Google File System (GFS) is a distributed file system
shares the same goals as other distributed file systems: performance, scalability, reliability, and availability
design differences driven by observations of Google’s application workloads and technological environment
operations: create, delete, open, close, read, write, snapshot, append
2½ quintillion bytes of data are created each day
ViaWest, 2012
three major aspects of Big Data
open-source software for scalable distributed computing
project consists of four major modules
clusters consist of commodity Linux servers
scales linearly
adopted as an Apache project in 2006
the term “Hadoop” has evolved to describe an ecosystem of tools
Ecosystem applications are higher level tools that work on top of HDFS and MapReduce
create/customize and run one or more MapReduce programs with Hadoop, via Amazon Elastic MapReduce. Use the us-east-1 (N. Virginia) zone
Amazon Elastic MapReduce Ruby Client might be useful. You will need your Access Key ID & Secret Access Key
sample input data on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”
create/customize and run one or more MapReduce programs with Hadoop, in the us-east-1 (N. Virginia) zone
sample input data available on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”
create/customize and run one or more MapReduce programs with Hadoop, in the us-east-1 (N. Virginia) zone
sample input data available on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”