Distributed File Systems and MapReduce

Session 101

King Chung Huang

Portfolio Manager

Google

began in 1996 as a research project by Larry Page and Sergey Brin
developed PageRank, a system that analyzes relationships between websites to determine their relevance
Google is well known for using large amounts of commodity x86 hardware running customized versions of Linux

“more than 10,000 servers, turning over more than 200 million search queries a day among 4 billion Web documents.”
Google’s Secret: ‘Cheap and Fast’ Hardware (PCWorld, 2003)
“current estimates put Google’s server farm at around 450,000 machines — and they’re still custom built, commodity-class x86 PCs”
Google’s Custom Built Servers (Unofficial Google Blog, 2007)

Google’s Oregon Data Center

What characteristics might the data storage and processing systems have to accomodate the vast quantities of information that Google has, and be able to perform meaningful operations over such a large number of systems?

The Google File System
(19th ACM Symposium on Operating Systems Principles, 2003)
MapReduce: Simplified Data Processing on Large Clusters
(OSDI’04: Sixth Symposium on Operating System Design and Implementation, 2004)
Many more publications at research.google.com

Google File System (GFS) is a distributed file system
shares the same goals as other distributed file systems: performance, scalability, reliability, and availability
design differences driven by observations of Google’s application workloads and technological environment
- high component failure rates
- large multi gigabyte files, and multi terabyte datasets
- file mutation by appending to, not overwriting existing data
operations: create, delete, open, close, read, write, snapshot, append

a programming model for distirbuted computing
inspired by the map and reduce primitives in Lisp and other functional languages
map function: takes in a key/value pair, to generate a set of intermediate key/value pairs
(K1, V1) → list(K2, V2)
reduce function: merges intermediate values sharing the same key
(K2, list(V2)) → list(K3, V3)
many real world tasks are expressible in this model

Word Count Diagram

Video Compression Diagram

2½ quintillion bytes of data are created each day
ViaWest, 2012
three major aspects of Big Data
1. velocity: Data is coming in at an ever increasing rate. Twitter processes over 400 million (12 TB) tweets every day. The Large Hadron Collider generates data at 40 TB per second.
2. volume: Data is increasing rapidly in amount. US utility companies record 350 billion meter readings per year.
3. variety: Data may be structure or unstructured, and come in many forms such as web pages, sensor data, audio and video, application logs, and more.

open-source software for scalable distributed computing
project consists of four major modules
1. Hadoop Distributed File System (HDFS)
2. Hadoop MapReduce
3. Hadoop Common
4. Hadoop YARN
clusters consist of commodity Linux servers
scales linearly
adopted as an Apache project in 2006

the term “Hadoop” has evolved to describe an ecosystem of tools
1. HBase column-oriented database that supports large tables
2. Hive data warehouse with SQL like ad-hoc querying and data summarization
3. Pig data-flow language and execution framework for parallel computation
4. Flume collect and aggregate large amounts of log data
Ecosystem applications are higher level tools that work on top of HDFS and MapReduce

create/customize and run one or more MapReduce programs with Hadoop, via Amazon Elastic MapReduce. Use the us-east-1 (N. Virginia) zone
Amazon Elastic MapReduce Ruby Client might be useful. You will need your Access Key ID & Secret Access Key
sample input data on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”

create/customize and run one or more MapReduce programs with Hadoop, in the us-east-1 (N. Virginia) zone
- You’ll need to have the EC2 API tools installed
- Then see AmazonEC2 - Hadoop Wiki
sample input data available on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”

create/customize and run one or more MapReduce programs with Hadoop, in the us-east-1 (N. Virginia) zone
- Install hadoop using Homebrew
- See INSTALLING HADOOP ON MAC OSX LION
Writing An Hadoop MapReduce Program in Python
sample input data available on S3, in the bucket “ad-bootcamp”
pre-created out buckets on S3 called “ad-emr-out”

May 30, 2013