hadoop 分布式系统存储数据库云计算 (15)

时间：2026-01-09

hadoop,分布式,系统,存储,云计算

15-440, Hadoop Distributed File SystemAllison Naaktgeboren

Ur doin' it rong kitteh

Wut u mean? I iz loadin a HA-doop fileh

hadoop,分布式,系统,存储,云计算

Annoucements

Go Vote! Interpretive Dances happen only after Lecture Office Hour Change

Mon: 6:30-9:30 Tues: 6-7:30

Exams are graded

hadoop,分布式,系统,存储,云计算

Hadoop Core at 30,000 ft

hadoop,分布式,系统,存储,云计算

Back to the Map Reduce Model

Recall that– map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) – reduce (inter_key, inter_value list) -> (out_key, out_vlaue)

What resource are we most constrained by?

“Oceans of Data, Skinny pipes”

How many types of data will the file system care about? How long will we need each kind? What is the common case for each?

hadoop,分布式,系统,存储,云计算

What would a MR Filesytem need?

General Use case: large files

Mostly append to end, long sequential reads, few deletes Appends might be concurrent Adding (or losing) machines should be relatively painless Minimize moving data between machines

Scability

Nodes work on nearby data

Bandwidth is our limiting resource Remember how much data

Failure (handling)is Common

Yea, yea we know, we took 213, we know hardware sucksDisks, processors,whole nodes, racks, and datacenters

No, really failure (handling) is common (constant)

hadoop,分布式,系统,存储,云计算

Addressing Those Concerns

Sequential Reads, appends need to be fast

Deletes can be painfulAdd or lose machines while system is running jobs System should auto detect the change So that all workers have a reasonable amount of data to chew on And coordinating with the Jobtracker (job master) Should be spread out. Why? What type of problems could arise?

“Hot plug” machines

HDFS should distribute data somewhat evenly

Data Replication

hadoop,分布式,系统,存储,云计算

Moving into the Details

Nodes in HDFS

NB – Hadoop and HDFS closely paired

NameNode (master) ( like GFS Master) DataNodes (slaves) ( like GFS chunkservers) “careful use of jargon defines the true expert” “worker node A” and “data node 1” are frequently the same machineJobtracker (Hadoop Job Master) NameNode (file system Master)

Two types of Masters

What I mean by 'master' for the rest of the lecture

hadoop,分布式,系统,存储,云计算

Your Data goes in ....

Files are divided into Chunks

64 MB

The mapping between filename and chunks goes to the Master Each chunk is replicated and sent off to DataNodes

By default, 3 The master determines which dataNodes

hadoop,分布式,系统,存储,云计算

What the Clients Do

Where the data starts On file creation creates a seperate file w/checksum When data fetched back from a dataNode, checksum computed again Cache file data

Avoid bothering the Master too often

When a Client has 1 chunk's worth of data

Contacts the Master, Master sends name of dataNodes

to send it to ONLY sends it to the 1st

hadoop,分布式,系统,存储,云计算

What the DataNodes Do

Heartbeat to the Master Opens, closes, or replicates a chunk if requested from Master During replication, sends data to next dataNode in chain

hadoop,分布式,系统,存储,云计算

What the Namespace Node Does

System metadata!

Holds Name->ID mapping Chunk replicas locations Transcation Logs

EditLog FSImage

It is responsible for coherency

Uses the logs atomically Addresses the conccurent writes issue Similar to AFS volume snapshots Will pull last consistent log upon restart

It is checkpointed

hadoop,分布式,系统,存储,云计算

What the Namespace Node Does

Listens for Heartbeats Listens for Client Requests If no heartbeat

marks a node as dead Its data is deregistered Which nodes get which chunks Signals creating, opening, closing Orders move to /trash Starts delete timer

It selects dataNodes

Deletes

hadoop,分布式,系统,存储,云计算

All together Now!

hadoop,分布式,系统,存储,云计算

Additional Resources

Hadoop wiki Youtube → “Hadoop” → Google developer videos (1-3 will be helpful) Google University

Includes UW course, the other UW course, a couple others Use are your own risk

“The Google File System” paper is rather readable as research papers go

…… 此处隐藏：1979字，全部文档内容请下载后查看。喜欢就下载吧 ……

hadoop 分布式系统存储数据库云计算 (15).doc 将本文的Word文档下载到电脑

下载这篇word文档