hadoop 分布式 系统 存储 数据库 云计算 (15)

发布时间:2024-11-12

hadoop,分布式,系统,存储,云计算

15-440, Hadoop Distributed File SystemAllison Naaktgeboren

Ur doin' it rong kitteh

Wut u mean? I iz loadin a HA-doop fileh

hadoop,分布式,系统,存储,云计算

Annoucements

Go Vote! Interpretive Dances happen only after Lecture Office Hour Change

Mon: 6:30-9:30 Tues: 6-7:30

Exams are graded

hadoop,分布式,系统,存储,云计算

Hadoop Core at 30,000 ft

hadoop,分布式,系统,存储,云计算

Back to the Map Reduce Model

Recall that– map (in_key, in_value) -> (inter_key, inter_value) list combine (inter_key, inter_value) → (inter_key, inter_value) – reduce (inter_key, inter_value list) -> (out_key, out_vlaue)

What resource are we most constrained by?

“Oceans of Data, Skinny pipes”

How many types of data will the file system care about? How long will we need each kind? What is the common case for each?

hadoop,分布式,系统,存储,云计算

hadoop,分布式,系统,存储,云计算

What would a MR Filesytem need?

General Use case: large files

Mostly append to end, long sequential reads, few deletes Appends might be concurrent Adding (or losing) machines should be relatively painless Minimize moving data between machines

Scability

Nodes work on nearby data

Bandwidth is our limiting resource Remember how much data

Failure (handling)is Common

Yea, yea we know, we took 213, we know hardware sucksDisks, processors,whole nodes, racks, and datacenters

No, really failure (handling) is common (constant)

hadoop,分布式,系统,存储,云计算

Addressing Those Concerns

Sequential Reads, appends need to be fast

Deletes can be painfulAdd or lose machines while system is running jobs System should auto detect the change So that all workers have a reasonable amount of data to chew on And coordinating with the Jobtracker (job master) Should be spread out. Why? What type of problems could arise?

“Hot plug” machines

HDFS should distribute data somewhat evenly

Data Replication

hadoop,分布式,系统,存储,云计算

Moving into the Details

Nodes in HDFS

NB – Hadoop and HDFS closely paired

NameNode (master) ( like GFS Master) DataNodes (slaves) ( like GFS chunkservers) “careful use of jargon defines the true expert” “worker node A” and “data node 1” are frequently the same machineJobtracker (Hadoop Job Master) NameNode (file system Master)

Two types of Masters

What I mean by 'master' for the rest of the lecture

hadoop,分布式,系统,存储,云计算

Your Data goes in ....

Files are divided into Chunks

64 MB

The mapping between filename and chunks goes to the Master Each chunk is replicated and sent off to DataNodes

By default, 3 The master determines which dataNodes

hadoop,分布式,系统,存储,云计算

What the Clients Do

Where the data starts On file creation creates a seperate file w/checksum When data fetched back from a dataNode, checksum computed again Cache file data

Avoid bothering the Master too often

When a Client has 1 chunk's worth of data

Contacts the Master, Master sends name of dataNodes

to send it to ONLY sends it to the 1st

hadoop,分布式,系统,存储,云计算

What the DataNodes Do

Heartbeat to the Master Opens, closes, or replicates a chunk if requested from Master During replication, sends data to next dataNode in chain

hadoop,分布式,系统,存储,云计算

What the Namespace Node Does

System metadata!

Holds Name->ID mapping Chunk replicas locations Transcation Logs

EditLog FSImage

It is responsible for coherency

Uses the logs atomically Addresses the conccurent writes issue Similar to AFS volume snapshots Will pull last consistent log upon restart

It is checkpointed

hadoop,分布式,系统,存储,云计算

What the Namespace Node Does

Listens for Heartbeats Listens for Client Requests If no heartbeat

marks a node as dead Its data is deregistered Which nodes get which chunks Signals creating, opening, closing Orders move to /trash Starts delete timer

It selects dataNodes

Deletes

hadoop,分布式,系统,存储,云计算

All together Now!

hadoop,分布式,系统,存储,云计算

Additional Resources

Hadoop wiki Youtube → “Hadoop” → Google developer videos (1-3 will be helpful) Google University

Includes UW course, the other UW course, a couple others Use are your own risk

“The Google File System” paper is rather readable as research papers go

hadoop 分布式 系统 存储 数据库 云计算 (15).doc 将本文的Word文档下载到电脑

    精彩图片

    热门精选

    大家正在看

    × 游客快捷下载通道(下载后可以自由复制和排版)

    限时特价:7 元/份 原价:20元

    支付方式:

    开通VIP包月会员 特价:29元/月

    注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
    微信:fanwen365 QQ:370150219