COMP5426_Parallel Program_2013 Semester 1_lecture06-13

时间：2025-07-11

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

COMP5426 Parallel and Distributed Computing

MapReduce

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Data-Intensive ComputingData-Intensive

In data-intensive computing, the focus is on the data: problem areas include

Typically store data at datacenters Use compute nodes nearby Compute nodes run computation services

Storage Communication bottleneck Moving tasks to data (rather than vice-versa) Security Availability of Data Scalability

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

What is MapReduce?MapReduce is a distributed/parallel computing framework introduced by Google to support computing on large data sets on clusters of computers. Originally used at Google, now widely used as a more general platform for data-intensive computing

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

What is MapReduce?The framework is inspired by map and reduce functions commonly used in functional programming (although their purpose in the MapReduce framework is not the same as their original forms).

User implements map() and reduce() functions Runtime library takes care of EVERYTHING else

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

What is MapReduce?A simple programming model that applies to certain large-scale data-intensive computing problems Hide messy details in MapReduce runtime library:

Improvements to core library benefit all users of library!

automatic parallelization load balancing network and disk transfer optimization handling of machine failures Robustness

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

MotivationLarge Scale Data Processing

Want to process lots of data (> 1 TB) Want to parallelize across hundreds/thousands of CPUsHow to parallelize How to distribute How to handle failures

… Want to make this easy6

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Typical problem solved by MapReduceRead a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, but map and reduce change to fit the problem7

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Programming ModelMap

Takes an input pair and produces a set of intermediate key/value pairs e.g., Map: (key1, value1) -> list(key2,value2)

The MapReduce library groups together all intermediate values associated with the same intermediate key Reduce

This function accepts an intermediate key and a set of values for that key Reduce: (key2,list(key2,value2)) -> value38

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Example: Counting WordsCounting words in a large set of documents: map()

Input<filename, file_text> Parses file and emits<word, count> pairs eg.<”hello”, 1>

reduce()

Sums all values for the same key and emits<word, TotalCount> eg.<”hello”, (3 5 2 7)>=><”hello”, 17>

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Example: Use of MapReducemap(string key, string value)//key: document name//value: document contents for each word w in value EmitIntermediate(w,“1”); reduce(string key, iterator values)//key: word//values: list of counts int results= 0; for each v in values result+= ParseInt(v); Emit(AsString(result));10

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Actual Source CodeThe example is written in pseudo-code Actual

implementation is in C++, using a MapReduce library True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

ApplicationsStructure of the Web: Input is (URL, contents) Scan through the document's contents looking for links to other URLs Map outputs (URL, linked-to URL) you get a simple representation of the WWW“link graph” Map outputs (linked-to URL, URL) you get the reverse link graph,“what web pages link to me?” Map outputs (linked-to URL, anchor text) you get“how do other web pages characterize me?”12

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

ApplicationsGoogle uses MapReduce for

Page indexing pipeline: What are all the pages that match this query? PageRank: What are the best pages that match this query? and more others

Greatly simplifies large-scale computations at Google13

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

How MapReduce WorksUser to do list:

Write map() and reduce() functions indicate:

Input/output files M: number of map tasks R: number of reduce tasks W: number of machines

Submit the job

This requires no knowledge of parallel& distributed systems!! What about everything else?14

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

The InfrastructureLarge clusters of commodity PCs and networking hardware Clusters consists of 100/1000s of machines (failures are common) GFS (Google File System)

Distributed file system Provides replication of the data

University of Sydney_COMP5426_Parallel Program_2013 Semester 1

Parallelismmap() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently. Synchronization required between the two functions.

…… 此处隐藏：3347字，全部文档内容请下载后查看。喜欢就下载吧 ……

COMP5426_Parallel Program_2013 Semester 1_lecture06-13.doc 将本文的Word文档下载到电脑

下载这篇word文档

上一篇：27[1].施工测量放线报审(污水管道)中线、开挖线

下一篇：试论中职学校实训基地建设的分层次探索