COMP5426_Parallel Program_2013 Semester 1_lecture06-13
时间:2025-04-21
时间:2025-04-21
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
COMP5426 Parallel and Distributed Computing
MapReduce
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Data-Intensive ComputingData-Intensive
In data-intensive computing, the focus is on the data: problem areas include
Typically store data at datacenters Use compute nodes nearby Compute nodes run computation services
Storage Communication bottleneck Moving tasks to data (rather than vice-versa) Security Availability of Data Scalability
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
What is MapReduce?MapReduce is a distributed/parallel computing framework introduced by Google to support computing on large data sets on clusters of computers. Originally used at Google, now widely used as a more general platform for data-intensive computing
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
What is MapReduce?The framework is inspired by map and reduce functions commonly used in functional programming (although their purpose in the MapReduce framework is not the same as their original forms).
User implements map() and reduce() functions Runtime library takes care of EVERYTHING else
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
What is MapReduce?A simple programming model that applies to certain large-scale data-intensive computing problems Hide messy details in MapReduce runtime library:
Improvements to core library benefit all users of library!
automatic parallelization load balancing network and disk transfer optimization handling of machine failures Robustness
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
MotivationLarge Scale Data Processing
Want to process lots of data (> 1 TB) Want to parallelize across hundreds/thousands of CPUsHow to parallelize How to distribute How to handle failures
… Want to make this easy6
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Typical problem solved by MapReduceRead a lot of data Map: extract something you care about from each record Shuffle and Sort Reduce: aggregate, summarize, filter, or transform Write the results Outline stays the same, but map and reduce change to fit the problem7
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Programming ModelMap
Takes an input pair and produces a set of intermediate key/value pairs e.g., Map: (key1, value1) -> list(key2,value2)
The MapReduce library groups together all intermediate values associated with the same intermediate key Reduce
This function accepts an intermediate key and a set of values for that key Reduce: (key2,list(key2,value2)) -> value38
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Example: Counting WordsCounting words in a large set of documents: map()
Input<filename, file_text> Parses file and emits<word, count> pairs eg.<”hello”, 1>
reduce()
Sums all values for the same key and emits<word, TotalCount> eg.<”hello”, (3 5 2 7)>=><”hello”, 17>
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Example: Use of MapReducemap(string key, string value)//key: document name//value: document contents for each word w in value EmitIntermediate(w,“1”); reduce(string key, iterator values)//key: word//values: list of counts int results= 0; for each v in values result+= ParseInt(v); Emit(AsString(result));10
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Actual Source CodeThe example is written in pseudo-code Actual
implementation is in C++, using a MapReduce library True code is somewhat more involved (defines how the input key/values are divided up and accessed, etc.)
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
ApplicationsStructure of the Web: Input is (URL, contents) Scan through the document's contents looking for links to other URLs Map outputs (URL, linked-to URL) you get a simple representation of the WWW“link graph” Map outputs (linked-to URL, URL) you get the reverse link graph,“what web pages link to me?” Map outputs (linked-to URL, anchor text) you get“how do other web pages characterize me?”12
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
ApplicationsGoogle uses MapReduce for
Page indexing pipeline: What are all the pages that match this query? PageRank: What are the best pages that match this query? and more others
Greatly simplifies large-scale computations at Google13
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
How MapReduce WorksUser to do list:
Write map() and reduce() functions indicate:
Input/output files M: number of map tasks R: number of reduce tasks W: number of machines
Submit the job
This requires no knowledge of parallel& distributed systems!! What about everything else?14
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
The InfrastructureLarge clusters of commodity PCs and networking hardware Clusters consists of 100/1000s of machines (failures are common) GFS (Google File System)
Distributed file system Provides replication of the data
University of Sydney_COMP5426_Parallel Program_2013 Semester 1
Parallelismmap() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently. Synchronization required between the two functions.
…… 此处隐藏:3347字,全部文档内容请下载后查看。喜欢就下载吧 ……