MapReduce
Course Instructor - Jimeng SunIntroduction
MapReduce is a powerful system that can perform some of the methods (predictive modeling) we have talked about so far on big data set using distributed computation and distributed storage.
What is MapReduce?
Hadoop is the Java implementation of MapReduce and Hadoop distributed file system (HDFS). So, it is an execution environment. there are many software tools that have been developed to facilitate development effort for data science tasks, such as data processing, extraction, transform and loading process, statistic computation, and analytic modeling using Hadoop. In summary, Hadoop and MapReduce enables a powerful big data ecosystem by providing the combination of all these things. MapReduce or Hadoop is a big data system that provides the following capability, distributed storage for large data set through Hadoop distributed file system, distributed computation through programming interface MapReduce, and fault tolerance systems in order to cope with constant system failures on large distributed systems that are built on top of commodity hardware.
Computational Process
Originally, MapReduce was proposed by Jeff Dean and Sanjay Ghemawat from Google in 2004 and were implemented inside Google as a proprietor software for supporting many of their search engine activities. Then later on, Apache Hadoop was developed as an open source software that mimic original Google’s MapReduce systems. It was written in Java for distributed storage and distributee processing of very large dataset to program. On Hadoop systems, we have to use the programming abstraction MapReduce, which provides a very limited, but powerful programming paradigm for parallel processing on a large dataset. All the algorithm running on MapReduce or Hadoop have to be specified as MapReduce programs. The reason for this very constrained programming model is to support super scalable, parallel and fault tolerance implementation of data processing algorithm that can run on large dataset. To utilize Hadoop and MapReduce for data analytics, we have to understand and master common patterns for computation using MapReduce.
Learning Via Aggregation
A fundamental pattern for writing a data mining or mission learning algorithm using Hadoop is to specify machine learning algorithms as computing aggregation statistics. Say we want to implement a machine learning algorithm for identifying the most common risk factors of heart failure. We need to decompose the algorithm into a set of smaller computation units. In particular, we need to specify a map function f, where f will be applied to all the heart failure patients in the large database. For example, we want to extract the list of risk factors related to heart failure that appear in each patient’s record. Then the result from this map function will be aggregated by a reduce function. For example, instead of listing the risk factors for each patient, we want to compute the frequency of each risk factor over the entire population. Then in this case, the reduce function would do that by performing the aggregation statistic on the result from the map function.
MapReduce Abstraction
Here’s an example to illustrate MapReduce in more details. Say we have a large database of patients stored in the Hadoop distributed file system. Each patient is stored as a separate record, and each record consists of the history of this patient encounter. For example, the diagnosis, medication, procedure, and clinical notes are all stored in those patient records. Our goal is to write up MapReduce programs to compute the number of cases in each disease. To do this in MapReduce, we first specify the map function that goes through each patient records, and extracting the disease mentions and output that. For example, we have two patient record over here, and when we apply the map function, we’ll first find the disease mentioned in this record. For example, for the first record, we identify hypertension and diabetes are in the record. Then, for the list of disease mentioned, we identified inside this record, we emit the disease name and the value 1. And the output from each map function looks like this.
For example, for this patient we have a hypertension and value 1, and diabetes with a value 1. For the second patient, he has heart disease with value 1 and a hypertension with value 1. And the disease name in this case is the key, and the value 1 is the value. So map function would process input record, and output a set of key value pairs. You notice that the same key value pairs may appear as output from different map function. For example, hypertension. Value 1 happened in this output and also in this output. So, the next phase will combine them. All the output from map function will be processed internally by Hadoop. In particular, all those output will be shuffled and aggregated.
For example, hypertension happened twice over here, and after the shuffling and combination phase, we have hypertension and all those associate value. Diabetes, on the other hand, only happened once in this record, so the corresponding lists of value only have one value, and similarly for heart failure, we only have one value here. And this intermediate result will be the input for the reduce function. To write a map reduce program we need to specify the map function, and the reduce function. So here’s one example for a reduce function. The reduce function will take the disease, key, and a list of disease values. For example, we have hypertension as the disease key, and we have a values 1 and 1. So in this reduce function, we’ll take the disease and the list of value, and we sum up those value. So the result of the reduce function would give us hypertension value 2, diabetes value 1, and heart disease value 1. So in this very simple example, you may feel like this two-stage process is very strange. But if we’re dealing with large data set, with billions of records, this two-faced process is very important. And next, we’ll explain how internally macro view system works. And you understand why we have to specify the competition in these two phases.
MapReduce System
So, MapReduce system has two components, mappers and reducers. So, all those data will be partitioned and processed by multiple mappers. And each mapper deal with a partition, which is a subset of records in the entire dataset. For example, mapper 1 processes three records, mapper 2 processes another three records and mapper 3 process the remaining records. Then, we have reducers, in this case, we have reducer 1 and reducer 2. So, they also divided the work by processing intermediate result, in certain ranges. For example, reducer 1 will be in charge of heart disease, and reducer 2 will be in charge of cancers. So, in the map phase, each mapper will iteratively apply the map function on the data that they’re in charge of. And also, they will prepare the output that will be sent to the corresponding reducer. For example, we have a set of intermediate result prepared for reducer one. And we have a set of intermediate result prepared to be sent to reducer 2. And across different mappers, this processes are happening in parallel.
So, to summarize, this MapReduce system has three different phases. The map stage, where we performed map function and the pre-aggregation and combination functions. Then we’ll have a shuffle stage, all of the intermediate results will be sent to the corresponding reducers. And we have the final reduce stage, where the final output is generated.
MapReduce Fault Recovery
So, one of the key functionalities that MapReduce or Hadoop system provide is fault tolerance. On a large cluster environment, many things can fail at any given time. So, failure recovery is very costly, and often times, data scientists do not want to deal with failure recovery when they implement their algorithms. They want to assume the system will always work. It’s really on the shorter of the system to take care of all the potential failures that can happen in this distributed environment.
Distributed File System
The MapReduce system is just part of the Hadoop systems. There is another part that’s very crucial as well, that is the distributed file systems. Given a large file, it’s impossible and impractical to store the whole file on a single machine. Oftentimes, we split this large file into different partition, for example, here we have four partition, A, B, C, D. Then we store those different partition on different machines which we call workers. Here we have four workers, and the other thing we do is, we store multiple copy of the same partition on different workers. For example, for partition A, we store it on three workers, on 2, 3, and 4. Same for B, C, and D. So this way, we distribute a large file across multiple machines. In order to access these large files, we can retrieve those different partitions many different ways.
There are two benefits for doing this. One is now we can access the multiple partitions in the single file concurrently. So, the speed is actually faster than accessing the single file on a single machine. But more importantly, if any of those workers failed, we still can’t retrieve a large file. Say we have two workers fail, worker 1 and worker 3. We can still access the entire file by retrieving all of those partitions from worker 2 and 4. So HDFS is the backhand file system to store all the data you want to process using MapReduce program.
MapReduce Design Choice
So, the key design philosophy behind MapReduce is not to include many functionalities that people may want. The key design decision is, what functionality can we remove so the system is more reliable and more scalable? At the same time, we want to make sure even with the minimum functionality that provided by MapReduce, so we can still enable powerful computational algorithms such as machine learning.
So, given the machine learning algorithms, there’s a lot of limitation to utilize MapReduce. For example, we can’t really directly access data. Instead, we have to specify map and reduce function and compute in a very restricted aggregation query.
Analytics with MapReduce –
KNN
Now let’s talk about how to use MapReduce to write a K nearest neighbor classifier. Given a set of patients, say we want to predict whether a given patient will have a heart failure or not, by finding a similar patient to this query patient. So here we have a set of patients, and we plot them in two-dimensional space. X axis is the cholesterol level, and y axis the blood pressure (see image). And every black point over here indicate a patient, and the value on those points are whether they had heart failure or not. And the goal is, given a query patient, we want to find the nearest neighbors, then do the majority vote and to predict whether this patient will have heart failure or not based on those similar patients.
In the map reduce setting, all those patients will be split into multiple partition. In this case we have two different partitions and this green partition for mapper one is brown partition for mapper two. And this red point over here is our query data point. And we want to find three nearest neighbors. Now we need to specify the map and reduce functions. So the input to the map function are all those data points and this particular query point. The output are K nearest neighbors for each partition, and the algorithm is quite simple. For each partition we’ll go through all those data points, and you need the K closest point, and those will be the intermediate results sending to the reduced phase. So, in the reduce phase we need the local nearest neighbors and the query point, and the output is a final global nearest neighbor, and the algorithm is almost the same as the map function, so reduced function will go through all those local nearest neighbor to identify the global nearest neighbor, which are the three points over here.
Linear Regression
Now let’s talk about how to use to implement linear regression. For example, we want to view the linear regression that map patients information to heart disease risk. And we want to find out the coefficients associated with all those patient features. So in this particular case, we have input feature matrix which is n by d and n is number of patients, d is number of features and the output target variable which is n by 1. Every row here is the heart disease risk associated with that individual. Finally, we have a d by 1 vector and every element here tells us the coefficients in their linear regression model for that corresponding features. For example, the first features may be the coefficient for age, the second feature may be the coefficient for height, and so on.
So now let’s see how we compute such a model using MapReduce. So from statistic class, we know that there are many ways to solve linear regression, and one of the popular way to do that is using this normal equation. That is, the optimal coefficient in maximum likely cosines can be computed by taking X transpose X inverse times X transpose Y. If we have a small data set, this can be easily computed on a single machine using your favorite statistical tools, such as R or MadLab. However, if we have a very large data set with billions of patients, this computation cannot be done on a single machine. And next we’ll see how we can use MapReduce to help us to do this. To write this as a MapReduce program, we have to understand this equation a little bit better. There are two steps involved here. One is to compute X transpose X. The other part is to compute X transpose Y. And both can be further decomposed into aggregation statistics. For example, here X transpose X becomes summation from a one to n over xi times xi transpose. And, similarly, x transpose y becomes summation over i from 1 to n, xi times yi.
So here, immediately, we can see the patterns we’re looking for in term of aggregation statistics. This can easily be mapped to MapReduce computation. For example, this xi times xi transpose becomes the map function, and this xi times yi becomes another map function, f2. For example, to implement the first one as a MapReduce function, here are the pseudo code. For the map function, the input is those x and y pairs and x is the patient feature and y is the heart disease risk. In this particular case, we only need the x feature vector, and we want to compute x times x transpose. In the reduce phase, we take all the output of those xi times xi transpose and compute the sum
Limitations of MapReduce
To use an MapReduce to compute logistic regression requires an iterative process. Every iteration requires loading the data two times, one for compute the map functions, one for compute the reduce function. And we have to do this iteration many, many times, so it’s not efficient. So, MapReduce is not optimized for iteration or multistage computation.
Quiz
Take quizzes on the lessons; fine-tune your learning experience.
Join a Group
Join a group based on your interest or geographical location.
Discussion Forums
Join a forum and discuss the lessons or ask/answer questions.
Resources
Lecture videos
Full lecture transcript with illustrations
Hands-on tutorials @Sunlab
Extract, Transform and Load (ETL) with Pandas
scikit-learn: machine learning in Python
Performance Evaluation