Here at the Labs, we have been playing around with the MapReduce programming model (namely the open-source Hadoop implementation) for a while, but have been relatively conservative up till now. Most of the jobs that we have done thus far have been relatively simplistic, being more or less basic aggregation functions, with the most difficult part being the shear size of the data itself. This time around, we are planning on being a little more adventurous with the techniques we will be using to analyze the data, mining deeper than we had before. Before actually diving in, I did some reading of books and papers from both academia and industry to get an idea of the landscape and what we could try. Here’s a basic summary of some of the more interesting things that I’ve stumbled upon and possible ideas that we will tackle in the near future.
First off, a quick introduction to MapReduce. Ever since Google released their research paper “MapReduce: Simplified Data Processing on Large Clusters” as a way of processing huge datasets with mostly unstructured or semi-structured formats, MapReduce has been attracting a lot of attention in both academia and industry. You can read the paper (or the wikipedia article) for a more detailed explanation of MapReduce, but in a nutshell, MapReduce is a generic execution engine that is more or less ignorant of storage schemas and data models. The underlying runtime system automatically parallelizes computations across a large cluster of machines, handles failures and manages disk and network efficiency. The user is only expected to provide a map function and a reduce function. The map function is applied over all input records in the dataset and yields an intermediate output that is aggregated by the reduce function. This model seemed perfect to be used to extract useful aggregation statistics from the terabytes of log statistics that we were generating on a regular basis.
For the past few months, we’ve been playing around with Hadoop on a relatively small cluster of only 10 nodes. But now that we’ve gotten our feet wet with basic MapReduce, it was about time we tried to do something a little more complicated. We wanted to find a way to extract useful information from huge volume of content that was being generated by our customers every day. In the future we would like to do some more complicated analysis of the content, like running it through a parts-of-speech tagger and actually getting contextual information, but that is a bit much for the first step.
Compliance is an obviously important application: the ability to predict whether or not an email would make it past the spam filters and into the recipient’s inbox, before it is even sent out, would be invaluable. But we can go even deeper. By analyzing the content of the email itself as well as the track record of the sender, we could predict other statistics about the email: whether or not it will be flagged as spam, the open rate, the click through rate, etc. The problem is that standard MapReduce is inherently a bulk activity; feed it a large dataset, such as logging data from the last X amount of time, and it will eventually spit out some aggregate information. This is where machine learning comes in. Because we track all of these statistics already, all of these statistics can be categorized as supervised learning problems. The building of a model, such as a regression tree, using our existing data will require the use of the hadoop cluster and probably several days of computation, but the resultant model can be used for predictions on new emails very efficiently. And because the real-world statistics for the predicted emails will eventually become apparent, we can use reinforcement learning, automatically modifying the model to make it even more accurate as time goes by.
Of course life isn’t always that easy. While being able to predict the outcome of different campaigns can be very valuable, the ability to find patterns in the data that would have otherwise never even been considered is arguably even more important. Because we are trying to find structure in a collection of more or less unlabeled data (such as email content), I immediately thought of clustering, which some consider to be the most important unsupervised learning problem. For the purposes of this research exercise, I looked into the implementation of two clustering algorithms, K-means and expectation maximization (EM), in the MapReduce model on top of an underlying Hadoop cluster.
I choose to look at the K-means algorithm mainly because it is so well-known and simple to implement. The algorithm begins by partitioning all of the data instances into k clusters maximizing intra-cluster similarity and minimizing inter-cluster similarity. It then proceeds to calculate new cluster centers by using mean for each cluster and then reassigning all instances to the cluster whose cluster center is closest to it. This process is repeated until there is no more inter-cluster movement. There is a more detailed explanation on the wikipedia article. In their paper “Parallel K-Means Clustering Based on MapReduce”, Zhao, Ma and He demonstrated that the K-Means algorithm can be successfully applied to the MapReduce model and can efficiently process large datasets. This is good news for us, knowing that the way has already been paved and the method works.
The other algorithm I looked at, EM, is a more probabilistic approach to the problem. The algorithm finds the maximum likelihood estimates of missing or hidden data. It iteratively performs the expectation (E) step, which estimates the missing data given the observed data, and the maximization (M) step, which maximizes the likelihood function based on the assumption that the missing data are known, using the estimates from the E-step. In their book “Data-Intensive Text Processing with MapReduce”, Jimmy Lin and Chris Dyer give a very detailed explanation of applying EM algorithms to text processing and fitting those algorithms into the MapReduce programming model. EM fits naturally into the MapReduce programming model by making each iteration of EM one MapReduce job: mappers map over independent instances and compute summary statistics, while reducers sum together the required training statistics and solve the M-step optimization problems.
But of course it is impossible to decide which algorithm to use until we decide what we are actually trying to accomplish. One of the biggest challenges is still the process of adding structure to the unstructured data that is natural language. In order to do this, we’re going to have to take a step into the realm of natural language processing, which is an entire field of computer science and linguistics in itself. But this is a delicate process because in the act of adding structure to the data, we are also introducing bias that can dramatically affect the correlations that are discovered. I’ll talk about what we actually decide to do to overcome this challenge in the future, when we actually figure it out…
For unrelated data mining tasks, I have always used data mining toolkits, such as Weka to do interactive data mining. They provide a library of implemented data mining algorithms in an easy to use user interface. But unfortunately, they were designed to run on a single machine and cannot handle data sets as large as those required for this project. This is when I stumbled upon the paper “Toolkit-based high-performance Data Mining of large Data on MapReduce Clusters” by Wegener, Mock, Adranale and Wrobel. They partially implemented an extension that integrates Weka with a Hadoop cluster for distributed data storage and parallel processing. In their experiment, they managed to successfully implement the simplistic Naive Bayes algorithm as a proof of concept. Their cluster was significantly faster than the single machine base test and scaled up to input sizes of 100 GB. The ability to use an interactive data mining toolkit will make the task of data mining that much better. I am looking forward to any progress made in this development and would love to contribute in any way I can.
At the moment, we are at the beginning stages of this project and still have a long way to go. Anyone else has had success doing complex data mining on top of Hadoop or planning on trying it as well? This is uncharted lands for us and I’m looking forward to seeing how things progress.
Continue the conversation by sharing your comments here on the blog and by following us on Twitter @CTCT_API