mrjob in a cluster

During the first programming assignment, you have gained familiarity with mrjob development, and run tasks on a single machine. While this is suitable for quick development and debugging, it, of course, does not achieve the whole purpose of using MapReduce: to scale up the size of analyses we can perform by using a cluster of machines. The purpose of this lab is to move from basic mrjob testing to real-life deployment in the Google Cloud.

This lab should be completed on your VM, not on a CSIL machine, because it requires administrator access.

Google Cloud Dataproc

Google Cloud is the cloud computing platform we are using this quarter. It offers various services, such as the Compute Engine service we used last week to make one or more virtual machines available for transient use. Another service is Google Cloud Dataproc: managed MapReduce using the Hadoop framework.

Managed denotes a service for which professional system administrators have designed and configured an environment for us. In this context, Google Cloud employees have created a virtual machine template, installed a suitable version of Linux, configured the operating system, installed a version of the Hadoop MapReduce framework, and configured the environment in various ways. They have then created an automated process by which we can request a cluster of Hadoop servers for a span of time, and the system will temporarily create and provision virtual machines for our use with all these installations and configurations already handled. This saves us from having to perform a variety of relatively uninspiring work to set up such a cluster by hand, and from having to learn about and consider various details that any Hadoop administrator needs to become familiar with. Instead, we can just focus on the details of our particular analysis.

Installing mrjob

You have likely already installed mrjob on your VM during PA#1; if not, see the writeup for that assignment for how to install it there.

Word Counts

Copy and paste this code into a file,

from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield word.lower(), 1

    def combiner(self, word, counts):
        yield word, sum(counts)

    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':

One of the reasons why mrjob supports running on a single machine is to facilitate development, testing, and debugging. It is actually slower to run on a single machine using mrjob than to write code not using mrjob at all and run on a single machine, because of the overhead that adds. But, there is no sense in trying to run broken code on Google Cloud. So, find or create some small text file and try running this code with it on your machine, to make sure it works:

python3 mytext.txt

Setting up Cloud Dataproc

Now that our dry run has been successful, we need to do some one-time setup of Cloud Dataproc so we can scale up our job.

Within the web browser on your VM, go to the Google Cloud API Manager (log in with the Google account to which you redeemed your coupon code).

Search for "Google Cloud Storage" and select it. Make sure it is enabled (a button saying "DISABLE" should be present -- if not, enable it).

Go back and search for Google Cloud Storage JSON API" and "Google Cloud Dataproc API" and make sure they are also enabled.

Choose "Credentials" in the menu on the left. Click "Create credentials" and choose "Service account key".

Please make sure you are doing this within the web browser on your VM, because we are about to download a file that needs to be on your VM and should never leave your VM, for security reasons.

Under "Service account", choose "New service account". Enter a name in the "Service account name" field. "mrjob" is recommended. Accept the ID that is filled in. Make sure "JSON" is selected.

For the "Role" choose "Dataproc", then "Dataproc Editor". Also choose "Storage", then "Storage Admin". (This will result in two different things being checked at once, which we need. Then click "Create".

A file will download onto your VM. This file gives credentials that can be used to access Google Cloud. It must be protected at all times, by never leaving your VM. It must never be checked into git, GitHub, or put in any other public location, or hackers are guaranteed to find it and start abusing your account. (This really happened, last year.)

Save this file in /home/student/mrjob-key.json

Go to a terminal window and type (or copy and paste):

echo "export GOOGLE_APPLICATION_CREDENTIALS=/home/student/mrjob-key.json" >> ~/.bashrc
source ~/.bashrc
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
echo "deb $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
curl | sudo apt-key add -
sudo apt-get update && sudo apt-get install google-cloud-sdk
gcloud init
You may need to log in to your Google account with the redeemed coupon again, and will need to "Allow" the Google Cloud SDK to have access to your account. After doing so, you can close the web browser window that the terminal opened up. You then should select one of the us-central1 regions to finish setup.

Trying a clustered mrjob

If you have completed all of the previous steps, then you should be able to run the mrjob you ran earlier on your local computer on Google Cloud Dataproc by simply adding the -r dataproc option to the command line:

python3 -r dataproc mytext.txt
If you receive an error, verify that you have completed all of the above steps. If it says that it is waiting for the cluster to accept the job, then it is working. In the web console, choose the menu button at top left (the three horizontal lines) and choose Dataproc near the bottom. You should see your cluster "Provisioning" with two nodes.

In about ten minutes, you should get your results back.

Congratulations! You have run your first MapReduce job in a real cluster and gone from having a job that would take nearly a second on your own computer to ten minutes on a cluster.

In all seriousness: the overheads of starting a cluster are substantial. When you try MapReduce with such a trivial dataset, the fixed cost of these overheads will not be recouped with faster completion of the actual task. But, had this been a truly large computation, the ten minutes it took to launch the cluster would have been well worth the wait: your analysis would have finished much faster, after the initial setup, than it would have on your own computer.

Once your job finishes, the web console should indicate that the cluster is "Deleting". Please make sure that it goes away.

If, for any reason, the cluster does not shut down, or if you have to leave the lab with something still running, please be sure to click on the cluster name in the web console and manually delete it. Otherwise, there is the potential that you will continue to use resources indefinitely, drawing down your account balance.

Why did mrjob start only two cluster nodes? Apparently, this is the default. Had you wanted to run with more, then after the -r dataproc option on the command line, you would also have added: --num-core-instances 4 for instance if you wanted four nodes in the cluster. Note that for a seriously large job, you would likely use even more.

A larger job

It's time to try our word counter on some real data. However, for it to finish within the lab, we won't scale it up that much.

In your repository, do a git pull upstream master to get a lab3 directory. In it is a file named gutenberg.txt which contains some works of literature in English from Project Gutenberg. Putting it all together, you should be able to run:

python3 -r dataproc /home/student/cmsc12300-spr-17-yourcnetid/lab3/gutenberg.txt > wordcounts.txt
Then, use a text editor to look at the output file wordcounts.txt. You could even use sort and uniq, which we learned about in CS 122, to find the top words. Can you do this directly in MapReduce, though? Well, you'll need to figure out how to complete Task 2 of the programming assignment....

After trying out this job, please again make sure your cluster has terminated.