r³ – A quick demo of usage

Posted: 4-8-12 in Development, Python

My new map/reduce engine project, got a lot of attention last week and before that in twitter, facebook and even hackernews.

So I decided to write a sample project demoing the usage of r³.

The problem

I had to find an interesting, yet simple problem to show in this demo. Since I am a huge fan of github, I decided that I would show each committer’s percentage of commits in a given repository.

GitHub has a VERY nice API that you can use  to retrieve a myriad of information on your own repositories or on other people’s repositories (provided they are public).

You just have to access  https://api.github.com/repos/mirrors/linux/commits?per_page=100&top=master to get the first 100 commits in the linux kernel repository. The resulting document comes with a link header that specifies where the next 100 commits can be found.

The Input Stream

Cool! So my map/reduce operation should operate on top of all commits for a given project. That means that in my input stream I just need to capture all those commits and return them.

I just built a simple crawler that keeps looking for the next page of commits until it can’t find one.

To save myself some time and bandwidth it also stores those commits in a temp folder as means of caching them.

The code:

 1 #!/usr/bin/python
 2 # -*- coding: utf-8 -*-
 3 
 4 from os.path import exists, join, dirname
 5 from urlparse import urlparse
 6 import os
 7 import sys
 8 import urllib2
 9 
10 from ujson import loads
11 
12 CACHE_PATH = '/tmp/r3-gh-cache'
13 
14 class Stream:
15     job_type = 'percentage'
16     group_size = 10
17 
18     def process(self, app, arguments):
19         if not exists(CACHE_PATH):
20             os.makedirs(CACHE_PATH)
21         user = arguments['user'][0]
22         repo = arguments['repo'][0]
23 
24         return get_repo_commits(user, repo)
25 
26 def get_repo_commits(user, repo):
27     next_url = 'https://api.github.com/repos/%s/%s/commits?per_page=100' % (user, repo)
28     commits = []
29     index = 0
30 
31     while next_url:
32         index += 1
33         content, next_url = get_url_content(next_url, index)
34         json = loads(content)
35         for item in json:
36             commits.append(item)
37 
38     return commits
39 
40 def get_url_content(url, index):
41     parts = urlparse(url)
42 
43     url_path = join(parts.path.lstrip('/'), parts.query.replace('&', '/').replace('=','_'))
44     cache_path = join(CACHE_PATH, url_path, 'contents.json')
45     next_path = join(CACHE_PATH, url_path, 'next.json')
46 
47     if exists(cache_path) and exists(next_path):
48         print "%d - %s found in cache!" % (index, url)
49         with open(cache_path) as cache_file:
50             with open(next_path) as next_file:
51                 return cache_file.read(), next_file.read()
52 
53     print "%d - getting %s..." % (index, url)
54     req = urllib2.Request(url)
55     response = urllib2.urlopen(req)
56 
57     contents = response.read()
58     print "%d - storing in cache" % index
59 
60     if not exists(dirname(cache_path)):
61         os.makedirs(dirname(cache_path))
62 
63     with open(cache_path, 'w') as cache_file:
64         cache_file.write(contents)
65 
66     next_url = None
67     if 'link' in response.headers:
68         link = response.headers['link']
69         if 'next' in link:
70             next_url = link.split(',')[0].split(';')[0][1:-1]
71 
72     if next_url is not None:
73         with open(next_path, 'w') as next_file:
74             next_file.write(next_url)
75 
76     return contents, next_url
77 

This stream is very simple. All it does is get all commits for a given project (using the arguments user and repo) and return it as a stream for r³.

The mapper

Now that we have all the commits for the given project it can’t get any simpler. We’ll just separate the commits per commiter like this:

 1 #!/usr/bin/python
 2 # -*- coding: utf-8 -*-
 3 
 4 
 5 from r3.worker.mapper import Mapper
 6 
 7 class CommitsPercentageMapper(Mapper):
 8     job_type = 'percentage'
 9 
10     def map(self, commits):
11         return list(self.split_commits(commits))
12 
13     def split_commits(self, commits):
14         for commit in commits:
15             commit = commit['commit']
16             yield commit['author']['name'], 1

That gets the number of commits per user in the project.

All that’s left is to reduce this to a coherent value.

The reducer

The reducer just iterates through all committers and assigns percentages:

 1 #!/usr/bin/python
 2 # -*- coding: utf-8 -*-
 3 
 4 from collections import defaultdict
 5 
 6 class Reducer:
 7     job_type = 'percentage'
 8 
 9     def reduce(self, app, items):
10         commits_per_user = defaultdict(int)
11         total_commits = 0
12 
13         for commit in items:
14             for user_data in commit:
15                 login = user_data[0]
16                 frequency = user_data[1]
17                 commits_per_user[login] += frequency
18                 total_commits += frequency
19 
20         percentages = {}
21         for login, frequency in commits_per_user.iteritems():
22             percentages[login] = round(float(frequency) / float(total_commits) * 100, 2)
23 
24         ordered_percentages = sorted(percentages.iteritems(), key=lambda item: -1 * item[1])
25         return {
26             'total_commits': total_commits,
27             'commit_percentages': [{ 'user': item[0], 'percentage': item[1], 'commits': commits_per_user[item[0]] } for item in ordered_percentages]
28         }

Getting it all together

Now it’s time to get all the things we done together and start looking at some famous repositories.

In order to make this easier, I setup a repository in github that has everything in place.

Just clone it, type make run and the server will be running.

WARNING: The make run command will install some python packages. If you don’t want them to be installed system-wide, create a virtualenv before running the command.

Interesting Trivia

I ran r3-gh against some famous repositories and got some interesting information. Be advised that the number of commits does not reflect code committed and/or effort spent, since some people commit more often than others. This is meant simply as trivia and as a way of demoing r³.

That said, let’s take a look at the rails repository (total of 25974 commits):

Now let’s see how django is distributed among committers (total of 12403 commits):

And finally the linux kernel (total of 63226 commits):

It’s worth noting that I excluded every committer that had less than 1% of commits (and more than 0.5% for the linux kernel),  so the percentages are a little off.

Conclusion

It is pretty simple to get r³ to do some cool calculations for us. I got the whole sample in a very short amount of time. It took me more time to write this post than to make r³ calculate the commiter percentages.

Hope you guys come up with some interesting stuff to calculate as well.

About these ads
Comments
  1. […] on blog.heynemann.com.br 이것이 좋아요:좋아하기Be the first to like […]

  2. […] found this following r³ – A quick demo of usage, which I found at: Demoing the Python-Based Map-Reduce R3 Against GitHub Data, Alex Popescu’s […]

  3. […] r³ – A quick demo of usage prima o poi lo provo ::: Heynemann […]

  4. lequocdo says:

    Hi heynemann! Could R3 run in a multil nodes cluster?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s