My new map/reduce engine project, r³ got a lot of attention last week and before that in twitter, facebook and even hackernews.
So I decided to write a sample project demoing the usage of r³.
The problem
I had to find an interesting, yet simple problem to show in this demo. Since I am a huge fan of github, I decided that I would show each committer’s percentage of commits in a given repository.
GitHub has a VERY nice API that you can use to retrieve a myriad of information on your own repositories or on other people’s repositories (provided they are public).
You just have to access https://api.github.com/repos/mirrors/linux/commits?per_page=100&top=master to get the first 100 commits in the linux kernel repository. The resulting document comes with a link header that specifies where the next 100 commits can be found.
The Input Stream
Cool! So my map/reduce operation should operate on top of all commits for a given project. That means that in my input stream I just need to capture all those commits and return them.
I just built a simple crawler that keeps looking for the next page of commits until it can’t find one.
To save myself some time and bandwidth it also stores those commits in a temp folder as means of caching them.
The code:
1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 4 from os.path import exists, join, dirname 5 from urlparse import urlparse 6 import os 7 import sys 8 import urllib2 9 10 from ujson import loads 11 12 CACHE_PATH = '/tmp/r3-gh-cache' 13 14 class Stream: 15 job_type = 'percentage' 16 group_size = 10 17 18 def process(self, app, arguments): 19 if not exists(CACHE_PATH): 20 os.makedirs(CACHE_PATH) 21 user = arguments['user'][0] 22 repo = arguments['repo'][0] 23 24 return get_repo_commits(user, repo) 25 26 def get_repo_commits(user, repo): 27 next_url = 'https://api.github.com/repos/%s/%s/commits?per_page=100' % (user, repo) 28 commits = [] 29 index = 0 30 31 while next_url: 32 index += 1 33 content, next_url = get_url_content(next_url, index) 34 json = loads(content) 35 for item in json: 36 commits.append(item) 37 38 return commits 39 40 def get_url_content(url, index): 41 parts = urlparse(url) 42 43 url_path = join(parts.path.lstrip('/'), parts.query.replace('&', '/').replace('=','_')) 44 cache_path = join(CACHE_PATH, url_path, 'contents.json') 45 next_path = join(CACHE_PATH, url_path, 'next.json') 46 47 if exists(cache_path) and exists(next_path): 48 print "%d - %s found in cache!" % (index, url) 49 with open(cache_path) as cache_file: 50 with open(next_path) as next_file: 51 return cache_file.read(), next_file.read() 52 53 print "%d - getting %s..." % (index, url) 54 req = urllib2.Request(url) 55 response = urllib2.urlopen(req) 56 57 contents = response.read() 58 print "%d - storing in cache" % index 59 60 if not exists(dirname(cache_path)): 61 os.makedirs(dirname(cache_path)) 62 63 with open(cache_path, 'w') as cache_file: 64 cache_file.write(contents) 65 66 next_url = None 67 if 'link' in response.headers: 68 link = response.headers['link'] 69 if 'next' in link: 70 next_url = link.split(',')[0].split(';')[0][1:-1] 71 72 if next_url is not None: 73 with open(next_path, 'w') as next_file: 74 next_file.write(next_url) 75 76 return contents, next_url 77
This stream is very simple. All it does is get all commits for a given project (using the arguments user and repo) and return it as a stream for r³.
The mapper
Now that we have all the commits for the given project it can’t get any simpler. We’ll just separate the commits per commiter like this:
1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 4 5 from r3.worker.mapper import Mapper 6 7 class CommitsPercentageMapper(Mapper): 8 job_type = 'percentage' 9 10 def map(self, commits): 11 return list(self.split_commits(commits)) 12 13 def split_commits(self, commits): 14 for commit in commits: 15 commit = commit['commit'] 16 yield commit['author']['name'], 1
That gets the number of commits per user in the project.
All that’s left is to reduce this to a coherent value.
The reducer
The reducer just iterates through all committers and assigns percentages:
1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 4 from collections import defaultdict 5 6 class Reducer: 7 job_type = 'percentage' 8 9 def reduce(self, app, items): 10 commits_per_user = defaultdict(int) 11 total_commits = 0 12 13 for commit in items: 14 for user_data in commit: 15 login = user_data[0] 16 frequency = user_data[1] 17 commits_per_user[login] += frequency 18 total_commits += frequency 19 20 percentages = {} 21 for login, frequency in commits_per_user.iteritems(): 22 percentages[login] = round(float(frequency) / float(total_commits) * 100, 2) 23 24 ordered_percentages = sorted(percentages.iteritems(), key=lambda item: -1 * item[1]) 25 return { 26 'total_commits': total_commits, 27 'commit_percentages': [{ 'user': item[0], 'percentage': item[1], 'commits': commits_per_user[item[0]] } for item in ordered_percentages] 28 }
Getting it all together
Now it’s time to get all the things we done together and start looking at some famous repositories.
In order to make this easier, I setup a repository in github that has everything in place.
Just clone it, type make run and the server will be running.
WARNING: The make run command will install some python packages. If you don’t want them to be installed system-wide, create a virtualenv before running the command.
Interesting Trivia
I ran r3-gh against some famous repositories and got some interesting information. Be advised that the number of commits does not reflect code committed and/or effort spent, since some people commit more often than others. This is meant simply as trivia and as a way of demoing r³.
That said, let’s take a look at the rails repository (total of 25974 commits):
Now let’s see how django is distributed among committers (total of 12403 commits):
And finally the linux kernel (total of 63226 commits):
It’s worth noting that I excluded every committer that had less than 1% of commits (and more than 0.5% for the linux kernel), so the percentages are a little off.
Conclusion
It is pretty simple to get r³ to do some cool calculations for us. I got the whole sample in a very short amount of time. It took me more time to write this post than to make r³ calculate the commiter percentages.
Hope you guys come up with some interesting stuff to calculate as well.




[...] on blog.heynemann.com.br 이것이 좋아요:좋아하기Be the first to like [...]
[...] found this following r³ – A quick demo of usage, which I found at: Demoing the Python-Based Map-Reduce R3 Against GitHub Data, Alex Popescu’s [...]
[...] R3 [...]
[...] r³ – A quick demo of usage prima o poi lo provo ::: Heynemann [...]