Franck Cuny @franckcuny

Software and Operation engineer.

StarGit

Last year I did a small exploration of GitHub to show the various communities using GitHub and how they work. I wanted to do it again this year, but I was lacking time and motivation to start over. A couple of months ago, I got a message from mojombo asking me if I was planning to do a new poster. This triggered the motivation to work on it again.

This time I got help from Alexis to provide you with an awesome tool: a real explorer of your graph, but more on this later ;)

And of course, the poster. Feel free to print it yourself, the size of the poster is A1.

The data

All the data are available! Last year I got some mails asking me for the dataset. So this time I asked first if I could release the data with the code and the poster, and the anwser is yes! So if you're intereseted, you can download it.

The data are stored in mongodb, so I provide the dump which you can easily use:

% wget http://maps.stargit.net/dump/github.
% tar xvzf github.tgz
% cd github
% mongorestore -d github .

Now you can use mongodb to browse the imported database. There is 5 collections: profiles / repositories / relations / contributions / edges.

Methodology

Last year I did a simple "follower/following" graph. It was already interesting, but it was also really too simple. This time I wanted to go deeper in the exploration.

The various step to process all this data are:

  • using the GitHub API, fetch informations from the profiles.
  • when all the profiles are collected, informations about the repositories are fetched. Only forked repositories are kept.
  • "simple" relations (followers/following) are kept and used later to add weight to relations.
  • tag user with the main programming language they use. Using the GitHub API, I was able to categorize ~40k profiles (about 1/3 of my whole dataset).
  • using the GeoNames API, extract the name of the country the user is in. This time, about 55k profiles were tagged.
  • fetch contributions for each repositories
  • compute a score between the author of the contribution and the owner of the repo
  • add a weight to each edges, using the computed score and "+1" if the developer follow the other developer

For all the graphs, I've used the following colors for:

  • Ruby
  • JavaScript
  • Python
  • C (C++, C#)
  • Perl
  • PHP
  • JVM (Java, Clojure, Scala)
  • Lisp (Emacs Lisp, Common Lisp)
  • Other

Exploring

Feel free to do your own analysis in the comments :) For each map, you'll find a PDF of the map, and the graph to explore using gephi (in GEXF or GDF format).

but first, some numbers

I've collected:

  • 123 562 profiles
  • 2 730 organizations
  • 40 807 repositories

This took me about a month in order to collect the data and to build the adapted tools.

Accounts creations

The following chart show the number of account created by month. "Everyone" means the total of accounts created. You can also see the numbers for each communities.

On the "Everyone" graph, you can see a huge pick around April 2008, that's the date GitHub was launched.

For most of the communities, the number of created accounts start to decrease since 2010. I think the reason is that most of the developers from those communities are now on GitHub.

Languages

(Keep in mind that these numbers are coming from the profiles I was able to tag, roughly 40k)

  • Ruby: 10046 (28%)
  • Python: 5403 (15%)
  • JavaScript: 5282 (15%) (JavaScript + CoffeeScript)
  • C: 5093 (14%) (C, C++, C#)
  • PHP: 3933 (11%)
  • JVM: 3790 (10%) (Java, Clojure, Scala, Groovy)
  • Perl: 1215 (3%)
  • Lisp: 348 (0%) (Emacs Lisp, Common Lisp)

Those numbers doesn't really match "what GitHub gave":https://github.com/languages, but it could be explained by the way I've selected my users.

Country

  • United States: 19861 (36%)
  • United Kingdom: 3533 (6%)
  • Germany: 3009 (5%)
  • Canada: 2657 (4%)
  • Brazil: 2454 (4%)
  • France: 1833 (3%)
  • Japan: 1799 (3%)
  • Russia: 1604 (2%)
  • Australia: 1441 (2%)
  • China: 1159 (2%)

The United States are still the main country represented on GitHub, no suprise here.

If you are interested in the "geography" of Open Source, you should read these two articles: Coding Places and Investigating the Geography of Open Source Software through GitHub.

companies

Looking at the "company" field on user's profile, here are some stats about which companies has employees using GitHub:

  • ThoughtWorks: 102
  • Google: 66
  • Mozilla: 65
  • Yahoo!: 65
  • Red Hat: 64
  • Globo.com: 55
  • Twitter: 53
  • Facebook: 45
  • Yandex: 43
  • Intridea: 34
  • Microsoft: 33
  • Engine Yard: 32
  • Pivotal Labs: 29
  • MIT: 28
  • Rackspace: 27
  • IBM: 24
  • Caelum: 23
  • Novell: 22
  • GitHub: 22
  • VMware: 22

I didn't knew the first company, ThoughtWorks, and I was expecting to see FaceBook or Twitter as the company with most developpers on GitHub. It's also interesting to see Yandex here.

Global graph (1628 nodes, 9826 edges)

(download PDF)

The main difference with last year, is the android / modders community. They're developing mostly in C and Java. The poster has been created from this map.

Ruby (1968 nodes, 9662 edges)

(download PDF, download GDF, download GEXF)

This is still the main community on GitHub, even if JavaScript is now the most popular language. This graph is really dense, it's not easy to read, since there is no real cluster in this one.

Python (1062 nodes, 2631 edges)

(download PDF, download GDF)

Here we have some clusters. I'm not familiar with the Python community, so I can't really give any insight.

Perl (608 nodes, 2967 edges)

(download PDF, download GDF, download GEXF)

I really like this graph since it show (in my opinion) one of the real strength of this community: everybody works with everybody. People working on a webframework will collaborate with people working on Moose, or an ORM, or other tools. It shows that in this community, people are competent in more than one field.

The Perl community is about the same size as last year. However, we can extract the following informations:

  • the Japaneses Perl Hackers are still a cluster by themselves
  • miyagawa is still the glue between the Japanese community and the "rest of the world"
  • other leaders are: Florian Ragwitz (rafl), Andy Amstrong (AndyA), Dave Rolsky (autarch)
  • some clusters exists for Moose and Dancer.

As we can see on the previous charts, the number of created accounts for the Perl developpers is stalling.

United States (2646 nodes, 11344 edges)

(download PDF, download GDF, download GEXF)

This one is really nice. We can clearly see all the communities. There is something interesting:

  • C and Ruby are on the opposite side (C on the left, Ruby on the right)
  • Python and Perl are also opposed (Perl at the bottom and Python at the top)

I'll let you take some conclusion by yourself on this one ;)

France (706 nodes, 1059 edges)

(download PDF, download GDF, download GEXF)

We have a lot of small clusters on this one, and some very big authorities.

Japan (464 nodes, 1091 edges)

(download PDF, download GDF, download GEXF)

There is three dominants clusters on this one:

  • Ruby
  • Perl
  • C

The Ruby and Perl one are well connected. There is a lot of japanese hacker on CPAN using both languages.

StarGit

StarGit is a great tool we built with Alexis to let you explore your community on GitHub. You can read more about the application on Alexis' blog.

It's hosted on dotcloud (I'm still amazed at how easy it was to deploy the code ...), using the Perl Dancer web framework, MongoDB to store the data, and Redis to do some caching.

Credits

I would like to thanks the whole GitHub team for being interested in the previous poster and to ask another one this year :)

A huge thanks to Alexis for his help on building the awesome StarGit. Another big thanks to Antonin for his work on the poster.