Collaboration graphs and Git

2018 September 19, 15:08

A collaboration graph, as the name indicates, is a graph of people who have collaborated on projects of some kind. The vertices are individuals, and they are linked by an edge if they have worked together on the same project.

A popular application of collaboration graphs is calculating somebody's Bacon number or Erdős number. In case of the Bacon number, the collaboration graph links actors who have been in a movie together. The number is equal to the number of edges between Kevin Bacon and a given actor. An actor who was in a movie with Kevin Bacon has a Bacon number of 1. An actor with the Bacon number of 2 is someone who was not in any movies with Kevin Bacon, but was in a movie with someone else, who in turn was in a movie with Bacon. The Erdős number is the same concept, except instead of movies, people are linked by co-authorship of academic papers.

I have a lot of FOSS project Git repositories cloned on my computer. Some of them I have contributed to, but I also clone stuff to compile locally, or even to check out the source code. It occurred to me that using those repositories, I can build a small collaboration graph of FOSS contributors. This should not be too hard, since both libraries for pulling information out of Git repositories and libraries for building and analyzing graphs are readily available.

Histories of Git

When you set up Git on your computer you will (hopefully) input two things: your name, and your email address (user.name and user.email, respectively). Git adds these to every commit you make. Because of this, if you have a Git repository that was not cloned shallowly (that is, you have all the commits available locally), then you can find a list of all people who have contributed to it.

Git is popular, so of course there is a library available for interacting with it from Python: the aptly-named GitPython library. It makes it fairly easy to process Git commit histories:

from git import Repo

contributors = set()

repo = Repo('/path/to/repo')
commits = repo.iter_commits()

for commit in commits:
    contributors.add(commit.author)

With some parallelization to make things faster, we can process all the git directories we are interested in, and either load the data directly into a graph object, or perhaps save it in an SQLite database for easier loading later.

Of course, it should be noted that a person can contribute under several name-email pairs. Sites like Github will generally identify people by their email address, and that is a relatively easy option. It is certainly possible that someone would have the same name and several different addresses under which they contributed, but that gets into more complicated territory, with possible false positives when trying to merge contributors.

Making a graph

After obtaining a set of contributors for a given project, we will want to add that information to a graph. Two most popular graph libraries for Python seem to be graph-tool and NetworkX.

graph-tool focuses on performance—it uses C or C++ parts for handling the more computationally demanding tasks. NetworkX, on the other hand, is somewhat more flexible. Attaching metadata to edges or vertices is easier in NetworkX than in graph-tool, but drawing a good-looking graph with a lot of edges is faster if you use graph-tool, rather than NetworkX's matplotlib graphing capabilities. Both libraries are capable of dumping and loading graphs to and from a number of formats, so it is not very difficult to switch from one to the other.

NetworkX is slightly easier to use in this case. Since it can use any hashable Python object as a vertex, we can use project paths and contributor emails to identify our vertices. Adding a vertex is idempotent—if we try to add the same object twice, we only end up with one vertex.

Having obtained a set of contributors for each project earlier on, we can now easily create a graph, whether we are running a database query or just directly using the sets we assembled earlier.

graph = networkx.Graph()

# pairs contains tuples of project path and contributor email, such
# as ('/path/to/repo', 'alice@example.com')

for project, contributor in pairs:
    graph.add_node(project, kind='project')
    graph.add_node(contributor, kind='contributor')
    
    graph.add_edge(project, contributor)

Note that both projects and contributors are vertices (which NetworkX calls nodes). We include extra metadata in the form of the kind field.

Finding paths

If we want to find a path between two vertices, NetworkX offers us an aptly named function: shortest_path(). Let's say we want to find the path between alice@example.com and bob@example.com:

networkx.shortest_path(graph, 'alice@example.com', 'bob@example.com')

Provided there actually is a path between the two nodes, we will get something like this:

['alice@example.com',
 '/some/project',
 'carol@example.com',
 '/another/project',
 'bob@example.com']

As a human, you can tell that Bob is two hops away from Alice. Of course, you can also programmatically filter the resulting path to only show contributors:

path = networkx.shortest_path(g, 'alice@example.com', 'bob@example.com')
kind = networkx.get_node_attributes(graph, 'kind')

' — '.join(filter(lambda n: kind[n] == 'contributor', path))

This will give you a string like this:

'alice@example.com — carol@example.com — bob@example.com'

On the other hand, you can also filter for the projects which connect the contributors. If you count these, you will get the degree of separation:

path = networkx.shortest_path(g, 'alice@example.com', 'bob@example.com')
kind = networkx.get_node_attributes(graph, 'kind')

sum(1 for n in path if kind[n] == 'project')

Conclusions

After all this, I found out my Linus Torvalds Number is two—I have contributed to a project that someone who has contributed to Git also contributed to. Of course, there is more stuff you can do with with collaboration graphs like these—for an obvious example, you can draw them. The available Python libraries make it fairly easy to get started.