Geocoding Github: Visualizing Distributed Open-Source Development
- Brandon Heller
- Eli Marschner
- Evan Rosenfeld
The world depends on open-source software (OSS). We take for granted that people have offered the fruits of their labor to save others time and money in development. Yet few give any thought to who developed the software, where and when.
We seek to build a visualization that shows when and where open source development occurs, at global scale. More specifically, we’re interested in:
- Where is OSS development most common?
- How does OSS development vary by season, day, and year on a world scale?
Less important to us, but equally interesting, are questions of global collaboration:
- Is development localized or is there global collaboration?
- Which countries collaborate most (in)frequently with which other countries?
- Are certain regions more or less active at certain times, due to local events like holidays?
Our final project will visualize open-source development on Github, a commercial service that host over 1.4 million git repositories from over 400,000 users. Github is unique among source code hosts, as it adds a social layer on top of the existing repository data. The social graph, self-reported user locations, and open API of github makes our questions more accessible than ever before. Eli has demonstrated the feasibility of this approach by searching for names, geolocating their positions, and plotting the aggregate on a tiled map.
In the final version, developers with Github profiles will be able to enter their own profile to interactively explore how their projects compare to their friends, and see how it all varies with time.
To support spatial representation and interaction we will rely on user-reported location data for github accounts or email addresses of committers. Temporal functionality will be supported by timestamps assigned to individual commits.
The following pieces of metadata can be assembled for commits.
Latitude & Longitude
- Local time of day
- Time zone
- Local season
The fundamental data unit will be commits oriented in both time (by timestamp) and space (by location of author). To answer our driving questions we will use additional metadata, such as commit hierarchy and project collaborators/contributors, all of which is available via GitHub APIs.
Effort will be made early on to evaluate quality of data pulled from our primary sources for characteristics such as completeness of location info, reliability of that info (especially after fed through a geocoding service), and ratio of usable (i.e. locatable) data to total data volume. We’ll likely be using Tableau for this.
We expect to build on top of familiar interaction techniques for the main two axes, time and space. Space can be adjusted through Google maps-style tile dragging and zooming, while time can be controlled through an adjustable-extents time slider (like the protovis example). Combining a time slider with a play button should reduces noise (e.g. per-week, per-month, or per-year smoothing), possibly with notches for these time periods.
Beyond the primary questions of time and space, we’d like to support data exploration through filtering. Text filtering options may include project, name, language, and geographic region, and space filtering through a highlight box. Users will be able to choose one of the commit metadata dimensions described above to aggregate. For instance, a user can choose to aggregate commits by local time of day, while filtering on the time-boundaries specified with the time slider. This aggregated data can be displayed in bar-chart format underneath the main map view.