An Analysis of the Top 1000 Go Repositories

This analysis was done from copies cloned on January 2, 2016 early morning Pacific Time.

Code organization

Most code is a library, so the code is organized as either .go files under the main repo, or as .go files under sub-directories. Many people also organize their code under a sub-directory, like /src, /lib/, /go/, or /pkg/. I can’t manually inspect all of the repositories, but those I did check are apps written in go rather than libraries. Using go get would fail on them because of the directory structure.

At least one of the repositories seems to use the developer’s $GOPATH as the repository root, which would certainly make developing it alongside anything else impractical.

oniony/TMSU:src/github.com/oniony/TMSU/cli/mount.go

Tumblr also seems to use the same model, but they have their own mini-monolithic repository with a ton of packages under the /src directory

tumblr/gocircuit:src/tumblr/redis/redis.go
tumblr/gocircuit:src/circuit/kit/debug/ctrlc/init.go

Vendoring

It’s done nearly as many ways as there are repositories. In order to pick out the source files for each project, I scrolled through all 50-thousand-odd files in an intermediate list to see if there were any vendored packages. I wanted to exclude them generally for analysis, but it’s quite difficult because of the variety of ways vendoring is done.

Some people store external dependencies in special folders, with names like:

Or alongside their own code (e.g.):

Open Source Books and Websites

Many books and websites exist as public repositories on Github:

I’m a big fan of this approach, mostly because the community can send pull requests and fix things that need a bit more clarification, or edit the work so it stays a living document of the subject matter. I also keep my blog as a public repository because I don’t think it makes sense to hide something that’s already public.

Age

The pace has been picking up over time for creation of great Go projects: time_created

Of the top 1000, 815 were updated in the last 7 days (with some possible play because I can’t download all repos at exactly the same time). last_updated

Overall, the speed of the popular repositories is higher than that of the community at large, which could possibly be explained by the size of the developer pool. I would hazard a guess that the popular repositories have a snowball effect, where more developers join over time and thus the velocity increases. I have the git histories, but haven’t done this analysis.

Repositories per Organization / User

I counted the number of top 1000 repositories each organization (or user) had on GitHub, with some interesting, but obvious results. Many are concentrated under just a few umbrellas, with a long tail of people with a single repo in the top crowd. The pattern, interestingly enough, follows a pretty clear exponential curve. repos_per_org

Unsafe, Reflect, and CGo

In Go, the “power tools” are the unsafe and reflect packages and CGo, which provides integration with C.

From this output, you can see that out of roughly 50 thousand files, very few use any of the three:

scott@devbox:/tmp/ghgo$ wc -l *_by_*
    7 cgo_by_org
    7 cgo_by_repo
  323 reflect_by_org
  413 reflect_by_repo
  165 unsafe_by_org
  195 unsafe_by_repo
 1110 total

Unsafe

Unsafe is used to get around the type system in Go. It is useful when dealign with CGo to deal with raw pointers, but otherwise is typically not used. Interestingly, usage of unsafe is far higher than the usage of CGo.

Top org usage of unsafe:

Top repo usage of unsafe:

Reflect

Reflect is typically used for JSON unmarshalling or sometimes for testing. I haven’t vetted the usage in each of the packages that contain imports, so I don’t know each stated purpose. It is used quite heavily.

Top org usage of reflect:

Top repo usage of reflect:

CGo

CGo provides integration with C libraries. It is used in a few places, such as cockroachDB, to connect to a backing storage library written in C(++) such as RocksDB.

Top org usage of CGo:

Top repo usage of CGo:

Source Code for Analysis

You will need jq and git installed:

You will also need about 10 gigabytes of free disk space in your /tmp/ directory.

The script takes a while to run, but produces all of the output used to write this article. It first goes to GitHub’s search API (which is publicly available, no auth required) to retrieve the top 1000 repositories with Go as the language. It then clones all of the repositores in order to do further analysis on the source code. Some basic statistics are extracted from the GitHub search results first, like age of the repository. Then a more complex piece starts, where it looks for usage of the unsafe and reflect packages as well as //#cgo comments. This analysis is done against a curated list of files that excludes vendored dependencies, test files, and example code. The usage of unsafe and reflect is measured in number of import statements in .go files from the final curated list, and usage of CGo is measured in instances of //#cgo in the source code.

There’s a couple of things that I couldn’t figure out exactly, like re-using find results in multiple egrep loops while respecting file names with spaces, but overall it gets the job done.

My final file lists were in the tens of thousands of files:

scott@devbox:/tmp/ghgo$ wc -l file_list* | sort -rn
  322753 total
  128657 file_list_example_vendor_test
   92131 file_list_example_vendor
   52079 file_list_example
   49886 file_list

Here is the full code in all its glory:

#! /bin/bash

rm -r /tmp/ghgo
mkdir /tmp/ghgo
cd /tmp/ghgo

# The GitHub API only provides the first 1000 results
# https://developer.github.com/v3/search/
# There's also a sleep to be a good citizen and avoid rate limits
echo; echo "Retrieving list of repositories from GitHub..."
for page in $(seq 1 10); do echo "Retrieving page $page"; curl "https://api.github.com/search/repositories?q=language:go&sort=stars&per_page=100&page=$page" 2>/dev/null > $page; sleep 10; done

# Pull all needed data out of the huge json structures sent back
echo; echo "Parsing data..."
for page in $(seq 1 10); do cat $page | jq -c '.items | .[] | {name: .full_name, stars: .stargazers_count, cloneurl: .clone_url, created: .created_at, updated: .updated_at}' >> summary; done

# Extract info in easy format to use
for page in $(seq 1 10); do cat $page | jq -r '.items | .[] | {a: .clone_url, b: .full_name} | to_entries | map(.value) | join(",")' >> clone_info; done

# Parse into clone commands
cat clone_info | awk -F, '{print "git clone " $1 " repos/" $2}' > clone_cmds

# Clone all repositories
echo; echo "Cloning all repositories..."
cat clone_cmds | bash

echo; echo "Gathering basic stats..."

# Repos per organization or person
for org in $(ls repos/); do echo $(ls repos/$org | wc -l) $org >> org_counts; done
echo; echo "Top repo count per org or person:"
cat org_counts | sort -rn | head -n 10

# Times created
for line in $(cat summary); do echo $line | jq -r '[.created,.name] | join(" ")' >> created; done
echo; echo "Earliest 10 created repos:"
sort created | head
echo; echo "Latest 10 created repos:"
sort -r created | head

# Graph created date by month
cat created | cut -d'-' -f1-2 | sort | uniq -c | sed 's/^\s*//' > created_graph_input

# Times updated
for line in $(cat summary); do echo $line | jq -r '[.updated,.name] | join(" ")' >> updated; done
echo; echo "Oldest 10 updated:"
sort updated | head
echo; echo "Latest 10 updated:"
sort -r updated | head

# Graph updated date by day
cat updated | cut -d'T' -f1 | sort | uniq -c | sed 's/^\s*//' > updated_graph_input

echo; echo "Looking for usage of unsafe, reflect, and CGo..."

# Find all non-test, non-vendored, non-example go source files
find repos -iname "*.go" -type f | tee file_list_example_vendor_test | egrep -v '(.*_test.go$|/test/|/testdata/)' | tee file_list_example_vendor | egrep -v '/(Godeps|_?vendor|_?third_party|3rdparty|external)/' | tee file_list_example | egrep -v '/_?examples?/' | sed 's/ /\\ /g' > file_list

# Look for import statements with "unsafe" in them
# The strange sed invocation escapes filenames with spaces in the middle
for file in $(cat file_list); do egrep -H '^(import ( _)?)?\s*"unsafe"$' "$file" >> res_unsafe; done

cat res_unsafe | cut -d'/' -f2 | sort | uniq -c | sort -rn > unsafe_by_org
cat res_unsafe | cut -d'/' -f2-3 | sort | uniq -c | sort -rn > unsafe_by_repo

echo; echo "Top org usage of unsafe:"
head unsafe_by_org
echo; echo "Top repo usage of unsafe:"
head unsafe_by_repo

# Look for import statements with "reflect" in them
for file in $(cat file_list); do egrep -H '^(import ( _)?)?\s*"reflect"$' "$file" >> res_reflect; done

cat res_reflect | cut -d'/' -f2 | sort | uniq -c | sort -rn > reflect_by_org
cat res_reflect | cut -d'/' -f2-3 | sort | uniq -c | sort -rn > reflect_by_repo

echo; echo "Top org usage of reflect:"
head reflect_by_org
echo; echo "Top repo usage of reflect:"
head reflect_by_repo

# Look for //#cgo
for file in $(cat file_list); do egrep -H '//#cgo' "$file" >> res_cgo; done

# Files might have multiple lines that match, but we only count each file once
cat res_cgo | cut -d':' -f1 | sort | uniq | cut -d'/' -f2 | sort | uniq -c | sort -rn > cgo_by_org
cat res_cgo | cut -d':' -f1 | sort | uniq | cut -d'/' -f2-3 | sort | uniq -c | sort -rn > cgo_by_repo

echo; echo "Top org usage of CGo:"
head cgo_by_org
echo; echo "Top repo usage of CGo:"
head cgo_by_repo

I also created a LibreOffice spreadsheet file that I used to generate the graphs, which can be found here.

Conclusion and Further Research

There’s a whole lot more information that could be gleaned from the data given by GitHub and what exists in the repositories themselves. I haven’t done much with this yet, but might do another post with some further metrics. I did have some ideas:

There’s a sizable and growing community of gophers out there, and it’s getting bigger by the day. I was suprised to see the kind of diversity that exists, including major infrastructure projects like docker, kubernetes, and etcd, open source books, databases, internet proxies, graphics libraries, and more. I couldn’t be more excited for 2016.

comments powered by Disqus