简评:让我们来看一下在 2017 年 Github 上哪些项目最热门,哪一家公司贡献得最多?
这里作者研究了 2017 年截止今天 Github 上的所有 PushEvents,对于每个用户都尽量分辨了其属于哪个组织,并且以下列出的仓库都是在 2017 年增长超过 20 个 star 的(比如,Apache 目前在 Github 有超过 1500 个仓库,但只有 205 个在今年获得了 20 个以上的 star。因此,这里也就只分析了这 205 个仓库的提交情况)。
更详细的内容,大家可以直接看这个可交互的版本:The top contributors to GitHub (2017)。
怎么做到的?
1. 数据来源:GitHub Archive。
2. 怎么判断 Github 用户属于哪个公司?
这里主要通过用户提交记录中的邮件地址来进行判断。当然不是每个人都会用自己组织的邮箱地址,有很多人在 Github 上都选择如 gmail.com, users.noreply.github.com, 或其他的邮箱,这种情况就没有办法了。
3. 所用工具:Google Big Query。
4. 具体的 SQL 代码:
#standardSQL
WITH
period AS (
SELECT *
FROM `githubarchive.month.2017*` a
),
repo_stars AS (
SELECT repo.id, COUNT(DISTINCT actor.login) stars, APPROX_TOP_COUNT(repo.name, 1)[OFFSET(0)].value repo_name
FROM period
WHERE type='WatchEvent'
GROUP BY 1
HAVING stars>20
),
pushers_guess_emails_and_top_projects AS (
SELECT *, REGEXP_EXTRACT(email, r'@(.*)') domain
FROM (
SELECT actor.id
, APPROX_TOP_COUNT(actor.login,1)[OFFSET(0)].value login
, APPROX_TOP_COUNT(JSON_EXTRACT_SCALAR(payload, '$.commits[0].author.email'),1)[OFFSET(0)].value email
, COUNT(*) c
, ARRAY_AGG(DISTINCT TO_JSON_STRING(STRUCT(b.repo_name,stars))) repos
FROM period a
JOIN repo_stars b
ON a.repo.id=b.id
WHERE type='PushEvent'
GROUP BY 1
HAVING c>3
)
)
SELECT * FROM (
SELECT domain
, githubers
, (SELECT COUNT(DISTINCT repo) FROM UNNEST(repos) repo) repos_contributed_to
, ARRAY(
SELECT AS STRUCT JSON_EXTRACT_SCALAR(repo, '$.repo_name') repo_name
, CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64) stars
, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo
GROUP BY 1, 2
HAVING githubers_from_domain>1
ORDER BY stars DESC LIMIT 3
) top
, (SELECT SUM(CAST(JSON_EXTRACT_SCALAR(repo, '$.stars') AS INT64)) FROM (SELECT DISTINCT repo FROM UNNEST(repos) repo)) sum_stars_projects_contributed_to
FROM (
SELECT domain, COUNT(*) githubers, ARRAY_CONCAT_AGG(ARRAY(SELECT * FROM UNNEST(repos) repo)) repos
FROM pushers_guess_emails_and_top_projects
#WHERE domain IN UNNEST(SPLIT('google.com|microsoft.com|amazon.com', '|'))
WHERE domain NOT IN UNNEST(SPLIT('gmail.com|users.noreply.github.com|qq.com|hotmail.com|163.com|me.com|googlemail.com|outlook.com|yahoo.com|web.de|iki.fi|foxmail.com|yandex.ru', '|')) # email hosters
GROUP BY 1
HAVING githubers > 30
)
WHERE (SELECT MAX(githubers_from_domain) FROM (SELECT repo, COUNT(*) githubers_from_domain FROM UNNEST(repos) repo GROUP BY repo))>4 # second filter email hosters
)
ORDER BY githubers DESC
原文:The top contributors to GitHub — 2017
日报扩展阅读: