Looking back at 9 years of Hacker News

1,116 阅读8分钟
原文链接: debarghyadas.com

Hacker News started as a pet project for the venture capital firm named after a concept in lambda calculus, Y Combinator. Since then, it has grown to become the go-to source of all technology news amongst technology people [citation needed]. Besides serving as the holy grail of daily updates of what's going on in the tech world, it has, over time, managed to accumulate a history of what tech talks about, what tech cares about, and the progress tech has made in the recent past. In this post, I look at interesting things the data from HN can tell us. In another post, The Top 100 Hacker News Posts of All Time, I go over HN's top highlights.

As of 13th October, 2015, out of nearly 2 million Hacker News (1,959,809) posts, merely 217 have managed to rake up over 1000 upvotes. That's about one out of every 2000 posts. Recently, I stumbled upon one of Google engineer Felipe Hoffa's many awesome curated datasets on Google's BigQuery containing all the data on all HN posts (I encourage you look at some of his other datasets, which include all of Reddit, Wikipedia, Freebase, NYC Taxis, Uber, and more).

Post Volume

The growth of Hacker News post volume over time, and the subsequent stabilization starting late 2011.

Hacker News had its humble beginnings on October 9, 2006, although logged daily traction only began on Feb 19, 2007. Since then, the daily volume of content has risen steadily, peaking on November 29, 2011 with 1474 posts. After that, the average daily volume has remained steady at around 900 a day. Interestingly, probably due to a long outage or a bug, Jan 5, 2014 has much lesser content than expected, and the next day has none. HN volume is much lower on Saturdays and Sundays, about half as much as the weekdays, which all share similar volume.

Over the course of a week we see clear daily post rhythyms on weekdays and a much lower post volume on weekends.

Average Upvote Volume

The slow steady growth of average daily upvotes on Hacker News over time.

Interestingly, the average upvotes per article has also grown since 2007, and has a steady growth trajectory even today, at about 10 upvotes per post. The distribution of upvotes on content, is unsurprisingly skewed. After just one other vote on a post (original posters default upvote their own posts), your post is at the 50th percentile of all HN. That is, only half of all HN posts have been upvoted. At 14 votes, you hit the 90th percentile, 43 votes hits the 95th, and you need 139 for the 99th. Although we previously saw that weekends got about half the post volume, the average upvote volume on weekends is approximately 16% more per post.

Average upvotes, too, show weekly variations. They bounce from 8 to 10 during the work week, and between 10 and 13 on the weekends.

Average upvote volume is particularly interesting on two days - Jan 6, 2014, again, for most likely having an outage, had 0 votes and Jan 12, 2013. The latter has an average of 32 upvotes per post, almost twice as much as any other day in HN history. I wonder how many of you remember what happened on that day?

It was one of the most tragic days in HN history - the suicide of Aaron Swartz. Out of the 39 posts that had over 100 upvotes, 36 of them were related to Aaron Swartz. You can relive that day by querying the following in BigQuery, or find the results here.

SELECT
  *
FROM
  [fh-bigquery:hackernews.stories]
WHERE
  DATE(time_ts)='2013-01-12'
ORDER BY
  score DESC
LIMIT
  100

Best Sources

What are the best domains shared on Hacker News? Using several interpretations of what best means, we figured out what they were.

Most Commonly Shared Domains

If by "best", we mean most commonly shared domains, the following are the top 20. Most of these are the usual suspects - large aggregate sites and publications. The only site I hadn't heard of in the top 20 was ReadWrite Web.

The SQL query to get these results is:

SELECT
  a.domain,
  COUNT(1) AS c
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.domain
ORDER BY
  c DESC
LIMIT
  100

Most Upvoted Domains

If by "best", we mean most upvoted domains, the list largely remains unchanged from before. Notable (and surprising) additions include EFF and Google+.

The SQL query is:

SELECT
  a.domain,
  SUM(score) AS c
FROM (
  SELECT
    REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.domain
ORDER BY
  c DESC
LIMIT
  100

Domains with the Highest Average Upvotes

What drastically alters the list is by finding the average upvotes of every shared domain, and filtering by a minimum number of posts. I arbitrarily chose 100 as my filter bar to preserve only the most popular content. This reveals some extremely interesting content. We see some companies - Tesla, Spacex, Mozilla, Stripe, and unsurprisingly, Y Combinator. 10 of the 20 are personal blogs of well-known developers and influential people in technology (and not very well know outside it). Wikileaks also made it. To me, the most surprising entry was Kalzumeus, which I've never heard of.

The SQL query for the highest average upvotes per domain with a minimum of 100 shares is:

SELECT
  b.domain,
  s,
  c,
  s/c AS quality
FROM (
  SELECT
    a.domain,
    SUM(score) AS s,
    COUNT(1) AS c
  FROM (
    SELECT
      REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain,
      score
    FROM
      [fh-bigquery:hackernews.stories]) a
  GROUP BY
    a.domain) b
WHERE
  c >= 100
ORDER BY
  quality DESC
LIMIT
  100

What People Talk About

We know the source of the content people share, but what do they actually talk about?

Most Commonly Upvoted Words

Let's take a look at the words in the titles of the posts that get the most upvotes. The SQL query for this is:

SELECT
  a.word,
  SUM(a.score) AS score
FROM (
  SELECT
    LOWER(SPLIT(title, ' ')) AS word,
    score
  FROM
    [fh-bigquery:hackernews.stories]) a
GROUP BY
  a.word
ORDER BY
  score DESC
LIMIT
  1000

The top 40 words that don't include stopwords (admittedly handpicked) were below. It covers the standard array of programming languages, companies and other tech and programming related things.

Rank Word Upvotes Rank Word Upvotes
1 google 633322 21 twitter 122475
2 web 360208 22 iphone 121689
3 startup 277140 23 windows 121070
4 data 248914 24 design 119559
5 app 248277 25 nsa 118330
6 facebook 232569 26 language 114610
7 apple 224476 27 project 109374
8 code 214499 28 apps 109072
9 programming 201684 29 computer 108865
10 javascript 182948 30 github 108706
11 python 178466 31 [pdf] 107560
12 source 169257 32 ios 106800
13 internet 167170 33 search 106669
14 software 161382 34 system 106441
15 android 161100 35 build 105453
16 microsoft 160804 36 tech 103110
17 game 152917 37 security 102922
18 linux 141067 38 bitcoin 102473
19 hacker 129485 39 os 96854
20 amazon 124177 40 startups 96643

Word Trends

Another way to track what people on Hacker News talk about is by tracing the rise and fall of specific words. To test this, I used the words "bitcoin", which gained traction in relatively recent times and "php", which I hypothesized would be popular in the past which has waned in recent times. It turns out that this is indeed the case.

Who Posts The Best Content

Similar to domains, the three rankings of users on HN we look at are - most prolific posters, most upvoted posters, and most upvotes/posts for users with at least 100 posts. Thankfully, due to the HN karma system, there are no prolific posters who get by with substandard post quality.

Most Prolific Contributors

With a runaway total of over 7000 posts on Hacker News, Clement Wan averages 2.24 posts a day since Hacker News took off (It's been 3,158 days since Feb 19, 2007). Two very mysterious users appear on this list. iProject, who has no user descriptions and posts a lot of content from popular publications and nickb. nickb is a great conspiracy theory story if there ever was one. There is a thread which points out that it is in fact a pseudonym from Paul Graham, the YC founder. It came out when nickb responded seemingly unhesitatingly to a comment on Paul Graham's (pg) comment as if it were him.

Rank User Posts Upvotes Description
1 cwan 7077 52833 Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China
2 shawndumas 6602 64308 Shawn Dumas, front-end engineer at Nest (Google)
3 evo_9 5659 41765 Rick Giampietro, founder of web dev company DotGlow
4 nickb 4322 29611 Quite a conspiracy theory, but revealed to be another account of Paul Graham here.
5 iProject 4266 26436 No clues given
6 bootload 4212 28759 Peter Renshaw, British creative learning consultant and researcher
7 edw519 3844 30073 Ed Weissman, profession programmer for 32 years
8 ColinWright 3766 77799 Colin Wright, PhD in Math and founder of Solipsys
9 nreece 3724 29841 Ashutosh Nilkanth, entrepreneur and programmer from Melbourne
10 tokenadult 3659 36769 Karl Bunday, founding director of the Edina Center for Academic Excellence

Most Upvoted Contributors

When it comes to most upvoted contributors, Colin Wright leads the list. The list contains 4 usual suspects from the most prolific contributors and notably, nickb's "real half", pg.

Rank User Posts Upvotes Description
1 ColinWright 3766 77799 Colin Wright, PhD in Math and founder of Solipsys
2 shawndumas 6602 64308 Shawn Dumas, front-end engineer at Nest (Google)
3 llambda 2601 60432 Max Countryman, engineer and open-source contributor
4 fogus 2420 57038 Michael Fogus, Clojure and ClojureScript contributor
5 danso 2625 53587 Dan Nguyen, Stanford lecturer in Computational Journalism
6 cwan 7077 52833 Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China
7 luu 2266 51838 Dan Luu, ex-Google engineer, currently at Microsoft
8 ssclafani 1326 49155 Stephen Scaplani, security researcher and founder of Play To Win
9 pg 708 46333 Paul Graham, YC founder
10 evo_9 5659 41765 Rick Giampietro, founder of web dev company DotGlow

Highest Quality Contributors

Note, again, that these are users with at least 100 posts ranked by average upvoted per post. Funnily, the highest quality poster is whoishiring, a bot which posts "Who is Hiring?" posts at 11AM EST on the first weekday of every month. Here are the top 10 and their descriptions:

Rank User Upvotes Posts Quality Description
1 whoishiring 23156 126 183.7777778 A bot that posts "Who is Hiring?" posts every month.
2 jaf12duke 8947 123 72.7398374 Jason Freedman, two time YC alum, runs 42Floors
3 cperciva 10541 145 72.69655172 Colin Perceival, founder of Tarsnap, FreeBSD security officer, runs his blog Daemonology.
4 pg 46333 708 65.4420904 Paul Graham, YC founder
5 jsnell 7761 124 62.58870968 Juho Snellman, systems programmer from Zurich
6 jordanmessina 7452 121 61.58677686 Jordan Messina, YC alum and founder of density.io
7 paul 6458 107 60.35514019 Paul Buchheit, lead dev on Gmail
8 tptacek 17969 310 57.96451613 Thomas Ptacek, founder of Matasano Security
9 wlll 5869 103 56.98058252 Unsure - probably Jason Fried, founder of Basecamp
10 dko 7461 144 51.8125 Derrick Ko, PM at Lyft

I've only scratched the surface of what Hacker News data can tell us, and I'm sure there's plenty more. Let me know in the comments if you think there would be other cool things worth exploring, or if there are any other cool analyses of HN data out there. And do leave any feedback! If you enjoyed this, make sure you check out The Top 100 Hacker News Posts of All Time.