Looking back at 9 years of Hacker News
Hacker News started as a pet project for the venture capital firm named after a concept in lambda calculus, Y Combinator. Since then, it has grown to become the go-to source of all technology news amongst technology people . Besides serving as the holy grail of daily updates of what's going on in the tech world, it has, over time, managed to accumulate a history of what tech talks about, what tech cares about, and the progress tech has made in the recent past. In this post, I look at interesting things the data from HN can tell us. In another post, The Top 100 Hacker News Posts of All Time, I go over HN's top highlights.
As of 13th October, 2015, out of nearly 2 million Hacker News (1,959,809) posts, merely 217 have managed to rake up over 1000 upvotes. That's about one out of every 2000 posts. Recently, I stumbled upon one of Google engineer Felipe Hoffa's many awesome curated datasets on Google's BigQuery containing all the data on all HN posts (I encourage you look at some of his other datasets, which include all of Reddit, Wikipedia, Freebase, NYC Taxis, Uber, and more).
The growth of Hacker News post volume over time, and the subsequent stabilization starting late 2011.
Hacker News had its humble beginnings on October 9, 2006, although logged daily traction only began on Feb 19, 2007. Since then, the daily volume of content has risen steadily, peaking on November 29, 2011 with 1474 posts. After that, the average daily volume has remained steady at around 900 a day. Interestingly, probably due to a long outage or a bug, Jan 5, 2014 has much lesser content than expected, and the next day has none. HN volume is much lower on Saturdays and Sundays, about half as much as the weekdays, which all share similar volume.
Over the course of a week we see clear daily post rhythyms on weekdays and a much lower post volume on weekends.
Average Upvote Volume
The slow steady growth of average daily upvotes on Hacker News over time.
Interestingly, the average upvotes per article has also grown since 2007, and has a steady growth trajectory even today, at about 10 upvotes per post. The distribution of upvotes on content, is unsurprisingly skewed. After just one other vote on a post (original posters default upvote their own posts), your post is at the 50th percentile of all HN. That is, only half of all HN posts have been upvoted. At 14 votes, you hit the 90th percentile, 43 votes hits the 95th, and you need 139 for the 99th. Although we previously saw that weekends got about half the post volume, the average upvote volume on weekends is approximately 16% more per post.
Average upvotes, too, show weekly variations. They bounce from 8 to 10 during the work week, and between 10 and 13 on the weekends.
Average upvote volume is particularly interesting on two days - Jan 6, 2014, again, for most likely having an outage, had 0 votes and Jan 12, 2013. The latter has an average of 32 upvotes per post, almost twice as much as any other day in HN history. I wonder how many of you remember what happened on that day?
It was one of the most tragic days in HN history - the suicide of Aaron Swartz. Out of the 39 posts that had over 100 upvotes, 36 of them were related to Aaron Swartz. You can relive that day by querying the following in BigQuery, or find the results here.
SELECT * FROM [fh-bigquery:hackernews.stories] WHERE DATE(time_ts)='2013-01-12' ORDER BY score DESC LIMIT 100
What are the best domains shared on Hacker News? Using several interpretations of what best means, we figured out what they were.
Most Commonly Shared Domains
If by "best", we mean most commonly shared domains, the following are the top 20. Most of these are the usual suspects - large aggregate sites and publications. The only site I hadn't heard of in the top 20 was ReadWrite Web.
The SQL query to get these results is:
SELECT a.domain, COUNT(1) AS c FROM ( SELECT REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain, score FROM [fh-bigquery:hackernews.stories]) a GROUP BY a.domain ORDER BY c DESC LIMIT 100
Most Upvoted Domains
If by "best", we mean most upvoted domains, the list largely remains unchanged from before. Notable (and surprising) additions include EFF and Google+.
The SQL query is:
SELECT a.domain, SUM(score) AS c FROM ( SELECT REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain, score FROM [fh-bigquery:hackernews.stories]) a GROUP BY a.domain ORDER BY c DESC LIMIT 100
Domains with the Highest Average Upvotes
What drastically alters the list is by finding the average upvotes of every shared domain, and filtering by a minimum number of posts. I arbitrarily chose 100 as my filter bar to preserve only the most popular content. This reveals some extremely interesting content. We see some companies - Tesla, Spacex, Mozilla, Stripe, and unsurprisingly, Y Combinator. 10 of the 20 are personal blogs of well-known developers and influential people in technology (and not very well know outside it). Wikileaks also made it. To me, the most surprising entry was Kalzumeus, which I've never heard of.
The SQL query for the highest average upvotes per domain with a minimum of 100 shares is:
SELECT b.domain, s, c, s/c AS quality FROM ( SELECT a.domain, SUM(score) AS s, COUNT(1) AS c FROM ( SELECT REGEXP_EXTRACT(url,r'^https?://(?:www.)?([^/]*)/?(?:.*)') AS domain, score FROM [fh-bigquery:hackernews.stories]) a GROUP BY a.domain) b WHERE c >= 100 ORDER BY quality DESC LIMIT 100
What People Talk About
We know the source of the content people share, but what do they actually talk about?
Most Commonly Upvoted Words
Let's take a look at the words in the titles of the posts that get the most upvotes. The SQL query for this is:
SELECT a.word, SUM(a.score) AS score FROM ( SELECT LOWER(SPLIT(title, ' ')) AS word, score FROM [fh-bigquery:hackernews.stories]) a GROUP BY a.word ORDER BY score DESC LIMIT 1000
The top 40 words that don't include stopwords (admittedly handpicked) were below. It covers the standard array of programming languages, companies and other tech and programming related things.
Another way to track what people on Hacker News talk about is by tracing the rise and fall of specific words. To test this, I used the words "bitcoin", which gained traction in relatively recent times and "php", which I hypothesized would be popular in the past which has waned in recent times. It turns out that this is indeed the case.
Who Posts The Best Content
Similar to domains, the three rankings of users on HN we look at are - most prolific posters, most upvoted posters, and most upvotes/posts for users with at least 100 posts. Thankfully, due to the HN karma system, there are no prolific posters who get by with substandard post quality.
Most Prolific Contributors
With a runaway total of over 7000 posts on Hacker News, Clement Wan averages 2.24 posts a day since Hacker News took off (It's been 3,158 days since Feb 19, 2007). Two very mysterious users appear on this list. iProject, who has no user descriptions and posts a lot of content from popular publications and nickb. nickb is a great conspiracy theory story if there ever was one. There is a thread which points out that it is in fact a pseudonym from Paul Graham, the YC founder. It came out when nickb responded seemingly unhesitatingly to a comment on Paul Graham's (pg) comment as if it were him.
|1||cwan||7077||52833||Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China|
|2||shawndumas||6602||64308||Shawn Dumas, front-end engineer at Nest (Google)|
|3||evo_9||5659||41765||Rick Giampietro, founder of web dev company DotGlow|
|4||nickb||4322||29611||Quite a conspiracy theory, but revealed to be another account of Paul Graham here.|
|5||iProject||4266||26436||No clues given|
|6||bootload||4212||28759||Peter Renshaw, British creative learning consultant and researcher|
|7||edw519||3844||30073||Ed Weissman, profession programmer for 32 years|
|8||ColinWright||3766||77799||Colin Wright, PhD in Math and founder of Solipsys|
|9||nreece||3724||29841||Ashutosh Nilkanth, entrepreneur and programmer from Melbourne|
|10||tokenadult||3659||36769||Karl Bunday, founding director of the Edina Center for Academic Excellence|
Most Upvoted Contributors
When it comes to most upvoted contributors, Colin Wright leads the list. The list contains 4 usual suspects from the most prolific contributors and notably, nickb's "real half", pg.
|1||ColinWright||3766||77799||Colin Wright, PhD in Math and founder of Solipsys|
|2||shawndumas||6602||64308||Shawn Dumas, front-end engineer at Nest (Google)|
|3||llambda||2601||60432||Max Countryman, engineer and open-source contributor|
|4||fogus||2420||57038||Michael Fogus, Clojure and ClojureScript contributor|
|5||danso||2625||53587||Dan Nguyen, Stanford lecturer in Computational Journalism|
|6||cwan||7077||52833||Clement Wan, ounder of a few niche contract manufacturing services in Hong Kong/China|
|7||luu||2266||51838||Dan Luu, ex-Google engineer, currently at Microsoft|
|8||ssclafani||1326||49155||Stephen Scaplani, security researcher and founder of Play To Win|
|9||pg||708||46333||Paul Graham, YC founder|
|10||evo_9||5659||41765||Rick Giampietro, founder of web dev company DotGlow|
Highest Quality Contributors
Note, again, that these are users with at least 100 posts ranked by average upvoted per post. Funnily, the highest quality poster is whoishiring, a bot which posts "Who is Hiring?" posts at 11AM EST on the first weekday of every month. Here are the top 10 and their descriptions:
|1||whoishiring||23156||126||183.7777778||A bot that posts "Who is Hiring?" posts every month.|
|2||jaf12duke||8947||123||72.7398374||Jason Freedman, two time YC alum, runs 42Floors|
|3||cperciva||10541||145||72.69655172||Colin Perceival, founder of Tarsnap, FreeBSD security officer, runs his blog Daemonology.|
|4||pg||46333||708||65.4420904||Paul Graham, YC founder|
|5||jsnell||7761||124||62.58870968||Juho Snellman, systems programmer from Zurich|
|6||jordanmessina||7452||121||61.58677686||Jordan Messina, YC alum and founder of density.io|
|7||paul||6458||107||60.35514019||Paul Buchheit, lead dev on Gmail|
|8||tptacek||17969||310||57.96451613||Thomas Ptacek, founder of Matasano Security|
|9||wlll||5869||103||56.98058252||Unsure - probably Jason Fried, founder of Basecamp|
|10||dko||7461||144||51.8125||Derrick Ko, PM at Lyft|
I've only scratched the surface of what Hacker News data can tell us, and I'm sure there's plenty more. Let me know in the comments if you think there would be other cool things worth exploring, or if there are any other cool analyses of HN data out there. And do leave any feedback! If you enjoyed this, make sure you check out The Top 100 Hacker News Posts of All Time.