The real top Stack Overflow questions
It’s easy to find the top Stack Overflow questions of all time — but the *current* top questions gives us what’s important now. Find the how-to here, and an interactive dashboard to get to the top trends.
Update 2020 — Check my new post “Stack Overflow in 2023: Predicting with ARIMA and BigQuery”
Here’s the interactive dashboard you can use to get the answers to your specific questions. Below, find the story of how it was created and why it matters.
The top Stack Overflow questions — all time vs current
Let’s see first what questions have received the largest number of pageviews through time:
Some observations:
- The top Stack Overflow question of all time — with more than 7 million views since its creation 9 years ago — is not even a programming question: “How do I undo the most recent commits in Git”.
- From the top 10 questions, 4 are related to git, 3 to JavaScript, 1 to Java, 1 to Linux, and 1 to HTML. What’s missing?
- There are no Python questions in the top 10 list. Surprised?
So let’s take a look at the top 10 question — not from all time, but from this last quarter. This is how the Stack Overflow world looks now:
Some observations:
- The top question shows that people are still trying to figure out “How do I undo the most recent commits in Git” — with more than 400k pageviews this last quarter.
- The programming world has changed: Now 4 of the top 10 questions are related to Python.
- Java is gone — it doesn’t show up within the top 10 questions any more.
The trend is even more evident if you look at the tags for the top 30 questions:
Java and SQL are gone from the top 30. Meanwhile Python now dominates the top 10 and top 30 Stack Overflow questions — by current pageviews.
So what are the top questions for each tag
Now you can look at what the top questions for any tag are. Before we take a look at them — can you guess the top questions for JavaScript, Python, Go?
JavaScript, Python, Go: Different challenges for each
The first thing you can notice with this dashboard are the top current 10 questions for any tag, and how they’ve trended through time. Python devs keep asking about “Iterating over dictionaries using ‘for’ loops”, JavaScript devs ask “How do I remove a particular element from an array in JavaScript?”, and Go devs need to know “What is the best way to convert byte array to string?”.
TensorFlow struggles
People are really excited about machine learning, and TensorFlow is one of the top projects in the space. But most of the top questions show people struggling with issues like “TensorFlow not found using pip” or the ability to work with their current CPU/GPU combos.
A worthy goal for the TensorFlow team: With this dashboard they’ll be able to track how these questions trend down through time — if they work out a foolproof way for devs to install TensorFlow with pip.
Kotlin top 10: Now and then
With the dashboard you can jump to see the top 10 questions at any given quarter. For example, let’s compare the top 10 for Kotlin in 2018Q4 vs 2017Q2:
This looks good: Most top questions from 2017 disappeared into the void — and as a guess, let’s say that the team was great at solving common issues people had back then. And on the 2018Q4 side you can see how most current top questions are new, and not the same as many months ago.
Redis top questions — focusing on Java vs C#
This are the top 10 questions for Redis:
You can see that the current top question for Redis ask how to install it on Windows. If you are a Redis developer working to improve the Java libraries, this doesn’t tell you much — luckily our dashboard also allows us to filter for subtags.
Let’s look at the top 10 Redis+Java questions instead:
If we filter for “redis” and “Java” then we see different questions: Most of these are related to Redis and Spring, for example. Which are different to the top question for Redis and C#:
Oh yeah — the top question for Redis + Java asks for simple C# examples. Maybe developers can’t find these in the official docs?
One question, so many replies
With this dashboard you can also see how many answers each question has. For example, the top question “Convert a String to int in Rust?“ has 5 different answers. Go has 12 different replies to “What is the best way to convert byte array to string?” and 20 to “How to efficiently concatenate strings in Go?”:
Dig beyond the top 10
Why stop at the top 10? You can also go deeper into the top 30:
How-to: Queries
You can find all this data in BigQuery. Every 3 months Stack Overflow publishes a snapshot of their latest data, and we make sure to have a fresh copy ready to be queried.
Top Stack Overflow questions, current vs all time
SELECT (
SELECT tag
FROM UNNEST(tags)
ORDER BY view_count DESC LIMIT 1
) tag
, * EXCEPT(tags)
FROM (
SELECT quarter_views , view_count
, ROW_NUMBER() OVER(ORDER BY quarter_views DESC) q_ranking
, ROW_NUMBER() OVER(ORDER BY view_count DESC) ranking
, ARRAY(
SELECT AS STRUCT tag, b.view_count
FROM UNNEST(tags) tag
JOIN `fh-bigquery.stackoverflow_archive_questions.merged_aux_tags` b
ON tag=b.tag
) tags, title
FROM `fh-bigquery.stackoverflow_archive_questions.merged`
WHERE quarter='2018-12-01'
AND view_count > 50000
)
WHERE q_ranking<30 OR ranking <30
ORDER BY 1 DESC
Finding the number of pageviews for each question through time
For this I started by storing a copy of each snapshot through time — and then proceeded to calculate the difference of pageviews between the snapshots.
Luckily the query looks pretty simple:
CREATE OR REPLACE TABLE `stackoverflow_archive_questions.merged`
AS SELECT
IFNULL(
view_count -
LAG(view_count) OVER(PARTITION BY id ORDER BY view_count)
, view_count) quarter_views, *
FROM (
SELECT PARSE_DATE('%Y%m',_table_suffix) quarter
, id, view_count
, SPLIT(tags, '|') tags
, score, creation_date, answer_count
, accepted_answer_id, title
FROM `fh-bigquery.stackoverflow_archive_questions.q*`
)
Top questions per tag
#standardSQL
SELECT title, quarter_views, view_count
FROM `fh-bigquery.stackoverflow_archive_questions.merged`
WHERE 'google-cloud-dataflow' IN UNNEST(tags)
AND quarter='2018-12-01'
ORDER BY quarter_views DESC
LIMIT 10
Top current questions that haven’t been updated in more than a year
#standardSQL
WITH top_questions AS (
SELECT id, title, quarter_views, view_count
FROM `fh-bigquery.stackoverflow_archive_questions.merged`
WHERE 'google-cloud-dataflow' IN UNNEST(tags)
AND quarter='2018-12-01'
), latest_answer AS (
SELECT parent_id, DATE(MAX(COALESCE(last_edit_date, last_activity_date, creation_date))) answer_last_edit_date
FROM `bigquery-public-data.stackoverflow.posts_answers` b
GROUP BY parent_id
)SELECT SUBSTR(title, 0,80) title, quarter_views, view_count, answer_last_edit_date
FROM top_questions a
JOIN latest_answer b
ON a.id=b.parent_id
WHERE DATE_DIFF(CURRENT_DATE(), answer_last_edit_date, DAY)>360
ORDER BY quarter_views DESC
LIMIT 10
Go beyond
Now you can go back to the top of this post to play with the dashboard. Please share what you find!
Then also check out what tags are trending and which ones are going down, with The real Stack Overflow trends:
Predict how long it takes for Stack Overflow to answer a question:
Next steps
Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.