A preliminary analysis of 200 stackoverflow users with the highest reputation

The goal is to find out if there’s any correlation between the votes, views and the time on platform and the reputations they have.

Preparation

Examining data from bigquery-public-data

A dataset from bigquery-public-data.stackoverflow.users containing such data looks clean and organized; however, in the ‘location’ column, city and country were squeezed into one place. I decided to separate the column based on the delimit ‘,’ and then select only ‘country’ for analysis.

I entered the below query on Bigquery to pull the relevant data:

WITH stackoverflow_user AS 
(
 SELECT 
 *
 FROM `bigquery-public-data.stackoverflow.users` 
 WHERE location is not null AND id > 0
 ORDER BY reputation DESC
 LIMIT 200
)

SELECT
 DISTINCT id,
 DATE(creation_date) AS cre_date,
 DATE(last_access_date) AS last_ace_date,
 country,
 reputation,
 up_votes,
 down_votes,
 views
FROM stackoverflow_user

Then I exported the CSV to Rstudio and then split the data there.

x <- separate(data = stackoverflow_user, col = location, into = c("city", "country"), sep = ",")

Now, the data is ready for analysis.

Analysis

0.1 A quick look at where the users are located based on the location info provided:

Top 3 countries:

  • USA
  • UK
  • Germany

plot here

0.2 Years to reach top 200 level of reputation

To answer this question, first, I calculated the time differencet between the id creation date and last access date. Then, I converted the days value to years.

#to get a result as days as difference.
timediff_days_x <- x %>% 
  mutate(timediff = difftime(last_ace_date, cre_date, 'days'))

#convert the days value to numeric and calculate and round to the nearest hundredth for years value.
timediff_years_x <- timediff_days_x %>% 
  mutate(timediff_year = round(as.numeric(timediff) / 365, digits = 1))

#a quick summary can tell us that the average years spent is 12.43.
round(mean(timediff_years_x$timediff_year), digits = 2)

Now, a chart here will tell me how many years users have spent on the platfom to reach it

> max_year_rep <- max(timelength_x$timediff_year)
> min_year_rep <- min(timelength_x$timediff_year)

ggplot(timelength_x)+
  geom_jitter(aes(y = reputation, x = timediff_year))+
  labs(caption=paste0(min_year_rep, ' to ', max_year_rep, ' years to achive top 200 level of reputation'),
       y = 'repuation from 126K to 1.3M')

Click to view chart

Finding #1 Views x Reputation

The graph shows that hose among the highest views tend to have higher reputation.

read.csv('stackoverflow_user_2023.csv')

x <- stackoverflow_user_2023 
ggplot(x) +
  geom_point(aes(y = reputation, x = views))
  

View the chart here

Finding #2 Upvotes and downvotes x Reputation

ggplot(x) +
  geom_point(aes(y = reputation, x = down_votes))

ggplot(x) +
  geom_point(aes(y = reputation, x = up_votes))
  
  • Contrary to common expectation, users with high volumes of downvotes are still able to amass high reputation.
  • It’s evident that higher amount of upvotes does not contribute to reputation as well. The majority of users that have surpassed over 400k reputation have fewer than 20,000 upvotes.

Click to view plot

Wrap up and follow up

Key takeaways:

User base

impact: dependent Notes: About 35% of the top users are from the US, 15% from UK, 10% from Germany.

User time

impact: dependent Notes: * Users need to spend at least four years on the platform to amass high enough reputation to be top 200, though the majority shows that longer term does not always result in higher reputation. * Those who have accumulated over 400k reputation have stayed on the platform for more than 8 years. * Users amassing 1 million–or close to 1 million–have used the platform for more than 10 years.

#1 Views

impact: strong Notes: Users with the highest level of reputation have the highest view among others, indicating it a supporting contributing factor.

#2 Upvotes and downvotes

impact: weak Notes: The amount of upvotes and downvotes do not seem to affect user reputation much.

Follow-up: More variables are needed for further in-depth analysis.

Furthermore, I’ve created a dashboard on Tableau to demonstrate the relationship between the variables presented above. Feel free to check it out here.