The goal is to find out if there’s any correlation between the votes, views and the time on platform and the reputations they have.
A dataset from bigquery-public-data.stackoverflow.users
containing such data looks clean and organized; however, in the
‘location’ column, city and country were squeezed into one place. I
decided to separate the column based on the delimit ‘,’ and then select
only ‘country’ for analysis.
I entered the below query on Bigquery to pull the relevant data:
WITH stackoverflow_user AS
(
SELECT
*
FROM `bigquery-public-data.stackoverflow.users`
WHERE location is not null AND id > 0
ORDER BY reputation DESC
LIMIT 200
)
SELECT
DISTINCT id,
DATE(creation_date) AS cre_date,
DATE(last_access_date) AS last_ace_date,
country,
reputation,
up_votes,
down_votes,
views
FROM stackoverflow_user
Then I exported the CSV to Rstudio and then split the data there.
x <- separate(data = stackoverflow_user, col = location, into = c("city", "country"), sep = ",")
Now, the data is ready for analysis.
Top 3 countries:
To answer this question, first, I calculated the time differencet between the id creation date and last access date. Then, I converted the days value to years.
#to get a result as days as difference.
timediff_days_x <- x %>%
mutate(timediff = difftime(last_ace_date, cre_date, 'days'))
#convert the days value to numeric and calculate and round to the nearest hundredth for years value.
timediff_years_x <- timediff_days_x %>%
mutate(timediff_year = round(as.numeric(timediff) / 365, digits = 1))
#a quick summary can tell us that the average years spent is 12.43.
round(mean(timediff_years_x$timediff_year), digits = 2)
Now, a chart here will tell me how many years users have spent on the platfom to reach it
> max_year_rep <- max(timelength_x$timediff_year)
> min_year_rep <- min(timelength_x$timediff_year)
ggplot(timelength_x)+
geom_jitter(aes(y = reputation, x = timediff_year))+
labs(caption=paste0(min_year_rep, ' to ', max_year_rep, ' years to achive top 200 level of reputation'),
y = 'repuation from 126K to 1.3M')
The graph shows that hose among the highest views tend to have higher reputation.
read.csv('stackoverflow_user_2023.csv')
x <- stackoverflow_user_2023
ggplot(x) +
geom_point(aes(y = reputation, x = views))
ggplot(x) +
geom_point(aes(y = reputation, x = down_votes))
ggplot(x) +
geom_point(aes(y = reputation, x = up_votes))
Key takeaways:
impact: dependent Notes: About 35% of the top users are from the US, 15% from UK, 10% from Germany.
impact: dependent Notes: * Users need to spend at least four years on the platform to amass high enough reputation to be top 200, though the majority shows that longer term does not always result in higher reputation. * Those who have accumulated over 400k reputation have stayed on the platform for more than 8 years. * Users amassing 1 million–or close to 1 million–have used the platform for more than 10 years.
impact: strong Notes: Users with the highest level of reputation have the highest view among others, indicating it a supporting contributing factor.
impact: weak Notes: The amount of upvotes and downvotes do not seem to affect user reputation much.
Follow-up: More variables are needed for further in-depth analysis.
Furthermore, I’ve created a dashboard on Tableau to demonstrate the relationship between the variables presented above. Feel free to check it out here.