The data that I have decided to tackle is Steam (an online video game marketplace) game data. The data is retrieved through an API called Steam Spy which monitors changes to the games on the marketplace. Specifically, we are looking at the 100 games which had the most users in the past two weeks.
# load in required packages
library(tidyverse)
library(httr2)
library(psych)
library(ggfortify)
# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
req_url_query(`request` = 'top100in2weeks') %>%
req_perform()
# Process the response JSON into a list of lists
jlist <- resp %>%
resp_body_json(flatten= TRUE)
# Melt the list of lists down into a format of a tidy dataframe
df_full <- jlist %>%
map(as_tibble) %>%
reduce(bind_rows)
# Select the columns which are important to our analysis and
# calculate a new column for percent positive ratings
df <- df_full %>%
select(appid,name,positive,negative, median_2weeks) %>%
mutate(rating = round( positive/(positive + negative) ,3), rating_count = positive + negative, positive_log = log(positive), negative_log = log(negative)) %>%
mutate(low_reviews = ifelse(rating_count<10000,1,0), square_rating = rating^2) %>%
mutate(positive_low_reviews = positive * low_reviews)
# Preview the data
knitr::kable(head(df))
| appid | name | positive | negative | median_2weeks | rating | rating_count | positive_log | negative_log | low_reviews | square_rating | positive_low_reviews |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 570 | Dota 2 | 1621661 | 341941 | 809 | 0.826 | 1963602 | 14.29896 | 12.74239 | 0 | 0.682276 | 0 |
| 730 | Counter-Strike: Global Offensive | 6325123 | 810842 | 320 | 0.886 | 7135965 | 15.66004 | 13.60583 | 0 | 0.784996 | 0 |
| 1172470 | Apex Legends | 524101 | 110212 | 429 | 0.826 | 634313 | 13.16944 | 11.61016 | 0 | 0.682276 | 0 |
| 578080 | PUBG: BATTLEGROUNDS | 1237666 | 928096 | 197 | 0.571 | 2165762 | 14.02874 | 13.74089 | 0 | 0.326041 | 0 |
| 1063730 | New World | 177235 | 76325 | 2045 | 0.699 | 253560 | 12.08523 | 11.24276 | 0 | 0.488601 | 0 |
| 440 | Team Fortress 2 | 887970 | 58964 | 346 | 0.938 | 946934 | 13.69669 | 10.98468 | 0 | 0.879844 | 0 |
From this data I want to determine if a multiple regression model can be used to create a model of median playtime of the game as a function of multiple different terms.
To make a bit of a more complex model we’ll add the quadratic term of the rating squared as we believe the rating should have a higher than normal effect on the playtime.
We have created the dichotomous term of those games with less than 10,000 total ratings as lower rated games. The thought process behind this term is that those with less ratings might not have as high of a score, but still have a higher amount of playtime.
We add onto this thought process by utilizing the interaction term of the positive amount of reviews * the low review dummy.
Beyond these we also attempt to utilize the log of the amount of positive reviews and negative reviews.
We utilize R’s built in model generation to get our model below:
df_lm <- lm(median_2weeks ~ square_rating + low_reviews + positive_low_reviews + positive_log + negative_log , data = df)
summary(df_lm)
##
## Call:
## lm(formula = median_2weeks ~ square_rating + low_reviews + positive_low_reviews +
## positive_log + negative_log, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -572.6 -251.6 -111.4 72.8 3653.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.27363 736.87021 0.147 0.883
## square_rating 334.52198 1041.36618 0.321 0.749
## low_reviews -119.59290 851.51122 -0.140 0.889
## positive_low_reviews -0.02748 0.18291 -0.150 0.881
## positive_log -163.51073 190.86299 -0.857 0.394
## negative_log 199.78335 186.39139 1.072 0.287
##
## Residual standard error: 548.5 on 94 degrees of freedom
## Multiple R-squared: 0.08573, Adjusted R-squared: 0.0371
## F-statistic: 1.763 on 5 and 94 DF, p-value: 0.1281
With a y-intercept of 108 we are told that a game would have close to 100 minutes of median playtime if there were no reviews at all..
The square_rating coefficient tells us that for every square percentage point a game is rated, the median playtime increases by 335 minutes. However, our P value is not significant at 0.749 which tells us square_rating is not a good indicator for median_playtime.
The low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 11.9 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.889 which tells us the dichotomous term of low reviews is not a good indicator for median_playtime.
The positive low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 0.03 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.881 which tells us the dichotomous interaction term of positive low reviews is not a good indicator for median_playtime.
The positive_log coefficient tells us that for every log positive rating, the median playtime decreases by 164 minutes. However, our P value is not significant at 0.394 which tells us positive_log is not a good indicator for median_playtime.
The negative_log coefficient tells us that for every log negative rating, the median playtime increases by 200 minutes. However, our P value is not significant at 0.287 which tells us negative_log is not a good indicator for median_playtime.
Despite the terrible fit we know of already, we still have a regression model of:
\[ \hat{playtime} = 334.52*rating^2 - 119.59*lowreviewed - 0.02748*lowreviewed*positivereviews -163.51 \\ * log(positivereviews) + 199.78 * log(negativereviews) + 108.27 \]
Examining the information regarding the residuals we violate almost every indication for following a Gaussian distribution. The median is not close to zero, the quantiles are not close in magnitude, and the minimum and maximum value of the residuals are particular outliers without being similar in magnitude.
Looking at the goodness of fit with the adjusted R-squared value. At 0.0371 we know that the model accounts for just 3.71% of variation in playtime based on the our variables. All signs so far point to this model not working out so well.
Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.
autoplot(df_lm)
Looking at the residuals vs fitted plot we can see that our data is not
distributed well. The residuals follow a downward trend and begin to fan
upward as we move to the right. Additionally, there are more residuals
underneath the line rather than above. Yet, many of the residuals above
the line have a high magnitude. These deviations mean that the model is
not a great fit for our data.
Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality.
From our analysis here, we have come up with a multiple regression model that was not appropriate for our data. The relationship between a game’s playtime and its other variables still seem to evade me.