Introduction

The data that I have decided to tackle is Steam (an online video game marketplace) game data. The data is retrieved through an API called Steam Spy which monitors changes to the games on the marketplace. Specifically, we are looking at the 100 games which had the most users in the past two weeks.

Data Preparation

# load in required packages
library(tidyverse)
library(httr2)
library(psych)
library(ggfortify)

# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
  req_url_query(`request` = 'top100in2weeks') %>%
  req_perform()

#  Process the response JSON into a list of lists
jlist <- resp %>%
  resp_body_json(flatten= TRUE)

# Melt the list of lists down into a format of a tidy dataframe
df_full <- jlist %>%
    map(as_tibble) %>%
    reduce(bind_rows)

# Select the columns which are important to our analysis and 
# calculate a new column for percent positive ratings
df <- df_full %>%
  select(appid,name,positive,negative, median_2weeks) %>%
  mutate(rating = round( positive/(positive + negative) ,3), rating_count = positive + negative, positive_log = log(positive), negative_log = log(negative)) %>%
  mutate(low_reviews = ifelse(rating_count<10000,1,0), square_rating = rating^2) %>%
  mutate(positive_low_reviews = positive * low_reviews)

# Preview the data
knitr::kable(head(df))
appid name positive negative median_2weeks rating rating_count positive_log negative_log low_reviews square_rating positive_low_reviews
570 Dota 2 1621661 341941 809 0.826 1963602 14.29896 12.74239 0 0.682276 0
730 Counter-Strike: Global Offensive 6325123 810842 320 0.886 7135965 15.66004 13.60583 0 0.784996 0
1172470 Apex Legends 524101 110212 429 0.826 634313 13.16944 11.61016 0 0.682276 0
578080 PUBG: BATTLEGROUNDS 1237666 928096 197 0.571 2165762 14.02874 13.74089 0 0.326041 0
1063730 New World 177235 76325 2045 0.699 253560 12.08523 11.24276 0 0.488601 0
440 Team Fortress 2 887970 58964 346 0.938 946934 13.69669 10.98468 0 0.879844 0

From this data I want to determine if a multiple regression model can be used to create a model of median playtime of the game as a function of multiple different terms.

Generating the Model

To make a bit of a more complex model we’ll add the quadratic term of the rating squared as we believe the rating should have a higher than normal effect on the playtime.

We have created the dichotomous term of those games with less than 10,000 total ratings as lower rated games. The thought process behind this term is that those with less ratings might not have as high of a score, but still have a higher amount of playtime.

We add onto this thought process by utilizing the interaction term of the positive amount of reviews * the low review dummy.

Beyond these we also attempt to utilize the log of the amount of positive reviews and negative reviews.

We utilize R’s built in model generation to get our model below:

df_lm <- lm(median_2weeks ~ square_rating + low_reviews + positive_low_reviews + positive_log + negative_log , data = df)
summary(df_lm)
## 
## Call:
## lm(formula = median_2weeks ~ square_rating + low_reviews + positive_low_reviews + 
##     positive_log + negative_log, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -572.6 -251.6 -111.4   72.8 3653.4 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)           108.27363  736.87021   0.147    0.883
## square_rating         334.52198 1041.36618   0.321    0.749
## low_reviews          -119.59290  851.51122  -0.140    0.889
## positive_low_reviews   -0.02748    0.18291  -0.150    0.881
## positive_log         -163.51073  190.86299  -0.857    0.394
## negative_log          199.78335  186.39139   1.072    0.287
## 
## Residual standard error: 548.5 on 94 degrees of freedom
## Multiple R-squared:  0.08573,    Adjusted R-squared:  0.0371 
## F-statistic: 1.763 on 5 and 94 DF,  p-value: 0.1281

With a y-intercept of 108 we are told that a game would have close to 100 minutes of median playtime if there were no reviews at all..

The square_rating coefficient tells us that for every square percentage point a game is rated, the median playtime increases by 335 minutes. However, our P value is not significant at 0.749 which tells us square_rating is not a good indicator for median_playtime.

The low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 11.9 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.889 which tells us the dichotomous term of low reviews is not a good indicator for median_playtime.

The positive low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 0.03 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.881 which tells us the dichotomous interaction term of positive low reviews is not a good indicator for median_playtime.

The positive_log coefficient tells us that for every log positive rating, the median playtime decreases by 164 minutes. However, our P value is not significant at 0.394 which tells us positive_log is not a good indicator for median_playtime.

The negative_log coefficient tells us that for every log negative rating, the median playtime increases by 200 minutes. However, our P value is not significant at 0.287 which tells us negative_log is not a good indicator for median_playtime.

Despite the terrible fit we know of already, we still have a regression model of:

\[ \hat{playtime} = 334.52*rating^2 - 119.59*lowreviewed - 0.02748*lowreviewed*positivereviews -163.51 \\ * log(positivereviews) + 199.78 * log(negativereviews) + 108.27 \]

Evaluating Quality of the Model

Examining the information regarding the residuals we violate almost every indication for following a Gaussian distribution. The median is not close to zero, the quantiles are not close in magnitude, and the minimum and maximum value of the residuals are particular outliers without being similar in magnitude.

Looking at the goodness of fit with the adjusted R-squared value. At 0.0371 we know that the model accounts for just 3.71% of variation in playtime based on the our variables. All signs so far point to this model not working out so well.

Residual Analysis of the Model

Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.

autoplot(df_lm)


Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals follow a downward trend and begin to fan upward as we move to the right. Additionally, there are more residuals underneath the line rather than above. Yet, many of the residuals above the line have a high magnitude. These deviations mean that the model is not a great fit for our data.

Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality.

Conclusion

From our analysis here, we have come up with a multiple regression model that was not appropriate for our data. The relationship between a game’s playtime and its other variables still seem to evade me.