Introduction

The data that I have decided to tackle is Steam (an online video game marketplace) game data. The data is retrieved through an API called Steam Spy which monitors changes to the games on the marketplace. Specifically, we are looking at the 100 games which had the most users in the past two weeks.

Data Preparation

# load in required packages
library(tidyverse)
library(httr2)
library(psych)
library(ggfortify)

# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
  req_url_query(`request` = 'top100in2weeks') %>%
  req_perform()

#  Process the response JSON into a list of lists
jlist <- resp %>%
  resp_body_json(flatten= TRUE)

# Melt the list of lists down into a format of a tidy dataframe
df_full <- jlist %>%
    map(as_tibble) %>%
    reduce(bind_rows)

# Select the columns which are important to our analysis and 
# calculate a new column for percent positive ratings
df <- df_full %>%
  select(appid,name,positive,negative,average_2weeks, median_2weeks) %>%
  mutate(rating = round( positive/(positive + negative) ,3)) %>%
  select(-positive,-negative)

# Preview the data
knitr::kable(head(df))
appid name average_2weeks median_2weeks rating
570 Dota 2 1233 709 0.826
730 Counter-Strike: Global Offensive 644 255 0.886
1172470 Apex Legends 796 374 0.827
578080 PUBG: BATTLEGROUNDS 719 298 0.571
1063730 New World 649 843 0.699
440 Team Fortress 2 965 298 0.938

From this data I want to determine if a simple linear regression model can be used to create a model of median playtime of the game as a function of user review ratings.

Visualization

For the initial visualization we plot the dependent variable of median playtime to the independent variable of user ratings. A linear relationship is very hard to determine from this view.

df %>%
  ggplot(aes(x=rating, y= median_2weeks)) +
  geom_point()

Generating the Linear Model

We utilize R’s built in linear model generation to get our linear model below:

df_lm <- lm(median_2weeks ~ rating, data = df)
(df_lm)
## 
## Call:
## lm(formula = median_2weeks ~ rating, data = df)
## 
## Coefficients:
## (Intercept)       rating  
##       577.1       -339.5

With a y-intercept of 577.1 and a slope of -339.5, we get the regression model of:

\(\hat{playtime} = -339.5*rating + 577.1\)

Oddly enough, we seem to have a negative relationship between playtime and a game’s user rating.

This is what the line looks visualized on our scatter plot:

df %>%
  ggplot(aes(x=rating, y= median_2weeks)) +
  geom_point() +
  geom_abline(intercept = df_lm$coefficients[1], slope = df_lm$coefficients[2])

Evaluating Quality of the Model

We use the summary function on the model object in order to get more information on how well the model fits.

summary(df_lm)
## 
## Call:
## lm(formula = median_2weeks ~ rating, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.91 -207.00  -98.01   80.30 1752.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    577.1      245.6   2.350   0.0208 *
## rating        -339.5      283.8  -1.197   0.2343  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 336.4 on 98 degrees of freedom
## Multiple R-squared:  0.0144, Adjusted R-squared:  0.004344 
## F-statistic: 1.432 on 1 and 98 DF,  p-value: 0.2343

Examining the information regarding the residuals we violate almost every indication for following a Gaussian distribution. The median is not close to zero, the quantiles are not close in magnitude, and the minimum and maximum value of the residuals reach astronomical values without being similar in magnitude.

Next we can examine the information regarding the coefficients. The standard error we have is only about 1.5 times smaller than the estimate of the rating coefficient, which is not indicative of a good model’s variability. Additionally, we have a small t-value that leaves the probability of any linear relationship being from chance of 23.43%.

Finally we’ll take a look at the goodness of fit with the multiple R-squared value. At 0.0144 we know that the model accounts for just 1.44% of variation in playtime based on the user rating. All signs so far point to this model not working out so well.

Residual Analysis of the Model

Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.

autoplot(df_lm)

Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals are concentrated towards the left side and begin to fan out as we move to the right. Additionally, there are more residuals underneath the line rather than above. Yet, many of the residuals above the line have a high magnitude. These deviations mean that the model is not a great fit for our data.

Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality.

Conclusion

From our analysis here, we have come up with a linear model that was not appropriate for our data. This is mainly from the fact that the residuals do not begin to approach a normal approximation. I would be interested if anyone has any thoughts on what type of regression model might be a better fit here.