DATA 605 Multiple Regression Discussion

Data Preparation

# load in required packages
library(tidyverse)
library(httr2)
library(psych)
library(ggfortify)

# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
  req_url_query(`request` = 'top100in2weeks') %>%
  req_perform()

#  Process the response JSON into a list of lists
jlist <- resp %>%
  resp_body_json(flatten= TRUE)

# Melt the list of lists down into a format of a tidy dataframe
df_full <- jlist %>%
    map(as_tibble) %>%
    reduce(bind_rows)

# Select the columns which are important to our analysis and 
# calculate a new column for percent positive ratings
df <- df_full %>%
  select(appid,name,positive,negative, median_2weeks) %>%
  mutate(rating = round( positive/(positive + negative) ,3), rating_count = positive + negative, positive_log = log(positive), negative_log = log(negative)) %>%
  mutate(low_reviews = ifelse(rating_count<10000,1,0), square_rating = rating^2) %>%
  mutate(positive_low_reviews = positive * low_reviews)

# Preview the data
knitr::kable(head(df))

appid	name	positive	negative	median_2weeks	rating	rating_count	positive_log	negative_log	square_rating
570	Dota 2	1621661	341941	809	0.826	1963602	14.29896	12.74239	0.682276
730	Counter-Strike: Global Offensive	6325123	810842	320	0.886	7135965	15.66004	13.60583	0.784996
1172470	Apex Legends	524101	110212	429	0.826	634313	13.16944	11.61016	0.682276
578080	PUBG: BATTLEGROUNDS	1237666	928096	197	0.571	2165762	14.02874	13.74089	0.326041
1063730	New World	177235	76325	2045	0.699	253560	12.08523	11.24276	0.488601
440	Team Fortress 2	887970	58964	346	0.938	946934	13.69669	10.98468	0.879844

From this data I want to determine if a multiple regression model can be used to create a model of median playtime of the game as a function of multiple different terms.

Generating the Model

To make a bit of a more complex model we’ll add the quadratic term of the rating squared as we believe the rating should have a higher than normal effect on the playtime.

We have created the dichotomous term of those games with less than 10,000 total ratings as lower rated games. The thought process behind this term is that those with less ratings might not have as high of a score, but still have a higher amount of playtime.

We add onto this thought process by utilizing the interaction term of the positive amount of reviews * the low review dummy.

Beyond these we also attempt to utilize the log of the amount of positive reviews and negative reviews.

We utilize R’s built in model generation to get our model below:

df_lm <- lm(median_2weeks ~ square_rating + low_reviews + positive_low_reviews + positive_log + negative_log , data = df)
summary(df_lm)

## 
## Call:
## lm(formula = median_2weeks ~ square_rating + low_reviews + positive_low_reviews + 
##     positive_log + negative_log, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -572.6 -251.6 -111.4   72.8 3653.4 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)           108.27363  736.87021   0.147    0.883
## square_rating         334.52198 1041.36618   0.321    0.749
## low_reviews          -119.59290  851.51122  -0.140    0.889
## positive_low_reviews   -0.02748    0.18291  -0.150    0.881
## positive_log         -163.51073  190.86299  -0.857    0.394
## negative_log          199.78335  186.39139   1.072    0.287
## 
## Residual standard error: 548.5 on 94 degrees of freedom
## Multiple R-squared:  0.08573,    Adjusted R-squared:  0.0371 
## F-statistic: 1.763 on 5 and 94 DF,  p-value: 0.1281

With a y-intercept of 108 we are told that a game would have close to 100 minutes of median playtime if there were no reviews at all..

The square_rating coefficient tells us that for every square percentage point a game is rated, the median playtime increases by 335 minutes. However, our P value is not significant at 0.749 which tells us square_rating is not a good indicator for median_playtime.

The low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 11.9 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.889 which tells us the dichotomous term of low reviews is not a good indicator for median_playtime.

The positive low reviews coefficient tells us that if a game has less than 10,000 ratings then the median playtime decreases by 0.03 minutes. However, our P value tells that this is likely to not be better than variations from random chance at 0.881 which tells us the dichotomous interaction term of positive low reviews is not a good indicator for median_playtime.

The positive_log coefficient tells us that for every log positive rating, the median playtime decreases by 164 minutes. However, our P value is not significant at 0.394 which tells us positive_log is not a good indicator for median_playtime.

The negative_log coefficient tells us that for every log negative rating, the median playtime increases by 200 minutes. However, our P value is not significant at 0.287 which tells us negative_log is not a good indicator for median_playtime.

Despite the terrible fit we know of already, we still have a regression model of:

\[ \hat{playtime} = 334.52*rating^2 - 119.59*lowreviewed - 0.02748*lowreviewed*positivereviews -163.51 \\ * log(positivereviews) + 199.78 * log(negativereviews) + 108.27 \]

Evaluating Quality of the Model

Examining the information regarding the residuals we violate almost every indication for following a Gaussian distribution. The median is not close to zero, the quantiles are not close in magnitude, and the minimum and maximum value of the residuals are particular outliers without being similar in magnitude.

Looking at the goodness of fit with the adjusted R-squared value. At 0.0371 we know that the model accounts for just 3.71% of variation in playtime based on the our variables. All signs so far point to this model not working out so well.

Residual Analysis of the Model

Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.

autoplot(df_lm)

Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals follow a downward trend and begin to fan upward as we move to the right. Additionally, there are more residuals underneath the line rather than above. Yet, many of the residuals above the line have a high magnitude. These deviations mean that the model is not a great fit for our data.

Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality.

DATA 605 Multiple Regression Discussion

Taha Ahmad

Introduction

Data Preparation

Generating the Model

Evaluating Quality of the Model

Residual Analysis of the Model

Conclusion