DATA 605 Linear Regression Discussion

Data Preparation

# load in required packages
library(tidyverse)
library(httr2)
library(psych)
library(ggfortify)

# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
  req_url_query(`request` = 'top100in2weeks') %>%
  req_perform()

#  Process the response JSON into a list of lists
jlist <- resp %>%
  resp_body_json(flatten= TRUE)

# Melt the list of lists down into a format of a tidy dataframe
df_full <- jlist %>%
    map(as_tibble) %>%
    reduce(bind_rows)

# Select the columns which are important to our analysis and 
# calculate a new column for percent positive ratings
df <- df_full %>%
  select(appid,name,positive,negative,average_2weeks, median_2weeks) %>%
  mutate(rating = round( positive/(positive + negative) ,3)) %>%
  select(-positive,-negative)

# Preview the data
knitr::kable(head(df))

appid	name	average_2weeks	median_2weeks	rating
570	Dota 2	1233	709	0.826
730	Counter-Strike: Global Offensive	644	255	0.886
1172470	Apex Legends	796	374	0.827
578080	PUBG: BATTLEGROUNDS	719	298	0.571
1063730	New World	649	843	0.699
440	Team Fortress 2	965	298	0.938

From this data I want to determine if a simple linear regression model can be used to create a model of median playtime of the game as a function of user review ratings.

Visualization

For the initial visualization we plot the dependent variable of median playtime to the independent variable of user ratings. A linear relationship is very hard to determine from this view.

df %>%
  ggplot(aes(x=rating, y= median_2weeks)) +
  geom_point()

Generating the Linear Model

We utilize R’s built in linear model generation to get our linear model below:

df_lm <- lm(median_2weeks ~ rating, data = df)
(df_lm)

## 
## Call:
## lm(formula = median_2weeks ~ rating, data = df)
## 
## Coefficients:
## (Intercept)       rating  
##       577.1       -339.5

With a y-intercept of 577.1 and a slope of -339.5, we get the regression model of:

\(\hat{playtime} = -339.5*rating + 577.1\)

Oddly enough, we seem to have a negative relationship between playtime and a game’s user rating.

This is what the line looks visualized on our scatter plot:

df %>%
  ggplot(aes(x=rating, y= median_2weeks)) +
  geom_point() +
  geom_abline(intercept = df_lm$coefficients[1], slope = df_lm$coefficients[2])

Evaluating Quality of the Model

We use the summary function on the model object in order to get more information on how well the model fits.

summary(df_lm)

## 
## Call:
## lm(formula = median_2weeks ~ rating, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -317.91 -207.00  -98.01   80.30 1752.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    577.1      245.6   2.350   0.0208 *
## rating        -339.5      283.8  -1.197   0.2343  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 336.4 on 98 degrees of freedom
## Multiple R-squared:  0.0144, Adjusted R-squared:  0.004344 
## F-statistic: 1.432 on 1 and 98 DF,  p-value: 0.2343

Examining the information regarding the residuals we violate almost every indication for following a Gaussian distribution. The median is not close to zero, the quantiles are not close in magnitude, and the minimum and maximum value of the residuals reach astronomical values without being similar in magnitude.

Next we can examine the information regarding the coefficients. The standard error we have is only about 1.5 times smaller than the estimate of the rating coefficient, which is not indicative of a good model’s variability. Additionally, we have a small t-value that leaves the probability of any linear relationship being from chance of 23.43%.

Finally we’ll take a look at the goodness of fit with the multiple R-squared value. At 0.0144 we know that the model accounts for just 1.44% of variation in playtime based on the user rating. All signs so far point to this model not working out so well.

Residual Analysis of the Model

Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.

autoplot(df_lm)

Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals are concentrated towards the left side and begin to fan out as we move to the right. Additionally, there are more residuals underneath the line rather than above. Yet, many of the residuals above the line have a high magnitude. These deviations mean that the model is not a great fit for our data.

Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality.

DATA 605 Linear Regression Discussion

Taha Ahmad

Introduction

Data Preparation

Visualization

Generating the Linear Model

Evaluating Quality of the Model

Residual Analysis of the Model

Conclusion