# load in required packages
library(tidyverse)
library(httr2)
library(jsonlite)
library(psych)
library(GGally)
library(ggpubr)
library(ggfortify)
In this project, we take data regarding games listed on Valve’s Steam platform for video games from the Steam Spy API. Specifically, we’re interested in the median playtime of the games with the top 100 users in the past 2 weeks from this data. With the additional variables of user rating, amount of game owners, and game price we also attempt to answer the question of if: The user ratings, reported by Steam, are related to median playtime. This is answered based on creating regression models from the data after processing, transforming, and removing the outliers from the data.
One model was a simple linear regression model with user rating predicting playtime, this model had a very low \(R^2\) of 0.0107 and the coefficient of user rating was not statistically significant at a p-value of 0.350, additionally many assumptions were violated. The second model was a multiple regression model with multiple variables predicting playtime, this model had a negative adjusted \(R^2\) of -0.0312 and the coefficient of user rating was not statistically significant at a p-value of 0.534, additionally many assumptions were violated. Thus we concluded with our data that the user ratings, reported by Steam, are not related to median playtime.
Valve corporation’s Steam is the most popular store front for digital computer games. Nearly every video game that is released these days which supports being played on computers will be listed for sale on Steam. Valve also provides robust API access to stats regarding both players and games for developers to utilize. This makes it ideal for gathering data on sales and engagement about video games.
We will be utilizing the Steam Spy API which collects data directly from the Steam Web API to gather the games which have the highest count of players in the past two weeks. Each case from this dataset represents one of the top 100 most played games (based on the amount of users that have launched the game) in the past two weeks. Thus there are 100 observations within our dataset.
Some of the information gathered includes the amount of user reviews a game has, both positive and negative, a general estimation of the amount of users that own a game, the price of the game, and the median playtime within the past two weeks between users who have played the game.
Using this information we want to find what variables lead to the game with the highest user engagement, i.e. median playtime. Thus, we start with our research question of: Considering the top 100 most played games in the past 2 weeks, are the user ratings, reported by Steam, related to median playtime?
We aim to answer this question by generating a regression model.
Here we initially load in the data from the API. We also process our data down to only have the columns which believe will be relevant for our analysis.
We create an estimation for the user rating displayed directly on the Steam marketplace which is the amount of positive reviews over total reviews, we convert playtime to hours from minutes, and we change the price from cents to dollars.
Then we preview our dataframe that is ready to explore.
# load data through a Steam Spy API request
req <- request(r"(steamspy.com/api.php)")
resp <- req %>%
req_url_query(`request` = 'top100in2weeks') %>%
req_perform()
# Process the response JSON into a list of lists
jlist <- resp %>%
resp_body_json(flatten= TRUE)
# Melt the list of lists down into a format of a tidy dataframe
df <- jlist %>%
map(as_tibble) %>%
reduce(bind_rows) %>%
# Select the columns which are relevant to our analysis
select(appid,name,positive_reviews = positive, negative_reviews = negative, owners, playtime = median_2weeks, positive_reviews = positive, price) %>%
# Calculate a new column for percent positive ratings
mutate(rating = round(positive_reviews/(positive_reviews + negative_reviews) ,3),
# Factorize the owner column which was previously stored as a string and reverse the ordering so the lowest owner amount would be the reference
owners = fct_rev(as_factor(owners)),
# Convert playtime from minutes to hours
playtime = round(playtime/60,2),
# Convert price to a numeric column and change it from cents to dollars
price = as.numeric(price)/100)
# Preview the data
knitr::kable(head(df))
| appid | name | positive_reviews | negative_reviews | owners | playtime | price | rating |
|---|---|---|---|---|---|---|---|
| 570 | Dota 2 | 1628885 | 346270 | 200,000,000 .. 500,000,000 | 12.23 | 0.00 | 0.825 |
| 730 | Counter-Strike: Global Offensive | 6365912 | 813590 | 50,000,000 .. 100,000,000 | 5.82 | 0.00 | 0.887 |
| 578080 | PUBG: BATTLEGROUNDS | 1241220 | 929663 | 50,000,000 .. 100,000,000 | 2.78 | 0.00 | 0.572 |
| 1063730 | New World | 177754 | 76441 | 50,000,000 .. 100,000,000 | 18.18 | 39.99 | 0.699 |
| 1172470 | Apex Legends | 529312 | 113259 | 50,000,000 .. 100,000,000 | 6.47 | 0.00 | 0.824 |
| 440 | Team Fortress 2 | 891253 | 59114 | 50,000,000 .. 100,000,000 | 5.55 | 0.00 | 0.938 |
Note that this analysis will be using data queried from the API on 5/4/2023. To maintain reproducibility of this project we have also uploaded the json file retrieved at this time period “top100in2weeks.json” on GitHub. The following code block allows for utilizing said json file:
# load data through a Steam Spy API request
url <- r"(https://raw.githubusercontent.com/alu-potato/DATA606/main/Final%20Project/top100in2weeks.json)"
# Process the response JSON into a list of lists
jlist <- read_json(url)
# Melt the list of lists down into a format of a tidy dataframe
df <- jlist %>%
map(as_tibble) %>%
reduce(bind_rows) %>%
# Select the columns which are relevant to our analysis
select(appid,name,positive_reviews = positive, negative_reviews = negative, owners, playtime = median_2weeks, positive_reviews = positive, price) %>%
# Calculate a new column for percent positive ratings
mutate(rating = round(positive_reviews/(positive_reviews + negative_reviews) ,3),
# Factorize the owner column which was previously stored as a string and reverse the ordering so the lowest owner amount would be the reference
owners = fct_rev(as_factor(owners)),
# Convert playtime from minutes to hours
playtime = round(playtime/60,2),
# Convert price to a numeric column and change it from cents to dollars
price = as.numeric(price)/100)
# Preview the data
knitr::kable(head(df))
We’ll use the summary() function to get an overview of the data we are working with here. Looking between the data as a whole, we notice that the means are almost always skewed away from the median. This is especially prevalent in the reviews and the playtime with strong rightward skews. This means that there are large outliers within those categories that we might want to deal with to prevent our regression model from not being generalizable.
However, if we compare the positive and negative review count categories with our transformed rating category, we can now see that the skew is much less noticeable with only less than a 2 percent difference between median and mean. Meaning we should use this category over the other two in our model.
The factors within owners seem relatively well distributed around games with 10 to 20 million owners. The outliers to those games with higher ownership should not matter too much for a categorical variable.
Looking at price we can see that even up to the 1st quartile games that are free are in the top 100 most users for the past two weeks. We expect this to be relevant in determining playtime later. The 3rd quartile games only go up to $20 as well, while the typical price for a new release these days is $60-$70.
Evaluating our response variable of playtime, we can see something odd right off the bat. There are games with no playtime within the top 100 most played games of the past two weeks as our minimum. Although, this is possible to be accounted for by many users launching the games and closing them right away leading to many users counted as “playing” a game, it should not be possible for these games to eclipse those with actual playtime recorded in them. This anomaly seems more likely to be an error in the data and should be dealt with as well.
The mean of playtime being 9.3 hours and the median being 3.6 hours also indicates a few games being large outliers in average median playtime such as the game with 160.3 hours of median playtime per user.
df %>%
select(-appid, -name) %>%
summary() %>%
knitr::kable()
| positive_reviews | negative_reviews | owners | playtime | price | rating | |
|---|---|---|---|---|---|---|
| Min. : 2155 | Min. : 270 | 5,000,000 .. 10,000,000 :29 | Min. : 0.000 | Min. : 0.00 | Min. :0.3570 | |
| 1st Qu.: 84270 | 1st Qu.: 9914 | 10,000,000 .. 20,000,000 :43 | 1st Qu.: 1.812 | 1st Qu.: 0.00 | 1st Qu.:0.8023 | |
| Median : 209310 | Median : 23248 | 20,000,000 .. 50,000,000 :21 | Median : 3.600 | Median : 9.99 | Median :0.8785 | |
| Mean : 359402 | Mean : 58056 | 50,000,000 .. 100,000,000 : 6 | Mean : 9.270 | Mean :14.39 | Mean :0.8505 | |
| 3rd Qu.: 458521 | 3rd Qu.: 55489 | 200,000,000 .. 500,000,000: 1 | 3rd Qu.: 7.305 | 3rd Qu.:19.99 | 3rd Qu.:0.9403 | |
| Max. :6365912 | Max. :929663 | NA | Max. :160.320 | Max. :69.99 | Max. :0.9850 |
Looking at a histogram and boxplot for rating we can see that despite the mean and median being close from our summary statistics, there is still a leftward skew going on. In this case we have 5 outliers towards the left where despite the user ratings being low the game still manages to find itself on the top 100 most played. Here we will consider pruning the outlier with a user rating of less than 0.4, as it is completely disconnected from the rest of the user ratings. Thus not being a good input to take for a regression model.
par(mfrow=c(1,2))
ggplot(df, aes(x=rating)) + geom_histogram(binwidth = .025, na.rm = TRUE, color = "black") +
xlim(c(0.3,1))+
ggtitle("User Rating Distribution")
ggplot(df, aes(x=rating)) + geom_boxplot(fill = "grey") +
ggtitle("User Rating Spread") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Looking at a histogram and boxplot for median playtime distribution we can see that it matches our findings from the summary statistics in that it is clustered towards the median playtime of 3.6 hours, but there are extreme outliers beyond the 50 hours of playtime mark. Since these are also completely disconnected from the rest of the rest of the distribution they likely will not tend to be an accurate representation of the population. Considering it logically as well, it simply doesn’t make sense that a game shared between millions of players would have a median playtime of 150+ hours within two weeks. That’s close to a whole week’s worth of time, including nights, just playing the game.
par(mfrow=c(2,1))
ggplot(df, aes(x=playtime)) + geom_histogram(bins = 50, na.rm = TRUE, color = "black") +
ggtitle("User Playtime Distribution")
ggplot(df, aes(x=playtime)) + geom_boxplot(fill = "grey") +
ggtitle("User Playtime Spread") +
theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Before removing our outlier for rating, let us take a look at it and surmise why this outlier exists in our data. Filtering for the outlier we can see that the game is Battlefield 2042, and now the outlier makes a little bit more sense. The game was supposed to be a major and solid release for the Battlefield franchise, but ended up releasing in a very poor state regarding performance, gameplay, and bugs. Thus, it was bombarded with many negative reviews immediately after release and the reviews never recovered. Despite that, updates over time have led to a state of the game that facilitates a healthy population of users.
df %>%
filter(rating < 0.40) %>%
knitr::kable()
| appid | name | positive_reviews | negative_reviews | owners | playtime | price | rating |
|---|---|---|---|---|---|---|---|
| 1517290 | Battlefield 2042 | 67407 | 121511 | 10,000,000 .. 20,000,000 | 8.63 | 59.99 | 0.357 |
Still, there’s no compelling reason to leave this in for our regression model when it diverges so far from the rating baseline.
df <- df %>%
filter(!rating < 0.40)
Taking a look at the games with no playtime, there doesn’t seem to be much relating these games together besides most being older games.
df %>%
filter(playtime == 0) %>%
select(name) %>%
knitr::kable()
| name |
|---|
| Half-Life 2: Lost Coast |
| Grand Theft Auto IV: Complete Edition |
| Serena |
| Half-Life 2: Deathmatch |
| Ring of Elysium |
| Black Squad |
| Guacamelee! Super Turbo Championship Edition |
| The Tiny Bang Story |
| Ricochet |
| Castle Crashers |
| Deathmatch Classic |
Since there’s no obvious pattern as to why the games wouldn’t have any median playtime we will remove this believing it is erroneous data.
df <- df %>%
filter(!playtime == 0)
Taking a look at the games with high playtime, 3 out of 4 of the games here are free to play. However, that doesn’t mean much considering we know that at least 25% of the games we initially had were free from the IQR of price. These games are also popular and have many players.
df %>%
filter(playtime > 50) %>%
knitr::kable()
| appid | name | positive_reviews | negative_reviews | owners | playtime | price | rating |
|---|---|---|---|---|---|---|---|
| 1599340 | Lost Ark | 137885 | 54131 | 20,000,000 .. 50,000,000 | 78.62 | 0 | 0.718 |
| 291480 | Warface | 54760 | 26554 | 10,000,000 .. 20,000,000 | 56.78 | 0 | 0.673 |
| 433850 | Z1 Battle Royale | 114954 | 92032 | 10,000,000 .. 20,000,000 | 160.32 | 0 | 0.555 |
| 466240 | Deceit | 68878 | 18172 | 5,000,000 .. 10,000,000 | 160.32 | 0 | 0.791 |
The fact that these games are popular makes it even stranger that their median playtime is so high. Since something isn’t make sense with these data points, we will also remove them.
df <- df %>%
filter(!playtime > 50)
After removing outliers and suspected erroneous data we are now down from 100 observations to 84 observations which will impact our adjusted \(R^2\).
Going back to visualizing our data, we build a scatter plot for user rating against playtime. The results are not promising with a correlation of -0.1 which suggests little to no negative correlation. We can also notice that the variance of playtime changes as the rating increases. Additionally, there are some outliers with more than 20 hours of playtime present as we have accepted those as not being too extreme. With our low correlation, we would be violating the assumption
ggplot(df, aes(x=rating,y=playtime)) +
geom_point(na.rm = TRUE) +
geom_smooth(formula = y ~ x,method=lm, na.rm = TRUE, se = FALSE) +
stat_cor(aes(label = after_stat(r.label))) +
ggtitle("User Rating Against Playtime")
Next, we build a scatter plot for price against playtime. The results are even less promising with a correlation of -0.065 which suggests basically no negative correlation. We can also notice that the variance of playtime changes as the price decreases. This is as the majority of the top played games are either free or low in price. We might be able to counteract this with a square root transformation to price which could lead to closer data points and a greater magnitude of correlation.
par(mfrow=c(1,2))
ggplot(df, aes(x=price,y=playtime)) +
geom_jitter(na.rm = TRUE) +
geom_smooth(formula = y ~ x,method=lm, na.rm = TRUE, se = FALSE) +
stat_cor(aes(label = after_stat(r.label)), label.x = 60) +
ggtitle("Price Against Playtime")
ggplot(df, aes(x=sqrt(price),y=playtime)) +
geom_jitter(na.rm = TRUE) +
geom_smooth(formula = y ~ x,method=lm, na.rm = TRUE, se = FALSE) +
stat_cor(aes(label = after_stat(r.label)), label.x = 7.5) +
ggtitle("Root Price Against Playtime")
We can confirm that our correlation has been doubled to -0.12, which while is still very weak, it is better than we had before. Thus, we insert it into our dataframe.
df <- df %>%
mutate(root_price = sqrt(price), .keep = "unused")
Finally, we’ll take a look at the pair plots for determining if we have colinearity between variables. As we have no correlations above 0.5 between variables, we can consider there being no colinearity here.
df %>%
select(-appid, -name, -negative_reviews, -positive_reviews) %>%
ggpairs()
The next step to take is building our regression models and then analyzing them.
We first tackle our research question on determining if the user ratings, reported by Steam, are related to median playtime.
We utilize R’s built in linear model generation to get our linear model below:
df_slm <- lm(playtime ~ rating, data = df)
summary(df_slm)
##
## Call:
## lm(formula = playtime ~ rating, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.965 -3.135 -1.452 1.839 21.953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.848 4.657 2.115 0.0375 *
## rating -5.020 5.341 -0.940 0.3500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.24 on 82 degrees of freedom
## Multiple R-squared: 0.01066, Adjusted R-squared: -0.001406
## F-statistic: 0.8835 on 1 and 82 DF, p-value: 0.35
With a y-intercept of 9.85 and a slope of -5.02, we get the regression model of:
\(\hat{playtime} = -5.02*rating + 9.85\)
This means that for every percentage point the rating goes up, the playtime goes down by .05 hours with a baseline of 9.85 hours played for a game with a 0% user rating. Oddly enough, we seem to have a negative relationship between playtime and a game’s user rating.
Next we can examine the information regarding the coefficients. The standard error we have is barely smaller than the estimate of the rating coefficient, which is not indicative of a good model’s variability. Additionally, we have a small t-value that leaves the probability of any linear relationship being from chance of 35.0%.
Finally we’ll take a look at the goodness of fit with the multiple R-squared value. At 0.0107 we know that the model accounts for just 1.07% of variation in playtime based on the user rating. All signs so far point to our model not being fitting in this case.
Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.
autoplot(df_slm)
Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals are concentrated towards the left side and begin to fan out as we move to the right. These deviations mean that the model is not a great fit for our data as we have violated homoscedasticity.
Generating a qq plot of our residuals reinforces the idea that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. Both the lower and upper residual data deviates from normality, with the deviations towards the top quantiles being especially egregious. Thus, we have violated the assumption of residual normality.
Going back to our scatterplot, the linearity between playtime and rating is dubious. The correlation is low and the spread is more of a cone shape than a line. We will consider the assumption of linearity violated as well.
The one assumption that we have not violated here is independent observation as what games the different users will be playing is not going to be dependent on another game.
From our analysis here, we have come up with a simple linear regression model that was not appropriate for our data. As we were unable to create a good model just by using user rating and median playtime we are able to answer our research question here:
No the user ratings, reported by Steam, are not related to median playtime based on creating a regression model from the data.
Although our research question is answered, we still have the data to attempt to generate some type of relationship between our other predictor variables and median playtime. We will now create a multiple regression model to see if we can make a better model including the other variables.
We utilize R’s built in linear model generation to get our linear model below:
df_mlm <- lm(playtime ~ rating + root_price + owners , data = df)
summary(df_mlm)
##
## Call:
## lm(formula = playtime ~ rating + root_price + owners, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.007 -3.029 -1.125 1.440 21.260
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3089 4.9043 1.898 0.0614 .
## rating -3.5120 5.6217 -0.625 0.5340
## root_price -0.1808 0.2299 -0.786 0.4341
## owners10,000,000 .. 20,000,000 -0.6732 1.4234 -0.473 0.6376
## owners20,000,000 .. 50,000,000 -0.2170 1.6912 -0.128 0.8982
## owners50,000,000 .. 100,000,000 0.6208 2.4875 0.250 0.8036
## owners200,000,000 .. 500,000,000 5.8185 5.4904 1.060 0.2926
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.317 on 77 degrees of freedom
## Multiple R-squared: 0.0433, Adjusted R-squared: -0.03125
## F-statistic: 0.5808 on 6 and 77 DF, p-value: 0.7445
This gives us a regression model of:
\[ \hat{playtime} = -3.512*rating - 0.181*\sqrt{price} - 0.673\text{( if owners 10m-20m)} - \\ 0.217\text{( if owners 20m-50m)} + 0.621\text{( if owners 50m-100m)} + 5.818\text{( if owners 100m-500m)} + 9.309 \]
With a y-intercept of 9.3 we are told that a game would have close to 9.3 hours of median playtime if there were no reviews at all.
The rating coefficient tells us that for every percentage point a game is rated, the median playtime decreases by 0.035 hours. However, our p-value is not significant at 0.534 which tells us rating is not a good indicator for playtime.
The root_price coefficient tells us that for every root dollar a game costs, the median playtime decreases by 0.181 hours. Yet again, our p-value at 0.434 is not significant which means root_price should not be used to determine playtime.
For each owner coefficient we know that if the game falls into one of these four categories of the amount of owners, it will either increase or decrease the median playtime in hours by the coefficient. However, none of the coefficients have a significant p-value. Meaning that these changes could very well be because of random chance.
Looking at the goodness of fit with the adjusted R-squared value. At -0.0312 we know that the model has taken so high of a penalty from our extra variables without gaining any benefit, that randomly guessing might as well be better.
Let us take a look at the individual residuals and what they tell us with the model. Here we utilize ggfortify’s autoplot capabilities to plot 4 diagnostic residual plots at once.
autoplot(df_mlm)
Looking at the residuals vs fitted plot we can see that our data is not distributed well. The residuals are concentrated towards the left side and begin to fan out as we move to the right again. These deviations mean that the model is not a great fit for our data as we have violated homoscedasticity.
Generating a qq plot of our residuals shows that our residuals do not seem to be normally distributed, and thus our model is not a great fit for the data. As the upper residual data deviates from normality quite a bit. Thus, we have violated the assumption of residual normality.
Going back to our pair plot, the linearity between playtime and rating is dubious along with root_price and rating. The correlation is low and the spread is more of a cone shape than a line in both cases. We will consider the assumption of linearity violated as well because of these.
We retain an assumption that we have not violated, independent observation as what games the different users will be playing is not going to be dependent on another game.
The final assumption we check for multiple regression is colinearity.
Going back to the pairplot we see that between predictor variables there
is low correlation, thus we pass this assumption check.
From our analysis here, we have come up with a multiple linear regression model that was not appropriate for our data. The adjusted R^2 was in the negatives and none of our coefficients were statistically significant. Additionally, three of our assumptions have been violated. In the end, this model was not much of an improvement over the previous model if at all.
After going through our regression analysis we can answer our research question with our data. The user ratings, reported by Steam, are not related to median playtime based on creating a regression model from the data. This is because within our regression models created, user rating as a dependent variable is never a statistically significant predictor of median playtime.
Knowing this might be useful to any future researchers who are interested in what sort of variables increase player retention and playtime. Despite, not being able to create a useful model we now know that in the future someone attempting the same analysis should attempt different methods to do so.
We were limited with the amount of data we had in both the variables used as predictors and the sample size as a whole. Any further research should be attempted on a larger dataset, or with different methods of transforming the existing data.