Final Project

Author

Dekker Spielman

The dataset I am using is a collection of every single game on steam, and data regarding each one. This information was all scraped from SteamDB. For this project, I will be trying to see what relationships there exist between game age, copies sold, price, and general sentiment. Most of these values are given in the dataset, but general sentiment isn’t. I measured that by recording what proportion of a game’s reviews were positive. I chose this dataset in particular because I thought it would be an interesting topic to explore.

#Here I get all the libraries I need

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(viridis)

Loading required package: viridisLite

setwd("~/aaaworkingdirectory")
games <- read_csv("steam.csv")

Rows: 27075 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): name, developer, publisher, platforms, categories, genres, steamsp...
dbl  (9): appid, english, required_age, achievements, positive_ratings, nega...
date (1): release_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#First, I cut out all the columns that I'm sure I won't need - things like app id, genres, and other information I won't be using.

trimGames <- games |>
  select(name, release_date, developer, publisher, platforms, steamspy_tags, positive_ratings, negative_ratings, average_playtime, price, owners) |>
  mutate(positive_ratio = positive_ratings / (positive_ratings + negative_ratings))


#Here I set up the linear model

lm_model <- lm(positive_ratings ~ negative_ratings + average_playtime + price, data = trimGames)

# Summary of model
summary(lm_model)


Call:
lm(formula = positive_ratings ~ negative_ratings + average_playtime + 
    price, data = trimGames)

Residuals:
     Min       1Q   Median       3Q      Max 
-1146869     -301     -263     -206  1285657 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      301.91754   94.18378   3.206  0.00135 ** 
negative_ratings   3.35561    0.01764 190.260  < 2e-16 ***
average_playtime   0.37520    0.04136   9.072  < 2e-16 ***
price            -10.80755    9.48279  -1.140  0.25442    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12260 on 27071 degrees of freedom
Multiple R-squared:  0.5831,    Adjusted R-squared:  0.5831 
F-statistic: 1.262e+04 on 3 and 27071 DF,  p-value: < 2.2e-16

According to the linear model, the equations are negative_ratings = 301.9 + 3.4x and average_playtime = 301.9 + 0.4x, where x is the number of positive reviews. I didn’t include the linear equation for price, because it’s p-value of 0.25 shows it wouldn’t be accurate. The p-values for both negative_ratings and average_playtime, on the other hand, are .2e-15, which shows that the correlation is very significant. Additionally, the R-squared value of 0.58 shows that the agregated data doesn’t very much deviate from the linear model. What this tells us about the data is that as the number of positive reviews go up, we can reliably say that the number of negative reviews and the average playtime will also go up. This makes sense to me, because games with more positive reviews will get more coverage, and will therefore get more negative reviews. Additionally, the higher production-value games that get more positive reviews are also the games that people will be willing to sink more time into.

Outside Research

A research paper by Andraž De Luisa uses a more sophisticated model to predict the number of players a game will get in it’s second month based on a wide variety of different factors.(Source 2) In fact, in the conclusion, the author mentions how the model could be further improved by including outside sources such as google searches and twitch views to help further refine the model and predict harder to measure popularity spikes. This paper helped me to realize how much more could be added to my simple model. While a linear model could never fully represent the complexity of something like this, Adding more elements could make it even more accurate.

#Here I make a graph of only games with more than 100,000 positive reviews to see if there are any trends among popular games

options(scipen = 999)
topGames <- trimGames |>
  filter(positive_ratings > 100000) |>
  ggplot(aes(release_date, positive_ratings + negative_ratings, text = paste("", name))) +
  geom_point(aes(color = positive_ratio, size = price)) +
  scale_color_viridis_c() +
  scale_y_log10() +
  theme_dark() +
  labs(x = "Release Date",
       y = "Total Reviews",
       color = "Ratio of Posivite Reviews",
       caption = "Steamdb",
       title = "Popular Steam Games by Release Date",
       size = "Price")
ggplotly(tooltip = c("text"))

This visualization was pretty interesting to make, and I noticed several things while I was working on it. The visualization itself was relatively easy to make, although there were a few things I had troubles with. One thing I think I did well with was having the tooltip list the name of the game even thought that wasn’t one of the things the graph was measuring. On top of that, I also made the y-axis use a log scale so that the cluster looked better. One thing I wish I could have done better with was the size legend. In the visualization, the size is supposed to represent the price of the game, but when rendered with plotly, it always drops the size legend and I wasn’t able to find a fix for that. While working on this, I noticed a few patterns. It seems like as the release date gets later, the average positive review ratio seems to get smaller. Additionally, a surprising number of the games that made it into this graph are developed by Valve, the owner of steam. Portal 2, Counter Strike, Counter Strike:Global Offensive, Team Fortress 2, Left 4 Dead 2, and Dota 2 are all developed by Valve. I’m curious if this is because Valve is able to push their games more than other games as the owner of the platform.

#This visulization encompasses all steam games, and shows both the ratio of positive reviews, as well as the number of copies sold.

#I found this fix on stack overflow - without this, it would go in alphabetical order instead of numerical. (Source 3)
cust_order <- c("0-20000", "20000-50000", "50000-100000", "100000-200000", "200000-500000", "500000-1000000", "1000000-2000000", "2000000-5000000", "5000000-10000000", "10000000-20000000", "20000000-50000000", "50000000-100000000", "100000000-200000000")
trimGames$owners<- factor(trimGames$owners, levels = cust_order)

gameHist <- trimGames |>
  ggplot(aes(positive_ratio, fill = owners)) +
  scale_fill_viridis_d() +
  geom_histogram() +
  theme_minimal() +
  labs(x = "Ratio of Positive Reviews",
       y = "Count",
       fill = "Copies Sold",
       caption = "Steamdb",
       title = "Games by Rating")
gameHist

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This visualization was also interesting to make, and I think it gives some insights. I was able to make this graph almost exactly the way I wanted it to. The only small qualm I had is that I wasn’t able to have this graph’s y-axis use the log scale for some reason. Whenever I tried to apply that function, all the data would be altered in a strange way that couldn’t be explained just by the y scale changing. I think this graph also shows an interesting relationship- only games with few copies sold are capable of having 100% positive reviews (or 0%). I think this relationship is interesting because as a good game gets more and more popular, it’s review score will actually go down, because as it’s exposed to a larger audience, it’s more likely to be exposed to someone who doesn’t like it. There are of course exceptions. If you look very closely at the bottom of the 100% bar, there is actually a small sliver of the 2000-5000 color. This means there are actually some games that have managed to get exclusively positive reviews even with a moderate level of popularity, which I thought was really interesting.

#For this one, I took all games less than $50, and saw what proportion of each price category was successful

priceHist <- trimGames |>
  filter(price<50) |>
  ggplot(mapping = aes(price, fill = owners)) +
  scale_fill_viridis_d() +
  geom_histogram(bins = 10, position = "fill") +
  theme_linedraw() +
  labs(x = "Price ($)",
       y = "Proportion",
       fill = "Copies Sold",
       caption = "Steamdb",
       title = "Proportional Popularity of Games by Price")
priceHist

Finally, my last visualization. At first I had trouble with this one, since I had a different idea for what I wanted to which didn’t end up working out. What I had wanted to do was have the top 100 games on steam in a line, and how far up they were was how many positive reviews they had. The reason I didn’t go with that is because the definition of what the ‘top’ games are is somewhat arbitrary, and the shape of the final graph changed drastically depending on which I went with. This graph was actually not too bad. I set the upper bound of price be $50 because there were very few above that, so any bars on the far right would be very chunky, and based on a few outliers. What I like about this graph is it shows off the two main forms of monetization that mainstream games go by. Most games charge upfront, so you pay once and then you can play the game forever. That’s why you can see a larger portion of games have a many copies sold when their price is higher, since higher production-value games will cost more and are more likely to be widely successful. However, you can see the graph goes up a little on the farthest left bin. This is because of the free-to-play model. A larger portion of free games are widely successful vs very cheap games because free-to-play games have a low barrier to entry, but also have the potential to have a high production value if they monetize it using in-game microtransactions. This trend is larger than it seems in the graph because there is a huge quantity of free, low-quality games on the platform.

Sources:

Source 1: (N.d.). Steam. Retrieved May 11, 2025, from https://help.steampowered.com/public/shared/images/responsive/steam_share_image.jpg.

Source 2: De Luisa, A., Hartman, J., Nabergoj, D., Pahor, S., Rus, M., Stevanoski, B., Demˇsar, J., & Sˇtrumbelj, E. (2021, October 6). Predicting the Popularity of Games on Steam. https://arxiv.org/abs/2110.02896

Source 3: R - custom reorder histogram fill categories in GGPLOT2 - stack overflow. Stack Overflow. (2020, August 7). https://stackoverflow.com/questions/63305764/custom-reorder-histogram-fill-categories-in-ggplot2