knitr::include_graphics("~/Documents/Data 110/steam home.jpg")
source: https://store.steampowered.com/
The topic of my analysis is the “Economics of Play”, which focuses on PC video games(steam specifically). The dataset, games3.csv, contains 9 variables including price, peak_ccu, genres, and required_age. The data was web scraped from Steam. To clean the data, I used filter to take out missing genre entries and mutated the complex genres string into a binary game_type variable (Multiplayer vs. Singleplayer). This topic is meaningful to me because I use steam every day and have been for years, and I think that it is interesting to learn more about the games that I like compare to others.
games <- read_csv("~/Documents/Data 110/games.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
## Rows: 111452 Columns: 39
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (19): Name, Release date, Estimated owners, Supported languages, Full au...
## dbl (16): AppID, Peak CCU, Required age, Price, DiscountDLC count, About the...
## lgl (4): Mac, Linux, Metacritic score, Achievements
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
games <- games |>
clean_names()
games2 <- games |>
filter(median_playtime_forever != 0)
gamesc <- games2 |>
select(name, price, peak_ccu, positive, negative, required_age, genres) |>
filter(price > 0) |>
mutate(total_reviews = positive + negative,
review_ratio = positive / total_reviews)
# This step removes free-to-play games and creates new variables to better understand user sentiment.
further filtering to create ideal dataset
games3 <- gamesc |>
filter(!is.na(genres)) |>
# Creating a simplified 'Game Type' variable.
mutate(game_type = case_when(
str_detect(genres, "Co-op") ~ "2 Player",
str_detect(genres, "Multi-player|Online") ~ "Multiplayer",
TRUE ~ "Singleplayer"
)) |>
# Mutate: Transform peak_ccu to log scale because popularity often follows a power law
mutate(log_ccu = log(peak_ccu + 1)) |>
# Mutate: Treat required_age as a categorical factor
mutate(age_rating = as.factor(required_age))
head(games3)
## # A tibble: 6 × 12
## name price peak_ccu positive negative required_age genres total_reviews
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Far Cry® 5 60.0 2164 0 100620 17 Singl… 100620
## 2 Forza Hori… 60.0 7571 0 122539 0 Singl… 122539
## 3 Max Payne 3.49 49 0 9516 17 Singl… 9516
## 4 Sally Face… 2.99 133 0 11739 0 Singl… 11739
## 5 Automobili… 24.0 358 0 4278 0 Singl… 4278
## 6 Oxygen Not… 25.0 7507 0 82902 0 Singl… 82902
## # ℹ 4 more variables: review_ratio <dbl>, game_type <chr>, log_ccu <dbl>,
## # age_rating <fct>
model <- lm(log_ccu ~ price + total_reviews + game_type, data = games3)
summary(model)
##
## Call:
## lm(formula = log_ccu ~ price + total_reviews + game_type, data = games3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6728 -1.2380 0.0956 1.3515 7.7515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.586e+00 1.172e-01 39.126 < 2e-16 ***
## price 5.837e-02 2.922e-03 19.977 < 2e-16 ***
## total_reviews 1.184e-05 6.956e-07 17.018 < 2e-16 ***
## game_typeMultiplayer -2.717e-01 1.480e-01 -1.835 0.0666 .
## game_typeSingleplayer -5.522e-01 1.062e-01 -5.201 2.18e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.064 on 2073 degrees of freedom
## Multiple R-squared: 0.2871, Adjusted R-squared: 0.2857
## F-statistic: 208.7 on 4 and 2073 DF, p-value: < 2.2e-16
The multiple linear regression model attempts to predict the log of peak concurrent users. The Adjusted R^2 value indicates how much of the variance in popularity is explained by price, total reviews, and game type. If the coefficient for game_type Multiplayer is positive and significant (p < 0.05), it suggests that multiplayer games tend to have higher peak player counts than single-player games, holding other factors constant
plob <- ggplot(games3, aes(x = price, y = peak_ccu, color = game_type)) +
geom_point(alpha = 0.6, size = 2) +
# Use a log scale for Y axis because player counts vary wildly
scale_y_log10(labels = scales::comma) +
# Custom Colors
scale_color_manual(values = c(
"Singleplayer" = "#E69F00", # Orange
"Multiplayer" = "#56B4E9", # Blue
"2 Player" = "#009E73" # Green
)) +
# Non-default Theme
theme_minimal() +
labs(
title = "Impact of Price on Peak Player Counts",
subtitle = "Multiplayer games maintain higher concurrency across price points",
x = "Price (USD)",
y = "Peak Concurrent Users (Log)",
color = "Game Mode",
caption = "Data Source: Steam Game data"
) +
theme(legend.position = "bottom")
plob
## Warning in scale_y_log10(labels = scales::comma): log-10 transformation
## introduced infinite values.
This scatter plot reveals the relationship between game price and peak
concurrent users. By separating ‘Co-op’ games into a distinct ‘2 Player’
category (shown in green), we can see that cooperative games often
cluster differently than competitive multiplayer titles. While
Singleplayer games (orange) span the entire price range, 2 Player games
tend to have higher engagement floors, suggesting that the social aspect
of co-op gaming drives consistent player counts.
plop <- ggplot(games3, aes(x = age_rating, y = total_reviews, fill = age_rating)) +
geom_boxplot(outlier.shape = NA) + # Hide outliers to keep plot clean
scale_y_log10(labels = scales::label_number()) +
# Custom Colors
scale_fill_viridis_d(option = "plasma") +
# Non-default Theme
theme_light() +
labs(
title = "Game Engagement Distribution by Age Rating",
x = "Required Age Rating",
y = "Total Reviews (Log Scale)",
fill = "Age Rating"
)
ggplotly(plop)
## Warning in scale_y_log10(labels = scales::label_number()): log-10
## transformation introduced infinite values.
## Warning: Removed 59 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
This interactive boxplot displays the distribution of total reviews across different age ratings (0, 13, 17, etc.). The interaction allows viewers to hover over specific quartiles to see exact median review counts. It appears that games rated for mature audiences (17+) often see higher median engagement levels compared to those rated for everyone (0)
The video game industry has seen explosive growth in the last decade. According to a report by Newzoo, the global games market generated over $180 billion in 2023, with PC gaming accounting for a significant portion. Multiplayer games, in particular, drive long-term engagement through “Games as a Service” (GaaS) models, where continuous updates keep concurrent user counts high.
Sources: Wijman, T. (2023, August 8). Explore the global games market in 2023. Newzoo. https://newzoo.com/resources/blog/explore-the-global-games-market-in-2023 Roundhill Investments. (2020, October 30). The Rise of Gaming as a Service. Roundhill Investments. https://blog.roundhillinvestments.com/the-rise-of-gaming-as-a-service
The visualizations highlight some distinct patterns in the gaming market. The scatter plot demonstrates that while price does not have a linear correlation with popularity, the game mode does: multiplayer games consistently achieve higher peak user counts, which doesnt surprise me as multiplayer games, in my experience, have more longevity. The interactive boxplot further reveals that mature-rated games (Age 17+) tend to have more reviews, probably because adults leave reviews while kids dont care. Something not surprising in the data was the weak correlation between price and success; many lower-priced games achieved massive popularity, this is because of the impact fortnite had on monetization in games, and how most of them are now on free to play models. One limitation was the positive and negative review columns, which contained weird numbers(zero values for highly reviewed games), preventing a sentiment analysis that I had originally hoped to include.