Randy Koliha
Isaiah Sarria
Owen Telis
John Levitt
Steam is a popular, online video game distributor with many other complementing features. Steam hosts community discussion threads, access to download community-made modifications to games sold on the platform, social media-like user profiles and friends lists that displays the users overall or recent video game activity, and sales events which are based on holidays or unique themes.
We want to analyze a dataset regarding Steam games to dive deep into the behavior of gamers on the platform.
Loading any needed packages.
library(tidyverse)
library(scales)
library(directlabels)
The games.csv dataset has data collected for games on the popular game license selling platform, Steam, over the months from the years July 2012- Febuary 2021. The data set has records of 1258 games (also includes other miscellaneous pieces of software), and includes their titles, monthly player peaks, monthly average players at the same time, monthly gains/losses of players compared to the previous month, and the percentage of how closely the average players approach the peak. It should also be noted that the data set was scrapped data from Steam by Steam Charts, and only tracks relevant games on Steam.
# Storing dataset in `games`
games <- read.csv("games.csv")
#cleaning it up
games <- games %>%
mutate( number_of_month = match(games[1:83631,3], month.name) ) %>%
select(gamename,year,number_of_month,everything()) %>%
arrange(desc(year), number_of_month)
games
summary(games)
gamename year number_of_month month avg gain peak avg_peak_perc
Length:83631 Min. :2012 Min. : 1.000 Length:83631 Min. : 0.0 Min. :-250249.0 Min. : 0 Length:83631
Class :character 1st Qu.:2016 1st Qu.: 3.000 Class :character 1st Qu.: 53.1 1st Qu.: -38.2 1st Qu.: 137 Class :character
Mode :character Median :2018 Median : 7.000 Mode :character Median : 203.1 Median : -1.6 Median : 500 Mode :character
Mean :2017 Mean : 6.546 Mean : 2765.7 Mean : -10.3 Mean : 5470
3rd Qu.:2019 3rd Qu.:10.000 3rd Qu.: 764.0 3rd Qu.: 22.2 3rd Qu.: 1727
Max. :2021 Max. :12.000 Max. :1584886.8 Max. : 426446.1 Max. :3236027
NA's :1258
# Finding the distinct number of `gamenames`
dist_g <- games %>%
distinct(gamename)
dist_g
# in alpha order
dist_g_alpha <- games %>%
distinct(gamename) %>%
arrange(gamename)
dist_g_alpha
There is 1258 games on steam.
g_payday <- games %>%
filter(gamename == "PAYDAY 2") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_payday
#makes the PAYDAY 2 dataframe
Years_Payday2 <- as.factor(g_payday$year)
ggplot(g_payday,aes(x = number_of_month,y = avg, group = year, color = Years_Payday2)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Payday 2: Average Players by Year")
#auto color pallet if + scale manual is removed
Why Payday 2 Spiked In 2017
g_pubg <- games %>%
filter(gamename == "PLAYERUNKNOWN'S BATTLEGROUNDS") %>%
filter(year %in% c(2017:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_pubg
Years_PUBG <- as.factor(g_pubg$year)
ggplot(g_pubg,aes(x = number_of_month,y = avg, group = year, color = Years_PUBG)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "PLAYERUNKNOWN'S BATTLEGROUNDS: Average Players by Year")
g_csgo <- games %>%
filter(gamename == "Counter-Strike: Global Offensive") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_csgo
Years_csgo <- as.factor(g_csgo$year)
ggplot(g_csgo,aes(x = number_of_month,y = avg, group = year, color = Years_csgo)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Counter Strike: Global Offensive: Average Players by Year")
g_gtav <- games %>%
filter(gamename == "Grand Theft Auto V") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_gtav
Years_gtav <- as.factor(g_gtav$year)
ggplot(g_gtav,aes(x = number_of_month,y = avg, group = year, color = Years_gtav)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Grand Theft Auto V: Average Players by Year")
g_dota2 <- games %>%
filter(gamename == "Dota 2") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_dota2
Years_dota2 <- as.factor(g_dota2$year)
ggplot(g_dota2,aes(x = number_of_month,y = avg, group = year, color = Years_dota2)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
scale_y_continuous(labels = comma) +
labs(title = "DOTA 2: Average Players by Year")
CSGO also was given out for free in 2020 so that shows similarities with the games given out for free over time.
This data shows that many multiplayer games seem to gain players over time and often have their peak average players occur in a period signficantly after the launch of the game. While GTA 5 decreases unlike the multiplayer games in the first year, a varible that has to be factored in is that GTA 5 is a single and mutiplayer game. So, we see a sudden drop on release but increasing players over time.
g_fallout4 <- games %>%
filter(gamename == "Fallout 4") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_fallout4
Years_fallout4 <- as.factor(g_fallout4$year)
ggplot(g_fallout4,aes(x = number_of_month,y = avg, group = year, color = Years_fallout4)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Fallout 4: Average Players by Year")
g_farcry5 <- games %>%
filter(gamename == "Far Cry 5") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_farcry5
Years_farcry5 <- as.factor(g_farcry5$year)
ggplot(g_farcry5,aes(x = number_of_month,y = avg, group = year, color = Years_farcry5)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Farcry 5: Average Players by Year")
g_cyberpunk <- games %>%
filter(gamename == "Cyberpunk 2077") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_cyberpunk
Years_cyberpunk <- as.factor(g_cyberpunk$year)
ggplot(g_cyberpunk,aes(x = number_of_month,y = avg, group = year, color = Years_cyberpunk)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Cyberpunk 2077: Average Players by Year")
g_girl <- games %>%
filter(gamename == "Hentai Girl") %>%
filter(year %in% c(2012:2021)) %>%
mutate(label = if_else(number_of_month == max(number_of_month), as.character(year), NA_character_))
g_girl
Years_girl <- as.factor(g_girl$year)
ggplot(g_girl,aes(x = number_of_month,y = avg, group = year, color = Years_girl)) +
geom_line(size = 1) +
geom_point()+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1))+
geom_dl(aes(label = year), method = list(dl.trans(x = x + 0.1), "last.points", cex = 1)) +
scale_colour_manual(values=c("#964B00","#f58231","#030000", "#00FBFE", "#3cb44b", "#FF0004", "#4363d8", "#911eb4", "#f032e6", "#a9a9a9" )) +
xlab("Months")+
ylab("Average Players") +
labs(title = "Hentai Girl: Average Players by Year")
This data shows that singleplayer titles seem to have a drastic fall off in their player base following the launch of the game. This most likely occurs because once someone beats a single player game they are less likely to return to it. This can also be attributed to the fact that multiplayer/esports games are usually continually updated with new content, where as single player game typically are less likely to receive these updates consistently.
games_seasons <- games %>%
filter(year %in% c(2012:2021)) %>%
group_by(number_of_month) %>%
summarise(avg_month_sum = sum(avg))
ggplot(data = games_seasons) +
geom_line(aes(x = number_of_month, y = avg_month_sum),color = "blue", size = 1) +
geom_point(aes(x = number_of_month, y = avg_month_sum),color = "red", size = 3)+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1)) +
scale_y_continuous(labels = comma) +
xlab("Months")+
ylab("Average Number of Players") +
labs(title = "Steam Seasonality using Avg. Players")
games_seasons
games_seasons_pk <- games %>%
filter(year %in% c(2012:2021)) %>%
group_by(number_of_month) %>%
summarise(peak_month_sum = sum(peak))
ggplot(data = games_seasons_pk) +
geom_line(aes(x = number_of_month, y = peak_month_sum),color = "blue", size = 1) +
geom_point(aes(x = number_of_month, y = peak_month_sum),color = "red", size = 3)+
scale_x_continuous(breaks = seq(1, 12, by = 1),expand=c(0, 1)) +
scale_y_continuous(labels = comma) +
xlab("Months")+
ylab("Peak Number of Players") +
labs(title = "Steam Seasonality using Peak Players")
games_seasons_pk
The data suggests that there is a seasonality to Steam’s player data. This shows that Steam has the most players around December, Janurary, and Feburary. We suspect that this occurs because of the Christmas season and people recieving money and games as gifts. The steep fall off we see in Feburary may occur because individuals have completed the games they received during the holiday season. We also see an increase in players during the summer months. We suspect that this occurs because individuals are out of school for their summer break, giving them more time to play video games on Steam.
games_user_growth <- games %>%
filter(year %in% c(2012:2020)) %>%
group_by(year) %>%
summarise(avg_month_all_years = sum(avg))
ggplot(data = games_user_growth) +
geom_line(aes(x = year, y = avg_month_all_years),color = "blue", size = 1) +
geom_point(aes(x = year, y = avg_month_all_years),color = "red", size = 3) +
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
scale_y_continuous(labels = comma) +
xlab("Years")+
ylab("Average Number of Players") +
labs(title = "Steam Platform Growth using Avg. Players")
games_user_growth
games_user_growth_pk <- games %>%
filter(year %in% c(2012:2020)) %>%
group_by(year) %>%
summarise(peak_month_all_years = sum(peak))
ggplot(data = games_user_growth_pk) +
geom_line(aes(x = year, y = peak_month_all_years),color = "blue", size = 1) +
geom_point(aes(x = year, y = peak_month_all_years),color = "red", size = 3) +
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
scale_y_continuous(labels = comma) +
xlab("Years")+
ylab("Peak Number of Players") +
labs(title = "Steam Platform Growth using Peak Players")
games_user_growth_pk
NA
NA
The graph shows that the Steam platform has had relatively consistent growth from 2012 to 2020. However, we can see the only time Steam has a drop in users was from 2018 to 2019. We suspect this to be related to the rise of other online games market places such as the EPIC Games store and the popularity of Fortnite at the time. In November of 2018, Fortnite hit a 8.3 million concurrent players. We suspect that the popularity of Fortnite combined with the fact it was not on Steam is ultimately what caused the dip in players from 2018-2019.
g_without_2021 <- games %>%
filter(number_of_month == 12, year %in% 2012:2020)
g_added <- g_without_2021 %>%
ggplot()+
geom_bar(aes(x = year))+
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
xlab("Years")+
ylab("Number of Games") +
labs(title = "Total Number of Relevent Games on Steam")
g_added
g_added_count <- g_without_2021 %>%
count(year)
grow_diff_titles <- c(268,268,393,549,720,911,1034,1101,1165)
g_added_count$differ <- grow_diff_titles
g_added_count <- g_added_count %>%
mutate(subtracted = n - differ)
g_added_count %>%
ggplot() +
geom_line(aes(x = year, y = subtracted, group = 1), size = 1, color = "red")+
geom_point(aes(x = year, y = subtracted, group = 1), size = 3, color = "red")+
scale_x_continuous(breaks = seq(2012, 2020, by = 1)) +
xlab("Years")+
ylab("Numbers of Games") +
labs(title = "New Relevant Games Published to Steam by Year")
g_added_count
NA
When analyzing this data we are looking at what we consider to be relevant titles published on the Steam platform. Anyone can publish their game on Steam and as of 2020 their were nearly 50,000 titles published on the platform. The data utilized in the data set is pulled from a data set that focuses on games that are actually played. When looking at the graphs we can see that the total games on the library have increasing over the years. However, we can also see in the line graph that there has been less games being published to Steam since 2016.
games %>%
filter(avg == max(avg))
PLAYERUNKNOWN’S BATTLEGROUNDS is the game with the best avg player performance.
games %>%
filter(peak == max(peak))
The game with the highest peak players in Steam history is PLAYERUNKNOWN’S BATTLEGROUNDS with 3,236,027 players.
games %>%
drop_na(gain) %>%
filter(gain == max(gain))
NA
PLAYERUNKNOWN’S BATTLEGROUNDS has had the greatest gain from the prior month in all of steam history.
games %>%
drop_na(gain) %>%
filter(gain == min(gain))
It makes sense that Cyberpunk 2077 had the biggest loss of players from one month to the next because the game was filled with bugs that made the game had to enjoy so many poeple returned the game.
With all the exploratory data analysis complete, lets run down what we have learned. We found that single player games differ when it comes to average players over years when compared to multiplayer games. Multiplayer games tend to have an increase from launch and either stay consistent in rising for each year or decreasing (EX. Dota 2 & Playerunknowns Battlegrounds respectively). While single player games have a all time high avg players on launch and then drops down and stays consistent. This makes a lot of sense when looking at GTA 5 because the game is a mix of single and multiplayer, so it declines on the first year but once people finished the single player the multiplayer version of the game follows the trend of other purely multiplayer titles. We also learned that large spikes in avg players of a game that are not gradual means that there was most likely a game give away at some point. An example of this is Payday 2 and CSGO where they just jump from one month to the next. When aggregating all games on Steam we learned that there is a seasonal trend when it comes to peak and avg players. We found that gamers tend to play more when there is holidays, or summer break. When looking at total number of players in each year based on peak and average players we see a linear increase in players over the years. This means that Steam is growing its customer base over time, although there was a slight decrease in 2019 when popular title Fortnite was being played on another platform EPIC Games. So, we believe that Fortnite to be the ultimate cause, however after Fortnite lost traction Steam’s overall players increased passed 2018. We also found that the relevant total number of games released on steam has increased over time. However, the amount of relevant games published from 2016-present has been lower than other years. Finally, we found the best game on Steam which was Playerunknowns Battlegrounds with the best avg, peak, and gain in all of Steam history. We found the worst game in terms of players lost with the lowest gain in Steam history which was Cyberpunk 2077 which lost 250249 players from one month to the next.
There is still a lot to be found and explored with this dataset. For one the data that we did analyze for single player and multiplayer games was just a handful of games, so we would run more analysis for those types. Another thing that would improve the analysis of this data set would be to mutate a column that specifies what type of game that it is whether it be multiplayer, single player, or both. Another good mutate would be what genre the game is in, and price of the game. These mutates would enable us to make sense of how gamers behave when there is a certain price for certain titles. We do know that when games are suddenly given out for free avg players increase, but at the moment we cant compare two shooter multiplayer games at different prices without doing research. It would also be useful for the future to determine how Steam Charts decides which games are relevant enough to record. When looking at the actual Steam platform there is more than 1200 games. Currently, we are just assuming that these titles have enough players to be considered worth while keeping track of.