My EDA was focused on Steam game data sourced from Steam Spy. Steam Spy is Steam stats service based on Web API provided by Valve and gathers data from user profile. Steam is a popular online gaming platform. I looked at game variables such as Release Date, Price, Ratings, # of Owners, Total Playtime, Median Playtime, Developers, Genre, etc.
This is the data that I’m going to be working with for this EDA. All data was sourced from Steam Spy on 11/26/19. There are two .cvs datasets: 1) Steam dataset with game title, and other variables (Release Date, Price, Ratings, # of Owners, Total Playtime, Median Playtime, Developers, etc) and 2) genre classification of each game.
I wrangled the data in a separate R file. I’m going to directly load in the clean, tidy data here.
Taking an initial glance at the data, I want to first see the total amount and percentage change of new games released on the Steam platform from year to year.
| release_year | total_released | growth_rate |
|---|---|---|
| 2004 | 29 | NA |
| 2005 | 33 | 0.1379310 |
| 2006 | 115 | 2.4848485 |
| 2007 | 107 | -0.0695652 |
| 2008 | 193 | 0.8037383 |
| 2009 | 335 | 0.7357513 |
| 2010 | 314 | -0.0626866 |
| 2011 | 310 | -0.0127389 |
| 2012 | 405 | 0.3064516 |
| 2013 | 549 | 0.3555556 |
| 2014 | 1648 | 2.0018215 |
| 2015 | 2728 | 0.6553398 |
| 2016 | 4346 | 0.5931085 |
| 2017 | 6498 | 0.4951680 |
| 2018 | 8483 | 0.3054786 |
| 2019 | 8410 | -0.0086054 |
The first plot shows the total number of games released on the Steam platform each year from 2014 to 2019. We can see that there is an exponential increase of games every year. However, to truly unpack the insight, I looked at the change in year-over-year growth rate of new games released on Steam from 2014 to 2019.
Looking at the second plot you can see that there is a spike in the growth rate of new games released in 2006, 2009, and in 2014. These spikes are correlated to major updates to the Steam platform for both users and game developers. In 2006, Steam started approaching 3rd party publishers to release their games on the Steam platform. In 2009, Steam introduced Steam Cloud. This allowed users to store their games on Steam-owned cloud server that then could be accessed via any computer running the Steam client. In 2014, Steam’s parent company, Valve, announced plans to hugely widen the number of games they allow onto Steam by approving developers directly via ended Steam GreenLight and introducing Steam Direct.
As background, during the days of Greenlight, developers pitch games to Steamers, who vote by answering the question “Would you buy this game if it were available in Steam?” If a game is popular enough, eventually Valve approve it. On Steam Direct, game developers register with Valve and, after verification, publish games to Steam, bypassing the “popularity contest” completely.
Next, I wanted to take a closer look at new Steam game releases by month. I want to understand if there was a trend on when game developers would typically release their games on Steam.
| release_month | total_released |
|---|---|
| 1 | 2143 |
| 2 | 2446 |
| 3 | 2744 |
| 4 | 2777 |
| 5 | 2922 |
| 6 | 2489 |
| 7 | 2920 |
| 8 | 3182 |
| 9 | 3265 |
| 10 | 3426 |
| 11 | 3140 |
| 12 | 3049 |
Looking at the graph, we can see that the most amount of new games released on the Steam platform was in October. We can also see a general trend of games being released in the 4th Quarter of the year.
Now why might this be the case? Is there some reason why game developers release games in the latter half of the year? My hunch is that game developers what to release their games closer, but before, holidays such as Black Friday and Christmas. This is to take advantage that consumers are more likely to spend their money to buy their game for either themselves or others.
Another interesting observation is that there seems to be less new game releases in the month of June compared to other months of the year. A possible explanation is that many of the large game developers are revealing their games at big gaming conventions such as E3 that happen in June. Smaller game developers, that make up a good portion on Steam game may release their games in other months to avoid competition with these bigger game developers. For example, a small indie game developer wouldn’t want to release their game at the same time the new Call of Duty is being announced and released the same week.
The first area I wanted to explore is the average price of Steam games have trended over the 15 year period.
| release_year | avg_price |
|---|---|
| 2004 | 6.659655 |
| 2005 | 8.674687 |
| 2006 | 9.053097 |
| 2007 | 8.889533 |
| 2008 | 9.863548 |
| 2009 | 9.718364 |
| 2010 | 9.016555 |
| 2011 | 9.379400 |
| 2012 | 10.928031 |
| 2013 | 10.626743 |
| 2014 | 9.650562 |
| 2015 | 8.351543 |
| 2016 | 8.147715 |
| 2017 | 8.035172 |
| 2018 | 7.422085 |
| 2019 | 7.894500 |
Initially, I expected the average price to be relatively the same over time. However, looking at the plot, we can see that the average prices of games have increased until 2012 and then decreases. Note, the average price of a Steam game in 2019 is about a dollar less than the average price in 2005!
Next, I wanted to understand what might have caused the overall decrease in average game price starting in 2012. My hunch is that Free-to-Play games are becoming more and more popular and thus driving down average price. To test my hypothesis, I calculated the proportion of Free-to-Play games to released games and visualized my calculations on a plot.
| release_year | n | total | prop |
|---|---|---|---|
| 2004 | 1 | 29 | 0.0344828 |
| 2005 | 1 | 33 | 0.0303030 |
| 2006 | 4 | 115 | 0.0347826 |
| 2007 | 5 | 107 | 0.0467290 |
| 2008 | 9 | 193 | 0.0466321 |
| 2009 | 8 | 335 | 0.0238806 |
| 2010 | 9 | 314 | 0.0286624 |
| 2011 | 11 | 310 | 0.0354839 |
| 2012 | 28 | 405 | 0.0691358 |
| 2013 | 38 | 549 | 0.0692168 |
| 2014 | 96 | 1648 | 0.0582524 |
| 2015 | 206 | 2728 | 0.0755132 |
| 2016 | 436 | 4346 | 0.1003221 |
| 2017 | 659 | 6498 | 0.1014158 |
| 2018 | 910 | 8483 | 0.1072734 |
| 2019 | 857 | 8410 | 0.1019025 |
Looking at the plot, it seems like my initial hypothesis looks to be correct. At least from personal experience, many other games that I play now-a-days are classified as a “Freemium” game. This is where a game is free of charge, however the game developers charge money for additional features, services, or virtual or physical goods (think loot boxes, in-game currency, etc.).
Next, I want to visualize the overall shift to lower pricing in a more in depth plot. To do this, I used a faceted plot to visualize all game prices.
Looking at the faceted plot, we can see that there is an overall general increase in lower priced games (priced less than $10). We can also see as the number of new released games on the Steam platform increase, the diversity in prices also increases. This wider spread in prices may indicate game developers employing different pricing strategies to better attract consumers to buy their games. Yes, free and low-priced games are becoming and more popular; however, this doesn’t stop some game developers to price their games at a very high amount comparatively.
First, I wanted to explore if there was a large difference between average playtime of different game genres. The average playtime here is the average playtime over the past 2 week. For this EDA, I am looking at the two week timeframe from November 12th, 2019 to November 26th, 2019.
| genre | total_playtime |
|---|---|
| action | 175110 |
| adventure | 175110 |
| early_access | 22927 |
| indie | 85056 |
| mmo | 37949 |
| rpg | 83578 |
| simulation | 67136 |
| sports | 17166 |
| strategy | 54136 |
Overall, it seems like games under the Action and Adventure genre have much higher average playtime in the last two weeks compared to other game genres. Interestingly, Action and Adventure have the same exact play time. Is this because game that is classified as Action is also classfied as Adventure? Or is it just pure coincidence?
To understand the abovementioned questions, I took a count of all the games with action and adventure genre classification.
| n | count |
|---|---|
| 2 | 14199 |
| 4 | 117 |
| 6 | 3 |
| 8 | 1 |
| genre | game | publishers |
|---|---|---|
| action | Zombie Apocalypse | GameTop.com |
| action | Zombie Apocalypse | Kapitan |
| adventure | Zombie Apocalypse | GameTop.com |
| adventure | Zombie Apocalypse | Kapitan |
Weird, there seems to be games with multiple adventure and action genres tags. Turns out, there are games by different publishers with the same names!
Overall, since we can see in the column n that all the numbers are even, this means that all action genre games are also classified as adventure games. I suppose no adventure is without action in the eyes of game developers.
Next, I wanted to see if the proportion of different game genres change over the 15 year period.
Looking at the faceted plot, we can see that over time, there has been a dramatic increase in indie genre games. There has been a slight increase in early_access simulation, and rpg games, while there has been a decline in strategy genre games. Action, Adventure, MMO, and Sports has relatively stayed the same.
To determine which games have a nostalgia factor, I look at the top trending games in the past two week.
| release_year | number_released |
|---|---|
| 2004 | 1 |
| 2006 | 1 |
| 2007 | 1 |
| 2009 | 1 |
| 2010 | 1 |
| 2011 | 1 |
| 2012 | 5 |
| 2013 | 9 |
| 2014 | 7 |
| 2015 | 8 |
| 2016 | 18 |
| 2017 | 13 |
| 2018 | 17 |
| 2019 | 14 |
I then filtered for games released pre-2010, and arranged average playtime from most to least.
| genre | game | release_year | trending | avg_playtime | metascore |
|---|---|---|---|---|---|
| action | Team Fortress 2 | 2007 | Yes | 1393 | 92 |
| adventure | Team Fortress 2 | 2007 | Yes | 1393 | 92 |
| indie | Garry’s Mod | 2006 | Yes | 523 | NA |
| simulation | Garry’s Mod | 2006 | Yes | 523 | NA |
| strategy | Sid Meier’s Civilization V | 2010 | Yes | 230 | 90 |
| action | Left 4 Dead 2 | 2009 | Yes | 191 | 89 |
| adventure | Left 4 Dead 2 | 2009 | Yes | 191 | 89 |
| action | Counter-Strike: Source | 2004 | Yes | 150 | 88 |
| adventure | Counter-Strike: Source | 2004 | Yes | 150 | 88 |
*Note: some games are duplicated, this is because they are classified under two genres.
Of all the games listed above, the most surprising was Garry’s Mod. Compared to the other games on the list, it is not a typical action/adventure first-person shooter or a classic strategy game. Instead, Garry’s Mod is an indie-simulation game. As a physics sandbox game, there aren’t even any predefined aims or goals to the game. Yet, despite its unique genre and not critically reviewed, the game is still being played 13 years after its initial release.
To have a large enough sample size to ensure a relatively accurately and widely agreed upon metascore, I filtered for games with 10M+ owners. According to the Metacritic website, anything above 80 (Green) is a pretty decent game.
| genre | n |
|---|---|
| action | 17 |
| adventure | 17 |
| indie | 4 |
| mmo | 4 |
| rpg | 5 |
| simulation | 1 |
| sports | 1 |
| strategy | 2 |
Again, to have a large enough sample size to ensure a relatively accurately and widely agreed upon metascore, I filtered for games with 10M+ owners.
| game | metascore | avg_playtime |
|---|---|---|
| Grand Theft Auto V | 96 | 826 |
| Half-Life 2 | 96 | 112 |
| The Elder Scrolls V: Skyrim | 94 | 155 |
| Team Fortress 2 | 92 | 1393 |
| Dota 2 | 90 | 1866 |
| Sid Meier’s Civilization V | 90 | 230 |
| Borderlands 2 | 89 | 275 |
| Left 4 Dead 2 | 89 | 191 |
| Counter-Strike: Source | 88 | 150 |
| PLAYERUNKNOWN’S BATTLEGROUNDS | 86 | 994 |
Looking at the first plot, we can see that the bulk of 80+ metascore games are categorized under action and adventure. Interestingly, out of all the games in the Top 10 Highest Rated Games table, Sid Meier’s Civilization V is the only pure strategy genre game on the list. This may imply two conclusions: 1) developing a highly rated pure strategy game is quite hard to do and 2) developing a highly rated game doesn’t have to be in the action/strategy genre.
I want to identify if there are any anomalies in distribution of Steam game owners. My initial hypothesis is that the distribution will be very left skewed. This is based on the fact that developing and publishing a highly popular game is very difficult nowadays.
| owners | n |
|---|---|
| 50,000,000 .. 100,000,000 | 1 |
| 100,000,000 .. 200,000,000 | 2 |
| 20,000,000 .. 50,000,000 | 4 |
| 10,000,000 .. 20,000,000 | 30 |
| 5,000,000 .. 10,000,000 | 59 |
| 2,000,000 .. 5,000,000 | 217 |
| 1,000,000 .. 2,000,000 | 350 |
| 500,000 .. 1,000,000 | 600 |
| 200,000 .. 500,000 | 1414 |
| 100,000 .. 200,000 | 1480 |
| 50,000 .. 100,000 | 1940 |
| 20,000 .. 50,000 | 3649 |
| 0 .. 20,000 | 24058 |
As noted, my hypothesis stands: the distribution of number of owners across Steam games is extremely left-skewed. Over 70% of all games on the Steam platform have between 0 and 20,000 owners. I honestly did not expect a steep drop off between games that have less than 20,000 owners and games that have 20,000+ owners. This may indicate that for many game developers and their games, achieving 20,000 owners on the Steam platform is a very tough barrier to cross.
I want to explore if there are any differences in metascore distributions among difffernt owner groups.
| stats | 20M+ | 10M-20M | 5M-10M | 1M-5M | <1M |
|---|---|---|---|---|---|
| Min. | 69.00 | 62.00 | 63.00 | 39.00 | 20.00 |
| 1st Qu. | 83.00 | 79.75 | 73.00 | 75.00 | 65.00 |
| Median | 84.50 | 83.50 | 81.50 | 81.00 | 72.00 |
| Mean | 82.83 | 82.65 | 79.81 | 79.56 | 70.56 |
| 3rd Qu. | 86.00 | 89.25 | 85.25 | 86.00 | 78.00 |
| Max. | 90.00 | 96.00 | 96.00 | 95.00 | 98.00 |
| NA’s | 1 | 10 | 23 | 212 | 30404 |
When looking at the boxplots, we can see that the number of owners increase, the smaller the spread of metascores the games have and, in general, have higher metascores as well. This intuitively makes sense because in order for a game to be very successful, it has to cater to a diverse set of players, as well as garner high reviews from both critics and players.
Interestingly enough, the highest rated game (seen in the <1M) is not owned by the most amount of people. This might be due to the fact the game is relatively unknown and has yet to be discovered by the general masses.
There are many potential next steps we could take. I think that one interesting next step would be integrate Twitch (popular videogame streaming platform) data with the Steam data. This would allow me to explore which Steam games are the most popular among livestreamers, how these Steam games trend / perform on the Steam platform or looking at whether or not if introduction of the game on Twitch helped with its popularity on Steam.