Week 3 Data Dive: Group By and Probabilities

The goal of this project is to gain a better understanding of the video game sales data set and the frequency of different categories video games may have, including their genre, platform, and publisher.

library(readr)
library(tidyverse)
library(ggplot2)
game_sales <- read_csv("video_game_sales.csv")

Group By

Group by Genre

sales_by_genre <- game_sales |>
  group_by(genre) |>
  summarize(count = n(), mean_global_sales = mean(global_sales)) |>
  arrange(desc(count))
sales_by_genre
## # A tibble: 12 × 3
##    genre        count mean_global_sales
##    <chr>        <int>             <dbl>
##  1 Action        3316             0.528
##  2 Sports        2346             0.567
##  3 Misc          1739             0.466
##  4 Role-Playing  1488             0.623
##  5 Shooter       1310             0.792
##  6 Adventure     1286             0.186
##  7 Racing        1249             0.586
##  8 Platform       886             0.938
##  9 Simulation     867             0.452
## 10 Fighting       848             0.529
## 11 Strategy       681             0.257
## 12 Puzzle         582             0.421
rare_genre <- sales_by_genre |>
  filter(count < 600) |>
  pluck("genre")
game_sales <- game_sales |>
  mutate(rare_genre = (genre == rare_genre))

Action games are the most common genre in the data by far, with almost 1000 more games present than the runner-up, sports games. Puzzle games are the rarest genre, although the exact way games were categorized into a single genre is unclear, so there might be some potential overlap with strategy games or puzzle platformers that were categorized as only platformers.

sales_by_genre |>
  ggplot() +
  geom_col(mapping = aes(x = genre, y = mean_global_sales)) +
  labs(title = "Mean Global Sales by Genre", x = "Genre", y = "Mean Global Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This visualization helps to illustrate how different the success of different genres in the data set is from the popularity of the genres. Platformers were on the less common side, but are the closest to having an average of 1 million dollars in sales. One possible reason for this could be that platformers as a whole have decreased in popularity in recent years and so are less likely to reach the minimum sales requirement necessary to be included in this data, but major franchises like the Mario series bring up the sales average.

Group by Platform

sales_by_platform <- game_sales |>
  group_by(platform) |>
  summarize(count = n(), total_global_sales = sum(global_sales)) |>
  arrange(desc(count))
sales_by_platform
## # A tibble: 31 × 3
##    platform count total_global_sales
##    <chr>    <int>              <dbl>
##  1 DS        2163               822.
##  2 PS2       2161              1256.
##  3 PS3       1329               958.
##  4 Wii       1325               927.
##  5 X360      1265               980.
##  6 PSP       1213               296.
##  7 PS        1196               731.
##  8 PC         960               259.
##  9 XB         824               258.
## 10 GBA        822               318.
## # ℹ 21 more rows
rare_platforms <- sales_by_platform |>
  filter(count < 10) |>
  pluck("platform")
game_sales <- game_sales |>
  mutate(rare_platform = (platform %in% rare_platforms))
top_platforms <- sales_by_platform |>
  filter (count > 500) |>
  pluck("platform")

sales_by_platform |>
  filter(platform %in% top_platforms) |>
  ggplot() +
  geom_col(mapping = aes(x = platform, y = total_global_sales)) +
  labs(title = "Total Global Sales by Platform", x = "Platform", y = "Total Global Sales", subtitle = "Only includes games on platforms with at least 500 high-selling games") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The most common platforms in the data set are the Nintendo DS and the PlayStation 2, which is unsurprising since they are the two all-time best selling consoles. Though the DS has two more games present, PS2 games have made over 400 million dollars more, which is a disparity definitely worth further investigation. About two thirds of the platforms have at least 100 games in the data set, which is plenty for useful aggregation, but some older consoles have only a handful of high selling games.

Group by Publisher

sales_by_publisher <- game_sales |>
  group_by(publisher) |>
  summarize(count = n(), mean_global_sales = round(mean(global_sales),4)) |>
  arrange(desc(count))
sales_by_publisher
## # A tibble: 579 × 3
##    publisher                    count mean_global_sales
##    <chr>                        <int>             <dbl>
##  1 Electronic Arts               1351             0.822
##  2 Activision                     975             0.746
##  3 Namco Bandai Games             932             0.273
##  4 Ubisoft                        921             0.515
##  5 Konami Digital Entertainment   832             0.341
##  6 THQ                            715             0.477
##  7 Nintendo                       703             2.54 
##  8 Sony Computer Entertainment    683             0.890
##  9 Sega                           639             0.427
## 10 Take-Two Interactive           413             0.967
## # ℹ 569 more rows
one_hit_publishers <- sales_by_publisher |>
  filter(count == 1) |>
  pluck("publisher")
game_sales <- game_sales |>
  mutate(one_hit_wonder = (publisher %in% one_hit_publishers))

Though there are over 500 publishers with at least one game in the data set, there are definitely some standout publishers with hundreds of high-selling games. Electronic Arts in particular is the only publisher with over 1000 high-selling games, which may be due to them publishing numerous different sports games, which generally have new releases yearly across multiple platforms. Since we will generally want to focus on publishers with multiple high selling games for aggregation purposes, all games that are the only entry by a certain publisher in the data set have been categorized as ‘one-hit wonders’.

top_publishers <- game_sales |>
  group_by(publisher) |>
  summarize(count = n()) |>
  arrange(desc(count)) |>
  filter(count >= 500) |>
  pluck("publisher")

sales_by_publisher |>
  filter(publisher %in% top_publishers) |>
  ggplot() +
  geom_col(mapping = aes(x = publisher, y = mean_global_sales)) +
  labs(title = "Mean Global Sales by Publisher", x = "Publisher", y = "Mean Global Sales", subtitle = "Only includes games by publishers with at least 500 high-selling games") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Since there are so many publishers in the data set, we focus on those with at least 500 high-selling games, making up the top 9 publishers. Though EA had the most high-selling games as determined above, this visual helps to illustrate that Nintendo has the best average global sales numbers and is the only publisher to average over a million dollars in sales on their games. The top 15 sellers in the whole data set were all published by Nintendo, so it makes sense that these massive hits would balance out their lower-selling games.

Hypothesis

On average, puzzle games have longer development times than action games. (This could potentially explain why there are so many high-selling action games compared to puzzle games–if puzzle games take longer to develop, they will be rarer in general.)

Analyzing Combinations

Group by Platform and Genre

sales_by_platform_and_genre <- game_sales |>
  group_by(platform, genre) |>
  summarize(count = n(), total_global_sales = sum(global_sales)) |>
  arrange(desc(count))
sales_by_platform_and_genre
## # A tibble: 293 × 4
## # Groups:   platform [31]
##    platform genre      count total_global_sales
##    <chr>    <chr>      <int>              <dbl>
##  1 PS2      Sports       400              273. 
##  2 DS       Misc         393              138. 
##  3 PS3      Action       380              308. 
##  4 PS2      Action       348              273. 
##  5 DS       Action       343              116. 
##  6 X360     Action       324              243. 
##  7 DS       Simulation   285              132. 
##  8 Wii      Misc         280              221. 
##  9 Wii      Sports       261              292. 
## 10 DS       Adventure    240               47.3
## # ℹ 283 more rows

Here, we can see that the most common games in our data set are sports games for the PlayStation 2. However, some of the slightly less frequent platform/genre combinations have been more successful, notably action games for the PlayStation 3 and sports game for the Wii. The latter is clearly due to Wii Sports and Wii Sports Resort being the best and fourth best sellers of the entire data set, but the former could use further investigation. The rarer platform/genre combinations mostly consist of older or less popular platforms that only have a few total games in the data set.

Visualization of Most Common Platform/Genre Combination

game_sales |>
  filter(platform == "PS2", genre == "Sports", publisher %in% top_publishers) |>
  ggplot() + 
  geom_col(mapping = aes(x = year, y = global_sales, fill = publisher)) +
  labs (title = "Global Sales of PS2 Sports Games Over Time", x = "Year", y = "Global Sales", subtitle = "Only includes games by publishers with at least 500 high-selling games") +
  theme_classic() +
  scale_fill_brewer(palette = "Set2")

Since there are too many different publishers to focus on in a single visual, we continue to focus on the top 9 publishers. This would be too many for this kind of visual, but since we are focusing on the PS2, Nintendo will not be present and we can focus on 8 publishers instead. Based on the discussion in the Group by Publisher section, it is unsurprising to see that EA made the most sales on sports games for the PS2 than any other publisher in each year except for perhaps 2011. Though they were not generally as successful as their competitors, it is expected to see that Sony has games present in each year since they own PlayStation. Sales had been decreasing for a few years already, but a major drop can be seen after 2006, when the PS3 was released. The trends in this visual would have to be analyzed against all sports games and all PS2 games to get a better idea of how the genre impacts sales on this particular platform.

Investigation of Missing Platform/Genre Combinations

platform_genre_combinations <- crossing(select(game_sales, platform), select(game_sales, genre))
platform_genre_combinations |>
  merge (sales_by_platform_and_genre, all.x = TRUE) |>
  filter (is.na(count)) |>
  select(platform, genre) |>
  tibble()
## # A tibble: 79 × 2
##    platform genre       
##    <chr>    <chr>       
##  1 2600     Role-Playing
##  2 2600     Strategy    
##  3 3DO      Action      
##  4 3DO      Fighting    
##  5 3DO      Misc        
##  6 3DO      Platform    
##  7 3DO      Racing      
##  8 3DO      Role-Playing
##  9 3DO      Shooter     
## 10 3DO      Sports      
## # ℹ 69 more rows

Here, we can systematically determine each missing combination of platform and genre in the data set. Many platforms have several genres missing, which is most likely due to being older consoles, as this data set does not account for inflation for sales. Individual instances of missing platform/genre combinations require more investigation, such as there being no puzzle games for the Xbox One in the data.