The goal of this project is to gain a better understanding of the video game sales data set and the frequency of different categories video games may have, including their genre, platform, and publisher.
library(readr)
library(tidyverse)
library(ggplot2)
game_sales <- read_csv("video_game_sales.csv")
sales_by_genre <- game_sales |>
group_by(genre) |>
summarize(count = n(), mean_global_sales = mean(global_sales)) |>
arrange(desc(count))
sales_by_genre
## # A tibble: 12 × 3
## genre count mean_global_sales
## <chr> <int> <dbl>
## 1 Action 3316 0.528
## 2 Sports 2346 0.567
## 3 Misc 1739 0.466
## 4 Role-Playing 1488 0.623
## 5 Shooter 1310 0.792
## 6 Adventure 1286 0.186
## 7 Racing 1249 0.586
## 8 Platform 886 0.938
## 9 Simulation 867 0.452
## 10 Fighting 848 0.529
## 11 Strategy 681 0.257
## 12 Puzzle 582 0.421
rare_genre <- sales_by_genre |>
filter(count < 600) |>
pluck("genre")
game_sales <- game_sales |>
mutate(rare_genre = (genre == rare_genre))
Action games are the most common genre in the data by far, with almost 1000 more games present than the runner-up, sports games. Puzzle games are the rarest genre, although the exact way games were categorized into a single genre is unclear, so there might be some potential overlap with strategy games or puzzle platformers that were categorized as only platformers.
sales_by_genre |>
ggplot() +
geom_col(mapping = aes(x = genre, y = mean_global_sales)) +
labs(title = "Mean Global Sales by Genre", x = "Genre", y = "Mean Global Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This visualization helps to illustrate how different the success of different genres in the data set is from the popularity of the genres. Platformers were on the less common side, but are the closest to having an average of 1 million dollars in sales. One possible reason for this could be that platformers as a whole have decreased in popularity in recent years and so are less likely to reach the minimum sales requirement necessary to be included in this data, but major franchises like the Mario series bring up the sales average.
sales_by_platform <- game_sales |>
group_by(platform) |>
summarize(count = n(), total_global_sales = sum(global_sales)) |>
arrange(desc(count))
sales_by_platform
## # A tibble: 31 × 3
## platform count total_global_sales
## <chr> <int> <dbl>
## 1 DS 2163 822.
## 2 PS2 2161 1256.
## 3 PS3 1329 958.
## 4 Wii 1325 927.
## 5 X360 1265 980.
## 6 PSP 1213 296.
## 7 PS 1196 731.
## 8 PC 960 259.
## 9 XB 824 258.
## 10 GBA 822 318.
## # ℹ 21 more rows
rare_platforms <- sales_by_platform |>
filter(count < 10) |>
pluck("platform")
game_sales <- game_sales |>
mutate(rare_platform = (platform %in% rare_platforms))
top_platforms <- sales_by_platform |>
filter (count > 500) |>
pluck("platform")
sales_by_platform |>
filter(platform %in% top_platforms) |>
ggplot() +
geom_col(mapping = aes(x = platform, y = total_global_sales)) +
labs(title = "Total Global Sales by Platform", x = "Platform", y = "Total Global Sales", subtitle = "Only includes games on platforms with at least 500 high-selling games") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The most common platforms in the data set are the Nintendo DS and the PlayStation 2, which is unsurprising since they are the two all-time best selling consoles. Though the DS has two more games present, PS2 games have made over 400 million dollars more, which is a disparity definitely worth further investigation. About two thirds of the platforms have at least 100 games in the data set, which is plenty for useful aggregation, but some older consoles have only a handful of high selling games.
sales_by_publisher <- game_sales |>
group_by(publisher) |>
summarize(count = n(), mean_global_sales = round(mean(global_sales),4)) |>
arrange(desc(count))
sales_by_publisher
## # A tibble: 579 × 3
## publisher count mean_global_sales
## <chr> <int> <dbl>
## 1 Electronic Arts 1351 0.822
## 2 Activision 975 0.746
## 3 Namco Bandai Games 932 0.273
## 4 Ubisoft 921 0.515
## 5 Konami Digital Entertainment 832 0.341
## 6 THQ 715 0.477
## 7 Nintendo 703 2.54
## 8 Sony Computer Entertainment 683 0.890
## 9 Sega 639 0.427
## 10 Take-Two Interactive 413 0.967
## # ℹ 569 more rows
one_hit_publishers <- sales_by_publisher |>
filter(count == 1) |>
pluck("publisher")
game_sales <- game_sales |>
mutate(one_hit_wonder = (publisher %in% one_hit_publishers))
Though there are over 500 publishers with at least one game in the data set, there are definitely some standout publishers with hundreds of high-selling games. Electronic Arts in particular is the only publisher with over 1000 high-selling games, which may be due to them publishing numerous different sports games, which generally have new releases yearly across multiple platforms. Since we will generally want to focus on publishers with multiple high selling games for aggregation purposes, all games that are the only entry by a certain publisher in the data set have been categorized as ‘one-hit wonders’.
top_publishers <- game_sales |>
group_by(publisher) |>
summarize(count = n()) |>
arrange(desc(count)) |>
filter(count >= 500) |>
pluck("publisher")
sales_by_publisher |>
filter(publisher %in% top_publishers) |>
ggplot() +
geom_col(mapping = aes(x = publisher, y = mean_global_sales)) +
labs(title = "Mean Global Sales by Publisher", x = "Publisher", y = "Mean Global Sales", subtitle = "Only includes games by publishers with at least 500 high-selling games") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Since there are so many publishers in the data set, we focus on those with at least 500 high-selling games, making up the top 9 publishers. Though EA had the most high-selling games as determined above, this visual helps to illustrate that Nintendo has the best average global sales numbers and is the only publisher to average over a million dollars in sales on their games. The top 15 sellers in the whole data set were all published by Nintendo, so it makes sense that these massive hits would balance out their lower-selling games.
On average, puzzle games have longer development times than action games. (This could potentially explain why there are so many high-selling action games compared to puzzle games–if puzzle games take longer to develop, they will be rarer in general.)
sales_by_platform_and_genre <- game_sales |>
group_by(platform, genre) |>
summarize(count = n(), total_global_sales = sum(global_sales)) |>
arrange(desc(count))
sales_by_platform_and_genre
## # A tibble: 293 × 4
## # Groups: platform [31]
## platform genre count total_global_sales
## <chr> <chr> <int> <dbl>
## 1 PS2 Sports 400 273.
## 2 DS Misc 393 138.
## 3 PS3 Action 380 308.
## 4 PS2 Action 348 273.
## 5 DS Action 343 116.
## 6 X360 Action 324 243.
## 7 DS Simulation 285 132.
## 8 Wii Misc 280 221.
## 9 Wii Sports 261 292.
## 10 DS Adventure 240 47.3
## # ℹ 283 more rows
Here, we can see that the most common games in our data set are sports games for the PlayStation 2. However, some of the slightly less frequent platform/genre combinations have been more successful, notably action games for the PlayStation 3 and sports game for the Wii. The latter is clearly due to Wii Sports and Wii Sports Resort being the best and fourth best sellers of the entire data set, but the former could use further investigation. The rarer platform/genre combinations mostly consist of older or less popular platforms that only have a few total games in the data set.
game_sales |>
filter(platform == "PS2", genre == "Sports", publisher %in% top_publishers) |>
ggplot() +
geom_col(mapping = aes(x = year, y = global_sales, fill = publisher)) +
labs (title = "Global Sales of PS2 Sports Games Over Time", x = "Year", y = "Global Sales", subtitle = "Only includes games by publishers with at least 500 high-selling games") +
theme_classic() +
scale_fill_brewer(palette = "Set2")
Since there are too many different publishers to focus on in a single visual, we continue to focus on the top 9 publishers. This would be too many for this kind of visual, but since we are focusing on the PS2, Nintendo will not be present and we can focus on 8 publishers instead. Based on the discussion in the Group by Publisher section, it is unsurprising to see that EA made the most sales on sports games for the PS2 than any other publisher in each year except for perhaps 2011. Though they were not generally as successful as their competitors, it is expected to see that Sony has games present in each year since they own PlayStation. Sales had been decreasing for a few years already, but a major drop can be seen after 2006, when the PS3 was released. The trends in this visual would have to be analyzed against all sports games and all PS2 games to get a better idea of how the genre impacts sales on this particular platform.
platform_genre_combinations <- crossing(select(game_sales, platform), select(game_sales, genre))
platform_genre_combinations |>
merge (sales_by_platform_and_genre, all.x = TRUE) |>
filter (is.na(count)) |>
select(platform, genre) |>
tibble()
## # A tibble: 79 × 2
## platform genre
## <chr> <chr>
## 1 2600 Role-Playing
## 2 2600 Strategy
## 3 3DO Action
## 4 3DO Fighting
## 5 3DO Misc
## 6 3DO Platform
## 7 3DO Racing
## 8 3DO Role-Playing
## 9 3DO Shooter
## 10 3DO Sports
## # ℹ 69 more rows
Here, we can systematically determine each missing combination of platform and genre in the data set. Many platforms have several genres missing, which is most likely due to being older consoles, as this data set does not account for inflation for sales. Individual instances of missing platform/genre combinations require more investigation, such as there being no puzzle games for the Xbox One in the data.