Source: https://gamepress.jp/archives/157095
Video Games are one of the most influential forms of entertainment worldwide, but player preferences are not always the same accross different regions. ntry, In reality, some countries and regions may favor certain genres, platforms, or style of games more than others. Understanding this differences can help reveal interesting cultural and market patterns within the gaming industry.
The dataset used in this project contains information about video game sales across different regions including, North America, Japan, Europe, and in the world overall. The data was originally collected from VGChartz, a website that tracks and estimates video game sales and market performance. Since this website collected the data through web scraping, the sale figures should be interpreted as estimates rather than exact official company records.
This dataset contains information for 16,598 video games and 11 variables related to genres, platforms, publisher, and released year as described below.
| Variable | Type | Description |
|---|---|---|
| Platform | Categorical | Gaming platform (PS4, Wii, Xbox, etc.) |
| Genre | Categorical | Type of video game (adventure, puzzle, etc.) |
| Year | Quantitative | Release year of the game |
| NA_Sales | Quantitative | Sales in North America (millions) |
| EU_Sales | Quantitative | Sales in Europe (millions) |
| JP_Sales | Quantitative | Sales in Japan (millions) |
| Global_Sales | Quantitative | Total worldwide sales |
The main question explored in this project is: How do video game preferences and sales differ across regions such as North America, Japan, and Europe?
I chose this topic because video games are an important part of modern entertainment and culture, and I find it interesting how gaming preferences can vary across different regions. I also wanted to explore a dataset that combines cultural trends, technology, and consumer behavior in a way than can be visually analyzed through graphs and interactive visualizations.
# Necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.3
## Warning: package 'dplyr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.3
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
vgames <- read_csv("vgsales.csv")
## Rows: 16598 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Platform, Year, Genre, Publisher
## dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Before creating any visualization and statistical models, the dataset needs to be clean and organize to focus on the variables most relevant for this project. This process will include selecting meaningful variables, checking for missing values, filtering incomplete pbservations, and preparing the data for analysis.
str(vgames)
## spc_tbl_ [16,598 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Rank : num [1:16598] 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : chr [1:16598] "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
## $ Platform : chr [1:16598] "Wii" "NES" "Wii" "Wii" ...
## $ Year : chr [1:16598] "2006" "1985" "2008" "2009" ...
## $ Genre : chr [1:16598] "Sports" "Platform" "Racing" "Sports" ...
## $ Publisher : chr [1:16598] "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
## $ NA_Sales : num [1:16598] 41.5 29.1 15.8 15.8 11.3 ...
## $ EU_Sales : num [1:16598] 29.02 3.58 12.88 11.01 8.89 ...
## $ JP_Sales : num [1:16598] 3.77 6.81 3.79 3.28 10.22 ...
## $ Other_Sales : num [1:16598] 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
## $ Global_Sales: num [1:16598] 82.7 40.2 35.8 33 31.4 ...
## - attr(*, "spec")=
## .. cols(
## .. Rank = col_double(),
## .. Name = col_character(),
## .. Platform = col_character(),
## .. Year = col_character(),
## .. Genre = col_character(),
## .. Publisher = col_character(),
## .. NA_Sales = col_double(),
## .. EU_Sales = col_double(),
## .. JP_Sales = col_double(),
## .. Other_Sales = col_double(),
## .. Global_Sales = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(vgames)
## # A tibble: 6 × 11
## Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 Wii Sports Wii 2006 Spor… Nintendo 41.5 29.0 3.77
## 2 2 Super Mario B… NES 1985 Plat… Nintendo 29.1 3.58 6.81
## 3 3 Mario Kart Wii Wii 2008 Raci… Nintendo 15.8 12.9 3.79
## 4 4 Wii Sports Re… Wii 2009 Spor… Nintendo 15.8 11.0 3.28
## 5 5 Pokemon Red/P… GB 1996 Role… Nintendo 11.3 8.89 10.2
## 6 6 Tetris GB 1989 Puzz… Nintendo 23.2 2.26 4.22
## # ℹ 2 more variables: Other_Sales <dbl>, Global_Sales <dbl>
colSums(is.na(vgames))
## Rank Name Platform Year Genre Publisher
## 0 0 0 0 0 0
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
## 0 0 0 0 0
There is non missing values, so I can move to the next step.
vgames1 <- vgames |>
select(Name, Platform, Genre, Year, NA_Sales, EU_Sales, JP_Sales, Global_Sales)
vgames2 <- vgames1 |>
mutate(Year = as.numeric(Year))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Year = as.numeric(Year)`.
## Caused by warning:
## ! NAs introduced by coercion
vgames2 <- vgames2 |>
mutate(Regional_sales = NA_Sales + EU_Sales + JP_Sales)
head(vgames2)
## # A tibble: 6 × 9
## Name Platform Genre Year NA_Sales EU_Sales JP_Sales Global_Sales
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Wii Sports Wii Spor… 2006 41.5 29.0 3.77 82.7
## 2 Super Mario Bros. NES Plat… 1985 29.1 3.58 6.81 40.2
## 3 Mario Kart Wii Wii Raci… 2008 15.8 12.9 3.79 35.8
## 4 Wii Sports Resort Wii Spor… 2009 15.8 11.0 3.28 33
## 5 Pokemon Red/Poke… GB Role… 1996 11.3 8.89 10.2 31.4
## 6 Tetris GB Puzz… 1989 23.2 2.26 4.22 30.3
## # ℹ 1 more variable: Regional_sales <dbl>
regional_totals <- vgames2 |>
summarize( North_America = sum(NA_Sales),
Europe = sum(EU_Sales),
Japan = sum(JP_Sales))
regional_totals
## # A tibble: 1 × 3
## North_America Europe Japan
## <dbl> <dbl> <dbl>
## 1 4393. 2434. 1291.
With this output, it can be interpretative that North America accounts for the largest share of video game sales in the dataset, This indicates that this region has hystorically been the largest market for video games. However, these total reflects overall market size rather than specific genre preferences which will be explored in the following visualization.
# Creating a small summary table so ggplot can use it
regional_plot <- data.frame(
Region = c("North America", "Europe", "Japan"),
Total_Sales = c(
regional_totals$North_America,
regional_totals$Europe,
regional_totals$Japan))
regional_plot
## Region Total_Sales
## 1 North America 4392.95
## 2 Europe 2434.13
## 3 Japan 1291.02
# Creating a Bar Chart
ggplot(regional_plot, aes(x = Region, y = Total_Sales, fill = Region)) +
geom_col() +
labs( title = "Total Video Game Sales by Region",
subtitle = "North America, Europe, and Japan",
x = "Region", y = "Total Sales (Millions of Units)",
caption = "Source: VGChartz") +
scale_fill_manual(values = c( "North America" = "steelblue",
"Europe" = "darkorange", "Japan" = "forestgreen")) +
theme_minimal() +
theme( legend.position = "none",
plot.title = element_text(face = "bold"))
This Bar Chart is confirming the past calculations, where is clearly
highlighted that North America has the largest video game sales and
Japan the smallest.
To examine which factors are associated with higher video game sales, a multiple linear regression model will be fit using Global_sales as the variable that I will try to explain. The predictors chosen are Year, Platform, and Genre. This model estimates how sales vary across different platforms and genre, while accounting for the release year of each game.
Model Equation:
Global_Sales = Year + Platform + Genre + (Other Factors)
Example of other factors can be marketing, review, brand popularity, competition, etc.
#Fitting multiple linear regression model
sales_model <- lm(Global_Sales ~ Year + Platform + Genre,
data = vgames2)
summary(sales_model)
##
## Call:
## lm(formula = Global_Sales ~ Year + Platform + Genre, data = vgames2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.921 -0.460 -0.231 0.040 81.874
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.626425 10.740555 10.021 < 2e-16 ***
## Year -0.053946 0.005417 -9.958 < 2e-16 ***
## Platform3DO 0.174396 0.896411 0.195 0.845748
## Platform3DS 1.470262 0.229774 6.399 1.61e-10 ***
## PlatformDC 0.597409 0.273364 2.185 0.028874 *
## PlatformDS 1.144613 0.203421 5.627 1.87e-08 ***
## PlatformGB 2.654543 0.224107 11.845 < 2e-16 ***
## PlatformGBA 0.787729 0.190363 4.138 3.52e-05 ***
## PlatformGC 0.772435 0.194888 3.963 7.42e-05 ***
## PlatformGEN 0.872426 0.332332 2.625 0.008669 **
## PlatformGG -0.452941 1.535902 -0.295 0.768072
## PlatformN64 0.853398 0.189236 4.510 6.54e-06 ***
## PlatformNES 2.046023 0.212122 9.645 < 2e-16 ***
## PlatformNG 0.111288 0.470991 0.236 0.813215
## PlatformPC 1.039675 0.209595 4.960 7.11e-07 ***
## PlatformPCFX -0.020418 1.536482 -0.013 0.989397
## PlatformPS 0.770731 0.172568 4.466 8.01e-06 ***
## PlatformPS2 1.098724 0.190334 5.773 7.95e-09 ***
## PlatformPS3 1.563082 0.214803 7.277 3.57e-13 ***
## PlatformPS4 1.881126 0.242378 7.761 8.92e-15 ***
## PlatformPSP 1.020643 0.207107 4.928 8.38e-07 ***
## PlatformPSV 1.205750 0.235408 5.122 3.06e-07 ***
## PlatformSAT 0.280216 0.199198 1.407 0.159531
## PlatformSCD 0.235627 0.643184 0.366 0.714113
## PlatformSNES 0.747928 0.185208 4.038 5.41e-05 ***
## PlatformTG16 0.089196 1.091911 0.082 0.934896
## PlatformWii 1.479050 0.208021 7.110 1.21e-12 ***
## PlatformWiiU 1.543642 0.255747 6.036 1.62e-09 ***
## PlatformWS 0.516690 0.647972 0.797 0.425233
## PlatformX360 1.564254 0.211597 7.393 1.51e-13 ***
## PlatformXB 0.743070 0.192103 3.868 0.000110 ***
## PlatformXOne 1.705254 0.249889 6.824 9.16e-12 ***
## GenreAdventure -0.249809 0.051341 -4.866 1.15e-06 ***
## GenreFighting -0.022232 0.060417 -0.368 0.712898
## GenreMisc -0.077425 0.046379 -1.669 0.095055 .
## GenrePlatform 0.326986 0.059438 5.501 3.83e-08 ***
## GenrePuzzle -0.172894 0.071061 -2.433 0.014983 *
## GenreRacing 0.025259 0.052117 0.485 0.627922
## GenreRole-Playing 0.100248 0.048617 2.062 0.039224 *
## GenreShooter 0.223183 0.051189 4.360 1.31e-05 ***
## GenreSimulation -0.066781 0.060161 -1.110 0.266998
## GenreSports -0.023984 0.042386 -0.566 0.571509
## GenreStrategy -0.243725 0.066609 -3.659 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.527 on 16284 degrees of freedom
## (271 observations deleted due to missingness)
## Multiple R-squared: 0.0507, Adjusted R-squared: 0.04825
## F-statistic: 20.71 on 42 and 16284 DF, p-value: < 2.2e-16
Regresion Results This model is overall statistically significant, indicating that these predictors are related to video game sales. But, the adjusted R-square value was 0.048, meaning that the model explains approximately 4.8% of the variation in global sales. Although, this is a relatively small proportion, it suggest that the variables of year, genre, and platforms have some predictive value, while most of the variation in sales is likely driven by other factors not included in this dataset. everal platform variables, including PS4, XOne, and Game Boy, had positive coefficients, indicating higher expected sales relative to the reference platform. Shooter and platform games also showed positive associations with sales, while adventure and strategy games tended to have lower predicted sales.
# Diagnostic Plots
par(mfrow = c(2, 2))
plot(sales_model)
## Warning: not plotting observations with leverage one:
## 13314, 14327
par(mfrow = c(1, 1))
The diagnostic plots suggest that the assumptions of linear regression are only partially satisfied. The Residuals vs Fitted plot shows some clustering and several large outliers, indicating that the model does not capture all of the variation in sales. The Q-Q plot shows a strong deviation from the straight line in the upper tail, suggesting that the residuals are not normally distributed. The Scale-Location plot indicates that the spread of residuals is not constant across fitted values. Finally, the Residuals vs Leverage plot identifies a few potentially influential observations. These patterns are expected because video game sales are highly skewed, with a small number of blockbuster titles selling far more than most games.
To explore how genre preferences differ across major markets, this interactive chart compares the average sales on each genre in North America, Japan, and Europe. Using this interactive visualization allows a easier exploration of genre preferences.
To create it, is necessary to summarize average sales by genre and region from the dataset, and convert it to a long format.
genre_region <- vgames2 |>
group_by(Genre) |>
summarize( North_America = mean(NA_Sales),
Europe = mean(EU_Sales), Japan = mean(JP_Sales))
# Convert to long format
genre_region_long <- genre_region |>
pivot_longer( cols = c(North_America, Europe, Japan),
names_to = "Region",
values_to = "Average_Sales")
Now, is the plotly code for the interactive plot
Inter_plot <- ggplot(genre_region_long,
aes(x = Genre, y = Average_Sales,
fill = Region, text = paste("Genre:", Genre,
"<br>Region:", Region,
"<br>Average Sales:",
round(Average_Sales, 3), "million"))) +
geom_col(position = "dodge") + scale_fill_manual(values = c(
"North_America" = "brown", "Europe" = "orange",
"Japan" = "darkgreen")) +
labs( title = "Average Video Game Sales by Genre Across Regions",
subtitle = "Comparison of North America, Europe, and Japan",
x = "Genre", y = "Average Sales (Millions of Units)",
fill = "Region", caption = "Source: VGChartz") +
theme_minimal() +
theme( axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"))
# Convert to interactive plot
ggplotly(Inter_plot, tooltip = "text")
This interactive chart highlights important regional differences in video game genre preferences across North America, Europe, and Japan. North America has the highest average sales in most genres, especially Platform, Shooter, and Sports games. This suggests that fast-paced and competitive games have historically been very popular in the North American market. Europe follows a pattern similar to North America, although average sales are generally lower across most genres. Genres such as Shooter, Sports, and Racing also perform relatively well in Europe, which may reflect similar gaming trends between these two regions. Japan shows a noticeably different pattern compared to the other regions. Role-Playing games have the highest average sales in Japan, while Shooter and Action-related genres are much less popular. This may reflect the strong influence of Japanese role-playing game franchises and storytelling-focused games within the Japanese gaming market.
Another interesting result is that Strategy and Puzzle games have relatively low average sales across all regions, suggesting that these genres appeal to smaller audiences compared to mainstream genres like Shooter, Platform, and Sports. Overall, the visualization suggests that video game preferences are strongly influenced by regional market trends and cultural gaming interests. While North America and Europe show similar genre preferences, Japan demonstrates a more distinct preference for Role-Playing games.
For this section, Tableau was used to create a visualization that shows Total global sales by platforms across regions. This analysis chosen because platforms are hugely connected to regional gaming culture, market dominance, and consumer behavior.
This interactive Tableau visualization shows total video game sales by platform, broken down into North America, Europe, and Japan. To create this chart, the regional sales columns (NA_Sales, EU_Sales, and JP_Sales) were reshaped using a pivot option to convert the data from a wide format into a long format. This allowed the creation of a single “Region” variable and a corresponding “Sales” measure, which made it possible to build a stacked bar chart. After restructuring the data, Platform was placed on the x-axis and total Sales on the y-axis, while Region was used to color the bars.
The visualization shows that North America contributes the largest share of sales across most platforms, particularly for systems like Xbox 360 and PlayStation consoles. Europe follows a similar pattern but with slightly lower sales, while Japan shows a more distinct distribution, with stronger relative performance in Nintendo-related platforms. Overall, this visualization highlights how platform success varies across regions and supports the broader finding that video game markets are regionally distinct.
Research shows that video game preferences vary significantly across regions due to cultural and historical differences in gaming development. For example, the Japanese gaming market has traditionally emphasized role-playing games and narrative-driven experiences, supported by long-running franchises such as Pokémon and Final Fantasy (UdOnis, 2023). In contrast, North American and European markets show stronger preferences for action, shooter, and sports games. This supports the idea that gaming preferences are not globalized uniformly but are influenced by regional culture and market history.
The Tableau visualization further reinforces these findings by showing how sales are distributed across platforms. North American sales dominate on platforms such as Xbox 360 and PlayStation systems, while Japan shows stronger relative contributions from Nintendo platforms. This suggests that platform popularity is also regionally influenced and closely tied to local gaming ecosystems.
These patterns are also consistent with the results of the multiple linear regression model, which indicated that both platform and genre have a statistically significant relationship with global sales. However, the model’s relatively low explanatory power suggests that while these factors matter, they only explain a small portion of overall variation in sales, and other influences such as marketing, branding, and consumer trends likely play a significant role.
One interesting observation is that despite global technological convergence in gaming, regional differences remain strong and consistent. Even as the industry becomes more globalized, consumer preferences continue to reflect cultural and historical influences.
One limitation of the analysis is that certain factors such as marketing influence, game reviews, pricing strategies, and digital vs physical sales were not included in the dataset. These factors could further explain variation in sales performance. Additionally, more advanced time-based analysis could have provided deeper insight into how preferences change across different gaming generations.
Overall, the visualizations demonstrate that video game preferences are strongly shaped by regional markets, with North America and Europe showing similar trends and Japan standing out as a distinct gaming culture with unique genre preferences.
Source: UdOnis. (2023). Japanese Gaming Market: Trends and Insights. https://www.blog.udonis.co/mobile-marketing/mobile-games/japanese-gaming-market