Video Game Title Source: https://gamepress.jp/archives/157095

1. Introduction

Video Games are one of the most influential forms of entertainment worldwide, but player preferences are not always the same accross different regions. ntry, In reality, some countries and regions may favor certain genres, platforms, or style of games more than others. Understanding this differences can help reveal interesting cultural and market patterns within the gaming industry.

The dataset used in this project contains information about video game sales across different regions including, North America, Japan, Europe, and in the world overall. The data was originally collected from VGChartz, a website that tracks and estimates video game sales and market performance. Since this website collected the data through web scraping, the sale figures should be interpreted as estimates rather than exact official company records.

This dataset contains information for 16,598 video games and 11 variables related to genres, platforms, publisher, and released year as described below.

Variable Type Description
Platform Categorical Gaming platform (PS4, Wii, Xbox, etc.)
Genre Categorical Type of video game (adventure, puzzle, etc.)
Year Quantitative Release year of the game
NA_Sales Quantitative Sales in North America (millions)
EU_Sales Quantitative Sales in Europe (millions)
JP_Sales Quantitative Sales in Japan (millions)
Global_Sales Quantitative Total worldwide sales

The main question explored in this project is: How do video game preferences and sales differ across regions such as North America, Japan, and Europe?

I chose this topic because video games are an important part of modern entertainment and culture, and I find it interesting how gaming preferences can vary across different regions. I also wanted to explore a dataset that combines cultural trends, technology, and consumer behavior in a way than can be visually analyzed through graphs and interactive visualizations.

# Necessary libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.3
## Warning: package 'dplyr' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
vgames <- read_csv("vgsales.csv")
## Rows: 16598 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Platform, Year, Genre, Publisher
## dbl (6): Rank, NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Data Cleaning and Wrangling

Before creating any visualization and statistical models, the dataset needs to be clean and organize to focus on the variables most relevant for this project. This process will include selecting meaningful variables, checking for missing values, filtering incomplete pbservations, and preparing the data for analysis.

str(vgames)
## spc_tbl_ [16,598 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Rank        : num [1:16598] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Name        : chr [1:16598] "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
##  $ Platform    : chr [1:16598] "Wii" "NES" "Wii" "Wii" ...
##  $ Year        : chr [1:16598] "2006" "1985" "2008" "2009" ...
##  $ Genre       : chr [1:16598] "Sports" "Platform" "Racing" "Sports" ...
##  $ Publisher   : chr [1:16598] "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
##  $ NA_Sales    : num [1:16598] 41.5 29.1 15.8 15.8 11.3 ...
##  $ EU_Sales    : num [1:16598] 29.02 3.58 12.88 11.01 8.89 ...
##  $ JP_Sales    : num [1:16598] 3.77 6.81 3.79 3.28 10.22 ...
##  $ Other_Sales : num [1:16598] 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
##  $ Global_Sales: num [1:16598] 82.7 40.2 35.8 33 31.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Rank = col_double(),
##   ..   Name = col_character(),
##   ..   Platform = col_character(),
##   ..   Year = col_character(),
##   ..   Genre = col_character(),
##   ..   Publisher = col_character(),
##   ..   NA_Sales = col_double(),
##   ..   EU_Sales = col_double(),
##   ..   JP_Sales = col_double(),
##   ..   Other_Sales = col_double(),
##   ..   Global_Sales = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(vgames)
## # A tibble: 6 × 11
##    Rank Name           Platform Year  Genre Publisher NA_Sales EU_Sales JP_Sales
##   <dbl> <chr>          <chr>    <chr> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports     Wii      2006  Spor… Nintendo      41.5    29.0      3.77
## 2     2 Super Mario B… NES      1985  Plat… Nintendo      29.1     3.58     6.81
## 3     3 Mario Kart Wii Wii      2008  Raci… Nintendo      15.8    12.9      3.79
## 4     4 Wii Sports Re… Wii      2009  Spor… Nintendo      15.8    11.0      3.28
## 5     5 Pokemon Red/P… GB       1996  Role… Nintendo      11.3     8.89    10.2 
## 6     6 Tetris         GB       1989  Puzz… Nintendo      23.2     2.26     4.22
## # ℹ 2 more variables: Other_Sales <dbl>, Global_Sales <dbl>
colSums(is.na(vgames))
##         Rank         Name     Platform         Year        Genre    Publisher 
##            0            0            0            0            0            0 
##     NA_Sales     EU_Sales     JP_Sales  Other_Sales Global_Sales 
##            0            0            0            0            0

There is non missing values, so I can move to the next step.

vgames1 <- vgames |>
    select(Name, Platform, Genre, Year, NA_Sales, EU_Sales, JP_Sales, Global_Sales)
vgames2 <- vgames1 |>
  mutate(Year = as.numeric(Year))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Year = as.numeric(Year)`.
## Caused by warning:
## ! NAs introduced by coercion
vgames2 <- vgames2 |>
  mutate(Regional_sales = NA_Sales + EU_Sales + JP_Sales)
head(vgames2)
## # A tibble: 6 × 9
##   Name              Platform Genre  Year NA_Sales EU_Sales JP_Sales Global_Sales
##   <chr>             <chr>    <chr> <dbl>    <dbl>    <dbl>    <dbl>        <dbl>
## 1 Wii Sports        Wii      Spor…  2006     41.5    29.0      3.77         82.7
## 2 Super Mario Bros. NES      Plat…  1985     29.1     3.58     6.81         40.2
## 3 Mario Kart Wii    Wii      Raci…  2008     15.8    12.9      3.79         35.8
## 4 Wii Sports Resort Wii      Spor…  2009     15.8    11.0      3.28         33  
## 5 Pokemon Red/Poke… GB       Role…  1996     11.3     8.89    10.2          31.4
## 6 Tetris            GB       Puzz…  1989     23.2     2.26     4.22         30.3
## # ℹ 1 more variable: Regional_sales <dbl>
regional_totals <- vgames2 |>
  summarize( North_America = sum(NA_Sales),
             Europe = sum(EU_Sales),
             Japan = sum(JP_Sales))
regional_totals
## # A tibble: 1 × 3
##   North_America Europe Japan
##           <dbl>  <dbl> <dbl>
## 1         4393.  2434. 1291.

With this output, it can be interpretative that North America accounts for the largest share of video game sales in the dataset, This indicates that this region has hystorically been the largest market for video games. However, these total reflects overall market size rather than specific genre preferences which will be explored in the following visualization.

# Creating a small summary table so ggplot can use it
regional_plot <- data.frame(
  Region = c("North America", "Europe", "Japan"),
  Total_Sales = c(
    regional_totals$North_America,
    regional_totals$Europe,
    regional_totals$Japan))
regional_plot
##          Region Total_Sales
## 1 North America     4392.95
## 2        Europe     2434.13
## 3         Japan     1291.02
# Creating a Bar Chart
ggplot(regional_plot, aes(x = Region, y = Total_Sales, fill = Region)) +
  geom_col() +
  labs( title = "Total Video Game Sales by Region",
        subtitle = "North America, Europe, and Japan",
        x = "Region", y = "Total Sales (Millions of Units)",
        caption = "Source: VGChartz") +
  scale_fill_manual(values = c( "North America" = "steelblue",
                                "Europe" = "darkorange", "Japan" = "forestgreen")) +
  theme_minimal() +
  theme( legend.position = "none",
         plot.title = element_text(face = "bold"))

This Bar Chart is confirming the past calculations, where is clearly highlighted that North America has the largest video game sales and Japan the smallest.

3. Multiple Linear Regression Model

To examine which factors are associated with higher video game sales, a multiple linear regression model will be fit using Global_sales as the variable that I will try to explain. The predictors chosen are Year, Platform, and Genre. This model estimates how sales vary across different platforms and genre, while accounting for the release year of each game.

Model Equation:

Global_Sales = Year + Platform + Genre + (Other Factors)

Example of other factors can be marketing, review, brand popularity, competition, etc.

 #Fitting multiple linear regression model

sales_model <- lm(Global_Sales ~ Year + Platform + Genre,
                  data = vgames2)

summary(sales_model)
## 
## Call:
## lm(formula = Global_Sales ~ Year + Platform + Genre, data = vgames2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.921 -0.460 -0.231  0.040 81.874 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       107.626425  10.740555  10.021  < 2e-16 ***
## Year               -0.053946   0.005417  -9.958  < 2e-16 ***
## Platform3DO         0.174396   0.896411   0.195 0.845748    
## Platform3DS         1.470262   0.229774   6.399 1.61e-10 ***
## PlatformDC          0.597409   0.273364   2.185 0.028874 *  
## PlatformDS          1.144613   0.203421   5.627 1.87e-08 ***
## PlatformGB          2.654543   0.224107  11.845  < 2e-16 ***
## PlatformGBA         0.787729   0.190363   4.138 3.52e-05 ***
## PlatformGC          0.772435   0.194888   3.963 7.42e-05 ***
## PlatformGEN         0.872426   0.332332   2.625 0.008669 ** 
## PlatformGG         -0.452941   1.535902  -0.295 0.768072    
## PlatformN64         0.853398   0.189236   4.510 6.54e-06 ***
## PlatformNES         2.046023   0.212122   9.645  < 2e-16 ***
## PlatformNG          0.111288   0.470991   0.236 0.813215    
## PlatformPC          1.039675   0.209595   4.960 7.11e-07 ***
## PlatformPCFX       -0.020418   1.536482  -0.013 0.989397    
## PlatformPS          0.770731   0.172568   4.466 8.01e-06 ***
## PlatformPS2         1.098724   0.190334   5.773 7.95e-09 ***
## PlatformPS3         1.563082   0.214803   7.277 3.57e-13 ***
## PlatformPS4         1.881126   0.242378   7.761 8.92e-15 ***
## PlatformPSP         1.020643   0.207107   4.928 8.38e-07 ***
## PlatformPSV         1.205750   0.235408   5.122 3.06e-07 ***
## PlatformSAT         0.280216   0.199198   1.407 0.159531    
## PlatformSCD         0.235627   0.643184   0.366 0.714113    
## PlatformSNES        0.747928   0.185208   4.038 5.41e-05 ***
## PlatformTG16        0.089196   1.091911   0.082 0.934896    
## PlatformWii         1.479050   0.208021   7.110 1.21e-12 ***
## PlatformWiiU        1.543642   0.255747   6.036 1.62e-09 ***
## PlatformWS          0.516690   0.647972   0.797 0.425233    
## PlatformX360        1.564254   0.211597   7.393 1.51e-13 ***
## PlatformXB          0.743070   0.192103   3.868 0.000110 ***
## PlatformXOne        1.705254   0.249889   6.824 9.16e-12 ***
## GenreAdventure     -0.249809   0.051341  -4.866 1.15e-06 ***
## GenreFighting      -0.022232   0.060417  -0.368 0.712898    
## GenreMisc          -0.077425   0.046379  -1.669 0.095055 .  
## GenrePlatform       0.326986   0.059438   5.501 3.83e-08 ***
## GenrePuzzle        -0.172894   0.071061  -2.433 0.014983 *  
## GenreRacing         0.025259   0.052117   0.485 0.627922    
## GenreRole-Playing   0.100248   0.048617   2.062 0.039224 *  
## GenreShooter        0.223183   0.051189   4.360 1.31e-05 ***
## GenreSimulation    -0.066781   0.060161  -1.110 0.266998    
## GenreSports        -0.023984   0.042386  -0.566 0.571509    
## GenreStrategy      -0.243725   0.066609  -3.659 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.527 on 16284 degrees of freedom
##   (271 observations deleted due to missingness)
## Multiple R-squared:  0.0507, Adjusted R-squared:  0.04825 
## F-statistic: 20.71 on 42 and 16284 DF,  p-value: < 2.2e-16

Regresion Results This model is overall statistically significant, indicating that these predictors are related to video game sales. But, the adjusted R-square value was 0.048, meaning that the model explains approximately 4.8% of the variation in global sales. Although, this is a relatively small proportion, it suggest that the variables of year, genre, and platforms have some predictive value, while most of the variation in sales is likely driven by other factors not included in this dataset. everal platform variables, including PS4, XOne, and Game Boy, had positive coefficients, indicating higher expected sales relative to the reference platform. Shooter and platform games also showed positive associations with sales, while adventure and strategy games tended to have lower predicted sales.

# Diagnostic Plots
par(mfrow = c(2, 2))
plot(sales_model)
## Warning: not plotting observations with leverage one:
##   13314, 14327

par(mfrow = c(1, 1))

The diagnostic plots suggest that the assumptions of linear regression are only partially satisfied. The Residuals vs Fitted plot shows some clustering and several large outliers, indicating that the model does not capture all of the variation in sales. The Q-Q plot shows a strong deviation from the straight line in the upper tail, suggesting that the residuals are not normally distributed. The Scale-Location plot indicates that the spread of residuals is not constant across fitted values. Finally, the Residuals vs Leverage plot identifies a few potentially influential observations. These patterns are expected because video game sales are highly skewed, with a small number of blockbuster titles selling far more than most games.

4. Interactive Visualization: Regional Genre Preferences

To explore how genre preferences differ across major markets, this interactive chart compares the average sales on each genre in North America, Japan, and Europe. Using this interactive visualization allows a easier exploration of genre preferences.

To create it, is necessary to summarize average sales by genre and region from the dataset, and convert it to a long format.

genre_region <- vgames2 |>
  group_by(Genre) |>
  summarize( North_America = mean(NA_Sales),
             Europe = mean(EU_Sales), Japan = mean(JP_Sales))

# Convert to long format
genre_region_long <- genre_region |>
  pivot_longer( cols = c(North_America, Europe, Japan),
    names_to = "Region",
    values_to = "Average_Sales")

Now, is the plotly code for the interactive plot

Inter_plot <- ggplot(genre_region_long,
             aes(x = Genre, y = Average_Sales,
                 fill = Region, text = paste("Genre:", Genre, 
                                             "<br>Region:", Region,
                                             "<br>Average Sales:",
                                             round(Average_Sales, 3), "million"))) +
  geom_col(position = "dodge") + scale_fill_manual(values = c(
    "North_America" = "brown", "Europe" = "orange",
    "Japan" = "darkgreen")) +
  labs( title = "Average Video Game Sales by Genre Across Regions",
        subtitle = "Comparison of North America, Europe, and Japan",
        x = "Genre", y = "Average Sales (Millions of Units)",
        fill = "Region", caption = "Source: VGChartz") +
  theme_minimal() +
  theme( axis.text.x = element_text(angle = 45, hjust = 1),
         plot.title = element_text(face = "bold"))

# Convert to interactive plot
ggplotly(Inter_plot, tooltip = "text")

This interactive chart highlights important regional differences in video game genre preferences across North America, Europe, and Japan. North America has the highest average sales in most genres, especially Platform, Shooter, and Sports games. This suggests that fast-paced and competitive games have historically been very popular in the North American market. Europe follows a pattern similar to North America, although average sales are generally lower across most genres. Genres such as Shooter, Sports, and Racing also perform relatively well in Europe, which may reflect similar gaming trends between these two regions. Japan shows a noticeably different pattern compared to the other regions. Role-Playing games have the highest average sales in Japan, while Shooter and Action-related genres are much less popular. This may reflect the strong influence of Japanese role-playing game franchises and storytelling-focused games within the Japanese gaming market.

Another interesting result is that Strategy and Puzzle games have relatively low average sales across all regions, suggesting that these genres appeal to smaller audiences compared to mainstream genres like Shooter, Platform, and Sports. Overall, the visualization suggests that video game preferences are strongly influenced by regional market trends and cultural gaming interests. While North America and Europe show similar genre preferences, Japan demonstrates a more distinct preference for Role-Playing games.

5. Tableau Visualization

For this section, Tableau was used to create a visualization that shows Total global sales by platforms across regions. This analysis chosen because platforms are hugely connected to regional gaming culture, market dominance, and consumer behavior.

Tableau Visualization

This interactive Tableau visualization shows total video game sales by platform, broken down into North America, Europe, and Japan. To create this chart, the regional sales columns (NA_Sales, EU_Sales, and JP_Sales) were reshaped using a pivot option to convert the data from a wide format into a long format. This allowed the creation of a single “Region” variable and a corresponding “Sales” measure, which made it possible to build a stacked bar chart. After restructuring the data, Platform was placed on the x-axis and total Sales on the y-axis, while Region was used to color the bars.

The visualization shows that North America contributes the largest share of sales across most platforms, particularly for systems like Xbox 360 and PlayStation consoles. Europe follows a similar pattern but with slightly lower sales, while Japan shows a more distinct distribution, with stronger relative performance in Nintendo-related platforms. Overall, this visualization highlights how platform success varies across regions and supports the broader finding that video game markets are regionally distinct.

6. Conclusion

Research shows that video game preferences vary significantly across regions due to cultural and historical differences in gaming development. For example, the Japanese gaming market has traditionally emphasized role-playing games and narrative-driven experiences, supported by long-running franchises such as Pokémon and Final Fantasy (UdOnis, 2023). In contrast, North American and European markets show stronger preferences for action, shooter, and sports games. This supports the idea that gaming preferences are not globalized uniformly but are influenced by regional culture and market history.

The Tableau visualization further reinforces these findings by showing how sales are distributed across platforms. North American sales dominate on platforms such as Xbox 360 and PlayStation systems, while Japan shows stronger relative contributions from Nintendo platforms. This suggests that platform popularity is also regionally influenced and closely tied to local gaming ecosystems.

These patterns are also consistent with the results of the multiple linear regression model, which indicated that both platform and genre have a statistically significant relationship with global sales. However, the model’s relatively low explanatory power suggests that while these factors matter, they only explain a small portion of overall variation in sales, and other influences such as marketing, branding, and consumer trends likely play a significant role.

One interesting observation is that despite global technological convergence in gaming, regional differences remain strong and consistent. Even as the industry becomes more globalized, consumer preferences continue to reflect cultural and historical influences.

One limitation of the analysis is that certain factors such as marketing influence, game reviews, pricing strategies, and digital vs physical sales were not included in the dataset. These factors could further explain variation in sales performance. Additionally, more advanced time-based analysis could have provided deeper insight into how preferences change across different gaming generations.

Overall, the visualizations demonstrate that video game preferences are strongly shaped by regional markets, with North America and Europe showing similar trends and Japan standing out as a distinct gaming culture with unique genre preferences.

Source: UdOnis. (2023). Japanese Gaming Market: Trends and Insights. https://www.blog.udonis.co/mobile-marketing/mobile-games/japanese-gaming-market