Introduction

Research questions:

- Do global sales differ significantly by platform?

- Do global sales differ significantly by genre?

- Is the distribution of game genres significantly different across platforms?

I have a PC that I mainly use to play video games and personally love playing role-playing games (RPGs), so I am interested in understanding how different game genres perform in sales across various platforms. These research questions are important to me because they explore whether certain platforms favor specific genres, and if platform and genre influence the sales of a game. Answering these questions will reveal what kinds of video games are developed and sold the most. Furthermore, understanding these relationships are important because it can provide insight into trends and consumer preferences, helping developers, companies, and even customers know where to focus their investments.

The dataset I used was downloaded on Kaggle (https://www.kaggle.com/datasets/gregorut/videogamesales), but the data itself is sourced and scraped from VGChartz (https://www.vgchartz.com/), a website that posts weekly estimates of video game sales and other statistics. The dataset contains 16,000 titles/observations and 11 variables. Some of the variables included game name, year released, publisher, etc. Each row represents a unique video game release, and the variables provide both categorical and numerical data about the game’s characteristics. For this project, I focused on three key variables: Platform (categorical, type of gaming console), Genre (categorical, type of game/playstyle), and Global_Sales (numerical, total worldwide sales, in millions).

Loading tidyverse and dataset in

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("C:/Users/hanle/Desktop/Final Project Data 101")
vgsales <- read.csv("vgsales.csv")

Data Management and Analysis:

I started by exploring the dataset by checking its dimensions, summary statistics, and checking for NAs. I changed the variable names to lowercase for easier handling before examining the unique platforms and genres available. I wanted to narrow it down to four categories each so then I filtered the data to focus on four specific platforms (PC, PS4, Wii, X360) and four genres (Action, Puzzle, Role-Playing, Shooter) that I am most familiar with to answer my research questions. For my visualizations, I created a stacked bar plot showing total global sales across platforms, with color-coding as a way to differentiate the genres. This plot displays the contribution of each genre to overall global sales on each platform, connecting to the question of how sales vary by platform and genre. Additionally, I created boxplots of global sales distributions by platform and by genre seperately. These boxplots display the spread of sales within each categorical variable, making it easier to spot variability and outliers. I then conducted two ANOVA tests and one chi-squared test to answer my research questions.

#dimensions
dim(vgsales)
## [1] 16598    11
#summary
summary(vgsales)
##       Rank           Name             Platform             Year          
##  Min.   :    1   Length:16598       Length:16598       Length:16598      
##  1st Qu.: 4151   Class :character   Class :character   Class :character  
##  Median : 8300   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8301                                                           
##  3rd Qu.:12450                                                           
##  Max.   :16600                                                           
##     Genre            Publisher            NA_Sales          EU_Sales      
##  Length:16598       Length:16598       Min.   : 0.0000   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.0200  
##                                        Mean   : 0.2647   Mean   : 0.1467  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.1100  
##                                        Max.   :41.4900   Max.   :29.0200  
##     JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.07778   Mean   : 0.04806   Mean   : 0.5374  
##  3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4700  
##  Max.   :10.22000   Max.   :10.57000   Max.   :82.7400
#Check if there's any NAs
sum(is.na(vgsales))
## [1] 0

Convert all column names to lowercase

names(vgsales) <- tolower(names(vgsales))

head(vgsales)
##   rank                     name platform year        genre publisher na_sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   eu_sales jp_sales other_sales global_sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

Check different game platforms and genres in dataset to select for filtering

unique(vgsales$platform)
##  [1] "Wii"  "NES"  "GB"   "DS"   "X360" "PS3"  "PS2"  "SNES" "GBA"  "3DS" 
## [11] "PS4"  "N64"  "PS"   "XB"   "PC"   "2600" "PSP"  "XOne" "GC"   "WiiU"
## [21] "GEN"  "DC"   "PSV"  "SAT"  "SCD"  "WS"   "NG"   "TG16" "3DO"  "GG"  
## [31] "PCFX"
unique(vgsales$genre)
##  [1] "Sports"       "Platform"     "Racing"       "Role-Playing" "Puzzle"      
##  [6] "Misc"         "Shooter"      "Simulation"   "Action"       "Fighting"    
## [11] "Adventure"    "Strategy"

Create new dataset based on my chosen platforms and genres

vgglobal <- vgsales |>
  select(name, platform, genre, global_sales) |>
  filter(genre %in% c("Action", "Puzzle", "Role-Playing", "Shooter"),
         platform %in% c("PC", "PS4", "Wii", "X360")) |>
  arrange(global_sales)

head(vgglobal)
##                                                              name platform
## 1                                                           Turok       PC
## 2                                                  Serious Sam II       PC
## 3                      Neverwinter Nights 2: Mask of the Betrayer       PC
## 4                                                  Call of Juarez       PC
## 5 Tom Clancy's  Ghost Recon Advanced Warfighter (weekly JP sales)     X360
## 6                                          Unreal Tournament 2003       PC
##          genre global_sales
## 1       Action         0.01
## 2      Shooter         0.01
## 3 Role-Playing         0.01
## 4      Shooter         0.01
## 5      Shooter         0.01
## 6      Shooter         0.01

You can see that most of the games with the lowest value of sales are PC games and shooter games. I will create visualizations and hypothesis tests to further explore that connection and find other relationships.

ggplot(vgglobal, aes(x = platform, 
                     y = global_sales, 
                     fill = genre)) +
  geom_bar(stat = "identity", 
           position = "stack") +
  scale_fill_manual(values = c("Action" = "#d41a17",
                               "Puzzle" = "#50bfff", 
                               "Role-Playing" = "#f5dc3d",
                               "Shooter" = "#f170c4")) +
  labs(title = "Total Global Sales by Platform and Genre",
       x = "Platform",
       y = "Global Sales (in millions)") +
  theme_bw()

Boxplots by platform and genre to see distribution of each variable

ggplot(vgglobal, aes(x = global_sales, 
                     y = platform)) +
  geom_boxplot(fill = "#7dff6e") +
  labs(title = "Global Sales Distribution by Platform", 
       x = "Sales (in millions)", 
       y = "Platform")

ggplot(vgglobal, aes(x = global_sales, 
                     y = genre)) +
  geom_boxplot(fill = "#93d8ff") +
  labs(title = "Global Sales Distribution by Genre", 
       x = "Sales (in millions)", 
       y = "Genre")

Statistical Analysis

Hypothesis testing is appropriate for my data because I am examining if observed differences in global sales across platforms and genres are statistically significant. I am using both quantitative and categorical variables, therefore it is suitable to use tests like ANOVA and chi-square. ANOVA tests are when comparing means of a continuous variable across more than two groups, which go with my questions about differences in mean global sales by platform and by genre. The chi-square test is used to check associations between two categorical variables, which connects to my question about whether the distribution of game genres significantly differs across platforms.


Question 1: Do global sales differ significantly by platform?**

Hypothesis

\(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) = \(\mu_4\); the mean global sales are equal across all platforms (PC, PS4, Wii, X360)

\(H_a\): Not all \(\mu\) are equal

Where each \(\mu\) represents one of the four platforms in the dataset

\(\alpha\) = 0.05

ANOVA test (one numeric variable, one categorical variable) for average sales by platform

anova_platform <- aov(global_sales ~ platform, data = vgglobal)
summary(anova_platform)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## platform       3    154   51.35   24.78 1.11e-15 ***
## Residuals   1646   3411    2.07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results

Since the p-value, 0.00000000000000111, is less than the \(\alpha\) of 0.05, reject \(H_0\)

The small p-value indicates strong evidence that global sales for video games differ significantly by platform.


Question 2: Do global sales differ significantly by genre?

Hypothesis

\(H_0\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) = \(\mu_4\), the mean global sales are equal across genres (Action, Puzzle, RPG, Shooter)

\(H_a\): Not all \(\mu\) are equal

Where each \(\mu\) represents one of the four genres in the dataset

\(\alpha\) = 0.05

ANOVA test (one numeric variable, one categorical variable) for average sales by genre

anova_genre <- aov(global_sales ~ genre, data = vgglobal)
summary(anova_genre)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## genre          3     65  21.565   10.14 1.28e-06 ***
## Residuals   1646   3500   2.127                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results

Since the p-value, 0.00000128, is less than the \(\alpha\) of 0.05, reject \(H_0\)

The small p-value indicates strong evidence that global sales for video games differ significantly by genre.


Question 3: Is the distribution of game genres significantly different across platforms?

Hypothesis

\(H_0\): \(p_{1j} = p_{2j} = p_{3j} = p_{4j}\), with \(j\) as placeholder for each genre; genre proportions are the same across all platforms.

\(H_a\): At least one \(p_i\) \(\neq\) to others; at least one platform has a different genre distribution

\(\alpha\) = 0.05

Chi-square test (two categorical variables) for genre distribution different across platforms

genre_platform <- table(vgglobal$genre, vgglobal$platform)
chisq.test(genre_platform)
## 
##  Pearson's Chi-squared test
## 
## data:  genre_platform
## X-squared = 188.59, df = 9, p-value < 2.2e-16

Results

Since the p-value, 0.00000000000000022, is less than the \(\alpha\) of 0.05, reject \(H_0\)

The small p-value indicates there is sufficient evidence proving an association between video game platform and genre.


Conclusion and Future Directions:

In conclusion, the results my hypothesis tests show that global sales significantly differ by platform and genre, indicating that gaming platforms and category of game are associated with higher average global sales than others. My chi-squared test show that genre distribution differs significantly across platforms, meaning that certain genres are more common on specific platforms than others. The bar plot and boxplots I created support my conclusions. The visualizations showed trends such as higher sales of Shooter games on X360 and Action games on PS4, and very low sales on PC overall, which back up my answer that platform and genre type are critical factors that influence how well a video game sells.

The implications are relevant for both game developers and consumers. Choosing the right platform for a given genre could lead to greater sales performance, which is especially important to know for games in competitive categories. As a fan of RPGs, I find it valuable to know which platforms support RPGs the most or which platforms have games with a wide variety of genres in their selection.

For future research, I would expand my dataset to include all genres and platforms to explore the top selling genres or platforms to find more detailed relationships. With my prior knowledge, I know that RPGs sell extremely well in Japan, so I was surprised to see that spread shown by my visualizations said otherwise for global sales. Therefore, in the future I would also try to explore sales from specific regions instead of just globally to see if sales are consistent or more prevalent depending on region.

References:

https://www.kaggle.com/datasets/gregorut/videogamesales

https://www.vgchartz.com/