# Load required packages
library(dplyr)      # data wrangling
library(ggplot2)    # plotting
library(tidyr)      # tidy helpers
library(car)        # Levene's test for homogeneity of variance
library(knitr)      # nicely formatted tables

A. Introduction

Research question: Is there a statistically significant difference in mean global sales (in millions of units) across video-game genres?

Gregory Smith compiled and published the Video Game Sales, and are available on Kaggle (Smith, 2016). Data was originally scraped from VGChartz.com and represents every game with a minimum total of 100,000 copies sold across the world. The full file has a total of 16,598 observations and 11 variables or fields and satisfies the assignment minimum of 1,000 observations. Each observation is defined as one video game for a specified platform. The 11 variables consist of video game identifiers, categorical type of game, and each of the 5 continuous sales. All sales amounts are in 1,000,000 unit numbers.

The purpose of identifying and measuring the top-selling video game genres is to assist video game developers in their studio strategy or retail buying and to provide information to regulators that could help them examine trends in media consumption. Video game genres and their global sales will be the two primary factors that will be analyzed in this study, i.e., Genre will serve as an independent variable with k = 12 groups, while Global Sales will represent the dependent variable. All additional identifiers from the dataset will remain in the dataset; however, only the Genre and Global Sales will be tested.

B. Data Analysis

I initially cleaned and prepared the dataset to run the ANOVA with dplyr. There were several steps involved in this process: (1) I read the raw CSV and used the NA string as a missing value, which allowed me to parse the Year column into numeric values; (2) I used the glimpse() and summary() functions to examine the dataset and ensure that I had the correct column types and their ranges; (3) I deleted the few rows where either the Genre or Global_Sales contained a missing value and transformed the Genre column to a factor using mutate(); (4) I used filter() to limit my analysis to only records from 1980 onward, since there are very few (four) rows contained in the dataset prior to Atari and there is a lot of ambiguity regarding their release years; and (5) I used select() to only retain the columns needed for my analysis and group_by() / summarise() to produce a descriptive table of global sales by genre. This pipeline contains six different dplyr verbs (mutate, filter, select, group_by, summarise, and arrange), which is well above the minimum of three required to complete this task.

# Step 1: read the dataset (place vgsales.csv in the same folder as the .Rmd)
vg_raw <- read.csv("vgsales.csv", stringsAsFactors = FALSE, na.strings = "N/A")

dim(vg_raw)
## [1] 16598    11
# Step 2: inspect structure and basic summaries
glimpse(vg_raw)
## Rows: 16,598
## Columns: 11
## $ Rank         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
## $ Name         <chr> "Wii Sports", "Super Mario Bros.", "Mario Kart Wii", "Wii…
## $ Platform     <chr> "Wii", "NES", "Wii", "Wii", "GB", "GB", "DS", "Wii", "Wii…
## $ Year         <int> 2006, 1985, 2008, 2009, 1996, 1989, 2006, 2006, 2009, 198…
## $ Genre        <chr> "Sports", "Platform", "Racing", "Sports", "Role-Playing",…
## $ Publisher    <chr> "Nintendo", "Nintendo", "Nintendo", "Nintendo", "Nintendo…
## $ NA_Sales     <dbl> 41.49, 29.08, 15.85, 15.75, 11.27, 23.20, 11.38, 14.03, 1…
## $ EU_Sales     <dbl> 29.02, 3.58, 12.88, 11.01, 8.89, 2.26, 9.23, 9.20, 7.06, …
## $ JP_Sales     <dbl> 3.77, 6.81, 3.79, 3.28, 10.22, 4.22, 6.50, 2.93, 4.70, 0.…
## $ Other_Sales  <dbl> 8.46, 0.77, 3.31, 2.96, 1.00, 0.58, 2.90, 2.85, 2.26, 0.4…
## $ Global_Sales <dbl> 82.74, 40.24, 35.82, 33.00, 31.37, 30.26, 30.01, 29.02, 2…
summary(vg_raw[, c("Year", "Global_Sales")])
##       Year       Global_Sales    
##  Min.   :1980   Min.   : 0.0100  
##  1st Qu.:2003   1st Qu.: 0.0600  
##  Median :2007   Median : 0.1700  
##  Mean   :2006   Mean   : 0.5374  
##  3rd Qu.:2010   3rd Qu.: 0.4700  
##  Max.   :2020   Max.   :82.7400  
##  NA's   :271
# Step 3 & 4: drop rows with missing key fields, coerce Genre to factor,
#             restrict to 1980+ titles
vg <- vg_raw %>%
  filter(!is.na(Genre), !is.na(Global_Sales), Year >= 1980) %>%
  mutate(Genre = factor(Genre)) %>%
  select(Name, Platform, Year, Genre, Publisher, Global_Sales)

# Confirm cleaning
cat("Rows after cleaning:", nrow(vg), "\n")
## Rows after cleaning: 16327
cat("Number of genre levels:", nlevels(vg$Genre), "\n")
## Number of genre levels: 12
# Step 5: per-genre descriptive summary
genre_summary <- vg %>%
  group_by(Genre) %>%
  summarise(
    n          = n(),
    mean_sales = round(mean(Global_Sales), 3),
    median     = round(median(Global_Sales), 3),
    sd         = round(sd(Global_Sales),   3),
    max_sales  = max(Global_Sales)
  ) %>%
  arrange(desc(mean_sales))

kable(genre_summary, caption = "Per-genre descriptive statistics for Global Sales (millions of units).")
Per-genre descriptive statistics for Global Sales (millions of units).
Genre n mean_sales median sd max_sales
Platform 876 0.947 0.28 2.599 40.24
Shooter 1282 0.800 0.23 1.834 28.31
Role-Playing 1471 0.628 0.19 1.717 31.37
Racing 1226 0.593 0.19 1.677 35.82
Sports 2304 0.568 0.22 2.105 82.74
Fighting 836 0.531 0.21 0.958 13.04
Action 3253 0.530 0.19 1.165 21.40
Misc 1710 0.466 0.16 1.323 29.02
Simulation 851 0.458 0.16 1.206 24.76
Puzzle 571 0.424 0.10 1.576 30.26
Strategy 671 0.258 0.09 0.524 5.45
Adventure 1276 0.184 0.06 0.511 11.18

The summary table above shows clear differences in average global sales: Platform games average ~0.94 M units per title and Shooters ~0.79 M, while Adventure and Strategy titles average well under 0.30 M. Standard deviations are large and unequal across groups.

C. Statistical Analysis

This study seeks to determine the likelihood that the average global sales will differ based on different genres. To accomplish this, a one-way ANOVA was used as it can be applied to any set of independent groups with continuous data. The three assumptions associated with an ANOVA are independent observations, the shape of the distribution is approximately normal, and equal variance exists in the population from which the independent groups were drawn. The data used for this analysis has independent observations because no two games are exactly alike. The distribution of the residuals and the variance of the residuals among groups were then checked to ensure they meet the requirements for the use of one-way ANOVA.

Hypotheses (using μ for population mean global sales):

\[H_{0}:\ \mu_{\text{Action}} = \mu_{\text{Adventure}} = \mu_{\text{Fighting}} = \cdots = \mu_{\text{Strategy}}\]

\[H_{A}:\ \text{at least one } \mu_{i} \neq \mu_{j} \text{ for some pair of genres } i \neq j\]

I will reject \(H_{0}\) if the ANOVA p-value is below the conventional significance level \(\alpha = 0.05\).

# Fit the one-way ANOVA: Global_Sales explained by Genre
aov_fit <- aov(Global_Sales ~ Genre, data = vg)
summary(aov_fit)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Genre          11    486   44.21   18.24 <2e-16 ***
## Residuals   16315  39537    2.42                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Assumption 1: Homogeneity of variance (Levene's test, robust to non-normality)
levene_res <- leveneTest(Global_Sales ~ Genre, data = vg)
levene_res
## Levene's Test for Homogeneity of Variance (center = median)
##          Df F value    Pr(>F)    
## group    11   15.36 < 2.2e-16 ***
##       16315                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Assumption 2: Normality of residuals - inspect visually + Shapiro on a subsample
par(mfrow = c(1, 2))
plot(aov_fit, which = 1)   # Residuals vs Fitted
plot(aov_fit, which = 2)   # Normal Q-Q

par(mfrow = c(1, 1))

# Shapiro-Wilk is limited to n <= 5000, so sample the residuals
set.seed(42)
resid_sample <- sample(residuals(aov_fit), 5000)
shapiro.test(resid_sample)
## 
##  Shapiro-Wilk normality test
## 
## data:  resid_sample
## W = 0.31673, p-value < 2.2e-16

The ANOVA produced an F-statistic of 18.24 with 11 and 16315 degrees of freedom, and a p-value of 9.5 x 10^-37. The results indicate that there is a significant difference in global sales between at least one of the 12 genres. Levene’s test also indicated a significant difference in variance among at least two of the data sets at a p-value of < 0.001, and there is evidence of heavy right tails in the Q-Q plot of the log sales data posted on the web. The heavy tails of the right side of the distribution are attributed to a small number of “blockbuster” titles that will cause an increase in the variance of the data within each genre, particularly in games such as Platform and Sports. As a result of violating the assumptions of normal distributions and equal variance, the analysis was repeated in two different ways: (a) using the log-transformed global sales data and (b) performing a Welch’s ANOVA test for variance using the oneway.test() function. The results of both tests reached the same qualitative conclusion.

# (a) Log10 transform: add a small constant because some games have 0 sales
vg <- vg %>% mutate(LogSales = log10(Global_Sales + 0.01))

aov_log <- aov(LogSales ~ Genre, data = vg)
summary(aov_log)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Genre          11    318  28.880   90.74 <2e-16 ***
## Residuals   16315   5193   0.318                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# (b) Welch's ANOVA - does NOT assume equal variances
oneway.test(Global_Sales ~ Genre, data = vg, var.equal = FALSE)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Global_Sales and Genre
## F = 44.261, num df = 11.0, denom df = 4968.9, p-value < 2.2e-16
# Post-hoc: Tukey's Honest Significant Difference on the log-transformed model
tukey_res <- TukeyHSD(aov_log)

# Pull out the significant pairs (adjusted p < 0.05) for readability
tukey_df <- as.data.frame(tukey_res$Genre)
tukey_df$Comparison <- rownames(tukey_df)
tukey_df <- tukey_df %>%
  select(Comparison, diff, lwr, upr, `p adj`) %>%
  arrange(`p adj`)

n_sig <- sum(tukey_df$`p adj` < 0.05)
cat("Significant pairwise differences (alpha = 0.05):",
    n_sig, "out of", nrow(tukey_df), "\n\n")
## Significant pairwise differences (alpha = 0.05): 51 out of 66
kable(head(tukey_df, 12),
      row.names = FALSE,
      caption = "Top 12 most significant pairwise genre differences (Tukey HSD on log10 sales).")
Top 12 most significant pairwise genre differences (Tukey HSD on log10 sales).
Comparison diff lwr upr p adj
Adventure-Action -0.4135195 -0.4744284 -0.3526106 0
Fighting-Adventure 0.4543515 0.3723040 0.5363990 0
Misc-Adventure 0.3482373 0.2800241 0.4164505 0
Platform-Adventure 0.5688293 0.4879215 0.6497372 0
Racing-Adventure 0.4249564 0.3512136 0.4986993 0
Role-Playing-Adventure 0.4162149 0.3456734 0.4867564 0
Shooter-Adventure 0.5121062 0.4391894 0.5850230 0
Simulation-Adventure 0.3424695 0.2608600 0.4240790 0
Sports-Adventure 0.4737176 0.4093716 0.5380636 0
Strategy-Platform -0.4020557 -0.4966530 -0.3074584 0
Strategy-Shooter -0.3453326 -0.4331929 -0.2574723 0
Strategy-Sports -0.3069440 -0.3878327 -0.2260553 0
# Boxplot of Global_Sales by Genre, ordered by mean for readability.
# Y-axis is on a log10 scale because the raw distribution is extremely right-skewed.
genre_order <- genre_summary$Genre

ggplot(vg, aes(x = factor(Genre, levels = genre_order),
               y = Global_Sales,
               fill = Genre)) +
  geom_boxplot(outlier.alpha = 0.25, outlier.size = 0.7) +
  scale_y_log10() +
  labs(
    title = "Global Sales per Title by Video-Game Genre",
    subtitle = "Boxes ordered by mean sales (left = highest). Y-axis log10.",
    x = "Genre",
    y = "Global Sales (millions, log10)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1),
    legend.position = "none"
  )

Interpreting the results. The classical, log-transformed, and Welch versions of the test all agree: genre is significantly associated with average global sales (p < 10⁻³⁰). Tukey HSD indicates that 51 of 66 pairwise comparisons (approximately 77 percent) are statistically significant (α = 0.05). Common trends include: (i) Platform and Shooter video games sell much better than Adventure, Strategy, Puzzle, and Simulation; (ii) Adventure is the lowest selling genre, and has statistically significant lower sales than almost every other genre. The boxplot shows this clearly: the medians slope downward from Platform on the left, to Adventure on the right, even though the upper whiskers (rare mega-hit titles) have considerable overlap among different genres. Practically speaking, however, only about 1% of the variance in global sales can be explained by genre. (The R-square statistic for the linear model is ~0.012). Thus, while the direction of the differences is stable, genre alone is a poor predictor of an individual title’s commercial performance. Furthermore, the majority of the variance in sales is attributable to factors other than genre (i.e. publisher, platform, marketing budget, release timing).

D. Conclusion and Future Directions

The one-way ANOVA clearly indicates a statistically significant difference in mean global sales among VGChartz’s 12 video-game genres (F(11, 16315) = 18.24, p < 0.001, n = 16,327). The highest average global sales belonged to Platform and Shooter titles, with Adventure and Strategy titles having greatly lower sales. Tukey HSD found that most of these pairwise differences remain individually significant following a correction for multiple comparisons. In addition, the same conclusion results from relaxing the assumptions of normality and homogeneity of variance with a log transformation and Welch’s test, indicating that an artifact of choice of distribution type does not explain this finding.

Even though some of the variance in title sales can be explained by genre, the effect size on its own can only account for about 1%. Therefore, as much as possible of the remaining variation among sales will be due to something other than genre. As such, an immediate follow-up to this would be to conduct a multiple regression or mixed-effect model which incorporates additional factors such as platform, publisher, release date, and sales figures by region to better understand how each of those variables contributes to the remaining variance in title sales data. A subsequent analysis could also look into temporal trends (the nature of the relationship between title genre and sales has likely changed since the introduction of mobile and free-to-play gaming), and a third area that can be extended upon is to examine the far right side of the sales data — the small number of publisher titles that generate most of the revenue, which behave differently from the majority of the other titles in the catalogue and drive revenue in a way that may create an additional revenue stream.

E. References