Week 7 Data Dive: Hypothesis Testing

The goal of this project is to practice using the Neyman-Pearson and Fisher’s Significance Testing frameworks and demonstrate what kinds of business questions can be answered through hypothesis testing on game sales data.

library(effsize)
library(readr)
library(tidyverse)
library(ggplot2)
library(pwrss)
game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
  func_values <- c(NULL)
  for (i in 1:n_iter) {
    # pull the sample (a vector)
    x_sample <- sample(x, size = length(x), replace = TRUE)
    func_values <- c(func_values, func(x_sample))
  }
  return(func_values)
}

Neyman-Pearson Framework

In this scenario, a game development studio in Japan is deciding on the genre for their next game. They have created games from a variety of genres in the past, but not any role-playing games. They have observed that role-playing games tend to have better sales numbers in Japan but want to make sure the difference is significant, since they will have to hire many new staff members for art, writing, and voice acting. They determine that the associated cost of hiring this staff is $60k, so this is the smallest difference they are interested in. Also, historical data may not be the best reference for the modern game market, so they narrow it down to games released after 2010.

In this scenario, the null hypothesis is that there is no difference in sales means between role-playing and non-role-playing gamse, while the alternative hypothesis is that there is a difference between them.

game_sales <- game_sales |>
  mutate(is_roleplaying = genre == "Role-Playing")

cohen.d(d = filter(game_sales, is_roleplaying, year > 2010) |> pluck("jp_sales"),
        f = filter(game_sales, !is_roleplaying, year > 2010) |> pluck("jp_sales"))

## 
## Cohen's d
## 
## d estimate: 0.5599124 (medium)
## 95 percent confidence interval:
##     lower     upper 
## 0.4613050 0.6585198

The Cohen’s d value is between .2 and .8, so it is worth investigating this scenario using hypothesis testing. In this situation, the company wants to avoid losing money by hiring extra staff as much as possible, and is not as worried about missing out on a small amount of extra profit. They decide a 6% chance is as much as they are willing to risk losing money, so alpha is set to .06, and power is set to .8.

roleplaying_count <- game_sales |>
  filter(is_roleplaying) |>
  summarize(n = n()) |>
  pluck("n")

not_roleplaying_count <- game_sales |>
  filter(!is_roleplaying) |>
  summarize(n = n()) |>
  pluck("n")

roleplaying_not_ratio <- roleplaying_count/not_roleplaying_count
roleplaying_not_ratio

## [1] 0.09847783

test <- pwrss.t.2means(mu1 = .05, 
                       sd1 = sd(pluck(game_sales, "jp_sales")),
                       kappa = roleplaying_not_ratio,
                       power = .8, alpha = 0.06, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 312 
##   n2 = 3166 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 3476 
##  Non-centrality parameter = 2.724 
##  Type I error rate = 0.06 
##  Type II error rate = 0.2

game_sales |>
  filter(year > 2010) |>
  group_by(is_roleplaying) |>
  summarize(count = n())

## # A tibble: 2 × 2
##   is_roleplaying count
##   <lgl>          <int>
## 1 FALSE           3431
## 2 TRUE             455

Here, we calculate the ratio of role-playing to non-role-playing games in the entire dataset to use as kappa in the sample size calculation. This tells us that we would need at least 312 instances from the smaller group and 3166 from the larger group to be able to perform the test, which is the case with our data.

roleplaying_sales <- game_sales |>
  filter(is_roleplaying, year > 2010) |>
  pluck("jp_sales")

not_roleplaying_sales <- game_sales |>
  filter (!is_roleplaying, year > 2010) |>
  pluck("jp_sales")

t.test(roleplaying_sales, not_roleplaying_sales)

## 
##  Welch Two Sample t-test
## 
## data:  roleplaying_sales and not_roleplaying_sales
## t = 5.869, df = 471.77, p-value = 8.27e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.08072348 0.16198501
## sample estimates:
##  mean of x  mean of y 
## 0.16872527 0.04737103

qt(.96, 472)

## [1] 1.754463

Finally, we calculate the t-statistic and the critical value. The resulting t is 5.869, far above the critical value of 1.754463; therefore, we reject the null hypothesis that there is no difference between the mean sales in Japan of role-playing games and non-role-playing games. In this scenario, this would mean that the company chooses to hire more staff and produce a role-playing game next.

Fisher’s Significance Testing Framework

Another potential scenario in which significance testing on game sales data could be valuable is if a company is deciding between two major publishers to invest in. In this section, we investigate how a company could choose between Electronic Arts and Activision. They are inclined to invest in Activision, but it looks like EA may have produced more profitable games. In this case, they care about global sales, focusing on the proportion of games each publisher released that hit a certain sales target. They determined that, for an investment to be profitable, a game needs to make at least $700k, so we first create a new column describing if a game would have been a profitable investment or not.

game_sales <- game_sales |>
  mutate(profitable_investment = global_sales > .7)

In this situation, the null hypothesis is that there is no difference between the profitable-unprofitable ratios of games produced by EA and Activision, while the alternative hypothesis is that there is a difference.

profit_table <- game_sales |>
  filter(publisher %in% c("Activision","Electronic Arts")) |>
  group_by(publisher) |>
  summarize(profitable = sum(profitable_investment), unprofitable = n() - sum(profitable_investment))
profit_table

## # A tibble: 2 × 3
##   publisher       profitable unprofitable
##   <chr>                <int>        <int>
## 1 Activision             220          755
## 2 Electronic Arts        482          869

fisher.test(select(profit_table, profitable, unprofitable))

## 
##  Fisher's Exact Test for Count Data
## 
## data:  select(profit_table, profitable, unprofitable)
## p-value = 7.973e-12
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4335647 0.6357952
## sample estimates:
## odds ratio 
##  0.5254877

Since there are two groups and two outcomes being investigated, we create a two-by-two contingency table for use in Fisher’s Exact Test. The result gives an extremely small p-value, indicating that there likely is a significant difference in profitability of the two groups of games.

Visualizations

game_sales |>
  filter(year > 2010) |>
  group_by(is_roleplaying) |>
  ggplot() +
  geom_boxplot(mapping = aes(x = is_roleplaying, y = jp_sales)) +
  scale_x_discrete(labels = c("Not Role-Playing", "Role-Playing")) +
  labs(x = "Genre (Role-Playing or Not)", y = "Sales in Japan (millions)") +
  theme_minimal()

game_sales |>
  filter(publisher %in% c("Activision","Electronic Arts")) |>
  ggplot() +
  geom_bar(mapping = aes(x = publisher, fill = profitable_investment), position = position_dodge()) +
  labs(x = "Publisher", y = "Number of Games", legend = "Investment is Profitable") +
  theme_minimal()

These basic visualizations show the differences between the two groups from the scenarios that led to the decision to complete hypothesis testing.