Week 6 Data Dive: Confidence Intervals

The goal of this project is to practice different strategies for calculating columns using existing data and to begin investigating correlations of different variables through confidnece intervals.

library(readr)
library(tidyverse)
library(ggplot2)
library(boot)
game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Building Numeric Variable Pairs

# orders of magnitude for global sales
oom_breaks <- 10 ^ c(-3, -2, -1, 0, 1, 2)
oom_labels <- c("10K", "10K-100K","100K-1M","1M-10M","10M+")

game_sales <- game_sales |>
  mutate(global_sales_oom = cut(global_sales, breaks = oom_breaks, labels = oom_labels, right = TRUE))

game_sales |>
  filter(rank <= 10) |>
  select(name, year, global_sales, global_sales_oom)
## # A tibble: 10 × 4
##    name                       year global_sales global_sales_oom
##    <chr>                     <dbl>        <dbl> <fct>           
##  1 Wii Sports                 2006         82.7 10M+            
##  2 Super Mario Bros.          1985         40.2 10M+            
##  3 Mario Kart Wii             2008         35.8 10M+            
##  4 Wii Sports Resort          2009         33   10M+            
##  5 Pokemon Red/Pokemon Blue   1996         31.4 10M+            
##  6 Tetris                     1989         30.3 10M+            
##  7 New Super Mario Bros.      2006         30.0 10M+            
##  8 Wii Play                   2006         29.0 10M+            
##  9 New Super Mario Bros. Wii  2009         28.6 10M+            
## 10 Duck Hunt                  1984         28.3 10M+
# create new dataframe with total of number of games per genre and their popularity ranking
genre_ranks <- game_sales |>
  group_by(genre) |>
  summarize(total_genre_games = n()) |>
  mutate(genre_rank = rank(desc(total_genre_games)))

# merge total number of games per genre and genre ranking into main dataframe
game_sales <- game_sales |>
  merge(genre_ranks, all.x = TRUE) |>
  arrange(rank)

game_sales |>
  filter(rank <= 10) |>
  select(rank, name, jp_sales, genre, genre_rank)
##    rank                      name jp_sales        genre genre_rank
## 1     1                Wii Sports     3.77       Sports          2
## 2     2         Super Mario Bros.     6.81     Platform          8
## 3     3            Mario Kart Wii     3.79       Racing          7
## 4     4         Wii Sports Resort     3.28       Sports          2
## 5     5  Pokemon Red/Pokemon Blue    10.22 Role-Playing          4
## 6     6                    Tetris     4.22       Puzzle         12
## 7     7     New Super Mario Bros.     6.50     Platform          8
## 8     8                  Wii Play     2.93         Misc          3
## 9     9 New Super Mario Bros. Wii     4.70     Platform          8
## 10   10                 Duck Hunt     0.28      Shooter          5

The two pairs of variables I have chosen to analyze are the year of a game’s release and the order of magnitude of its global sales, as well as the popularity ranking of a game’s genre (determined by how many games of that genre exist in the entire dataset) and its total sales in Japan.

Visualizing Relationships

Release Year and Global Sales OOM

game_sales |>
  filter(!is.na(year)) |>
  group_by(year, global_sales_oom) |>
  count() |>
  ggplot() +
  geom_point(mapping = aes(x = year, y = global_sales_oom, size = n)) +
  labs(x = "Year", y = "Global Sales Order of Magnitude", size = "Number of Rows") +
  theme_minimal()

Here we can see the total number of video games in each year, organized by the order of magnitude of their global sales. Early on, games only fell in the 1M-10M and 100K-1M categories, but we can see 10M+ and 10K-100K games appearing around 1985 and 10K only games around 1992. It mainly appears that more video games are just being produced in general, rather than more video games making any specific order of magnitude of global sales, but since it took the longest for the lowest rank to appear, a slight downward trend could be perceived. There are not any outliers specific to this chart, although the 2020 year outlier can be seen.

Genre Rank and Japan Sales

game_sales |>
  ggplot() +
  geom_point(mapping = aes(x = genre_rank, y = jp_sales)) +
  scale_x_continuous(breaks = pluck(genre_ranks, "genre_rank"), labels = pluck(genre_ranks, "genre")) +
  labs(x = "Genres (Ranked by Popularity)", y = "Sales in Japan (millions)") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This visualization seems to show very little correlation between the monetary success of a video game in Japan and how common the game’s genre is. Unsurprisingly, every genre is very heavily weighted towards 0, but some genres have vastly more successful games than others. Shooters and strategy games, ranked 5th and 11th, are all very concentrated at low values, but role-playing games, ranked 4th, have a notable outlier–the only game to make over 10 million in sales in the dataset.

Calculating Correlation Coefficients

game_sales_filtered_nas <- game_sales |> filter(!is.na(year))
cor(x = game_sales_filtered_nas$year,
    y = as.numeric(game_sales_filtered_nas$global_sales_oom),
    method = "spearman")
## [1] -0.1452938

This value aligns with the observations made about the visualization–though more video games are being produced now than in the past overall, the lower global sales ranks are growing faster than the higher ones, so there is a downward trend.

cor(x = game_sales$genre_rank, y = game_sales$jp_sales)
## [1] 0.03293325

As discussed above, there is not much of any trend with sales in Japan against genre ranking, which the correlation coefficient demonstrates with a very small positive value, meaning that as the genre rank number goes up (or as the number of games within a game’s genre goes down) the sales value goes up slightly; however, this is too small to show any real correlation between the two variables.

Building Confidence Intervals

# set up bootstrapping function (taken from lab workbook)
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 500) {
  boot_func <- \(x, i) func(x[i], na.rm=TRUE)
  b <- boot(v, boot_func, R = n_iter)
  boot.ci(b, conf = conf, type = "perc")
}
boot_ci(pluck(game_sales, "jp_sales"), mean, .95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 500 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 0.0725,  0.0824 )  
## Calculations and Intervals on Original Scale

This confidence interval tells us that, assuming our data is representative of the entire population, 95% of the times we take a representative sample and create a confidence interval of this size, the interval will contain the true mean of the population. The required assumption is unlikely to be accurate for this dataset, since it focuses on ranking all games with global sales above a certain point rather than taking a random representative sample from all video games, but this could still potentially provide a general idea that the true population mean lies between .0733 million and .0824 million dollars in sales.