The goal of this project is to practice different strategies for calculating columns using existing data and to begin investigating correlations of different variables through confidnece intervals.
library(readr)
library(tidyverse)
library(ggplot2)
library(boot)
game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales
# orders of magnitude for global sales
oom_breaks <- 10 ^ c(-3, -2, -1, 0, 1, 2)
oom_labels <- c("10K", "10K-100K","100K-1M","1M-10M","10M+")
game_sales <- game_sales |>
mutate(global_sales_oom = cut(global_sales, breaks = oom_breaks, labels = oom_labels, right = TRUE))
game_sales |>
filter(rank <= 10) |>
select(name, year, global_sales, global_sales_oom)
## # A tibble: 10 × 4
## name year global_sales global_sales_oom
## <chr> <dbl> <dbl> <fct>
## 1 Wii Sports 2006 82.7 10M+
## 2 Super Mario Bros. 1985 40.2 10M+
## 3 Mario Kart Wii 2008 35.8 10M+
## 4 Wii Sports Resort 2009 33 10M+
## 5 Pokemon Red/Pokemon Blue 1996 31.4 10M+
## 6 Tetris 1989 30.3 10M+
## 7 New Super Mario Bros. 2006 30.0 10M+
## 8 Wii Play 2006 29.0 10M+
## 9 New Super Mario Bros. Wii 2009 28.6 10M+
## 10 Duck Hunt 1984 28.3 10M+
# create new dataframe with total of number of games per genre and their popularity ranking
genre_ranks <- game_sales |>
group_by(genre) |>
summarize(total_genre_games = n()) |>
mutate(genre_rank = rank(desc(total_genre_games)))
# merge total number of games per genre and genre ranking into main dataframe
game_sales <- game_sales |>
merge(genre_ranks, all.x = TRUE) |>
arrange(rank)
game_sales |>
filter(rank <= 10) |>
select(rank, name, jp_sales, genre, genre_rank)
## rank name jp_sales genre genre_rank
## 1 1 Wii Sports 3.77 Sports 2
## 2 2 Super Mario Bros. 6.81 Platform 8
## 3 3 Mario Kart Wii 3.79 Racing 7
## 4 4 Wii Sports Resort 3.28 Sports 2
## 5 5 Pokemon Red/Pokemon Blue 10.22 Role-Playing 4
## 6 6 Tetris 4.22 Puzzle 12
## 7 7 New Super Mario Bros. 6.50 Platform 8
## 8 8 Wii Play 2.93 Misc 3
## 9 9 New Super Mario Bros. Wii 4.70 Platform 8
## 10 10 Duck Hunt 0.28 Shooter 5
The two pairs of variables I have chosen to analyze are the year of a game’s release and the order of magnitude of its global sales, as well as the popularity ranking of a game’s genre (determined by how many games of that genre exist in the entire dataset) and its total sales in Japan.
game_sales |>
filter(!is.na(year)) |>
group_by(year, global_sales_oom) |>
count() |>
ggplot() +
geom_point(mapping = aes(x = year, y = global_sales_oom, size = n)) +
labs(x = "Year", y = "Global Sales Order of Magnitude", size = "Number of Rows") +
theme_minimal()
Here we can see the total number of video games in each year, organized by the order of magnitude of their global sales. Early on, games only fell in the 1M-10M and 100K-1M categories, but we can see 10M+ and 10K-100K games appearing around 1985 and 10K only games around 1992. It mainly appears that more video games are just being produced in general, rather than more video games making any specific order of magnitude of global sales, but since it took the longest for the lowest rank to appear, a slight downward trend could be perceived. There are not any outliers specific to this chart, although the 2020 year outlier can be seen.
game_sales |>
ggplot() +
geom_point(mapping = aes(x = genre_rank, y = jp_sales)) +
scale_x_continuous(breaks = pluck(genre_ranks, "genre_rank"), labels = pluck(genre_ranks, "genre")) +
labs(x = "Genres (Ranked by Popularity)", y = "Sales in Japan (millions)") +
theme_classic() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This visualization seems to show very little correlation between the monetary success of a video game in Japan and how common the game’s genre is. Unsurprisingly, every genre is very heavily weighted towards 0, but some genres have vastly more successful games than others. Shooters and strategy games, ranked 5th and 11th, are all very concentrated at low values, but role-playing games, ranked 4th, have a notable outlier–the only game to make over 10 million in sales in the dataset.
game_sales_filtered_nas <- game_sales |> filter(!is.na(year))
cor(x = game_sales_filtered_nas$year,
y = as.numeric(game_sales_filtered_nas$global_sales_oom),
method = "spearman")
## [1] -0.1452938
This value aligns with the observations made about the visualization–though more video games are being produced now than in the past overall, the lower global sales ranks are growing faster than the higher ones, so there is a downward trend.
cor(x = game_sales$genre_rank, y = game_sales$jp_sales)
## [1] 0.03293325
As discussed above, there is not much of any trend with sales in Japan against genre ranking, which the correlation coefficient demonstrates with a very small positive value, meaning that as the genre rank number goes up (or as the number of games within a game’s genre goes down) the sales value goes up slightly; however, this is too small to show any real correlation between the two variables.
# set up bootstrapping function (taken from lab workbook)
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 500) {
boot_func <- \(x, i) func(x[i], na.rm=TRUE)
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
boot_ci(pluck(game_sales, "jp_sales"), mean, .95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 500 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 0.0725, 0.0824 )
## Calculations and Intervals on Original Scale
This confidence interval tells us that, assuming our data is representative of the entire population, 95% of the times we take a representative sample and create a confidence interval of this size, the interval will contain the true mean of the population. The required assumption is unlikely to be accurate for this dataset, since it focuses on ranking all games with global sales above a certain point rather than taking a random representative sample from all video games, but this could still potentially provide a general idea that the true population mean lies between .0733 million and .0824 million dollars in sales.