The goal of this project is to learn more about how the specific sample taken of a population could lead to misleading conclusions by analyzing subsamples of a game sales dataset as a simulation of sample collection.
library(readr)
library(tidyverse)
library(ggplot2)
library(magrittr)
game_sales <- read_csv("video_game_sales.csv")
sample_size <- game_sales |>
summarize(n()) |>
pluck("n()") |>
divide_by(2) |>
round()
num_samples <- 5
sample_list <- list()
for (i in 1:num_samples) {
sample_name <- paste0("sample_", i)
sample_list[[sample_name]] <- game_sales |>
slice_sample(n = sample_size, replace = TRUE) |>
arrange(rank)
}
This code determines the sample size by halving the number of rows in the game sales data, then systematically creates five samples of this size and adds them to a list so analysis can be simplified using map later.
global_sales_by_sample <- sample_list |> map(\(x) pluck(x, "global_sales")) |> map(quantile)
global_sales_by_sample
## $sample_1
## 0% 25% 50% 75% 100%
## 0.01 0.06 0.17 0.46 33.00
##
## $sample_2
## 0% 25% 50% 75% 100%
## 0.01 0.06 0.17 0.48 82.74
##
## $sample_3
## 0% 25% 50% 75% 100%
## 0.01 0.06 0.17 0.46 40.24
##
## $sample_4
## 0% 25% 50% 75% 100%
## 0.01 0.06 0.17 0.49 82.74
##
## $sample_5
## 0% 25% 50% 75% 100%
## 0.01 0.06 0.17 0.47 35.82
Here, we can see the first four quantiles are very similar across all five samples. Since sales data is weighted towards lower values, it makes sense that enough data was collected to avoid much disparity, the largest difference only being four thousand dollars from the 75th percentile of Sample 3 to Sample 4. However, the differences between the maximum values are very large. Only two of the samples selected the number one best selling game, Wii Sports. Since it sold twice as well as the second-best seller, the samples missing it have a very inaccurate total range. of values.
global_sales_by_sample_genre <- sample_list |>
map(\(x) group_by(x, genre)) |>
map(\(x) summarize(x, count = n(), mean_global_sales = mean(global_sales))) |>
map(\(x) arrange(x, desc(count)))
global_sales_by_sample_genre[2]
## $sample_2
## # A tibble: 12 × 3
## genre count mean_global_sales
## <chr> <int> <dbl>
## 1 Action 1667 0.546
## 2 Sports 1185 0.660
## 3 Misc 823 0.437
## 4 Role-Playing 721 0.600
## 5 Adventure 660 0.196
## 6 Shooter 653 0.713
## 7 Racing 621 0.645
## 8 Platform 475 1.07
## 9 Simulation 437 0.395
## 10 Fighting 422 0.530
## 11 Strategy 335 0.294
## 12 Puzzle 300 0.508
Overall, the subsamples are quite similar to the actual data in terms of how common different genres are. The extreme ends of genre popularity, the top four and bottom two genres, are consistent across all the subsamples and the real data. The ranking of the genres in the middle varies, but is still relatively similar to the actual data–racing and shooter games are the most misleading, as racing appears two ranks higher than its actual rank in two of the samples and shooter appears one rank lower in three of the samples. Though the global sales means naturally vary a fair amount, probably the most significant difference from the actual data set is that the second sample shows platform games having a global sales average of over $1 million, something no genre in the full data set actually achieved.
global_sales_by_publisher_genre <- sample_list |>
map(\(x) group_by(x, publisher)) |>
map(\(x) summarize(x, count = n(), mean_global_sales = mean(global_sales))) |>
map(\(x) arrange(x, desc(count)))
global_sales_by_publisher_genre[5]
## $sample_5
## # A tibble: 424 × 3
## publisher count mean_global_sales
## <chr> <int> <dbl>
## 1 Electronic Arts 619 0.863
## 2 Activision 484 0.692
## 3 Ubisoft 467 0.503
## 4 Namco Bandai Games 452 0.286
## 5 Konami Digital Entertainment 434 0.332
## 6 Nintendo 366 2.78
## 7 THQ 358 0.478
## 8 Sony Computer Entertainment 355 0.914
## 9 Sega 331 0.484
## 10 Take-Two Interactive 204 0.955
## # ℹ 414 more rows
Once again, the top publishers are quite similar across subsamples and the complete dataset. Electronic Arts is so far ahead of the other publishers that they are consistently in first place, but the subsamples tend to show Activision lower than second, Ubisoft above fourth, and Sega above ninth, so the exact rankings are not very accurate. Though the order varies, the top 9 publishers (which were analyzed in the Week 3 Data Dive) are consistent across all the samples and the data set. In the full data set, Nintendo has the highest mean global sales of the major publishers at 2.5 million, but in the fifth sample it reaches over 3 million, which would seem anomalous. The number of publishers varies by sample, but ranges between 400 and 423 publishers of 579 in the full data set. Though it seems like there being so many publishers missing in the samples could mislead someone into thinking there are fewer total publishers in the population, when looking at the ratio of publishers to total games, the samples would imply there are actually more total publishers, as 4.8-5% of the rows in the samples show different publishers as opposed to 3.5% in the full data.
Working through these analyses demonstrated how easy it can be to make incorrect conclusions based on a sample of data. Generally, focusing on broad aggregations of each sample can be more accurate than looking at extremes, as seen with the maximum global sales versus the other quantiles. When it comes to looking at categorical data, investigating the top n categories may be more useful than trying to analyze the specific ranking of the categories, as a few key members of the population being missing could completely change your understanding of the data.