Week 4 Data Dive: Sampling and Drawing Conclusions

The goal of this project is to learn more about how the specific sample taken of a population could lead to misleading conclusions by analyzing subsamples of a game sales dataset as a simulation of sample collection.

library(readr)
library(tidyverse)
library(ggplot2)
library(magrittr)
game_sales <- read_csv("video_game_sales.csv")

Generation of Subsamples

sample_size <- game_sales |>
  summarize(n()) |>
  pluck("n()") |>
  divide_by(2) |>
  round()
num_samples <- 5

sample_list <- list()
for (i in 1:num_samples) {
  sample_name <- paste0("sample_", i)
  sample_list[[sample_name]] <- game_sales |>
    slice_sample(n = sample_size, replace = TRUE) |>
    arrange(rank)
}

This code determines the sample size by halving the number of rows in the game sales data, then systematically creates five samples of this size and adds them to a list so analysis can be simplified using map later.

Analysis of Subsample Global Sales

global_sales_by_sample <- sample_list |> map(\(x) pluck(x, "global_sales")) |> map(quantile)
global_sales_by_sample
## $sample_1
##    0%   25%   50%   75%  100% 
##  0.01  0.06  0.17  0.46 33.00 
## 
## $sample_2
##    0%   25%   50%   75%  100% 
##  0.01  0.06  0.17  0.48 82.74 
## 
## $sample_3
##    0%   25%   50%   75%  100% 
##  0.01  0.06  0.17  0.46 40.24 
## 
## $sample_4
##    0%   25%   50%   75%  100% 
##  0.01  0.06  0.17  0.49 82.74 
## 
## $sample_5
##    0%   25%   50%   75%  100% 
##  0.01  0.06  0.17  0.47 35.82

Here, we can see the first four quantiles are very similar across all five samples. Since sales data is weighted towards lower values, it makes sense that enough data was collected to avoid much disparity, the largest difference only being four thousand dollars from the 75th percentile of Sample 3 to Sample 4. However, the differences between the maximum values are very large. Only two of the samples selected the number one best selling game, Wii Sports. Since it sold twice as well as the second-best seller, the samples missing it have a very inaccurate total range. of values.

Analysis of Subsample Genres

global_sales_by_sample_genre <- sample_list |>
  map(\(x) group_by(x, genre)) |>
  map(\(x) summarize(x, count = n(), mean_global_sales = mean(global_sales))) |>
  map(\(x) arrange(x, desc(count)))
global_sales_by_sample_genre[2]
## $sample_2
## # A tibble: 12 × 3
##    genre        count mean_global_sales
##    <chr>        <int>             <dbl>
##  1 Action        1667             0.546
##  2 Sports        1185             0.660
##  3 Misc           823             0.437
##  4 Role-Playing   721             0.600
##  5 Adventure      660             0.196
##  6 Shooter        653             0.713
##  7 Racing         621             0.645
##  8 Platform       475             1.07 
##  9 Simulation     437             0.395
## 10 Fighting       422             0.530
## 11 Strategy       335             0.294
## 12 Puzzle         300             0.508

Overall, the subsamples are quite similar to the actual data in terms of how common different genres are. The extreme ends of genre popularity, the top four and bottom two genres, are consistent across all the subsamples and the real data. The ranking of the genres in the middle varies, but is still relatively similar to the actual data–racing and shooter games are the most misleading, as racing appears two ranks higher than its actual rank in two of the samples and shooter appears one rank lower in three of the samples. Though the global sales means naturally vary a fair amount, probably the most significant difference from the actual data set is that the second sample shows platform games having a global sales average of over $1 million, something no genre in the full data set actually achieved.

Analysis of Subsample Publishers

global_sales_by_publisher_genre <- sample_list |>
  map(\(x) group_by(x, publisher)) |>
  map(\(x) summarize(x, count = n(), mean_global_sales = mean(global_sales))) |>
  map(\(x) arrange(x, desc(count)))
global_sales_by_publisher_genre[5]
## $sample_5
## # A tibble: 424 × 3
##    publisher                    count mean_global_sales
##    <chr>                        <int>             <dbl>
##  1 Electronic Arts                619             0.863
##  2 Activision                     484             0.692
##  3 Ubisoft                        467             0.503
##  4 Namco Bandai Games             452             0.286
##  5 Konami Digital Entertainment   434             0.332
##  6 Nintendo                       366             2.78 
##  7 THQ                            358             0.478
##  8 Sony Computer Entertainment    355             0.914
##  9 Sega                           331             0.484
## 10 Take-Two Interactive           204             0.955
## # ℹ 414 more rows

Once again, the top publishers are quite similar across subsamples and the complete dataset. Electronic Arts is so far ahead of the other publishers that they are consistently in first place, but the subsamples tend to show Activision lower than second, Ubisoft above fourth, and Sega above ninth, so the exact rankings are not very accurate. Though the order varies, the top 9 publishers (which were analyzed in the Week 3 Data Dive) are consistent across all the samples and the data set. In the full data set, Nintendo has the highest mean global sales of the major publishers at 2.5 million, but in the fifth sample it reaches over 3 million, which would seem anomalous. The number of publishers varies by sample, but ranges between 400 and 423 publishers of 579 in the full data set. Though it seems like there being so many publishers missing in the samples could mislead someone into thinking there are fewer total publishers in the population, when looking at the ratio of publishers to total games, the samples would imply there are actually more total publishers, as 4.8-5% of the rows in the samples show different publishers as opposed to 3.5% in the full data.

Discussion

Working through these analyses demonstrated how easy it can be to make incorrect conclusions based on a sample of data. Generally, focusing on broad aggregations of each sample can be more accurate than looking at extremes, as seen with the maximum global sales versus the other quantiles. When it comes to looking at categorical data, investigating the top n categories may be more useful than trying to analyze the specific ranking of the categories, as a few key members of the population being missing could completely change your understanding of the data.