Week 5 Data Dive: Documentation

The goal of this project is to analyze the detail of the video game sales dataset’s documentation, point out areas in which it is lacking, and locate missing data and outliers.

library(readr)
library(tidyverse)
library(ggplot2)
game_sales <- read_csv("video_game_sales.csv")

Initially Unclear Data

One initially unclear component of this data is what units the sales columns use. Viewing the documentation confirms that they are tracked in millions, which was likely chosen since the sales columns range from $10K to under $100 million, so on a scale of millions there is at most two digits to the left or right of the decimal. However, the majority of data has global sales under $1 million, so the decimal values that frequently display on plots could confuse viewers and generally should be relabeled.

The title of the ‘platform’ column is also not obvious at first, as in gaming, this could easily refer to either the console a game is played on or the storefront it is sold on. Though it is not explicitly stated in the documentation, viewing the metadata makes it clear that this refers to the console. Many games are sold across multiple consoles and PC, but this is less universal than PC games typically being sold across all available storefronts, and it is probably simpler to gather data by console than by store as well.

Lastly, the ‘other sales’ column title is somewhat vague, but the documentation shows that it is simply sales in regions other than North America, Europe, and Japan. Presumably the curator of this dataset gathered data for these three regions as well as globally, and created this column by subtracting the major regions’ sales from the global sales. Though it would be difficult to make general inferences about the trends of game sales in so many different regions, it is a useful column to have for the sake of completeness.

Unclear Data

One major issue with this dataset is that when the data was pulled from is unclear, as well as the accuracy of the release years in the dataset. It could be assumed that the data was collected during the last year with rows in the dataset–however, the last year with a large amount of data is 2016, followed by three games in 2017, and one in 2020. This sudden jump seems unusual, but searching the game supposedly from 2020, ‘Imagine: Makeup Artist’, shows it actually released in 2009. This greatly calls into question the reliability of the entire year column. Since the documentation is not clear, it is also possible that the year refers to something other than the year of the game’s earliest release on a specific platform.

Demonstrative Visualizations

game_sales |>
  ggplot() +
  geom_histogram(mapping = aes(x = year), color = 'white') +
  geom_vline(xintercept = 2020, color = 'red') +
  labs(x = "Year", y = "Games") +
  theme_minimal()

Here, it can be seen that the histogram extends beyond the last significant year in the dataset, 2016, all the way until 2020. The bars from 2017 and 2020 are too small to be visible, though. The data points in these years should likely just be excluded from visualizations–it may potentially be necessary to exclude data from 2016 for accuracy as well, depending on when the exact cutoff for data collection was. It is possible that the sales metrics for the games from 2017 were based on prepurchases and the actual data collection ended during 2016.

years <- 2008:2014

genres <- game_sales |>
  group_by(genre) |>
  summarize(count = n()) |>
  arrange(desc(count)) |>
  filter(count > 1000) |>
  pluck("genre")

game_sales |>
  filter(year %in% years, genre %in% genres) |>
  ggplot() +
  geom_bar(mapping = aes(x = year, fill = genre)) +
  theme_minimal() +
  labs(x = "Year", y = "Games") +
  scale_fill_brewer(palette = "Set2")

This visualization is similar to the one above, except it divides the data up more by focusing in on a few major years as well as the most popular genres. Though the anomalous years have been excluded, major issues could still come up if the years associated with each game are inaccurate. This is not as significant when at the scale of all video games, but if, for example, some racing games from 2013 were mislabeled as releasing in 2012, this could dramatically change the potential interpretations.

Missing Category Investigation

Genre

game_sales |> filter(is.na(genre))

## # A tibble: 0 × 11
## # ℹ 11 variables: rank <dbl>, name <chr>, platform <chr>, year <dbl>,
## #   genre <chr>, publisher <chr>, na_sales <dbl>, eu_sales <dbl>,
## #   jp_sales <dbl>, other_sales <dbl>, global_sales <dbl>

game_sales |> pluck ("genre") |> unique()

##  [1] "Sports"       "Platform"     "Racing"       "Role-Playing" "Puzzle"      
##  [6] "Misc"         "Shooter"      "Simulation"   "Action"       "Fighting"    
## [11] "Adventure"    "Strategy"

Though there are no explicitly missing values in the genre column and there are no implicitly missing values since genre does not exist on a continuum, it would be easy to interpret it as having empty groups. Genre is a very loose concept, and there is no explanation in the documentation for why the specific genres in this dataset were chosen. I would not think that there are any empty groups since there is a catch-all ‘misc’ genre, but another genre could easily be added that encapsulates some of the miscellaneous games.

Year

game_sales |> filter(is.na(year))

## # A tibble: 271 × 11
##     rank name          platform  year genre publisher na_sales eu_sales jp_sales
##    <dbl> <chr>         <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
##  1   180 Madden NFL 2… PS2         NA Spor… Electron…     4.26     0.26     0.01
##  2   378 FIFA Soccer … PS2         NA Spor… Electron…     0.59     2.36     0.04
##  3   432 LEGO Batman:… Wii         NA Acti… Warner B…     1.86     1.02     0   
##  4   471 wwe Smackdow… PS2         NA Figh… <NA>          1.57     1.02     0   
##  5   608 Space Invade… 2600        NA Shoo… Atari         2.36     0.14     0   
##  6   625 Rock Band     X360        NA Misc  Electron…     1.93     0.34     0   
##  7   650 Frogger's Ad… GBA         NA Adve… Konami D…     2.15     0.18     0   
##  8   653 LEGO Indiana… Wii         NA Acti… LucasArts     1.54     0.63     0   
##  9   713 Call of Duty… Wii         NA Shoo… Activisi…     1.19     0.84     0   
## 10   784 Rock Band     Wii         NA Misc  MTV Games     1.35     0.56     0   
## # ℹ 261 more rows
## # ℹ 2 more variables: other_sales <dbl>, global_sales <dbl>

game_sales |>
  group_by(platform) |>
  filter(is.na(year)) |>
  summarize(count = n()) |>
  arrange(desc(count))

## # A tibble: 16 × 2
##    platform count
##    <chr>    <int>
##  1 Wii         35
##  2 PS2         34
##  3 DS          30
##  4 X360        30
##  5 PS3         25
##  6 XB          21
##  7 2600        17
##  8 PC          17
##  9 PSP         16
## 10 GC          14
## 11 GBA         11
## 12 3DS          9
## 13 PS           7
## 14 N64          3
## 15 GB           1
## 16 PSV          1

game_sales |> arrange(desc(year)) |> pluck("year") |> unique()

##  [1] 2020 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2004
## [16] 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989
## [31] 1988 1987 1986 1985 1984 1983 1982 1981 1980   NA

There are 271 rows in the dataset that are explicitly missing year values; as discussed above, 2018 and 2019 appear to be implicitly missing, but the inclusion of 2020 seems to be a mistake; the concept of empty groups does not particularly apply to year. Though it is small compared to the total dataset, it may be valuable to investigate where years are most commonly missing from rows, especially due to the earlier concerns about year tracking. Looking at platform, it seems that a disproportionate number of Wii games are missing years compared to the more popular DS, PS2, and PS3. A thorough investigation of this column could potentially involve comparing the release years from this dataset to those from another source to judge its reliability.

Continuous Column Outliers

game_sales |> ggplot() +
  geom_point(mapping = aes(x = year, y = global_sales)) + 
  labs(title = "Global Sales by Year", x = "Year", y = "Global Sales (Millions") +
  theme_minimal()

game_sales |> filter(global_sales > 40)

## # A tibble: 2 × 11
##    rank name           platform  year genre publisher na_sales eu_sales jp_sales
##   <dbl> <chr>          <chr>    <dbl> <chr> <chr>        <dbl>    <dbl>    <dbl>
## 1     1 Wii Sports     Wii       2006 Spor… Nintendo      41.5    29.0      3.77
## 2     2 Super Mario B… NES       1985 Plat… Nintendo      29.1     3.58     6.81
## # ℹ 2 more variables: other_sales <dbl>, global_sales <dbl>

In the Global Sales column, I would define the top-selling game, Wii Sports, as an outlier, as it made over twice as much as the second-best selling game, while the second-best seller has many runner-ups that are close behind it. For most statistics of the whole dataset, it would be best to include all the data points. However, if trying to compare the success of subcategories, such as genre, it may be beneficial to exclude this outlier to better provide a reasonable statistic for the success of sports games.