Week 12 Data Dive: Time-Based Data

The goal of this data dive is to demonstrate how to utilize time-based data for time-series modeling and how to find the trend and season of time-series models.

library(readr)
library(tidyverse)
library(ggplot2)
library(ggrepel)
library(xts)
library(tsibble)

game_sales <- read_csv("video_game_sales.csv")
game_sales_raw <- game_sales

Setting Up Time-Based Data

The time-based data in the game sales dataset is the year of a game’s release. This must be converted to a date–the first of each year–for time-series modeling to work. First, though, we need to check for gaps. As found in a previous data dive, there is no data in 2018 or 2019, but then there is data in 2020. This was determined to be anomalous, as the game marked as releasing in 2020 actually released in an earlier year, so we will simply remove that year from the dataset.

game_sales <- game_sales |>
  filter(!is.na(year))

game_sales |>
  arrange(year) |>
  pluck("year") |>
  unique()

##  [1] 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## [16] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [31] 2010 2011 2012 2013 2014 2015 2016 2017 2020

game_sales <- game_sales |>
  filter(year != 2020)

Many games release in a single year, so we need to summarize a variable that can be analyzed over time. Global sales, referring to the number of copies of a game that were sold worldwide, is the most important column, so we can summarize each year with the total number of copies of video games released that year that have been sold. Then, this dataframe can be transformed into a tsibble.

game_sales <- game_sales |>
  mutate(date = as.Date(paste(year, "-1-1", sep = "")))

game_sales_summarized <- game_sales |>
  group_by(date) |>
  summarize(total_global_sales = sum(global_sales))

game_sales_ts <- as_tsibble(game_sales_summarized, index = date)
game_sales_xts <- xts(x = game_sales_ts$total_global_sales, order.by = game_sales_ts$date, frequency = 365)
game_sales_xts <- setNames(game_sales_xts, "TotalGlobalSales")

Plotting Data Over Time

game_sales_ts |>
  ggplot() +
  geom_line(mapping = aes(x = date, y = total_global_sales)) +
  labs(x = "Year", y = "Total Global Sales of Games Released (Millions)")

The most obvious point of interest with this data is that the total global sales trend upwards with time, peaking around 2008, and then quickly fall as the end of the dataset is approached. There are also many small peaks and troughs early on in the data every 3-4 years or so, which may indicate seasonality.

Detecting Trends

game_sales_ts |>
  filter_index(~ "2008") |>
  ggplot(mapping = aes(x = date, y = total_global_sales, ymin = -100, ymax = 800)) +
  geom_line() +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(x = "Year", y = "Total Global Sales of Games Released (Millions)")

## `geom_smooth()` using formula = 'y ~ x'

game_sales_ts |>
  filter_index("2008" ~ .) |>
  ggplot(mapping = aes(x = date, y = total_global_sales, ymin = -100, ymax = 800)) +
  geom_line() +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(x = "Year", y = "Total Global Sales of Games Released (Millions)")

## `geom_smooth()` using formula = 'y ~ x'

Clearly the data as a whole should not be modeled with linear regression, but we can model up to the peak (2008) and down from it with separate models. The subset before 2008 would probably be better modeled with a generalized linear model, since the trend appears to be polynomial, but the subset afterwards is fairly well modeled with linear regression. The downwards trend from 2008-2017 is much stronger than the upwards trend from 1980-2008. A quick search shows that 2008 was a record high for video game sales, but 2009 fell short of this record due to the recession at that time, which is reflected in the data. Video game sales totals continued to stay low until 2016, when they rose greatly again. This, however, is not reflected in the data, as data collection for 2016 and 2017 must still have been in progress when this dataset was released.

Detecting Seasons

game_sales_ts |>
  ggplot(mapping = aes(x = date, y = total_global_sales)) +
  geom_line(linewidth = .5) +
  geom_smooth(linewidth = .5, span = .2, se = FALSE) + 
  scale_x_date(breaks = "4 years", labels = \(x) year(x)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Year", y = "Total Global Sales of Games Released (Millions)")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Since the data is already looking at the year level and there are only 38 years in the dataset, smoothing does not help much with identifying seasons. It looks like there was a peak and a trough about every 4 years early on in the data, around 1980-1996, but no other seasons are readily apparent. We can attempt to use partial autocorrelation to identify other seasons.

game_sales_xts |>
  pacf(na.action = na.exclude)

Unfortunately, other than the trivial case of years being most correlated with themselves, there do not appear to be real seasons in this data. The partial ACF is highest when the lag is 9, and upon closer inspection, there are very slight troughs every 9 years in the data. However, the intervening data has very little in common, and this looks more like a coincidence than a true season. Since the time series is based around years, the literal seasons within each year that may have some sort of repeated pattern are not captured in this data.