data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
print(head(data))
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26

Convert Year Column to Date

data$Date <- as.Date(paste(data$Year, "01", "01", sep = "-"))

data <- data %>% filter(!is.na(Date))

Summarize Data by Year and Calculate Total Global Sales

# Group data by year, summing up Global Sales
annual_sales <- data %>%
  group_by(Date) %>%
  summarize(Global_Sales = sum(Global_Sales, na.rm = TRUE))

# Print summarized data
print(sum(is.na(annual_sales$Date)))
## [1] 0

Convert Data to tsibble and Plot Global Sales Over Time

# Convert to tsibble, now without NA in Date
sales_tsibble <- as_tsibble(annual_sales, index = Date)

# Proceed with plotting and analysis
ggplot(sales_tsibble, aes(x = Date, y = Global_Sales)) +
  geom_line() +
  geom_point() +
  labs(x = "Year", y = "Global Sales (Millions)", title = "Global Video Game Sales Over Time")

Insight:

The initial plot shows spikes in certain years, suggesting high sales during specific periods likely due to major game releases, new console launches, or holiday seasons.

Significance:

Identifying these peaks provides insight into how external factors, like new technology or major game releases, influence overall sales trends. This is valuable for understanding and predicting high-revenue years.

Further Questions:

  1. What major events correspond with the spikes in sales?

  2. Do certain platforms or game genres contribute more significantly to sales spikes?

Linear Regression for Trend Detection

# Perform linear regression to detect long-term trends
trend_model <- lm(Global_Sales ~ as.numeric(Date), data = sales_tsibble)
summary(trend_model)
## 
## Call:
## lm(formula = Global_Sales ~ as.numeric(Date), data = sales_tsibble)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -450.13  -74.96  -14.66   86.93  356.95 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -84.840271  77.205260  -1.099 0.278914    
## as.numeric(Date)   0.029310   0.006778   4.325 0.000111 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 175.4 on 37 degrees of freedom
## Multiple R-squared:  0.3357, Adjusted R-squared:  0.3178 
## F-statistic:  18.7 on 1 and 37 DF,  p-value: 0.000111
# Plot Global Sales with the regression line
ggplot(sales_tsibble, aes(x = Date, y = Global_Sales)) +
  geom_line() +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Year", y = "Global Sales (Millions)", title = "Trend in Global Video Game Sales Over Time")
## `geom_smooth()` using formula = 'y ~ x'

Insight:

The regression line in the plot shows whether there’s an overall upward or downward trend in video game sales. If statistically significant, this trend might indicate growth or decline in the gaming market over the years.

Significance:

A statistically significant trend suggests that consumer interest in video games is either increasing or decreasing over time. This can inform business strategies, marketing budgets, and product release schedules.

Further Questions:

  1. Are certain years responsible for sharp deviations from the trend? If so, what happened in those years?

  2. Does this trend differ for specific regions or platforms?

Applying Smoothing to Detect Underlying Patterns

smoothed_sales <- loess(Global_Sales ~ as.numeric(Date), data = sales_tsibble)
sales_tsibble$Smoothed <- predict(smoothed_sales)

# Plot the original and smoothed data
ggplot(sales_tsibble, aes(x = Date)) +
  geom_line(aes(y = Global_Sales), color = "blue", alpha = 0.5) +
  geom_line(aes(y = Smoothed), color = "red") +
  labs(x = "Year", y = "Global Sales (Millions)", title = "Original vs Smoothed Global Sales")

Insight:

The smoothed line reveals the underlying pattern in the data, reducing the impact of year-to-year fluctuations.

Significance:

Smoothing makes it easier to identify the general direction of the trend, helping stakeholders see if the high points are anomalies or part of an ongoing trend. This insight can be crucial for strategic planning, particularly around product launches and budget allocations.

Further Questions:

  1. Do sales increases coincide with new console generations or popular game franchises?

  2. Are there intervals in which sales consistently increase, such as every 5 years?

Seasonality Analysis Using ACF and PACF

library(forecast)
## Warning: package 'forecast' was built under R version 4.4.2
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# Convert Global_Sales to a time series object for analysis
ts_data <- ts(sales_tsibble$Global_Sales, frequency = 1)  # Annual frequency

# Plot ACF and PACF
Acf(ts_data, main = "Autocorrelation Function (ACF) of Global Sales")

Pacf(ts_data, main = "Partial Autocorrelation Function (PACF) of Global Sales")

Interpretation of the ACF:

  • Initial Lag: A positive autocorrelation at the first lag might indicate that sales in a given year are influenced by the previous year’s sales. This suggests that a strong sales year might set the stage for another strong sales year.

  • Declining Correlation: A gradual decrease in the correlation suggests that the influence of past sales diminishes over time, indicating a lack of a consistent annual pattern across the entire time frame.

  • Long-Term Independence: The autocorrelation levels off at zero, implying that sales in one year become less dependent on those of previous years, except in the short term.

Interpretation of the PACF:

  • First Lag: A significant partial autocorrelation at the first lag suggests that the previous year’s sales have a notable impact on current sales. This may indicate market momentum or consumer loyalty to popular franchises or platforms.

  • Subsequent Lags: Lack of significant partial autocorrelation at higher lags suggests that past years beyond the previous one don’t strongly impact the current year, pointing to an AR(1) process where future sales are primarily influenced by the immediately preceding year’s sales.

Implications for Modeling:

The ACF and PACF results suggest that an autoregressive model of order 1 (AR(1)) could be appropriate, where each year’s sales depend on the sales of the preceding year.