data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
## method from
## as_tibble.grouped_df dplyr
##
## Attaching package: 'tsibble'
## The following objects are masked from 'package:base':
##
## intersect, setdiff, union
print(head(data))
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
data$Date <- as.Date(paste(data$Year, "01", "01", sep = "-"))
data <- data %>% filter(!is.na(Date))
# Group data by year, summing up Global Sales
annual_sales <- data %>%
group_by(Date) %>%
summarize(Global_Sales = sum(Global_Sales, na.rm = TRUE))
# Print summarized data
print(sum(is.na(annual_sales$Date)))
## [1] 0
# Convert to tsibble, now without NA in Date
sales_tsibble <- as_tsibble(annual_sales, index = Date)
# Proceed with plotting and analysis
ggplot(sales_tsibble, aes(x = Date, y = Global_Sales)) +
geom_line() +
geom_point() +
labs(x = "Year", y = "Global Sales (Millions)", title = "Global Video Game Sales Over Time")
Insight:
The initial plot shows spikes in certain years, suggesting high sales during specific periods likely due to major game releases, new console launches, or holiday seasons.
Significance:
Identifying these peaks provides insight into how external factors, like new technology or major game releases, influence overall sales trends. This is valuable for understanding and predicting high-revenue years.
Further Questions:
What major events correspond with the spikes in sales?
Do certain platforms or game genres contribute more significantly to sales spikes?
# Perform linear regression to detect long-term trends
trend_model <- lm(Global_Sales ~ as.numeric(Date), data = sales_tsibble)
summary(trend_model)
##
## Call:
## lm(formula = Global_Sales ~ as.numeric(Date), data = sales_tsibble)
##
## Residuals:
## Min 1Q Median 3Q Max
## -450.13 -74.96 -14.66 86.93 356.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -84.840271 77.205260 -1.099 0.278914
## as.numeric(Date) 0.029310 0.006778 4.325 0.000111 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 175.4 on 37 degrees of freedom
## Multiple R-squared: 0.3357, Adjusted R-squared: 0.3178
## F-statistic: 18.7 on 1 and 37 DF, p-value: 0.000111
# Plot Global Sales with the regression line
ggplot(sales_tsibble, aes(x = Date, y = Global_Sales)) +
geom_line() +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Year", y = "Global Sales (Millions)", title = "Trend in Global Video Game Sales Over Time")
## `geom_smooth()` using formula = 'y ~ x'
Insight:
The regression line in the plot shows whether there’s an overall upward or downward trend in video game sales. If statistically significant, this trend might indicate growth or decline in the gaming market over the years.
Significance:
A statistically significant trend suggests that consumer interest in video games is either increasing or decreasing over time. This can inform business strategies, marketing budgets, and product release schedules.
Further Questions:
Are certain years responsible for sharp deviations from the trend? If so, what happened in those years?
Does this trend differ for specific regions or platforms?
smoothed_sales <- loess(Global_Sales ~ as.numeric(Date), data = sales_tsibble)
sales_tsibble$Smoothed <- predict(smoothed_sales)
# Plot the original and smoothed data
ggplot(sales_tsibble, aes(x = Date)) +
geom_line(aes(y = Global_Sales), color = "blue", alpha = 0.5) +
geom_line(aes(y = Smoothed), color = "red") +
labs(x = "Year", y = "Global Sales (Millions)", title = "Original vs Smoothed Global Sales")
Insight:
The smoothed line reveals the underlying pattern in the data, reducing the impact of year-to-year fluctuations.
Significance:
Smoothing makes it easier to identify the general direction of the trend, helping stakeholders see if the high points are anomalies or part of an ongoing trend. This insight can be crucial for strategic planning, particularly around product launches and budget allocations.
Further Questions:
Do sales increases coincide with new console generations or popular game franchises?
Are there intervals in which sales consistently increase, such as every 5 years?
library(forecast)
## Warning: package 'forecast' was built under R version 4.4.2
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# Convert Global_Sales to a time series object for analysis
ts_data <- ts(sales_tsibble$Global_Sales, frequency = 1) # Annual frequency
# Plot ACF and PACF
Acf(ts_data, main = "Autocorrelation Function (ACF) of Global Sales")
Pacf(ts_data, main = "Partial Autocorrelation Function (PACF) of Global Sales")
Initial Lag: A positive autocorrelation at the first lag might indicate that sales in a given year are influenced by the previous year’s sales. This suggests that a strong sales year might set the stage for another strong sales year.
Declining Correlation: A gradual decrease in the correlation suggests that the influence of past sales diminishes over time, indicating a lack of a consistent annual pattern across the entire time frame.
Long-Term Independence: The autocorrelation levels off at zero, implying that sales in one year become less dependent on those of previous years, except in the short term.
First Lag: A significant partial autocorrelation at the first lag suggests that the previous year’s sales have a notable impact on current sales. This may indicate market momentum or consumer loyalty to popular franchises or platforms.
Subsequent Lags: Lack of significant partial autocorrelation at higher lags suggests that past years beyond the previous one don’t strongly impact the current year, pointing to an AR(1) process where future sales are primarily influenced by the immediately preceding year’s sales.
The ACF and PACF results suggest that an autoregressive model of order 1 (AR(1)) could be appropriate, where each year’s sales depend on the sales of the preceding year.