vg_sales <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(vg_sales)
##   Rank                     Name Platform Year        Genre Publisher NA_Sales
## 1    1               Wii Sports      Wii 2006       Sports  Nintendo    41.49
## 2    2        Super Mario Bros.      NES 1985     Platform  Nintendo    29.08
## 3    3           Mario Kart Wii      Wii 2008       Racing  Nintendo    15.85
## 4    4        Wii Sports Resort      Wii 2009       Sports  Nintendo    15.75
## 5    5 Pokemon Red/Pokemon Blue       GB 1996 Role-Playing  Nintendo    11.27
## 6    6                   Tetris       GB 1989       Puzzle  Nintendo    23.20
##   EU_Sales JP_Sales Other_Sales Global_Sales
## 1    29.02     3.77        8.46        82.74
## 2     3.58     6.81        0.77        40.24
## 3    12.88     3.79        3.31        35.82
## 4    11.01     3.28        2.96        33.00
## 5     8.89    10.22        1.00        31.37
## 6     2.26     4.22        0.58        30.26
summary(vg_sales)
##       Rank           Name             Platform             Year          
##  Min.   :    1   Length:16598       Length:16598       Length:16598      
##  1st Qu.: 4151   Class :character   Class :character   Class :character  
##  Median : 8300   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 8301                                                           
##  3rd Qu.:12450                                                           
##  Max.   :16600                                                           
##     Genre            Publisher            NA_Sales          EU_Sales      
##  Length:16598       Length:16598       Min.   : 0.0000   Min.   : 0.0000  
##  Class :character   Class :character   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Mode  :character   Mode  :character   Median : 0.0800   Median : 0.0200  
##                                        Mean   : 0.2647   Mean   : 0.1467  
##                                        3rd Qu.: 0.2400   3rd Qu.: 0.1100  
##                                        Max.   :41.4900   Max.   :29.0200  
##     JP_Sales         Other_Sales        Global_Sales    
##  Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0100  
##  1st Qu.: 0.00000   1st Qu.: 0.00000   1st Qu.: 0.0600  
##  Median : 0.00000   Median : 0.01000   Median : 0.1700  
##  Mean   : 0.07778   Mean   : 0.04806   Mean   : 0.5374  
##  3rd Qu.: 0.04000   3rd Qu.: 0.04000   3rd Qu.: 0.4700  
##  Max.   :10.22000   Max.   :10.57000   Max.   :82.7400

Response variable

Out of all the variables in the dataset, Global_Sales is probably one of the most important continuous variables. This variable represents the total global sales of a video game, which is a key measure of success in the gaming industry. It helps developers and publishers understand the market reach and profitability of their games.

Categorical Column for Explanatory Variable

I’m choosing Platform as the categorical variable. Different platforms (e.g., PlayStation, Xbox, or Nintendo) may have varying levels of popularity and market reach, which could influence the sales performance of video games. Factors such as platform exclusivity and user base size can also impact a game’s sales.

NULL HYPOTHESIS (H0)

There is no significant difference in the average global sales of video games across different platforms.

ANOVA TEST

vg_sales$Platform <- as.factor(vg_sales$Platform)
vg_sales <- vg_sales %>%
  mutate(Platform = case_when(
    Platform %in% c("PS4", "X360", "PS3", "Wii") ~ Platform,
    TRUE ~ "Other"
  ))

Performing the ANOVA test

anova_result <- aov(Global_Sales ~ Platform, data = vg_sales)
summary(anova_result)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## Platform        4    238   59.56   24.77 <2e-16 ***
## Residuals   16593  39895    2.40                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation of Results:

  • Degrees of Freedom (Df) - There are 4 levels of the Platform factor variable, representing four consolidated platforms.

  • Sum of Squares (Sum Sq) - The total sum of squares for the model (Platform) is 238.0, and for the residuals (error) is 39895.0.

  • Mean Square (Mean Sq) - The mean square for the model is 59.56, and for the residuals is 2.40.

  • F value - The F-value is 24.77, which is large and suggests that there is more variance in global sales between platforms than within each platform.

  • p-value - The p-value is less than 0.001, meaning we reject the null hypothesis.

Conclusion from ANOVA Test

The p-value is significantly smaller than the significance threshold of 0.05, leading us to reject the null hypothesis. This indicates that there is a statistically significant difference in the average global sales of video games across different platforms. For game developers and publishers, this suggests that platform choice plays an important role in determining a game’s commercial success.

Visualization

library(ggplot2)

ggplot(vg_sales, aes(x = Platform, y = Global_Sales, fill = Platform)) +
  geom_boxplot() +
  labs(x = "Platform", y = "Global Sales (millions)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Continuous Column for Explanatory Variable

I selected Year as a continuous explanatory variable. This column represents the release year of each game and might influence global sales. Technological advancements, consumer preferences, and market conditions could affect how games perform in different years.

vg_sales <- vg_sales[!is.na(as.numeric(vg_sales$Year)), ]
## Warning in `[.data.frame`(vg_sales, !is.na(as.numeric(vg_sales$Year)), ): NAs
## introduced by coercion
vg_sales$Year <- as.numeric(vg_sales$Year)

correlation <- cor(vg_sales$Year, vg_sales$Global_Sales, use = "complete.obs")

# Print the correlation coefficient
correlation
## [1] -0.0747348

The correlation coefficient between Year and Global_Sales is -0.0747, indicating a very weak negative correlation. This suggests that, contrary to initial expectations, newer games tend to have slightly lower global sales, but the relationship is very weak.

Regression Model

lm_model <- lm(Global_Sales ~ Year, data = vg_sales)

# Summary of the regression model
summary(lm_model)
## 
## Call:
## lm(formula = Global_Sales ~ Year, data = vg_sales)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.950 -0.458 -0.338 -0.058 82.192 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.818104   4.206323   9.704   <2e-16 ***
## Year        -0.020075   0.002096  -9.576   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.561 on 16325 degrees of freedom
## Multiple R-squared:  0.005585,   Adjusted R-squared:  0.005524 
## F-statistic: 91.69 on 1 and 16325 DF,  p-value: < 2.2e-16

Conclusion

The regression model indicates a weak but statistically significant negative relationship between Year and Global_Sales. For every additional year, global sales decrease by a small amount (about 20,000 units). However, the R-squared value of 0.0056 shows that the release year does not explain much of the variation in global sales. This implies that factors other than the year of release, such as platform, genre, and marketing strategies, likely have a more substantial impact on a game’s commercial success.

library(ggplot2)

ggplot(vg_sales, aes(x = Year, y = Global_Sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Year", y = "Global Sales (millions)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

This visualization will help us observe the weak downward trend in global sales as the release year increases, as indicated by the regression model.

Insights

Further Investigation

Given the weak relationship between Year and Global_Sales, future analysis could focus on the following:

By focusing on these additional factors, we can gain a deeper understanding of what drives the success of video games in different markets.