vg_sales <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(vg_sales)
## Rank Name Platform Year Genre Publisher NA_Sales
## 1 1 Wii Sports Wii 2006 Sports Nintendo 41.49
## 2 2 Super Mario Bros. NES 1985 Platform Nintendo 29.08
## 3 3 Mario Kart Wii Wii 2008 Racing Nintendo 15.85
## 4 4 Wii Sports Resort Wii 2009 Sports Nintendo 15.75
## 5 5 Pokemon Red/Pokemon Blue GB 1996 Role-Playing Nintendo 11.27
## 6 6 Tetris GB 1989 Puzzle Nintendo 23.20
## EU_Sales JP_Sales Other_Sales Global_Sales
## 1 29.02 3.77 8.46 82.74
## 2 3.58 6.81 0.77 40.24
## 3 12.88 3.79 3.31 35.82
## 4 11.01 3.28 2.96 33.00
## 5 8.89 10.22 1.00 31.37
## 6 2.26 4.22 0.58 30.26
summary(vg_sales)
## Rank Name Platform Year
## Min. : 1 Length:16598 Length:16598 Length:16598
## 1st Qu.: 4151 Class :character Class :character Class :character
## Median : 8300 Mode :character Mode :character Mode :character
## Mean : 8301
## 3rd Qu.:12450
## Max. :16600
## Genre Publisher NA_Sales EU_Sales
## Length:16598 Length:16598 Min. : 0.0000 Min. : 0.0000
## Class :character Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000
## Mode :character Mode :character Median : 0.0800 Median : 0.0200
## Mean : 0.2647 Mean : 0.1467
## 3rd Qu.: 0.2400 3rd Qu.: 0.1100
## Max. :41.4900 Max. :29.0200
## JP_Sales Other_Sales Global_Sales
## Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
## 1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
## Median : 0.00000 Median : 0.01000 Median : 0.1700
## Mean : 0.07778 Mean : 0.04806 Mean : 0.5374
## 3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4700
## Max. :10.22000 Max. :10.57000 Max. :82.7400
Out of all the variables in the dataset, Global_Sales is probably one of the most important continuous variables. This variable represents the total global sales of a video game, which is a key measure of success in the gaming industry. It helps developers and publishers understand the market reach and profitability of their games.
I’m choosing Platform as the categorical variable. Different platforms (e.g., PlayStation, Xbox, or Nintendo) may have varying levels of popularity and market reach, which could influence the sales performance of video games. Factors such as platform exclusivity and user base size can also impact a game’s sales.
There is no significant difference in the average global sales of video games across different platforms.
vg_sales$Platform <- as.factor(vg_sales$Platform)
vg_sales <- vg_sales %>%
mutate(Platform = case_when(
Platform %in% c("PS4", "X360", "PS3", "Wii") ~ Platform,
TRUE ~ "Other"
))
Performing the ANOVA test
anova_result <- aov(Global_Sales ~ Platform, data = vg_sales)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Platform 4 238 59.56 24.77 <2e-16 ***
## Residuals 16593 39895 2.40
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Degrees of Freedom (Df) - There are 4 levels of the Platform factor variable, representing four consolidated platforms.
Sum of Squares (Sum Sq) - The total sum of squares for the model (Platform) is 238.0, and for the residuals (error) is 39895.0.
Mean Square (Mean Sq) - The mean square for the model is 59.56, and for the residuals is 2.40.
F value - The F-value is 24.77, which is large and suggests that there is more variance in global sales between platforms than within each platform.
p-value - The p-value is less than 0.001, meaning we reject the null hypothesis.
The p-value is significantly smaller than the significance threshold of 0.05, leading us to reject the null hypothesis. This indicates that there is a statistically significant difference in the average global sales of video games across different platforms. For game developers and publishers, this suggests that platform choice plays an important role in determining a game’s commercial success.
library(ggplot2)
ggplot(vg_sales, aes(x = Platform, y = Global_Sales, fill = Platform)) +
geom_boxplot() +
labs(x = "Platform", y = "Global Sales (millions)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
I selected Year as a continuous explanatory variable. This column represents the release year of each game and might influence global sales. Technological advancements, consumer preferences, and market conditions could affect how games perform in different years.
vg_sales <- vg_sales[!is.na(as.numeric(vg_sales$Year)), ]
## Warning in `[.data.frame`(vg_sales, !is.na(as.numeric(vg_sales$Year)), ): NAs
## introduced by coercion
vg_sales$Year <- as.numeric(vg_sales$Year)
correlation <- cor(vg_sales$Year, vg_sales$Global_Sales, use = "complete.obs")
# Print the correlation coefficient
correlation
## [1] -0.0747348
The correlation coefficient between Year and Global_Sales is -0.0747, indicating a very weak negative correlation. This suggests that, contrary to initial expectations, newer games tend to have slightly lower global sales, but the relationship is very weak.
lm_model <- lm(Global_Sales ~ Year, data = vg_sales)
# Summary of the regression model
summary(lm_model)
##
## Call:
## lm(formula = Global_Sales ~ Year, data = vg_sales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.950 -0.458 -0.338 -0.058 82.192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.818104 4.206323 9.704 <2e-16 ***
## Year -0.020075 0.002096 -9.576 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.561 on 16325 degrees of freedom
## Multiple R-squared: 0.005585, Adjusted R-squared: 0.005524
## F-statistic: 91.69 on 1 and 16325 DF, p-value: < 2.2e-16
Intercept: The intercept represents the predicted global sales when the Year is zero. While the intercept has no practical interpretation in this context (since a year of zero is not meaningful), it represents the baseline value of global sales when Year is very small.
Year Coefficient: The coefficient for Year is -0.0201, which means that for every additional year, the global sales are expected to decrease by approximately 0.02millionunits (or 20,000 units). This small but statistically significant negative coefficient suggests a weak downward trend in global sales as the release year increases.
Significance: The p-value for the Year variable is extremely small (< 2.2e-16), indicating that the negative relationship between Year and Global_Sales is statistically significant. However, the R-squared value is only 0.0056, which means that the Year variable explains less than 1% of the variability in global sales. This suggests that Year is not a strong predictor of global sales, and other factors likely play a much larger role in determining sales performance.
The regression model indicates a weak but statistically significant negative relationship between Year and Global_Sales. For every additional year, global sales decrease by a small amount (about 20,000 units). However, the R-squared value of 0.0056 shows that the release year does not explain much of the variation in global sales. This implies that factors other than the year of release, such as platform, genre, and marketing strategies, likely have a more substantial impact on a game’s commercial success.
library(ggplot2)
ggplot(vg_sales, aes(x = Year, y = Global_Sales)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Year", y = "Global Sales (millions)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This visualization will help us observe the weak downward trend in global sales as the release year increases, as indicated by the regression model.
Platform: From the earlier ANOVA test, we know that the platform on which a game is released has a much stronger impact on global sales than the year of release. Focusing on popular platforms is likely to yield higher sales.
Year: While there is a weak negative correlation between Year and Global_Sales, the year of release is not a strong predictor of sales. Other factors such as game quality, marketing, and platform availability are likely more important in determining a game’s success.
Given the weak relationship between Year and Global_Sales, future analysis could focus on the following:
Genre: Investigate whether certain game genres are more likely to achieve higher global sales.
Publisher: Explore how different game publishers influence global sales, as some publishers may have better marketing or distribution channels.
RegionalSales: Examine how global sales vary by region to understand the impact of regional markets on game performance.
By focusing on these additional factors, we can gain a deeper understanding of what drives the success of video games in different markets.