data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
Response Variable: Global_Sales - Total global sales of each game.
Explanatory Variables:
NA_Sales - Sales in North America.
EU_Sales - Sales in Europe.
JP_Sales - Sales in Japan.
Used a Generalized Linear Model (GLM) to analyze how sales in different regions contribute to the global sales of a video game.
# Generalized Linear Model (GLM)
model <- glm(Global_Sales ~ NA_Sales + EU_Sales + JP_Sales,
data = data,
family = gaussian(link = "identity"))
summary(model)
##
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales,
## family = gaussian(link = "identity"), data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006166 0.001046 5.895 3.83e-09 ***
## NA_Sales 1.047260 0.001927 543.550 < 2e-16 ***
## EU_Sales 1.222241 0.003089 395.620 < 2e-16 ***
## JP_Sales 0.962372 0.003622 265.734 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.01619746)
##
## Null deviance: 40133.40 on 16597 degrees of freedom
## Residual deviance: 268.78 on 16594 degrees of freedom
## AIC: -21323
##
## Number of Fisher Scoring iterations: 2
This summary provides the following:
Coefficients: Estimates, standard errors, t-values, and p-values for each variable.
Model Fit Statistics: Measures of model performance, including residual deviance and null deviance.
Insight: The GLM reveals the relationship between global sales and regional sales in North America, Europe, and Japan.
Significance: Each coefficient reflects the impact of regional sales on global sales. The p-values assess statistical significance, indicating which regions have a meaningful impact on global sales.
To ensure that the model assumptions are met, we analyze several diagnostic plots:
plot(model, which = 1)
My observations from the plot is any pattern indicates non-linearity or heteroscedasticity, suggesting a potential issue with the assumption of constant variance.
plot(model, which = 2)
My observations from the plot is deviations from normality may indicate that residuals are not normally distributed.
plot(model, which = 3)
My observations from the plot is heteroscedasticity may be present if there’s a visible trend.
plot(model, which = 5)
My observations from the plot is no influential outliers appear to unduly impact the model.
# 2. Examining coefficients and significance
summary(model)
##
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales,
## family = gaussian(link = "identity"), data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.006166 0.001046 5.895 3.83e-09 ***
## NA_Sales 1.047260 0.001927 543.550 < 2e-16 ***
## EU_Sales 1.222241 0.003089 395.620 < 2e-16 ***
## JP_Sales 0.962372 0.003622 265.734 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.01619746)
##
## Null deviance: 40133.40 on 16597 degrees of freedom
## Residual deviance: 268.78 on 16594 degrees of freedom
## AIC: -21323
##
## Number of Fisher Scoring iterations: 2
#3.Assessing Model Fit
# Residual Deviance and Degrees of Freedom
residual_deviance <- deviance(model)
df_residual <- df.residual(model)
# Null Deviance and Degrees of Freedom
null_model <- glm(Global_Sales ~ 1, data = data, family = gaussian(link = "identity"))
null_deviance <- deviance(null_model)
df_null <- nrow(data) - 1
# Calculating R-squared
r_squared <- 1 - (residual_deviance / null_deviance)
cat("Residual Deviance:", residual_deviance, "\n")
## Residual Deviance: 268.7807
cat("Degrees of Freedom (Residual):", df_residual, "\n")
## Degrees of Freedom (Residual): 16594
cat("Null Deviance:", null_deviance, "\n")
## Null Deviance: 40133.4
cat("Degrees of Freedom (Null):", df_null, "\n")
## Degrees of Freedom (Null): 16597
cat("R-squared:", r_squared, "\n")
## R-squared: 0.9933028
Residual Deviance: Lower values indicate better fit.
R-squared: This represents the proportion of variability in Global_Sales explained by the model. A higher R-squared implies a better fit.
Consider interpreting the coefficient for NA_Sales for every one-unit increase in North American sales, the model predicts an increase in global sales by the value of the NA_Sales coefficient, holding other variables constant.
Insight:
The coefficient for NA_Sales reveals a strong association between North American and global sales, backed by statistical significance.
Significance:
This suggests that North American sales contribute positively to global sales, and the small p-value indicates that this relationship is statistically significant.
Further Questions:
How do sales trends in North America impact overall performance in the gaming industry?
Could including additional regions or factors improve the model’s explanatory power?
Can this model be applied to predict sales in other markets or contexts?
Does this relationship align with industry expectations?