data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")

Response Variable: Global_Sales - Total global sales of each game.

Explanatory Variables:

Model Building

Used a Generalized Linear Model (GLM) to analyze how sales in different regions contribute to the global sales of a video game.

# Generalized Linear Model (GLM)
model <- glm(Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
             data = data, 
             family = gaussian(link = "identity"))
summary(model)
## 
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
##     family = gaussian(link = "identity"), data = data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.006166   0.001046   5.895 3.83e-09 ***
## NA_Sales    1.047260   0.001927 543.550  < 2e-16 ***
## EU_Sales    1.222241   0.003089 395.620  < 2e-16 ***
## JP_Sales    0.962372   0.003622 265.734  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01619746)
## 
##     Null deviance: 40133.40  on 16597  degrees of freedom
## Residual deviance:   268.78  on 16594  degrees of freedom
## AIC: -21323
## 
## Number of Fisher Scoring iterations: 2

This summary provides the following:

Insight: The GLM reveals the relationship between global sales and regional sales in North America, Europe, and Japan.

Significance: Each coefficient reflects the impact of regional sales on global sales. The p-values assess statistical significance, indicating which regions have a meaningful impact on global sales.

Diagnosing the Model

To ensure that the model assumptions are met, we analyze several diagnostic plots:

Residuals vs. Fitted Plot

plot(model, which = 1)

My observations from the plot is any pattern indicates non-linearity or heteroscedasticity, suggesting a potential issue with the assumption of constant variance.

Q-Q Plot of Residuals

plot(model, which = 2)

My observations from the plot is deviations from normality may indicate that residuals are not normally distributed.

Scale-Location Plot

plot(model, which = 3)

My observations from the plot is heteroscedasticity may be present if there’s a visible trend.

Residuals vs. Leverage Plot

plot(model, which = 5)

My observations from the plot is no influential outliers appear to unduly impact the model.

# 2. Examining coefficients and significance
summary(model)
## 
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
##     family = gaussian(link = "identity"), data = data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.006166   0.001046   5.895 3.83e-09 ***
## NA_Sales    1.047260   0.001927 543.550  < 2e-16 ***
## EU_Sales    1.222241   0.003089 395.620  < 2e-16 ***
## JP_Sales    0.962372   0.003622 265.734  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01619746)
## 
##     Null deviance: 40133.40  on 16597  degrees of freedom
## Residual deviance:   268.78  on 16594  degrees of freedom
## AIC: -21323
## 
## Number of Fisher Scoring iterations: 2
#3.Assessing Model Fit
# Residual Deviance and Degrees of Freedom
residual_deviance <- deviance(model)
df_residual <- df.residual(model)

# Null Deviance and Degrees of Freedom
null_model <- glm(Global_Sales ~ 1, data = data, family = gaussian(link = "identity"))
null_deviance <- deviance(null_model)
df_null <- nrow(data) - 1

# Calculating R-squared
r_squared <- 1 - (residual_deviance / null_deviance)

cat("Residual Deviance:", residual_deviance, "\n")
## Residual Deviance: 268.7807
cat("Degrees of Freedom (Residual):", df_residual, "\n")
## Degrees of Freedom (Residual): 16594
cat("Null Deviance:", null_deviance, "\n")
## Null Deviance: 40133.4
cat("Degrees of Freedom (Null):", df_null, "\n")
## Degrees of Freedom (Null): 16597
cat("R-squared:", r_squared, "\n")
## R-squared: 0.9933028

Interpretation:

  • Residual Deviance: Lower values indicate better fit.

  • R-squared: This represents the proportion of variability in Global_Sales explained by the model. A higher R-squared implies a better fit.

Interpretation of Coefficients

Consider interpreting the coefficient for NA_Sales for every one-unit increase in North American sales, the model predicts an increase in global sales by the value of the NA_Sales coefficient, holding other variables constant.

Insight:

The coefficient for NA_Sales reveals a strong association between North American and global sales, backed by statistical significance.

Significance:

This suggests that North American sales contribute positively to global sales, and the small p-value indicates that this relationship is statistically significant.

Further Questions: