Week_11

data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")

Response Variable: Global_Sales - Total global sales of each game.

Explanatory Variables:

NA_Sales - Sales in North America.
EU_Sales - Sales in Europe.
JP_Sales - Sales in Japan.

Model Building

Used a Generalized Linear Model (GLM) to analyze how sales in different regions contribute to the global sales of a video game.

# Generalized Linear Model (GLM)
model <- glm(Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
             data = data, 
             family = gaussian(link = "identity"))

summary(model)

## 
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
##     family = gaussian(link = "identity"), data = data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.006166   0.001046   5.895 3.83e-09 ***
## NA_Sales    1.047260   0.001927 543.550  < 2e-16 ***
## EU_Sales    1.222241   0.003089 395.620  < 2e-16 ***
## JP_Sales    0.962372   0.003622 265.734  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01619746)
## 
##     Null deviance: 40133.40  on 16597  degrees of freedom
## Residual deviance:   268.78  on 16594  degrees of freedom
## AIC: -21323
## 
## Number of Fisher Scoring iterations: 2

This summary provides the following:

Coefficients: Estimates, standard errors, t-values, and p-values for each variable.
Model Fit Statistics: Measures of model performance, including residual deviance and null deviance.

Insight: The GLM reveals the relationship between global sales and regional sales in North America, Europe, and Japan.

Significance: Each coefficient reflects the impact of regional sales on global sales. The p-values assess statistical significance, indicating which regions have a meaningful impact on global sales.

Diagnosing the Model

To ensure that the model assumptions are met, we analyze several diagnostic plots:

Residuals vs. Fitted Plot

plot(model, which = 1)

My observations from the plot is any pattern indicates non-linearity or heteroscedasticity, suggesting a potential issue with the assumption of constant variance.

Q-Q Plot of Residuals

plot(model, which = 2)

My observations from the plot is deviations from normality may indicate that residuals are not normally distributed.

Scale-Location Plot

plot(model, which = 3)

My observations from the plot is heteroscedasticity may be present if there’s a visible trend.

Residuals vs. Leverage Plot

plot(model, which = 5)

My observations from the plot is no influential outliers appear to unduly impact the model.

# 2. Examining coefficients and significance
summary(model)

## 
## Call:
## glm(formula = Global_Sales ~ NA_Sales + EU_Sales + JP_Sales, 
##     family = gaussian(link = "identity"), data = data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.006166   0.001046   5.895 3.83e-09 ***
## NA_Sales    1.047260   0.001927 543.550  < 2e-16 ***
## EU_Sales    1.222241   0.003089 395.620  < 2e-16 ***
## JP_Sales    0.962372   0.003622 265.734  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.01619746)
## 
##     Null deviance: 40133.40  on 16597  degrees of freedom
## Residual deviance:   268.78  on 16594  degrees of freedom
## AIC: -21323
## 
## Number of Fisher Scoring iterations: 2

#3.Assessing Model Fit
# Residual Deviance and Degrees of Freedom
residual_deviance <- deviance(model)
df_residual <- df.residual(model)

# Null Deviance and Degrees of Freedom
null_model <- glm(Global_Sales ~ 1, data = data, family = gaussian(link = "identity"))
null_deviance <- deviance(null_model)
df_null <- nrow(data) - 1

# Calculating R-squared
r_squared <- 1 - (residual_deviance / null_deviance)

cat("Residual Deviance:", residual_deviance, "\n")

## Residual Deviance: 268.7807

cat("Degrees of Freedom (Residual):", df_residual, "\n")

## Degrees of Freedom (Residual): 16594

cat("Null Deviance:", null_deviance, "\n")

## Null Deviance: 40133.4

cat("Degrees of Freedom (Null):", df_null, "\n")

## Degrees of Freedom (Null): 16597

cat("R-squared:", r_squared, "\n")

## R-squared: 0.9933028

Interpretation:

Residual Deviance: Lower values indicate better fit.
R-squared: This represents the proportion of variability in Global_Sales explained by the model. A higher R-squared implies a better fit.

Interpretation of Coefficients

Consider interpreting the coefficient for NA_Sales for every one-unit increase in North American sales, the model predicts an increase in global sales by the value of the NA_Sales coefficient, holding other variables constant.

Insight:

The coefficient for NA_Sales reveals a strong association between North American and global sales, backed by statistical significance.

Significance:

This suggests that North American sales contribute positively to global sales, and the small p-value indicates that this relationship is statistically significant.

Further Questions:

How do sales trends in North America impact overall performance in the gaming industry?
Could including additional regions or factors improve the model’s explanatory power?
Can this model be applied to predict sales in other markets or contexts?
Does this relationship align with industry expectations?