data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(ggplot2)
The dataset includes columns such as NA_Sales, EU_Sales, JP_Sales, and Other_Sales, which represent regional sales and could be meaningful additions to the regression model as they may have predictive value for Global_Sales. So the new model is built using all these columns and the interactive variable would capture the combined effect of the Year and NA_Sales on the response variable, Global_Sales. Also, including the EU_Sales to this would allow us know its impact on the Global_Sales.
lm_model <- lm(Global_Sales ~ Year * NA_Sales + EU_Sales, data = data)
# Summary of the model
summary(lm_model)
##
## Call:
## lm(formula = Global_Sales ~ Year * NA_Sales + EU_Sales, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7600 -0.0567 -0.0267 0.0049 7.2127
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0003106 0.1260849 -0.002 0.998034
## Year1981 0.0012447 0.1382277 0.009 0.992815
## Year1982 0.0002083 0.1370888 0.002 0.998788
## Year1983 0.6446984 0.1803457 3.575 0.000352 ***
## Year1984 1.0571472 0.1475912 7.163 8.25e-13 ***
## Year1985 0.5021227 0.1473163 3.408 0.000655 ***
## Year1986 1.0167100 0.1451636 7.004 2.59e-12 ***
## Year1987 0.6835984 0.1539555 4.440 9.04e-06 ***
## Year1988 0.6902886 0.1507286 4.580 4.69e-06 ***
## Year1989 0.5358108 0.1454500 3.684 0.000231 ***
## Year1990 0.6098960 0.1476224 4.131 3.62e-05 ***
## Year1991 0.3243251 0.1346088 2.409 0.015990 *
## Year1992 0.3852453 0.1348845 2.856 0.004294 **
## Year1993 0.3641652 0.1312886 2.774 0.005547 **
## Year1994 0.1932130 0.1288156 1.500 0.133654
## Year1995 0.1828860 0.1275490 1.434 0.151635
## Year1996 0.0608431 0.1273064 0.478 0.632709
## Year1997 0.1214210 0.1273025 0.954 0.340200
## Year1998 0.0634360 0.1270620 0.499 0.617609
## Year1999 0.0146660 0.1271364 0.115 0.908164
## Year2000 0.0961292 0.1272081 0.756 0.449849
## Year2001 0.0256172 0.1268337 0.202 0.839939
## Year2002 0.0005748 0.1265116 0.005 0.996375
## Year2003 0.0153952 0.1266050 0.122 0.903217
## Year2004 -0.0809181 0.1265541 -0.639 0.522574
## Year2005 -0.0123328 0.1264547 -0.098 0.922309
## Year2006 0.0556786 0.1263832 0.441 0.659542
## Year2007 0.0288707 0.1263710 0.228 0.819292
## Year2008 0.0206608 0.1263194 0.164 0.870079
## Year2009 0.0078712 0.1263080 0.062 0.950311
## Year2010 0.0293491 0.1263417 0.232 0.816309
## Year2011 0.0306796 0.1263749 0.243 0.808189
## Year2012 0.0495444 0.1265908 0.391 0.695525
## Year2013 0.0556422 0.1266897 0.439 0.660522
## Year2014 0.0239083 0.1266857 0.189 0.850314
## Year2015 0.0354370 0.1266195 0.280 0.779581
## Year2016 0.0321952 0.1270497 0.253 0.799957
## Year2017 0.0169773 0.2009853 0.084 0.932683
## Year2020 0.0235623 0.2936825 0.080 0.936055
## YearN/A 0.0178767 0.1274410 0.140 0.888445
## NA_Sales 0.9879569 0.0747291 13.221 < 2e-16 ***
## EU_Sales 1.3736293 0.0081440 168.667 < 2e-16 ***
## Year1981:NA_Sales 0.0011065 0.0929594 0.012 0.990503
## Year1982:NA_Sales 0.0000518 0.0843339 0.001 0.999510
## Year1983:NA_Sales -0.3775826 0.2542327 -1.485 0.137514
## Year1984:NA_Sales -0.0059958 0.0754769 -0.079 0.936684
## Year1985:NA_Sales 0.2098974 0.0753656 2.785 0.005358 **
## Year1986:NA_Sales -0.0419964 0.1015631 -0.414 0.679245
## Year1987:NA_Sales 0.0605748 0.1306643 0.464 0.642947
## Year1988:NA_Sales 0.1774437 0.0796724 2.227 0.025950 *
## Year1989:NA_Sales 0.1804387 0.0756079 2.387 0.017021 *
## Year1990:NA_Sales 0.1572057 0.0781122 2.013 0.044177 *
## Year1991:NA_Sales 0.0715706 0.1001192 0.715 0.474709
## Year1992:NA_Sales 0.2970330 0.0808364 3.674 0.000239 ***
## Year1993:NA_Sales 0.1867357 0.0859708 2.172 0.029864 *
## Year1994:NA_Sales 0.2692088 0.0850206 3.166 0.001546 **
## Year1995:NA_Sales 0.1264236 0.0915715 1.381 0.167421
## Year1996:NA_Sales 0.3757158 0.0766752 4.900 9.67e-07 ***
## Year1997:NA_Sales 0.0632868 0.0780757 0.811 0.417617
## Year1998:NA_Sales 0.1077871 0.0778613 1.384 0.166271
## Year1999:NA_Sales 0.2839180 0.0771393 3.681 0.000233 ***
## Year2000:NA_Sales 0.0244292 0.0812812 0.301 0.763760
## Year2001:NA_Sales 0.0979643 0.0766791 1.278 0.201413
## Year2002:NA_Sales 0.1432654 0.0766742 1.868 0.061711 .
## Year2003:NA_Sales 0.0635592 0.0786350 0.808 0.418940
## Year2004:NA_Sales 0.5119741 0.0764488 6.697 2.20e-11 ***
## Year2005:NA_Sales 0.2664725 0.0764702 3.485 0.000494 ***
## Year2006:NA_Sales 0.1054653 0.0751157 1.404 0.160326
## Year2007:NA_Sales 0.1539540 0.0759161 2.028 0.042581 *
## Year2008:NA_Sales 0.1403799 0.0757032 1.854 0.063708 .
## Year2009:NA_Sales 0.1727549 0.0754370 2.290 0.022031 *
## Year2010:NA_Sales 0.0675556 0.0755141 0.895 0.371009
## Year2011:NA_Sales 0.0549339 0.0761355 0.722 0.470595
## Year2012:NA_Sales 0.0964113 0.0767333 1.256 0.208972
## Year2013:NA_Sales 0.0787641 0.0763944 1.031 0.302547
## Year2014:NA_Sales 0.1541182 0.0781807 1.971 0.048705 *
## Year2015:NA_Sales 0.0687939 0.0784864 0.877 0.380768
## Year2016:NA_Sales 0.0360226 0.1125488 0.320 0.748925
## Year2017:NA_Sales NA NA NA NA
## Year2020:NA_Sales NA NA NA NA
## YearN/A:NA_Sales 0.0363229 0.0841897 0.431 0.666154
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2711 on 16519 degrees of freedom
## Multiple R-squared: 0.9698, Adjusted R-squared: 0.9696
## F-statistic: 6789 on 78 and 16519 DF, p-value: < 2.2e-16
Visualizing the relationship between each of these variables -
ggplot(data, aes(x = Year, y = Global_Sales, color = NA_Sales, size = EU_Sales)) +
geom_point(alpha = 0.7) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Global Sales by Year and NA Sales with EU Sales Size",
x = "Year",
y = "Global Sales",
color = "NA Sales",
size = "EU Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Year * NA_Sales (Interaction Term):
Reason for inclusion : This interaction term captures how the influence of North American sales on global sales may vary by year. For instance, certain years may have seen stronger or weaker North American contributions due to changes in consumer behavior, technological advancements, or economic factors. This term is included to better understand these potential shifts over time.
Adding an interaction term can sometimes introduce multicollinearity, particularly if Year and NA_Sales are highly correlated.
EU_Sales(Continuous Variable) :
Reason for Inclusion: EU_Sales is included to account for the direct impact of European sales on global sales. This variable provides insight into the contribution of the European market independently, helping to complete the picture of regional influences on global sales.
Since Global_Sales is partly composed of EU_Sales and NA_Sales, multicollinearity could arise, as these regional sales are not entirely independent.
#diagnostic plots
par(mfrow=c(2, 2)) # Arrange plots in a 2x2 grid
# Residuals vs Fitted Values Plot
plot(lm_model, which = 1)
# Normal Q-Q Plot
plot(lm_model, which = 2)
## Warning: not plotting observations with leverage one:
## 5958
# Scale-Location Plot
plot(lm_model, which = 3)
## Warning: not plotting observations with leverage one:
## 5958
# Residuals vs Leverage Plot
plot(lm_model, which = 5)
## Warning: not plotting observations with leverage one:
## 5958
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Mild violation, suggesting the model could benefit from additional terms or transformations.
Significant issue, as residuals deviate considerably from normality.
Moderate to high violation, indicating non-constant variance.
A few points show influence, but they don’t appear to have an overwhelming effect on the model.
Residual Vs Fitted Plot :
Indication: The plot shows a slight curve in the residuals as the fitted values increase, which suggests potential non-linearity in the relationship between the predictors and the response variable.
Severity: The curvature is mild, indicating a moderate issue with linearity. This suggests the model might benefit from transformation or addition of non-linear terms.
Confidence: Moderate. This plot implies that the linear assumption might not be fully met, though it’s not severely violated.
Normal Q-Q Plot :
Indication: The residuals deviate from the theoretical quantile line, especially at the tails, which suggests non-normality of residuals.
Severity: The deviation is significant, particularly in the extremes, indicating potential issues with the assumption of normality. This can affect the reliability of statistical tests like hypothesis testing and confidence intervals.
Confidence: Low. The significant departure from normality suggests the assumption of normality is not well met.
Scale-Location Plot (Spread-Location):
Indication: The residuals spread increases with higher fitted values, forming a fan-like pattern. This suggests heteroscedasticity which is non-constant variance in the residuals).
Severity: Moderate to high, as a clear pattern of increasing spread is evident. Heteroscedasticity can lead to inefficient estimates and unreliable standard errors.
Confidence: Low. The assumption of homoscedasticity is not well supported, indicating the model may need adjustment to correct for this.
Residuals vs Leverage:
Indication: A few points are labeled (e.g., 18, 10, 48) and fall close to or outside the Cook’s distance lines, indicating they might be influential observations.
Severity: Moderate, as there are a few potential influential points, but they don’t seem to be drastically affecting the overall model.
Confidence: Moderate. These points should be further investigated; however, their influence on the model might be manageable.