data <- read.csv("C:\\Users\\gajaw\\OneDrive\\Desktop\\STATS\\vgsales.csv")
library(ggplot2)

Linear Regression Model :

The dataset includes columns such as NA_Sales, EU_Sales, JP_Sales, and Other_Sales, which represent regional sales and could be meaningful additions to the regression model as they may have predictive value for Global_Sales. So the new model is built using all these columns and the interactive variable would capture the combined effect of the Year and NA_Sales on the response variable, Global_Sales. Also, including the EU_Sales to this would allow us know its impact on the Global_Sales.

lm_model <- lm(Global_Sales ~ Year * NA_Sales + EU_Sales, data = data)
# Summary of the model
summary(lm_model)
## 
## Call:
## lm(formula = Global_Sales ~ Year * NA_Sales + EU_Sales, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7600 -0.0567 -0.0267  0.0049  7.2127 
## 
## Coefficients: (2 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.0003106  0.1260849  -0.002 0.998034    
## Year1981           0.0012447  0.1382277   0.009 0.992815    
## Year1982           0.0002083  0.1370888   0.002 0.998788    
## Year1983           0.6446984  0.1803457   3.575 0.000352 ***
## Year1984           1.0571472  0.1475912   7.163 8.25e-13 ***
## Year1985           0.5021227  0.1473163   3.408 0.000655 ***
## Year1986           1.0167100  0.1451636   7.004 2.59e-12 ***
## Year1987           0.6835984  0.1539555   4.440 9.04e-06 ***
## Year1988           0.6902886  0.1507286   4.580 4.69e-06 ***
## Year1989           0.5358108  0.1454500   3.684 0.000231 ***
## Year1990           0.6098960  0.1476224   4.131 3.62e-05 ***
## Year1991           0.3243251  0.1346088   2.409 0.015990 *  
## Year1992           0.3852453  0.1348845   2.856 0.004294 ** 
## Year1993           0.3641652  0.1312886   2.774 0.005547 ** 
## Year1994           0.1932130  0.1288156   1.500 0.133654    
## Year1995           0.1828860  0.1275490   1.434 0.151635    
## Year1996           0.0608431  0.1273064   0.478 0.632709    
## Year1997           0.1214210  0.1273025   0.954 0.340200    
## Year1998           0.0634360  0.1270620   0.499 0.617609    
## Year1999           0.0146660  0.1271364   0.115 0.908164    
## Year2000           0.0961292  0.1272081   0.756 0.449849    
## Year2001           0.0256172  0.1268337   0.202 0.839939    
## Year2002           0.0005748  0.1265116   0.005 0.996375    
## Year2003           0.0153952  0.1266050   0.122 0.903217    
## Year2004          -0.0809181  0.1265541  -0.639 0.522574    
## Year2005          -0.0123328  0.1264547  -0.098 0.922309    
## Year2006           0.0556786  0.1263832   0.441 0.659542    
## Year2007           0.0288707  0.1263710   0.228 0.819292    
## Year2008           0.0206608  0.1263194   0.164 0.870079    
## Year2009           0.0078712  0.1263080   0.062 0.950311    
## Year2010           0.0293491  0.1263417   0.232 0.816309    
## Year2011           0.0306796  0.1263749   0.243 0.808189    
## Year2012           0.0495444  0.1265908   0.391 0.695525    
## Year2013           0.0556422  0.1266897   0.439 0.660522    
## Year2014           0.0239083  0.1266857   0.189 0.850314    
## Year2015           0.0354370  0.1266195   0.280 0.779581    
## Year2016           0.0321952  0.1270497   0.253 0.799957    
## Year2017           0.0169773  0.2009853   0.084 0.932683    
## Year2020           0.0235623  0.2936825   0.080 0.936055    
## YearN/A            0.0178767  0.1274410   0.140 0.888445    
## NA_Sales           0.9879569  0.0747291  13.221  < 2e-16 ***
## EU_Sales           1.3736293  0.0081440 168.667  < 2e-16 ***
## Year1981:NA_Sales  0.0011065  0.0929594   0.012 0.990503    
## Year1982:NA_Sales  0.0000518  0.0843339   0.001 0.999510    
## Year1983:NA_Sales -0.3775826  0.2542327  -1.485 0.137514    
## Year1984:NA_Sales -0.0059958  0.0754769  -0.079 0.936684    
## Year1985:NA_Sales  0.2098974  0.0753656   2.785 0.005358 ** 
## Year1986:NA_Sales -0.0419964  0.1015631  -0.414 0.679245    
## Year1987:NA_Sales  0.0605748  0.1306643   0.464 0.642947    
## Year1988:NA_Sales  0.1774437  0.0796724   2.227 0.025950 *  
## Year1989:NA_Sales  0.1804387  0.0756079   2.387 0.017021 *  
## Year1990:NA_Sales  0.1572057  0.0781122   2.013 0.044177 *  
## Year1991:NA_Sales  0.0715706  0.1001192   0.715 0.474709    
## Year1992:NA_Sales  0.2970330  0.0808364   3.674 0.000239 ***
## Year1993:NA_Sales  0.1867357  0.0859708   2.172 0.029864 *  
## Year1994:NA_Sales  0.2692088  0.0850206   3.166 0.001546 ** 
## Year1995:NA_Sales  0.1264236  0.0915715   1.381 0.167421    
## Year1996:NA_Sales  0.3757158  0.0766752   4.900 9.67e-07 ***
## Year1997:NA_Sales  0.0632868  0.0780757   0.811 0.417617    
## Year1998:NA_Sales  0.1077871  0.0778613   1.384 0.166271    
## Year1999:NA_Sales  0.2839180  0.0771393   3.681 0.000233 ***
## Year2000:NA_Sales  0.0244292  0.0812812   0.301 0.763760    
## Year2001:NA_Sales  0.0979643  0.0766791   1.278 0.201413    
## Year2002:NA_Sales  0.1432654  0.0766742   1.868 0.061711 .  
## Year2003:NA_Sales  0.0635592  0.0786350   0.808 0.418940    
## Year2004:NA_Sales  0.5119741  0.0764488   6.697 2.20e-11 ***
## Year2005:NA_Sales  0.2664725  0.0764702   3.485 0.000494 ***
## Year2006:NA_Sales  0.1054653  0.0751157   1.404 0.160326    
## Year2007:NA_Sales  0.1539540  0.0759161   2.028 0.042581 *  
## Year2008:NA_Sales  0.1403799  0.0757032   1.854 0.063708 .  
## Year2009:NA_Sales  0.1727549  0.0754370   2.290 0.022031 *  
## Year2010:NA_Sales  0.0675556  0.0755141   0.895 0.371009    
## Year2011:NA_Sales  0.0549339  0.0761355   0.722 0.470595    
## Year2012:NA_Sales  0.0964113  0.0767333   1.256 0.208972    
## Year2013:NA_Sales  0.0787641  0.0763944   1.031 0.302547    
## Year2014:NA_Sales  0.1541182  0.0781807   1.971 0.048705 *  
## Year2015:NA_Sales  0.0687939  0.0784864   0.877 0.380768    
## Year2016:NA_Sales  0.0360226  0.1125488   0.320 0.748925    
## Year2017:NA_Sales         NA         NA      NA       NA    
## Year2020:NA_Sales         NA         NA      NA       NA    
## YearN/A:NA_Sales   0.0363229  0.0841897   0.431 0.666154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2711 on 16519 degrees of freedom
## Multiple R-squared:  0.9698, Adjusted R-squared:  0.9696 
## F-statistic:  6789 on 78 and 16519 DF,  p-value: < 2.2e-16

Visualizing the relationship between each of these variables -

ggplot(data, aes(x = Year, y = Global_Sales, color = NA_Sales, size = EU_Sales)) +
  geom_point(alpha = 0.7) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Global Sales by Year and NA Sales with EU Sales Size",
       x = "Year",
       y = "Global Sales",
       color = "NA Sales",
       size = "EU Sales") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Year * NA_Sales (Interaction Term):

EU_Sales(Continuous Variable) :

Model Evaluation :

#diagnostic plots
par(mfrow=c(2, 2)) # Arrange plots in a 2x2 grid

# Residuals vs Fitted Values Plot
plot(lm_model, which = 1)

# Normal Q-Q Plot
plot(lm_model, which = 2)
## Warning: not plotting observations with leverage one:
##   5958
# Scale-Location Plot
plot(lm_model, which = 3)
## Warning: not plotting observations with leverage one:
##   5958
# Residuals vs Leverage Plot
plot(lm_model, which = 5)
## Warning: not plotting observations with leverage one:
##   5958
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Findings:

  • Mild violation, suggesting the model could benefit from additional terms or transformations.

  • Significant issue, as residuals deviate considerably from normality.

  • Moderate to high violation, indicating non-constant variance.

  • A few points show influence, but they don’t appear to have an overwhelming effect on the model.

Residual Vs Fitted Plot :

  • Indication: The plot shows a slight curve in the residuals as the fitted values increase, which suggests potential non-linearity in the relationship between the predictors and the response variable.

  • Severity: The curvature is mild, indicating a moderate issue with linearity. This suggests the model might benefit from transformation or addition of non-linear terms.

  • Confidence: Moderate. This plot implies that the linear assumption might not be fully met, though it’s not severely violated.

Normal Q-Q Plot :

  • Indication: The residuals deviate from the theoretical quantile line, especially at the tails, which suggests non-normality of residuals.

  • Severity: The deviation is significant, particularly in the extremes, indicating potential issues with the assumption of normality. This can affect the reliability of statistical tests like hypothesis testing and confidence intervals.

  • Confidence: Low. The significant departure from normality suggests the assumption of normality is not well met.

Scale-Location Plot (Spread-Location):

  • Indication: The residuals spread increases with higher fitted values, forming a fan-like pattern. This suggests heteroscedasticity which is non-constant variance in the residuals).

  • Severity: Moderate to high, as a clear pattern of increasing spread is evident. Heteroscedasticity can lead to inefficient estimates and unreliable standard errors.

  • Confidence: Low. The assumption of homoscedasticity is not well supported, indicating the model may need adjustment to correct for this.

Residuals vs Leverage:

  • Indication: A few points are labeled (e.g., 18, 10, 48) and fall close to or outside the Cook’s distance lines, indicating they might be influential observations.

  • Severity: Moderate, as there are a few potential influential points, but they don’t seem to be drastically affecting the overall model.

  • Confidence: Moderate. These points should be further investigated; however, their influence on the model might be manageable.