Linear Regression

Linear Regression Assumptions: * Linear relationship between dependent and independent variables. * Dependent variable has normal distribution across independent variables. + This case study has a binary dependent variable - so this assumption is already violated. * Little or no multicollinearity. * Little or no auto-correlation. * No heteroscedasticity.

It is possible to have the dependent variables range from negative to positive infinity in linear regression models. However, since we have a binary dependent variable, it is not appropriate.

Linear Regression might be useful for this case if BBBC were interested in forecasting revenue levels from a single genre or title, and to make predictions on what variables most affect those profits.

LM with all Variables (and ‘Choice’ as numeric)

bbbc_lm_1 <- lm(as.numeric(Choice) ~., data = bbbc_train)
summary(bbbc_lm_1)

## 
## Call:
## lm(formula = as.numeric(Choice) ~ ., data = bbbc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9603 -0.2462 -0.1161  0.1622  1.0588 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3642284  0.0307411  44.378  < 2e-16 ***
## Gender1          -0.1309205  0.0200303  -6.536 8.48e-11 ***
## Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
## Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
## Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
## First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
## P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
## P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
## P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
## P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
## P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2353 
## F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16

Several variables are significant. Now checking to see if they will cause multicollinearity problems.

Checking for multicollinearity problems

vif(bbbc_lm_1)

##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.005801         1.248066         3.253860        18.770402 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.685333         3.360349         1.775022         3.324928 
##            P_DIY            P_Art 
##         2.016910         2.273771

Last_purchase shows a very high value - meaning there is a multicollinearity problem. To clean the model better, we need to remove it.

Remove "Last_purchase’ due to high value

bbbc_lm_1_novif <- lm(as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
vif(bbbc_lm_1_novif)

##           Gender Amount_purchased        Frequency   First_purchase 
##         1.005634         1.235982         2.651820         7.182666 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.949849         1.307915         2.009609         1.457362 
##            P_Art 
##         1.634878

This looks to be more appropriate to work with. Let’s look at a summary.

LM Model Summary

summary(bbbc_lm_1_novif)

## 
## Call:
## lm(formula = as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0018 -0.2482 -0.1277  0.1567  1.1035 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3926595  0.0309609  44.981  < 2e-16 ***
## Gender1          -0.1290720  0.0203424  -6.345 2.89e-10 ***
## Amount_purchased  0.0003518  0.0001122   3.135 0.001753 ** 
## Frequency        -0.0157943  0.0019980  -7.905 4.97e-15 ***
## First_purchase    0.0046036  0.0015884   2.898 0.003803 ** 
## P_Child          -0.0502183  0.0126891  -3.958 7.90e-05 ***
## P_Youth          -0.0225339  0.0175326  -1.285 0.198888    
## P_Cook           -0.0667467  0.0131127  -5.090 4.00e-07 ***
## P_DIY            -0.0606486  0.0170835  -3.550 0.000396 ***
## P_Art             0.1916012  0.0167447  11.443  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3847 on 1590 degrees of freedom
## Multiple R-squared:  0.2156, Adjusted R-squared:  0.2111 
## F-statistic: 48.55 on 9 and 1590 DF,  p-value: < 2.2e-16

Diagnostic plots of the initial model can show if there are violations of the linear regression assumptions.

Diagnostic Plots

par(mfrow = c(2,2))
plot(bbbc_lm_1_novif, which = c(1:4))

The diagnostic plots show non-normality.