bbbc_test <- read.csv("C:/Users/lacyb/Documents/2022 Spring/DA 6813 - DA Applications/Case Studies/Case Study 2/BBBC-Test.csv", header = TRUE)
bbbc_train <- read.csv("C:/Users/lacyb/Documents/2022 Spring/DA 6813 - DA Applications/Case Studies/Case Study 2/BBBC-Train.csv", header = TRUE)
bbbc_train$Observation <- NULL
bbbc_test$Observation <- NULL
bbbc_train$Choice <- as.factor(bbbc_train$Choice)
bbbc_train$Gender <- as.factor(bbbc_train$Gender)
bbbc_test$Choice <- as.factor(bbbc_test$Choice)
bbbc_test$Gender <- as.factor(bbbc_test$Gender)
Linear Regression Assumptions: * Linear relationship between dependent and independent variables. * Dependent variable has normal distribution across independent variables. + This case study has a binary dependent variable - so this assumption is already violated. * Little or no multicollinearity. * Little or no auto-correlation. * No heteroscedasticity.
It is possible to have the dependent variables range from negative to positive infinity in linear regression models. However, since we have a binary dependent variable, it is not appropriate.
Linear Regression might be useful for this case if BBBC were interested in forecasting revenue levels from a single genre or title, and to make predictions on what variables most affect those profits.
bbbc_lm_1 <- lm(as.numeric(Choice) ~., data = bbbc_train)
summary(bbbc_lm_1)
##
## Call:
## lm(formula = as.numeric(Choice) ~ ., data = bbbc_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9603 -0.2462 -0.1161 0.1622 1.0588
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3642284 0.0307411 44.378 < 2e-16 ***
## Gender1 -0.1309205 0.0200303 -6.536 8.48e-11 ***
## Amount_purchased 0.0002736 0.0001110 2.464 0.0138 *
## Frequency -0.0090868 0.0021791 -4.170 3.21e-05 ***
## Last_purchase 0.0970286 0.0135589 7.156 1.26e-12 ***
## First_purchase -0.0020024 0.0018160 -1.103 0.2704
## P_Child -0.1262584 0.0164011 -7.698 2.41e-14 ***
## P_Youth -0.0963563 0.0201097 -4.792 1.81e-06 ***
## P_Cook -0.1414907 0.0166064 -8.520 < 2e-16 ***
## P_DIY -0.1352313 0.0197873 -6.834 1.17e-11 ***
## P_Art 0.1178494 0.0194427 6.061 1.68e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared: 0.2401, Adjusted R-squared: 0.2353
## F-statistic: 50.2 on 10 and 1589 DF, p-value: < 2.2e-16
Several variables are significant. Now checking to see if they will cause multicollinearity problems.
vif(bbbc_lm_1)
## Gender Amount_purchased Frequency Last_purchase
## 1.005801 1.248066 3.253860 18.770402
## First_purchase P_Child P_Youth P_Cook
## 9.685333 3.360349 1.775022 3.324928
## P_DIY P_Art
## 2.016910 2.273771
Last_purchase shows a very high value - meaning there is a multicollinearity problem. To clean the model better, we need to remove it.
bbbc_lm_1_novif <- lm(as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
vif(bbbc_lm_1_novif)
## Gender Amount_purchased Frequency First_purchase
## 1.005634 1.235982 2.651820 7.182666
## P_Child P_Youth P_Cook P_DIY
## 1.949849 1.307915 2.009609 1.457362
## P_Art
## 1.634878
This looks to be more appropriate to work with. Let’s look at a summary.
summary(bbbc_lm_1_novif)
##
## Call:
## lm(formula = as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0018 -0.2482 -0.1277 0.1567 1.1035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3926595 0.0309609 44.981 < 2e-16 ***
## Gender1 -0.1290720 0.0203424 -6.345 2.89e-10 ***
## Amount_purchased 0.0003518 0.0001122 3.135 0.001753 **
## Frequency -0.0157943 0.0019980 -7.905 4.97e-15 ***
## First_purchase 0.0046036 0.0015884 2.898 0.003803 **
## P_Child -0.0502183 0.0126891 -3.958 7.90e-05 ***
## P_Youth -0.0225339 0.0175326 -1.285 0.198888
## P_Cook -0.0667467 0.0131127 -5.090 4.00e-07 ***
## P_DIY -0.0606486 0.0170835 -3.550 0.000396 ***
## P_Art 0.1916012 0.0167447 11.443 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3847 on 1590 degrees of freedom
## Multiple R-squared: 0.2156, Adjusted R-squared: 0.2111
## F-statistic: 48.55 on 9 and 1590 DF, p-value: < 2.2e-16
Diagnostic plots of the initial model can show if there are violations of the linear regression assumptions.
par(mfrow = c(2,2))
plot(bbbc_lm_1_novif, which = c(1:4))
The diagnostic plots show non-normality.
bbbc_lm_1_stepwise <- ols_step_both_p(bbbc_lm_1, pent = 0.05, prem = 0.05, details = F)
bbbc_lm_1_stepwise
##
## Stepwise Selection Summary
## -----------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## -----------------------------------------------------------------------------------------------
## 1 P_Art addition 0.128 0.127 227.4360 1649.2034 0.4046
## 2 Frequency addition 0.170 0.169 142.0340 1572.6125 0.3949
## 3 Gender addition 0.188 0.186 106.5840 1539.7192 0.3908
## 4 P_Cook addition 0.200 0.198 82.3710 1516.8351 0.3879
## 5 P_DIY addition 0.204 0.201 77.1250 1511.8878 0.3871
## 6 Amount_purchased addition 0.208 0.205 70.8160 1505.8838 0.3863
## 7 P_Child addition 0.211 0.208 64.8840 1500.2051 0.3855
## 8 Last_purchase addition 0.228 0.224 31.7060 1467.7006 0.3815
## 9 P_Youth addition 0.239 0.235 10.2160 1446.2388 0.3788
## -----------------------------------------------------------------------------------------------
bbbc_lm_1_step <- lm(as.numeric(Choice) ~ P_Art + Frequency + Gender + P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth, data = bbbc_train)
summary(bbbc_lm_1_step)
##
## Call:
## lm(formula = as.numeric(Choice) ~ P_Art + Frequency + Gender +
## P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase +
## P_Youth, data = bbbc_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9802 -0.2452 -0.1157 0.1655 1.0595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3727367 0.0297590 46.128 < 2e-16 ***
## P_Art 0.1150034 0.0192719 5.967 2.97e-09 ***
## Frequency -0.0110830 0.0012128 -9.138 < 2e-16 ***
## Gender1 -0.1316464 0.0200208 -6.575 6.56e-11 ***
## P_Cook -0.1433497 0.0165218 -8.676 < 2e-16 ***
## P_DIY -0.1365578 0.0197520 -6.914 6.82e-12 ***
## Amount_purchased 0.0002742 0.0001110 2.470 0.0136 *
## P_Child -0.1275991 0.0163571 -7.801 1.11e-14 ***
## Last_purchase 0.0894288 0.0116772 7.658 3.25e-14 ***
## P_Youth -0.0973642 0.0200903 -4.846 1.38e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared: 0.2395, Adjusted R-squared: 0.2352
## F-statistic: 55.63 on 9 and 1590 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(bbbc_lm_1_step, which = c(1:4))
vif(bbbc_lm_1_step)
## P_Art Frequency Gender P_Cook
## 2.233698 1.007855 1.004715 3.290657
## P_DIY Amount_purchased P_Child Last_purchase
## 2.009454 1.248033 3.341880 13.920175
## P_Youth
## 1.771355