Reading Data

bbbc_test <- read.csv("C:/Users/lacyb/Documents/2022 Spring/DA 6813 - DA Applications/Case Studies/Case Study 2/BBBC-Test.csv", header = TRUE)
bbbc_train <- read.csv("C:/Users/lacyb/Documents/2022 Spring/DA 6813 - DA Applications/Case Studies/Case Study 2/BBBC-Train.csv", header = TRUE)

Removing ‘Observation’ & Setting ‘Choice’ & ‘Gender’ as Factors

bbbc_train$Observation <- NULL
bbbc_test$Observation <- NULL 
bbbc_train$Choice <- as.factor(bbbc_train$Choice)
bbbc_train$Gender <- as.factor(bbbc_train$Gender)
bbbc_test$Choice <- as.factor(bbbc_test$Choice)
bbbc_test$Gender <- as.factor(bbbc_test$Gender)

Linear Regression

Linear Regression Assumptions: * Linear relationship between dependent and independent variables. * Dependent variable has normal distribution across independent variables. + This case study has a binary dependent variable - so this assumption is already violated. * Little or no multicollinearity. * Little or no auto-correlation. * No heteroscedasticity.

It is possible to have the dependent variables range from negative to positive infinity in linear regression models. However, since we have a binary dependent variable, it is not appropriate.

Linear Regression might be useful for this case if BBBC were interested in forecasting revenue levels from a single genre or title, and to make predictions on what variables most affect those profits.

LM with all Variables (and ‘Choice’ as numeric)

bbbc_lm_1 <- lm(as.numeric(Choice) ~., data = bbbc_train)
summary(bbbc_lm_1)
## 
## Call:
## lm(formula = as.numeric(Choice) ~ ., data = bbbc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9603 -0.2462 -0.1161  0.1622  1.0588 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3642284  0.0307411  44.378  < 2e-16 ***
## Gender1          -0.1309205  0.0200303  -6.536 8.48e-11 ***
## Amount_purchased  0.0002736  0.0001110   2.464   0.0138 *  
## Frequency        -0.0090868  0.0021791  -4.170 3.21e-05 ***
## Last_purchase     0.0970286  0.0135589   7.156 1.26e-12 ***
## First_purchase   -0.0020024  0.0018160  -1.103   0.2704    
## P_Child          -0.1262584  0.0164011  -7.698 2.41e-14 ***
## P_Youth          -0.0963563  0.0201097  -4.792 1.81e-06 ***
## P_Cook           -0.1414907  0.0166064  -8.520  < 2e-16 ***
## P_DIY            -0.1352313  0.0197873  -6.834 1.17e-11 ***
## P_Art             0.1178494  0.0194427   6.061 1.68e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1589 degrees of freedom
## Multiple R-squared:  0.2401, Adjusted R-squared:  0.2353 
## F-statistic:  50.2 on 10 and 1589 DF,  p-value: < 2.2e-16

Several variables are significant. Now checking to see if they will cause multicollinearity problems.

Checking for multicollinearity problems

vif(bbbc_lm_1)
##           Gender Amount_purchased        Frequency    Last_purchase 
##         1.005801         1.248066         3.253860        18.770402 
##   First_purchase          P_Child          P_Youth           P_Cook 
##         9.685333         3.360349         1.775022         3.324928 
##            P_DIY            P_Art 
##         2.016910         2.273771

Last_purchase shows a very high value - meaning there is a multicollinearity problem. To clean the model better, we need to remove it.

Remove "Last_purchase’ due to high value

bbbc_lm_1_novif <- lm(as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
vif(bbbc_lm_1_novif)
##           Gender Amount_purchased        Frequency   First_purchase 
##         1.005634         1.235982         2.651820         7.182666 
##          P_Child          P_Youth           P_Cook            P_DIY 
##         1.949849         1.307915         2.009609         1.457362 
##            P_Art 
##         1.634878

This looks to be more appropriate to work with. Let’s look at a summary.

LM Model Summary

summary(bbbc_lm_1_novif)
## 
## Call:
## lm(formula = as.numeric(Choice) ~ . - Last_purchase, data = bbbc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0018 -0.2482 -0.1277  0.1567  1.1035 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3926595  0.0309609  44.981  < 2e-16 ***
## Gender1          -0.1290720  0.0203424  -6.345 2.89e-10 ***
## Amount_purchased  0.0003518  0.0001122   3.135 0.001753 ** 
## Frequency        -0.0157943  0.0019980  -7.905 4.97e-15 ***
## First_purchase    0.0046036  0.0015884   2.898 0.003803 ** 
## P_Child          -0.0502183  0.0126891  -3.958 7.90e-05 ***
## P_Youth          -0.0225339  0.0175326  -1.285 0.198888    
## P_Cook           -0.0667467  0.0131127  -5.090 4.00e-07 ***
## P_DIY            -0.0606486  0.0170835  -3.550 0.000396 ***
## P_Art             0.1916012  0.0167447  11.443  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3847 on 1590 degrees of freedom
## Multiple R-squared:  0.2156, Adjusted R-squared:  0.2111 
## F-statistic: 48.55 on 9 and 1590 DF,  p-value: < 2.2e-16

Diagnostic plots of the initial model can show if there are violations of the linear regression assumptions.

Diagnostic Plots

par(mfrow = c(2,2))
plot(bbbc_lm_1_novif, which = c(1:4))

The diagnostic plots show non-normality.

Linear Regression with Stepwise Selection (based on p-value) with selection criteria of 0.05

bbbc_lm_1_stepwise <- ols_step_both_p(bbbc_lm_1, pent = 0.05, prem = 0.05, details = F)
bbbc_lm_1_stepwise
## 
##                                   Stepwise Selection Summary                                    
## -----------------------------------------------------------------------------------------------
##                              Added/                   Adj.                                         
## Step        Variable        Removed     R-Square    R-Square      C(p)         AIC        RMSE     
## -----------------------------------------------------------------------------------------------
##    1         P_Art          addition       0.128       0.127    227.4360    1649.2034    0.4046    
##    2       Frequency        addition       0.170       0.169    142.0340    1572.6125    0.3949    
##    3         Gender         addition       0.188       0.186    106.5840    1539.7192    0.3908    
##    4         P_Cook         addition       0.200       0.198     82.3710    1516.8351    0.3879    
##    5         P_DIY          addition       0.204       0.201     77.1250    1511.8878    0.3871    
##    6    Amount_purchased    addition       0.208       0.205     70.8160    1505.8838    0.3863    
##    7        P_Child         addition       0.211       0.208     64.8840    1500.2051    0.3855    
##    8     Last_purchase      addition       0.228       0.224     31.7060    1467.7006    0.3815    
##    9        P_Youth         addition       0.239       0.235     10.2160    1446.2388    0.3788    
## -----------------------------------------------------------------------------------------------

Create Stepwise LM

bbbc_lm_1_step <- lm(as.numeric(Choice) ~ P_Art + Frequency + Gender + P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + P_Youth, data = bbbc_train)
summary(bbbc_lm_1_step)
## 
## Call:
## lm(formula = as.numeric(Choice) ~ P_Art + Frequency + Gender + 
##     P_Cook + P_DIY + Amount_purchased + P_Child + Last_purchase + 
##     P_Youth, data = bbbc_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9802 -0.2452 -0.1157  0.1655  1.0595 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.3727367  0.0297590  46.128  < 2e-16 ***
## P_Art             0.1150034  0.0192719   5.967 2.97e-09 ***
## Frequency        -0.0110830  0.0012128  -9.138  < 2e-16 ***
## Gender1          -0.1316464  0.0200208  -6.575 6.56e-11 ***
## P_Cook           -0.1433497  0.0165218  -8.676  < 2e-16 ***
## P_DIY            -0.1365578  0.0197520  -6.914 6.82e-12 ***
## Amount_purchased  0.0002742  0.0001110   2.470   0.0136 *  
## P_Child          -0.1275991  0.0163571  -7.801 1.11e-14 ***
## Last_purchase     0.0894288  0.0116772   7.658 3.25e-14 ***
## P_Youth          -0.0973642  0.0200903  -4.846 1.38e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3788 on 1590 degrees of freedom
## Multiple R-squared:  0.2395, Adjusted R-squared:  0.2352 
## F-statistic: 55.63 on 9 and 1590 DF,  p-value: < 2.2e-16

Interpretation:

  • We have enough statistical evidence to reject the null hypothesis that all model coefficients equal zero.
  • We can interpret the following from the significant variables:
    • P_Art : A customer who purchases more art books may purchase more copies of The Art History of Florence.
    • Frequency : A customer who purchases more books may purchase fewer copes of The Art History of Florence.
    • Gender1 (male) : Men on average may purchase fewer copes of The Art History of Florence.
    • P_Cook : A customer who purchases more cooking books may purchase fewer copies of The Art History of Florence.
    • P_DIY : A customer who purchases more DIY books may purchase fewer copies of The Art History of Florence.
    • Amount_purchased : A customer who spends more on books may purchase more copies of The Art History of Florence.
    • P_Child : A customer who purchases more children’s books may purchase fewer copies of The Art History of Florence.
    • Last_purchase : A customer who hasn’t purchased any books in a while may purchase more copies of The Art History of Florence.
    • P_Youth : A customer who purchases more youth books may purchase fewer copies of The Art History of Florence.

Diagnostic Plots

par(mfrow = c(2,2))
plot(bbbc_lm_1_step, which = c(1:4))

Check VIF

vif(bbbc_lm_1_step)
##            P_Art        Frequency           Gender           P_Cook 
##         2.233698         1.007855         1.004715         3.290657 
##            P_DIY Amount_purchased          P_Child    Last_purchase 
##         2.009454         1.248033         3.341880        13.920175 
##          P_Youth 
##         1.771355

More Interpretation:

  • Last_purchase has a very high VIF value & the QQPlot shows non-normal trace in the data.
  • Linear Regression is NOT a good method to provide inferences in this model to predict if a customer will buy a book. It does not take into account the binary variable “Choice”. While a linear regression model could be helpful in determining total revenue trends of P_Art, it is not a useful or appropriate choice for classification or prediction.