Predictive Analytics: Linear Regression

Preliminary

The output that follows was generated using the stats package in base R.

In the lesson that follows, we use the training and testing sets of the CarSeats data, which we import as ‘CS_Train’ and ‘CS_Test’.

CS_Train <- read.csv("CarSeats_Train.csv")
CS_Test <- read.csv("CarSeats_Test.csv")

Data Exploration

We will train our linear regression model using the training dataset, CS_Train. We can view the structure of the data below.

str(CS_Train)

## 'data.frame':    278 obs. of  12 variables:
##  $ CompPrice       : int  138 117 141 124 132 121 117 122 115 149 ...
##  $ Income          : int  73 100 64 113 113 78 94 35 28 95 ...
##  $ Advertising     : int  11 4 3 13 0 9 4 2 11 5 ...
##  $ Population      : int  276 466 340 501 131 150 503 393 29 400 ...
##  $ Price           : int  120 97 128 72 124 100 94 136 86 144 ...
##  $ ShelveLoc_Medium: int  0 1 0 0 1 0 0 1 0 1 ...
##  $ ShelveLoc_Good  : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ Age             : int  42 55 38 78 76 26 50 62 53 76 ...
##  $ Education       : int  17 14 13 16 17 10 13 18 18 18 ...
##  $ Urban           : int  1 1 1 0 0 0 1 1 1 0 ...
##  $ US              : int  1 1 0 1 1 1 1 0 1 0 ...
##  $ Sales           : num  9.5 7.4 4.15 10.81 4.69 ...

We will evaluate the performance of our predictive model on the testing set, CS_Test. We can view the structure of the data below.

str(CS_Test)

## 'data.frame':    122 obs. of  12 variables:
##  $ CompPrice       : int  111 113 115 136 132 107 139 131 121 109 ...
##  $ Income          : int  48 35 105 81 110 117 32 84 41 73 ...
##  $ Advertising     : int  16 10 0 15 0 11 0 11 5 0 ...
##  $ Population      : int  260 269 45 425 108 148 176 29 412 454 ...
##  $ Price           : int  83 80 108 120 124 118 82 96 110 102 ...
##  $ ShelveLoc_Medium: int  0 1 1 0 1 0 0 1 1 1 ...
##  $ ShelveLoc_Good  : int  1 0 0 1 0 1 1 0 0 0 ...
##  $ Age             : int  65 59 71 67 76 52 54 44 54 65 ...
##  $ Education       : int  10 12 15 10 10 18 11 17 10 15 ...
##  $ Urban           : int  1 1 1 1 0 1 0 0 1 1 ...
##  $ US              : int  1 1 0 1 0 1 0 1 1 0 ...
##  $ Sales           : num  11.22 10.06 6.63 11.85 6.54 ...

Linear Regression

We use the lm() function to create the linear regression model. We can specify that we want to use all other variables in the CS_Train dataframe to predict Sales. We can view a summary of the fitted model, which includes overall model information, fitted coefficient values, and p-values.

train_mod <- lm(Sales ~ ., data = CS_Train)
mod_sum <- summary(train_mod)
mod_sum

## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3525 -0.7223 -0.0214  0.6778  3.4249 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.8267996  0.6883138   8.465 1.74e-15 ***
## CompPrice         0.0950507  0.0049609  19.160  < 2e-16 ***
## Income            0.0148406  0.0022775   6.516 3.60e-10 ***
## Advertising       0.1198440  0.0135362   8.854  < 2e-16 ***
## Population        0.0002053  0.0004445   0.462    0.645    
## Price            -0.0963036  0.0032293 -29.822  < 2e-16 ***
## ShelveLoc_Medium  1.8057701  0.1494249  12.085  < 2e-16 ***
## ShelveLoc_Good    4.7429806  0.1858085  25.526  < 2e-16 ***
## Age              -0.0456242  0.0038855 -11.742  < 2e-16 ***
## Education        -0.0363030  0.0236249  -1.537    0.126    
## Urban             0.2035123  0.1375914   1.479    0.140    
## US               -0.1611668  0.1851281  -0.871    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 266 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.8647 
## F-statistic: 161.9 on 11 and 266 DF,  p-value: < 2.2e-16

Model Performance

We evaluate model performance on the testing data. We can use the predict() function to make predictions for test data set using the fitted regression equation. Then, we can obtain performance measures such as mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE), which all capture prediction error. We want to have low error, and therefore prefer smaller values.

test_preds <- predict(object = train_mod, 
              newdata = CS_Test)
test_perf <- c(MSE = mean((test_preds - CS_Test$Sales)^2),
               MAE = mean(abs(test_preds - CS_Test$Sales)),
               RMSE = sqrt(mean((test_preds - CS_Test$Sales)^2)))
test_perf

##       MSE       MAE      RMSE 
## 1.0484738 0.8227436 1.0239501

Model Goodness of Fit

We want our model to be balanced–which means it performs similarly well on the training and test sets, suggesting it will generalize well to unseen data (data it was not trained on). We can evaluate the performance on the training and testing sets side-by-side to consider the goodness of fit.

train_perf <- c(MSE = mean(train_mod$residuals^2),
               MAE = mean(abs(train_mod$residuals)),
              RMSE = sqrt(mean((train_mod$residuals)^2)))

cbind(train = train_perf, test = test_perf)

##         train      test
## MSE  1.002329 1.0484738
## MAE  0.801216 0.8227436
## RMSE 1.001164 1.0239501

As shown, the performance in similar on the training and testing sets and the error is low, suggesting the model is a good fit (balanced).

Model Selection

In the model above we used all available variables as predictors to predict the target variable, Sales.

preds <- names(CS_Train)[!names(CS_Train) %in% "Sales"]
preds

##  [1] "CompPrice"        "Income"           "Advertising"      "Population"      
##  [5] "Price"            "ShelveLoc_Medium" "ShelveLoc_Good"   "Age"             
##  [9] "Education"        "Urban"            "US"

As we saw, not all variables were statistically significant predictors of car seat sales. Many times (especially if a model is overfitting!) we want to only include the best subset of our variables to include as predictors in the model. Methods for model selection applied to regression models include: forward selection, backward elimination, and stepwise selection. We can use model selection methods, also known as variable and feature selection methods, to improve the performance of our model by eliminating unimportant variables as predictors in our model.

Backward Elimination

In Backward Elimination, we start with full model and iteratively remove predictors based on their p-values. We first set a threshold value for removal and continue the process described below until all predictors have p-values less than the chosen threshold.

The code below automates the backward elimination process. It sets a threshold for removal of 0.10 and iterates, continuing to build the linear regression model and remove the variable with the highest p-value as long as the highest p-value is greater than or equal to the threshold value, 0.10. The output from running the code below includes for each iteration: -the summary output of the regression model -the variable considered for removal -the p-value of the variable under consideration -the decision (REMOVE/TERMINATE)

counter = 0 # initialize iteration counter
cutoff_thresh <- 0.10 # set threshold for variable removal
ps <- c(1) # initialize iteration
while(ps[1] >= cutoff_thresh){
  preds <- preds[!preds %in% names(ps)[1]]
  cslm1 <- lm(Sales~., CS_Train[,c(preds, "Sales")])
  it1<- summary(cslm1)
  ps <- sort(it1$coefficients[,4], decreasing = TRUE)
  counter = counter + 1
  paste("ITERATION", counter)
  print(paste0(c("**------------------------  ITERATION", counter, " ------------------------**"), collapse = " "))
  print(summary(lm(Sales~., CS_Train[,c(preds, "Sales")])))

  print(paste0(c("Candidate for Removal:", names(ps)[1]), collapse = " "))
  print(paste0(c("p-Value:", round(ps[1],5)), collapse = " "))
  dec <- ifelse(ps[1] >= 0.10, "REMOVE", "*TERMINATE*")
  print(paste("Decision:", dec))

}

## [1] "**------------------------  ITERATION 1  ------------------------**"
## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3525 -0.7223 -0.0214  0.6778  3.4249 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.8267996  0.6883138   8.465 1.74e-15 ***
## CompPrice         0.0950507  0.0049609  19.160  < 2e-16 ***
## Income            0.0148406  0.0022775   6.516 3.60e-10 ***
## Advertising       0.1198440  0.0135362   8.854  < 2e-16 ***
## Population        0.0002053  0.0004445   0.462    0.645    
## Price            -0.0963036  0.0032293 -29.822  < 2e-16 ***
## ShelveLoc_Medium  1.8057701  0.1494249  12.085  < 2e-16 ***
## ShelveLoc_Good    4.7429806  0.1858085  25.526  < 2e-16 ***
## Age              -0.0456242  0.0038855 -11.742  < 2e-16 ***
## Education        -0.0363030  0.0236249  -1.537    0.126    
## Urban             0.2035123  0.1375914   1.479    0.140    
## US               -0.1611668  0.1851281  -0.871    0.385    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 266 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.8647 
## F-statistic: 161.9 on 11 and 266 DF,  p-value: < 2.2e-16
## 
## [1] "Candidate for Removal: Population"
## [1] "p-Value: 0.64464"
## [1] "Decision: REMOVE"
## [1] "**------------------------  ITERATION 2  ------------------------**"
## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3929 -0.7170 -0.0468  0.6695  3.4389 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.923824   0.654491   9.051  < 2e-16 ***
## CompPrice         0.094899   0.004943  19.199  < 2e-16 ***
## Income            0.014804   0.002273   6.514 3.63e-10 ***
## Advertising       0.121670   0.012927   9.412  < 2e-16 ***
## Price            -0.096241   0.003222 -29.873  < 2e-16 ***
## ShelveLoc_Medium  1.802803   0.149066  12.094  < 2e-16 ***
## ShelveLoc_Good    4.742077   0.185524  25.560  < 2e-16 ***
## Age              -0.045766   0.003868 -11.833  < 2e-16 ***
## Education        -0.037479   0.023453  -1.598    0.111    
## Urban             0.199440   0.137106   1.455    0.147    
## US               -0.177799   0.181322  -0.981    0.328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.022 on 267 degrees of freedom
## Multiple R-squared:  0.8699, Adjusted R-squared:  0.8651 
## F-statistic: 178.6 on 10 and 267 DF,  p-value: < 2.2e-16
## 
## [1] "Candidate for Removal: US"
## [1] "p-Value: 0.32769"
## [1] "Decision: REMOVE"
## [1] "**------------------------  ITERATION 3  ------------------------**"
## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4265 -0.7051 -0.0429  0.6850  3.3315 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.873794   0.652453   9.003  < 2e-16 ***
## CompPrice         0.094788   0.004941  19.183  < 2e-16 ***
## Income            0.014553   0.002258   6.445 5.36e-10 ***
## Advertising       0.112907   0.009339  12.090  < 2e-16 ***
## Price            -0.096266   0.003221 -29.884  < 2e-16 ***
## ShelveLoc_Medium  1.813780   0.148635  12.203  < 2e-16 ***
## ShelveLoc_Good    4.732666   0.185262  25.546  < 2e-16 ***
## Age              -0.045779   0.003867 -11.837  < 2e-16 ***
## Education        -0.035396   0.023354  -1.516    0.131    
## Urban             0.195267   0.137030   1.425    0.155    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.022 on 268 degrees of freedom
## Multiple R-squared:  0.8695, Adjusted R-squared:  0.8651 
## F-statistic: 198.4 on 9 and 268 DF,  p-value: < 2.2e-16
## 
## [1] "Candidate for Removal: Urban"
## [1] "p-Value: 0.15532"
## [1] "Decision: REMOVE"
## [1] "**------------------------  ITERATION 4  ------------------------**"
## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5572 -0.7122 -0.0332  0.6791  3.4080 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.943805   0.651845   9.118  < 2e-16 ***
## CompPrice         0.095392   0.004932  19.340  < 2e-16 ***
## Income            0.014750   0.002258   6.532 3.23e-10 ***
## Advertising       0.113339   0.009352  12.120  < 2e-16 ***
## Price            -0.096214   0.003227 -29.813  < 2e-16 ***
## ShelveLoc_Medium  1.795588   0.148369  12.102  < 2e-16 ***
## ShelveLoc_Good    4.705205   0.184610  25.487  < 2e-16 ***
## Age              -0.045810   0.003875 -11.823  < 2e-16 ***
## Education        -0.036306   0.023390  -1.552    0.122    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.024 on 269 degrees of freedom
## Multiple R-squared:  0.8685, Adjusted R-squared:  0.8646 
## F-statistic: 222.1 on 8 and 269 DF,  p-value: < 2.2e-16
## 
## [1] "Candidate for Removal: Education"
## [1] "p-Value: 0.12179"
## [1] "Decision: REMOVE"
## [1] "**------------------------  ITERATION 5  ------------------------**"
## 
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4852 -0.7098  0.0043  0.7064  3.3593 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.488583   0.583660   9.404  < 2e-16 ***
## CompPrice         0.094882   0.004934  19.229  < 2e-16 ***
## Income            0.015037   0.002256   6.664 1.49e-10 ***
## Advertising       0.113800   0.009371  12.144  < 2e-16 ***
## Price            -0.096134   0.003235 -29.714  < 2e-16 ***
## ShelveLoc_Medium  1.789020   0.148695  12.031  < 2e-16 ***
## ShelveLoc_Good    4.701091   0.185072  25.401  < 2e-16 ***
## Age              -0.046093   0.003880 -11.878  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.027 on 270 degrees of freedom
## Multiple R-squared:  0.8673, Adjusted R-squared:  0.8639 
## F-statistic: 252.1 on 7 and 270 DF,  p-value: < 2.2e-16
## 
## [1] "Candidate for Removal: Income"
## [1] "p-Value: 0"
## [1] "Decision: *TERMINATE*"

As shown above, the backward elimination procedure goes through 5 iterations before reaching the final model. This model would then be used and performance would be evaluated using the model on the test set.