The output that follows was generated using the stats package in base R.
In the lesson that follows, we use the training and testing sets of the CarSeats data, which we import as ‘CS_Train’ and ‘CS_Test’.
CS_Train <- read.csv("CarSeats_Train.csv")
CS_Test <- read.csv("CarSeats_Test.csv")
We will train our linear regression model using the training dataset, CS_Train. We can view the structure of the data below.
str(CS_Train)
## 'data.frame': 278 obs. of 12 variables:
## $ CompPrice : int 138 117 141 124 132 121 117 122 115 149 ...
## $ Income : int 73 100 64 113 113 78 94 35 28 95 ...
## $ Advertising : int 11 4 3 13 0 9 4 2 11 5 ...
## $ Population : int 276 466 340 501 131 150 503 393 29 400 ...
## $ Price : int 120 97 128 72 124 100 94 136 86 144 ...
## $ ShelveLoc_Medium: int 0 1 0 0 1 0 0 1 0 1 ...
## $ ShelveLoc_Good : int 0 0 0 0 0 0 1 0 1 0 ...
## $ Age : int 42 55 38 78 76 26 50 62 53 76 ...
## $ Education : int 17 14 13 16 17 10 13 18 18 18 ...
## $ Urban : int 1 1 1 0 0 0 1 1 1 0 ...
## $ US : int 1 1 0 1 1 1 1 0 1 0 ...
## $ Sales : num 9.5 7.4 4.15 10.81 4.69 ...
We will evaluate the performance of our predictive model on the testing set, CS_Test. We can view the structure of the data below.
str(CS_Test)
## 'data.frame': 122 obs. of 12 variables:
## $ CompPrice : int 111 113 115 136 132 107 139 131 121 109 ...
## $ Income : int 48 35 105 81 110 117 32 84 41 73 ...
## $ Advertising : int 16 10 0 15 0 11 0 11 5 0 ...
## $ Population : int 260 269 45 425 108 148 176 29 412 454 ...
## $ Price : int 83 80 108 120 124 118 82 96 110 102 ...
## $ ShelveLoc_Medium: int 0 1 1 0 1 0 0 1 1 1 ...
## $ ShelveLoc_Good : int 1 0 0 1 0 1 1 0 0 0 ...
## $ Age : int 65 59 71 67 76 52 54 44 54 65 ...
## $ Education : int 10 12 15 10 10 18 11 17 10 15 ...
## $ Urban : int 1 1 1 1 0 1 0 0 1 1 ...
## $ US : int 1 1 0 1 0 1 0 1 1 0 ...
## $ Sales : num 11.22 10.06 6.63 11.85 6.54 ...
We use the lm() function to create the linear regression
model. We can specify that we want to use all other variables in the
CS_Train dataframe to predict Sales. We can view a summary of the fitted
model, which includes overall model information, fitted coefficient
values, and p-values.
train_mod <- lm(Sales ~ ., data = CS_Train)
mod_sum <- summary(train_mod)
mod_sum
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3525 -0.7223 -0.0214 0.6778 3.4249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8267996 0.6883138 8.465 1.74e-15 ***
## CompPrice 0.0950507 0.0049609 19.160 < 2e-16 ***
## Income 0.0148406 0.0022775 6.516 3.60e-10 ***
## Advertising 0.1198440 0.0135362 8.854 < 2e-16 ***
## Population 0.0002053 0.0004445 0.462 0.645
## Price -0.0963036 0.0032293 -29.822 < 2e-16 ***
## ShelveLoc_Medium 1.8057701 0.1494249 12.085 < 2e-16 ***
## ShelveLoc_Good 4.7429806 0.1858085 25.526 < 2e-16 ***
## Age -0.0456242 0.0038855 -11.742 < 2e-16 ***
## Education -0.0363030 0.0236249 -1.537 0.126
## Urban 0.2035123 0.1375914 1.479 0.140
## US -0.1611668 0.1851281 -0.871 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 266 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.8647
## F-statistic: 161.9 on 11 and 266 DF, p-value: < 2.2e-16
We evaluate model performance on the testing data. We can use the
predict() function to make predictions for test data set
using the fitted regression equation. Then, we can obtain performance
measures such as mean absolute error (MAE), mean square error (MSE), and
root mean square error (RMSE), which all capture prediction error. We
want to have low error, and therefore prefer smaller values.
test_preds <- predict(object = train_mod,
newdata = CS_Test)
test_perf <- c(MSE = mean((test_preds - CS_Test$Sales)^2),
MAE = mean(abs(test_preds - CS_Test$Sales)),
RMSE = sqrt(mean((test_preds - CS_Test$Sales)^2)))
test_perf
## MSE MAE RMSE
## 1.0484738 0.8227436 1.0239501
We want our model to be balanced–which means it performs similarly well on the training and test sets, suggesting it will generalize well to unseen data (data it was not trained on). We can evaluate the performance on the training and testing sets side-by-side to consider the goodness of fit.
train_perf <- c(MSE = mean(train_mod$residuals^2),
MAE = mean(abs(train_mod$residuals)),
RMSE = sqrt(mean((train_mod$residuals)^2)))
cbind(train = train_perf, test = test_perf)
## train test
## MSE 1.002329 1.0484738
## MAE 0.801216 0.8227436
## RMSE 1.001164 1.0239501
As shown, the performance in similar on the training and testing sets and the error is low, suggesting the model is a good fit (balanced).
In the model above we used all available variables as predictors to predict the target variable, Sales.
preds <- names(CS_Train)[!names(CS_Train) %in% "Sales"]
preds
## [1] "CompPrice" "Income" "Advertising" "Population"
## [5] "Price" "ShelveLoc_Medium" "ShelveLoc_Good" "Age"
## [9] "Education" "Urban" "US"
As we saw, not all variables were statistically significant predictors of car seat sales. Many times (especially if a model is overfitting!) we want to only include the best subset of our variables to include as predictors in the model. Methods for model selection applied to regression models include: forward selection, backward elimination, and stepwise selection. We can use model selection methods, also known as variable and feature selection methods, to improve the performance of our model by eliminating unimportant variables as predictors in our model.
In Backward Elimination, we start with full model and iteratively remove predictors based on their p-values. We first set a threshold value for removal and continue the process described below until all predictors have p-values less than the chosen threshold.
The code below automates the backward elimination process. It sets a threshold for removal of 0.10 and iterates, continuing to build the linear regression model and remove the variable with the highest p-value as long as the highest p-value is greater than or equal to the threshold value, 0.10. The output from running the code below includes for each iteration: -the summary output of the regression model -the variable considered for removal -the p-value of the variable under consideration -the decision (REMOVE/TERMINATE)
counter = 0 # initialize iteration counter
cutoff_thresh <- 0.10 # set threshold for variable removal
ps <- c(1) # initialize iteration
while(ps[1] >= cutoff_thresh){
preds <- preds[!preds %in% names(ps)[1]]
cslm1 <- lm(Sales~., CS_Train[,c(preds, "Sales")])
it1<- summary(cslm1)
ps <- sort(it1$coefficients[,4], decreasing = TRUE)
counter = counter + 1
paste("ITERATION", counter)
print(paste0(c("**------------------------ ITERATION", counter, " ------------------------**"), collapse = " "))
print(summary(lm(Sales~., CS_Train[,c(preds, "Sales")])))
print(paste0(c("Candidate for Removal:", names(ps)[1]), collapse = " "))
print(paste0(c("p-Value:", round(ps[1],5)), collapse = " "))
dec <- ifelse(ps[1] >= 0.10, "REMOVE", "*TERMINATE*")
print(paste("Decision:", dec))
}
## [1] "**------------------------ ITERATION 1 ------------------------**"
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3525 -0.7223 -0.0214 0.6778 3.4249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.8267996 0.6883138 8.465 1.74e-15 ***
## CompPrice 0.0950507 0.0049609 19.160 < 2e-16 ***
## Income 0.0148406 0.0022775 6.516 3.60e-10 ***
## Advertising 0.1198440 0.0135362 8.854 < 2e-16 ***
## Population 0.0002053 0.0004445 0.462 0.645
## Price -0.0963036 0.0032293 -29.822 < 2e-16 ***
## ShelveLoc_Medium 1.8057701 0.1494249 12.085 < 2e-16 ***
## ShelveLoc_Good 4.7429806 0.1858085 25.526 < 2e-16 ***
## Age -0.0456242 0.0038855 -11.742 < 2e-16 ***
## Education -0.0363030 0.0236249 -1.537 0.126
## Urban 0.2035123 0.1375914 1.479 0.140
## US -0.1611668 0.1851281 -0.871 0.385
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 266 degrees of freedom
## Multiple R-squared: 0.87, Adjusted R-squared: 0.8647
## F-statistic: 161.9 on 11 and 266 DF, p-value: < 2.2e-16
##
## [1] "Candidate for Removal: Population"
## [1] "p-Value: 0.64464"
## [1] "Decision: REMOVE"
## [1] "**------------------------ ITERATION 2 ------------------------**"
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3929 -0.7170 -0.0468 0.6695 3.4389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.923824 0.654491 9.051 < 2e-16 ***
## CompPrice 0.094899 0.004943 19.199 < 2e-16 ***
## Income 0.014804 0.002273 6.514 3.63e-10 ***
## Advertising 0.121670 0.012927 9.412 < 2e-16 ***
## Price -0.096241 0.003222 -29.873 < 2e-16 ***
## ShelveLoc_Medium 1.802803 0.149066 12.094 < 2e-16 ***
## ShelveLoc_Good 4.742077 0.185524 25.560 < 2e-16 ***
## Age -0.045766 0.003868 -11.833 < 2e-16 ***
## Education -0.037479 0.023453 -1.598 0.111
## Urban 0.199440 0.137106 1.455 0.147
## US -0.177799 0.181322 -0.981 0.328
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.022 on 267 degrees of freedom
## Multiple R-squared: 0.8699, Adjusted R-squared: 0.8651
## F-statistic: 178.6 on 10 and 267 DF, p-value: < 2.2e-16
##
## [1] "Candidate for Removal: US"
## [1] "p-Value: 0.32769"
## [1] "Decision: REMOVE"
## [1] "**------------------------ ITERATION 3 ------------------------**"
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4265 -0.7051 -0.0429 0.6850 3.3315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.873794 0.652453 9.003 < 2e-16 ***
## CompPrice 0.094788 0.004941 19.183 < 2e-16 ***
## Income 0.014553 0.002258 6.445 5.36e-10 ***
## Advertising 0.112907 0.009339 12.090 < 2e-16 ***
## Price -0.096266 0.003221 -29.884 < 2e-16 ***
## ShelveLoc_Medium 1.813780 0.148635 12.203 < 2e-16 ***
## ShelveLoc_Good 4.732666 0.185262 25.546 < 2e-16 ***
## Age -0.045779 0.003867 -11.837 < 2e-16 ***
## Education -0.035396 0.023354 -1.516 0.131
## Urban 0.195267 0.137030 1.425 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.022 on 268 degrees of freedom
## Multiple R-squared: 0.8695, Adjusted R-squared: 0.8651
## F-statistic: 198.4 on 9 and 268 DF, p-value: < 2.2e-16
##
## [1] "Candidate for Removal: Urban"
## [1] "p-Value: 0.15532"
## [1] "Decision: REMOVE"
## [1] "**------------------------ ITERATION 4 ------------------------**"
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5572 -0.7122 -0.0332 0.6791 3.4080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.943805 0.651845 9.118 < 2e-16 ***
## CompPrice 0.095392 0.004932 19.340 < 2e-16 ***
## Income 0.014750 0.002258 6.532 3.23e-10 ***
## Advertising 0.113339 0.009352 12.120 < 2e-16 ***
## Price -0.096214 0.003227 -29.813 < 2e-16 ***
## ShelveLoc_Medium 1.795588 0.148369 12.102 < 2e-16 ***
## ShelveLoc_Good 4.705205 0.184610 25.487 < 2e-16 ***
## Age -0.045810 0.003875 -11.823 < 2e-16 ***
## Education -0.036306 0.023390 -1.552 0.122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.024 on 269 degrees of freedom
## Multiple R-squared: 0.8685, Adjusted R-squared: 0.8646
## F-statistic: 222.1 on 8 and 269 DF, p-value: < 2.2e-16
##
## [1] "Candidate for Removal: Education"
## [1] "p-Value: 0.12179"
## [1] "Decision: REMOVE"
## [1] "**------------------------ ITERATION 5 ------------------------**"
##
## Call:
## lm(formula = Sales ~ ., data = CS_Train[, c(preds, "Sales")])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4852 -0.7098 0.0043 0.7064 3.3593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.488583 0.583660 9.404 < 2e-16 ***
## CompPrice 0.094882 0.004934 19.229 < 2e-16 ***
## Income 0.015037 0.002256 6.664 1.49e-10 ***
## Advertising 0.113800 0.009371 12.144 < 2e-16 ***
## Price -0.096134 0.003235 -29.714 < 2e-16 ***
## ShelveLoc_Medium 1.789020 0.148695 12.031 < 2e-16 ***
## ShelveLoc_Good 4.701091 0.185072 25.401 < 2e-16 ***
## Age -0.046093 0.003880 -11.878 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.027 on 270 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8639
## F-statistic: 252.1 on 7 and 270 DF, p-value: < 2.2e-16
##
## [1] "Candidate for Removal: Income"
## [1] "p-Value: 0"
## [1] "Decision: *TERMINATE*"
As shown above, the backward elimination procedure goes through 5 iterations before reaching the final model. This model would then be used and performance would be evaluated using the model on the test set.