Multiple regression — qualitative predictors & interactions

Importing necessary libraries

library(tidyverse)
library(ISLR)
library(tidymodels)
library(ggplot2)
library(dplyr)
library(skimr)
library(GGally)
library(ggformula)
  1. Use the Carseats dataset (can be found in ISLR R package) to predict Sales using other variables. You will build, assess, and interpret a multiple regression model that:

(a) Exploration and Model identification

This includes (but not limited to):

  • Explore the data using summaries and visualizations
  • Identify your chosen predictors (quantitative & qualitative)
  • Justify why they are reasonable for predicting sales.
  • Propose an initial multiple regression model that includes an interaction.
carseats <- Carseats

head(carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
skim(carseats)
Data summary
Name carseats
Number of rows 400
Number of columns 11
_______________________
Column type frequency:
factor 3
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
ShelveLoc 0 1 FALSE 3 Med: 219, Bad: 96, Goo: 85
Urban 0 1 FALSE 2 Yes: 282, No: 118
US 0 1 FALSE 2 Yes: 258, No: 142

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Sales 0 1 7.50 2.82 0 5.39 7.49 9.32 16.27 ▁▆▇▃▁
CompPrice 0 1 124.97 15.33 77 115.00 125.00 135.00 175.00 ▁▅▇▃▁
Income 0 1 68.66 27.99 21 42.75 69.00 91.00 120.00 ▇▆▇▆▅
Advertising 0 1 6.64 6.65 0 0.00 5.00 12.00 29.00 ▇▃▃▁▁
Population 0 1 264.84 147.38 10 139.00 272.00 398.50 509.00 ▇▇▇▇▇
Price 0 1 115.80 23.68 24 100.00 117.00 131.00 191.00 ▁▂▇▆▁
Age 0 1 53.32 16.20 25 39.75 54.50 66.00 80.00 ▇▆▇▇▇
Education 0 1 13.90 2.62 10 12.00 14.00 16.00 18.00 ▇▇▃▇▇

There are no missing values in our dataset, hence we do not need extra steps to handle the missing values in our dataset.

In the carseats dataset, there are three qualitative variables (ShelveLoc, Urban, and US) and eight quantitative variables. Our goal is to build a multiple linear regression model with at least one qualitative variable and at least one quantitative variable while also including the interaction between your qualitative and quantitative explanatory variables

Our goal is to fit a model like: \[ y = \beta_0 + \beta_1 \times qualitative\_variable + \beta_2 \times quantitative\_variable + \beta_3 \times ( qualitative\_variable \times quantitative\_variable) + \varepsilon \]

In order to choose my qualitative variable, I am visualizing the boxplot between all the qualitative variables and sales.

carseats |>
  dplyr::select(US, Urban, ShelveLoc, Sales) |>
  ggpairs()

From the boxplot, we can see that ShelveLoc has the clear distiction in the median values of the three levels good, medium and bad with a very strong separation between the boxes. Hence, It would be wise to choose ShelveLoc as our qualitative predictor in this case.

In order to choose my quantitative variable, I am visualizing the correlation between all the quantitative variables and sales.

carseats |>
  dplyr::select(-US, -Urban, -ShelveLoc) |>
  ggpairs()

From the above visualization, we can see a comparatively stronger negative relationship of Price variable with unit Sales than other quantitative variables. Hence, I am planning to choose Price as my quantitative predictor in this case.

The proposed model having Sales as a response variable and Price and ShelveLoc as predictors along with the interaction term is:

\[ \widehat{Sales} = \beta_0 + \beta_1 \times price + \beta_2 \times ShelveLoc_{med} + \beta_3 \times ShelveLoc_{good} + \beta_4 \times (price \times ShelveLoc_{med}) + \beta_5 \times (price \times ShelveLoc_{good}) +\varepsilon \] In the proposed model, I am taking ShelveLoc = bad as the reference category, i.e

\(ShelveLoc_{med}\) = 1 if Medium, 0 otherwise

\(ShelveLoc_{good}\) = 1 if Good, 0 otherwise

Here, the inclusion of the interaction term gives us three different regression lines depending on the shelve location.

To interpret it easily:

If ShelveLoc is bad then \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 0, hence the model becomes \[ \widehat{Sales} = \beta_0 + \beta_1 \times price +\varepsilon \]

If ShelveLoc is mediumthen \(ShelveLoc_{med}\) = 1 and \(ShelveLoc_{good}\) = 0, hence the model becomes \[ \widehat{Sales} = (\beta_0 + \beta_2 ) + (\beta_1 + \beta_4) \times price +\varepsilon \]

If ShelveLoc is good then \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 1, hence the model becomes \[ \widehat{Sales} = (\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) \times price +\varepsilon \]

(b) Interpretation of the coefficients

This includes (but not limited to):

  • Interpret the coefficient for the quantitative predictor.
  • Interpret the coefficient(s) for the qualitative predictor.
  • Interpret the interaction term.
  • Explain how the relationship between the quantitative predictor and Sales changes across levels of the qualitative predictor.

Splitting the data into train/test set:

# set seed before random split
set.seed(44)

# put 80% of the data into the training set
carseat <- initial_split(carseats, prop = 0.8)

# assign the two splits to data frames - with descriptive names
carseat_train <- training(carseat)
carseat_test <- testing(carseat)

# splits
dim(carseat_train)
## [1] 320  11
dim(carseat_test)
## [1] 80 11

Model Instantiation

lm_spec <- linear_reg() |>
  set_mode("regression") |>
  set_engine("lm")

Now using our training dataset i.e carseat_train to fit our regression model.

#Using carseat_train to fit our linear model
mlr_fit_model <- lm_spec |>
  fit(Sales ~ Price * ShelveLoc, data = carseat_train)

#assessing the fit our model
tidy(mlr_fit_model, conf.int =TRUE)
## # A tibble: 6 × 7
##   term                  estimate std.error statistic  p.value conf.low conf.high
##   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)            1.12e+1   1.01      11.1    1.78e-24   9.24     13.2   
## 2 Price                 -5.05e-2   0.00865   -5.84   1.32e- 8  -0.0675   -0.0335
## 3 ShelveLocGood          6.55e+0   1.45       4.51   9.04e- 6   3.69      9.40  
## 4 ShelveLocMedium        1.81e+0   1.26       1.44   1.50e- 1  -0.656     4.28  
## 5 Price:ShelveLocGood   -1.32e-2   0.0123    -1.08   2.82e- 1  -0.0374    0.0109
## 6 Price:ShelveLocMedium -3.69e-4   0.0107    -0.0345 9.73e- 1  -0.0214    0.0207

Medium shelf location and interaction terms between price and Shelf location are not statistically significant because of their larger p-value (>0.05). However, for the purpose of this assignment, I am going to include it for the rest of the part.

Using tidy output, the updated model becomes

\[ \widehat{Sales} = 11.23 - 0.05 \times price + 1.81 \times ShelveLoc_{med} + 6.55 \times ShelveLoc_{good} - 0.00037 \times (price \times ShelveLoc_{med}) - 0.0132 \times (price \times ShelveLoc_{good}) +\varepsilon \]

Interpretation of coefficients:

  1. y_intercept (\(\beta_0\)) = 11.23: For the stores with the bad shelf location, the predicted value of Sales will be 11.23 when the value of price become 0.
  2. Price_coefficient (\(\beta_1\)) = -0.05: This explains the the effect of Price when ShelveLoc = Bad i.e. For stores with Bad shelf location, increasing Price by 1 unit is associated with a decrease in predicted Sales by 0.05 units.
  3. \(ShelveLoc_{med}\)_coefficient (\(\beta_2\)) = 1.81: For the stores having the same price and ignoring interaction for a moment, stores with Medium shelf location will have predicted Sales about 1.81 units higher than stores with Bad shelf location.
  4. \(ShelveLoc_{good}\)_coefficient (\(\beta_3\)) = 6.55: For the stores having the same price and ignoring interaction for a moment, stores with Good shelf location will have predicted Sales about 6.55 units higher than stores with Bad shelf location.
  5. \(price \times ShelveLoc_{med}\)_coefficient (\(\beta_4\))= -0.00037: Compared to Bad shelves, the price effect for Medium shelves becomes 0.00037 more negative per 1 unit increase in price.
  6. \(price \times ShelveLoc_{good}\)_coefficient (\(\beta_5\)) = -0.0132: Compared to Bad shelves, the price effect for Good shelves becomes 0.0132 more negative per 1 unit increase in price.

From the explanation above, we are clear that we will get three different regression lines for three different levels of category of ShelveLoc variables and they are:

Bad (reference): \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 0

\[ \widehat{Sales} = 11.23 - 0.05 \times Price + \epsilon\] Medium: \(ShelveLoc_{med}\) = 1 and \(ShelveLoc_{good}\) = 0

\[ \widehat{Sales} = (11.23 + 1.81) + ( - 0.05 - 0.00037) \times Price + \epsilon\] \[ \widehat{Sales} = 13.04 − 0.05037 \times Price + \epsilon\] Good: \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 1

\[ \widehat{Sales} = (11.23 + 6.55) + ( - 0.05 - 0.0132) \times Price + \epsilon\] \[ \widehat{Sales} = 17.78 − 0.0632 \times Price + \epsilon\]

In summary, we can say that: For Bad shelves, Sales drop by about 0.05 per $1 increase in Price. For Medium shelves, Sales drop by about 0.05037 per $1 increase, which is basically the same as Bad. For Good shelves, Sales drop by about 0.0632 per $1 increase, which is more price-sensitive (sales decrease faster as price rises).

Lets visualize the regression line:

carseats |>
  ggplot(aes(x = Price, 
             y = Sales,  
             color = ShelveLoc)) +
  geom_point() +
  scale_color_viridis_d() + #colorblind friendly
  geom_smooth(method = "lm", 
              se = FALSE) +
  labs(y = "Sales (in thousands)",
       title = "Sales vs. Price by Shelf Location") 

(c) Accuracy of the model

This includes (but not limited to):

  • Evaluate how well the model fits the data using R2
  • Generate predictions on the test data
  • Compare (briefly) training vs. testing performance if relevant
  • Discuss whether the model generalizes well
glance(mlr_fit_model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.569         0.562  1.87      83.0 2.51e-55     5  -651. 1316. 1343.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The \(R^2\) value of 0.5692 suggests that about 56.92% of the variability in the Sales of the child carseat across stores is explained by the Price and the ShelveLoc variable along with thier interaction term.

train_aug <- augment(mlr_fit_model, new_data = carseat_train)
test_aug <- augment(mlr_fit_model, new_data = carseat_test)
metrics(train_aug, truth = Sales, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       1.85 
## 2 rsq     standard       0.569
## 3 mae     standard       1.50
metrics(test_aug, truth = Sales, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       2.12 
## 2 rsq     standard       0.434
## 3 mae     standard       1.73

On the training data, the model shows reasonably strong performance, with an \(R^2\) of 0.57, an RMSE of 1.85, and an MAE of 1.50, which suggests that the model explains a substantial portion of the variability in Sales and makes fairly accurate predictions on the data it was trained on. On the other hand, the test data shows declination on the \(R^2\) to 0.43, an increase in the RMSE to 2.12, and MAE to 1.73, which suggests that the model’s predictions are less accurate on unseen data.

Overall, the decrease in performance from training to testing suggests some overfitting, but not much severe. We can say that the model generalizes reasonably well with a lot of room for improvement by removing the interaction term or by taking more relevant features into our consideration.

(d) Predictions of new observations

This includes (but not limited to):

  • Use your fitted model to predict sales for at least one new observation
  • Clearly specify the values of the predictors you used (both qualitative and quantitative)
  • Interpret the prediction in words.
  • Include confidence and prediction intervals

Predictions in this section should be treated as truly new cases; they do not need to come from your test set

Here, I am using my fitted linear model to predict the sales for a new observation

new_obvs <- data.frame(
  Price = 125,    # quantitative predictor
  ShelveLoc = 'Good'      #qualitative predictor
)
cbind(
  predict(mlr_fit_model,new_obvs, type = "numeric"),
  predict(mlr_fit_model,new_obvs,type = "conf_int"),
  predict(mlr_fit_model,new_obvs,type = "pred_int"))
##      .pred .pred_lower .pred_upper .pred_lower .pred_upper
## 1 9.808379    9.340831    10.27593     6.10195    13.51481

For a new store with a Price of 120 and a Good shelf location, the predicted value of Sales is approximately 9.8 units.

The associated 95% confidence interval for the predicted Sales associated with the given Price and shelf location is (9.340831 10.27593). This means that, for stores with the specified Price and shelf location, we are 95% confident that the average Sales across all such stores lies between approximately 9.34 and 10.28 units.

The associated 95% prediction interval for the predicted Sales associated with the given Price and shelf location is (6.10195 13.51481). This means that we are 95% confident that the Sales for a single new store with the given Price and shelf location will fall between approximately 6.10 and 13.51 units.

(e) Model Assessment

This includes (but not limited to):

  • Assess model assumptions using residual plots, normality checks (if appropriate)
  • Comment on whether assumptions appear reasonable

Linearity and Variability check

ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 4, linetype = "dashed", color = "red") +
  geom_hline(yintercept = -4, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

From the plot of the residuals vs fitted values, we can observe that the residuals are randomly scattered around 0 without any curves or systematic patterns. Additionally, we can also see that the residual spread is fairly consistent across most fitted values with slightly wider spread at mid-range fitted values, but not that extreme. Hence, our assumption of linearity and homoscedasticity is reasonably satisfied.

Normality Check:

ggplot(data = train_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

train_aug |>
  gf_qq(~.resid) |>
  gf_qqline()

From the histogram and Q-Q plot, we can say that the normality assumption is satisfied by our model as residuals follow the reference line fairly closely, with only minor deviations in the tails.