Importing necessary libraries
library(tidyverse)
library(ISLR)
library(tidymodels)
library(ggplot2)
library(dplyr)
library(skimr)
library(GGally)
library(ggformula)
This includes (but not limited to):
carseats <- Carseats
head(carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
skim(carseats)
| Name | carseats |
| Number of rows | 400 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| ShelveLoc | 0 | 1 | FALSE | 3 | Med: 219, Bad: 96, Goo: 85 |
| Urban | 0 | 1 | FALSE | 2 | Yes: 282, No: 118 |
| US | 0 | 1 | FALSE | 2 | Yes: 258, No: 142 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Sales | 0 | 1 | 7.50 | 2.82 | 0 | 5.39 | 7.49 | 9.32 | 16.27 | ▁▆▇▃▁ |
| CompPrice | 0 | 1 | 124.97 | 15.33 | 77 | 115.00 | 125.00 | 135.00 | 175.00 | ▁▅▇▃▁ |
| Income | 0 | 1 | 68.66 | 27.99 | 21 | 42.75 | 69.00 | 91.00 | 120.00 | ▇▆▇▆▅ |
| Advertising | 0 | 1 | 6.64 | 6.65 | 0 | 0.00 | 5.00 | 12.00 | 29.00 | ▇▃▃▁▁ |
| Population | 0 | 1 | 264.84 | 147.38 | 10 | 139.00 | 272.00 | 398.50 | 509.00 | ▇▇▇▇▇ |
| Price | 0 | 1 | 115.80 | 23.68 | 24 | 100.00 | 117.00 | 131.00 | 191.00 | ▁▂▇▆▁ |
| Age | 0 | 1 | 53.32 | 16.20 | 25 | 39.75 | 54.50 | 66.00 | 80.00 | ▇▆▇▇▇ |
| Education | 0 | 1 | 13.90 | 2.62 | 10 | 12.00 | 14.00 | 16.00 | 18.00 | ▇▇▃▇▇ |
There are no missing values in our dataset, hence we do not need extra steps to handle the missing values in our dataset.
In the carseats dataset, there are three qualitative
variables (ShelveLoc, Urban, and
US) and eight quantitative variables. Our goal is to build
a multiple linear regression model with at least one qualitative
variable and at least one quantitative variable while also including the
interaction between your qualitative and quantitative explanatory
variables
Our goal is to fit a model like: \[ y = \beta_0 + \beta_1 \times qualitative\_variable + \beta_2 \times quantitative\_variable + \beta_3 \times ( qualitative\_variable \times quantitative\_variable) + \varepsilon \]
In order to choose my qualitative variable, I am visualizing the boxplot between all the qualitative variables and sales.
carseats |>
dplyr::select(US, Urban, ShelveLoc, Sales) |>
ggpairs()
From the boxplot, we can see that ShelveLoc has the
clear distiction in the median values of the three levels good, medium
and bad with a very strong separation between the boxes. Hence, It would
be wise to choose ShelveLoc as our qualitative predictor in
this case.
In order to choose my quantitative variable, I am visualizing the correlation between all the quantitative variables and sales.
carseats |>
dplyr::select(-US, -Urban, -ShelveLoc) |>
ggpairs()
From the above visualization, we can see a comparatively stronger
negative relationship of Price variable with unit
Sales than other quantitative variables. Hence, I am
planning to choose Price as my quantitative predictor in
this case.
The proposed model having Sales as a response variable
and Price and ShelveLoc as predictors along
with the interaction term is:
\[
\widehat{Sales} = \beta_0 + \beta_1 \times price + \beta_2 \times
ShelveLoc_{med} + \beta_3 \times ShelveLoc_{good} + \beta_4 \times
(price \times ShelveLoc_{med}) + \beta_5 \times (price \times
ShelveLoc_{good}) +\varepsilon
\] In the proposed model, I am taking ShelveLoc =
bad as the reference category, i.e
\(ShelveLoc_{med}\) = 1 if Medium, 0 otherwise
\(ShelveLoc_{good}\) = 1 if Good, 0 otherwise
Here, the inclusion of the interaction term gives us three different regression lines depending on the shelve location.
To interpret it easily:
If ShelveLoc is bad then \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 0, hence the model
becomes \[
\widehat{Sales} = \beta_0 + \beta_1 \times price +\varepsilon
\]
If ShelveLoc is mediumthen \(ShelveLoc_{med}\) = 1 and \(ShelveLoc_{good}\) = 0, hence the model
becomes \[
\widehat{Sales} =
(\beta_0 + \beta_2 ) + (\beta_1 + \beta_4) \times price +\varepsilon
\]
If ShelveLoc is good then \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 1, hence the model
becomes \[
\widehat{Sales} =
(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) \times price +\varepsilon
\]
This includes (but not limited to):
Splitting the data into train/test set:
# set seed before random split
set.seed(44)
# put 80% of the data into the training set
carseat <- initial_split(carseats, prop = 0.8)
# assign the two splits to data frames - with descriptive names
carseat_train <- training(carseat)
carseat_test <- testing(carseat)
# splits
dim(carseat_train)
## [1] 320 11
dim(carseat_test)
## [1] 80 11
Model Instantiation
lm_spec <- linear_reg() |>
set_mode("regression") |>
set_engine("lm")
Now using our training dataset i.e carseat_train to fit
our regression model.
#Using carseat_train to fit our linear model
mlr_fit_model <- lm_spec |>
fit(Sales ~ Price * ShelveLoc, data = carseat_train)
#assessing the fit our model
tidy(mlr_fit_model, conf.int =TRUE)
## # A tibble: 6 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.12e+1 1.01 11.1 1.78e-24 9.24 13.2
## 2 Price -5.05e-2 0.00865 -5.84 1.32e- 8 -0.0675 -0.0335
## 3 ShelveLocGood 6.55e+0 1.45 4.51 9.04e- 6 3.69 9.40
## 4 ShelveLocMedium 1.81e+0 1.26 1.44 1.50e- 1 -0.656 4.28
## 5 Price:ShelveLocGood -1.32e-2 0.0123 -1.08 2.82e- 1 -0.0374 0.0109
## 6 Price:ShelveLocMedium -3.69e-4 0.0107 -0.0345 9.73e- 1 -0.0214 0.0207
Medium shelf location and interaction terms between price and Shelf location are not statistically significant because of their larger p-value (>0.05). However, for the purpose of this assignment, I am going to include it for the rest of the part.
Using tidy output, the updated model becomes
\[ \widehat{Sales} = 11.23 - 0.05 \times price + 1.81 \times ShelveLoc_{med} + 6.55 \times ShelveLoc_{good} - 0.00037 \times (price \times ShelveLoc_{med}) - 0.0132 \times (price \times ShelveLoc_{good}) +\varepsilon \]
Interpretation of coefficients:
Sales will be 11.23 when the value of price become 0.Price when
ShelveLoc = Bad i.e. For stores with
Bad shelf location, increasing Price by 1 unit is
associated with a decrease in predicted Sales by 0.05 units.Medium shelf location will have predicted Sales about 1.81
units higher than stores with Bad shelf location.Good shelf location will have predicted Sales about 6.55
units higher than stores with Bad shelf location.Bad shelves, the price effect for Medium
shelves becomes 0.00037 more negative per 1 unit increase in price.Bad shelves, the price effect for Good shelves
becomes 0.0132 more negative per 1 unit increase in price.From the explanation above, we are clear that we will get three
different regression lines for three different levels of category of
ShelveLoc variables and they are:
Bad (reference): \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 0
\[ \widehat{Sales} = 11.23 - 0.05 \times Price + \epsilon\] Medium: \(ShelveLoc_{med}\) = 1 and \(ShelveLoc_{good}\) = 0
\[ \widehat{Sales} = (11.23 + 1.81) + ( - 0.05 - 0.00037) \times Price + \epsilon\] \[ \widehat{Sales} = 13.04 − 0.05037 \times Price + \epsilon\] Good: \(ShelveLoc_{med}\) = 0 and \(ShelveLoc_{good}\) = 1
\[ \widehat{Sales} = (11.23 + 6.55) + ( - 0.05 - 0.0132) \times Price + \epsilon\] \[ \widehat{Sales} = 17.78 − 0.0632 \times Price + \epsilon\]
In summary, we can say that: For Bad shelves, Sales drop
by about 0.05 per $1 increase in Price. For Medium shelves,
Sales drop by about 0.05037 per $1 increase, which is basically the same
as Bad. For Good shelves, Sales drop by about 0.0632 per $1
increase, which is more price-sensitive (sales decrease faster as price
rises).
Lets visualize the regression line:
carseats |>
ggplot(aes(x = Price,
y = Sales,
color = ShelveLoc)) +
geom_point() +
scale_color_viridis_d() + #colorblind friendly
geom_smooth(method = "lm",
se = FALSE) +
labs(y = "Sales (in thousands)",
title = "Sales vs. Price by Shelf Location")
This includes (but not limited to):
glance(mlr_fit_model)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.569 0.562 1.87 83.0 2.51e-55 5 -651. 1316. 1343.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
The \(R^2\) value of 0.5692 suggests
that about 56.92% of the variability in the Sales of the
child carseat across stores is explained by the Price and
the ShelveLoc variable along with thier interaction
term.
train_aug <- augment(mlr_fit_model, new_data = carseat_train)
test_aug <- augment(mlr_fit_model, new_data = carseat_test)
metrics(train_aug, truth = Sales, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 1.85
## 2 rsq standard 0.569
## 3 mae standard 1.50
metrics(test_aug, truth = Sales, estimate = .pred)
## # A tibble: 3 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 2.12
## 2 rsq standard 0.434
## 3 mae standard 1.73
On the training data, the model shows reasonably strong performance, with an \(R^2\) of 0.57, an RMSE of 1.85, and an MAE of 1.50, which suggests that the model explains a substantial portion of the variability in Sales and makes fairly accurate predictions on the data it was trained on. On the other hand, the test data shows declination on the \(R^2\) to 0.43, an increase in the RMSE to 2.12, and MAE to 1.73, which suggests that the model’s predictions are less accurate on unseen data.
Overall, the decrease in performance from training to testing suggests some overfitting, but not much severe. We can say that the model generalizes reasonably well with a lot of room for improvement by removing the interaction term or by taking more relevant features into our consideration.
This includes (but not limited to):
Predictions in this section should be treated as truly new cases; they do not need to come from your test set
Here, I am using my fitted linear model to predict the sales for a new observation
new_obvs <- data.frame(
Price = 125, # quantitative predictor
ShelveLoc = 'Good' #qualitative predictor
)
cbind(
predict(mlr_fit_model,new_obvs, type = "numeric"),
predict(mlr_fit_model,new_obvs,type = "conf_int"),
predict(mlr_fit_model,new_obvs,type = "pred_int"))
## .pred .pred_lower .pred_upper .pred_lower .pred_upper
## 1 9.808379 9.340831 10.27593 6.10195 13.51481
For a new store with a Price of 120 and a
Good shelf location, the predicted value of
Sales is approximately 9.8 units.
The associated 95% confidence interval for the predicted
Sales associated with the given Price and shelf location is
(9.340831 10.27593). This means that, for stores with the specified
Price and shelf location, we are 95% confident that the average
Sales across all such stores lies between approximately
9.34 and 10.28 units.
The associated 95% prediction interval for the predicted
Sales associated with the given Price and shelf location is
(6.10195 13.51481). This means that we are 95% confident that the Sales
for a single new store with the given Price and shelf location will fall
between approximately 6.10 and 13.51 units.
This includes (but not limited to):
Linearity and Variability check
ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
geom_hline(yintercept = 4, linetype = "dashed", color = "red") +
geom_hline(yintercept = -4, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
From the plot of the residuals vs fitted values, we can observe that the residuals are randomly scattered around 0 without any curves or systematic patterns. Additionally, we can also see that the residual spread is fairly consistent across most fitted values with slightly wider spread at mid-range fitted values, but not that extreme. Hence, our assumption of linearity and homoscedasticity is reasonably satisfied.
Normality Check:
ggplot(data = train_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
train_aug |>
gf_qq(~.resid) |>
gf_qqline()
From the histogram and Q-Q plot, we can say that the normality assumption is satisfied by our model as residuals follow the reference line fairly closely, with only minor deviations in the tails.