Let’s import necessary libraries that we use in this assignment.
library(ISLR)
library(tidymodels)
library(tidyverse)
library(ggformula)
library(skimr)
The objective of this section is to to perform a simple linear
regression with mpg as the response and `horsepower as the
predictor.
auto_data <- Auto
skim(auto_data)
| Name | auto_data |
| Number of rows | 392 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| name | 0 | 1 | FALSE | 301 | amc: 5, for: 5, toy: 5, amc: 4 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| mpg | 0 | 1 | 23.45 | 7.81 | 9 | 17.00 | 22.75 | 29.00 | 46.6 | ▆▇▆▃▁ |
| cylinders | 0 | 1 | 5.47 | 1.71 | 3 | 4.00 | 4.00 | 8.00 | 8.0 | ▇▁▃▁▅ |
| displacement | 0 | 1 | 194.41 | 104.64 | 68 | 105.00 | 151.00 | 275.75 | 455.0 | ▇▂▂▃▁ |
| horsepower | 0 | 1 | 104.47 | 38.49 | 46 | 75.00 | 93.50 | 126.00 | 230.0 | ▆▇▃▁▁ |
| weight | 0 | 1 | 2977.58 | 849.40 | 1613 | 2225.25 | 2803.50 | 3614.75 | 5140.0 | ▇▇▅▅▂ |
| acceleration | 0 | 1 | 15.54 | 2.76 | 8 | 13.78 | 15.50 | 17.02 | 24.8 | ▁▆▇▂▁ |
| year | 0 | 1 | 75.98 | 3.68 | 70 | 73.00 | 76.00 | 79.00 | 82.0 | ▇▆▇▆▇ |
| origin | 0 | 1 | 1.58 | 0.81 | 1 | 1.00 | 1.00 | 2.00 | 3.0 | ▇▁▂▁▂ |
skim() function shows all the variables and its type and
also gives the descriptive statistics of the numeric variables. From the
table above we can see that auto dataset contains 8 numeric
variables. There are not any missing values in our dataset, and hence we
don’t need to follow any step to handle the missing values.
Instead of using lm() function, I have used
tidymodels function as instructed.
lm_spec = linear_reg() |>
set_mode("regression") |>
set_engine("lm")
Fitting the simple linear model with horsepower as
predictor and mpg as the response variable
slr_fit = lm_spec |>
fit(mpg ~ horsepower, data = auto_data)
Here, tidy() function transforms “messy” model output
into clean, standardized format.
tidy(slr_fit)
Using the tidy output, the linear relationship between
mpg and horsepower with appropriate parameters
are:
\[\widehat{mpg} = 39.936 - 0.158 \times
horsepower\] Interpretation of the model parameters: 1.
y_intercept (\(\beta_0\)) = 39.936: For
a vehicle having 0 horsepower, the predicted
mpg will be 39.936. This is just a theoritical
interpretation, such scenario doesn’t exist in the real world. 2.
horsepower coefficient (\(\beta_1\)) =
- 0.158: The predicted value of mpg decreases by 0.158 with
a unit increase in the value of horsepower, keeping all
other constant.
glance(slr_fit)
Based on the model above:
i. Is there a relationship between the predictor and the response?
Yes. The regression results show a statistically significant
relationship between horsepower and mpg,
supported by a very small p-value (7.031989e-81) for the horsepower
coefficient.
ii. How strong is the relationship between the predictor and the response?
The model has an \(R^2\) of about
0.61, meaning roughly 61% of the variability in mpg is
explained by horsepower alone. Hence, we can say that the relationship
between horsepower and mpg is fairly
strong.
iii. Is the relationship between the predictor and the response positive or negative?
The value of slope is -0.1578447, which means that the relationship
is negative i.e. if horsepower increases, mpg
tends to decrease.
iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?
auto_98 <- auto_data |>
filter(horsepower == 98) |>
head(n = 1)
auto_98
cbind(
predict(slr_fit,new_data=auto_98),
predict(slr_fit,new_data=auto_98,type="conf_int"),
predict(slr_fit,new_data=auto_98,type="pred_int"))
The associated 95% confidence interval for the predicted mpg associated with a horsepower of 98 is (23.97308, 24.96108), which means that we can be 95% confident that the true average mpg for all cars with 98 horsepower falls between 23.97 and 24.96 mpg.
The associated 95% prediction interval for the predicted mpg associated with a horsepower of 98 is (14.8094, 34.12476), which means that if we choose a single random car with 98 horsepower, we are 95% confident that its actual fuel efficiency (mpg) will fall between 14.81 and 34.12 mpg.
mpg) and the predictor
(horsepower)auto_data |>
mutate(.pred = predict(slr_fit, new_data = auto_data)$.pred) |>
gf_point(mpg~horsepower) |>
gf_line(.pred~horsepower, color = 'red')
While the simple linear regression model is statistically significant
and explains 61% of the variance, it may be underfitting the data
because it does not account for the slight curve visible in the actual
observations. Hence, a non-linear or polynomial model might provide a
more accurate fit than a straight line.
# Produce diagnostic plots
par(mfrow = c(2, 2)) # arrange 4 plots in a 2x2 grid
slr_fit |>
extract_fit_engine() |>
plot()
par(mfrow = c(1, 1)) # reset plotting layout
The residuals vs fitted plot shows a clear U-shaped curve in the red line, which indicates non-linearity. The linear model is missing a curved relationship, suggesting that a polynomial term (like \(horsepower^2\)) might fit the data better.
In the Q-Q plot, the points mostly follow the dashed straight line but curve upward at the high end, which means that residuals are mostly normal, but there is some “heavy-tailing,” meaning the model has a few larger-than-expected errors.
In scale-location plot, the spread of the points changes as the fitted values increase, which suggests heteroscedasticity, meaning the model’s accuracy is not consistent across all horsepower levels.
I collect a set of data (n = 100 observations) containing a single predictor X and a quantitative response Y . I then fit a linear regression model i.e \[\widehat{y} = \beta_0 + \beta_1 \times x + \epsilon\] the data, as well as a separate cubic regression model, i.e., \[\widehat{y} = \beta_0 + \beta_1 \times x + \beta_2 \times x^2 + \beta_3 \times x^3 +\epsilon\]
(a) Which model is more flexible? Why?
A cubic regression model is more flexible than a linear model because it can fit curved relationships, while a linear model can only fit straight lines. The cubic model has more parameters, which allows it to capture a greater variety of patterns in the data.
(b) Suppose that the true relationship between X and Y is linear, i.e., \[\widehat{y} = \beta_0 + \beta_1 \times + \epsilon\]
(i) Based on model flexibility, which model would you expect to have a lower training RSS? Why?
As cubic model id more flexible, it is expected to have a lover training RSS because it can mimic the linear model by setting \(\beta_2\) = \(\beta_3\) = 0, and can also fit some of the random noise in the training data.
(ii) Based on model flexibility, which model would you expect to have a lower test RSS? Why?
The linear model is expected to have a lower test RSS because The true relationship is linear and it matches the true data-generating process, while the cubic model may overfit, capturing noise rather than signal.
This dataset contains information collected by the U.S Census Service
concerning real estate information in the area of Boston, Massachusetts.
In the context of R, the Boston dataset is found in the MASS library and
has 506 rows and 14 columns. The median value of a home, medv, is to be
predicted using the social economic status (lstat)
library(MASS)
skim(Boston)
| Name | Boston |
| Number of rows | 506 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| crim | 0 | 1 | 3.61 | 8.60 | 0.01 | 0.08 | 0.26 | 3.68 | 88.98 | ▇▁▁▁▁ |
| zn | 0 | 1 | 11.36 | 23.32 | 0.00 | 0.00 | 0.00 | 12.50 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.14 | 6.86 | 0.46 | 5.19 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.38 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▆▅▁ |
| rm | 0 | 1 | 6.28 | 0.70 | 3.56 | 5.89 | 6.21 | 6.62 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.57 | 28.15 | 2.90 | 45.02 | 77.50 | 94.07 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.21 | 5.19 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.55 | 8.71 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 408.24 | 168.54 | 187.00 | 279.00 | 330.00 | 666.00 | 711.00 | ▇▇▃▁▇ |
| ptratio | 0 | 1 | 18.46 | 2.16 | 12.60 | 17.40 | 19.05 | 20.20 | 22.00 | ▁▃▅▅▇ |
| black | 0 | 1 | 356.67 | 91.29 | 0.32 | 375.38 | 391.44 | 396.22 | 396.90 | ▁▁▁▁▇ |
| lstat | 0 | 1 | 12.65 | 7.14 | 1.73 | 6.95 | 11.36 | 16.96 | 37.97 | ▇▇▅▂▁ |
| medv | 0 | 1 | 22.53 | 9.20 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
Fitting a simple Linear Regression Model without the trainsplit:
lm_boston <- lm_spec |>
fit(medv ~ lstat, data = Boston)
# regression output with confidence interval for the model coefficients
tidy(lm_boston, conf.int = TRUE)
Using the tidy output, the linear relationship between
medv and lstat with appropriate parameters
are:
\[\widehat{medv} = 34.55 - 0.95 \times lstat\] Interpretation of coefficients:
y-intercept (\(\beta_0\)) =
34.55: The predicted median value of house (medv) will be
34.55 for those having 0 socio economic status (lstat),
which is ideal case.
lstat_coefficient(\(\beta_0\)) =
-0.95: The median value of house (medv) decreases by 0.95
foe one unit increase in lstat.
Confidence Intervals:
We are 95% confident that the true intercept lies between 33.44 and
35.66. This means that when lstat = 0 (i.e., when the
percentage of lower status population is zero), the average median house
value (medv) is estimated to be between 33.44 and 35.66 (in
$1000s).
We are 95% confident that the true slope lies between –1.03 and
-0.87. This means that for every 1-unit increase in lstat,
the median house value (medv) is expected to decrease by
between -1.03 and -0.87 (in $1000s) on average.
glance(lm_boston)
\(R^2\) value of 54.4% suggests that
54.4% of variability in the response (medv) is explained by
the predictor (lstat).
slr_aug <- augment(lm_boston, new_data = Boston)
slr_aug
Train/test split to assess predictive performance.
# set seed before random split
set.seed(44)
# put 80% of the data into the training set
boston <- initial_split(Boston, prop = 0.8)
# assign the two splits to data frames - with descriptive names
boston_train <- training(boston)
boston_test <- testing(boston)
# splits
dim(boston_train)
## [1] 404 14
dim(boston_test)
## [1] 102 14
Now using our training dataset to fit our simple linear regression model.
# fit a linear model using boston_train
lm_boston_train <- lm_spec |>
fit(medv ~ lstat, data = boston_train)
#assessing the fit our model
tidy(lm_boston_train, conf.int =TRUE)
Using the tidy output, the linear relationship between
medv and lstat with appropriate parameters
are:
\[\widehat{medv} = 33.97 - 0.92 \times lstat\] Interpretation of coefficients:
y-intercept (\(\beta_0\)) =
33.97: The predicted median value of house (medv) will be
33.97 for those having 0 socio economic status (lstat),
which is ideal case.
lstat_coefficient(\(\beta_0\)) =
-0.92: The median value of house (medv) decreases by 0.92
foe one unit increase in lstat.
Confidence Intervals:
We are 95% confident that the true intercept lies between 32.74 and
35.20. This means that when lstat = 0 (i.e., when the
percentage of lower status population is zero), the average median house
value (medv) is estimated to be between 32.74 and 35.20 (in
$1000s).
We are 95% confident that the true slope lies between -1.00 and
-0.83. This means that for every 1-unit increase in lstat,
the median house value (medv) is expected to decrease by
between 0.83 and 1.00 (in $1000s) on average.
glance(lm_boston_train)
\(R^2\) value of 53.6% suggests that
53.6% of variability in the response (medv) is explained by
the predictor (lstat).
train_aug <- augment(lm_boston_train, new_data = boston_train)
test_aug <- augment(lm_boston_train, new_data = boston_test)
metrics(slr_aug, truth = medv, estimate = .pred)
metrics(train_aug, truth = medv, estimate = .pred)
metrics(test_aug, truth = medv, estimate = .pred)
We can conclude that the simple linear regression model using
lstat to predict medv performs moderately well
with the \(R^2\) value ranging from
54-58%, which means that the model explains 54-58% variances in the home
prices. Moreover, similar performance on training and test sets
indicates the model generalizes well, although using polynomial terms
(like lstat²) or adding other predictors (crime, age, rooms, etc.) could
reduce RMSE and increase R².
Model diagnostics
To assess whether the linear model is reliable, we should check for (1) linearity, (2) nearly normal residuals, and (3) constant variability. Note that the normal residuals is not really necessary for all models (sometimes we simply want to describe a relationship for the data that we have or population-level data, where statistical inference is not appropriate/necessary).
Linearity: Lets look at things graphically
test_aug |>
mutate(mod.pred=predict(lm_boston,new_data=boston_test)$.pred) |>
gf_point(medv~lstat) |>
gf_line(.pred~lstat,col='red') |>
gf_line(mod.pred~lstat,col='blue')
From the plot above, we can see that the relationship between the
medv and lstats is reasonably linear. We
should also verify this condition with a plot of the residuals
vs. fitted (predicted) values.
Variability check:
ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
From the above visualization, we can see that residual variance increases with fitted values, and this pattern indicates heteroscedasticity.This suggests that prediction errors are larger for higher fitted values and our model is less reliable for expensive homes.
Normality of residuals:
ggplot(data = train_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.5) +
xlab("Residuals")
From the histogram above, we can see that the residual distribution is right-skewed, with a long positive tail, hence, residuals are not approximately normal, particularly for larger positive residuals.
train_aug |>
gf_qq(~.resid) |>
gf_qqline()
This Q–Q plot shows substantial departures from the diagonal, especially in the upper tail, which means that residuals are not approximately normal, particularly for larger positive residuals.
In a nutshell, a linear model is acceptable as a first approximation, but it is not ideal for the real estate information in the area of Boston, Massachusetts. Predictive metrics (RMSE, MAE, R²) are reasonable and stable, so the model generalizes okay. However, diagnostic plots clearly show violations of linearity, normality and constant variance. This means that the linear model misses important structure in the data.