Let’s import necessary libraries that we use in this assignment.

library(ISLR)
library(tidymodels)
library(tidyverse)
library(ggformula)
library(skimr)

1. Linear Regression with the Auto Dataset

The objective of this section is to to perform a simple linear regression with mpg as the response and `horsepower as the predictor.

auto_data <- Auto

skim(auto_data)
Data summary
Name auto_data
Number of rows 392
Number of columns 9
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
name 0 1 FALSE 301 amc: 5, for: 5, toy: 5, amc: 4

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
mpg 0 1 23.45 7.81 9 17.00 22.75 29.00 46.6 ▆▇▆▃▁
cylinders 0 1 5.47 1.71 3 4.00 4.00 8.00 8.0 ▇▁▃▁▅
displacement 0 1 194.41 104.64 68 105.00 151.00 275.75 455.0 ▇▂▂▃▁
horsepower 0 1 104.47 38.49 46 75.00 93.50 126.00 230.0 ▆▇▃▁▁
weight 0 1 2977.58 849.40 1613 2225.25 2803.50 3614.75 5140.0 ▇▇▅▅▂
acceleration 0 1 15.54 2.76 8 13.78 15.50 17.02 24.8 ▁▆▇▂▁
year 0 1 75.98 3.68 70 73.00 76.00 79.00 82.0 ▇▆▇▆▇
origin 0 1 1.58 0.81 1 1.00 1.00 2.00 3.0 ▇▁▂▁▂

skim() function shows all the variables and its type and also gives the descriptive statistics of the numeric variables. From the table above we can see that auto dataset contains 8 numeric variables. There are not any missing values in our dataset, and hence we don’t need to follow any step to handle the missing values.

A. Fitting a simple linear regression model

Instead of using lm() function, I have used tidymodels function as instructed.

lm_spec = linear_reg() |>
  set_mode("regression") |>
  set_engine("lm")

Fitting the simple linear model with horsepower as predictor and mpg as the response variable

slr_fit = lm_spec |>
  fit(mpg ~ horsepower, data = auto_data)

Here, tidy() function transforms “messy” model output into clean, standardized format.

tidy(slr_fit)

Using the tidy output, the linear relationship between mpg and horsepower with appropriate parameters are:

\[\widehat{mpg} = 39.936 - 0.158 \times horsepower\] Interpretation of the model parameters: 1. y_intercept (\(\beta_0\)) = 39.936: For a vehicle having 0 horsepower, the predicted mpg will be 39.936. This is just a theoritical interpretation, such scenario doesn’t exist in the real world. 2. horsepower coefficient (\(\beta_1\)) = - 0.158: The predicted value of mpg decreases by 0.158 with a unit increase in the value of horsepower, keeping all other constant.

Assessing our model

glance(slr_fit)

Based on the model above:

i. Is there a relationship between the predictor and the response?

Yes. The regression results show a statistically significant relationship between horsepower and mpg, supported by a very small p-value (7.031989e-81) for the horsepower coefficient.

ii. How strong is the relationship between the predictor and the response?

The model has an \(R^2\) of about 0.61, meaning roughly 61% of the variability in mpg is explained by horsepower alone. Hence, we can say that the relationship between horsepower and mpg is fairly strong.

iii. Is the relationship between the predictor and the response positive or negative?

The value of slope is -0.1578447, which means that the relationship is negative i.e. if horsepower increases, mpg tends to decrease.

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

auto_98 <- auto_data |>
  filter(horsepower == 98) |>
  head(n = 1)

auto_98
cbind(
  predict(slr_fit,new_data=auto_98),
  predict(slr_fit,new_data=auto_98,type="conf_int"),
  predict(slr_fit,new_data=auto_98,type="pred_int"))

The associated 95% confidence interval for the predicted mpg associated with a horsepower of 98 is (23.97308, 24.96108), which means that we can be 95% confident that the true average mpg for all cars with 98 horsepower falls between 23.97 and 24.96 mpg.

The associated 95% prediction interval for the predicted mpg associated with a horsepower of 98 is (14.8094, 34.12476), which means that if we choose a single random car with 98 horsepower, we are 95% confident that its actual fuel efficiency (mpg) will fall between 14.81 and 34.12 mpg.

B. Plot the response (mpg) and the predictor (horsepower)

auto_data |>
  mutate(.pred = predict(slr_fit, new_data = auto_data)$.pred) |>
  gf_point(mpg~horsepower) |>
  gf_line(.pred~horsepower, color = 'red')

While the simple linear regression model is statistically significant and explains 61% of the variance, it may be underfitting the data because it does not account for the slight curve visible in the actual observations. Hence, a non-linear or polynomial model might provide a more accurate fit than a straight line.

C. Use the plot() function to produce diagnostic plots of the least squares regression fit.

# Produce diagnostic plots
par(mfrow = c(2, 2))  # arrange 4 plots in a 2x2 grid
slr_fit |>
  extract_fit_engine() |>
  plot()

par(mfrow = c(1, 1))  # reset plotting layout

The residuals vs fitted plot shows a clear U-shaped curve in the red line, which indicates non-linearity. The linear model is missing a curved relationship, suggesting that a polynomial term (like \(horsepower^2\)) might fit the data better.

In the Q-Q plot, the points mostly follow the dashed straight line but curve upward at the high end, which means that residuals are mostly normal, but there is some “heavy-tailing,” meaning the model has a few larger-than-expected errors.

In scale-location plot, the spread of the points changes as the fitted values increase, which suggests heteroscedasticity, meaning the model’s accuracy is not consistent across all horsepower levels.

2. Theoritical Interpretation

I collect a set of data (n = 100 observations) containing a single predictor X and a quantitative response Y . I then fit a linear regression model i.e \[\widehat{y} = \beta_0 + \beta_1 \times x + \epsilon\] the data, as well as a separate cubic regression model, i.e., \[\widehat{y} = \beta_0 + \beta_1 \times x + \beta_2 \times x^2 + \beta_3 \times x^3 +\epsilon\]

(a) Which model is more flexible? Why?

A cubic regression model is more flexible than a linear model because it can fit curved relationships, while a linear model can only fit straight lines. The cubic model has more parameters, which allows it to capture a greater variety of patterns in the data.

(b) Suppose that the true relationship between X and Y is linear, i.e., \[\widehat{y} = \beta_0 + \beta_1 \times + \epsilon\]

(i) Based on model flexibility, which model would you expect to have a lower training RSS? Why?

As cubic model id more flexible, it is expected to have a lover training RSS because it can mimic the linear model by setting \(\beta_2\) = \(\beta_3\) = 0, and can also fit some of the random noise in the training data.

(ii) Based on model flexibility, which model would you expect to have a lower test RSS? Why?

The linear model is expected to have a lower test RSS because The true relationship is linear and it matches the true data-generating process, while the cubic model may overfit, capturing noise rather than signal.

3. Regression with the Boston Dataset

This dataset contains information collected by the U.S Census Service concerning real estate information in the area of Boston, Massachusetts. In the context of R, the Boston dataset is found in the MASS library and has 506 rows and 14 columns. The median value of a home, medv, is to be predicted using the social economic status (lstat)

library(MASS)

skim(Boston)
Data summary
Name Boston
Number of rows 506
Number of columns 14
_______________________
Column type frequency:
numeric 14
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
crim 0 1 3.61 8.60 0.01 0.08 0.26 3.68 88.98 ▇▁▁▁▁
zn 0 1 11.36 23.32 0.00 0.00 0.00 12.50 100.00 ▇▁▁▁▁
indus 0 1 11.14 6.86 0.46 5.19 9.69 18.10 27.74 ▇▆▁▇▁
chas 0 1 0.07 0.25 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
nox 0 1 0.55 0.12 0.38 0.45 0.54 0.62 0.87 ▇▇▆▅▁
rm 0 1 6.28 0.70 3.56 5.89 6.21 6.62 8.78 ▁▂▇▂▁
age 0 1 68.57 28.15 2.90 45.02 77.50 94.07 100.00 ▂▂▂▃▇
dis 0 1 3.80 2.11 1.13 2.10 3.21 5.19 12.13 ▇▅▂▁▁
rad 0 1 9.55 8.71 1.00 4.00 5.00 24.00 24.00 ▇▂▁▁▃
tax 0 1 408.24 168.54 187.00 279.00 330.00 666.00 711.00 ▇▇▃▁▇
ptratio 0 1 18.46 2.16 12.60 17.40 19.05 20.20 22.00 ▁▃▅▅▇
black 0 1 356.67 91.29 0.32 375.38 391.44 396.22 396.90 ▁▁▁▁▇
lstat 0 1 12.65 7.14 1.73 6.95 11.36 16.96 37.97 ▇▇▅▂▁
medv 0 1 22.53 9.20 5.00 17.02 21.20 25.00 50.00 ▂▇▅▁▁

Fitting a simple Linear Regression Model without the trainsplit:

lm_boston <- lm_spec |>
  fit(medv ~ lstat, data = Boston)
# regression output with confidence interval for the model coefficients
tidy(lm_boston, conf.int = TRUE)

Using the tidy output, the linear relationship between medv and lstat with appropriate parameters are:

\[\widehat{medv} = 34.55 - 0.95 \times lstat\] Interpretation of coefficients:

  1. y-intercept (\(\beta_0\)) = 34.55: The predicted median value of house (medv) will be 34.55 for those having 0 socio economic status (lstat), which is ideal case.

  2. lstat_coefficient(\(\beta_0\)) = -0.95: The median value of house (medv) decreases by 0.95 foe one unit increase in lstat.

Confidence Intervals:

We are 95% confident that the true intercept lies between 33.44 and 35.66. This means that when lstat = 0 (i.e., when the percentage of lower status population is zero), the average median house value (medv) is estimated to be between 33.44 and 35.66 (in $1000s).

We are 95% confident that the true slope lies between –1.03 and -0.87. This means that for every 1-unit increase in lstat, the median house value (medv) is expected to decrease by between -1.03 and -0.87 (in $1000s) on average.

glance(lm_boston)

\(R^2\) value of 54.4% suggests that 54.4% of variability in the response (medv) is explained by the predictor (lstat).

slr_aug <- augment(lm_boston, new_data = Boston)

slr_aug

Train/test split to assess predictive performance.

# set seed before random split
set.seed(44)

# put 80% of the data into the training set
boston <- initial_split(Boston, prop = 0.8)

# assign the two splits to data frames - with descriptive names
boston_train <- training(boston)
boston_test <- testing(boston)

# splits
dim(boston_train)
## [1] 404  14
dim(boston_test)
## [1] 102  14

Now using our training dataset to fit our simple linear regression model.

# fit a linear model using boston_train
lm_boston_train <- lm_spec |>
  fit(medv ~ lstat, data = boston_train)

#assessing the fit our model
tidy(lm_boston_train, conf.int =TRUE)

Using the tidy output, the linear relationship between medv and lstat with appropriate parameters are:

\[\widehat{medv} = 33.97 - 0.92 \times lstat\] Interpretation of coefficients:

  1. y-intercept (\(\beta_0\)) = 33.97: The predicted median value of house (medv) will be 33.97 for those having 0 socio economic status (lstat), which is ideal case.

  2. lstat_coefficient(\(\beta_0\)) = -0.92: The median value of house (medv) decreases by 0.92 foe one unit increase in lstat.

Confidence Intervals:

We are 95% confident that the true intercept lies between 32.74 and 35.20. This means that when lstat = 0 (i.e., when the percentage of lower status population is zero), the average median house value (medv) is estimated to be between 32.74 and 35.20 (in $1000s).

We are 95% confident that the true slope lies between -1.00 and -0.83. This means that for every 1-unit increase in lstat, the median house value (medv) is expected to decrease by between 0.83 and 1.00 (in $1000s) on average.

glance(lm_boston_train)

\(R^2\) value of 53.6% suggests that 53.6% of variability in the response (medv) is explained by the predictor (lstat).

train_aug <- augment(lm_boston_train, new_data = boston_train)
test_aug <- augment(lm_boston_train, new_data = boston_test)
metrics(slr_aug, truth = medv, estimate = .pred)
metrics(train_aug, truth = medv, estimate = .pred)
metrics(test_aug, truth = medv, estimate = .pred)

We can conclude that the simple linear regression model using lstat to predict medv performs moderately well with the \(R^2\) value ranging from 54-58%, which means that the model explains 54-58% variances in the home prices. Moreover, similar performance on training and test sets indicates the model generalizes well, although using polynomial terms (like lstat²) or adding other predictors (crime, age, rooms, etc.) could reduce RMSE and increase R².

Model diagnostics

To assess whether the linear model is reliable, we should check for (1) linearity, (2) nearly normal residuals, and (3) constant variability. Note that the normal residuals is not really necessary for all models (sometimes we simply want to describe a relationship for the data that we have or population-level data, where statistical inference is not appropriate/necessary).

Linearity: Lets look at things graphically

test_aug |>
  mutate(mod.pred=predict(lm_boston,new_data=boston_test)$.pred) |>
  gf_point(medv~lstat) |>
  gf_line(.pred~lstat,col='red') |>
  gf_line(mod.pred~lstat,col='blue')

From the plot above, we can see that the relationship between the medv and lstats is reasonably linear. We should also verify this condition with a plot of the residuals vs. fitted (predicted) values.

Variability check:

ggplot(data = train_aug, aes(x = .pred, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  xlab("Fitted values") +
  ylab("Residuals")

From the above visualization, we can see that residual variance increases with fitted values, and this pattern indicates heteroscedasticity.This suggests that prediction errors are larger for higher fitted values and our model is less reliable for expensive homes.

Normality of residuals:

ggplot(data = train_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 0.5) +
  xlab("Residuals")

From the histogram above, we can see that the residual distribution is right-skewed, with a long positive tail, hence, residuals are not approximately normal, particularly for larger positive residuals.

train_aug |>
  gf_qq(~.resid) |>
  gf_qqline()

This Q–Q plot shows substantial departures from the diagonal, especially in the upper tail, which means that residuals are not approximately normal, particularly for larger positive residuals.

Conclusion:

In a nutshell, a linear model is acceptable as a first approximation, but it is not ideal for the real estate information in the area of Boston, Massachusetts. Predictive metrics (RMSE, MAE, R²) are reasonable and stable, so the model generalizes okay. However, diagnostic plots clearly show violations of linearity, normality and constant variance. This means that the linear model misses important structure in the data.