Here we re are using simple linear regression on the
Auto
data set. We use the lm()
function to
perform a simple linear regression with mpg
as the response
and horsepower
as the predictor. We use the
summary
function to print the results.
# first we load the dataset
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
df_auto <- Auto
head(df_auto)
#fit the simple linear regression model
model <- lm(mpg ~ horsepower, data=df_auto)
#print the summary of the model
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = df_auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Is there a relationship between the predictor and the response?
Yes, there exist a relationship between horsepower predictor variable
and mpg response variable. We know this by looking at the
p-value
from the regression summary of the coefficient
associated with horsepower which is 2 x 10-16
. Since the
p-value is extremely small, this implies that there is a statistically
significant relationship between the two.
How strong is the relationship between the predictor and the response?
To determine the strength of the relationship between the predictor
and the response, we use the \(R^2\)
which has a value of 0.6049 which means that the 60.49%
variation of mpg
is explained by the model using
horsepower.
Is the relationship between the predictor and the response positive or negative?
The relationship between the predictor and the response is negative.
We know this by looking at the sign of the coefficient for horsepower
variable. The coefficient is -0.1578 meaning higher
horsepower
leads to lower mpg
. This makes
sense practically because more power engine cars are less
fuel-efficient, thus higher miles per gallons.
What is the predicted mpg associated with a horsepower of 98?
Using the below linear regression equation from our model, we can make a prediction of mpg given a horsepower value of 98.
\[\begin{align} mpg &= 39.94 - 0.1578 \times \text{horsepower} \\[4pt] &= 39.94 - (0.1578 \times 98) \\[4pt] &= 24.48 \text{(mpg)} \end{align}\]
Hence answer is 24.48 mpg.
What are the associated 95% confidence and prediction intervals?
#compute the confidence and prediction intervals
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 38.525212 41.3465103
## horsepower -0.170517 -0.1451725
For the intercept we have a confidence interval of (38.53, 41.35), meaning that the when mean mpg when horsepower is zero is between these values.
For the horsepower CI (-0.1705, -0.1452) meaning that for each additional unit of horsepower, mpg decreases by a value within this range.
Plot the response and the predictor. Use the
abline()
function to display the least squares regression
line.
plot(
df_auto$horsepower,
df_auto$mpg,
xlab="horsepower",
ylab="mpg",
pch = 16,
col="blue")
#adding a regression line
abline(model, col="red", lwd=2)
mpg
and horsepower
.Use the plot()
function to produce diagnostic plots
of the least squares regression fit. Comment on any problems you see
with the fit.
par(mfrow=c(2,2))
plot(model)
From the Residual vs. Fitter Values, we can see
that the residuals follows a certain pattern as seen from the curved red
line, and are not random. This suggests a problem in our linear model.
The curved pattern suggests that the relationship between the
horsepower
and mpg
is not perfectly linear.
This is not a surprise for us because from \(R^2\) we saw that with horsepower variable,
the model only explains approximately 60% of the variability in the
response variable.
Using the Q-Q diagnostic plot, we are able to assess whether the residuals of our model are normally distributed which is on the key assumptions of the linear regression model. From our Q-Q plot, we observe that the central part of the plot aligns with the reference line, suggesting that the majority of the residuals follows the normal distribution. However, the tails seem to deviate from the reference line, implying possible presence of outliers.
Using the Scale-Location plot, we can be able to check for homoscedasticity (which means constant variance of residuals remains constant across different levels of fitted values). The red line which represents the trend in the standardized residuals, is slightly curved but generally increasing. This suggests that the variance is not constant, but increases for larger fitted values.
Finally, we look at the Cook’s Distance Plot
(Residuals vs. Leverage) which helps us identify influential data points
that have high impact on the model. From the plot, we see that some of
the observations such as observation 1170
have high
leverage and high residual meaning such points influence the regression
model significantly. If we were optimizing the regression model, we
could consider removing these points and refitting the model for a
better performance.
This question should be answered using the Carseats
data
set. (a) Fit a multiple regression model to predict Sales using Price,
Urban, and US.
#view description of the dataset
?Carseats
## starting httpd help server ... done
head(Carseats)
#next we fit our model using the provided predictor variables.
model <- lm(Sales ~ Price + Urban + US, data=Carseats)
#the model coefficients
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
From the summary of the model above, we have the intercept, \(\beta_0\) as
13.04
, which represents the sales price
when we have all the predictor variables at their reference levels i.e
\(\text{price} = 0\),
urban = No
, US=No
. Realistically this does not
make sense because it implies the baseline average Sales
for stores that are not in the US, and not in an urban area is 13.04
when the company charged zero price for car seats in those sites. This
would equate the company was giving the car seats for free perhaps for
some sort of special promotion campaign. However, this is not part of
the model.
We have the coefficient associated with price as
- 0.0545
implying that for every additional increase in unit
price, the Sales of car seats decrease by this much, while holding the
other two predictor variables constant. This realistically agrees in a
real world scenario where we expect sales to reduce when price of
commodity is increased.
For the Urban
variable, we have the coefficient as
-0.0219
. Since this is a categorical variable, it implies
that for sites in the urban areas, they sales tends to be lower by
-0.0219 units when compared with those NOT in urban areas (of course
while holding the other predictor variables constant).
Finally, for US
, we have a coefficient of
1.2006
implying that on average, stores located in the US
sell 1.2006 more units when compared to stores that are NOT in the US,
while holding the other predictor variables constant.
\[ \hat{Sales} \quad = \quad 13.04 \quad - \quad 0.0544 \times \text{Price} \quad - \quad 0.0219 \times \text{UrbanYes}\quad + \quad 1.2006 \times \text{USYes}\] (d) For which of the predictors can you reject the null hypothesis H0 : \(\beta_j = 0\)?
Price
and US
are extremely small implying
that they are statiscally significant in predicting the Sales response
variable. Therefore, for these two we fail to reject the null hypothesis
that their \(\beta_j = 0\). However,
for the Urban
variable, the p-value is extremely high
implying that this predictor is not stastically significant in
predicting the response variable. Thus, for this predictor, we fail
to reject the null hypothesis that \(\beta_\text{urban} = 0\).Urban
variable as one of the predictors.#the new updated model without the urban variable.
model <- lm(Sales ~ Price + US, data=Carseats)
#get the model summary
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
To evaluate how well the models from (a) and (e) fit we can start
by looking at the \(R^2\) metric. Both
of the models seem to have the same \(R^2\) of 0.2393
, meaning that
the two models only explain approximately 24% of the variance in the
Sales
response variable. This is quite expected since, in
(e) we removed a predictor variable that was not statistically
significant.
When it comes to Residual standard error, the two models seems to have approximately the same value of 2.5. Therefore, they seem to perform the same here as well.
Lastly, when we compare these models based on their corresponding F-statistic, we see that model in (e) has slightly higher value meaning that the updated model has a higher overall model significance
confint(model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
The confidence intervals are as follows: Intercept CI: (11.79, 14.27); Price CI: (-0.06475, -0.04420), USYes CI: (0.6915, 1.7078). For the two predictors variables, their respective confidence intervals excludes zero, which agrees with our intial analysis of rejecting the null hypothesis.
par= (mfrow = c(2,2))
plot(model)
From the Residuals vs. Fitted plot, we can be able to identify some of the possible outliers such as the 51st, 69th and 377th observations. These three seem to have residuals that deviate largely from the rest of the points.
From the Cook’s Distance, we can observe 26th, 50th and 368th data points as some of the high leverage points; these points may unduly influence our model.
This problem focuses on the collinearity problem. (a) Perform the following commands in R:
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The last line corresponds to creating a linear model in which y is a
function of x1
and x2
. Write out the form of
the linear model. What are the regression coefficients?
The form of the linear model is as shown below:
\[y = 2 + 2x_1 + 0.3x_2 \]
with \(\beta_0 = 2 \text{,}\; \beta_1 =
2,\; \text{and} \; \beta_2 = 0.3\) representing the intercept
coefficient, x1
coefficient, and x2
coefficient respectively.
x1
and x2
?
Create a scatterplot displaying the relationship between the
variables.#scatter plot between x1 and x2
plot(
x1,
x2,
main = "Scatter plot of x1 versus x2",
pch = 19,
col = "blue"
)
The scatter plot above shows a strong positive correlation
between the two variables. This means that x2
tends to
increase with increase in x1
. To confirm the actual
correlation between these two variables, we can run below code:
cor(x1, x2)
## [1] 0.8351212
The correlation between the two variables is equal to 1,
confirming the strong positive correlation that we earlier identified by
visualizing the plot. This is however expected because we know
x2
was generated from x1
, hence the two are
linearly dependent.
y
using x1
and x2
. Describe the
results obtained. What are \(\hat\beta_0,
\;\hat\beta_1,\;\hat\beta_2\) ? How do these relate to the true
\(\beta_0, \; \beta_1, \; \text{and}
\;\beta_2\) Can you reject the null hypothesis H0 : \(\beta_1 = 0\) ? How about the null
hypothesis H0 : \(\beta_2 = 0\) ?#fitting the linear regression model
lm1 = lm(y ~ x1 + x2)
#getting the summary of the model
summary(lm1)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
From the summary above we observe the coefficients as follows:
\(\hat{\beta_0} = 2.1305, \; \hat{\beta_1} =
1.4396, \; \text{and} \; \hat{\beta2} = 1.0097\). Ideally, the
estimated coefficients should be close to the true coefficients for the
predictor variables. The coefficients for the two predictors seem to
vary from the true coefficients. This is possibly due to
multicollinearity that exist between x1
and
x2
.
For the x1
, we reject the null hypothesis, Ho: \(\beta_1 \; = 0\) since the p-value is
relatively small (less than the 0.05 threshold. This means that
x1
is statistically significant in predicting predicting
y
.
For x2, we fail to reject the null hypothesis, Ho: \(\beta_1 = 0\) since the p-value is
relatively great (greater than the 0.05 threshold). This means that
x2
is NOT statistically in predicting
y
.
#fitting the regression model now using only x1
lm2 = lm(y ~ x1)
#getting summary of the model
summary(lm2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
From the above summary we reject the null hypothesis, Ho:
\(\beta_1 = 0\). This is so because the
p-value associated with x1
is extremely small (8.27 x
10-15). This implies that x1
is statistically significant
in predicting response variable y
.
x2
. Comment on your results. Can you reject the null
hypothesis H0 : \(\beta_1 = 0\)?lm3 = lm(y~x2)
summary(lm3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
No! We fail to reject the null hypothesis Ho: \(\beta_1 = 0\) . This is because the p-value
is extremely low suggesting that x2
is statistically
significant in predicting y.
No! They do not contradict each other. For example in
lm1
(where we used x1
and x2
) we
found a strong evidence to fail to reject the null hypothesis for
x2
. However, when we fit only x2 in lm3
the
p-value associated with x2
seems to show that
x2
is statistically significant contrary to our earlier
assertion. However, this is not to say that the results contradict each
other, instead we are simply having the problem of
collinearity.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
#with both x1 and x2
lm4 = lm(y ~ x1 + x2)
summary(lm4)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
#with only x1
lm5 = lm(y ~ x1)
summary(lm5)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
#finally with just x2
lm6 = lm(y ~ x2)
summary(lm6)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
From the results of the 3 above models we notice some changes in
coefficient estimates. The estimates have slightly shifted in all the
models, with some significant changes in x1
and
x2
. For example, in the multiple regression model, the
coefficient for x1
dropped from 0.5394 to 0.1671,
suggesting that the new observation significantly affects the
relationship.
We can make a scatter plot of x1 vs. x2 in an attempt to identify whether the new point is an outlier
plot(x1, x2, main = "Scatter Plot: x1 vs. x2", col="blue", pch=19)
From the scatter plot above, we can definitely see that the new
observation is an outlier in the scatter plot of x1
vs. x1
.
We then to finally check whether the observation is a high-leverage point:
#for the multiple regression model
which.max(hatvalues(lm4))
## 101
## 101
x2[101]
## [1] 0.8
x1[101]
## [1] 0.1
#here we see the new observation is a high leverate point
#for the regression model using only x1
which.max(hatvalues(lm5))
## 27
## 27
x2[27]
## [1] -0.03763402
x1[27]
## [1] 0.01339033
#here it doesn't seem like the new observation is a high leverage point.
#finally we check for the regression model using x2
which.max(hatvalues(lm6))
## 101
## 101
x2[101]
## [1] 0.8
x1[101]
## [1] 0.1
#again, here we see that the new observation is a high leverage point.
From the results above, we can see that the new observation is not a high leverage point in all the models, but it is in 2 of our models as shown above.