Question 8
Part a)
lm_model <- lm(mpg ~ horsepower, data = Auto)
summary(lm_model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
i. Yes, there is a relationship between horsepower, the predictor,
and the response, mpg. These two are negatively correlated.
ii. The estimate of -0.158 suggests that relationship is not super
strong but still there is inversely proportional relationship.
iii. Negative
iv. We’ll use the predict function below to compute the predicted
value for 98 horsepower and associated intervals.
predict(lm_model, data.frame(horsepower = 98), interval = 'confidence')
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_model, data.frame(horsepower = 98), interval = 'prediction')
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Part b)
plot(Auto$horsepower,Auto$mpg)
abline(lm_model, lwd = 3, col = "blue")

Part c)
par(mfrow = c(2, 2))
plot(lm_model)

Residuals vs Fitted does show some pattern which indicates the
relationship might not be entirely linear.
QQ Plot shows that residuals have a normal distribution which is
good.
We can also notice one leverage point in bottom right plot and
couple of outliers in top left which could be problematic.
y = 13 - 0.054(Price) - 0.02(UrbanYes) + 1.2(USYes)
Part d)
We can reject the null that Price and USYes is zero because their
associated p-values are very small.
Part e)
multi_lm_trim <- lm(Sales ~ Price + US, data = Carseats)
summary(multi_lm_trim)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Part f)
We can compare adjusted r squared values for both models. Model in
a) has 0.2335 and model in e) has 0.2354. So clearly latter model fits
the data better.
Part g)
confint(multi_lm_trim)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Yes we can see evidence of both outliers and leverage points using
our diagnostic plots.
Question 14
Part a)
set.seed(1)
x1 <- runif (100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
Linear model form: y = 2 + 2(x1) + 0.3(x2) + error
The regression coefficients are 2 for x1 and 0.3 for x2.
Part b)
plot(x1,x2)

We can notice that as x1 increases, x2 increases as well. So our
predictors in this case are collinear.
Part c)
lm_ <- lm(y~ x1 + x2)
summary(lm_)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
Now the beta 0 is 2.13, beta 1 is 1.4396, beta 2 is 1.0097. Second
predictor, x2, is farther away from the true beta 2 which was 0.3. But
first predictor is also different from true B1. We reject the null B1 =
0 because p-value is <0.05, however we fail to reject the null that
B2 = 0.
Part d)
lm_2 <- lm(y~ x1)
summary(lm_2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
Yes we still reject the null that B1 = 0 in fact the p-value dropped
further.
Part e)
lm_3 <- lm(y ~ x2)
summary(lm_3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
We will still be rejecting the null that beta1 = 0 because p-value
is small. In our first model, x2 seemed to not correlate with the
response, but now it does. This could mean that there is an interaction
between x1 and x2.
Part f)
I mentioned this in previous part that there seems to be an
interaction effect between the predictors. When they’re alone, they seem
really significant. But when taken together, they seem to cancel each
other out or reduce their impact on the response.
Part g)
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm_ <- lm(y~ x1 + x2)
summary(lm_)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
par(mfrow = c(2, 2))
plot(lm_)

In this model, this new observation is a leverage point because it
changed our model significantly and it shows up on the plot
(bottom-right). But it’s not an outlier as it can be seen on the top
left plot.
lm_2 <- lm(y~ x1)
summary(lm_2)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
par(mfrow = c(2, 2))
plot(lm_2)

For second model, new observation is both an outlier and a leverage
point as can be seen from our diagnostic plots.
lm_3 <- lm(y ~ x2)
summary(lm_3)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
par(mfrow = c(2, 2))
plot(lm_3)

For this model, our new observation is not an outlier but it is a
leverage point.