lm_fit <- lm(mpg ~ horsepower, data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
a. 1. Yes, there is a relationship between horsepower (the predictor) and mpg (the response) as indicated by the regression analysis.
The relationship is quite strong since the R-squared value is around 0.6, which means approximately 60% of the variability in mpg can be explained by horsepower.
The relationship is negative, as shown by the negative coefficient for horsepower in the regression equation. This indicates that as horsepower increases, the mpg tends to decrease.
predicted_output <- predict(lm_fit, data.frame(horsepower = (c(98))), interval = "confidence")
print(predicted_output)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
b.
plot(Auto$horsepower, Auto$mpg)
abline(lm_fit)
c)
par(mfrow = c(2, 2))
plot(lm_fit)
There don’t appear to be any problems with the fit. The points are randomly scattered around the line with no major outliers.
10. a.
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b. The sales decrease as price increases. The sales decrease if it is in an urban area, however, this is not statistically significant Sales increase if in the US.
c. Sales = -0.054459(Price) - 0.021916(Urban[1 or 0]) + 1.200573(US[1 or 0]) + 13.043469
d. We can reject the null hypothesis for Price and
US as their p-values are below the common alpha level of 0.05,
indicating that these predictors have a statistically significant
association with Sales.
e.
model <- lm(Sales ~ Price + US, data = Carseats)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f We can see that R-squared is quite low for both models it looks like these models do not fit the data very well, which indicates that the explanatory power of the models is very limited.
g
confint(model)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h We see an observation with high leverage, yet within the smaller model no data points are considered outliers. This indicates the presence of certain points that have influence on the model’s predictions.
par(mfrow = c(2,2))
plot(model)
plot(predict(model), rstudent(model))
14 a.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
y = 2+2x1+0.3x2+ß Coeffi cients: ß0 = ß+2 ß1 = 2 ß2 = .3
b)
plot(x1, x2)
cor(x1, x2)
## [1] 0.8351212
x2 goes up as x1 goes up (positive correlation)
c.
model2 <- lm(y ~ x1 + x2)
summary(model2)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
This is quite different than the real values, except for ß0 Actual Values: ß0: 2.0000, ß1: 2.0000, ß2: 0.3000 Predicted Values: ß0: 2.1305, ß1: 1.4396, ß2: 1.0097 As the p value for ß1 is low enough, we can reject the null hypothesis for it, however, we cannot do the same for ß2.
d.
model3 <- lm(y ~ x1)
summary(model3)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The coefficients are much closer this time to the true values and we can reject the null hypothesis for ß1.
e.
model4 <- lm(y ~ x2)
summary(model4)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The results do not contradict each other despite seeming to do so. This is because x1 and x2 are highly correlated, which causes the importance of x2 to be masked by collinearity when both x1 and x2 are included in the model.
g.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
model5 <- lm(y ~ x1 + x2)
model6 <- lm(y ~ x1)
model7 <- lm(y ~ x2)
summary(model5)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(model6)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(model7)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
For the first model that includes both variables (\(x1\) and \(x2\)), compared to the original model, the new model’s coefficients changed significantly. The value of \(\hat{\beta}_2\) became significant instead of \(\hat{\beta}_1\). In the models that include only one of the predictors, the significance of the variables remained as in the original models, but the coefficients changed due to the new observation.
par(mfrow=c(2, 2))
plot(model5)
par(mfrow=c(2,2))
plot(model7)
The mismeasured observation is a high-leverage point in both the first and third models, which means it has a significant impact on the estimation of the regression coefficients.
plot(predict(model5), rstudent(model5))
plot(predict(model6), rstudent(model6))
plot(predict(model7), rstudent(model7))
In the second model, which uses \(x1\) as the only predictor, the new observation is considered an outlier (above the cutoff value of 3)