library(ISLR2)
data("Auto")
auto <- Auto
auto <- na.omit(auto)
model <- lm(mpg~horsepower,data=auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
(i)The P-value is for the horsepower’s coefficient is close to zero. This indicates that there is a strong relationship between response and predictor ( mpg & horsepower) variables.
mean(auto$mpg, na.rm=T)
## [1] 23.44592
4.906/mean(Auto$mpg, na.rm=T) * 100.0
## [1] 20.92475
horsepower_new <- data.frame(horsepower = 98)
predicted_mpg <- predict(model, newdata = horsepower_new)
print(predicted_mpg)
## 1
## 24.46708
intervals <- predict(model, newdata = horsepower_new, interval = "confidence", level = 0.95)
prediction_intervals <- predict(model, newdata = horsepower_new, interval = "prediction", level = 0.95)
print(intervals)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
print(prediction_intervals)
## fit lwr upr
## 1 24.46708 14.8094 34.12476
plot(auto$horsepower, auto$mpg,
xlab = "Horsepower", ylab = "MPG",
main = "Scatter plot of MPG vs Horsepower")
abline(model, col = "red")
(iv) (b)
plot(model)
## Question 10 (a)
data("Carseats")
multiple_model <- lm(Sales~Price+Urban+US, data=Carseats)
summary(multiple_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Sales = 13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US where Urban=1 if the store is in urban location Urban=0 if the store is in rural location US=1 if the store is in US US=0 if the store is not in US (d) we can reject the null hypothesis for the ‘Price’ and ‘US’ variables. (e)
model_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(model_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
confint(model_2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(model_2)
The plot of standardized residuals vseverage indicates thart the are
some outliers (highr than 2 and lower than -2).there are some leverage
points as some points exceeded (P+1)/n. ## Question 14
set.seed(1)
x1 <- runif (100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)
(c)
model_3 <- lm(y~x1+x2)
summary(model_3)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The coefficients βˆ0, βˆ1, and βˆ2 are respectively 2.1305, 1.4396 & 1.0097.The β0 and βˆ0 are almost close.we may reject the null hypothesis for β1 as the p-value is less than 0.05. we may not reject the null hypothesis for β2 as the p-value is higher than 0.05. (d)
model_4 <- lm(y~x1)
summary(model_4)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
The coefficient of x1 is different from the previous case where the ‘y’ is dependent upon both x1 and x2. In this case the coefficient of x1 is very significant and the p-value is also very low. so we may reject the null hypothesis β1=0. (e)
model_5 <- lm(y~x2)
summary(model_5)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
The coefficient x2 is also very significant and p-value is very low. so we may also reject this null hypothesis β1=0. (f) No, the results of (c) and (e) do not contradict with each other.The difference can be the cause of collinearity.the response variable is dependent upon two variables and those two variables are collinear which result in masking the effect of variables on response variable one over the other.collinearity can also increase the sandard error as we can see that the standard error of x1+x2 is greater than the x1 or x2. (g)
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
model_6 <- lm(y~x1+x2)
model_7 <- lm(y~x1)
model_8 <- lm(y~x2)
summary(model_6)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(model_7)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(model_8)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
plot(model_6)
plot(model_7)
plot(model_8)
In the model with two predictors the point 101 has high leverage point.
In the model with x1 as only predictor the point 101 is an outlier.In
the model with x2 as only predictor the point 101 is high leverage
point.