Load Libraries and attach datasets
library(ISLR2)
library(tidyverse)
library(GGally)
attach(Auto)
attach(Carseats)
The KNN regression method is closely related to the KNN classifier. The text book states that for KNN classifiers “Given a positive integer K and a test observation x0, the KNN classifier first identifies the neighbor K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j. Finally, KNN classifies the test observation x0 to the class with the largest probability.” It goes on to say that “KNN regression first identifies the K training observations that are closest to x0, represented by N0. It then estimates f(x0) using the average of all the training responses in N0.”
From this, my understaning is that KNN classifier uses probability estimates to classify the results (in a qualitative manner) while KNN regression averages the results (a quantitative method).
This question involves the use of multiple linear regression on the Auto data set.
auto<-ISLR2::Auto
plot(auto, col="dark blue")
auto_minus_name<-subset(auto, select = -name)
cor(auto_minus_name)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.auto_lrm<-lm(mpg~.,data=auto_minus_name)
summary(auto_lrm)
##
## Call:
## lm(formula = mpg ~ ., data = auto_minus_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, based of the p-value and the R-squared, there appears to be a relationship between at least some of the predictor variables and MPG.
The predictors marked with “** or ***” are those that are statistically significant. In this case displacement, weight, year, and origin all appear to be statically significant.
year variable suggest?The year variable has a coefficient of 0.750773, which suggests that for every 1 increase in year, there is a 0.75 increase in MPG.
The fit is not great as we see that the data is as linear as we would like it to be.
par(mfrow=c(2,2))
plot(auto_lrm, col="dark blue")
The Residuals vs Fitted, Normal Q-Q, and Scale Location plots all show outliers in the top right portion of the charts.
No, none of the points fall outside of the red dashed line (Cook’s distance).
auto_lrm_interact <- lm(mpg~. + weight*displacement + horsepower*origin + year:weight, data=auto_minus_name)
summary(auto_lrm_interact)
##
## Call:
## lm(formula = mpg ~ . + weight * displacement + horsepower * origin +
## year:weight, data = auto_minus_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8003 -1.5504 -0.1553 1.2630 12.0826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.962e+01 1.390e+01 -5.008 8.41e-07 ***
## cylinders 1.535e-01 2.900e-01 0.529 0.59696
## displacement -5.244e-02 1.113e-02 -4.710 3.48e-06 ***
## horsepower -1.128e-02 2.033e-02 -0.555 0.57934
## weight 1.272e-02 4.874e-03 2.609 0.00944 **
## acceleration 6.202e-02 9.098e-02 0.682 0.49585
## year 1.556e+00 1.775e-01 8.768 < 2e-16 ***
## origin 2.181e+00 1.050e+00 2.076 0.03853 *
## displacement:weight 1.621e-05 2.674e-06 6.061 3.26e-09 ***
## horsepower:origin -1.930e-02 1.222e-02 -1.580 0.11493
## weight:year -2.874e-04 6.295e-05 -4.565 6.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.866 on 381 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8651
## F-statistic: 251.8 on 10 and 381 DF, p-value: < 2.2e-16
Yes, all three of the interactions I explored (weight\(\times\)displacement, horsepower\(\times\)origin + year\(\times\)weight) appear to be statistically significant.
auto_lrm_transform <- lm(mpg~. + log(displacement) + log(acceleration) + I(horsepower^2),dat=auto_minus_name)
summary(auto_lrm_transform)
##
## Call:
## lm(formula = mpg ~ . + log(displacement) + log(acceleration) +
## I(horsepower^2), data = auto_minus_name)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3686 -1.5724 -0.0506 1.5073 11.7089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.389e+01 1.504e+01 3.582 0.000385 ***
## cylinders 4.061e-01 2.988e-01 1.359 0.174931
## displacement 2.795e-02 1.260e-02 2.218 0.027116 *
## horsepower -2.347e-01 3.895e-02 -6.025 3.98e-09 ***
## weight -2.897e-03 6.782e-04 -4.272 2.45e-05 ***
## acceleration 6.181e-01 4.670e-01 1.324 0.186419
## year 7.599e-01 4.536e-02 16.753 < 2e-16 ***
## origin 5.969e-01 2.704e-01 2.208 0.027867 *
## log(displacement) -8.207e+00 2.147e+00 -3.823 0.000154 ***
## log(acceleration) -1.439e+01 7.478e+00 -1.924 0.055054 .
## I(horsepower^2) 6.528e-04 1.319e-04 4.948 1.13e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.935 on 381 degrees of freedom
## Multiple R-squared: 0.8622, Adjusted R-squared: 0.8586
## F-statistic: 238.3 on 10 and 381 DF, p-value: < 2.2e-16
Using the log function on displacement and acceleration appears to increase the significance of both variables while squaring horsepower appeared to reduce its significance.
This question should be answered using the Carseats data set.
Sales using Price, Urban, and US.carseats_lrm<-lm(Sales~Price+Urban+US, data=Carseats)
summary(carseats_lrm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Urban is not significant, however both Price and US are. For every unit of increase in Price, Sales decreases by 0.054. Likewise, Sales in the US are 1.20 times higher than those that are outside the US.
Sales = -0.054459\(\times\)Price-0.021916\(\times\)UrbanYes+1.200573\(\times\)USYes.
We can reject the null hypothesis for the predictors Price and US.
carseats_lrm2<-lm(Sales~Price+US)
summary(carseats_lrm2)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The R-Squared in both modes is .2393, which means that there is not a great fit as the model is only predicting approximately 24% of the Sales.
confint(carseats_lrm2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Based on the plot below, there do not appear to be any that are outliers or have high leverage based on Cook’s D.
par(mfrow=c(2,2))
plot(carseats_lrm2, col="dark blue")
This problem involves simple linear regression without an intercept.
The coefficient estimate will be the same for X onto Y and Y onto X when the coefficients are the same or in other words, when Y=X.
set.seed(22)
X1 = rnorm(100)
Y1 = rnorm(100)+7
YintoX1 <- lm(Y1~X1+0)
XintoY1 <- lm(X1~Y1+0)
The coefficients are not equal when X and Y are not equal:
summary(YintoX1)$coefficients[1,1]
## [1] 0.5950274
summary(XintoY1)$coefficients[1,1]
## [1] 0.0122923
set.seed(22)
X2 = rnorm(100)
Y2 = X2
YintoX2 <- lm(Y2~X2+0)
XintoY2 <- lm(X2~Y2+0)
The coefficients are equal when X and Y are equal:
summary(YintoX2)$coefficients[1,1]
## [1] 1
summary(XintoY2)$coefficients[1,1]
## [1] 1