Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN Classifier is used for categorical (classification) tasks wile KNN Regression is used for continuous (regression) methods
KNN Classifier finds the K closest training points to a given test point and assigns the most frequent class among those neighbors, while KNN Regression finds the K closest training points and calculates the average of their response variables.
KNN Classifier uses majority voting, which means that the class label that appear the most among the K neighbors is assigned to that test point while KNN Regression uses numerical averaging, which means that the predicted value is the mean of the K nearest neighbors target value.
KNN Classifier’s output is a discrete category like ‘Yes’ or ‘No’, while KNN Regression’s output is a continuous numerical value.
library(ISLR)
attach(Auto)
pairs(Auto)
cor(Auto[, sapply(Auto, is.numeric)])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
model9c <- lm(mpg ~ . -name, data = Auto)
summary(model9c)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, there is a relationship between the predictors and the response
variable as indicated by the extremely low p-value of 2.2e-16, which
suggests that at least one predictor significantly explains variation in
mpg. However, cylinders,
horsepower and acceleration do not have a
statistically significant effect on the response variable, which means
their individual effects on mpg are not statistically
significant.
displacement, weight, year,
and origin are the predictors with a statistically
significant relationship to the response.
Assuming all other predictors remain constant, for every increase in
year, there is an increase of about 0.750773 units in
mpg . This likely reflects improvement of mpg
over time.
par(mfrow = c(2, 2))
plot(model9c)
par(mfrow = c(1, 1))
I do not see any issues with the diagnostic plots.
The residual plots do not suggest any large outliers.
Leverage plot also does not identify any observations with unusually high leverage. All observations are inside Cook’s line
model9e <- lm(mpg ~ . -name + displacement:weight + horsepower:acceleration + year:origin, data = Auto)
summary(model9e)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower:acceleration +
## year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7073 -1.6687 0.0337 1.4242 12.8153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.726e+00 8.643e+00 0.663 0.508026
## cylinders 3.390e-01 2.915e-01 1.163 0.245533
## displacement -7.449e-02 1.076e-02 -6.925 1.86e-11 ***
## horsepower 5.107e-02 2.420e-02 2.110 0.035494 *
## weight -8.580e-03 8.525e-04 -10.065 < 2e-16 ***
## acceleration 5.743e-01 1.545e-01 3.717 0.000231 ***
## year 5.097e-01 9.887e-02 5.155 4.09e-07 ***
## origin -1.228e+01 4.128e+00 -2.975 0.003118 **
## displacement:weight 1.948e-05 2.325e-06 8.381 1.03e-15 ***
## horsepower:acceleration -6.668e-03 1.730e-03 -3.854 0.000136 ***
## year:origin 1.639e-01 5.303e-02 3.091 0.002143 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.87 on 381 degrees of freedom
## Multiple R-squared: 0.8682, Adjusted R-squared: 0.8648
## F-statistic: 251.1 on 10 and 381 DF, p-value: < 2.2e-16
The interaction between displacement:weight,
horsepower:acceleration, and year:origin
appear to be statistically significant.
Many of the other interactions appear to not be statistically significant as shown in the next model.
model9e_2 <- lm(mpg ~ . -name + displacement:horsepower + weight:acceleration + cylinders:origin, data = Auto)
summary(model9e_2)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:horsepower + weight:acceleration +
## cylinders:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2931 -1.6317 -0.1017 1.4266 12.5900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.724e+00 7.483e+00 -0.765 0.445
## cylinders 4.400e-01 4.842e-01 0.909 0.364
## displacement -7.077e-02 1.173e-02 -6.032 3.84e-09 ***
## horsepower -1.909e-01 2.264e-02 -8.431 7.13e-16 ***
## weight -1.831e-03 1.688e-03 -1.084 0.279
## acceleration 2.969e-02 3.017e-01 0.098 0.922
## year 7.406e-01 4.551e-02 16.275 < 2e-16 ***
## origin 1.315e-01 1.200e+00 0.110 0.913
## displacement:horsepower 4.926e-04 6.194e-05 7.953 2.10e-14 ***
## weight:acceleration -8.485e-05 1.000e-04 -0.848 0.397
## cylinders:origin 1.319e-01 2.832e-01 0.465 0.642
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.916 on 381 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.8604
## F-statistic: 242 on 10 and 381 DF, p-value: < 2.2e-16
In this model, we can see that the interaction
displacement:horsepower is statistically significant, while
weight:acceleration, and cylinders:origin are
not.
model9f <- lm(mpg ~ cylinders + displacement + I(horsepower^2) + log(weight) + sqrt(acceleration) + year + origin, data = Auto)
summary(model9f)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + I(horsepower^2) +
## log(weight) + sqrt(acceleration) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3864 -1.9909 0.0413 1.6213 12.8007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.329e+02 1.106e+01 12.011 < 2e-16 ***
## cylinders -3.447e-01 3.050e-01 -1.130 0.25909
## displacement 1.532e-02 7.259e-03 2.111 0.03541 *
## I(horsepower^2) 4.171e-05 3.947e-05 1.057 0.29135
## log(weight) -2.261e+01 1.530e+00 -14.773 < 2e-16 ***
## sqrt(acceleration) 1.569e+00 6.473e-01 2.424 0.01582 *
## year 8.073e-01 4.723e-02 17.092 < 2e-16 ***
## origin 8.687e-01 2.636e-01 3.295 0.00108 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.095 on 384 degrees of freedom
## Multiple R-squared: 0.8456, Adjusted R-squared: 0.8427
## F-statistic: 300.3 on 7 and 384 DF, p-value: < 2.2e-16
The Adjusted R-squared increased from 0.8182 in our model with the same variables to 0.8427 in our model with transformations.
RSE dropped from 3.328 to 3.095 which indicates better predictions with lower residual variability.
log(weight) confirms that weight has a strong negative
effect on mpg.
sqrt(acceleration) is now significant compared to the
other model, suggesting that acceleration’s effect on mpg is
non-linear.
Squaring horsepower I(horsepower^2) didn’t improve its
significance, meaning it may not have a strong non-linear effect on
mpg.
library(ISLR)
attach(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
model10a <- lm(Sales ~ Price + Urban + US , data = Carseats)
summary(model10a)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
coef(model10a)[2]
## Price
## -0.05445885
The coefficient for Price is -0.054459 which means that
for every dollar increase in the price of my car seat, my store’s sales
decrease by 54 units on average.
The coefficient for ‘Urban’ is -0.021916
The coefficient for ‘USYes’ is 1.200573 which means on average, US stores sell 1,200 units more compared to stores outside the US.
Price and US because they both have
p-values lower than 0.05, which means we can reject the null
hypothesis.
Urban's p-value is higher than 0.05 and actually very
high (0.936) which means we can’t reject the null
hypothesis.
model10e <- lm(Sales ~ Price + US, data = Carseats)
summary(model10e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Not good. Adjusted R-squared is 0.2335 for part (a) and Adjusted R-squared is 0.2354 for part (e). I’d prefer Adjusted R-squared to be higher than 0.7
confint(model10e)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(model10e)
summary(influence.measures(model10e))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
All residuals appear to be within the inside of our Cook’s line, which isn’t even visible because none of the points get close enough to it. This means that there may or may not be ‘outliers’ but points with high leverage are absolutely not present.
After analyzing potential influential observations, no single observation appears to be highly influential based on Cook’s Distance, DFFITS and DFBETAS which are all relatively low.
When the sum of squares of X and Y are equal, the data points all lie on a 45-degree line through the origin which is when the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(42)
n <- 100
X <- rnorm(n, mean = 0, sd = 2)
Y <- 0.5 * X + rnorm(n, mean = 0, sd = 1)
model12b <- lm(Y ~ 0 + X) #0 is to make sure there is no intercept
summary(model12b)
##
## Call:
## lm(formula = Y ~ 0 + X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9815 -0.5947 -0.0741 0.4498 2.7669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X 0.5122 0.0438 11.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared: 0.5801, Adjusted R-squared: 0.5759
## F-statistic: 136.8 on 1 and 99 DF, p-value: < 2.2e-16
model12b_2 <- lm(X ~ 0 + Y)
summary(model12b_2)
##
## Call:
## lm(formula = X ~ 0 + Y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3644 -0.7187 0.1213 1.1146 2.6722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y 1.13251 0.09683 11.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.35 on 99 degrees of freedom
## Multiple R-squared: 0.5801, Adjusted R-squared: 0.5759
## F-statistic: 136.8 on 1 and 99 DF, p-value: < 2.2e-16
Since var(X) != var(Y), the regression coefficients are different.
Y on X coefficient = 0.5122
X on Y coefficient = 1.13251
set.seed(42)
n <- 100
X <- rnorm(n, mean = 0, sd = 1)
Y <- X
model12c <- lm(Y ~ 0 + X)
summary(model12c)
##
## Call:
## lm(formula = Y ~ 0 + X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.178e-14 -1.028e-16 1.120e-17 1.112e-16 4.124e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X 1.000e+00 1.154e-16 8.668e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.196e-15 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.514e+31 on 1 and 99 DF, p-value: < 2.2e-16
model12c_2 <- lm(X ~ 0 + Y)
summary(model12c_2)
##
## Call:
## lm(formula = X ~ 0 + Y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.178e-14 -1.028e-16 1.120e-17 1.112e-16 4.124e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y 1.000e+00 1.154e-16 8.668e+15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.196e-15 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.514e+31 on 1 and 99 DF, p-value: < 2.2e-16
Since X = Y, their sum of squares are equal, making the coefficients identical (1 and 1).