KNN Classifier is used for classification tasks while KNN Regression is used for regression tasks. This means that KNN Classifier is used when the response variable is categorical, while KNN Regression is used when the response variable is a continuous numeric value.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'ISLR2'
## The following object is masked from 'package:MASS':
##
## Boston
## 'data.frame': 397 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130" "165" "150" "150" ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## Warning: NAs introduced by coercion
pairs(auto %>% select(-name))
cor(auto %>% select(-name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7762599 -0.8044430 NA -0.8317389
## cylinders -0.7762599 1.0000000 0.9509199 NA 0.8970169
## displacement -0.8044430 0.9509199 1.0000000 NA 0.9331044
## horsepower NA NA NA 1 NA
## weight -0.8317389 0.8970169 0.9331044 NA 1.0000000
## acceleration 0.4222974 -0.5040606 -0.5441618 NA -0.4195023
## year 0.5814695 -0.3467172 -0.3698041 NA -0.3079004
## origin 0.5636979 -0.5649716 -0.6106643 NA -0.5812652
## acceleration year origin
## mpg 0.4222974 0.5814695 0.5636979
## cylinders -0.5040606 -0.3467172 -0.5649716
## displacement -0.5441618 -0.3698041 -0.6106643
## horsepower NA NA NA
## weight -0.4195023 -0.3079004 -0.5812652
## acceleration 1.0000000 0.2829009 0.2100836
## year 0.2829009 1.0000000 0.1843141
## origin 0.2100836 0.1843141 1.0000000
auto.model <- lm(mpg ~ . -name, data = auto)
summary(auto.model)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(auto.model)
The residuals vs fitted plot is slightly curved which suggests nonlinearity.It also seems to expand toward the right side which suggests that there isn’t equal variance. The QQ Plot shows some questionable points at the top that are not along the normal distribution. However, because most points do fall along the line, it is probably fine. Based on the scale-location plot, it does not look like there are any outliers. The leverage plot does not show any points as having high leverage.
auto.model.interactions = lm(mpg~.:.,auto %>% select(-name))
summary(auto.model.interactions)
##
## Call:
## lm(formula = mpg ~ .:., data = auto %>% select(-name))
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
displacement:year, acceleration:year, and acceleration:origin interactions appear significant.
auto.model.int2 <- lm(mpg ~ displacement * weight * year * origin, data = auto)
summary(auto.model)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
lm.b = lm(mpg ~ displacement + weight + year + origin, auto)
summary(lm.b)
##
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8313 -2.1194 -0.0079 1.8070 13.1429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.884e+01 3.992e+00 -4.720 3.29e-06 ***
## displacement 6.091e-03 4.757e-03 1.280 0.201
## weight -6.657e-03 5.556e-04 -11.980 < 2e-16 ***
## year 7.765e-01 4.945e-02 15.700 < 2e-16 ***
## origin 1.235e+00 2.653e-01 4.653 4.49e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.348 on 392 degrees of freedom
## Multiple R-squared: 0.8188, Adjusted R-squared: 0.8169
## F-statistic: 442.7 on 4 and 392 DF, p-value: < 2.2e-16
lm.log = lm(formula = mpg ~ log(displacement) + log(weight) + log(year) + log(origin), data = auto)
summary(lm.log)
##
## Call:
## lm(formula = mpg ~ log(displacement) + log(weight) + log(year) +
## log(origin), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0040 -2.0533 -0.0359 1.6305 13.0884
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -81.5119 17.4114 -4.682 3.93e-06 ***
## log(displacement) -0.1381 0.9971 -0.139 0.88989
## log(weight) -18.9260 1.7067 -11.089 < 2e-16 ***
## log(year) 59.0925 3.4591 17.083 < 2e-16 ***
## log(origin) 1.4220 0.4824 2.948 0.00339 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.136 on 392 degrees of freedom
## Multiple R-squared: 0.8411, Adjusted R-squared: 0.8395
## F-statistic: 518.6 on 4 and 392 DF, p-value: < 2.2e-16
It appears that taking the log decreased the p-values.
lm.other = lm(mpg ~ I(displacement^2) + sqrt(weight) + sqrt(year) + sqrt(origin), data = auto)
summary(lm.other)
##
## Call:
## lm(formula = mpg ~ I(displacement^2) + sqrt(weight) + sqrt(year) +
## sqrt(origin), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8344 -1.9544 -0.0111 1.7301 12.8196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.078e+01 7.479e+00 -8.128 5.79e-15 ***
## I(displacement^2) 2.969e-05 7.192e-06 4.128 4.48e-05 ***
## sqrt(weight) -8.440e-01 4.649e-02 -18.155 < 2e-16 ***
## sqrt(year) 1.435e+01 8.194e-01 17.512 < 2e-16 ***
## sqrt(origin) 2.755e+00 6.675e-01 4.127 4.49e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.162 on 392 degrees of freedom
## Multiple R-squared: 0.8384, Adjusted R-squared: 0.8367
## F-statistic: 508.2 on 4 and 392 DF, p-value: < 2.2e-16
Here we see displacement becoming significant.
carseats = Carseats
str(carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
carseat.model = lm(Sales~Price + Urban + US,data=carseats)
summary(carseat.model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price: with one unit increase in price there is 0.054459 units decrease in sales. Urban: not significant USYes: when a store is in the US, there is a 1.200573 units increase in sales.
Sales=13.043469 - 0.054459Price - 0.021916(1 if Urban, 0 not urban) + 1.200573 (1 if US, 0 if not US)
Price and USYes
carseat.model2 = lm(Sales~Price + US,data=carseats)
summary(carseat.model2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
There is a slight increase in adjusted R-square suggesting that the second model fits the data slightly better.
confint(carseat.model2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(carseat.model2)
No, there is not evidence of outliers or high leverage.
The coefficients are the same if ∑x2=∑y2
set.seed(1)
x = 1:100
sum(x^2)
## [1] 338350
y = 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
fit.Y = lm(y ~ x + 0)
fit.X = lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
x = 1:100
sum(x^2)
## [1] 338350
y = 100:1
sum(y^2)
## [1] 338350
fit.Y = lm(y ~ x + 0)
fit.X = lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08