The KNN Classifier is for solving classification problems (problems with qualitative outputs) while KNN Regression is used for regression problems (problems with quantitative outputs). Additionally, KNN Classifier estimates the class of the neighbors to X while KNN Regression estimates f(x) with the average of the neighbor values.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
data(Auto)
plot(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Auto_lm <- lm(mpg~., data = Auto[, 1:8])
summary(Auto_lm)
##
## Call:
## lm(formula = mpg ~ ., data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
At least some of the variables have a relationship with ‘mpg’ (the response). This is indicated by the low F-Statistic which is 2.2e-16. #### Predictors that appear to have a statistically significant relationship to response? ‘Displacement’ - with a 1 unit increase, ‘mpg’ increases 0.019896 units (all others constant) ‘Weight’ - with a 1 unit increase, ‘mpg’ decreases 0.006474 units (all others constant) ‘Year’ - with a 1 unit increase, ‘mpg’ increases 0.750773 units (all others constant) ‘Origin’ - with a 1 unit increase, ‘mpg’ increases 1.426141 units (all others constant) ‘Cylinders’, ‘horsepower’, and ‘acceleration’ show on relationship to ‘mpg’ as indicated by the higher p-value. #### Coefficient for ‘year’ suggests? The ‘Year’ coefficient of 0.750773 indicates that as the car is one year newer, it’s average miles per gallon (efficiency) increases by about 0.75.
par(mfrow = c(2,2))
plot(Auto_lm)
#### Comments on Problems The plot of Residuals vs. Fitted values shows a non-linearity in the data. Also, the Residuals vs Leverage plot shows there are some outliers in the data.
Auto_lm_2 <- lm(mpg ~ cylinders * displacement + horsepower * weight + acceleration * year, data = Auto[, 1:8])
summary(Auto_lm_2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight +
## acceleration * year, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3265 -1.5779 0.0389 1.3483 11.6961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.162e+02 1.853e+01 6.274 9.53e-10 ***
## cylinders -1.803e-01 4.776e-01 -0.377 0.7061
## displacement -2.867e-02 1.425e-02 -2.013 0.0449 *
## horsepower -2.261e-01 2.609e-02 -8.664 < 2e-16 ***
## weight -1.019e-02 9.020e-04 -11.296 < 2e-16 ***
## acceleration -7.081e+00 1.158e+00 -6.113 2.41e-09 ***
## year -6.719e-01 2.417e-01 -2.780 0.0057 **
## cylinders:displacement 2.790e-03 2.067e-03 1.350 0.1779
## horsepower:weight 5.154e-05 6.727e-06 7.661 1.53e-13 ***
## acceleration:year 9.113e-02 1.502e-02 6.069 3.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.819 on 382 degrees of freedom
## Multiple R-squared: 0.8726, Adjusted R-squared: 0.8696
## F-statistic: 290.6 on 9 and 382 DF, p-value: < 2.2e-16
The interactions between ‘horsepower’ and ‘weight’ and between ‘acceleration’ and ‘cylinders’ are statistically significant. However, the interaction between ‘cylinders’ and ‘displacement’ is not statistically significant.
par(mfrow = c(2,2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
#### Graph 2: Acceleration
par(mfrow = c(2,2))
plot(log(Auto$acceleration), Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)
#### Graph 3: Cylinders
par(mfrow = c(2,2))
plot(log(Auto$cylinders), Auto$mpg)
plot(sqrt(Auto$cylinders), Auto$mpg)
plot((Auto$cylinders)^2, Auto$mpg)
#### Comments on Findings Using the variables that did not have a relationship with ‘mpg’ originally (‘horsepower’, ‘acceleration’, and ‘cylinders’), the log(X) of ‘horsepower’ appears to be the closest to a linear model than the other variations.
data(Carseats)
Carseats_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseats_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
‘Price’ - with a $1 increase, ‘Sales’ decreases $54.46 (all others constant) ‘US’ - Sales in a US store are $1,200.57 higher than in a non-US store (all others constant) The ‘Urban’ variable shows no relationship to ‘Sales’ due to the larger p-value.
Sales = 13.043469 - (0.054459 * Price) - (0.021916 * (Urban = yes)) + (1.200573 * (US = yes))
Where (Urban = yes) = 1 for Urban and 0 for Not Urban & (US = yes) = 1 for a US Store and 0 for a Non-US Store
The null hypothesis can be rejected for ‘Price’ and ‘US’ since both have an effect on ‘Sales’
Carseats_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(Carseats_lm_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models show that ~23% of the variability can be explained by the model. This indicates that neither of the models fit the data well.
confint(Carseats_lm_2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2,2))
plot(Carseats_lm_2)
#### Observations The Residuals vs. Fitted plot shows there appears to be linearity in the data. However, the Residuals vs. Leverage plot shows there are outliers and there appears to be high leverage observations greater than 0.01.
The two coefficient estimates are the same when the sum of X^2 is equal to the sum of Y^2.
set.seed(1)
X <- 1:100
sum(X^2)
## [1] 338350
Y <- X * -153
sum(Y^2)
## [1] 7920435150
fit.X <- lm(Y ~ X + 0)
fit.Y <- lm(X ~ Y + 0)
summary(fit.Y)
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = X ~ Y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.493e-13 -1.826e-15 4.100e-17 1.549e-15 1.140e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y -6.536e-03 2.846e-19 -2.297e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.532e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.276e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = Y ~ X + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.926e-12 -2.460e-13 -1.000e-15 2.110e-13 3.282e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X -1.530e+02 5.745e-15 -2.663e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.342e-12 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.093e+32 on 1 and 99 DF, p-value: < 2.2e-16
A <- 1:100
sum(A^2)
## [1] 338350
B <- 100:1
sum(B^2)
## [1] 338350
fit.A <- lm(A ~ B + 0)
fit.B <- lm(B ~ A + 0)
summary(fit.B)
##
## Call:
## lm(formula = B ~ A + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## A 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.A)
##
## Call:
## lm(formula = A ~ B + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## B 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08