Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier is used for classification problems/categorical target variables. It assigns a class label based on the majority vote among the k nearest neighbors.
The KNN regression is used for continuous target variables to solve regression problems. It predicts the value of the output by averging target values of k nearest neighbors.
pairs(Auto[, -9])
cor_matrix <- cor(Auto[, -9])
print(cor_matrix)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg, weight, horsepower and displacement show multicollinearity with each other.
model1 <- lm(mpg ~ . -name, data = Auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
anova(model1)
## Analysis of Variance Table
##
## Response: mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## cylinders 1 14403.1 14403.1 1300.6838 < 2.2e-16 ***
## displacement 1 1073.3 1073.3 96.9293 < 2.2e-16 ***
## horsepower 1 403.4 403.4 36.4301 3.731e-09 ***
## weight 1 975.7 975.7 88.1137 < 2.2e-16 ***
## acceleration 1 1.0 1.0 0.0872 0.7679
## year 1 2419.1 2419.1 218.4609 < 2.2e-16 ***
## origin 1 291.1 291.1 26.2912 4.666e-07 ***
## Residuals 384 4252.2 11.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
displacement, weight, year
and origin have statistically significant relationship to
mpg because they have p-value of less than 0.05.par(mfrow = c(2,2))
plot(model1)
Notable non-linearity in the residual vs fitted plot. The Q-Q plot shows normality with some skewed points at the right tail.
Interactions are done between terms that have high correlation.
model_interact <- lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto[, 1:8])
summary(model_interact)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
Only displacement:weight is statistically significant.
model_log <- lm(log(mpg) ~ . -name, data = Auto)
summary(model_log)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
model_sqrt <- lm(sqrt(mpg) ~ . -name, data = Auto)
summary(model_sqrt)
##
## Call:
## lm(formula = sqrt(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98891 -0.18946 0.00505 0.16947 1.02581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.075e+00 4.290e-01 2.506 0.0126 *
## cylinders -5.942e-02 2.986e-02 -1.990 0.0474 *
## displacement 1.752e-03 6.942e-04 2.524 0.0120 *
## horsepower -2.512e-03 1.274e-03 -1.972 0.0493 *
## weight -6.367e-04 6.024e-05 -10.570 < 2e-16 ***
## acceleration 2.738e-03 9.131e-03 0.300 0.7644
## year 7.381e-02 4.709e-03 15.675 < 2e-16 ***
## origin 1.217e-01 2.569e-02 4.735 3.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared: 0.8561, Adjusted R-squared: 0.8535
## F-statistic: 326.3 on 7 and 384 DF, p-value: < 2.2e-16
model_2 <- lm((mpg)^2 ~ . -name, data = Auto)
summary(model_2)
##
## Call:
## lm(formula = (mpg)^2 ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483.45 -141.87 -19.62 103.58 1042.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.878e+03 2.928e+02 -6.412 4.22e-10 ***
## cylinders -1.436e+01 2.038e+01 -0.704 0.48157
## displacement 1.328e+00 4.738e-01 2.802 0.00534 **
## horsepower -3.587e-01 8.693e-01 -0.413 0.68009
## weight -3.522e-01 4.111e-02 -8.567 2.62e-16 ***
## acceleration 9.278e+00 6.232e+00 1.489 0.13740
## year 4.081e+01 3.214e+00 12.698 < 2e-16 ***
## origin 9.509e+01 1.754e+01 5.422 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 209.8 on 384 degrees of freedom
## Multiple R-squared: 0.7292, Adjusted R-squared: 0.7243
## F-statistic: 147.8 on 7 and 384 DF, p-value: < 2.2e-16
When mpg is squares, the model has the lowest R^2 when
compared to those when mpg has log transformation or square root.
data("Carseats")
model2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model2)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price: for every $1 increase in price, Sales decrease by 0.054 units. The p-value indicates that the effect is significant
UrbanYes: the high p-value indicates that the variable is insignificant, and has no meaningful effect on sales
USYes: US stores on average sell 1.2 units more compared to those outside of the US. This effect is statistically significant, and suggests US stores perform better.
Sales = 13.04 - 0.0545 * Price - 0.0219 * UrbanYes + 1.2006 * USYes
Price and USYes have statistically significant effects on Sales, so we can reject the null hypothesis for those predictors.
model3 <- lm(Sales ~ Price + US, data = Carseats)
summary(model3)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
When comparing the model from (a) to the model in (e), the adjust R^2 slightly improves, the multuple R^2 remains the same and RSE slightly decreases. Removing Urban in the second model fits the data slightly, which is preferable as it retains model performance while removing an unnecessary variable.
confint(model3, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(model3)
The residuals vs levarage plot for the model in (e) shows evidence of some outliers and high leverage observations.
The coefficients are equal only if X and Y have the same sum of squares.
set.seed(123)
X <- rnorm(100, mean = 0, sd = 5)
Y <- rnorm(100, mean = 0, sd = 3)
# Y onto X
model_YX <- lm(Y ~ X - 1)
# X onto Y
model_XY <- lm(X ~ Y - 1)
summary(model_YX)
##
## Call:
## lm(formula = Y ~ X - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.003 -2.370 -0.540 1.408 9.529
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X -0.03818 0.06385 -0.598 0.551
##
## Residual standard error: 2.914 on 99 degrees of freedom
## Multiple R-squared: 0.003598, Adjusted R-squared: -0.006466
## F-statistic: 0.3575 on 1 and 99 DF, p-value: 0.5512
summary(model_XY)
##
## Call:
## lm(formula = X ~ Y - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.5274 -2.5380 0.1912 3.3317 11.1065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y -0.09426 0.15764 -0.598 0.551
##
## Residual standard error: 4.578 on 99 degrees of freedom
## Multiple R-squared: 0.003598, Adjusted R-squared: -0.006466
## F-statistic: 0.3575 on 1 and 99 DF, p-value: 0.5512
set.seed(123)
X <- rnorm(100, mean = 0, sd = 5)
Y <- X
# Y onto X
model_YX <- lm(Y ~ X - 1)
# X onto Y
model_XY <- lm(X ~ Y - 1)
summary(model_YX)
## Warning in summary.lm(model_YX): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Y ~ X - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.366e-15 -1.360e-16 1.320e-17 2.015e-16 1.256e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X 1.000e+00 2.874e-17 3.48e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312e-15 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.211e+33 on 1 and 99 DF, p-value: < 2.2e-16
summary(model_XY)
## Warning in summary.lm(model_XY): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = X ~ Y - 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.366e-15 -1.360e-16 1.320e-17 2.015e-16 1.256e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y 1.000e+00 2.874e-17 3.48e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.312e-15 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.211e+33 on 1 and 99 DF, p-value: < 2.2e-16