Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier chooses a categorical variable from the K nearest neighbors which class has the most frequent instances, while the regression methods chooses a continuous variable that is the average of the K nearest neighbors.
This question involves the use of multiple linear regression on the Auto data set.
First we will import the Auto dataset and repeat preprocessing done in assignment 1:
auto <- read.table('auto.data', header = T)
# head(auto)
# sum(is.na(auto)) # no outright missing values
# unique(auto$horsepower) # there is a "?" which is a missing value. will remove these columns
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
auto <- auto %>%
filter(!grepl("\\?", horsepower)) %>%
select(-grep("\\?", names(auto)))
# unique(auto$horsepower)
Now we can plot a scatterplot matrix of the variables excluding name:
auto$horsepower <- as.numeric(auto$horsepower)
pairs(auto[,-9])
auto_cor <- cor(auto[,-9])
corrplot(auto_cor)
auto_noname <- auto[,-9] # for convenience
full_model <- lm(mpg~., data=auto_noname)
summary(full_model)
##
## Call:
## lm(formula = mpg ~ ., data = auto_noname)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes, there appears to be a relationship between the predictors displacement, weight, year, and origin, and the response variable mpg.
The statistically significant predictors using a significance of 0.05 and using every predictor is displacement, weight, year, and origin.
The coefficient for year is 0.75, meaning that for every additional year there is an increase in mpg by 0.75 if all other variables are held constant.
par(mfrow = c(2,2))
plot(full_model)
From looking at the residuals v fitted plot, we see some evidence of a pattern in the data. This would suggest a non linear pattern may be present. We can also see in the leverage plot that point 14 appears to have high leverage, which could influence the model fit.
interactions_model <- lm(mpg~.*., data=auto_noname)
summary(interactions_model)
##
## Call:
## lm(formula = mpg ~ . * ., data = auto_noname)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
If we look at a model containing all terms and their interactions, we see a few that are statistically significant using a significance of 0.05: displacement, acceleration, origin, displacement:year, acceleration:year, and acceleration:origin.
par(mfrow=c(2,2))
auto_log <- log(auto_noname)
log_model <- lm(mpg~., data=auto_log)
summary(log_model)
##
## Call:
## lm(formula = mpg ~ ., data = auto_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41298 -0.07098 0.00055 0.06150 0.39532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.155391 0.648230 -0.240 0.81068
## cylinders -0.082815 0.061429 -1.348 0.17841
## displacement 0.006625 0.056970 0.116 0.90748
## horsepower -0.294389 0.057652 -5.106 5.18e-07 ***
## weight -0.569666 0.082397 -6.914 1.98e-11 ***
## acceleration -0.179239 0.059536 -3.011 0.00278 **
## year 2.243989 0.131661 17.044 < 2e-16 ***
## origin 0.044848 0.018821 2.383 0.01767 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1136 on 384 degrees of freedom
## Multiple R-squared: 0.8903, Adjusted R-squared: 0.8883
## F-statistic: 445.3 on 7 and 384 DF, p-value: < 2.2e-16
plot(log_model, main = "Log Model")
auto_sqrt <- sqrt(auto_noname)
sqrt_model <- lm(mpg~., data=auto_sqrt)
summary(sqrt_model)
##
## Call:
## lm(formula = mpg ~ ., data = auto_sqrt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.98667 -0.17280 -0.00315 0.16145 1.02245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.949286 0.847481 -2.300 0.021979 *
## cylinders -0.108552 0.141968 -0.765 0.444964
## displacement 0.019707 0.021182 0.930 0.352752
## horsepower -0.090896 0.028428 -3.197 0.001502 **
## weight -0.061414 0.007292 -8.422 7.48e-16 ***
## acceleration -0.107258 0.077048 -1.392 0.164699
## year 1.266015 0.079308 15.963 < 2e-16 ***
## origin 0.272324 0.070883 3.842 0.000143 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2964 on 384 degrees of freedom
## Multiple R-squared: 0.8662, Adjusted R-squared: 0.8638
## F-statistic: 355.1 on 7 and 384 DF, p-value: < 2.2e-16
plot(sqrt_model, main ='Square Root Model')
auto_sq <- auto_noname^2
sq_model <- lm(mpg~., data=auto_sq)
summary(sq_model)
##
## Call:
## lm(formula = mpg ~ ., data = auto_sq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -501.89 -145.36 -18.91 111.41 1034.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.523e+02 1.456e+02 -5.165 3.87e-07 ***
## cylinders -3.746e+00 1.559e+00 -2.403 0.016713 *
## displacement 3.356e-03 8.547e-04 3.926 0.000102 ***
## horsepower 1.279e-04 3.076e-03 0.042 0.966851
## weight -4.833e-05 5.551e-06 -8.707 < 2e-16 ***
## acceleration 4.892e-01 1.663e-01 2.941 0.003474 **
## year 2.731e-01 2.183e-02 12.513 < 2e-16 ***
## origin 2.608e+01 4.275e+00 6.101 2.57e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 218.8 on 384 degrees of freedom
## Multiple R-squared: 0.7055, Adjusted R-squared: 0.7001
## F-statistic: 131.4 on 7 and 384 DF, p-value: < 2.2e-16
plot(sq_model, main = 'Square Model')
If we apply each of these transformations to the dataset we can see the effects. If we recall the original adjR2 is around 0.81, we can see some improvement with some of these transformations; in particular we see log(x) improving the adjR2 to 0.88 and sqrt(x) improving it to 0.86. The x^2 in turn seems to decrease the performance, causing the adjR2 to decrease to 0.70.
Reviewing the diagnostic plots of each of these transformations we see that log(x) seems to have removed the patterns found in the base model, and sqrt(x) seems to have also done so but to a lesser degree. The transformation x^2 in turn seems to have worsened these trends. None of these transformations seem to have affected the high leverage point, however.
This question should be answered using the Carseats data set.
data("Carseats")
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
model <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
For every unit increase in price, there is a decrease of 54 sales of the carseat with every other variable held constant. If the store is in an urban location, there is a decrease of 21 sales in comparison to being in a rural location with every other variable held constant. If the store is in the US, there is an increase of 1200 sales in comparison to not being in the US with all other variables held constant.
If we treat ‘Yes’ as 1 and ‘No’ as 0 for the categorical variables Urban and US, we can use the following equation:
\[ \text{Sales} = -0.054 \cdot \text{Price} - 0.021 \cdot \text{Urban} + 1.200 \cdot \text{US} + 13.043\]
Using a significance of 0.05, we can reject the null hypothesis that there is no relatoinship between the predictor and response variable for Price and US.
smaller_model <- lm(Sales ~ Price + US, data=Carseats)
summary(smaller_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
We can compare the models using R2 and adjR2. They both have an R2 of 0.2393 meaning that they account for 23.93% of the variance of the data, but the smaller model in part (e) has a slightly better adjR2 at 0.2354. This means that they essentially fit the data the same, but the smaller model is better due to having fewer variables and getting the same results.
confint(smaller_model, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
We can determine outliers using Cook’s distance and high leverage points using the residual vs leverage plot, both of which can be found in the diagnostic plots.
# Cook's distance to identify outliers
n <- nrow(Carseats)
plot(smaller_model, which = 4)
abline(h = 4/n, lty = 2, col = "red") # threshold for outliers
# get residual v outliers:
plot(smaller_model, which = 5)
From here we can see several outliers using Cook’s distance, above the threshold of 4/n. There also appears to be a few high leverage points as can be seen in the second plot. One particular point of concern is observation 368, which has a Cook’s distance above the threshold and has a relatively high leverage.
This problem involves simple linear regression without an intercept.
The coefficient estimate of x onto y and y onto x will be the same if the relationship (without an intercept) is y = x, meaning that this will only occur when there is a perfect linear relationship with a coefficient of 1.
set.seed(123)
x <- rnorm(100)
y <- 5 * x + rnorm(100)
diff_coef_model <- lm(y~x)
summary(diff_coef_model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9073 -0.6835 -0.0875 0.5806 3.2904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.10280 0.09755 -1.054 0.295
## x 4.94753 0.10688 46.291 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9707 on 98 degrees of freedom
## Multiple R-squared: 0.9563, Adjusted R-squared: 0.9558
## F-statistic: 2143 on 1 and 98 DF, p-value: < 2.2e-16
x <- rnorm(100)
y <- 1 * x + rnorm(100)
same_coef_model <- lm(y~x)
summary(same_coef_model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4412 -0.6537 0.0080 0.7412 2.5266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03031 0.10515 -0.288 0.774
## x 0.95094 0.11036 8.617 1.2e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.043 on 98 degrees of freedom
## Multiple R-squared: 0.4311, Adjusted R-squared: 0.4253
## F-statistic: 74.25 on 1 and 98 DF, p-value: 1.204e-13