Carefully explain the differences between the KNN classifier and KNN regression methods.
Lovaas answer: For some reason I find it hilarious that I’ve been instructed to do this “carefully.” I am many things but careful isn’t really one of them! KNN classifier and KNN regression methods are closely related in concept - the classifier does a qualitative output (will voter x vote for candidate y or z) and the regression methods output quantitative values - sales will increase x% for each unit increase in y.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
pairs(Auto)
cor(subset(Auto, select=-name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit = lm(mpg~.-name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm.fit)
Lovaas answer 9d: The residuals vs fitted line has a bit of a smiley-face curve, meaning a simple linear model may not be strong enough, and a transformation may be in order. The normal q-q plot reinforces this notion. There’s no visible pattern in the scale-location plot, which is good. RStudio has highlighted a few outliers in the residual plots (top left and bottom right), meaning the model may not have a great fit yet.
lm.fit2 = lm(mpg~cylinders*displacement+weight*year+cylinders*weight+displacement*year, data = Auto) #By using the asterisk I am including the interaction effects and all 4 base variables
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + weight * year +
## cylinders * weight + displacement * year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9577 -1.5989 -0.1133 1.2365 13.9236
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.489e+01 2.246e+01 -1.108 0.2686
## cylinders -4.156e+00 7.428e-01 -5.595 4.21e-08 ***
## displacement 1.990e-01 1.045e-01 1.905 0.0575 .
## weight -1.649e-02 1.370e-02 -1.204 0.2295
## year 1.198e+00 2.690e-01 4.453 1.11e-05 ***
## cylinders:displacement -2.335e-03 3.085e-03 -0.757 0.4496
## weight:year 1.850e-05 1.653e-04 0.112 0.9110
## cylinders:weight 1.518e-03 3.585e-04 4.234 2.88e-05 ***
## displacement:year -2.567e-03 1.269e-03 -2.022 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.009 on 383 degrees of freedom
## Multiple R-squared: 0.8545, Adjusted R-squared: 0.8514
## F-statistic: 281.1 on 8 and 383 DF, p-value: < 2.2e-16
Lovaas answer 9e: only the interaction effects between Cylinders + Weights and Displacement + Year seems to be significant. Cylinders + Displacement and Weught + Year were not significant.
lm.fit3=lm(mpg~cylinders+I(cylinders^2)+year+I(year^2)+displacement+I(displacement^2), data = Auto)
lm.fit4=lm(log(mpg)~cylinders+I(cylinders^2)+year+I(year^2)+displacement+I(displacement^2), data = Auto)
lm.fit5=lm(log(mpg)~cylinders+sqrt(cylinders)+horsepower+sqrt(horsepower)+displacement+sqrt(displacement), data = Auto)
lm.fit6=lm(mpg~log(cylinders)+log(year)+log(displacement)+log(weight), data = Auto)
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ cylinders + I(cylinders^2) + year + I(year^2) +
## displacement + I(displacement^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.1138 -1.4426 0.0194 1.5267 13.6306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.476e+02 7.898e+01 5.667 2.85e-08 ***
## cylinders 5.607e+00 1.543e+00 3.633 0.000318 ***
## I(cylinders^2) -3.826e-01 1.238e-01 -3.090 0.002144 **
## year -1.183e+01 2.084e+00 -5.676 2.72e-08 ***
## I(year^2) 8.296e-02 1.370e-02 6.055 3.34e-09 ***
## displacement -1.852e-01 1.405e-02 -13.184 < 2e-16 ***
## I(displacement^2) 2.574e-04 2.449e-05 10.509 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.283 on 385 degrees of freedom
## Multiple R-squared: 0.8257, Adjusted R-squared: 0.823
## F-statistic: 304.1 on 6 and 385 DF, p-value: < 2.2e-16
summary(lm.fit4)
##
## Call:
## lm(formula = log(mpg) ~ cylinders + I(cylinders^2) + year + I(year^2) +
## displacement + I(displacement^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45829 -0.06492 0.00738 0.07036 0.50013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.709e+01 3.151e+00 5.423 1.04e-07 ***
## cylinders 2.308e-01 6.156e-02 3.750 0.000204 ***
## I(cylinders^2) -1.789e-02 4.938e-03 -3.623 0.000330 ***
## year -3.949e-01 8.315e-02 -4.750 2.88e-06 ***
## I(year^2) 2.799e-03 5.466e-04 5.122 4.79e-07 ***
## displacement -6.487e-03 5.605e-04 -11.574 < 2e-16 ***
## I(displacement^2) 8.284e-06 9.772e-07 8.477 4.98e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.131 on 385 degrees of freedom
## Multiple R-squared: 0.8539, Adjusted R-squared: 0.8516
## F-statistic: 375 on 6 and 385 DF, p-value: < 2.2e-16
summary(lm.fit5)
##
## Call:
## lm(formula = log(mpg) ~ cylinders + sqrt(cylinders) + horsepower +
## sqrt(horsepower) + displacement + sqrt(displacement), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.51384 -0.09954 -0.01098 0.11642 0.60844
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.988975 0.818498 6.095 2.66e-09 ***
## cylinders -0.093537 0.151596 -0.617 0.53759
## sqrt(cylinders) 0.390836 0.722566 0.541 0.58889
## horsepower 0.003520 0.003230 1.090 0.27654
## sqrt(horsepower) -0.156775 0.067888 -2.309 0.02145 *
## displacement 0.002832 0.001450 1.954 0.05146 .
## sqrt(displacement) -0.120581 0.041027 -2.939 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1589 on 385 degrees of freedom
## Multiple R-squared: 0.785, Adjusted R-squared: 0.7816
## F-statistic: 234.3 on 6 and 385 DF, p-value: < 2.2e-16
summary(lm.fit6)
##
## Call:
## lm(formula = mpg ~ log(cylinders) + log(year) + log(displacement) +
## log(weight), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6326 -1.9688 -0.0678 1.6796 13.3853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.888 17.632 -4.304 2.13e-05 ***
## log(cylinders) 2.555 1.656 1.543 0.1237
## log(year) 58.164 3.508 16.579 < 2e-16 ***
## log(displacement) -2.899 1.322 -2.193 0.0289 *
## log(weight) -17.820 1.718 -10.374 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.162 on 387 degrees of freedom
## Multiple R-squared: 0.8376, Adjusted R-squared: 0.8359
## F-statistic: 498.9 on 4 and 387 DF, p-value: < 2.2e-16
Lovaas notes ex 9f: Of the 4 (somewhat random) models I created, the second on, modelling the log of MPG, had the highest r-squared value. Investigating that further could be interesting. Best model: lm(log(mpg)~cylinders+I(cylinders2)+year+I(year2)+displacement+I(displacement^2), data = Auto)
lm.fitcar = lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.fitcar)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
R-squared = 24%
Lovaas answer ex 10b: Price has a negative correlation, meaning the higher the price, the lower the sales. Urban is a qualitative variable - if a store in in an urban location, then sales are slightly lower (negative factor for UrbanYes). US is also a qualitative variable - either stores are in the US or they are not. Stores in the US have higher sales of carseats than stores outside of the US.
Lovaas answer ex 10c: Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes
Lovaas answer ex 10d: Price and USYes have p values < 0.05, so we can reject the null hypothesis that the intercept is 0 and use them in our model. UrbanYes has a much weaker correlation with a p value well above 0.05, so we will accept the null hypothesis and leave UrbanYes/No out of our model.
lm.fitcar2 = lm(Sales~Price+US, data = Carseats)
summary(lm.fitcar2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Lovaas answer ex 10f: The models in (a) and (e) have similar r-squared values of 0.24, meaning both explain about 24% of the variance of “sales.” The model in part (e) has one less variable, which increases model speed and simplicity.
confint(lm.fitcar2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(predict(lm.fitcar2), rstudent(lm.fitcar2))
par(mfrow=c(2,2))
plot(lm.fitcar2)
Lovaas answer ex 10h: all plotted values in the first chart are within 3 and -3, so no outliers are apparent. There are some outliers on the Residuals vs Leverage (bottom right) chart; which Rstudio has marked. Ie, there is evidence of high leverage observations.
Lovaas answer ex 12a: The coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X when the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.
set.seed(1)
x = rnorm(100)
y = 2*x
lm.fity = lm(y~x+0)
lm.fitx = lm(x~y+0)
summary(lm.fity)
## Warning in summary.lm(lm.fity): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.776e-16 -3.378e-17 2.680e-18 6.113e-17 5.105e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.000e+00 1.296e-17 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
summary(lm.fitx)
## Warning in summary.lm(lm.fitx): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.24e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
set.seed(1)
x <- rnorm(100)
y <- -sample(x, 100)
lm.fity2 = lm(y~x+0)
lm.fitx2 = lm(x~y+0)
summary(lm.fity2)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2833 -0.6945 -0.1140 0.4995 2.1665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.07768 0.10020 0.775 0.44
##
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared: 0.006034, Adjusted R-squared: -0.004006
## F-statistic: 0.601 on 1 and 99 DF, p-value: 0.4401
summary(lm.fitx2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2182 -0.4969 0.1595 0.6782 2.4017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.07768 0.10020 0.775 0.44
##
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared: 0.006034, Adjusted R-squared: -0.004006
## F-statistic: 0.601 on 1 and 99 DF, p-value: 0.4401
Lovaas note ex 12c: note that the coefficients are the same, 0.07768, for each model.