Carefully explain the differences between the KNN classifier and KNN regression methods.
The main difference between KNN classifier and KNN regression is that when using a classifer approach, assumes the outcome as the most prevalent class, whereas in KNN regression assumes the outcome of the average of the nearest neighboring values.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
library(ISLR)
data(Auto)
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm_w_mpg <- lm(mpg ~ . - name, data = Auto)
summary(lm_w_mpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
By looking at the p-value, 2.2e-16, of the model, we can conclude that there is a relationship between the predictors and the response.
Again, by examining the p-values, we can see that all predictors are significant except cylinders, horsepower, and acceleration because they are bigger than 0.05.
The coefficient of year represents that in increase in 1 year will have an increase of 0.750773.
par(mfrow = c(2, 2))
plot(lm_w_mpg)
By looking at the residuals vs fitted plot there might be some non linearity in the data. Also, looking at the residuals vs leverage plot, we can see there are a few outliers at points 327, 394, and 14.
part9e <- lm(mpg ~ horsepower * displacement+displacement * cylinders, data = Auto[, 1:8])
summary(part9e)
##
## Call:
## lm(formula = mpg ~ horsepower * displacement + displacement *
## cylinders, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1114 -2.1683 -0.4345 2.0054 18.2391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.408e+01 2.564e+00 21.094 < 2e-16 ***
## horsepower -2.318e-01 2.285e-02 -10.144 < 2e-16 ***
## displacement -1.241e-01 1.444e-02 -8.592 < 2e-16 ***
## cylinders 1.224e-01 7.419e-01 0.165 0.869
## horsepower:displacement 5.544e-04 8.214e-05 6.750 5.44e-11 ***
## displacement:cylinders 3.055e-03 2.957e-03 1.033 0.302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.93 on 386 degrees of freedom
## Multiple R-squared: 0.7497, Adjusted R-squared: 0.7464
## F-statistic: 231.2 on 5 and 386 DF, p-value: < 2.2e-16
From above, we can see that the interaction between horsepower and displacement is significant, but the interaction between displacement and cylinders is not.
par(mfrow = c(2, 2))
plot(log(Auto$displacement), Auto$mpg)
plot(sqrt(Auto$displacement), Auto$mpg)
plot((Auto$displacement)^2, Auto$mpg)
With these transformations of displacement, it looks like the log transformation plot looks the most linear.
data(Carseats)
carsests_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carsests_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The coefficient of the Price
means that the price increase of $1 is a decrease of 54.4459 units in sales if all other predictors remain the same. The coefficient of the Urban
says that the unit sales in urban location are 21.916 units less than in rural location if all other predictors are the same. The coefficient of the “US” variable says that the unit sales in a US store are 1200.5723 units more than in a non US store if all other predictors remainthe same.
Sales=13.0434689 + (−0.054459) × Price + (−0.021916) × Urban + (1.200573) × US + ε
Urban = 1 if the store is located in an urban location and 0 if not. US = 1 if the store is located in the United States and 0 if not
We can reject the null hypothesis for the “Price” and “US” variables.
model10e <- lm(Sales ~ Price + US, data = Carseats)
summary(model10e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
According to the R squared about 23.93% of the variability is explained by the model.
confint(model10e)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(model10e)
Looking at the residuals vs levergae plot, there looks to be some outliers in the data.
Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(100)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 2 * x + rnorm(100, sd = 0.5)
sum(y^2)
## [1] 1353011
model.Y <- lm(y ~ x + 0)
model.X <- lm(x ~ y + 0)
summary(model.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11910 -0.30013 -0.01178 0.34459 1.31061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9996933 0.0008768 2281 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.51 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.201e+06 on 1 and 99 DF, p-value: < 2.2e-16
summary(model.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6542 -0.1713 0.0067 0.1504 0.5607
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5000672 0.0002193 2281 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2551 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.201e+06 on 1 and 99 DF, p-value: < 2.2e-16
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
modely2 <- lm(y ~ x + 0)
modelx2 <- lm(x ~ y + 0)
summary(modely2)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(modelx2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08