Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier and regression are pretty similar. The main difference is that in KNN classifier we usually use the class of the majority of the neighbors, while in regression we “merge” the values of the neighbors. Another difference is that for classification we usually use an odd number of neighbors, while we don’t have that restriction for regression. Both methods can use some kind of weights in order to give more importance to closer neighbors. The weighting method can have much more impact in the regression methods so one should be careful when picking the weighting function.
KNN regression sample with K being 1, 2 and 3.
KNN Classification example.
It is claimed in the text that in the case of simple linear regression of Y onto X, the R2 statistic (3.17) is equal to the square of the correlation between X and Y (3.18). Prove that this is the case. For simplicity, you may assume that ¯x = ¯y = 0.
This problem involves simple linear regression without an intercept.
Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X
> x = runif(100, min=3, max=5)
> y = runif(100, min=5, max=7)
> lm.fity = lm(y ~ x + 0)
> lm.fitx = lm(x ~ y + 0)
> summary(lm.fity)
Call:
lm(formula = y ~ x + 0)
Residuals:
Min 1Q Median 3Q Max
-2.12033 -0.58598 0.06645 0.79872 1.93644
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 1.47010 0.02344 62.72 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9542 on 99 degrees of freedom
Multiple R-squared: 0.9754, Adjusted R-squared: 0.9752
F-statistic: 3933 on 1 and 99 DF, p-value: < 2.2e-16
> summary(lm.fitx)
Call:
lm(formula = x ~ y + 0)
Residuals:
Min 1Q Median 3Q Max
-1.20111 -0.44104 0.06561 0.50158 1.52925
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y 0.66353 0.01058 62.72 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.641 on 99 degrees of freedom
Multiple R-squared: 0.9754, Adjusted R-squared: 0.9752
F-statistic: 3933 on 1 and 99 DF, p-value: < 2.2e-16
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
> set.seed(1)
> x = runif(100, min=3, max=3.1)
> y = runif(100, min=3, max=3.1)
> lm.fity = lm(y ~ x + 0)
> lm.fitx = lm(x ~ y + 0)
> summary(lm.fity)
Call:
lm(formula = y ~ x + 0)
Residuals:
Min 1Q Median 3Q Max
-0.088602 -0.029414 0.005026 0.027561 0.077283
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.999913 0.001239 807 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03782 on 99 degrees of freedom
Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
F-statistic: 6.512e+05 on 1 and 99 DF, p-value: < 2.2e-16
> summary(lm.fitx)
Call:
lm(formula = x ~ y + 0)
Residuals:
Min 1Q Median 3Q Max
-0.076819 -0.027097 -0.004566 0.029877 0.089068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y 0.999935 0.001239 807 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03782 on 99 degrees of freedom
Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
F-statistic: 6.512e+05 on 1 and 99 DF, p-value: < 2.2e-16