Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN Classifier is used when the response variable is categorical (qualitative), and it shows Y as 0 or 1. The KNN Regression is used in numerical situations(quantitative), and shows/ predicts the value of Y and can be continuous.
This question involves the use of multiple linear regression on the Auto data set.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR)
data(Auto)
auto <- na.omit(Auto)
plot(Auto)
Auto1<-Auto
Auto1$name=NULL
cor(Auto1)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
model1<-lm(mpg~ .-name,data=Auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response? Yes, there us a relationship between predictor and response variable as show by the pvaule of less than 0.05f, meaning we reject the null hypothesis.
Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year, and groin have a significant relationship with MPG shown by their p-values.
What does the coefficient for the year variable suggest? The coefficient of year variable is positive which suggest that if all other variable are constant than an average mpg increases by 0.75 every year as it is also significant.
par(mfrow=c(2,2))
plot(model1)
plot(predict(model1),rstudent(model1))
plot(hatvalues(model1))
which.max(hatvalues(model1))
## 14
## 14
The first graph shows that there is a non-linear relationship between the responce and the predictors; The second graph shows that the residuals are normally distributed and right skewed; The third graph shows that the constant variance of error assumption is not true for this model; The Third graphs shows that there are no leverage points. However, there on observation that stands out as a potential leverage point (labeled 14 on the graph
model2 = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model2)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
model3 = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model3)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3344 -1.6333 0.0188 1.4740 11.9723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.725e+01 5.328e+00 -3.237 0.00131 **
## cylinders 6.354e-01 6.106e-01 1.041 0.29870
## displacement -6.805e-02 1.337e-02 -5.088 5.68e-07 ***
## horsepower 6.026e-02 2.601e-02 2.317 0.02105 *
## weight -8.864e-03 1.097e-03 -8.084 8.43e-15 ***
## acceleration 6.257e-01 1.592e-01 3.931 0.00010 ***
## year 7.845e-01 4.470e-02 17.549 < 2e-16 ***
## origin 4.668e-01 2.595e-01 1.799 0.07284 .
## cylinders:displacement -1.337e-03 2.726e-03 -0.490 0.62415
## displacement:weight 2.071e-05 3.638e-06 5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03 1.784e-03 -4.185 3.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared: 0.865, Adjusted R-squared: 0.8615
## F-statistic: 244.2 on 10 and 381 DF, p-value: < 2.2e-16
model4 = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model4)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight +
## year:origin + acceleration:horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6504 -1.6476 0.0381 1.4254 12.7893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.287e+00 9.074e+00 0.583 0.560429
## cylinders 4.249e-01 6.079e-01 0.699 0.485011
## displacement -7.322e-02 1.334e-02 -5.490 7.38e-08 ***
## horsepower 5.252e-02 2.586e-02 2.031 0.042913 *
## weight -8.689e-03 1.086e-03 -7.998 1.54e-14 ***
## acceleration 5.796e-01 1.582e-01 3.665 0.000283 ***
## year 5.116e-01 9.976e-02 5.129 4.66e-07 ***
## origin -1.220e+01 4.161e+00 -2.933 0.003560 **
## cylinders:displacement -4.368e-04 2.712e-03 -0.161 0.872156
## displacement:weight 1.992e-05 3.608e-06 5.522 6.21e-08 ***
## year:origin 1.630e-01 5.341e-02 3.051 0.002440 **
## horsepower:acceleration -6.735e-03 1.781e-03 -3.781 0.000181 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8644
## F-statistic: 227.7 on 11 and 380 DF, p-value: < 2.2e-16
model5 = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model5)
##
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin +
## displacement:weight + displacement:weight + acceleration:horsepower +
## acceleration:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5074 -1.6324 0.0599 1.4577 12.7376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.868e+01 7.796e+00 2.396 0.017051 *
## displacement -7.794e-02 9.026e-03 -8.636 < 2e-16 ***
## horsepower 8.719e-02 3.167e-02 2.753 0.006183 **
## weight -1.350e-02 1.287e-03 -10.490 < 2e-16 ***
## year 4.911e-01 9.825e-02 4.998 8.83e-07 ***
## origin -1.262e+01 4.109e+00 -3.071 0.002288 **
## year:origin 1.686e-01 5.277e-02 3.195 0.001516 **
## displacement:weight 2.253e-05 2.184e-06 10.312 < 2e-16 ***
## horsepower:acceleration -9.164e-03 2.222e-03 -4.125 4.56e-05 ***
## weight:acceleration 2.784e-04 7.087e-05 3.929 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared: 0.8687, Adjusted R-squared: 0.8656
## F-statistic: 280.8 on 9 and 382 DF, p-value: < 2.2e-16
From the models used as shown it seems the last model as variables significant including the interaction models. It is very likely that it is the best combination of predictors & interaction terms. This showed with 87% of the changed in the response can be explain with these predictors by R squared statistic.
plot(log(auto$weight), auto$mpg)
plot(sqrt(auto$weight), auto$mpg)
hist(log(auto$weight))
hist(sqrt(auto$weight))
hist(sqrt(auto$mpg))
lm = lm(mpg ~ ., data = Auto[, 1:8])
summary(lm)
##
## Call:
## lm(formula = mpg ~ ., data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
lm.fit.trans = lm(mpg ~ . + I(cylinders^2) + log(horsepower) + sqrt(displacement), data = Auto[, 1:8])
summary(lm.fit.trans)
##
## Call:
## lm(formula = mpg ~ . + I(cylinders^2) + log(horsepower) + sqrt(displacement),
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8017 -1.5427 -0.0296 1.5157 11.7164
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.145e+01 1.235e+01 5.785 1.51e-08 ***
## cylinders 3.345e-01 1.512e+00 0.221 0.82505
## displacement 7.073e-02 2.860e-02 2.473 0.01383 *
## horsepower 1.075e-01 3.133e-02 3.430 0.00067 ***
## weight -3.270e-03 6.561e-04 -4.984 9.45e-07 ***
## acceleration -2.595e-01 9.838e-02 -2.637 0.00870 **
## year 7.578e-01 4.534e-02 16.715 < 2e-16 ***
## origin 5.691e-01 2.729e-01 2.085 0.03772 *
## I(cylinders^2) -1.671e-02 1.242e-01 -0.135 0.89300
## log(horsepower) -1.927e+01 3.582e+00 -5.379 1.31e-07 ***
## sqrt(displacement) -2.271e+00 8.504e-01 -2.671 0.00789 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.928 on 381 degrees of freedom
## Multiple R-squared: 0.8628, Adjusted R-squared: 0.8592
## F-statistic: 239.7 on 10 and 381 DF, p-value: < 2.2e-16
It seems both the log and square roots transformation show a linear trend, as ran the histogram of the log shows as skewed while the square roots showed relatively normal. Also by transforming horsepower with log and the square roots of displacement it made more of the predictors significant as it made their respective p values more significant. Finally, then squaring the cyliders variable had no significant impact/ changes.
This question should be answered using the Carseats data set. a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
data(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
s1 = lm(Sales ~ Price+Urban+US, data= Carseats)
summary(s1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
As all other predictors are held constant, the sales in units will decrease by 54.46 units when the price increases by a thousand dollars is shown by the coefficient of the price variable. The sales are not affected by weather or not it is an urban area. It shows a US store sale in average 1200 more car seats than the overseas ones.
The Model written out is as follows: The model may be written as Sales = 13.043469 + (-0.054459)(Price) + (-0.021916)(Urban) + 1.200573*(US) + ε with Urban=1 if the store is in an urban location and 0 if not, and US=1 if the store is in the US and 0 if not.
We can reject the null hypothesis for the “Price” and “US” variables. This being due to the predictor ‘Urban’. Its p-value is not statistically significant with a value of 0.936.
s2 = lm(Sales ~ Price+US, data= Carseats)
summary(s2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
anova(s1,s2)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
As there is a slight decrease in the residual standard error and a slight increase in the adjusted r square the anova test shows that the difference is not statistically significant for both model and so we do not reject the null hypothesis. To add based on the two models r-squared values the models are only a 24% change in response explained and removing the non-significant predictor didn’t change much.
confint(s2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(s2)
Based on the Residuals vs leverage & the Normall Q-Q plot the is no evidence of the points of outliers or high leverage observations.
This problem involves simple linear regression without an intercept.
From the equation, the parameter estimate will be equal if the summation of xi^2 equals the summation of yi ^2.
x=rnorm(100)
y=rbinom(100,2,0.3)
model12<-lm(y~x+0)
summary(model12)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17896 0.02441 0.80486 1.01173 2.12560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.08405 0.07952 1.057 0.293
##
## Residual standard error: 0.848 on 99 degrees of freedom
## Multiple R-squared: 0.01116, Adjusted R-squared: 0.00117
## F-statistic: 1.117 on 1 and 99 DF, p-value: 0.2931
model12a<-lm(x~y+0)
summary(model12a)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4292 -0.8298 -0.1029 0.5355 2.6864
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.1328 0.1256 1.057 0.293
##
## Residual standard error: 1.066 on 99 degrees of freedom
## Multiple R-squared: 0.01116, Adjusted R-squared: 0.00117
## F-statistic: 1.117 on 1 and 99 DF, p-value: 0.2931
As we can see the observations are different in both cases when looking at the coefficients
x=1:100
y=100:1
model12b<-lm(y~x+0)
summary(model12b)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
model12c<-lm(x~y+0)
summary(model12c)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
As shown above the observations are the same for the coefficients