Both KNN classifier and KNN regression is a non parametric approach to model data. Both depend on parameter k which is the number of nearest neighbors that the algorithm will look for to estimate the value of the response. The classifier method looks at the proportion of nearest neighbors to determine the class an observation should belong to. It then determines a decision boundary to predict new values into classes. KNN regression instead of a proportion averages the values of the nearest neighbors to then determine a step like function to predict new values.
Auto <- Auto
pairs(Auto)
cor(select(Auto,-name))
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Autolm <- lm(mpg ~.-name, Auto)
summary(Autolm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?
par(mfrow=c(2,2))
plot(Autolm)
In the diagnostic plots we see that in the residual vs fit we can see an curve upward when we want it to be a flat line. In the Q-Q plot there is points don’t follow exactly the line in the right tail. Concerning high leverage R points to point 14. Concerning outliers there are points with very high residuals like point 327, 394, 326, 323.
AutolmInter <- lm(mpg ~ .+ weight*horsepower, select(Auto, -name, -cylinders, -acceleration))
summary(AutolmInter)
##
## Call:
## lm(formula = mpg ~ . + weight * horsepower, data = select(Auto,
## -name, -cylinders, -acceleration))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4091 -1.7734 -0.1386 1.5039 11.9502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.012e-01 3.966e+00 0.177 0.859757
## displacement 6.504e-03 4.884e-03 1.331 0.183822
## horsepower -2.202e-01 2.076e-02 -10.605 < 2e-16 ***
## weight -1.142e-02 6.891e-04 -16.571 < 2e-16 ***
## year 7.727e-01 4.476e-02 17.263 < 2e-16 ***
## origin 8.399e-01 2.488e-01 3.376 0.000809 ***
## horsepower:weight 5.438e-05 5.068e-06 10.731 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.928 on 385 degrees of freedom
## Multiple R-squared: 0.8615, Adjusted R-squared: 0.8593
## F-statistic: 399 on 6 and 385 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(AutolmInter)
From exploring interaction terms displacement:horsepower and horsepower:weight are consistently significant depending on the combination of variables. Removing one of the terms will increase the significance of the other. horsepower:weight has the better effect on R squared.Including interaction terms make the residuals more flat.
Auto %>%
select(-name) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
Seeing if any variables can benefit from normalization. Origin should be a categorical variable not continuous.
Autolmtrans <- lm(mpg ~ . + poly(horsepower, 2) - weight + log(weight) + as.factor(origin) - origin, select(Auto,-name, -cylinders, -acceleration, -displacement))
summary(Autolmtrans)
##
## Call:
## lm(formula = mpg ~ . + poly(horsepower, 2) - weight + log(weight) +
## as.factor(origin) - origin, data = select(Auto, -name, -cylinders,
## -acceleration, -displacement))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5229 -1.8585 -0.0073 1.5756 12.3550
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.104181 9.631414 8.836 < 2e-16 ***
## horsepower -0.026151 0.008622 -3.033 0.00258 **
## year 0.776688 0.045725 16.986 < 2e-16 ***
## poly(horsepower, 2)1 NA NA NA NA
## poly(horsepower, 2)2 22.988399 3.367983 6.826 3.42e-11 ***
## log(weight) -14.885036 1.263899 -11.777 < 2e-16 ***
## as.factor(origin)2 1.293095 0.462423 2.796 0.00543 **
## as.factor(origin)3 1.536449 0.474980 3.235 0.00132 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.948 on 385 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8573
## F-statistic: 392.6 on 6 and 385 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(Autolmtrans)
Transforming the variables increased the R- squared and lead to an improvement in the residuals by flattening the mean. Transforming variables to find the best combination seems to be endless to maximize the R- Squared. I based my transformations on the histograms to make variables more normal. As well as on the pairs plot to examine polynomial behavior.
lmcs <- Carseats %>%
select(Sales,Price, Urban, US) %>%
lm(Sales ~ ., .)
summary(lmcs)
##
## Call:
## lm(formula = Sales ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
careful—some of the variables in the model are qualitative!
This model has 4 coefficients. The intercept does not have a strong interpretation because the domain of price does not contain zero. The price coefficient is negative indicating that when price goes up the sales will go down. This intuitive since demand will decrease when price goes up. Urban variable says that if in an urban environment then that will lower sales. This is probably because in an urban environment people use public transit though this variable is not significant. The US variables being positive leads to a positive increase in sales. This is probably because the US has laws and safety measures that require car seats.
\[\begin{equation} Sales = -0.054459 * Price + Urban * -0.021916 + US * 1.200573 + 13.043469 \; \begin{array}{c} Urban_{i} = \begin{cases} 1 & \text{if store $i$ in urban location}\\ 0 & \text{if store $i$ in rural location}\\ \end{cases} \\ US_{i} = \begin{cases} 1 & \text{if store $i$ in the United States}\\ 0 & \text{if store $i$ not in the United States}\\ \end{cases} \end{array} \end{equation}\]
We can reject the null hypothesis that \(β\) = 0 for the variables price and US. With P values very close to zero.
lmcssimple <- Carseats %>%
select(Sales,Price, US) %>%
lm(Sales ~ ., .)
summary(lmcssimple)
##
## Call:
## lm(formula = Sales ~ ., data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The two models are very similar in a lot of ways. The R squared value only went down by .0001 from A) to E). The estimates of the parameters \(β\) only adjusted a little bit. E) model benefit is that it is simpler and easier to explain. Both don’t explain a lot of variance with Sales with R - squared values around 23 %
confint(lmcssimple)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(lmcssimple)
In the model from E) we can see points that have high residuals are outliers like point 377, 69, and 51. A point that has high influence on the model is 368.
The coefficients are the same if \(∑(x- \bar{x})^{2}=∑(y- \bar{y})^{2}\).
set.seed(1)
x <- jitter(1:100, 0)
y <- jitter(1:100, 100)
TSSX <- sum((x-mean(x))^2)
TSSY <- sum((y-mean(y))^2)
print(TSSX == TSSY)
## [1] FALSE
coef(lm(y~x))
## (Intercept) x
## 0.7955612 0.9983826
coef(lm(x~y))
## (Intercept) y
## 5.366281 0.881279
plot(x,y)
x <- 1:100
y <- 101:200
TSSX <- sum((x-mean(x))^2)
TSSY <- sum((y-mean(y))^2)
print(TSSX == TSSY)
## [1] TRUE
coef(lm(y~x))
## (Intercept) x
## 100 1
coef(lm(x~y))
## (Intercept) y
## -100 1
plot(x,y)