KNN Classifier is used to classify already existing data points into different groups based on their location relative to other points. On the other hand, KNN regression focuses on predicting new points with the goal of creating a regression. The methdology behind why the points are choosen for classification and where the point is placed in KNN regression are the same, based on the number of nearest points inform the regression or classification.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
pairs(Auto)
### Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
cor(Auto[, -9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg.lm.fit = lm(mpg~.-name, Auto)
summary(mpg.lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The model’s p-value is below .001, indicating that the model is significant. This means that there is a significant relationship between predictors and the response variable.
The statisticaly significant predictors are displacement with a p-value of .008, weight with a p-value below .001, year with a p-value below .001, and origin with a p-value also below .001. All these predictors are significant assuming a .05 cutoff. This means that relationship between them and mpg is significant.
All else held equal, a single unit increase in year (like a car made in 1998 to 1999) will see an increase in its average mpg by a value of .750773.
par(mfrow = c(2,2))
plot(mpg.lm.fit)
cooksd = order(cooks.distance(mpg.lm.fit), decreasing = T)
head(cooksd)
## [1] 14 389 325 382 324 243
The Residual vs Fitted and Scale-Location plots can be used to determine the linearity assumption of regression. The Scale location has standardized residuals, which return better results as they are more randomly spread out, decreasing the effect of outliers. Also the data is not evenly spread out, as it seems to open up towards the right implying that their is unequal variance in our model as well as slightly non linear pattern as the data does not spread out evenly across the line but shows a small dip. These also look better with the standardized residuals.
The QQ plot, which test for normality in the data by plotting standardized residuals against theoretical normal quantiles, returns a decent trend until the end where they seems to be a few outliers causing the data to skew.
The cook’s distance plot helps highlight any outliers, labeling those at the extremes. In this case records 327 and 394 are highlighted, along with a few other unamed points. We can see those points by creating a formula and find that our top 5 values are 14, 389, 325, 382, and 324.
mpg.lm.fit2 <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(mpg.lm.fit2)
##
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
The interactions that appear to be significant are displacement:year, acceleration:year, and acceleration:origin assuming a 0.05 cutoff value, as all of these terms have less than .05 p-values.
Looking at the main effects model, Cylinder, Horsepower, and acceleration were not significant, so perhaps including a couple transformations using these variables might help.
mpg.lm.fit3 = lm(mpg~.-name + log(acceleration) + log(cylinders) + log(horsepower), Auto)
summary(mpg.lm.fit3)
##
## Call:
## lm(formula = mpg ~ . - name + log(acceleration) + log(cylinders) +
## log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1801 -1.6733 -0.0964 1.5354 12.0891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.019e+02 1.484e+01 6.868 2.66e-11 ***
## cylinders 8.702e-01 1.205e+00 0.722 0.470607
## displacement -6.439e-03 7.352e-03 -0.876 0.381642
## horsepower 1.545e-01 2.632e-02 5.870 9.47e-09 ***
## weight -3.226e-03 6.732e-04 -4.792 2.37e-06 ***
## acceleration 3.557e-01 4.791e-01 0.742 0.458285
## year 7.438e-01 4.532e-02 16.412 < 2e-16 ***
## origin 8.724e-01 2.543e-01 3.431 0.000667 ***
## log(acceleration) -1.091e+01 7.667e+00 -1.423 0.155637
## log(cylinders) -4.894e+00 6.499e+00 -0.753 0.451914
## log(horsepower) -2.487e+01 2.912e+00 -8.542 3.19e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.956 on 381 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.8565
## F-statistic: 234.4 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mpg.lm.fit3)
Horse power seems to have been postively effected by the log transformation. The models return slightly better residucals vs fitted plots, with data more spread out and reflecting less homoscedacity and nonlinearity than the first model. The QQ plot seems to have issues with a few of the upper quantile points, but the cooks distance also appears to have a smaller effect on the data.
mpg.lm.fit4 = lm(mpg~.-name + sqrt(acceleration) + sqrt(cylinders) + sqrt(horsepower), Auto)
summary(mpg.lm.fit4)
##
## Call:
## lm(formula = mpg ~ . - name + sqrt(acceleration) + sqrt(cylinders) +
## sqrt(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3747 -1.6637 -0.0922 1.5159 12.0592
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.833e+01 1.858e+01 4.217 3.09e-05 ***
## cylinders 2.870e+00 2.343e+00 1.225 0.221392
## displacement -7.473e-03 7.361e-03 -1.015 0.310619
## horsepower 3.750e-01 5.102e-02 7.350 1.22e-12 ***
## weight -3.158e-03 6.743e-04 -4.684 3.91e-06 ***
## acceleration 1.230e+00 9.832e-01 1.251 0.211720
## year 7.413e-01 4.526e-02 16.380 < 2e-16 ***
## origin 8.752e-01 2.536e-01 3.451 0.000621 ***
## sqrt(acceleration) -1.250e+01 7.941e+00 -1.574 0.116281
## sqrt(cylinders) -1.314e+01 1.104e+01 -1.191 0.234369
## sqrt(horsepower) -9.565e+00 1.131e+00 -8.453 6.10e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.952 on 381 degrees of freedom
## Multiple R-squared: 0.8606, Adjusted R-squared: 0.857
## F-statistic: 235.2 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mpg.lm.fit4)
The square root of horsepower is also significant although the diagnostics do not seem to have been altered much from the previous log transformation model.
mpg.lm.fit5 = lm(mpg~.-name + I(acceleration*acceleration) + I(cylinders*cylinders) + I(horsepower*horsepower), Auto)
summary(mpg.lm.fit5)
##
## Call:
## lm(formula = mpg ~ . - name + I(acceleration * acceleration) +
## I(cylinders * cylinders) + I(horsepower * horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8488 -1.6976 -0.0639 1.4410 12.1216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.2501162 6.4010975 2.695 0.00735 **
## cylinders -2.5830432 1.2080209 -2.138 0.03313 *
## displacement -0.0090379 0.0074258 -1.217 0.22432
## horsepower -0.2975187 0.0344484 -8.637 < 2e-16 ***
## weight -0.0030964 0.0006800 -4.554 7.11e-06 ***
## acceleration -1.4761632 0.5457876 -2.705 0.00714 **
## year 0.7355311 0.0453526 16.218 < 2e-16 ***
## origin 0.9131603 0.2531125 3.608 0.00035 ***
## I(acceleration * acceleration) 0.0351887 0.0159215 2.210 0.02769 *
## I(cylinders * cylinders) 0.2534219 0.1003346 2.526 0.01195 *
## I(horsepower * horsepower) 0.0008809 0.0001108 7.951 2.13e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.957 on 381 degrees of freedom
## Multiple R-squared: 0.8601, Adjusted R-squared: 0.8564
## F-statistic: 234.2 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(mpg.lm.fit5)
Lastly, with the squared terms, most of the model turns out be significant with a 0.05 cutoff. The only variable not significant is displacement. Also the data appears to be more spread out and linear in our diagnostics model, with the residuals being spread out evenly.
Carseats = Carseats
attach(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
sales.lm.fita = lm(Sales ~ Price + US + Urban, Carseats)
summary(sales.lm.fita)
##
## Call:
## lm(formula = Sales ~ Price + US + Urban, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The model is significant overall, with a p-value of less than .001. The variables Price and USyes (The store where the seats were sold are in the US) were also both significant with p-values less than .001. For price, this means each increase in the price decreases the sales for said location by .054459 (per thousand units). Also if the store is located in the United States, the sales increase by 1.200573. UrbanYes has a p-value of .936, well above the .05 cut off, meaning its interaction with the target variable sales is not significant. For every store in an urban area, the sales decrease by .021916 (per thousand units).
If the store isnt in the US then the variable will be zero meaning the following simplified equation will be in place.
If the store isnt in an urban area then the variable will be zero meaning the following simplified equation will be in place.
Lastly, if it the store is neither in the US or an Urban area, the formula will just include price and intercept.
In this model, we can reject the null hypothesis for both Price and US, as their terms are significant meaning that βj != 0.
sales.lm.fite = lm(Sales ~ Price + US, Carseats)
summary(sales.lm.fite)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both Model a and e are significant overall and have the same exact r-squared of .2393, but using adjusted r-squared shows that model e has a slightly higher value of .2354. This is becauce the adjusted r-squared is weighted by how many terms in the model and model e has fewer terms. The improvement from model is very small considering that no new variables were added and only an unhelpful variable was removed.
confint(sales.lm.fite)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2,2))
plot(sales.lm.fite)
The data in the diagnostics plots returns very clean results. The Residuals vs Fitted plots are randomly separated making it difficult to determine where outliers might be. We can also take a look at cook’s distance to determine where the influential points.
cooksdistance = cooks.distance(sales.lm.fite)
cooksdordered = order(cooksdistance, decreasing = T)
head(cooksdordered)
## [1] 26 368 50 317 166 377
cooksdistance[c(26,368,50,317,166,377)]
## 26 368 50 317 166 377
## 0.02610946 0.02428736 0.02283546 0.02047046 0.01975504 0.01828219
The following points are the most influential by cook’s distance, with said distance listed beneath.
For Y onto X without, the estimation of βˆ is the summation of X times y divided by the summation of X^2. For X onto Y, the estimation for βˆ is then the summation of X time y divided by the summation of Y^2.
It then follows that both held equal to each other is when X onto Y and Y onto X’s coefficient estimates would be equal.
summation of x^2 equals summation y^2 is where they would have the same estimate.
x = rnorm(100)
y = 2 * x + rnorm(100)
sum(x^2)
## [1] 73.56352
sum(y^2)
## [1] 424.2937
lmYX = lm(y~x+0)
lmXY = lm(x~y+0)
summary(lmYX)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.36794 -0.68238 -0.07716 0.69265 2.40594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0786 0.1209 17.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.037 on 99 degrees of freedom
## Multiple R-squared: 0.7491, Adjusted R-squared: 0.7465
## F-statistic: 295.5 on 1 and 99 DF, p-value: < 2.2e-16
summary(lmXY)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9457 -0.3165 -0.0449 0.2829 1.0358
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.36038 0.02096 17.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4318 on 99 degrees of freedom
## Multiple R-squared: 0.7491, Adjusted R-squared: 0.7465
## F-statistic: 295.5 on 1 and 99 DF, p-value: < 2.2e-16
The summations of x^2 and y^2 do not equal, which is in line with the thought process in 12a. Now the coefficients are not the same either.
x=rnorm(100)
y=abs(x)
sum(x^2)
## [1] 91.0051
sum(y^2)
## [1] 91.0051
lmYX = lm(y~x+0)
lmXY = lm(x~y+0)
summary(lmYX)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.01504 0.24488 0.52209 1.02569 2.56584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.09838 0.10002 0.984 0.328
##
## Residual standard error: 0.9541 on 99 degrees of freedom
## Multiple R-squared: 0.009679, Adjusted R-squared: -0.0003241
## F-statistic: 0.9676 on 1 and 99 DF, p-value: 0.3277
summary(lmXY)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.56584 -0.49015 0.03824 0.53775 2.34748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.09838 0.10002 0.984 0.328
##
## Residual standard error: 0.9541 on 99 degrees of freedom
## Multiple R-squared: 0.009679, Adjusted R-squared: -0.0003241
## F-statistic: 0.9676 on 1 and 99 DF, p-value: 0.3277
Using the absolute value of x for y means that the range of data will be different, but when squared and summed, the values should come out as the same. This then means that the summation of x and y squared should be equal, meaning the coeffecients in X to Y and Y to X regression should be the same.