KNN classifier makes estimates based on the most frequent observations. KNN regression estimates f(x0) using the average of the K training observations.
pairs(Auto)
cor(Auto[-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm = lm(mpg ~ . - name, data = Auto)
summary(lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes there is a relationship. The R^2 value states that 82.15% of change in mpg can be explained by change in the predictor variables.
Using the p-values, displacement, weight, year, and origin are statistically significant to mpg.
1 year increase proportionally increases mpg by 0.750773
par(mfrow = c(2,2))
plot(lm)
The Normal QQ plot shows that the data is skewed to the right. Observation 14 is the only leverage point.
lm.interaction.attempt.1 = lm(mpg ~. -name + cylinders:weight, data = Auto)
summary(lm.interaction.attempt.1)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9484 -1.7133 -0.1809 1.4530 12.4137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3143478 5.0076737 1.461 0.14494
## cylinders -5.0347425 0.5795767 -8.687 < 2e-16 ***
## displacement 0.0156444 0.0068409 2.287 0.02275 *
## horsepower -0.0314213 0.0126216 -2.489 0.01322 *
## weight -0.0150329 0.0011125 -13.513 < 2e-16 ***
## acceleration 0.1006438 0.0897944 1.121 0.26306
## year 0.7813453 0.0464139 16.834 < 2e-16 ***
## origin 0.8030154 0.2617333 3.068 0.00231 **
## cylinders:weight 0.0015058 0.0001657 9.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8501
## F-statistic: 278.1 on 8 and 383 DF, p-value: < 2.2e-16
lm.interaction.attempt.2 = lm(mpg ~. -name + displacement:weight, data = Auto)
summary(lm.interaction.attempt.2)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
lm.interaction.attempt.3 = lm(mpg ~. -name + horsepower:weight, data = Auto)
summary(lm.interaction.attempt.3)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.589 -1.617 -0.184 1.541 12.001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.876e+00 4.511e+00 0.638 0.524147
## cylinders -2.955e-02 2.881e-01 -0.103 0.918363
## displacement 5.950e-03 6.750e-03 0.881 0.378610
## horsepower -2.313e-01 2.363e-02 -9.791 < 2e-16 ***
## weight -1.121e-02 7.285e-04 -15.393 < 2e-16 ***
## acceleration -9.019e-02 8.855e-02 -1.019 0.309081
## year 7.695e-01 4.494e-02 17.124 < 2e-16 ***
## origin 8.344e-01 2.513e-01 3.320 0.000986 ***
## horsepower:weight 5.529e-05 5.227e-06 10.577 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared: 0.8618, Adjusted R-squared: 0.859
## F-statistic: 298.6 on 8 and 383 DF, p-value: < 2.2e-16
lm.interaction.attempt.4 = lm(mpg ~. -name + acceleration:weight, data = Auto)
summary(lm.interaction.attempt.4)
##
## Call:
## lm(formula = mpg ~ . - name + acceleration:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.247 -2.048 -0.045 1.619 12.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.364e+01 5.811e+00 -7.511 4.18e-13 ***
## cylinders -2.141e-01 3.078e-01 -0.696 0.487117
## displacement 3.138e-03 7.495e-03 0.419 0.675622
## horsepower -4.141e-02 1.348e-02 -3.071 0.002287 **
## weight 4.027e-03 1.636e-03 2.462 0.014268 *
## acceleration 1.629e+00 2.422e-01 6.726 6.36e-11 ***
## year 7.821e-01 4.833e-02 16.184 < 2e-16 ***
## origin 1.033e+00 2.686e-01 3.846 0.000141 ***
## weight:acceleration -5.826e-04 8.408e-05 -6.928 1.81e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.141 on 383 degrees of freedom
## Multiple R-squared: 0.8414, Adjusted R-squared: 0.838
## F-statistic: 253.9 on 8 and 383 DF, p-value: < 2.2e-16
Out of the 4 tested models, lm.interaction.attempt.3 (horsepower:weight interaction) produced the lowest p-values and Adjusted R^2 closest to 1 with a value of 0.859.
lm.log.interaction.attempt.1 = lm(mpg ~. -name + log(cylinders) + cylinders:weight, data = Auto)
summary(lm.log.interaction.attempt.1)
##
## Call:
## lm(formula = mpg ~ . - name + log(cylinders) + cylinders:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8589 -1.7311 -0.1874 1.5819 12.3569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.622e+00 6.042e+00 -0.599 0.54920
## cylinders -1.117e+01 2.027e+00 -5.512 6.54e-08 ***
## displacement 1.416e-02 6.778e-03 2.090 0.03731 *
## horsepower -2.361e-02 1.272e-02 -1.856 0.06418 .
## weight -1.767e-02 1.382e-03 -12.791 < 2e-16 ***
## acceleration 7.749e-02 8.906e-02 0.870 0.38483
## year 7.934e-01 4.604e-02 17.234 < 2e-16 ***
## origin 7.995e-01 2.587e-01 3.090 0.00215 **
## log(cylinders) 2.626e+01 8.319e+00 3.156 0.00172 **
## cylinders:weight 1.954e-03 2.168e-04 9.014 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.987 on 382 degrees of freedom
## Multiple R-squared: 0.8569, Adjusted R-squared: 0.8535
## F-statistic: 254.1 on 9 and 382 DF, p-value: < 2.2e-16
lm.log.interaction.attempt.2 = lm(mpg ~. -name + log(displacement) + displacement:weight, data = Auto)
summary(lm.log.interaction.attempt.2)
##
## Call:
## lm(formula = mpg ~ . - name + log(displacement) + displacement:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0020 -1.7953 -0.0745 1.5635 12.1431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.107e+00 1.288e+01 -0.241 0.80950
## cylinders 1.190e-01 2.948e-01 0.404 0.68657
## displacement -6.308e-02 3.022e-02 -2.087 0.03752 *
## horsepower -3.343e-02 1.285e-02 -2.602 0.00962 **
## weight -1.043e-02 1.356e-03 -7.692 1.24e-13 ***
## acceleration 6.366e-02 9.019e-02 0.706 0.48072
## year 7.854e-01 4.560e-02 17.223 < 2e-16 ***
## origin 5.471e-01 2.727e-01 2.007 0.04550 *
## log(displacement) -6.546e-01 3.481e+00 -0.188 0.85093
## displacement:weight 2.197e-05 4.491e-06 4.891 1.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.968 on 382 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8554
## F-statistic: 258.1 on 9 and 382 DF, p-value: < 2.2e-16
lm.log.interaction.attempt.3 = lm(mpg ~. -name + log(horsepower) + horsepower:weight, data = Auto)
summary(lm.log.interaction.attempt.3)
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + horsepower:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.509 -1.514 -0.122 1.415 11.931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.054e+01 1.691e+01 2.397 0.016994 *
## cylinders -5.646e-03 2.867e-01 -0.020 0.984298
## displacement 3.473e-04 7.137e-03 0.049 0.961209
## horsepower -7.176e-02 7.296e-02 -0.983 0.326007
## weight -8.189e-03 1.497e-03 -5.472 8.08e-08 ***
## acceleration -2.054e-01 1.012e-01 -2.030 0.043065 *
## year 7.591e-01 4.491e-02 16.902 < 2e-16 ***
## origin 8.170e-01 2.500e-01 3.268 0.001182 **
## log(horsepower) -1.157e+01 5.011e+00 -2.310 0.021419 *
## horsepower:weight 3.563e-05 9.972e-06 3.573 0.000398 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.915 on 382 degrees of freedom
## Multiple R-squared: 0.8637, Adjusted R-squared: 0.8605
## F-statistic: 269.1 on 9 and 382 DF, p-value: < 2.2e-16
lm.log.interaction.attempt.4 = lm(mpg ~. -name + log(acceleration) + acceleration:weight, data = Auto)
summary(lm.log.interaction.attempt.4)
##
## Call:
## lm(formula = mpg ~ . - name + log(acceleration) + acceleration:weight,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5279 -1.8822 -0.0237 1.6775 12.3282
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.083e+00 1.671e+01 -0.304 0.761077
## cylinders -1.286e-01 3.078e-01 -0.418 0.676212
## displacement -1.442e-03 7.675e-03 -0.188 0.851089
## horsepower -4.814e-02 1.367e-02 -3.521 0.000482 ***
## weight 3.357e-03 1.648e-03 2.037 0.042302 *
## acceleration 2.610e+00 4.657e-01 5.604 4.01e-08 ***
## year 7.811e-01 4.801e-02 16.270 < 2e-16 ***
## origin 1.027e+00 2.669e-01 3.849 0.000139 ***
## log(acceleration) -1.975e+01 8.031e+00 -2.460 0.014355 *
## weight:acceleration -5.101e-04 8.858e-05 -5.759 1.74e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.121 on 382 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8402
## F-statistic: 229.3 on 9 and 382 DF, p-value: < 2.2e-16
lm.fit = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
lm.fit
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Coefficients:
## (Intercept) Price UrbanYes USYes
## 13.04347 -0.05446 -0.02192 1.20057
There is a negative relationship between price and sales. As price increases by $1, sales decreases by ~54.46. There is no relationship between urban and sales as stated by the p-value of 0.936. There is a positive relationship between the US and sales. There is a ~1201 sales increase if the store is located in the US.
Sales = 13.04 + -0.05 * Price + -0.02 * UrbanYes + 1.20 * USYes
Reject the Urban predictor because of its high p-value.
lm.fit.2 = lm(Sales ~ Price -Urban + US, data = Carseats)
summary(lm.fit.2)
##
## Call:
## lm(formula = Sales ~ Price - Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The first model Adjusted R^2 is .2335 while the second model’s is .2354. There is a slight difference, but the second model is better.
confint(lm.fit.2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(lm.fit.2)
Observations 51, 69, and 377 appear to be outliers in the residuals vs fitted & scale-location graphs. And there appears to be a few observations with high leverage as well.
From 3.38, we see that the coefficient estimate for both Y on X and X on Y will be very similar as their numerators are the same, but just changing the denominator, both will become equal and we can remove the other variables. It’ll equate to just the sum of x^2 to equate to the sum of y^2.
set.seed(12)
x = rnorm(100)
sum(x)
## [1] -3.116866
set.seed(12)
y = 2*x + rnorm(100)
sum(y)
## [1] -9.350598
betaX = lm(x~y + 0)
betaY = lm(y~x + 0)
coef(betaX)
## y
## 0.3333333
coef(betaY)
## x
## 3
set.seed(12)
x.2 = rnorm(100)
y.2 = 1*x.2
sum(x.2)
## [1] -3.116866
sum(y.2)
## [1] -3.116866
betaX2 = lm(x.2~y.2 + 0)
betaY2 = lm(y.2~x.2 + 0)
coef(betaX2)
## y.2
## 1
coef(betaY2)
## x.2
## 1