For KNN regression, the algorithm identifies K observations near some point X, and estimates a function f(X) using the average of all the K points. For the classification flavor of KNN, the idea is similar, in that the algorithm identifies K observations near a point X, and classifies X to the label of the maximum K points. If the maximum K has the label “Banana,” then X gets classified as the yellow fruit.
#str(Auto)
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[c(1,2,3,4,5,6,7,8)])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit9C <- lm(mpg ~. -name, data=Auto)
summary(lm.fit9C)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The null hypothesis is that the model is not useful, and that all coefficients should be considered zero. The alternate hypothesis is that the model is useful, and that at least one of the coefficients is not equal to zero. For this model, we see that the selection threshold p-value is well below the typical 0.05 value, and so we reject the null hypothesis. This model is useful, and at least one coefficient is not equal to zero. The model explains approximately 82% of the variance observed in MPG.
Having established that the model is useful, we examine the selection thresholds for the various coefficients. The null hypothesis states that there is no relation, and the coefficient should be considered zero. The alternate hypothesis states that there is a relationship, and the coefficient is not zero. For this model, the following coefficients have p-values below the selection threshold of 0.05:
Displacement, Weight, Year and Origin.
All of these coefficients have a statistically significant relation with the target variable, MPG.
The coefficient for “Year” should be interpreted as follows: “The MPG increases by 0.75 for every unit increase in Year, when all other variables are held constant.”
par(mfrow = c(2,2))
plot(lm.fit9C, which=1:4)
I observe a combination of curve as well as an expanding “cheese wedge” appearance in the Residuals plot, indicating a possibility of heteroscedasticity. Q-Q plot has a generally good track with the reference line, although there is a divergence with points 323, 327 and 326 identified. The Cook’s Distance plot identifies three values that could be influential: 14, 327 and 394.
#Setting plots back to standard configuration
par(mfrow = c(1,1))
lm.fit9E1 <- lm(mpg~.+cylinders*displacement + displacement*weight + horsepower*displacement + weight * displacement + acceleration * horsepower + origin*displacement -name, data=Auto)
summary(lm.fit9E1)
##
## Call:
## lm(formula = mpg ~ . + cylinders * displacement + displacement *
## weight + horsepower * displacement + weight * displacement +
## acceleration * horsepower + origin * displacement - name,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3250 -1.5778 -0.0658 1.4758 12.4039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.047e+00 6.922e+00 -0.874 0.38291
## cylinders 7.586e-01 6.331e-01 1.198 0.23159
## displacement -9.031e-02 1.916e-02 -4.712 3.44e-06 ***
## horsepower -7.047e-02 5.853e-02 -1.204 0.22941
## weight -7.088e-03 1.452e-03 -4.882 1.55e-06 ***
## acceleration 2.107e-01 2.316e-01 0.910 0.36354
## year 7.593e-01 4.544e-02 16.710 < 2e-16 ***
## origin -5.884e-01 9.558e-01 -0.616 0.53856
## cylinders:displacement -9.817e-04 2.827e-03 -0.347 0.72862
## displacement:weight 1.427e-05 4.753e-06 3.002 0.00286 **
## displacement:horsepower 2.524e-04 1.074e-04 2.350 0.01930 *
## horsepower:acceleration -3.540e-03 2.342e-03 -1.512 0.13148
## displacement:origin 9.991e-03 8.248e-03 1.211 0.22652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.886 on 379 degrees of freedom
## Multiple R-squared: 0.8675, Adjusted R-squared: 0.8633
## F-statistic: 206.7 on 12 and 379 DF, p-value: < 2.2e-16
Using the correlation matrix as a reference, I chose the highest correlations across a number of variables. The summary results indicate the majority of interaction effects are not statistically significant, or that there exist no interactions. However, there does appear to be statistically significant evidence for an interaction effect between the following: Displacement and Weight, and Displacement and Horsepower.
lm.fit9F <- lm(mpg~ log(cylinders)+displacement+I(horsepower^2)+weight+sqrt(acceleration)+year+origin, data=Auto)
summary(lm.fit9F)
##
## Call:
## lm(formula = mpg ~ log(cylinders) + displacement + I(horsepower^2) +
## weight + sqrt(acceleration) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8647 -2.1472 0.0056 1.7709 13.0454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.267e+01 5.251e+00 -4.317 2.01e-05 ***
## log(cylinders) -2.724e+00 1.759e+00 -1.549 0.1223
## displacement 1.410e-02 7.843e-03 1.797 0.0731 .
## I(horsepower^2) 7.067e-05 4.387e-05 1.611 0.1080
## weight -7.054e-03 6.018e-04 -11.722 < 2e-16 ***
## sqrt(acceleration) 1.701e+00 7.096e-01 2.398 0.0170 *
## year 7.823e-01 5.053e-02 15.480 < 2e-16 ***
## origin 1.198e+00 2.794e-01 4.286 2.30e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.318 on 384 degrees of freedom
## Multiple R-squared: 0.8225, Adjusted R-squared: 0.8193
## F-statistic: 254.2 on 7 and 384 DF, p-value: < 2.2e-16
In this case, after applying random transformations to variables that had previously not been statistically significant, we see that there is a change in behavior. The square-root of acceleration is now statistically significant, however none of these other transformed variables has become statistically significant.
par(mfrow=c(2,2))
plot(lm.fit9F, which=1:4)
Examination of the diagnostics plots indicates similar behavior to the previous (non-transformed) model. There is some evidence of heteroscedasticity, and some of the Cook’s Distance points indicate they may be influential (14, 327, 387).
lm.fit9F2 <- lm(log(mpg)~.-name, data=Auto)
summary(lm.fit9F2)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
Just for fun, I transformed MPG to LOG(MPG) and re-ran the original model. Variables that have become statistically significant are: Cylinders and Horsepower. The model now explains about 88% of the variance in MPG, so that’s an improvement if we’re aiming for predictive power.
par(mfrow=c(2,2))
plot(lm.fit9F2, which=1:4)
The diagnostics plots indicate a little better performance on the residuals plot (it’s more scattered), and Cook’s Distances have been knocked down a bit, with only observation 14 being worth a look for influential status. The Q-Q plot shows some departure at either end of the reference line, but otherwise the data tracks the reference line.
# ?Carseats
# str(Carseats)
lm.fit10A <- lm(Sales ~ Price+Urban+US, data = Carseats)
summary(lm.fit10A)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The null hypothesis for the model is that the model is not useful, and that all coefficients are equal to zero. The alternate hypothesis for the model is that the model IS useful, and that some of the coefficients are not equal to zero. We observe that the p-value for the model is below the typical selection threshold of 0.05, so we can reject the null hypothesis. This model is useful and has something to tell us. The model explains approximately 24% of the variance in Sales, which isn’t all that good for prediction purposes, but does establish that a relationship exists between some of the variables and sales.
The null hypothesis for each variable is that it has no relation to the target variable, and is therefore equal to zero. The alternate hypothesis holds that the variable does have a relationship with the target variable. Going down the list we see that Price is statistically significant, so there is a “Price Effect” on Sales. Whether the carseat falls into the Urban category is not statistically significant. Store location (whether the carseat is sold in the US) is statistically significant, so we can say there is a categorical “US Store Effect” on Sales.
For the statistically significant variables, we interpret them as follows:
Sales decreases by 0.054 (x1000) for every unit increase in Price, holding all other variables constant. On average, “US store” carseats have sales 1.2005 (x1000) higher than carseats sold in stores outside the US when all other variables are held constant.
The equations for the model considering statistically significant terms may be written as follows:
\(Sales (x 1000) = (13.043 + 1.200573) - 0.054459 * Price\) for carseats sold in US stores.
\(Sales (x 1000) = 13.043 - 0.054459 * Price\) for carseats sold in stores outside the US.
The equation for the entire model is:
\(Sales (x 1000) = 13.043 + 1.200576 * USYes - 0.021916 * UrbanYes - 0.054459 * Price\)
I reject the null hypothesis for Price and the categorical USYes predictor. The reasoning behind this is explained in the answer for 10-B.
lm.fit10E <- lm(Sales~Price + US, data = Carseats)
summary(lm.fit10E)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models explain about 24% of the variance in Sales. This new model has a slightly smaller residual standard error. The F-statistic for the new model is somewhat larger than the earlier one.
confint(lm.fit10E)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Confidence intervals are displayed above.
par(mfrow=c(2,2))
plot(lm.fit10E, which=1:4)
Residuals plot indicates a random shotgun scattering, so I would say this data has homoscedasticty. There are three observations that are on the outside bounds of the residuals plot (51, 69, 377). The Q-Q plot looks very good, with data tracking down the reference line. The Cook’s Distance plot shows several points above a Cook’s distance of 0.02 (26, 50 and 368). Depending upon your threshold for influential points, these points may need to be removed.
If the summation of Xi^2 = the summation of Yi^2, then the β should be the same.
par(mfrow=c(1,1))
set.seed(42)
x = rnorm(100)
y = 2*x + rnorm(100)
#plot(x,y)
lm.fit12B1 <- lm(y~x+0)
summary(lm.fit12B1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9815 -0.5947 -0.0741 0.4498 2.7669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0245 0.0876 23.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9081 on 99 degrees of freedom
## Multiple R-squared: 0.8436, Adjusted R-squared: 0.8421
## F-statistic: 534.1 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit12B2 <- lm(x~y+0)
summary(lm.fit12B2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.56841 -0.21077 0.06774 0.31614 0.83105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.41671 0.01803 23.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.412 on 99 degrees of freedom
## Multiple R-squared: 0.8436, Adjusted R-squared: 0.8421
## F-statistic: 534.1 on 1 and 99 DF, p-value: < 2.2e-16
set.seed(42)
x = rnorm(100)
y = 1*x
lm.fit12B1 <- lm(y~x+0)
summary(lm.fit12B1)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.281e-15 -5.030e-17 -3.400e-18 4.530e-17 3.022e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.000e+00 7.156e-17 1.397e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.418e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.953e+32 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit12B2 <- lm(x~y+0)
summary(lm.fit12B2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.281e-15 -5.030e-17 -3.400e-18 4.530e-17 3.022e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 1.000e+00 7.156e-17 1.397e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.418e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.953e+32 on 1 and 99 DF, p-value: < 2.2e-16