Chapter 3
Question 2:
Carefully explain the differences between the KNN classifier and KNN regression methods.
As put in Chapter 2, the KNN classifier promotes the classification of a result into qualitative groups. It achieves this aim by using the most common group found amid the K nearest neighbors. In other words, the ratios corresponding to the classes encompassed by the K nearest neighbors are the effective classification probabilities, i.e., the highest frequency of classes prevails. Alternatively, KNN regression facilitates a quantitative estimate by averaging the K nearest neighbors’ results.
Question 9
#Auto dataset
library(ISLR)
data(Auto)
#scatterplots for each variable-combination
pairs(Auto)
b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
sub_auto <- subset(Auto, select = -name)
cor(sub_auto)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm_res_2 <- lm(mpg ~ . - name, data = Auto)
summary(lm_res_2)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
c.i) Is there a relationship between the predictors and the response?
Yes, by way of the null hypothesis, there exists a relationship between the predictors and the response. As before, we can affirm this via the very low p-value which is indicative of this assertion. Further, the F-statistic is significant as well.
c.ii) Which predictors appear to have a statistically significant relationship to the response?
As denoted by the p-values associated with each of the predictor’s t-statistic, it can be observed that displacement, weight, year, and origin have a statistically significant relationship.
c.iii) What does the coefficient for the year variable suggest?
The regression coefficient for year of 0.750773 can be interpreted as there being a 0.750773 in increase in mpg for every annual increase in year, given mpg is the response.
#creating a matrix of 2 rows x 2 cols plots
par(mfrow=c(2,2))
plot(lm_res_2)
It can first be noted that the Residuals vs Fitted is suggestive of a non-linear relationship (i.e, the observable curve), albeit not as suggestive as that in Question 8 when horsepower was the predictor. Additionally, the relatively horizontal red line in the Scale-Location plot indicates some homoscedasticity, though the encompassed points are not necessarily ideally-equally spread. It can also be noted that observation 14 is a point of high leverage, as shown by the Residuals vs Leverage plot. Additionally, Normal Q-Q is supportive of normality, given on a Q-Q plot normally distributed data appears as roughly a straight line. Lastly, it can be seen from the above plots that 327 and 323 to a relatively larger degree appear to be outliers, in addition to a couple other points.
#implementing a few interactions
interact <- lm(mpg ~ cylinders * displacement + acceleration * year + displacement * weight, data = Auto)
summary(interact)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + acceleration *
## year + displacement * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6591 -1.6942 0.0812 1.4461 11.7626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.052e+02 1.846e+01 5.698 2.42e-08 ***
## cylinders 1.471e+00 5.743e-01 2.561 0.01081 *
## displacement -7.306e-02 1.167e-02 -6.261 1.02e-09 ***
## acceleration -7.201e+00 1.179e+00 -6.108 2.48e-09 ***
## year -7.114e-01 2.460e-01 -2.892 0.00405 **
## weight -1.223e-02 9.370e-04 -13.052 < 2e-16 ***
## cylinders:displacement -4.937e-03 2.559e-03 -1.929 0.05444 .
## acceleration:year 9.655e-02 1.537e-02 6.282 9.10e-10 ***
## displacement:weight 2.809e-05 3.579e-06 7.847 4.29e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.858 on 383 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.8659
## F-statistic: 316.6 on 8 and 383 DF, p-value: < 2.2e-16
As given by the corresponding p-values associated with each’s t-statistic as done in Question 9, it can be seen that the interaction between displacement and weight is statistically significant versus that between cylinders and displacement not demonstrating this, following the acceptable ranges given in the text. Interestingly, there appears to be some statistically significant between acceleration and year which is a bit less intuitive, although it makes sense.
Below, I applied some transformations on the weight predictor. As can be seen, the more the predictor was decreased, the more linear the relationship, with log(weight) being the most linear.
Subsequently, I performed additional transformations.
data(Auto)
par(mfrow = c(2, 2))
plot(Auto$weight, Auto$mpg)
plot(sqrt(Auto$weight), Auto$mpg)
plot(log(Auto$weight), Auto$mpg)
plot(Auto$weight^2, Auto$mpg)
#for baseline comparison to transformations
lm_res_2 <- lm(mpg ~ . - name, data = Auto)
summary(lm_res_2)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
#transformation to compare to above baseline
lm_res_2 <- lm(I(mpg^2) ~ cylinders + I(displacement^2) + horsepower + I(log(weight)) + acceleration + I(sqrt(year)) + origin, data=Auto)
summary(lm_res_2)
##
## Call:
## lm(formula = I(mpg^2) ~ cylinders + I(displacement^2) + horsepower +
## I(log(weight)) + acceleration + I(sqrt(year)) + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -456.65 -116.14 -17.65 89.46 1005.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.995e+03 7.323e+02 4.089 5.28e-05 ***
## cylinders -1.149e+01 1.568e+01 -0.733 0.4640
## I(displacement^2) 3.610e-03 6.299e-04 5.731 2.03e-08 ***
## horsepower -1.054e+00 8.263e-01 -1.276 0.2029
## I(log(weight)) -1.161e+03 9.574e+01 -12.130 < 2e-16 ***
## acceleration 1.174e+01 5.508e+00 2.131 0.0338 *
## I(sqrt(year)) 7.543e+02 5.090e+01 14.818 < 2e-16 ***
## origin 6.373e+01 1.551e+01 4.109 4.87e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 189.8 on 384 degrees of freedom
## Multiple R-squared: 0.7785, Adjusted R-squared: 0.7745
## F-statistic: 192.8 on 7 and 384 DF, p-value: < 2.2e-16
Some notable aspects of the preceding transformation were that \(R^2\) was reduced. Interestingly, acceleration became relatively significant. Displacement also gained significance, whereas that of origin decrease somewhat (though still signifnat). Additionally, F-statistic decreased from 252.4 to 192.8.
Question 10
data(Carseats)
cs_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(cs_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The low p-value of the t-statistic coupled with the negative sign implies a negative relationship between price and sales. We can read this as for every increase of $1 in price, the sales will drop by roughly 54.459 units, when the other predictors remain the same. Note that the documentation states that Sales is the unit sales in thousands.
Unlike Price, UrbanYes is qualitative, or in this case, binary; thus, the output implies that sales in urban locations are 21.916 units less than in rural locations, given that the coefficient is negative. Again, note that this applies when the other predictors remain constant.
Similarly, USYes it qualitative as well; in turn, the output gives that sales in a US stores are 1200.573 units more than that residing outstide the US, when the other predictors remain fixed.
\(Sales = 13.043469 + -0.054459 * Price + -0.021916 * UrbanYes + 1.200573 * USYes\), where \(UrbanYes\) equals 1 or 0 to to indicate whether the store is in an urban or rural location, respectively, and \(USYes\) equals 1 or 0 to to indicate whether the store is in the US or not, respectively.
We can reject the null hypothesis for Price and USYes, given their t-statistic and their low p-value.
cs_fit_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(cs_fit_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The R-squared or percentage of the response variable variation that is explained by a model remained the same for both (a) and (e). Naturally, the Adjusted R-squared increased slightly for (e). Based on such similarity between the two outputs, it appears that we simplified the model by removing UrbanYes while still arriving at comparable results.
confint(cs_fit_2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
The Residuals vs Fitted and Scale-Location plots below appear to be rather balanced with respect to their distribution, and in turn nothing looks particularly alarming. From the Residuals vs Leverage plot, some outliers that could be influential can be identified. In particular, a point with high leverage can be observed, though its standardized residual is close to 0. There are also a couple points that exhibit both leverage and standardized residuals that appear to be less that -1.
par(mfrow = c(2, 2))
plot(cs_fit_2)
Question 12
For the regression of Y onto X, we have: \(\frac{\sum_ix_iy_i}{\sum_jx_j^2}\) For the regression of X onto Y, we have: \(\frac{\sum_ix_iy_i}{\sum_jy_j^2}\) Note that the numerators are equivalent; as a result, we are concerned with the denominators. In turn, the circumstance the the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X when: \(\sum_jx_j^2 = \sum_jy_j^2\)
set.seed(1)
x <- rnorm(100, mean=50, sd=5)
y <- x + rnorm(100, mean=10, sd=5)
y_fit <- lm(y ~ x + 0)
x_fit <- lm(x ~ y + 0)
print("X coefficient:")
## [1] "X coefficient:"
coef(x_fit)
## y
## 0.8331712
print("Y coefficient:")
## [1] "Y coefficient:"
coef(y_fit)
## x
## 1.192592
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- 100:1
y <- 1:100
y_fit <- lm(y ~ x + 0)
x_fit <- lm(x ~ y + 0)
print("X coefficient:")
## [1] "X coefficient:"
coef(x_fit)
## y
## 0.5074627
print("Y coefficient:")
## [1] "Y coefficient:"
coef(y_fit)
## x
## 0.5074627