Assignment 2

Chapter 3

Question 2:

Carefully explain the differences between the KNN classifier and KNN regression methods.

As put in Chapter 2, the KNN classifier promotes the classification of a result into qualitative groups. It achieves this aim by using the most common group found amid the K nearest neighbors. In other words, the ratios corresponding to the classes encompassed by the K nearest neighbors are the effective classification probabilities, i.e., the highest frequency of classes prevails. Alternatively, KNN regression facilitates a quantitative estimate by averaging the K nearest neighbors’ results.

Question 9

#Auto dataset
library(ISLR)
data(Auto)

Produce a scatterplot matrix which includes all of the variables in the data set.

#scatterplots for each variable-combination 
pairs(Auto)

b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

sub_auto <- subset(Auto, select = -name)
cor(sub_auto)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

lm_res_2 <- lm(mpg ~ . - name, data = Auto)
summary(lm_res_2)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

c.i) Is there a relationship between the predictors and the response?

Yes, by way of the null hypothesis, there exists a relationship between the predictors and the response. As before, we can affirm this via the very low p-value which is indicative of this assertion. Further, the F-statistic is significant as well.

c.ii) Which predictors appear to have a statistically significant relationship to the response?

As denoted by the p-values associated with each of the predictor’s t-statistic, it can be observed that displacement, weight, year, and origin have a statistically significant relationship.

c.iii) What does the coefficient for the year variable suggest?

The regression coefficient for year of 0.750773 can be interpreted as there being a 0.750773 in increase in mpg for every annual increase in year, given mpg is the response.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

#creating a matrix of 2 rows x 2 cols plots
par(mfrow=c(2,2))
plot(lm_res_2)

It can first be noted that the Residuals vs Fitted is suggestive of a non-linear relationship (i.e, the observable curve), albeit not as suggestive as that in Question 8 when horsepower was the predictor. Additionally, the relatively horizontal red line in the Scale-Location plot indicates some homoscedasticity, though the encompassed points are not necessarily ideally-equally spread. It can also be noted that observation 14 is a point of high leverage, as shown by the Residuals vs Leverage plot. Additionally, Normal Q-Q is supportive of normality, given on a Q-Q plot normally distributed data appears as roughly a straight line. Lastly, it can be seen from the above plots that 327 and 323 to a relatively larger degree appear to be outliers, in addition to a couple other points.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

#implementing a few interactions
interact <-  lm(mpg ~ cylinders * displacement + acceleration * year + displacement * weight, data = Auto)
summary(interact)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + acceleration * 
##     year + displacement * weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6591 -1.6942  0.0812  1.4461 11.7626 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.052e+02  1.846e+01   5.698 2.42e-08 ***
## cylinders               1.471e+00  5.743e-01   2.561  0.01081 *  
## displacement           -7.306e-02  1.167e-02  -6.261 1.02e-09 ***
## acceleration           -7.201e+00  1.179e+00  -6.108 2.48e-09 ***
## year                   -7.114e-01  2.460e-01  -2.892  0.00405 ** 
## weight                 -1.223e-02  9.370e-04 -13.052  < 2e-16 ***
## cylinders:displacement -4.937e-03  2.559e-03  -1.929  0.05444 .  
## acceleration:year       9.655e-02  1.537e-02   6.282 9.10e-10 ***
## displacement:weight     2.809e-05  3.579e-06   7.847 4.29e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.858 on 383 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.8659 
## F-statistic: 316.6 on 8 and 383 DF,  p-value: < 2.2e-16

As given by the corresponding p-values associated with each’s t-statistic as done in Question 9, it can be seen that the interaction between displacement and weight is statistically significant versus that between cylinders and displacement not demonstrating this, following the acceptable ranges given in the text. Interestingly, there appears to be some statistically significant between acceleration and year which is a bit less intuitive, although it makes sense.

Try a few different transformations of the variables.

Below, I applied some transformations on the weight predictor. As can be seen, the more the predictor was decreased, the more linear the relationship, with log(weight) being the most linear.

Subsequently, I performed additional transformations.

data(Auto)

par(mfrow = c(2, 2))
plot(Auto$weight, Auto$mpg)
plot(sqrt(Auto$weight), Auto$mpg)
plot(log(Auto$weight), Auto$mpg)
plot(Auto$weight^2, Auto$mpg)

#for baseline comparison to transformations
lm_res_2 <- lm(mpg ~ . - name, data = Auto)
summary(lm_res_2)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

#transformation to compare to above baseline
lm_res_2 <- lm(I(mpg^2) ~ cylinders + I(displacement^2) + horsepower + I(log(weight)) + acceleration + I(sqrt(year)) + origin, data=Auto)
summary(lm_res_2)

## 
## Call:
## lm(formula = I(mpg^2) ~ cylinders + I(displacement^2) + horsepower + 
##     I(log(weight)) + acceleration + I(sqrt(year)) + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -456.65 -116.14  -17.65   89.46 1005.80 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.995e+03  7.323e+02   4.089 5.28e-05 ***
## cylinders         -1.149e+01  1.568e+01  -0.733   0.4640    
## I(displacement^2)  3.610e-03  6.299e-04   5.731 2.03e-08 ***
## horsepower        -1.054e+00  8.263e-01  -1.276   0.2029    
## I(log(weight))    -1.161e+03  9.574e+01 -12.130  < 2e-16 ***
## acceleration       1.174e+01  5.508e+00   2.131   0.0338 *  
## I(sqrt(year))      7.543e+02  5.090e+01  14.818  < 2e-16 ***
## origin             6.373e+01  1.551e+01   4.109 4.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 189.8 on 384 degrees of freedom
## Multiple R-squared:  0.7785, Adjusted R-squared:  0.7745 
## F-statistic: 192.8 on 7 and 384 DF,  p-value: < 2.2e-16

Some notable aspects of the preceding transformation were that $R^2$ was reduced. Interestingly, acceleration became relatively significant. Displacement also gained significance, whereas that of origin decrease somewhat (though still signifnat). Additionally, F-statistic decreased from 252.4 to 192.8.

Question 10

Fit a multiple regression model to predict Sales using Price, Urban, and US.

data(Carseats)
cs_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(cs_fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The low p-value of the t-statistic coupled with the negative sign implies a negative relationship between price and sales. We can read this as for every increase of $1 in price, the sales will drop by roughly 54.459 units, when the other predictors remain the same. Note that the documentation states that Sales is the unit sales in thousands.

Unlike Price, UrbanYes is qualitative, or in this case, binary; thus, the output implies that sales in urban locations are 21.916 units less than in rural locations, given that the coefficient is negative. Again, note that this applies when the other predictors remain constant.

Similarly, USYes it qualitative as well; in turn, the output gives that sales in a US stores are 1200.573 units more than that residing outstide the US, when the other predictors remain fixed.

Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales = 13.043469 + -0.054459 * Price + -0.021916 * UrbanYes + 1.200573 * USYes$, where $UrbanYes$ equals 1 or 0 to to indicate whether the store is in an urban or rural location, respectively, and $USYes$ equals 1 or 0 to to indicate whether the store is in the US or not, respectively.

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

We can reject the null hypothesis for Price and USYes, given their t-statistic and their low p-value.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

cs_fit_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(cs_fit_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

The R-squared or percentage of the response variable variation that is explained by a model remained the same for both (a) and (e). Naturally, the Adjusted R-squared increased slightly for (e). Based on such similarity between the two outputs, it appears that we simplified the model by removing UrbanYes while still arriving at comparable results.

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(cs_fit_2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

The Residuals vs Fitted and Scale-Location plots below appear to be rather balanced with respect to their distribution, and in turn nothing looks particularly alarming. From the Residuals vs Leverage plot, some outliers that could be influential can be identified. In particular, a point with high leverage can be observed, though its standardized residual is close to 0. There are also a couple points that exhibit both leverage and standardized residuals that appear to be less that -1.

par(mfrow = c(2, 2))
plot(cs_fit_2)

Question 12

Recall that the coefficient estimate $\beta$ˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

For the regression of Y onto X, we have: $\frac{\sum_ix_iy_i}{\sum_jx_j^2}$ For the regression of X onto Y, we have: $\frac{\sum_ix_iy_i}{\sum_jy_j^2}$ Note that the numerators are equivalent; as a result, we are concerned with the denominators. In turn, the circumstance the the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X when: $\sum_jx_j^2 = \sum_jy_j^2$

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- rnorm(100, mean=50, sd=5)
y <- x + rnorm(100, mean=10, sd=5)
y_fit <- lm(y ~ x + 0)
x_fit <- lm(x ~ y + 0)
print("X coefficient:")

## [1] "X coefficient:"

coef(x_fit)

##         y 
## 0.8331712

print("Y coefficient:")

## [1] "Y coefficient:"

coef(y_fit)

##        x 
## 1.192592

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <- 100:1
y <- 1:100
y_fit <- lm(y ~ x + 0)
x_fit <- lm(x ~ y + 0)
print("X coefficient:")

## [1] "X coefficient:"

coef(x_fit)

##         y 
## 0.5074627

print("Y coefficient:")

## [1] "Y coefficient:"

coef(y_fit)

##         x 
## 0.5074627