2.

The major difference between the KNN classifer and KNN regression is that the KNN classifer attempts to predict the class that the output variable belongs to by calculating the local probabilites, while KNN regression predicts the value of the output based on local averages.

9.

(a)

library(MASS)
## Warning: package 'MASS' was built under R version 3.5.3
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
autoDf<-Auto
pairs(autoDf)

### (b)

cor(autoDf[,1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c)

lm.fit <- lm(mpg~.-name,autoDf)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = autoDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i.

Based on the \(R^2\), the predictors are highly correlated with the outcome, as the model explains \(82.15\%\) of the variance in MPG. #### ii. The displacement, weight, year, and origin predictors beta coefficients are all significantly different from zero. #### iii. The year coefficient of \(0.750773\) suggests that for every one-year increase, MPG increases by about \(0.75\). This seems logical since newer vehicles have improved technology that typically improves MPG. ### (d)

par(mfrow=c(2,2))
plot(lm.fit)

Looking at the diagnostic plots, the only issue appears to be some large outliers that deviate from the line of normality in the Q-Q plot. The residuals vs. fitted plot suggests this as well, showing an increase in variance as the fitted values become larger. The leverage plot shows observation 14 is an unusually high outlier. ### (e)

lm_int.fit <- lm(mpg~cylinders*displacement+weight+acceleration +horsepower*year+origin,autoDf)
summary(lm_int.fit)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + weight + acceleration + 
##     horsepower * year + origin, data = autoDf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3762  -1.7426   0.0006   1.2578  11.2401 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -7.200e+01  1.037e+01  -6.946 1.62e-11 ***
## cylinders              -1.741e+00  4.048e-01  -4.302 2.16e-05 ***
## displacement           -6.844e-02  1.332e-02  -5.140 4.40e-07 ***
## weight                 -4.523e-03  5.937e-04  -7.620 2.03e-13 ***
## acceleration           -2.619e-02  8.678e-02  -0.302   0.7630    
## horsepower              6.401e-01  9.404e-02   6.807 3.87e-11 ***
## year                    1.649e+00  1.285e-01  12.836  < 2e-16 ***
## origin                  6.596e-01  2.565e-01   2.572   0.0105 *  
## cylinders:displacement  1.064e-02  1.661e-03   6.407 4.36e-10 ***
## horsepower:year        -9.418e-03  1.276e-03  -7.379 1.00e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.894 on 382 degrees of freedom
## Multiple R-squared:  0.8657, Adjusted R-squared:  0.8625 
## F-statistic: 273.5 on 9 and 382 DF,  p-value: < 2.2e-16

Adding an interaction term between cylinders and displacement, as well as horsepower and year yeild two significant interaction terms. The interaction term between cylinders and displacement is interesting as more cylinders and a larger displacement is both correlated with a lower MPG, which can be seen in the individual betas. The added interaction term between horsepower and year is also significant and suggests a slight decrease in MPG as cars become newer with more power. ### (f)

lm_log.fit <- lm(log(mpg)~.-name,autoDf)
summary(lm_log.fit)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = autoDf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_log.fit)

lm_sqrt.fit <- lm(mpg**.5~.-name,autoDf)
summary(lm_sqrt.fit)
## 
## Call:
## lm(formula = mpg^0.5 ~ . - name, data = autoDf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98891 -0.18946  0.00505  0.16947  1.02581 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.075e+00  4.290e-01   2.506   0.0126 *  
## cylinders    -5.942e-02  2.986e-02  -1.990   0.0474 *  
## displacement  1.752e-03  6.942e-04   2.524   0.0120 *  
## horsepower   -2.512e-03  1.274e-03  -1.972   0.0493 *  
## weight       -6.367e-04  6.024e-05 -10.570  < 2e-16 ***
## acceleration  2.738e-03  9.131e-03   0.300   0.7644    
## year          7.381e-02  4.709e-03  15.675  < 2e-16 ***
## origin        1.217e-01  2.569e-02   4.735 3.09e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3074 on 384 degrees of freedom
## Multiple R-squared:  0.8561, Adjusted R-squared:  0.8535 
## F-statistic: 326.3 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_sqrt.fit)

lm_sqrd.fit <- lm(mpg**2~.-name,autoDf)
summary(lm_sqrd.fit)
## 
## Call:
## lm(formula = mpg^2 ~ . - name, data = autoDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483.45 -141.87  -19.62  103.58 1042.84 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.878e+03  2.928e+02  -6.412 4.22e-10 ***
## cylinders    -1.436e+01  2.038e+01  -0.704  0.48157    
## displacement  1.328e+00  4.738e-01   2.802  0.00534 ** 
## horsepower   -3.587e-01  8.693e-01  -0.413  0.68009    
## weight       -3.522e-01  4.111e-02  -8.567 2.62e-16 ***
## acceleration  9.278e+00  6.232e+00   1.489  0.13740    
## year          4.081e+01  3.214e+00  12.698  < 2e-16 ***
## origin        9.509e+01  1.754e+01   5.422 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.8 on 384 degrees of freedom
## Multiple R-squared:  0.7292, Adjusted R-squared:  0.7243 
## F-statistic: 147.8 on 7 and 384 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_sqrd.fit)

Running the model after adding the \(log(y)\), \(\sqrt y\), and \(y^2\) transformation produces interesting differences. The \(log(y)\) performs well, increasing the \(R^2\) from \(82.15\%\) to \(87.95\%\) and seeming to solve some of the issues with larger outliers, though it does seem to introduce small outliers. The \(\sqrt y\) transformation still improves over the original model’s \(R^2\), though not as much as the \(log(y)\) transformation, with an \(R^2\) of \(85.61\%\), and seems to help reduce high outliers without as many low outliers like the \(log(y)\) transformation. Finally, the \(y^2\) transformation performs the worst of all the models, with an \(R^2\) of \(72.92\%\), while also exagerating the high outliers. ## 10. ### (a)

carseatsDf<-Carseats
lm_cs.fit<-lm(Sales~Price+Urban+US,carseatsDf)
summary(lm_cs.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseatsDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_cs.fit)

### (b) For every dollar increase in price, sales decrease by about \(55\) units. If the store is in an urban location, this decreases sales by about \(22\) units. If the store is in the US it increases sales by about \(120\) units. ### (c) \(\hat{y} = 13.043469 - 0.05445x_1 - 0.021916x_2 + 1.200573x_3\) where \(\hat{y}\) is the predicted car seat sales in thousands, \(x_1\) is the price of the car seat, \(x_2\) is an indicator variable that is 1 if the store was in an urban area, 0 if it was a rural area, and \(x_3\) is an indicator varaible that is 1 if the store was in the US, 0 otherwise. ### (d) The only variable that does not reject the null hypothesis that \(\beta=0\) is Urban. ### (e)

lm_cs_sig.fit<-lm(Sales~Price+US,carseatsDf)
summary(lm_cs_sig.fit)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = carseatsDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_cs_sig.fit)

### (f) Both models fit well, meeting all the assumptions based on the diagnostic plots, however neither model explains the sales of carseats well, both having an \(R^2\) of about \(34\%\).

(g)

confint.lm(lm_cs_sig.fit)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

The \(95\%\) CI for \(Price\) is \((-0.06475984, -0.04419543)\) and \((0.69151957, 1.70776632)\) for \(US_{yes}\) ### (h) There are a few points with high leverage, but no significant outliers. ## 12. ### (a) The coefficient estimate for X onto Y is the same as Y onto X when the sum of squares of X equals the sum of squares of Y, i.e. \(\sum_jx_j^2 = sum_jy_j^2\) since the estimate of Y onto X is \(\hat\beta = \frac{\sum_ix_i^2y_i^2}{sum_jx_j^2}\) and the inverse is \(\hat\beta' = \frac{\sum_ix_i^2y_i^2}{sum_jy_j^2}\) ### (b)

set.seed(13)
x <- 1:100
sum(x^2)
## [1] 338350
y<-10*x+rnorm(100,sd = 10)

sum(y^2)
## [1] 33723401
lm.fit_X <- lm(y ~ x + 0)
lm.fit_y <- lm(x ~ y + 0)
summary(lm.fit_X)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.7715  -6.9074   0.5694   6.0541  18.8264 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  9.98218    0.01628     613   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.472 on 99 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
## F-statistic: 3.758e+05 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm.fit_y)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.87243 -0.58770 -0.04299  0.70147  1.98753 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## y 0.1001521  0.0001634     613   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9487 on 99 degrees of freedom
## Multiple R-squared:  0.9997, Adjusted R-squared:  0.9997 
## F-statistic: 3.758e+05 on 1 and 99 DF,  p-value: < 2.2e-16

(c)

y<- 100:1
lm.fit_X <- lm(y ~ x + 0)
lm.fit_y <- lm(x ~ y + 0)
summary(lm.fit_X)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
summary(lm.fit_y)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08