Question 2.

Carefully explain the differences between the KNN classifier and KNN regression methods:

The Classification KNN Model is used for predicting the class or category of a data point, while the Regression KNN Model is used for predicting a continuous/quantitative value.

Question 9.

a. scatterplot matrix

library(ISLR2)
Auto <- read.csv("Auto.csv", na.strings = "?", stringsAsFactors = T)
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

b. matrix of correlation

pairs(Auto[,-9])

c. multiple linear regression

lm.fit <- lm(mpg~. -name, data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

There does seem to be a relationship between the predictors and their response as the p-value is near 0. Weight, year, and origin all appear to have a statistically significant relationship to mpg. For every 1 year increase in auto, there is an estimated .75 mpg increase, signifying newer autos are more fuel efficient.

d. Diagnostic plots

par(mfrow = c(2,2))
plot(lm.fit)

Comment on any problems you see with the fit.

Slight curve pattern in the Residuals vs Fitted graph, ideally a pattern here would not be discernible. This indicated non-linearity in the Auto data.

Do the residual plots suggest any unusually large outliers?

While there are outliers present, these are not noted as unusually large.

Does the leverage plot identify any observations with unusually high leverage?

Yes, there is a high leverage point labeled “14” that can be seen outside of the normal range.

e. Use the * and : symbols to fit linear regression models.

summary(lm(mpg ~ weight * displacement, data = Auto))
## 
## Call:
## lm(formula = mpg ~ weight * displacement, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9209  -2.4627  -0.3147   1.8435  17.9674 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.396e+01  1.930e+00  27.954  < 2e-16 ***
## weight              -9.002e-03  8.446e-04 -10.658  < 2e-16 ***
## displacement        -7.936e-02  1.131e-02  -7.017 1.00e-11 ***
## weight:displacement  1.773e-05  2.784e-06   6.368 5.35e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.11 on 393 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7242 
## F-statistic: 347.5 on 3 and 393 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ weight * year, data = Auto))
## 
## Call:
## lm(formula = mpg ~ weight * year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0341 -1.9851 -0.0912  1.6987 12.9292 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.124e+02  1.280e+01  -8.781  < 2e-16 ***
## weight       2.821e-02  4.376e-03   6.447 3.34e-10 ***
## year         2.068e+00  1.699e-01  12.171  < 2e-16 ***
## weight:year -4.672e-04  5.857e-05  -7.977 1.66e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.187 on 393 degrees of freedom
## Multiple R-squared:  0.8354, Adjusted R-squared:  0.8341 
## F-statistic: 664.9 on 3 and 393 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ year * displacement, data = Auto))
## 
## Call:
## lm(formula = mpg ~ year * displacement, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9067  -2.4318  -0.2423   2.0392  17.0413 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -7.341e+01  8.298e+00  -8.846  < 2e-16 ***
## year               1.415e+00  1.092e-01  12.958  < 2e-16 ***
## displacement       2.559e-01  4.048e-02   6.321 7.06e-10 ***
## year:displacement -4.130e-03  5.438e-04  -7.594 2.28e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.737 on 393 degrees of freedom
## Multiple R-squared:  0.7738, Adjusted R-squared:  0.772 
## F-statistic:   448 on 3 and 393 DF,  p-value: < 2.2e-16

Multiple interactions with the variables are considered statistically significant. In conclusion while one variable alone could be less significant; in combination with another it is significant and effects mpg in automobiles.

f. Different transformation of variables

fit.log <- lm(mpg ~ log(weight) + displacement + year + origin, data = Auto)
summary(fit.log)
## 
## Call:
## lm(formula = mpg ~ log(weight) + displacement + year + origin, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7232 -1.8862  0.0032  1.5833 12.9262 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  132.626241  10.979843  12.079  < 2e-16 ***
## log(weight)  -21.822624   1.440808 -15.146  < 2e-16 ***
## displacement   0.008248   0.004029   2.047 0.041287 *  
## year           0.810625   0.046100  17.584  < 2e-16 ***
## origin         0.842756   0.248393   3.393 0.000762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.109 on 392 degrees of freedom
## Multiple R-squared:  0.8438, Adjusted R-squared:  0.8422 
## F-statistic: 529.5 on 4 and 392 DF,  p-value: < 2.2e-16
fit.sqrt <- lm(mpg ~ sqrt(weight) + displacement + year, data = Auto)
summary(fit.sqrt)
## 
## Call:
## lm(formula = mpg ~ sqrt(weight) + displacement + year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9005 -2.0727 -0.1043  1.9165 13.9777 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.573974   4.221812   1.557    0.120    
## sqrt(weight) -0.814270   0.058364 -13.952   <2e-16 ***
## displacement  0.005418   0.004398   1.232    0.219    
## year          0.787320   0.048391  16.270   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.272 on 393 degrees of freedom
## Multiple R-squared:  0.8265, Adjusted R-squared:  0.8252 
## F-statistic:   624 on 3 and 393 DF,  p-value: < 2.2e-16
lm.fit2 <-  lm(mpg ~ weight + I(weight^2) + year + origin, data = Auto)
summary(lm.fit2)
## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9160 -1.5656 -0.1268  1.6294 12.7188 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.901e-01  4.007e+00  -0.122   0.9027    
## weight      -2.063e-02  1.521e-03 -13.564   <2e-16 ***
## I(weight^2)  2.244e-06  2.311e-07   9.712   <2e-16 ***
## year         8.299e-01  4.370e-02  18.989   <2e-16 ***
## origin       5.018e-01  2.413e-01   2.079   0.0382 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.013 on 392 degrees of freedom
## Multiple R-squared:  0.8533, Adjusted R-squared:  0.8518 
## F-statistic: 570.1 on 4 and 392 DF,  p-value: < 2.2e-16

Question 10.

a. fit multiple regression models to predict Sales using Price, Urban, and US

attach(Carseats)
library(tidyverse)
fit.lm <- lm(Sales ~ Price+Urban+US, data = Carseats)
summary(fit.lm)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b. Interpret each coefficient

For every $1 increase in the price, sales decreases by .05 thousand units. Stores located in Urban areas have a negative .02 unit in sales from stores in Urban areas. Stores located in the US sell 1.2 thousand units more than stores located outside of the US.

c. model equation form

\[Sales = 13.043 - 0.055 X_1 -0.022 X_2 +1.201 X_3\]

d. For which predictors can you reject the null Hypothesis?

Reject the null hypothesis for Price and US.

e. Smaller model

fit.lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit.lm2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f. How well do the two models fit the data

summary(fit.lm)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
summary(fit.lm2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

There was very little shift in either the residual standard error or the R^2 therefore, Urban is not as a significant variable, since minimal change occurred when removed.

g. 95% CI for the coefficients of model e (smaller model)

confint(fit.lm2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h. Is there evidence of outliers or high leverage obs in model e?

par(mfrow=c(2,2))
plot(fit.lm2)

There is no evidence of either significant outliers or high leverage observations.

Question 12.

a. Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimates are equal when the sum of the squared X-values equal the sum of the squared Y-values, or when both estimates are equal to 0.

\[\stackrel{n}{\underset{i=1}\sum} X_i^2 = \stackrel{n}{\underset{i=1}\sum} Y_i^2\] OR

\[\sum X_iY_i = 0\] b. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
n <- 100
x <- rnorm(n, mean = 0, sd = 1)
y <- 2*x + rnorm(n, mean = 0, sd = 1)

fit1 <- lm(y ~ x + 0)
fit2 <- lm(x ~ y + 0)

coef(fit1)
##        x 
## 1.993876
coef(fit2)
##         y 
## 0.3911145

c. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(1)
n <- 100
x <- rnorm(n)
y <- x

fit1 <- lm(y ~ x + 0)
fit2 <- lm(x ~ y + 0)

coef(fit1)
## x 
## 1
coef(fit2)
## y 
## 1