Question 2:

Q: Carefully explain the differences between the KNN classifier and KNN regression methods.

A: KNN Regression is similar to KNN Classifier. KNN Classifier is used for classification/qualitative problems (Y is categorical) while KNN Regression is used for regression/quantitative problems (Y is numerical/continuous)

KNN Classifier - for any given X we find the k closest neighbors to X in the training data, and examine their corresponding Y. If the majority of the Y’s are “z” we predict “z” otherwise guess “x”. The smaller that k is the more flexible the method will be.

KNN Regression is used to predict Y for a given value of X, considering k closest points to X in training data and taking the average of the responses. If k is small, kNN is much more flexible than linear regression.

Question 9

Q: This question involves the use of multiple linear regression on the Auto data set. (a) Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
pairs(Auto)

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"
cor(Auto[c(1:8)])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
  1. Is there a relationship between the predictors and the response?

The p-value for the model is below 0.05 and hence, the model is useful and there is a relationship between the predictors and the response. However, Cylinder, Horsepower, and Acceleration do not have a significant effect on mpg.

  1. Which predictors appear to have a statistically significant relationship to the response?

Displacement, Weight, Year, and Origin have a statistically significant relationship to the response.

  1. What does the coefficient for the year variable suggest?

The coefficient is 0.750773 which means that mpg increases by 0.75 units for every unit increase in year, all else constant.

lmauto <- lm(mpg~. -name, data = Auto)
summary(lmauto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

The curve in the residual plot shows a non-linear relationship between predictors and variables. We can also see that residuals are normally distributed from the QQ plot. Lastly, from the cooks distance plot we can see that value 14 has high leverage.

par(mfrow = c(2,2))
plot(lmauto)

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Taking the highest correlated pairs for interactions: Displacement and cylinders, and displacement with weight. The interaction displacement:weight is statistically significant. Another model with more correlated interactions was ran. We saw that the interactions displacement:weight and displacement:horsepower are statistically significant.

lmauto2 = lm(mpg~. -name + cylinders*displacement + displacement*weight, data = Auto)
summary(lmauto2)
## 
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement + displacement * 
##     weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0609  -1.7589  -0.0494   1.5790  12.1496 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -4.795e+00  4.515e+00  -1.062  0.28883    
## cylinders              -1.091e-01  5.965e-01  -0.183  0.85502    
## displacement           -7.186e-02  1.363e-02  -5.273 2.25e-07 ***
## horsepower             -3.457e-02  1.304e-02  -2.651  0.00836 ** 
## weight                 -1.030e-02  1.064e-03  -9.680  < 2e-16 ***
## acceleration            6.618e-02  8.817e-02   0.751  0.45334    
## year                    7.840e-01  4.566e-02  17.171  < 2e-16 ***
## origin                  5.475e-01  2.643e-01   2.071  0.03901 *  
## cylinders:displacement  1.186e-03  2.715e-03   0.437  0.66251    
## displacement:weight     2.141e-05  3.712e-06   5.768 1.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.967 on 382 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8555 
## F-statistic: 258.2 on 9 and 382 DF,  p-value: < 2.2e-16
lmauto3<- lm(mpg~. -name + cylinders*displacement + displacement*weight + horsepower*displacement + acceleration:horsepower + origin*displacement, data = Auto)
summary(lmauto3)
## 
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement + displacement * 
##     weight + horsepower * displacement + acceleration:horsepower + 
##     origin * displacement, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3250 -1.5778 -0.0658  1.4758 12.4039 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -6.047e+00  6.922e+00  -0.874  0.38291    
## cylinders                7.586e-01  6.331e-01   1.198  0.23159    
## displacement            -9.031e-02  1.916e-02  -4.712 3.44e-06 ***
## horsepower              -7.047e-02  5.853e-02  -1.204  0.22941    
## weight                  -7.088e-03  1.452e-03  -4.882 1.55e-06 ***
## acceleration             2.107e-01  2.316e-01   0.910  0.36354    
## year                     7.593e-01  4.544e-02  16.710  < 2e-16 ***
## origin                  -5.884e-01  9.558e-01  -0.616  0.53856    
## cylinders:displacement  -9.817e-04  2.827e-03  -0.347  0.72862    
## displacement:weight      1.427e-05  4.753e-06   3.002  0.00286 ** 
## displacement:horsepower  2.524e-04  1.074e-04   2.350  0.01930 *  
## horsepower:acceleration -3.540e-03  2.342e-03  -1.512  0.13148    
## displacement:origin      9.991e-03  8.248e-03   1.211  0.22652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.886 on 379 degrees of freedom
## Multiple R-squared:  0.8675, Adjusted R-squared:  0.8633 
## F-statistic: 206.7 on 12 and 379 DF,  p-value: < 2.2e-16
  1. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

According to model lmauto4, log(weight), sqrt(horsepower), and I(acceleration^2) are now statistically significant. We also see that the adjusted r-squared has increased and is better compared to the original model from 81% to 86%

Next, based on the previous parts, we see a non-linear pattern in the graphs close to a log pattern, hence, we use log(mpg). We see all variables except acceleration are now significant. The R-squared is 87% which is better compared to our lmauto4 model slightly.

lmauto4 = lm(mpg ~ . - name + log(weight) + sqrt(horsepower) + I(cylinders^2) + I(acceleration^2) + I(displacement^2), data = Auto)
summary(lmauto4)
## 
## Call:
## lm(formula = mpg ~ . - name + log(weight) + sqrt(horsepower) + 
##     I(cylinders^2) + I(acceleration^2) + I(displacement^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3593 -1.5249 -0.0286  1.4450 12.3350 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.870e+02  4.855e+01   3.852 0.000138 ***
## cylinders          8.106e-01  1.445e+00   0.561 0.575233    
## displacement      -2.825e-02  2.177e-02  -1.298 0.195182    
## horsepower         1.431e-01  7.112e-02   2.012 0.044978 *  
## weight             4.140e-03  2.286e-03   1.811 0.070964 .  
## acceleration      -1.881e+00  5.784e-01  -3.252 0.001250 ** 
## year               7.784e-01  4.503e-02  17.288  < 2e-16 ***
## origin             5.487e-01  2.653e-01   2.068 0.039279 *  
## log(weight)       -2.385e+01  7.306e+00  -3.265 0.001195 ** 
## sqrt(horsepower)  -4.323e+00  1.584e+00  -2.730 0.006635 ** 
## I(cylinders^2)    -6.156e-02  1.166e-01  -0.528 0.597707    
## I(acceleration^2)  5.074e-02  1.724e-02   2.943 0.003445 ** 
## I(displacement^2)  4.507e-05  3.815e-05   1.181 0.238231    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.875 on 379 degrees of freedom
## Multiple R-squared:  0.8685, Adjusted R-squared:  0.8643 
## F-statistic: 208.5 on 12 and 379 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lmauto4)

lmauto5 <- lm(log(mpg)~.-name, data=Auto)
summary(lmauto5)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40955 -0.06533  0.00079  0.06785  0.33925 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.751e+00  1.662e-01  10.533  < 2e-16 ***
## cylinders    -2.795e-02  1.157e-02  -2.415  0.01619 *  
## displacement  6.362e-04  2.690e-04   2.365  0.01852 *  
## horsepower   -1.475e-03  4.935e-04  -2.989  0.00298 ** 
## weight       -2.551e-04  2.334e-05 -10.931  < 2e-16 ***
## acceleration -1.348e-03  3.538e-03  -0.381  0.70339    
## year          2.958e-02  1.824e-03  16.211  < 2e-16 ***
## origin        4.071e-02  9.955e-03   4.089 5.28e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared:  0.8795, Adjusted R-squared:  0.8773 
## F-statistic: 400.4 on 7 and 384 DF,  p-value: < 2.2e-16

Question 10

This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

data(Carseats)
lmcars <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lmcars)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

When price increases by $1(and other variables held constant), Sales decrease by 54.459 units An Urban location has 21.9162 units less sales compared to rural location, all else constant A US store sells 1200.5 carseats more than outside the US, all else constant

  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.0435 + (−0.0545) × Price + (−0.0219162) × UrbanYes + (1.20057) × USYes + error

  1. For which of the predictors can you reject the null hypothesis H0 : βj = 0?

The p-value for price and USYes is greater than .05, hence, we reject the null, which means they are statistically significant

  1. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lmcars2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lmcars2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
  1. How well do the models in (a) and (e) fit the data?

Both models have a similar fit with adjusted r-squared for model (a) being 23.35%, residual standard error being 2.472, and adjusted r-squared for model (e) being 23.54%, residual standard error being 2.469 The model in (e) is slightly better, but not by much

  1. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(lmcars2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
  1. Is there evidence of outliers or high leverage observations in the model from (e)?

In the residual plot, points 51, 69, and 377 show up as outliers. In the cooks distance plot, we can see some high leverage observations (26, 50, 368)

par(mfrow = c(2, 2))
plot(lmcars2)

Question 12

This problem involves simple linear regression without an intercept. (a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficients are the same if sum of squares of observed y values is equal to sum of squares of observed x values ∑xi^2 = ∑yi^2

  1. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)
lmfit = lm(y~x+0)
lmfit2 = lm(x~y+0)
summary(lmfit)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lmfit2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
  1. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = rnorm(100)
y = 1*x
lmfit3 = lm(y~x+0)
lmfit4 = lm(x~y+0)
summary(lmfit3)
## Warning in summary.lm(lmfit3): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lmfit4)
## Warning in summary.lm(lmfit4): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  6.479e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16