Question 2

  1. Carefully explain the differences between the KNN classifier and KNN regression methods.

The difference is that classifier is used when the response variable is categorical, while the regression is used when the response variable is numerical. Classifier shows Y as 0 or 1. Regression predicts the quantitative value of Y (can be continuous).

Question 9

This question involves the use of multiple linear regression on the Auto data set.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.

  2. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

  3. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  2. Which predictors appear to have a statistically signifcant relationship to the response?
  3. What does the coeffcient for the year variable suggest?
  1. Use the plot() function to produce diagnostic plots of the linear regression ft. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

  2. Use the * and : symbols to ft linear regression models with interaction efects. Do any interactions appear to be statistically signifcant?

  3. Try a few diferent transformations of the variables, such as log(X), √ X, X2. Comment on your fndings.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
library(MASS)
plot(Auto)

Auto_correlation <- Auto
Auto_correlation$name=NULL
cor(Auto_correlation)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
lin_reg <- lm(mpg ~ . - name, data = Auto)
summary(lin_reg)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

There is a relationship between multiple predictors and the response. Displacement, weight, year, and origin have significant relationships with mpg based their p-values being under 0.05. The coefficient for for year tells us that mpg on average increases by 0.75 every year forward.

par(mfrow=c(2,2))
plot(lin_reg)

The Residuals vs Fitted plot appears curved, showing non-linearity. The Q-Q Residuals plot deviates from the normal distribution line towards the end, indicating normality isn’t reasonable to assume. The Scale-Location plot indicates normality as most values fall within 0.0 - 2.0. The Residuals vs Leverage plot has numerous points outside of Cook’s distance line, meaning they are influencing the model.

lin_reg2<-lm(mpg~.:.,Auto_correlation)
summary(lin_reg2)
## 
## Call:
## lm(formula = mpg ~ .:., data = Auto_correlation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16
lin_reg2<-lm(mpg~.*.,Auto_correlation)
summary(lin_reg2)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto_correlation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

Based on the results, displacement:year, acceleration:year, and acceleration:origin are statistically significant to mpg.

lin_reg3<-lm(mpg~.-name + displacement*log(weight)+year:cylinders+log(acceleration), Auto)
summary(lin_reg3)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement * log(weight) + year:cylinders + 
##     log(acceleration), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0920 -1.5290  0.0143  1.2957 12.9721 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               80.859829  76.160285   1.062 0.289043    
## cylinders                 10.279768   2.259566   4.549 7.25e-06 ***
## displacement              -0.504066   0.196698  -2.563 0.010772 *  
## horsepower                -0.047762   0.012410  -3.849 0.000139 ***
## weight                    -0.004112   0.004851  -0.848 0.397142    
## acceleration               1.421988   0.448480   3.171 0.001644 ** 
## year                       1.453984   0.149493   9.726  < 2e-16 ***
## origin                     0.530919   0.253621   2.093 0.036980 *  
## log(weight)              -14.176185  10.284091  -1.378 0.168873    
## log(acceleration)        -22.593994   7.455537  -3.030 0.002609 ** 
## displacement:log(weight)   0.061791   0.024131   2.561 0.010833 *  
## cylinders:year            -0.132177   0.028685  -4.608 5.56e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.831 on 380 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8684 
## F-statistic: 235.6 on 11 and 380 DF,  p-value: < 2.2e-16

Log transformation of weight and acceleration results in acceleration becoming statistically significant.

Question 10

This question should be answered using the Carseats data set.

  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.

  2. Provide an interpretation of each coeffcient in the model. Be careful—some of the variables in the model are qualitative!

  3. Write out the model in equation form, being careful to handle the qualitative variables properly. 3.7 Exercises 125

  4. For which of the predictors can you reject the null hypothesis H0 : βj = 0?

  5. On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

  6. How well do the models in (a) and (e) ft the data?

  7. Using the model from (e), obtain 95 % confdence intervals for the coeffcient(s).

  8. Is there evidence of outliers or high leverage observations in the model from (e)?

lin_reg4 <-lm(Sales~Price+Urban+US,data=Carseats)
summary(lin_reg4)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

For every unit increase in Price, Sales will decrease by 54. If a store is in an urban area (UrbanYes), Sales will decrease by 22. If a store is in the US, Sales will increase by 1200.

Sales = 13.043469 - 0.054459 * (Price) - (0.021916) * (Urban) + 1.200573 (US)

You can reject the null hypothesis for Price and UrbanYes (statistically significant based on p-values)

lin_reg5 <-lm(Sales~Price+US,data=Carseats)
summary(lin_reg5)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
anova(lin_reg4,lin_reg5)
## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

When UrbanYes is removed, the adjusted r squared increases by a small margin. The difference between both models isn’t statistically significant.

confint(lin_reg5)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
par(mfrow=c(2,2))
plot(lin_reg5)

For the Residuals vs Fitted plot, most values are centered around zero. There is no strong evidence of outliers. For the Q-Q Residuals most values are on the line, indicating normality. For the Scale-Location plot, the line is mostly flat and values are equally distrubuted between 0.0 - 2.0. For the Residuals vs Leverage plot, most points have low leverage as there are no extreme outliers beyond Cook’s distance boundaries.

Question 12

This problem involves simple linear regression without an intercept.

  1. Recall that the coeffcient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coeffcient estimate for the regression of X onto Y the same as the coeffcient estimate for the regression of Y onto X?

  2. Generate an example in R with n = 100 observations in which the coeffcient estimate for the regression of X onto Y is diferent from the coeffcient estimate for the regression of Y onto X.

  3. Generate an example in R with n = 100 observations in which the coeffcient estimate for the regression of X onto Y is the same as the coeffcient estimate for the regression of Y onto X.

If the variance of X and Y is the same and means are 0, regression coefficients will be the same.

x=rnorm(100)
y=rbinom(100,2,0.3)
eg<-lm(y~x+0)
summary(eg)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34097 -0.00659  0.48499  0.97617  2.05916 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)  
## x  0.13924    0.08049    1.73   0.0868 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8165 on 99 degrees of freedom
## Multiple R-squared:  0.02934,    Adjusted R-squared:  0.01954 
## F-statistic: 2.993 on 1 and 99 DF,  p-value: 0.08676
eg1<-lm(x~y+0)
summary(eg1)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.90368 -0.62272  0.07564  0.57722  2.44874 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)  
## y   0.2107     0.1218    1.73   0.0868 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.004 on 99 degrees of freedom
## Multiple R-squared:  0.02934,    Adjusted R-squared:  0.01954 
## F-statistic: 2.993 on 1 and 99 DF,  p-value: 0.08676

Coefficients are different for both regressions.

x=1:100
y=100:1
eg3<-lm(y~x+0)
summary(eg3)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
eg4<-lm(x~y+0)
summary(eg4)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.