Question 2

The KNN classifier and regression methods are very similar with a few key differences. The classifier method is used to predict categorical variables. It is given a point to predict by finding a specified number of neighbors and seeing which category shows up the most. The regression method follows a similar strategy but instead of going with a number that shows up the most in the nearest neighbors, it takes the average of the specified number of neighbors.

Quesion 9

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

Auto <- read.table("Auto.data", header = T, na.strings = "?", stringsAsFactors = T)
pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(select(Auto, -name), use = "complete.obs")
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

model1 = lm(mpg~. -name, data = Auto)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

There is relationship between the predictors and the response variables which can been seen by the overall p-value of < 2.2e-16.

ii. Which predictors appear to have a statistically significant relationship to the response?

displacement, weight, year, and origin all seem to have a statistically significant relationship to the response variable.

iii. What does the coefficient for the year variable suggest?

The coefficient 0.750773 means that for each additional year, mpg increases by around .75 units.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(model1)

The residual plot does not show any outstanding outliers, however the leverage plot does show one observations that stands out from the rest of the observations.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

model2 = lm(mpg ~ horsepower * cylinders * weight, data = Auto)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ horsepower * cylinders * weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0975  -2.1526  -0.3461   1.8172  15.3008 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.950e+01  1.776e+01   3.912 0.000108 ***
## horsepower                  -2.828e-01  1.939e-01  -1.458 0.145690    
## cylinders                   -3.750e+00  3.179e+00  -1.180 0.238913    
## weight                      -6.278e-03  6.722e-03  -0.934 0.350920    
## horsepower:cylinders         2.551e-02  2.914e-02   0.876 0.381844    
## horsepower:weight            7.192e-06  7.371e-05   0.098 0.922324    
## cylinders:weight             1.410e-04  1.023e-03   0.138 0.890474    
## horsepower:cylinders:weight  8.170e-07  9.999e-06   0.082 0.934922    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.901 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7502 
## F-statistic: 168.7 on 7 and 384 DF,  p-value: < 2.2e-16
model3 = lm(mpg ~ horsepower * cylinders : weight, data = Auto)
summary(model3)
## 
## Call:
## lm(formula = mpg ~ horsepower * cylinders:weight, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.4898  -2.3941  -0.4798   2.0983  14.8335 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.199e+01  1.634e+00   31.82   <2e-16 ***
## horsepower                  -2.258e-01  2.012e-02  -11.22   <2e-16 ***
## cylinders:weight            -1.069e-03  7.784e-05  -13.74   <2e-16 ***
## horsepower:cylinders:weight  6.341e-06  6.335e-07   10.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.009 on 388 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7382, Adjusted R-squared:  0.7361 
## F-statistic: 364.6 on 3 and 388 DF,  p-value: < 2.2e-16

Looking at the 2 models above, it can be said that interactions can either be significant or not significant in a model depending on what all in being considered. The formula for the first model was lm(mpg ~ horsepower * cylinders * weight, data = Auto) and the formula for the second model was lm(mpg ~ horsepower * cylinders : weight, data = Auto). They are almost identical with the exception of the symbol between cylinders and weight. In the first model all three predictors are considerred individually and all of their interactions are considered as well. In that model nothing was significant. However in the second model, horsepower is the only predictor that is included individually, and only two interactions are considered. With this much simpler model all of the coefficients were significant even though in the first model they were not.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

model4 = lm(mpg~ log(horsepower) * log(weight) * log(displacement), data = Auto)
summary(model4)
## 
## Call:
## lm(formula = mpg ~ log(horsepower) * log(weight) * log(displacement), 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3760  -2.0492  -0.3067   1.8102  16.2392 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -2513.045   1007.463  -2.494
## log(horsepower)                                 554.944    225.352   2.463
## log(weight)                                     359.082    127.785   2.810
## log(displacement)                               445.825    195.671   2.278
## log(horsepower):log(weight)                     -78.126     28.579  -2.734
## log(horsepower):log(displacement)               -93.910     42.413  -2.214
## log(weight):log(displacement)                   -62.672     24.329  -2.576
## log(horsepower):log(weight):log(displacement)    13.168      5.278   2.495
##                                               Pr(>|t|)   
## (Intercept)                                    0.01304 * 
## log(horsepower)                                0.01423 * 
## log(weight)                                    0.00521 **
## log(displacement)                              0.02325 * 
## log(horsepower):log(weight)                    0.00655 **
## log(horsepower):log(displacement)              0.02740 * 
## log(weight):log(displacement)                  0.01037 * 
## log(horsepower):log(weight):log(displacement)  0.01302 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.85 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7566 
## F-statistic: 174.7 on 7 and 384 DF,  p-value: < 2.2e-16
model5 = lm(mpg~ sqrt(horsepower) * sqrt(weight) * sqrt(displacement), data = Auto)
summary(model5)
## 
## Call:
## lm(formula = mpg ~ sqrt(horsepower) * sqrt(weight) * sqrt(displacement), 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3263  -2.1503  -0.3557   1.7908  16.5264 
## 
## Coefficients:
##                                                   Estimate Std. Error t value
## (Intercept)                                      55.665055  40.422959   1.377
## sqrt(horsepower)                                 -0.537630   4.496039  -0.120
## sqrt(weight)                                      0.574050   0.800020   0.718
## sqrt(displacement)                               -1.113110   3.317851  -0.335
## sqrt(horsepower):sqrt(weight)                    -0.095741   0.086768  -1.103
## sqrt(horsepower):sqrt(displacement)               0.036890   0.300129   0.123
## sqrt(weight):sqrt(displacement)                  -0.040245   0.055324  -0.727
## sqrt(horsepower):sqrt(weight):sqrt(displacement)  0.004393   0.005016   0.876
##                                                  Pr(>|t|)
## (Intercept)                                         0.169
## sqrt(horsepower)                                    0.905
## sqrt(weight)                                        0.473
## sqrt(displacement)                                  0.737
## sqrt(horsepower):sqrt(weight)                       0.271
## sqrt(horsepower):sqrt(displacement)                 0.902
## sqrt(weight):sqrt(displacement)                     0.467
## sqrt(horsepower):sqrt(weight):sqrt(displacement)    0.382
## 
## Residual standard error: 3.868 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.7589, Adjusted R-squared:  0.7545 
## F-statistic: 172.6 on 7 and 384 DF,  p-value: < 2.2e-16

Question 10

library(ISLR2)
## 
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto
attach(Carseats)
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

fit<-lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
coef(fit)[4]
##    USYes 
## 1.200573

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The coefficient for Price is -0.054459 which means for every dollar increase in the price of my car seat, my stores sales decrease by $54 on average.

The coefficient for US = Yes is 1.200573 which means, on average, US stores sell $1,200 more compared to stores outside the US.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly. \(Sales = 13.04 - 0.05Price - 0.022Urban + 1.2US\)

(d) For which of the predictors can you reject the null hypothesis \(H0 : \beta_j = 0\)? See part (b) for interpretation, but Price and US = Yes are significant thus we can reject the null hypothesis \(H0 : \beta_j = 0\)

(e) On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit<-lm(Sales ~ Price + US, data = Carseats)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) ft the data?

The models did not fit the data well at all as the Adjusted R-squared is 0.2335 for part (a) and 0.2354 for part (e).

(g) Using the model from (e), obtain 95 % confdence intervals for the coeffcient(s).

confint(fit)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(fit)

summary(influence.measures(fit))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

Question 12

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? When X and Y have the same variance and magnitude, the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(123)  # For reproducibility
n <- 100
X <- rnorm(n, mean = 50, sd = 10)  # Generate X ~ Normal(50, 10)
Y <- 2 * X + rnorm(n, mean = 0, sd = 15)  # Y depends on X, but with added noise
model_X_Y <- lm(X ~ Y)
summary(model_X_Y)
## 
## Call:
## lm(formula = X ~ Y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.768  -3.485   0.295   4.151  14.454 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.90579    2.65175   7.507 2.82e-11 ***
## Y            0.30938    0.02582  11.984  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.843 on 98 degrees of freedom
## Multiple R-squared:  0.5944, Adjusted R-squared:  0.5903 
## F-statistic: 143.6 on 1 and 98 DF,  p-value: < 2.2e-16
model_Y_X <- lm(Y ~ X)
summary(model_Y_X)
## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.610 -10.253  -1.312   8.710  49.356 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3933     8.2897   0.289    0.773    
## X             1.9213     0.1603  11.984   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.56 on 98 degrees of freedom
## Multiple R-squared:  0.5944, Adjusted R-squared:  0.5903 
## F-statistic: 143.6 on 1 and 98 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(123)  # For reproducibility
n2 <- 100
X2 <- rnorm(n2, mean = 50, sd = 10)  # Generate X ~ N(50, 10)
Y2 <- X2   # Standardize X2 to have mean 0, SD 1
   # Y depends on X, but with added noise
model_X_Y2 <- lm(X2 ~ Y2)
summary(model_X_Y2)
## Warning in summary.lm(model_X_Y2): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = X2 ~ Y2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.048e-15 -2.071e-16  1.830e-17  2.008e-16  1.180e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 0.000e+00  2.516e-16 0.000e+00        1    
## Y2          1.000e+00  4.866e-18 2.055e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.42e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.223e+34 on 1 and 98 DF,  p-value: < 2.2e-16
model_Y_X2 <- lm(Y2 ~ X2)
summary(model_Y_X2)
## Warning in summary.lm(model_Y_X2): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = Y2 ~ X2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.048e-15 -2.071e-16  1.830e-17  2.008e-16  1.180e-15 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 0.000e+00  2.516e-16 0.000e+00        1    
## X2          1.000e+00  4.866e-18 2.055e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.42e-16 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.223e+34 on 1 and 98 DF,  p-value: < 2.2e-16