library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3

Homework 2 Chapter 3

Question 2

KNN classifier is used when the dependent variable is categorical. KNN regression methods are used when the dependent variable is numeric.

Question 9

data(Auto)

Part A.

pairs(Auto)

Part B.

cor(Auto[, -9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Part C.

Auto$origin[Auto$origin == 1] = "American"
Auto$origin[Auto$origin == 2] = "European"
Auto$origin[Auto$origin == 3] = "Japanese"

Auto$origin = as.factor(Auto$origin)
m1 = lm(mpg ~ .-name, data = Auto)
summary(m1)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16
  1. There is a relationship between the predictors and the response. The model is statisticall significant (p < 2.2e-16) which means we can say there is a relationship between mpg with the predictors.

  2. Displacement has a postive significant relationship with mpg (p = 0.002), weight has a negative significant relationship with mpg (p < 2e-16), year has a positive significant relationship mpg (p < 2e-16), Japanese origin has a postive significant relationship with mpg (p = 3.93e-07), and European origin has a positive significant relationship with mpg.

  3. With a 1 unit increase in year, mpg increases by 0.777 with all other predictors held constant.

Part D

par(mfrow = c(2,2))
plot(m1)

The Q-Q Plot shows that the data is fairly normal. There is not a pattern in the residuals vs fitted plot so there is equal variance. In the residuals vs leverage plot, we can see there are a few outliers.

Part E.

m2_interactions = lm(mpg ~ . * ., data=Auto[,-9])
summary(m2_interactions)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

Using p < .05, the significant interactions are: cylinder and acceleration, acceleration and year, acceleration and European origin, acceleration and Japanese origin, year and European origin, and year and Japanese origin.

Part F.

I took the log, square root, and raised acceleration to the 2nd power.

In all the models, acceleration is statisitcally significant. Accerlation has a larger coefficient when it is log transformed.

m3log = lm(mpg ~ log(acceleration), data = Auto)
summary(m3log)
## 
## Call:
## lm(formula = mpg ~ log(acceleration), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0234  -5.6231  -0.9787   4.5943  23.0872 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -27.834      5.373  -5.180 3.56e-07 ***
## log(acceleration)   18.801      1.966   9.565  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.033 on 390 degrees of freedom
## Multiple R-squared:   0.19,  Adjusted R-squared:  0.1879 
## F-statistic: 91.49 on 1 and 390 DF,  p-value: < 2.2e-16
m4sqt = lm(mpg ~ sqrt(acceleration), data = Auto)
summary(m4sqt)
## 
## Call:
## lm(formula = mpg ~ sqrt(acceleration), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.034  -5.601  -1.027   4.713  23.184 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -14.177      4.008  -3.537 0.000453 ***
## sqrt(acceleration)    9.582      1.017   9.424  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.053 on 390 degrees of freedom
## Multiple R-squared:  0.1855, Adjusted R-squared:  0.1834 
## F-statistic: 88.81 on 1 and 390 DF,  p-value: < 2.2e-16
m5p = lm(mpg ~ I(acceleration^2), data = Auto)
summary(m5p)
## 
## Call:
## lm(formula = mpg ~ I(acceleration^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.753  -5.694  -1.432   4.978  23.238 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       14.597505   1.077541  13.547   <2e-16 ***
## I(acceleration^2)  0.035518   0.004075   8.716   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.15 on 390 degrees of freedom
## Multiple R-squared:  0.163,  Adjusted R-squared:  0.1609 
## F-statistic: 75.96 on 1 and 390 DF,  p-value: < 2.2e-16

Question 10

data("Carseats")
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Part A.

m6 = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(m6)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Part B.

With a 1 unit increase in Price, for all other variables held constant, there is a 54.46 drop in Sales (p < 2e-16).

If the store is located in an urban area, it has 21.92 less sales compared to areas where the store is located in rural areas with all other variables held constant. However, this finding is not statistically significant (p = 0.94).

Stores in the US have 1200.57 more sales than stores outside of the US with all other predictors held constant (p = 4.86e-06).

Part C.

Sales=13.0434689+(−0.0544588)×Price+(−0.0219162)×Urban+(1.2005727)×US+ε

Part D.

I can reject the null hypothesis for Price and USYes.

Part E.

m7 = lm(Sales ~ Price + US, data = Carseats)
summary(m7)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Part F.

Model with Price, US, and Urban R-squared: 0.2393 Adjusted R-squared: 0.2335

This model explains 23.35% of the variation in Sales

Model with Price and Urban R-squared: 0.2393 Adjusted R-squared: 0.2354

This model explains 23.54% of the variation in Sales

Part G.

confint(m7)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Part H.

par(mfrow = c(2,2))
plot(m7)

Looking at the residuals vs leverage plot, we can see there are a view outliers.

Question 12

Part A. When the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.

Part B.

set.seed(1)
x = rnorm(100)
y = 2*x
m8 = lm(y~x+0)
m9 = lm(x~y+0)
summary(m8)
## Warning in summary.lm(m8): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.776e-16 -3.378e-17  2.680e-18  6.113e-17  5.105e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 2.000e+00  1.296e-17 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
summary(m9)
## Warning in summary.lm(m9): essentially perfect fit: summary may be unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##   Estimate Std. Error   t value Pr(>|t|)    
## y 5.00e-01   3.24e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
set.seed(1)
x = rnorm(100)
y = -sample(x, 100)
sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 81.05509
m10 = lm(y~x+0)
m11 = lm(x~y+0)
summary(m10)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2833 -0.6945 -0.1140  0.4995  2.1665 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.07768    0.10020   0.775     0.44
## 
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared:  0.006034,   Adjusted R-squared:  -0.004006 
## F-statistic: 0.601 on 1 and 99 DF,  p-value: 0.4401
summary(m11)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2182 -0.4969  0.1595  0.6782  2.4017 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.07768    0.10020   0.775     0.44
## 
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared:  0.006034,   Adjusted R-squared:  -0.004006 
## F-statistic: 0.601 on 1 and 99 DF,  p-value: 0.4401