Chapter 03 (page 120): 2, 9, 10, 12

Chapter 03 Ex 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

Lovaas answer: For some reason I find it hilarious that I’ve been instructed to do this “carefully.” I am many things but careful isn’t really one of them! KNN classifier and KNN regression methods are closely related in concept - the classifier does a qualitative output (will voter x vote for candidate y or z) and the regression methods output quantitative values - sales will increase x% for each unit increase in y.

Chapter 03 Ex 9

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
pairs(Auto)

cor(subset(Auto, select=-name))
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
lm.fit = lm(mpg~.-name, data = Auto) 
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictors and the response? Lovaas answer: There is a relationship between the predictors and the response.The F -statistic is far from 1, with a very small p-value, indicating evidence against the null hypothesis.
  2. Which predictors appear to have a statistically significant relationship to the response? Lovaas answer: Since we have P values <0.05 for displacement, weight, year, and origin, we can assume there is a significant relationship between those predictors and MPG
  3. What does the coefficient for the year variable suggest? Lovaas answer: The coefficient for year is positive, meaning that as the year increases (time passes), mpg increases and cars are more fuel effieicnt. The p value is quite low, so this is a statistically significant observation.
par(mfrow=c(2,2))
plot(lm.fit)

Lovaas answer 9d: The residuals vs fitted line has a bit of a smiley-face curve, meaning a simple linear model may not be strong enough, and a transformation may be in order. The normal q-q plot reinforces this notion. There’s no visible pattern in the scale-location plot, which is good. RStudio has highlighted a few outliers in the residual plots (top left and bottom right), meaning the model may not have a great fit yet.

lm.fit2 = lm(mpg~cylinders*displacement+weight*year+cylinders*weight+displacement*year, data = Auto) #By using the asterisk I am including the interaction effects and all 4 base variables
summary(lm.fit2)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + weight * year + 
##     cylinders * weight + displacement * year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9577 -1.5989 -0.1133  1.2365 13.9236 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.489e+01  2.246e+01  -1.108   0.2686    
## cylinders              -4.156e+00  7.428e-01  -5.595 4.21e-08 ***
## displacement            1.990e-01  1.045e-01   1.905   0.0575 .  
## weight                 -1.649e-02  1.370e-02  -1.204   0.2295    
## year                    1.198e+00  2.690e-01   4.453 1.11e-05 ***
## cylinders:displacement -2.335e-03  3.085e-03  -0.757   0.4496    
## weight:year             1.850e-05  1.653e-04   0.112   0.9110    
## cylinders:weight        1.518e-03  3.585e-04   4.234 2.88e-05 ***
## displacement:year      -2.567e-03  1.269e-03  -2.022   0.0438 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.009 on 383 degrees of freedom
## Multiple R-squared:  0.8545, Adjusted R-squared:  0.8514 
## F-statistic: 281.1 on 8 and 383 DF,  p-value: < 2.2e-16

Lovaas answer 9e: only the interaction effects between Cylinders + Weights and Displacement + Year seems to be significant. Cylinders + Displacement and Weught + Year were not significant.

lm.fit3=lm(mpg~cylinders+I(cylinders^2)+year+I(year^2)+displacement+I(displacement^2), data = Auto)
lm.fit4=lm(log(mpg)~cylinders+I(cylinders^2)+year+I(year^2)+displacement+I(displacement^2), data = Auto)
lm.fit5=lm(log(mpg)~cylinders+sqrt(cylinders)+horsepower+sqrt(horsepower)+displacement+sqrt(displacement), data = Auto)
lm.fit6=lm(mpg~log(cylinders)+log(year)+log(displacement)+log(weight), data = Auto)

summary(lm.fit3)
## 
## Call:
## lm(formula = mpg ~ cylinders + I(cylinders^2) + year + I(year^2) + 
##     displacement + I(displacement^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1138  -1.4426   0.0194   1.5267  13.6306 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.476e+02  7.898e+01   5.667 2.85e-08 ***
## cylinders          5.607e+00  1.543e+00   3.633 0.000318 ***
## I(cylinders^2)    -3.826e-01  1.238e-01  -3.090 0.002144 ** 
## year              -1.183e+01  2.084e+00  -5.676 2.72e-08 ***
## I(year^2)          8.296e-02  1.370e-02   6.055 3.34e-09 ***
## displacement      -1.852e-01  1.405e-02 -13.184  < 2e-16 ***
## I(displacement^2)  2.574e-04  2.449e-05  10.509  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.283 on 385 degrees of freedom
## Multiple R-squared:  0.8257, Adjusted R-squared:  0.823 
## F-statistic: 304.1 on 6 and 385 DF,  p-value: < 2.2e-16
summary(lm.fit4)
## 
## Call:
## lm(formula = log(mpg) ~ cylinders + I(cylinders^2) + year + I(year^2) + 
##     displacement + I(displacement^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45829 -0.06492  0.00738  0.07036  0.50013 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.709e+01  3.151e+00   5.423 1.04e-07 ***
## cylinders          2.308e-01  6.156e-02   3.750 0.000204 ***
## I(cylinders^2)    -1.789e-02  4.938e-03  -3.623 0.000330 ***
## year              -3.949e-01  8.315e-02  -4.750 2.88e-06 ***
## I(year^2)          2.799e-03  5.466e-04   5.122 4.79e-07 ***
## displacement      -6.487e-03  5.605e-04 -11.574  < 2e-16 ***
## I(displacement^2)  8.284e-06  9.772e-07   8.477 4.98e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.131 on 385 degrees of freedom
## Multiple R-squared:  0.8539, Adjusted R-squared:  0.8516 
## F-statistic:   375 on 6 and 385 DF,  p-value: < 2.2e-16
summary(lm.fit5)
## 
## Call:
## lm(formula = log(mpg) ~ cylinders + sqrt(cylinders) + horsepower + 
##     sqrt(horsepower) + displacement + sqrt(displacement), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51384 -0.09954 -0.01098  0.11642  0.60844 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.988975   0.818498   6.095 2.66e-09 ***
## cylinders          -0.093537   0.151596  -0.617  0.53759    
## sqrt(cylinders)     0.390836   0.722566   0.541  0.58889    
## horsepower          0.003520   0.003230   1.090  0.27654    
## sqrt(horsepower)   -0.156775   0.067888  -2.309  0.02145 *  
## displacement        0.002832   0.001450   1.954  0.05146 .  
## sqrt(displacement) -0.120581   0.041027  -2.939  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1589 on 385 degrees of freedom
## Multiple R-squared:  0.785,  Adjusted R-squared:  0.7816 
## F-statistic: 234.3 on 6 and 385 DF,  p-value: < 2.2e-16
summary(lm.fit6)
## 
## Call:
## lm(formula = mpg ~ log(cylinders) + log(year) + log(displacement) + 
##     log(weight), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6326 -1.9688 -0.0678  1.6796 13.3853 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -75.888     17.632  -4.304 2.13e-05 ***
## log(cylinders)       2.555      1.656   1.543   0.1237    
## log(year)           58.164      3.508  16.579  < 2e-16 ***
## log(displacement)   -2.899      1.322  -2.193   0.0289 *  
## log(weight)        -17.820      1.718 -10.374  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.162 on 387 degrees of freedom
## Multiple R-squared:  0.8376, Adjusted R-squared:  0.8359 
## F-statistic: 498.9 on 4 and 387 DF,  p-value: < 2.2e-16

Lovaas notes ex 9f: Of the 4 (somewhat random) models I created, the second on, modelling the log of MPG, had the highest r-squared value. Investigating that further could be interesting. Best model: lm(log(mpg)~cylinders+I(cylinders2)+year+I(year2)+displacement+I(displacement^2), data = Auto)

Chapter 03 Ex 10

lm.fitcar = lm(Sales~Price+Urban+US, data = Carseats)
summary(lm.fitcar)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

R-squared = 24%

Lovaas answer ex 10b: Price has a negative correlation, meaning the higher the price, the lower the sales. Urban is a qualitative variable - if a store in in an urban location, then sales are slightly lower (negative factor for UrbanYes). US is also a qualitative variable - either stores are in the US or they are not. Stores in the US have higher sales of carseats than stores outside of the US.

Lovaas answer ex 10c: Sales = 13.04 + -0.05 Price + -0.02 UrbanYes + 1.20 USYes

Lovaas answer ex 10d: Price and USYes have p values < 0.05, so we can reject the null hypothesis that the intercept is 0 and use them in our model. UrbanYes has a much weaker correlation with a p value well above 0.05, so we will accept the null hypothesis and leave UrbanYes/No out of our model.

lm.fitcar2 = lm(Sales~Price+US, data = Carseats)
summary(lm.fitcar2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Lovaas answer ex 10f: The models in (a) and (e) have similar r-squared values of 0.24, meaning both explain about 24% of the variance of “sales.” The model in part (e) has one less variable, which increases model speed and simplicity.

confint(lm.fitcar2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
plot(predict(lm.fitcar2), rstudent(lm.fitcar2))

par(mfrow=c(2,2))
plot(lm.fitcar2)

Lovaas answer ex 10h: all plotted values in the first chart are within 3 and -3, so no outliers are apparent. There are some outliers on the Residuals vs Leverage (bottom right) chart; which Rstudio has marked. Ie, there is evidence of high leverage observations.

Chapter 03 Ex 12

Lovaas answer ex 12a: The coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X when the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.

set.seed(1)
x = rnorm(100)
y = 2*x
lm.fity = lm(y~x+0)
lm.fitx = lm(x~y+0)
summary(lm.fity)
## Warning in summary.lm(lm.fity): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.776e-16 -3.378e-17  2.680e-18  6.113e-17  5.105e-16 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 2.000e+00  1.296e-17 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.167e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm.fitx)
## Warning in summary.lm(lm.fitx): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.888e-16 -1.689e-17  1.339e-18  3.057e-17  2.552e-16 
## 
## Coefficients:
##   Estimate Std. Error   t value Pr(>|t|)    
## y 5.00e-01   3.24e-18 1.543e+17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 2.382e+34 on 1 and 99 DF,  p-value: < 2.2e-16
set.seed(1)
x <- rnorm(100)
y <- -sample(x, 100)
lm.fity2 = lm(y~x+0)
lm.fitx2 = lm(x~y+0)
summary(lm.fity2)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2833 -0.6945 -0.1140  0.4995  2.1665 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.07768    0.10020   0.775     0.44
## 
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared:  0.006034,   Adjusted R-squared:  -0.004006 
## F-statistic: 0.601 on 1 and 99 DF,  p-value: 0.4401
summary(lm.fitx2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2182 -0.4969  0.1595  0.6782  2.4017 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.07768    0.10020   0.775     0.44
## 
## Residual standard error: 0.9021 on 99 degrees of freedom
## Multiple R-squared:  0.006034,   Adjusted R-squared:  -0.004006 
## F-statistic: 0.601 on 1 and 99 DF,  p-value: 0.4401

Lovaas note ex 12c: note that the coefficients are the same, 0.07768, for each model.