knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Chapter 3 Questions

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

For classification, it looks at the (k) closest data points and picks the most common group or label among them — like deciding if a fruit is an apple or orange based on what the nearby fruits are. For regression, it looks at the (k) closest points and takes the average of their values — like guessing someone’s height by averaging the heights of people with similar ages. Both use the idea of “closest neighbors,” but one gives a label, and the other gives a number.

9. Auto dataset

library(ISLR2)
library(GGally)
data(Auto)
numeric_vars <- Auto[sapply(Auto, is.numeric)]
ggpairs(numeric_vars)

cor_matrix <- cor(numeric_vars, use = "complete.obs")
print(cor_matrix)
##                     mpg  cylinders displacement horsepower     weight acceleration
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442    0.4233285
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458
##                    year     origin
## mpg           0.5805410  0.5652088
## cylinders    -0.3456474 -0.5689316
## displacement -0.3698552 -0.6145351
## horsepower   -0.4163615 -0.4551715
## weight       -0.3091199 -0.5850054
## acceleration  0.2903161  0.2127458
## year          1.0000000  0.1815277
## origin        0.1815277  1.0000000

The following model explains 82% of mpg. Origin, year, weight, and displacement are all statistically significant. Cylinders, horsepower, and weight all decrease mpg while the other variables increase mpg. Origin and cylinders have the greatest strength. For every year increase in the car’s model year, the predicted mpg increases by approximately 750 miles per gallon.

model <- lm(mpg ~ . - name, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

In the Residuals vs Fitted plot, the points curve a bit, which shows the relationship between variables might not be fully captured by a straight-line model. The Q-Q plot mostly follows a straight line, but a few points at the top stand out, meaning we may have some outliers. The Scale-Location plot shows that the spread of points increases slightly, which could mean the model works better for some cars than others. The last plot shows one car (point 14) has a big influence on the model, which could affect how accurate the predictions are.

par(mfrow = c(2, 2)) 
plot(model)

Model1 is the best model overall with an interaction between horsepower and weight. Both interactions are statistically significant but model1 explains 86% of mpg.

model1 <- lm(mpg ~ . - name + horsepower:weight, data = Auto)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ . - name + horsepower:weight, data = Auto)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.589 -1.617 -0.184  1.541 12.001 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.876e+00  4.511e+00   0.638 0.524147    
## cylinders         -2.955e-02  2.881e-01  -0.103 0.918363    
## displacement       5.950e-03  6.750e-03   0.881 0.378610    
## horsepower        -2.313e-01  2.363e-02  -9.791  < 2e-16 ***
## weight            -1.121e-02  7.285e-04 -15.393  < 2e-16 ***
## acceleration      -9.019e-02  8.855e-02  -1.019 0.309081    
## year               7.695e-01  4.494e-02  17.124  < 2e-16 ***
## origin             8.344e-01  2.513e-01   3.320 0.000986 ***
## horsepower:weight  5.529e-05  5.227e-06  10.577  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.931 on 383 degrees of freedom
## Multiple R-squared:  0.8618, Adjusted R-squared:  0.859 
## F-statistic: 298.6 on 8 and 383 DF,  p-value: < 2.2e-16
model2 <- lm(mpg ~ . - name + year:origin, data = Auto)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6072 -2.0439 -0.0596  1.7121 12.3368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
## cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
## displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
## horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
## weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
## acceleration  9.185e-02  9.766e-02   0.941 0.347546    
## year          4.189e-01  1.125e-01   3.723 0.000226 ***
## origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
## year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8228 
## F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

Using transformations made almost every variable statistically significant, but the model still is not as representative of mpg as model1 was.

model_trans <- lm(mpg ~ log(horsepower) + sqrt(weight) + I(displacement^2) + acceleration + cylinders + year + origin, 
                  data = Auto)
summary(model_trans)
## 
## Call:
## lm(formula = mpg ~ log(horsepower) + sqrt(weight) + I(displacement^2) + 
##     acceleration + cylinders + year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3568 -1.8191 -0.1246  1.5970 12.2641 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.441e+01  7.435e+00   4.629 5.05e-06 ***
## log(horsepower)   -7.864e+00  1.511e+00  -5.204 3.18e-07 ***
## sqrt(weight)      -5.863e-01  6.901e-02  -8.496 4.38e-16 ***
## I(displacement^2)  5.272e-05  8.997e-06   5.859 1.00e-08 ***
## acceleration      -1.295e-01  1.002e-01  -1.293   0.1969    
## cylinders         -4.958e-01  2.472e-01  -2.006   0.0456 *  
## year               7.521e-01  4.636e-02  16.224  < 2e-16 ***
## origin             1.140e+00  2.397e-01   4.757 2.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.011 on 384 degrees of freedom
## Multiple R-squared:  0.8538, Adjusted R-squared:  0.8511 
## F-statistic: 320.3 on 7 and 384 DF,  p-value: < 2.2e-16

10. Carseats dataset

data(Carseats)
carmodel <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carmodel)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

For every $1 increase in price, Sales decrease by about $54, and it is statistically significant. Stores located in urban areas have no effect on sales. Stores in the US sell about $1,200 more than stores outside the US, and it is statistically significant.

\(Sales=13.043469−0.054459×Price−0.021916×Urban_{Yes}+1.200573×US_{Yes}\)

Reject the null hypothesis for the statistically significant variables, Price and US.

carmodel1 <- lm(Sales ~ Price + US, data = Carseats)
summary(carmodel1)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Neither model fits the sales data well. They each only account for 23% of the variance in Sales.

confint(carmodel1)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Point number 377 has a big difference between the actual and predicted value, and is an outlier. A couple other points like 51 and 69 that stand out a little bit, too. Point number 368 has unusual input values, and could have a big influence on the model’s overall results so it is a high leverage point.

par(mfrow = c(2, 2))
plot(carmodel1)

12. This problem involves simple linear regression without an intercept.

When both X and Y are standardized or when they are converted to all have the same average of zero and standard deviation of 1

set.seed(123)

X <- rnorm(100, mean = 50, sd = 10)
Y <- 2 * X + rnorm(100, mean = 0, sd = 20)

model_Y_on_X <- lm(Y ~ X)
summary(model_Y_on_X)$coefficients
##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 3.191100 11.0529508 0.2887103 7.734129e-01
## X           1.895057  0.2137572 8.8654627 3.497142e-14
model_X_on_Y <- lm(X ~ Y)
summary(model_X_on_Y)$coefficients
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 27.4991330 2.72704471 10.083859 7.959924e-17
## Y            0.2348544 0.02649093  8.865463 3.497142e-14
set.seed(124)

X_raw <- rnorm(100)
Y_raw <- 3 * X_raw + rnorm(100, sd = 0.5)

X1 <- scale(X_raw)
Y1 <- scale(Y_raw)

model_Y_on_X1 <- lm(Y1 ~ X1)
slope_Y_on_X <- coef(model_Y_on_X1)[2]

model_X_on_Y1 <- lm(X1 ~ Y1)
slope_X_on_Y <- coef(model_X_on_Y1)[2]

cat("Slope of Y ~ X:", round(slope_Y_on_X, 4), "\n")
## Slope of Y ~ X: 0.9863
cat("Slope of X ~ Y:", round(slope_X_on_Y, 4), "\n")
## Slope of X ~ Y: 0.9863