## Warning: package 'ISLR2' was built under R version 4.5.3

Question 2

KNN classifier vs KNN regression

The KNN classifier is used when the Y is categorical i.e. is this email spam or not spam? It takes the K nearest neighbors and assigns the test point to the majority category. It estimates the conditional probability that Y belongs to a certain category. KNN regression is used when the Y is quantitative. It takes the nearest K neighbors and averages them.

Question 9

a) Scatterplots

pairs(Auto)

pairs(Auto[, -9])  # no 'name' column

b) Correlations

cor(Auto[, -9])  # no 'name' column
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c) MLR on mpg

lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Q&A

Is there a relationship between the predictors and the response? Yes. The f-statistic is large and p-value is small. Therefore we can reject the null hypothesis of there being no significant relationship between predictors and response. We can conclude that there is a relationship there.

Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year, and origin are statistically significant predictors.

What does the coefficient for the year variable suggest? The year coefficient 0.750773 suggests that, holding all other predictors constant, each additional year is associated with an increase of about 0.75 mpg. Cars get more fuel-efficient as the years go on.

d) Residuals

par(mfrow = c(2, 2))
plot(lm.fit)

There is a U-shaped pattern in the red smoothing line, which shows that the true relationship is non-linear. Point 14 has high leverage. Points 327 and 394 are outliers.

e) Interaction

summary(lm(mpg ~ . - name + displacement:weight + year:origin + horsepower:weight, data = Auto))
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + year:origin + 
##     horsepower:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4370 -1.6736 -0.0609  1.4670 11.8289 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.847e+01  8.064e+00   2.291  0.02251 *  
## cylinders            4.277e-02  2.899e-01   0.148  0.88278    
## displacement        -3.328e-02  2.044e-02  -1.629  0.10421    
## horsepower          -1.379e-01  5.018e-02  -2.748  0.00628 ** 
## weight              -1.095e-02  7.296e-04 -15.002  < 2e-16 ***
## acceleration        -9.464e-03  9.496e-02  -0.100  0.92066    
## year                 5.300e-01  1.029e-01   5.149 4.21e-07 ***
## origin              -1.084e+01  4.391e+00  -2.470  0.01396 *  
## displacement:weight  1.093e-05  5.675e-06   1.926  0.05483 .  
## year:origin          1.483e-01  5.614e-02   2.641  0.00861 ** 
## horsepower:weight    2.979e-05  1.343e-05   2.219  0.02711 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.907 on 381 degrees of freedom
## Multiple R-squared:  0.8649, Adjusted R-squared:  0.8613 
## F-statistic: 243.8 on 10 and 381 DF,  p-value: < 2.2e-16

Yes. The interaction between year and origin is statistically significant. The interaction between horsepower and weight is also statistically significant.

f) Transformations

summary(lm(mpg ~ log(weight) + log(horsepower) + log(displacement) + year + origin, data = Auto))
## 
## Call:
## lm(formula = mpg ~ log(weight) + log(horsepower) + log(displacement) + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2579 -1.8220 -0.0471  1.6804 12.8491 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       109.90829    9.79679  11.219  < 2e-16 ***
## log(weight)       -16.45826    1.78137  -9.239  < 2e-16 ***
## log(horsepower)    -3.44674    1.05602  -3.264 0.001197 ** 
## log(displacement)   0.60829    1.02418   0.594 0.552904    
## year                0.73389    0.04686  15.662  < 2e-16 ***
## origin              0.92964    0.27284   3.407 0.000725 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.088 on 386 degrees of freedom
## Multiple R-squared:  0.8455, Adjusted R-squared:  0.8435 
## F-statistic: 422.3 on 5 and 386 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ sqrt(weight) + sqrt(horsepower) + sqrt(displacement) + year + origin, data = Auto))
## 
## Call:
## lm(formula = mpg ~ sqrt(weight) + sqrt(horsepower) + sqrt(displacement) + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3541 -2.0004 -0.1574  1.7110 12.9879 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.87939    4.36666   0.888   0.3749    
## sqrt(weight)       -0.65926    0.06478 -10.177  < 2e-16 ***
## sqrt(horsepower)   -0.55801    0.21773  -2.563   0.0108 *  
## sqrt(displacement)  0.21411    0.15506   1.381   0.1681    
## year                0.73837    0.04887  15.109  < 2e-16 ***
## origin              1.15315    0.27512   4.192 3.44e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.203 on 386 degrees of freedom
## Multiple R-squared:  0.8338, Adjusted R-squared:  0.8316 
## F-statistic: 387.3 on 5 and 386 DF,  p-value: < 2.2e-16
summary(lm(mpg ~ weight + I(weight^2) + horsepower + I(horsepower^2) + year + origin, data = Auto))
## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + horsepower + I(horsepower^2) + 
##     year + origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8841 -1.7292 -0.1211  1.5860 12.1360 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.602e+00  4.082e+00   0.637  0.52426    
## weight          -1.481e-02  1.838e-03  -8.056 1.00e-14 ***
## I(weight^2)      1.568e-06  2.699e-07   5.809 1.32e-08 ***
## horsepower      -1.593e-01  2.960e-02  -5.383 1.27e-07 ***
## I(horsepower^2)  5.088e-04  1.042e-04   4.882 1.55e-06 ***
## year             7.792e-01  4.471e-02  17.426  < 2e-16 ***
## origin           6.685e-01  2.372e-01   2.818  0.00508 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.915 on 385 degrees of freedom
## Multiple R-squared:  0.8626, Adjusted R-squared:  0.8605 
## F-statistic:   403 on 6 and 385 DF,  p-value: < 2.2e-16

The quadratic transformation performs best, with both weight² and horsepower² being significant. Log transformations also improve fit. In all transformed models, displacement becomes insignificant once the other predictors are transformed.

#Question 10 ## a) Predict Sales using Price, Urban and US

carseats.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseats.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b) Interpretation

Holding everything else constant, for every $1 increase in price, sales decrease by 54,500 units. Urban stores sell 21.9 units fewer than rural stores (this isn’t statistically significant). US stores sell 1,200 more units than foreign stores.

c) Equation

\[\hat{Sales} = 13.04 - 0.054(Price) - 0.022(Urban) + 1.201(US)\] Where Urban = 1 if Yes, 0 if No; US = 1 if Yes, 0 if No.

  1. Rejecting the null hypothesis Price and US are the predictors for which I can reject the null hypothesis.

carseats.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(carseats.fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f) Comparison of the 2 models

The models are very similar. Model a) (including Urban) has a slightly higher RSE, while model e) (excluding urban) has a slightly higher adjusted R².

g) 95% confidence intervals

The 95% confidence interval for Price is -0.0648 to -0.0442 and for USYes is 0.6915 to 1.7078.

h) Outliers or High Leverage Points

There are no observations that exceed Cook’s distance and no extreme high leverage values.

#Question 11 ## Question 11

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)

fit.a <- lm(y ~ x + 0)
fit.b <- lm(x ~ y + 0)
summary(fit.a)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9154 -0.6472 -0.1771  0.5056  2.3109 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   1.9939     0.1065   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16
summary(fit.b)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8699 -0.2368  0.1030  0.2858  0.8938 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.39111    0.02089   18.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared:  0.7798, Adjusted R-squared:  0.7776 
## F-statistic: 350.7 on 1 and 99 DF,  p-value: < 2.2e-16

a) Regression of Y on X

\(\hat{\beta} = 1.9939\), SE = 0.1065, t = 18.73, p < 2e-16 We reject H₀. The t-statistic is large with a small p-value, suggesting strong evidence of a relationship between x and y.

b) Regression of X on Y

\(\hat{\beta} = 0.3911\), SE = 0.0209, t = 18.73, p < 2e-16 We reject H₀. The t-statistic is large with a small p-value, suggesting strong evidence of a relationship between x and y.

c) Comparing relationships

The t-statistics and p-values are the same for both a) and b), because the relationships being measured are linear. So the direction of the relationship doesn’t necessarily change the strength of the correlation between the two. Only the coefficients and SE changed, impacting the interpretation of the model.

d, e and f) Algebraic Reasoning

The t-statistic is: \[t = \frac{\hat{\beta}}{SE(\hat{\beta})}\]

We are given two formulas. First, the coefficient estimate without an intercept: \[\hat{\beta} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2}\]

Second, the standard error: \[SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n}(y_i - x_i\hat{\beta})^2}{(n-1)\sum_{i=1}^{n} x_i^2}}\]

Step 1: Substitute \(\hat{\beta}\) into the SE formula Expanding \((y_i - x_i\hat{\beta})^2\) in the numerator of the SE: \[\sum(y_i - x_i\hat{\beta})^2 = \sum y_i^2 - 2\hat{\beta}\sum x_i y_i + \hat{\beta}^2 \sum x_i^2\] Step 2: Substitute \(\hat{\beta} = \frac{\sum x_i y_i}{\sum x_i^2}\) \[= \sum y_i^2 - 2\frac{\sum x_i y_i}{\sum x_i^2}\sum x_i y_i + \left(\frac{\sum x_i y_i}{\sum x_i^2}\right)^2 \sum x_i^2\] \[= \sum y_i^2 - 2\frac{(\sum x_i y_i)^2}{\sum x_i^2} + \frac{(\sum x_i y_i)^2}{\sum x_i^2}\] \[= \sum y_i^2 - \frac{(\sum x_i y_i)^2}{\sum x_i^2}\] Step 3: Plug back into SE \[SE(\hat{\beta}) = \sqrt{\frac{\sum y_i^2 - \frac{(\sum x_i y_i)^2}{\sum x_i^2}}{(n-1)\sum x_i^2}}\]

\[= \sqrt{\frac{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}{(n-1)(\sum x_i^2)^2}}\] Step 4: Compute \(t = \hat{\beta} / SE(\hat{\beta})\) \[t = \frac{\frac{\sum x_i y_i}{\sum x_i^2}}{\sqrt{\frac{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}{(n-1)(\sum x_i^2)^2}}}\] The \(\sum x_i^2\) terms cancel, leaving: \[\boxed{t = \frac{\sqrt{n-1} \sum x_i y_i}{\sqrt{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}}}\]

This expression is symmetric in x and y. Swapping x and y leaves it unchanged, which is why the t-statistic is identical whether we regress Y onto X or X onto Y.

Numerical confirmation in R:

n <- 100
t.manual <- (sqrt(n-1) * sum(x*y)) / sqrt(sum(x^2) * sum(y^2) - sum(x*y)^2)
t.manual  # should match t-statistics from (a) and (b): 18.72593
## [1] 18.72593

The result t.manual = 18.72593 matches both regression outputs.

##Question 12 # a) When are coefficient estimates the same when X and Y are regressed inversely?

The two coefficient estimates are equal when \(\sum x_i^2 = \sum y_i^2\), i.e. when X and Y have the same sum of squares. This is because:

\[\hat{\beta}_{Y \text{ onto } X} = \frac{\sum x_i y_i}{\sum x_i^2}\]

\[\hat{\beta}_{X \text{ onto } Y} = \frac{\sum x_i y_i}{\sum y_i^2}\]

The numerators are identical, so the estimates are equal only when the denominators are equal.

b) Different Coefficients with Inverse Regression

Here we generate \(x\) and \(y\) where \(y = 2x + \varepsilon\), which makes \(\sum x_i^2 \neq \sum y_i^2\), giving different coefficient estimates.

set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)

fit.yx <- lm(y ~ x + 0)
fit.xy <- lm(x ~ y + 0)

coef(fit.yx)  # beta for y onto x
##        x 
## 1.993876
coef(fit.xy)  # beta for x onto y
##         y 
## 0.3911145

The coefficient for Y onto X (\(\approx 1.99\)) differs from X onto Y (\(\approx 0.39\)) because the noise term \(\varepsilon\) and the scaling factor of 2 cause \(\sum y_i^2 \neq \sum x_i^2\).

c) Same Coefficients with Inverse Regression

Here we set \(y = x\), which makes \(\sum x_i^2 = \sum y_i^2\), giving identical coefficient estimates.

set.seed(1)
x <- rnorm(100)
y <- x

fit.yx2 <- lm(y ~ x + 0)
fit.xy2 <- lm(x ~ y + 0)

coef(fit.yx2)  # beta for y onto x
## x 
## 1
coef(fit.xy2)  # beta for x onto y
## y 
## 1

Both coefficients equal exactly 1 because when \(y = x\), the numerator \(\sum x_i y_i = \sum x_i^2 = \sum y_i^2\), so both formulas reduce to:

\[\hat{\beta} = \frac{\sum x_i^2}{\sum x_i^2} = 1\]