## Warning: package 'ISLR2' was built under R version 4.5.3
The KNN classifier is used when the Y is categorical i.e. is this email spam or not spam? It takes the K nearest neighbors and assigns the test point to the majority category. It estimates the conditional probability that Y belongs to a certain category. KNN regression is used when the Y is quantitative. It takes the nearest K neighbors and averages them.
pairs(Auto)
pairs(Auto[, -9]) # no 'name' column
cor(Auto[, -9]) # no 'name' column
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response? Yes. The f-statistic is large and p-value is small. Therefore we can reject the null hypothesis of there being no significant relationship between predictors and response. We can conclude that there is a relationship there.
Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year, and origin are statistically significant predictors.
What does the coefficient for the year variable suggest? The year coefficient 0.750773 suggests that, holding all other predictors constant, each additional year is associated with an increase of about 0.75 mpg. Cars get more fuel-efficient as the years go on.
par(mfrow = c(2, 2))
plot(lm.fit)
There is a U-shaped pattern in the red smoothing line, which shows that
the true relationship is non-linear. Point 14 has high leverage. Points
327 and 394 are outliers.
summary(lm(mpg ~ . - name + displacement:weight + year:origin + horsepower:weight, data = Auto))
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + year:origin +
## horsepower:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4370 -1.6736 -0.0609 1.4670 11.8289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.847e+01 8.064e+00 2.291 0.02251 *
## cylinders 4.277e-02 2.899e-01 0.148 0.88278
## displacement -3.328e-02 2.044e-02 -1.629 0.10421
## horsepower -1.379e-01 5.018e-02 -2.748 0.00628 **
## weight -1.095e-02 7.296e-04 -15.002 < 2e-16 ***
## acceleration -9.464e-03 9.496e-02 -0.100 0.92066
## year 5.300e-01 1.029e-01 5.149 4.21e-07 ***
## origin -1.084e+01 4.391e+00 -2.470 0.01396 *
## displacement:weight 1.093e-05 5.675e-06 1.926 0.05483 .
## year:origin 1.483e-01 5.614e-02 2.641 0.00861 **
## horsepower:weight 2.979e-05 1.343e-05 2.219 0.02711 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.907 on 381 degrees of freedom
## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8613
## F-statistic: 243.8 on 10 and 381 DF, p-value: < 2.2e-16
Yes. The interaction between year and origin is statistically significant. The interaction between horsepower and weight is also statistically significant.
summary(lm(mpg ~ log(weight) + log(horsepower) + log(displacement) + year + origin, data = Auto))
##
## Call:
## lm(formula = mpg ~ log(weight) + log(horsepower) + log(displacement) +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2579 -1.8220 -0.0471 1.6804 12.8491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.90829 9.79679 11.219 < 2e-16 ***
## log(weight) -16.45826 1.78137 -9.239 < 2e-16 ***
## log(horsepower) -3.44674 1.05602 -3.264 0.001197 **
## log(displacement) 0.60829 1.02418 0.594 0.552904
## year 0.73389 0.04686 15.662 < 2e-16 ***
## origin 0.92964 0.27284 3.407 0.000725 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.088 on 386 degrees of freedom
## Multiple R-squared: 0.8455, Adjusted R-squared: 0.8435
## F-statistic: 422.3 on 5 and 386 DF, p-value: < 2.2e-16
summary(lm(mpg ~ sqrt(weight) + sqrt(horsepower) + sqrt(displacement) + year + origin, data = Auto))
##
## Call:
## lm(formula = mpg ~ sqrt(weight) + sqrt(horsepower) + sqrt(displacement) +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3541 -2.0004 -0.1574 1.7110 12.9879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.87939 4.36666 0.888 0.3749
## sqrt(weight) -0.65926 0.06478 -10.177 < 2e-16 ***
## sqrt(horsepower) -0.55801 0.21773 -2.563 0.0108 *
## sqrt(displacement) 0.21411 0.15506 1.381 0.1681
## year 0.73837 0.04887 15.109 < 2e-16 ***
## origin 1.15315 0.27512 4.192 3.44e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.203 on 386 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8316
## F-statistic: 387.3 on 5 and 386 DF, p-value: < 2.2e-16
summary(lm(mpg ~ weight + I(weight^2) + horsepower + I(horsepower^2) + year + origin, data = Auto))
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + horsepower + I(horsepower^2) +
## year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8841 -1.7292 -0.1211 1.5860 12.1360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.602e+00 4.082e+00 0.637 0.52426
## weight -1.481e-02 1.838e-03 -8.056 1.00e-14 ***
## I(weight^2) 1.568e-06 2.699e-07 5.809 1.32e-08 ***
## horsepower -1.593e-01 2.960e-02 -5.383 1.27e-07 ***
## I(horsepower^2) 5.088e-04 1.042e-04 4.882 1.55e-06 ***
## year 7.792e-01 4.471e-02 17.426 < 2e-16 ***
## origin 6.685e-01 2.372e-01 2.818 0.00508 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.915 on 385 degrees of freedom
## Multiple R-squared: 0.8626, Adjusted R-squared: 0.8605
## F-statistic: 403 on 6 and 385 DF, p-value: < 2.2e-16
The quadratic transformation performs best, with both weight² and horsepower² being significant. Log transformations also improve fit. In all transformed models, displacement becomes insignificant once the other predictors are transformed.
#Question 10 ## a) Predict Sales using Price, Urban and US
carseats.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseats.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Holding everything else constant, for every $1 increase in price, sales decrease by 54,500 units. Urban stores sell 21.9 units fewer than rural stores (this isn’t statistically significant). US stores sell 1,200 more units than foreign stores.
\[\hat{Sales} = 13.04 - 0.054(Price) - 0.022(Urban) + 1.201(US)\] Where Urban = 1 if Yes, 0 if No; US = 1 if Yes, 0 if No.
Rejecting the null hypothesis Price and US are the predictors for which I can reject the null hypothesis.
carseats.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(carseats.fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The models are very similar. Model a) (including Urban) has a slightly higher RSE, while model e) (excluding urban) has a slightly higher adjusted R².
The 95% confidence interval for Price is -0.0648 to -0.0442 and for USYes is 0.6915 to 1.7078.
There are no observations that exceed Cook’s distance and no extreme high leverage values.
#Question 11 ## Question 11
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
fit.a <- lm(y ~ x + 0)
fit.b <- lm(x ~ y + 0)
summary(fit.a)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.b)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
\(\hat{\beta} = 1.9939\), SE = 0.1065, t = 18.73, p < 2e-16 We reject H₀. The t-statistic is large with a small p-value, suggesting strong evidence of a relationship between x and y.
\(\hat{\beta} = 0.3911\), SE = 0.0209, t = 18.73, p < 2e-16 We reject H₀. The t-statistic is large with a small p-value, suggesting strong evidence of a relationship between x and y.
The t-statistics and p-values are the same for both a) and b), because the relationships being measured are linear. So the direction of the relationship doesn’t necessarily change the strength of the correlation between the two. Only the coefficients and SE changed, impacting the interpretation of the model.
The t-statistic is: \[t = \frac{\hat{\beta}}{SE(\hat{\beta})}\]
We are given two formulas. First, the coefficient estimate without an intercept: \[\hat{\beta} = \frac{\sum_{i=1}^{n} x_i y_i}{\sum_{i=1}^{n} x_i^2}\]
Second, the standard error: \[SE(\hat{\beta}) = \sqrt{\frac{\sum_{i=1}^{n}(y_i - x_i\hat{\beta})^2}{(n-1)\sum_{i=1}^{n} x_i^2}}\]
Step 1: Substitute \(\hat{\beta}\) into the SE formula Expanding \((y_i - x_i\hat{\beta})^2\) in the numerator of the SE: \[\sum(y_i - x_i\hat{\beta})^2 = \sum y_i^2 - 2\hat{\beta}\sum x_i y_i + \hat{\beta}^2 \sum x_i^2\] Step 2: Substitute \(\hat{\beta} = \frac{\sum x_i y_i}{\sum x_i^2}\) \[= \sum y_i^2 - 2\frac{\sum x_i y_i}{\sum x_i^2}\sum x_i y_i + \left(\frac{\sum x_i y_i}{\sum x_i^2}\right)^2 \sum x_i^2\] \[= \sum y_i^2 - 2\frac{(\sum x_i y_i)^2}{\sum x_i^2} + \frac{(\sum x_i y_i)^2}{\sum x_i^2}\] \[= \sum y_i^2 - \frac{(\sum x_i y_i)^2}{\sum x_i^2}\] Step 3: Plug back into SE \[SE(\hat{\beta}) = \sqrt{\frac{\sum y_i^2 - \frac{(\sum x_i y_i)^2}{\sum x_i^2}}{(n-1)\sum x_i^2}}\]
\[= \sqrt{\frac{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}{(n-1)(\sum x_i^2)^2}}\] Step 4: Compute \(t = \hat{\beta} / SE(\hat{\beta})\) \[t = \frac{\frac{\sum x_i y_i}{\sum x_i^2}}{\sqrt{\frac{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}{(n-1)(\sum x_i^2)^2}}}\] The \(\sum x_i^2\) terms cancel, leaving: \[\boxed{t = \frac{\sqrt{n-1} \sum x_i y_i}{\sqrt{\sum x_i^2 \sum y_i^2 - (\sum x_i y_i)^2}}}\]
This expression is symmetric in x and y. Swapping x and y leaves it unchanged, which is why the t-statistic is identical whether we regress Y onto X or X onto Y.
Numerical confirmation in R:
n <- 100
t.manual <- (sqrt(n-1) * sum(x*y)) / sqrt(sum(x^2) * sum(y^2) - sum(x*y)^2)
t.manual # should match t-statistics from (a) and (b): 18.72593
## [1] 18.72593
The result t.manual = 18.72593 matches both regression
outputs.
##Question 12 # a) When are coefficient estimates the same when X and Y are regressed inversely?
The two coefficient estimates are equal when \(\sum x_i^2 = \sum y_i^2\), i.e. when X and Y have the same sum of squares. This is because:
\[\hat{\beta}_{Y \text{ onto } X} = \frac{\sum x_i y_i}{\sum x_i^2}\]
\[\hat{\beta}_{X \text{ onto } Y} = \frac{\sum x_i y_i}{\sum y_i^2}\]
The numerators are identical, so the estimates are equal only when the denominators are equal.
Here we generate \(x\) and \(y\) where \(y = 2x + \varepsilon\), which makes \(\sum x_i^2 \neq \sum y_i^2\), giving different coefficient estimates.
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
fit.yx <- lm(y ~ x + 0)
fit.xy <- lm(x ~ y + 0)
coef(fit.yx) # beta for y onto x
## x
## 1.993876
coef(fit.xy) # beta for x onto y
## y
## 0.3911145
The coefficient for Y onto X (\(\approx 1.99\)) differs from X onto Y (\(\approx 0.39\)) because the noise term \(\varepsilon\) and the scaling factor of 2 cause \(\sum y_i^2 \neq \sum x_i^2\).
Here we set \(y = x\), which makes \(\sum x_i^2 = \sum y_i^2\), giving identical coefficient estimates.
set.seed(1)
x <- rnorm(100)
y <- x
fit.yx2 <- lm(y ~ x + 0)
fit.xy2 <- lm(x ~ y + 0)
coef(fit.yx2) # beta for y onto x
## x
## 1
coef(fit.xy2) # beta for x onto y
## y
## 1
Both coefficients equal exactly 1 because when \(y = x\), the numerator \(\sum x_i y_i = \sum x_i^2 = \sum y_i^2\), so both formulas reduce to:
\[\hat{\beta} = \frac{\sum x_i^2}{\sum x_i^2} = 1\]