Both methods use a distance metric (usually Euclidean) to identify neighbors, and the choice of \(K\) affects the bias-variance trade-off in similar ways.
library(ISLR2)
Warning: package ‘ISLR2’ was built under R version 4.4.2
pairs(Auto[, -9]) # Exclude 'name' as it's qualitative
cor_matrix <- cor(Auto[, -9]) # Compute correlation matrix without 'name'
print(cor_matrix)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410 0.5652088
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277 1.0000000
lm_fit <- lm(mpg ~ . -name, data = Auto) # Exclude 'name' from predictors
summary(lm_fit)
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The predictors that are statistically significant are displacement (positive), weight(negative), year positive), and origin(positive). These variables are important in explaining variations in mpg.
The coefficient for year suggests that newer cars are generally more fuel-efficient, with an increase in mpg of approximately 0.75 mpg for each year newer the car is.
par(mfrow=c(2,2)) # Arrange plots in a 2x2 grid
plot(lm_fit)
Inference:
x <- subset(Auto, select = -name)
summary(lm(mpg ~ . + cylinders*displacement, data = x))
Call:
lm(formula = mpg ~ . + cylinders * displacement, data = x)
Residuals:
Min 1Q Median 3Q Max
-11.6081 -1.7833 -0.0465 1.6821 12.2617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.7096590 4.6858582 -0.578 0.563426
cylinders -2.6962123 0.4094916 -6.584 1.51e-10 ***
displacement -0.0774797 0.0141535 -5.474 7.96e-08 ***
horsepower -0.0476026 0.0133736 -3.559 0.000418 ***
weight -0.0052339 0.0006253 -8.370 1.10e-15 ***
acceleration 0.0597997 0.0918038 0.651 0.515188
year 0.7594500 0.0473354 16.044 < 2e-16 ***
origin 0.7087399 0.2736917 2.590 0.009976 **
cylinders:displacement 0.0136081 0.0017209 7.907 2.84e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.089 on 383 degrees of freedom
Multiple R-squared: 0.8465, Adjusted R-squared: 0.8433
F-statistic: 264.1 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . + displacement:horsepower, data = x))
Call:
lm(formula = mpg ~ . + displacement:horsepower, data = x)
Residuals:
Min 1Q Median 3Q Max
-8.7010 -1.6009 -0.0967 1.4119 12.6734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.894e+00 4.302e+00 -0.440 0.66007
cylinders 6.466e-01 3.017e-01 2.143 0.03275 *
displacement -7.487e-02 1.092e-02 -6.859 2.80e-11 ***
horsepower -1.975e-01 2.052e-02 -9.624 < 2e-16 ***
weight -3.147e-03 6.475e-04 -4.861 1.71e-06 ***
acceleration -2.131e-01 9.062e-02 -2.351 0.01921 *
year 7.379e-01 4.463e-02 16.534 < 2e-16 ***
origin 6.891e-01 2.527e-01 2.727 0.00668 **
displacement:horsepower 5.236e-04 4.813e-05 10.878 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.912 on 383 degrees of freedom
Multiple R-squared: 0.8636, Adjusted R-squared: 0.8608
F-statistic: 303.1 on 8 and 383 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . + cylinders:horsepower*weight, data = x))
Call:
lm(formula = mpg ~ . + cylinders:horsepower * weight, data = x)
Residuals:
Min 1Q Median 3Q Max
-9.2767 -1.5641 -0.0771 1.4277 11.8078
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.275e+00 5.340e+00 0.988 0.32384
cylinders -2.350e+00 8.095e-01 -2.903 0.00391 **
displacement 2.101e-03 6.948e-03 0.302 0.76253
horsepower -2.050e-01 4.858e-02 -4.220 3.06e-05 ***
weight -6.675e-03 1.135e-03 -5.879 8.99e-09 ***
acceleration -1.121e-01 9.147e-02 -1.226 0.22092
year 7.589e-01 4.494e-02 16.885 < 2e-16 ***
origin 8.046e-01 2.496e-01 3.224 0.00137 **
cylinders:horsepower 1.297e-02 1.006e-02 1.290 0.19794
cylinders:horsepower:weight 3.019e-06 1.035e-06 2.917 0.00374 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.9 on 382 degrees of freedom
Multiple R-squared: 0.8651, Adjusted R-squared: 0.8619
F-statistic: 272.2 on 9 and 382 DF, p-value: < 2.2e-16
summary(lm(mpg ~ . + weight:cylinders, data = x))
Call:
lm(formula = mpg ~ . + weight:cylinders, data = x)
Residuals:
Min 1Q Median 3Q Max
-10.9484 -1.7133 -0.1809 1.4530 12.4137
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3143478 5.0076737 1.461 0.14494
cylinders -5.0347425 0.5795767 -8.687 < 2e-16 ***
displacement 0.0156444 0.0068409 2.287 0.02275 *
horsepower -0.0314213 0.0126216 -2.489 0.01322 *
weight -0.0150329 0.0011125 -13.513 < 2e-16 ***
acceleration 0.1006438 0.0897944 1.121 0.26306
year 0.7813453 0.0464139 16.834 < 2e-16 ***
origin 0.8030154 0.2617333 3.068 0.00231 **
cylinders:weight 0.0015058 0.0001657 9.088 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.022 on 383 degrees of freedom
Multiple R-squared: 0.8531, Adjusted R-squared: 0.8501
F-statistic: 278.1 on 8 and 383 DF, p-value: < 2.2e-16
Inference :
Several interaction effects appear to be statistically significant:
These significant interactions indicate that the effect of one predictor on mpg is indeed influenced by the levels of another predictor. This underscores the complexity of relationships in the data and suggests that models which incorporate these interaction effects may capture the variability in mpg more effectively than models with only main effects.
par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg, cex = 0.2)
plot(log(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(sqrt(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(Auto$horsepower^2, Auto$mpg, cex = 0.2)
x <- subset(Auto, select = -name)
x$weight <- log(x$horsepower)
fit <- lm(mpg ~ ., data = x)
summary(fit)
Call:
lm(formula = mpg ~ ., data = x)
Residuals:
Min 1Q Median 3Q Max
-8.971 -1.674 -0.204 1.526 12.277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 114.790956 9.930074 11.560 < 2e-16 ***
cylinders -0.117404 0.299897 -0.391 0.695659
displacement -0.022101 0.006438 -3.433 0.000663 ***
horsepower 0.196982 0.023060 8.542 3.13e-16 ***
weight -33.218155 2.419604 -13.729 < 2e-16 ***
acceleration -0.616145 0.081222 -7.586 2.52e-13 ***
year 0.705540 0.046227 15.262 < 2e-16 ***
origin 0.879875 0.260964 3.372 0.000823 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.055 on 384 degrees of freedom
Multiple R-squared: 0.8495, Adjusted R-squared: 0.8468
F-statistic: 309.7 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit, cex = 0.2)
x1 <- subset(Auto, select = -name)
x1$weight <- sqrt(x1$horsepower)
fit2 <- lm(mpg ~ ., data = x1)
summary(fit2)
Call:
lm(formula = mpg ~ ., data = x1)
Residuals:
Min 1Q Median 3Q Max
-8.9036 -1.6589 -0.2271 1.5274 12.1298
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.455179 6.538054 9.247 < 2e-16 ***
cylinders 0.030652 0.301660 0.102 0.919118
displacement -0.023203 0.006441 -3.602 0.000357 ***
horsepower 0.503287 0.043717 11.512 < 2e-16 ***
weight -12.973681 0.940908 -13.788 < 2e-16 ***
acceleration -0.616677 0.081104 -7.603 2.24e-13 ***
year 0.703981 0.046154 15.253 < 2e-16 ***
origin 0.899512 0.260234 3.457 0.000608 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.051 on 384 degrees of freedom
Multiple R-squared: 0.8499, Adjusted R-squared: 0.8472
F-statistic: 310.7 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit2, cex = 0.2)
x2 <- subset(Auto, select = -name)
x2$weight <- x2$horsepower^2
fit3 <- lm(mpg ~ ., data = x2)
summary(fit3)
Call:
lm(formula = mpg ~ ., data = x2)
Residuals:
Min 1Q Median 3Q Max
-9.2180 -1.8046 -0.1561 1.5133 11.9203
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.244e+00 4.446e+00 2.079 0.038292 *
cylinders 4.086e-01 3.133e-01 1.304 0.192950
displacement -2.548e-02 6.550e-03 -3.889 0.000118 ***
horsepower -4.201e-01 2.804e-02 -14.982 < 2e-16 ***
weight 1.262e-03 9.487e-05 13.306 < 2e-16 ***
acceleration -6.144e-01 8.209e-02 -7.485 4.95e-13 ***
year 6.992e-01 4.667e-02 14.981 < 2e-16 ***
origin 1.010e+00 2.618e-01 3.857 0.000134 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.086 on 384 degrees of freedom
Multiple R-squared: 0.8464, Adjusted R-squared: 0.8436
F-statistic: 302.4 on 7 and 384 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(fit3, cex = 0.2)
Summary of Findings on Transformations:
The square root transformation of horsepower provided the best fit, suggesting a stronger but moderated effect on mpg. The log and squared transformation were also effective but less impactful comparated to square root.
library(ISLR2)
Warning: package ‘ISLR2’ was built under R version 4.4.2
data(Carseats)
# Fit the model
model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The model can be written as: - Sales=13.0435−0.0545×Price−0.0219×1(Urban=Yes)+1.2006×1(US=Yes)+ϵ
However, for UrbanYes, the p-value is 0.936, which is much higher than 0.05, so we cannot reject the null hypothesis for this predictor. Thus, there is no evidence that the location being urban versus rural significantly affects sales.
reduced_model <- lm(Sales ~ Price + US, data = Carseats)
summary(reduced_model)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
The Adjusted R-squared for the reduced model is slightly higher than that for the full model, indicating that removing the Urban variable didn’t hurt the model’s explanatory power. This suggests that Urban might not be contributing much to the model.
confint(reduced_model)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
Inference: - Price is negatively associated with Sales, and the 95% confidence interval confirms that this relationship is statistically significant. - Being in the US (USYes) is positively associated with Sales, and the confidence interval confirms that this effect is statistically significant.
par(mfrow = c(2, 2))
plot(reduced_model)
Inference:
# Set seed for reproducibility
set.seed(42)
# Generate X and Y with different variances
X <- rnorm(100)
Y <- rnorm(100, sd = 5)
# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)
Call:
lm(formula = Y ~ X - 1)
Residuals:
Min 1Q Median 3Q Max
-9.9075 -2.9733 -0.3705 2.2492 13.8347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X 0.1224 0.4380 0.28 0.78
Residual standard error: 4.54 on 99 degrees of freedom
Multiple R-squared: 0.0007886, Adjusted R-squared: -0.009304
F-statistic: 0.07813 on 1 and 99 DF, p-value: 0.7804
# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)
Call:
lm(formula = X ~ Y - 1)
Residuals:
Min 1Q Median 3Q Max
-2.9759 -0.6453 0.1102 0.6822 2.2832
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Y 0.006441 0.023044 0.28 0.78
Residual standard error: 1.041 on 99 degrees of freedom
Multiple R-squared: 0.0007886, Adjusted R-squared: -0.009304
F-statistic: 0.07813 on 1 and 99 DF, p-value: 0.7804
# Set seed for reproducibility
set.seed(42)
# Generate X and Y with the same variance
X <- rnorm(100)
Y <- X * 1.5 # Y is a scaled version of X (same variance as X)
# Fit the regression of Y onto X (without intercept)
reg_Y_on_X <- lm(Y ~ X - 1)
summary(reg_Y_on_X)
Warning in summary.lm(reg_Y_on_X) :
essentially perfect fit: summary may be unreliable
Call:
lm(formula = Y ~ X - 1)
Residuals:
Min 1Q Median 3Q Max
-4.839e-16 -7.530e-17 2.200e-18 6.570e-17 6.634e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
X 1.500e+00 6.627e-17 2.263e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.87e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.123e+32 on 1 and 99 DF, p-value: < 2.2e-16
# Fit the regression of X onto Y (without intercept)
reg_X_on_Y <- lm(X ~ Y - 1)
summary(reg_X_on_Y)
Warning in summary.lm(reg_X_on_Y) :
essentially perfect fit: summary may be unreliable
Call:
lm(formula = X ~ Y - 1)
Residuals:
Min 1Q Median 3Q Max
-1.176e-15 -6.392e-17 -1.502e-17 3.599e-17 2.662e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Y 6.667e-01 9.342e-18 7.137e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.453e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.093e+33 on 1 and 99 DF, p-value: < 2.2e-16