library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
data(Auto)
Carefully explain the differences between the KNN classifier and KNN regression methods.
K-Nearest Neighbors, or KNN, is a simple way for a computer to make predictions. It can be used in two ways: to guess a group (called classification) or to guess a number (called regression). KNN works by looking at the closest points, or “neighbors,” to the one it’s trying to guess. For classification, it picks the most common group among those neighbors. For regression, it averages the numbers from those neighbors. Classification is used when the answer is a label, like “yes” or “no,” while regression is used when the answer is a number, like a house price. They also have different ways of checking how good the guesses are.
This question involves the use of multiple linear regression on the Auto data set.
ggpairs(
Auto,
title = "Scatterplot Matrix of Auto Dataset",
upper = list(continuous = wrap("cor", size = 3)),
lower = list(continuous = wrap("points", alpha = 0.6, size = 0.8)),
diag = list(continuous = wrap("densityDiag", alpha = 0.5)),
cardinality_threshold = 500
) +
theme_minimal(base_size = 10)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The correlation matrix was computed to quantify the relationships between variables. The results indicated strong negative correlations between mpg and weight (-0.83), displacement (-0.81), and horsepower (-0.78). Positive correlations were found between mpg and both year (0.58) and origin (0.57). These findings confirm that lighter, newer, and imported cars generally have higher fuel efficiency.
cor_matrix <- cor(Auto[, -which(names(Auto) == "name")])
print(cor_matrix)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
The F-statistic for the model is 252.4 with a p-value < 2 × 10⁻¹⁶, so the null hypothesis that all slope coefficients are zero is decisively rejected. Combined with an adjusted R² of 0.818, this shows that, taken together, the predictors explain a large share of the variation in miles per gallon (mpg). In short, there is a clear relationship between the set of predictors and the response.
The variables weight, year, origin, and displacement are statistically significant (p < 0.05), meaning they have a meaningful impact on mpg. In contrast, cylinders, horsepower, and acceleration are not significant when in the model.
The coefficient for year (≈ 0.75) means that, holding all other variables constant, a car that is one model-year newer is expected to achieve about 0.75 additional miles per gallon. This suggests that fuel efficiency tended to improve steadily from year to year over the period covered by the dataset.
model <- lm(mpg ~ . - name, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Diagnostic plots were used to assess the model’s validity. The residuals appeared randomly distributed, the Q-Q plot suggested the residuals were approximately normally distributed, and the scale-location plot indicated fairly constant variance. Although a few high-leverage points were identified, none were extreme or influential. Overall, the regression assumptions were reasonably met.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(model, which = 1:4, pch = 19, col = "#0072B2", cex = 0.7)
The model shows that both interaction terms are important. The combination of engine size (displacement) and weight has a strong negative effect on miles per gallon (mpg), meaning heavier cars with bigger engines use more fuel. Also, the relationship between engine size and mpg changes depending on the car’s horsepower. These results show that fuel efficiency depends on how these features work together, not just on one factor alone.
interaction_model <- lm(mpg ~ displacement * weight + displacement:horsepower + . - name - displacement - weight - horsepower, data = Auto)
summary(interaction_model)
##
## Call:
## lm(formula = mpg ~ displacement * weight + displacement:horsepower +
## . - name - displacement - weight - horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5314 -2.4221 -0.0745 2.1164 13.6442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.928e+01 5.021e+00 -5.830 1.17e-08 ***
## cylinders -9.337e-01 3.178e-01 -2.938 0.003502 **
## acceleration 1.218e-01 9.699e-02 1.256 0.209884
## year 7.578e-01 5.778e-02 13.116 < 2e-16 ***
## origin 1.464e+00 2.997e-01 4.886 1.51e-06 ***
## displacement:weight -1.143e-05 1.950e-06 -5.862 9.83e-09 ***
## displacement:horsepower 1.513e-04 4.053e-05 3.733 0.000218 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.824 on 385 degrees of freedom
## Multiple R-squared: 0.7636, Adjusted R-squared: 0.76
## F-statistic: 207.3 on 6 and 385 DF, p-value: < 2.2e-16
Several variable transformations were applied to improve model performance. Transforming horsepower with a logarithm, weight with a square root, and displacement with a square improved the adjusted R-squared values slightly, reaching around 0.833 to 0.834. These transformations helped capture non-linear patterns, but the model with the interaction term remained the best overall in terms of fit.
log_model <- lm(mpg ~ log(horsepower) + weight + year + displacement + acceleration + cylinders + origin, data = Auto)
summary(log_model)
##
## Call:
## lm(formula = mpg ~ log(horsepower) + weight + year + displacement +
## acceleration + cylinders + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3115 -2.0041 -0.1726 1.8393 12.6579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.254005 8.589614 3.173 0.00163 **
## log(horsepower) -9.506436 1.539619 -6.175 1.69e-09 ***
## weight -0.004266 0.000694 -6.148 1.97e-09 ***
## year 0.705329 0.048456 14.556 < 2e-16 ***
## displacement 0.019456 0.006876 2.830 0.00491 **
## acceleration -0.292088 0.103804 -2.814 0.00515 **
## cylinders -0.486206 0.306692 -1.585 0.11372
## origin 1.482435 0.259347 5.716 2.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.18 on 384 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.834
## F-statistic: 281.6 on 7 and 384 DF, p-value: < 2.2e-16
sqrt_model <- lm(mpg ~ sqrt(weight) + horsepower + year + displacement + acceleration + cylinders + origin, data = Auto)
summary(sqrt_model)
##
## Call:
## lm(formula = mpg ~ sqrt(weight) + horsepower + year + displacement +
## acceleration + cylinders + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4018 -2.0112 0.0246 1.7565 12.8943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.840893 4.486253 0.633 0.52695
## sqrt(weight) -0.794322 0.066906 -11.872 < 2e-16 ***
## horsepower -0.010706 0.013111 -0.817 0.41469
## year 0.773764 0.049030 15.781 < 2e-16 ***
## displacement 0.021846 0.007134 3.062 0.00235 **
## acceleration 0.131710 0.094051 1.400 0.16220
## cylinders -0.430040 0.310000 -1.387 0.16618
## origin 1.210091 0.268519 4.507 8.76e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.191 on 384 degrees of freedom
## Multiple R-squared: 0.8359, Adjusted R-squared: 0.8329
## F-statistic: 279.4 on 7 and 384 DF, p-value: < 2.2e-16
sq_model <- lm(mpg ~ I(displacement^2) + horsepower + weight + year + acceleration + cylinders + origin, data = Auto)
summary(sq_model)
##
## Call:
## lm(formula = mpg ~ I(displacement^2) + horsepower + weight +
## year + acceleration + cylinders + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7420 -1.8872 -0.0646 1.6601 12.5615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.378e+01 4.488e+00 -3.070 0.00229 **
## I(displacement^2) 6.951e-05 1.074e-05 6.475 2.91e-10 ***
## horsepower -4.248e-02 1.381e-02 -3.075 0.00225 **
## weight -6.446e-03 5.848e-04 -11.024 < 2e-16 ***
## year 7.644e-01 4.886e-02 15.646 < 2e-16 ***
## acceleration 7.466e-02 9.427e-02 0.792 0.42884
## cylinders -7.083e-01 2.614e-01 -2.710 0.00703 **
## origin 1.337e+00 2.537e-01 5.271 2.27e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.188 on 384 degrees of freedom
## Multiple R-squared: 0.8361, Adjusted R-squared: 0.8331
## F-statistic: 279.9 on 7 and 384 DF, p-value: < 2.2e-16
This question should be answered using the Carseats data set.
data(Carseats)
model_a <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_a)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The multiple regression model shows how Price, Urban, and US affect Sales therein Carseats dataset. The intercept of 13.04 represents the predicted Sales when Price is 0, and both Urban and US are set to No. The coefficient for Price is -0.0545, indicating that for every $1 increase in price, Sales are expected to decrease by about 0.05 units, which is highly statistically significant (p < 2e-16). The Urban Yes coefficient is -0.0219, suggesting a very small and statistically insignificant difference in Sales between urban and non-urban stores (p = 0.936). In contrast, the US Yes coefficient is 1.20, meaning stores in the US are predicted to sell about 1.2 more units than those outside the US, a difference that is statistically significant (p ≈ 4.86e-06). Overall, Price and US have meaningful relationships with Sales, while Urban does not. The model explains approximately 24% of the variation in Sales (R² = 0.2393), and the overall model is statistically significant (F = 41.52, p < 2.2e-16).
The regression model for predicting Sales is:
Sales = 13.0435 − 0.0545 × Price − 0.0219 × UrbanYes + 1.2006 × USYes.
This means that when Price is 0, and the store is not in an urban area and not in the US, the predicted sales are 13.04 units. For every $1 increase in price, sales drop by about 0.05 units. Stores in urban areas sell slightly less (0.02 units), but this difference is not meaningful. Stores in the US sell about 1.2 more units than those outside the US, which is a significant difference.
The diagnostic plots show that the regression model fits the data fairly well. The Residuals vs Fitted plot suggests a mostly linear relationship, with no major patterns. The Q-Q plot shows the residuals are close to normally distributed, with small deviations at the ends. The Scale-Location plot indicates that the spread of residuals is fairly even, though there’s a bit more variation in the middle. The Residuals vs Leverage plot shows a few points with higher leverage (like points 368 and 86), but none are strong utliners. Overall, the model meets the key assumptions and looks reliable.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(model_a,
which = 1:4,
pch = 19,
col = "#0072B2",
cex = 0.7)
The smaller model includes only Price and US, since Urban was not a significant predictor. In this model, both Price and US remain statistically significant. Higher prices are linked to lower sales, and stores in the US sell about 1.2 more units than those outside the US. The adjusted R-squared is 0.2354, almost the same as the original model, which means dropping Urban did not hurt the model’s accuracy. This simpler model is just as effective and easier to interpret.
model_e <- lm(Sales ~ Price + US, data = Carseats)
summary(model_e)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models fit the data well, but the smaller model (e), which includes only Price and US, performs just as well as the full model (a), which also includes Urban. The adjusted R-squared is slightly higher in the smaller model, and the residual standard error is slightly lower. This means the simpler model is just as accurate and easier to work with.
Model | Predictors | Adjusted R² | Residual Std. Error |
---|---|---|---|
Model (a) | Price, Urban, US | 0.2335 | 2.472 |
Model (e) | Price, US | 0.2354 | 2.469 |
The 95% confidence intervals show that all variables in the smaller model are important. The intercept is between 11.79 and 14.27, which is the expected sales when price is 0 and the store is not in the US. The interval for Price is from -0.0648 to -0.0442, meaning higher prices lead to lower sales. The interval for USYes is from 0.69 to 1.71, showing that US stores sell more than non-US stores. Since none of the intervals include 0, both Price and US have a real effect on sales.
confint(model_e, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
The diagnostic plots show that the model fits the data well. The residuals are spread out evenly with no clear pattern, and the Q-Q plot shows the residuals are mostly normal. The variance looks consistent across the fitted values. A few points, like 368 and 86, have slightly higher leverage, but they are not influential. Overall, there are no major issues with outliers or leverage in this model.
par(mfrow = c(2, 2), mar = c(4, 4, 2, 1))
plot(model_e,
which = 1:4,
pch = 19,
col = "#0072B2",
cex = 0.7)
This problem involves simple linear regression without an intercept.
In simple linear regression without an intercept, the slope when predicting Y from X is different from the slope when predicting X from Y. They are only the same when the total squared values of X and Y are equal (that is, when ∑X2=∑Y2\sum X^2 = \sum Y^2). This usually happens when X and Y have the same scale, like when they are equal or very similar in size.
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
model_y_on_x <- lm(y ~ x + 0)
coef(model_y_on_x)
## x
## 1.993876
model_x_on_y <- lm(x ~ y + 0)
coef(model_x_on_y)
## y
## 0.3911145
set.seed(2)
x <- rnorm(100)
y <- x
model_y_on_x_equal <- lm(y ~ x + 0)
coef(model_y_on_x_equal)
## x
## 1
model_x_on_y_equal <- lm(x ~ y + 0)
coef(model_x_on_y_equal)
## y
## 1
References:
ISLR Q3.9 - Multiple Linear Regression/Auto. http://www.h4labs.com/ml/islr/chapter03/03_09_melling.html