KNN classification is used for categorical variables by assigning the most common class among the k nearest neighbors, while KNN regression is used for continuous variables by averaging the values of the k nearest neighbors.
A scatterplot matrix visualizes the relationships between all numerical variables.
library(ISLR)
library(ggplot2)
library(corrplot)
data(Auto)
pairs(Auto)
mpg is negatively correlated with cylinders, displacement, horsepower, and weight, and positively correlated with year and origin. Weight, horsepower, and displacement are highly correlated, indicating possible multicollinearity.
cor_matrix <- cor(Auto[, !names(Auto) %in% "name"])
corrplot(cor_matrix, method = "circle")
Yes, the F-statistic = 252.4 (p-value < 2.2e-16) confirms a strong relationship. R² = 0.8215, meaning 82% of mpg variance is explained.
Significant (p < 0.05): displacement, weight, year, origin
Not significant (p > 0.05): cylinders, horsepower, acceleration
0.7508 → For every 1-year increase, mpg increases by 0.75. Newer cars are more fuel-efficient.
model <- lm(mpg ~ . - name, data = Auto)
summary(model)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Residuals vs Fitted: Slight curve suggests non-linearity, meaning the
linear model might not be the best fit.
Q-Q Plot: Some points deviate from the line, indicating non-normality in
residuals.
Scale-Location: No strong pattern, but potential heteroscedasticity
(variance not constant).
Residuals vs Leverage: Points 327, 394, and 14 have high leverage,
meaning they could strongly influence the model. Point 14 seems
particularly influential.
par(mfrow = c(2, 2))
plot(model)
Yes, the following interactions are statistically significant (p < 0.05).
interaction_model <- lm(mpg ~ (cylinders + displacement + horsepower + weight + acceleration + year + origin)^2 - name, data = Auto)
summary(interaction_model)
##
## Call:
## lm(formula = mpg ~ (cylinders + displacement + horsepower + weight +
## acceleration + year + origin)^2 - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
The model explains 65.6% of the variation in mpg, which is lower than other models. The squared weight term is highly significant with a strong negative effect on mpg, while the squared horsepower term is borderline significant. The squared displacement term has no effect. Overall, only weight squared is meaningful, and this model does not improve much compared to others.
model_squared <- lm(mpg ~ I(horsepower^2) + I(weight^2) + I(displacement^2), data = Auto)
summary(model_squared)
##
## Call:
## lm(formula = mpg ~ I(horsepower^2) + I(weight^2) + I(displacement^2),
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.941 -3.323 -0.771 2.634 17.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.427e+01 6.017e-01 56.955 <2e-16 ***
## I(horsepower^2) -1.033e-04 5.632e-05 -1.834 0.0674 .
## I(weight^2) -9.953e-07 1.018e-07 -9.778 <2e-16 ***
## I(displacement^2) -3.673e-08 1.483e-05 -0.002 0.9980
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.596 on 388 degrees of freedom
## Multiple R-squared: 0.6559, Adjusted R-squared: 0.6532
## F-statistic: 246.5 on 3 and 388 DF, p-value: < 2.2e-16
data(Carseats)
model_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(model_carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price is significant and negatively affects Sales. US is significant and increases Sales. Urban is not significant and has no impact on Sales.
Sales = 13.04 − 0.0545(Price) − 0.0219(UrbanYes) + 1.20(USYes)
The significant predictors (p < 0.05) are Price and US, while Urban is not significant.
significant_predictors <- summary(model_carseats)$coefficients[,4] < 0.05
print(significant_predictors)
## (Intercept) Price UrbanYes USYes
## TRUE TRUE FALSE TRUE
A reduced model using only significant predictors.
reduced_model <- lm(Sales ~ Price + US, data = Carseats)
summary(reduced_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models have the same R² (0.2393), meaning they explain 23.93% of the variation in Sales. The removal of Urban did not impact the model’s fit, so the reduced model is just as effective but simpler.
AIC(model_carseats, reduced_model)
## df AIC
## model_carseats 5 1865.312
## reduced_model 4 1863.319
confint(reduced_model, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
The diagnostic plots help identify outliers and high leverage points that could influence the model. If extreme points are present, further investigation is needed.
par(mfrow = c(2, 2))
plot(reduced_model)
The coefficient estimates for the regression of X onto Y and Y onto X will be the same if and only if Var(X) = Var(Y) (i.e., X and Y have the same variance).
The coefficient estimates for Y onto X (1.9829) and X onto Y (0.4972) are different. This occurs because X and Y do not have the same variance, and there is added noise in Y = 2X + error. The presence of error (random noise) disrupts the perfect linear relationship, causing the two regression coefficients to differ.
set.seed(42)
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- 2 * X + rnorm(n, mean = 0, sd = 3)
model_xy <- lm(Y ~ X - 1)
model_yx <- lm(X ~ Y - 1)
coef_xy <- coef(model_xy)
coef_yx <- coef(model_yx)
print(coef_xy)
## X
## 1.982863
print(coef_yx)
## Y
## 0.497212
The coefficient estimates for Y onto X and X onto Y are both 1, meaning they are identical. This occurs because Y = X without any added noise, ensuring a perfect linear relationship where Var(X) = Var(Y).
set.seed(42)
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- X # Perfect linear relationship
model_xy_equal <- lm(Y ~ X - 1)
model_yx_equal <- lm(X ~ Y - 1)
coef_xy_equal <- coef(model_xy_equal)
coef_yx_equal <- coef(model_yx_equal)
print(coef_xy_equal)
## X
## 1
print(coef_yx_equal)
## Y
## 1