KNN classification, the model assigns a class label based on the majority vote among the nearest neighbors, making it suitable for categorical outcomes.
KNN regression, the algorithm predicts a continuous value by averaging the numerical outputs of the nearest neighbors, making it effective for continuous target variables.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
plot(Auto)
Auto1 <- Auto
Auto1$name = NULL
cor(Auto1)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Auto_lm <- lm(mpg ~ . -name, data =Auto)
summary(Auto_lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
There is a strong relationship between the predictors and
mpg. The very low p-value (< 2.2e-16)
means that at least one of the predictors significantly affects
mpg. Also, the R-squared value (0.8215)
shows that about 82% of the variation in
mpg is explained by the predictors.
ii. Which predictors appear to have a statistically significant relationship to the response?
Displacement
Weight
Year
Origin
iii. What does the coefficient for the year variable suggest?
The coefficient for the year variable indicates that it
is statistically significant and suggests that, assuming all other
variables remain constant, mpg increases by
0.75 for each additional year.
par(mfrow = c(2,2))
plot(Auto_lm)
The Residuals vs Fitted plot suggests that the errors might not have constant spread, which can be a problem. The Q-Q plot shows that some points don’t follow a normal pattern, meaning there could be outliers. The Scale-Location plot confirms that the spread of errors changes, which means the model might not be treating all values equally. The Residuals vs Leverage plot shows that observation 14 has high leverage, meaning it could have too much influence on the model.
Auto_lm2 <- lm(mpg ~.:., data = Auto1)
summary(Auto_lm2)
##
## Call:
## lm(formula = mpg ~ .:., data = Auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
Auto_lm3 <- lm(mpg ~.*., data = Auto1)
summary(Auto_lm3)
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
There are interactions between both symbols * and
: due to the p-value. The statistically significant
interactions are displacement:year,
acceleration:year, and
acceleration:origin.
Auto_lm4 <- lm(mpg ~ weight + I(sqrt(weight)), data = Auto)
summary(Auto_lm4)
##
## Call:
## lm(formula = mpg ~ weight + I(sqrt(weight)), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.5660 -2.6552 -0.4161 1.7373 16.1001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 109.218284 11.573797 9.437 < 2e-16 ***
## weight 0.013191 0.003828 3.446 0.000631 ***
## I(sqrt(weight)) -2.314535 0.424250 -5.456 8.7e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.181 on 389 degrees of freedom
## Multiple R-squared: 0.7145, Adjusted R-squared: 0.713
## F-statistic: 486.7 on 2 and 389 DF, p-value: < 2.2e-16
Both predictors (weight and
I(sqrt(weight))) are statistically significant at a very
low p-value, indicating that both have meaningful contributions to
predicting mpg. The model explains about
71.45% of the variation in mpg.
Auto_lm5 <- lm(mpg ~ weight+ I(weight^2), data = Auto)
summary(Auto_lm5)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6246 -2.7134 -0.3485 1.8267 16.0866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 ***
## weight -1.850e-02 1.972e-03 -9.379 < 2e-16 ***
## I(weight^2) 1.697e-06 3.059e-07 5.545 5.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137
## F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16
(weight) and the squared term (I(weight^2))
are statistically significant, which suggests a relationship between
weight and mpg. The model explains
71.51% of the variation in mpg, a very
similar R-squared value to the first model, indicating a comparable
model fit.
library(ISLR2)
attach(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
Sale_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Sale_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
coef(Sale_lm)
## (Intercept) Price UrbanYes USYes
## 13.04346894 -0.05445885 -0.02191615 1.20057270
Coefficient for ‘price’:
-0.054459
This means that for every dollar increase in the price of my Carseats,
store’s sales decrease by $54 on average.
Coefficient for ‘US = Yes’:
1.200573
This means, on average, stores in the US sell $1,200
more Carseats compared to stores outside the US.
\(Sales = 13.04 - 0.05Price - 0.22Urban + 1.2US\)
The null hypothesis can be rejected for the
Price and US variables, since their p-values is
2e-16 and 4.86e-06.
Sale_lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(Sale_lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models (a) and (e) performed poorly, with the adjusted R-squared for Model A being 0.2335 and for Model E being 0.2354. Both models explained around 23% of the variance. Even after removing the “Urban” variable, it did not affect the model.
confint(Sale_lm2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c(2,2))
plot(Sale_lm2)
There are some possible outliers in the Residuals vs Fitted plot and slight deviations in the Q-Q plot. A few high-leverage points appear in the Residuals vs Leverage plot, but they don’t seem very influential.
The coefficient estimates for both X and Y are equal when we set up and compare their respective quotient equations. This happens when the sum of X^2 is equal to the sum of Y^2, ensuring both variables have the same spread.
set.seed(99)
x <- rnorm(100)
y <- 3 * x + rnorm(100)
df1 <- lm(y~x)
summary(df1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7247 -0.7219 -0.1007 0.7467 2.2239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.07963 0.10765 -0.74 0.461
## x 3.07747 0.11931 25.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.069 on 98 degrees of freedom
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8703
## F-statistic: 665.3 on 1 and 98 DF, p-value: < 2.2e-16
df2 <- lm(x~y)
summary(df2)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72090 -0.24436 -0.00792 0.23757 0.75347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.009198 0.032735 0.281 0.779
## y 0.283223 0.010980 25.794 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3244 on 98 degrees of freedom
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8703
## F-statistic: 665.3 on 1 and 98 DF, p-value: < 2.2e-16
set.seed(99)
x <- rnorm(100)
y <- x
df3 <- lm(y~x)
summary(df3)
## Warning in summary.lm(df3): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.974e-15 2.010e-17 3.680e-17 6.080e-17 3.933e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.441e-17 4.142e-17 -1.072e+00 0.286
## x 1.000e+00 4.591e-17 2.178e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.115e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.744e+32 on 1 and 98 DF, p-value: < 2.2e-16
df4 <- lm(x~y)
summary(df4)
## Warning in summary.lm(df4): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.974e-15 2.010e-17 3.680e-17 6.080e-17 3.933e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.441e-17 4.142e-17 -1.072e+00 0.286
## y 1.000e+00 4.591e-17 2.178e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.115e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.744e+32 on 1 and 98 DF, p-value: < 2.2e-16