data(Auto)
Warning in data(Auto): data set 'Auto' not found
KNN classifer and KNN regression: K-Nearest Neighbors, or KNN, is a way to make predictions by looking at the closest data points. If you’re trying to predict a category, like whether a college is private or public, KNN classification looks at the nearby examples and picks the most common type. If you’re trying to predict a number, like someone’s salary or a house price, KNN regression takes the average of the nearby numbers. So, the difference is simple: use KNN classification when you’re picking a label, and KNN regression when you’re guessing a number.
data(Auto)
Warning in data(Auto): data set 'Auto' not found
<- read.csv("Auto.csv", na.strings = "?") auto
<- na.omit(auto) auto
pairs(auto[, -which(names(auto) == "name")],
main = "Scatterplot Matrix of Auto Data")
<- auto[, -which(names(auto) == "name")]
auto_numeric cor(auto_numeric[, sapply(auto_numeric, is.numeric)])
mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
<- lm(mpg ~ . - name, data = auto)
model summary(model)
Call:
lm(formula = mpg ~ . - name, data = auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Yes — definitely. The F-statistic is 224.5 with a very small p-value (< 2.2e-16), which means that at least one of the predictors is significantly related to mpg. The R-squared is 0.8242, meaning that the model explains about 82% of the variance in mpg. That’s a strong fit.
From the P value column, the following predictors are significant at the 0.05 level:
displacement (p = 0.00186)
weight (p < 2e-16)
year (p < 2e-16)
originEurope (p = 4.72e-06)
originJapan (p = 3.93e-07)
These have strong relationships with mpg.
The year coefficient is 0.777, which means:
Each additional model year increases the expected mpg by about 0.78, holding all other variables constant.
This suggests that newer cars are more fuel-efficient.
<- lm(mpg ~ . - name, data = auto)
model plot(model)
Residuals vs Fitted Plot shows a curved pattern, suggesting some non-linearity — the relationship between predictors and mpg may not be purely linear.
A few observations (e.g., 323, 327, 326) stand out as potential outliers or high-leverage points.
These points may have a strong influence on the model and should be reviewed.
Overall, the model might benefit from transformations or nonlinear terms to improve fit.
<- lm(mpg ~ horsepower * weight, data = auto)
model_interact summary(model_interact)
Call:
lm(formula = mpg ~ horsepower * weight, data = auto)
Residuals:
Min 1Q Median 3Q Max
-10.7725 -2.2074 -0.2708 1.9973 14.7314
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.356e+01 2.343e+00 27.127 < 2e-16 ***
horsepower -2.508e-01 2.728e-02 -9.195 < 2e-16 ***
weight -1.077e-02 7.738e-04 -13.921 < 2e-16 ***
horsepower:weight 5.355e-05 6.649e-06 8.054 9.93e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.93 on 388 degrees of freedom
Multiple R-squared: 0.7484, Adjusted R-squared: 0.7465
F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
There is a strong and statistically significant interaction between horsepower and weight. This means the effect of horsepower on fuel efficiency (mpg) changes depending on the weight of the car. Heavier cars with more horsepower show a sharper drop in mpg than lighter ones
<- lm(mpg ~ year * origin, data = auto)
model_year_origin summary(model_year_origin)
Call:
lm(formula = mpg ~ year * origin, data = auto)
Residuals:
Min 1Q Median 3Q Max
-11.3141 -3.7120 -0.6513 3.3621 15.5859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -83.3809 12.0000 -6.948 1.57e-11 ***
year 1.3089 0.1576 8.305 1.68e-15 ***
origin 17.3752 6.8325 2.543 0.0114 *
year:origin -0.1663 0.0889 -1.871 0.0621 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.199 on 388 degrees of freedom
Multiple R-squared: 0.5596, Adjusted R-squared: 0.5562
F-statistic: 164.4 on 3 and 388 DF, p-value: < 2.2e-16
The year of the car has a strong positive impact on mpg — newer cars tend to get better mileage. Cars from Japan may have slightly higher mpg than U.S. cars, but there’s no strong evidence that the relationship between year and mpg changes based on the car’s origin.
<- lm(mpg ~ log(horsepower) + log(weight), data = auto)
model_log summary(model_log)
Call:
lm(formula = mpg ~ log(horsepower) + log(weight), data = auto)
Residuals:
Min 1Q Median 3Q Max
-10.6665 -2.4028 -0.3842 2.1558 15.3359
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 179.973 7.420 24.25 < 2e-16 ***
log(horsepower) -7.672 1.210 -6.34 6.36e-10 ***
log(weight) -15.244 1.478 -10.32 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.993 on 389 degrees of freedom
Multiple R-squared: 0.7396, Adjusted R-squared: 0.7382
F-statistic: 552.4 on 2 and 389 DF, p-value: < 2.2e-16
Both predictors are statistically significant (p-values < 0.001).
Negative coefficients:
As log(horsepower) increases, mpg decreases (−7.672). As log(weight) increases, mpg decreases (−15.244).
The log-transformed model shows that both horsepower and weight are strong, statistically significant predictors of mpg. As either increases, mpg drops, confirming that heavier and more powerful cars are less fuel-efficient. The model explains about 74% of the variation in mpg, with an average prediction error of around ±4 mpg. Using the log transformation improved how well the model captures non-linear relationships.
<- lm(mpg ~ sqrt(horsepower) + sqrt(weight), data = auto)
model_sqrt summary(model_sqrt)
Call:
lm(formula = mpg ~ sqrt(horsepower) + sqrt(weight), data = auto)
Residuals:
Min 1Q Median 3Q Max
-10.9211 -2.6240 -0.3587 2.2098 15.7776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.49726 1.49097 45.941 < 2e-16 ***
sqrt(horsepower) -1.27083 0.23678 -5.367 1.38e-07 ***
sqrt(weight) -0.59713 0.05522 -10.813 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.096 on 389 degrees of freedom
Multiple R-squared: 0.726, Adjusted R-squared: 0.7246
F-statistic: 515.5 on 2 and 389 DF, p-value: < 2.2e-16
Both sqrt(horsepower) and sqrt(weight) are statistically significant predictors of mpg (p-values < 0.001)
The negative coefficients mean that as horsepower or weight increases (even slightly), fuel efficiency decreases.
For each unit increase in sqrt(horsepower), mpg drops by about 1.27. For each unit increase in sqrt(weight), mpg drops by about 0.60.
Model Performance: R-squared = 0.726 → the model explains about 72.6% of the variability in mpg.
Residual Standard Error (RSE) = 4.096 → on average, predictions are off by about ±4 mpg.
Using the square root transformation improves the fit over a basic linear model and effectively captures some non-linear effects. The fit is slightly less strong than the log-transformed model, but still solid.
<- lm(mpg ~ horsepower + I(horsepower^2) + weight + I(weight^2), data = auto)
model_quad summary(model_quad)
Call:
lm(formula = mpg ~ horsepower + I(horsepower^2) + weight + I(weight^2),
data = auto)
Residuals:
Min 1Q Median 3Q Max
-10.7988 -2.2736 -0.2347 2.0022 14.8074
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.444e+01 2.850e+00 22.608 < 2e-16 ***
horsepower -2.163e-01 3.937e-02 -5.495 7.09e-08 ***
I(horsepower^2) 5.727e-04 1.400e-04 4.092 5.21e-05 ***
weight -1.261e-02 2.306e-03 -5.468 8.17e-08 ***
I(weight^2) 1.258e-06 3.483e-07 3.610 0.000346 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.937 on 387 degrees of freedom
Multiple R-squared: 0.7481, Adjusted R-squared: 0.7455
F-statistic: 287.4 on 4 and 387 DF, p-value: < 2.2e-16
R-squared = 0.7481 → About 74.8% of the variation in mpg is explained — a strong fit.
Residual Standard Error (RSE) = 3.937 → Average error in predictions is about ±4 mpg, slightly better than the square root model.
Including squared terms helps capture nonlinear effects between mpg, horsepower, and weight. The improvement is modest compared to log transformations, but it’s a solid approach for modeling curvature while keeping predictors on their original scale.
library(ISLR2)
data("Carseats")
<- lm(Sales ~ Price + Urban + US, data = Carseats)
model summary(model)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
In this regression model, the intercept of 13.04 represents the baseline sales for a store that is not located in an urban area and is not in the US, with a price of zero (though price = 0 isn’t realistic, it anchors the model). The coefficient for price is −0.054, indicating that for each $1 increase in price, sales are expected to decrease by about 0.054 units, holding other factors constant. The coefficient for UrbanYes is −0.022, suggesting urban stores have slightly lower sales than non-urban ones, but this effect is not statistically significant. On the other hand, US stores have significantly higher sales—about 1.2 units more—than non-US stores, as shown by the USYes coefficient of 1.20.
Sales=13.04−0.054×Price−0.022×UrbanYes+1.20×USYes
We can reject Null Hypothesis for:
Price (because the p-value is extremely small) USYes (also a very small p-value)
You cannot reject Null Hypothesis for:
UrbanYes (p-value is too large — 0.936 — which means it is not statistically significant)
<- lm(Sales ~ Price + US, data = Carseats)
model_small summary(model_small)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
This model drops Urban because it didn’t show a statistically significant relationship with Sales.
We tried two models to predict Sales. The first model used three things: Price, Urban, and US. The second, simpler model used just Price and US.
The results showed that both models gave similar results, but the second one (with fewer variables) was actually slightly better. This means that Urban didn’t really help explain sales, so we didn’t need it.
Using fewer, more useful predictors worked just as well — and made the model easier to understand.
confint(model_small)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
plot(model_small)
The diagnostic plots suggest that the model fits the data reasonably well. The residuals appear evenly spread, which supports the assumption of constant variance. The Q-Q plot shows that most residuals follow a normal pattern. The scale-location plot also confirms that error sizes stay fairly consistent. Lastly, the leverage plot shows no major influential observations, so there’s no serious concern about individual points skewing the model.
The coefficient estimate for the regression of Y onto X (no intercept) is the same as the coefficient for X onto Y (also no intercept) if and only if: The variance of X and Y must be equal.
# Set seed for reproducibility
set.seed(1)
# Generate Y with larger variance
<- rnorm(100, mean = 0, sd = 3)
Y
# Create X as a noisy version of Y (with smaller variance)
<- Y + rnorm(100, mean = 0, sd = 1)
X
# Regression: Y onto X (no intercept)
<- lm(Y ~ X + 0)
model_Y_on_X coef(model_Y_on_X)
X
0.8905322
# Regression: X onto Y (no intercept)
<- lm(X ~ Y + 0)
model_X_on_Y coef(model_X_on_Y)
Y
0.9979587
set.seed(2)
# Generate a base variable Z
<- rnorm(100)
Z
# Let X and Y be the same (or linear transforms with same variance)
<- Z
X <- Z
Y
# Regression: Y onto X (no intercept)
<- lm(Y ~ X + 0)
model_Y_on_X coef(model_Y_on_X)
X
1
# Regression: X onto Y (no intercept)
<- lm(X ~ Y + 0)
model_X_on_Y coef(model_X_on_Y)
Y
1