Obviously the main difference is the goal. The KNN classifier is used in situations where the response variable is categorical, whereas the KNN regressor would only be appropriate when it is numeric.
For both the classifier & regressor:
Want to predict for some new test observation: x0 We choose some K nearest neighbours, where K∈N We refer to the K training observations closest to x0 by N0
KNN Classifier:
First, the conditional probability for class j is estimated, which is the fraction of the points in N0 where yi = j.
Pr(Y=j|x=x0)=1K∑i∈N0I(yi=j)
After working out the conditional probability for each j class, the class prediction f^(x0) is then simply the class j with the highest conditional probability:
f^(x0)=argmaxj(Pr(Y=j|x=x0))
KNN Regressor:
The KNN regressor identified the K training observations that are closest to the observation it is predicting (x0), then it estimates f(x0) by taking the mean y value of these training observations.
f^(x0)=1K∑i∈N0yi
#Auto <- read.csv("C://Spring 2021//Algo 2//Database", header=TRUE)
Auto = na.omit(Auto)
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name <chr> "chevrolet chevelle malibu", "buick skylark 320", "ply...
Auto$name= as.factor(Auto$name)
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...
pairs(Auto)
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))
mpg_lm <- lm(mpg ~ . - name, data = Auto)
summary(mpg_lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
i: We answer this question by performing an F-test, where we test the null hypothesis that all of the regression coefficients are zero:
H0:β1=β2=⋯=βp=0Ha:at least one βj≠0
The p-value is given at the bottom of the model summary (p-value: < 2.2e-16), so it’s clear that the probability of the null hypothesis being true (given our data) is practically zero.
We reject the null hypothesis (and hence conclude that there is a relationship between the predictors and mpg).
ii: Using the coefficient p-values in the model output, and p = 0.05 as my threshold for significance, all variables except cylinders, horsepower & acceleration have a statistically significant relationship with the response.
iii:
Extracting the coefficient for year (β7^):
coef(mpg_lm)[7]
## year
## 0.7770269
A coefficient of 0.777 means that the effect of an increase of 1 year (with all other effects held constant) is associated with an increase of 0.777 in the response (mpg). This could perhaps be thought of as the rate of the cars increase in efficiency year-on-year.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(mpg_lm)
mpg_lm_stats <- broom::augment(mpg_lm) %>%
dplyr::select(.rownames,.fitted, .hat, .resid, .std.resid, .cooksd)%>%
arrange(desc(.cooksd))
mpg_lm_stats
## # A tibble: 392 x 6
## .rownames .fitted .hat .resid .std.resid .cooksd
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 14 19.4 0.194 -5.42 -1.83 0.0895
## 2 394 35.5 0.0646 8.53 2.67 0.0547
## 3 327 32.4 0.0406 11.0 3.41 0.0546
## 4 326 33.9 0.0332 10.4 3.20 0.0391
## 5 245 33.0 0.0323 10.1 3.11 0.0358
## 6 387 28.7 0.0342 9.33 2.87 0.0324
## 7 323 33.2 0.0151 13.4 4.07 0.0282
## 8 112 27.0 0.0294 -9.01 -2.77 0.0257
## 9 328 27.9 0.0297 8.55 2.62 0.0234
## 10 330 34.7 0.0217 9.86 3.01 0.0224
## # ... with 382 more rows
mpg_lm_stats$cooksd_cutoff <- factor(ifelse(mpg_lm_stats$.cooksd >= 4 / nrow(Auto), "True", "False"))
I extract these stats (leverage, residual stats, cooks distance, etc.) for each observation below using the broom package, and highlight those observations with CooksD≥4n, a common threshold for identifying influential observations.
ggplot(mpg_lm_stats, aes(x = .hat, y = .std.resid, col = cooksd_cutoff)) +
geom_point(alpha = 0.8) +
scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Residuals vs Leverage",
x = "Leverage",
y = "Standardized Residuals",
col = "Cooks Distance >= 4/n")
ggplot(mpg_lm_stats, aes(x = .hat, y = .cooksd, col = cooksd_cutoff)) +
geom_point(alpha = 0.8) +
geom_hline(yintercept = 4 / nrow(Auto), col = "grey30") +
scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Cooks Distance vs Leverage",
x = "Leverage",
y = "Cooks Distance",
col = "Cooks Distance >= 4/n")
#### Interaction Terms #### Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Since we have relatively few predictors, we can test all interactions with mpg ~ . * . in the call to lm():
summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## originEuropean -3.517e+01 1.260e+01 -2.790 0.005547 **
## originJapanese -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:originEuropean -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:originJapanese 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:originEuropean -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:originJapanese 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:originEuropean -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:originJapanese 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:originEuropean 2.277e-03 2.685e-03 0.848 0.397037
## weight:originJapanese -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:originEuropean 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:originJapanese 7.159e-01 3.258e-01 2.198 0.028614 *
## year:originEuropean 2.932e-01 1.444e-01 2.031 0.043005 *
## year:originJapanese 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
We can see the significant terms (at the 0.05 level) are those with at least one asterisk (*). It is probably unreasonable to use a significance level of 0.05 here, as we are testing such a large number of hypothesis, perhaps a lower threshold for significance (or a p-value correction using the p.adjust() function) would be appropriate.
Using the standard threshold of 0.05, the significant interaction terms are given by:
cylinders:acceleration acceleration:year acceleration:originEuropean acceleration:originJapanese year:originEuropean year:originJapanese
I use my best_predictor() function here, as defined in my chapter 2 solutions.
There is no testing for X−−√ transformations here, but these tend to perform similarly to transformations of the form log(X).
best_predictor <- function(dataframe, response) {
if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
stop("Make sure that all variables are of class numeric/factor!")
}
# pre-allocate vectors
varname <- c()
vartype <- c()
R2 <- c()
R2_log <- c()
R2_quad <- c()
AIC <- c()
AIC_log <- c()
AIC_quad <- c()
y <- dataframe[ ,response]
# # # # # NUMERIC RESPONSE # # # # #
if (is.numeric(y)) {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
R2[i] <- summary(lm(y ~ x))$r.squared
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
} else {
R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
}
} else {
R2_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
} else {
R2_quad[i] <- NA
}
} else {
R2[i] <- NA
R2_log[i] <- NA
R2_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
R2 = round(R2, 3),
R2_log = round(R2_log, 3),
R2_quad = round(R2_quad, 3)) %>%
mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
arrange(desc(max_R2))
# # # # # CATEGORICAL RESPONSE # # # # #
} else {
for (i in 1:ncol(dataframe)) {
x <- dataframe[ ,i]
varname[i] <- names(dataframe)[i]
if (class(x) %in% c("numeric", "integer")) {
vartype[i] <- "numeric"
} else {
vartype[i] <- "categorical"
}
if (!identical(y, x)) {
# linear: y ~ x
AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic
# log-transform: y ~ log(x)
if (is.numeric(x)) {
if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
} else {
AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
}
} else {
AIC_log[i] <- NA
}
# quadratic: y ~ x + x^2
if (is.numeric(x)) {
AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
} else {
AIC_quad[i] <- NA
}
} else {
AIC[i] <- NA
AIC_log[i] <- NA
AIC_quad[i] <- NA
}
}
print(paste("Response variable:", response))
data.frame(varname,
vartype,
AIC = round(AIC, 3),
AIC_log = round(AIC_log, 3),
AIC_quad = round(AIC_quad, 3)) %>%
mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
arrange(min_AIC)
}
}
best_predictor(Auto[ ,-9], "mpg")
## [1] "Response variable: mpg"
## varname vartype R2 R2_log R2_quad max_R2
## 1 weight numeric 0.693 0.713 0.715 0.715
## 2 displacement numeric 0.648 0.686 0.689 0.689
## 3 horsepower numeric 0.606 0.668 0.688 0.688
## 4 cylinders numeric 0.605 0.603 0.607 0.607
## 5 year numeric 0.337 0.332 0.368 0.368
## 6 origin categorical 0.332 NA NA 0.332
## 7 acceleration numeric 0.179 0.190 0.194 0.194
## 8 mpg numeric NA NA NA NA
mpg_predictors <- best_predictor(Auto[ ,-9], "mpg") %>%
dplyr::select(-c(vartype, max_R2)) %>%
gather(key = "key", value = "R2", -varname) %>%
filter(!is.na(R2))
## [1] "Response variable: mpg"
mpg_predictors_order <- best_predictor(Auto[ ,-9], "mpg") %>%
dplyr::select(varname, max_R2) %>%
filter(!is.na(max_R2)) %>%
arrange(desc(max_R2)) %>%
pull(varname)
## [1] "Response variable: mpg"
mpg_predictors$varname <- factor(mpg_predictors$varname,
ordered = T,
levels = mpg_predictors_order)
ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) +
geom_line(col = "grey15") +
geom_point(size = 2) +
theme_light() +
theme(legend.position = "bottom") +
labs(title = "Best predictors (& transformations) for mpg",
col = "Predictor Transformation",
y = "Predictor") +
scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))
We might say from this that we should test quadratic terms for weight, displacement, horsepower & year
Recall, for the standard linear model, we get an adjusted R2 of 0.8205.
mpg_lm_2 <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)
summary(mpg_lm_2)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) +
## I(horsepower^2) + I(year^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4816 -1.5384 0.0735 1.3671 12.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.185e+02 6.966e+01 6.008 4.40e-09 ***
## cylinders 5.073e-01 3.191e-01 1.590 0.112692
## displacement -3.328e-02 2.045e-02 -1.627 0.104480
## horsepower -1.781e-01 3.953e-02 -4.506 8.81e-06 ***
## weight -1.114e-02 2.587e-03 -4.306 2.12e-05 ***
## acceleration -1.700e-01 9.652e-02 -1.762 0.078960 .
## year -1.019e+01 1.837e+00 -5.546 5.49e-08 ***
## originEuropean 1.323e+00 5.304e-01 2.494 0.013068 *
## originJapanese 1.258e+00 5.129e-01 2.452 0.014637 *
## I(weight^2) 1.182e-06 3.438e-07 3.439 0.000649 ***
## I(displacement^2) 5.839e-05 3.435e-05 1.700 0.089967 .
## I(horsepower^2) 4.388e-04 1.336e-04 3.284 0.001118 **
## I(year^2) 7.210e-02 1.207e-02 5.974 5.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8735
## F-statistic: 225.9 on 12 and 379 DF, p-value: < 2.2e-16
The adjusted R2 has improved quite significantly: 0.8735
Notice how it seems like there is less non-linearity in the residuals.
par(mfrow=c(2,2))
plot(mpg_lm_2)
However, there is still some fan-shape to the residual plot, and the Q-Q plot is showing that the residuals are not as normal as we would hope. Perhaps a log-transformation of the response (mpg) will help this:
mpg_lm_3 <- lm(log(mpg) ~ . - name, data = Auto)
summary(mpg_lm_3)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40380 -0.06679 0.00493 0.06913 0.33036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.712e+00 1.673e-01 10.230 < 2e-16 ***
## cylinders -2.781e-02 1.149e-02 -2.420 0.01598 *
## displacement 7.874e-04 2.738e-04 2.876 0.00425 **
## horsepower -1.520e-03 4.904e-04 -3.100 0.00208 **
## weight -2.639e-04 2.344e-05 -11.260 < 2e-16 ***
## acceleration -1.403e-03 3.513e-03 -0.399 0.68996
## year 3.055e-02 1.852e-03 16.491 < 2e-16 ***
## originEuropean 8.531e-02 2.026e-02 4.210 3.18e-05 ***
## originJapanese 8.145e-02 1.977e-02 4.119 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared: 0.8815, Adjusted R-squared: 0.879
## F-statistic: 356.1 on 8 and 383 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_3)
Pushing for even higher performance from the linear model, I make the following changes.
Note that I have tried to avoid adding too many complex interactions & transformations to the model, but have instead focused on extracting the relationships that I think are most likely to exist and add performance on new data:
log-transform the response (mpg) introduce some quadratic terms (for those that are non-linearly related to log(mpg)) introduce the most significant interaction terms introduce a brand variable that i created in chapter 2, through text-mining the name variable
Auto_2 <- Auto %>%
mutate(mpg = log(mpg))
# new variable
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split = " "),
function(x) x[1]) # extract the first item from each list element
Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen",
ifelse(Auto_2$brand == "toyouta", "toyota",
ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet",
ifelse(Auto_2$brand == "maxda", "mazda",
Auto_2$brand))))) # fixing typo's
Auto_2$brand <- fct_lump(Auto_2$brand,
n = 9,
other_level = "uncommon") # collapse into 10 categories
Auto_2$name <- NULL
# gives ideas for transformations:
# best_predictor(Auto_2, "mpg")
# gives ideas for interactions:
# summary(lm(mpg ~ . * ., data = Auto_2[ ,-9]))
mpg_lm_4 <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)
summary(mpg_lm_4)
##
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year +
## acceleration:origin, data = Auto_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36578 -0.05924 0.00281 0.05853 0.32154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.407e+01 2.668e+00 5.274 2.28e-07 ***
## cylinders -6.017e-03 1.101e-02 -0.547 0.584993
## displacement -1.286e-04 2.835e-04 -0.454 0.650361
## horsepower -8.124e-03 1.344e-03 -6.046 3.63e-09 ***
## weight -1.541e-04 2.578e-05 -5.978 5.34e-09 ***
## acceleration -1.506e-01 4.536e-02 -3.320 0.000991 ***
## year -2.541e-01 7.136e-02 -3.560 0.000419 ***
## originEuropean -2.801e-01 9.866e-02 -2.839 0.004772 **
## originJapanese -2.721e-01 1.229e-01 -2.214 0.027430 *
## brandbuick 6.683e-02 3.351e-02 1.994 0.046840 *
## brandchevrolet 3.528e-02 2.565e-02 1.375 0.169816
## branddatsun 1.896e-01 3.943e-02 4.809 2.21e-06 ***
## branddodge 5.767e-02 2.936e-02 1.965 0.050209 .
## brandford -3.803e-03 2.573e-02 -0.148 0.882547
## brandplymouth 6.476e-02 2.834e-02 2.285 0.022871 *
## brandtoyota 1.106e-01 3.842e-02 2.880 0.004207 **
## brandvolkswagen 2.639e-02 4.022e-02 0.656 0.512122
## branduncommon 6.968e-02 2.657e-02 2.622 0.009100 **
## I(horsepower^2) 1.789e-05 4.230e-06 4.229 2.96e-05 ***
## I(year^2) 1.685e-03 4.829e-04 3.490 0.000540 ***
## acceleration:year 1.690e-03 5.954e-04 2.839 0.004782 **
## acceleration:originEuropean 1.912e-02 5.590e-03 3.420 0.000697 ***
## acceleration:originJapanese 1.544e-02 7.407e-03 2.084 0.037829 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1043 on 369 degrees of freedom
## Multiple R-squared: 0.9112, Adjusted R-squared: 0.9059
## F-statistic: 172.2 on 22 and 369 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_4)
The heteroscedasticity of the residuals has definitely decreased significantly since the first linear model. The adjusted R2 has improved from 0.8205 to 0.9059, which is pretty impressive given that no additional data has been collected.
Carseats is in the ISLR package, so the data has already been loaded.
glimpse(Carseats)
## Rows: 400
## Columns: 11
## $ Sales <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.85, 6.5...
## $ CompPrice <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, ...
## $ Income <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35...
## $ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, ...
## $ Population <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 5...
## $ Price <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, ...
## $ ShelveLoc <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medi...
## $ Age <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53,...
## $ Education <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18,...
## $ Urban <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes,...
## $ US <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes...
sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(sales_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price = -0.054
Interpretation = the effect of a 1-unit increase in Price (for fixed values of Urban & US) is a change in Sales of -0.054 units (54 sales).
Urban = -0.022
Interpretation = the effect of a store being in an urban area (for fixed values of Price & US) is a change in Sales of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales at a store and whether the store was Urban (or rural).
US = 1.200
Interpretation = the effect of a store being in the US (for fixed values of Price & Urban) is a change in Sales of 1.2 units (1200 sales).
Sales^=13.043469−0.054459⋅Price−0.021916⋅Urban+1.200573⋅US
Where:
Urban = 1 for a store in an urban location, else 0 US = 1 for a store in the US, else 0
This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price and US predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban is zero.
Removing the Urban variable:
sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(sales_lm_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
For model (a), we have:
R2 = 0.23928 adjusted R2 = 0.23351 For model (e), we have:
R2 = 0.23926 adjusted R2 = 0.23543 In this case, both models explain ~23.9% of the variance in Sales.
This is an interesting case to talk about using training R2 vs R¯2 (adjusted R2) in model selection, since model (a) has a higher R2 but lower R¯2.
Recall:
R2=1−RSSTSS
R¯2=1−RSS/(n−p−1)TSS/(n−1)
RSS=∑i=1n(yi−y^)2
TSS=∑i=1n(yi−y¯)2
TSS will be constant regardless of the model being fit, and so maximizing R¯2 is equivalent to minimizing RSS/(n−p−1).
The fact that RSS will always decrease as variables are added to the model explains why R2 always increases with additional variables.
However, with R¯2, adding new variables will mean that the increase in RSS is now ‘competing’ with the increase in n−p−1 (specifically, the increase in p), so it is less clear whether R¯2 will increase or not.
In this particular case, we can see that model (e) adds a useless variable (Urban) and sees an increase in R2 but a decrease in R¯2, because the boost in RSS that the model sees is insufficient in outweighing the penalty by the increase in p.
We should really choose (e) as the better model in this case, as it provides a more parsimonious model, and this is a good example of why the training R2 or RSS can be very limited.
The confidence intervals are calculated using:
βi±tn−2(0.975)⋅SE(βi)
Where:
βi^ is the parameter estimate tn−2(0.975) is the 97.5% quantile of a t-distribution, with n−2 degrees of freedom (qt(0.975, nrow(data) - 2)) SE(βi^) is the standard error of βi^
confint(sales_lm_2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
We can say, for instance, that there is a 95% probability that the true parameter for Price (β1) falls within the interval: (-0.0648, -0.0442), and a 5% probability that it doesn’t.
Since the average leverage statistic is always equal to (p+1)/n, we can identify observations with a leverage far higher than (p+1)/n as ‘high-leverage’.
round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(sales_lm_2)), 10)
## [1] TRUE
Similarly, we can identify potential outliers as those observations with a standardized residual outside of [−2,2].
broom::augment(sales_lm_2) %>%
dplyr::select(.hat, .std.resid) %>%
ggplot(aes(x = .hat, y = .std.resid)) +
geom_point() +
geom_hline(yintercept = -2, col = "deepskyblue3", size = 1) +
geom_hline(yintercept = 2, col = "deepskyblue3", size = 1) +
geom_vline(xintercept = 3 / nrow(Carseats), col = "mediumseagreen", size = 1) +
theme_light() +
labs(x = "Leverage",
y = "Standardized Residual",
title = "Residuals vs Leverage")
Looking at these thresholds on a graph, it is hard to identify when an observation is large enough in residual & leverage to be influential and worth investigating.
This is why I prefer the approach of using a metric that combines these (e.g. cooks distance):
broom::augment(sales_lm_2) %>%
rownames_to_column("rowid") %>%
arrange(desc(.cooksd)) %>%
dplyr::select(Sales, Price, US, .std.resid, .hat, .cooksd)
## # A tibble: 400 x 6
## Sales Price US .std.resid .hat .cooksd
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
## 1 14.9 82 No 2.58 0.0116 0.0261
## 2 14.4 53 No 1.73 0.0237 0.0243
## 3 10.6 149 No 2.32 0.0126 0.0228
## 4 15.6 72 Yes 2.17 0.0129 0.0205
## 5 0.37 191 Yes -1.42 0.0286 0.0198
## 6 16.3 92 Yes 2.87 0.00664 0.0183
## 7 3.47 81 No -2.10 0.0119 0.0177
## 8 0.53 159 Yes -2.05 0.0119 0.0169
## 9 13.6 89 No 2.18 0.00983 0.0158
## 10 3.02 90 Yes -2.56 0.00710 0.0157
## # ... with 390 more rows
In this case it’s unlikely that you could justify removing any other observations or class them as problematic. The model is very simple with p = 2, and many of these observations would not be an issue if more strong predictors were added.
This problem involves simple linear regression without an intercept.
Using the same notation as previously, where:
y=βaxx=βby
We have:
βa=∑xy∑x2βb=∑yx∑y2βa=βb⟺∑xy∑x2=∑yx∑y2⟺∑i=1nx2i=∑i=1ny2i
All that is required here is that the above equality doesn’t hold (which will happen in all x, y combinations, with the exception of a very few).
I generate 100 observations with the following properties:
y=5⋅x+ϵϵ∼N(0,22)
set.seed(2)
x <- rnorm(100)
y <- 5*x + rnorm(100, sd = 2)
data <- data.frame(x, y)
lm_9 <- lm(y ~ x + 0)
lm_10 <- lm(x ~ y + 0)
We have:
∑ni=1x2i = 133.3521492 ∑ni=1y2i = 3591.2935934 It’s quite clear that βa≠βb:
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_9), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_10), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_9 & lm_10)")
The actual coefficients:
βa^ = 4.904515 βb^ = 0.182115
One such example where ∑ni=1x2i=∑ni=1y2i would hold is obviously where the two vectors are equal (a relationship where R2 = 1, and βb=1βa=11=1):
set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)
lm_11 <- lm(y ~ x + 0)
lm_12 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_11), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_12), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_11 & lm_12)")
We have:
∑ni=1x2i = 72.566061 ∑ni=1y2i = 72.566061 βa^ = 1 βb^ = 1
Another example would be where vectors x and y are permutations of one another. Clearly ∑ni=1x2i=∑ni=1y2i would hold here, so the parameters should also be equal:
set.seed(4)
x <- 1:100
y <- x[order(rnorm(100))]
data <- data.frame(x, y)
lm_13 <- lm(y ~ x + 0)
lm_14 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_13), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_14), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_13 & lm_14)")
We have:
∑ni=1x2i = 3.383510^{5} ∑ni=1y2i = 3.383510{5} βa = 0.692493 βb^ = 0.692493 Note that the previous example showed perfect relationship where βa=βb held true, and this example shows a random relationship where it also holds. You could create a relationship as strong or as weak as desired, simply by permuting less (or more) or the elements of x, and still we would see that βa=βb holds.
Another obvious example would be where y=|x|:
set.seed(5)
x <- rnorm(100)
y <- abs(x)
data <- data.frame(x, y)
lm_15 <- lm(y ~ x + 0)
lm_16 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) +
geom_point(color="red") +
geom_abline(intercept = 0, slope = coef(lm_15), size = 1, col = "deepskyblue3") +
geom_abline(intercept = 0, slope = 1 / coef(lm_16), size = 1, col = "mediumseagreen") +
labs(title = "Regression lines (lm_15 & lm_16)")
We have:
∑ni=1x2i = 88.5627779 ∑ni=1y2i = 88.5627779 βa^ = 0.115944 βb^ = 0.115944