2. KNN: Classification vs Regression

Q: Carefully explain the differences between the KNN classifier and KNN regression methods.

A:

Obviously the main difference is the goal. The KNN classifier is used in situations where the response variable is categorical, whereas the KNN regressor would only be appropriate when it is numeric.

For both the classifier & regressor:

Want to predict for some new test observation: x0 We choose some K nearest neighbours, where K∈N We refer to the K training observations closest to x0 by N0

KNN Classifier:

First, the conditional probability for class j is estimated, which is the fraction of the points in N0 where yi = j.

Pr(Y=j|x=x0)=1K∑i∈N0I(yi=j)

After working out the conditional probability for each j class, the class prediction f^(x0) is then simply the class j with the highest conditional probability:

f^(x0)=argmaxj(Pr(Y=j|x=x0))

KNN Regressor:

The KNN regressor identified the K training observations that are closest to the observation it is predicting (x0), then it estimates f(x0) by taking the mean y value of these training observations.

f^(x0)=1K∑i∈N0yi

9. Multiple Linear Regression:

This question involves the use of multiple linear regression on the Auto data set.

scatterplot matrix:

Q: Produce a scatterplot matrix which includes all of the variables in the data set.

#Auto <- read.csv("C://Spring 2021//Algo 2//Database", header=TRUE)
Auto = na.omit(Auto)
glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <chr> "chevrolet chevelle malibu", "buick skylark 320", "ply...
Auto$name= as.factor(Auto$name)

glimpse(Auto)
## Rows: 392
## Columns: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14...
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, ...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383,...
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170,...
## $ weight       <int> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, ...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8....
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70...
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, ...
## $ name         <fct> chevrolet chevelle malibu, buick skylark 320, plymouth...

A:

pairs(Auto)

(b) Correlation Matrix

Q: Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Multiple Linear Regression

Q: Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary()function to print the results.Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  2. Which predictors appear to have a statistically significant relationship to the response?
  3. What does the coefficient for the year variable suggest?
Auto$origin <- factor(Auto$origin, labels = c("American", "European", "Japanese"))

mpg_lm <- lm(mpg ~ . - name, data = Auto)
summary(mpg_lm)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i: We answer this question by performing an F-test, where we test the null hypothesis that all of the regression coefficients are zero:

H0:β1=β2=⋯=βp=0Ha:at least one βj≠0

The p-value is given at the bottom of the model summary (p-value: < 2.2e-16), so it’s clear that the probability of the null hypothesis being true (given our data) is practically zero.

We reject the null hypothesis (and hence conclude that there is a relationship between the predictors and mpg).

ii: Using the coefficient p-values in the model output, and p = 0.05 as my threshold for significance, all variables except cylinders, horsepower & acceleration have a statistically significant relationship with the response.

iii:

Extracting the coefficient for year (β7^):

coef(mpg_lm)[7]
##      year 
## 0.7770269

A coefficient of 0.777 means that the effect of an increase of 1 year (with all other effects held constant) is associated with an increase of 0.777 in the response (mpg). This could perhaps be thought of as the rate of the cars increase in efficiency year-on-year.

d. Diagnostic Plots

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(mpg_lm)

mpg_lm_stats <- broom::augment(mpg_lm) %>%
  dplyr::select(.rownames,.fitted, .hat, .resid, .std.resid, .cooksd)%>%
  arrange(desc(.cooksd))

mpg_lm_stats
## # A tibble: 392 x 6
##    .rownames .fitted   .hat .resid .std.resid .cooksd
##    <chr>       <dbl>  <dbl>  <dbl>      <dbl>   <dbl>
##  1 14           19.4 0.194   -5.42      -1.83  0.0895
##  2 394          35.5 0.0646   8.53       2.67  0.0547
##  3 327          32.4 0.0406  11.0        3.41  0.0546
##  4 326          33.9 0.0332  10.4        3.20  0.0391
##  5 245          33.0 0.0323  10.1        3.11  0.0358
##  6 387          28.7 0.0342   9.33       2.87  0.0324
##  7 323          33.2 0.0151  13.4        4.07  0.0282
##  8 112          27.0 0.0294  -9.01      -2.77  0.0257
##  9 328          27.9 0.0297   8.55       2.62  0.0234
## 10 330          34.7 0.0217   9.86       3.01  0.0224
## # ... with 382 more rows
mpg_lm_stats$cooksd_cutoff <- factor(ifelse(mpg_lm_stats$.cooksd >= 4 / nrow(Auto), "True", "False"))

I extract these stats (leverage, residual stats, cooks distance, etc.) for each observation below using the broom package, and highlight those observations with CooksD≥4n, a common threshold for identifying influential observations.

ggplot(mpg_lm_stats, aes(x = .hat, y = .std.resid, col = cooksd_cutoff)) + 
  geom_point(alpha = 0.8) + 
  scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Residuals vs Leverage", 
       x = "Leverage", 
       y = "Standardized Residuals", 
       col = "Cooks Distance >= 4/n")

ggplot(mpg_lm_stats, aes(x = .hat, y = .cooksd, col = cooksd_cutoff)) + 
  geom_point(alpha = 0.8) + 
  geom_hline(yintercept = 4 / nrow(Auto), col = "grey30") +
  scale_color_manual(values = c("mediumseagreen", "#BB8FCE")) +
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Cooks Distance vs Leverage", 
       x = "Leverage", 
       y = "Cooks Distance", 
       col = "Cooks Distance >= 4/n")

#### Interaction Terms #### Q: Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

A:

Since we have relatively few predictors, we can test all interactions with mpg ~ . * . in the call to lm():

summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

We can see the significant terms (at the 0.05 level) are those with at least one asterisk (*). It is probably unreasonable to use a significance level of 0.05 here, as we are testing such a large number of hypothesis, perhaps a lower threshold for significance (or a p-value correction using the p.adjust() function) would be appropriate.

Using the standard threshold of 0.05, the significant interaction terms are given by:

cylinders:acceleration acceleration:year acceleration:originEuropean acceleration:originJapanese year:originEuropean year:originJapanese

f. Variable Transformation

Q: Try a few different transformations of the variables, such as log(X), X−−√, X2. Comment on your findings.

A:

I use my best_predictor() function here, as defined in my chapter 2 solutions.

There is no testing for X−−√ transformations here, but these tend to perform similarly to transformations of the form log(X).

best_predictor <- function(dataframe, response) {
  
  if (sum(sapply(dataframe, function(x) {is.numeric(x) | is.factor(x)})) < ncol(dataframe)) {
    stop("Make sure that all variables are of class numeric/factor!")
  }
  
  # pre-allocate vectors
  varname <- c()
  vartype <- c()
  R2 <- c()
  R2_log <- c()
  R2_quad <- c()
  AIC <- c()
  AIC_log <- c()
  AIC_quad <- c()
  y <- dataframe[ ,response]
  
  # # # # # NUMERIC RESPONSE # # # # #
  if (is.numeric(y)) {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        R2[i] <- summary(lm(y ~ x))$r.squared 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            R2_log[i] <- summary(lm(y ~ log(x + abs(min(x)) + 1)))$r.squared
          } else {
            R2_log[i] <- summary(lm(y ~ log(x)))$r.squared
          }
        } else {
          R2_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          R2_quad[i] <- summary(lm(y ~ x + I(x^2)))$r.squared
        } else {
          R2_quad[i] <- NA
        }
        
      } else {
        R2[i] <- NA
        R2_log[i] <- NA
        R2_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               R2 = round(R2, 3), 
               R2_log = round(R2_log, 3), 
               R2_quad = round(R2_quad, 3)) %>%
      mutate(max_R2 = pmax(R2, R2_log, R2_quad, na.rm = T)) %>%
      arrange(desc(max_R2))
    
    
    # # # # # CATEGORICAL RESPONSE # # # # #
  } else {
    
    for (i in 1:ncol(dataframe)) {
      
      x <- dataframe[ ,i]
      varname[i] <- names(dataframe)[i]
      
      if (class(x) %in% c("numeric", "integer")) {
        vartype[i] <- "numeric"
      } else {
        vartype[i] <- "categorical"
      }
      
      if (!identical(y, x)) {
        
        # linear: y ~ x
        AIC[i] <- summary(glm(y ~ x, family = "binomial"))$aic 
        
        # log-transform: y ~ log(x)
        if (is.numeric(x)) { 
          if (min(x) <= 0) { # if y ~ log(x) for min(x) <= 0, do y ~ log(x + abs(min(x)) + 1)
            AIC_log[i] <- summary(glm(y ~ log(x + abs(min(x)) + 1), family = "binomial"))$aic
          } else {
            AIC_log[i] <- summary(glm(y ~ log(x), family = "binomial"))$aic
          }
        } else {
          AIC_log[i] <- NA
        }
        
        # quadratic: y ~ x + x^2
        if (is.numeric(x)) { 
          AIC_quad[i] <- summary(glm(y ~ x + I(x^2), family = "binomial"))$aic
        } else {
          AIC_quad[i] <- NA
        }
        
      } else {
        AIC[i] <- NA
        AIC_log[i] <- NA
        AIC_quad[i] <- NA
      }
    }
    
    print(paste("Response variable:", response))
    
    data.frame(varname, 
               vartype, 
               AIC = round(AIC, 3), 
               AIC_log = round(AIC_log, 3), 
               AIC_quad = round(AIC_quad, 3)) %>%
      mutate(min_AIC = pmin(AIC, AIC_log, AIC_quad, na.rm = T)) %>%
      arrange(min_AIC)
  } 
}
best_predictor(Auto[ ,-9], "mpg")
## [1] "Response variable: mpg"
##        varname     vartype    R2 R2_log R2_quad max_R2
## 1       weight     numeric 0.693  0.713   0.715  0.715
## 2 displacement     numeric 0.648  0.686   0.689  0.689
## 3   horsepower     numeric 0.606  0.668   0.688  0.688
## 4    cylinders     numeric 0.605  0.603   0.607  0.607
## 5         year     numeric 0.337  0.332   0.368  0.368
## 6       origin categorical 0.332     NA      NA  0.332
## 7 acceleration     numeric 0.179  0.190   0.194  0.194
## 8          mpg     numeric    NA     NA      NA     NA
mpg_predictors <- best_predictor(Auto[ ,-9], "mpg") %>%
  dplyr::select(-c(vartype, max_R2)) %>%
  gather(key = "key", value = "R2", -varname) %>%
  filter(!is.na(R2))
## [1] "Response variable: mpg"
mpg_predictors_order <- best_predictor(Auto[ ,-9], "mpg") %>%
  dplyr::select(varname, max_R2) %>%
  filter(!is.na(max_R2)) %>%
  arrange(desc(max_R2)) %>%
  pull(varname)
## [1] "Response variable: mpg"
mpg_predictors$varname <- factor(mpg_predictors$varname, 
                                 ordered = T, 
                                 levels = mpg_predictors_order)
ggplot(mpg_predictors, aes(x = R2, y = varname, col = key, group = varname)) + 
  geom_line(col = "grey15") +
  geom_point(size = 2) + 
  theme_light() + 
  theme(legend.position = "bottom") +
  labs(title = "Best predictors (& transformations) for mpg", 
       col = "Predictor Transformation", 
       y = "Predictor") + 
  scale_color_discrete(labels = c("None", "Log-Transformed", "Order 2 Polynomial"))

We might say from this that we should test quadratic terms for weight, displacement, horsepower & year

Recall, for the standard linear model, we get an adjusted R2 of 0.8205.

mpg_lm_2 <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)

summary(mpg_lm_2)
## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) + 
##     I(horsepower^2) + I(year^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4816 -1.5384  0.0735  1.3671 12.0213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.185e+02  6.966e+01   6.008 4.40e-09 ***
## cylinders          5.073e-01  3.191e-01   1.590 0.112692    
## displacement      -3.328e-02  2.045e-02  -1.627 0.104480    
## horsepower        -1.781e-01  3.953e-02  -4.506 8.81e-06 ***
## weight            -1.114e-02  2.587e-03  -4.306 2.12e-05 ***
## acceleration      -1.700e-01  9.652e-02  -1.762 0.078960 .  
## year              -1.019e+01  1.837e+00  -5.546 5.49e-08 ***
## originEuropean     1.323e+00  5.304e-01   2.494 0.013068 *  
## originJapanese     1.258e+00  5.129e-01   2.452 0.014637 *  
## I(weight^2)        1.182e-06  3.438e-07   3.439 0.000649 ***
## I(displacement^2)  5.839e-05  3.435e-05   1.700 0.089967 .  
## I(horsepower^2)    4.388e-04  1.336e-04   3.284 0.001118 ** 
## I(year^2)          7.210e-02  1.207e-02   5.974 5.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8735 
## F-statistic: 225.9 on 12 and 379 DF,  p-value: < 2.2e-16

The adjusted R2 has improved quite significantly: 0.8735

Notice how it seems like there is less non-linearity in the residuals.

par(mfrow=c(2,2))
plot(mpg_lm_2)

However, there is still some fan-shape to the residual plot, and the Q-Q plot is showing that the residuals are not as normal as we would hope. Perhaps a log-transformation of the response (mpg) will help this:

mpg_lm_3 <- lm(log(mpg) ~ . - name, data = Auto)

summary(mpg_lm_3)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40380 -0.06679  0.00493  0.06913  0.33036 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.712e+00  1.673e-01  10.230  < 2e-16 ***
## cylinders      -2.781e-02  1.149e-02  -2.420  0.01598 *  
## displacement    7.874e-04  2.738e-04   2.876  0.00425 ** 
## horsepower     -1.520e-03  4.904e-04  -3.100  0.00208 ** 
## weight         -2.639e-04  2.344e-05 -11.260  < 2e-16 ***
## acceleration   -1.403e-03  3.513e-03  -0.399  0.68996    
## year            3.055e-02  1.852e-03  16.491  < 2e-16 ***
## originEuropean  8.531e-02  2.026e-02   4.210 3.18e-05 ***
## originJapanese  8.145e-02  1.977e-02   4.119 4.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.879 
## F-statistic: 356.1 on 8 and 383 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_3)

Pushing for even higher performance from the linear model, I make the following changes.

Note that I have tried to avoid adding too many complex interactions & transformations to the model, but have instead focused on extracting the relationships that I think are most likely to exist and add performance on new data:

log-transform the response (mpg) introduce some quadratic terms (for those that are non-linearly related to log(mpg)) introduce the most significant interaction terms introduce a brand variable that i created in chapter 2, through text-mining the name variable

Auto_2 <- Auto %>%
  mutate(mpg = log(mpg))

# new variable
Auto_2$brand <- sapply(strsplit(as.character(Auto_2$name), split = " "),
                     function(x) x[1]) # extract the first item from each list element

Auto_2$brand <- factor(ifelse(Auto_2$brand %in% c("vokswagen", "vw"), "volkswagen", 
                            ifelse(Auto_2$brand == "toyouta", "toyota", 
                                   ifelse(Auto_2$brand %in% c("chevroelt", "chevy"), "chevrolet", 
                                          ifelse(Auto_2$brand == "maxda", "mazda", 
                                                 Auto_2$brand))))) # fixing typo's
Auto_2$brand <- fct_lump(Auto_2$brand, 
                       n = 9, 
                       other_level = "uncommon") # collapse into 10 categories

Auto_2$name <- NULL

# gives ideas for transformations:
# best_predictor(Auto_2, "mpg") 

# gives ideas for interactions:
# summary(lm(mpg ~ . * ., data = Auto_2[ ,-9])) 


mpg_lm_4 <- lm(mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + acceleration:origin, data = Auto_2)

summary(mpg_lm_4)
## 
## Call:
## lm(formula = mpg ~ . + I(horsepower^2) + I(year^2) + acceleration:year + 
##     acceleration:origin, data = Auto_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36578 -0.05924  0.00281  0.05853  0.32154 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.407e+01  2.668e+00   5.274 2.28e-07 ***
## cylinders                   -6.017e-03  1.101e-02  -0.547 0.584993    
## displacement                -1.286e-04  2.835e-04  -0.454 0.650361    
## horsepower                  -8.124e-03  1.344e-03  -6.046 3.63e-09 ***
## weight                      -1.541e-04  2.578e-05  -5.978 5.34e-09 ***
## acceleration                -1.506e-01  4.536e-02  -3.320 0.000991 ***
## year                        -2.541e-01  7.136e-02  -3.560 0.000419 ***
## originEuropean              -2.801e-01  9.866e-02  -2.839 0.004772 ** 
## originJapanese              -2.721e-01  1.229e-01  -2.214 0.027430 *  
## brandbuick                   6.683e-02  3.351e-02   1.994 0.046840 *  
## brandchevrolet               3.528e-02  2.565e-02   1.375 0.169816    
## branddatsun                  1.896e-01  3.943e-02   4.809 2.21e-06 ***
## branddodge                   5.767e-02  2.936e-02   1.965 0.050209 .  
## brandford                   -3.803e-03  2.573e-02  -0.148 0.882547    
## brandplymouth                6.476e-02  2.834e-02   2.285 0.022871 *  
## brandtoyota                  1.106e-01  3.842e-02   2.880 0.004207 ** 
## brandvolkswagen              2.639e-02  4.022e-02   0.656 0.512122    
## branduncommon                6.968e-02  2.657e-02   2.622 0.009100 ** 
## I(horsepower^2)              1.789e-05  4.230e-06   4.229 2.96e-05 ***
## I(year^2)                    1.685e-03  4.829e-04   3.490 0.000540 ***
## acceleration:year            1.690e-03  5.954e-04   2.839 0.004782 ** 
## acceleration:originEuropean  1.912e-02  5.590e-03   3.420 0.000697 ***
## acceleration:originJapanese  1.544e-02  7.407e-03   2.084 0.037829 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1043 on 369 degrees of freedom
## Multiple R-squared:  0.9112, Adjusted R-squared:  0.9059 
## F-statistic: 172.2 on 22 and 369 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mpg_lm_4)

The heteroscedasticity of the residuals has definitely decreased significantly since the first linear model. The adjusted R2 has improved from 0.8205 to 0.9059, which is pretty impressive given that no additional data has been collected.

10. APPLIED: The Carseats Dataset (Multiple Linear Regression)

(a) Predicting Sales

Q: Fit a multiple regression model to predict Sales using Price, Urban, and US.

A:

Carseats is in the ISLR package, so the data has already been loaded.

glimpse(Carseats)
## Rows: 400
## Columns: 11
## $ Sales       <dbl> 9.50, 11.22, 10.06, 7.40, 4.15, 10.81, 6.63, 11.85, 6.5...
## $ CompPrice   <dbl> 138, 111, 113, 117, 141, 124, 115, 136, 132, 132, 121, ...
## $ Income      <dbl> 73, 48, 35, 100, 64, 113, 105, 81, 110, 113, 78, 94, 35...
## $ Advertising <dbl> 11, 16, 10, 4, 3, 13, 0, 15, 0, 0, 9, 4, 2, 11, 11, 5, ...
## $ Population  <dbl> 276, 260, 269, 466, 340, 501, 45, 425, 108, 131, 150, 5...
## $ Price       <dbl> 120, 83, 80, 97, 128, 72, 108, 120, 124, 124, 100, 94, ...
## $ ShelveLoc   <fct> Bad, Good, Medium, Medium, Bad, Bad, Medium, Good, Medi...
## $ Age         <dbl> 42, 65, 59, 55, 38, 78, 71, 67, 76, 76, 26, 50, 62, 53,...
## $ Education   <dbl> 17, 10, 12, 14, 13, 16, 15, 10, 10, 17, 10, 13, 18, 18,...
## $ Urban       <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes,...
## $ US          <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, No, Yes, Yes, Yes...
sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)

(b) Interpreting Coefficients

Q: Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!

A:

summary(sales_lm)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price = -0.054

Interpretation = the effect of a 1-unit increase in Price (for fixed values of Urban & US) is a change in Sales of -0.054 units (54 sales).

Urban = -0.022

Interpretation = the effect of a store being in an urban area (for fixed values of Price & US) is a change in Sales of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales at a store and whether the store was Urban (or rural).

US = 1.200

Interpretation = the effect of a store being in the US (for fixed values of Price & Urban) is a change in Sales of 1.2 units (1200 sales).

(c) Equation-Form

Q: Write out the model in equation form, being careful to handle the qualitative variables properly.

A:

Sales^=13.043469−0.054459⋅Price−0.021916⋅Urban+1.200573⋅US

Where:

Urban = 1 for a store in an urban location, else 0 US = 1 for a store in the US, else 0

(d) T-tests

Q: For which of the predictors can you reject the null hypothesis H0:βj=0

A:

This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price and US predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban is zero.

(e) Smaller Model

Q: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

A:

Removing the Urban variable:

sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)

summary(sales_lm_2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Model Fit

Q: How well do the models in (a) and (e) fit the data?

A:

For model (a), we have:

R2 = 0.23928 adjusted R2 = 0.23351 For model (e), we have:

R2 = 0.23926 adjusted R2 = 0.23543 In this case, both models explain ~23.9% of the variance in Sales.

This is an interesting case to talk about using training R2 vs R¯2 (adjusted R2) in model selection, since model (a) has a higher R2 but lower R¯2.

Recall:

R2=1−RSSTSS

R¯2=1−RSS/(n−p−1)TSS/(n−1)

RSS=∑i=1n(yi−y^)2

TSS=∑i=1n(yi−y¯)2

TSS will be constant regardless of the model being fit, and so maximizing R¯2 is equivalent to minimizing RSS/(n−p−1).

The fact that RSS will always decrease as variables are added to the model explains why R2 always increases with additional variables.

However, with R¯2, adding new variables will mean that the increase in RSS is now ‘competing’ with the increase in n−p−1 (specifically, the increase in p), so it is less clear whether R¯2 will increase or not.

In this particular case, we can see that model (e) adds a useless variable (Urban) and sees an increase in R2 but a decrease in R¯2, because the boost in RSS that the model sees is insufficient in outweighing the penalty by the increase in p.

We should really choose (e) as the better model in this case, as it provides a more parsimonious model, and this is a good example of why the training R2 or RSS can be very limited.

(g) Coefficient Confidence Interval

Q: Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

A:

The confidence intervals are calculated using:

βi±tn−2(0.975)⋅SE(βi)

Where:

βi^ is the parameter estimate tn−2(0.975) is the 97.5% quantile of a t-distribution, with n−2 degrees of freedom (qt(0.975, nrow(data) - 2)) SE(βi^) is the standard error of βi^

confint(sales_lm_2, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

We can say, for instance, that there is a 95% probability that the true parameter for Price (β1) falls within the interval: (-0.0648, -0.0442), and a 5% probability that it doesn’t.

(h) Outliers and High Leverage Observations

Q: Is there evidence of outliers or high leverage observations in the model from (e)

A:

Since the average leverage statistic is always equal to (p+1)/n, we can identify observations with a leverage far higher than (p+1)/n as ‘high-leverage’.

round(((2 + 1) / nrow(Carseats)), 10) == round(mean(hatvalues(sales_lm_2)), 10)
## [1] TRUE

Similarly, we can identify potential outliers as those observations with a standardized residual outside of [−2,2].

broom::augment(sales_lm_2) %>%
  dplyr::select(.hat, .std.resid) %>%
  ggplot(aes(x = .hat, y = .std.resid)) + 
  geom_point() + 
  geom_hline(yintercept = -2, col = "deepskyblue3", size = 1) + 
  geom_hline(yintercept = 2, col = "deepskyblue3", size = 1) + 
  geom_vline(xintercept = 3 / nrow(Carseats), col = "mediumseagreen", size = 1) + 
  theme_light() + 
  labs(x = "Leverage", 
       y = "Standardized Residual", 
       title = "Residuals vs Leverage")

Looking at these thresholds on a graph, it is hard to identify when an observation is large enough in residual & leverage to be influential and worth investigating.

This is why I prefer the approach of using a metric that combines these (e.g. cooks distance):

broom::augment(sales_lm_2) %>%
  rownames_to_column("rowid") %>%
  arrange(desc(.cooksd)) %>% 
  dplyr::select(Sales, Price, US, .std.resid, .hat, .cooksd)
## # A tibble: 400 x 6
##    Sales Price US    .std.resid    .hat .cooksd
##    <dbl> <dbl> <fct>      <dbl>   <dbl>   <dbl>
##  1 14.9     82 No          2.58 0.0116   0.0261
##  2 14.4     53 No          1.73 0.0237   0.0243
##  3 10.6    149 No          2.32 0.0126   0.0228
##  4 15.6     72 Yes         2.17 0.0129   0.0205
##  5  0.37   191 Yes        -1.42 0.0286   0.0198
##  6 16.3     92 Yes         2.87 0.00664  0.0183
##  7  3.47    81 No         -2.10 0.0119   0.0177
##  8  0.53   159 Yes        -2.05 0.0119   0.0169
##  9 13.6     89 No          2.18 0.00983  0.0158
## 10  3.02    90 Yes        -2.56 0.00710  0.0157
## # ... with 390 more rows

In this case it’s unlikely that you could justify removing any other observations or class them as problematic. The model is very simple with p = 2, and many of these observations would not be an issue if more strong predictors were added.

12. APPLIED: Generated Data (Regression without an intercept)

This problem involves simple linear regression without an intercept.

(a) Reverse Regression: Requirement for Equal Coefficients

Q: Recall that the coefficient estimate βa^ for the linear regression of y onto x without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of x onto y the same as the coefficient estimate for the regression of y onto x?

A:

Using the same notation as previously, where:

y=βaxx=βby

We have:

βa=∑xy∑x2βb=∑yx∑y2βa=βb⟺∑xy∑x2=∑yx∑y2⟺∑i=1nx2i=∑i=1ny2i

(b) Non-equal Coefficients

Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of x onto y is different from the coefficient estimate for the regression of y onto x

A:

All that is required here is that the above equality doesn’t hold (which will happen in all x, y combinations, with the exception of a very few).

I generate 100 observations with the following properties:

y=5⋅x+ϵϵ∼N(0,22)

set.seed(2)
x <- rnorm(100)
y <- 5*x + rnorm(100, sd = 2)
data <- data.frame(x, y)

lm_9 <- lm(y ~ x + 0)
lm_10 <- lm(x ~ y + 0)

We have:

∑ni=1x2i = 133.3521492 ∑ni=1y2i = 3591.2935934 It’s quite clear that βa≠βb:

ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_9), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_10), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_9 & lm_10)")

The actual coefficients:

βa^ = 4.904515 βb^ = 0.182115

(c) Equal Coefficients

Q: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of x onto y is the same from the coefficient estimate for the regression of y onto x

A:

One such example where ∑ni=1x2i=∑ni=1y2i would hold is obviously where the two vectors are equal (a relationship where R2 = 1, and βb=1βa=11=1):

set.seed(3)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)

lm_11 <- lm(y ~ x + 0)
lm_12 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_11), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_12), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_11 & lm_12)")

We have:

∑ni=1x2i = 72.566061 ∑ni=1y2i = 72.566061 βa^ = 1 βb^ = 1

Another example would be where vectors x and y are permutations of one another. Clearly ∑ni=1x2i=∑ni=1y2i would hold here, so the parameters should also be equal:

set.seed(4)
x <- 1:100
y <- x[order(rnorm(100))]
data <- data.frame(x, y)

lm_13 <- lm(y ~ x + 0)
lm_14 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_13), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_14), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_13 & lm_14)")

We have:

∑ni=1x2i = 3.383510^{5} ∑ni=1y2i = 3.383510{5} βa = 0.692493 βb^ = 0.692493 Note that the previous example showed perfect relationship where βa=βb held true, and this example shows a random relationship where it also holds. You could create a relationship as strong or as weak as desired, simply by permuting less (or more) or the elements of x, and still we would see that βa=βb holds.

Another obvious example would be where y=|x|:

set.seed(5)
x <- rnorm(100)
y <- abs(x)
data <- data.frame(x, y)

lm_15 <- lm(y ~ x + 0)
lm_16 <- lm(x ~ y + 0)
ggplot(data, aes(x, y)) + 
  geom_point(color="red") + 
  geom_abline(intercept = 0, slope = coef(lm_15), size = 1, col = "deepskyblue3") + 
  geom_abline(intercept = 0, slope = 1 / coef(lm_16), size = 1, col = "mediumseagreen") + 
  labs(title = "Regression lines (lm_15 & lm_16)")

We have:

∑ni=1x2i = 88.5627779 ∑ni=1y2i = 88.5627779 βa^ = 0.115944 βb^ = 0.115944