Linear Regression

Author

Justin Pons

2

Carefully explain the differences between the KNN classifier and KNN regression methods.

The primary difference between the two is the objective - KNN classifier attempts to predict a label and KNN regression attempts to predict a continuous value.

For a given value \(x_0\) KNN Regression finds the average of the K nearest values for each predictor p.

\[ \hat{f}(x_0)=\frac{1}{K}\sum_{x_i\in N_0} y_i \]

9A

Produce a scatterplot matrix which includes all of the variables in the data set.

library(tidyverse)
library(ISLR2)

data(Auto)

pairs(Auto)

9B

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto |>
  select(-name) |> 
  cor()

                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

9C

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

Auto <- Auto |> 
  mutate(origin = as.factor(origin))
lm(mpg~.-name, data = Auto) |> 
  summary()


Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0095 -2.0785 -0.0982  1.9856 13.3608 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.795e+01  4.677e+00  -3.839 0.000145 ***
cylinders    -4.897e-01  3.212e-01  -1.524 0.128215    
displacement  2.398e-02  7.653e-03   3.133 0.001863 ** 
horsepower   -1.818e-02  1.371e-02  -1.326 0.185488    
weight       -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
acceleration  7.910e-02  9.822e-02   0.805 0.421101    
year          7.770e-01  5.178e-02  15.005  < 2e-16 ***
origin2       2.630e+00  5.664e-01   4.643 4.72e-06 ***
origin3       2.853e+00  5.527e-01   5.162 3.93e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.307 on 383 degrees of freedom
Multiple R-squared:  0.8242,    Adjusted R-squared:  0.8205 
F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

With a significant overall p-value, we can reject the null hypothesis that there is no relationship between the predictors and the response.

The variables displacement, weight, year, origin2, and origin3 appear to be significant.

The coefficient for year suggests that for every unit increase in year, the response variable mpg increases by one.

9D

*Use the plot() function to produce diagnostic plots of the linear regression ft. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
lm(mpg~.-name, data = Auto) |> 
  plot()

The QQ plot appears to be adequately linear, indicating normality. Observation 14 is beyond Cook’s distance, indicating undue influence on the model.

9E

Use the x and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

lm(mpg~.*.,data=Auto[-9]) |> 
  summary()


Call:
lm(formula = mpg ~ . * ., data = Auto[-9])

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6008 -1.2863  0.0813  1.2082 12.0382 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                4.401e+01  5.147e+01   0.855 0.393048    
cylinders                  3.302e+00  8.187e+00   0.403 0.686976    
displacement              -3.529e-01  1.974e-01  -1.788 0.074638 .  
horsepower                 5.312e-01  3.390e-01   1.567 0.117970    
weight                    -3.259e-03  1.820e-02  -0.179 0.857980    
acceleration              -6.048e+00  2.147e+00  -2.818 0.005109 ** 
year                       4.833e-01  5.923e-01   0.816 0.415119    
origin2                   -3.517e+01  1.260e+01  -2.790 0.005547 ** 
origin3                   -3.765e+01  1.426e+01  -2.640 0.008661 ** 
cylinders:displacement    -6.316e-03  7.106e-03  -0.889 0.374707    
cylinders:horsepower       1.452e-02  2.457e-02   0.591 0.555109    
cylinders:weight           5.703e-04  9.044e-04   0.631 0.528709    
cylinders:acceleration     3.658e-01  1.671e-01   2.189 0.029261 *  
cylinders:year            -1.447e-01  9.652e-02  -1.499 0.134846    
cylinders:origin2         -7.210e-01  1.088e+00  -0.662 0.508100    
cylinders:origin3          1.226e+00  1.007e+00   1.217 0.224379    
displacement:horsepower   -5.407e-05  2.861e-04  -0.189 0.850212    
displacement:weight        2.659e-05  1.455e-05   1.828 0.068435 .  
displacement:acceleration -2.547e-03  3.356e-03  -0.759 0.448415    
displacement:year          4.547e-03  2.446e-03   1.859 0.063842 .  
displacement:origin2      -3.364e-02  4.220e-02  -0.797 0.425902    
displacement:origin3       5.375e-02  4.145e-02   1.297 0.195527    
horsepower:weight         -3.407e-05  2.955e-05  -1.153 0.249743    
horsepower:acceleration   -3.445e-03  3.937e-03  -0.875 0.382122    
horsepower:year           -6.427e-03  3.891e-03  -1.652 0.099487 .  
horsepower:origin2        -4.869e-03  5.061e-02  -0.096 0.923408    
horsepower:origin3         2.289e-02  6.252e-02   0.366 0.714533    
weight:acceleration       -6.851e-05  2.385e-04  -0.287 0.774061    
weight:year               -8.065e-05  2.184e-04  -0.369 0.712223    
weight:origin2             2.277e-03  2.685e-03   0.848 0.397037    
weight:origin3            -4.498e-03  3.481e-03  -1.292 0.197101    
acceleration:year          6.141e-02  2.547e-02   2.412 0.016390 *  
acceleration:origin2       9.234e-01  2.641e-01   3.496 0.000531 ***
acceleration:origin3       7.159e-01  3.258e-01   2.198 0.028614 *  
year:origin2               2.932e-01  1.444e-01   2.031 0.043005 *  
year:origin3               3.139e-01  1.483e-01   2.116 0.035034 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.628 on 356 degrees of freedom
Multiple R-squared:  0.8967,    Adjusted R-squared:  0.8866 
F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

The significant interactions are cylinders:acceleration, acceleration:year, acceleration:origin2, acceleration:origin3, year:origin2, and year:origin3.

9F

Try a few different transformations of the variables, such as \(log(X)\), \(√X\), \(X^2\). Comment on your findings.

Judging from the plot function in 9A, weight, displacement, and horsepower appear as if they would benefit from some transformation.

lm(mpg~.+log(weight)+log(displacement)+log(horsepower), data = Auto[-9]) |> 
  summary()


Call:
lm(formula = mpg ~ . + log(weight) + log(displacement) + log(horsepower), 
    data = Auto[-9])

Residuals:
    Min      1Q  Median      3Q     Max 
-9.1123 -1.5756 -0.0977  1.4870 12.2114 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       187.193734  40.148119   4.663 4.33e-06 ***
cylinders          -0.158747   0.299358  -0.530  0.59622    
displacement        0.010936   0.014197   0.770  0.44160    
horsepower          0.095958   0.031361   3.060  0.00237 ** 
weight              0.002127   0.002222   0.957  0.33901    
acceleration       -0.200229   0.100776  -1.987  0.04766 *  
year                0.781173   0.046562  16.777  < 2e-16 ***
origin2             1.098874   0.544930   2.017  0.04445 *  
origin3             1.291908   0.537994   2.401  0.01681 *  
log(weight)       -19.060336   7.360258  -2.590  0.00998 ** 
log(displacement)  -2.017991   2.877048  -0.701  0.48348    
log(horsepower)   -16.566039   3.741005  -4.428 1.24e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.913 on 380 degrees of freedom
Multiple R-squared:  0.8646,    Adjusted R-squared:  0.8607 
F-statistic: 220.7 on 11 and 380 DF,  p-value: < 2.2e-16

The logs of horsepower and weight are significant and result in an increase of adjusted \(R^2\) from the base lm()

10A

Fit a multiple regression model to predict Sales using Price, Urban, and US.

data("Carseats")

lm(Sales~Price+Urban+US, data = Carseats) |> 
  summary()


Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

10B

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

The estimate coefficient indicate the value of the response variable when the predictors equal 0 for numeric, or the baseline group for qualitative variables. For every unit increase in price, the response decreases by .05. When the observation has ‘Yes’ for Urban, then the intercept coefficient decreases by .02. When the observation for US equals ‘Yes’, then the intercept coefficient increases by 1.2.

10C

Write out the model in equation form, being careful to handle the qualitative variables properly.

\[ \hat{Sales}=13.04-.05Price-.02Urban+1.2US \]

Where Price = 1 if ‘Yes’ or Price = 0 if ‘No’ AND Where US = 1 if ‘Yes’ or US = 0 if ‘No’

10D

for which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?

Price and US

10E

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm(Sales~Price+US,data=Carseats) |> 
  summary()


Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

10F

How well do the models in (a) and (e) fit the data?

Both models are significant. Model 1 has an \(R^2\) = 0.2393 and model 2 has an \(R^2\) = 0.2393, indicating that both models explain the same amount of variance in Sales, however model 2 is superior due to its increased simplicity.

10G

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

lm(Sales~Price+US,data=Carseats) |> 
  confint(level = 0.95)

                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

10H

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
lm(Sales~Price+US,data=Carseats) |> 
  plot()

Based on the plots, there doesn’t appear to be any outliers or observations with high leverage.

12A

When

\[ \sum_{} y^2 = \sum_{} x^2 \]

12B

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x <- rnorm(100)
y <- rnorm(100,2,2)
data <- data.frame(x, y)

lm(y ~ x + 0) |> 
  summary()


Call:
lm(formula = y ~ x + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7865  0.7343  1.6627  2.9642  6.4940 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
x   0.2564     0.3013   0.851    0.397

Residual standard error: 2.712 on 99 degrees of freedom
Multiple R-squared:  0.007265,  Adjusted R-squared:  -0.002763 
F-statistic: 0.7245 on 1 and 99 DF,  p-value: 0.3967

lm(x ~ y + 0) |> 
  summary()


Call:
lm(formula = x ~ y + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2345 -0.5473  0.0186  0.5927  2.3209 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
y  0.02833    0.03328   0.851    0.397

Residual standard error: 0.9015 on 99 degrees of freedom
Multiple R-squared:  0.007265,  Adjusted R-squared:  -0.002763 
F-statistic: 0.7245 on 1 and 99 DF,  p-value: 0.3967

12C

Generate an example in R with n = 100 observations in which the coefcient estimate for the regression of X onto Y is the same as the coefcient estimate for the regression of Y onto X.

set.seed(1)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)

lm(y ~ x + 0) |> 
  summary()

Warning in summary.lm(lm(y ~ x + 0)): essentially perfect fit: summary may be
unreliable


Call:
lm(formula = y ~ x + 0)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.583e-15 -3.440e-17 -1.600e-18  1.280e-17  1.997e-16 

Coefficients:
   Estimate Std. Error   t value Pr(>|t|)    
x 1.000e+00  4.046e-17 2.471e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.643e-16 on 99 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.107e+32 on 1 and 99 DF,  p-value: < 2.2e-16

lm(x ~ y + 0) |> 
  summary()

Warning in summary.lm(lm(x ~ y + 0)): essentially perfect fit: summary may be
unreliable


Call:
lm(formula = x ~ y + 0)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.583e-15 -3.440e-17 -1.600e-18  1.280e-17  1.997e-16 

Coefficients:
   Estimate Std. Error   t value Pr(>|t|)    
y 1.000e+00  4.046e-17 2.471e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.643e-16 on 99 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 6.107e+32 on 1 and 99 DF,  p-value: < 2.2e-16