library(tidyverse)
library(ISLR2)Linear Regression
2
Carefully explain the differences between the KNN classifier and KNN regression methods.
The primary difference between the two is the objective - KNN classifier attempts to predict a label and KNN regression attempts to predict a continuous value.
For a given value \(x_0\) KNN Regression finds the average of the K nearest values for each predictor p.
\[ \hat{f}(x_0)=\frac{1}{K}\sum_{x_i\in N_0} y_i \]
9A
Produce a scatterplot matrix which includes all of the variables in the data set.
data(Auto)pairs(Auto)9B
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
Auto |>
select(-name) |>
cor() mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
9C
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
Auto <- Auto |>
mutate(origin = as.factor(origin))
lm(mpg~.-name, data = Auto) |>
summary()
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.0095 -2.0785 -0.0982 1.9856 13.3608
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
cylinders -4.897e-01 3.212e-01 -1.524 0.128215
displacement 2.398e-02 7.653e-03 3.133 0.001863 **
horsepower -1.818e-02 1.371e-02 -1.326 0.185488
weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
acceleration 7.910e-02 9.822e-02 0.805 0.421101
year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
origin2 2.630e+00 5.664e-01 4.643 4.72e-06 ***
origin3 2.853e+00 5.527e-01 5.162 3.93e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.307 on 383 degrees of freedom
Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
With a significant overall p-value, we can reject the null hypothesis that there is no relationship between the predictors and the response.
The variables displacement, weight, year, origin2, and origin3 appear to be significant.
The coefficient for year suggests that for every unit increase in year, the response variable mpg increases by one.
9D
*Use the plot() function to produce diagnostic plots of the linear regression ft. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
lm(mpg~.-name, data = Auto) |>
plot()The QQ plot appears to be adequately linear, indicating normality. Observation 14 is beyond Cook’s distance, indicating undue influence on the model.
9E
Use the x and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm(mpg~.*.,data=Auto[-9]) |>
summary()
Call:
lm(formula = mpg ~ . * ., data = Auto[-9])
Residuals:
Min 1Q Median 3Q Max
-7.6008 -1.2863 0.0813 1.2082 12.0382
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.401e+01 5.147e+01 0.855 0.393048
cylinders 3.302e+00 8.187e+00 0.403 0.686976
displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
horsepower 5.312e-01 3.390e-01 1.567 0.117970
weight -3.259e-03 1.820e-02 -0.179 0.857980
acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
year 4.833e-01 5.923e-01 0.816 0.415119
origin2 -3.517e+01 1.260e+01 -2.790 0.005547 **
origin3 -3.765e+01 1.426e+01 -2.640 0.008661 **
cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
cylinders:origin2 -7.210e-01 1.088e+00 -0.662 0.508100
cylinders:origin3 1.226e+00 1.007e+00 1.217 0.224379
displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
displacement:origin2 -3.364e-02 4.220e-02 -0.797 0.425902
displacement:origin3 5.375e-02 4.145e-02 1.297 0.195527
horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
horsepower:origin2 -4.869e-03 5.061e-02 -0.096 0.923408
horsepower:origin3 2.289e-02 6.252e-02 0.366 0.714533
weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
weight:year -8.065e-05 2.184e-04 -0.369 0.712223
weight:origin2 2.277e-03 2.685e-03 0.848 0.397037
weight:origin3 -4.498e-03 3.481e-03 -1.292 0.197101
acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
acceleration:origin2 9.234e-01 2.641e-01 3.496 0.000531 ***
acceleration:origin3 7.159e-01 3.258e-01 2.198 0.028614 *
year:origin2 2.932e-01 1.444e-01 2.031 0.043005 *
year:origin3 3.139e-01 1.483e-01 2.116 0.035034 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.628 on 356 degrees of freedom
Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
The significant interactions are cylinders:acceleration, acceleration:year, acceleration:origin2, acceleration:origin3, year:origin2, and year:origin3.
9F
Try a few different transformations of the variables, such as \(log(X)\), \(√X\), \(X^2\). Comment on your findings.
Judging from the plot function in 9A, weight, displacement, and horsepower appear as if they would benefit from some transformation.
lm(mpg~.+log(weight)+log(displacement)+log(horsepower), data = Auto[-9]) |>
summary()
Call:
lm(formula = mpg ~ . + log(weight) + log(displacement) + log(horsepower),
data = Auto[-9])
Residuals:
Min 1Q Median 3Q Max
-9.1123 -1.5756 -0.0977 1.4870 12.2114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 187.193734 40.148119 4.663 4.33e-06 ***
cylinders -0.158747 0.299358 -0.530 0.59622
displacement 0.010936 0.014197 0.770 0.44160
horsepower 0.095958 0.031361 3.060 0.00237 **
weight 0.002127 0.002222 0.957 0.33901
acceleration -0.200229 0.100776 -1.987 0.04766 *
year 0.781173 0.046562 16.777 < 2e-16 ***
origin2 1.098874 0.544930 2.017 0.04445 *
origin3 1.291908 0.537994 2.401 0.01681 *
log(weight) -19.060336 7.360258 -2.590 0.00998 **
log(displacement) -2.017991 2.877048 -0.701 0.48348
log(horsepower) -16.566039 3.741005 -4.428 1.24e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.913 on 380 degrees of freedom
Multiple R-squared: 0.8646, Adjusted R-squared: 0.8607
F-statistic: 220.7 on 11 and 380 DF, p-value: < 2.2e-16
The logs of horsepower and weight are significant and result in an increase of adjusted \(R^2\) from the base lm()
10A
Fit a multiple regression model to predict Sales using Price, Urban, and US.
data("Carseats")lm(Sales~Price+Urban+US, data = Carseats) |>
summary()
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
10B
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
The estimate coefficient indicate the value of the response variable when the predictors equal 0 for numeric, or the baseline group for qualitative variables. For every unit increase in price, the response decreases by .05. When the observation has ‘Yes’ for Urban, then the intercept coefficient decreases by .02. When the observation for US equals ‘Yes’, then the intercept coefficient increases by 1.2.
10C
Write out the model in equation form, being careful to handle the qualitative variables properly.
\[ \hat{Sales}=13.04-.05Price-.02Urban+1.2US \]
Where Price = 1 if ‘Yes’ or Price = 0 if ‘No’ AND Where US = 1 if ‘Yes’ or US = 0 if ‘No’
10D
for which of the predictors can you reject the null hypothesis \(H_0\) : \(β_j\) = 0?
Price and US
10E
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm(Sales~Price+US,data=Carseats) |>
summary()
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
10F
How well do the models in (a) and (e) fit the data?
Both models are significant. Model 1 has an \(R^2\) = 0.2393 and model 2 has an \(R^2\) = 0.2393, indicating that both models explain the same amount of variance in Sales, however model 2 is superior due to its increased simplicity.
10G
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
lm(Sales~Price+US,data=Carseats) |>
confint(level = 0.95) 2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
10H
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
lm(Sales~Price+US,data=Carseats) |>
plot()Based on the plots, there doesn’t appear to be any outliers or observations with high leverage.
12A
When
\[ \sum_{} y^2 = \sum_{} x^2 \]
12B
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- rnorm(100)
y <- rnorm(100,2,2)
data <- data.frame(x, y)lm(y ~ x + 0) |>
summary()
Call:
lm(formula = y ~ x + 0)
Residuals:
Min 1Q Median 3Q Max
-1.7865 0.7343 1.6627 2.9642 6.4940
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.2564 0.3013 0.851 0.397
Residual standard error: 2.712 on 99 degrees of freedom
Multiple R-squared: 0.007265, Adjusted R-squared: -0.002763
F-statistic: 0.7245 on 1 and 99 DF, p-value: 0.3967
lm(x ~ y + 0) |>
summary()
Call:
lm(formula = x ~ y + 0)
Residuals:
Min 1Q Median 3Q Max
-2.2345 -0.5473 0.0186 0.5927 2.3209
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y 0.02833 0.03328 0.851 0.397
Residual standard error: 0.9015 on 99 degrees of freedom
Multiple R-squared: 0.007265, Adjusted R-squared: -0.002763
F-statistic: 0.7245 on 1 and 99 DF, p-value: 0.3967
12C
Generate an example in R with n = 100 observations in which the coefcient estimate for the regression of X onto Y is the same as the coefcient estimate for the regression of Y onto X.
set.seed(1)
x <- rnorm(100)
y <- x
data <- data.frame(x, y)lm(y ~ x + 0) |>
summary()Warning in summary.lm(lm(y ~ x + 0)): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = y ~ x + 0)
Residuals:
Min 1Q Median 3Q Max
-3.583e-15 -3.440e-17 -1.600e-18 1.280e-17 1.997e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 1.000e+00 4.046e-17 2.471e+16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.643e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.107e+32 on 1 and 99 DF, p-value: < 2.2e-16
lm(x ~ y + 0) |>
summary()Warning in summary.lm(lm(x ~ y + 0)): essentially perfect fit: summary may be
unreliable
Call:
lm(formula = x ~ y + 0)
Residuals:
Min 1Q Median 3Q Max
-3.583e-15 -3.440e-17 -1.600e-18 1.280e-17 1.997e-16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y 1.000e+00 4.046e-17 2.471e+16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.643e-16 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.107e+32 on 1 and 99 DF, p-value: < 2.2e-16