Given a value for \(K\) and a test observation \(x_0\), the KNN classifier first identifies the \(K\) points in the training data that are closest to \(x_0\), represented by \(N_0\). It then estimates the conditional probability for class \(j\) as the fraction of points in \(N_0\) whose values equal \(j\). Finally, KNN classifies the test observation \(x_0\) to the class with the largest probability.
Given a value for \(K\) and predicted point \(x_0\), KNN regression first identifies the \(K\) training observations that are closet to \(x_0\), represented by \(N_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(N_0\).
Auto data set.a) Produce a scatterplot matrix which includes all the variables in the data set.
library(ISLR2)
attach(Auto)
library(xtable)
pairs(Auto)
b) Compute the matrix correlations between the variables
using the function cor(). You will need to exclude the
name variable, which is qualitative.
cor_xt <- xtable(cor(Auto[, -9]))
print(cor_xt, type = "html")
| mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
|---|---|---|---|---|---|---|---|---|
| mpg | 1.00 | -0.78 | -0.81 | -0.78 | -0.83 | 0.42 | 0.58 | 0.57 |
| cylinders | -0.78 | 1.00 | 0.95 | 0.84 | 0.90 | -0.50 | -0.35 | -0.57 |
| displacement | -0.81 | 0.95 | 1.00 | 0.90 | 0.93 | -0.54 | -0.37 | -0.61 |
| horsepower | -0.78 | 0.84 | 0.90 | 1.00 | 0.86 | -0.69 | -0.42 | -0.46 |
| weight | -0.83 | 0.90 | 0.93 | 0.86 | 1.00 | -0.42 | -0.31 | -0.59 |
| acceleration | 0.42 | -0.50 | -0.54 | -0.69 | -0.42 | 1.00 | 0.29 | 0.21 |
| year | 0.58 | -0.35 | -0.37 | -0.42 | -0.31 | 0.29 | 1.00 | 0.18 |
| origin | 0.57 | -0.57 | -0.61 | -0.46 | -0.59 | 0.21 | 0.18 | 1.00 |
library(corrplot)
## corrplot 0.92 loaded
corr = cor(Auto[ , -9])
corrplot(corr, method = "ellipse")
c) Use the lm() function to perform a multiple
linear regression with mpg as the response and all other
variables except name as the predictors. Use the
summary() function to print the results. Comment on the
output. For instance:
year
variable suggest?lm_fit <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin,
data = Auto)
summary(lm_fit)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
mpg are
displacement, weight, year, and
origin. The coefficient for year suggests that
an increase of one year, with all other variables held constant, results
in an increase of 0.750773 miles per gallon. Each year cars become
slightly more fuel efficient.d) Use the plot() function to produce diagnostic
plots of the linear regression fit. Comment on any problems you see with
the fit. Do residual plots suggest any unusually large outliers? Does
the leverage plot identify any observations with unusually high
leverage?
par(mfrow = c(2,2))
plot(lm_fit)
e) Use the * and : symbols to fit
linear regression models with interaction effects. Do any interactions
appear to be statistically significant?
lm_fit2 <- lm(mpg ~ . *. , data = Auto[ , -9])
summary(lm_fit2)
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
displacement:year, acceleration:year, and
acceleration:origin.f) Try a few different transformations of the variables, such as \(log(X)\), \(\sqrt{X}\), \(X^2\). Comment on your findings.
lm_fit3 <- lm(mpg ~ . + I(weight**2) + I(horsepower**2) + I(log(weight)) + I(log(acceleration)) +
I(displacement**(1/2)) + I(horsepower**(1/2)), data = Auto[ , -9])
summary(lm_fit3)
##
## Call:
## lm(formula = mpg ~ . + I(weight^2) + I(horsepower^2) + I(log(weight)) +
## I(log(acceleration)) + I(displacement^(1/2)) + I(horsepower^(1/2)),
## data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.1673 -1.5596 0.0183 1.4953 12.0534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.427e+02 1.714e+02 -0.833 0.40563
## cylinders 2.783e-01 3.567e-01 0.780 0.43563
## displacement 4.581e-03 2.949e-02 0.155 0.87664
## horsepower 9.327e-02 3.648e-01 0.256 0.79835
## weight -3.472e-02 1.750e-02 -1.984 0.04795 *
## acceleration 1.367e+00 5.335e-01 2.562 0.01080 *
## year 7.834e-01 4.478e-02 17.492 < 2e-16 ***
## origin 6.542e-01 2.656e-01 2.464 0.01420 *
## I(weight^2) 3.147e-06 1.374e-06 2.291 0.02250 *
## I(horsepower^2) 9.598e-05 4.820e-04 0.199 0.84227
## I(log(weight)) 3.194e+01 2.665e+01 1.199 0.23145
## I(log(acceleration)) -2.485e+01 8.393e+00 -2.960 0.00327 **
## I(displacement^(1/2)) -2.919e-01 8.774e-01 -0.333 0.73955
## I(horsepower^(1/2)) -3.804e+00 5.281e+00 -0.720 0.47179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.869 on 378 degrees of freedom
## Multiple R-squared: 0.8694, Adjusted R-squared: 0.8649
## F-statistic: 193.5 on 13 and 378 DF, p-value: < 2.2e-16
weight, acceleration,
year, and origin remain statistically
significant, although all of them except year are only
significant at the 0.05 level. Of the variable transformations,
horsepower squared and the log of acceleration
are statistically significant.Carseats
data set.a) Fit a multiple regression model to predict
Sales using Price, Urban, and
US.
car_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
b) Provide an interpretation of each coefficient in the model. Be careful - some variables in the model are qualitative!
summary(car_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price, with Urban
and US held constant, results in a decrease in
Sales by 0.054 units or $54. The relationship between
Price and Sales appears to be statistically
significant, with a p-value near 0.Urban and Sales
does not appear to be statistically significant, with a p-value of
0.936.Price and
Urban held constant, results in an increase in
Sales by 1.2 units or $1200. The relationship between
US and Sales appears to be statistically
significant, with a p-value near 0.c) Write out the model in equation form, being careful to handle the qualitative variables properly.
d) For which of the predictors can you reject the null hypothesis \(H_{0} : \beta_{j} = 0\)?
Price and US because of their small p-values.
There is not enough evidence to reject the null hypothesis for the
predictor Urban because of its large p-value.e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
car_lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(car_lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f) How well do the models in a) and e) fit the data?
Sales around its mean. The \(R^2\) value of 0.239 is relatively low, so
much of the variation of the mean of Sales is not being
explained by either of these models.g) Using the model from e), obtain 95% confidence intervals for the coefficient(s).
confint(car_lm2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h) Is there evidence of outliers or high leverage observations in the model from e)?
par(mfrow=c(2,2))
plot(car_lm2)
a) Recall that the coefficient estimate \(\hat{\beta}\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) into \(X\)?
b) Generate an example in R with \(n = 100\) observations in which the
coefficient estimate for the regression of \(X\) onto \(Y\) is different from the
coefficient estimate for the regression of \(Y\) onto\(X\).
set.seed(1)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
lmx <- lm(x ~ y)
lmy <- lm(y ~ x)
summary(lmx)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90848 -0.28101 0.06274 0.24570 0.85736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03880 0.04266 0.91 0.365
## y 0.38942 0.02099 18.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4249 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(lmy)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8768 -0.6138 -0.1395 0.5394 2.3462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03769 0.09699 -0.389 0.698
## x 1.99894 0.10773 18.556 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9628 on 98 degrees of freedom
## Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762
## F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
c) Generate an example in R with \(n = 100\) observations in which the
coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient
estimate for the regression of \(Y\)
onto \(X\).
set.seed(1)
f <- 1:100
g <- 100:1
lmf <- lm(f ~ g + 0)
lmg <- lm(g ~ f + 0)
summary(lmf)
##
## Call:
## lm(formula = f ~ g + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## g 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(lmg)
##
## Call:
## lm(formula = g ~ f + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## f 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08