1. (Exercise 3.7.2) Carefully explain the differences between the KNN classifier and KNN regression methods.
In both methods, for a test observation, \(x_{0}\), a neighborhood is created by identifying the \(K\)-nearest training observations to \(x_{0}\), for some positive integer \(K\).
KNN Classifier Method: this method aims to correctly classify the test observation \(x_{0}\) by the value of a categorical response variable. It does so by estimating the conditional probability of each response class, \(i\), as the fraction of training observations with response value \(i\) in the neighborhood of \(x_{0}\). This method will assign the test observation \(x_{0}\) to whichever response class has the highest estimated conditional probability.
KNN Regression Method: this method aims to estimate the numerical response value [\(f(x_{0})\)] of the test observation \(x_{0}\). It does so by calculating the mean response value from the training observations in the neighborhood of \(x_{0}\) and assigning this mean as the estimate of \(f(x_{0})\).
2. (Exercise 3.7.9) This question involves the use of
multiple linear regression on the Auto data
set.
#Load library containing data set and load Auto data
library(ISLR2)
data("Auto")
#create scatterplot matrix of all variables in the data
pairs(Auto)
cor(). You will need to exclude the
name variable, which is qualitative.#cor matrix all Auto variables except name
cor(Auto[,1:8])
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442 0.4233285 0.5805410 0.5652088
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273 -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944 -0.5438005 -0.3698552 -0.6145351
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377 -0.6891955 -0.4163615 -0.4551715
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000 -0.4168392 -0.3091199 -0.5850054
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054 0.2127458 0.1815277 1.0000000
Use the lm() function to perform a multiple linear
regression with mpg as the response and all other variables
except name as the predictors. Use the
summary() function to print the results. Comment on the
output. For instance:
A relationship between the predictors and response does appear to be present; the overall p-value < 2.2e-16 tells us that there is sufficient evidence to suggest that the model relationship is statistically significant.
The specific predictors which appear to be statistically significant
here are displacement, weight,
year, and origin.
The coefficient for the year variable (0.750773)
suggests that as the model year of a vehicle increases by 1 year, while
all other variables are held constant, we can expect the gas mileage to
increase by approximately 0.751 miles per gallon.
# create data frame of all variables without name
auto.new <- data.frame(Auto[, 1:8])
#MLR with resp mpg to all other vars w/o name and show reg summary
auto.reg <- lm(mpg ~., data = auto.new)
summary(auto.reg)
Call:
lm(formula = mpg ~ ., data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
plot() function to produce diagnostic plots of
the linear regression fit. Comment on any problems you see with the fit.
Do the residual plots suggest any unusually large outliers? Does the
leverage plot identify any observations with unusually high
leverage?# plot auto regression
par(mfrow = c(2,2), mar = c(3.4,3.5,1.2,0), mgp =c(2,1,0))
plot(auto.reg)
There are some problems with the fit. From the Residuals vs. Fitted plot, we can see that we do not have homoscedasticity (model has non-constant variance) as is indicated by the bowed fan shape of the the distribution. Additionally, the error may not be normally distributed given the departure from linearity seen on the high end observations in the Q-Q plot. Additionally, these plots indicate that observations 323, 326, and 327 may be outliers in the data. The Leverage plot does not identify any observations with unusually high leverage.
* and : symbols to fit linear
regression models with interaction effects. Do any interactions appear
to be statistically significant?# reg with main/interaction effects between weight & acceleration
wt.acc.reg <- lm(mpg ~ weight + acceleration + weight:acceleration, data = auto.new)
summary(wt.acc.reg)
Call:
lm(formula = mpg ~ weight + acceleration + weight:acceleration,
data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-10.5823 -2.6411 -0.3517 2.2611 15.6704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.814e+01 4.872e+00 5.776 1.57e-08 ***
weight -3.168e-03 1.461e-03 -2.168 0.03076 *
acceleration 1.117e+00 3.097e-01 3.608 0.00035 ***
weight:acceleration -2.787e-04 9.694e-05 -2.875 0.00426 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.249 on 388 degrees of freedom
Multiple R-squared: 0.706, Adjusted R-squared: 0.7037
F-statistic: 310.5 on 3 and 388 DF, p-value: < 2.2e-16
# reg with main/interaction effects between horsepower & acceleration
hp.acc.reg <- lm(mpg ~ horsepower * acceleration, data = auto.new)
summary(hp.acc.reg)
Call:
lm(formula = mpg ~ horsepower * acceleration, data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-13.3442 -2.7324 -0.4049 2.4210 15.8840
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.512440 3.420187 9.798 < 2e-16 ***
horsepower 0.017590 0.027425 0.641 0.521664
acceleration 0.800296 0.211899 3.777 0.000184 ***
horsepower:acceleration -0.015698 0.002003 -7.838 4.45e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.426 on 388 degrees of freedom
Multiple R-squared: 0.6809, Adjusted R-squared: 0.6784
F-statistic: 275.9 on 3 and 388 DF, p-value: < 2.2e-16
In the first regression (weight +
acceleration) all terms appear to be significant, including
the interaction.
In the second regression (horsepower +
acceleration) horsepower on its own was not shown to be
significant, but acceleration and the interaction between
horsepower and acceleration both appear to be
highly significant.
#examine relationship: mpg & horsepower
par(mfrow = c(2,2))
plot(auto.new$horsepower, auto.new$mpg)
plot(log(auto.new$horsepower), auto.new$mpg)
plot(sqrt(auto.new$horsepower), auto.new$mpg)
plot((auto.new$horsepower)^2, auto.new$mpg)
# full reg with log-horsepower
log.hp.reg <- lm(mpg ~ cylinders + displacement + log(horsepower) + weight
+ acceleration + year + origin, data = auto.new)
summary(log.hp.reg)
Call:
lm(formula = mpg ~ cylinders + displacement + log(horsepower) +
weight + acceleration + year + origin, data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-9.3115 -2.0041 -0.1726 1.8393 12.6579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.254005 8.589614 3.173 0.00163 **
cylinders -0.486206 0.306692 -1.585 0.11372
displacement 0.019456 0.006876 2.830 0.00491 **
log(horsepower) -9.506436 1.539619 -6.175 1.69e-09 ***
weight -0.004266 0.000694 -6.148 1.97e-09 ***
acceleration -0.292088 0.103804 -2.814 0.00515 **
year 0.705329 0.048456 14.556 < 2e-16 ***
origin 1.482435 0.259347 5.716 2.19e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.18 on 384 degrees of freedom
Multiple R-squared: 0.837, Adjusted R-squared: 0.834
F-statistic: 281.6 on 7 and 384 DF, p-value: < 2.2e-16
#examine relationship: mpg & acceleration
plot(auto.new$acceleration, auto.new$mpg)
plot(log(auto.new$acceleration), auto.new$mpg)
plot(sqrt(auto.new$acceleration), auto.new$mpg)
plot((auto.new$acceleration)^2, auto.new$mpg)
# full reg with log-acceleration
log.acc.reg <- lm(mpg ~ cylinders + displacement + horsepower + weight
+ log(acceleration) + year + origin, data = auto.new)
summary(log.acc.reg)
Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
log(acceleration) + year + origin, data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-9.7774 -2.1790 -0.1636 1.8434 13.1268
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -15.174273 6.443614 -2.355 0.0190 *
cylinders -0.507167 0.323203 -1.569 0.1174
displacement 0.019166 0.007595 2.524 0.0120 *
horsepower -0.024622 0.014198 -1.734 0.0837 .
weight -0.006190 0.000676 -9.157 < 2e-16 ***
log(acceleration) -0.129499 1.631402 -0.079 0.9368
year 0.747224 0.050993 14.654 < 2e-16 ***
origin 1.428083 0.278370 5.130 4.6e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.331 on 384 degrees of freedom
Multiple R-squared: 0.8212, Adjusted R-squared: 0.8179
F-statistic: 251.9 on 7 and 384 DF, p-value: < 2.2e-16
# full reg with sqrt-acceleration
sqrt.acc.reg <- lm(mpg ~ cylinders + displacement + horsepower + weight
+ sqrt(acceleration) + year + origin, data = auto.new)
summary(sqrt.acc.reg)
Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
sqrt(acceleration) + year + origin, data = auto.new)
Residuals:
Min 1Q Median 3Q Max
-9.6792 -2.1496 -0.1413 1.8603 13.0920
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.696e+01 5.556e+00 -3.052 0.00243 **
cylinders -5.022e-01 3.233e-01 -1.553 0.12117
displacement 1.966e-02 7.550e-03 2.604 0.00958 **
horsepower -2.052e-02 1.401e-02 -1.464 0.14395
weight -6.347e-03 6.639e-04 -9.560 < 2e-16 ***
sqrt(acceleration) 3.086e-01 8.101e-01 0.381 0.70343
year 7.490e-01 5.100e-02 14.687 < 2e-16 ***
origin 1.428e+00 2.783e-01 5.131 4.58e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.33 on 384 degrees of freedom
Multiple R-squared: 0.8212, Adjusted R-squared: 0.818
F-statistic: 252 on 7 and 384 DF, p-value: < 2.2e-16
A logarithmic transformation on horsepower gives an
approximately linear relationship to mpg and the regression
on the full model with this transformation suggests that
log(horsepower) is statistically significant.
None of the transformations used here provided much improvement on
the acceleration data in terms of linearity. Regressions
were performed on the full model using both transformations on this
variable, but signifcance was not suggested for either
transformation.
3. (Exercise 3.7.10) This question should be answered using
the Carseats data set.
Sales using
Price, Urban, and US.data("Carseats")
# create data frame of appropriate variables
seats.new <- data.frame(Carseats[, c(1,6,10:11)])
#MLR
seats.reg <- lm(Sales ~., data = seats.new)
summary(seats.reg)
Call:
lm(formula = Sales ~ ., data = seats.new)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Write out the model in equation form, being careful to handle the qualitative variables properly. \[ \begin{aligned} Y_{Sales} = 13.04 - 0.05445X_{Price} - 0.0219X_{Urban:Yes} + 1.2006X_{US:Yes} \end{aligned} \]
For which of the predictors can you reject the null hypothesis \(H_{0}: \beta_{j} = 0\)?
We can reject this null hypothesis for Price and
US as both have p-values which would suggest statistically
that their respective \(\beta\) values
are significantly different from zero.
data("Carseats")
# create data frame of appropriate variables
seats.new <- data.frame(Carseats[, c(1,6,10:11)])
#MLR reduced
seats.reg.red <- lm(Sales ~ Price + US, data = seats.new)
summary(seats.reg.red)
Call:
lm(formula = Sales ~ Price + US, data = seats.new)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
These models are poorly fit to the data with both having an R2 = 0.2393 which means that the models only account for about 24% of the variance in the data.
confint(seats.reg.red, level = 0.95)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
# Get regression plots
par(mfrow = c(2,2), mar = c(3.4,3.5,1.2,0), mgp =c(2,1,0))
plot(seats.reg.red)
# Get summary of potentially influential obs
summary(influence.measures(seats.reg.red))
# Remove obs from data and re-run reg
pips <- c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
seats.no.pips <- seats.new[-pips, ]
reg.no.pips2 <- lm(Sales ~ Price + US, data = seats.no.pips)
summary(reg.no.pips2)
The regression plots don’t immediately suggest that any influential observations are present. We generate a list of potentially influential observations, remove these from the data and re-run the regression.
The models are very similar and the coefficient estimates in the model without potential influential points are contained by the confidence intervals obtained using the model fit with all observations. Therefore we can safely include all observations without fear of influential observations.
4. (Exercise 3.7.12) This problem involves simple linear regression without an intercept.
From (3.38) we have that: \[\begin{align*} \hat{\beta}_{Y \to X} &= \frac{\sum_{i=1}^{n} x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2}} \\ \implies \hat{\beta}_{X \to Y} &= \frac{\sum_{i=1}^{n} x_{i}y_{i}}{\sum_{i=1}^{n} y_{i}^{2}} \\ \end{align*}\]
Then, \[\begin{align*} \hat{\beta}_{Y \to X} &= \hat{\beta}_{X \to Y} \\ \implies \frac{\sum_{i=1}^{n} x_{i}y_{i}}{\sum_{i=1}^{n} x_{i}^{2}} &= \frac{\sum_{i=1}^{n} x_{i}y_{i}}{\sum_{i=1}^{n} y_{i}^{2}} \\ \implies \sum_{i=1}^{n} y_{i}^{2} &= \sum_{i=1}^{n} x_{i}^{2} \end{align*}\]
Therefore, the circumstance under which the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) onto \(X\) is when \(\sum_{i=1}^{n} y_{i}^{2} = \sum_{i=1}^{n} x_{i}^{2}\).
R with \(n = 100\) observations in which the
coefficient estimate for the regression of \(X\) onto \(Y\) is different from the
coefficient estimate for the regression of \(Y\) onto \(X\).x <- 1:100
sum(x^2)
[1] 338350
y.diff <- 2*x
sum(y.diff^2)
[1] 1353400
reg.yx.diff <- lm(y.diff ~ 0 + x)
reg.xy.diff <- lm(x ~ 0 + y.diff)
summary(reg.yx.diff)
Call:
lm(formula = y.diff ~ 0 + x)
Residuals:
Min 1Q Median 3Q Max
-4.677e-13 -3.180e-15 6.000e-17 3.080e-15 1.537e-14
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 2.000e+00 8.144e-17 2.456e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.737e-14 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.03e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(reg.xy.diff)
Call:
lm(formula = x ~ 0 + y.diff)
Residuals:
Min 1Q Median 3Q Max
-2.338e-13 -1.589e-15 2.900e-17 1.540e-15 7.683e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y.diff 5.000e-01 2.036e-17 2.456e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.369e-14 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.03e+32 on 1 and 99 DF, p-value: < 2.2e-16
R with \(n = 100\) observations in which the
coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient
estimate for the regression of \(Y\)
onto \(X\).x <- 1:100
sum(x^2)
[1] 338350
y.same <- -1:-100
sum(y.same^2)
[1] 338350
reg.yx.same <- lm(y.same ~ 0 + x)
reg.xy.same <- lm(x ~ 0 + y.same)
summary(reg.yx.same)
Call:
lm(formula = y.same ~ 0 + x)
Residuals:
Min 1Q Median 3Q Max
-7.683e-15 -1.540e-15 -2.900e-17 1.589e-15 2.338e-13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -1.000e+00 4.072e-17 -2.456e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.369e-14 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.03e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(reg.xy.same)
Call:
lm(formula = x ~ 0 + y.same)
Residuals:
Min 1Q Median 3Q Max
-2.338e-13 -1.589e-15 2.900e-17 1.540e-15 7.683e-15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
y.same -1.000e+00 4.072e-17 -2.456e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.369e-14 on 99 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 6.03e+32 on 1 and 99 DF, p-value: < 2.2e-16