Assignment 2: Chapter 3
Questions: 2, 9, 10, 12
2. Carefully explain the differences between the KNN classifier and KNN regression methods.
The differences between these approaches are similar to the differences that any type of classifier would have when compared to a regression approach. Classifiers, like the kNN classifier, are looking to group data points together and determine which groups certain new points would belong to. Regression models, on the other hand, are trying to use existing data points to predict new points fall.
9. Multiple Linear Regression on the Auto dataset.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.5 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
a)
auto<- ISLR:: Auto
pairs(Auto)
b)
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
c)
m<- lm(mpg~.-name, auto)
m
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Coefficients:
## (Intercept) cylinders displacement horsepower weight
## -17.218435 -0.493376 0.019896 -0.016951 -0.006474
## acceleration year origin
## 0.080576 0.750773 1.426140
summary(m)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Yes, there is a relationship between the predictors and response.
ii. This means that displacement, weight, year, and origin significantly contribute to mpg.
iii. The coefficient for year is ~0.75, which indicates that each year, the cars’ mpg value improves by 0.75.
d)
par(mfrow=c(2,2))
plot(m)
The plots indicate that there are some non-linear parts of the data, which is why there is a slight tail on the Q-Q plot and why the residuals vs. fitted plot has some clustering near the upper end of the x-axis.
e)
m_int <- lm(mpg~.*.-name*.+.-name,data=auto)
summary(m_int)
##
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
There are some significant interactions (p<0.05): -acceleration with year and origin -displacement with year
f)
par(mfrow=c(2,2))
plot(auto$weight,auto$mpg)
plot((auto$weight)^2,auto$mpg)
plot(log(auto$weight),auto$mpg)
plot(sqrt(auto$weight),auto$mpg)
This is consistent with the relationship between weight and mpg that was shown in the scatterplots, but the log transform of weight gives a more linear relationship than just the standard plot.
10. Regression with Carseats data
a)
carseats<- ISLR:: Carseats
car_mod<- lm(Sales~Price+Urban+US, carseats)
summary(car_mod)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b) The price coefficient indicates that every one unit increase in price results in a decrease in sales. The US variable is a factor (No/Yes) and indicates that more sales happen in stores that are in the US than elsewhere. The urban variable is similar, suggesting that more sales happen in urban stores, but this doesn’t have a significant contribution to the model.
c) sales = 13.04 - 0.05price - 0.02urban + 1.2*US
d) The variable “urban” does not have a significant influence on the model, so price and US would be the only instances where we can reject the null hypothesis.
e)
lim_mod<- lm(Sales~US+Price, carseats)
summary(lim_mod)
##
## Call:
## lm(formula = Sales ~ US + Price, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f) The adjusted R-squared values for the model with US and Price only is 0.2354. The adjusted R-squared for the model that also includes the variable Urban is 0.2335, so the second model (US + Price) is better.
g)
confint(lim_mod)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## USYes 0.69151957 1.70776632
## Price -0.06475984 -0.04419543
h)
par(mfrow=c(2,2))
plot(lim_mod)
It looks like a linear relationship is present for the data, especially with regard to the Q-Q plot.
12. Linear regression without intercept
a) I’m understanding this to mean “when is the coefficient the same even if the directionality of the regression is changed”, and that would mean that the coefficient is the same when the sum of x^2 equals the sum of y^2.
b)
# x on y
x<- 1:100
y<- 4*x + rnorm(100)
b<- as.data.frame(cbind(x,y))
m_one<- lm(x~y,b)
summary(m_one)
##
## Call:
## lm(formula = x ~ y, data = b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.55916 -0.19389 -0.01598 0.21107 0.60198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0135154 0.0505008 0.268 0.79
## y 0.2499684 0.0002171 1151.578 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2507 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.326e+06 on 1 and 98 DF, p-value: < 2.2e-16
# y on x
m_two<- lm(y~x, b)
summary(m_two)
##
## Call:
## lm(formula = y ~ x, data = b)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.41279 -0.85260 0.05622 0.78051 2.22880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.039140 0.202056 -0.194 0.847
## x 4.000210 0.003474 1151.578 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.003 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.326e+06 on 1 and 98 DF, p-value: < 2.2e-16
c)
x<- 1:100
y<- -x
c<- as.data.frame(cbind(x,y))
cm<- lm(x~y, c)
cm_two<- lm(y~x, c)