Q: Carefully explain the differences between the KNN classifier and KNN regression methods.
A: KNN Regression is similar to KNN Classifier. KNN Classifier is used for classification/qualitative problems (Y is categorical) while KNN Regression is used for regression/quantitative problems (Y is numerical/continuous)
KNN Classifier - for any given X we find the k closest neighbors to X in the training data, and examine their corresponding Y. If the majority of the Y’s are “z” we predict “z” otherwise guess “x”. The smaller that k is the more flexible the method will be.
KNN Regression is used to predict Y for a given value of X, considering k closest points to X in training data and taking the average of the responses. If k is small, kNN is much more flexible than linear regression.
Q: This question involves the use of multiple linear regression on the Auto data set. (a) Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[c(1:8)])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
The p-value for the model is below 0.05 and hence, the model is useful and there is a relationship between the predictors and the response. However, Cylinder, Horsepower, and Acceleration do not have a significant effect on mpg.
Displacement, Weight, Year, and Origin have a statistically significant relationship to the response.
The coefficient is 0.750773 which means that mpg increases by 0.75 units for every unit increase in year, all else constant.
lmauto <- lm(mpg~. -name, data = Auto)
summary(lmauto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The curve in the residual plot shows a non-linear relationship between predictors and variables. We can also see that residuals are normally distributed from the QQ plot. Lastly, from the cooks distance plot we can see that value 14 has high leverage.
par(mfrow = c(2,2))
plot(lmauto)
Taking the highest correlated pairs for interactions: Displacement and cylinders, and displacement with weight. The interaction displacement:weight is statistically significant. Another model with more correlated interactions was ran. We saw that the interactions displacement:weight and displacement:horsepower are statistically significant.
lmauto2 = lm(mpg~. -name + cylinders*displacement + displacement*weight, data = Auto)
summary(lmauto2)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement + displacement *
## weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0609 -1.7589 -0.0494 1.5790 12.1496
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.795e+00 4.515e+00 -1.062 0.28883
## cylinders -1.091e-01 5.965e-01 -0.183 0.85502
## displacement -7.186e-02 1.363e-02 -5.273 2.25e-07 ***
## horsepower -3.457e-02 1.304e-02 -2.651 0.00836 **
## weight -1.030e-02 1.064e-03 -9.680 < 2e-16 ***
## acceleration 6.618e-02 8.817e-02 0.751 0.45334
## year 7.840e-01 4.566e-02 17.171 < 2e-16 ***
## origin 5.475e-01 2.643e-01 2.071 0.03901 *
## cylinders:displacement 1.186e-03 2.715e-03 0.437 0.66251
## displacement:weight 2.141e-05 3.712e-06 5.768 1.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 382 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8555
## F-statistic: 258.2 on 9 and 382 DF, p-value: < 2.2e-16
lmauto3<- lm(mpg~. -name + cylinders*displacement + displacement*weight + horsepower*displacement + acceleration:horsepower + origin*displacement, data = Auto)
summary(lmauto3)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders * displacement + displacement *
## weight + horsepower * displacement + acceleration:horsepower +
## origin * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3250 -1.5778 -0.0658 1.4758 12.4039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.047e+00 6.922e+00 -0.874 0.38291
## cylinders 7.586e-01 6.331e-01 1.198 0.23159
## displacement -9.031e-02 1.916e-02 -4.712 3.44e-06 ***
## horsepower -7.047e-02 5.853e-02 -1.204 0.22941
## weight -7.088e-03 1.452e-03 -4.882 1.55e-06 ***
## acceleration 2.107e-01 2.316e-01 0.910 0.36354
## year 7.593e-01 4.544e-02 16.710 < 2e-16 ***
## origin -5.884e-01 9.558e-01 -0.616 0.53856
## cylinders:displacement -9.817e-04 2.827e-03 -0.347 0.72862
## displacement:weight 1.427e-05 4.753e-06 3.002 0.00286 **
## displacement:horsepower 2.524e-04 1.074e-04 2.350 0.01930 *
## horsepower:acceleration -3.540e-03 2.342e-03 -1.512 0.13148
## displacement:origin 9.991e-03 8.248e-03 1.211 0.22652
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.886 on 379 degrees of freedom
## Multiple R-squared: 0.8675, Adjusted R-squared: 0.8633
## F-statistic: 206.7 on 12 and 379 DF, p-value: < 2.2e-16
According to model lmauto4, log(weight), sqrt(horsepower), and I(acceleration^2) are now statistically significant. We also see that the adjusted r-squared has increased and is better compared to the original model from 81% to 86%
Next, based on the previous parts, we see a non-linear pattern in the graphs close to a log pattern, hence, we use log(mpg). We see all variables except acceleration are now significant. The R-squared is 87% which is better compared to our lmauto4 model slightly.
lmauto4 = lm(mpg ~ . - name + log(weight) + sqrt(horsepower) + I(cylinders^2) + I(acceleration^2) + I(displacement^2), data = Auto)
summary(lmauto4)
##
## Call:
## lm(formula = mpg ~ . - name + log(weight) + sqrt(horsepower) +
## I(cylinders^2) + I(acceleration^2) + I(displacement^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3593 -1.5249 -0.0286 1.4450 12.3350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.870e+02 4.855e+01 3.852 0.000138 ***
## cylinders 8.106e-01 1.445e+00 0.561 0.575233
## displacement -2.825e-02 2.177e-02 -1.298 0.195182
## horsepower 1.431e-01 7.112e-02 2.012 0.044978 *
## weight 4.140e-03 2.286e-03 1.811 0.070964 .
## acceleration -1.881e+00 5.784e-01 -3.252 0.001250 **
## year 7.784e-01 4.503e-02 17.288 < 2e-16 ***
## origin 5.487e-01 2.653e-01 2.068 0.039279 *
## log(weight) -2.385e+01 7.306e+00 -3.265 0.001195 **
## sqrt(horsepower) -4.323e+00 1.584e+00 -2.730 0.006635 **
## I(cylinders^2) -6.156e-02 1.166e-01 -0.528 0.597707
## I(acceleration^2) 5.074e-02 1.724e-02 2.943 0.003445 **
## I(displacement^2) 4.507e-05 3.815e-05 1.181 0.238231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.875 on 379 degrees of freedom
## Multiple R-squared: 0.8685, Adjusted R-squared: 0.8643
## F-statistic: 208.5 on 12 and 379 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lmauto4)
lmauto5 <- lm(log(mpg)~.-name, data=Auto)
summary(lmauto5)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40955 -0.06533 0.00079 0.06785 0.33925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.751e+00 1.662e-01 10.533 < 2e-16 ***
## cylinders -2.795e-02 1.157e-02 -2.415 0.01619 *
## displacement 6.362e-04 2.690e-04 2.365 0.01852 *
## horsepower -1.475e-03 4.935e-04 -2.989 0.00298 **
## weight -2.551e-04 2.334e-05 -10.931 < 2e-16 ***
## acceleration -1.348e-03 3.538e-03 -0.381 0.70339
## year 2.958e-02 1.824e-03 16.211 < 2e-16 ***
## origin 4.071e-02 9.955e-03 4.089 5.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 384 degrees of freedom
## Multiple R-squared: 0.8795, Adjusted R-squared: 0.8773
## F-statistic: 400.4 on 7 and 384 DF, p-value: < 2.2e-16
This question should be answered using the Carseats data set. (a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
data(Carseats)
lmcars <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lmcars)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
When price increases by $1(and other variables held constant), Sales decrease by 54.459 units An Urban location has 21.9162 units less sales compared to rural location, all else constant A US store sells 1200.5 carseats more than outside the US, all else constant
Sales = 13.0435 + (−0.0545) × Price + (−0.0219162) × UrbanYes + (1.20057) × USYes + error
The p-value for price and USYes is greater than .05, hence, we reject the null, which means they are statistically significant
lmcars2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lmcars2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models have a similar fit with adjusted r-squared for model (a) being 23.35%, residual standard error being 2.472, and adjusted r-squared for model (e) being 23.54%, residual standard error being 2.469 The model in (e) is slightly better, but not by much
confint(lmcars2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
In the residual plot, points 51, 69, and 377 show up as outliers. In the cooks distance plot, we can see some high leverage observations (26, 50, 368)
par(mfrow = c(2, 2))
plot(lmcars2)
This problem involves simple linear regression without an intercept. (a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficients are the same if sum of squares of observed y values is equal to sum of squares of observed x values ∑xi^2 = ∑yi^2
set.seed(1)
x = rnorm(100)
y = 2*x + rnorm(100)
lmfit = lm(y~x+0)
lmfit2 = lm(x~y+0)
summary(lmfit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9154 -0.6472 -0.1771 0.5056 2.3109
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.9939 0.1065 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9586 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
summary(lmfit2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8699 -0.2368 0.1030 0.2858 0.8938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.39111 0.02089 18.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4246 on 99 degrees of freedom
## Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776
## F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
set.seed(1)
x = rnorm(100)
y = 1*x
lmfit3 = lm(y~x+0)
lmfit4 = lm(x~y+0)
summary(lmfit3)
## Warning in summary.lm(lmfit3): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.000e+00 6.479e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16
summary(lmfit4)
## Warning in summary.lm(lmfit4): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.888e-16 -1.689e-17 1.339e-18 3.057e-17 2.552e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 1.000e+00 6.479e-18 1.543e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833e-17 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.382e+34 on 1 and 99 DF, p-value: < 2.2e-16