Carefully explain the differences between the KNN classifier and KNN regression methods:
The Classification KNN Model is used for predicting the class or category of a data point, while the Regression KNN Model is used for predicting a continuous/quantitative value.
a. scatterplot matrix
library(ISLR2)
Auto <- read.csv("Auto.csv", na.strings = "?", stringsAsFactors = T)
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
b. matrix of correlation
pairs(Auto[,-9])
c. multiple linear regression
lm.fit <- lm(mpg~. -name, data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There does seem to be a relationship between the predictors and their response as the p-value is near 0. Weight, year, and origin all appear to have a statistically significant relationship to mpg. For every 1 year increase in auto, there is an estimated .75 mpg increase, signifying newer autos are more fuel efficient.
d. Diagnostic plots
par(mfrow = c(2,2))
plot(lm.fit)
Comment on any problems you see with the fit.
Slight curve pattern in the Residuals vs Fitted graph, ideally a pattern here would not be discernible. This indicated non-linearity in the Auto data.
Do the residual plots suggest any unusually large outliers?
While there are outliers present, these are not noted as unusually large.
Does the leverage plot identify any observations with unusually high leverage?
Yes, there is a high leverage point labeled “14” that can be seen outside of the normal range.
e. Use the * and : symbols to fit linear regression models.
summary(lm(mpg ~ weight * displacement, data = Auto))
##
## Call:
## lm(formula = mpg ~ weight * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.9209 -2.4627 -0.3147 1.8435 17.9674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.396e+01 1.930e+00 27.954 < 2e-16 ***
## weight -9.002e-03 8.446e-04 -10.658 < 2e-16 ***
## displacement -7.936e-02 1.131e-02 -7.017 1.00e-11 ***
## weight:displacement 1.773e-05 2.784e-06 6.368 5.35e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.11 on 393 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7242
## F-statistic: 347.5 on 3 and 393 DF, p-value: < 2.2e-16
summary(lm(mpg ~ weight * year, data = Auto))
##
## Call:
## lm(formula = mpg ~ weight * year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0341 -1.9851 -0.0912 1.6987 12.9292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.124e+02 1.280e+01 -8.781 < 2e-16 ***
## weight 2.821e-02 4.376e-03 6.447 3.34e-10 ***
## year 2.068e+00 1.699e-01 12.171 < 2e-16 ***
## weight:year -4.672e-04 5.857e-05 -7.977 1.66e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.187 on 393 degrees of freedom
## Multiple R-squared: 0.8354, Adjusted R-squared: 0.8341
## F-statistic: 664.9 on 3 and 393 DF, p-value: < 2.2e-16
summary(lm(mpg ~ year * displacement, data = Auto))
##
## Call:
## lm(formula = mpg ~ year * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9067 -2.4318 -0.2423 2.0392 17.0413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.341e+01 8.298e+00 -8.846 < 2e-16 ***
## year 1.415e+00 1.092e-01 12.958 < 2e-16 ***
## displacement 2.559e-01 4.048e-02 6.321 7.06e-10 ***
## year:displacement -4.130e-03 5.438e-04 -7.594 2.28e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.737 on 393 degrees of freedom
## Multiple R-squared: 0.7738, Adjusted R-squared: 0.772
## F-statistic: 448 on 3 and 393 DF, p-value: < 2.2e-16
Multiple interactions with the variables are considered statistically significant. In conclusion while one variable alone could be less significant; in combination with another it is significant and effects mpg in automobiles.
f. Different transformation of variables
fit.log <- lm(mpg ~ log(weight) + displacement + year + origin, data = Auto)
summary(fit.log)
##
## Call:
## lm(formula = mpg ~ log(weight) + displacement + year + origin,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.7232 -1.8862 0.0032 1.5833 12.9262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 132.626241 10.979843 12.079 < 2e-16 ***
## log(weight) -21.822624 1.440808 -15.146 < 2e-16 ***
## displacement 0.008248 0.004029 2.047 0.041287 *
## year 0.810625 0.046100 17.584 < 2e-16 ***
## origin 0.842756 0.248393 3.393 0.000762 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.109 on 392 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8422
## F-statistic: 529.5 on 4 and 392 DF, p-value: < 2.2e-16
fit.sqrt <- lm(mpg ~ sqrt(weight) + displacement + year, data = Auto)
summary(fit.sqrt)
##
## Call:
## lm(formula = mpg ~ sqrt(weight) + displacement + year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9005 -2.0727 -0.1043 1.9165 13.9777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.573974 4.221812 1.557 0.120
## sqrt(weight) -0.814270 0.058364 -13.952 <2e-16 ***
## displacement 0.005418 0.004398 1.232 0.219
## year 0.787320 0.048391 16.270 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.272 on 393 degrees of freedom
## Multiple R-squared: 0.8265, Adjusted R-squared: 0.8252
## F-statistic: 624 on 3 and 393 DF, p-value: < 2.2e-16
lm.fit2 <- lm(mpg ~ weight + I(weight^2) + year + origin, data = Auto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ weight + I(weight^2) + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9160 -1.5656 -0.1268 1.6294 12.7188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.901e-01 4.007e+00 -0.122 0.9027
## weight -2.063e-02 1.521e-03 -13.564 <2e-16 ***
## I(weight^2) 2.244e-06 2.311e-07 9.712 <2e-16 ***
## year 8.299e-01 4.370e-02 18.989 <2e-16 ***
## origin 5.018e-01 2.413e-01 2.079 0.0382 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.013 on 392 degrees of freedom
## Multiple R-squared: 0.8533, Adjusted R-squared: 0.8518
## F-statistic: 570.1 on 4 and 392 DF, p-value: < 2.2e-16
a. fit multiple regression models to predict Sales using Price, Urban, and US
attach(Carseats)
library(tidyverse)
fit.lm <- lm(Sales ~ Price+Urban+US, data = Carseats)
summary(fit.lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
b. Interpret each coefficient
For every $1 increase in the price, sales decreases by .05 thousand units. Stores located in Urban areas have a negative .02 unit in sales from stores in Urban areas. Stores located in the US sell 1.2 thousand units more than stores located outside of the US.
c. model equation form
\[Sales = 13.043 - 0.055 X_1 -0.022 X_2 +1.201 X_3\]
d. For which predictors can you reject the null Hypothesis?
Reject the null hypothesis for Price and US.
e. Smaller model
fit.lm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit.lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
f. How well do the two models fit the data
summary(fit.lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(fit.lm2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
There was very little shift in either the residual standard error or the R^2 therefore, Urban is not as a significant variable, since minimal change occurred when removed.
g. 95% CI for the coefficients of model e (smaller model)
confint(fit.lm2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
h. Is there evidence of outliers or high leverage obs in model e?
par(mfrow=c(2,2))
plot(fit.lm2)
There is no evidence of either significant outliers or high leverage observations.
a. Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimates are equal when the sum of the squared X-values equal the sum of the squared Y-values, or when both estimates are equal to 0.
\[\stackrel{n}{\underset{i=1}\sum} X_i^2 = \stackrel{n}{\underset{i=1}\sum} Y_i^2\] OR
\[\sum X_iY_i = 0\] b. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
n <- 100
x <- rnorm(n, mean = 0, sd = 1)
y <- 2*x + rnorm(n, mean = 0, sd = 1)
fit1 <- lm(y ~ x + 0)
fit2 <- lm(x ~ y + 0)
coef(fit1)
## x
## 1.993876
coef(fit2)
## y
## 0.3911145
c. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(1)
n <- 100
x <- rnorm(n)
y <- x
fit1 <- lm(y ~ x + 0)
fit2 <- lm(x ~ y + 0)
coef(fit1)
## x
## 1
coef(fit2)
## y
## 1