The KNN classifier and regression methods are very similar with a few key differences. The classifier method is used to predict categorical variables. It is given a point to predict by finding a specified number of neighbors and seeing which category shows up the most. The regression method follows a similar strategy but instead of going with a number that shows up the most in the nearest neighbors, it takes the average of the specified number of neighbors.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
Auto <- read.table("Auto.data", header = T, na.strings = "?", stringsAsFactors = T)
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(select(Auto, -name), use = "complete.obs")
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
model1 = lm(mpg~. -name, data = Auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
i. Is there a relationship between the predictors and the response?
There is relationship between the predictors and the response variables which can been seen by the overall p-value of < 2.2e-16.
ii. Which predictors appear to have a statistically significant relationship to the response?
displacement, weight, year,
and origin all seem to have a statistically significant
relationship to the response variable.
iii. What does the coefficient for the year variable suggest?
The coefficient 0.750773 means that for each additional
year, mpg increases by around .75 units.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(model1)
The residual plot does not show any outstanding outliers, however the
leverage plot does show one observations that stands out from the rest
of the observations.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
model2 = lm(mpg ~ horsepower * cylinders * weight, data = Auto)
summary(model2)
##
## Call:
## lm(formula = mpg ~ horsepower * cylinders * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0975 -2.1526 -0.3461 1.8172 15.3008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.950e+01 1.776e+01 3.912 0.000108 ***
## horsepower -2.828e-01 1.939e-01 -1.458 0.145690
## cylinders -3.750e+00 3.179e+00 -1.180 0.238913
## weight -6.278e-03 6.722e-03 -0.934 0.350920
## horsepower:cylinders 2.551e-02 2.914e-02 0.876 0.381844
## horsepower:weight 7.192e-06 7.371e-05 0.098 0.922324
## cylinders:weight 1.410e-04 1.023e-03 0.138 0.890474
## horsepower:cylinders:weight 8.170e-07 9.999e-06 0.082 0.934922
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.901 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.7547, Adjusted R-squared: 0.7502
## F-statistic: 168.7 on 7 and 384 DF, p-value: < 2.2e-16
model3 = lm(mpg ~ horsepower * cylinders : weight, data = Auto)
summary(model3)
##
## Call:
## lm(formula = mpg ~ horsepower * cylinders:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4898 -2.3941 -0.4798 2.0983 14.8335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.199e+01 1.634e+00 31.82 <2e-16 ***
## horsepower -2.258e-01 2.012e-02 -11.22 <2e-16 ***
## cylinders:weight -1.069e-03 7.784e-05 -13.74 <2e-16 ***
## horsepower:cylinders:weight 6.341e-06 6.335e-07 10.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.009 on 388 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.7382, Adjusted R-squared: 0.7361
## F-statistic: 364.6 on 3 and 388 DF, p-value: < 2.2e-16
Looking at the 2 models above, it can be said that interactions can
either be significant or not significant in a model depending on what
all in being considered. The formula for the first model was
lm(mpg ~ horsepower * cylinders * weight, data = Auto) and
the formula for the second model was
lm(mpg ~ horsepower * cylinders : weight, data = Auto).
They are almost identical with the exception of the symbol between
cylinders and weight. In the first model all
three predictors are considerred individually and all of their
interactions are considered as well. In that model nothing was
significant. However in the second model, horsepower is the
only predictor that is included individually, and only two interactions
are considered. With this much simpler model all of the coefficients
were significant even though in the first model they were not.
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
model4 = lm(mpg~ log(horsepower) * log(weight) * log(displacement), data = Auto)
summary(model4)
##
## Call:
## lm(formula = mpg ~ log(horsepower) * log(weight) * log(displacement),
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3760 -2.0492 -0.3067 1.8102 16.2392
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2513.045 1007.463 -2.494
## log(horsepower) 554.944 225.352 2.463
## log(weight) 359.082 127.785 2.810
## log(displacement) 445.825 195.671 2.278
## log(horsepower):log(weight) -78.126 28.579 -2.734
## log(horsepower):log(displacement) -93.910 42.413 -2.214
## log(weight):log(displacement) -62.672 24.329 -2.576
## log(horsepower):log(weight):log(displacement) 13.168 5.278 2.495
## Pr(>|t|)
## (Intercept) 0.01304 *
## log(horsepower) 0.01423 *
## log(weight) 0.00521 **
## log(displacement) 0.02325 *
## log(horsepower):log(weight) 0.00655 **
## log(horsepower):log(displacement) 0.02740 *
## log(weight):log(displacement) 0.01037 *
## log(horsepower):log(weight):log(displacement) 0.01302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.85 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7566
## F-statistic: 174.7 on 7 and 384 DF, p-value: < 2.2e-16
model5 = lm(mpg~ sqrt(horsepower) * sqrt(weight) * sqrt(displacement), data = Auto)
summary(model5)
##
## Call:
## lm(formula = mpg ~ sqrt(horsepower) * sqrt(weight) * sqrt(displacement),
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3263 -2.1503 -0.3557 1.7908 16.5264
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 55.665055 40.422959 1.377
## sqrt(horsepower) -0.537630 4.496039 -0.120
## sqrt(weight) 0.574050 0.800020 0.718
## sqrt(displacement) -1.113110 3.317851 -0.335
## sqrt(horsepower):sqrt(weight) -0.095741 0.086768 -1.103
## sqrt(horsepower):sqrt(displacement) 0.036890 0.300129 0.123
## sqrt(weight):sqrt(displacement) -0.040245 0.055324 -0.727
## sqrt(horsepower):sqrt(weight):sqrt(displacement) 0.004393 0.005016 0.876
## Pr(>|t|)
## (Intercept) 0.169
## sqrt(horsepower) 0.905
## sqrt(weight) 0.473
## sqrt(displacement) 0.737
## sqrt(horsepower):sqrt(weight) 0.271
## sqrt(horsepower):sqrt(displacement) 0.902
## sqrt(weight):sqrt(displacement) 0.467
## sqrt(horsepower):sqrt(weight):sqrt(displacement) 0.382
##
## Residual standard error: 3.868 on 384 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.7589, Adjusted R-squared: 0.7545
## F-statistic: 172.6 on 7 and 384 DF, p-value: < 2.2e-16
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
attach(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
fit<-lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
coef(fit)[4]
## USYes
## 1.200573
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
The coefficient for Price is -0.054459 which means for
every dollar increase in the price of my car seat, my stores sales
decrease by $54 on average.
The coefficient for US = Yes is 1.200573 which means, on
average, US stores sell $1,200 more compared to stores outside the
US.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly. \(Sales = 13.04 - 0.05Price - 0.022Urban + 1.2US\)
(d) For which of the predictors can you reject the null
hypothesis \(H0 : \beta_j =
0\)? See part (b) for interpretation, but
Price and US = Yes are significant thus we can
reject the null hypothesis \(H0 : \beta_j =
0\)
(e) On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit<-lm(Sales ~ Price + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) ft the data?
The models did not fit the data well at all as the Adjusted R-squared is 0.2335 for part (a) and 0.2354 for part (e).
(g) Using the model from (e), obtain 95 % confdence intervals for the coeffcient(s).
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
plot(fit)
summary(influence.measures(fit))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? When X and Y have the same variance and magnitude, the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(123) # For reproducibility
n <- 100
X <- rnorm(n, mean = 50, sd = 10) # Generate X ~ Normal(50, 10)
Y <- 2 * X + rnorm(n, mean = 0, sd = 15) # Y depends on X, but with added noise
model_X_Y <- lm(X ~ Y)
summary(model_X_Y)
##
## Call:
## lm(formula = X ~ Y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.768 -3.485 0.295 4.151 14.454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.90579 2.65175 7.507 2.82e-11 ***
## Y 0.30938 0.02582 11.984 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.843 on 98 degrees of freedom
## Multiple R-squared: 0.5944, Adjusted R-squared: 0.5903
## F-statistic: 143.6 on 1 and 98 DF, p-value: < 2.2e-16
model_Y_X <- lm(Y ~ X)
summary(model_Y_X)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.610 -10.253 -1.312 8.710 49.356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3933 8.2897 0.289 0.773
## X 1.9213 0.1603 11.984 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.56 on 98 degrees of freedom
## Multiple R-squared: 0.5944, Adjusted R-squared: 0.5903
## F-statistic: 143.6 on 1 and 98 DF, p-value: < 2.2e-16
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(123) # For reproducibility
n2 <- 100
X2 <- rnorm(n2, mean = 50, sd = 10) # Generate X ~ N(50, 10)
Y2 <- X2 # Standardize X2 to have mean 0, SD 1
# Y depends on X, but with added noise
model_X_Y2 <- lm(X2 ~ Y2)
summary(model_X_Y2)
## Warning in summary.lm(model_X_Y2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = X2 ~ Y2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.048e-15 -2.071e-16 1.830e-17 2.008e-16 1.180e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000e+00 2.516e-16 0.000e+00 1
## Y2 1.000e+00 4.866e-18 2.055e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.42e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.223e+34 on 1 and 98 DF, p-value: < 2.2e-16
model_Y_X2 <- lm(Y2 ~ X2)
summary(model_Y_X2)
## Warning in summary.lm(model_Y_X2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Y2 ~ X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.048e-15 -2.071e-16 1.830e-17 2.008e-16 1.180e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000e+00 2.516e-16 0.000e+00 1
## X2 1.000e+00 4.866e-18 2.055e+17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.42e-16 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.223e+34 on 1 and 98 DF, p-value: < 2.2e-16