library(tidyverse)
library(openintro)
library(ISLR)
library(ISLR2)
library(stats)Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier seeks to predict a discrete category based on the majority of k neighbors’ categories, while the KNN regression method computes a continuous value based on the arithmetic mean of k neighboring observations’ values.
This question involves the use of multiple linear regression on
the Auto data set.
#Load the data
Auto <- read.table("C:/Users/mille/Documents/Auto.data")
#Name the columns
colnames(Auto) <- Auto[1, ]
#Remove row 1
Auto <- Auto[-1, ]
#Convert quantitative variables to numeric format
Auto$mpg <- as.numeric(Auto$mpg)
Auto$cylinders <- as.numeric(Auto$cylinders)
Auto$displacement <- as.numeric(Auto$displacement)
Auto$horsepower <- as.numeric(Auto$horsepower)## Warning: NAs introduced by coercion
Auto$weight <- as.numeric(Auto$weight)
Auto$acceleration <- as.numeric(Auto$acceleration)
Auto$year <- as.numeric(Auto$year)
Auto$origin <- as.numeric(Auto$origin)
#Warning: NAs introduced by coercionProduce a scatterplot matrix which includes all of the variables in the data set.
plot(Auto)Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
Auto %>%
drop_na() -> OKAuto
cor(Auto[1:8,1:8])## Warning in cor(Auto[1:8, 1:8]): the standard deviation is zero
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 NA -0.8510163 -0.8971403 -0.8365458
## cylinders NA 1 NA NA NA
## displacement -0.8510163 NA 1.0000000 0.9816599 0.9907731
## horsepower -0.8971403 NA 0.9816599 1.0000000 0.9574110
## weight -0.8365458 NA 0.9907731 0.9574110 1.0000000
## acceleration 0.6955204 NA -0.8622310 -0.8763229 -0.8414428
## year NA NA NA NA NA
## origin NA NA NA NA NA
## acceleration year origin
## mpg 0.6955204 NA NA
## cylinders NA NA NA
## displacement -0.8622310 NA NA
## horsepower -0.8763229 NA NA
## weight -0.8414428 NA NA
## acceleration 1.0000000 NA NA
## year NA 1 NA
## origin NA NA 1
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
autolm <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = OKAuto)
summary(autolm)##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = OKAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and response?
Yes. The adjusted R2 value is strong at 0.8182, with a small p-value for the whole model, indicating that it is statistically significant.
Which predictors appear to have a statistically significant relationship to the response?
It looks like we have 4 statistically significant predictors (displacement, weight, year, origin).
What does the coefficient for the year
variable suggest?
The coefficient for the variable year is the largest
coefficient in the model, suggesting it has the strongest influence on
the mpg response.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(autolm)The overall shape of the Residuals vs Fitted plot seems to have a slight upward concavity, indicating our assumption of linearity is not true.
The Normal Q-Q plot shows some deviation from the assumption that the residuals of the model are nearly normal, particularly in the upper quantiles.
The Scale-Location plot shows the absolute values of residuals against fitted values. While the trend line is roughly horizontal, the variance does appear to increase as fitted values increase, suggesting heteroscedasticity.
The Residuals vs Leverage plot shows one observation with a huge amount of leverage (> 0.15) compared to other points.
Before we go further, let’s try to remove that outlier:
#identify outlier
which.max(hatvalues(autolm))## 14
## 14
#remove outlier from dataframe
NOOAuto <- OKAuto[-14,]#new multiple linear regression model with outlier removed
NOOautolm <- lm(mpg ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = NOOAuto)
summary(NOOautolm)##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year + origin, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.551 -2.147 -0.048 1.889 13.056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.771e+01 4.644e+00 -3.813 0.00016 ***
## cylinders -5.469e-01 3.242e-01 -1.687 0.09247 .
## displacement 2.306e-02 7.745e-03 2.977 0.00309 **
## horsepower -1.105e-02 1.422e-02 -0.777 0.43769
## weight -6.916e-03 7.046e-04 -9.815 < 2e-16 ***
## acceleration 1.163e-01 1.010e-01 1.151 0.25043
## year 7.551e-01 5.093e-02 14.825 < 2e-16 ***
## origin 1.427e+00 2.775e-01 5.142 4.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.32 on 383 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.8188
## F-statistic: 252.7 on 7 and 383 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(NOOautolm)Success - after removing observation 14, the highest-leverage value is nearly half of what it was previously, and the red trend line is no longer bent.
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
#new multiple linear regression model with year by weight interaction
YWautolm <- lm(mpg ~ year*weight, data = NOOAuto)
summary(YWautolm)##
## Call:
## lm(formula = mpg ~ year * weight, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0547 -1.9960 -0.1011 1.6521 12.9926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.102e+02 1.293e+01 -8.521 3.56e-16 ***
## year 2.038e+00 1.716e-01 11.873 < 2e-16 ***
## weight 2.761e-02 4.408e-03 6.263 1.00e-09 ***
## year:weight -4.588e-04 5.900e-05 -7.777 6.81e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.189 on 387 degrees of freedom
## Multiple R-squared: 0.8341, Adjusted R-squared: 0.8328
## F-statistic: 648.7 on 3 and 387 DF, p-value: < 2.2e-16
#new multiple linear regression model with year by origin interaction
YOautolm <- lm(mpg ~ year*origin, data = NOOAuto)
summary(YOautolm)##
## Call:
## lm(formula = mpg ~ year * origin, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3141 -3.7120 -0.6567 3.3637 15.5859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -83.38672 12.06282 -6.913 1.97e-11 ***
## year 1.30893 0.15839 8.264 2.27e-15 ***
## origin 17.37735 6.85308 2.536 0.0116 *
## year:origin -0.16635 0.08916 -1.866 0.0628 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.206 on 387 degrees of freedom
## Multiple R-squared: 0.558, Adjusted R-squared: 0.5546
## F-statistic: 162.8 on 3 and 387 DF, p-value: < 2.2e-16
#new multiple linear regression model with origin by weight interaction
OWautolm <- lm(mpg ~ origin*weight, data = NOOAuto)
summary(OWautolm)##
## Call:
## lm(formula = mpg ~ origin * weight, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4061 -2.7853 -0.4088 2.1673 15.4794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.9978472 2.1965047 17.755 < 2e-16 ***
## origin 4.1005003 1.4931915 2.746 0.00631 **
## weight -0.0055547 0.0007819 -7.104 5.86e-12 ***
## origin:weight -0.0012704 0.0006228 -2.040 0.04203 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.241 on 387 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.7044
## F-statistic: 310.8 on 3 and 387 DF, p-value: < 2.2e-16
Of the interactions tested, year * weight had the lowest
p-value as well as highest adjusted R2. The
origin * weight interaction effect was still statistically
significant at the 95% confidence level, p < 0.05, with a lower
adjusted R2 of 0.7044. Finaly, the year * origin
interaction regression model did not show a statistically significant
interaction, and had the poorest adjusted R2 value at
0.5546.
Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
invmpg <- lm(I(mpg^-1) ~ cylinders + displacement + horsepower + weight + acceleration + year + origin, data = NOOAuto)
summary(invmpg)##
## Call:
## lm(formula = I(mpg^-1) ~ cylinders + displacement + horsepower +
## weight + acceleration + year + origin, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0156391 -0.0033534 -0.0001398 0.0028516 0.0238136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.183e-02 7.959e-03 11.538 < 2e-16 ***
## cylinders 1.551e-03 5.557e-04 2.792 0.00551 **
## displacement -2.825e-05 1.327e-05 -2.128 0.03395 *
## horsepower 1.207e-04 2.438e-05 4.953 1.1e-06 ***
## weight 1.122e-05 1.208e-06 9.291 < 2e-16 ***
## acceleration 3.133e-04 1.731e-04 1.810 0.07114 .
## year -1.268e-03 8.729e-05 -14.522 < 2e-16 ***
## origin -1.013e-03 4.756e-04 -2.129 0.03390 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005691 on 383 degrees of freedom
## Multiple R-squared: 0.8848, Adjusted R-squared: 0.8827
## F-statistic: 420.4 on 7 and 383 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(invmpg)To choose a transformation, I referred back to the scatterplot matrix
and noticed that many of the graphs in the mpg row/column
had a decreasing curved shape, so I decided to take the inverse of
mpg itself. While the skedacity and assumption of linearity
are somewhat improved, the distribution of residuals is still not normal
in the upper quantiles. The adjusted R2 value is a robust
0.8827.
sqrtacc <- lm(mpg ~ cylinders + displacement + horsepower + weight + sqrt(acceleration) + year + origin, data = NOOAuto)
summary(sqrtacc)##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## sqrt(acceleration) + year + origin, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6384 -2.1602 -0.1089 1.8645 13.0869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.802e+01 5.589e+00 -3.224 0.00137 **
## cylinders -5.542e-01 3.245e-01 -1.708 0.08844 .
## displacement 2.275e-02 7.799e-03 2.917 0.00374 **
## horsepower -1.472e-02 1.448e-02 -1.016 0.31011
## weight -6.776e-03 7.189e-04 -9.425 < 2e-16 ***
## sqrt(acceleration) 5.941e-01 8.296e-01 0.716 0.47433
## year 7.531e-01 5.098e-02 14.773 < 2e-16 ***
## origin 1.430e+00 2.778e-01 5.146 4.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.324 on 383 degrees of freedom
## Multiple R-squared: 0.8217, Adjusted R-squared: 0.8184
## F-statistic: 252.1 on 7 and 383 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(sqrtacc)
Again referring to the scatterplot matrix, it looked like
acceleration might have a shape like a square-root
function, so I decided to apply sqrt() to it. While the linearity
assumption is worse, the standardized residuals look more normal on the
low end of the Q-Q plot. The adjusted R2 value is slightly
lower than the inverse-mpg plot, at 0.8184.
comblm <- lm(mpg ~ cylinders + 1/displacement + 1/horsepower + 1/weight + sqrt(acceleration) + year + origin, data = NOOAuto)
summary(comblm)##
## Call:
## lm(formula = mpg ~ cylinders + 1/displacement + 1/horsepower +
## 1/weight + sqrt(acceleration) + year + origin, data = NOOAuto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0071 -2.4357 -0.3078 2.1841 13.8205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.19791 5.46884 -4.242 2.78e-05 ***
## cylinders -2.51104 0.16927 -14.834 < 2e-16 ***
## sqrt(acceleration) -0.06683 0.69021 -0.097 0.923
## year 0.76009 0.05941 12.794 < 2e-16 ***
## origin 1.83231 0.30777 5.954 5.90e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.008 on 386 degrees of freedom
## Multiple R-squared: 0.7386, Adjusted R-squared: 0.7359
## F-statistic: 272.7 on 4 and 386 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(comblm)
Here, instead of transforming
mpg, I took the inverse of
the other variables that produced that graph shape when crossed with
mpg, so as to preserve the relationship of mpg
with sqrt(acceleration). The linearity and skedacity tests
look better than sqrt(acceleration) alone, but the
distribution of residuals looks worse. In addition, this model has the
lowest adjusted R2 value of the transformations I’ve
tried.
This question should be answered using the
Carseats data set.
data("Carseats")Fit a multiple regression model to predict
Sales using Price,
Urban, and US.
carseatlm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(carseatlm)##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
The Price coefficient of approximately -0.05 has a very
low p-value, so it looks like increasing the price by $1 decreases sales
by approximately 5%. UrbanYes, a factor value, has a
coefficient of about -0.02, but is not statistically significant, so it
doesn’t look like whether or not the store is an urban location has an
effect on Sales. Finally, USYes also has a
significant p-value, so it appears that stores located in the US sell
many times more carseats than stores outside the US.
Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.043469 - 0.05445885 Price -
0.02191615 UrbanYes + 1.20057270 USYes
For which of the predictors can you reject the null hypothesis H0 : βj = 0?
Price and USYes
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
carseatlm2 <- lm(Sales ~ Price + US, data = Carseats)
summary(carseatlm2)##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data?
The model from (a) has an adjusted R2 value of 0.2335, and the updated (e) model has a slightly higher adjusted R2 value of 0.2354. Either way, while the p-values are significant, the correlations are weak.
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(carseatlm2)## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(carseatlm2)Yes, there is at least one high-leverage observation beyond
Leverage = 0.04, but perhaps even the observations with
Leverage = 0.03 could be classified as outliers as
well.
This problem involves simple linear regression without an intercept.
Recall that the coefficient estimate ˆβ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
When every element in X is also in Y, and every element in Y is also in X; alternatively, when Y is X*-1. In either case, the sum of yi2 will be equivalent to the sum of xi2, yielding equivalent ˆβ calculations.
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
a <- rnorm(1:100)
b <- rnorm(1:100)
lm1 <- lm(a ~ b)
lm2 <- lm(b ~ a)
lm1##
## Call:
## lm(formula = a ~ b)
##
## Coefficients:
## (Intercept) b
## 0.03726 0.04175
lm2##
## Call:
## lm(formula = b ~ a)
##
## Coefficients:
## (Intercept) a
## 0.06968 0.04355
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x <- rnorm(1:100)
y <- sample(x)
lm1 <- lm(x ~ y)
lm2 <- lm(y ~ x)
lm1##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## -0.03116 -0.04817
lm2##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -0.03116 -0.04817
x <- rnorm(1:100)
y <- -x
lm1 <- lm(x ~ y)
lm2 <- lm(y ~ x)
lm1##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## -4.441e-17 -1.000e+00
lm2##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 4.441e-17 -1.000e+00