Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN regression is used for quantitative variables and makes a prediction using the average of the response values from K. The KNN classifier is used for qualitative variables and makes a prediction by finding the majority vote and finding a common class to assign a label. Regression averages to predict a value and classification votes to predict a category.
This question involves the use of multiple linear regression on the Auto data set.
Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR2)
data(Auto)
pairs(Auto)
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
# names is the 9th column
cor(Auto[, -9])
mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
lm.fit <- lm(mpg ~ . - name, data = Auto)
summary(lm.fit)
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 4.67e-07 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance: i. Is there a relationship between the predictors and the response?
Yes, the F-statistic is 252.4 with a small p-value (p < 0.05), rejecting the null hypothesis that all coefficients = 0, resulting in high significance. The R-squared is at 0.82, meaning that the predictors explain around 82% of the variance for mpg.
- Which predictors appear to have a statistically ii.
- Is there a relationship between the predictors and the response?
Yes, the F statistic is 252.4 with a p value of < 0.05 so we reject the null hypothesis. Using the R square, the model explains 82% of the variance.
- Which predictors appear to have a statistically significant relationship to the response?
Looking at the individual t-statistics, displacement, weight, year, and origin are significant with a p value < 0.05.
- What does the coefficient for the year variable suggest?
The coefficient for year is at about 0.75, suggesting that and increase of about 0.75 mpg per model year and shows improvement in fuel efficiency over time.
Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(lm.fit)
par(mfrow = c(1, 1))
Auto[14, ]
The Residuals vs Fitted does not show a random scatter around zero, meaning that the relationships is non-linear and the model is over and under predicting.
Q-Q Residuals show the residuals against a normal distribution, where the points fall on the diagonal line. There is an upper tail, meaning that the residuals are right-skewed rather than perfectly normal.
Scale-Location shows the square root of the standardized residuals. The spread seems to be wide around the fitted values, indicating non-constant variance.
Residuals vs Leverage is used to find observations that are poorly predicted. The row 14 of the buick estate wagon (sw) has the largest residual.
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
# * includes both main effects and the interaction
summary(lm(mpg ~ weight * horsepower, data = Auto))
summary(lm(mpg ~ displacement * weight, data = Auto))
# : adds only the interaction
summary(lm(mpg ~ acceleration + horsepower + acceleration:horsepower, data = Auto))
Yes, they all appear to be significant when using * and : to fit an interaction. The interactions between weight:horsepower, displacement:weight, and acceleration:horsepower all have p values below 0.05 and even 0.001, which indicates high significance.
Additionally, in the case of horsepower, this helped to better fit the model since it was not significant alone. Using an interaction test to determine if mpg depends on the other value (in this case, the interaction of horsepower with acceleration and weight) results in a significant predictor for mpg.
Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
summary(lm(mpg ~ horsepower, data = Auto))
summary(lm(mpg ~ log(horsepower), data = Auto))
summary(lm(mpg ~ sqrt(horsepower), data = Auto))
summary(lm(mpg ~ horsepower + I(horsepower^2), data = Auto))
Horsepower squared gives the best fit with an R square of 0.6876 compared to horsepower alone at 0.6059. This confirms that the relationship is non-linear and the transformations of using log, square root, and square helped to improve the model.
This question should be answered using the Carseats data set.
Fit a multiple regression model to predict Sales using Price, Urban, and US.
library(ISLR2)
data(Carseats)
lm.fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9206 -1.6220 -0.0564 1.5786 7.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
Price -0.054459 0.005242 -10.389 < 2e-16 ***
UrbanYes -0.021916 0.271650 -0.081 0.936
USYes 1.200573 0.259042 4.635 4.86e-06 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative
Price: A qualitative predictor where a one dollar increase in price will result in a decrease in sales of 0.0545 thousand units or 54 car seats. It is negative and very significant as p < 0.05.
UrbanYes: A qualitative predictor of 0.0219 where a store in an urban location sells 0.0219 thousand units or 22 units less than a store that is in a non-urban location. However, the p value is > 0.05 so it is not significant.
USyes: A qualitative predictor where a store in the US will sell 1.2006 units more or 1200 more car seats than a store outside the US. It is positive and has a p value < 0.05 so it is significant.
Write out the model in equation form, being careful to handle the qualitative variables properly.
\(Sales = 13.04 - 0.0545\, Price - 0.0219\, Urban + 1.2006\, US\)
For which of the predictors can you reject the null hypothesis H0 : βj = 0?
We can reject the null hypothesis for Price and US since their p values are less than 0.05.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit2)
Call:
lm(formula = Sales ~ Price + US, data = Carseats)
Residuals:
Min 1Q Median 3Q Max
-6.9269 -1.6286 -0.0574 1.5766 7.0515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
Price -0.05448 0.00523 -10.416 < 2e-16 ***
USYes 1.19964 0.25846 4.641 4.71e-06 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data?
There is not much difference between the two models since both explain 24% of the variance, but the model in question e is slightly better because the adjusted R square is a bit higher compared to 0.2335.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(lm.fit2)
2.5 % 97.5 %
(Intercept) 11.79032020 14.27126531
Price -0.06475984 -0.04419543
USYes 0.69151957 1.70776632
The 95% confidence intervals are (-0.0647, -0.0442) for Price and (0.06915, 1.7078) for US.
Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2, 2))
plot(lm.fit2)
par(mfrow = c(1, 1))
which(abs(rstudent(lm.fit2)) > 3)
named integer(0)
max(abs(rstudent(lm.fit2)))
[1] 2.891521
Using the model in question e, there are no outliers but there is high leverage points, but none seem to be very influential to the results of the model.
This problem involves simple linear regression without an intercept.
##(a)
Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The two coefficients are the same when the sum of the squared x-values equal the sum of the squared y-values. The two variables have the same spread of squared values around the zero.
##(b)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(234)
x <- rnorm(100)
y <- 2 * x + rnorm(100)
coef(lm(y ~ x + 0))
x
2.109122
coef(lm(x ~ y + 0))
y
0.3762149
##(c)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
set.seed(1)
x <- rnorm(100)
y <- sample(x)
coef(lm(y ~ x + 0))
x
-0.07767695
coef(lm(x ~ y + 0))
y
-0.07767695