Question 2: Carefully explain the differences between the KNN classifier and KNN regression methods.
Answer: Classifier predicts a class label similar to blue/red or spam/not spam. Regression predicts a continous value such as house price or temperature. They differ in their evaluation metrics as well. Classification relies on accuracy, precision, recall, and F1 score while regression relies on MSE, RMSE, and mean squared error.
Question 9: This question involves the use of multiple linear regression on the Auto data set.
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
pairs(~ ., data = Auto,
main = "Auto Dataset Scatterplot Matrix")
B. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
auto_cor <- cor(subset(Auto, select = -name))
print(auto_cor)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
fit <- lm(mpg ~ . - name, data = Auto)
summary(fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response? Yes, there is a relationship between some of the predictors and the response. These being displacement, weight, year and origin as they all have p-values less than 0.05.
Which predictors appear to have a statistically significant relationship to the response? displacement, weight, year and origin as they all have p-values less than 0.05 making them statistically significant
What does the coefficient for the year variable suggest? If we hold all other variables constant, for every one unit increase to year, mpg will increase about .75.
D. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(fit)
Residuals vs fitted graph shows a U-shaped curve rather than random points spread out. In the Normal Q-Q plot, it does show outliers. In the residuals vs leverage graph, it does show that observation 14 is all the way to right making it an outlier as well with high leverage.
E. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
fit1 <- lm(mpg ~ displacement:year, data = Auto)
fit2 <- lm(mpg ~ horsepower * weight, data = Auto)
fit3 <- lm(mpg ~ acceleration:year, data = Auto)
summary(fit1)
##
## Call:
## lm(formula = mpg ~ displacement:year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.1566 -3.0276 -0.6339 2.5802 20.1066
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.530e+01 5.314e-01 66.42 <2e-16 ***
## displacement:year -8.100e-04 3.227e-05 -25.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.832 on 390 degrees of freedom
## Multiple R-squared: 0.6177, Adjusted R-squared: 0.6168
## F-statistic: 630.2 on 1 and 390 DF, p-value: < 2.2e-16
summary(fit2)
##
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7725 -2.2074 -0.2708 1.9973 14.7314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.356e+01 2.343e+00 27.127 < 2e-16 ***
## horsepower -2.508e-01 2.728e-02 -9.195 < 2e-16 ***
## weight -1.077e-02 7.738e-04 -13.921 < 2e-16 ***
## horsepower:weight 5.355e-05 6.649e-06 8.054 9.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared: 0.7484, Adjusted R-squared: 0.7465
## F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
summary(fit3)
##
## Call:
## lm(formula = mpg ~ acceleration:year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.4083 -4.9868 -0.9834 4.6751 22.5613
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.561770 1.758112 1.457 0.146
## acceleration:year 0.017642 0.001458 12.103 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.663 on 390 degrees of freedom
## Multiple R-squared: 0.273, Adjusted R-squared: 0.2712
## F-statistic: 146.5 on 1 and 390 DF, p-value: < 2.2e-16
All interactions noted in the lm above show a statistically significant interaction.
F. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
fit_log <- lm(log(mpg) ~ cylinders + log(displacement) + log(horsepower) + log(weight) + acceleration + year + origin, data = Auto)
fit_sqrt <- lm(mpg ~ cylinders + sqrt(displacement) + sqrt(horsepower) + sqrt(weight) + acceleration + year + origin, data = Auto)
fit_quad <- lm(mpg ~ cylinders + displacement + horsepower + I(horsepower^2) + weight + I(weight^2) + acceleration + year + origin, data = Auto)
summary(fit_log)
##
## Call:
## lm(formula = log(mpg) ~ cylinders + log(displacement) + log(horsepower) +
## log(weight) + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40203 -0.06561 -0.00048 0.05823 0.38672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.939090 0.376026 18.454 < 2e-16 ***
## cylinders -0.013813 0.010580 -1.306 0.1925
## log(displacement) 0.003877 0.052652 0.074 0.9413
## log(horsepower) -0.254083 0.057923 -4.387 1.49e-05 ***
## log(weight) -0.599957 0.081224 -7.386 9.47e-13 ***
## acceleration -0.008336 0.003800 -2.194 0.0289 *
## year 0.029594 0.001743 16.983 < 2e-16 ***
## origin 0.023370 0.010359 2.256 0.0246 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1139 on 384 degrees of freedom
## Multiple R-squared: 0.8899, Adjusted R-squared: 0.8879
## F-statistic: 443.4 on 7 and 384 DF, p-value: < 2.2e-16
summary(fit_sqrt)
##
## Call:
## lm(formula = mpg ~ cylinders + sqrt(displacement) + sqrt(horsepower) +
## sqrt(weight) + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4030 -1.9807 -0.1672 1.7124 12.9777
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97092 4.94260 1.006 0.3152
## cylinders 0.11130 0.32178 0.346 0.7296
## sqrt(displacement) 0.14430 0.22341 0.646 0.5187
## sqrt(horsepower) -0.64976 0.30327 -2.143 0.0328 *
## sqrt(weight) -0.63983 0.07765 -8.240 2.75e-15 ***
## acceleration -0.04568 0.10247 -0.446 0.6560
## year 0.73646 0.04927 14.946 < 2e-16 ***
## origin 1.13268 0.28152 4.023 6.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared: 0.8339, Adjusted R-squared: 0.8309
## F-statistic: 275.5 on 7 and 384 DF, p-value: < 2.2e-16
summary(fit_quad)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + I(horsepower^2) +
## weight + I(weight^2) + acceleration + year + origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8713 -1.6140 -0.1788 1.4667 12.0738
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.110e+00 4.586e+00 1.332 0.18359
## cylinders 1.600e-01 2.981e-01 0.537 0.59164
## displacement -9.982e-04 7.271e-03 -0.137 0.89087
## horsepower -2.086e-01 3.999e-02 -5.216 3.01e-07 ***
## I(horsepower^2) 6.217e-04 1.286e-04 4.833 1.96e-06 ***
## weight -1.339e-02 2.125e-03 -6.303 8.07e-10 ***
## I(weight^2) 1.420e-06 2.835e-07 5.010 8.35e-07 ***
## acceleration -1.830e-01 1.006e-01 -1.818 0.06979 .
## year 7.724e-01 4.522e-02 17.081 < 2e-16 ***
## origin 7.372e-01 2.530e-01 2.914 0.00378 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.91 on 382 degrees of freedom
## Multiple R-squared: 0.8642, Adjusted R-squared: 0.861
## F-statistic: 270 on 9 and 382 DF, p-value: < 2.2e-16
We can see that non-linear transformations are highly justified. Using the original R-squared, we can see that all three transformation models exceed that of the original. Using the squared transformation, it also allows us to address the parablic shape seen in the residual vs fitted graph of the original.
data("Carseats")
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
car_fit <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(car_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
B. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Price - for every one unit increase in Price, Sales decreases by .0545.
Urban - stores located in urban areas are about 22 unites lower than in non-urban stores.
US - stores located in the US see sales increase 1.2006 compared to stores outside the US.
y = 13.04-.05(price)-.02(Urban)+1.2(US)
D. For which of the predictors can you reject the null hypothesis H0 : βj =0?
We can reject the null hypothesis for Price and US.
E. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
new_model <- lm(Sales ~ Price + US, data = Carseats)
summary(new_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
F. How well do the models in (a) and (e) fit the data?
Both models only fit about 24% of the data.
confint(new_model)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
H. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(new_model)
There is signs of leverage based on the graph, but the outliers fall on
the regression line for Q-Q residuals.
The first circumstance is if the sum of squares is equal. The coefficient estimates will be exactly the same if the sum of squares of the X values is equal to the sum of squares of the Y values. The other circumstance is if the numerator is 0. X and y would be orthogonal.
B. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(42)
X <- rnorm(100, mean = 0, sd = 1)
Y <- 2 * X + rnorm(100, mean = 0, sd = 0.5)
model_Y_onto_X <- lm(Y ~ X - 1)
beta_Y_onto_X <- coef(model_Y_onto_X)
model_X_onto_Y <- lm(X ~ Y - 1)
beta_X_onto_Y <- coef(model_X_onto_Y)
print(beta_X_onto_Y)
## Y
## 0.4746934
print(beta_Y_onto_X)
## X
## 2.012243
C. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X
set.seed(123)
X <- rnorm(100, mean = 5, sd = 2)
Y <- rev(X)
model_Y_onto_X2 <- lm(Y ~ X - 1)
beta_Y_onto_X2 <- coef(model_Y_onto_X2)
model_X_onto_Y2 <- lm(X ~ Y - 1)
beta_X_onto_Y2 <- coef(model_X_onto_Y2)
print(beta_Y_onto_X2)
## X
## 0.9087063
print(beta_X_onto_Y2)
## Y
## 0.9087063