February 15, 2023
library(tidyverse)
library(openintro)
library(ISLR)
library(MASS)
Carefully explain the differences between the KNN classifier and KNN regression methods.
The difference between the KNN classifier and KNN regression methods is that the KNN classifier method is used when the response variable is qualitative (categorical) while the regression method is used in quantitative (numerical) situations. Therefore, the classifier method shows the y-values as 0 or 1 while the KNN regression method predicts a numerical value for Y. Additionally, the codomain of the regression model is a continuous space while the codomain of a classification model is a discrete space.
A. Produce a scatterplot matrix which includes all of the variables in the data set.
head(Auto)
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 18 8 307 130 3504 12.0 70 1
## 2 15 8 350 165 3693 11.5 70 1
## 3 18 8 318 150 3436 11.0 70 1
## 4 16 8 304 150 3433 12.0 70 1
## 5 17 8 302 140 3449 10.5 70 1
## 6 15 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
plot(Auto)
B. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable which is qualitative.
<- cor(Auto[-9])
correlationMatrix correlationMatrix
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
C. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
<-lm(mpg~ .-name,data=Auto)
mlrsummary(mlr)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
I. Is there a relationship between the predictors and the response?
Yes, there is a relationship between at least one predictor and the response variable. We can say that a model is statistically significant and that there is a relationship between the predictors and response variable when the p-value is less than 0.05. For instance, there is a relationship between the response variable mpg and the predictor because its associated p-value is less than 0.05.
The following predictors appear to have a statistically significant relationship to the response: Weight, Year, Displacement, & Origin.
Because the coefficient for the year variable is positive and significant, it suggests that if all other variables are constant then mpg increases by 0.7507 yearly, therefore becoming more fuel efficient.
D. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
“Residuals vs. Fitted” shows non-linearity as the plot has a slight curve resembling a U shape. On the QQ plot, the right tail has several points that are not along the normal distribution line and curve upward instead and therefore violating the normality assumptions. “Scale-Location” shows the trend line is straight, therefore indicating normality. “Residuals vs. Leverage” shows that most observations fall to the left of the plot. Because there are no points within the bounds of Cook’s Distance, there are no leverage points that could cause an impact. However, the plot does show an outlier (14 on the plot) that could be a potential leverage point.
par(mfrow=c(2,2))
plot(mlr)
E. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
The interactions between horsepower&displacement, horsepower&origin, and acceleration&origin appear to be statistically significant while the interaction between displacement&cylinders is not.
<- lm(mpg~.-name + horsepower:displacement, data=Auto)
interact.model1 summary(interact.model1)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7010 -1.6009 -0.0967 1.4119 12.6734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.894e+00 4.302e+00 -0.440 0.66007
## cylinders 6.466e-01 3.017e-01 2.143 0.03275 *
## displacement -7.487e-02 1.092e-02 -6.859 2.80e-11 ***
## horsepower -1.975e-01 2.052e-02 -9.624 < 2e-16 ***
## weight -3.147e-03 6.475e-04 -4.861 1.71e-06 ***
## acceleration -2.131e-01 9.062e-02 -2.351 0.01921 *
## year 7.379e-01 4.463e-02 16.534 < 2e-16 ***
## origin 6.891e-01 2.527e-01 2.727 0.00668 **
## displacement:horsepower 5.236e-04 4.813e-05 10.878 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.912 on 383 degrees of freedom
## Multiple R-squared: 0.8636, Adjusted R-squared: 0.8608
## F-statistic: 303.1 on 8 and 383 DF, p-value: < 2.2e-16
<- lm(mpg~.-name + displacement*cylinders, data=Auto)
interact.model2 summary(interact.model1)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower:displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7010 -1.6009 -0.0967 1.4119 12.6734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.894e+00 4.302e+00 -0.440 0.66007
## cylinders 6.466e-01 3.017e-01 2.143 0.03275 *
## displacement -7.487e-02 1.092e-02 -6.859 2.80e-11 ***
## horsepower -1.975e-01 2.052e-02 -9.624 < 2e-16 ***
## weight -3.147e-03 6.475e-04 -4.861 1.71e-06 ***
## acceleration -2.131e-01 9.062e-02 -2.351 0.01921 *
## year 7.379e-01 4.463e-02 16.534 < 2e-16 ***
## origin 6.891e-01 2.527e-01 2.727 0.00668 **
## displacement:horsepower 5.236e-04 4.813e-05 10.878 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.912 on 383 degrees of freedom
## Multiple R-squared: 0.8636, Adjusted R-squared: 0.8608
## F-statistic: 303.1 on 8 and 383 DF, p-value: < 2.2e-16
<- lm(mpg~.-name + horsepower*origin, data=Auto)
interact.model3 summary(interact.model3)
##
## Call:
## lm(formula = mpg ~ . - name + horsepower * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.277 -1.875 -0.225 1.570 12.080
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.196e+01 4.396e+00 -4.996 8.94e-07 ***
## cylinders -5.275e-01 3.028e-01 -1.742 0.0823 .
## displacement -1.486e-03 7.607e-03 -0.195 0.8452
## horsepower 8.173e-02 1.856e-02 4.404 1.38e-05 ***
## weight -4.710e-03 6.555e-04 -7.186 3.52e-12 ***
## acceleration -1.124e-01 9.617e-02 -1.168 0.2434
## year 7.327e-01 4.780e-02 15.328 < 2e-16 ***
## origin 7.695e+00 8.858e-01 8.687 < 2e-16 ***
## horsepower:origin -7.955e-02 1.074e-02 -7.405 8.44e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.116 on 383 degrees of freedom
## Multiple R-squared: 0.8438, Adjusted R-squared: 0.8406
## F-statistic: 258.7 on 8 and 383 DF, p-value: < 2.2e-16
= lm(mpg ~.-name + acceleration:origin, data=Auto)
interact.model4 summary(interact.model4)
##
## Call:
## lm(formula = mpg ~ . - name + acceleration:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4106 -1.8805 -0.2471 1.7891 11.9680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3327273 5.0570077 -0.066 0.9476
## cylinders -0.5881258 0.3063127 -1.920 0.0556 .
## displacement 0.0086251 0.0073062 1.181 0.2385
## horsepower -0.0250843 0.0131049 -1.914 0.0564 .
## weight -0.0052351 0.0006439 -8.131 5.98e-15 ***
## acceleration -1.0340600 0.1896960 -5.451 8.98e-08 ***
## year 0.7623813 0.0482774 15.792 < 2e-16 ***
## origin -9.3089774 1.6109675 -5.779 1.56e-08 ***
## acceleration:origin 0.6546959 0.0969263 6.755 5.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.15 on 383 degrees of freedom
## Multiple R-squared: 0.8405, Adjusted R-squared: 0.8371
## F-statistic: 252.2 on 8 and 383 DF, p-value: < 2.2e-16
F. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
My findings showed that log(acceleration) appears to be less significant than acceleration but still very significant, squaring horsepower doesn’t seem to change the significance, and lastly, log(horsepower) appears to be more linear than the other transformations and more significant than horsepower.
summary(lm(mpg ~ . -name + log(acceleration) + I(acceleration^2) + sqrt(acceleration), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + log(acceleration) + I(acceleration^2) +
## sqrt(acceleration), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0893 -1.8990 0.0073 1.9302 13.3519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.220e+02 7.057e+02 0.315 0.7533
## cylinders -3.559e-01 3.214e-01 -1.107 0.2688
## displacement 1.014e-02 7.966e-03 1.273 0.2039
## horsepower -3.433e-02 1.411e-02 -2.433 0.0154 *
## weight -5.496e-03 6.992e-04 -7.860 3.97e-14 ***
## acceleration 3.452e+01 1.152e+02 0.300 0.7645
## year 7.516e-01 4.973e-02 15.113 < 2e-16 ***
## origin 1.327e+00 2.722e-01 4.877 1.58e-06 ***
## log(acceleration) 3.226e+02 8.453e+02 0.382 0.7030
## I(acceleration^2) -9.014e-02 6.334e-01 -0.142 0.8869
## sqrt(acceleration) -4.156e+02 1.182e+03 -0.352 0.7253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.242 on 381 degrees of freedom
## Multiple R-squared: 0.8319, Adjusted R-squared: 0.8275
## F-statistic: 188.5 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(log(Auto$acceleration), Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)
summary(lm(mpg ~ . -name + log(horsepower) + I(horsepower^2) +sqrt(horsepower), data=Auto))
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + I(horsepower^2) +
## sqrt(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4353 -1.7487 -0.0456 1.3931 11.7105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.182e+02 1.706e+02 -1.865 0.063006 .
## cylinders -1.213e-01 3.285e-01 -0.369 0.712020
## displacement -3.336e-03 7.316e-03 -0.456 0.648684
## horsepower 6.901e+00 2.728e+00 2.529 0.011826 *
## weight -3.595e-03 6.715e-04 -5.354 1.49e-07 ***
## acceleration -2.722e-01 9.953e-02 -2.735 0.006528 **
## year 7.377e-01 4.522e-02 16.312 < 2e-16 ***
## origin 8.611e-01 2.528e-01 3.407 0.000728 ***
## log(horsepower) 3.312e+02 1.473e+02 2.248 0.025123 *
## I(horsepower^2) -4.864e-03 1.974e-03 -2.464 0.014168 *
## sqrt(horsepower) -1.868e+02 7.617e+01 -2.452 0.014664 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.944 on 381 degrees of freedom
## Multiple R-squared: 0.8614, Adjusted R-squared: 0.8578
## F-statistic: 236.8 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
A. Fit a multiple regression model to predict Sales using Price, Urban, and US.
<-lm(Sales~Price+Urban+US,data=Carseats)
salessummary(sales)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
B. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
When other predictors are held constant, as price (continuous variable) increases there is a decrease in sales by 54.459 unit sales. In other words, as price increases, there is a change in sales of -0.54459. Urban (categorical variable) does not affect sales as the p-value shows it is not statistically significant and therefore there is no evidence of a relationship between the two variables. US (categorical variable) shows that a store located in the US increases sales by 1200 more car seats on average than stores in different countries. In other words, if a store is located in the US, there is a change in sales of 1.200573.
C. Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales=13.043469 - 0.054459(Price) - (0.021916)UrbanYes + 1.200573*USYes
D. For which of the predictors can you reject the null hypothesis H0 :βj =0?
The null hypothesis can be rejected for the predictors Price and US based on their p-values.
E. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
<-lm(Sales~Price+US,data=Carseats)
salesvarsummary(salesvar)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
F. How well do the models in (a) and (e) fit the data?
Based on the adjusted r square and the residual standard error, there is a slight difference that is not statistically significant for both models. Overall, both models fit the data in a moderate way.
anova(sales,salesvar)
## Analysis of Variance Table
##
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 396 2420.8
## 2 397 2420.9 -1 -0.03979 0.0065 0.9357
G. Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(salesvar, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
H. Is there evidence of outliers or high leverage observations in the model from (e)?
There is no evidence of obvious outliers, however, in the residuals v. leverage plot we can see observations that have high leverage.
par(mfrow=c(2,2))
plot(salesvar)
A. Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when the summation of Xi^2 equals the summation of Yi^2. In other words, they will be equal under the circumstance of the sum of squares of y being the same as the sum of squares of x.
B. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
<- rnorm(100)
x <- rbinom(100,2,0.5)
y
<- lm(y~x + 0)
example summary(example)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08466 0.92494 1.02138 1.92987 2.07145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x -0.06177 0.13457 -0.459 0.647
##
## Residual standard error: 1.309 on 99 degrees of freedom
## Multiple R-squared: 0.002124, Adjusted R-squared: -0.007956
## F-statistic: 0.2107 on 1 and 99 DF, p-value: 0.6472
<- lm(x~y + 0)
example2 summary(example2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4179 -0.7150 0.1459 0.6964 2.0625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y -0.03438 0.07490 -0.459 0.647
##
## Residual standard error: 0.9766 on 99 degrees of freedom
## Multiple R-squared: 0.002124, Adjusted R-squared: -0.007956
## F-statistic: 0.2107 on 1 and 99 DF, p-value: 0.6472
C. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
<- 1:100
x <- 100:1
y
<- lm(y~x + 0)
example summary(example)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
<- lm(x~y + 0)
example2 summary(example2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08