Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier is mainly used in situations where the response variable is categorical, qualitative speaking, therefore it shows Y as 0 or 1. The KNN regression method, on the other hand, is used more commonly in numerical situations, quantitative speaking, which predits the qualitative value of Y, and thus, it can also be continuous. The algorithm in this case differs more in the output it produces.
This question involves the use of multiple linear regression on the Auto data set.
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
library(ISLR2)
library(MASS)##
## Attaching package: 'MASS'
## The following object is masked from 'package:ISLR2':
##
## Boston
plot(Auto)(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
Auto1<-Auto
Auto1$name=NULL
cor(Auto1)## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
AutoLR<-lm(mpg~ .-name,data=Auto)
summary(AutoLR)##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance:
I. Is there a relationship between the predictors and the response?
Due to the p-value being <0.05, this basically indicates that we can reject the null hypothesis. Hence, a relationship between at least one response variable and predictor variable.
Which predictors appear to have a statistically significant relationship to the response?
Based on the p-value: Displacement, weight, year and origin have a statistically significant relationship with mpg.
What does the coefficient for the year variable suggest?
It suggests that it is significant and positive. This means that, if all the other variables remain constant, the average mpg increases by 0.75 every year.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit.
par(mfrow=c(2,2))
plot(AutoLR)Comment on any problems you see with the fit.
Do the residual plots suggest any unusually large outliers?
Based on this plot below of student residuals, there is data with a value >3 due to several outlines.
plot(predict(AutoLR), rstudent(AutoLR))Does the leverage plot identify any observations with unusually high leverage?
Looking at the leverage plot, point 14 has a high leverage, but in turn not a high magnitude residual. A linear model isn't the best fit in this case.
(e) Use the * and : symbols to fit linear regression models with interaction effects.
Autolm <- lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto)
summary(Autolm)##
## Call:
## lm(formula = mpg ~ cylinders * displacement + displacement *
## weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2934 -2.5184 -0.3476 1.8399 17.7723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.262e+01 2.237e+00 23.519 < 2e-16 ***
## cylinders 7.606e-01 7.669e-01 0.992 0.322
## displacement -7.351e-02 1.669e-02 -4.403 1.38e-05 ***
## weight -9.888e-03 1.329e-03 -7.438 6.69e-13 ***
## cylinders:displacement -2.986e-03 3.426e-03 -0.872 0.384
## displacement:weight 2.128e-05 5.002e-06 4.254 2.64e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.103 on 386 degrees of freedom
## Multiple R-squared: 0.7272, Adjusted R-squared: 0.7237
## F-statistic: 205.8 on 5 and 386 DF, p-value: < 2.2e-16
Do any interactions appear to be statistically significant?
The interaction between displacement and weight is statistically significant. In contrast, the interaction between cylinders and displacement is not.
(f) Try a few different transformations of the variables, such as log(X), √X, X^2. Comment on your findings.
AutoLR4<-lm(mpg~weight+I((weight)^2),Auto)
summary(AutoLR4)##
## Call:
## lm(formula = mpg ~ weight + I((weight)^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.6246 -2.7134 -0.3485 1.8267 16.0866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.226e+01 2.993e+00 20.800 < 2e-16 ***
## weight -1.850e-02 1.972e-03 -9.379 < 2e-16 ***
## I((weight)^2) 1.697e-06 3.059e-07 5.545 5.43e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared: 0.7151, Adjusted R-squared: 0.7137
## F-statistic: 488.3 on 2 and 389 DF, p-value: < 2.2e-16
plot(AutoLR4)A non-normal distribution of error terms from the Normal Q-Q plot is shown in the plot.
In the ‘Residuals vs Leverage’ plot, we can notice that within the bounds of the Cook's distance, there are no points. Therefore, there ate no meaningful points that could cause the slope coefficient to be significantly impacted.
This question should be answered using the Carseats data set.
data("Carseats")
summary(Carseats)## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
Carseatslm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseatslm)##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
Price: there is likely a correlation between price and sales, with the coefficient showing a negative relationship; as price increases, sales decrease.
UrbanYes: there is not enough evidence to suggest a link between the location of the store and the number of sales.
USYes: there appears to be a positive relationship between whether a store is located in the US or not and the amount of sales, with an approximate increase of 1201 sales units if the store is based in the US.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.04 + -0.05(Price) + -0.02(UrbanYes) + 1.20(USYes)
(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
The null hypothesis can be rejected for Price and USYes based on the p-values.
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
Carseatslm2<- lm(Sales ~ Price + US, data = Carseats)
summary(Carseatslm2)##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Based on the RSE and r squared of the linear regressions, they both fit the data similarly. Although model (e) fit the data slightly better, the difference is not statistically significant for both models and therefore we do not reject the null hypothesis.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(Carseatslm2)## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
plot(predict(Carseatslm2), rstudent(Carseatslm2))The residuals appear to be bounded by -3 to 3. Therefore, we can say that there are no outliers present in the data
par(mfrow = c(2, 2))
plot(Carseatslm2)There are very few observations that significantly exceed (p+1)/n(0.0075567) on the leverage-statistic plot that suggest that the points corresponding have high leverage.
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficients will be the same if ∑jx2j=∑jy2j.
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
x=rnorm(100)
y=rbinom(100,2,0.3)
n100<-lm(y~x+0)
summary(n100)##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31543 0.02337 0.72568 1.05220 2.22124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x -0.12153 0.09556 -1.272 0.206
##
## Residual standard error: 0.9614 on 99 degrees of freedom
## Multiple R-squared: 0.01608, Adjusted R-squared: 0.006138
## F-statistic: 1.618 on 1 and 99 DF, p-value: 0.2064
n100.2<-lm(x~y+0)
summary(n100.2)##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.59547 -0.66291 0.04894 0.75250 2.08498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y -0.1323 0.1040 -1.272 0.206
##
## Residual standard error: 1.003 on 99 degrees of freedom
## Multiple R-squared: 0.01608, Adjusted R-squared: 0.006138
## F-statistic: 1.618 on 1 and 99 DF, p-value: 0.2064
Based on the results of the coefficients the observations are different for both cases.
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
x=1:100
y=100:1
n200<-lm(y~x+0)
summary(n200)##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
n200.2<-lm(x~y+0)
summary(n200.2)##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
Unlike (b), here we can see that the coefficients are the same based on the results.
…