Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classifier and KNN regressin methods both indentify a neighborhood by obtaining K number of data points that are near a specific data point (\(x_0\)). But, the two methods differ in how the this information is used. The KNN classifier estimates a conditional probability for a designated class as a fraction of the points in the created neighborhood. While, the KNN regression method uses the neighborhoods to estimate a function (f(\(x_0\))) from averages of the data points in the neighborhood.
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
data(Auto)
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable,cor() which is qualitative.
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
The function names() was used to determine the name of the variables in the Auto data set. Once the name of the variables were known the qualitative variable, name was removed to compute the correaltions matrices for all the variables.
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.Comment on the output.
fxnmpg <- lm(mpg ~ .-name, data=Auto)
summary(fxnmpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
For instance:
i. Is there a relationship between the predictors and the response?
To determine the relationship between the predictors and the response (mpg), we must test the hypothesis \(H_0\):\(\beta_i\)=0 for all values of i(the number of predictors). Referring to the output above illustrates that with a F-statistic with the p-value = <.001, we reject the null hypothesis that the predictors are equal to zero; thus there a statistically significant relationship between the predictors and the response variables.
ii. Which predictors appear to have a statistically significant relationship to the response?
The predictor variables that appear to have statistically significant relationship to the response, mpg are the predictors variables with Pr(>|t|) are less than the level of significance, \(\alpha\) = 0.05. The predictors that fit this criteria are displacement (p-value= 0.00844), weight(p-value=<.001), year(p-value=<.001) and origin(p-value=<.001).
iii. What does the coefficient for the year variable suggest?
The coefficient for the year variable suggest that as the year of the car increases, the mpg of the car also increase by 0.750773; keeping all of the other variables constant.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit.Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow =c(2,2))
plot(fxnmpg)
The diagnostics reveal that the model doesn’t quite illustrate a linear relationship; the QQ and the residual vs. fitted plots depict that residsuals in this model don’t follow a normal distribution. Also observing the remaining plots illustrate that there are few outliers that are higher than 2 standard deviations. The dataoint 14 appears to be an unusal leverage point.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
intermpg <- lm(mpg~ cylinders*horsepower+displacement*weight+acceleration*horsepower ,data= Auto[, 1:8])
summary(intermpg)
##
## Call:
## lm(formula = mpg ~ cylinders * horsepower + displacement * weight +
## acceleration * horsepower, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7093 -2.1721 -0.4586 1.7839 16.7986
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.567e+01 7.197e+00 9.125 < 2e-16 ***
## cylinders -3.070e+00 1.035e+00 -2.968 0.003189 **
## horsepower -2.435e-01 8.174e-02 -2.979 0.003077 **
## displacement -4.299e-02 1.648e-02 -2.609 0.009429 **
## weight -3.405e-03 1.421e-03 -2.396 0.017039 *
## acceleration 5.307e-02 2.483e-01 0.214 0.830877
## cylinders:horsepower 3.094e-02 8.685e-03 3.562 0.000414 ***
## displacement:weight 6.292e-06 4.398e-06 1.431 0.153346
## horsepower:acceleration -3.787e-03 2.581e-03 -1.467 0.143221
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.849 on 383 degrees of freedom
## Multiple R-squared: 0.7618, Adjusted R-squared: 0.7568
## F-statistic: 153.1 on 8 and 383 DF, p-value: < 2.2e-16
Again, to determine if an interaction is statistically significant Pr(>|t|) must be less than the level of signifcance \(\alpha\) =0.05. The only interaction that statistically significant in this example is the interaction between the variables cylinders and horsepower with a p-value = 0.000414.
(f) Try a few different transformations of the variables, such as log(X), \(\sqrt{X}\), \(X^2\). Comment on your findings.
par(mfrow=c(2,3))
plot(log(Auto$horsepower),Auto$mpg)
plot(sqrt(Auto$horsepower),Auto$mpg)
plot((Auto$horsepower)^2,Auto$mpg)
plot(log(Auto$acceleration),Auto$mpg)
plot(sqrt(Auto$acceleration),Auto$mpg)
plot((Auto$acceleration)^2,Auto$mpg)
The variable that were chosen to be transformed were horsepower and acceleration because these variables were not statistically significant in the original model. Reviewing the plots of each of the transformation verses the repsonse variable, mpg; the only transformation that resulted in strongest most linear relationship is the log transformation of horsepower. The transformations of the acceleration variable doesn’t appear to improve the linear relationship or the strength of the relationship between this variable and mpg.
This question should be answered using the Carseats data set.
library(ISLR)
data(Carseats)
(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.
carseats_mod <-lm(Sales ~ Price +Urban+US, data=Carseats)
summary(carseats_mod)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!
The summary output above depict that the variables Price (p-value = <.001) and US(p-value = <.001) are statistically significant predictors of Sales. The interpretation for Price: if the price of the carseat increases by $1 the sales of the carseat decrease by ~ 54.46 sale units. The interpretation for US: if store is in the United States the sales of the carseat increase by ~ 1,200.57 sale units. Urban appears to have no effect on Sales(p-value = 0.936).
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
The model equation is: \[Sales = 13.043469- 0.054459(Price)-0.021916(UrbanYes)+1.200573(USYes)\]
(d) For which of the predictors can you reject the null hypothesis.
As stated previously, the predictors that are statistically significant are Price and US (i.e.p-values< than \(\alpha\) = 0.05).
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
carseat_modbetter<-lm(Sales~Price+US, data=Carseats)
summary(carseat_modbetter)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Although both models are statistically siginificant, neither model fits the data well. The original model with all three predictors has an adjusted \(R^2\) of .2335; meaning that predictors only explain 23.35% of the variance in Sales. The improved model has an adjusted \(R^2\) of .2354; meaning that the predictors only explain 23.54% of the variance in Sales.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(carseat_modbetter)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow =c(2,2))
plot(carseat_modbetter)
The plots depict that there are a few outliers, specifically datapoints 69 and 377(on the higher end of the standard deviations)and 51(on the lower end of the standard deviation).There also some high leverage points as well.
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate \(\hat{\beta}\) ?? for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimate for Y onto x: \[\hat{\beta}=\frac{\sum_{i=1}^n x_iy_i}{\sum_{j=1}^n x^2_j}\]
The coefficient estimate for X onto Y: \[\hat{\beta}'= \frac{\sum_{i=1}^nx_iy_i}{\sum_{j=1}^ny^2_j}\]
For the coefficient estimates to be the same \(\sum_j x^2_j = \sum_j y^2_j\).
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x<-1:100
sum(x^2)
## [1] 338350
y <- 2*x +rnorm(100, sd=0.5)
sum(y^2)
## [1] 1354445
This creates different values for x and y.
fit.y <- lm(y~x)
summary(fit.y)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17003 -0.30292 0.00775 0.29257 1.14873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.065833 0.090949 0.724 0.471
## x 1.999774 0.001564 1278.991 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4513 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.636e+06 on 1 and 98 DF, p-value: < 2.2e-16
fit.x <- lm(x~y)
summary(fit.x)
##
## Call:
## lm(formula = x ~ y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57377 -0.14607 -0.00247 0.15111 0.58286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.029893 0.045499 -0.657 0.513
## y 0.500026 0.000391 1278.991 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2257 on 98 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.636e+06 on 1 and 98 DF, p-value: < 2.2e-16
Viewing the two summary outputs above, the values for coeffiecients x and y are completely different; thus resulting in two completely different models.
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
X<- 1:100
sum(X^2)
## [1] 338350
Y <- 100:1
sum(Y^2)
## [1] 338350
This creates the same value for x and y
fit.Y<-lm(Y~X)
summary(fit.Y)
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.575e-14 -5.302e-15 -2.850e-15 4.300e-16 2.680e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.010e+02 5.598e-15 1.804e+16 <2e-16 ***
## X -1.000e+00 9.624e-17 -1.039e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.08e+32 on 1 and 98 DF, p-value: < 2.2e-16
fit.X<-lm(X~Y)
summary(fit.X)
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = X ~ Y)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.680e-13 -4.300e-16 2.850e-15 5.302e-15 3.575e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.010e+02 5.598e-15 1.804e+16 <2e-16 ***
## Y -1.000e+00 9.624e-17 -1.039e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.778e-14 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.08e+32 on 1 and 98 DF, p-value: < 2.2e-16
Since the value of x and y are the same, the coefficients are the same as well; thus the models are the exact same too.