Carefully explain the differences between the KNN classifier and KNN regression methods.
This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
attach(Auto)
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.
lm.fit=lm(mpg~.-name,data = Auto)
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance:
i. Is there a relationship between the predictors and the response?
Yes there is a relationship between a the predictors and response since we have an Adjusted R-square of 0.8182
ii. Which predictors appear to have a statistically significant relationship to the response? The predictors that appear to have a statistically significant relationship to mpg is displacement , weight , year , and origin
iii. What does the coefficient for the year variable suggest?
It suggest that the newer the car, the better miles per gallon that the car will have
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Yes, the Residual vs Fitted plot suggest that observations 323, 326, and 327 are outliers with high residuals. With the Residuals vs Leverage plot we can see that observation 394 and specially 14 have a high leverage.
plot(lm.fit)
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
The interaction in which I found statistically significant, between all possible interactions were the following:
cylinders:displacement , cylinders:year , displacement:acceleration , displacement:year, displacement:origin, weight:acceleration, acceleration:year, and year:origin.
lm.fit1 = lm(mpg ~ cylinders:displacement + cylinders:horsepower + cylinders:weight + cylinders:acceleration + cylinders:year + cylinders:origin + displacement:horsepower + displacement:weight +displacement:acceleration + displacement:year + displacement:origin + horsepower:weight + horsepower:acceleration + horsepower:year + horsepower:origin + weight:acceleration + weight:year + weight:origin + acceleration:year + acceleration:origin + year:origin)
summary(lm.fit1)
##
## Call:
## lm(formula = mpg ~ cylinders:displacement + cylinders:horsepower +
## cylinders:weight + cylinders:acceleration + cylinders:year +
## cylinders:origin + displacement:horsepower + displacement:weight +
## displacement:acceleration + displacement:year + displacement:origin +
## horsepower:weight + horsepower:acceleration + horsepower:year +
## horsepower:origin + weight:acceleration + weight:year + weight:origin +
## acceleration:year + acceleration:origin + year:origin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.8275 -1.4924 -0.1428 1.2977 15.2100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.721e+00 2.245e+00 2.994 0.00294 **
## cylinders:displacement 1.973e-03 6.785e-03 0.291 0.77139
## cylinders:horsepower -2.427e-03 2.292e-02 -0.106 0.91576
## cylinders:weight -7.059e-04 9.531e-04 -0.741 0.45939
## cylinders:acceleration -1.895e-01 1.576e-01 -1.203 0.22991
## cylinders:year 8.276e-02 3.703e-02 2.235 0.02601 *
## cylinders:origin -6.433e-01 5.231e-01 -1.230 0.21956
## displacement:horsepower 2.755e-04 2.611e-04 1.055 0.29196
## displacement:weight 1.748e-05 1.596e-05 1.095 0.27406
## displacement:acceleration 7.896e-03 3.192e-03 2.474 0.01381 *
## displacement:year -3.999e-03 9.710e-04 -4.119 4.7e-05 ***
## displacement:origin 5.045e-02 2.035e-02 2.479 0.01361 *
## horsepower:weight -1.803e-05 2.947e-05 -0.612 0.54093
## horsepower:acceleration -4.421e-03 3.937e-03 -1.123 0.26225
## horsepower:year 1.093e-03 1.500e-03 0.729 0.46671
## horsepower:origin -4.101e-02 2.487e-02 -1.649 0.10008
## weight:acceleration -6.201e-04 2.112e-04 -2.936 0.00353 **
## weight:year 1.442e-04 8.675e-05 1.663 0.09724 .
## weight:origin -1.972e-03 1.604e-03 -1.229 0.21974
## acceleration:year 2.159e-02 5.137e-03 4.204 3.3e-05 ***
## acceleration:origin 4.872e-02 1.343e-01 0.363 0.71698
## year:origin 6.172e-02 3.090e-02 1.997 0.04651 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.967 on 370 degrees of freedom
## Multiple R-squared: 0.8632, Adjusted R-squared: 0.8554
## F-statistic: 111.2 on 21 and 370 DF, p-value: < 2.2e-16
(f) Try a few √ different transformations of the variables, such as log(X), X, X 2 . Comment on your findings.
With this section I went ahead and created 2 linear regression models for mpg that has displacement as the predictor. The first model lm.fit2 is a simple regression model and the second model is the same model but having log(displacemnet) as its difference. what we can conclude be adding log() is that our Adjusted R-square went from .6473 to .6855 therefore we get a better performing model by having better accuracy
lm.fit2= lm(mpg~displacement ,data = Auto)
lm.fit3= lm(mpg~log(displacement) ,data = Auto)
summary(lm.fit2)
##
## Call:
## lm(formula = mpg ~ displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.9170 -3.0243 -0.5021 2.3512 18.6128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.12064 0.49443 71.03 <2e-16 ***
## displacement -0.06005 0.00224 -26.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.635 on 390 degrees of freedom
## Multiple R-squared: 0.6482, Adjusted R-squared: 0.6473
## F-statistic: 718.7 on 1 and 390 DF, p-value: < 2.2e-16
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ log(displacement), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.1204 -2.5843 -0.4217 2.1979 19.9005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85.6906 2.1422 40.00 <2e-16 ***
## log(displacement) -12.1385 0.4155 -29.21 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.377 on 390 degrees of freedom
## Multiple R-squared: 0.6863, Adjusted R-squared: 0.6855
## F-statistic: 853.4 on 1 and 390 DF, p-value: < 2.2e-16
This question should be answered using the Carseats data set.
library(ISLR)
attach(Carseats)
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
lm.fit=lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
With our results above we can conclude that Pice and USYes are significant predictors for Sales. Therefore for every $1,000 Dollar increase in sales Price will decrease $54, and sales inside the US are $1,200 higher than outside the US.
(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales = 13.043469 - 0.054459(Price) - 0.021916(UrbanYes) + 1.200573(USYes)
(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?
Price and Us
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
lm.fit1=lm(Sales ~ Price + US, data = Carseats)
summary(lm.fit1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the data?
Both Models have an adjusted R square of around .23. Therefore only around 23% of the variance of sales can be explained by the predictors, which isnt really a good overall model.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(lm.fit1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
Based on the Plots below we can see that observation 68 and 377 are outliers and observation 368 has a high leverage.
plot(lm.fit1)
This problem involves simple linear regression without an intercept
(a) Recall that the coefficient estimate β̂ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient for estimate for the regression of X onto Y is the same when sum of squares of observed y values = sum of squares of observed x values
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
x=rnorm(100)
y=x^2
coefficients(lm(x ~ y))
## (Intercept) y
## -0.15003887 -0.08993973
coefficients(lm(y ~ x))
## (Intercept) x
## 0.9935323 -0.1625477
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X
x=rnorm(100)
y=x
coefficients(lm(x ~ y))
## (Intercept) y
## -1.665335e-17 1.000000e+00
coefficients(lm(y ~ x))
## (Intercept) x
## -1.665335e-17 1.000000e+00