##2## Carefully explain the differences between the KNN classifier and KNN regression methods. The KNN classifier classifies a result in a qualitative manner based on using the most common group found among the K nearest neighbors The regression method makes a quantitative estimate by taking the average of the K nearest neighbors.
##9## This question involves the use of multiple linear regression on the Auto data set.
library(ISLR)
attach(Auto)
plot(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
fit<-lm(mpg~., data = Auto[, 1:8])
summary(fit)
##
## Call:
## lm(formula = mpg ~ ., data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
For instance: i. Is there a relationship between the predictors and the response? There’s a relationship between some of the predictors and the response because the F-statistic is < 2.2e-16. ii. Which predictors appear to have a statistically significant relationship to the response? Displacement, weight, year, and origin. iii. What does the coefficient for the year variable suggest? The average effect of the year variable increasing by one unit equals to an increase of .750773 units in mpg. The model seems to be a good fit because the R-squared is 82.15% of the variability can be explained by the model.
par(mfrow = c(2, 2))
plot(fit)
fit2 <- lm(mpg ~ cylinders * displacement + horsepower * weight + acceleration * year, data = Auto[, 1:8])
summary(fit2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight +
## acceleration * year, data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3265 -1.5779 0.0389 1.3483 11.6961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.162e+02 1.853e+01 6.274 9.53e-10 ***
## cylinders -1.803e-01 4.776e-01 -0.377 0.7061
## displacement -2.867e-02 1.425e-02 -2.013 0.0449 *
## horsepower -2.261e-01 2.609e-02 -8.664 < 2e-16 ***
## weight -1.019e-02 9.020e-04 -11.296 < 2e-16 ***
## acceleration -7.081e+00 1.158e+00 -6.113 2.41e-09 ***
## year -6.719e-01 2.417e-01 -2.780 0.0057 **
## cylinders:displacement 2.790e-03 2.067e-03 1.350 0.1779
## horsepower:weight 5.154e-05 6.727e-06 7.661 1.53e-13 ***
## acceleration:year 9.113e-02 1.502e-02 6.069 3.10e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.819 on 382 degrees of freedom
## Multiple R-squared: 0.8726, Adjusted R-squared: 0.8696
## F-statistic: 290.6 on 9 and 382 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
par(mfrow = c(2, 2))
plot(log(Auto$acceleration), Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)
par(mfrow = c(2, 2))
plot(log(Auto$cylinders), Auto$mpg)
plot(sqrt(Auto$cylinders), Auto$mpg)
plot((Auto$cylinders)^2, Auto$mpg)
##10## This question should be answered using the Carseats data set.
library(ISLR)
attach(Carseats)
fit3<-lm(Sales~Price+Urban+US)
summary(fit3)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! Price coefficient indiciates the average effect of price increase of $1 equals a decrease of $54.459 in sales. Urban’s coefficient indicates no relationship between Sales & Urban due to large p-value. US coefficient indiciates that on average sales in a US store is $1200.573 more than in a non-US store.
Write out the model in equation form, being careful to handle the qualitative variables properly. Sales=13.043469−0.054459Price−0.021916UrbanYes+1.200573USYes where UrbanYes=1 for Urban and 0 for not Urban and USYes=1 for U.S store and 0 for non U.S stores.
For which of the predictors can you reject the null hypothesis H0 : βj = 0? Reject null for Price & US because both have an effect on Sales.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit4<-lm(Sales~Price+US)
summary(fit4)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data? The model doesn’t seem to be a good fit because the R-squared for both models shows 23.93% of the variability can be explained by the model.
Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(fit4)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2, 2))
plot(fit4)
##12## This problem involves simple linear regression without an intercept. (a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? Using (c)’s example, the coefficient estimates for the regressions are the same when sum of x^2 is equal to sum of y^2.
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- x * -153
sum(y^2)
## [1] 7920435150
fit.X<-lm(y~x + 0)
fit.Y<-lm(x~y + 0)
summary(fit.Y)
## Warning in summary.lm(fit.Y): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.493e-13 -1.826e-15 4.100e-17 1.549e-15 1.140e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y -6.536e-03 2.846e-19 -2.297e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.532e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.276e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(fit.X)
## Warning in summary.lm(fit.X): essentially perfect fit: summary may be unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.926e-12 -2.460e-13 -1.000e-15 2.110e-13 3.282e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x -1.530e+02 5.745e-15 -2.663e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.342e-12 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 7.093e+32 on 1 and 99 DF, p-value: < 2.2e-16
x<-1:100
sum(x^2)
## [1] 338350
y<-100:1
sum(y^2)
## [1] 338350
fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)
summary(fit.Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
summary(fit.X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08