This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier method tries to predict the class to which the output variable belongs by computing the local probability.
KNN regressions tries to predict the value of the output variable by using a local average.
library(MASS)
library(ISLR)
Question 9
auto <- Auto
names(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
Question 9a
pairs(auto)
Question 9b
cor(auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
Question 9c
lm_mpg <- lm(mpg~.-name, data = auto)
summary(lm_mpg)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Question 9ci Yes there is a relationship between the predictors and the response, as indicated by the p-value being < 0.05
Question 9cii displacement, weight, year, and origin appear to have a statistically significant relationship to the response.
Question 9ciii The Coefficient for year shows that for each increase in year, mpg increases by .750773
Question 9d The linear model does not appear to be a good fit because the Residual vs Fitted plot indicates that there is a non-linear relationship. Since no points in the Residuals vs Leverage plot are above the Cook’s Distance line at 0.5, there does not appear to be any outliers. The leverage plot shows point 14 as having high leverage because it is in the bottom right portion of the plot, and far from other points.
par(mfrow = c(2, 2))
plot(lm_mpg)
Question9e
The interactions between year and horsepower and acceleration and horsepower are statistically significant.
lm_interact <- lm(mpg~.-name + year:horsepower + horsepower*acceleration, data = auto)
summary(lm_interact)
##
## Call:
## lm(formula = mpg ~ . - name + year:horsepower + horsepower *
## acceleration, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4237 -1.6168 -0.1576 1.5501 11.4807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.354e+01 9.850e+00 -9.496 < 2e-16 ***
## cylinders 3.693e-01 3.014e-01 1.225 0.221183
## displacement -1.078e-02 7.703e-03 -1.400 0.162320
## horsepower 7.602e-01 9.307e-02 8.167 4.65e-15 ***
## weight -3.814e-03 6.751e-04 -5.649 3.16e-08 ***
## acceleration 5.977e-01 1.618e-01 3.694 0.000253 ***
## year 1.645e+00 1.344e-01 12.235 < 2e-16 ***
## origin 9.583e-01 2.537e-01 3.778 0.000184 ***
## horsepower:year -9.406e-03 1.339e-03 -7.025 9.87e-12 ***
## horsepower:acceleration -8.169e-03 1.762e-03 -4.635 4.91e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.963 on 382 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8558
## F-statistic: 258.9 on 9 and 382 DF, p-value: < 2.2e-16
Question 9f
After transforming the variable displacement, log and square root appear to have transformed the data much closer to a linear relationship than ^2 did.
par(mfrow = c(2, 2))
plot(log(auto$displacement), auto$mpg)
plot(sqrt(auto$displacement), auto$mpg)
plot((auto$displacement)^2, auto$mpg)
*******************************************************************************
Question 10
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
?Carseats
carseats <- Carseats
seat_fit <- lm(Sales~Price + Urban + US, data = carseats)
summary(seat_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Question 10b
As the Price increases, sales decrease (-$54). If a store is in the US, sales are higher ($1200). If a store is in an urban area, sales decrease, but this variable is not statistically significant so we would drop Urban from the final model.
Question 10c
Sales = 13.043469 + (-.054459)Price + (-.021916)UrbanYes + (1.200573) * USYes + ε
Question 10d
We can reject the null for Price and US since the p-value is below 0.05
Question 10e
newseat_fit <- lm(Sales~Price + US, data = carseats)
summary(newseat_fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Question 10f
Both models are not very good. The smaller model is a little better, but not by much (23.54% of the variability is explained by the smaller model compared to 23.35% by the larger model).
Question 10g
confint(newseat_fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Question 10h There does not appear to be any outliers, but because of the large horizontal gap between points in the Residuals vs Leverage plot, a high leverage observation appears to exist.
par(mfrow = c(2, 2))
plot(newseat_fit)
Question 12a
The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when the sum of x^2 = the sum of y^2.
Question 12b
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- x * 26
sum(y^2)
## [1] 228724600
X_fit <- lm(y~x + 0)
Y_fit <- lm(x~y + 0)
coef(X_fit)
## x
## 26
coef(Y_fit)
## y
## 0.03846154
Question 12c
set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
X_fit2 <- lm(y~x + 0)
Y_fit2 <- lm(x~y + 0)
coef(X_fit2)
## x
## 0.5074627
coef(Y_fit2)
## y
## 0.5074627