R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.
The KNN classifier method tries to predict the class to which the output variable belongs by computing the local probability.
KNN regressions tries to predict the value of the output variable by using a local average.

library(MASS)
library(ISLR)

Question 9

auto <- Auto
names(auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

Question 9a

pairs(auto)

Question 9b

cor(auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question 9c

lm_mpg <- lm(mpg~.-name, data = auto)
summary(lm_mpg)
## 
## Call:
## lm(formula = mpg ~ . - name, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Question 9ci Yes there is a relationship between the predictors and the response, as indicated by the p-value being < 0.05

Question 9cii displacement, weight, year, and origin appear to have a statistically significant relationship to the response.

Question 9ciii The Coefficient for year shows that for each increase in year, mpg increases by .750773

Question 9d The linear model does not appear to be a good fit because the Residual vs Fitted plot indicates that there is a non-linear relationship. Since no points in the Residuals vs Leverage plot are above the Cook’s Distance line at 0.5, there does not appear to be any outliers. The leverage plot shows point 14 as having high leverage because it is in the bottom right portion of the plot, and far from other points.

par(mfrow = c(2, 2))
plot(lm_mpg)

Question9e
The interactions between year and horsepower and acceleration and horsepower are statistically significant.

lm_interact <- lm(mpg~.-name + year:horsepower + horsepower*acceleration, data = auto)
summary(lm_interact)
## 
## Call:
## lm(formula = mpg ~ . - name + year:horsepower + horsepower * 
##     acceleration, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4237 -1.6168 -0.1576  1.5501 11.4807 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -9.354e+01  9.850e+00  -9.496  < 2e-16 ***
## cylinders                3.693e-01  3.014e-01   1.225 0.221183    
## displacement            -1.078e-02  7.703e-03  -1.400 0.162320    
## horsepower               7.602e-01  9.307e-02   8.167 4.65e-15 ***
## weight                  -3.814e-03  6.751e-04  -5.649 3.16e-08 ***
## acceleration             5.977e-01  1.618e-01   3.694 0.000253 ***
## year                     1.645e+00  1.344e-01  12.235  < 2e-16 ***
## origin                   9.583e-01  2.537e-01   3.778 0.000184 ***
## horsepower:year         -9.406e-03  1.339e-03  -7.025 9.87e-12 ***
## horsepower:acceleration -8.169e-03  1.762e-03  -4.635 4.91e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.963 on 382 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8558 
## F-statistic: 258.9 on 9 and 382 DF,  p-value: < 2.2e-16

Question 9f
After transforming the variable displacement, log and square root appear to have transformed the data much closer to a linear relationship than ^2 did.

par(mfrow = c(2, 2))
plot(log(auto$displacement), auto$mpg)
plot(sqrt(auto$displacement), auto$mpg)
plot((auto$displacement)^2, auto$mpg)

*******************************************************************************

Question 10

names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
?Carseats
carseats <- Carseats
seat_fit <- lm(Sales~Price + Urban + US, data = carseats)
summary(seat_fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Question 10b
As the Price increases, sales decrease (-$54). If a store is in the US, sales are higher ($1200). If a store is in an urban area, sales decrease, but this variable is not statistically significant so we would drop Urban from the final model.

Question 10c
Sales = 13.043469 + (-.054459)Price + (-.021916)UrbanYes + (1.200573) * USYes + ε

Question 10d
We can reject the null for Price and US since the p-value is below 0.05

Question 10e

newseat_fit <- lm(Sales~Price + US, data = carseats)
summary(newseat_fit)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Question 10f
Both models are not very good. The smaller model is a little better, but not by much (23.54% of the variability is explained by the smaller model compared to 23.35% by the larger model).

Question 10g

confint(newseat_fit)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Question 10h There does not appear to be any outliers, but because of the large horizontal gap between points in the Residuals vs Leverage plot, a high leverage observation appears to exist.

par(mfrow = c(2, 2))
plot(newseat_fit)


Question 12a
The coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X when the sum of x^2 = the sum of y^2.

Question 12b

set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- x * 26
sum(y^2)
## [1] 228724600
X_fit <- lm(y~x + 0)
Y_fit <- lm(x~y + 0)
coef(X_fit)
##  x 
## 26
coef(Y_fit)
##          y 
## 0.03846154

Question 12c

set.seed(1)
x <- 1:100
sum(x^2)
## [1] 338350
y <- 100:1
sum(y^2)
## [1] 338350
X_fit2 <- lm(y~x + 0)
Y_fit2 <- lm(x~y + 0)
coef(X_fit2)
##         x 
## 0.5074627
coef(Y_fit2)
##         y 
## 0.5074627