#Name: Justin Howard
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.1.2
#Question 2. Carefully explain the differences between the KNN classifier and KNN #regression methods.
KNN is a non-parametric model (does not assume a form for the function), model with a small value for K will be more flexible but have high variance (leads to overfitting) and model with a large value for K will have more bias but less variance (leads to underfitting.
KNN regression estimates the prediction point based on the average of all observations/neighbors near the predicted value (the observations are quantitative).
KNN classification estimates class based on the highest estimated probability of the observations nearest the predicted value/observation (the observations are qualitative)
In other words, KNN regression uses the average of all quantitative neighbors to estimate the prediction point and KNN classification uses the conditional probability of a class based on all qualitative neighbors.
##Question 9. This question involves the use of multiple linear regression on the ##Auto data set.
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : num 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : num 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : num 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
plot(Auto)
plot(Auto , pch=20 , cex=1.5 , col="#69b3a2")
#names(Auto)
#cor(Auto[, c("mpg","cylinders","displacement","horsepower","weight","acceleration","year","origin")])
cor(Auto[,-c(9)])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
m_lm_auto = lm(formula = mpg ~.-name, data = Auto)
summary (m_lm_auto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
unique(Auto[c('year')])
## year
## 1 70
## 30 71
## 58 72
## 86 73
## 126 74
## 153 75
## 183 76
## 217 77
## 245 78
## 281 79
## 310 80
## 339 81
## 368 82
Do the residual plots suggest any unusually large outliers? R identifies two observations in the upper right corner of the scale-location plot. If a standardized residual is greater than 3 than it is considered an outlier, not residuals are > 3 basedon the scale-location plot
Does the leverage plot identify any observations with unusually high leverage? Observations are listed on the Leverage plot, however no observations are above the dashed line indicated there are no unusually high leverage observations.
par(mfrow = c(2,2))
plot(m_lm_auto)
Interaction effect: Acceleration * Horsepower
lm_weight <- lm(mpg ~ .-name + weight * cylinders, data = Auto)
summary(lm_weight)
##
## Call:
## lm(formula = mpg ~ . - name + weight * cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9484 -1.7133 -0.1809 1.4530 12.4137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3143478 5.0076737 1.461 0.14494
## cylinders -5.0347425 0.5795767 -8.687 < 2e-16 ***
## displacement 0.0156444 0.0068409 2.287 0.02275 *
## horsepower -0.0314213 0.0126216 -2.489 0.01322 *
## weight -0.0150329 0.0011125 -13.513 < 2e-16 ***
## acceleration 0.1006438 0.0897944 1.121 0.26306
## year 0.7813453 0.0464139 16.834 < 2e-16 ***
## origin 0.8030154 0.2617333 3.068 0.00231 **
## cylinders:weight 0.0015058 0.0001657 9.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8501
## F-statistic: 278.1 on 8 and 383 DF, p-value: < 2.2e-16
Interaction effect: acceleration and horsepower
lm_acc <- lm(mpg ~ .-name + acceleration * horsepower, data = Auto)
summary(lm_acc)
##
## Call:
## lm(formula = mpg ~ . - name + acceleration * horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0329 -1.8177 -0.1183 1.7247 12.4870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.499820 4.923380 -6.601 1.36e-10 ***
## cylinders 0.083489 0.316913 0.263 0.792350
## displacement -0.007649 0.008161 -0.937 0.349244
## horsepower 0.127188 0.024746 5.140 4.40e-07 ***
## weight -0.003976 0.000716 -5.552 5.27e-08 ***
## acceleration 0.983282 0.161513 6.088 2.78e-09 ***
## year 0.755919 0.048179 15.690 < 2e-16 ***
## origin 1.035733 0.268962 3.851 0.000138 ***
## horsepower:acceleration -0.012139 0.001772 -6.851 2.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.145 on 383 degrees of freedom
## Multiple R-squared: 0.841, Adjusted R-squared: 0.8376
## F-statistic: 253.2 on 8 and 383 DF, p-value: < 2.2e-16
X2 transformation: Horsepower X2 of Horsepower is significant, same significance as horsepower not transformed
lm_hpX2 <- lm(mpg ~ -name + horsepower + I(horsepower^2), data = Auto)
summary(lm_hpX2)
##
## Call:
## lm(formula = mpg ~ -name + horsepower + I(horsepower^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7135 -2.5943 -0.0859 2.2868 15.8961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.9000997 1.8004268 31.60 <2e-16 ***
## horsepower -0.4661896 0.0311246 -14.98 <2e-16 ***
## I(horsepower^2) 0.0012305 0.0001221 10.08 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared: 0.6876, Adjusted R-squared: 0.686
## F-statistic: 428 on 2 and 389 DF, p-value: < 2.2e-16
Log transformation: Horsepower Log of Horsepower is significant, more significant than horsepower not transformed
lm_hplog <- lm(mpg ~ -name + horsepower + log(horsepower), data = Auto)
summary(lm_hplog)
##
## Call:
## lm(formula = mpg ~ -name + horsepower + log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5118 -2.5018 -0.2533 2.4446 15.3102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.04057 12.08267 12.914 < 2e-16 ***
## horsepower 0.11846 0.02929 4.044 6.34e-05 ***
## log(horsepower) -31.59815 3.28363 -9.623 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.415 on 389 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6801
## F-statistic: 416.6 on 2 and 389 DF, p-value: < 2.2e-16
Compare two models, horsepower, X2 horsepower Which is better?
lm_hp <- lm(mpg ~ horsepower, data = Auto)
anova(lm_hp, lm_hpX2)
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ -name + horsepower + I(horsepower^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7442.0 1 1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare two models, horsepower and log of horsepower.
lm_hp <- lm(mpg ~ horsepower, data = Auto)
anova(lm_hp, lm_hplog)
## Analysis of Variance Table
##
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ -name + horsepower + log(horsepower)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 390 9385.9
## 2 389 7581.2 1 1804.7 92.601 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##Question 10: This question should be answered using the Carseats data set.
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
lm_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary (lm_carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
contrasts(Carseats$US)
## Yes
## No 0
## Yes 1
contrasts(Carseats$Urban)
## Yes
## No 0
## Yes 1
Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative! Price has a negative impact on sales, increasing price by 1 unit, decreases sales by -0.054 units Urban Yes is negative indicating lower sales compared to Urban No, however it is not significant US Yes is positive indicating higher sales compared to US No.
Write out the model in equation form, being careful to handle the qualitative variables properly Sales = 13.04 + -0.0544(Price) + 1.200573(US=1) + -0.021916(Urban=1) (included urban but it should not be in the final model)
For which of the predictors can you reject the null hypothesis H0:βj=0? Can reject the null hypothesis for Price (p-value < 0.05) and US (p-value < 0.05)
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome
lm2_carseats <- lm(Sales ~ Price + US, data = Carseats)
summary (lm2_carseats)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data? Fit measured by R^2 Model a adj R^2 of 23.3% Model e ad R^2 of 23.5% slight improvement after removing Urban from the model. I don’t believe 23% is a very good fit, means the model can explain 23% of the variation of Y (Sales). I would seek ways to improve the model fit and predictive power.
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint (lm2_carseats, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow = c(2,2))
plot(lm2_carseats)
##12. This problem involves simple linear regression without an intercept. βˆ = sum(XiYi)/sum(X^2) β = sum(yiix)/sum(y^2) (a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38).
Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X
Numerator is the same for βˆ and β, the denominator is different. If X = Y, or - X = Y or -Y = X, those combinations for X and Y would result in the same coefficient estimate for regression X onto Y or Y onto X.
x <- rnorm(100)
#x
y <- rnorm(100)
#y
yregx_lm <- lm(y~x)
#summary(yregx_lm)
yregx_lm
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 0.006423 -0.075591
xregy_lm <- lm(x~y)
#summary(xregy_lm)
xregy_lm
##
## Call:
## lm(formula = x ~ y)
##
## Coefficients:
## (Intercept) y
## 0.06734 -0.07951
xs <- rnorm(100)
ys <- -xs
yregxs_lm <- lm(ys~xs)
yregxs_lm
##
## Call:
## lm(formula = ys ~ xs)
##
## Coefficients:
## (Intercept) xs
## -5.551e-18 -1.000e+00
xregys_lm <- lm(xs~ys)
xregys_lm
##
## Call:
## lm(formula = xs ~ ys)
##
## Coefficients:
## (Intercept) ys
## 5.551e-18 -1.000e+00