HW2

#Name: Justin Howard

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.2

#Question 2. Carefully explain the differences between the KNN classifier and KNN #regression methods.

KNN is a non-parametric model (does not assume a form for the function), model with a small value for K will be more flexible but have high variance (leads to overfitting) and model with a large value for K will have more bias but less variance (leads to underfitting.

KNN regression estimates the prediction point based on the average of all observations/neighbors near the predicted value (the observations are quantitative).

KNN classification estimates class based on the highest estimated probability of the observations nearest the predicted value/observation (the observations are qualitative)

In other words, KNN regression uses the average of all quantitative neighbors to estimate the prediction point and KNN classification uses the conditional probability of a class based on all qualitative neighbors.

##Question 9. This question involves the use of multiple linear regression on the ##Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

plot(Auto)

plot(Auto , pch=20 , cex=1.5 , col="#69b3a2")

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

#names(Auto)
#cor(Auto[, c("mpg","cylinders","displacement","horsepower","weight","acceleration","year","origin")])
cor(Auto[,-c(9)])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Is there a relationship between the predictors and the response? Yes, predictors with p-value < 0.5 indicate a relationship with the response (mpg).
Which predictors appear to have a statistically significant relationshp to the response? displacement, weight, year, origin

m_lm_auto = lm(formula = mpg ~.-name, data = Auto)
summary (m_lm_auto)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

What does the coefficient for the year variable suggest? A 1 unit increase in year, increases mpg by 0.75, year ranges from 70:82, so increasing year by 1 unit (1 year) increases mpg by 0.75

unique(Auto[c('year')])

##     year
## 1     70
## 30    71
## 58    72
## 86    73
## 126   74
## 153   75
## 183   76
## 217   77
## 245   78
## 281   79
## 310   80
## 339   81
## 368   82

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Appears, based on the pattern in the residuals that there is non-linearity in the data.

Do the residual plots suggest any unusually large outliers? R identifies two observations in the upper right corner of the scale-location plot. If a standardized residual is greater than 3 than it is considered an outlier, not residuals are > 3 basedon the scale-location plot

Does the leverage plot identify any observations with unusually high leverage? Observations are listed on the Leverage plot, however no observations are above the dashed line indicated there are no unusually high leverage observations.

par(mfrow = c(2,2))
plot(m_lm_auto)

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? Weight * Cylinders is statistically significant p < 0.05 Acceleration * Horsepower is statistically significant p < 0.05

Interaction effect: Acceleration * Horsepower

lm_weight <- lm(mpg ~ .-name + weight * cylinders, data = Auto)
summary(lm_weight)

## 
## Call:
## lm(formula = mpg ~ . - name + weight * cylinders, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9484  -1.7133  -0.1809   1.4530  12.4137 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.3143478  5.0076737   1.461  0.14494    
## cylinders        -5.0347425  0.5795767  -8.687  < 2e-16 ***
## displacement      0.0156444  0.0068409   2.287  0.02275 *  
## horsepower       -0.0314213  0.0126216  -2.489  0.01322 *  
## weight           -0.0150329  0.0011125 -13.513  < 2e-16 ***
## acceleration      0.1006438  0.0897944   1.121  0.26306    
## year              0.7813453  0.0464139  16.834  < 2e-16 ***
## origin            0.8030154  0.2617333   3.068  0.00231 ** 
## cylinders:weight  0.0015058  0.0001657   9.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.8501 
## F-statistic: 278.1 on 8 and 383 DF,  p-value: < 2.2e-16

Interaction effect: acceleration and horsepower

lm_acc <- lm(mpg ~ .-name + acceleration * horsepower, data = Auto)
summary(lm_acc)

## 
## Call:
## lm(formula = mpg ~ . - name + acceleration * horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0329 -1.8177 -0.1183  1.7247 12.4870 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -32.499820   4.923380  -6.601 1.36e-10 ***
## cylinders                 0.083489   0.316913   0.263 0.792350    
## displacement             -0.007649   0.008161  -0.937 0.349244    
## horsepower                0.127188   0.024746   5.140 4.40e-07 ***
## weight                   -0.003976   0.000716  -5.552 5.27e-08 ***
## acceleration              0.983282   0.161513   6.088 2.78e-09 ***
## year                      0.755919   0.048179  15.690  < 2e-16 ***
## origin                    1.035733   0.268962   3.851 0.000138 ***
## horsepower:acceleration  -0.012139   0.001772  -6.851 2.93e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.145 on 383 degrees of freedom
## Multiple R-squared:  0.841,  Adjusted R-squared:  0.8376 
## F-statistic: 253.2 on 8 and 383 DF,  p-value: < 2.2e-16

Try a few different transformations of the variables, such as log(X), √ X, X2. Comment on your findings.

X2 transformation: Horsepower X2 of Horsepower is significant, same significance as horsepower not transformed

lm_hpX2 <- lm(mpg ~ -name + horsepower + I(horsepower^2), data = Auto)
summary(lm_hpX2)

## 
## Call:
## lm(formula = mpg ~ -name + horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Log transformation: Horsepower Log of Horsepower is significant, more significant than horsepower not transformed

lm_hplog <- lm(mpg ~ -name + horsepower + log(horsepower), data = Auto)
summary(lm_hplog)

## 
## Call:
## lm(formula = mpg ~ -name + horsepower + log(horsepower), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5118  -2.5018  -0.2533   2.4446  15.3102 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     156.04057   12.08267  12.914  < 2e-16 ***
## horsepower        0.11846    0.02929   4.044 6.34e-05 ***
## log(horsepower) -31.59815    3.28363  -9.623  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.415 on 389 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6801 
## F-statistic: 416.6 on 2 and 389 DF,  p-value: < 2.2e-16

Compare two models, horsepower, X2 horsepower Which is better?

lm_hp <- lm(mpg ~ horsepower, data = Auto)
anova(lm_hp, lm_hpX2)

## Analysis of Variance Table
## 
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ -name + horsepower + I(horsepower^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7442.0  1    1943.9 101.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Compare two models, horsepower and log of horsepower.

lm_hp <- lm(mpg ~ horsepower, data = Auto)
anova(lm_hp, lm_hplog)

## Analysis of Variance Table
## 
## Model 1: mpg ~ horsepower
## Model 2: mpg ~ -name + horsepower + log(horsepower)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    390 9385.9                                  
## 2    389 7581.2  1    1804.7 92.601 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##Question 10: This question should be answered using the Carseats data set.

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

Fit a multiple regression model to predict Sales using Price, Urban (categorical), and US (categorical)

lm_carseats <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary (lm_carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

contrasts(Carseats$US)

##     Yes
## No    0
## Yes   1

contrasts(Carseats$Urban)

##     Yes
## No    0
## Yes   1

Provide an interpretation of each coefficient in the model. Be careful - some of the variables in the model are qualitative! Price has a negative impact on sales, increasing price by 1 unit, decreases sales by -0.054 units Urban Yes is negative indicating lower sales compared to Urban No, however it is not significant US Yes is positive indicating higher sales compared to US No.
Write out the model in equation form, being careful to handle the qualitative variables properly Sales = 13.04 + -0.0544(Price) + 1.200573(US=1) + -0.021916(Urban=1) (included urban but it should not be in the final model)
For which of the predictors can you reject the null hypothesis H0:βj=0? Can reject the null hypothesis for Price (p-value < 0.05) and US (p-value < 0.05)
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome

lm2_carseats <- lm(Sales ~ Price + US, data = Carseats)
summary (lm2_carseats)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data? Fit measured by R^2 Model a adj R^2 of 23.3% Model e ad R^2 of 23.5% slight improvement after removing Urban from the model. I don’t believe 23% is a very good fit, means the model can explain 23% of the variation of Y (Sales). I would seek ways to improve the model fit and predictive power.
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint (lm2_carseats, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)? Variance looks good residuals vs fitted, normality looks good Normal Q-Q plot No observations > 3 (absolute) on the scale-location plot There does appear to be high leverage observations based on the Residuals vs Leverage plot Multiple observations fall beyond the Cook’s distance dotted line indicating high leverage observations

par(mfrow = c(2,2))
plot(lm2_carseats)

##12. This problem involves simple linear regression without an intercept. βˆ = sum(XiYi)/sum(X^2) β = sum(yiix)/sum(y^2) (a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38).

Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

Numerator is the same for βˆ and β, the denominator is different. If X = Y, or - X = Y or -Y = X, those combinations for X and Y would result in the same coefficient estimate for regression X onto Y or Y onto X.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x <- rnorm(100)
#x
y <- rnorm(100)
#y
yregx_lm <- lm(y~x)
#summary(yregx_lm)
yregx_lm

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    0.006423    -0.075591

xregy_lm <- lm(x~y)
#summary(xregy_lm)
xregy_lm

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##     0.06734     -0.07951

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

xs <- rnorm(100)

ys <- -xs

yregxs_lm <- lm(ys~xs)

yregxs_lm

## 
## Call:
## lm(formula = ys ~ xs)
## 
## Coefficients:
## (Intercept)           xs  
##  -5.551e-18   -1.000e+00

xregys_lm <- lm(xs~ys)

xregys_lm

## 
## Call:
## lm(formula = xs ~ ys)
## 
## Coefficients:
## (Intercept)           ys  
##   5.551e-18   -1.000e+00