Alg II Assigntment #2

Question # 2: Carefully explain the diﬀerences between the KNN classiﬁer and KNN regression methods.

It really comes down to the outcome you are trying to predict (a numerical vs categorical response). With KNN, these are non-parametric approaches that both use some form of local approximation to predict an outcome. The different outcome reflects the different type of approximation taking place. KNN regression tries to predict the value of the output variable by using a local average. KNN classification attempts to predict the class to which the output variable belong by computing the local probability.

_______________________________________________________________________________________________________________________

9. This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use thesummary() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically signiﬁcant relationship to the response? iii. What does the coeﬃcient for the year variable suggest?

lm_mpg <- lm(mpg ~ . - name, data = Auto)
summary(lm_mpg)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comments for 9(c)

Is there a relationship between the predictors and the response?

According to the p-value shown with the F-statistic, we can say the evidence suggests there is a relationship between the predictors and the response variable mpg.

Which predictors appear to have a statistically signiﬁcant relationship to the response?

Predictors with p-values that demonstrate statistically significant relationships to the response variable include: Displacement, Weight, Year, and Origin.

What does the coeﬃcient for the year variable suggest?

Holding all other variables consants, the year variable suggests that for every 1 unit increase in year, mpg will increase by a value of 0.750773

Use the plot() function to produce diagnostic plots of the linear regression ﬁt. Comment on any problems you see with the ﬁt. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm_mpg)

The residual v Fitted plot indicates some mild non-linearity in the data. The residual plot does indicate a few outliers and the leverage plot indicates an observation, #14, with a particularly high leverage point. It may be worth seeing how much of an influence these points have on our models fit by trialing some models with and without the outliers/leverage points.

Use the * and : symbols to ﬁt linear regression models with interaction eﬀects. Do any interactions appear to be statistically signiﬁcant?

After running through a few examples on my own, I was able to find some significant interaction terms. Here is summary of those specific findings

summary (lm(mpg ~ cylinders*acceleration + year*acceleration  ,data=Auto))

## 
## Call:
## lm(formula = mpg ~ cylinders * acceleration + year * acceleration, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1725  -2.5315  -0.1471   2.1855  14.7314 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            23.02437   29.32140   0.785  0.43279   
## cylinders              -0.57550    0.81688  -0.705  0.48154   
## acceleration           -2.70976    1.81962  -1.489  0.13725   
## year                    0.04577    0.35441   0.129  0.89730   
## cylinders:acceleration -0.17363    0.05406  -3.212  0.00143 **
## acceleration:year       0.04778    0.02205   2.167  0.03082 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.06 on 386 degrees of freedom
## Multiple R-squared:  0.7329, Adjusted R-squared:  0.7295 
## F-statistic: 211.9 on 5 and 386 DF,  p-value: < 2.2e-16

The interraction terms cylinders:acceleration & acceleration:year appear to be statistically significant

Try a few diﬀerent transformations of the variables, such as log(X), √X, X2. Comment on your ﬁndings.

summary (lm(mpg ~ acceleration + log(acceleration) ,data=Auto))

## 
## Call:
## lm(formula = mpg ~ acceleration + log(acceleration), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.9168  -5.5716  -0.8378   4.6334  22.8515 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       -59.8670    24.5353  -2.440  0.01513 * 
## acceleration       -1.2491     0.9336  -1.338  0.18168   
## log(acceleration)  37.6631    14.2333   2.646  0.00847 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.026 on 389 degrees of freedom
## Multiple R-squared:  0.1937, Adjusted R-squared:  0.1896 
## F-statistic: 46.73 on 2 and 389 DF,  p-value: < 2.2e-16

summary (lm(mpg ~ displacement + sqrt(displacement), data=Auto))

## 
## Call:
## lm(formula = mpg ~ displacement + sqrt(displacement), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2607  -2.3012  -0.2878   2.0179  20.2582 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        63.87634    4.06504  15.714  < 2e-16 ***
## displacement        0.08818    0.02092   4.214 3.12e-05 ***
## sqrt(displacement) -4.27750    0.60071  -7.121 5.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.365 on 389 degrees of freedom
## Multiple R-squared:  0.6888, Adjusted R-squared:  0.6872 
## F-statistic: 430.5 on 2 and 389 DF,  p-value: < 2.2e-16

summary (lm(mpg ~ horsepower + I(horsepower^2) ,data=Auto))

## 
## Call:
## lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7135  -2.5943  -0.0859   2.2868  15.8961 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
## horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
## I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.374 on 389 degrees of freedom
## Multiple R-squared:  0.6876, Adjusted R-squared:  0.686 
## F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

In order to look into the effects of predictor variable transformations on models, I decided to run a few MLR models with each model holding a variable + the same variable transformed in some way. I noticed in the first model, doing a log transformation of the acceleration variable made it significant, while the untransformed acceleration variable isn’t. I also noticed that sometimes the significance of the variable would not change, but the coefficient estimates would. This was the case when I did a model with a sqrt() transformation on displacement and another model with a x^2 transformation on horsepower. So while more combinations of transformations in models can be trialed here, I think whats important to note is that there can be real changes in sigificance and estimate values when doing so.

10. This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, andUS.

fit_sales<-lm(Sales~Price+Urban+US, data = Carseats)
summary(fit_sales)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coeﬃcient in the model. Be careful—some of the variables in the model are qualitative! Price and US are significant predictors of Sales, while Urban has no significant effect on Sales. For every $1 increase in price, sales are expected to decrease by $54 (all other predictors remaining fixed). On average sales in a US store are $1200 more than in a non US store (all other predictors remaining fixed).
Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales= 13.043469 − 0.054459(Price) − 0.021916(UrbanYes) + 1.200573(USYes)

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

Given their low p-values, we can reject the null hypothesis for the variables Price & US.

On the basis of your response to the previous question, ﬁt a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit_sales2 <- lm(Sales ~ Price + US, data = Carseats)
summary(fit_sales2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) ﬁt the data?

Not so good, both seemed to only explain 23% of the variation in the response variable Sales

Using the model from (e), obtain 95% conﬁdence intervals for the coeﬃcient(s).

confint(fit_sales2)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(fit_sales2)

Upon a visual inspection of the outliers in the Residual vs leverage plot, It looks as if there are some points going beyond the Cooks distance cutoff point and just general outliers. If I summarize the potentially influential obsevations by forming a list that shows them row by row. I can easily reference that list to exclude them from a new model to see if doing so bears any significant outcome on the models performance. By running this new model and comparing the results of both models, I can really see if what I visually suspected can be confirmed or denied.

summary(influence.measures(fit_sales2))

## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

outyliers <- c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats_Altered <- Carseats[-outyliers,]
fit_sales2NEW <- lm(Sales ~ Price + US, data = Carseats_Altered)
summary(fit_sales2NEW)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats_Altered)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

Comparing how well the model fits now to before removing the influential points, there isn’t even a 1% change in R-Squared values. It’s probably safe to include all of the data points in our model.

12. This problem involves simple linear regression without an intercept.

Recall that the coeﬃcient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coeﬃcient estimate for the regression of X onto Y the same as the coeﬃcient estimate for the regression of Y onto X?

When the sum of the squares of the observed y-values are equal to the sum of the squares of the observed x-values.

Generate an example in R with n = 100 observations in which the coeﬃcient estimate for the regression of X onto Y is diﬀerent from the coeﬃcient estimate for the regression of Y onto X.

x = rnorm(100,5,2)
y = rnorm(100,10,1)

lm(y~x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     9.82121      0.02837

lm(x~y)

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##       3.710        0.113

Generate an example in R with n = 100 observations in which the coeﬃcient estimate for the regression of X onto Y is the same as the coeﬃcient estimate for the regression of Y onto X.

x = rnorm(100,10,1)
y = rnorm(100,10,1)

lm(y~x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     12.0731      -0.2093

lm(x~y)

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##     11.9538      -0.2087

Some helpful references to maybe understand problem 12

https://rstudio-pubs-static.s3.amazonaws.com/289958_811f0797a3454900b45ce7ecad79c101.html