Assignment 2, Ch 3

Prior to starting, I always load up Tidyverse to make sure I’m covering all my bases.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ISLR2)

Question 2: Carefully explain the differences between the KNN classifier and KNN regression methods

#While similar, the KNN classifier tries to predict which class a variable belongs to, and the KNN regression tries to predict a value using the average of all training responses.

Question 9: This question involves the use of multiple linear regression on the Auto data set.

Question 9a: Produce a scatterplot matrix which includes all of the variables in the data set.

#Auto <- read.csv("Auto.csv")
attach(Auto)
## The following object is masked from package:ggplot2:
## 
##     mpg
plot(Auto)

Question 9b: Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative

str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Auto$horsepower <- as.numeric(Auto$horsepower)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
cor(Auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question 9c: Use the lm() function to perform a multiple linear regression

with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  2. Which predictors appear to have a statistically significant relationship to the response?
  3. What does the coefficient for the year variable suggest?
mlr <- lm(mpg~.-name, data=Auto)
summary(mlr) 
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
#I. There is an overall relationship of predictors and response in this model, with an f-value of 252.4, and p-value of 2.2e-16 
#II.    There is a relationship between the response (mpg) and the following predictors, as shown by  a p-value of <.05: displacement, weight, year, origin  
#III.   Year coefficient shows that each year will increase MPG by 0.75 (I.e., MPG is increasing as time goes on) 

Question 9d: Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(mlr)

#These plots show outliers and observations with high leverage, especially at the higher end (right-skewed). The plots show high variance, with lower variance in the middle (close to a U shape), which may reveal that the data is not liner. 

Question 9e: Use the * and : symbols to fit linear regression models with

interaction effects. Do any interactions appear to be statistically significant?

mlr2 <- lm(mpg~acceleration*horsepower + weight:acceleration, data=Auto)
summary(mlr2)
## 
## Call:
## lm(formula = mpg ~ acceleration * horsepower + weight:acceleration, 
##     data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1722  -2.4537  -0.3763   2.1673  15.3486 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              2.759e+01  3.268e+00   8.442 6.34e-16 ***
## acceleration             1.245e+00  2.050e-01   6.073 3.00e-09 ***
## horsepower              -1.036e-02  2.575e-02  -0.403   0.6875    
## acceleration:horsepower -5.676e-03  2.257e-03  -2.515   0.0123 *  
## acceleration:weight     -3.004e-04  3.823e-05  -7.858 3.90e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.116 on 387 degrees of freedom
## Multiple R-squared:  0.7248, Adjusted R-squared:  0.7219 
## F-statistic: 254.8 on 4 and 387 DF,  p-value: < 2.2e-16
#Choosing interactions of acceleration:horsepower, and acceleration:weight, it appears that both of them are statistically significant at the .05 level and lower. 

Question 9f: Try a few different transformations of the variables, such as

log(X), √X, X2. Comment on your findings

par(mfrow = c(2, 2))
plot(Auto$acceleration, Auto$MPG)
plot(log(Auto$acceleration), Auto$MPG)
plot(sqrt(Auto$acceleration), Auto$MPG)
plot((Auto$acceleration)^2, Auto$MPG)

plot(Auto$horsepower, Auto$MPG)
plot(log(Auto$horsepower), Auto$MPG)
plot(sqrt(Auto$horsepower), Auto$MPG)
plot((Auto$horsepower)^2, Auto$MPG)

detach(Auto)

#When performing Log, Sqrt, and Sq for Acceleration, the plots don't show too much difference, as the data doesn't become any more linear or non-linear. For horsepower, the logged data may reveal a more linear graph than the non-transformed data. And don't forget to detach when you are finished with these questions!

Question 10: This question should be answered using the Carseats data set.

Question 10a: Fit a multiple regression model to predict Sales using Price, Urban, and US

#Carseats <- read_csv("Carseats.csv")
view(Carseats)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Carseats$US <- as.factor(Carseats$US)
Carseats$Urban <- as.factor(Carseats$Urban)
Carseats$ShelveLoc <- as.factor(Carseats$ShelveLoc)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
attach(Carseats)
carSeatMRM <- lm(Sales~Price+Urban+US, data=Carseats)
summary(carSeatMRM)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Question 10b: Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

#Price: as price increases, sales decrease. The coefficients show that for every dollar a carseat increases, sales decrease by 0.05. 

#Urban: This variable is not statistically significant which means sales do not change whether they are in an urban area or not.  

#US: If a store is located in the US, the data is showing a 1.2 increase of sales compared to non-US stores. 

Question 10c: Write out the model in equation form, being careful to handle the qualitative variables properly.

#Sales = β0 + β1 x Price + β2 x Urbanyes + β3 USyes 

Question 10d: For which of the predictors can you reject the null hypothesis H0 : βj = 0?

#The null hypothesis of Price and US are rejected as their p-values are less than 0.05 (statistically significant) 

Question 10e: On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is

evidence of association with the outcome.

carSeatMRM2 <- lm(Sales~Price+US, data=Carseats)
summary(carSeatMRM2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
#Sales = β0 + β1 x Price + β2 USyes 

Question 10f:How well do the models in (a) and (e) fit the data?

#The r-squared value for both models remains the same (.2393) which shows that Urban really did not do much to the model. The F-value increased in the second model, which shows that it is more robust 

Question 10g:Using the model from (e), obtain 95 % confidence intervals for the coefficient(s)

confint(carSeatMRM2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Question 10h: Is there evidence of outliers or high leverage observations in the model from (e)?

plot(carSeatMRM2)

#Yes the plots show some evidence of outliers and leverage for the model using only Price and US. It looks like there are two outliers above the data, and one outlier beneath the data. And there looks to be at least one leverage point. 

detach(Carseats)

Question 12: This problem involves simple linear regression without an intercept

Question 12a: Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what

circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

#Since the numerators would be ∑jx2j and ∑jy2j, if these sums equaled each other, then the coefficients would also be equal 

Question 12b: Generate an example in R with n = 100 observations in which

the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

x <- 1:100
y <- 10:109
regressionX <- lm(x~y+0)
regressionY <- lm(y~x+0)
coef(regressionX)
##        y 
## 0.877558
coef(regressionY)
##        x 
## 1.134328

Question 12c: Generate an example in R with n = 100 observations in which

the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

x2 <- 1:100
y2 <- x2
regressionX2 <- lm(x2~y2+0)
regressionY2 <- lm(y2~x2+0)
coef(regressionX2)
## y2 
##  1
coef(regressionY2)
## x2 
##  1