HW2

Nassos Galiopoulos February 18, 2022

Exercise 2. Carefully explain the differences between the KNN classifier and KNN regression methods.

We would use a regression method for continuous response variable to predict closet observations based on the mean value. We would use classification for categorical non-parametric approach where we would look at odds ratios and conditional probabilities.

Exercise 9. This question involves the use of multiple linear regression on the Auto data set.

# clean the environment
rm(list = ls())
# load the dataset
library(ISLR2)
summary(Auto)

##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

(b) Compute the matrix of correlations between the variables using the function `cor()`. You will need to exclude the `name` variable,`cor()` which is qualitative.

# compute correlations and exclude qualitative
cor(Auto[ ,-9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the `lm()` function to perform a multiple linear regression with `mpg` as the response and all other variables except name as the predictors. Use the `summary()` function to print the results.

Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest?

the p-value is < 2.2e-16 which is very small the F-Statistic is 252.4 which is large so we feel confident that there is a relationship between the predictors and the response variable mpg.

Displacement, weight, year and origin have low p-values less than the assumed significance level of a=0.05

The coefficient for the year is 0.750773 This means that for every unit increase of year there is a positive relationship that mpg efficiency increases by 0.750773 assuming all other predictors are held constant.

# perform multiple linear regression on mpg
MilesPerGallon = lm(mpg~.-name,Auto)
summary(MilesPerGallon)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(d) Use the `plot()` function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

The Q-Q plot shows some linearity of the data up to a point 2 However the residuals vs fitted plot does not show good linearity Also when looking at the residuals vs fitted graph we show that the fitted values especially to the right could influence our model We don’t see any observations with large cooks distance

# produce diagnostics plots
par(mfrow=c(2,2))
plot(MilesPerGallon)

cook.d = cooks.distance(MilesPerGallon)
par(mfrow=c(1,1))
plot(cook.d,col="red",pch=19,cex=1)

(e) Use the `*` and `:` symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Displacement: weight appear to be statistically significant

# fit linear regression models with interaction effects
summary(MilesPerGallon)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Model1 = lm(mpg~.-name + displacement*weight+year:cylinders, Auto)
summary(Model1)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement * weight + year:cylinders, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5164 -1.5950 -0.1015  1.3021 12.5609 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.632e+01  1.166e+01  -5.686 2.59e-08 ***
## cylinders            1.209e+01  2.161e+00   5.594 4.24e-08 ***
## displacement        -6.739e-02  1.063e-02  -6.340 6.50e-10 ***
## horsepower          -3.994e-02  1.199e-02  -3.332 0.000948 ***
## weight              -1.011e-02  6.936e-04 -14.574  < 2e-16 ***
## acceleration         1.006e-01  8.498e-02   1.184 0.237154    
## year                 1.561e+00  1.455e-01  10.724  < 2e-16 ***
## origin               5.936e-01  2.525e-01   2.351 0.019249 *  
## displacement:weight  2.078e-05  2.200e-06   9.448  < 2e-16 ***
## cylinders:year      -1.543e-01  2.762e-02  -5.588 4.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.853 on 382 degrees of freedom
## Multiple R-squared:  0.8694, Adjusted R-squared:  0.8664 
## F-statistic: 282.6 on 9 and 382 DF,  p-value: < 2.2e-16

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

log transformation of acceleration shoed reduced p-value but the value of the F-statistic is now smaller

# log transformations
Model2 = lm(mpg~.-name + displacement*log(weight)+year:cylinders+log(acceleration), Auto)
summary(Model2)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement * log(weight) + year:cylinders + 
##     log(acceleration), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0920 -1.5290  0.0143  1.2957 12.9721 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               80.859829  76.160285   1.062 0.289043    
## cylinders                 10.279768   2.259566   4.549 7.25e-06 ***
## displacement              -0.504066   0.196698  -2.563 0.010772 *  
## horsepower                -0.047762   0.012410  -3.849 0.000139 ***
## weight                    -0.004112   0.004851  -0.848 0.397142    
## acceleration               1.421988   0.448480   3.171 0.001644 ** 
## year                       1.453984   0.149493   9.726  < 2e-16 ***
## origin                     0.530919   0.253621   2.093 0.036980 *  
## log(weight)              -14.176185  10.284091  -1.378 0.168873    
## log(acceleration)        -22.593994   7.455537  -3.030 0.002609 ** 
## displacement:log(weight)   0.061791   0.024131   2.561 0.010833 *  
## cylinders:year            -0.132177   0.028685  -4.608 5.56e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.831 on 380 degrees of freedom
## Multiple R-squared:  0.8721, Adjusted R-squared:  0.8684 
## F-statistic: 235.6 on 11 and 380 DF,  p-value: < 2.2e-16

Exercise 10. This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

# exercise 10
# clean the environment
rm(list = ls())
library(ISLR2)
# load the dataset
lm.carseats = lm(Sales~Price+Urban+US,data=Carseats)

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

for every unit increase in price sales will decrease by 54 also fo a US shop there a positive change in sales of 1.2 units

summary(lm.carseats)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469 -0.054459*Price-0.021916*Urban+1.200573*US

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

We should remove Urban as a predictor in the model Also we can reject the null hypothesis for Price and US predictor variables

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# fit a new model
lm.carseats2 = lm(Sales~Price+US,data = Carseats)

(f) How well do the models in (a) and (e) fit the data?

the F-statistic is now noticeably higher 62.43 from 41.52 and there is a small increase in Rsquared 0.2354 from 0.2335

summary(lm.carseats2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

# get  confidence interval for new model
confint(lm.carseats2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

# take a look at the plots
par(mfrow=c(2,2))
plot(lm.carseats2)

# check for outliers with studentized residuals
rstudent(lm.carseats2)[which(rstudent(lm.carseats2)>3)]

## named numeric(0)

none above 3

Exercise 12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimates will be the same if they have the same denominator

# clean the environment
rm(list = ls())

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

# generate an example with the different denominator
set.seed (1)
x1=rnorm(100)
y1=x1*x1

# fit a regression to get coefficients
lm.fit = lm(x1~y1+0)
lm.fit2 = lm(y1~x1+0)
# compare coefficients
summary(lm.fit)$coefficients[1,1]

## [1] 0.1078456

summary(lm.fit2)$coefficients[1,1]

## [1] 0.2598674

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

# generate an example with the same denominator
set.seed (1)
x1=rnorm(100)
y1=sample(x1)
# fir a regression to get coefficients
lm.fit = lm(x1~y1+0)
lm.fit2 = lm(y1~x1+0)
# compare coefficients
summary(lm.fit)$coefficients[1,1]

## [1] -0.07767695

summary(lm.fit2)$coefficients[1,1]

## [1] -0.07767695

HW2