Homework 2 - Chapter 3

2-) Carefully explain the differences between the KNN classifier and KNN regression methods.

The classifier is used to fit categorical dependent variable while the regression is used to fit continuous (numeric) dependent variable.

The classifier will calculate the conditional probability of each of the K neighbors for each observation and will assign the level of highest probability to the estimate. The regression model will use the mean of the K neighbors instead

9-) This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

cor(Auto[ ,1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

auto_data<- Auto
auto_data$origin <- factor(auto_data$origin, labels = c("American", "European", "Japanese"))
lm_mpg <- lm(mpg~. -name, data = auto_data)
summary(lm_mpg)

## 
## Call:
## lm(formula = mpg ~ . - name, data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

The p value is very low (< 2.2e-16), which means that we should reject the null and adopt the alternative hypothesis that the model has at least one statistically significant predictor for the variable mpg.

Which predictors appear to have a statistically significant relationship to the response?

All predictors are significant to explain the variation of the mpg, except cylinders, horsepower and accelaration. The variable name was excluded from the model.

What does the coefficient for the year variable suggest?

lm_mpg$coefficients

##    (Intercept)      cylinders   displacement     horsepower         weight 
##  -17.954602067   -0.489709424    0.023978644   -0.018183464   -0.006710384 
##   acceleration           year originEuropean originJapanese 
##    0.079103036    0.777026939    2.630002360    2.853228228

Since the coefficient is positive, it indicates that as the year of car increases so does the miles per gallon - at rate of 0.77 mpg per year increas, given that all the other variables are constant. In other words, the newer the car the better the mileage efficiency.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm_mpg)

The qqplot show a long right tail, which tells that the distribution is skewed right: there are some outliners with very high mpg. More precisely, we can tell that observations 323 and 394 are those outliers, influential points.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

interaction_lm<- lm(formula = mpg ~ . * ., data = auto_data[, -9])
summary(interaction_lm)

## 
## Call:
## lm(formula = mpg ~ . * ., data = auto_data[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

Using the confidence interval of 95%, thus considering significant those predictors with p value below 5%, we can count significant 9 predictor for explaining the variation of mpg.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

exp2_lm <- lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = auto_data)

summary(exp2_lm)

## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) + 
##     I(horsepower^2) + I(year^2), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4816 -1.5384  0.0735  1.3671 12.0213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.185e+02  6.966e+01   6.008 4.40e-09 ***
## cylinders          5.073e-01  3.191e-01   1.590 0.112692    
## displacement      -3.328e-02  2.045e-02  -1.627 0.104480    
## horsepower        -1.781e-01  3.953e-02  -4.506 8.81e-06 ***
## weight            -1.114e-02  2.587e-03  -4.306 2.12e-05 ***
## acceleration      -1.700e-01  9.652e-02  -1.762 0.078960 .  
## year              -1.019e+01  1.837e+00  -5.546 5.49e-08 ***
## originEuropean     1.323e+00  5.304e-01   2.494 0.013068 *  
## originJapanese     1.258e+00  5.129e-01   2.452 0.014637 *  
## I(weight^2)        1.182e-06  3.438e-07   3.439 0.000649 ***
## I(displacement^2)  5.839e-05  3.435e-05   1.700 0.089967 .  
## I(horsepower^2)    4.388e-04  1.336e-04   3.284 0.001118 ** 
## I(year^2)          7.210e-02  1.207e-02   5.974 5.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8735 
## F-statistic: 225.9 on 12 and 379 DF,  p-value: < 2.2e-16

exp3_lm <- lm(mpg ~ . - name + I(weight^3) + I(displacement^3) + I(horsepower^3) + I(year^3), data = auto_data)

summary(exp3_lm)

## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^3) + I(displacement^3) + 
##     I(horsepower^3) + I(year^3), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2784 -1.6036  0.0593  1.3744 12.1152 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.686e+02  4.675e+01   5.746 1.88e-08 ***
## cylinders          6.578e-01  3.334e-01   1.973  0.04923 *  
## displacement      -1.957e-02  1.386e-02  -1.411  0.15897    
## horsepower        -1.182e-01  2.461e-02  -4.802 2.26e-06 ***
## weight            -8.205e-03  1.490e-03  -5.507 6.75e-08 ***
## acceleration      -1.441e-01  9.639e-02  -1.495  0.13573    
## year              -4.618e+00  9.253e-01  -4.991 9.19e-07 ***
## originEuropean     1.442e+00  5.313e-01   2.715  0.00694 ** 
## originJapanese     1.334e+00  5.120e-01   2.605  0.00955 ** 
## I(weight^3)        1.354e-10  3.256e-11   4.159 3.96e-05 ***
## I(displacement^3)  7.671e-08  4.574e-08   1.677  0.09440 .  
## I(horsepower^3)    9.763e-07  3.316e-07   2.944  0.00344 ** 
## I(year^3)          3.106e-04  5.312e-05   5.847 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.791 on 379 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8721 
## F-statistic: 223.2 on 12 and 379 DF,  p-value: < 2.2e-16

sqr2_lm <- lm(mpg ~ . - name + I(weight^0.5) + I(displacement^0.5) + I(horsepower^0.5) + I(year^0.5), data = auto_data)

summary(sqr2_lm)

## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^0.5) + I(displacement^0.5) + 
##     I(horsepower^0.5) + I(year^0.5), data = auto_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7371 -1.5567  0.0806  1.2635 12.0194 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.772e+03  2.786e+02   6.359 5.83e-10 ***
## cylinders            1.477e-01  2.937e-01   0.503 0.615288    
## displacement         3.307e-02  2.721e-02   1.215 0.225048    
## horsepower           1.807e-01  6.200e-02   2.914 0.003776 ** 
## weight               9.920e-03  4.365e-03   2.272 0.023618 *  
## acceleration        -2.000e-01  9.640e-02  -2.074 0.038731 *  
## year                 2.333e+01  3.669e+00   6.359 5.85e-10 ***
## originEuropean       1.269e+00  5.237e-01   2.423 0.015854 *  
## originJapanese       1.260e+00  5.136e-01   2.454 0.014582 *  
## I(weight^0.5)       -1.479e+00  5.120e-01  -2.889 0.004084 ** 
## I(displacement^0.5) -1.082e+00  8.293e-01  -1.305 0.192697    
## I(horsepower^0.5)   -5.281e+00  1.395e+00  -3.785 0.000179 ***
## I(year^0.5)         -3.932e+02  6.397e+01  -6.147 2.01e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.774 on 379 degrees of freedom
## Multiple R-squared:  0.8775, Adjusted R-squared:  0.8737 
## F-statistic: 226.3 on 12 and 379 DF,  p-value: < 2.2e-16

The transformation did not change the overall significance of the model. All have the same p-value. Looking at the individual predictors, there was no single transformation that substantially improved the significance: some improve, some get worse. Finally, looking at the adjusted R2, we could not see also a big improvement: all values turned around 87%. At end, the model with no transformation and the interecations had highest adjusted R2 value.

10-) This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

lm_carseat <- lm(Sales ~ Price + Urban + US, data = Carseats)

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

summary(lm_carseat)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price: as the price increases, the sales decrease

Urban stores have smaller sales than rural ones. But this predictor is not signficant, which means that the variation is probably due other external factors (error).

US stores have larger sales than stores outside ones.

Write out the model in equation form, being careful to handle the qualitative variables properly.

Predict_Sales = 13.043469−0.054459xPrice−0.021916xUrban+1.200573xUS

Assuming that Urban is 1, no urban 0; store in the US is 1 and outside US is 0.

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

For Price and USYes, as their p values are below 5%.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm_carseat2<- lm(Sales ~ Price + US, data = Carseats)
summary(lm_carseat2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

They have very close values for R2 and adjusted R2. The first model has R2 of 0.23928 compared to 0.23926 of the second model. For use one extra predictor, however, model one has a slighlty lower adjusted R2, OF 0.23351 compared with 0.23543 of the second model.

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(lm_carseat2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(lm_carseat2)

There a few outliers, but overall all the observation are pretty close to the normal distribution curve.

12-) This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Since this model has no intercept (B0=0), the coefficients would be same when the sum of squared values of the X is the same of the sum of squared values of Y.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(51)
x <- rnorm(100)
y <- 3.5*x + rnorm(100, sd = 3)
data_12b <- data.frame(x, y)

lm_y_12b <- lm(y ~ x+0, data = data_12b)
lm_x_12b <- lm(x ~ y+0, data = data_12b)

lm_y_12b$coefficients

##        x 
## 3.511354

lm_x_12b$coefficients

##         y 
## 0.1866791

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

set.seed(59)
x <- rnorm(100)
y <- x
data_12c <- data.frame(x, y)

lm_y_12c <- lm(y ~ x+ 0, data = data_12c)
lm_x_12c <- lm(x ~ y+0, data = data_12c)

lm_y_12c$coefficients

## x 
## 1

lm_x_12c$coefficients

## y 
## 1

Homework 2 - Chapter 3

Andre Vicente - ods882

6/15/2021

2-) Carefully explain the differences between the KNN classifier and KNN regression methods.

9-) This question involves the use of multiple linear regression on the Auto data set.

10-) This question should be answered using the Carseats data set.

12-) This problem involves simple linear regression without an intercept.