Assignment2

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

Question 2

The main difference between KNN classifier and KNN regression is in the name. One, KNN classification, is used for classification where the response variable is categorical (trying to put things into categories) and the other, KNN regression, is used when the response variable is numeric.

Question 9a

data(Auto)
plot(Auto)

Question 9b

cor(Auto[, names(Auto)!="name"])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question 9c

q9lm = lm(mpg~. -name, Auto)
summary(q9lm)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i: Is a significance between the predictor and the response. ii: There are significant relationships between mpg and displacement, weight, year, and origin. iii: MPG increases by ~0.75 each year.

Question 9d

par(mfrow=c(2,2))
plot(q9lm)

The residuals vs Fitted plot shows us that the relationship between response and pred is not linear. Using the QQ plot you can see that it is normally distributed. It looks like the point 14 on the residuals vs leverage plot seems to be a potential leverage point.

Question 9e

summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))

## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

The following interactions are statistically significant:

displacement:year acceleration:year acceleration:origin

Question 9f

par(mfrow = c(2, 2))
plot(log(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)

All plots for acceleration roughly have the same linearity with log and sqrt being roughly identical and squared being more left focused.

Question 10a

data(Carseats)
q10lm = lm(Sales ~ Price + Urban + US, Carseats)
summary(q10lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Question 10b

With an increase of 1 unit for price the sales will go down 0.054 units. If the store is located in an urban area then the sales will go down 0.021 units, but since the p-value is not significant we can ignore this relationship.
If the store is in the US then the sales will go up 1.2.

Question 10c

Sales = 13.043469 - 0.054459 * Price - 0.021916 * Urban + 1.200573 * US

where Urban = 1 when Yes and Urban = 0 when No and US = 1 when Yes and US = 0 when No

Question 10d

We can reject the null hypothesis on the predictors Price and US

Question 10e

model = lm(Sales ~ Price + US, Carseats)
summary(model)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Question 10f

Both of the models from 10a and 10e fit roughly the same with an adjusted R^2 of 0.23

Question 10g

confint(model, level=.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Question 10h

par(mfrow = c(2,2))
plot(model)

There are a few outliers in the residuals vs leverage, where they are greater than 2 or less than -2.

Question 12a

Coefficient estimate for the regression of Y onto X without an intercept is:

\[ \sum_{i = 1}^{n}{x_i^2} = \sum_{i = 1}^{n}{y_i^2} \]

Question 12b

set.seed(2)
x = rnorm(100)
y = x*3 + rnorm(100, sd=2)
df = data.frame(x,y)

lm12by = lm(y ~ x + 0)
lm12bx = lm(x ~ y + 0)

summary(lm12bx)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52583 -0.46562 -0.02371  0.42784  1.06727 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.25674    0.01507   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5852 on 99 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7432 
## F-statistic: 290.3 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lm12by)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2442 -1.5840  0.3314  1.4960  4.1668 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   2.9045     0.1705   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.968 on 99 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7432 
## F-statistic: 290.3 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficients are different, 0.25674 and 2.9045.

Question 12c

set.seed(2)
x = rnorm(100)
y = x
df = data.frame(x,y)

lm12cy = lm(y ~ x + 0)
lm12cx = lm(x ~ y + 0)

summary(lm12cx)

## Warning in summary.lm(lm12cx): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.024e-16 -1.308e-17  7.990e-18  4.566e-17  2.532e-15 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 1.000e+00  2.287e-17 4.373e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.913e+33 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lm12cy)

## Warning in summary.lm(lm12cy): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.024e-16 -1.308e-17  7.990e-18  4.566e-17  2.532e-15 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 1.000e+00  2.287e-17 4.373e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.913e+33 on 1 and 99 DF,  p-value: < 2.2e-16