Assignment 2

Q.2

KNN Classifier is used for classification problems with a qualitative response. KNN regression is used for solving regression problems with a quantitative response. KNN Classifier classifies a given observation to the class with the largest estimated probability. KNN regression identifies the neighborhood of the observations and then estimates the outcome as the average of all the training responses in the neighborhood.

library(readr)
library(MASS)
library(ISLR)

Q.9

Auto = read.csv("Auto.csv", na.strings = "?")
Auto = na.omit(Auto)

# scatterplot matrix
plot(Auto)

# matrix of correlations
corr = cor(Auto[1:8])
print(corr)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

attach(Auto)

# multiple linear regression
lm.fit = lm(mpg~.-name, data = Auto)
summary(lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Relationship between the predictors and the response

There is a relationship between the predictors and the response based on the linear regression model. The model has a R-square of 0.8215.

Predictors that appear to have a statistically significant relationship to the response

The predictors displacement, weight, year, and origin appear to have a statistically significant relationship with the response variable mpg. These predictors have a p-value less than 0.05 which makes them statistically significant.

Coefficient for the year variable

The year variable suggests that each year mpg which is the response variable increases by 0.75.

# diagnostic plots
par(mfrow=c(2,2))
plot(lm.fit)

Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

The residual plot has some unusually large outliers which also have high leverage in the leverage plot. So those outliers are influential in the model.

# interaction effects
inter_eff = lm(mpg ~ . - name + weight * year + weight * origin ,
               data = Auto)
summary(inter_eff)

## 
## Call:
## lm(formula = mpg ~ . - name + weight * year + weight * origin, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6874 -1.8186 -0.1915  1.4850 11.5235 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.142e+02  1.343e+01  -8.509 4.05e-16 ***
## cylinders     -2.002e-01  3.033e-01  -0.660  0.50950    
## displacement   7.467e-03  7.354e-03   1.015  0.31058    
## horsepower    -2.435e-02  1.292e-02  -1.884  0.06033 .  
## weight         2.936e-02  4.647e-03   6.318 7.37e-10 ***
## acceleration   1.159e-01  9.224e-02   1.257  0.20967    
## year           1.979e+00  1.779e-01  11.129  < 2e-16 ***
## origin         4.039e+00  1.246e+00   3.243  0.00129 ** 
## weight:year   -4.470e-04  6.305e-05  -7.090 6.54e-12 ***
## weight:origin -1.261e-03  5.363e-04  -2.352  0.01918 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.066 on 382 degrees of freedom
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8457 
## F-statistic: 239.1 on 9 and 382 DF,  p-value: < 2.2e-16

\({weight}\times{year}\) and \(weight\times origin\) becomes significant with a p-value less than 0.05.R-square also increases with 0.8495.

# transformation of variables
trans_var = lm(mpg ~ . - name + I(log(weight)) +                 I(log(origin)) + I(log(weight * year)), data = Auto)
summary(trans_var)

## 
## Call:
## lm(formula = mpg ~ . - name + I(log(weight)) + I(log(origin)) + 
##     I(log(weight * year)), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5239 -1.5448  0.0142  1.4848 12.8274 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.330e+03  4.740e+02   7.025 9.91e-12 ***
## cylinders             -2.983e-01  2.780e-01  -1.073  0.28384    
## displacement           1.160e-02  6.676e-03   1.737  0.08317 .  
## horsepower            -3.786e-02  1.202e-02  -3.150  0.00176 ** 
## weight                 8.398e-03  1.556e-03   5.396 1.20e-07 ***
## acceleration           6.922e-03  8.476e-02   0.082  0.93496    
## year                   1.290e+01  1.868e+00   6.907 2.08e-11 ***
## origin                -3.815e+00  1.643e+00  -2.323  0.02073 *  
## I(log(weight))         8.769e+02  1.419e+02   6.179 1.66e-09 ***
## I(log(origin))         8.318e+00  2.962e+00   2.808  0.00523 ** 
## I(log(weight * year)) -9.183e+02  1.419e+02  -6.470 3.01e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.844 on 381 degrees of freedom
## Multiple R-squared:  0.8706, Adjusted R-squared:  0.8672 
## F-statistic: 256.4 on 10 and 381 DF,  p-value: < 2.2e-16

trans_var2 = lm(mpg ~ . - name + I(sqrt(weight)) +
                  I(sqrt(origin)) + I(sqrt(weight * year)), data = Auto)
summary(trans_var2)

## 
## Call:
## lm(formula = mpg ~ . - name + I(sqrt(weight)) + I(sqrt(origin)) + 
##     I(sqrt(weight * year)), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7099 -1.6043 -0.0566  1.4825 12.0998 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.232e+02  3.127e+01  -3.938 9.75e-05 ***
## cylinders              -1.965e-01  2.813e-01  -0.699  0.48525    
## displacement            1.396e-02  6.686e-03   2.089  0.03741 *  
## horsepower             -3.127e-02  1.195e-02  -2.616  0.00924 ** 
## weight                  1.483e-02  3.099e-03   4.787 2.43e-06 ***
## acceleration            1.032e-01  8.559e-02   1.206  0.22845    
## year                    2.767e+00  3.352e-01   8.255 2.51e-15 ***
## origin                 -9.378e+00  3.278e+00  -2.861  0.00446 ** 
## I(sqrt(weight))         3.366e+00  1.136e+00   2.963  0.00323 ** 
## I(sqrt(origin))         2.756e+01  8.935e+00   3.084  0.00219 ** 
## I(sqrt(weight * year)) -6.559e-01  1.120e-01  -5.859 1.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.863 on 381 degrees of freedom
## Multiple R-squared:  0.8689, Adjusted R-squared:  0.8654 
## F-statistic: 252.5 on 10 and 381 DF,  p-value: < 2.2e-16

trans_var3 = lm(mpg ~ . - name + I(weight ^ 2) + I(origin ^ 2) + I((weight *year) ^ 2), data = Auto)
summary(trans_var3)

## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(origin^2) + I((weight * 
##     year)^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1957 -1.6635 -0.0742  1.5486 12.1525 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.204e+01  8.979e+00  -4.682 3.96e-06 ***
## cylinders            -4.797e-02  2.846e-01  -0.169 0.866225    
## displacement          1.583e-02  6.712e-03   2.359 0.018841 *  
## horsepower           -3.310e-02  1.207e-02  -2.743 0.006376 ** 
## weight               -1.716e-02  1.668e-03 -10.290  < 2e-16 ***
## acceleration          1.033e-01  8.597e-02   1.202 0.230216    
## year                  1.241e+00  9.390e-02  13.212  < 2e-16 ***
## origin                6.160e+00  1.755e+00   3.511 0.000501 ***
## I(weight^2)           3.650e-06  3.603e-07  10.128  < 2e-16 ***
## I(origin^2)          -1.351e+00  4.337e-01  -3.115 0.001981 ** 
## I((weight * year)^2) -3.509e-10  6.917e-11  -5.073 6.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.882 on 381 degrees of freedom
## Multiple R-squared:  0.8671, Adjusted R-squared:  0.8636 
## F-statistic: 248.6 on 10 and 381 DF,  p-value: < 2.2e-16

Transformation using log has the best outputs with highest R-square = 0.8706, highest F-statistic=256.4, and lowest residual error=2.844. So the models using log transformation is better than the original regression model and the two models that use square root and square transformations.

detach(Auto)

Q.10

data(Carseats)
attach(Carseats)

contrasts(Urban)

##     Yes
## No    0
## Yes   1

contrasts(US)

##     Yes
## No    0
## Yes   1

The contrasts show that if stores are in the Urban are it’s a 1 and 0 otherwise and stores in the US is a 1 otherwise 0.

Multiple regression model

# Model 1:
lm.fit2 = lm(Sales ~ Price + Urban + US)
summary(lm.fit2)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Interpretation of coefficients

Price of child carseats decreases sales, stores in Urban area (UrbanYes) decreases, and stores in US(USYes) increases the sales. This means that stores in an Urban area of US has low sales where price is a significant factor.

Model in equation form

\(Sales = -0.054Price -0.022UrbanYes + 1.201USYes\)

Rejecting the null hypothesis

The F-statistic = 41.52 is very low and not all the variables are significant because P-value is not less than 0.05. So we cannot reject the null hypothesis.

Predictors with evidence of association with the outcome

Only Price and USYes variables are significant with a p-value of 0.05. UrbanYes is not significant because of it’s high p-value. So we can fit a smaller model for the outcome by removing the UrbanYes variable.

# fit a smaller model
# Model 2:
lm.fit3 = lm(Sales ~ Price + US)
summary(lm.fit3)

## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Fit of two models

Although Model 2 only has significant predictors, both Model 1 and Model 2 have a low F-Statistics, high residual standard errors, and low R-square value. These indicates that neither of the models are a good fit for the data.

95% confidence intervals for the coefficient(s) of Model 2

confid_int=confint(lm.fit3, level = 0.95)

Outliers or high leverage observations in Model 2

par(mfrow=c(2,2))
plot(lm.fit3)

There are very few outliers and high leverage observations in the model so they are not very influential.

detach(Carseats)

Q.12

Simple linear regression without an intercept

Circumstances where coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X

If the predictors X and Y are equal then coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X.

Example where coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

X= rnorm(n=100)
Y= 0.25*X + X
coef(lm(Y~X))

##   (Intercept)             X 
## -3.885781e-17  1.250000e+00

coef(lm(X~Y))

##  (Intercept)            Y 
## 2.844947e-17 8.000000e-01

Example where the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

X= rnorm(n=100)
Y= X
coef(lm(Y~X))

##   (Intercept)             X 
## -1.665335e-17  1.000000e+00

coef(lm(X~Y))

##   (Intercept)             Y 
## -1.665335e-17  1.000000e+00