Assignment 2: Chapter 3: Linear Regression

Problem 2

Carefully explain the diﬀerences between the KNN classiﬁer and KNN regression methods.

The main difference between KNN classifier and KNN regression is that KNN classifier is used to solve classification problems that have a qualitative response by identifying the neighbors of X and estimating the probability of the class of neighbors and KNN regression is used to solve regression problems that have a quantitative response by identifying the neighbors of x and estimating f(x) by averaging the values of all neighbors.

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

library(ISLR)
attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the ‘name’ variable, which is qualitative.

names(Auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"   "weight"      
## [6] "acceleration" "year"         "origin"       "name"

cor(Auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use thesummary() function to print the results. Comment on the output. For instance: I. Is there a relationship between the predictors and the response? II. Which predictors appear to have a statistically signiﬁcant relationship to the response? III. What does the coeﬃcient for the year variable suggest?

fit<-lm(mpg~., data = Auto[, 1:8])
summary(fit)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

The F-statistic for the model is 2.2e-16 which is low so this shows that there are relationships between some variables and ‘mpg’ in the model.

As interpreted by the summary above, The coefficient of the ‘displacement’ variable shows that the average effect of ‘displacement’ increasing by 1 unit equates to an increase of .019896 units in ‘mpg’ considering that all other predictors remain fixed. The coefficient of the ‘weight’ variable shows that the average effect of ‘weight’ increasing by 1 unit equates to a decrease of .006474 units in ‘mpg’ considering that all other predictors remain fixed. The coefficient of the ‘year’ variable shows that the average effect of ‘year’ increasing by 1 unit equates to an increase of .750773 units in ‘mpg’ considering that all other predictors remain fixed. The coefficient of the ‘origin’ variable shows that the average effect of ‘origin’ increasing by 1 unit equates to an increase of 1.426.141 units in ‘mpg’ considering that all other predictors remain fixed. The coefficients of the ‘cylinders’, ‘horsepower’, and ‘acceleration’ variables show that there is no relationship between ‘mpg’ and those variables since the p-value is large.

The R-squared for the model shows that about 82.15% of the variability can be explained by the model. So the model appears to fit the data well.

(d) Use the plot() function to produce diagnostic plots of the linear regression ﬁt. Comment on any problems you see with the ﬁt. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

Based on the plot of Residuals versus fitted values there appears to be non linearity in the data and based on the Residuals vs Leverage plot below, there does appear to be outliers due to the fact that there are data points outside of the -2 and 2 standardized residuals. And there does appear to be high leverage observations greater then 0.05.

par(mfrow = c(2, 2))
plot(fit)

(e) Use the * and : symbols to ﬁt linear regression models with interaction eﬀects. Do any interactions appear to be statistically signiﬁcant?

fit2 <- lm(mpg ~ cylinders * displacement + horsepower * weight + acceleration * year, data = Auto[, 1:8])
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + horsepower * weight + 
##     acceleration * year, data = Auto[, 1:8])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3265 -1.5779  0.0389  1.3483 11.6961 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             1.162e+02  1.853e+01   6.274 9.53e-10 ***
## cylinders              -1.803e-01  4.776e-01  -0.377   0.7061    
## displacement           -2.867e-02  1.425e-02  -2.013   0.0449 *  
## horsepower             -2.261e-01  2.609e-02  -8.664  < 2e-16 ***
## weight                 -1.019e-02  9.020e-04 -11.296  < 2e-16 ***
## acceleration           -7.081e+00  1.158e+00  -6.113 2.41e-09 ***
## year                   -6.719e-01  2.417e-01  -2.780   0.0057 ** 
## cylinders:displacement  2.790e-03  2.067e-03   1.350   0.1779    
## horsepower:weight       5.154e-05  6.727e-06   7.661 1.53e-13 ***
## acceleration:year       9.113e-02  1.502e-02   6.069 3.10e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.819 on 382 degrees of freedom
## Multiple R-squared:  0.8726, Adjusted R-squared:  0.8696 
## F-statistic: 290.6 on 9 and 382 DF,  p-value: < 2.2e-16

Based on the above summary it appears that the interactions between ‘cylinders’ and ‘displacement are not statistically significant but the interactions between ’horsepower’ and ‘weight’ is significant and the interaction between ‘acceleration’ and ‘year’ is signicant.

(f) Try a few diﬀerent transformations of the variables, such as log(X), √X, X2. Comment on your ﬁndings.

When taking the log(X), √X, X2 for the three variables that did not show relationships with ‘mpg’ in the original model, ‘horsepower’, ‘acceleration’ and cylinders’, it appears that the log(x) of ‘horsepower’ shows the closest to being linear then any of the other variations.

par(mfrow = c(2, 2))
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

par(mfrow = c(2, 2))
plot(log(Auto$acceleration), Auto$mpg)
plot(sqrt(Auto$acceleration), Auto$mpg)
plot((Auto$acceleration)^2, Auto$mpg)

par(mfrow = c(2, 2))
plot(log(Auto$cylinders), Auto$mpg)
plot(sqrt(Auto$cylinders), Auto$mpg)
plot((Auto$cylinders)^2, Auto$mpg)

Problem 10

This question should be answered using the Carseats data set.

library(ISLR)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using ‘Price’, ‘Urban’, and ‘US’.

fit3<-lm(Sales~Price+Urban+US)
summary(fit3)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coeﬃcient in the model. Be careful—some of the variables in the model are qualitative!

As interpreted by the summary above, The coefficient of the ‘Price’ variable shows that the average effect of a ‘price’ increase of $1 equates to a decrease of $54.459 in ‘sales’ considering that all other predictors remain fixed. The coefficient of the ‘Urban’ variable shows that there is no relationship between ‘Sales’ and ‘Urban’ since the p-value is large. The coefficient of the ‘US’ variable shows that on average sales in a US store are $1200.573 more than in a non US store considering that all other predictors remain fixed.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

$Sales = 13.043469 -0.054459Price -0.021916Urban_Yes + 1.200573US_Yes$
where $Urban_Yes = 1$ for Urban and 0 for not Urban and $US_Yes = 1$ for U.S store and 0 for non U.S stores

(d) For which of the predictors can you reject the null hypothesis $H_0 : β_j = 0$?

As interpreted by the summary above, you can reject the null hypothesis for the following variables:

‘Price’ and ‘US’

because both of these variables have an effect on ‘Sales’.

(e) On the basis of your response to the previous question, ﬁt a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit4<-lm(Sales~Price+US)
summary(fit4)

## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) ﬁt the data?

The R-squared for both models shows that about 23.93% of the variability can be explained by the model. So the models do not appear to fit the data.

(g) Using the model from (e), obtain 95% conﬁdence intervals for the coeﬃcient(s).

confint(fit4)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

Based on the Residuals versus Fitted plot there for the most part there appears to be linearity in the data but based on the Residuals vs Leverage plot below, there does appear to be outliers due to the fact that there are data points outside of the -2 and 2 standardized residuals. And there does appear to be high leverage observations greater then 0.01.

par(mfrow = c(2, 2))
plot(fit4)

Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coeﬃcient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coeﬃcient estimate for the regression of X onto Y the same as the coeﬃcient estimate for the regression of Y onto X?

The coefficient estimate for the regression of X onto Y is the same as the coefficient estimat for the regression of Y onto X when the SUm of $x^2$ equals the sum of $y^2$ as in example on (c).

(b) Generate an example in R with n = 100 observations in which the coeﬃcient estimate for the regression of X onto Y is diﬀerent from the coeﬃcient estimate for the regression of Y onto X.

set.seed(1)
x <- 1:100
sum(x^2)

## [1] 338350

y <- x * -153
sum(y^2)

## [1] 7920435150

fit.X<-lm(y~x + 0)
fit.Y<-lm(x~y + 0)

summary(fit.Y)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.493e-13 -1.826e-15  4.100e-17  1.549e-15  1.140e-14 
## 
## Coefficients:
##     Estimate Std. Error    t value Pr(>|t|)    
## y -6.536e-03  2.846e-19 -2.297e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.276e+32 on 1 and 99 DF,  p-value: < 2.2e-16

summary(fit.X)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.926e-12 -2.460e-13 -1.000e-15  2.110e-13  3.282e-11 
## 
## Coefficients:
##     Estimate Std. Error    t value Pr(>|t|)    
## x -1.530e+02  5.745e-15 -2.663e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.342e-12 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.093e+32 on 1 and 99 DF,  p-value: < 2.2e-16

(c) Generate an example in R with n = 100 observations in which the coeﬃcient estimate for the regression of X onto Y is the same as the coeﬃcient estimate for the regression of Y onto X.

x<-1:100
sum(x^2)

## [1] 338350

y<-100:1
sum(y^2)

## [1] 338350

fit.Y <- lm(y ~ x + 0)
fit.X <- lm(x ~ y + 0)

summary(fit.Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

summary(fit.X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Assignment 2: Chapter 3: Linear Regression

Angelica Maldonado

6/19/2020

Problem 2

Problem 9

Problem 10

Problem 12