Problem 1

(a) Use the * and : symbols to fit linear regression modes with interaction effects. Do any appear to be statistically significant?

auto <- read.table("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.data", 
                   header=TRUE,
                   na.strings = "?")
auto = na.omit(auto)
auto <- auto[, -c(8:9)]
attach(auto)

str(auto)
## 'data.frame':    392 obs. of  7 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
int_mod1 <- lm(mpg ~ cylinders*weight)
summary(int_mod1)
## 
## Call:
## lm(formula = mpg ~ cylinders * weight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4916  -2.6225  -0.3927   1.7794  16.7087 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      65.3864559  3.7333137  17.514  < 2e-16 ***
## cylinders        -4.2097950  0.7238315  -5.816 1.26e-08 ***
## weight           -0.0128348  0.0013628  -9.418  < 2e-16 ***
## cylinders:weight  0.0010979  0.0002101   5.226 2.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.165 on 388 degrees of freedom
## Multiple R-squared:  0.7174, Adjusted R-squared:  0.7152 
## F-statistic: 328.3 on 3 and 388 DF,  p-value: < 2.2e-16
plot(int_mod1$fitted, int_mod1$residual)

int_mod2 <- lm(horsepower ~ cylinders + year + cylinders:year)
summary((int_mod2))
## 
## Call:
## lm(formula = horsepower ~ cylinders + year + cylinders:year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.778 -12.188  -2.439  11.428  71.569 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -342.7626    69.3281  -4.944 1.14e-06 ***
## cylinders       106.2874    12.5642   8.460 5.54e-16 ***
## year              4.6485     0.9139   5.086 5.69e-07 ***
## cylinders:year   -1.1788     0.1674  -7.042 8.69e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.96 on 388 degrees of freedom
## Multiple R-squared:  0.7591, Adjusted R-squared:  0.7573 
## F-statistic: 407.6 on 3 and 388 DF,  p-value: < 2.2e-16
plot(int_mod2$fitted, int_mod2$residual)

Yes, the interaction between cylinders and weight is a significant predictor of mpg. Additionally, the interaction between cylinders and year is a significant predictor of horsepower. All main effects in both models are significant as well. int_mod1 showed a fan shaped distribution, indicating heteroscedasticity.

(b) Try a few different transformations of the variables. Comment on your findings.

int_mod3 <- lm(mpg ~ cylinders*log10(weight))
summary(int_mod3)
## 
## Call:
## lm(formula = mpg ~ cylinders * log10(weight))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.9668  -2.6279  -0.4572   1.9242  16.6468 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              244.663     29.498   8.294 1.82e-15 ***
## cylinders                -12.870      5.634  -2.284   0.0229 *  
## log10(weight)            -62.657      8.553  -7.326 1.38e-12 ***
## cylinders:log10(weight)    3.445      1.587   2.171   0.0306 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.142 on 388 degrees of freedom
## Multiple R-squared:  0.7205, Adjusted R-squared:  0.7183 
## F-statistic: 333.4 on 3 and 388 DF,  p-value: < 2.2e-16
plot(int_mod3$fitted, int_mod3$residual)

int_mod4 <- lm(horsepower ~ I(cylinders^2)*(year))
summary(int_mod4)
## 
## Call:
## lm(formula = horsepower ~ I(cylinders^2) * (year))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.686 -11.456  -2.571  10.665  69.372 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -56.01703   37.07645  -1.511   0.1316    
## I(cylinders^2)        8.73001    1.02780   8.494 4.32e-16 ***
## year                  1.45909    0.48505   3.008   0.0028 ** 
## I(cylinders^2):year  -0.09602    0.01372  -6.999 1.14e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.2 on 388 degrees of freedom
## Multiple R-squared:  0.7781, Adjusted R-squared:  0.7764 
## F-statistic: 453.6 on 3 and 388 DF,  p-value: < 2.2e-16
plot(int_mod4$fitted, int_mod4$residual)

int_mod5 <- lm(horsepower ~ sqrt(cylinders)*year)
summary(int_mod5)
## 
## Call:
## lm(formula = horsepower ~ sqrt(cylinders) * year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.697 -12.092  -2.264  11.240  72.908 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -946.1913   141.0563  -6.708 7.00e-11 ***
## sqrt(cylinders)       514.9939    61.4599   8.379 9.88e-16 ***
## year                   11.3925     1.8681   6.098 2.59e-09 ***
## sqrt(cylinders):year   -5.7391     0.8182  -7.015 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.48 on 388 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7438 
## F-statistic: 379.3 on 3 and 388 DF,  p-value: < 2.2e-16
plot(int_mod5$fitted, int_mod5$residual)

Each of the transfromations resulted in a higher R^2 value, except for int_mod5. However, int_mod4, which squared the cylinders variable, led to the largest increase in explained variance. Additionally, int_mod3 still showed heteroscedasticity in the same way that was observed in in_mod1.

Problem 2

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(MASS)
library(ISLR)
attach(Carseats)

mlr_mod <- lm(Sales ~ Price + Urban + US)

(b) Provide an interpretation of each coefficient in the model.

summary(mlr_mod)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The coefficient for price suggests that for each additional price unit, you would predict an decrease of .054 units in Sales, when all other variables are held constant. The coefficient for Urban suggests that a store being urban is associated with .022 unit decrease in sales, whereas not being urban is associated with no change in sales (both assuming that all other variables are held constant). However, this relationship is not significant and is likely due to chance. The coefficient for US suggests that a store being in the US is associated with a 1.2 unit increase in Sales and not being in the US is associated with no change in sales (both assuming that all other variables are held constant)

(c) Write out the model in equation form.

y = 13.043469 + -.054459(PRICE) + -.021916(URBAN) + 1.200573(US)

Urban: 1 if urban, 0 if not

US: 1 if US, 0 if not

(d) For which of the predictors can you reject the null hypothesis?

You can reject the null hypothesis that the slope for the predictor Price is different from 0 and for the difference between US and non-US sales.

(e) Fit a smaller model using only the predictors that showed evidence of association with the outcome.

mlr_mod_small <- lm(Sales ~ Price + US)
summary(mlr_mod_small)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

They fit the data equally well, showing that Urban is not a necessary predictor. They both explain 23.93% of the varaince in Sales. However, the model in (e) has a higher adjusted R-squared due to predicting equally well with less predictors.

(g) Using the model from (e), obtain the 95% confidence intervals for the coefficient(s).

confint(mlr_mod_small)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632