Problem Set 5

library(ISLR)

auto<-read.csv("http://faculty.marshall.usc.edu/gareth-james/ISL/Auto.csv",
header=TRUE,
na.strings = "?")

auto<-na.omit(auto)
auto<-auto[, -c(8:9)]
attach(auto)

Problem 1

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant? (at the 0.05 level)

modInt1<-lm(mpg~year*weight, data=auto)
summary(modInt1)

## 
## Call:
## lm(formula = mpg ~ year * weight, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0397 -1.9956 -0.0983  1.6525 12.9896 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.105e+02  1.295e+01  -8.531 3.30e-16 ***
## year         2.040e+00  1.718e-01  11.876  < 2e-16 ***
## weight       2.755e-02  4.413e-03   6.242 1.14e-09 ***
## year:weight -4.579e-04  5.907e-05  -7.752 8.02e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.193 on 388 degrees of freedom
## Multiple R-squared:  0.8339, Adjusted R-squared:  0.8326 
## F-statistic: 649.3 on 3 and 388 DF,  p-value: < 2.2e-16

year and weight chosen has two explanatory variables because they were the only two found to be statisitistically significant in Problem 3 of the previous homework with multiple linear regression, hard to explain results of more than a three way interaction

With a p-vaule of less than 0.05, the interaction between year and weight (year:weight) appears to be statistically significant.

Other possible interactions:

modInt2<-lm(mpg~year*cylinders, data=auto)
summary(modInt2)

## 
## Call:
## lm(formula = mpg ~ year * cylinders, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2164  -2.5792  -0.1558   2.2569  15.2532 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -61.61775   15.10277  -4.080 5.47e-05 ***
## year             1.34054    0.19909   6.733 5.99e-11 ***
## cylinders        5.51044    2.73705   2.013  0.04478 *  
## year:cylinders  -0.11350    0.03647  -3.112  0.00199 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.131 on 388 degrees of freedom
## Multiple R-squared:  0.722,  Adjusted R-squared:  0.7199 
## F-statistic: 335.9 on 3 and 388 DF,  p-value: < 2.2e-16

With a p-vaule of less than 0.05, the interaction between year and weight (year:cylinders) appears to be statistically significant.

modInt3<-lm(mpg~weight*cylinders, data=auto)
summary(modInt3)

## 
## Call:
## lm(formula = mpg ~ weight * cylinders, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4916  -2.6225  -0.3927   1.7794  16.7087 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      65.3864559  3.7333137  17.514  < 2e-16 ***
## weight           -0.0128348  0.0013628  -9.418  < 2e-16 ***
## cylinders        -4.2097950  0.7238315  -5.816 1.26e-08 ***
## weight:cylinders  0.0010979  0.0002101   5.226 2.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.165 on 388 degrees of freedom
## Multiple R-squared:  0.7174, Adjusted R-squared:  0.7152 
## F-statistic: 328.3 on 3 and 388 DF,  p-value: < 2.2e-16

With a p-vaule of less than 0.05, the interaction between year and weight (weight:cylinders) appears to be statistically significant.

Try a few different transformations of the variables, such as log(X), √X, X2 Comment on your findings.

X^2 transformation, variabel=weight

mod1<-lm(mpg~weight+I(weight^2), data=auto)
summary(mod1)

## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.226e+01  2.993e+00  20.800  < 2e-16 ***
## weight      -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
## I(weight^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

plot(mod1$fitted.values, mod1$residuals, pch=16)
abline(h=0, col="blue")

plot(mod1)

While the residual plot for the X^2 transformation of the weight varoiable seems to have no pattern with the residuals having approximately equal distances from 0. the qqplot shows the upper tail straying from normality.

sqrt X, variable=weight

mod2<-lm(mpg~weight+I(weight^(1/2)), data=auto)
summary(mod2)

## 
## Call:
## lm(formula = mpg ~ weight + I(weight^(1/2)), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5660  -2.6552  -0.4161   1.7373  16.1001 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     109.218284  11.573797   9.437  < 2e-16 ***
## weight            0.013191   0.003828   3.446 0.000631 ***
## I(weight^(1/2))  -2.314535   0.424250  -5.456  8.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.181 on 389 degrees of freedom
## Multiple R-squared:  0.7145, Adjusted R-squared:  0.713 
## F-statistic: 486.7 on 2 and 389 DF,  p-value: < 2.2e-16

plot(mod2$fitted.values, mod2$residuals, pch=16)
abline(h=0, col="blue")

plot(mod2)

While the residual plot for the X^1/2 transformation of the weight varoiable seems to have no pattern (maybe slight fanning) with the residuals having approximately equal distances from 0. the qqplot shows the upper tail straying from normality.

Problem 2

data(Carseats)
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age
## 1  9.50       138     73          11        276   120       Bad  42
## 2 11.22       111     48          16        260    83      Good  65
## 3 10.06       113     35          10        269    80    Medium  59
## 4  7.40       117    100           4        466    97    Medium  55
## 5  4.15       141     64           3        340   128       Bad  38
## 6 10.81       124    113          13        501    72       Bad  78
##   Education Urban  US
## 1        17   Yes Yes
## 2        10   Yes Yes
## 3        12   Yes Yes
## 4        14   Yes Yes
## 5        13   Yes  No
## 6        16    No Yes

Fit a multiple regression model to predict Sales using Price, Urban, and US.

CarS1<-lm(Sales~Price+Urban+US, data=Carseats)
summary(CarS1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model.

Price, -0.054459: For every 1 increase in unit of Price, you would expect a 0.054459 decrease in Sales.

UrbanYes, -0.021916: highlight shift in the intercept for a YES to Urban

USYes, 1.200573: highlight shift in the intercept for a YES to US

Write out the model in equation form, being careful to handle the qualitative variables properly. (Hint: You can write seperate equations)

UrbanNo, UrbanNo (ref): y = -0.05449x + 13.043469

UrbanYes, USYes: y = -0.054459x + (13.043469 - 0.021916 + 1.200573)

UrbanYes, UrbanNo: y = -0.05449x + (13.043469 - 0.021916)

UrbanNo, UrbanYes: y = -0.05449x + (13.043469 + 1.200573)

note: no changes in slope in this example because interactions between explanatory variables not taken into account

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

summary(CarS1)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

You can reject the null hypothesis for the predictors of price and US(Yes) as both appear to be statistically significiant with a small (less than 0.05) p-value.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

CarSmallPrice<-lm(Sales~Price*US, data = Carseats)
summary(CarSmallPrice)

## 
## Call:
## lm(formula = Sales ~ Price * US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9299 -1.6375 -0.0492  1.5765  7.0430 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.974798   0.953079  13.614  < 2e-16 ***
## Price       -0.053986   0.008163  -6.613 1.22e-10 ***
## USYes        1.295775   1.252146   1.035    0.301    
## Price:USYes -0.000835   0.010641  -0.078    0.937    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

How well do the models in a) and e) fit the data?

plot(CarS1)

plot(CarSmallPrice)

When checking the diagnostic plots for the models in parts a) and e), both show a residual plot that has data with no pattern and similar distances from 0 (residuals). Both QQ plots show very little deviation from normality and Cook’s plot hihghlights no outliers that carry too much leverage on the modeel fit.

Using the model from (e), obtain the 95% confidence intervals for the coefficient(s).

confint(CarSmallPrice)

##                   2.5 %      97.5 %
## (Intercept) 11.10107096 14.84852478
## Price       -0.07003516 -0.03793731
## USYes       -1.16590989  3.75745964
## Price:USYes -0.02175564  0.02008563

Problem Set 5

Jack Boydell

10/9/2019

Problem 1

Problem 2