Prepare the environment for the analysis

library(UsingR)

1. Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.

data (mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
factor(mtcars$cyl)
##  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
## Levels: 4 6 8
#Once 4 is the first value, will not be necessary to relevel
fit<-lm(mpg~factor(cyl)+wt,data=mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ factor(cyl) + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.9908     1.8878  18.006  < 2e-16 ***
## factor(cyl)6  -4.2556     1.3861  -3.070 0.004718 ** 
## factor(cyl)8  -6.0709     1.6523  -3.674 0.000999 ***
## wt            -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11
#So, factor(cyl) 8 is -6.0709

Answer:

-6.071

33.991

-4.256

-3.206

2. Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.

#Using the results from question 1, let's fit a model without adjusting with wt
fitnowt<-lm(mpg~factor(cyl), mtcars)
summary(fitnowt)
## 
## Call:
## lm(formula = mpg ~ factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.6636     0.9718  27.437  < 2e-16 ***
## factor(cyl)6  -6.9208     1.5583  -4.441 0.000119 ***
## factor(cyl)8 -11.5636     1.2986  -8.905 8.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09
#Here factor(cyl) 8 is -11.56364 and we can say that:

Answer:

** Including or excluding weight does not appear to change anything regarding the estimated impact of number of cylinders on mpg.

** Within a given weight, 8 cylinder vehicles have an expected 12 mpg drop in fuel efficiency.

** Holding weight constant, cylinder appears to have less of an impact on mpg than if weight is disregarded.

** Holding weight constant, cylinder appears to have more of an impact on mpg than if weight is disregarded.

3. Consider the mtcars data set. Fit a model with mpg as the outcome that consider number of cylinders as a factor variable and weight as a confounder. Now fit a second modelwith mpg as the outcomemodel that considers the interaction between number of cylinders(as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significange benchmark.

#Considers number of cylinders as a factor variable and weight as confounder is the question 1 case: wt is added to factor(cyl) and we can use fit
#Now, considers the interaction between number of cylinders (as a factor variable) and weight we must multiply factor(cyl) by wt in fit3
fit3<-lm(mpg~factor(cyl)*wt, mtcars)
#now we can us the Analysis of Variance Table to look at the P-value
anova(fit, fit3)
## Analysis of Variance Table
## 
## Model 1: mpg ~ factor(cyl) + wt
## Model 2: mpg ~ factor(cyl) * wt
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     28 183.06                           
## 2     26 155.89  2     27.17 2.2658 0.1239
#P-value is 0.1239, larger than 0.05.

Answer:

** The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is necessary

** The P-value is small (less than 0.05). Thus it is surely true that there is no interaction term in the true model.

** The P-value is small (less than 0.05). Thus it is surely true that there is an interaction term in the true model.

** The P-value is small (less than 0.05). So, according to our criterion, we reject, which suggests that the interaction term is not necessary.

** The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.

** The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms is necessary.

4. Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as

fit4<-lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
summary(fit4)
## 
## Call:
## lm(formula = mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.991      1.888  18.006  < 2e-16 ***
## I(wt * 0.5)    -6.411      1.508  -4.252 0.000213 ***
## factor(cyl)6   -4.256      1.386  -3.070 0.004718 ** 
## factor(cyl)8   -6.071      1.652  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11
#remember that the weight is expressed in 1000lbs and a ton is 2000lbs so

How is the wt coefficient interpretted?

Answer:

** The estimated expected change in MPG per half ton increase in weight for the average number of cylinders.

** The estimated expected change in MPG per half ton increase in weight.

** The estimated expected change in MPG per half ton increase in weight for for a specific number of cylinders (4, 6, 8).

** The estimated expected change in MPG per one ton increase in weight.

** The estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8).

5. Consider the following data set

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the hat diagonal for the most influential point

fit5<-lm(y~x)
rstudent(fit5)
##            1            2            3            4            5 
##   1.96142171   0.11888841  -0.02986561  -2.01139691 -11.05925885
#The most influencial point is the fifth point
hatvalues(fit5)
##         1         2         3         4         5 
## 0.2286650 0.2438146 0.2525027 0.2804443 0.9945734
#So, the fifth hatvalue is

Answer:

** 0.2025

** 0.9946

** 0.2287

** 0.2804

6. Consider the following data set

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the slope dfbeta for the point with the highest hat value.

fit6<-lm(y~x)
round (hatvalues(fit6)[1:5],3)
##     1     2     3     4     5 
## 0.229 0.244 0.253 0.280 0.995
# we can see that the highest hatvalue is the fifth 
round(dfbetas(fit6)[1:5,2],3)
##        1        2        3        4        5 
##   -0.378   -0.029    0.008    0.673 -133.823
# So the fifth value of the slope dfbeta is  

Answer:

** -0.378

** 0.673

** -.00134

** -134

7. Consider a regression relationship between Y and X with and without adjustment for a third variable Z. Which of the following is true about comparing the regression coefficient between Y and X with and without adjustment for Z.

Answer:

** The coefficient can’t change sign after adjustment, except for slight numerical pathological cases.

** For the the coefficient to change sign, there must be a significant interaction term.

** Adjusting for another variable can only attenuate the coefficient toward zero. It can’t materially change sign.

** It is possible for the coefficient to reverse sign after adjustment. For example, it can be strongly significant and positive before adjustment and strongly significant and negative after adjustment.