Regression Models

data(mtcars)
library(dplyr)

Question 1

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.

mtcars <- mutate(mtcars, cyl=factor(cyl))
fit1 <- lm(mpg ~ cyl + wt, mtcars)
sf1 <- summary(fit1)$coef
sf1

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    33.99      1.888   18.01 6.26e-17
## cyl6           -4.26      1.386   -3.07 4.72e-03
## cyl8           -6.07      1.652   -3.67 9.99e-04
## wt             -3.21      0.754   -4.25 2.13e-04

Answer

-6.071

Compared to 4-cyl cars with a mean of 33.991 mpg, 8-cyl cars have a mean mpg -6.071 less at 27.92 mpg.

Question 2

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.

fit2 <- lm(mpg ~ cyl, mtcars)
sf2 <- summary(fit2)$coef
compare <- data.frame(with_wt=sf1[3],without_wt=sf2[3], row.names = "8-cyl Est.")
compare

##            with_wt without_wt
## 8-cyl Est.   -6.07      -11.6

Answer

Therefore, holding weight constant, cylinder appears to have less of an impact on mpg than if weight is disregarded.

Question 3

Consider the mtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.

library(lmtest)
fit3 <- lm(mpg ~ cyl*wt, mtcars)
test <- lrtest(fit1,fit3)
pval <- test$`Pr(>Chisq)`[2]
test

## Likelihood ratio test
## 
## Model 1: mpg ~ cyl + wt
## Model 2: mpg ~ cyl * wt
##   #Df LogLik Df Chisq Pr(>Chisq)  
## 1   5  -73.3                      
## 2   7  -70.7  2  5.14      0.076 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer

The P-value 0.076 is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.

Question 4

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars). How is the wt coefficient interpretted?

fit4 <- lm(mpg ~ I(wt * 0.5) + cyl, mtcars)
summary(fit4)$coef

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    33.99       1.89   18.01 6.26e-17
## I(wt * 0.5)    -6.41       1.51   -4.25 2.13e-04
## cyl6           -4.26       1.39   -3.07 4.72e-03
## cyl8           -6.07       1.65   -3.67 9.99e-04

compare <- data.frame(per_half_ton=sf1[4], per_ton=summary(fit4)$coef[2],
                      row.names = "wt coef")
compare

##         per_half_ton per_ton
## wt coef        -3.21   -6.41

Answer

One unit of the weight variable equals 1000 lbs. Multiplying wt*0.5 doubles the coeffiecient, which corresponds to the estimated expected change in MPG per one ton increase in weight for a specific number of cylinders (4, 6, 8).

Question 5

Consider the following data set:

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the hat diagonal for the most influential point.

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

fit5 <- lm(y ~ x)
hatvalues(fit5)

##     1     2     3     4     5 
## 0.229 0.244 0.253 0.280 0.995

max_hat <- hatvalues(fit5)[which.max(hatvalues(fit5))]
max_hat

##     5 
## 0.995

# So, the 5th row contains the data for the most influential point.
# Let's plot the data and take a look at the regression line with the extreme
# data point (red line) and without it (blue line)

plot(x, y, pch=19)
abline(fit5, col="red")
fit6 <- lm(y[-which.max(hatvalues(fit5))] ~ x[-which.max(hatvalues(fit5))])
abline(fit6, col="blue")

Answer

The hat value for the most influential point is 0.995.

Question 6

Consider the following data set:

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the slope dfbeta for the point with the highest hat value.

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)  
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

fit5 <- lm(y ~ x)
im <- influence.measures(fit5)$infmat
im

##    dfb.1_     dfb.x     dffit cov.r   cook.d   hat
## 1  1.0621 -3.78e-01    1.0679 0.341 2.93e-01 0.229
## 2  0.0675 -2.86e-02    0.0675 2.934 3.39e-03 0.244
## 3 -0.0174  7.92e-03   -0.0174 3.007 2.26e-04 0.253
## 4 -1.2496  6.73e-01   -1.2557 0.342 3.91e-01 0.280
## 5  0.2043 -1.34e+02 -149.7204 0.107 2.70e+02 0.995

max_beta <- im[which.max(abs(im[,"hat"])),"dfb.x"]

Answer

max_beta

## [1] -134

Question 7

Consider a regression relationship between Y and X with and without adjustment for a third variable Z. Which of the following is true about comparing the regression coefficient between Y and X with and without adjustment for Z.

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)  
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
set.seed(4567)
z <- rnorm(5)

fit <- lm(y ~ x)
fitz <- lm(y ~ x + z)

summary(fit)$coef

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.107     0.2354  -0.453   0.6811
## x              0.129     0.0448   2.877   0.0637

summary(fitz)$coef

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -0.190     0.1736   -1.09   0.3883
## x              0.188     0.0438    4.28   0.0504
## z             -0.479     0.2437   -1.97   0.1881

Answer

It is possible for the coefficient to reverse sign after adjustment. For example, it can be strongly significant and positive before adjustment and strongly significant and negative after adjustment.

In this example, the sign changes from positive to negative. However, neither of the correlations are strongly signifcant, but it is certainly possible.

Regression Models - Quiz 3

Fenton Taylor

September 1, 2016

Question 1

Answer

Question 2

Answer

Question 3

Answer

Question 4

Answer

Question 5

Answer

Question 6

Answer

Question 7

Answer