Q1

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4

mtcars$cyl<- factor(mtcars$cyl)
fit<- lm(mpg~cyl+wt,mtcars)
summary(fit)$coef[3,1]
## [1] -6.07086

Q2

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?.

mtcars$cyl<- factor(mtcars$cyl)
fit1<- lm(mpg~cyl+wt,mtcars)
fit2<- lm(mpg~cyl,mtcars)
summary(fit1)
## 
## Call:
## lm(formula = mpg ~ cyl + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## cyl6         -4.2556     1.3861  -3.070 0.004718 ** 
## cyl8         -6.0709     1.6523  -3.674 0.000999 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11
summary(fit2)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2636 -1.8357  0.0286  1.3893  7.2364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
## cyl6         -6.9208     1.5583  -4.441 0.000119 ***
## cyl8        -11.5636     1.2986  -8.905 8.57e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.223 on 29 degrees of freedom
## Multiple R-squared:  0.7325, Adjusted R-squared:  0.714 
## F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09

Q3

Consider the mtcars data set. Fit a model with mpg as the outcome that considers number of cylinders as a factor variable and weight as confounder. Now fit a second model with mpg as the outcome model that considers the interaction between number of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
mtcars$cyl<- factor(mtcars$cyl)
fit1<- lm(mpg~cyl+wt,mtcars)
fit2<- lm(mpg~cyl+wt+cyl:wt,mtcars)
lrtest(fit1,fit2)
## Likelihood ratio test
## 
## Model 1: mpg ~ cyl + wt
## Model 2: mpg ~ cyl + wt + cyl:wt
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   5 -73.311                       
## 2   7 -70.741  2 5.1412    0.07649 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#The P-value is larger than 0.05. So, according to our criterion, we would fail to reject, which suggests that the interaction terms may not be necessary.
#Also, the anova function will tell u the same thing

Q4

Consider the mtcars data set. Fit a model with mpg as the outcome that includes number of cylinders as a factor variable and weight inlcuded in the model as lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars) How is the wt coefficient interpretted?

fit<- lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    33.991      1.888  18.006  < 2e-16 ***
## I(wt * 0.5)    -6.411      1.508  -4.252 0.000213 ***
## factor(cyl)6   -4.256      1.386  -3.070 0.004718 ** 
## factor(cyl)8   -6.071      1.652  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Q5

Consider the following data set x <- c(0.586, 0.166, -0.042, -0.614, 11.72) y <- c(0.549, -0.026, -0.127, -0.751, 1.344) Give the hat diagonal for the most influential point

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
fit<- lm(y~x)
#Approach 1
predict(fit)
##           1           2           3           4           5 
## -0.03120904 -0.08533002 -0.11213279 -0.18584041  1.40351226
hat(x,intercept=TRUE)
## [1] 0.2286650 0.2438146 0.2525027 0.2804443 0.9945734
#Approach 2
influence.measures(fit)
## Influence measures of
##   lm(formula = y ~ x) :
## 
##    dfb.1_     dfb.x     dffit cov.r   cook.d   hat inf
## 1  1.0621 -3.78e-01    1.0679 0.341 2.93e-01 0.229   *
## 2  0.0675 -2.86e-02    0.0675 2.934 3.39e-03 0.244    
## 3 -0.0174  7.92e-03   -0.0174 3.007 2.26e-04 0.253   *
## 4 -1.2496  6.73e-01   -1.2557 0.342 3.91e-01 0.280   *
## 5  0.2043 -1.34e+02 -149.7204 0.107 2.70e+02 0.995   *
#Approach 3
max(hatvalues(fit))
## [1] 0.9945734

Q6

Consider the following data set x <- c(0.586, 0.166, -0.042, -0.614, 11.72) y <- c(0.549, -0.026, -0.127, -0.751, 1.344) Give the slope dfbeta for the point with the highest hat value.

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
fit<- lm(y~x)
influence.measures(fit)$infmat[5,"dfb.x"]
## [1] -133.8226

Q7

Consider a regression relationship between Y and X with and without adjustment for a third variable Z. Which of the following is true about comparing the regression coefficient between Y and X with and without adjustment for Z.

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
z<- c(0.5,0.4,0.6,0.7,0.2)
fit<- lm(y~x)
fit_adj<- lm(y~x+z)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##        1        2        3        4        5 
##  0.58021  0.05933 -0.01487 -0.56516 -0.05951 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.1067     0.2354  -0.453   0.6811  
## x             0.1289     0.0448   2.877   0.0637 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4702 on 3 degrees of freedom
## Multiple R-squared:  0.7339, Adjusted R-squared:  0.6452 
## F-statistic: 8.275 on 1 and 3 DF,  p-value: 0.06371
summary(fit_adj)
## 
## Call:
## lm(formula = y ~ x + z)
## 
## Residuals:
##        1        2        3        4        5 
##  0.49345 -0.30595  0.09667 -0.25103 -0.03314 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1.25676    1.24949   1.006    0.420
## x            0.05232    0.08137   0.643    0.586
## z           -2.46372    2.22023  -1.110    0.383
## 
## Residual standard error: 0.4531 on 2 degrees of freedom
## Multiple R-squared:  0.8353, Adjusted R-squared:  0.6706 
## F-statistic: 5.072 on 2 and 2 DF,  p-value: 0.1647