library(haven)
library(tidyverse)

Data

We will be using the Eel.sav dataset, which has information on whether a patient was Cured (1) or Not Cured (0), whether a Intervention was used (Intervention=1, no Intervention = 0) and the duration of their condition before Intervention. Please load this dataset.

Eel <- read_sav("Eel.sav")
head(Eel)
## # A tibble: 6 × 3
##   Cured         Intervention     Duration
##   <dbl+lbl>     <dbl+lbl>           <dbl>
## 1 0 [Not Cured] 0 [No Treatment]        7
## 2 0 [Not Cured] 0 [No Treatment]        7
## 3 0 [Not Cured] 0 [No Treatment]        6
## 4 1 [Cured]     0 [No Treatment]        8
## 5 1 [Cured]     1 [Intervention]        7
## 6 1 [Cured]     0 [No Treatment]        6

Logistic Regression Model

Let’s run a logistic regression model to determine the best model for our data. We will look to see whether Intervention, duration, and their interaction significantly predicts whether the patient gets cured.

For model 1, let’s include Cured as our dependent variable (we want to know what predicts getting cured) and Intervention as our independent variable (does Intervention influence the likelihood of being cured?)

To run this:

model1 <- glm(Cured ~  Intervention, family = "binomial", data=Eel)
summary(model1)
## 
## Call:
## glm(formula = Cured ~ Intervention, family = "binomial", data = Eel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5940  -1.0579   0.8118   0.8118   1.3018  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -0.2877     0.2700  -1.065  0.28671   
## Intervention   1.2287     0.3998   3.074  0.00212 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 154.08  on 112  degrees of freedom
## Residual deviance: 144.16  on 111  degrees of freedom
## AIC: 148.16
## 
## Number of Fisher Scoring iterations: 4

1. Is Intervention a significant predictor of Cured in Model 1?

Intervention is a significant predictor of Cured in Model 1. The p-value associated with the Intervention variable is 0.00212 (indicated by **), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between Intervention and the likelihood of being cured.

For model2, let’s add Duration.

2. Can you come up with the updated code to run this? Please run this model.

model2 <- glm(Cured ~ Intervention + Duration, family = "binomial", data=Eel)
summary(model2)
## 
## Call:
## glm(formula = Cured ~ Intervention + Duration, family = "binomial", 
##     data = Eel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6025  -1.0572   0.8107   0.8161   1.3095  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.234660   1.220563  -0.192  0.84754   
## Intervention  1.233532   0.414565   2.975  0.00293 **
## Duration     -0.007835   0.175913  -0.045  0.96447   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 154.08  on 112  degrees of freedom
## Residual deviance: 144.16  on 110  degrees of freedom
## AIC: 150.16
## 
## Number of Fisher Scoring iterations: 4

Is Duration a significant predictor of Cured in Model 2?

Duration is not a significant predictor of Cured. The p-value associated with Duration in the summary output is 0.96447, which is greater than the significance level of 0.05. Therefore, I fail to reject the null hypothesis, and there is no significant evidence that Duration is associated with the likelihood of a patient being cured.

3. Using the anova() function (with the argument: test = “Chisq”) is Model 2 a significant improvement over Model 1?

anova(model1, model2, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Cured ~ Intervention
## Model 2: Cured ~ Intervention + Duration
##   Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
## 1       111     144.16                      
## 2       110     144.16  1 0.0019835   0.9645

The residual deviance for Model 1 is 144.16 with 111 degrees of freedom, and for Model 2, it is 144.16 with 110 degrees of freedom. The difference in deviance is 0.0019835, and the p-value is 0.9645.

Since the p-value (0.9645) is greater than 0.05, I conclude that Model 2 is not a significant improvement over Model 1. There is no significant difference between the two models.

For model3, let’s add the interaction of Duration and Cured. To do this, you can run:

model3 <- glm(Cured ~ Intervention * Duration, family = "binomial", data=Eel)
summary(model3)
## 
## Call:
## glm(formula = Cured ~ Intervention * Duration, family = "binomial", 
##     data = Eel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6510  -1.0611   0.8045   0.8412   1.3285  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -0.52033    1.68409  -0.309    0.757
## Intervention           1.85049    2.53509   0.730    0.465
## Duration               0.03436    0.24540   0.140    0.889
## Intervention:Duration -0.08694    0.35200  -0.247    0.805
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 154.08  on 112  degrees of freedom
## Residual deviance: 144.09  on 109  degrees of freedom
## AIC: 152.09
## 
## Number of Fisher Scoring iterations: 4

Note: The * will give us each variable as a main effect and their interaction (you’ll see Intervention, Duration, and the interaction of Intervention and Duration – [represented as Intervention:Duration] in the parameter estimates table.

4. Is Model 3 (the model with Intervention, Duration & Intervention x Duration) a significant improvement over Model 2?

To compare Model 2 and Model 3, I use the anova() function.

anova(model2, model3, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: Cured ~ Intervention + Duration
## Model 2: Cured ~ Intervention * Duration
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       110     144.16                     
## 2       109     144.09  1 0.061022   0.8049

The residual deviances are 144.16 for Model 1 and 144.09 for Model 2, with a difference in deviance of 0.061022. The p-value associated with this difference is 0.8049, which is not statistically significant at the conventional alpha levels (e.g., 0.05).

This result suggests that Model 3 (Intervention, Duration & Intervention x Duration) is not a significant improvement over Model 2 (Intervention and Duration only).

5. Which of the 3 models we ran is the best model for our data?

To determine the best model among the three models, I compare the AIC values.

Model 1 AIC = 148.16 Model 2 AIC = 150.16 Model 3 AIC = 152.09

Model 1 has the lowest AIC value, which suggests that it is the best model among the three for our data. This model includes only the Intervention variable as a predictor of Cured.

We will continue our analyses only looking at the best model. Re-run the best model (as determined in the previous question) and check the assumptions of our model.

Odds Ratio

The odds ratio is outputted for us using the code:

exp(coef(model1))
##  (Intercept) Intervention 
##     0.750000     3.416667

The value for the intercept has little meaning, but the exponent of the Intervention variable is the odds. This tells us that getting the Intervention increases the odds of being cured.

A value greater than 1 means that as the predictor increases, the odds of the outcome occurring increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. If the odds ratio is 1 (or very close to 1), it suggests that the variable is not a great predictor of the outcome variable.

6. Is Intervention a significant predictor of cured?

The output summary from glm(formula = Cured ~ Intervention, family = "binomial", data = Eel) showed that Intervention is a significant predictor of Cured, with a p-value of 0.00212 (p < 0.05). This means that there is a statistically significant relationship between receiving the Intervention and the likelihood of being cured.

7. What is the odds ratio?

The odds ratio for the Intervention variable is 3.416667, which is greater than 1. This tells me that getting the Intervention increases the odds of being cured. Specifically, receiving the Intervention increases the odds of being cured by a factor of 3.416667 compared to not receiving the Intervention.

Pseudo-R2

The Pseudo-R2 gives us information about how well our model fits.

library(pscl)
## Warning: package 'pscl' was built under R version 4.2.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
##   McFadden 
## 0.06442071

8. Please report the values of the McFadden R2

The value of McFadden’s R^2 is 0.06442071. This indicates that the model with the Intervention variable explains about 6.44% of the variation compared to the null model.

Testing Assumptions

Let’s look at Cook’s distance. Recall that any value greater than 1 would be concerning for Cook’s distance

plot(model2, which=4, id.n=5)

max(cooks.distance(model2))
## [1] 0.0574928

9. Are there any influential cases based on the Cook’s Distance?

max gives the largest Cook’s distance. If it is greater then one, it suggests that at least one data point has influence on the actual perimeters. max result is 0.0574928. Therefore, I conclude that there are no influential cases.

  • Next, let’s examine our standardized residuals.
### Standardized Residuals are there any outliers? 
standard_res <- rstandard(model1)
plot(standard_res)

10. Are there any standardized residuals greater than the absolute value of 3? If so, how many?

outliers <- sum(abs(standard_res) > 3)
outliers
## [1] 0

There are no outliers greater than 3.

  • Multicollinearity – predictors should not be too highly correlated. Run the vif() function to determine if there are any problems with multicollinearity for model 2 (the model with Duration & Intervention)
library(car)
vif(model2)
## Intervention     Duration 
##     1.075431     1.075431

11. Do our results suggest any problems with multicollinearity?

The VIF values for both Intervention and Duration are 1.075431, which are close to 1. In general, VIF values greater than 10 are considered to indicate multicollinearity issues. Since the VIF values for my predictors are close to 1, it suggests that there are no multicollinearity problems in model2.