library(haven)
library(tidyverse)
We will be using the Eel.sav dataset, which has
information on whether a patient was Cured (1) or Not Cured (0), whether
a Intervention was used (Intervention=1, no Intervention = 0) and the
duration of their condition before Intervention. Please load this
dataset.
Eel <- read_sav("Eel.sav")
head(Eel)
## # A tibble: 6 × 3
## Cured Intervention Duration
## <dbl+lbl> <dbl+lbl> <dbl>
## 1 0 [Not Cured] 0 [No Treatment] 7
## 2 0 [Not Cured] 0 [No Treatment] 7
## 3 0 [Not Cured] 0 [No Treatment] 6
## 4 1 [Cured] 0 [No Treatment] 8
## 5 1 [Cured] 1 [Intervention] 7
## 6 1 [Cured] 0 [No Treatment] 6
Let’s run a logistic regression model to determine the best model for our data. We will look to see whether Intervention, duration, and their interaction significantly predicts whether the patient gets cured.
For model 1, let’s include Cured as our dependent variable (we want to know what predicts getting cured) and Intervention as our independent variable (does Intervention influence the likelihood of being cured?)
To run this:
model1 <- glm(Cured ~ Intervention, family = "binomial", data=Eel)
summary(model1)
##
## Call:
## glm(formula = Cured ~ Intervention, family = "binomial", data = Eel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5940 -1.0579 0.8118 0.8118 1.3018
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2877 0.2700 -1.065 0.28671
## Intervention 1.2287 0.3998 3.074 0.00212 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.08 on 112 degrees of freedom
## Residual deviance: 144.16 on 111 degrees of freedom
## AIC: 148.16
##
## Number of Fisher Scoring iterations: 4
Intervention is a significant predictor of Cured in Model 1. The p-value associated with the Intervention variable is 0.00212 (indicated by **), which is less than the significance level of 0.05. This suggests that there is a statistically significant relationship between Intervention and the likelihood of being cured.
For model2, let’s add Duration.
model2 <- glm(Cured ~ Intervention + Duration, family = "binomial", data=Eel)
summary(model2)
##
## Call:
## glm(formula = Cured ~ Intervention + Duration, family = "binomial",
## data = Eel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6025 -1.0572 0.8107 0.8161 1.3095
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.234660 1.220563 -0.192 0.84754
## Intervention 1.233532 0.414565 2.975 0.00293 **
## Duration -0.007835 0.175913 -0.045 0.96447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.08 on 112 degrees of freedom
## Residual deviance: 144.16 on 110 degrees of freedom
## AIC: 150.16
##
## Number of Fisher Scoring iterations: 4
Duration is not a significant predictor of Cured. The p-value associated with Duration in the summary output is 0.96447, which is greater than the significance level of 0.05. Therefore, I fail to reject the null hypothesis, and there is no significant evidence that Duration is associated with the likelihood of a patient being cured.
anova() function (with the argument:
test = “Chisq”) is Model 2 a significant improvement over
Model 1?anova(model1, model2, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Cured ~ Intervention
## Model 2: Cured ~ Intervention + Duration
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 111 144.16
## 2 110 144.16 1 0.0019835 0.9645
The residual deviance for Model 1 is 144.16 with 111 degrees of freedom, and for Model 2, it is 144.16 with 110 degrees of freedom. The difference in deviance is 0.0019835, and the p-value is 0.9645.
Since the p-value (0.9645) is greater than 0.05, I conclude that Model 2 is not a significant improvement over Model 1. There is no significant difference between the two models.
For model3, let’s add the interaction of Duration and Cured. To do this, you can run:
model3 <- glm(Cured ~ Intervention * Duration, family = "binomial", data=Eel)
summary(model3)
##
## Call:
## glm(formula = Cured ~ Intervention * Duration, family = "binomial",
## data = Eel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6510 -1.0611 0.8045 0.8412 1.3285
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.52033 1.68409 -0.309 0.757
## Intervention 1.85049 2.53509 0.730 0.465
## Duration 0.03436 0.24540 0.140 0.889
## Intervention:Duration -0.08694 0.35200 -0.247 0.805
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 154.08 on 112 degrees of freedom
## Residual deviance: 144.09 on 109 degrees of freedom
## AIC: 152.09
##
## Number of Fisher Scoring iterations: 4
Note: The * will give us each variable as a main effect
and their interaction (you’ll see Intervention, Duration, and the
interaction of Intervention and Duration – [represented as
Intervention:Duration] in the parameter estimates table.
To compare Model 2 and Model 3, I use the anova()
function.
anova(model2, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: Cured ~ Intervention + Duration
## Model 2: Cured ~ Intervention * Duration
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 110 144.16
## 2 109 144.09 1 0.061022 0.8049
The residual deviances are 144.16 for Model 1 and 144.09 for Model 2, with a difference in deviance of 0.061022. The p-value associated with this difference is 0.8049, which is not statistically significant at the conventional alpha levels (e.g., 0.05).
This result suggests that Model 3 (Intervention, Duration & Intervention x Duration) is not a significant improvement over Model 2 (Intervention and Duration only).
To determine the best model among the three models, I compare the AIC values.
Model 1 AIC = 148.16 Model 2 AIC = 150.16 Model 3 AIC = 152.09
Model 1 has the lowest AIC value, which suggests that it is the best model among the three for our data. This model includes only the Intervention variable as a predictor of Cured.
We will continue our analyses only looking at the best model. Re-run the best model (as determined in the previous question) and check the assumptions of our model.
The odds ratio is outputted for us using the code:
exp(coef(model1))
## (Intercept) Intervention
## 0.750000 3.416667
The value for the intercept has little meaning, but the exponent of the Intervention variable is the odds. This tells us that getting the Intervention increases the odds of being cured.
A value greater than 1 means that as the predictor increases, the odds of the outcome occurring increases. A value less than 1 means that as the predictor increases, the odds of the outcome occurring decreases. If the odds ratio is 1 (or very close to 1), it suggests that the variable is not a great predictor of the outcome variable.
The output summary from
glm(formula = Cured ~ Intervention, family = "binomial", data = Eel)
showed that Intervention is a significant predictor of Cured, with a
p-value of 0.00212 (p < 0.05). This means that there is a
statistically significant relationship between receiving the
Intervention and the likelihood of being cured.
The odds ratio for the Intervention variable is 3.416667, which is greater than 1. This tells me that getting the Intervention increases the odds of being cured. Specifically, receiving the Intervention increases the odds of being cured by a factor of 3.416667 compared to not receiving the Intervention.
The Pseudo-R2 gives us information about how well our model fits.
library(pscl)
## Warning: package 'pscl' was built under R version 4.2.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
pR2(model1)["McFadden"]
## fitting null model for pseudo-r2
## McFadden
## 0.06442071
The value of McFadden’s R^2 is 0.06442071. This indicates that the model with the Intervention variable explains about 6.44% of the variation compared to the null model.
Let’s look at Cook’s distance. Recall that any value greater than 1 would be concerning for Cook’s distance
plot(model2, which=4, id.n=5)
max(cooks.distance(model2))
## [1] 0.0574928
max gives the largest Cook’s distance. If it is greater
then one, it suggests that at least one data point has influence on the
actual perimeters. max result is 0.0574928. Therefore, I
conclude that there are no influential cases.
### Standardized Residuals are there any outliers?
standard_res <- rstandard(model1)
plot(standard_res)
outliers <- sum(abs(standard_res) > 3)
outliers
## [1] 0
There are no outliers greater than 3.
vif() function to determine if there are any
problems with multicollinearity for model 2 (the model with
Duration & Intervention)library(car)
vif(model2)
## Intervention Duration
## 1.075431 1.075431
The VIF values for both Intervention and Duration are 1.075431, which
are close to 1. In general, VIF values greater than 10 are considered to
indicate multicollinearity issues. Since the VIF values for my
predictors are close to 1, it suggests that there are no
multicollinearity problems in model2.