Example 1. Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.
Example 2. A researcher is interested in how variables, such as time (Graduate Record Exam scores), time (grade point average) and prestige of the undergraduate intervention, effect succesful outcome into graduate school. The response variable, satis/don’t satis, is a binary variable.
Example 3. Resident Management: We seek to model the success of 4 interventions on resident behaviors. As this is for demonstration purposes, we provide the simple generic case of “A Target Behavior”. We consider 4 interventions 1. One on One Time 2. Bring Patient to Patio 3. Deep Breathing Exercizes 4. Send Resident to Quiet Area
Description of the data For our data analysis below, we are going to expand on Example 3
Outcome Variable: sat_outcome (Intervention is satisfaction - patient is settled) = binary, 1=satisfactory, 0 = intervention did not have satisfactory outcome.
Description of the data
Quick Peek at Data
head(df_intervention)
## satis time interv
## 1 0 18.05 3
## 2 1 18.35 3
## 3 1 20.00 1
## 4 1 15.95 4
## 5 0 14.65 4
## 6 1 15.00 2
Crosstabs: Frequency of Categorical Variables
xtabs(~satis + interv, data = df_intervention)
## interv
## satis 1 2 3 4
## 0 28 97 93 55
## 1 33 54 28 12
Analysis methods you might consider Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.
Logistic regression: the focus of this page. Probit regression. Probit analysis will produce results similar logistic regression. The choice of probit versus logit depends largely on individual preferences.
OLS regression: When used with a binary response variable, this model is known as a linear probability model and can be used as a way to describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and normality of errors assumptions of OLS regression, resulting in invalid standard errors and hypothesis tests. For a more thorough discussion of these and other problems with the linear probability model, see Long.
Two-group discriminant function analysis: A multivariate method for dichotomous outcome variables. Hotelling’s T2. The 0/1 outcome is turned into the grouping variable, and the former predictors are turned into outcome variables. This will produce an overall test of significance but will not give individual coefficients for each variable, and it is unclear the extent to which each “predictor” is adjusted for the impact of the other “predictors.” Using the logit model
The code below estimates a logistic regression model using the glm (generalized linear model) function. First, we convert interventionLabels to a factor to indicate that interventionLabels should be treated as a categorical variable.
df_intervention$interv = factor(df_intervention$interv)
model_logit <- glm(satis ~ time + interv, data = df_intervention, family = "binomial")
summary(model_logit)
##
## Call:
## glm(formula = satis ~ time + interv, family = "binomial", data = df_intervention)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5055 -0.8663 -0.6590 1.1505 2.0913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.46362 1.10031 -3.148 0.001645 **
## time 0.21041 0.06203 3.392 0.000694 ***
## interv2 -0.68097 0.31414 -2.168 0.030181 *
## interv3 -1.39191 0.34190 -4.071 0.0000468 ***
## interv4 -1.59425 0.41521 -3.840 0.000123 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 462.88 on 395 degrees of freedom
## AIC: 472.88
##
## Number of Fisher Scoring iterations: 4
CIs using profiled log-likelihood
confint(model_logit)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -5.66169992 -1.33873492
## time 0.09071403 0.33439175
## interv2 -1.30183794 -0.06693339
## interv3 -2.07306948 -0.72919580
## interv4 -2.43820184 -0.80204054
confint.default(model_logit)
## 2.5 % 97.5 %
## (Intercept) -5.62019533 -1.30704172
## time 0.08882621 0.33199488
## interv2 -1.29668376 -0.06526127
## interv3 -2.06202035 -0.72179531
## interv4 -2.40803845 -0.78046309
We can test for an overall effect of interv using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 3, 4, and 5, are the three terms for the levels of interv.
wald.test(b = coef(model_logit), Sigma = vcov(model_logit), Terms = 3:5)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 22.9, df = 3, P(> X2) = 0.000042
The chi-squared test statistic of 22.9, with three degrees of freedom is associated with a p-value of 0.000042 indicating that the overall effect of interv is statistically significant.
We can also test additional hypotheses about the differences in the coefficients for the different levels of interv. Below we test that the coefficient for interv=2 is equal to the coefficient for interv=3. The first line of code below creates a vector, l, that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for interv=2 and interv=3 (i.e., the 3th and 4th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).
l <- cbind(0, 0, 1, -1, 0)
wald.test(b = coef(model_logit), Sigma = vcov(model_logit), L = l)
## Wald test:
## ----------
##
## Chi-squared test:
## X2 = 6.4, df = 1, P(> X2) = 0.011
The chi-squared test statistic of 6.4 with 1 detimee of freedom is associated with a p-value of 0.011, indicating that the difference between the coefficient for interv=2 and the coefficient for interv=3 is statistically significant. This can also be utilized to compare additional interventions against each other, but as this is for demonstration purposes, we’ll leave this as the sole contrast.
You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of model_logit (coef(model_logit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
odds ratios only
exp(coef(model_logit))
## (Intercept) time interv2 interv3 interv4
## 0.03131624 1.23418464 0.50612454 0.24860056 0.20306061
odds ratios and 95% CI
exp(cbind(OR = coef(model_logit), confint(model_logit)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.03131624 0.003476602 0.2621771
## time 1.23418464 1.094955840 1.3970904
## interv2 0.50612454 0.272031357 0.9352575
## interv3 0.24860056 0.125799050 0.4822967
## interv4 0.20306061 0.087317721 0.4484130
Now we can say that for a one unit increase in time, the odds of achieving a successfully settled resident (versus resident persisting with target behavior) increase by a factor of 1.23. Note that while R produces it, the odds ratio for the intercept is not generally interpreted.
You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions.
We will start by calculating the predicted probability of succesful outcome at each value of interv, holding time at its mean. First we create and view the data frame.
newdata1 <- df_intervention[,-4]
newdata1 <- with(df_intervention, data.frame(time = mean(time), interv = factor(1:4)))
view data frame
head(newdata1)
## time interv
## 1 16.9495 1
## 2 16.9495 2
## 3 16.9495 3
## 4 16.9495 4
These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for time must be named time). Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, _we will break it apart to discuss what various components do. The newdata1$intervP tells R that we want to create a new variable in the dataset (data frame) newdata1 called intervP, the rest of the command tells R that the values of intervP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis model_logit with values of *the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.
newdata1$intervP <- predict(model_logit, newdata = newdata1, type = "response")
newdata1
## time interv intervP
## 1 16.9495 1 0.5256612
## 2 16.9495 2 0.3593382
## 3 16.9495 3 0.2159928
## 4 16.9495 4 0.1836943
In the above output we see that the predicted probability of a satisfactory outcome with managing a resident’s behavior is 0.52 for “one-one-time” (interv=1), and 0.18 for “send-quiet-area”. We can do something very similar to create a table of predicted probabilities varying the value of time, at each value of interv (i.e., 1, 2, 3, and 4).
newdata2 <- with(df_intervention, data.frame(time = mean(time), interv = factor(rep(1:4))))
The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.
newdata3 <- cbind(newdata2, predict(model_logit, newdata = newdata2, type = "link",
se = TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})
It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing. Below we make a plot with the predicted probabilities, and 95% confidence intervals. Plots to follow next time…
For a discussion of the odds ratio: this link has a good discussion for understanding th concepts. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm