Logistic regression Discussion and Sample Analysis

Examples

Example 1. Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign, the amount of time spent campaigning negatively and whether or not the candidate is an incumbent.

Example 2. A researcher is interested in how variables, such as time (Graduate Record Exam scores), time (grade point average) and prestige of the undergraduate intervention, effect succesful outcome into graduate school. The response variable, satis/don’t satis, is a binary variable.

Example 3. Resident Management: We seek to model the success of 4 interventions on resident behaviors. As this is for demonstration purposes, we provide the simple generic case of “A Target Behavior”. We consider 4 interventions 1. One on One Time 2. Bring Patient to Patio 3. Deep Breathing Exercizes 4. Send Resident to Quiet Area

Description of the data For our data analysis below, we are going to expand on Example 3

Outcome Variable: sat_outcome (Intervention is satisfaction - patient is settled) = binary, 1=satisfactory, 0 = intervention did not have satisfactory outcome.

  • X1: minutes spent by CNA with resident
  • X2: Intervention Utilized. The four interventions include
  1. one-one-time (Spend One on One time with resident)
  2. send-patio (Send Resident to Patio)
  3. breathing (Show Resident Breathing Exercize)
  4. send-quiet-area (Send Resident to Quiet Area)

Description of the data

Quick Peek at Data

head(df_intervention)
##   satis  time interv
## 1     0 18.05      3
## 2     1 18.35      3
## 3     1 20.00      1
## 4     1 15.95      4
## 5     0 14.65      4
## 6     1 15.00      2

Crosstabs: Frequency of Categorical Variables

xtabs(~satis + interv, data = df_intervention)
##      interv
## satis  1  2  3  4
##     0 28 97 93 55
##     1 33 54 28 12

Analysis methods you might consider Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

Logistic regression: the focus of this page. Probit regression. Probit analysis will produce results similar logistic regression. The choice of probit versus logit depends largely on individual preferences.

OLS regression: When used with a binary response variable, this model is known as a linear probability model and can be used as a way to describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and normality of errors assumptions of OLS regression, resulting in invalid standard errors and hypothesis tests. For a more thorough discussion of these and other problems with the linear probability model, see Long.

Two-group discriminant function analysis: A multivariate method for dichotomous outcome variables. Hotelling’s T2. The 0/1 outcome is turned into the grouping variable, and the former predictors are turned into outcome variables. This will produce an overall test of significance but will not give individual coefficients for each variable, and it is unclear the extent to which each “predictor” is adjusted for the impact of the other “predictors.” Using the logit model

The code below estimates a logistic regression model using the glm (generalized linear model) function. First, we convert interventionLabels to a factor to indicate that interventionLabels should be treated as a categorical variable.

df_intervention$interv = factor(df_intervention$interv)
model_logit <- glm(satis ~ time + interv, data = df_intervention, family = "binomial")
summary(model_logit)
## 
## Call:
## glm(formula = satis ~ time + interv, family = "binomial", data = df_intervention)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5055  -0.8663  -0.6590   1.1505   2.0913  
## 
## Coefficients:
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept) -3.46362    1.10031  -3.148  0.001645 ** 
## time         0.21041    0.06203   3.392  0.000694 ***
## interv2     -0.68097    0.31414  -2.168  0.030181 *  
## interv3     -1.39191    0.34190  -4.071 0.0000468 ***
## interv4     -1.59425    0.41521  -3.840  0.000123 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 462.88  on 395  degrees of freedom
## AIC: 472.88
## 
## Number of Fisher Scoring iterations: 4
  • In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
  • Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.
  • The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both time and time are statistically significant, as are the three terms for interv. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
  • For a one unit increase in time, the log odds of achieving a successful outcome, “a settled resident” increases by 0.804.
  • The indicator variables for interv have a slightly different interpretation. For example, utilizing an with intervention of send-patio, versus an intervention with a interv of one-one-time, changes the log odds of successfully managing the resident’s target behavior decreases by -0.681.
  • Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Later we show an example of how you can use these values to help assess model fit.
  • We can use the confint function to obtain confidence intervals for the coefficient estimates. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function. We can also get CIs based on just the standard errors by using the default method.

CIs using profiled log-likelihood

confint(model_logit)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -5.66169992 -1.33873492
## time         0.09071403  0.33439175
## interv2     -1.30183794 -0.06693339
## interv3     -2.07306948 -0.72919580
## interv4     -2.43820184 -0.80204054
confint.default(model_logit)
##                   2.5 %      97.5 %
## (Intercept) -5.62019533 -1.30704172
## time         0.08882621  0.33199488
## interv2     -1.29668376 -0.06526127
## interv3     -2.06202035 -0.72179531
## interv4     -2.40803845 -0.78046309

We can test for an overall effect of interv using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 3, 4, and 5, are the three terms for the levels of interv.

wald.test(b = coef(model_logit), Sigma = vcov(model_logit), Terms = 3:5)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 22.9, df = 3, P(> X2) = 0.000042

The chi-squared test statistic of 22.9, with three degrees of freedom is associated with a p-value of 0.000042 indicating that the overall effect of interv is statistically significant.

We can also test additional hypotheses about the differences in the coefficients for the different levels of interv. Below we test that the coefficient for interv=2 is equal to the coefficient for interv=3. The first line of code below creates a vector, l, that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for interv=2 and interv=3 (i.e., the 3th and 4th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).

l <- cbind(0, 0, 1, -1, 0)
wald.test(b = coef(model_logit), Sigma = vcov(model_logit), L = l)
## Wald test:
## ----------
## 
## Chi-squared test:
## X2 = 6.4, df = 1, P(> X2) = 0.011

The chi-squared test statistic of 6.4 with 1 detimee of freedom is associated with a p-value of 0.011, indicating that the difference between the coefficient for interv=2 and the coefficient for interv=3 is statistically significant. This can also be utilized to compare additional interventions against each other, but as this is for demonstration purposes, we’ll leave this as the sole contrast.

You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of model_logit (coef(model_logit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.

odds ratios only

exp(coef(model_logit))
## (Intercept)        time     interv2     interv3     interv4 
##  0.03131624  1.23418464  0.50612454  0.24860056  0.20306061

odds ratios and 95% CI

exp(cbind(OR = coef(model_logit), confint(model_logit)))
## Waiting for profiling to be done...
##                     OR       2.5 %    97.5 %
## (Intercept) 0.03131624 0.003476602 0.2621771
## time        1.23418464 1.094955840 1.3970904
## interv2     0.50612454 0.272031357 0.9352575
## interv3     0.24860056 0.125799050 0.4822967
## interv4     0.20306061 0.087317721 0.4484130

Now we can say that for a one unit increase in time, the odds of achieving a successfully settled resident (versus resident persisting with target behavior) increase by a factor of 1.23. Note that while R produces it, the odds ratio for the intercept is not generally interpreted.

You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions.

We will start by calculating the predicted probability of succesful outcome at each value of interv, holding time at its mean. First we create and view the data frame.

newdata1 <- df_intervention[,-4]
newdata1 <- with(df_intervention, data.frame(time = mean(time), interv = factor(1:4)))

view data frame

head(newdata1)
##      time interv
## 1 16.9495      1
## 2 16.9495      2
## 3 16.9495      3
## 4 16.9495      4

These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for time must be named time). Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, _we will break it apart to discuss what various components do. The newdata1$intervP tells R that we want to create a new variable in the dataset (data frame) newdata1 called intervP, the rest of the command tells R that the values of intervP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis model_logit with values of *the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.

newdata1$intervP <- predict(model_logit, newdata = newdata1, type = "response")
newdata1
##      time interv   intervP
## 1 16.9495      1 0.5256612
## 2 16.9495      2 0.3593382
## 3 16.9495      3 0.2159928
## 4 16.9495      4 0.1836943

In the above output we see that the predicted probability of a satisfactory outcome with managing a resident’s behavior is 0.52 for “one-one-time” (interv=1), and 0.18 for “send-quiet-area”. We can do something very similar to create a table of predicted probabilities varying the value of time, at each value of interv (i.e., 1, 2, 3, and 4).

newdata2 <- with(df_intervention, data.frame(time = mean(time), interv = factor(rep(1:4))))

The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.

newdata3 <- cbind(newdata2, predict(model_logit, newdata = newdata2, type = "link",
se = TRUE))
newdata3 <- within(newdata3, {
PredictedProb <- plogis(fit)
LL <- plogis(fit - (1.96 * se.fit))
UL <- plogis(fit + (1.96 * se.fit))
})

It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing. Below we make a plot with the predicted probabilities, and 95% confidence intervals. Plots to follow next time…

For a discussion of the odds ratio: this link has a good discussion for understanding th concepts. http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm