Let’s begin by loading the mroz data set.
library(car)
Mroz <- Mroz
str(Mroz)
## 'data.frame': 753 obs. of 8 variables:
## $ lfp : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ k5 : int 1 0 1 0 1 0 0 0 0 0 ...
## $ k618: int 0 2 3 3 2 0 2 0 2 2 ...
## $ age : int 32 30 35 34 31 54 37 54 48 39 ...
## $ wc : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 2 1 1 1 ...
## $ hc : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ lwg : num 1.2102 0.3285 1.5141 0.0921 1.5243 ...
## $ inc : num 10.9 19.5 12 6.8 20.1 ...
The Mroz dataset, named after the researcher who gathered the data, contains variables on the labor force participation of 753 women. The response variable, lfp, is a factor with two levels: either the woman is in the labor force (“yes”) or she is not (“no”). When you have a binary response variable, this is when you want to do a logistic regression.
What determines whether the woman is in the labor force? This is a binary outcome, like flipping a coin. What is the probability of success? (Success can be arbitrarily defined as one of the two binary outcomes, such as being in the labor force. So what is the probability of a given woman being in the labor force?) In a logistic regression, it may depend on some covariates.
Note that although the response is binary, the data itself is not necessarily from a binomial distribution. If the data is binomial, then we need to specify the fixed N for the numer of trials in the regression. As it says in Fox, “Binary regression is a limiting case of binomial regression with all the N = 1.” In other words, we’ll be looking at binary logistic models.
Other variables (ie. covariates) are: k5- number of children age 5 years or younger. k618- number of children ages 6-18. age- woman’s age wc- woman’s college attendance, a factor, either yes or no. hc- husbands college attendance, a factor, either yes or no. lwg- log of expected wage rate. This variable is a little unsual. If the woman is in the labor force, then it’s the log of her ACTUAL wage rate. If the woman is not in the labor force, the data is IMPUTED based on other variables. We will not cover the details of the imputation here. inc- family income exclusive of wife’s income.
Let’s begin with a simple linear regression. We’ll need to convert the lfp variable from a factor to a numeric. In this case, I’m going to use the integer 1 to indicate that the woman is in the labor force. The reason I’m doing this is that it makes interpretation of the beta coefficients easier.
Mroz$lfpn <- ifelse(Mroz$lfp=="yes",1,0)
m1 <- lm(lfpn ~ k5 + k618 + age + wc + hc + lwg + inc, data=Mroz)
summary(m1)
##
## Call:
## lm(formula = lfpn ~ k5 + k618 + age + wc + hc + lwg + inc, data = Mroz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9268 -0.4632 0.1684 0.3906 0.9602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.143548 0.127053 9.001 < 2e-16 ***
## k5 -0.294836 0.035903 -8.212 9.58e-16 ***
## k618 -0.011215 0.013963 -0.803 0.422109
## age -0.012741 0.002538 -5.021 6.45e-07 ***
## wcyes 0.163679 0.045828 3.572 0.000378 ***
## hcyes 0.018951 0.042533 0.446 0.656044
## lwg 0.122740 0.030191 4.065 5.31e-05 ***
## inc -0.006760 0.001571 -4.304 1.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.459 on 745 degrees of freedom
## Multiple R-squared: 0.1503, Adjusted R-squared: 0.1423
## F-statistic: 18.83 on 7 and 745 DF, p-value: < 2.2e-16
There’s nothing technically wrong with doing a simple linear regression on binary data. With a simple linear regression, the beta coefficients are probabilities and the response is a probability as well. For example, every additional year of age decreases the probability that the woman is in the labor force by about 1.3%.
In order to make the regression a little more intuitive, I’m going to center the continuous variables of age, log wage and income.
Mroz$age_a <- Mroz$age-mean(Mroz$age)
Mroz$lwg_a <- Mroz$lwg - mean(Mroz$lwg)
Mroz$inc_a <- Mroz$inc - mean(Mroz$inc)
m2 <- lm(lfpn ~ k5 + k618 + age_a + wc + hc + lwg_a + inc_a, data=Mroz)
summary(m2)
##
## Call:
## lm(formula = lfpn ~ k5 + k618 + age_a + wc + hc + lwg_a + inc_a,
## data = Mroz)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9268 -0.4632 0.1684 0.3906 0.9602
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.600150 0.031594 18.996 < 2e-16 ***
## k5 -0.294836 0.035903 -8.212 9.58e-16 ***
## k618 -0.011215 0.013963 -0.803 0.422109
## age_a -0.012741 0.002538 -5.021 6.45e-07 ***
## wcyes 0.163679 0.045828 3.572 0.000378 ***
## hcyes 0.018951 0.042533 0.446 0.656044
## lwg_a 0.122740 0.030191 4.065 5.31e-05 ***
## inc_a -0.006760 0.001571 -4.304 1.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.459 on 745 degrees of freedom
## Multiple R-squared: 0.1503, Adjusted R-squared: 0.1423
## F-statistic: 18.83 on 7 and 745 DF, p-value: < 2.2e-16
Note that the beta coefficients have’t changed. But the intercept has. Before it was 1.14, now it’s 0.6. Having an intercept of 1.14 is not wrong, but it’s just unintuitive from the get-go.
Let’s assume that this is an average person and all the factors are “no”. In this case, the intercept is the probability for the response: there is a 60% probability that the person is in the labor force.
Here is the entire problem that causes a linear regression to not work here. The problem with doing a simple linear regression on a binary response is that the fitted values can be above or below 1.
hist(m2$fitted.values)
As the histogram above indicates, some of the probabilities are above 1 and below 0. This is a violation of probability theory, so we should be able to do better. Let’s get into the logistic regression.
Mroz.mod <- glm(lfp ~ k5 + k618 + age + wc + hc + lwg + inc, data=Mroz,
family = binomial(link = "logit"))
Let’s take this apart. This is a general linear model- which I won’t explain here- except to say that we use the glm function in r. Besides using the glm function, the only other difference is that we call out the family as binomial. This is the family of distributions. Think of it as the distribution of the response variable.
The response variable, which is binary, is binomially distributed (unconditionally with the covariates). There is some proportion of successes and a certain sample size. We can calculate the sample proportion and sample size from the data.
x <- sum(Mroz$lfpn);x
## [1] 428
y <- nrow(Mroz)
z <- x / y # proportion of successes
So the response variable shows that out of 753 women (our n), about 57% are in the labor force (our p), which we are defining as a success. The question of this model with respect to statistical inference is- what percentage of women IN THE POPULATION are in the labor force (and how much variation is explained by the predictors)?
Before we look at the predictors, let’s look at the distribution of fitted values.
hist(Mroz.mod$fitted.values)
As you can see, everything is bound between 0 and 1. This is what we’re looking for in the logistic model.
OK, so how do we get there? We’ll need a bit of statistical theory. We’re going to do this in 3 steps.
Going back to the simple linear regression, we can define the response as an expected value, E(y_i | b0, b1, b2, etc.). In the case of a logistic regression, the response is a probability (the probability of being in the labor force), so we’ll call it p. To be clear: p_i = E(y_i | b0, b1, b2, etc.) The domain of the response is between 0 and 1. (Except it isn’t.)
Now, we’re going to transform the response from an expected value probability (which is calculated using simple linear regression) to THE ODDS. Note that the odds can be derived from the probability, simply, odds = p/1-p or odds = p/q. The odds are the number of successes over the number of failures. For example, it the probability of a success is 1/10, then the odds are 1:9. Note the difference in notation. Note that the domain of the odds are between 0 and infinity. When the odds are between 0 and 1, there are more successful outcomes than failures. If there are more failures than successes, the odds are between 1 and infinity. This is not intuitive.
Finally, we’re going to take the log of the odds, log(p/q). The domain is now between negative and positive infinity. This is our response variable in logistic regression! We call it the logit.
logit = log(p/q)
The logit is known as a link function. It is the log of the odds. In glm, the link function is the transformation of the response variable. Essentially, the link function is set equal to the regression equation.
So, our new regression- our logistic regression- is logit = b0 + b1, etc.
You can jump between the logit and the linear regression by converting the logit to the probability of a success. Recall that y = p, i.e. the response in linear regression is a probability.
log(p/(1-p)) = b0 + b1x_i –> p (or y) = exp(b0 + b1x_i)/(1+ exp(b0 + b1x_i))
Note that this equation only works for the response variable. This is usually what is of interest.
head(Mroz.mod$fitted.values)
## 1 2 3 4 5 6
## 0.5158291 0.6668165 0.4565831 0.6620169 0.6632299 0.5959744
head(exp(Mroz.mod$fitted.values)/(1+exp(Mroz.mod$fitted.values)))
## 1 2 3 4 5 6
## 0.6261719 0.6607899 0.6122033 0.6597133 0.6599856 0.6447348
OK, now we’re ready to interpret our parameters.
summary(Mroz.mod)
##
## Call:
## glm(formula = lfp ~ k5 + k618 + age + wc + hc + lwg + inc, family = binomial(link = "logit"),
## data = Mroz)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1062 -1.0900 0.5978 0.9709 2.1893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.182140 0.644375 4.938 7.88e-07 ***
## k5 -1.462913 0.197001 -7.426 1.12e-13 ***
## k618 -0.064571 0.068001 -0.950 0.342337
## age -0.062871 0.012783 -4.918 8.73e-07 ***
## wcyes 0.807274 0.229980 3.510 0.000448 ***
## hcyes 0.111734 0.206040 0.542 0.587618
## lwg 0.604693 0.150818 4.009 6.09e-05 ***
## inc -0.034446 0.008208 -4.196 2.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1029.75 on 752 degrees of freedom
## Residual deviance: 905.27 on 745 degrees of freedom
## AIC: 921.27
##
## Number of Fisher Scoring iterations: 4
coef(Mroz.mod)
## (Intercept) k5 k618 age wcyes hcyes
## 3.18214046 -1.46291304 -0.06457068 -0.06287055 0.80727378 0.11173357
## lwg inc
## 0.60469312 -0.03444643
First, a caveat. At this point, coefficients get pretty complicated. With glm’s, we’re less interested in studying the parameters are more interested in looking at the predictions.
With that said…
The coefficients for b0 and b1-bn have different interpretations.
b0 is the log odds. In order to convert it to a probability, you exponentiate b0 and divide by (1+exp(b0)). The interpretation is likely nonsense since the other covariates are equal to 0.
Mroz.mod$coefficients[[1]]
## [1] 3.18214
exp(Mroz.mod$coefficients[[1]])/(1+exp(Mroz.mod$coefficients[[1]]))
## [1] 0.9601566
b1, etc. are the log odds RATIO.
b0 + b1(x+1) - [b0+ b1(x)] = b1.
Mroz.mod[["coefficients"]][["age"]]
## [1] -0.06287055
What does this mean? The equations are a difference in logits, but it suggests that the b1 coefficient is just the change in the log odds. So for example, the odds of working are reduced 6% per year of age. This method of interpretation interprets the coefficients as survival rates.
But this isn’t really how the coefficients are interpreted. It opens you up to some mistakes. Here is how we actually do it.
When we take the exponent of a beta coefficient, we get the odds ratio, aka the risk factor, aka the fitted odds of success. This is a ratio of odds for an incremental change in b1.
round(exp(cbind(Estimate = coef(Mroz.mod), confint(Mroz.mod))),2)
## Waiting for profiling to be done...
## Estimate 2.5 % 97.5 %
## (Intercept) 24.10 6.94 87.03
## k5 0.23 0.16 0.34
## k618 0.94 0.82 1.07
## age 0.94 0.92 0.96
## wcyes 2.24 1.43 3.54
## hcyes 1.12 0.75 1.68
## lwg 1.83 1.37 2.48
## inc 0.97 0.95 0.98
When we exponentiate the logit model we go from an additive model with the logit as the response to having the odds as the response and each coefficient is multiplicative with the other coefficients: exp(b0) x exp(b1x1) x … x exp(bkxk)
So going to back to the effect of age… an additional year of age decreases the probability of working by .94 = 6% decrease in odds. This is the “x times more likely” statement.
An odds ratio of 1 means that there is no difference in odds.
When x is a factor, we often get larger effects. For example, if the woman is college educated, then she is 2.24 times more likely to work.
If we invert the odds we get a probability, again, exp(b1)/(1+exp(b1)). This is just another way of looking at it.
Variables can be judged relevant or not using a chi-square test. Remember a low p-value indicates that the variables are significant.
Mroz.mod.2 <-update(Mroz.mod, . ~ . -k5 -k618)
anova(Mroz.mod.2, Mroz.mod, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: lfp ~ age + wc + hc + lwg + inc
## Model 2: lfp ~ k5 + k618 + age + wc + hc + lwg + inc
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 747 971.75
## 2 745 905.27 2 66.485 3.655e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculatae deviance for each parameter. This is the better way to determine whether a variable should be included in the model. Unlike the regression, it compares two models: one with the variable of interest and one without.
library(car)
Anova(Mroz.mod)
## Analysis of Deviance Table (Type II tests)
##
## Response: lfp
## LR Chisq Df Pr(>Chisq)
## k5 66.484 1 3.527e-16 ***
## k618 0.903 1 0.342042
## age 25.598 1 4.204e-07 ***
## wc 12.724 1 0.000361 ***
## hc 0.294 1 0.587489
## lwg 17.001 1 3.736e-05 ***
## inc 19.504 1 1.004e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculate probabilities for each observation. That is, what is the response variable, expressed as a probability, for each observation? Note that, unlike with OLS, the response probability will be between 0 and 1.
glm.probs= predict(Mroz.mod, type = "response")
length(glm.probs)
## [1] 753
Why contrasts?
contrasts(Mroz$lfp)
## yes
## no 0
## yes 1
Produce a confusion matrix. This is a bad name. A confusion matrix is a two-way table that compares the expected response from the model to the actual reponse. For example, we compare the frequency with which the model correctly predicts that the woman is in the labor force. In short, we compare the model to the data.
glm.pred= rep("no", 753)
glm.pred[glm.probs> 0.5] = "yes"
table(glm.pred, Mroz$lfp)
##
## glm.pred no yes
## no 180 86
## yes 145 342
mean(glm.pred== Mroz$lfp) #70% classified correctly
## [1] 0.6932271
It might be tempting to draw the s-curve, but this can only be done with a single continuous x variable.