So far, in all of the models we examined the dependent variable y has been a quantitative variable, e.g., wages, GPA score, prices, etc.
Can we explain a qualitative (i. e., binary or dummy) variable using multiple regression?
Binary dependent variable y = 1 or y = 0; e. g., it may indicate whether an adult has a high school education, whether a household owns a house, whether an adult is married, owns a car, etc.
The case where y = 1 is called success whereas y = 0 is called failure.
What happens if we regress a 0/1 variable on a set of independent variables? How can we interpret regression coefficients?
2 Linear Probability Model
Under the standard assumptions the conditional expectation of the dependent variable can be written as follows:
The probability of success is given by p(x) = P(y = 1|x). The expression above states that the success probability is a linear function of x variables.
Slope coefficients are now interpreted as the change in the probability of success:
\Delta P(y=1|x) = \beta_j \Delta x_j
If we use OLS estimation then the OLS sample regression equation is given by
\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + \cdots + \hat{\beta}_kx_k
- In this estimated equation, \hat{y} is the predicted probability of success.
Call:
lm(formula = inlf ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6, data = mroz)
Residuals:
Min 1Q Median 3Q Max
-0.93432 -0.37526 0.08833 0.34404 0.99417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.5855192 0.1541780 3.798 0.000158 ***
nwifeinc -0.0034052 0.0014485 -2.351 0.018991 *
educ 0.0379953 0.0073760 5.151 3.32e-07 ***
exper 0.0394924 0.0056727 6.962 7.38e-12 ***
expersq -0.0005963 0.0001848 -3.227 0.001306 **
age -0.0160908 0.0024847 -6.476 1.71e-10 ***
kidslt6 -0.2618105 0.0335058 -7.814 1.89e-14 ***
kidsge6 0.0130122 0.0131960 0.986 0.324415
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4271 on 745 degrees of freedom
Multiple R-squared: 0.2642, Adjusted R-squared: 0.2573
F-statistic: 38.22 on 7 and 745 DF, p-value: < 2.2e-16
All variables are individually statistically significant except kidsge6. All coefficients have expected signs using standard economic theory and intuition.
Interpreation of estimated coefficients:
For example, the coefficient estimate on educ, 0.038, implies that, ceteris paribus, an additional year of education increases predicted probability of labor force participation by 0:038.
The coefficient estimate on nwifeinc: if wife’s income increases by 10 units (ie, $10000) the probability of labor force participation falls by 0.034.
exper has a quadratic relationship with inlf which means that the effect of past experience on the probability of labor force participation is diminishing
The number of young children has a big impact on labor force participation. The coefficient estimate on kidslt6 is -0.262 which means that, ceteris paribus, having one additional child less than six years old reduces the probability of participation by 0.262
2.2 Shortcomings of LPM
Predicted probability of success is given by \hat{y} and it can have values outside the range 0-1. Obviously, this contradicts the rules of probability
In the example out of 753 observations, 16 have \widehat{inlf} < 0 and 17 have \widehat{inlf} > 1.
If these are relatively few, they can be interpreted as 0 and 1, respectively.
Nevertheless, the major shortcoming of LPM is not implausible probability predictions. The major problem is that a probability cannot be linearly related to the independent variables for all their possible values
In the example, the model predicts that the effect of going from zero children to one young child reduces the probability of working by 0.262.
This is also the predicted drop if the woman goes from having one child to 2 or 2 to 3, etc.
It seems more realistic that the first small child would reduce the probability by a large amount, but subsequent children would have a smaller marginal effect
Thus, the relationship may be nonlinear
Code
ggplot(aes(x=kidslt6, y=inlf), data = mroz) +geom_point() +geom_smooth(method="lm", se=FALSE)
Despite these shortcomings LPM is useful and often applied in economics
It usually works well for values of the independent variables that are near the averages in the sample.
In the previous example, 96% of the women have either no children or one child under 6. Thus, the coefficient estimate on kidslt6 (-0.262) practically measures the impact of the first children on the probability of labor force participation.
Therefore, we should not use this estimate for changes from 3 to 4 or 4 to 5, etc.
LPM is heteroscedastic
Recall that y is a binary variable following a Bernoulli distribution with variance given by:
Var(u|x) = Var(y|x) = p(x) \times [1 - p(x)]
Since p(x) is a linear combination of x variables, Var(u|x) is not constant
We learned that in this case OLS is unbiased and consistent but inefficient. The Gauss-Markov Theorem fails. Standard errors and the usual inference procedures are not valid.
It is possible to find more efficient estimators than OLS
Code
plot(fitted(lpm1),resid(lpm1))abline(0,0, col ="red")
There are two widely used binary dependent variable models: logit and probit
As stated above, LPM is heteroscedastic and in order to conduct hypothesis tests and confidence intervals for the marginal effects an explanatory variable has on the outcome variable, we must first correct for heteroskedasticity. We can use the White estimator for correcting heteroskedasticity
We compute the White heteroskedastic variance/covariance matrix for the coefficients with the call to vcovHC (which stands for Variance/Covariance Heteroskedastic Consistent)
Then we call coeftest() to use this estimate for the variance/covariance to properly compute our standard errors, t-statistics, and p-values for the coefficients.
In probit regression, the cumulative standard normal distribution function \Phi(.) is used to model the regression function when the dependent variable is binary, that is, we assume
E(Y|X) = P(Y=1|X) = \Phi(\beta_0 + \beta_1X)
\beta_0 + \beta_1X plays the role of a quantile z. Remember that \Phi(z) = P(Z \leq z), where Z \sim N(0,1) such that the probit coefficient \beta_1 is the change in z associated with a one unit change in X
Although the effect on z of a change in X is linear, the link between z and the dependent variable Y is nonlinear since \Phi is a nonlinear function of X
Assume that Y is a binary variable. The model
Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_kX_k + u
Look up \Phi(z) in a standard normal table or use the pnorm() function in R.
In R, probit models can be estimated using the function glm() from the package stats; using the argument family we specify that we want to use a probit link function
Code
inlf_probit <-glm(inlf ~ nwifeinc + educ + exper +expersq + age + kidslt6 + kidsge6, family =binomial(link ="probit"),data = mroz)summary(inlf_probit)
Call:
glm(formula = inlf ~ nwifeinc + educ + exper + expersq + age +
kidslt6 + kidsge6, family = binomial(link = "probit"), data = mroz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2700736 0.5080782 0.532 0.59503
nwifeinc -0.0120236 0.0049392 -2.434 0.01492 *
educ 0.1309040 0.0253987 5.154 2.55e-07 ***
exper 0.1233472 0.0187587 6.575 4.85e-11 ***
expersq -0.0018871 0.0005999 -3.145 0.00166 **
age -0.0528524 0.0084624 -6.246 4.22e-10 ***
kidslt6 -0.8683247 0.1183773 -7.335 2.21e-13 ***
kidsge6 0.0360056 0.0440303 0.818 0.41350
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1029.7 on 752 degrees of freedom
Residual deviance: 802.6 on 745 degrees of freedom
AIC: 818.6
Number of Fisher Scoring iterations: 4
The magnitude of the coefficients have no intuitive meaning. They are the magnitude that the log odds ratio increases when each explanatory variable increases by one unit
The marginal effects of a regression are estimates of the impact that one-unit increases in the explanatory variables have on the outcome variable.
In a logistic regression, the marginal effects are defined as the impact that one-unit increases in the explanatory variables have on the probability that the outcome variable is equal to 1.0.
The function maBina() from the package erer can be used to compute the marginal effects of the logistic regression as follows.
Hosmer-Lemeshow test of goodness of fit using hoslem() in the ResourceSelection package
Code
hoslem.test(mroz$inlf, inlf_logit$fitted, g =10)
Hosmer and Lemeshow goodness of fit (GOF) test
data: mroz$inlf, inlf_logit$fitted
X-squared = 12.851, df = 8, p-value = 0.1171
4.3 Confusion Matrix
A confusion matrix gives a short comparison of the actual values for your outcome variable, and the values predicted from your outcome variable
A confusion matrix compares these predicted values (P(Y=1|X)) with the actual values, and reports how often we correctly identified Y=1 and Y=0
The code below creates predicted values for Y based on a cut off value of 0.5 for the predicted probability.
Code
pred <-as.numeric(inlf_logit$fitted >=0.5)
The code below computes the confusion matrix along with the several other model performance measures. The confusionMatrix() function is in the caret package