Introduction
The aim of the project is twofold:
- To estimate the impact of management training on company performance;
- To investigate the determinants of exporting
To finish the poject, we have 3 tasks and we need describe the results as clearly as possible. The word limit for this project is 5000 words.
Task 1 [10%]
Discuss briefly the most important features of the data; describe any intereting patterns or correlations in the data the provide some summary statistics/graphs of the variables of interest.
Where and how do we find these features and summary? From data or from model?
Just read An Economist’s Guide to Visualizing Data with model in your mind
Task 2 [40% - impact]
Senior managers of selected companies were given some taining on “modern management techniques” in 2012. It is claimed by the data providers that companies were selected randomly for this purpose.
Using DiD estimation, estimate the impact of this management training on (a) company profitability; (b) unit cost of production. Write a short report on your findings, including any assumptions you may have made and your views on the feasibility of the assumptions. You can use company size and age (or some functions of size and company) as control variables.
What is DiD, and what's the criterion of choosing contraol variables ? What's the main assumptions ?
Would your conclusions change if you perform your analysis (a) for private and non-private companeis separately; (b) for service and non-service companies separately. You can also conduct any other robustness analysis you think might be appropriate and informative.
What is appropriate and informative robustness analysis ? How do we measure that change or compare?
Task 3 [50% - determinants]
Choose any year between 2012 and 2014 and consider the following model of the determinants of exporting i at time t in terms of firm characteristics at time t-1. \[Prob (y_{it = 1}) = F(X_{it}; \beta),\] where i indexes firms; t is 2012, 2013 or 2014. X is a vector of company characteristics that should at least include company size and age.
- An economist suggests that the following qualitative variable should be used as the dependent variable:
\[\begin{equation} y_{it} = \begin{cases} 1, & if \ expint_{it} > 0 \\ 0, & if \ expint_{it} = 0 \end{cases} \end{equation}\] Estimate the probability equation using the above variables as the dependent variable, and carefully interpret your results. Comment on the goodness-of-fit of your models. Investigate the robustness of your results by re-estimating the model for service and non-service sector firms separately You can also conduct any other robustness analysis you think might be appropriate and informative.
- Another economsit suggests that the following qualitative variable be used as the dependent variable:
\[y_{it} = \begin{cases} = 0 & if \ expint_{it} = 0 \\ = 1 & if \ 0 < expint_{it} < 0.25 \\ = 2 & if \ 0.25 \leq expint_{it} < 1 \\ = 3 & if \ expint_{it} = 1 \end{cases}\] Estimate the probability equation using the above variable as the dependent variable, and carefully interpret your results. Investigate the robustness of your results by reestimating your models for service and non-service sector firms separately. You can also conduct any other robustness analysis you deem useful.
- Out of the models of you have estimated under Task 2, is ther any one you would prefer best in terms of either theoretical appropriateness or empirical robustenss? Give good reasons for your answer.
Get the feeling and thoughts on data
In this part, I try to give the data visualization for the main varialbes.
Trend for nonprivate company
Trend for private company
It seems, the training program has minor effect on profitability but large effect on cost, especially for private company.
Trend for nonservice company
Trend for service company
Trend and distribution for export
Now, let’s just focus on the performance of year 2014.
Distribution for skilled and unskilled wage
Distribution for cost and profitability
Distribution for size and age
Distribution for nonservice and service sector(profit)
Distribution for nonservice and service sector (cost)
Give me a story !
Empirical Work Procedure
Better to bear the following in the mind:
Assumptions
Model
Data - no target data without theory (GDP, e.g. NeoCM without government - dont’t use NIPA)
Goodness of fit - criterion
Diagnostic
Inference or Interpretation
Consistency !
Difference in Difference
Example - attending hostipal
Now, Consider the effect of hospital treatment on health. We collect data!
- This comparison shows that the health of those who attended hospital is worse than those who did not
- Perhaps those who attended hospital would have had worse health even if they had not attended hospital
- Perhaps those who did not attend hospital would have had better health even if they had attended hospital
The unit-level causal effect of the treatment ∆ = \(Y^1 − Y^0\) cannot be observed because we observe only the outcome of one level of treatment; we do not observe the counterfactual.
This is known as the Fundamental Problem of Causal Inference !
The outcome Y reveals only half of the information contained in the underlying potential outcomes.
How do we solve this problme? DiD with a restrictive assumption !
DiD Model
The key strategy in regression was to estimate causal effects by controlling for confounding factors. A key variable in such a strategy is frequently the outcome of interest in a period before the treatment took place. Differences-in-differences is a strategy to model the role of pre-treatment outcomes in a particular fashion. (Having effect takes time !! - dimension)
We decompse indivudal data or observation into different levels of componets. We don’t know what the specific factor might affect healthy, but we can assume its existence.
Formally, the assumptions underlying differences-in-differences estimation are as follows. Let
y0 = fast food employment for low minimum wage
y1 = fast food employment for high minimum wage
Decomposition: \[E(y_0|X)=E(y_0|s,t)= \gamma_s + \lambda_t\], where s means state (dimension 1); and t means time (dimension 2).
The treatment, a higher minimum wage, changes the employment level conditional on s and t: \[ E(y_1|s,t)=E(y_0|s,t) + \beta \times 1 = \gamma_s + \lambda_t + \beta \times 1\]
So we can write observed employment in restaurant i as:
\[y_i= \gamma_s+ \lambda_t+ \beta D_{st}+ \varepsilon_i\] , where:
- \(\gamma_s\) can be interpreted as the individual state or characteristic - fixed;
- \(\lambda_t\) can be interpreted as the individual trend or macro trend - stationary;
- \(D_{st}\) is a dummy for the treatment;
- \(\varepsilon\) is the error term - either we have no idea or exogeneous for model;
Notice:
\[\begin{aligned} & E(y_i|s=PA;t=Nov; D_{st} = 0) - E(y_i|s=PA;t=Feb; D_{st} = 0) \\ \\ & = (\gamma_{PA}+ \lambda_{Nov}+ \beta \times 0+ \varepsilon_i) - (\gamma_{PA}+ \lambda_{Feb}+ \beta \times 0+ \varepsilon_i) \\ & = \lambda_{Nov} - \lambda_{Feb} \end{aligned}\] also: \[\begin{aligned} & E(y_i|s=NJ;t=Nov; D_{st} = 1) - E(y_i|s=NJ;t=Feb; D_{st} = 1) \\ \\ & = (\gamma_{NJ}+ \lambda_{Nov}+ \beta \times 1+ \varepsilon_i) - (\gamma_{NJ}+ \lambda_{Feb}+ \beta \times 1 + \varepsilon_i) \\ & = \lambda_{Nov} - \lambda_{Feb} + \beta \end{aligned} \]
It’s time to do Difference in Difference
\[\begin{aligned} \bigg\{ \big [E(y_i|s=NJ;t=Nov; D_{st} = 1) - & E(y_i|s=NJ;t=Feb; D_{st} = 1)\big] - \\ \big[& E(y_i|s=PA;t=Nov; D_{st} = 0) - E(y_i|s=PA;t=Feb; D_{st} = 0)\big] \bigg\} \\ &= [\lambda_{Nov} - \lambda_{Feb} + \beta] - [\lambda_{Nov} - \lambda_{Feb}] \\ &= \beta \end{aligned}\]
Qestion: How do you do DiD in Stata ?
Main assumption
The assumption is that the trends would have been the same in both states in the absence of the treatment. Make sure you can understand this picture.
Much of the recent discussion of differences-in-differences models has been about ascertaining, whether the underlying assumption of equal trends in the absence of treatment is a reasonable one.
Difference-in-differences in Regression Format, Multiple Contrasts, and Robustness
The treatment may not be the binary but continuous. \[y_{st} = \gamma_s + \lambda_t + \beta M_{st} + \varepsilon_{st}\] Difference over time: \[\begin{aligned} \Delta y_{st} &= \lambda_t - \lambda_{t-1} + \beta M_{st} + \Delta \varepsilon_{st} \\ & = \lambda + \beta M_{st} + \Delta \varepsilon_{st} \end{aligned}\] Given the multiple contrasts from using more than 2 state level changes, it is now possible to introduce controls for other state specifc factors. it is possible to use parametric controls for trends at the state level as in \[\Delta y_{st} = \lambda + \beta M_{st} + X_{st} \delta + \Delta \varepsilon_{st}\] The control variable coefficients can be interpreted as:
- With extral control, to see whether \(\beta\) will change or not;
- The more control variables you add, the more details you can get on treatment effect;
- Give me more intrepretation
You can also do dynamic DiD by putting lags and leads.
There are more implicit assumptions about this model, check the lecture slides and reading.
Question: how do we DiD for multiple states and control variabes in Stata ?
Logit Models for Binary Data
Motivation
Why do we need logit model? - Let EK explain it to us :)
Bernoulli Distribution
If X is a random variable with this distribution, we have: \[Pr(X = 1) = p; \ Pr(X = 0) = 1 - p;\] The probability mass function f of this distribution, over possible outcomes k, is: \[f(k;p) = \begin{cases} p, &if \ k = 1 \\ 1 - p, &if \ k = 0 \end{cases}\]
Binomial Distribution
In general, if the random variable X follows the binomial distribution with parameters \(n \in ℕ\) and \(p \in [0,1]\), we write \(X \sim B(n, p)\). The probability of getting exactly k successes in n trials is given by the probability mass function: \[Pr(k; n, p) = Pr(X = k) = \binom{n}{k}p^k(1-p)^{n-k}; \ where \binom{n}{k} = \frac{n!}{k!(n-k)!}\]
Now, we view \(y_i\) as a realization of a random variable \(Y_i\) that takes the values \(0, 1, \cdots, n_i\). If the \(n_i\) observations in each group are independent, and they all have the same probability \(\pi_i\) of having the attributes of interest, then the distribution of \(Y_i\) is binomial with parameters \(\pi_i \ and \ n_i\), which we write \[Y_i \sim B(n_i, \pi_i)\] The probabiltiy distribtuion function of \(Y_i\) is given by \(Pr(Y_i = y_i) = \binom{n_i}{y_i} \pi_i^{y_i}(1-\pi_i)^{n_i - y_i}\)
However, we don’t know the probability \(\pi_i\) !!
Example: MLE
Again, we assume it exist.
The Logit Transformation
The next step in difining a model for our data concerns the systematic structure. We would like to have the probability \(\pi_i\) depend on a vector of observed covariates \(x_i\). The simplest idea would be to let \(\pi_i\) be a linear function of the covariates, say \[\pi_i = x_i'\beta\] where \(\beta\) is a vector of regression coefficients. This model sometimes is also called the linear probability model. One problem with this model is that probability \(\pi_i\) on the LHS has to be between 0 and 1, but the linear predictor \(x_i'\beta\) on the RHS can take any real value, so there is no guaranteen that the predicted values will be in the correct range unless complex restrictions are imposed on the coefficients.
A simple solution to this problem is to transform the probability to remove the range restrictions, and model the transformation as a linear function of the covariates. We do this in two steps.
- First, we move from the probability \(\pi\) to the odds, (example \(\frac{1}{1000}; \frac{999}{1000}\)): \[odds_i = \frac{\pi_i}{1-\pi_i}\]
- Second, we take logarithms, calculating the \(logit\) or log-oods: \[\eta_i = logit(\pi_i) = log (\frac{\pi_i}{1-\pi_i}), where \ \eta \in (-\infty, + \infty)\]
Good, good, problem solved ! Happy :)
Now, let’s do some simple algebra and begin to use MLE.
Solving for \(\pi_i\) in above equation, it gives \[\pi_i = logit^{-1}(\eta_i) = \frac{e^{\eta_i}}{1+e^{\eta_i}}\]
In the lecture, professor gives the different intrepretaton: \[\pi_{1i} \equiv Pr(y_i = 1 | x_i) = F(x_i' \beta) \] he call this \(F(\cdot)\) as the link function.
We are noe in a position to define the logistic regression model, by assuming that the logit of the probability \(\pi_i\), rather than the probability itself, follows a linear model (Caution:logit): \[logit(\pi_i) = x_i'\beta \ \ \Rightarrow \ \ log(\frac{\pi_i}{1-\pi_i}) = x_i' \beta \ \ \Rightarrow \frac{\pi_i}{1-\pi_i} = e^{x_i'\beta} \ \ \Rightarrow \pi_i = F(x_i'\beta) = \frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}\]
In this case marginal effect can be obtained as \[\frac{d\pi_i}{dx_{ij}} = \Big[ \frac{e^{x_i'\beta}}{(1+e^{x_i'\beta})^2} \Big]\beta_j = \beta_j \pi_i(1-\pi_i) \]
The log-likehood function is \[\mathcal{L}_N(\beta) = \sum_{i=1}^N \{y_i \ln F(x_i'\beta) + (1-y_i) \ln\big[1-F(x_i'\beta)\big]\}\]
Another popular link function is the standard normal \(CDF\), leading to the probit model:
\[\pi_i = F(x_i'\beta) = \Phi(x_i'\beta') = \int_{-\infty}^{x_i'\beta} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\mu^2} du\]
Question: What the log-likehood function \(\mathcal{L}_N(\beta)\) should look like ?
Reflection on MLE again
Let me give a discourse on MLE again and make sure you truely undertand it.
Philosophy of modeling the world (Epistemology)
We use the different language to talk the same things again and again for thousands of years - by Heidegger;
Example
Plato: from prefernece to choices; from pdf to estimation and inference - MLE;
Aristotle: from choices to preferences; from data to estimation and inference - bootstrap; machine learning;
Notes for Study Pack II (for lectures 3-6)
In this section, I will go through the lecture notes, and give the full derivations of main models. This section services for two purposes:
- Draft for my
latex
version notes of L14025; - Practice for group project.
For every model of discrete data regression, I will give an example and also Stata Code to show the following procedures:
Assumptions
Model
Data - no target data without theory (GDP, e.g. NeoCM without government - dont’t use NIPA)
Goodness of fit - criterion
Diagnostic
Inference or Interpretation
Consistency !
Okay, Let’s kick it off.
Introduction
It is often of interest to know why economic agents choose certain actions when faced by discrete choices among a set of alternatives. For example: i) why do some individuals default on their mortgages and others do not? ii) Why do some consumers prefer one brand of bottled water over another?
In all cases the main interest lies in determining the factors that influcence the probability of choosing one of the alternatives. With qualitative response variables, it makes more sense to analyse the determinats of choosing the various alternatives using probability models.
What is the effect of increasing the mortgage rate by one percentage point on the probability of mortgage default?
How does the price of Brand B affect consumers’ choice of Brand A?
What is the effect of having one more friend on the probability of being very happy?
Probabiliy models involve nonlinear regressions and are often estimated by the method of maximum likelihood estiamtion.
Review of Maximum Likelihood Estimation
Please read L14020 notes, and understand the main assumptions of MLE, properties of MLE, and also the asymptotic test of MLE.
Make sure, you understand rather than memorize the equations !!
As all models related to discrete variables use MLE, then some test like Wald, Likelihood Ratio, and lagrange Multiplier test will be used very often.
Model selection based MLE can be conducted by using the so-called information criteria. All else equal, the model that minimises an information criterion is preferred. Two information criteria are widely used: the Bayesian information criterion (BIC) and the Akaike information criterion (AIC). BIC gives the penality of increasing independent varikes number - k. \[\begin{aligned} BIC & = -2 \ln L(\beta) + 2 \ln(N)k \\ AIC & = - \ln L(\beta) + 2k \end{aligned}\] Although it is best to use them in conjunction, BIC is preferred at most time.
Binary choice model
In binary choice model, economic agents face two alternatives, often coded as 0 and 1 for convenience. The aim is to model the probability of choosing one of the alternatives. As the sum of probabilities over the two alternatives is 1, we do not need to explicitly model the probability of the second alternative. The alternative or category that is not explicitly modelled is called the base group. The choice of the base group does not affect the outcome of the econometric analysis. The aim is to model the probability of default: \[\pi_{1i} = Pr(y_i = 1 \mid x_i) = F(x_i' \beta)\] When dealing with contunious variables, we need probability density function. For example, giving \(y \sim N(X\beta, \sigma^2 I_T)\) and so \[f(y; \beta, \sigma^2) = (2\pi\sigma^2)^{-T/2} e^{[-\frac{1}{2\sigma^2}(y-X\beta)'(y - X\beta)]}\] In this case, we time all marginal density and get the join density, and then take log and get the sum. For binary choice, we have the probability mass function f of this distribution, over possible outcomes k, is: \[f(k;p) = \begin{cases} p, &if \ k = 1 \\ 1 - p, &if \ k = 0 \end{cases}\] or, we can write it as \[f(y_i;p) = \pi_i = p^{y_i}(1-p)^{1-y_i} = \begin{cases} p & if \ y_i = 1 \\ 1-p & if y_i = 0 \end{cases}\] In this case, we just take the log form of probabiltiy and get the sum \[\begin{aligned} \mathcal{L}_N(\beta) & = \sum_{i=1}^N \{y_i \ln \pi_i + (1-y_i) \ln[1-\pi_i]\} \\ & = \sum_{i=1}^N \{y_i \ln F(x_i'\beta) + (1-y_i) \ln\big[1-F(x_i'\beta)\big]\} \\ & = \sum_{i=1}^N\sum_{j=0}^1 I_j(y_i) \ln \pi_i \ \ , where \ I_j(y_i) = 1 \ if \ y_i \in j; \ 0 \ else \end{aligned}\] The function \(F(\cdot)\) in above likelihood function is called link function. We have three choices: identify function, logistic CDF and probit CDF. I will only present the logit and probit link function as these two are used most time.
Logit model
\[\pi_i = F(x_i'\beta) = \frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}\] In this case marginal effect can be obtained as \[\frac{d\pi_i}{dx_{ij}} = \Big[ \frac{e^{x_i'\beta}}{(1+e^{x_i'\beta})^2} \Big]\beta_j = \beta_j \pi_i(1-\pi_i) \]
In logit model, \(e^{\hat{\beta_k}}\) can be interpreted by saying that a unit increase in \(x_k\) changes the odds ratio by a factor of \(e^{\hat{\beta_k}}\), keeping all else constant. I will give more examples on interpretation in the following case studies.
Another popular link function is the standard normal \(CDF\), leading to the probit model:
\[\pi_i = F(x_i'\beta) = \Phi(x_i'\beta') = \int_{-\infty}^{x_i'\beta} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\mu^2} du\] For the probit model, the expression for the marginal effect is \[\frac{\partial \hat{\pi}_{pi}}{\partial x_{ki}} = [\phi(z_i)]\hat{\beta}_k; \ \ where \ \ \phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}\]
Goodness of fit:
Pseudo R-squared
With a logistic or probit regression models, an equivalent statistic to R-squared used in linear models estimated by OLS does not exist. However there are “pseudo” R-squared values that can be used to evaluate model fit. Like standard R-squared values, these “pseudo” R-squared range from 0 to 1 (though never achieving 0 or 1), with higher values indicating better model fit. However these cannot be interpreted as one would interpret an OLS (e.g. as the % of explained variability in the dependent variable, or the square of the correlation between actual and predicted values).
Perhaps the most popular pseudo R-squared is McFadden’s pseudo R-squared which is defined as 1 minus the ratio of the log likelihood of the full model and the log likelihood of the intercept only model:
\[R_{McF}^2 = 1 - \frac{\ln L(\beta_0, \beta_1, \cdots \beta_k)}{\ln L(\beta_0)}\]
Sensitiviy and specificity
A more useful measure of goodness of fit that is a summary of the predictive ability of the model given as a 2x2 classification table of the success or failure of a prediction. By default a prediction is deemed “a success” (“a failure” if the predicted probability is greater (less) than 0.5 and the actual outcome is y = 1 (y=0). But the researcher can set this cut-off probability to be at any value between 0 and 1 depending on what makes more sense to the situation in hand.
- The percentage of correctly predicted y=1 observations is called the model’s sensitivity.
- The percentage of correctly predicted y=0 observations is called the model’s specificity.
Hosmer-Lemeshow goodness-of-fit test
Intuitively this test compares the deviation of each observation from its predicted or expected value. If this deviation is large, one would conclude that model fit is not good enough. Here we discuss some basic steps of one version of the Hosmer-Lemeshow goodness-of-fit test, and provide an empirical example later on.
The procedure of calculating this test is given in lecture slides, I will not present it here.
Diagnostic
There is not much material in our lecture slides, but I will show it in the case studies.
Outliers
Influence
leverage
Case Studies (Empirical examples)
In the following empirical examples, I will give the analysis step by step by following:
Assumptions - Bernoulli Distribution
Model - Logit and Probit (MLE)
Data - data exploration
Goodness of fit - Pseudo R-squared, sensitivity and specficity; Hosmer-Lemeshow; Model selection (BIC and AIC)
Diagnostic - outlier and influence
Inference or Interpretation
Consistency !
Case I - Labour Force participation
Our first case is from Mroz’s (1987) study of the labor force participation of women, using data from the 1976 Panel Study of Income Dynamics. You can download data binlfp4.dta
from this link. In this case, we want to find:
- What determine the probability of labour force participation.
Here is the explanation of main variables. The dependent variable in the above list is
lfp
, which has the value 1 or 0 - means the agent is in paid labor force or not. One of the independent variable agecat
is ordinal.
Assumptions
Most emprical work just assume the binary data follows the Bernoulli distribution without checking. This is because this assumption is not very restrictive. Based on my knowledge, most binary data follows the Bernoulli distribution. So, we will not check this assumptions.
Model
Now, we will run logit model
And then we run probit model
You can see the coefficients are different, the following one gives the comparison.
Hypothesis testing
Most often, we are interested in testing \(H_0:\beta_k = 0\). For example, consider the results for variables k5
and wc
from the logit output:
Having young childern has a significant effect on the probability of being in the labor force
(z = -7.25, p < 0.01 for a two-tailed test)
The effect of the wife attending college is significant at the 0.01 level
Be carefule, The coefficient in stata
output doesn’t tell us the scale effect, it only tells us whether it is significant or not, also whether it is positive or negative.
The default test in stata
for logit model is wald test, we can also use LR test. To test a single cofficent, we begin by fitting the full model and storing the results. Then, we fit the model without k5
and store the results. Next, we run lrtest
. The followign fiture gives the code and result. The LR test shows the following:
The effect of having young children is significant at the 0.01 level
(LR Chi-square = 62.55, df = 1, p < 0.01)
We can also test multiple coefficents by using test
or lrtest
.
We conclude the following:
We reject the hypothesis that the efficients of the husband's and the wife's
education are simultaneously equal to 0 (chi-square = 17.83, df = 2, p < 0.01)
We conclude the followign:
We reject the hypothesis that the efficients of the husband's and the wife's
education are simultaneously equal to 0 (LR chi-square = 18.68, df = 2, p < 0.01)
Now, let’s compare LR and Wald tests. Although the LR and Wald tests are asymptotically equivalent, their values differ in finite samples. In our example, we can compare them in the following table. Please review L14020 notes to understand the comparions between
LR test
and Wald test
.
Measure of fit
A scalar measure of fit can be useful when comparing competing models. Information criteria such as BIC and AIC can be used to select among models and are often very useful. We can use Hosmer-Lemeshow to check the goodness of fit.
Diagnostic of model
An index plot is an easy way to examine residuals by ploting them against the observation number.
quietly logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
predict rstd, rstandard
label var rstd "Standardized Residual"
sort inc
generate index = _n
label var index "Observation Number"
graph twoway scatter rstd index, msymbol(Oh) mcolor(black) xlabel(0(200)800) \
ylabel(-4(2)4, grid gmin gmax) yline(0, lcolor(black))
If you want to identify the outliers, you can give them index, by the following code.
graph twoway scatter rstd index, msymbol(none) mlabel(index) mlabposition(0) xlabel(0(200)800) \
ylabel(-4(2)4, grid gmin gmax) yline(0, lcolor(black))
Influenctial Cases
We can also identify the influenctial cases.
predict deltabeta, dbeta
label var deltabeta "Pregibon's influence statistic"
graph twoway scatter deltabeta index, msymbol(none) mlabel(index) mlabposition(0) \
xlabel(0(200)800) xtitle("Observation Number") ylabel(0(0.1)0.3, grid gmin gmax)
This plot shows that cases 142, 309, and 752 merit further examinaiton.
Intrepretation
Okay, once we have the model, we need intrepret the coefficent. The logit and probit model is not linear, that’s why we need calcualte the marginal effect.
Effects for the logit model ( but not the probit model ) can be interpreted in terms of changes in the odds. In the logit model, the log odds are a linear combination of the x’s and \(\beta's\). For example, consider a model with three independent variables: \[\ln \big\{ \frac{Pr(y=1|x)}{Pr(y=0|x)} \big\} = \ln \Omega (x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_3\] This leads to a model that is multiplicative instead of linear but in which the outcome is the odds: \[\Omega (x) = e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3}\] Now, if we want to check how odds ratio will change when we change variable \(x_3\), we will have \[\frac{\Omega(x, x_3+1)}{\Omega(x,x_3)} = \frac{e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3} e^{\beta_3}}{e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3}} = e^{\beta_3}\] Accordingly, we can interpret the exponential of the logit coefficients as follows:
- For a unit change in \(x_k\), the odds are expected to change by a factor of exp(\(\beta_k\)), holding other variables constant.
- For a standard deviation change in \(x_k\), the odds are expected to change by a factor of exp(_k s_k$), holding all other variables constant.
In our example, once we run the logit regression, we can list the coefficents and also the odds ratio. Here is the code:
logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
listcoef, help
In above table, b means \(\beta\), \(e^b\) gives the odds ratio, and \(e^{bStdX}\) gives the change in odds for SD increase in X. By using the coefficients for
lwg
in the column \(e^{bStdX}\), we can say:
- For a standard deviation increase in the log of the wife’s expected wages, the odds of being in the labor force are 1.43 times greater, holding all other variables constant.
Marginal effects: Changes in Probabilites
We will not repeat the formula for marginal effect of this model. I will directly jump to the code and interpretatio. We have two popular marginal effects: marginal effects at the average value (MEA) and average marginal effects (AME). Cameron and Trivedi suggest that ‘it is best to use AME over MEA’.
AMEs for continuous variables
From the above table, we can say
- Holding other variables at their observed values, increasing income by one standard deviation, roughtly $12,000, decreases the probability of labor force participation on average by 0.09.
AMEs for factor variables
From the above table, we can say
- On average, having attended college increases a woman’s probability of labor force participation from 0.53 to 0.69, a change of 0.16 (\(p < 0.001\)), while the effect of the husband having attended college is not significant.
Graphing predicted probabilities
With a continuous independent variable, you can plot the predicted probabilities over the range of the variable. For example, to examine the effects of inc
, we might plot the predicted probability of labor force participation as inc
changes, holding other variables at fixed values.
The following code gives the graph
quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
estimates store base
margins, at(inc=(0(10)100)) atmeans noatlegend
margins plot
An effective way to show the effects of two variables is to graph predictions at various levels of one variable as the other variable changes. This can be done with either marginsplot
or mgen
.
The following code gives the graph
quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
estimates store base
margins agecat, at(inc=(0(10)100)) atmeans noatlegend
marginsplot, noci legend(cols(3))
Adding power terms and plotting predictions
One interesting plot is to compare the prediction power when we adding power terms.
quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
mgen, predlabel(linear) atmeans at(inc=(0(10)100)) stub(_lin)
quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg c.inc##c.inc, nolog
mgen, predlabel(quadratic) atmeans at(inc=(0(10)100)) stub(_quad)
graph twoway connected _linpr1 _quadpr1 _quadinc, title("Comparing income specificaitons") \
caption("Other variables at their means") ytitle("Pr(In Labor Force)") \
ylab(0(0.25)1, grid gmin gmax) legend(cols(3))
You can also add confidence interval into the graph by following code
*run regression and store the estimates firstly
graph twoway (rarea _linul _linll _lininc, col(gs12)) \
(connected _linpr1 _quadpr1 _quadinc), title("Comparing income specificaitons") \
caption("Other variables at their means") ytitle("Pr(In Labor Force)") \
ylab(0(0.25)1, grid gmin gmax) legend(cols(3))
Okay, this is the end of Binary model.
Multinomial logit models
For multinomial models, we use the same method with binary model to get the transformation. The multinomial linear model (MNLM) can be written as \[\ln \Omega_{m\mid b}(x) = \ln \frac{Pr(y=m \mid x)}{Pr(y=b \mid x)} = x \beta_{m|b} \ \ for \ m = 1 \ to \ J \] where b is the base outcome, sometimes called the reference category. With these transformations, we can calculate the relevant probabilities conditional on dependent varialbes \(x\): \[Pr(y=m|x) = \frac{e^{x\beta_{m|b}}}{\sum_{j=1}^J e^{x\beta_{j|b}}}, \ \ where \ J \ is \ number \ of \ categories\]
For example, suppose that you have three outcomes and fit the model with alternative 1 as the base, where you would obtain estimates \(\hat{\beta}_{2|1} \ and \ \hat{\beta}_{3|1}, \ with \ \beta_{1|1} = 0\). The probability equation for the second outcome is \[Pr(y=2|x) = \frac{exp(x\beta_{2|1})}{\sum_{j=1}^3exp(x\beta_{j|1})} \] Okay, let’s give the proper definitions of \(m\) and \(J\). There are \(J\) alternatives and the dependent variable y is defined to take value \(m\) if the \(m th\) alternative is taken, \(m = 1, \cdots J\). Define the probability that alternative \(m\) is chosen as \[p_m = Pr(y=m), \ m = 1, \cdots J \] Introduce \(J\) binary variables for each observations y, \[y_m =\begin{cases} 1 & if \ y = m, \\ 0 & if \ y \neq m. \end{cases}\] Thus \(y_m\) equals one if alternative \(m\) is the observed outcome and the remaining \(y_k\) equal zero, so for each observation on \(y\) exactly one of \(y_1, y_2, \cdots , y_J\) will be nonzero. The multinomial density for one observation can then be conveniently written as \[f(y) = p_1^{y_1} \times \cdots \times p_J^{y_J} = \prod_{j=1}^J p_j^{y_j} \ \ \ \ \ \ \ \ \ \ \ \ \ (D1)\] This means \(X \in \{1, \cdots J\}\) be a categorical random variable with \(J\) states, and \(p_m = Pr(y=m), \ m = 1, \cdots J\).
For regression models introduce a subscript \(i\) for the \(ith\) individual and regressors \(x_i\). Specify a model for the probability that individual \(i\) chooses the \(mth\) alternatives: \[p_{im} = Pr(y_i = m) = F_m(x_i, \beta), \ m = 1, \cdots J, \ i = 1, \cdots, N \] The functional form for \(F_m\) should be such that probabilities lie between 0 and 1 and sum over \(m\) to 1, which means overal probability for \(J\) states should be 1.
ML Estimation
The multinomial density for one obvervation is given in (D1). The likelihood function for a sample of N independent observations is then \(L_N = \prod_{i=1}^N \prod_{j=1}^J p_{ij}^{y_{ij}}\), where the subscript \(i\) denotes the \(ith\) of \(N\) individuals and the subscript \(j\) denotes the \(jth\) of J alternatives. The log-likelihood function is \[\mathcal{L} = \ln L_{N} = \sum_{i=1}^N \sum_{j=1}^J y_{ij} \ln p_{ij}, \ \ \ where \ p_{ij} = F_j(x_i, \beta) \ is \ a \ function \ of \ parameters \ \beta \ and \ regressors. \] The marginal effect of a regressor say \(x_k\) on the probability of y belonging to category \(j\) can be obtained as \[\frac{\partial \hat{\pi}_{ji}}{\partial x_{ki}} = \hat{\pi}_{ji} \big[\beta_k^j - \sum_{j=1}^J p_j \beta_k^j \big] \] Independence of irrelevant alternatives
The multinomial model, as well as the conditional logit and rank-ordered logit models, make the assumption known as the independence of irrelevant alternative (IIA). In the multinomial model, \[\frac{Pr(y=m|x)}{Pr(y=n|x)} = exp\{x(\beta_{m|b} - \beta_{n|b}) \} \] , where the odds do not dependent on other alternatives that are availabe. In this sense, those alternatives are “irrelevant”. What this means is that adding or deleting alternatives do not affect the odds among the remaining alternatives. The Hausman statistic can be used to check the this assumption. If the null hypothesis of IIA is rejected, inference based on estimates from the multinomial Logit model would be invalid. In this case there are two possible options:
- Estimate binary logit models for each pair of alternatives. This option is always valid, albeit inefficient.
- Estimate multivariate probit models, but make sure that the program you use does not assume IIA. However, this option can be computationally burdensome, especially with a larger number of choices.
Before we doing conditional logit and ordered logit model, let’s have a case for mutinomial model.
Case study (MNLM)
The 1992 American National Election Study asked respondents to indicate their political party using one of eight categories. We used to these to create party
with five categories: strong Democrat (1 = StrDem
), Democrat (2 = Dem
), Independent (3 = Indep
), Republican (4 = Rep
), and strong Republican (5 = StrRep
). Five regressors are included in the model: age, income, race(black or white), gender and education. You can download the dataset partyid4.dta
from this link.
Now, let’s just take a look at data and begin use the mlogit model
tabulate party, miss
sum age income black female
tabulate educ, miss
mlogit party age income i.black i.female i.educ
We will not present the result of whole model, but jump to hypothesis test and goodness of fit, ect. For the effects of selecting different base, we will come back later. Again as we use the MLE, the test should be either wald or LR or LM test.
quiet mlogit party age income i.black i.female i.educ, nolog
mlogtest, lr
mlogtest, wald * gives the wald test
The results of the LR test, regardless of how they are computed, can be interpreted as follows:
- The effect of age on party affiliation is significant at the 0.01 level.
- The effect of being female is significant at the 0.1 level but not at the 0.05 level
We can also test for combining alternatives.
quiet mlogit party age income i.black i.female i.educ, nolog
mlogtest, combine
From these results, we can reject the hypothesis that categories
StrDem
and Dem
are indistinguishable. In deed, our results indiciate that all the categories are distinguishable.
Independence of irrelevant alternatives
IIA means that adding or deleting alternatives does not affect the odds among the remaining alternatives. Tests of IIA invole comparing the estmated coefficients from the full model to those from a restricted model that excludes at least one of the alternatives. If the test statistic is significant, the assumption of IIA is rejected, incidcating that the multinomial model is inappropriate. We will use HM and SH test.
mlogit party age income i.black i.female i.educ, base(5)
mlogtest, hausman
Based on the result of test, we cannot reject the IIA.
Measure of fit
As with models for binary and ordinal outcomes, many scalar measures of fit for the MNLM model can be computed with the SPost
command fitstat
, and information criteria can be computed with estat ic
. I will not repreat it again.
Predicted probabilities with predict
The most basic command for computing probabilities is predict
. After fitting the model with mlogit
, predicted probabilites for all outcomes within the sample can be calculated with predict
.
quiet mlogit party age income i.black i.female i.educ, base(5)
estimate store mlogit
predict mnlmSD mnlmD mnlmI mnlmR mnlmSR
codebook mnlmSD mnlmD mnlmI mnlmR mnlmSR, compact
Now, we run ordered model, and compare the prediction.
quiet ologit party age income i.black i.female i.educ
predict olmSD olmD olmI olmR olmSR
codebook olm*, compact
The following code gives the dot plot of comparison between ordered and multinomial model.
label var olmSD "ologit"
label var mnlmSD "mlogit"
dotplot olmSD mnlmSD, ylabel(0(0.2)0.8, grid) ytitle(Pr(Strong Democrat))
If we look at the middle category of Independent, however, things look quite different, reflecting the abrupt truncation of the distribution of predictions for middle categories that is often found with the OLM.
label var olmI "ologit"
label var mnlmI "mlogit"
dotplot olmI mnlmI, ylabel(0(0.2)0.3, grid) ytitle(Pr(Indenpendent))
Marginal effect
Average marginal effects are a quick and valuable way to assess the effects of all the indepdendent variables in your model. We will calculate the marginal effects and give interpretations.
quiet mlogit party age income i.black i.female i.educ, base(5)
estimate store mlogit
mchange, amount(sd) brief width(8)
Even for this relatively simple model and looking only at a single amount of change for each variable, there is a lot of information to digest. To make it simpler to interpret these results, we plot the changes with mchangeplot
. We begin by looking at the average discrete changes for standard deviation increases in age and income
mchangeplot age income, symbols(D d i r R) min(-0.06) max(0.06) gap(0.02) \
sig(0.05) xtitle(Average Discrete Change) ysize(1.3) scale(2.1)
Although we probably would not include this graph in a paper, we use it to help describe the effects:
- On average, a standard deviation increase in age, about 17 years, increases the probability of being a strong Republican by 0.03 and of being a strong Democrat by 0.05. The probability of other affiliations all decrease by roughly 0.03. All effects are significant at the 0.01 level.
- On average, a standard deviation increase in income, roughtly $28000, significantly increases the probability of being a Republican by 0.04 and a strong Republican by 0.02, while significantly decreasing the probability of being a strong Democrat by 0.04.
The greater effect of race compared with gender on party affiliation is shown with a plot of their average discrete changes. We use the following command to create the graph
mchangeplot black female, symbols(D d i r R) min(-0.3) max(0.3) gap(0.1) \
sig(0.05) xtitle(Average Discrete Change) ysize(1.3) scale(2.1)
On average, for people similar on other characteristics, being black increases the probability of being a strong Democrat by nearly 0.3 compared with someone who is white. Except for an increase of 0.07 in Democratic affiliation, the effects of gender are not significant.
Graphing predicted probabilities
It whould be very helpful if we could plot the trend of probability for one variable.
quiet mlogit party age income i.black i.female i.educ, base(5) vsquish
mgen, atmeans at(income=(0(10)100)) stub(mnlmI) replace
label var mnlmIpr1 "Strong Dem"
label var mnlmIpr2 "Democrat"
label var mnlmIpr3 "Independent"
label var mnlmIpr4 "Republican"
label var mnlmIpr5 "Strong Rep"
graph twoway connected mnlmIpr1 mnlmIpr2 mnlmIpr3 mnlmIpr4 mnlmIpr5 mnlmIincome, \
title("Multinomial logit model: other variables held at their means", \
pos(11) size(medium)) ytitle(Probability of party affiliation) \
ylab(0(0.1)0.4, grid gmax gmin) msym(O Ohdh sh s) mcol(gs1 gs5 gs8 gs5 gs1) \
lpat(solid dash shortdash dash solid) lco(gs1 gs5 gs8 gs5 gs1) \
legend(rows(2))
Project Task III
Before you run the code for this section, please read page 9 -11 of this document, and make sure you install SPost13
on your Stata.
In the following analysis, we will focus on the data from 2013 and use ordinal model (ORM) to answer the question.
The statistical model
A latent-variable model
The ORM is commonly presented as a latent-variable model. Defining \(y^*\) as a latent variable ranging from \(- \infty \ to \ \infty\), the structural model is \[y_i^* = x_i \beta + \varepsilon_i\] , where \(i\) is the observation and \(\varepsilon\) is a random error. The underlying, continuous latent variable can be thought of as the propensity to identify oneself as having higher order. The observed categories are tied to the latent variable by the measurement model \[y_{it} = \begin{cases} = 0 & if \ expint_{it} = 0 \\ = 1 & if \ 0 < expint_{it} < 0.25 \\ = 2 & if \ 0.25 \leq expint_{it} < 1 \\ = 3 & if \ expint_{it} = 1 \end{cases}\] Here, expint
is the latent variable, and \(y_it\) is the category variable we have to create.
drop if year != 2013
gen expcat = 0 if expint == 0
replace expcat = 1 if expint > 0 & expint <0.25
replace expcat = 2 if expint >= 0.25 & expint < 1
replace expcat = 3 if expint == 1
table expcat
label var expcat "category variables based on exporting intensity rannge"
codebook company manage expint markets private size lnage expcat service, compact
Now, we can use
ologit
or oprobit
to estimate the model. We put all the relevant variables as independent variables, and just to take a look what will we get.
ologit expcat manage private lnwage_skilled lnwage_unskilled lncost profitability ///
size service lnage, nolog
estimates store ologit
oprobit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability size service lnage, nolog
estimates store oprobit
estimates table ologit oprobit, b(%9.3f) t varlabel varwidth(30)
From above result, we found that size and age are not significantly in terms of exporting categories determinants. Also, the coefficents from different models are quite different when we comparing
ologit
and oprobit
model. The following picture gives the result. Now, we try to add quadratic term and some interaction to see whether the main variable will become significant factors or not.
ologit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability c.size##c.size service c.lnage##c.lnage, nolog
estimates store ologit1
oprobit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability c.size##c.size service c.lnage##c.lnage, nolog
estimates store oprobit1
Once we add quardatic term, the main factor - size and age - becomes significant now. Testing the parallel regression assumption using oparallel
ssc install oparallel
quiet ologit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability c.size##c.size service c.lnage##c.lnage, nolog
oparallel, ic
Based on test, we have to reject the assumption of parallel regression, which means we cannot use ologit to model this task. We should use generalised ologit or multinomial logit model to answer the question (we can also use gologit2 to to ordered analysis).
before we do the multinomial logit model, it’s better to give the correlation table. As markets have high correlation with
expint
, it’s better not to put it as independent variable in the model. Let’s run the multinomial model to see what we will get.
mlogit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability size service lnage
mlogtest, hausman
Four tests of IIA are reported. The first two shows that we cannot reject these two categoreis are independent with others. However, the last tow shows that we should reject IIA. If we remove service, the results are quite different.
mlogit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
profitability size lnage
mlogtest, hausman
We will give the analysis with full variables, and just check the prediction power, and later use the model selction to find the best one.
The following is the probability of being export depends on the cost.