Introduction

The aim of the project is twofold:

To finish the poject, we have 3 tasks and we need describe the results as clearly as possible. The word limit for this project is 5000 words.

Task 1 [10%]

Discuss briefly the most important features of the data; describe any intereting patterns or correlations in the data the provide some summary statistics/graphs of the variables of interest.

    Where and how do we find these features and summary? From data or from model?

Just read An Economist’s Guide to Visualizing Data with model in your mind

Task 2 [40% - impact]

Senior managers of selected companies were given some taining on “modern management techniques” in 2012. It is claimed by the data providers that companies were selected randomly for this purpose.

  1. Using DiD estimation, estimate the impact of this management training on (a) company profitability; (b) unit cost of production. Write a short report on your findings, including any assumptions you may have made and your views on the feasibility of the assumptions. You can use company size and age (or some functions of size and company) as control variables.

    What is DiD, and what's the criterion of choosing contraol variables ?
    What's the main assumptions ?
  2. Would your conclusions change if you perform your analysis (a) for private and non-private companeis separately; (b) for service and non-service companies separately. You can also conduct any other robustness analysis you think might be appropriate and informative.

    What is appropriate and informative robustness analysis ?
    How do we measure that change or compare? 

Task 3 [50% - determinants]

Choose any year between 2012 and 2014 and consider the following model of the determinants of exporting i at time t in terms of firm characteristics at time t-1. \[Prob (y_{it = 1}) = F(X_{it}; \beta),\] where i indexes firms; t is 2012, 2013 or 2014. X is a vector of company characteristics that should at least include company size and age.

  • An economist suggests that the following qualitative variable should be used as the dependent variable:

\[\begin{equation} y_{it} = \begin{cases} 1, & if \ expint_{it} > 0 \\ 0, & if \ expint_{it} = 0 \end{cases} \end{equation}\] Estimate the probability equation using the above variables as the dependent variable, and carefully interpret your results. Comment on the goodness-of-fit of your models. Investigate the robustness of your results by re-estimating the model for service and non-service sector firms separately You can also conduct any other robustness analysis you think might be appropriate and informative.

  • Another economsit suggests that the following qualitative variable be used as the dependent variable:

\[y_{it} = \begin{cases} = 0 & if \ expint_{it} = 0 \\ = 1 & if \ 0 < expint_{it} < 0.25 \\ = 2 & if \ 0.25 \leq expint_{it} < 1 \\ = 3 & if \ expint_{it} = 1 \end{cases}\] Estimate the probability equation using the above variable as the dependent variable, and carefully interpret your results. Investigate the robustness of your results by reestimating your models for service and non-service sector firms separately. You can also conduct any other robustness analysis you deem useful.

  • Out of the models of you have estimated under Task 2, is ther any one you would prefer best in terms of either theoretical appropriateness or empirical robustenss? Give good reasons for your answer.

Get the feeling and thoughts on data

In this part, I try to give the data visualization for the main varialbes.

Trend for nonprivate company

Trend for nonprivate company

Trend for private company

Trend for private company

It seems, the training program has minor effect on profitability but large effect on cost, especially for private company.

Trend for nonservice company

Trend for nonservice company

Trend for service company

Trend for service company

Trend and distribution for export

Trend and distribution for export

Now, let’s just focus on the performance of year 2014.

Distribution for skilled and unskilled wage

Distribution for skilled and unskilled wage

Distribution for cost and profitability

Distribution for cost and profitability

Distribution for size and age

Distribution for size and age

Distribution for nonservice and service sector(profit)

Distribution for nonservice and service sector(profit)

Distribution for nonservice and service sector (cost)

Distribution for nonservice and service sector (cost)

Give me a story !

Empirical Work Procedure

Better to bear the following in the mind:

  1. Assumptions

  2. Model

  3. Data - no target data without theory (GDP, e.g. NeoCM without government - dont’t use NIPA)

  4. Goodness of fit - criterion

  5. Diagnostic

  6. Inference or Interpretation

  7. Consistency !

Difference in Difference

Example - attending hostipal

Now, Consider the effect of hospital treatment on health. We collect data!

  • This comparison shows that the health of those who attended hospital is worse than those who did not
  • Perhaps those who attended hospital would have had worse health even if they had not attended hospital
  • Perhaps those who did not attend hospital would have had better health even if they had attended hospital

The unit-level causal effect of the treatment ∆ = \(Y^1 − Y^0\) cannot be observed because we observe only the outcome of one level of treatment; we do not observe the counterfactual.

This is known as the Fundamental Problem of Causal Inference !

The outcome Y reveals only half of the information contained in the underlying potential outcomes.

How do we solve this problme? DiD with a restrictive assumption !

DiD Model

The key strategy in regression was to estimate causal effects by controlling for confounding factors. A key variable in such a strategy is frequently the outcome of interest in a period before the treatment took place. Differences-in-differences is a strategy to model the role of pre-treatment outcomes in a particular fashion. (Having effect takes time !! - dimension)

We decompse indivudal data or observation into different levels of componets. We don’t know what the specific factor might affect healthy, but we can assume its existence.

Formally, the assumptions underlying differences-in-differences estimation are as follows. Let

    y0 = fast food employment for low minimum wage
    
    y1 = fast food employment for high minimum wage
    

Decomposition: \[E(y_0|X)=E(y_0|s,t)= \gamma_s + \lambda_t\], where s means state (dimension 1); and t means time (dimension 2).

The treatment, a higher minimum wage, changes the employment level conditional on s and t: \[ E(y_1|s,t)=E(y_0|s,t) + \beta \times 1 = \gamma_s + \lambda_t + \beta \times 1\]

So we can write observed employment in restaurant i as:

\[y_i= \gamma_s+ \lambda_t+ \beta D_{st}+ \varepsilon_i\] , where:

  • \(\gamma_s\) can be interpreted as the individual state or characteristic - fixed;
  • \(\lambda_t\) can be interpreted as the individual trend or macro trend - stationary;
  • \(D_{st}\) is a dummy for the treatment;
  • \(\varepsilon\) is the error term - either we have no idea or exogeneous for model;

Notice:

\[\begin{aligned} & E(y_i|s=PA;t=Nov; D_{st} = 0) - E(y_i|s=PA;t=Feb; D_{st} = 0) \\ \\ & = (\gamma_{PA}+ \lambda_{Nov}+ \beta \times 0+ \varepsilon_i) - (\gamma_{PA}+ \lambda_{Feb}+ \beta \times 0+ \varepsilon_i) \\ & = \lambda_{Nov} - \lambda_{Feb} \end{aligned}\] also: \[\begin{aligned} & E(y_i|s=NJ;t=Nov; D_{st} = 1) - E(y_i|s=NJ;t=Feb; D_{st} = 1) \\ \\ & = (\gamma_{NJ}+ \lambda_{Nov}+ \beta \times 1+ \varepsilon_i) - (\gamma_{NJ}+ \lambda_{Feb}+ \beta \times 1 + \varepsilon_i) \\ & = \lambda_{Nov} - \lambda_{Feb} + \beta \end{aligned} \]

It’s time to do Difference in Difference

\[\begin{aligned} \bigg\{ \big [E(y_i|s=NJ;t=Nov; D_{st} = 1) - & E(y_i|s=NJ;t=Feb; D_{st} = 1)\big] - \\ \big[& E(y_i|s=PA;t=Nov; D_{st} = 0) - E(y_i|s=PA;t=Feb; D_{st} = 0)\big] \bigg\} \\ &= [\lambda_{Nov} - \lambda_{Feb} + \beta] - [\lambda_{Nov} - \lambda_{Feb}] \\ &= \beta \end{aligned}\]

Qestion: How do you do DiD in Stata ?

Main assumption

The assumption is that the trends would have been the same in both states in the absence of the treatment. Make sure you can understand this picture.

Much of the recent discussion of differences-in-differences models has been about ascertaining, whether the underlying assumption of equal trends in the absence of treatment is a reasonable one.

Difference-in-differences in Regression Format, Multiple Contrasts, and Robustness

The treatment may not be the binary but continuous. \[y_{st} = \gamma_s + \lambda_t + \beta M_{st} + \varepsilon_{st}\] Difference over time: \[\begin{aligned} \Delta y_{st} &= \lambda_t - \lambda_{t-1} + \beta M_{st} + \Delta \varepsilon_{st} \\ & = \lambda + \beta M_{st} + \Delta \varepsilon_{st} \end{aligned}\] Given the multiple contrasts from using more than 2 state level changes, it is now possible to introduce controls for other state specifc factors. it is possible to use parametric controls for trends at the state level as in \[\Delta y_{st} = \lambda + \beta M_{st} + X_{st} \delta + \Delta \varepsilon_{st}\] The control variable coefficients can be interpreted as:

  • With extral control, to see whether \(\beta\) will change or not;
  • The more control variables you add, the more details you can get on treatment effect;
  • Give me more intrepretation

You can also do dynamic DiD by putting lags and leads.

There are more implicit assumptions about this model, check the lecture slides and reading.

Question: how do we DiD for multiple states and control variabes in Stata ?

Logit Models for Binary Data

Motivation

Why do we need logit model? - Let EK explain it to us :)

Bernoulli Distribution

If X is a random variable with this distribution, we have: \[Pr(X = 1) = p; \ Pr(X = 0) = 1 - p;\] The probability mass function f of this distribution, over possible outcomes k, is: \[f(k;p) = \begin{cases} p, &if \ k = 1 \\ 1 - p, &if \ k = 0 \end{cases}\]

Binomial Distribution

In general, if the random variable X follows the binomial distribution with parameters \(n \in ℕ\) and \(p \in [0,1]\), we write \(X \sim B(n, p)\). The probability of getting exactly k successes in n trials is given by the probability mass function: \[Pr(k; n, p) = Pr(X = k) = \binom{n}{k}p^k(1-p)^{n-k}; \ where \binom{n}{k} = \frac{n!}{k!(n-k)!}\]

Now, we view \(y_i\) as a realization of a random variable \(Y_i\) that takes the values \(0, 1, \cdots, n_i\). If the \(n_i\) observations in each group are independent, and they all have the same probability \(\pi_i\) of having the attributes of interest, then the distribution of \(Y_i\) is binomial with parameters \(\pi_i \ and \ n_i\), which we write \[Y_i \sim B(n_i, \pi_i)\] The probabiltiy distribtuion function of \(Y_i\) is given by \(Pr(Y_i = y_i) = \binom{n_i}{y_i} \pi_i^{y_i}(1-\pi_i)^{n_i - y_i}\)

However, we don’t know the probability \(\pi_i\) !!

Example: MLE

Again, we assume it exist.

The Logit Transformation

The next step in difining a model for our data concerns the systematic structure. We would like to have the probability \(\pi_i\) depend on a vector of observed covariates \(x_i\). The simplest idea would be to let \(\pi_i\) be a linear function of the covariates, say \[\pi_i = x_i'\beta\] where \(\beta\) is a vector of regression coefficients. This model sometimes is also called the linear probability model. One problem with this model is that probability \(\pi_i\) on the LHS has to be between 0 and 1, but the linear predictor \(x_i'\beta\) on the RHS can take any real value, so there is no guaranteen that the predicted values will be in the correct range unless complex restrictions are imposed on the coefficients.

A simple solution to this problem is to transform the probability to remove the range restrictions, and model the transformation as a linear function of the covariates. We do this in two steps.

  1. First, we move from the probability \(\pi\) to the odds, (example \(\frac{1}{1000}; \frac{999}{1000}\)): \[odds_i = \frac{\pi_i}{1-\pi_i}\]
  2. Second, we take logarithms, calculating the \(logit\) or log-oods: \[\eta_i = logit(\pi_i) = log (\frac{\pi_i}{1-\pi_i}), where \ \eta \in (-\infty, + \infty)\]

Good, good, problem solved ! Happy :)

Now, let’s do some simple algebra and begin to use MLE.

Solving for \(\pi_i\) in above equation, it gives \[\pi_i = logit^{-1}(\eta_i) = \frac{e^{\eta_i}}{1+e^{\eta_i}}\]

In the lecture, professor gives the different intrepretaton: \[\pi_{1i} \equiv Pr(y_i = 1 | x_i) = F(x_i' \beta) \] he call this \(F(\cdot)\) as the link function.

We are noe in a position to define the logistic regression model, by assuming that the logit of the probability \(\pi_i\), rather than the probability itself, follows a linear model (Caution:logit): \[logit(\pi_i) = x_i'\beta \ \ \Rightarrow \ \ log(\frac{\pi_i}{1-\pi_i}) = x_i' \beta \ \ \Rightarrow \frac{\pi_i}{1-\pi_i} = e^{x_i'\beta} \ \ \Rightarrow \pi_i = F(x_i'\beta) = \frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}\]

In this case marginal effect can be obtained as \[\frac{d\pi_i}{dx_{ij}} = \Big[ \frac{e^{x_i'\beta}}{(1+e^{x_i'\beta})^2} \Big]\beta_j = \beta_j \pi_i(1-\pi_i) \]

The log-likehood function is \[\mathcal{L}_N(\beta) = \sum_{i=1}^N \{y_i \ln F(x_i'\beta) + (1-y_i) \ln\big[1-F(x_i'\beta)\big]\}\]

Another popular link function is the standard normal \(CDF\), leading to the probit model:

\[\pi_i = F(x_i'\beta) = \Phi(x_i'\beta') = \int_{-\infty}^{x_i'\beta} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\mu^2} du\]

Question: What the log-likehood function \(\mathcal{L}_N(\beta)\) should look like ?

Reflection on MLE again

Let me give a discourse on MLE again and make sure you truely undertand it.

Philosophy of modeling the world (Epistemology)

We use the different language to talk the same things again and again for thousands of years - by Heidegger;

Example

  • Plato: from prefernece to choices; from pdf to estimation and inference - MLE;

  • Aristotle: from choices to preferences; from data to estimation and inference - bootstrap; machine learning;

Notes for Study Pack II (for lectures 3-6)

In this section, I will go through the lecture notes, and give the full derivations of main models. This section services for two purposes:

For every model of discrete data regression, I will give an example and also Stata Code to show the following procedures:

  1. Assumptions

  2. Model

  3. Data - no target data without theory (GDP, e.g. NeoCM without government - dont’t use NIPA)

  4. Goodness of fit - criterion

  5. Diagnostic

  6. Inference or Interpretation

  7. Consistency !

Okay, Let’s kick it off.

Introduction

It is often of interest to know why economic agents choose certain actions when faced by discrete choices among a set of alternatives. For example: i) why do some individuals default on their mortgages and others do not? ii) Why do some consumers prefer one brand of bottled water over another?

In all cases the main interest lies in determining the factors that influcence the probability of choosing one of the alternatives. With qualitative response variables, it makes more sense to analyse the determinats of choosing the various alternatives using probability models.

  1. What is the effect of increasing the mortgage rate by one percentage point on the probability of mortgage default?

  2. How does the price of Brand B affect consumers’ choice of Brand A?

  3. What is the effect of having one more friend on the probability of being very happy?

Probabiliy models involve nonlinear regressions and are often estimated by the method of maximum likelihood estiamtion.

Review of Maximum Likelihood Estimation

Please read L14020 notes, and understand the main assumptions of MLE, properties of MLE, and also the asymptotic test of MLE.

Make sure, you understand rather than memorize the equations !!

As all models related to discrete variables use MLE, then some test like Wald, Likelihood Ratio, and lagrange Multiplier test will be used very often.

Model selection based MLE can be conducted by using the so-called information criteria. All else equal, the model that minimises an information criterion is preferred. Two information criteria are widely used: the Bayesian information criterion (BIC) and the Akaike information criterion (AIC). BIC gives the penality of increasing independent varikes number - k. \[\begin{aligned} BIC & = -2 \ln L(\beta) + 2 \ln(N)k \\ AIC & = - \ln L(\beta) + 2k \end{aligned}\] Although it is best to use them in conjunction, BIC is preferred at most time.

Binary choice model

In binary choice model, economic agents face two alternatives, often coded as 0 and 1 for convenience. The aim is to model the probability of choosing one of the alternatives. As the sum of probabilities over the two alternatives is 1, we do not need to explicitly model the probability of the second alternative. The alternative or category that is not explicitly modelled is called the base group. The choice of the base group does not affect the outcome of the econometric analysis. The aim is to model the probability of default: \[\pi_{1i} = Pr(y_i = 1 \mid x_i) = F(x_i' \beta)\] When dealing with contunious variables, we need probability density function. For example, giving \(y \sim N(X\beta, \sigma^2 I_T)\) and so \[f(y; \beta, \sigma^2) = (2\pi\sigma^2)^{-T/2} e^{[-\frac{1}{2\sigma^2}(y-X\beta)'(y - X\beta)]}\] In this case, we time all marginal density and get the join density, and then take log and get the sum. For binary choice, we have the probability mass function f of this distribution, over possible outcomes k, is: \[f(k;p) = \begin{cases} p, &if \ k = 1 \\ 1 - p, &if \ k = 0 \end{cases}\] or, we can write it as \[f(y_i;p) = \pi_i = p^{y_i}(1-p)^{1-y_i} = \begin{cases} p & if \ y_i = 1 \\ 1-p & if y_i = 0 \end{cases}\] In this case, we just take the log form of probabiltiy and get the sum \[\begin{aligned} \mathcal{L}_N(\beta) & = \sum_{i=1}^N \{y_i \ln \pi_i + (1-y_i) \ln[1-\pi_i]\} \\ & = \sum_{i=1}^N \{y_i \ln F(x_i'\beta) + (1-y_i) \ln\big[1-F(x_i'\beta)\big]\} \\ & = \sum_{i=1}^N\sum_{j=0}^1 I_j(y_i) \ln \pi_i \ \ , where \ I_j(y_i) = 1 \ if \ y_i \in j; \ 0 \ else \end{aligned}\] The function \(F(\cdot)\) in above likelihood function is called link function. We have three choices: identify function, logistic CDF and probit CDF. I will only present the logit and probit link function as these two are used most time.

Logit model

\[\pi_i = F(x_i'\beta) = \frac{e^{x_i'\beta}}{1+e^{x_i'\beta}}\] In this case marginal effect can be obtained as \[\frac{d\pi_i}{dx_{ij}} = \Big[ \frac{e^{x_i'\beta}}{(1+e^{x_i'\beta})^2} \Big]\beta_j = \beta_j \pi_i(1-\pi_i) \]

In logit model, \(e^{\hat{\beta_k}}\) can be interpreted by saying that a unit increase in \(x_k\) changes the odds ratio by a factor of \(e^{\hat{\beta_k}}\), keeping all else constant. I will give more examples on interpretation in the following case studies.

Another popular link function is the standard normal \(CDF\), leading to the probit model:

\[\pi_i = F(x_i'\beta) = \Phi(x_i'\beta') = \int_{-\infty}^{x_i'\beta} \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}\mu^2} du\] For the probit model, the expression for the marginal effect is \[\frac{\partial \hat{\pi}_{pi}}{\partial x_{ki}} = [\phi(z_i)]\hat{\beta}_k; \ \ where \ \ \phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}\]

Goodness of fit:

Pseudo R-squared

With a logistic or probit regression models, an equivalent statistic to R-squared used in linear models estimated by OLS does not exist. However there are “pseudo” R-squared values that can be used to evaluate model fit. Like standard R-squared values, these “pseudo” R-squared range from 0 to 1 (though never achieving 0 or 1), with higher values indicating better model fit. However these cannot be interpreted as one would interpret an OLS (e.g. as the % of explained variability in the dependent variable, or the square of the correlation between actual and predicted values).

Perhaps the most popular pseudo R-squared is McFadden’s pseudo R-squared which is defined as 1 minus the ratio of the log likelihood of the full model and the log likelihood of the intercept only model:

\[R_{McF}^2 = 1 - \frac{\ln L(\beta_0, \beta_1, \cdots \beta_k)}{\ln L(\beta_0)}\]

Sensitiviy and specificity

A more useful measure of goodness of fit that is a summary of the predictive ability of the model given as a 2x2 classification table of the success or failure of a prediction. By default a prediction is deemed “a success” (“a failure” if the predicted probability is greater (less) than 0.5 and the actual outcome is y = 1 (y=0). But the researcher can set this cut-off probability to be at any value between 0 and 1 depending on what makes more sense to the situation in hand.

  • The percentage of correctly predicted y=1 observations is called the model’s sensitivity.
  • The percentage of correctly predicted y=0 observations is called the model’s specificity.

Hosmer-Lemeshow goodness-of-fit test

Intuitively this test compares the deviation of each observation from its predicted or expected value. If this deviation is large, one would conclude that model fit is not good enough. Here we discuss some basic steps of one version of the Hosmer-Lemeshow goodness-of-fit test, and provide an empirical example later on.

The procedure of calculating this test is given in lecture slides, I will not present it here.

Diagnostic

There is not much material in our lecture slides, but I will show it in the case studies.

Outliers

Influence

leverage

Case Studies (Empirical examples)

In the following empirical examples, I will give the analysis step by step by following:

  1. Assumptions - Bernoulli Distribution

  2. Model - Logit and Probit (MLE)

  3. Data - data exploration

  4. Goodness of fit - Pseudo R-squared, sensitivity and specficity; Hosmer-Lemeshow; Model selection (BIC and AIC)

  5. Diagnostic - outlier and influence

  6. Inference or Interpretation

  7. Consistency !

Case I - Labour Force participation

Our first case is from Mroz’s (1987) study of the labor force participation of women, using data from the 1976 Panel Study of Income Dynamics. You can download data binlfp4.dta from this link. In this case, we want to find:

  • What determine the probability of labour force participation.

Here is the explanation of main variables. The dependent variable in the above list is lfp, which has the value 1 or 0 - means the agent is in paid labor force or not. One of the independent variable agecat is ordinal.

Assumptions

Most emprical work just assume the binary data follows the Bernoulli distribution without checking. This is because this assumption is not very restrictive. Based on my knowledge, most binary data follows the Bernoulli distribution. So, we will not check this assumptions.

Model

Now, we will run logit model

And then we run probit model

You can see the coefficients are different, the following one gives the comparison.

Hypothesis testing

Most often, we are interested in testing \(H_0:\beta_k = 0\). For example, consider the results for variables k5 and wc from the logit output:

    Having young childern has a significant effect on the probability of being in the labor force
    (z = -7.25, p < 0.01 for a two-tailed test)
    The effect of the wife attending college is significant at the 0.01 level 
    

Be carefule, The coefficient in stata output doesn’t tell us the scale effect, it only tells us whether it is significant or not, also whether it is positive or negative.

The default test in stata for logit model is wald test, we can also use LR test. To test a single cofficent, we begin by fitting the full model and storing the results. Then, we fit the model without k5 and store the results. Next, we run lrtest. The followign fiture gives the code and result. The LR test shows the following:

    The effect of having young children is significant at the 0.01 level
    (LR Chi-square = 62.55, df = 1, p < 0.01)
    

We can also test multiple coefficents by using test or lrtest.

We conclude the following:

    We reject the hypothesis that the efficients of the husband's and the wife's
    education are simultaneously equal to 0 (chi-square = 17.83, df = 2, p < 0.01)
    

We conclude the followign:

    We reject the hypothesis that the efficients of the husband's and the wife's
    education are simultaneously equal to 0 (LR chi-square = 18.68, df = 2, p < 0.01)

Now, let’s compare LR and Wald tests. Although the LR and Wald tests are asymptotically equivalent, their values differ in finite samples. In our example, we can compare them in the following table. Please review L14020 notes to understand the comparions between LR test and Wald test.

Measure of fit

A scalar measure of fit can be useful when comparing competing models. Information criteria such as BIC and AIC can be used to select among models and are often very useful. We can use Hosmer-Lemeshow to check the goodness of fit.

Diagnostic of model

An index plot is an easy way to examine residuals by ploting them against the observation number.

    quietly logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog 
    predict rstd, rstandard 
    label var rstd "Standardized Residual" 
    sort inc 
    generate index = _n 
    label var index "Observation Number" 
    graph twoway scatter rstd index, msymbol(Oh) mcolor(black) xlabel(0(200)800) \
    ylabel(-4(2)4, grid gmin gmax) yline(0, lcolor(black))

If you want to identify the outliers, you can give them index, by the following code.

    graph twoway scatter rstd index, msymbol(none) mlabel(index) mlabposition(0) xlabel(0(200)800) \
    ylabel(-4(2)4, grid gmin gmax) yline(0, lcolor(black))

Influenctial Cases

We can also identify the influenctial cases.

    predict deltabeta, dbeta 
    label var deltabeta "Pregibon's influence statistic"
    graph twoway scatter deltabeta index, msymbol(none) mlabel(index) mlabposition(0) \
    xlabel(0(200)800) xtitle("Observation Number") ylabel(0(0.1)0.3, grid gmin gmax)

This plot shows that cases 142, 309, and 752 merit further examinaiton.

Intrepretation

Okay, once we have the model, we need intrepret the coefficent. The logit and probit model is not linear, that’s why we need calcualte the marginal effect.

Effects for the logit model ( but not the probit model ) can be interpreted in terms of changes in the odds. In the logit model, the log odds are a linear combination of the x’s and \(\beta's\). For example, consider a model with three independent variables: \[\ln \big\{ \frac{Pr(y=1|x)}{Pr(y=0|x)} \big\} = \ln \Omega (x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3 x_3\] This leads to a model that is multiplicative instead of linear but in which the outcome is the odds: \[\Omega (x) = e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3}\] Now, if we want to check how odds ratio will change when we change variable \(x_3\), we will have \[\frac{\Omega(x, x_3+1)}{\Omega(x,x_3)} = \frac{e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3} e^{\beta_3}}{e^{\beta_0} e^{\beta_1x_1}e^{\beta_2x_2}e^{\beta_3x_3}} = e^{\beta_3}\] Accordingly, we can interpret the exponential of the logit coefficients as follows:

  • For a unit change in \(x_k\), the odds are expected to change by a factor of exp(\(\beta_k\)), holding other variables constant.
  • For a standard deviation change in \(x_k\), the odds are expected to change by a factor of exp(_k s_k$), holding all other variables constant.

In our example, once we run the logit regression, we can list the coefficents and also the odds ratio. Here is the code:

    logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog 
    listcoef, help 

In above table, b means \(\beta\), \(e^b\) gives the odds ratio, and \(e^{bStdX}\) gives the change in odds for SD increase in X. By using the coefficients for lwg in the column \(e^{bStdX}\), we can say:

  • For a standard deviation increase in the log of the wife’s expected wages, the odds of being in the labor force are 1.43 times greater, holding all other variables constant.

Marginal effects: Changes in Probabilites

We will not repeat the formula for marginal effect of this model. I will directly jump to the code and interpretatio. We have two popular marginal effects: marginal effects at the average value (MEA) and average marginal effects (AME). Cameron and Trivedi suggest that ‘it is best to use AME over MEA’.

AMEs for continuous variables

From the above table, we can say

  • Holding other variables at their observed values, increasing income by one standard deviation, roughtly $12,000, decreases the probability of labor force participation on average by 0.09.

AMEs for factor variables

From the above table, we can say

  • On average, having attended college increases a woman’s probability of labor force participation from 0.53 to 0.69, a change of 0.16 (\(p < 0.001\)), while the effect of the husband having attended college is not significant.

Graphing predicted probabilities

With a continuous independent variable, you can plot the predicted probabilities over the range of the variable. For example, to examine the effects of inc, we might plot the predicted probability of labor force participation as inc changes, holding other variables at fixed values.

The following code gives the graph

    quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog 
    estimates store base
    margins, at(inc=(0(10)100)) atmeans noatlegend 
    margins plot 

An effective way to show the effects of two variables is to graph predictions at various levels of one variable as the other variable changes. This can be done with either marginsplot or mgen.

The following code gives the graph

    quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog 
    estimates store base
    margins agecat, at(inc=(0(10)100)) atmeans noatlegend
    marginsplot, noci legend(cols(3))

Adding power terms and plotting predictions

One interesting plot is to compare the prediction power when we adding power terms.

    quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg inc, nolog
    mgen, predlabel(linear) atmeans at(inc=(0(10)100)) stub(_lin) 
    quiet logit lfp k5 k618 i.agecat i.wc i.hc lwg c.inc##c.inc, nolog
    mgen, predlabel(quadratic) atmeans at(inc=(0(10)100)) stub(_quad)
    graph twoway connected _linpr1 _quadpr1 _quadinc, title("Comparing income specificaitons") \
    caption("Other variables at their means") ytitle("Pr(In Labor Force)") \ 
    ylab(0(0.25)1, grid gmin gmax) legend(cols(3))

You can also add confidence interval into the graph by following code

    *run regression and store the estimates firstly
    graph twoway (rarea _linul _linll _lininc, col(gs12)) \ 
    (connected _linpr1 _quadpr1 _quadinc), title("Comparing income specificaitons") \ 
    caption("Other variables at their means") ytitle("Pr(In Labor Force)") \ 
    ylab(0(0.25)1, grid gmin gmax) legend(cols(3))

Okay, this is the end of Binary model.

Multinomial logit models

For multinomial models, we use the same method with binary model to get the transformation. The multinomial linear model (MNLM) can be written as \[\ln \Omega_{m\mid b}(x) = \ln \frac{Pr(y=m \mid x)}{Pr(y=b \mid x)} = x \beta_{m|b} \ \ for \ m = 1 \ to \ J \] where b is the base outcome, sometimes called the reference category. With these transformations, we can calculate the relevant probabilities conditional on dependent varialbes \(x\): \[Pr(y=m|x) = \frac{e^{x\beta_{m|b}}}{\sum_{j=1}^J e^{x\beta_{j|b}}}, \ \ where \ J \ is \ number \ of \ categories\]

For example, suppose that you have three outcomes and fit the model with alternative 1 as the base, where you would obtain estimates \(\hat{\beta}_{2|1} \ and \ \hat{\beta}_{3|1}, \ with \ \beta_{1|1} = 0\). The probability equation for the second outcome is \[Pr(y=2|x) = \frac{exp(x\beta_{2|1})}{\sum_{j=1}^3exp(x\beta_{j|1})} \] Okay, let’s give the proper definitions of \(m\) and \(J\). There are \(J\) alternatives and the dependent variable y is defined to take value \(m\) if the \(m th\) alternative is taken, \(m = 1, \cdots J\). Define the probability that alternative \(m\) is chosen as \[p_m = Pr(y=m), \ m = 1, \cdots J \] Introduce \(J\) binary variables for each observations y, \[y_m =\begin{cases} 1 & if \ y = m, \\ 0 & if \ y \neq m. \end{cases}\] Thus \(y_m\) equals one if alternative \(m\) is the observed outcome and the remaining \(y_k\) equal zero, so for each observation on \(y\) exactly one of \(y_1, y_2, \cdots , y_J\) will be nonzero. The multinomial density for one observation can then be conveniently written as \[f(y) = p_1^{y_1} \times \cdots \times p_J^{y_J} = \prod_{j=1}^J p_j^{y_j} \ \ \ \ \ \ \ \ \ \ \ \ \ (D1)\] This means \(X \in \{1, \cdots J\}\) be a categorical random variable with \(J\) states, and \(p_m = Pr(y=m), \ m = 1, \cdots J\).

For regression models introduce a subscript \(i\) for the \(ith\) individual and regressors \(x_i\). Specify a model for the probability that individual \(i\) chooses the \(mth\) alternatives: \[p_{im} = Pr(y_i = m) = F_m(x_i, \beta), \ m = 1, \cdots J, \ i = 1, \cdots, N \] The functional form for \(F_m\) should be such that probabilities lie between 0 and 1 and sum over \(m\) to 1, which means overal probability for \(J\) states should be 1.

ML Estimation

The multinomial density for one obvervation is given in (D1). The likelihood function for a sample of N independent observations is then \(L_N = \prod_{i=1}^N \prod_{j=1}^J p_{ij}^{y_{ij}}\), where the subscript \(i\) denotes the \(ith\) of \(N\) individuals and the subscript \(j\) denotes the \(jth\) of J alternatives. The log-likelihood function is \[\mathcal{L} = \ln L_{N} = \sum_{i=1}^N \sum_{j=1}^J y_{ij} \ln p_{ij}, \ \ \ where \ p_{ij} = F_j(x_i, \beta) \ is \ a \ function \ of \ parameters \ \beta \ and \ regressors. \] The marginal effect of a regressor say \(x_k\) on the probability of y belonging to category \(j\) can be obtained as \[\frac{\partial \hat{\pi}_{ji}}{\partial x_{ki}} = \hat{\pi}_{ji} \big[\beta_k^j - \sum_{j=1}^J p_j \beta_k^j \big] \] Independence of irrelevant alternatives

The multinomial model, as well as the conditional logit and rank-ordered logit models, make the assumption known as the independence of irrelevant alternative (IIA). In the multinomial model, \[\frac{Pr(y=m|x)}{Pr(y=n|x)} = exp\{x(\beta_{m|b} - \beta_{n|b}) \} \] , where the odds do not dependent on other alternatives that are availabe. In this sense, those alternatives are “irrelevant”. What this means is that adding or deleting alternatives do not affect the odds among the remaining alternatives. The Hausman statistic can be used to check the this assumption. If the null hypothesis of IIA is rejected, inference based on estimates from the multinomial Logit model would be invalid. In this case there are two possible options:

  1. Estimate binary logit models for each pair of alternatives. This option is always valid, albeit inefficient.
  2. Estimate multivariate probit models, but make sure that the program you use does not assume IIA. However, this option can be computationally burdensome, especially with a larger number of choices.

Before we doing conditional logit and ordered logit model, let’s have a case for mutinomial model.

Case study (MNLM)

The 1992 American National Election Study asked respondents to indicate their political party using one of eight categories. We used to these to create party with five categories: strong Democrat (1 = StrDem), Democrat (2 = Dem), Independent (3 = Indep), Republican (4 = Rep), and strong Republican (5 = StrRep). Five regressors are included in the model: age, income, race(black or white), gender and education. You can download the dataset partyid4.dta from this link.

Now, let’s just take a look at data and begin use the mlogit model

    tabulate party, miss
    sum age income black female
    tabulate educ, miss
    mlogit party age income i.black i.female i.educ 

We will not present the result of whole model, but jump to hypothesis test and goodness of fit, ect. For the effects of selecting different base, we will come back later. Again as we use the MLE, the test should be either wald or LR or LM test.

    quiet mlogit party age income i.black i.female i.educ, nolog
    mlogtest, lr 
    mlogtest, wald * gives the wald test 

The results of the LR test, regardless of how they are computed, can be interpreted as follows:

  1. The effect of age on party affiliation is significant at the 0.01 level.
  2. The effect of being female is significant at the 0.1 level but not at the 0.05 level

We can also test for combining alternatives.

    quiet mlogit party age income i.black i.female i.educ, nolog
    mlogtest, combine 

From these results, we can reject the hypothesis that categories StrDem and Dem are indistinguishable. In deed, our results indiciate that all the categories are distinguishable.

Independence of irrelevant alternatives

IIA means that adding or deleting alternatives does not affect the odds among the remaining alternatives. Tests of IIA invole comparing the estmated coefficients from the full model to those from a restricted model that excludes at least one of the alternatives. If the test statistic is significant, the assumption of IIA is rejected, incidcating that the multinomial model is inappropriate. We will use HM and SH test.

    mlogit party age income i.black i.female i.educ, base(5) 
    mlogtest, hausman 

Based on the result of test, we cannot reject the IIA.

Measure of fit

As with models for binary and ordinal outcomes, many scalar measures of fit for the MNLM model can be computed with the SPost command fitstat, and information criteria can be computed with estat ic. I will not repreat it again.

Predicted probabilities with predict

The most basic command for computing probabilities is predict. After fitting the model with mlogit, predicted probabilites for all outcomes within the sample can be calculated with predict.

    quiet mlogit party age income i.black i.female i.educ, base(5) 
    estimate store mlogit 
    predict mnlmSD mnlmD mnlmI mnlmR mnlmSR
    codebook mnlmSD mnlmD mnlmI mnlmR mnlmSR, compact 
    

Now, we run ordered model, and compare the prediction.

    quiet ologit party age income i.black i.female i.educ 
    predict olmSD olmD olmI olmR olmSR 
    codebook olm*, compact 

The following code gives the dot plot of comparison between ordered and multinomial model.

    label var olmSD "ologit" 
    label var mnlmSD "mlogit"
    dotplot olmSD mnlmSD, ylabel(0(0.2)0.8, grid) ytitle(Pr(Strong Democrat)) 

If we look at the middle category of Independent, however, things look quite different, reflecting the abrupt truncation of the distribution of predictions for middle categories that is often found with the OLM.

    label var olmI "ologit"
    label var mnlmI "mlogit"
    dotplot olmI mnlmI, ylabel(0(0.2)0.3, grid) ytitle(Pr(Indenpendent)) 

Marginal effect

Average marginal effects are a quick and valuable way to assess the effects of all the indepdendent variables in your model. We will calculate the marginal effects and give interpretations.

    quiet mlogit party age income i.black i.female i.educ, base(5) 
    estimate store mlogit 
    mchange, amount(sd) brief width(8) 

Even for this relatively simple model and looking only at a single amount of change for each variable, there is a lot of information to digest. To make it simpler to interpret these results, we plot the changes with mchangeplot. We begin by looking at the average discrete changes for standard deviation increases in age and income

    mchangeplot age income, symbols(D d i r R) min(-0.06) max(0.06) gap(0.02) \ 
    sig(0.05) xtitle(Average Discrete Change) ysize(1.3) scale(2.1)

Although we probably would not include this graph in a paper, we use it to help describe the effects:

  • On average, a standard deviation increase in age, about 17 years, increases the probability of being a strong Republican by 0.03 and of being a strong Democrat by 0.05. The probability of other affiliations all decrease by roughly 0.03. All effects are significant at the 0.01 level.
  • On average, a standard deviation increase in income, roughtly $28000, significantly increases the probability of being a Republican by 0.04 and a strong Republican by 0.02, while significantly decreasing the probability of being a strong Democrat by 0.04.

The greater effect of race compared with gender on party affiliation is shown with a plot of their average discrete changes. We use the following command to create the graph

     mchangeplot black female, symbols(D d i r R) min(-0.3) max(0.3) gap(0.1) \
     sig(0.05) xtitle(Average Discrete Change) ysize(1.3) scale(2.1)

On average, for people similar on other characteristics, being black increases the probability of being a strong Democrat by nearly 0.3 compared with someone who is white. Except for an increase of 0.07 in Democratic affiliation, the effects of gender are not significant.

Graphing predicted probabilities

It whould be very helpful if we could plot the trend of probability for one variable.

    quiet mlogit party age income i.black i.female i.educ, base(5) vsquish 
    mgen, atmeans at(income=(0(10)100)) stub(mnlmI) replace 
    label var mnlmIpr1 "Strong Dem"
    label var mnlmIpr2 "Democrat"
    label var mnlmIpr3 "Independent"
    label var mnlmIpr4 "Republican" 
    label var mnlmIpr5 "Strong Rep" 
    graph twoway connected mnlmIpr1 mnlmIpr2 mnlmIpr3 mnlmIpr4 mnlmIpr5 mnlmIincome, \ 
    title("Multinomial logit model: other variables held at their means", \
    pos(11) size(medium)) ytitle(Probability of party affiliation) \
    ylab(0(0.1)0.4, grid gmax gmin) msym(O Ohdh sh s) mcol(gs1 gs5 gs8 gs5 gs1) \
    lpat(solid dash shortdash dash solid) lco(gs1 gs5 gs8 gs5 gs1) \
    legend(rows(2))

Project Task III

Before you run the code for this section, please read page 9 -11 of this document, and make sure you install SPost13 on your Stata.

In the following analysis, we will focus on the data from 2013 and use ordinal model (ORM) to answer the question.

The statistical model

A latent-variable model

The ORM is commonly presented as a latent-variable model. Defining \(y^*\) as a latent variable ranging from \(- \infty \ to \ \infty\), the structural model is \[y_i^* = x_i \beta + \varepsilon_i\] , where \(i\) is the observation and \(\varepsilon\) is a random error. The underlying, continuous latent variable can be thought of as the propensity to identify oneself as having higher order. The observed categories are tied to the latent variable by the measurement model \[y_{it} = \begin{cases} = 0 & if \ expint_{it} = 0 \\ = 1 & if \ 0 < expint_{it} < 0.25 \\ = 2 & if \ 0.25 \leq expint_{it} < 1 \\ = 3 & if \ expint_{it} = 1 \end{cases}\] Here, expint is the latent variable, and \(y_it\) is the category variable we have to create.

    drop if year != 2013 
    gen expcat = 0 if expint == 0
    replace expcat = 1 if expint > 0 & expint <0.25
    replace expcat = 2 if expint >= 0.25 & expint < 1 
    replace expcat = 3 if expint == 1 
    table expcat 
    label var expcat "category variables based on exporting intensity rannge" 
    codebook company manage expint markets private size lnage expcat service, compact

Now, we can use ologit or oprobit to estimate the model. We put all the relevant variables as independent variables, and just to take a look what will we get.

    ologit expcat manage private lnwage_skilled lnwage_unskilled lncost profitability ///
    size service lnage, nolog 
    estimates store ologit
    oprobit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability size service lnage, nolog 
    estimates store oprobit
    estimates table ologit oprobit, b(%9.3f) t varlabel varwidth(30)

From above result, we found that size and age are not significantly in terms of exporting categories determinants. Also, the coefficents from different models are quite different when we comparing ologit and oprobit model. The following picture gives the result. Now, we try to add quadratic term and some interaction to see whether the main variable will become significant factors or not.

    ologit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability c.size##c.size service c.lnage##c.lnage, nolog
    estimates store ologit1
    oprobit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability c.size##c.size service c.lnage##c.lnage, nolog
    estimates store oprobit1

Once we add quardatic term, the main factor - size and age - becomes significant now. Testing the parallel regression assumption using oparallel

    ssc install oparallel
    quiet ologit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability c.size##c.size service c.lnage##c.lnage, nolog
    oparallel, ic

Based on test, we have to reject the assumption of parallel regression, which means we cannot use ologit to model this task. We should use generalised ologit or multinomial logit model to answer the question (we can also use gologit2 to to ordered analysis).

before we do the multinomial logit model, it’s better to give the correlation table. As markets have high correlation with expint, it’s better not to put it as independent variable in the model. Let’s run the multinomial model to see what we will get.

    mlogit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability size service lnage
    mlogtest, hausman 

Four tests of IIA are reported. The first two shows that we cannot reject these two categoreis are independent with others. However, the last tow shows that we should reject IIA. If we remove service, the results are quite different.

    mlogit expcat manage private lnwage_skilled lnwage_unskilled lncost ///
    profitability size lnage
    mlogtest, hausman

We will give the analysis with full variables, and just check the prediction power, and later use the model selction to find the best one.

The following is the probability of being export depends on the cost.