Limited Dependent Variable Models
1 Introduction and Motivation
Chapters 5 and 10 have shown various uses of dummy variables to numerically capture the information qualitative variables – for example, day-of-the-week effects, gender, credit ratings, etc. When a dummy is used as an explanatory variable in a regression model, this usually does not give rise to any particular problems (so long as one is careful to avoid the dummy variable trap – see Chapter 10). However, there are many situations in financial research where it is the explained variable, rather than one or more of the explanatory variables, that is qualitative. The qualitative information would then be coded as a dummy variable and the situation would be referred to as a idiscrete choice variable and needs to be treated differently. The term refers to any problem where the values that the dependent variables may take are limited to certain integers (e.g., \(0\), \(1\), \(2\), \(3\), \(4\)) or even where it is a binary number (only \(0\) or \(1\), which would then be known as a binary choice variable).
Discrete choice variables are one set from among what are known more generally as limited dependent variables, since the values they can take are limited to only certain integers. Another class of limited dependent variables are where the data that we see are censored or truncated in some way – in other words, we can only observe the true values for part of the distribution while for the remainder above or below some fixed threshold, the true values remain latent. We will return to censored and truncated series – and the differences between them – later in the chapter. There are numerous examples of instances where the dependent variable may arise from a binary choice, for example where we want to model
Why firms choose to list their shares on the NASDAQ rather than the NYSE Why some stocks pay dividends while others do not What factors affect whether countries default on their sovereign debt Why some firms choose to issue new stock to finance an expansion while others issue bonds Why some firms choose to engage in stock splits while others do not.
It is fairly easy to see in all these cases that the appropriate form for the dependent variable would be a \(0-1\) dummy variable since there are only two possible outcomes. There are, of course, also situations where it would be more useful to allow the dependent variable to take on other values, but these will be considered later in Section 12.9. We will first examine a simple and obvious, but unfortunately flawed, method for dealing with binary dependent variables, known as the linear probability model.
2 The Linear Probability Model
The linear probability model (LPM) is by far the simplest way of dealing with binary dependent variables, and it is based on an assumption that the probability of an event occurring, Pi, is linearly related to a set of explanatory variables \(x_{2i}, x_{3i}, …, x_{ki}\) \[P_i = p(y_i=1)= \beta_1 + \beta_2 x_{2i} + \beta_3 x_{3i}+...+\beta_k x_{ki}+u_i(12.1)\]
The actual probabilities cannot be observed, so we would estimate a model where the outcomes, \(y_i\) (the series of zeros and ones), would be the dependent variable. This is then a linear regression model and would be estimated by OLS. The set of explanatory variables could include either quantitative variables or dummies or both. The fitted values from this regression are the estimated probabilities for \(y_i = 1\) for each observation \(i\). The slope estimates for the linear probability model can be interpreted as the change in the probability that the dependent variable will equal 1 for a one-unit change in a given explanatory variable, holding the effect of all other explanatory variables fixed. Suppose, for example, that we wanted to model the probability that a firm \(i\) will pay a dividend \((y_i = 1)\) as a function of its market capitalisation (\(x_{2i}\), measured in millions of US dollars), and we fit the following line: \[\hat P_i = -0.3 +0.012 x_{2i}, (12.2)\] where \(\hat P_i\) denotes the fitted or estimated probability for firm \(i\). This model suggests that for every $1m increase in size, the probability that the firm will pay a dividend increases by 0.012 (or 1.2%). A firm whose stock is valued at $50m will have a \(−0.3 + 0.012 \cdot 50 = 0.3\) (or 30%) probability of making a dividend payment. Graphically, this situation may be represented as in Figure 12.1.
Figure 12.1 The fatal flaw of the linear probability model
While the linear probability model is simple to estimate and intuitive to interpret, the diagram should immediately signal a problem with this setup. For any firm whose value is less than $25m, the model-predicted probability of dividend payment is negative, while for any firm worth more than $88m, the probability is greater than one. Clearly, such predictions cannot be allowed to stand, since the probabilities should lie within the range \((0,1)\). An obvious solution is to truncate the probabilities at 0 or 1, so that a probability of −0.3, say, would be set to zero, and a probability of, say, 1.2 would be set to 1. However, there are at least two reasons why this is still not adequate
- The process of truncation will result in too many observations for which the estimated probabilities are exactly zero or one.
- More importantly, it is simply not plausible to suggest that the firm’s probability of paying a dividend is either exactly zero or exactly one. Are we really certain that very small firms will definitely never pay a dividend and that large firms will always make a payout? Probably not, so a different kind of model is usually used for binary dependent variables – either a logit or a probit specification. These approaches will be discussed in the following sections. But before moving on, it is worth noting that the LPM also suffers from a couple of more standard econometric problems that we have examined in previous chapters. First, since the dependent variable takes only one of two values, for given (fixed in repeated samples) values of the explanatory variables, the disturbance term will also take on only one of two values.1 Consider again equation (12.1). If \(y_i = 1\), then by definition \[u_i = 1-\beta_1 -\beta_2 x_{2i}-\beta_3 x_{3i} -...-\beta_k x_{ki}\] but if \(y_i = 0\), then \[u_i = -\beta_1 -\beta_2 x_{2i}-\beta_3 x_{3i} -...-\beta_k x_{ki}\] Hence the error term cannot plausibly be assumed to be normally distributed. Since ui changes systematically with the explanatory variables, the disturbances will also be heteroscedastic. It is therefore essential that heteroscedasticity-robust standard errors are always used in the context of limited dependent variable models.
3 The Logit Model
Both the logit and probit model approaches are able to overcome the limitation of the LPM that it can produce estimated probabilities that are negative or greater than one. They do this by using a function that effectively transforms the regression model so that the fitted values are bounded within the (0,1) interval. Visually, the fitted regression model will appear as an S-shape rather than a straight line, as was the case for the LPM. This is shown in Figure 12.2.
Figure 12.2 The logit model
The logistic function F, which is a function of any random variable, z, would be \[F(z_i)=\frac{e^{z_i}}{1+e^{z_i}}=\frac{1}{1+e^{-z_i}},(12.3)\] where \(e\) is the exponential under the logit approach. The model is so called because the function F is in fact the cumulative logistic distribution. So the logistic model estimated would be \[F(z_i)=\frac{1}{1+e^{-(\beta_1+\beta_2 x_{2i}+\beta_3 x_{3i}+\beta_k x_{ki}+u_i)}}, (12.4)\] where again \(P_i\) is the probability that \(y_i = 1\).
With the logistic model, 0 and 1 are asymptotes to the function and thus the probabilities will never actually fall to exactly zero or rise to one, although they may come infinitesimally close. In equation (12.3), as \(z_i\) tends to \(\infty\), tends to zero and tends to 1; as \(z_i\) tends to \(-\infty\), tends to infinity and tends to 0. Clearly, this model is not linear (and cannot be made linear by a transformation) and thus is not estimable using OLS. Instead, maximum likelihood is usually used – this is discussed in Section 12.7 and in more detail in the appendix to this chapter.
4 Using a Logit to Test the Pecking Order
Hypothesis
This section examines a study of the pecking order hypothesis due to Helwege and Liang (1996). The theory of firm financing suggests that corporations should use the cheapest methods of financing their activities first (i.e. the sources of funds that require payment of the lowest rates of return to investors) and switch to more expensive methods only when the cheaper sources have been exhausted. This is known as the ‘pecking order hypothesis’, initially proposed by Myers (1984). Differences in the relative cost of the various sources of funds are argued to arise largely from information asymmetries since the firm’s senior managers will know the true riskiness of the business, whereas potential outside investors will not. Hence, all else equal, firms will prefer internal finance and then, if further (external) funding is necessary, the firm’s riskiness will determine the type of funding sought. The more risky the firm is perceived to be, the less accurate will be the pricing of its securities.
Helwege and Liang (1996) examine the pecking order hypothesis in the context of a set of US firms that had been newly listed on the stock market in 1983, with their additional funding decisions being tracked over the 1984–92 period. Such newly listed firms are argued to experience higher rates of growth, and are more likely to require additional external funding than firms which have been stock market listed for many years. They are also more likely to exhibit information asymmetries due to their lack of a track record. The list of initial public offerings (IPOs) came from the Securities Data Corporation and the Securities and Exchange Commission with data obtained from Compustat.
A core objective of the paper is to determine the factors that affect the probability of raising external financing. As such, the dependent variable will be binary – that is, a column of 1s (firm raises funds externally) and 0s (firm does not raise any external funds). Thus OLS would not be appropriate and hence a logit model is used. The explanatory variables are a set that aims to capture the relative degree of information asymmetry and degree of riskiness of the firm. If the pecking order hypothesis is supported by the data, then firms should be more likely to raise external funding the less internal cash they hold. Hence variable ‘deficit’ measures (capital expenditures + acquisitions + dividends − earnings). ‘Positive deficit’ is a variable identical to deficit but with any negative deficits (i.e. surpluses) set to zero; ‘surplus’ is equal to the negative of deficit for firms where deficit is negative; ‘positive deficit operating income’ is an interaction term where the two variables are multiplied together to capture cases where firms have strong investment opportunities but limited access to internal funds; ‘assets’ is used as a measure of firm size; ‘industry asset growth’ is the average rate of growth of assets in that firm’s industry over the 1983–92 period; ‘previous financing’ is a dummy variable equal to 1 for firms that obtained external financing in the previous year. The results from the logit regression are presented in Table 12.1.
Table 12.1 Logit estimation of the probability of external financing
| Variable | (1) | (2) | (3) |
|---|---|---|---|
| Intercept | -0.29 | -0.72 | -0.15 |
| Deficit | 0.04 | 0.02 | |
| Positive deficit | -0.24 | ||
| Surplus | -2.06 | ||
| Positive deficit and operating income | -0.03 | ||
| Assets | 0.0004 | 0.0003 | 0.0004 |
| Industry asset growth | -0.002 | -0.002 | -0.002 |
| Previous financing | 0.79 |
Note: a blank cell implies that the particular variable was not included in that regression; t-ratios in parentheses; only figures for all years in the sample are presented. Source: Helwege and Liang (1996). Reprinted with the permission of Elsevier.
The key variable, ‘deficit,’ has a parameter that is not statistically significant and hence the probability of obtaining external financing does not depend on the size of a firm’s cash deficit.3 The parameter on the ‘surplus’ variable has the correct negative sign, indicating that the larger a firm’s surplus, the less likely it is to seek external financing, which provides some limited support for the pecking order hypothesis. Larger firms (with larger total assets) are more likely to use the capital markets, as are firms that have already obtained external financing during the previous year.
5 The Probit Model
Instead of using the cumulative logistic function to transform the model, the cumulative normal distribution is sometimes used instead. This gives rise to the probit model. The function F in equation (12.3) is replaced by \[F(z_i) = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{z_i} e^{-\frac{1}{2}z_i^2}\,dz, (12.5)\] This function is the cumulative distribution function for a standard normally distributed random variable. As for the logistic approach, this function provides a transformation to ensure that the fitted probabilities will lie between zero and one. Also as for the logit model, the marginal impact of a unit change in an explanatory variable, \(x_{4i}\) say, will be given by \(\beta_4 F(z_i)\), where \(\beta_4\) is the parameter attached to \(x_{4i}\) and
6 Choosing Between the Logit and Probit Models
For the majority of the applications, the logit and probit models will give very similar characterisations of the data because the densities are very similar. That is, the fitted regression plots (such as figure 12.2) will be virtually indistinguishable and the implied relationships between the explanatory variables and the probability that \(y_i = 1\) will also be very similar. Both approaches are much preferred to the linear probability model. The only instance where the models may give non-negligibility different results occurs when the split of the yi between 0 and 1 is very unbalanced – for example, when \(y_i = 1\) occurs only 10% of the time. Stock and Watson (2011) suggest that the logistic approach was traditionally preferred since the function does not require the evaluation of an integral and thus the model parameters could be estimated faster. However, this argument is no longer relevant given the computational speeds now achievable and the choice of one specification rather than the other is now usually arbitrary.
7 Estimation of Limited Dependent Variable Models
Given that both logit and probit are non-linear models, they cannot be estimated by OLS. While the parameters could, in principle, be estimated using non-linear least squares (NLS), maximum likelihood (ML) is simpler and is invariably used in practice. As discussed in Chapter 9, the principle is that the parameters are chosen to jointly maximise a log-likelihood function (LLF). The form of this LLF will depend upon whether the logit or probit model is used, but the general principles for parameter estimation described in Chapter 9 will still apply. That is, we form the appropriate log-likelihood function and then the software package will find the values of the parameters that jointly maximise it using an iterative search procedure. A derivation of the ML estimator for logit and probit models is given in the appendix to this chapter. Box 12.1 shows how to interpret the estimated parameters from probit and logit models.
Once the model parameters have been estimated, standard errors can be calculated and hypothesis tests conducted. While t-test statistics are constructed in the usual way, the standard error formulae used following the ML estimation are valid asymptotically only. Consequently, it is common to use the critical values from a normal distribution rather than a t distribution with the implicit assumption that the sample size is sufficiently large.
8 Goodness of Fit Measures for Linear Dependent Variable Models
While it would be possible to calculate the values of the standard goodness of fit measures such as RSS, \(R^2\) or for linear dependent variable models, these cease to have any real meaning. The objective of ML is to maximise the value of the LLF, not to minimise the RSS. Moreover, \(R^2\) and adjusted \(R^2\), if calculated in the usual fashion, will be misleading because the fitted values from the model can take on any value but the actual values will be only either \(0\) and \(1\). To illustrate, suppose that we are considering a situation where a bank either grants a loan \((y_i = 1)\) or it refuses \((y_i = 0)\). Does, say, mean the loan is offered or not? In order to answer this question, sometimes, any value of would be rounded up to one and any value \(< 0.5\) rounded down to zero. However, this approach is unlikely to work well when most of the observations on the dependent variable are one or when most are zero. In such cases, it makes more sense to use the unconditional probability that \(y = 1\) (call this) as the threshold rather than \(0.5\). So if, for example, only \(20\%\) of the observations have \(y = 1\) (so \(\bar y = 0.2\)), then we would deem the model to have correctly predicted the outcome concerning whether the bank would grant the loan to the customer where and \(y_i = 1\) and where and \(y_i = 0\).
Thus if \(y_i = 1\) and the model has effectively made the correct prediction (either the loan is granted or refused – we cannot have any outcome in between), whereas \(R_2\) and will not give it full credit for this. Two goodness of fit measures that are commonly reported for limited dependent variable models are as follows
- The percentage of \(y_i\) values correctly predicted, defined as 100 the number of observations predicted correctly divided by the total number of observations: \[\% \text{correct} = \frac{100}{N} y_i I(\hat P_i) + (1-y_i)(1-I(\hat P_i)), (12.6)\] where \(I(\hat y_i) = 1\) if \(\hat y_i > \bar y\) and \(0\) otherwise. Obviously, the higher this number, the better the fit of the model. Although this measure is intuitive and easy to calculate, Kennedy (2003) suggests that it is not ideal, since it is possible that a ‘nave predictor’ could do better than any model if the sample is unbalanced between 0 and 1. For example, suppose that \(y_i = 1\) for 80% of the observations. A simple rule that the prediction is always 1 is likely to outperform any more complex model on this measure but is unlikely to be very useful. Kennedy (2003, p. 267) suggests measuring goodness of fit as the percentage of \(y_i = 1\) correctly predicted plus the percentage of \(y_i = 0\) correctly predicted. Algebraically, this can be calculated as \[\% \text{correct} = 100 \cdot \left[ \frac{\sum y_i I(\hat P_i)}{\sum y_i} + \frac{\sum(1-y_i)(1-I(\hat P_i))}{N-\sum y_i} \right], (12.9)\] Again, the higher the value of the measure, the better the fit of the model.
- A measure known as ‘pseudo-R2’, defined as \[\text{pseudo}-R^2 = 1-\frac{LLF}{LLF_0}(12.10)\] where LLF is the maximised value of the log-likelihood function for the logit and probit model and LLF0 is the value of the log-likelihood function for a restricted model where all of the slope parameters are set to zero (i.e. the model contains only an intercept). -R^2 will have a value of zero for the restricted model, as with the traditional \(R^2\), but this is where the similarity ends. Since the likelihood is essentially a joint probability, its value must be between zero and one, and therefore taking its logarithm to form the LLF must result in a negative number. Thus, as the model fit improves, LLF will become less negative and therefore \(\text{pseudo}-R^2\) will rise. The maximum value of one could be reached only if the model fitted perfectly (i.e., all the were either exactly zero or one corresponding to the actual values). This could never occur in reality and therefore \(\text{pseudo}-R^2\) has a maximum value less than one. We also lose the simple interpretation of the standard \(R^2\) that it measures the proportion of variation in the dependent variable that is explained by the model. Indeed, \(\text{pseudo}-R^2\) does not have any intuitive interpretation. This definition of \(\text{pseudo}-R^2\) is also known as McFadden’s \(R^2\), but it is also possible to specify the metric in other ways. For example, we could define \(\text{pseudo}-R^2\) as \([1 − (RSS/TSS)]\) where RSS is the residual sum of squares from the fitted model and TSS is the total sum of squares of \(y_i\).
9 Multinomial Linear Dependent Variables
All of the examples that have been considered so far in this chapter have concerned situations where the dependent variable is modelled as a binary (0,1) choice. But there are also many instances where investors or financial agents are faced with more alternatives. For example, a company may be considering listing on the NYSE, the NASDAQ or the AMEX markets; a firm that is intending to take over another may choose to pay by cash, with shares, or with a mixture of both; a retail investor may be choosing between five different mutual funds; a credit ratings agency could assign one of sixteen (AAA to B3/B−) different ratings classifications to a firm’s debt.
Notice that the first three of these examples are different from the last one. In the first three cases, there is no natural ordering of the alternatives: the choice is simply made between them. In the final case, there is an obvious ordering, because a score of 1, denoting a AAA-rated bond, is better than a score of 2, denoting a AA1/AA+-rated bond, and so on (see Section 5.15 in Chapter 5). These two situations need to be distinguished and a different approach used in each case. In the first (when there is no natural ordering), a multinomial logit or probit would be used, while in the second (where there is an ordering), an ordered logit or probit would be used. This latter situation will be discussed in the next section, while multinomial models will be considered now.
When the alternatives are unordered, this is sometimes called a discrete choice or multiple choice problem. The models used are derived from the principles of utility maximisation – that is, the agent chooses the alternative that maximises his utility relative to the others. Econometrically, this is captured using a simple generalisation of the binary setup discussed earlier. When there were only two choices (0,1), we required just one equation to capture the probability that one or the other would be chosen. If there are now three alternatives, we would need two equations; for four alternatives, we would need three equations. In general, if there are m possible alternative choices, we need \(m − 1\) equations.
The situation is best illustrated by first examining a multinomial linear probability model. This still, of course, suffers from the same limitations as it did in the binary case (i.e., the same problems as the LPM), but it nonetheless serves as a simple example by way of introduction.4 The multiple choice example most commonly used is that of the selection of the mode of transport for travel to work.5 Suppose that the journey may be made by car, bus, or bicycle (three alternatives), and suppose that the explanatory variables are the person’s income (I), total hours worked (H), their gender (G) and the distance travelled (D).6 We could set up two equations \[\begin{align*} &\text{BUS}_i = \alpha_1 + \alpha_2 \text{I}+ \alpha_3 \text{H}+ \alpha_4 \text{G}+ \alpha_5 \text{D} + u_i, &(12.11) \\ &\text{CAR}_i = \beta_1 + \beta_2 \text{I}+ \beta_3 \text{H}+ \beta_4 \text{G}+ \beta_5 \text{D}(12.12) + v_i, &(12.12) \end{align*}\] where \(\text{BUS}_i=1\) if person \(i\) travels by bus and \(0\) otherwise; \(\text{CAR}_i=1\) if person \(i\) travels by car and \(0\) otherwise.
There is no equation for travel by bicycle and this becomes a sort of reference point, since if the dependent variables in the two equations are both zero, the person must be travelling by bicycle. In fact, we do not need to estimate the third equation (for travel by bicycle) since any quantity of interest can be inferred from the other two. The fitted values from the equations can be interpreted as probabilities and so, together with the third possibility, they must sum to unity. Thus, if, for a particular individual i, the probability of travelling by car is 0.4 and by bus is 0.3, then the possibility that she will travel by bicycle must be \(0.3\) \((1−0.4−0.3)\). Also, the intercepts for the three equations (the two estimated equations plus the missing one) must sum to zero across the three modes of transport.
While the fitted probabilities will always sum to unity by construction, as with the binomial case, there is no guarantee that they will all lie between 0 and 1 – it is possible that one or more will be greater than 1 and one or more will be negative. In order to make a prediction about which mode of transport a particular individual will use, given that the parameters in equations (12.11) and (12.12) have been estimated and given the values of the explanatory variables for that individual, the largest fitted probability would be set to 1 and the others set to 0. So, for example, if the estimated probabilities of a particular individual travelling by car, bus and bicycle are 1.1, 0.2 and −0.3, these probabilities would be rounded to 1, 0, and 0. So the model would predict that this person would travel to work by car.
Exactly as the LPM has some important limitations that make logit and probit the preferred models, in the multiple choice context multinomial logit and probit models should be used. These are direct generalisations of the binary cases, and as with the multinomial LPM, m − 1 equations must be estimated where there are m possible outcomes or choices. The outcome for which an equation is not estimated then becomes the reference choice, and thus the parameter estimates must be interpreted slightly differently. Suppose that travel by bus (B) or by car (C) have utilities for person \(i\) that depend on the characteristics described above \((I_i, H_i, G_i, D_i)\), then the car will be chosen if \[(\beta_1 + \beta_2 \text{I}_i + \beta_3 \text{H}_i + \beta_4 \text{G}_i + \beta_5 \text{D}_i + v_i) > (\alpha_1+\alpha \text{I}_2 +\alpha_3 \text{H}_i + \alpha_4 \text{G}_i + \alpha_5 \text{D}_i + u_i), (12.13)\] That is, the probability that the car will be chosen will be greater than that of the bus being chosen if the utility from going by car is greater. Equation (12.13) can be rewritten as \[(\beta_1-\alpha_1) + (\beta_2-\alpha_2) I_i + (\beta_3 - \alpha_3) H_i + (\beta_4-\alpha_4)G_i + (\beta_5 - \alpha_5) \text{D}_i > u_i - v_i, (12.14)\]
If it is assumed that \(u_i\) and \(v_i\) independently follow a particular distribution, then the difference between them will follow a logistic distribution. Thus we can write \[P \left( \frac{C_i}{B_i} \right) = \frac{1}{1+e^{-z_i}}, (12.15)\] where \(z_i\) is the function on the LHS side of (12.14), i.e., \((\beta_1-\alpha_1) + (\beta_2-\alpha_2) I_i +...\) and travel by bus becomes the reference category. \(P \left( \frac{C_i}{B_i} \right)\) denotes the probability that individual i would choose to travel by car rather than by bus.
Equation (12.15) implies that the probability of the car being chosen in preference to the bus depends upon the logistic function of the differences in the parameters describing the relationship between the utilities from travelling by each mode of transport. Of course, we cannot recover both \(\beta_2\) and \(\alpha_2\) for example, but only the difference between them (call this \(\gamma_2 = \beta_2 - \alpha_2\)). These parameters measure the impact of marginal changes in the explanatory variables on the probability of travelling by car relative to the probability of travelling by bus. Note that a unit increase in \(I_i\) will lead to a \(\gamma_2 F(\text{I}_i)\) increase in the probability and not a \(\gamma_2\) increase – see equations (12.5) and (12.7) above. For this trinomial problem, there would need to be another equation – for example, based on the difference in utilities between travelling by bike and by bus. These two equations would be estimated simultaneously using maximum likelihood.
For the multinomial logit model, the error terms in the equations (ui and vi in the example above) must be assumed to be independent. However, this creates a problem whenever two or more of the choices are very similar to one another. This problem is known as the ‘independence of irrelevant alternatives’. To illustrate how this works, Kennedy (2003, p. 270) uses an example where another choice to travel by bus is introduced and the only thing that differs is the colour of the bus. Suppose that the original probabilities for the car, bus and bicycle were \(0.4\), \(0.3\) and \(0.3\). If a new green bus were introduced in addition to the existing red bus, we would expect that the overall probability of travelling by bus should stay at \(0.3\) and that bus passengers should split between the two (say, with half using each coloured bus). This result arises since the new colour of the bus is irrelevant to those who have already chosen to travel by car or bicycle. Unfortunately, the logit model will not be able to capture this and will seek to preserve the relative probabilities of the old choices (which could be expressed as \(\frac{4}{10}\), \(\frac{4}{10}\) and \(\frac{3}{10}\) respectively). These will become and for car, green bus, red bus and bicycle, respectively – a long way from what intuition would lead us to expect.
Fortunately, the multinomial probit model, which is the multiple choice generalisation of the probit model discussed in Section 12.5 above, can handle this. The multinomial probit model would be set up in exactly the same fashion as the multinomial logit model, except that the cumulative normal distribution is used for \((u_i − v_i)\) instead of a cumulative logistic distribution. This is based on an assumption that \(u_i\) and \(v_i\) are multivariate normally distributed but unlike the logit model, they can be correlated. A positive correlation between the error terms can be employed to reflect a similarity in the characteristics of two or more choices. However, such a correlation between the error terms makes estimation of the multinomial probit model using maximum likelihood difficult because multiple integrals must be evaluated. Kennedy (2003, p. 271) suggests that this has resulted in continued use of the multinomial logit approach despite the independence of irrelevant alternatives problem.
10 The Pecking Order Hypothesis Revisited: The Choice Between Financing Methods
In Section 12.4, a logit model was used to evaluate whether there was empirical support for the pecking order hypothesis where the hypothesis boiled down to a consideration of the probability that a firm would seek external financing or not. But suppose that we wish to examine not only whether a firm decides to issue external funds but also which method of funding it chooses when there are a number of alternatives available. As discussed above, the pecking order hypothesis suggests that the least costly methods, which, everything else equal, will arise where there is least information asymmetry, will be used first, and the method used will also depend on the riskiness of the firm. Returning to Helwege and Liang’s study, they argue that if the pecking order is followed, low-risk firms will issue public debt first, while moderately risky firms will issue private debt and the most risky companies will issue equity. Since there is more than one possible choice, this is a multiple choice problem and consequently, a binary logit model is inappropriate and instead, a multinomial logit is used. There are three possible choices here: bond issue, equity issue and private debt issue. As is always the case for multinomial models, we estimate equations for one fewer than the number of possibilities, and so equations are estimated for equities and bonds, but not for private debt. This choice then becomes the reference point, so that the coefficients measure the probability of issuing equity or bonds rather than private debt, and a positive parameter estimate in, say, the equities equation implies that an increase in the value of the variable leads to an increase in the probability that the firm will choose to issue equity rather than private debt.
The set of explanatory variables is slightly different now given the different nature of the problem at hand. The key variable measuring risk is now the ‘unlevered Z score’, which is Altman’s Z score constructed as a weighted average of operating earnings before interest and taxes, sales, retained earnings and working capital. All other variable names are largely self-explanatory and so are not discussed in detail, but they are divided into two categories – those measuring the firm’s level of risk (unlevered Z- score, debt, interest expense and variance of earnings) and those measuring the degree of information asymmetry (R&D expenditure, venture-backed, age, age over fifty, plant property and equipment, industry growth, non- financial equity issuance, and assets). Firms with heavy R&D expenditure, those receiving venture capital financing, younger firms, firms with less property, plant and equipment, and smaller firms are argued to suffer from greater information asymmetry. The parameter estimates for the multinomial logit are presented in Table 12.2, with equity issuance as a (0,1) dependent variable in the second column and bond issuance as a (0,1) dependent variable in the third column.
Table 12.2 Multinomial logit estimation of the type of external financing
| Tables Variable 1 | Tables Equity equation 2 | Tables Bonds equation 3 |
|---|---|---|
| Intercept | -4.67 | -4.68 |
| null | (-6.17) | (-5.48) |
| Unlevered Z-score | 0.14 | 0.26 |
| null | (1.84) | (2.86) |
| Debt | 1.72 | 3.28 |
| null | (1.60) | (2.88) |
| Interest expense | -9.41 | -4.54 |
| null | (-0.93) | (-0.42) |
| Variance of earnings | -0.04 | -0.14 |
| null | (-0.55) | (-1.56) |
| R&D | 0.61 | 0.89 |
| null | (1.28) | (1.59) |
| Venture-backed | 0.70 | 0.86 |
| null | (2.32) | (2.50) |
| Age | -0.01 | -0.03 |
| null | (-1.10) | (-1.85) |
| Age over fifty | 1.58 | 1.93 |
| null | (1.44) | (1.70) |
| Plant, property and equipment | (0.62) | 0.34 |
| null | (0.94) | (0.50) |
| Industry growth | 0.005 | 0.003 |
| null | (1.14) | (0.70) |
| Non-financial equity issuance | 0.008 | 0.005 |
| null | (3.89) | (2.65) |
| Assets | -0.001 | 0.002 |
| null | (-0.59) | (4.11) |
Notes: t-ratios in parentheses; only figures for all years in the sample are presented. Source: Helwege and Liang (1996). Reprinted with the permission of Elsevier.
Overall, the results paint a very mixed picture about whether the pecking order hypothesis is validated or not. The positive (significant) and negative (insignificant) estimates on the unlevered Z-score and interest expense variables, respectively, suggest that firms in good financial health (i.e. less risky firms) are more likely to issue equities or bonds rather than private debt. Yet the positive sign of the parameter on the debt variable is suggestive that riskier firms are more likely to issue equities or bonds; the variance of earnings variable has the wrong sign but is not statistically significant. Almost all of the asymmetric information variables have statistically insignificant parameters. The only exceptions are that firms having venture backing are more likely to seek capital market financing of either type, as are non-financial firms. Finally, larger firms are more likely to issue bonds (but not equity). Thus the authors conclude that the results ‘do not indicate that firms strongly avoid external financing as the pecking order predicts’ and ‘equity is not the least desirable source of financing since it appears to dominate bank loans’ (Helwege and Liang, 1996, p. 458).
11 Ordered Response Linear Dependent Variables Models
Some limited dependent variables can be assigned numerical values that have a natural ordering. The most common example in finance is that of credit ratings, as discussed previously, but a further application is to modelling a security’s bid-ask spread (see, for example, ap Gwilym, Clare and Thomas, 1998). In such cases, it would not be appropriate to use multinomial logit or probit since these techniques cannot take into account any ordering in the dependent variables. Notice that ordinal variables are still distinct from the usual type of data that were employed in the early chapters in this book, such as stock returns, GDP, interest rates, etc. These are examples of cardinal numbers, since additional information can be inferred from their actual values relative to one another. To illustrate, an increase in house prices of 20% represents twice as much growth as a 10% rise. The same is not true of ordinal numbers, where (returning to the credit ratings example) a rating of AAA, assigned a numerical score of 16, is not ‘twice as good’ as a rating of Baa2/BBB, assigned a numerical score of 8. Similarly, for ordinal data, the difference between a score of, say, 15 and of 16 cannot be assumed to be equivalent to the difference between the scores of 8 and 9. All we can say is that as the score increases, there is a monotonic increase in the credit quality. Since only the ordering can be interpreted with such data and not the actual numerical values, OLS cannot be employed and a technique based on ML is used instead. The models used are generalisations of logit and probit, known as ordered logit and ordered probit. Using the credit rating example again, the model is set up so that a particular bond falls in the AA+ category (using Standard and Poor’s terminology) if its unobserved (latent) creditworthiness falls within a certain range that is too low to classify it as AAA and too high to classify it as AA. The boundary values between each rating are then estimated along with the model parameters.
12 Are Unsolicited Credit Ratings Biased Downwards? An Ordered Probit Analysis
Modelling the determinants of credit ratings is one of the most important uses of ordered probit and logit models in finance. The main credit ratings agencies construct what may be termed solicited ratings, which are those where the issuer of the debt contacts the agency and pays them a fee for producing the rating. Many firms globally do not seek a rating (because, for example, the firm believes that the ratings agencies are not well placed to evaluate the riskiness of debt in their country or because they do not plan to issue any debt or because they believe that they would be awarded a low rating), but the agency may produce a rating anyway. Such ‘unwarranted and unwelcome’ ratings are known as unsolicited ratings. All of the major ratings agencies produce unsolicited ratings as well as solicited ones, and they argue that there is a market demand for this information even if the issuer would prefer not to be rated.
Companies in receipt of unsolicited ratings argue that these are biased downwards relative to solicited ratings and that they cannot be justified without the level of detail of information that can be provided only by the rated company itself. A study by Poon (2003) seeks to test the conjecture that unsolicited ratings are biased after controlling for the rated company’s characteristics that pertain to its risk.
The data employed comprise a pooled sample of all companies that appeared on the annual ‘issuer list’ of S&P during the years 1998–2000. This list contains both solicited and unsolicited ratings covering 295 firms over fifteen countries and totalling 595 observations. In a preliminary exploratory analysis of the data, Poon finds that around half of the sample ratings were unsolicited, and indeed the unsolicited ratings in the sample are on average significantly lower than the solicited ratings.9 As expected, the financial characteristics of the firms with unsolicited ratings are significantly weaker than those for firms that requested ratings. The core methodology employs an ordered probit model with explanatory variables comprising firm characteristics and a dummy variable for whether the firm’s credit rating was solicited or not \[R_i^{*} = X_i \beta +\epsilon_i, (12.16)\] with \[R_i = \begin{cases} 1 & \text{if} &R_i^* \leq \mu_0 \\ 2 & \text{if} &\mu_0 < R_i^* \leq \mu_1 \\ 3 & \text{if} &\mu_1 < R_i^* \leq \mu_2 \\ 4 & \text{if} &\mu_2 < R_i^* \leq \mu_3 \\ 5 & \text{if} &R_i^* > \mu_3 \end{cases}\] where \(R_i\) are the observed ratings scores that are given numerical values as follows: AA or above = 6, A = 5, BBB = 4, BB = 3, B = 2 and CCC or below = 1; \(R_i^*\)is the unobservable ‘true rating’ (or ’an unobserved continuous variable representing S&P’s assessment of the creditworthiness of issuer \(i’\)), \(X_i\) is a vector of variables that explains the variation in ratings; β is a vector of coefficients; µi are the threshold parameters to be estimated along with β; and ϵi is a disturbance term that is assumed normally distributed.
The explanatory variables attempt to capture the creditworthiness using publicly available information. Two specifications are estimated: the first includes the variables listed below, while the second additionally incorporates an interaction of the main financial variables with a dummy variable for whether the firm’s rating was solicited (SOL) and separately with a dummy for whether the firm is based in Japan.10 The financial variables are ICOV – interest coverage (i.e. earnings interest), ROA – return on assets, DTC – total debt to capital, and SDTD – short-term debt to total debt. Three variables – SOVAA, SOVA and SOVBBB – are dummy variables that capture the debt issuer’s sovereign credit rating.11 Table 12.3 presents the results from the ordered probit estimation. Table 12.3 Ordered probit model results for the determinants of credit ratings
| Explanatory | Model 1 | null | Model 2 | null |
|---|---|---|---|---|
| variables | Coefficient | Test statistic | Coefficient | Test statistic |
| Intercept | 2.324 | 8.960*** | 1.492 | 3.155** |
| SOL | 0.359 | 2.105** | 0.391 | 0.647 |
| JP | -0.548 | -2.949*** | 1.296 | 2.441** |
| JP+SOL | 1.614 | 7.027*** | 1.487 | 5.183** |
| SOVAA | 2.135 | 8.768*** | 2.470 | 8.975** |
| SOVA | 0.554 | 2.552** | 0.925 | 3.968*** |
| SOVBBB | -0.416 | -1.480 | -0.181 | -0.601 |
| ICOV | 0.023 | 3.466*** | -0.005 | -0.172 |
| ROA | 0.104 | 10.306*** | 0.194 | 2.503** |
| DTC | -1.393 | -5.736*** | -0.522 | -1.130 |
| SDTD | -1.212 | -5.228*** | 0.111 | 0.171 |
| SOL+ICOV | - | - | 0.005 | 0.163 |
| SOL+ROA | - | - | -0.116 | -1.476 |
| SOL+DTC | - | - | 0.756 | 1.136 |
| SOL+SDTD | - | - | -0.887 | -1.290 |
| JP+ICOV | - | - | 0.009 | 0.275 |
| JP+ROA | - | - | 0.183 | 2.200** |
| JP+DTC | - | - | -1.865 | -3.214*** |
| JP+SDTD | - | - | -2.443 | -3.437*** |
| AA or above | >5.095 | >5.578 | ||
| A | -3.788 and <5.095 | 25.278* | >4.147 and <5.578 | 23.294* |
| BBB | -2.550 and <3.788 | 19.671*** | >2.803 and <4.147 | 19.204*** |
| BB | 1.287 and <2.550 | 14.342** | >1.432 and <2.803 | 14.324** |
| B | >0 and <1.287 | 7.927*** | >0 and <1.432 | 7.910*** |
| CCC or below | 0 | 0 |
Note:*, ** and *** denote significance at the 10%, 5% and 1% levels, respectively. Source: Poon (2003). Reprinted with the permission of Elsevier.
The key finding is that the SOL variable is positive and statistically significant in Model 1 (and it is positive but insignificant in Model 2), indicating that even after accounting for the financial characteristics of the firms, unsolicited firms receive ratings on average 0.359 units lower than an otherwise identical firm that had requested a rating. The parameter estimate for the interaction term between the solicitation and Japanese dummies (SOL*JP) is positive and significant in both specifications, indicating strong evidence that Japanese firms soliciting ratings receive higher scores. On average, firms with stronger financial characteristics (higher interest coverage, higher return on assets, lower debt to total capital, or a lower ratio of short-term debt to long-term debt) have higher ratings. A major flaw that potentially exists within the above analysis is the self- selection bias or sample selection bias that may have arisen if firms that would have received lower credit ratings (because they have weak financials) elect not to solicit a rating. If the probit equation for the determinants of ratings is estimated ignoring this potential problem and it exists, the coefficients will be inconsistent. To get around this problem and to control for the sample selection bias, Heckman (1979) proposed a two- step procedure that in this case would involve first estimating a 0–1 probit model for whether the firm chooses to solicit a rating and second estimating the ordered probit model for the determinants of the rating. The first-stage probit model is \[Y_i^* = Z_{i \gamma} + \xi_i, (12.17)\] where \(Y_i = 1\) if the firm has solicited a rating and 0 otherwise, and \(Y_i^*\) denotes the latent propensity of issuer i to solicit a rating, \(Z_i\) are the variables that explain the choice to be rated or not, and \(\gamma\) are the parameters to be estimated. When this equation has been estimated, the rating \(R_i\) as defined above in equation (12.16) will be observed only if \(Y_i = 1\). The error terms from the two equations, \(\epsilon_i\) and \(\xi_i\), follow a bivariate standard normal distribution with correlation \(\rho_{\epsilon \xi}\). Table 12.4 shows the results from the two-step estimation procedure, with the estimates from the binary probit model for the decision concerning whether to solicit a rating in panel A and the determinants of ratings for rated firms in panel B.
Table 12.4 Two-step ordered probit model allowing for selectivity bias in the determinants of credit ratings
| Explanatory variable Panel A: Decision to be rated | Coefficient | Test statistic |
|---|---|---|
| Intercept | 1.624 | 3.935* |
| JP | -0.776 | -4.951** |
| SOVAA | -0.959 | -2.706*** |
| SOVA | -0.614 | -1.794* |
| SOVBBB | -1.130 | -2.899** |
| ICOV | -0.005 | -0.922 |
| ROA | 0.051 | 6.537 |
| DTC | 0.272 | 1.019 |
| SDTD | -1.651 | -5.320** |
| Panel B: Rating determinant equation | ||
| Intercept | 1.368 | 2.890*** |
| JP | 2.456 | 3.141 *** |
| SOVAA | 2.315 | 6.121 *** |
| SOVA | 0.875 | 2.755** |
| SOVBBB | 0.306 | 0.768 |
| ICOV | 0.002 | 0.118 |
| ROA | 0.038 | 2.408* |
| DTC | -0.330 | -0.512 |
| SDTD | 0.105 | 0.303 |
| **JP*ICOV** | 0.038 | 1.129 |
| **JP*ROA** | 0.188 | 2.104** |
| **JP*DTC** | -0.808 | -0.924 |
| **JP*SDTD** | -2.823 | -2.430** |
| Estimated correlation | -0.836 | -5.723** |
| AA or above | >4.275 | |
| A | >2.841 and <4.275 | 8.235 *** |
| BBB | >1.748 and <2.841 | 9.164’ *** |
| BB | >0.704 and <1.748 | 6.788 |
| B | >0 and 0.704 | 3.316*** |
| CCC or below | <0 |
Note:*, ** and *** denote significance at the 10%, 5% and 1% levels, respectively. Source: Poon (2003). Reprinted with the permission of Elsevier.
A positive parameter value in panel A indicates that higher values of the associated variable increases the probability that a firm will elect to be rated. Of the four financial variables, only the return on assets and the short-term debt as a proportion of total debt have correctly signed and significant (positive and negative, respectively) impacts on the decision to be rated. The parameters on the sovereign credit rating dummy variables (SOVAA, SOVA and SOVB) are all significant and negative in sign, indicating that any debt issuer in a country with a high sovereign rating is less likely to solicit its own rating from S&P, other things equal.
These sovereign rating dummy variables have the opposite sign in the ratings determinant equation (panel B) as expected, so that firms in countries where government debt is highly rated are themselves more likely to receive a higher rating. Of the four financial variables, only ROA has a significant (and positive) effect on the rating awarded. The dummy for Japanese firms is also positive and significant, and so are three of the four financial variables when interacted with the Japan dummy, indicating that S&P appears to attach different weights to the financial variables when assigning ratings to Japanese firms compared with comparable firms in other countries.
Finally, the estimated correlation between the error terms in the decision to be rated equation and the ratings determinant equation, \(\rho_{\epsilon \xi}\), is significant and negative (−0.836), indicating that the results in Table 12.3 above would have been subject to self-selection bias and hence the results of the two-stage model are to be preferred. The only disadvantage of this approach, however, is that by construction it cannot answer the core question of whether unsolicited ratings are on average lower after allowing for the debt issuer’s financial characteristics, because only firms with solicited ratings are included in the sample at the second stage!
13 Censored and Truncated Dependent Variables
Censored or truncated variables occur when the range of values observable for the dependent variables is limited for some reason. Unlike the types of limited dependent variables examined so far in this chapter, censored or truncated variables may not necessarily be dummies. A standard example is that of charitable donations by individuals. It is likely that some people would actually prefer to make negative donations (that is, to receive from the charity rather than to donate to it), but since this is not possible, there will be many observations at exactly zero. So suppose, for example, that we wished to model the relationship between donations to charity and people’s annual income, in pounds. The situation we might face is illustrated in Figure 12.3.
Figure 12.3 Modelling charitable donations as a function of income
Given the observed data, with many observations on the dependent variable stuck at zero, OLS would yield biased and inconsistent parameter estimates. An obvious but flawed way to get around this would be just to remove all of the zero observations altogether, since we do not know whether they should be truly zero or negative. However, as well as being inefficient (since information would be discarded), this would still yield biased and inconsistent estimates. This arises because the error term, ui, in such a regression would not have an expected value of zero, and it would also be correlated with the explanatory variable(s), violating the assumption that \(\text{Cov}(u_i, x_{ki}) = 0, \forall k\).
The key differences between censored and truncated data are highlighted in Box 12.2. For both censored and truncated data, OLS will not be appropriate, and an approach based on maximum likelihood must be used, although the model in each case would be slightly different. In both cases, we can work out the marginal effects given the estimated parameters, but these are now more complex than in the logit or probit cases.
BOX 12.2 The differences between censored and truncated dependent variables Although at first sight the two words might appear interchangeable, when the terms are used in econometrics, censored and truncated data are different.
- Censored data occur when the dependent variable has been ‘censored’ at a certain point so that values above (or below) this cannot be observed. Even though the dependent variable is censored, the corresponding values of the independent variables are still observable.
- As an example, suppose that a privatisation IPO is heavily oversubscribed, and you were trying to model the demand for the shares using household income, age, education and region of residence as explanatory variables. The number of shares allocated to each investor may have been capped at, say, 250, resulting in a truncated distribution.
- In this example, even though we are likely to have many share allocations at 250 and none above this figure, all of the observations on the independent variables are present and hence the dependent variable is censored, not truncated.
- A truncated dependent variable, meanwhile, occurs when the observations for both the dependent and the independent variables are missing when the dependent variable is above (or below) a certain threshold. Thus the key difference from censored data is that we cannot observe the xis either, and so some observations are completely cut out or truncated from the sample. For example, suppose that a bank were interested in determining the factors (such as age, occupation and income) that affected a customer’s decision as to whether to undertake a transaction in a branch or online. Suppose also that the bank tried to achieve this by encouraging clients to fill in an online questionnaire when they log on. There would be no data at all for those who opted to transact in person since they probably would not have even logged on to the bank’s web-based system and so would not have the opportunity to complete the questionnaire. This is a common problem, which will result whenever data for buyers or users only can be observed while data for non-buyers or non-users cannot. Of course, it is possible, although unlikely, that the population of interest is focused only on those who use the internet for banking transactions, in which case there would be no problem.
13.1 Censored Models
The approach usually used to estimate models with censored dependent variables is known as tobit analysis, named after Tobin (1958). To illustrate, suppose that we wanted to model the demand for privatisation IPO shares, as discussed above, as a function of income (x2i), age (x3i), education (x4i) and region of residence (x5i). The model would be the following \((12.18)\) equation: \[ y_i^* = \beta_1 +\beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i}+ \beta_5 x_{5i} + u_i \\ y_i = \begin{cases} y_i^* &\text{for} & y_i^* < 250 \\ 250 &\text{for} & y_i^* \geq 250 \end{cases} \] \(y_i^*\) represents the true demand for shares (i.e. the number of shares requested) and this will be observable only for demand less than 250. Thus 250 is effectively like a threshold. In Tobin’s original model, the threshold was assumed to be zero, which simplifies matters slightly.
It is important to note in this model that \(\beta_2, \beta_3,...\) represent the impact on the number of shares demanded (of a unit change in x2i, x3i, etc.) and not the impact on the actual number of shares that will be bought (allocated). More generally, a dependent variable can be either right-censored (upper censored) as in the example above where observations above a certain threshold (call it \(b\)) are not observable and are equal to the threshold, or it could be left-censored (lower censored) where observations below a certain threshold (call this \(a\)) are not observable and so are equal to \(a\).
A commonly employed illustration of the latter (left-censored) variable is to return to the example above relating to charitable donations made by individuals. To see how this would work, the explanatory variables could be exactly as in the above example with IPOs, and \(y_i\) would be the actual amount donated while is the unobservable amount that a person \(i\) would actually like to give to charity (this may be negative, which would be interpreted as suggesting that the person would prefer to take money from the charity rather than donate to it if that were possible). Algebraically, we would write the following \((12.19)\) equation: \[ y_i^* = \beta_1 +\beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i}+ \beta_5 x_{5i} + u_i \\ y_i = \begin{cases} y_i^* &\text{for} & y_i^* < 0 \\ 0 &\text{for} & y_i^* \geq 0 \end{cases} \]
A final possibility is that the dependent variable is double-censored so that neither observations at or below a certain threshold a nor observations at or above a certain other threshold b can be observed. We could write this as equation \((12.20)\) \[ y_i^* = \beta_1 +\beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i}+ \beta_5 x_{5i} + u_i \\ y_i = \begin{cases} a &\text{for} & y_i^* \leq a \\ y_i^* &\text{for} & a < y_i^* \leq b \\ b &\text{for} & y_i^* \geq b \end{cases} \]
Tobit models can be estimated in a fairly straightforward fashion using maximum likelihood under the assumption that the threshold(s) (a or/and b) is/are known and that the disturbances, ui, follow a normal distribution with mean 0 and constant variance σ2. The log-likelihood function for a double-censored Tobit model is \[LLF = \sum_{i=1}^N \left[I_i^a \ln F \left( \frac{a-XB}{\sigma}\right) + I_i^b \ln F \left(\frac{XB-b}{\sigma}\right) +(1-I_i^a - I_i^b) \left( \ln f \left( \frac{y-XB}{\sigma}\right) -\ln \sigma \right) \right]\] where \(XB\) is a shorthand for all of the parameters multiplied by their corresponding explanatory variables \((\beta_1 + \beta_2 x_{2i} + \beta_3 x_{3i} + + \beta_4 x_{4i} + + \beta_5 x_{5i})\), \(F(.)\) and \(f(.)\) are the standard normal cdf and pdf respectively, and \(I_i^a\) and \(I_i^b\) are indicator functions that, respectively, pull out the observations below the lower and above the upper thresholds respectively. The latter can be defined, respectively, as \[I_i^a = \begin{cases} 1 & \text{if} & y_i < a \\ 0 & \text{if} & y_i \geq a \end{cases}\] and \[I_i^b = \begin{cases} 1 & \text{if} & y_i < b \\ 0 & \text{if} & y_i \geq b \end{cases}\]
Effectively, there is a pdf (in essence, a linear part) for the observed portion of the distribution and a cdf (or two cdfs) for the truncated part(s). Equation (12.21) above is the most general form of log-likelihood function in this class with both left– and right-censoring. A more restricted version would be where the dependent variable was censored on only one side, in which case one of the first two terms in the equation would drop out: the first term drops out if there is no left-censoring and \(a = −\infty\), while the second term drops out if there is no right-censoring and \(b = \infty\).
An interesting financial application of the tobit approach is due to Haushalter (2000), who employs it to model the determinants of the extent of hedging by oil and gas producers using futures or options over the 1992–4 period. The dependent variable used in the regression models, the proportion of production hedged, is clearly censored because around half of all of the observations are exactly zero (i.e., the firm does not hedge at all).12 The censoring of the proportion of production hedged may arise because of high fixed costs that prevent many firms from being able to hedge even if they wished to. Moreover, if companies expect the price of oil or gas to rise in the future, they may wish to increase rather than reduce their exposure to price changes (i.e., ‘negative hedging’), but this would not be observable given the way that the data are constructed in the study.
The main results from the study are that the proportion of exposure hedged is negatively related to creditworthiness, positively related to indebtedness, to the firm’s marginal tax rate, and to the location of the firm’s production facility. The extent of hedging is not, however, affected by the size of the firm as measured by its total assets.
Before moving on, two important limitations of tobit modelling should be noted. First, such models are much more seriously affected by non- normality and heteroscedasticity than are standard regression models (see Amemiya, 1984), and biased and inconsistent estimation will result. Second, as Kennedy (2003, p. 283) argues, the tobit model requires it to be plausible that the dependent variable can have values close to the limit. There is no problem with the privatisation IPO example discussed above since the demand could be for 249 shares. However, it would not be appropriate to use the tobit model in situations where this is not the case, such as the number of shares issued by each firm in a particular month. For most companies, this figure will be exactly zero, but for those where it is not, the number will be much higher and thus it would not be feasible to issue, say, one or three or fifteen shares. In this case, an alternative approach should be used.
13.2 Truncated Models
As stated above, a truncated dependent variable occurs when both the dependent and the independent variables for a particular section of the population are missing or unobservable. Thus, dealing with truncated data is really a sample selection problem because the sample of data that can be observed is not representative of the population of interest – the sample is biased, very likely resulting in biased (towards zero) and inconsistent parameter estimates. Thus we cannot use OLS to estimate the model parameters and again we use maximum likelihood with a slight modification in the likelihood function to transform the data so that the cumulative probabilities still sum to one but over the observed part of the distribution only. The appropriate log-likelihood function (although there are many different ways to write it) would be \[LLF = \sum_{i=1}^N \left[ \ln F \left(\frac{a-XB}{\sigma} \right) + \ln F \left( \frac{XB-b}{\sigma} \right) -\left( \left( \frac{y-XB}{2\sigma^2} \right) -\ln \sigma \right) \right], (12.22)\] where as above \(XB\) is a shorthand for all of the parameters multiplied by their corresponding explanatory variables \((\beta_1 + \beta_2x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i} + \beta_5 x_{5i})\), F(.) is the standard normal cdf.
Usually, however, for truncated data a more general model is employed that contains two equations – one for whether a particular data point will fall into the observed or constrained categories and another for modelling the resulting variable. The second equation is equivalent to the tobit approach. This two-equation methodology allows for a different set of factors to affect the sample selection (for example, the decision to set up internet access to a bank account) from the equation to be estimated (for example, to model the factors that affect whether a particular transaction will be conducted online or in a branch). If it is thought that the two sets of factors will be the same, then a single equation can be used and the tobit approach is sufficient. In many cases, however, the researcher may believe that the variables in the sample selection and estimation equations should be different. Thus the equations could be \[\begin{align*} &a_i^* = \alpha_1 +\alpha_2 z_{2i} + \alpha_3 z_{3i} + ...+ \alpha_m z_{mi} + \epsilon_i, &(12.23) \\ &y_i^* = \beta_1 +\beta_2 x_{2i} + \beta_3 x_{3i} + ...+ \beta_m x_{mi} + u_i, &(12.24) \end{align*}\] where \(y_i = y_i^*\) for \(\alpha_i^* > 0\) and, \(y_i\) is unobserved for denotes the relative ‘advantage’ of being in the observed sample relative to the unobserved sample.
The first equation determines whether the particular data point i will be observed or not, by regressing a proxy for the latent (unobserved) variable \(a_i^*\) on a set of factors, \(z_i\). The second equation is similar to the tobit model. Ideally, the two equations (12.23) and (12.24) will be fitted jointly by maximum likelihood. This is usually based on the assumption that the error terms, εi and ui, are multivariate normally distributed and allowing for any possible correlations between them. However, while joint estimation of the equations is more efficient, it is computationally more complex and hence a two-stage procedure popularised by Heckman (1976) is often used. The Heckman procedure allows for possible correlations between εi and ui while estimating the equations separately in a clever way – see Maddala (1983).
It is useful to note that for both censored and truncated data, the parameter estimates arising from maximum likelihood estimation are the marginal effects for the whole population – that is, we can interpret them in the usual way rather than having to calculate them separately in a second step as we would have to for probit or logit models. The reason is that the latter types of models effectively involve a nonlinear transformation of the data through the normal or logistic functions, which is not the case for censored or truncated data.
14 Appendix
The Maximum Likelihood Estimator for Logit and Probit Models
Recall that under the logit formulation, the estimate of the probability that \(y_i = 1\) will be given from equation (12.4), which was \[P_i = \frac{1}{1+e^{-(\beta_1 + \beta_2 x_{2i}+...+ \beta_k x_{ki})}} ,(12A.1)\] Set the error term, \(u_i\), to its expected value for simplicity and again, let so that we have \[P = \frac{1}{1+e^{-z_i}}, (12A.2)\]
We will also need the probability that \(y_i \neq 1\) or equivalently the probability that \(y_i = 0\). This will be given by 1 minus the probability in equation \((12A.2)\). Given that we can have actual zeros and ones only for yi rather than probabilities, the likelihood function for each observation \(y_i\) will be \[L_i = \left( \frac{1}{1+e^{-z_i}} \right)^{y_i} \cdot \left( \frac{1}{1+e^{z_i}} \right)^{1-y_i}, (12A.3)\]
The likelihood function that we need will be based on the joint probability for all \(N\) observations rather than an individual observation \(i\). Assuming that each observation on \(y_i\) is independent, the joint likelihood will be the product of all \(N\) marginal likelihoods. Let \(L(\theta|x_{2i}, x_{3i}, ..., x_{ki}; i = 1, N)\) denote the likelihood function of the set of parameters \((\beta_1, \beta_2, ..., \beta_k)\) given the data. Then the likelihood function will be given by \[L(\theta) = \Pi_{i=1}^N \left( \frac{1}{1+e^{-z_i}} \right)^{y_i} \cdot \left( \frac{1}{1+e^{z_i}} \right)^{1-y_i}, (12A.4)\]
As for maximum likelihood estimator of GARCH models, it is computationally much simpler to maximise an additive function of a set of variables than a multiplicative function, so long as we can ensure that the parameters required to achieve this will be the same. We thus take the natural logarithm of equation (12A.4) and this loglikelihood function is maximised \[LLF = - \sum_{i=1}^N \left[y_i\ln(1+e^{-z_i})+(1-y_i) \ln(1+e^{z_i}) \right] ,(12A.5)\]
BOX 12.1 Parameter interpretation for probit and logit models
Standard errors and t-ratios will automatically be calculated by the econometric software package used, and hypothesis tests can be conducted in the usual fashion. However, interpretation of the coefficients needs slight care. It is tempting, but incorrect, to state that a 1-unit increase in \(x_{2i}\), for example, causes a \(100 \cdot \beta_2 \%\) increase in the probability that the outcome corresponding to \(y_i = 1\) will be realised. This would have been the correct interpretation for the linear probability model. However, for logit models, this interpretation would be incorrect because the form of the function is not \(P_i = \beta_i + \beta_2 x_i + u_i\), for example, but rather \(P_i = F(\beta_i + \beta_2 x_i + u_i)\), where \(F\) represents the (non-linear) logistic function. To obtain the required relationship between changes in \(x_{2i}\) and \(P_i\), we would need to differentiate \(F\) with respect to \(x_{2i}\) and it turns out that this derivative is \(F(x_{2i}) (1 − F(x_{2i}))\). So in fact, a 1-unit increase in \(x_{2i}\) will cause a \(\beta_2 F(x_{2i}(1-F(x_{2i})))\) increase in probability. Usually, these impacts of incremental changes in an explanatory variable are evaluated by setting each of them to their mean values. For example, suppose we have estimated the following logit model with three explanatory variables using maximum likelihood \[\hat P_i = \frac{1}{1+e^{-(0.1+0.3 x_{2i}-0.6 x_{3i}+0.9 x_{4i})}}, (12.7)\] Thus we have \(\hat \beta_1 =0.1, \hat \beta_2=0.3,\hat \beta_3 = -0.6,\hat \beta_4=0.9\). We now need to calculate \(F(zi)\), for which we need the means of the explanatory variables, where \(z_i\) is defined as before. Suppose that these are then the estimate of \(F(z_i)\) will be given by
\[\hat P_i = \frac{1}{1+e^{-(0.1+0.3 \cdot 1.6 -0.6 \cdot 0.2 + 0.9 \cdot 0.1)}}=0.63, (12.8)\]
Thus a 1-unit increase in \(x_2\) will cause an increase in the probability that the outcome corresponding to \(y_i = 1\) will occur by \(0.3 \cdot 0.63 \cdot 0.37 = 0.07\). The corresponding changes in probability for variables \(x_3\) and \(x_4\) are \(−0.6 \cdot 0.63 \cdot 0.37 = −0.14\) and \(0.9 \cdot 0.63 \cdot 0.37 = 0.21\), respectively. These estimates are sometimes known as the marginal effects. There is also another way of interpreting discrete choice models, known as the random utility model. The idea is that we can view the value of y that is chosen by individual \(i\) (either \(0\) or \(1\)) as giving that person a particular level of utility, and the choice that is made will obviously be the one that generates the highest level of utility. This interpretation is particularly useful in the situation where the person faces a choice between more than two possibilities as in Section 12.9 below.