Main objective of this post is to explain the concept of Tobit Model also called Censored Regression Model which is used to find a relationship of a censored continuous dependent with other variables. Variable is called censored (right or left) when cases with a value at or above some threshold value take threshold value while actually it might also be higher. In left censoring (censoring from below). values which fall at or below some threshold are censored. In right censoring (censoring from above), values which fall at or above some threshold are cesored.
Truncation and censoring are two distinct phenomena that cause our samples to be incomplete. These phenomena arise in medical sciences, engineering, social sciences, and other research fields. If we ignore truncation or censoring when analyzing our data, our estimates of population parameters will be inconsistent. In the censored regression model, there are data on buyers and nonbuyers, as there would be if the data were obtained via simple random sampling of the adult population. If, however, the data are collected from sales tax records, then the data would include only buyers: There would be no data at all for nonbuyers. Data in which observations are unavailable above or below a threshold (data for buyers only) are called truncated data. The truncated regression model is a regression model applied to data in which observations are simply unavailable when the dependent variable is above or below a certain cutoff.([Introduction to Econometrics by Stock and Watson Ch.11]
Censoring or truncation happens during sampling process. For example when we measure income of households per month and we record all values above Rs.200,000/ as 200,000/ it means that we have data on all X-variables but data on response variable is censored above. In truncated data, no data on any of the variable with having income value above Rs.200,000/ will be available. So censored data sample is representative of populaton with certain values not recorded exactly while truncated data is not a representative sample.
1) There are a number of customers in a mall (buyers and non-buyers). In censored data , non-buyers value will be counted as zero while buyers cosumption will be observed. In truncated data only buyers data will be in the sample.
2)In students evaluation, their CGPA 4 means that if a student scores above a certain % of marks, he/she gets 4 but this 4 does not measure exact scores of these students. So there is high concentration of values at GPA, so data are right censored.
3) Consider the situation in which we have a measure of academic aptitude (scaled 200-800) which we want to model using reading and math test scores, as well as, the type of program the student is enrolled in (academic, general, or vocational). The problem here is that students who answer all questions on the academic aptitude test correctly receive a score of 800, even though it is likely that these students are not “truly” equal in aptitude. The same is true of students who answer all of the questions incorrectly. All such students would have a score of 200, although they may not all be of equal aptitude. This is taken from Tobil Analysis Using R link given below.
Tobi Analysis Using R For more details, stat blog and R blog and citation mentioned in these blogs maybe very useful for understanding the concept. Introduction to Econometrics by Woolridge Ch.17 provides very useful guidelines for dealing with limited dependent variables.
Apply OLS on censored or truncated data gives misleading results. For Censored data, we use Censored(Tobit) regression model and for truncated data we use truncated regression.
As mentioned above Censored data include a large number of observations for which the dependent variable takes one, or a limited number of values. An example is the mroz data, where about 43 percent of the women observed are not in the labour force, therefore their market hours worked are zero.
Graph shows the histogram of the variable wage in the dataset mroz .
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
To estimate a Tobit model, we have to load the AER package. Other option is to use censReg for which use censReg package. Both are very easy to use. The rest is similar to linear regression methods. Just enter the formula and the data set and you will get your estimates.
##
## Call:
## tobit(formula = hours ~ nwifeinc + educ + exper + expersq + age +
## kidslt6 + kidsge6, data = mroz)
##
## Observations:
## Total Left-censored Uncensored Right-censored
## 753 325 428 0
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 965.30530 446.43614 2.162 0.030599 *
## nwifeinc -8.81424 4.45910 -1.977 0.048077 *
## educ 80.64561 21.58324 3.736 0.000187 ***
## exper 131.56430 17.27939 7.614 2.66e-14 ***
## expersq -1.86416 0.53766 -3.467 0.000526 ***
## age -54.40501 7.41850 -7.334 2.24e-13 ***
## kidslt6 -894.02174 111.87804 -7.991 1.34e-15 ***
## kidsge6 -16.21800 38.64139 -0.420 0.674701
## Log(scale) 7.02289 0.03706 189.514 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Scale: 1122
##
## Gaussian distribution
## Number of Newton-Raphson Iterations: 4
## Log-likelihood: -3819 on 9 Df
## Wald-statistic: 253.9 on 7 Df, p-value: < 2.22e-16
##
## ===============================================================
## Dependent variable:
## -------------------------------------------
## hours
## OLS Tobit
## (1) (2)
## ---------------------------------------------------------------
## nwifeinc -3.447 -8.814**
## (2.544) (4.459)
##
## educ 28.761** 80.646***
## (12.955) (21.583)
##
## exper 65.673*** 131.564***
## (9.963) (17.279)
##
## expersq -0.700** -1.864***
## (0.325) (0.538)
##
## age -30.512*** -54.405***
## (4.364) (7.419)
##
## kidslt6 -442.090*** -894.022***
## (58.847) (111.878)
##
## kidsge6 -32.779 -16.218
## (23.176) (38.641)
##
## Constant 1,330.482*** 965.305**
## (270.785) (446.436)
##
## ---------------------------------------------------------------
## Observations 753 753
## R2 0.266
## Adjusted R2 0.259
## Log Likelihood -3,819.095
## Residual Std. Error 750.179 (df = 745)
## F Statistic 38.495*** (df = 7; 745)
## Wald Test 253.862*** (df = 7)
## ===============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01