Truncated regression is used to model dependent variables for which some of the observations are not included in the analysis because of the value of the dependent variable.

#install.packages("truncreg")
require(foreign)
require(ggplot2)
require(truncreg)
require(boot)

Examples of truncated regression

Description of the data

Let’s pursue Example 1 from above. We have a hypothetical data file, truncreg.dta, with 178 observations. The outcome variable is called achiv, and the language test score variable is called langscore. The variable prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled.

Let’s look at the data. It is always a good idea to start with descriptive statistics.

dat <- read.dta("https://stats.idre.ucla.edu/stat/data/truncreg.dta")

summary(dat)
##        id             achiv         langscore           prog    
##  Min.   :  3.00   Min.   :41.00   Min.   :31.00   general : 40  
##  1st Qu.: 55.25   1st Qu.:47.00   1st Qu.:47.50   academic:101  
##  Median :102.50   Median :52.00   Median :56.00   vocation: 37  
##  Mean   :103.62   Mean   :54.24   Mean   :54.01                 
##  3rd Qu.:151.75   3rd Qu.:63.00   3rd Qu.:61.75                 
##  Max.   :200.00   Max.   :76.00   Max.   :67.00
##        id            achiv        langscore          prog    
##  Min.   :  3.0   Min.   :41.0   Min.   :31.0   general : 40  
##  1st Qu.: 55.2   1st Qu.:47.0   1st Qu.:47.5   academic:101  
##  Median :102.5   Median :52.0   Median :56.0   vocation: 37  
##  Mean   :103.6   Mean   :54.2   Mean   :54.0                 
##  3rd Qu.:151.8   3rd Qu.:63.0   3rd Qu.:61.8                 
##  Max.   :200.0   Max.   :76.0   Max.   :67.0
       id             achiv         langscore           prog    
 Min.   :  3.00   Min.   :41.00   Min.   :31.00   general : 40  
 1st Qu.: 55.25   1st Qu.:47.00   1st Qu.:47.50   academic:101  
 Median :102.50   Median :52.00   Median :56.00   vocation: 37  
 Mean   :103.62   Mean   :54.24   Mean   :54.01                 
 3rd Qu.:151.75   3rd Qu.:63.00   3rd Qu.:61.75                 
 Max.   :200.00   Max.   :76.00   Max.   :67.00                 
# histogram of achiv coloured by program type
ggplot(dat, aes(achiv, fill = prog)) +
  geom_histogram(binwidth=3)

Archive by Program Type

# boxplot of achiv by program type
ggplot(dat, aes(prog, achiv)) +
  geom_boxplot() +
  geom_jitter()

ggplot(dat, aes(x = langscore, y = achiv)) +
  geom_point() +
  stat_smooth(method = "loess") +
  facet_grid(. ~ prog, margins=TRUE)
## `geom_smooth()` using formula 'y ~ x'

# boxplot of achiv by program type
ggplot(dat, aes(prog, achiv)) +
  geom_boxplot() +
  geom_jitter()

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable, while others have either fallen out of favor or have limitations.

Truncated regression

Below we use the truncreg function in the truncreg package to estimate a truncated regression model. The point argument indicates where the data are truncated, and the direction indicates whether it is left or right truncated.

m <- truncreg(achiv ~ langscore + prog, data = dat, point = 40, direction = "left")

summary(m)
## 
## Call:
## truncreg(formula = achiv ~ langscore + prog, data = dat, point = 40, 
##     direction = "left")
## 
## BFGS maximization method
## 57 iterations, 0h:0m:0s 
## g'(-H)^-1g = 2.5E-05 
##  
## 
## 
## Coefficients :
##              Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)  11.29942    6.77173  1.6686   0.09519 .  
## langscore     0.71267    0.11446  6.2264 4.773e-10 ***
## progacademic  4.06267    2.05432  1.9776   0.04797 *  
## progvocation -1.14422    2.66958 -0.4286   0.66821    
## sigma         8.75368    0.66647 13.1343 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -591.31 on 5 Df

Model Summary

# update old model dropping prog
m2 <- update(m, . ~ . - prog)

pchisq(-2 * (logLik(m2) - logLik(m)), df = 2, lower.tail = FALSE)
## 'log Lik.' 0.02516651 (df=3)
## 'log Lik.' 0.0252 (df=3)

The two degree-of-freedom chi-square test indicates that prog is a statistically significant predictor of achiv. We can get the expected means for each program at the mean of langscore by reparameterizing the model.

# create mean centered langscore to use later
dat <- within(dat, {mlangscore <- langscore - mean(langscore)})

malt <- truncreg(achiv ~ 0 + mlangscore + prog, data = dat, point = 40)
summary(malt)
## 
## Call:
## truncreg(formula = achiv ~ 0 + mlangscore + prog, data = dat, 
##     point = 40)
## 
## BFGS maximization method
## 21 iterations, 0h:0m:0s 
## g'(-H)^-1g = 4.3E-07 
##  
## 
## 
## Coefficients :
##              Estimate Std. Error t-value  Pr(>|t|)    
## mlangscore    0.71259    0.11448  6.2248  4.82e-10 ***
## proggeneral  49.78926    1.89714 26.2443 < 2.2e-16 ***
## progacademic 53.85340    1.15012 46.8242 < 2.2e-16 ***
## progvocation 48.65315    2.14049 22.7299 < 2.2e-16 ***
## sigma         8.75545    0.66684 13.1299 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -591.31 on 5 Df

Notice all that has changed is the intercept is gone and the program scores are now the expected values when langscore is at its mean for each type of program.

We could also calculate the bootstrapped confidence intervals if we wanted to.

First, we define a function that returns the parameters of interest, and then use the boot function to run the bootstrap.

f <- function(data, i) {
  require(truncreg)
  m <- truncreg(formula = achiv ~ langscore + prog, data = data[i, ], point = 40)
  as.vector(t(summary(m)$coefficients[, 1:2]))
}
set.seed(10)

(res <- boot(dat, f, R = 1200, parallel = "snow", ncpus = 4))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = dat, statistic = f, R = 1200, parallel = "snow", 
##     ncpus = 4)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1*  11.2994158 -0.0750432302  5.91023663
## t2*   6.7717257  0.0099996401  0.86430871
## t3*   0.7126732  0.0019041192  0.09730007
## t4*   0.1144602  0.0002004845  0.01366713
## t5*   4.0626698 -0.0485288252  1.94566248
## t6*   2.0543191  0.0076934779  0.24494415
## t7*  -1.1442162  0.0425573767  2.81757637
## t8*   2.6695799  0.0219657954  0.29789986
## t9*   8.7536778 -0.0900719907  0.54763875
## t10*  0.6664744 -0.0083607460  0.07505858
## 

We could use the bootstrapped standard error to get a normal approximation for a significance test and confidence intervals for every parameter. However, instead we will get the percentile and bias adjusted 95 percent confidence intervals, using the boot.ci function.

# basic parameter estimates with percentile and bias adjusted CIs
parms <- t(sapply(c(1, 3, 5, 7, 9), function(i) {
    out <- boot.ci(res, index = c(i, i + 1), type = c("perc", "bca"))
    with(out, c(Est = t0, pLL = percent[4], pUL = percent[5], bcaLL = bca[4],
        bcaLL = bca[5]))
}))

# add row names
row.names(parms) <- names(coef(m))
# print results
parms
##                     Est         pLL        pUL       bcaLL      bcaLL
## (Intercept)  11.2994158 -0.47667113 22.8722200 -0.55662632 22.8711476
## langscore     0.7126732  0.52197893  0.9055977  0.52286226  0.9065408
## progacademic  4.0626698 -0.06878459  7.8897378  0.04873108  7.9296538
## progvocation -1.1442162 -6.79210189  4.3581860 -6.92519441  4.0693050
## sigma         8.7536778  7.62374802  9.7726403  7.86169145  9.9453352
##                 Est    pLL    pUL   bcaLL  bcaLL
## (Intercept)  11.299 -1.258 22.297 -3.7231 21.320
## langscore     0.713  0.539  0.916  0.5508  0.944
## progacademic  4.063  0.058  8.011  0.0842  8.043
## progvocation -1.144 -6.805  4.277 -6.8436  4.250
## sigma         8.754  7.674  9.792  7.8896 10.110
Est pLL pUL bcaLL bcaLL
(Intercept) 11.2994158 -1.25506068 22.2970167 -3.76892641 21.3197748
langscore 0.7126732 0.53765006 0.9153814 0.55093818 0.9435016
progacademic 4.0626698 0.05802952 8.0486887 0.09498805 8.0736647
progvocation -1.1442162 -6.80478798 4.2770202 -6.83649328 4.2525382
sigma 8.7536778 7.67160189 9.7919028 7.89156503 10.1186253

The conclusions are the same as from the default model tests. You can compute a rough estimate of the degree of association for the overall model, by correlating achiv with the predicted value and squaring the result.

dat$yhat <- fitted(m)

# correlation
(r <- with(dat, cor(achiv, yhat)))
## [1] 0.5524392
## [1] 0.552
# rough variance accounted for
r^2
## [1] 0.305189
## [1] 0.305

The calculated value of .31 is rough estimate of the R2 you would find in an OLS regression. The squared correlation between the observed and predicted academic aptitude values is about 0.31, indicating that these predictors accounted for over 30% of the variability in the outcome variable.

Things to consider