Some observations are continuous (e.g., temperature), and others are discrete (e.g., counts). A surprising number of data types are combinations of both. In this example, basal area is continuous and positive or discrete zero. How does one accommodate both in a model? In this example we use imputation to adapt regression to the case there is censoring at zero. Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.

Logistics

Resources

  • Software includes:

    • Class code on Canvas/Modules/Code and data: clarkFunctions2026.r

    • R packages: gjam, repmis, MASS

For next time

  • Unit 3 problems (this vignette) to Canvas/Discussion folder

Today’s plan

  • short assessment on Canvas
  • Exploratory Data Analysis (EDA)
  • Data with combine continuous and discrete responses
  • Imputation
    • missing data (predictors and responses)
    • censored data (Tobit model)
    • hierarchical models
    • it is not “pseudoabsence”
  • Bayesian regression with Tobit model

Objectives:

  • Zeros: an imputation approach with the Tobit model
  • Evaluate correlation structure with MDS
  • Build and manipulate a design matrix.
  • Interpret elements of a traditional regression model.
  • Compare with the Bayesian approach, including the Tobit model.


Once you have downloaded from Canvas the file clarkFunctions2026.R, move it to your working directory, and source it this way:

source('clarkFunctions2026.R')

Case study: USDA’s Forest Inventory and Analysis (FIA) program

Monitoring networks are increasingly used to evaluate geographic patterns in biodiversity and ecosystem properties, how they are changing, and why. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventory programs.

The FIA network provides opportunity to illustrate some of the concepts that are important for this course. It is too big to manipulate with the standard 8GB or 16GB ram that is installed in most modern laptops (or desktops). [If your machine can expand to 32GB you should do this–not expensive and can make a big difference.] Every tree on > \(10^5\) plots shown below are distributed across the country. Data include > 300 tree species. The latest update includes > 20M rows for tree-years and > 200 variables.

The plots themselves are deployed as a stratified-random design, with the stratification being over a uniform geographic grid. Plots are remeasured at intervals of several years, depending on state. Data include tree diameter, crown class, and a large number of codes related to individual condition. There are small seedling plots and larger tree plots, but even the latter are small, approximately 1/14th of a hectare in the eastern US, larger in the West. Several environmental variables include slope, aspect, estimated stand age, and site moisture status.

200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).
200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).

Exploratory data analysis (EDA)

With large observational data sets, exploratory data analysis (EDA) comes before model building. I need to explore if there is pattern in the data before investing weeks building a hierarchical model. In this section I introduce a few approaches applied to the FIA.

Although we cannot manipulate the entire data set in class, we can examine a subset of it. To do this, I clustered plots based on their environmental characteristics to result in approximately 1-ha plots. To manipulate FIA data I use built-in functions and packages. A package is a bundle of functions that together offer tools to process a related set of problems. To use a function that is contained in a package, I need to install the package. Here is the installation of two packages:

install.packages('repmis')  # read from github
install.packages('gjam')    # extract file

Once a package is installed I can make it available in my work space using the library function. I can use a function without placing all of its functions in my workspace using the syntax package::function.

In the following code I first set a variable biomass to the the column for Pinus taeda. I get the data from a remote repository using the function source_data(fileName) in the package repmis:

library( gjam )
library( repmis )
library( MASS )
d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
source_data(d)
## [1] "forestTraits"
data    <- forestTraits$xdata                         # compressed
tmp     <- gjamReZero(forestTraits$treesDeZero)       # decompress
biomass <- tmp[,"pinuTaed"]                           # response
data    <- cbind(biomass, data)                       # all variables

The function gjamReZero is a function in the package gjam that decompresses a file containing mostly zeros.

The data.frame data includes both continuous variables and factors. It is formatted as observations (rows) by variables (columns). I’ll say more about formats as we go. I can use sapply to determine storage modes:

sapply( data, mode )               # all are numeric, but some could be factors
##   biomass      temp   deficit  moisture        u1        u2        u3    stdage 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##      soil 
## "numeric"
which( sapply( data, is.factor ) ) # found it
## soil 
##    9

As always, I start with some simple exploratory data analysis (EDA). First I plot some variables in data using the R function plot. Soil is a factor in the model, meaning that it is numeric, but takes only discrete values. It has five levels, one being a reference type. In these first plots I assign colors to soil types.

col <- match( data$soil, levels(data$soil) ) # symbol color from soil type
cex <- .01 + sqrt( biomass/max(biomass) )    # symbol size is response:biomass

par( mfrow=c(1,2), bty='n', mar=c(4,4,3,1), omi = c(.5, .1, .1, .1) )
plot(data$temp, data$deficit, xlab='',         # climate deficit
     ylab='Moisture deficit', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$temp, data$moisture, xlab='',        # local moisture level
     ylab='Site moisture', cex=cex, col = col)
mtext( 'Winter temperature', 1, outer = T )
*Some variables in the data.frame data*

Some variables in the data.frame data

A high moisture deficit (deficit) means dry.

Recall, the syntax data$temp indicates that data is a list. In this instance, data is a specific type of list called a data.frame. The $temp part refers to an object within a list. Use the help(plot) page to interpret the arguments to the function plot.

par(mfrow=c(1,2),bty='n', mar=c(4,4,1,3))

plot(data$deficit, data$moisture, xlab='Deficit', 
     ylab='Site moisture', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$soil, sqrt(biomass), xlab = '', ylab = 'Biomass (sqrt scale)', xaxt = 'n')
axis(1, at = 1:5, las = 2, labels = levels(data$soil) )
*Additional variables, including a factor soil.*

Additional variables, including a factor soil.

Note that moisture and deficit are not correlated. This lack of (negative) correlation comes from the fact that the moisture variable refers to local drainage, while deficit refers to the regional climate.

Here is a pairs plot to examine multiple variables:

pairs( data[,c(1:4, 8, 9)], cex=.1 )

Here is non-metric multidimensional scaling (MDS) to extract structure. I first evaluate the structure between observations. I follow that with the structure between variables. I plot the stand age of observations (negative low to positive high) and the names of variables, with arrows pointing from the centered matrix.

dcols <- as.matrix( data[,c(1:4, 8)] )             # predictor columns
dcols[,'biomass'] <- sqrt( dcols[,'biomass'] )     # down-scale extremes
omat <- cov2Dist( var( t(dcols) ) )                # MDS by observations
obs  <- isoMDS(omat, k = 3)

vmat <- cov2Dist( var( dcols ) )                   # MDS by variables
vars <- isoMDS(vmat, k = 3)
nv   <- nrow(vmat)
cols <- rampColor( data$stdage )

plot(obs$points[,1], obs$points[,2], pch = 16, cex = .4,
     xlab = 'Axis 1', ylab = 'Axis 2', col = cols, bty = 'n') # axis 1 driven by stdage

text(vars$points[,1], vars$points[,2],             # relationship with variables
     rownames(vars$points), cex = 1.2, col=c(1:nv))
points( 0, 0, pch = 16, cex=1.2 )
arrows( rep(0, nv), rep(0, nv), vars$points[,1], vars$points[,2], 
        col = c(1:nv), lwd = 2 )
MDS plot showing plots (dots) and variables (arrows).
MDS plot showing plots (dots) and variables (arrows).

Notice that the stdage variable, which I’ve assigned a color scale, accounts for much of the variation on the first axis. I have labeled the values for observations to see the gradient in values.

I am also interested in how biomass differs by soil type. Here is use of the function tapply to determine the mean and sd for biomass by soil:

mean <- tapply( biomass, data$soil, mean, na.rm = T )
sd   <- tapply( biomass, data$soil, sd, na.rm = T )
biomassTable <- rbind(mean, sd)
print( round( biomassTable, 2 ) )
##      EntVert  Mol reference SpodHist UltKan
## mean    3.57 0.28      4.15     0.40  13.87
## sd      7.13 1.28      9.66     1.98  11.43

I’ll return to this example after further discussion of some basic concepts.

Exercise 1: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?


Traditional regression

Based on EDA conducted thus far it makes sense to try a few simple regressions. The pairs plot indicates that no predictor by itself dominates. MDS suggests that stdage and moisture may show the strongest relationship with biomass, but this can be deceiving. These plots offer a “marginal” view, in the sense that they project one variable onto another without accounting for how they affect one another in combination. A model like regression could help me see their influences in combination.

To determine the relationship between abundance and environmental variables, I want a model for the response biomass with predictors. Here are the names of variables in data:

names(data)
## [1] "biomass"  "temp"     "deficit"  "moisture" "u1"       "u2"       "u3"      
## [8] "stdage"   "soil"

Square-root where there are zeros

Because biomass is skewed, I can consider a square-root transformation. I cannot use a log transformation, because biomass includes zeros. I can back to that later, for now, here is the square root:

data$sqrtBiomass <- sqrt( data$biomass )

These are the column headings in data. To see this, I can look at the first two rows:

biomass sqrtBiomass temp deficit moisture stdage soil
2 1.414214 1.2165433 0.0363791 0.6870299 -0.1669796 reference
0 0.000000 0.1825447 0.2070871 1.6655992 -0.0290727 reference
0 0.000000 -0.9409308 0.2034515 -0.1892591 0.4814762 SpodHist
13 3.605551 0.6435989 0.8153297 0.3900552 -0.0789539 reference
17 4.123106 0.8238641 -0.1755659 0.8500289 0.3600742 reference

Factor: soil type

With the exception of "soil", these are continuous variables. Again, the variable soil is a factor with these levels:

data$soil <- relevel(data$soil, 'reference')
attr(data$soil,'levels')
## [1] "reference" "EntVert"   "Mol"       "SpodHist"  "UltKan"

These are soil ‘orders’, based on parent material.

I start with the standard function lm in R for linear regression. In the formula below I have specified a model containing moisture and with quadratic effects, as I(moisture^2). The I() function in a formula indicates that I want to evaluate the argument as it appears in the function I.

Explore predictors

I can fit a linear regression with response biomass to several combinations of variables using lm and plot responses.

fit1 <- lm( sqrtBiomass ~ moisture, data )   
p1   <- predict( fit1 )
fit2 <- lm( sqrtBiomass ~ moisture + I(moisture^2), data )   # quadratic
p2   <- predict(fit2)                                    # predictive mean

par( mfrow=c(1,3), bty='n', mar=c(4,4,1,1) )
plot( data$moisture, data$sqrtBiomass, cex=.2 )               # biomass by site moisture
points( data$moisture, p1, col='green', cex=.2 )         # add to plot
points( data$moisture, p2, col='blue', cex=.2 )   

#repeat with additional variables
plot( data$deficit, data$sqrtBiomass, cex=.2)
form <- as.formula(sqrtBiomass ~ soil + moisture*stdage + temp + deficit)
fit3 <- lm(form, data)  
p3   <- predict(fit3)
plot( data$sqrtBiomass, p3, cex=.2, xlab = 'Observed', ylab = 'Predicted' )
abline(0, 1, lty=2)
*Biomass fitted to moisture and climate deficit showing observed and predicted.*

Biomass fitted to moisture and climate deficit showing observed and predicted.


For more on use of variable names in the above block of code, see the note about R environments.

Plots show the predictions for a linear and quadratic model, fit1 and fit2 and for a large model with main effects, an interaction moisture*stdage and the factor soil.

I can use the function summary to look at the estimates in a fitted objects, e.g.,

summary(fit3)
## 
## Call:
## lm(formula = form, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7752 -0.7905 -0.0705  0.4838  6.4013 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.766047   0.039137  19.574  < 2e-16 ***
## soilEntVert     -0.086707   0.127322  -0.681 0.495964    
## soilMol         -0.519219   0.207693  -2.500 0.012521 *  
## soilSpodHist     0.316143   0.086681   3.647 0.000274 ***
## soilUltKan       1.357620   0.175556   7.733 1.84e-14 ***
## moisture        -0.005941   0.031300  -0.190 0.849486    
## stdage          -0.318372   0.032144  -9.904  < 2e-16 ***
## temp             0.928942   0.039853  23.309  < 2e-16 ***
## deficit          0.084968   0.036915   2.302 0.021478 *  
## moisture:stdage  0.126505   0.033733   3.750 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.245 on 1607 degrees of freedom
## Multiple R-squared:  0.4456, Adjusted R-squared:  0.4425 
## F-statistic: 143.5 on 9 and 1607 DF,  p-value: < 2.2e-16
r23 <- summary(fit3)$r.squared

Standard products of a traditional regression include a table of point estimates for the coefficients in \(\boldsymbol{\beta}\), with their standard errors, which are a measure of uncertainty. There is a Student’s t statistic for each estimate, a measure of how far the estimate is from a hypothesized value of zero. For each estimate there is a P value, the probability for obtaining a value of t at least this large under the hypothesis that the coefficient is equal to zero. There is a F statistic and P value for the entire model. The degrees of freedom is the sample size \(n\) minus the number of fitted parameters \((2)\). The R-squared is interpreted as the ‘variance explained’ by the model, r23 = 0.446. The residual standard error is the estimate of parameter \(\sigma = \sqrt{\sigma^2}\). The 95% confidence intervals for these estimates would be approximately the point estimates \(\pm 1.96\) times the standard errors, or:

estimate 0.025 0.975
(Intercept) 0.76600 0.6900 0.8430
soilEntVert -0.08670 -0.3360 0.1620
soilMol -0.51900 -0.9250 -0.1130
soilSpodHist 0.31600 0.1470 0.4860
soilUltKan 1.36000 1.0100 1.7000
moisture -0.00594 -0.0671 0.0553
stdage -0.31800 -0.3810 -0.2560
temp 0.92900 0.8510 1.0100
deficit 0.08500 0.0128 0.1570
moisture:stdage 0.12700 0.0606 0.1920

Together, this combination of outputs for the linear regression would contribute to an interpretation of how abundance of this species responds to climate and habitat.

If I were to use \(p < 0.05\) as a threshold I find that soil, stdage, temp, def and moisture:stdage are significant:

summary(fit3)$coefficients
##                     Estimate Std. Error    t value      Pr(>|t|)
## (Intercept)      0.766046630 0.03913673 19.5736007  1.093334e-76
## soilEntVert     -0.086707456 0.12732205 -0.6810089  4.959640e-01
## soilMol         -0.519218802 0.20769315 -2.4999322  1.252072e-02
## soilSpodHist     0.316142515 0.08668072  3.6472069  2.735636e-04
## soilUltKan       1.357620232 0.17555616  7.7332531  1.837267e-14
## moisture        -0.005940936 0.03130027 -0.1898047  8.494862e-01
## stdage          -0.318372312 0.03214439 -9.9044432  1.728836e-22
## temp             0.928941728 0.03985294 23.3092423 9.208488e-104
## deficit          0.084968286 0.03691497  2.3017295  2.147780e-02
## moisture:stdage  0.126505456 0.03373296  3.7502033  1.829578e-04

The positive coefficient on the moisture \(\times\) stdage interaction means that the moisture becomes more important in older stands.

Exercise 2. Repeat the regression analysis using function lm with a different species as a response. Interpret the analysis.

Design elements

This model includes a factor and an interaction. A factor has categorical levels that can be represented by words in a data.frame. In this example, data$soil is a factor. The levels for soil are attr(data$soil,'levels') = reference, EntVert, Mol, SpodHist, UltKan. In the fore-going analysis there was a coefficient for all but the first factor level, reference. To understand how factors are treated in an analysis, it’s easiest to look at the design matrix \(\mathbf{X}\) that is generated when factors are included. Here are a few lines of \(\mathbf{X}\) for this model:

tmp  <- model.frame(form, data, na.action=NULL)
x    <- model.matrix(form, data = tmp)
head( x )
##   (Intercept) soilEntVert soilMol soilSpodHist soilUltKan   moisture
## 1           1           0       0            0          0  0.6870299
## 2           1           0       0            0          0  1.6655992
## 3           1           0       0            1          0 -0.1892591
## 4           1           0       0            0          0  0.3900552
## 5           1           0       0            0          0  0.8500289
## 6           1           0       0            0          0 -0.8765680
##        stdage       temp     deficit moisture:stdage
## 1 -0.16697961  1.2165433  0.03637914     -0.11471999
## 2 -0.02907271  0.1825447  0.20708706     -0.04842349
## 3  0.48147624 -0.9409308  0.20345146     -0.09112376
## 4 -0.07895393  0.6435989  0.81532968     -0.03079639
## 5  0.36007415  0.8238641 -0.17556592      0.30607343
## 6 -0.58986965  0.2201759  0.76258277      0.51706086

The missing factor level, called reference, is represented in the model by the intercept, the first column of ones in \(\mathbf{X}\). An observation from the reference soil type will have zeros in the remaining four columns. Any other soil type will have a one in its column, so its effect is the sum of the intercept and its own coefficient.

The interaction term moisture:stdage is the product of two columns. In the coefficient table there is a main effect of both moisture and stdage and there is an interaction coefficient. If the interaction coefficient is positive, then the effect of stage becomes more important as it increases away from zero (either positive or negative).

A Bayesian approach

Leaving technical detail for subsequent units, this section compares the previous result with a Bayesian analysis, incorporating a non-informative prior distribution. As mentioned above, a Bayesian analysis combines the likelihood with a prior distribution. In this analysis, the prior distribution is taken to be non-informative for coefficients and residual variance \(\boldsymbol{\beta}, \sigma\).

Recall the Bayesian model, here applied to regression for a slope and intercept model:

\[ [ \boldsymbol{\beta}, \sigma^2 | \mathbf{X}, \mathbf{y} ] \propto \prod_{i=1}^n N \left( y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2 \right) MVN( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) IG(\sigma^2 | s_1, s_2 ) \] The first factor on the right-hand side is the likelihood. The second factor is a multivariate prior distribution for regression coefficients. The third factor it the inverse gamma distribution that I use as a prior for the residual variance. The prior distribution for regression coefficients is non-informative, having these prior parameter values:

\[\begin{align} \mathbf{b} &= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} 1000 & 0 \\ 0 & 1000 \end{bmatrix} \end{align}\]

The prior parameter values for the inverse gamma distribution are also non-informative: \(s_1 = s_2 = 0.1\). Here is how I would specify them in the code:

priorB  <- matrix( 0, 2, 1 ) # prior mean regression parameters
priorVB <- diag( 1000, 2 )   # inverse prior covariance matrix
s1      <- .1                # variance prior values
s2      <- .1

Here is a Bayesian analysis to compare with fit3:

fitb <- bayesReg(form, data, TOBIT = F )
rsqBayes <- fitb$rsq

The function BayesReg organizes output a bit differently from lm, reporting 95% credible intervals [values at (0.025 and 0.975)]. The output is simpler than that generated by lm–there are not a lot of statistics. However, I see that point estimates and standard errors for coefficients are nearly the same as I obtained with lm. I also see that the coefficients having credible intervals that do not span zero in the Bayesian analysis are the same coefficients that lm flagged as ‘signficant’. I further see the same \(r^2\) values: r23 = 0.446 and rsqBayes = 0.446. Here is a plot of coefficients with errors:

*Same estimates from both methods, showing point estimates and 95% CIs.*

Same estimates from both methods, showing point estimates and 95% CIs.

The header for this section asks ‘what’s different about Bayes?’. The shift from a classical to Bayesian analysis did not change how I interpret effects of climate and habitat on biomass of this species. However, the Bayesian model can be extended in a number of ways. In the next section I illustrate a hierarchical version having an additional stage.

Zeros are common

At least two problems with both the traditional and Bayesian regressions are that the normal distribution i) does not allow any discrete values, including zeros, and ii) it is non-zero for negative values, whereas biomass can only be non-negative. Biomass data diverge from the assumptions of the model in that observations \(Y \in [0, \infty)\)–there is point mass at zero. Recall that the probability for any discrete value is zero for a continuous variable. Positive biomass values are continous, but zero is discrete. Here the data are dominated by zeros:

hist(data$sqrtBiomass, nclass=50)                            # distribution
*Response has many zeros.*

Response has many zeros.

length(which(data$sqrtBiomass == 0))/length(data$sqrtBiomass)
## [1] 0.733457


If there were a few zeros, with most values bounded away from zero, I might argue that it’s close enough. That’s not the case here. Standard diagnostic plots, such as plot(fit1) make this clear.

If these were discrete data, I might turn to a zero-inflated Poisson or negative binomial model. These are examples of generalized linear models (GLMs) to be discussed in later units. These models sometimes work ok, provided there are not too many zeros, but here zeros dominate. Regardless, these GLMs describe discrete data, so I cannot use either.

I cannot add a small amount to each observation and then transform the data to a log scale. If I do that, every coefficient I estimate depends on the arbitrary value I used.

Tobit model

A Tobit regression allows for continuous responses with zeros, and it is readily implemented as a Bayesian model. Part (b) of the model graph includes an additional stage for a variable \(W\). This variable has a normal distribution; it is defined on \((-\infty, \infty)\). The observed \(Y\) is a censored version of \(W\). It is equal to the response \(Y\) whenever \(Y > 0\). When \(Y = 0\), then the latent variable \(W\) is negative:

\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \]

Treating \(Y\) as a censored version of \(W\) allows me to combine continuous and censored data without changing the scale. In a Tobit model the censored value is zero, but it also works with other values.

Latent variable imputation

With this model the regression moves to the latent \(W\),

\[w_i \sim N(\mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The model has become hierarchical. I fit this model in a Bayesian setting, again, requiring a prior distribution.

Now allowing for the zeros in \(Y\), the fitted coefficients differ substantially from both previous models:

fitTobit <- bayesReg( form, data, TOBIT=T )
## fitted as Tobit model
par( mfrow=c(1,2), bty='n' )
plot( data$sqrtBiomass, p3, cex=.5, ylab='Predicted values',
      ylim = c( -5, 10 ), pch = 16 )
points (data$sqrtBiomass, fitTobit$predict[,2], col=2, cex=.5, pch = 16 )
abline( 0, 1, lty=2 )
abline( h=0)

rbeta <- summary(fit3)$coefficients   # standard regression coeffs
tbeta <- fitTobit$beta                # Bayesian Tobit coeffs
plot( rbeta[,1], tbeta[,1], xlim = c(-2, 8), 
      xlab='Traditional regression', ylab='Tobit')
text( rbeta[,1], tbeta[,1], rownames(tbeta), pos = 4, cex = .8 )
abline( 0, 1, lty=2 )
*Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.*

Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.

The \(r^2\) value for the Tobit is substantially improved, fitTobit$rsq = 0.565. Not only is the fit improved, but fewer coefficients are different from zero: soil, stdage, temp. So once we have allowed for discrete zero, the interpretation changes.

In terms of the latent \(w_i\), the relationship looks like this:

tmp  <- model.frame(form, data, na.action=NULL)
x    <- model.matrix(form, data=tmp)
what <- x%*%rbeta[,1]
plot( data$sqrtBiomass, what, cex = .3, xlab = 'w', ylab = 'y', bty = 'n', pch = 16 )

Note that for \(y = 0\) we have \(w < 0\).


Exercise 3. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.

I called the Tobit model ‘hierarchical’, but some could see it differently. The \(W\) stage in the model is ‘partially known’. Note the dashed line in part (c) of the graphical model. If I know the value of \(W\), then I also know the value of \(Y\). So in the ‘generative direction’ from \(W\) to \(Y\) I can view their relationship as deterministic. However, given \(Y\) does not necessarily mean that I know \(W\). If \(Y = 0\), then \(W\) is stochastic.

Connection to Gibbs sampling

When we talk about posterior simulation, we’ll discuss conditional sampling from a posterior distribution. A common technique is called Gibbs sampling. Suppose I want to simulate a joint distribution of parameters based on observations. In the case of the Tobit model, the unknowns include not only \(\beta\) and \(\sigma^2\), but also the values of \(w_i\) for which \(y_i = 0\). I will represent these unknown values as \(\mathbf{w}_0\). I could write this joint distribution of unknowns like this:

\[ [ \beta, \sigma^2, \mathbf{w}_0 | \mathbf{X}, \mathbf{y} ] \] When we discuss MCMC we’ll see why this joint distribution can be simulated with a three-step process:

  1. sample from the conditional distribution \([ \beta | \sigma^2, \mathbf{w}_0, \mathbf{X}, \mathbf{y} ]\)
  2. then sample from \([ \sigma^2 | \beta, \mathbf{w}_0, \mathbf{X}, \mathbf{y} ]\)
  3. finally, sample from \([ \mathbf{w}_0 | \beta, \sigma^2, \mathbf{X}, \mathbf{y} ]\)

In each case I sample from one group of unknowns conditional on the others. The order of sampling does not matter. Here I initialize parameter values (\(\beta, \sigma^2\)) and \(\mathbf{w}_0\), the latter with the truncated normal to insure that they are negative. I then alternately sample as outlined above to show how the estimates of \(\beta\) respond to the updated values of \(\mathbf{w}_0\), and vice versa:

# initial values
sg  <- 1
bg  <- matrix( 0, ncol(x), 1, dimnames = list( colnames(x), NULL ) )
w   <- data$sqrtBiomass
w0  <- which( data$sqrtBiomass == 0 )
w[ w0 ] <- .tnorm( length(w0), -Inf, 0,-1, 1)

# prior parameters
priorB   <- bg*0                       # normal coefficient mean
priorIVB <- diag( .01, nrow(priorB) )  # normal coefficient precision
s1 <- s2 <- 1                          # IG variance

XX <- crossprod( x )

tcols <- c('(Intercept)','temp')
plot( x[,'temp'], data$sqrtBiomass, ylim = c(-10, 10), pch = 16, col = 'grey', bty = 'n' )
abline( h = 0, lwd = 2, lty = 2, col = 'grey' )

for( g in 1:4){
  xg <- x[,tcols]%*%bg[drop=F,tcols,]                        # predicted by current beta
  points( x[w0,'temp'], w[w0], cex = .3, col = g, pch = 16 ) # current w0
  lines( x[,'temp'], xg, col = g )
  
  mu <- x%*%bg
  bg <- updateBeta(x, w, sg, priorIVB, priorB, XX)           # beta | sigma^2, w0
  sg <- updateSigma(w, mu, s1, s2)                           # sigma^2 | beta, w0
  w[w0] <- .tnorm(length(w0), -Inf, 0, mu[w0], sqrt(sg) )    # w0 | beta, sigma^2
  readline( 'updated w and prediction -- return to continue ' )
}

## updated w and prediction -- return to continue 
## updated w and prediction -- return to continue 
## updated w and prediction -- return to continue 
## updated w and prediction -- return to continue

Although you don’t yet know what’s happening in the functions like updateBeta, you should see that it follows the three-step structure outlined above.

The Tobit model handles the zeros by assuming they occupy the negative end of the scale. We need the model to tell us “how negative”. That comes from sampling them based on the currently imputed values for bg.

In summary, not only is the Tobit defensible as a valid model for continuous data with zeros–it also finds more effects and better predicts data than the traditional regression. Many value the hierarchical framework for its flexibility to admit additional stages.

We’ll add levels to regression (and other models) in subsequent vignettes.

Recap

Large observational data sets present a number of challenges. Analysis always has to start with EDA, which can reveal how variables relate to one another. Here that came through correlation structure (including MDS) and regression.

The application of regression models highlighted two points. First, simple Bayes offered no meaningful benefit over traditional regression. The parameter estimates and \(r^2\) were the same. True, Bayes allows the option of an informative prior distribution, but the best fit to the data will always come from a non-informative prior–the prior pulls the fit away from the likelihood.

Many variables combine continuous values with discrete zero (or other censored values, such as a detection for some constituents). Regression is not correct for responses that are censored and therefore collect point mass on discrete values. The Tobit is an example of a model that allows for values censored at zero. Generalized joint attribute modeling (GJAM) does this more generally for multivariate responses.

When there are zeros, the transformation to a square-root scale is preferred to the common log transformation, which is not valid. By adding an arbitrary value to \(y\), usually taken to be \(\log(y + 1)\), then the scale for the fitted coefficients depends on the arbitary coefficient. There is no “correct” scale for the estimates.

In the Bayesian framework, censoring is readily handled with imputation. The censored values, in this case zero, are imputed to be negative in sign, but still subject to the same model as the positive values. In other words, imputation extends the scale into the negative range. While observed \(y\) is zero or positive, imputed \(w\) is both positive and negative. In the Bayesian framework this imputation is not more complicated that any hierarchical model.

The big difference with Bayes was not realized by the simple Bayes example. It was the capacity to extend it that saw the big improvement in fit.

The simple illustration of Gibbs sampling, a type of Markov chain Monte Carlo (MCMC) will be explored in subsequent vignettes.


Appendix

Note about R environments

To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

##matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text