Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.
Logistics
Resources
Software includes:
Class code on Sakai:
clarkFunctions2022.rR packages:
gjam, repmis
Discussion reading:
- Redefining statistical significance, classical and Bayesian statisticians compromise, if \(P\) values are to used, then \(P = 0.05\) is way too high, Nature.
For next time
- Unit 3 problems
Today’s plan
Breakout: Discussion and problems from Unit 2.
Jim: Foundation materials in Unit 3
Breakout: begin work on Unit 3
Objectives:
- Review basic R concepts and apply them to a data set.
- Evaluate correlation structure in observational data.
- Gain experience with R storage modes, including factors.
- Build and manipulate a design matrix.
- Interpret elements of a traditional regression model.
- Compare with the Bayesian approach, including the Tobit model.
Once you have downloaded from Sakai the file clarkFunctions2022.r, move it to your current directory, and source it this way:
source('clarkFunctions2022.r')Case study: USDA’s Forest Inventory and Analysis (FIA) program
Monitoring networks are increasingly used to evaluate geographic patterns in biodiversity and ecosystem properties, how they are changing, and why. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventory programs.
The FIA network provides opportunity to illustrate some of the concepts that are important for this course. It is too big to manipulate with the standard 8GB or 16GB ram that is installed in most modern laptops (or desktops). [If your machine can expand to 32GB you should do this–not expensive and can make a big difference.] Every tree on > \(10^5\) plots shown below are distributed across the country. Datq include > 300 tree species. The latest update includes > 20M rows for tree-years and > 200 variables. The plots themselves are deployed as a stratified-random design, with the stratification being over a uniform geographic grid. Plots are remeasured at intervals of several years, depending on state. Data include tree diameter, crown class, and a large number of codes related to individual condition. There are small seedling plots and larger tree plots, but even the latter are small, approximately 1/14th of a hectare in the eastern US, larger in the West. Several environmental variables include slope, aspect, estimated stand age, and site moisture status.
200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).
I use the application to illustrate some differences between a traditional and Bayesian analysis. Here are some R objects I use in this example:
R packages: gjam, repmis
R objects: list, data.frame,numeric, numeric vector, numeric matrix, source_data
R functions: as.formula, attr, cbind, lm, names, par, plot, points, predict, summary
Data exploration
Although we cannot manipulate the entire data set in class, we can examine a subset of it. To do this, I clustered plots based on their environmental characteristics to result in approximately 1-ha plots. To manipulate FIA data I use built-in functions and packages. A package is a bundle of functions that together offer tools to process a related set of problems. To use a function that is contained in a package, I need to install the package. Here is the installation of two packages:
install.packages('repmis') # read from github
install.packages('gjam') # extract fileOnce a package is installed I can make it available in my work space using the library function. I can use a function without placing all of its functions in my workspace using the syntax package::function.
In the following code I first set a variable biomass to the the column for Pinus taeda. I get the data from a remote repository using the function source_data(fileName) in the package repmis:
d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
repmis::source_data(d)## [1] "forestTraits"
data <- forestTraits$xdata # compressed
tmp <- gjam::gjamReZero(forestTraits$treesDeZero) # decompress
biomass <- tmp[,"pinuTaed"] # response
data <- cbind(biomass, data) # all variablesThe function gjamReZero is a function in the packages gjam that decompresses a file containing mostly zeros.
The data.frame data includes both continuous variables and factors. It is formatted as observations (rows) by variables (columns). I’ll say more about formats as we go.
As always, I start with some simple exploratory data analysis (EDA). First I plot some variables in data using the R function plot. Soil is a factor in the model, meaning that it is numeric, but takes only discrete values. It has five levels, one being a reference type. In these first plots I assign colors to soil types.
# symbol color from soil type
col <- match( data$soil, levels(data$soil) )
# symbol size is response:biomass
cex <- .1 + biomass
cex <- sqrt( cex/max(cex) )
par(mfrow=c(1,2),bty='n', mar=c(4,4,3,1))
plot(data$temp, data$deficit, xlab='Winter temperature', # climate deficit
ylab='Moisture deficit', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
bty='n', cex = .7)
plot(data$temp, data$moisture, xlab='Winter temperature', # local moisture level
ylab='Site moisture', cex=cex, col = col)Some variables in the data.frame data
The syntax data$temp indicates that data is a type of R object termed a list. In this instance, data is a specific type of list called a data.frame. The $temp part of this name means that data is an object known as a list of objects, one of which having the name temp. Use the help(plot) page to interpret the arguments to the function plot.
par(mfrow=c(2,1),bty='n', mar=c(4,4,1,3))
plot(data$deficit, data$moisture, xlab='Deficit',
ylab='Site moisture', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
bty='n', cex = .7)
plot(data$soil, biomass)Additional variables, including a factor soil.
I’ll return to this example after further discussion of some basic concepts.
Exercise: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?
Traditional regression
To determine the relationship between abundance and environmental variables, I want a model for the response biomass with predictors. Here are the names of variables in data:
names(data)## [1] "biomass" "temp" "deficit" "moisture" "u1" "u2" "u3"
## [8] "stdage" "soil"
These are the column headings in data. To see this, I can look at the first two rows:
data[1:2,]With the exception of "soil", these are continuous variables. Again, the variable soil is a factor with these levels:
attr(data$soil,'levels')## [1] "EntVert" "Mol" "reference" "SpodHist" "UltKan"
These are soil ‘orders’, based on parent material.
I start with the standard function lm in R for linear regression. In the formula below I have specified a model containing moisture and with quadratic effects, as I(moisture^2). The I() function in a formula indicates that I want to evaluate the argument as it appears in the function I.
I can fit a linear regression with response biomass to several combinations of variables using lm and plot responses.
par(mfrow=c(1,2),bty='n', mar=c(4,4,1,1))
plot( data$moisture, biomass, cex=.2 ) # wet sites
fit1 <- lm(biomass ~ moisture, data)
p1 <- predict(fit1)
fit2 <- lm(biomass ~ moisture + I(moisture^2), data) # quadratic
p2 <- predict(fit2) # predictive mean
points(data$moisture,p1, col='green', cex=.2) # add to plot
points(data$moisture,p2, col='blue', cex=.2)
#repeat with additional variables
plot( data$deficit, biomass, cex=.2)Biomass fitted to moisture and climate deficit showing observed and predicted.
form <- as.formula(biomass ~ soil + moisture*stdage + temp + deficit)
fit3 <- lm(form, data)
p3 <- predict(fit3)
plot(biomass, p3, cex=.2, xlab = 'Observed', ylab = 'Predicted')
abline(0, 1, lty=2)Biomass fitted to moisture and climate deficit showing observed and predicted.
For more on use of variables names in the above block of code, see the note about R environments.
Plots show the predictions for a linear and quadratic model, fit1 and fit2 and for a large model with main effects, an interaction moisture*stdage and the factor soil.
I can use the function summary to look at the estimates in a fitted objects, e.g.,
summary(fit3)##
## Call:
## lm(formula = form, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.026 -4.051 -0.579 1.673 61.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.0033 0.7137 2.807 0.00506 **
## soilMol -0.9777 1.3819 -0.708 0.47934
## soilreference 1.1961 0.7473 1.601 0.10966
## soilSpodHist 2.2552 0.8486 2.658 0.00795 **
## soilUltKan 7.2434 1.2289 5.894 4.58e-09 ***
## moisture -0.4506 0.1837 -2.453 0.01427 *
## stdage -1.4494 0.1887 -7.682 2.70e-14 ***
## temp 3.8207 0.2339 16.334 < 2e-16 ***
## deficit 0.4002 0.2167 1.847 0.06491 .
## moisture:stdage 0.9336 0.1980 4.715 2.62e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.308 on 1607 degrees of freedom
## Multiple R-squared: 0.3031, Adjusted R-squared: 0.2992
## F-statistic: 77.66 on 9 and 1607 DF, p-value: < 2.2e-16
Standard products of a traditional regression include a table of point estimates for the coefficients in \(\boldsymbol{\beta}\), with their standard errors, which are a measure of uncertainty. There is a Student’s t statistic for each estimate, a measure of how far the estimate is from a hypothesized value of zero. For each estimate there is a P value, the probability for obtaining a value of t at least this large under the hypothesis that the coefficient is equal to zero. There is a F statistic and P value for the entire model. The degrees of freedom is the sample size \(n\) minus the number of fitted parameters \((2)\). The R-squared is interpreted as the ‘variance explained’ by the model. The residual standard error is the estimate of parameter \(\sigma = \sqrt{\sigma^2}\). The 95% confidence intervals for these estimates would be approximately the point estimates \(\pm 1.96\) times the standard errors, or:
## estimate 0.025 0.975
## (Intercept) 2.000 0.6080 3.4000
## soilMol -0.978 -3.6800 1.7200
## soilreference 1.200 -0.2650 2.6600
## soilSpodHist 2.260 0.5960 3.9100
## soilUltKan 7.240 4.8400 9.6500
## moisture -0.451 -0.8100 -0.0915
## stdage -1.450 -1.8200 -1.0800
## temp 3.820 3.3600 4.2800
## deficit 0.400 -0.0234 0.8240
## moisture:stdage 0.934 0.5470 1.3200
Together, this combination of outputs for the linear regression would contribute to an interpretation of how abundance of this species responds to climate and habitat.
Exercise. Repeat the regression analysis using function lm with a different species as a response. Interpret the analysis.
A Bayesian approach with the Tobit model
Leaving technical detail for subsequent units, this section compares the previous result with a Bayesian analysis, incorporating a non-informative prior distribution. As mentioned above, a Bayesian analysis combines the likelihood with a prior distribution. In this analysis, the prior distribution is taken to be non-informative for coefficients and residual variance \(\boldsymbol{\beta}, \sigma\). Here is a Bayesian analysis to compare with fit3:
fitb <- bayesReg(form, data)##
## Coefficients:
## median std error 0.025 0.975 not zero
## intercept 2.026 0.6964 0.6185 3.351 *
## soilMol -0.9913 1.376 -3.602 1.726
## soilreference 1.154 0.7305 -0.1839 2.643
## soilSpodHist 2.255 0.8279 0.6571 3.87 *
## soilUltKan 7.248 1.239 4.814 9.575 *
## moisture -0.4466 0.1819 -0.8142 -0.1145 *
## stdage -1.449 0.1895 -1.807 -1.053 *
## temp 3.831 0.2313 3.359 4.273 *
## deficit 0.3896 0.2141 -0.03737 0.8193
## moisture:stdage 0.9239 0.1989 0.5387 1.323 *
##
## * indicates that 95% predictive interval does not include zero
##
## Residual standard error 7.312, with 1607 degrees of freedom,
## root mean sq prediction error 7.185.
The function BayesReg organizes output a bit differently from lm, reporting 95% credible intervals [values at (0.025 and 0.975)]. The output is simpler than that generated by lm–there are not a lot of statistics. However, I see that point estimates and standard errors for coefficients are nearly the same as I obtained with lm. I also see that the coefficients having credible intervals that do not span zero in the Bayesian analysis are the same coefficients that lm flagged as ‘signficant’.
Same estimates from both methods.
The header for this section asks ‘what’s different about Bayes?’. The shift from a classical to Bayesian analysis did not change how I interpret effects of climate and habitat on biomass of white oak. However, the Bayesian model can be extended in a number of ways. In the next section I illustrate a hierarchical version having an additional stage.
At least two problems with both the traditional and Bayesian regressions are that the normal distribution i) does not allow any discrete values, including zeros, and ii) it is non-zero for negative values, whereas biomass can only be non-negative. Biomass data diverge from the assumptions of the model in that observations \(Y \in [0, \infty)\)–there is point mass at zero. Recall that the probability for any discrete value is zero for a continuous variable. Positive biomass values are continous, but zero is discrete. Here the data are dominated by zeros:
hist(biomass, nclass=50) # distributionResponse has many zeros.
length(which(biomass == 0))/length(biomass)## [1] 0.733457
If there were a few zeros, with most values bounded away from zero, I might argue that it’s close enough. That’s not the case here. Standard diagnostic plots, such as plot(fit1) make this clear.
If these were discrete data, I might turn to a zero-inflated Poisson or negative binomial model. These are examples of generalized linear models (GLMs) to be discussed in later units. These models sometimes work ok, provided there are not too many zeros, but here zeros dominate. Regardless, these GLMs describe discrete data, so I cannot use either.
I cannot add a small amount to each observation and then transform the data to a log scale. If I do that, every coefficient I estimate depends on the arbitrary value I used.
A Tobit regression allows for continuous responses with zeros, and it is readily implemented as a Bayesian model. Part (b) of the model graph includes an additional stage for a variable \(W\). This variable has a normal distribution; it is defined on \((-\infty, \infty)\). The observed \(Y\) is a censored version of \(W\). It is equal to the response \(Y\) whenever \(Y > 0\). When \(Y = 0\), then the latent variable \(W\) is negative:
\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \]
Treating \(Y\) as a censored version of \(W\) allows me to combine continuous and censored data without changing the scale. In a Tobit model the censored value is zero, but it also works with other values.
With this model the regression moves to the latent \(W\),
\[w_i \sim N(\mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The model has become hierarchical. I fit this model in a Bayesian setting, again, requiring a prior distribution.
Now allowing for the zeros in \(Y\), the fitted coefficients differ substantially from both previous models:
fitTobit <- bayesReg(form, data, TOBIT=T)## fitted as Tobit model
##
## Coefficients:
## median std error 0.025 0.975 not zero
## intercept -28.3 3.015 -33.62 -22.54 *
## soilMol -8.066 7.108 -23.26 4.817
## soilreference 6.477 2.237 2.05 10.94 *
## soilSpodHist -4.646 3.446 -11.48 2.284
## soilUltKan 15.17 2.91 9.493 20.93 *
## moisture 0.04456 0.5713 -1.091 1.178
## stdage -4.119 0.6635 -5.384 -2.813 *
## temp 24.21 1.857 21.13 27.35 *
## deficit -1.115 0.6847 -2.47 0.1955
## moisture:stdage 1.419 0.6101 0.1889 2.604 *
##
## * indicates that 95% predictive interval does not include zero
##
## Residual standard error 14.31, with 1607 degrees of freedom,
## root mean sq prediction error 6.894.
par(mfrow=c(1,2),bty='n')
plot(biomass, p3, cex=.2, ylab='Predicted values')
points(biomass,fitTobit$predictY[,2], col=2, cex=.2)
abline(0,1,lty=2)
abline(h=0)
plot(summary(fit3)$coefficients[,1],fitTobit$coeff[,1],
xlab='Traditional regression',ylab='Tobit')
abline(0,1,lty=2)Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.
Exercise. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.
I called the Tobit model ‘hierarchical’, but some could see it differently. The \(W\) stage in the model is ‘partially known’. Note the dashed line in part (c) of the graphical model. If I know the value of \(W\), then I also know the value of \(Y\). So in the ‘generative direction’ from \(W\) to \(Y\) I can view their relationship as deterministic. However, given \(Y\) does not necessarily mean that I know \(W\). If \(Y = 0\), then \(W\) is stochastic.
In summary, not only is the Tobit defensible as a valid model for continuous data with zeros–it also finds more effects and better predicts data than the traditional regression. Many value the hierarchical framework for its flexibility to admit additional stages.
Appendix
Note about R environments
To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.
Notation
Here are some notation conventions used in these vignettes.
| notation | example | interpretation |
|---|---|---|
| italic | \(x\), \(X\) | scalar quantity, known |
| greek | \(\alpha, \beta, \dots\) | stochastic (fitted) variable, unknown |
| parentheses | \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) | parametric function/density |
| curly brackets | \(\{0, 1, \dots\}\) | a set on objects |
| closed interval | \((-\infty,0]\), \([0, \infty)\) | intervals include zero |
| open interval | \((-\infty,0)\), \((0, \infty)\) | exclude zero |
| distributed as | \(x \sim N(\mu, \sigma^2)\) | distribution or density |
| expectation | \(E[\epsilon] = 0\) | expected value of a variable |
| variance | \(Var[\epsilon] = \sigma^2\) | variance of a variable |
| bracket, distribution | \([A, B, C]\) | unspecified density or distribution |
| proportional | \(f(x) \propto g(x)\) | differ by a constant, \(f(x) = c g(x)\) |
| approximate | \(\pi \approx 3.14159\) | approximately equal |
| real numbers | \(\mathbb{R} = (-\infty, \infty)\) | note: also positive real \(\mathbb{R}_+\) |
| is an element of | \(\pi \in \mathbb{R}\) | \(\pi\) is a real number |
| subset | \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) | in the set |
| proper subset | \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) | cannot include all elements |
| union | \(a \cup b\) | either \(a\) or \(b\) |
| intersection | \(a \cap b\) | both \(a\) and \(b\) |
| sum | \(\sum_{i=1}^n x_i\) | \(x_1 + \dots + x_n\) |
| product | \(\prod_{i=1}^n x_i\) | \(x_1 \times \dots \times x_n\) |
| exponentiate | \(e^x\), \(exp(x)\) | \(e \approx 2.71828\), inverse of \(ln(x)\) |
| natural logarithm | \(ln(x)\) | inverse of \(e^x\) |
##matrices
| notation | example | interpretation |
|---|---|---|
| bold, l.c. | \(\mathbf{x}\), \(\mathbf{x}'\) | column and row vectors, respectively |
| bold, u.c. | \(\mathbf{X}\) | matrix |
| dimensions | \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) | \(n\) rows, \(q\) columns |
| subscript | \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) | element of an array |
| row vector | \(\mathbf{X}_{i\cdot}\) | row \(i\) |
| column vector | \(\mathbf{X}_{\cdot j}\) | column \(j\) |
| transpose | \(\mathbf{X}'\) | rows become columns |
| matrix inverse | \(\mathbf{A}^{-1}\) | solve systems of linear equations |
| identity matrix | \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) | \(p \times p\) with 1’s on the diagonal, zeros elsewhere |
| matrix determinant | \(det\left( \mathbf{A} \right)\) | obtain for a square matrix, e.g., covariance |
| kronecker product | \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) | \(np \times mq\) matrix, defined in text |