Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.
Software includes:
Class code on Sakai: clarkFunctions2024.r
R packages: gjam, repmis, MASS
sapplyfactor, levels, relevellm, Tobit model, BayesObjectives:
Once you have downloaded from Sakai the file
clarkFunctions2022.r, move it to your current directory,
and source it this way:
source('clarkFunctions2024.r')
Monitoring networks are increasingly used to evaluate geographic patterns in biodiversity and ecosystem properties, how they are changing, and why. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventory programs.
The FIA network provides opportunity to illustrate some of the concepts that are important for this course. It is too big to manipulate with the standard 8GB or 16GB ram that is installed in most modern laptops (or desktops). [If your machine can expand to 32GB you should do this–not expensive and can make a big difference.] Every tree on > \(10^5\) plots shown below are distributed across the country. Data include > 300 tree species. The latest update includes > 20M rows for tree-years and > 200 variables. The plots themselves are deployed as a stratified-random design, with the stratification being over a uniform geographic grid. Plots are remeasured at intervals of several years, depending on state. Data include tree diameter, crown class, and a large number of codes related to individual condition. There are small seedling plots and larger tree plots, but even the latter are small, approximately 1/14th of a hectare in the eastern US, larger in the West. Several environmental variables include slope, aspect, estimated stand age, and site moisture status.
200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).
I use the application to illustrate some differences between a traditional and Bayesian analysis. Here are some R objects I use in this example:
R packages: MASS, gjam,
repmis
R objects: list,
data.frame,numeric,
numeric vector, numeric matrix,
source_data
R functions: as.formula,
attr, cbind, lm,
names, par, plot,
points, predict, summary
Although we cannot manipulate the entire data set in class, we can
examine a subset of it. To do this, I clustered plots based on their
environmental characteristics to result in approximately 1-ha plots. To
manipulate FIA data I use built-in functions and packages. A
package is a bundle of functions that together offer tools
to process a related set of problems. To use a function
that is contained in a package, I need to install the
package. Here is the installation of two packages:
install.packages('repmis') # read from github
install.packages('gjam') # extract file
Once a package is installed I can make it available in my work space
using the library function. I can use a
function without placing all of its functions in my
workspace using the syntax package::function.
In the following code I first set a variable biomass to
the the column for Pinus taeda. I get the data from a remote
repository using the function source_data(fileName) in the
package repmis:
library( gjam )
library( repmis )
library( MASS )
d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
source_data(d)
## [1] "forestTraits"
data <- forestTraits$xdata # compressed
tmp <- gjamReZero(forestTraits$treesDeZero) # decompress
biomass <- tmp[,"pinuTaed"] # response
data <- cbind(biomass, data) # all variables
The function gjamReZero is a function in the packages
gjam that decompresses a file containing mostly zeros.
The data.frame data includes both continuous variables
and factors. It is formatted as observations (rows) by variables
(columns). I’ll say more about formats as we go. I can use
sapply to determine storage modes:
sapply( data, mode ) # all are numeric, but some could be factors
## biomass temp deficit moisture u1 u2 u3 stdage
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
## soil
## "numeric"
which( sapply( data, is.factor ) ) # found it
## soil
## 9
As always, I start with some simple exploratory data analysis (EDA).
First I plot some variables in data using the R function
plot. Soil is a factor in the model, meaning
that it is numeric, but takes only discrete values. It has
five levels, one being a reference type. In these first
plots I assign colors to soil types.
col <- match( data$soil, levels(data$soil) ) # symbol color from soil type
cex <- .1 + biomass # symbol size is response:biomass
cex <- sqrt( cex/max(cex) )
par( mfrow=c(1,2), bty='n', mar=c(4,4,3,1), omi = c(.5, .1, .1, .1) )
plot(data$temp, data$deficit, xlab='', # climate deficit
ylab='Moisture deficit', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
bty='n', cex = .7)
plot(data$temp, data$moisture, xlab='', # local moisture level
ylab='Site moisture', cex=cex, col = col)
mtext( 'Winter temperature', 1, outer = T )
Some variables in the data.frame data
Recall, the syntax data$temp indicates that
data is a list. In this instance,
data is a specific type of list called a
data.frame. The $temp part refers to an object
within a list. Use the help(plot) page to
interpret the arguments to the function plot.
par(mfrow=c(1,2),bty='n', mar=c(4,4,1,3))
plot(data$deficit, data$moisture, xlab='Deficit',
ylab='Site moisture', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
bty='n', cex = .7)
plot(data$soil, sqrt(biomass), xlab = 'Soil type', ylab = 'Biomass (sqrt scale)')
Additional variables, including a factor soil.
Here is a pairs plot to examine multiple variables:
pairs( data[,c(1:4, 8, 9)], cex=.1 )
Here is non-metric multidimensional scaling (MDS) to extract stucture
dcols <- as.matrix( data[,c(1:4, 8)] )
dcols[,'biomass'] <- sqrt( dcols[,'biomass'] ) # down-scale extremes
omat <- cov2Dist( var( t(dcols) ) ) # by observations
obs <- isoMDS(omat, k = 3)
## initial value 19.129672
## iter 5 value 10.465162
## iter 10 value 7.488707
## iter 15 value 6.760770
## iter 20 value 6.483806
## iter 25 value 6.361688
## final value 6.327042
## converged
vmat <- cov2Dist( var( dcols ) ) # by variables
vars <- isoMDS(vmat, k = 3)
## initial value 2.265770
## iter 5 value 0.028160
## iter 5 value 0.000000
## iter 5 value 0.000000
## final value 0.000000
## converged
nv <- nrow(vmat)
plot(obs$points[,1], obs$points[,2], pch = '.',
xlab = 'Axis 1', ylab = 'Axis 2') # axis 1 driven by stdage
wx <- which( abs(obs$points[,1]) > 5 )
text( obs$points[wx,1], obs$points[wx,2], # note extreme values
round(data$stdage[wx]), cex = .8)
text(vars$points[,1], vars$points[,2], # relationship with variables
rownames(vars$points), cex = 1.2, col=c(1:nv))
points( 0, 0, pch = 16, cex=2 )
arrows( rep(0, nv), rep(0, nv), vars$points[,1], vars$points[,2],
col = c(1:nv), lwd = 2 )
Notice that the stdage variable accounts for most of the
variation on the first axis. I have labeled the values for observations
to see the gradient in values.
I am also interested in how biomass differs by soil type. Here is use
of the function tapply to determine the mean and sd for
biomass by soil:
mean <- tapply( biomass, data$soil, mean, na.rm = T )
sd <- tapply( biomass, data$soil, sd, na.rm = T )
biomassTable <- rbind(mean, sd)
print( round( biomassTable, 2 ) )
## EntVert Mol reference SpodHist UltKan
## mean 3.57 0.28 4.15 0.40 13.87
## sd 7.13 1.28 9.66 1.98 11.43
I’ll return to this example after further discussion of some basic concepts.
Exercise: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?
To determine the relationship between abundance and environmental
variables, I want a model for the response biomass with
predictors. Here are the names of variables in data:
names(data)
## [1] "biomass" "temp" "deficit" "moisture" "u1" "u2" "u3"
## [8] "stdage" "soil"
These are the column headings in data. To see this, I
can look at the first two rows:
data[1:2,]
With the exception of "soil", these are continuous
variables. Again, the variable soil is a
factor with these levels:
data$soil <- relevel(data$soil, 'reference')
attr(data$soil,'levels')
## [1] "reference" "EntVert" "Mol" "SpodHist" "UltKan"
These are soil ‘orders’, based on parent material.
I start with the standard function lm in R for linear
regression. In the formula below I have specified a model
containing moisture and with quadratic effects, as
I(moisture^2). The I() function in a
formula indicates that I want to evaluate the argument as
it appears in the function I.
I can fit a linear regression with response biomass to
several combinations of variables using lm and plot
responses.
fit1 <- lm( biomass ~ moisture, data )
p1 <- predict( fit1 )
fit2 <- lm( biomass ~ moisture + I(moisture^2), data ) # quadratic
p2 <- predict(fit2) # predictive mean
par( mfrow=c(1,3), bty='n', mar=c(4,4,1,1) )
plot( data$moisture, biomass, cex=.2 ) # biomass by site moisture
points( data$moisture, p1, col='green', cex=.2 ) # add to plot
points( data$moisture, p2, col='blue', cex=.2 )
#repeat with additional variables
plot( data$deficit, biomass, cex=.2)
form <- as.formula(biomass ~ soil + moisture*stdage + temp + deficit)
fit3 <- lm(form, data)
p3 <- predict(fit3)
plot( biomass, p3, cex=.2, xlab = 'Observed', ylab = 'Predicted' )
abline(0, 1, lty=2)
Biomass fitted to moisture and climate deficit showing observed and predicted.
For more on use of variable names in the above block of code, see the note about R environments.
Plots show the predictions for a linear and quadratic model,
fit1 and fit2 and for a large model with main
effects, an interaction moisture*stdage and the
factor soil.
I can use the function summary to look at the estimates
in a fitted objects, e.g.,
summary(fit3)
##
## Call:
## lm(formula = form, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.026 -4.051 -0.579 1.673 61.399
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1995 0.2297 13.928 < 2e-16 ***
## soilEntVert -1.1961 0.7473 -1.601 0.1097
## soilMol -2.1739 1.2190 -1.783 0.0747 .
## soilSpodHist 1.0590 0.5088 2.082 0.0375 *
## soilUltKan 6.0472 1.0304 5.869 5.32e-09 ***
## moisture -0.4506 0.1837 -2.453 0.0143 *
## stdage -1.4494 0.1887 -7.682 2.70e-14 ***
## temp 3.8207 0.2339 16.334 < 2e-16 ***
## deficit 0.4002 0.2167 1.847 0.0649 .
## moisture:stdage 0.9336 0.1980 4.715 2.62e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.308 on 1607 degrees of freedom
## Multiple R-squared: 0.3031, Adjusted R-squared: 0.2992
## F-statistic: 77.66 on 9 and 1607 DF, p-value: < 2.2e-16
Standard products of a traditional regression include a table of point estimates for the coefficients in \(\boldsymbol{\beta}\), with their standard errors, which are a measure of uncertainty. There is a Student’s t statistic for each estimate, a measure of how far the estimate is from a hypothesized value of zero. For each estimate there is a P value, the probability for obtaining a value of t at least this large under the hypothesis that the coefficient is equal to zero. There is a F statistic and P value for the entire model. The degrees of freedom is the sample size \(n\) minus the number of fitted parameters \((2)\). The R-squared is interpreted as the ‘variance explained’ by the model. The residual standard error is the estimate of parameter \(\sigma = \sqrt{\sigma^2}\). The 95% confidence intervals for these estimates would be approximately the point estimates \(\pm 1.96\) times the standard errors, or:
## estimate 0.025 0.975
## (Intercept) 3.200 2.7500 3.6500
## soilEntVert -1.200 -2.6600 0.2650
## soilMol -2.170 -4.5600 0.2100
## soilSpodHist 1.060 0.0644 2.0500
## soilUltKan 6.050 4.0300 8.0600
## moisture -0.451 -0.8100 -0.0915
## stdage -1.450 -1.8200 -1.0800
## temp 3.820 3.3600 4.2800
## deficit 0.400 -0.0234 0.8240
## moisture:stdage 0.934 0.5470 1.3200
Together, this combination of outputs for the linear regression would contribute to an interpretation of how abundance of this species responds to climate and habitat.
Exercise. Repeat the regression analysis using function
lm with a different species as a response. Interpret the
analysis.
This model includes a factor and an interaction. A factor has
categorical levels that can be represented by words in a
data.frame. In this example, data$soil is a
factor. The levels for soil are
attr(data$soil,'levels') = reference, EntVert, Mol,
SpodHist, UltKan. In the fore-going analysis there was a coefficient for
all but the first factor level, reference. To understand
how factors are treated in an analysis, it’s easiest to look at the
design matrix \(\mathbf{X}\) that is
generated when factors are included. Here are a few lines of \(\mathbf{X}\) for this model:
tmp <- model.frame(form, data, na.action=NULL)
x <- model.matrix(form, data=tmp)
head( x )
## (Intercept) soilEntVert soilMol soilSpodHist soilUltKan moisture
## 1 1 0 0 0 0 0.6870299
## 2 1 0 0 0 0 1.6655992
## 3 1 0 0 1 0 -0.1892591
## 4 1 0 0 0 0 0.3900552
## 5 1 0 0 0 0 0.8500289
## 6 1 0 0 0 0 -0.8765680
## stdage temp deficit moisture:stdage
## 1 -0.16697961 1.2165433 0.03637914 -0.11471999
## 2 -0.02907271 0.1825447 0.20708706 -0.04842349
## 3 0.48147624 -0.9409308 0.20345146 -0.09112376
## 4 -0.07895393 0.6435989 0.81532968 -0.03079639
## 5 0.36007415 0.8238641 -0.17556592 0.30607343
## 6 -0.58986965 0.2201759 0.76258277 0.51706086
The missing factor level, called reference,
is represented in the model by the intercept, the first column of ones
in \(\mathbf{X}\). An observation from
the reference soil type will have zeros in the remaining
four columns. Any other soil type will have a one in its column, so its
effect is the sum of the intercept and its own coefficient.
The interaction term moisture:stdage is the product of
two columns. In the coefficient table there is a main
effect of both moisture and stdage
and there is an interaction coefficient. If the interaction coefficient
is positive, then the effect of stage becomes more
important as it increases away from zero (either positive or
negative).
Leaving technical detail for subsequent units, this section compares the previous result with a Bayesian analysis, incorporating a non-informative prior distribution. As mentioned above, a Bayesian analysis combines the likelihood with a prior distribution. In this analysis, the prior distribution is taken to be non-informative for coefficients and residual variance \(\boldsymbol{\beta}, \sigma\).
Recall the Bayesian model, here applied to regression for a slope and intercept model:
\[ [ \boldsymbol{\beta}, \sigma^2 | \mathbf{X}, \mathbf{y} ] \propto \prod_{i=1}^n N \left( y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2 \right) MVN( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) IG(\sigma^2 | s_1, s_2 ) \] The first factor on the right-hand side is the likelihood. The second factor is a multivariate prior distribution for regression coefficients. The third factor it the inverse gamma distribution that I use as a prior for the residual variance. The prior distribution for regression coefficients is non-informative, having these prior parameter values:
\[\begin{align} \mathbf{b} &= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} 1000 & 0 \\ 0 & 1000 \end{bmatrix} \end{align}\]
The prior parameter values for the inverse gamma distribution are also non-informative: \(s_1 = s_2 = 0.1\). Here is how I would specify them in the code:
priorB <- matrix( 0, 2, 1 ) # prior mean regression parameters
priorVB <- diag( 1000, 2 ) # inverse prior covariance matrix
s1 <- .1 # variance prior values
s2 <- .1
Here is a Bayesian analysis to compare with fit3:
fitb <- bayesReg(form, data)
##
## Coefficients:
## median std error 0.025 0.975 not zero
## intercept 3.1960 0.2344 2.74300 3.64400 *
## soilEntVert -1.1830 0.7456 -2.62300 0.29790
## soilMol -2.1450 1.2030 -4.50300 0.20090
## soilSpodHist 1.0720 0.5120 0.06454 2.03700 *
## soilUltKan 6.0510 1.0310 4.06100 8.06200 *
## moisture -0.4470 0.1848 -0.82130 -0.09579 *
## stdage -1.4510 0.1889 -1.81200 -1.08800 *
## temp 3.8180 0.2401 3.35900 4.28700 *
## deficit 0.4037 0.2226 -0.03556 0.83420
## moisture:stdage 0.9348 0.1985 0.54540 1.32400 *
##
## * indicates that 95% predictive interval does not include zero
##
## Residual standard error 7.314, with 1607 degrees of freedom,
## root mean sq prediction error 7.177.
The function BayesReg organizes output a bit differently
from lm, reporting 95% credible intervals [values at (0.025
and 0.975)]. The output is simpler than that generated by
lm–there are not a lot of statistics. However, I see that
point estimates and standard errors for coefficients are nearly the same
as I obtained with lm. I also see that the coefficients
having credible intervals that do not span zero in the Bayesian analysis
are the same coefficients that lm flagged as
‘signficant’.
Same estimates from both methods, showing point estimates and 95% CIs.
The header for this section asks ‘what’s different about Bayes?’. The shift from a classical to Bayesian analysis did not change how I interpret effects of climate and habitat on biomass of this species. However, the Bayesian model can be extended in a number of ways. In the next section I illustrate a hierarchical version having an additional stage.
At least two problems with both the traditional and Bayesian regressions are that the normal distribution i) does not allow any discrete values, including zeros, and ii) it is non-zero for negative values, whereas biomass can only be non-negative. Biomass data diverge from the assumptions of the model in that observations \(Y \in [0, \infty)\)–there is point mass at zero. Recall that the probability for any discrete value is zero for a continuous variable. Positive biomass values are continous, but zero is discrete. Here the data are dominated by zeros:
hist(biomass, nclass=50) # distribution
Response has many zeros.
length(which(biomass == 0))/length(biomass)
## [1] 0.733457
If there were a few zeros, with most values bounded away from zero, I
might argue that it’s close enough. That’s not the case here. Standard
diagnostic plots, such as plot(fit1) make this clear.
If these were discrete data, I might turn to a zero-inflated Poisson or negative binomial model. These are examples of generalized linear models (GLMs) to be discussed in later units. These models sometimes work ok, provided there are not too many zeros, but here zeros dominate. Regardless, these GLMs describe discrete data, so I cannot use either.
I cannot add a small amount to each observation and then transform the data to a log scale. If I do that, every coefficient I estimate depends on the arbitrary value I used.
A Tobit regression allows for continuous responses with zeros, and it is readily implemented as a Bayesian model. Part (b) of the model graph includes an additional stage for a variable \(W\). This variable has a normal distribution; it is defined on \((-\infty, \infty)\). The observed \(Y\) is a censored version of \(W\). It is equal to the response \(Y\) whenever \(Y > 0\). When \(Y = 0\), then the latent variable \(W\) is negative:
\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \]
Treating \(Y\) as a censored version of \(W\) allows me to combine continuous and censored data without changing the scale. In a Tobit model the censored value is zero, but it also works with other values.
With this model the regression moves to the latent \(W\),
\[w_i \sim N(\mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The model has become hierarchical. I fit this model in a Bayesian setting, again, requiring a prior distribution.
Now allowing for the zeros in \(Y\), the fitted coefficients differ substantially from both previous models:
fitTobit <- bayesReg(form, data, TOBIT=T)
## fitted as Tobit model
##
## Coefficients:
## median std error 0.025 0.975 not zero
## intercept -21.630000 1.8990 -25.0700 -18.500 *
## soilEntVert -6.686000 2.1540 -10.8300 -2.361 *
## soilMol -14.890000 6.5510 -28.4600 -2.644 *
## soilSpodHist -10.990000 3.0690 -16.5300 -4.638 *
## soilUltKan 8.805000 2.1590 4.4950 13.000 *
## moisture -0.008222 0.5599 -1.0910 1.106
## stdage -4.116000 0.6616 -5.3970 -2.814 *
## temp 24.000000 1.8800 21.0600 27.210 *
## deficit -1.108000 0.7125 -2.4820 0.320
## moisture:stdage 1.370000 0.5994 0.2412 2.578 *
##
## * indicates that 95% predictive interval does not include zero
##
## Residual standard error 14.24, with 1607 degrees of freedom,
## root mean sq prediction error 6.871.
par( mfrow=c(1,2), bty='n' )
plot( biomass, p3, cex=.2, ylab='Predicted values' )
points (biomass, fitTobit$predictY[,2], col=2, cex=.2 )
abline( 0, 1, lty=2 )
abline( h=0)
rbeta <- summary(fit3)$coefficients # standard regression coeffs
tbeta <- fitTobit$beta # Bayesian Tobit coeffs
plot( rbeta[,1], tbeta[,1], xlim = c(-2, 8),
xlab='Traditional regression', ylab='Tobit')
text( rbeta[,1], tbeta[,1], rownames(tbeta), pos = 4, cex = .8 )
abline( 0, 1, lty=2 )
Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.
Exercise. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.
I called the Tobit model ‘hierarchical’, but some could see it differently. The \(W\) stage in the model is ‘partially known’. Note the dashed line in part (c) of the graphical model. If I know the value of \(W\), then I also know the value of \(Y\). So in the ‘generative direction’ from \(W\) to \(Y\) I can view their relationship as deterministic. However, given \(Y\) does not necessarily mean that I know \(W\). If \(Y = 0\), then \(W\) is stochastic.
In summary, not only is the Tobit defensible as a valid model for continuous data with zeros–it also finds more effects and better predicts data than the traditional regression. Many value the hierarchical framework for its flexibility to admit additional stages.
To fully understand the block of code for the lm fit, I
need to know that the variable biomass is defined in my
global environment (see previous code block). I can enter
length(biomass) and get an answer, because
biomass exists in the working environment. I have not
assigned the variables deficit and moisture in
my global environment. They only exist within the
data.frame data. When I call the function lm
it knows to look for variables in my global environment or in
data, because I have passed data as an
argument. The functions plot and points do not
look for variables in this way. When I call them, I must specify the
data.frame with the variable name,
data$deficit. Using R is the subject of Unit 2.
Here are some notation conventions used in these vignettes.
| notation | example | interpretation |
|---|---|---|
| italic | \(x\), \(X\) | scalar quantity, known |
| greek | \(\alpha, \beta, \dots\) | stochastic (fitted) variable, unknown |
| parentheses | \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) | parametric function/density |
| curly brackets | \(\{0, 1, \dots\}\) | a set on objects |
| closed interval | \((-\infty,0]\), \([0, \infty)\) | intervals include zero |
| open interval | \((-\infty,0)\), \((0, \infty)\) | exclude zero |
| distributed as | \(x \sim N(\mu, \sigma^2)\) | distribution or density |
| expectation | \(E[\epsilon] = 0\) | expected value of a variable |
| variance | \(Var[\epsilon] = \sigma^2\) | variance of a variable |
| bracket, distribution | \([A, B, C]\) | unspecified density or distribution |
| proportional | \(f(x) \propto g(x)\) | differ by a constant, \(f(x) = c g(x)\) |
| approximate | \(\pi \approx 3.14159\) | approximately equal |
| real numbers | \(\mathbb{R} = (-\infty, \infty)\) | note: also positive real \(\mathbb{R}_+\) |
| is an element of | \(\pi \in \mathbb{R}\) | \(\pi\) is a real number |
| subset | \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) | in the set |
| proper subset | \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) | cannot include all elements |
| union | \(a \cup b\) | either \(a\) or \(b\) |
| intersection | \(a \cap b\) | both \(a\) and \(b\) |
| sum | \(\sum_{i=1}^n x_i\) | \(x_1 + \dots + x_n\) |
| product | \(\prod_{i=1}^n x_i\) | \(x_1 \times \dots \times x_n\) |
| exponentiate | \(e^x\), \(exp(x)\) | \(e \approx 2.71828\), inverse of \(ln(x)\) |
| natural logarithm | \(ln(x)\) | inverse of \(e^x\) |
##matrices
| notation | example | interpretation |
|---|---|---|
| bold, l.c. | \(\mathbf{x}\), \(\mathbf{x}'\) | column and row vectors, respectively |
| bold, u.c. | \(\mathbf{X}\) | matrix |
| dimensions | \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) | \(n\) rows, \(q\) columns |
| subscript | \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) | element of an array |
| row vector | \(\mathbf{X}_{i\cdot}\) | row \(i\) |
| column vector | \(\mathbf{X}_{\cdot j}\) | column \(j\) |
| transpose | \(\mathbf{X}'\) | rows become columns |
| matrix inverse | \(\mathbf{A}^{-1}\) | solve systems of linear equations |
| identity matrix | \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) | \(p \times p\) with 1’s on the diagonal, zeros elsewhere |
| matrix determinant | \(det\left( \mathbf{A} \right)\) | obtain for a square matrix, e.g., covariance |
| kronecker product | \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) | \(np \times mq\) matrix, defined in text |