Basic concepts are introduced with the Forest Inventory and Analysis (FIA) monitoring network.

Logistics

Resources

  • Software includes:

    • Class code on Sakai: clarkFunctions2024.r

    • R packages: gjam, repmis, MASS

For next time

  • Unit 3 problems (this unit) and Duke Forest vignette

Today’s plan

  • Field trip logistics and vignette
  • Highlights from intro to R
  • Jim: Exploratory Data Analysis (EDA) in Unit 3, a few concepts include:
    • storage modes with sapply
    • factor, levels, relevel
    • regression with lm, Tobit model, Bayes

Objectives:

  1. Review basic R concepts and apply them to a data set.
  2. Evaluate correlation structure in observational data.
  3. Gain experience with R storage modes, including factors.
  4. Build and manipulate a design matrix.
  5. Interpret elements of a traditional regression model.
  6. Compare with the Bayesian approach, including the Tobit model.


Once you have downloaded from Sakai the file clarkFunctions2022.r, move it to your current directory, and source it this way:

source('clarkFunctions2024.r')

Case study: USDA’s Forest Inventory and Analysis (FIA) program

Monitoring networks are increasingly used to evaluate geographic patterns in biodiversity and ecosystem properties, how they are changing, and why. The USDA Forest Inventory and Analysis (FIA) program is one of many national inventory programs.

The FIA network provides opportunity to illustrate some of the concepts that are important for this course. It is too big to manipulate with the standard 8GB or 16GB ram that is installed in most modern laptops (or desktops). [If your machine can expand to 32GB you should do this–not expensive and can make a big difference.] Every tree on > \(10^5\) plots shown below are distributed across the country. Data include > 300 tree species. The latest update includes > 20M rows for tree-years and > 200 variables. The plots themselves are deployed as a stratified-random design, with the stratification being over a uniform geographic grid. Plots are remeasured at intervals of several years, depending on state. Data include tree diameter, crown class, and a large number of codes related to individual condition. There are small seedling plots and larger tree plots, but even the latter are small, approximately 1/14th of a hectare in the eastern US, larger in the West. Several environmental variables include slope, aspect, estimated stand age, and site moisture status.

200,000 FIA plots support 20M sampled trees, here showing young stands in the South and East (yellow at left) and dry site classes in mountainous sites and southern Plains (brown at right).

I use the application to illustrate some differences between a traditional and Bayesian analysis. Here are some R objects I use in this example:

R packages: MASS, gjam, repmis

R objects: list, data.frame,numeric, numeric vector, numeric matrix, source_data

R functions: as.formula, attr, cbind, lm, names, par, plot, points, predict, summary

Data exploration

Although we cannot manipulate the entire data set in class, we can examine a subset of it. To do this, I clustered plots based on their environmental characteristics to result in approximately 1-ha plots. To manipulate FIA data I use built-in functions and packages. A package is a bundle of functions that together offer tools to process a related set of problems. To use a function that is contained in a package, I need to install the package. Here is the installation of two packages:

install.packages('repmis')  # read from github
install.packages('gjam')    # extract file

Once a package is installed I can make it available in my work space using the library function. I can use a function without placing all of its functions in my workspace using the syntax package::function.

In the following code I first set a variable biomass to the the column for Pinus taeda. I get the data from a remote repository using the function source_data(fileName) in the package repmis:

library( gjam )
library( repmis )
library( MASS )
d <- "https://github.com/jimclarkatduke/gjam/blob/master/forestTraits.RData?raw=True"
source_data(d)
## [1] "forestTraits"
data    <- forestTraits$xdata                         # compressed
tmp     <- gjamReZero(forestTraits$treesDeZero)       # decompress
biomass <- tmp[,"pinuTaed"]                           # response
data    <- cbind(biomass, data)                       # all variables

The function gjamReZero is a function in the packages gjam that decompresses a file containing mostly zeros.

The data.frame data includes both continuous variables and factors. It is formatted as observations (rows) by variables (columns). I’ll say more about formats as we go. I can use sapply to determine storage modes:

sapply( data, mode )               # all are numeric, but some could be factors
##   biomass      temp   deficit  moisture        u1        u2        u3    stdage 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##      soil 
## "numeric"
which( sapply( data, is.factor ) ) # found it
## soil 
##    9

As always, I start with some simple exploratory data analysis (EDA). First I plot some variables in data using the R function plot. Soil is a factor in the model, meaning that it is numeric, but takes only discrete values. It has five levels, one being a reference type. In these first plots I assign colors to soil types.

col <- match( data$soil, levels(data$soil) ) # symbol color from soil type
cex <- .1 + biomass                          # symbol size is response:biomass
cex <- sqrt( cex/max(cex) )                  

par( mfrow=c(1,2), bty='n', mar=c(4,4,3,1), omi = c(.5, .1, .1, .1) )
plot(data$temp, data$deficit, xlab='',         # climate deficit
     ylab='Moisture deficit', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$temp, data$moisture, xlab='',        # local moisture level
     ylab='Site moisture', cex=cex, col = col)
mtext( 'Winter temperature', 1, outer = T )
*Some variables in the data.frame data*

Some variables in the data.frame data

Recall, the syntax data$temp indicates that data is a list. In this instance, data is a specific type of list called a data.frame. The $temp part refers to an object within a list. Use the help(plot) page to interpret the arguments to the function plot.

par(mfrow=c(1,2),bty='n', mar=c(4,4,1,3))

plot(data$deficit, data$moisture, xlab='Deficit', 
     ylab='Site moisture', cex=cex, col = col)
legend( 'topleft', levels(data$soil), text.col = c(1:nlevels(data$soil)),
        bty='n', cex = .7)
plot(data$soil, sqrt(biomass), xlab = 'Soil type', ylab = 'Biomass (sqrt scale)')
*Additional variables, including a factor soil.*

Additional variables, including a factor soil.

Here is a pairs plot to examine multiple variables:

pairs( data[,c(1:4, 8, 9)], cex=.1 )

Here is non-metric multidimensional scaling (MDS) to extract stucture

dcols <- as.matrix( data[,c(1:4, 8)] )
dcols[,'biomass'] <- sqrt( dcols[,'biomass'] )  # down-scale extremes
omat <- cov2Dist( var( t(dcols) ) )             # by observations
obs  <- isoMDS(omat, k = 3)
## initial  value 19.129672 
## iter   5 value 10.465162
## iter  10 value 7.488707
## iter  15 value 6.760770
## iter  20 value 6.483806
## iter  25 value 6.361688
## final  value 6.327042 
## converged
vmat <- cov2Dist( var( dcols ) )                # by variables
vars <- isoMDS(vmat, k = 3)
## initial  value 2.265770 
## iter   5 value 0.028160
## iter   5 value 0.000000
## iter   5 value 0.000000
## final  value 0.000000 
## converged
nv   <- nrow(vmat)
plot(obs$points[,1], obs$points[,2], pch = '.',
     xlab = 'Axis 1', ylab = 'Axis 2')         # axis 1 driven by stdage
wx <- which( abs(obs$points[,1]) > 5 )
text( obs$points[wx,1], obs$points[wx,2],      # note extreme values
      round(data$stdage[wx]), cex = .8)

text(vars$points[,1], vars$points[,2],         # relationship with variables
     rownames(vars$points), cex = 1.2, col=c(1:nv))
points( 0, 0, pch = 16, cex=2 )
arrows( rep(0, nv), rep(0, nv), vars$points[,1], vars$points[,2], 
        col = c(1:nv), lwd = 2 )

Notice that the stdage variable accounts for most of the variation on the first axis. I have labeled the values for observations to see the gradient in values.

I am also interested in how biomass differs by soil type. Here is use of the function tapply to determine the mean and sd for biomass by soil:

mean <- tapply( biomass, data$soil, mean, na.rm = T )
sd   <- tapply( biomass, data$soil, sd, na.rm = T )
biomassTable <- rbind(mean, sd)
print( round( biomassTable, 2 ) )
##      EntVert  Mol reference SpodHist UltKan
## mean    3.57 0.28      4.15     0.40  13.87
## sd      7.13 1.28      9.66     1.98  11.43

I’ll return to this example after further discussion of some basic concepts.

Exercise: Based on plots of soil moisture and climate deficit does it appear useful to include these predictors in a model for biomass of this species? Do both variables bring information?


Traditional regression

To determine the relationship between abundance and environmental variables, I want a model for the response biomass with predictors. Here are the names of variables in data:

names(data)
## [1] "biomass"  "temp"     "deficit"  "moisture" "u1"       "u2"       "u3"      
## [8] "stdage"   "soil"

These are the column headings in data. To see this, I can look at the first two rows:

 data[1:2,]

With the exception of "soil", these are continuous variables. Again, the variable soil is a factor with these levels:

data$soil <- relevel(data$soil, 'reference')
attr(data$soil,'levels')
## [1] "reference" "EntVert"   "Mol"       "SpodHist"  "UltKan"

These are soil ‘orders’, based on parent material.

I start with the standard function lm in R for linear regression. In the formula below I have specified a model containing moisture and with quadratic effects, as I(moisture^2). The I() function in a formula indicates that I want to evaluate the argument as it appears in the function I.

I can fit a linear regression with response biomass to several combinations of variables using lm and plot responses.

fit1 <- lm( biomass ~ moisture, data )   
p1   <- predict( fit1 )
fit2 <- lm( biomass ~ moisture + I(moisture^2), data )   # quadratic
p2   <- predict(fit2)                                    # predictive mean

par( mfrow=c(1,3), bty='n', mar=c(4,4,1,1) )
plot( data$moisture, biomass, cex=.2 )                   # biomass by site moisture
points( data$moisture, p1, col='green', cex=.2 )         # add to plot
points( data$moisture, p2, col='blue', cex=.2 )   

#repeat with additional variables
plot( data$deficit, biomass, cex=.2)
form <- as.formula(biomass ~ soil + moisture*stdage + temp + deficit)
fit3 <- lm(form, data)  
p3   <- predict(fit3)
plot( biomass, p3, cex=.2, xlab = 'Observed', ylab = 'Predicted' )
abline(0, 1, lty=2)
*Biomass fitted to moisture and climate deficit showing observed and predicted.*

Biomass fitted to moisture and climate deficit showing observed and predicted.


For more on use of variable names in the above block of code, see the note about R environments.

Plots show the predictions for a linear and quadratic model, fit1 and fit2 and for a large model with main effects, an interaction moisture*stdage and the factor soil.

I can use the function summary to look at the estimates in a fitted objects, e.g.,

summary(fit3)
## 
## Call:
## lm(formula = form, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.026  -4.051  -0.579   1.673  61.399 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.1995     0.2297  13.928  < 2e-16 ***
## soilEntVert      -1.1961     0.7473  -1.601   0.1097    
## soilMol          -2.1739     1.2190  -1.783   0.0747 .  
## soilSpodHist      1.0590     0.5088   2.082   0.0375 *  
## soilUltKan        6.0472     1.0304   5.869 5.32e-09 ***
## moisture         -0.4506     0.1837  -2.453   0.0143 *  
## stdage           -1.4494     0.1887  -7.682 2.70e-14 ***
## temp              3.8207     0.2339  16.334  < 2e-16 ***
## deficit           0.4002     0.2167   1.847   0.0649 .  
## moisture:stdage   0.9336     0.1980   4.715 2.62e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.308 on 1607 degrees of freedom
## Multiple R-squared:  0.3031, Adjusted R-squared:  0.2992 
## F-statistic: 77.66 on 9 and 1607 DF,  p-value: < 2.2e-16

Standard products of a traditional regression include a table of point estimates for the coefficients in \(\boldsymbol{\beta}\), with their standard errors, which are a measure of uncertainty. There is a Student’s t statistic for each estimate, a measure of how far the estimate is from a hypothesized value of zero. For each estimate there is a P value, the probability for obtaining a value of t at least this large under the hypothesis that the coefficient is equal to zero. There is a F statistic and P value for the entire model. The degrees of freedom is the sample size \(n\) minus the number of fitted parameters \((2)\). The R-squared is interpreted as the ‘variance explained’ by the model. The residual standard error is the estimate of parameter \(\sigma = \sqrt{\sigma^2}\). The 95% confidence intervals for these estimates would be approximately the point estimates \(\pm 1.96\) times the standard errors, or:

##                 estimate   0.025   0.975
## (Intercept)        3.200  2.7500  3.6500
## soilEntVert       -1.200 -2.6600  0.2650
## soilMol           -2.170 -4.5600  0.2100
## soilSpodHist       1.060  0.0644  2.0500
## soilUltKan         6.050  4.0300  8.0600
## moisture          -0.451 -0.8100 -0.0915
## stdage            -1.450 -1.8200 -1.0800
## temp               3.820  3.3600  4.2800
## deficit            0.400 -0.0234  0.8240
## moisture:stdage    0.934  0.5470  1.3200

Together, this combination of outputs for the linear regression would contribute to an interpretation of how abundance of this species responds to climate and habitat.

Exercise. Repeat the regression analysis using function lm with a different species as a response. Interpret the analysis.

Design elements

This model includes a factor and an interaction. A factor has categorical levels that can be represented by words in a data.frame. In this example, data$soil is a factor. The levels for soil are attr(data$soil,'levels') = reference, EntVert, Mol, SpodHist, UltKan. In the fore-going analysis there was a coefficient for all but the first factor level, reference. To understand how factors are treated in an analysis, it’s easiest to look at the design matrix \(\mathbf{X}\) that is generated when factors are included. Here are a few lines of \(\mathbf{X}\) for this model:

tmp  <- model.frame(form, data, na.action=NULL)
x    <- model.matrix(form, data=tmp)
head( x )
##   (Intercept) soilEntVert soilMol soilSpodHist soilUltKan   moisture
## 1           1           0       0            0          0  0.6870299
## 2           1           0       0            0          0  1.6655992
## 3           1           0       0            1          0 -0.1892591
## 4           1           0       0            0          0  0.3900552
## 5           1           0       0            0          0  0.8500289
## 6           1           0       0            0          0 -0.8765680
##        stdage       temp     deficit moisture:stdage
## 1 -0.16697961  1.2165433  0.03637914     -0.11471999
## 2 -0.02907271  0.1825447  0.20708706     -0.04842349
## 3  0.48147624 -0.9409308  0.20345146     -0.09112376
## 4 -0.07895393  0.6435989  0.81532968     -0.03079639
## 5  0.36007415  0.8238641 -0.17556592      0.30607343
## 6 -0.58986965  0.2201759  0.76258277      0.51706086

The missing factor level, called reference, is represented in the model by the intercept, the first column of ones in \(\mathbf{X}\). An observation from the reference soil type will have zeros in the remaining four columns. Any other soil type will have a one in its column, so its effect is the sum of the intercept and its own coefficient.

The interaction term moisture:stdage is the product of two columns. In the coefficient table there is a main effect of both moisture and stdage and there is an interaction coefficient. If the interaction coefficient is positive, then the effect of stage becomes more important as it increases away from zero (either positive or negative).

A Bayesian approach

Leaving technical detail for subsequent units, this section compares the previous result with a Bayesian analysis, incorporating a non-informative prior distribution. As mentioned above, a Bayesian analysis combines the likelihood with a prior distribution. In this analysis, the prior distribution is taken to be non-informative for coefficients and residual variance \(\boldsymbol{\beta}, \sigma\).

Recall the Bayesian model, here applied to regression for a slope and intercept model:

\[ [ \boldsymbol{\beta}, \sigma^2 | \mathbf{X}, \mathbf{y} ] \propto \prod_{i=1}^n N \left( y_i | \mathbf{x}'_i \boldsymbol{\beta}, \sigma^2 \right) MVN( \boldsymbol{\beta} | \mathbf{b}, \mathbf{B}) IG(\sigma^2 | s_1, s_2 ) \] The first factor on the right-hand side is the likelihood. The second factor is a multivariate prior distribution for regression coefficients. The third factor it the inverse gamma distribution that I use as a prior for the residual variance. The prior distribution for regression coefficients is non-informative, having these prior parameter values:

\[\begin{align} \mathbf{b} &= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ \mathbf{B} &= \begin{bmatrix} 1000 & 0 \\ 0 & 1000 \end{bmatrix} \end{align}\]

The prior parameter values for the inverse gamma distribution are also non-informative: \(s_1 = s_2 = 0.1\). Here is how I would specify them in the code:

priorB  <- matrix( 0, 2, 1 ) # prior mean regression parameters
priorVB <- diag( 1000, 2 )   # inverse prior covariance matrix
s1      <- .1                # variance prior values
s2      <- .1

Here is a Bayesian analysis to compare with fit3:

fitb <- bayesReg(form, data)
## 
## Coefficients:
##                  median std error    0.025    0.975 not zero
## intercept        3.1960    0.2344  2.74300  3.64400        *
## soilEntVert     -1.1830    0.7456 -2.62300  0.29790         
## soilMol         -2.1450    1.2030 -4.50300  0.20090         
## soilSpodHist     1.0720    0.5120  0.06454  2.03700        *
## soilUltKan       6.0510    1.0310  4.06100  8.06200        *
## moisture        -0.4470    0.1848 -0.82130 -0.09579        *
## stdage          -1.4510    0.1889 -1.81200 -1.08800        *
## temp             3.8180    0.2401  3.35900  4.28700        *
## deficit          0.4037    0.2226 -0.03556  0.83420         
## moisture:stdage  0.9348    0.1985  0.54540  1.32400        *
## 
##  * indicates that 95% predictive interval does not include zero
## 
## Residual standard error 7.314, with 1607 degrees of freedom, 
##  root mean sq prediction error 7.177.

The function BayesReg organizes output a bit differently from lm, reporting 95% credible intervals [values at (0.025 and 0.975)]. The output is simpler than that generated by lm–there are not a lot of statistics. However, I see that point estimates and standard errors for coefficients are nearly the same as I obtained with lm. I also see that the coefficients having credible intervals that do not span zero in the Bayesian analysis are the same coefficients that lm flagged as ‘signficant’.

*Same estimates from both methods, showing point estimates and 95% CIs.*

Same estimates from both methods, showing point estimates and 95% CIs.

The header for this section asks ‘what’s different about Bayes?’. The shift from a classical to Bayesian analysis did not change how I interpret effects of climate and habitat on biomass of this species. However, the Bayesian model can be extended in a number of ways. In the next section I illustrate a hierarchical version having an additional stage.

Zeros are common

At least two problems with both the traditional and Bayesian regressions are that the normal distribution i) does not allow any discrete values, including zeros, and ii) it is non-zero for negative values, whereas biomass can only be non-negative. Biomass data diverge from the assumptions of the model in that observations \(Y \in [0, \infty)\)–there is point mass at zero. Recall that the probability for any discrete value is zero for a continuous variable. Positive biomass values are continous, but zero is discrete. Here the data are dominated by zeros:

hist(biomass, nclass=50)                            # distribution
*Response has many zeros.*

Response has many zeros.

length(which(biomass == 0))/length(biomass)
## [1] 0.733457


If there were a few zeros, with most values bounded away from zero, I might argue that it’s close enough. That’s not the case here. Standard diagnostic plots, such as plot(fit1) make this clear.

If these were discrete data, I might turn to a zero-inflated Poisson or negative binomial model. These are examples of generalized linear models (GLMs) to be discussed in later units. These models sometimes work ok, provided there are not too many zeros, but here zeros dominate. Regardless, these GLMs describe discrete data, so I cannot use either.

I cannot add a small amount to each observation and then transform the data to a log scale. If I do that, every coefficient I estimate depends on the arbitrary value I used.

Tobit model

A Tobit regression allows for continuous responses with zeros, and it is readily implemented as a Bayesian model. Part (b) of the model graph includes an additional stage for a variable \(W\). This variable has a normal distribution; it is defined on \((-\infty, \infty)\). The observed \(Y\) is a censored version of \(W\). It is equal to the response \(Y\) whenever \(Y > 0\). When \(Y = 0\), then the latent variable \(W\) is negative:

\[ y_i = \begin{cases} w_i & \quad w_i > 0\\ 0 & \quad w_i \leq 0\\ \end{cases} \]

Treating \(Y\) as a censored version of \(W\) allows me to combine continuous and censored data without changing the scale. In a Tobit model the censored value is zero, but it also works with other values.

With this model the regression moves to the latent \(W\),

\[w_i \sim N(\mathbf{x}'_i \boldsymbol{\beta}, \sigma^2)\] The model has become hierarchical. I fit this model in a Bayesian setting, again, requiring a prior distribution.

Now allowing for the zeros in \(Y\), the fitted coefficients differ substantially from both previous models:

fitTobit <- bayesReg(form, data, TOBIT=T)
## fitted as Tobit model
## 
## Coefficients:
##                     median std error    0.025   0.975 not zero
## intercept       -21.630000    1.8990 -25.0700 -18.500        *
## soilEntVert      -6.686000    2.1540 -10.8300  -2.361        *
## soilMol         -14.890000    6.5510 -28.4600  -2.644        *
## soilSpodHist    -10.990000    3.0690 -16.5300  -4.638        *
## soilUltKan        8.805000    2.1590   4.4950  13.000        *
## moisture         -0.008222    0.5599  -1.0910   1.106         
## stdage           -4.116000    0.6616  -5.3970  -2.814        *
## temp             24.000000    1.8800  21.0600  27.210        *
## deficit          -1.108000    0.7125  -2.4820   0.320         
## moisture:stdage   1.370000    0.5994   0.2412   2.578        *
## 
##  * indicates that 95% predictive interval does not include zero
## 
## Residual standard error 14.24, with 1607 degrees of freedom, 
##  root mean sq prediction error 6.871.
par( mfrow=c(1,2), bty='n' )
plot( biomass, p3, cex=.2, ylab='Predicted values' )
points (biomass, fitTobit$predictY[,2], col=2, cex=.2 )
abline( 0, 1, lty=2 )
abline( h=0)

rbeta <- summary(fit3)$coefficients   # standard regression coeffs
tbeta <- fitTobit$beta                # Bayesian Tobit coeffs
plot( rbeta[,1], tbeta[,1], xlim = c(-2, 8), 
      xlab='Traditional regression', ylab='Tobit')
text( rbeta[,1], tbeta[,1], rownames(tbeta), pos = 4, cex = .8 )
abline( 0, 1, lty=2 )
*Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.*

Prediction from linear regression (black) and the Tobit model (red). Mean parameter estimates at right show large differences.


Exercise. Using a different species and predictors, interpret differences between estimates for the Tobit and standard regression model.

I called the Tobit model ‘hierarchical’, but some could see it differently. The \(W\) stage in the model is ‘partially known’. Note the dashed line in part (c) of the graphical model. If I know the value of \(W\), then I also know the value of \(Y\). So in the ‘generative direction’ from \(W\) to \(Y\) I can view their relationship as deterministic. However, given \(Y\) does not necessarily mean that I know \(W\). If \(Y = 0\), then \(W\) is stochastic.

In summary, not only is the Tobit defensible as a valid model for continuous data with zeros–it also finds more effects and better predicts data than the traditional regression. Many value the hierarchical framework for its flexibility to admit additional stages.


Appendix

Note about R environments

To fully understand the block of code for the lm fit, I need to know that the variable biomass is defined in my global environment (see previous code block). I can enter length(biomass) and get an answer, because biomass exists in the working environment. I have not assigned the variables deficit and moisture in my global environment. They only exist within the data.frame data. When I call the function lm it knows to look for variables in my global environment or in data, because I have passed data as an argument. The functions plot and points do not look for variables in this way. When I call them, I must specify the data.frame with the variable name, data$deficit. Using R is the subject of Unit 2.

Notation

Here are some notation conventions used in these vignettes.

notation example interpretation
italic \(x\), \(X\) scalar quantity, known
greek \(\alpha, \beta, \dots\) stochastic (fitted) variable, unknown
parentheses \(\phi(\mu, \sigma^2)\), \(N(\mu, \sigma^2)\) parametric function/density
curly brackets \(\{0, 1, \dots\}\) a set on objects
closed interval \((-\infty,0]\), \([0, \infty)\) intervals include zero
open interval \((-\infty,0)\), \((0, \infty)\) exclude zero
distributed as \(x \sim N(\mu, \sigma^2)\) distribution or density
expectation \(E[\epsilon] = 0\) expected value of a variable
variance \(Var[\epsilon] = \sigma^2\) variance of a variable
bracket, distribution \([A, B, C]\) unspecified density or distribution
proportional \(f(x) \propto g(x)\) differ by a constant, \(f(x) = c g(x)\)
approximate \(\pi \approx 3.14159\) approximately equal
real numbers \(\mathbb{R} = (-\infty, \infty)\) note: also positive real \(\mathbb{R}_+\)
is an element of \(\pi \in \mathbb{R}\) \(\pi\) is a real number
subset \(\{a\}\) and \(\{a, b\} \subseteq \{a, b\}\) in the set
proper subset \(\{a\}\), but not \(\{a, b\} \subset \{a, b\}\) cannot include all elements
union \(a \cup b\) either \(a\) or \(b\)
intersection \(a \cap b\) both \(a\) and \(b\)
sum \(\sum_{i=1}^n x_i\) \(x_1 + \dots + x_n\)
product \(\prod_{i=1}^n x_i\) \(x_1 \times \dots \times x_n\)
exponentiate \(e^x\), \(exp(x)\) \(e \approx 2.71828\), inverse of \(ln(x)\)
natural logarithm \(ln(x)\) inverse of \(e^x\)

##matrices

notation example interpretation
bold, l.c. \(\mathbf{x}\), \(\mathbf{x}'\) column and row vectors, respectively
bold, u.c. \(\mathbf{X}\) matrix
dimensions \(\underset{\scriptscriptstyle n\times q}{\mathbf{X}}\) \(n\) rows, \(q\) columns
subscript \(\mathbf{x}_i\), \(\mathbf{X}_{ij}\) element of an array
row vector \(\mathbf{X}_{i\cdot}\) row \(i\)
column vector \(\mathbf{X}_{\cdot j}\) column \(j\)
transpose \(\mathbf{X}'\) rows become columns
matrix inverse \(\mathbf{A}^{-1}\) solve systems of linear equations
identity matrix \(\mathbf{I}_p = \mathbf{A} \mathbf{A}^{-1}\) \(p \times p\) with 1’s on the diagonal, zeros elsewhere
matrix determinant \(det\left( \mathbf{A} \right)\) obtain for a square matrix, e.g., covariance
kronecker product \(\underset{\scriptscriptstyle n\times m}{\mathbf{A}} \otimes \underset{\scriptscriptstyle p\times q}{\mathbf{B}}\) \(np \times mq\) matrix, defined in text