Parsimony and Likelihood

M. Drew LaMar
March 3, 2021

Announcements

Reading assignment for Friday: Anderson, Appendix A: Likelihood Theory

Models - General Definition

Definition: A model is a simplified, abstract (or concrete) representation of objects and their representations or processes in the real world.

The Importance of Good Data

Quote: “Meaningful data of sufficient quantity are the grist of scientific bread.”

Is the study sound so that an inductive inference can be justified?
- Experimental design should be able to address predictions from…
- …one or a number of scientific hypotheses that have been well thought out.
Are the data analysis methods sound? Relies on…
- …adequate modeling and
- objective approaches to model selection.

Information and Statistics

Quote: “If data are collected in an appropriate manner, then there is information in the sample data about the process or system under study.”

Mathematical model is required (in most cases) to obtain information from the data.
Inductive vs deductive reasoning
Inductive: “inference of a generalized conclusion from particular instances” (sample -> population)
Statistics adds rigor to the inductive process.
The inference comes from a model that approximates the system or process of interest.

Models in Science

Quantification is essential due to variation and complexity.

Quote: “Unless one is engaged in simple descriptive studies, they [the empirical sciences] must deal with mathematical models.”

Quote: “We are not trying to model the data; instead, we are trying to model the information in the data.”

Quote: “Data contain both information and noise; fitting the data perfectly would include modeling the noise and this is counter to our science objective.”

Models in Science

Quote: “Models must be derived to carefully represent each of the science hypotheses.”

\[ H_{1} \Leftrightarrow g_{1}, \ H_{2} \Leftrightarrow g_{2}, \ldots, H_{k} \Leftrightarrow g_{k}. \]

Scientific Question: What is the support or empirical evidence for the ith hypothesis (via its corresponding model), relative to others in the set.

Model Selection: What is the the evidence for each of the hypotheses (and their associated models), given the data.

Models are Approximations

“All models are wrong, but some are useful.”
- Box

Example: Population survival \[ n_{t+1} = s\cdot n_{t} \]

Assumptions:

Population survival rate \( s \) does not change over time.
Each individual most likely has a different survival rate (\( s \) represents the population average).
Biotic and abiotic factors that influence survival rate are being ignored.

Models are Approximations

Discuss: What about Hardy-Weinberg equilibrium? What are the assumptions and approximations that go into this model?

Parameter estimation and model fit

Three common approaches have emerged for general parameter estimation:

least squares, LS (or “regression”),
maximum likelihood, ML, and
Bayesian methods.

Definition: The maximum likelihood estimate (MLE) is the value of the parameter that is most likely, given the data and model.

The Principal of Parsimony

Quote: “A person new to statistical thinking often finds it difficult to relate data, model, and model parameters that must be estimated. These are hard concepts to understand and the concepts are wound into the issue of parsimony. Let the data be fixed and then realize the information in the data is also fixed, then some of this information is "expended” each time a parameter is estimated. Thus, the data will only “support” a certain number of estimates, as this limit is exceeded parameter estimates become either very uncertain (e.g., large standard errors) or reach the point where they are not estimable.“

The Principal of Parsimony

“…too few parameters and the model will be so unrealistic as to make prediction unreliable, but too many parameters and the model will be so specific to the particular data set so to make prediction unreliable.”
- Edwards

The Principal of Parsimony

Quote: “Each time a parameter is estimated, some information is "taken out” of the data, leaving less information available for the estimation of still more parameters.“

Quote: "In model selection, we are really asking which is the best model for a given sample size.”

In other words, what's the best model given the amount of information that we have?

Quote: “We are really asking - how much model structure will the data support?”

Tapering Effect Sizes

Large effects -> Medium effects -> Small effects -> …

We achieve the ability to detect ever smaller effects in a system through:

larger sample sizes,
better study designs, and
better models based on
better hypotheses.

Example: Hardening of Portland Cement

Variables

\( x_{1} \): calcium aluminate
\( x_{2} \): tricalcium silicate
\( x_{3} \): tetracalcium alumino ferrite
\( x_{4} \): dicalcium silicate
\( y \): calories of heat per gram of cement following 180 days of hardening

Example: Hardening of Portland Cement

Hypotheses/Models

Example: Hardening of Portland Cement

Data

library(AICcmodavg)
data(cement)
str(cement)

'data.frame':   13 obs. of  5 variables:
 $ x1: int  7 1 11 11 7 11 3 1 2 21 ...
 $ x2: int  26 29 56 31 52 55 71 31 54 47 ...
 $ x3: int  6 15 8 8 6 9 17 22 18 4 ...
 $ x4: int  60 52 20 47 33 22 6 44 22 26 ...
 $ y : num  78.5 74.3 104.3 87.6 95.9 ...

Discuss: How many parameters can we reasonably estimate with this amount of data?

Answer: Rule-of-thumb: Number of estimable parameters = \( n/10 \).

Example: Hardening of Portland Cement

Data

Collinearity between variables!!!

cor(cement %>% select(starts_with("x")))

           x1         x2         x3         x4
x1  1.0000000  0.2285795 -0.8241338 -0.2454451
x2  0.2285795  1.0000000 -0.1392424 -0.9729550
x3 -0.8241338 -0.1392424  1.0000000  0.0295370
x4 -0.2454451 -0.9729550  0.0295370  1.0000000

Example: Hardening of Portland Cement

Data

Quote: “Rigorous experimental methods were just being developed during the time these data were taken (about 1930). Had such design methods been widely available and the importance of replication understood, then it would have been possible to break the unwanted correlations among the x variables and establish cause and effect if that was a goal.”

Quote: “Orthogonality arises in controlled experiments where the factors and levels are designed to be orthogonal. In observational studies, there is often a high probability that some of the regressor variables will be mutually quite dependent.”

Likelihood Theory

Definition: The Likelihood of model parameters \( \theta \) given the model (\( g \)) and data (\( x \)) is given by: \[ \mathcal{L}(\theta | x, g) \]

Note: Likelihood theory describes how to find the most likely parameters of a model that fit the data the best.

Likelihood Theory: Example

Suppose you observe 10 coin flips and you see the following result:

H H H H H H T T T H

Discuss: What’s the most likely value for the probability of heads on an individual coin flip?

Let \( p \) denote the probability of getting heads.

This follows what is known as a binomial model, with the probability of getting 7 heads out of 10 given by:

\[ \mathrm{Prob}(7 \ \textrm{heads}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]

Likelihood Theory: Example

In this example \[ \mathcal{L}(\theta | x, g) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \] we have

The model \( g \) is the binomial model
The data \( x \) is the number of coin flips (10) and number of heads (7). In other words, it's what we observed.
The unknown parameter \( \theta \) is \( p \)

Likelihood Theory: Example

\[ \mathcal{L}(p | 10, 7; \mathrm{binomial}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]

plot of chunk unnamed-chunk-4

Likelihood Theory: Example

The most likely parameter given the model and data is where the likelihood function is maximized.

It's usually easier to deal with summation rather than products, so we look at the log-likelihood function instead:

\[ \log(\mathcal{L}(\theta | g, x)) \]

which in our case becomes

\[ \log\left(\begin{array}{c}10 \\ 7\end{array}\right) + 7\log p +3\log (1-p) \]

Likelihood Theory: Example

Log-likelihood function:

plot of chunk unnamed-chunk-5