M. Drew LaMar
February 19, 2020
“In its purest form, mathematics is the perfect expression of human thought that marries logic with creative expression.”
- Junaid Mubeen
Quote: “A person new to statistical thinking often finds it difficult to relate data, model, and model parameters that must be estimated. These are hard concepts to understand and the concepts are wound into the issue of parsimony. Let the data be fixed and then realize the information in the data is also fixed, then some of this information is "expended” each time a parameter is estimated. Thus, the data will only “support” a certain number of estimates, as this limit is exceeded parameter estimates become either very uncertain (e.g., large standard errors) or reach the point where they are not estimable.“
“…too few parameters and the model will be so unrealistic as to make prediction unreliable, but too many parameters and the model will be so specific to the particular data set so to make prediction unreliable.”
- Edwards
Quote: “Each time a parameter is estimated, some information is "taken out” of the data, leaving less information available for the estimation of still more parameters.“
Quote: "In model selection, we are really asking which is the best model
for a given sample size .”
In other words, what's the best model given the amount of information that we have?
Quote: “We are really asking - how much
model structure will the data support?”
Large effects -> Medium effects -> Small effects -> …
We achieve the ability to detect ever smaller effects in a system through:
library(AICcmodavg)
data(cement)
str(cement)
'data.frame': 13 obs. of 5 variables:
$ x1: int 7 1 11 11 7 11 3 1 2 21 ...
$ x2: int 26 29 56 31 52 55 71 31 54 47 ...
$ x3: int 6 15 8 8 6 9 17 22 18 4 ...
$ x4: int 60 52 20 47 33 22 6 44 22 26 ...
$ y : num 78.5 74.3 104.3 87.6 95.9 ...
Discuss: How many parameters can we reasonably estimate with this amount of data?
Answer: Rule-of-thumb: Number of estimable parameters = \( n/10 \).
Collinearity between variables!!!
cor(cement %>% select(starts_with("x")))
x1 x2 x3 x4
x1 1.0000000 0.2285795 -0.8241338 -0.2454451
x2 0.2285795 1.0000000 -0.1392424 -0.9729550
x3 -0.8241338 -0.1392424 1.0000000 0.0295370
x4 -0.2454451 -0.9729550 0.0295370 1.0000000
Quote: “Rigorous experimental methods were just being developed during the time these data were taken (about 1930). Had such design methods been widely available and the importance of replication understood, then it would have been possible to break the unwanted correlations among the x variables and establish cause and effect if that was a goal.”
Quote: “Orthogonality arises in controlled experiments where the factors and levels are designed to be orthogonal. In observational studies, there is often a high probability that some of the regressor variables will be mutually quite dependent.”
Definition: The Likelihood of model parameters \( \theta \) given the model (\( g \)) and data (\( x \)) is given by: \[ \mathcal{L}(\theta | x, g) \]
Note: Likelihood theory describes how to find the most likely parameters of a model that fit the data the best.
Suppose you observe 10 coin flips and you see the following result:
H H H H H H T T T H
Discuss: What’s the most likely value for the probability of heads on an individual coin flip?
Let \( p \) denote the probability of getting heads.
This follows what is known as a binomial model, with the probability of getting 7 heads out of 10 given by:
\[ \mathrm{Prob}(7 \ \textrm{heads}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]
In this example \[ \mathcal{L}(\theta | x, g) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \] we have
\[ \mathcal{L}(p | 10, 7; \mathrm{binomial}) = \left(\begin{array}{c}10 \\ 7\end{array}\right)p^{7}(1-p)^{3} \]
The most likely parameter given the model and data is where the likelihood function is maximized.
It's usually easier to deal with summation rather than products, so we look at the log-likelihood function instead:
\[ \log(\mathcal{L}(\theta | g, x)) \]
which in our case becomes
\[ \log\left(\begin{array}{c}10 \\ 7\end{array}\right) + 7\log p +3\log (1-p) \]
Log-likelihood function:
AIC uses Likelihood and Information Theory to construct another model selection criterion (different from backward elimination or forward selection).
One problem with using adjusted \( R^2 \) values for model selection is it does poorly with out-of-sample prediction. In other words, finding a best-fit model using adjusted \( R^2 \) creates a bias towards the data set that was used.
AIC addresses this out-of-sample prediction issue by its very design. Current state-of-the-art is to use AIC for model selection, and adjusted \( R^2 \) for model validation (all models in the candidate set could be bad!!)
\[ AIC = -2\log(\mathcal{L}(\hat{\theta} | x,g) + 2K \]
where
\[ AIC = -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K \]
\[ \begin{eqnarray*} AICc & = & -2\log(\mathcal{L}(\hat{\theta}) | x,g) + 2K\left(\frac{n}{n-K-1}\right) \\ & = & AIC + \frac{2K(K+1)}{n-K-1} \end{eqnarray*} \]
where \( n \) is the sample size. When \( n \) is large, AICc converges to AIC.