A model is a conceptualization of a system, often a hypothesis of how we think a system works.
"All models are wrong, but some are useful." - George Box, 1976 & 1978
…is a mathematical representation of relationships between a population's variables.
\[y = f(x) + \varepsilon\]
(We used to call \(Y\) and \(X\) dependent and independent variables, but statisticians no longer recommend this nomenclature.)
\(f(x) = \beta_{0} + \beta_{1} x\), and \(\varepsilon\) follows a normal distribution:
\[y = \beta_{0} + \beta_{1} x + \varepsilon\]
…using various methods.
…can be modeled by including higher order terms.
\[y = \beta_{0} + \beta_{1} x + \beta_{2} x^2 + \dots + \beta_{z} x^z + \varepsilon\]
GLMs describe categorical and/or non-linear relationships by assigning a link function, \(f(y)\), and letting the residuals follow non-normal distributions.
\[f(y) = \beta_{0} + \beta_{1} x + \varepsilon\]
This common GLM uses the logit link function and binomial \(\varepsilon\) to model the probability of a binary outcome.
\[\ln \left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1} x + \varepsilon\]
Finally, the linear model can allow any number of predictor variables.
\[y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \dots + \beta_{k} x_{k} + \varepsilon\]
Ronald Fisher went to war with Jerzy Neyman and Egon Pearson (son of Karl Pearson) during the 1930's over the most philosophically sound method for statistical hypothesis testing.
Our modern null hypothesis significance test, NHST, is a hybrid of the two camps' recommendations.
Hypotheses can never proven true, they can only ever be proven false.
This philosophy of science says that a hypothesis is scientific only if it is falsifiable.
Following this philosophy, NHST attempts to falsify a null hypothesis.
Examples:
| Hypothesis | \(t\)-Test | ANOVA | Regression | \(\chi^2\) |
|---|---|---|---|---|
| \(\mathrm{H}_0\) | \(\mu_1 = \mu_2\) | All \(\mu_j\) are equal | \(\beta=0\) | O \(=\) E |
| \(\mathrm{H}_{\mathrm{a}}\) | \(\mu_1 \neq \mu_2\) | Not all \(\mu_j\) are equal | \(\beta \neq 0\) | O \(\neq\) E |
Test statistics are standardized measures of deviation from the null-hypothesized condition.
Examples:
| t-Test | ANOVA | Regression | \(\chi^2\) |
|---|---|---|---|
| \(t = \frac{\bar{x}_1-\bar{x}_2}{{SE}_{1,2}}\) | \(F = \frac{{MS}_{\mathrm{among}}}{{MS}_{\mathrm{within}}}\) | \(t = \frac{\hat{\beta}-0}{{SE}_{\hat{\beta}}}\) | \(\chi^2 = \sum_{j=1}^{k} \frac{(O_j-E_j)^2}{E_j}\) |
By chance alone, most samples will exhibit small differences from the population, some will exhibit larger differences, and a very few will exhibit very large differences.
The larger the test statistic, the less likely we are to observe that value due to chance alone when \(\mathrm{H}_0\) is true.
The integral of the appropriate probability density function outside of our test statistic (shaded) gives us \(p\), the probability of getting a test statistic larger than our observed test statistic by chance alone when \(\mathrm{H}_0\) is true.
Alternatively, we could use randomization methods to find \(p\).
We could phrase our question several ways…
\(\mathrm{H}_0:\) \(\mu_\mathrm{F} = \mu_\mathrm{M}\)
\(\mathrm{H}_a:\) \(\mu_\mathrm{F} \neq \mu_\mathrm{M}\)
\(t = \frac{\bar{x}_1-\bar{x}_2}{{SE}_{1,2}}=\) -11.09, with 192.86 degrees of freedom.
\(p \approx\) 0 \(<0.05\), therefore reject \(\mathrm{H}_0\)
The probability of observing a \(t\) larger than -11.09 with 192.86 degrees of freedom by chance when the null hypothesis is true is less than 0.05. Therefore, we reject the null hypothesis in favor of the alternate hypothesis, and conclude that women (exhibiting here a mean of 63.8 inches) are shorter than men (exhibiting here a mean of 68.7 inches).
Significant just means "measurable," not "important."
How much variability in \(Y\) is explained by \(f(x)\)?
\[ R^2 = \frac{SS_{regression}}{SS_{total}}\]
\(R^2\) is the proportion of variability explained by the regression, relative to the total variation in the data.
What \(f(x)\) best describes \(Y\)?
We call this cross-validation
Very flexible models can bring training error to almost 0.
…but those models become hyper-specific to the training set, and don't do a good job at predicting new data.
The \(f(x)\) that minimizes test error is the best at predicting new \(Y\).