Fundamentals of Data Analysis II

DropboxLink

A model is a conceptualization of a system, often a hypothesis of how we think a system works.

"All models are wrong, but some are useful." - George Box, 1976 & 1978

A statistical model

…is a mathematical representation of relationships between a population's variables.

Statistical models can serve two purposes:

Facilitate understanding of a stochastic process
…and/or predict the outcome of a stochastic process

These purposes are not mutually exclusive, but…

To understand a process, we construct simple models that are easy to interpret, but do not always make very accurate predictions.
To predict an outcome, we construct complex models that can be difficult to interpret, but make more accurate predictions.

Most models take some form of…

\[y = f(x) + \varepsilon\]

\(y\) is a specific value of the response or outcome variable, \(Y\)
\(x\) is a specific value of the explanitory or predictor variable, \(X\)
\(\varepsilon\) describes residual error unexplained by the model
\(f(x)\), describes how \(Y\) and \(X\) are related

(We used to call \(Y\) and \(X\) dependent and independent variables, but statisticians no longer recommend this nomenclature.)

The Linear Model

\(f(x) = \beta_{0} + \beta_{1} x\), and \(\varepsilon\) follows a normal distribution:

\[y = \beta_{0} + \beta_{1} x + \varepsilon\]

Coefficients are derived from data

…using various methods.

Ordinary Least Squares Regression
Maximum Likelihood Estimation
Bayesian Estimation
…etc.

Samples still estimate populations

Non-linear relationships

…can be modeled by including higher order terms.

\[y = \beta_{0} + \beta_{1} x + \beta_{2} x^2 + \dots + \beta_{z} x^z + \varepsilon\]

Generalized Linear Models

GLMs describe categorical and/or non-linear relationships by assigning a link function, \(f(y)\), and letting the residuals follow non-normal distributions.

\[f(y) = \beta_{0} + \beta_{1} x + \varepsilon\]

Logistic regression

This common GLM uses the logit link function and binomial \(\varepsilon\) to model the probability of a binary outcome.

\[\ln \left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1} x + \varepsilon\]

Multiple regression

Finally, the linear model can allow any number of predictor variables.

\[y = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \dots + \beta_{k} x_{k} + \varepsilon\]

Model Assessment

Assessing a model

Does \(Y\) change with \(X\)? …for a specific \(f(x)\).
- "Significance"
How much variability in \(Y\) is explained by a specific \(f(x)\)? …i.e. "fit."
- \(R^2\)
What \(f(x)\) best describes \(Y\)?
- Machine Learning

Null Hypothesis Significance Testing

Ronald Fisher went to war with Jerzy Neyman and Egon Pearson (son of Karl Pearson) during the 1930's over the most philosophically sound method for statistical hypothesis testing.

Our modern null hypothesis significance test, NHST, is a hybrid of the two camps' recommendations.

Karl Popper's Falsificationism

Hypotheses can never proven true, they can only ever be proven false.

The hypothesis "all raves are black" can never be proven true because you can't possibly be assured that all raves have been observed.
However, finding one white raven disproves "all ravens are black."

This philosophy of science says that a hypothesis is scientific only if it is falsifiable.

Following this philosophy, NHST attempts to falsify a null hypothesis.

The modern NHST

Make null and alternate hypotheses, \(\mathrm{H}_0\) and \(\mathrm{H}_\mathrm{a}\).
Calculate a test statistic measuring our sample's deviation from the null condition.
Assume \(\mathrm{H}_0\) is true.
Determine the probability of observing a deviation larger than our sample's deviation by chance alone, the p value.
If this probability…
- …is small, conclude \(\mathrm{H}_0\) is false, therefore \(\mathrm{H}_\mathrm{a}\) must be true.
- …is not small, conclude \(\mathrm{H}_0\) could be true.

1. Make hypotheses

\(\mathrm{H}_{0}\) describes a null condition, i.e. that our sample statistic is similar to some hypothesized population parameter.
\(\mathrm{H}_\mathrm{a}\) describes every other possible condition other than the null condition.

Examples:

Hypothesis	\(t\)-Test	ANOVA	Regression	\(\chi^2\)
\(\mathrm{H}_0\)	\(\mu_1 = \mu_2\)	All \(\mu_j\) are equal	\(\beta=0\)	O \(=\) E
\(\mathrm{H}_{\mathrm{a}}\)	\(\mu_1 \neq \mu_2\)	Not all \(\mu_j\) are equal	\(\beta \neq 0\)	O \(\neq\) E

2. Calculate test statistic

Test statistics are standardized measures of deviation from the null-hypothesized condition.

Examples:

t-Test	ANOVA	Regression	\(\chi^2\)
\(t = \frac{\bar{x}_1-\bar{x}_2}{{SE}_{1,2}}\)	\(F = \frac{{MS}_{\mathrm{among}}}{{MS}_{\mathrm{within}}}\)	\(t = \frac{\hat{\beta}-0}{{SE}_{\hat{\beta}}}\)	\(\chi^2 = \sum_{j=1}^{k} \frac{(O_j-E_j)^2}{E_j}\)

3. Assume \(\mathrm{H}_0\) is true

By chance alone, most samples will exhibit small differences from the population, some will exhibit larger differences, and a very few will exhibit very large differences.

The larger the test statistic, the less likely we are to observe that value due to chance alone when \(\mathrm{H}_0\) is true.

3. Assume \(\mathrm{H}_0\) is true

4. Determine the \(p\) value

The integral of the appropriate probability density function outside of our test statistic (shaded) gives us \(p\), the probability of getting a test statistic larger than our observed test statistic by chance alone when \(\mathrm{H}_0\) is true.

Alternatively, we could use randomization methods to find \(p\).

5. Draw a conclusion

If p is small, reject \(\mathrm{H}_0\), and conclude \(\mathrm{H}_{\mathrm{a}}\) is true.
If p is not small, fail to reject \(\mathrm{H}_0\), and conclude…
- …our data is consistant with the null hypothesis.
- ~~…the null hypothesis is true.~~

Other NHST details

How small of a \(p\) is "small enough?" By convention, \(\alpha = 0.05\), but only by convention.
Degrees of freedom are scaling parameters for the probability density functions, and are derived from the sample size and experimental design.
The critical value is the smallest value of the test statistic that would still be considered "significant." These were important before desktop computers, but largely irrelevant today.

Example: Comparing heights of men and women

We could phrase our question several ways…

Are men taller than women?
Is there an effect of sex on height?
Does height vary by sex, \(\mathrm{h} = f(\mathrm{s})\), where \(f(\mathrm{s})\) is a linear function parametized through OLS regression?

The difference of means is equivalent to the slope of the line.

Two sample t-test

\(\mathrm{H}_0:\) \(\mu_\mathrm{F} = \mu_\mathrm{M}\)
\(\mathrm{H}_a:\) \(\mu_\mathrm{F} \neq \mu_\mathrm{M}\)

\(t = \frac{\bar{x}_1-\bar{x}_2}{{SE}_{1,2}}=\) -11.09, with 192.86 degrees of freedom.

\(p \approx\) 0 \(<0.05\), therefore reject \(\mathrm{H}_0\)

The probability of observing a \(t\) larger than -11.09 with 192.86 degrees of freedom by chance when the null hypothesis is true is less than 0.05. Therefore, we reject the null hypothesis in favor of the alternate hypothesis, and conclude that women (exhibiting here a mean of 63.8 inches) are shorter than men (exhibiting here a mean of 68.7 inches).

Strengths of NHST

Decision making criteria
Well known
Intended for small to moderately sized data

NHST is often misunderstood, misapplied, and misinterpreted

\(p\neq\) the probability that \(\mathrm{H}_0\) is true
Can never conclude \(\mathrm{H}_0\) is true
Never assures a correct conclusion
- \(\alpha\) = Type I error rate
- \(\beta\) = Type II error rate
Not applicable to Big Data
Does not assess model fit
Questionable model selection
P hacking
- Model and \(\alpha\) must be chosen a priori

Don't trust \(p\)-values alone

…because "significant" is not necessarily "informative."

Significant just means "measurable," not "important."

Model fit, \(R^2\)

How much variability in \(Y\) is explained by \(f(x)\)?

\[ R^2 = \frac{SS_{regression}}{SS_{total}}\]

\(R^2\) is the proportion of variability explained by the regression, relative to the total variation in the data.

Model fit, \(R^2\)

Machine learning, AKA statistical learning

What \(f(x)\) best describes \(Y\)?

Fit many different \(f(x)\) to a set of training data
Evaluate the fit of those models on a new set of testing data
Pick \(f(x)\) that minimizes Testing Mean Squared Error

We call this cross-validation

Training error

Very flexible models can bring training error to almost 0.

Training vs testing error

…but those models become hyper-specific to the training set, and don't do a good job at predicting new data.

Which \(f(x)\) is best?

The \(f(x)\) that minimizes test error is the best at predicting new \(Y\).

Summary

A statistical model is an approximation of a stochastic system.
Simple models help us to understand a system.
Complex models help us to predict a system.
The linear model is a powerful model that can be either simple or complex.
NHST answers the question "are \(Y\) and \(X\) related?"
\(R^2\) answers the question "how much of \(Y\) is explained by my \(f(x)\)?"
Machine learning answers the question "what is the best \(f(x)\) to predict \(Y\)?"