EVERYTHING FROM BEFORE MIDTERM:
Basic probability definitions (union, intersection, complement, conditional, Bayes Theorem)
Union
\[P(A\,\,OR\,\, B) = P(A\, \cup \, B)
= P(A) + P(B) - P(A\, \cap\, B)\] Note: You subtract the intersection because otherwise it would be counted twice.
Intersection
\[P(A\,\,AND\,\, B) = P(A\, \cap\, B) = P(A) * P(B)\]
Complement
The compliment of a trait represents anything that does not have that trait
\[P(A^{c}) = 1 - P(A)\]
Difference between sample statistics and population statistics (a.k.a. population parameters)
Statistic: anything that is generated from the data
Sample statistics are generated from the sample data while population statistics are general estimates of those same statistics for the entire population that the sample was taken from (even if you can’t sample the whole population)
Know the concepts of bias and variance in estimators
Bias:
The difference between the average prediction of our model and the correct value which we are trying to predict. If an estimator is unbiased, then repeated estimates of the parameter by the estimator will demonstrate neither predispositions for overestimates nor underestimates.
The expected value does not equal the population parameter
Variance: The varience measures the average squared distances between each point and the mean.
Varience Unbiased: \[ \mbox{variance}_{unbiased} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\] Varience Biased: \[\mbox{variance}_{biased} = \frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2} \]
Definition of degrees of freedom
Degrees of freedom (DoF): The number of degrees of freedom is the number of values in the final calculation that are allowed to vary. For example, when calculating variance, you typically have (n-1) degrees of freedom because you include the parameter (mean) which can’t vary in the formula
How to calculate the expected value of a distribution (discrete and continuous)
Be able to write down the probability density (or mass, for discrete distributions) function, expected value E[X], and variance Var[X], for the (1) Normal, (2) Standard Normal, (3) Log-Normal, (4) Poisson, (5) Binomial
Normal Distribution
Bounded: \([- \infty : \infty ]\) (unbounded) Countinuous Can be negative PDF: \[f(x \mid \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}\] E[X]: \[\begin{align} E[X] &= \int_{-\infty}^{\infty}{X \cdot f(X)dX} \\
&= \int_{-\infty}^{\infty} x \cdot f(x \mid \mu, \sigma) = \int_{-\infty}^{\infty}\frac{x}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx \\ &= \mu \end{align}\]
VAR[X]: \[\begin{align} Var[X] &= E[(X- E[X])^2] \\
&= E[(X - \mu)^2] \\
&= E[X^2] - \mu^2 \\
&= \left( \int_{-\infty}^{\infty} x^2 \cdot f(x \mid \mu, \sigma) = \int_{-\infty}^{\infty}\frac{x^{2}}{\sqrt{2 \pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx \right) - \mu^2 \\
&= \sigma^2 \end{align}\]
Standard Normal
PDF: \[Z = \frac{X-\mu}{\sigma}\] \[f(z \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi}}e^{-\frac{1}{2}z^2}\] E[X] = 0
VAR[X] = 1
Log-Normal Distribution
PDF: \[\begin{align} log(X) &\sim N(\mu,\sigma) \\ X &\sim LN(\mu,\sigma) \end{align}\] \[f(x \mid \mu, \sigma) = \frac{1}{x\sqrt{2 \pi \sigma^2}}e^{-\frac{(log(x)-\mu)^2}{2\sigma^2}} \\ x \in \{0,\infty\} \\
\mu \in \mathbb{R} \\
\sigma > 0\]
E[X]: \[E[X] = e^{\mu + \frac{\sigma^2}{2}}\]
VAR[X]: \[Var[X] = e^{2(\mu + \sigma^2) - (2\mu + \sigma^2)}\]
Poisson Distribution:
You can typically use the poisson distribution in various situations such as: - The description of random spatial point patterns - As the frequency distribition of rare but independent events - As the error distribution in linear models of count data
The poisson distribution is discrete, hence it has a probability mass function (PMF) instead of a PDF. It cannot be negative and is bounded \([0,\infty)\)
PMF: \[P(x \mid \lambda)= \frac{e^{-\lambda} \cdot \lambda^x}{x!} \\
\lambda>0 \\
x \in \mathbb{N} \cup \{0\}\]
E[X]: \[\begin{align} E[X] &= \sum_{x=1}^{\infty} x \frac{e^{-\lambda} \cdot \lambda^x}{x!} \\
&= \lambda \cdot e^{-\lambda} \cdot \sum_{x=1}^{\infty} x \frac{\lambda^{x-1}}{x!} \\
&= \lambda \cdot e^{-\lambda} \cdot \sum_{x=1}^{\infty} \frac{\lambda^{x-1}}{(x-1)!}\\
&\mbox{define } y = x-1 \\
&= \lambda \cdot e^{-\lambda} \cdot \sum_{y=0}^{\infty} \frac{\lambda^{y}}{y!} \mbox{ (the sum is now the expansion of the exponential)}\\
&= \lambda \cdot e^{-\lambda} \cdot e^{\lambda} \\
&= \lambda\end{align}\]
VAR[X] \[Var[X] = \lambda\]
Binomial Distribution
Be able to recognize the Gamma, Beta, Multinomial, Chi-squared, F, and t-distributions
Multinomial Distribution
Central Limit Theorem
Know all the relationships between the distributions we discussed in lecture
Standard deviation vs. standard error
Know the R functions associated with the univariate distributions (rnorm, pnorm, rchisq, etc.)
Know how to construct a maximum likelihood, find maximum likelihood estimators, and 1-α confidence intervals
Be able to construct the 1-α parameter confidence intervals discussed in class, lab, or on any of the problem sets
understand the concept behind Type I and Type II error and statistical power
Know everything from the Week #5 summary table. The only exception is the d.o.f. for the two sample unpaired t-test, which you will not be expected to know.
know why multiple comparisons are a problem and what to do about it
Be able to discuss all of the objections to null hypothesis testing presented in the papers from the primary literature discussed in class
know how to do the one-sample t-test, the two-sample unpaired t-test, the two-sample paired t-test, Fisher’s F-test for a comparison of variances, comparing two proportions, and comparing two distributions with the K-S test
Understand the relationship between the test statistic T*, the distribution of T under the null hypothesis f(T|H0), and the construction of a one and two-tailed p-value
Know the R functions for all the hypothesis tests discussed in lab
understand the basic idea behind non-parametric bootstrap, parametric bootstrap, jackknife, and bootstrap-after-jackknife and know how to use each technique to calculate estimator bias and standard error.
EVERYTHING FROM AFTER MIDTERM:
know how to calculate Pearson’s product moment correlation coefficient (and the assumptions behind it)
understand the difference between the population correlation coefficient and the sample correlation coefficient r
understand why r has a sampling distribution (but you don’t need to know what it is)
know what Fisher’s transformation is and why/when/how you would use it (don’t need to know its sampling distribution)
know how to calculate Spearman’s rank correlation coefficient (and the assumptions behind it)
know how to calculate Kendall’s tau
know the difference between OLS and RMA(SMA)/MA regression
know how we calculated the estimates for slope and intercept
understand why the slope and intercept have sampling distributions
understand the assumptions of regression
understand the difference between a confidence interval and a prediction interval
understand how to partition the variance for a linear regression into SSR, SSE, and SST (and their degrees of freedom) and how to use that to calculate the coefficient of determination r2
understand the basic idea behind robust regression (when you would use it, how it works generally)
know how to interpret regression estimates generally, and also how to interpret the output of the R function ‘lm’
know when you would use a Generalized Linear Model, and when a Bernoulli, Binomial, and Poisson regression would be appropriate and why
be able to write down the model equation for each of the GLMs introduced (basically, from the Model summary table.doc handout).
understand how to interpret the GLM parameter estimates
know what “overdispersion” means in the context of Poisson regression
know what Deviance is, how to calculate it, and what its sampling distribution is
know how to use Deviance to compare two models
understand the basic idea behind splines/LOESS smoothers, and Generalized Additive Models (GAMs)
understand the two methods for looking at the significance of a regression covariate (t-test and comparison of full to reduced model)
know what multicollinearity is, why it’s a problem, how we diagnose it, and what to do about it
FROM THE ANOVA STUDY SHEET
One-way ANOVA:
Write down the model equation
Know the one-way ANOVA null hypothesis and implied alternative hypothesis
Know why a test of means involves a ratio of variances (i.e. why are we using an F ratio to test a statistical hypothesis?)
Fill out a one-way ANOVA table
Assumptions of ANOVA
Difference between fixed vs. random effects and the different null hypotheses implied
Understand why follow-up analyses to ANOVA are required
Understand Tukey’s HSD
Two-way ANOVA (Factorial and Nested)
Write down the model equation for factorial or nested design using “effect” coding
Write down the model equation for factorial or nested design using “cell means” approach
Know the two-way ANOVA null hypotheses (factorial and nested) and implied alternative hypotheses
Fill out a two-way factorial ANOVA table when A,B are fixed and when A,B are random
Know the difference between sequential (Type I) and marginal (Type III) sums-of-squares and when you have to worry about them
understand the bias-variance trade-off in model selection
know the difference between nested and non-nested models
know how to compare models using the likelihood ratio, AIC, and BIC
know the equations for and the difference between AIC and AICc
be able to discuss the pros and cons of Likelihood ratio tests vs. AIC/BIC
know how to calculate model weights
be able to describe step-wise regression and discuss the criticisms of stepwise regression
know the difference between Residual, Outlier, Leverage, and Influence
PAPERS TO FOCUS ON: (but all assigned papers are fair game):
Cohen 1994
Johnson 1999, Johnson 2002
Shmueli 2010
Hurlbert 1984
Tong 2019