The purpose of this site is to describe some common statistical concepts in a paragraph or two. This is meant as a supplement to the educational slides on Hypothesis Testing and Statistical Power to be a resource that can be used to quickly look up basic concepts when reading/writting methods sections, when performing analysis, etc.
In general, when performing a statistical analysis we wish to draw conclusions for a certain population. Here we use population to describe all individuals with a certain characteristic. The population of interest could for example be all Americans, all >60 year olds with a certain disease, etc.
Since it is never possible to get to every individual in such a population, we rely on a representative sample of said population. This is a small subset of the population which we deem to be representative for the general population. If chosen appropriately (i.e. big enough, not over-/underrepresenting certain subgroups, etc.) such a sample can provide useful information of the entire population.
For every study, there’s a number of parameters of interest. A parameter is a summary of a certain characteristic of the entire population. This could for example be the mean age, the prevalence of a given disease, mean logMAR, etc.
However, since we can never examine the entire population, we can never know the exact value of the parameter. Instead, we rely on a statistic. For example, say we want to know the prevalence of glaucoma in the US for adults aged 40-80 years. Naturally, we cannot examine every person in this population. So the best we can do is estimate the prevalence by using the proportion of a representative sample with glaucoma. This proportion is the statistic.
In short: the parameter is the true value of a feature in the entire population; the statistic is the estimated value of said feature found using the sample at hand.
Here are some of the most common parameters.
Mean/Standard Deviation(SD) The mean is a location parameter. It tells us where to expect the data to be. An often used statistic is the average in a given sample. The standard deviation is used to describe variability of data. I.e. it indicates how far from the mean we can expect to observe new data points.
Median/Inter Quartile Range(IQR) The median is another location parameter. This is found by sorting all observations of the variable of interest and then finding the middle observation (or the mean of the two, if an even number of observations are available). When used as a statistic, it is more robust to outliers, i.e. a few very extreme observations do not influence the median as much as they will influence the average. The Inter Quartile Range is to the median what the standard deviation is to the mean. It is the range within which the middle 50/% of the data is observed.
A statistical hypothesis is an assumption about the data. It can be regarding the difference in means of a certain variable in two groups, difference between the prevalence of a disease, etc.
For more, see here.
A statistical test uses the information in the data to either reject or not-reject the hypothesis one aims at testing. The result of a test is often a p-value, which is then used to draw your conclusion of whether the hypothesis should be rejected or not.
For more, see here.
There are two classes of tests:
This class of tests assumes that the data come from a specific probability distribution. Examples include the t-test and linear regression analysis, both which assume the data are normally distributed. If this assumption is not violated, a parametric test provides more statistical power than a non-parametric test. However, if the assumption is violated, the inverse is true.
When there is not evidence that data come from a specific probability distribution, we turn to non-parametric tests.
A p-value is related to a statistical hypothesis test. It is the probability of observing data fitting the hypothesis worse than the data observed.
A small p-value means we reject the hypothesis, whereas a large p-value means we do not reject the hypothesis.
When performing multiple tests at once, one has to adjust p-values in order to adjust for an increased risk of significant findings being due to pure chance (i.e. false positives). Popular methods for adjusting p-values include Bonferroni’s method and Benjamini-Hochberg’s method for adjusting the false discovery rate. (Note that while Bonferroni’s method is very popular, mainly due to it’s simplicity, it is known to be overly conservative in many cases.) For more, see slides 17 and 18 here or this excellent discussion.
When testing a hypothesis, we either reject the hypothesis or we do not reject the hypothesis. Similarly, the hypothesis is either true or false. Hence, we always find ourselves in one of four scenarios: we reject a false hypothesis, we reject a true hypothesis, we do not reject a false hypothesis, or we do not reject a true hypothesis. Two cases are okay, since we do not make any wrong conclusions. However, in two scenarios we do in fact get to a wrong conclusion: rejecting \(H_0\) when it is in fact true (type II error), or failing to reject \(H_0\) when it is in fact false (type I error).
All scenarios are shown in table below.
| Test says \ Hypothesis is | false | true |
|---|---|---|
Reject \(H_0\) |
No error |
Type II error |
Do not reject \(H_0\) |
Type I error |
No error |
Silly example: a study finds that an increase in the amount of ice cream sold leads to an increase in the number of drowning deaths for a given period. However, the researchers forgot that both are correlated with a third, confounding, variable: the season. During the summer more people buy ice cream AND more people go swimming. Hence, the increase in the number of ice creams sold does (probably…) not lead to an increase in the number of drowning, but the fact that it is summer is the more probable cause.
When a variable that is not included in the statistical model proves to be associated with an explanatory variable and the outcome variable, it is a potential confounding variable. That is, it is very hard to say whether the effect at first believed to be found from the explanatory variable is in fact an effect from this variable, or an effect of the confounding variable. Sometimes it is possible to rule out one based on common sense (as in the ice cream example above).
Numerical variables that in theory can be any number. Examples are time, age, height, weight.
Variables which fall in one of a certain number of groups. Can either be ordinal (i.e. the groups can be ranked largest to smallest; for example age groups, answers to questions like “on a scale from 1 to 5, …”, educational level [low, medium, high], etc.) or categorical/nominal (i.e. no order to the groups; for example gender, race, etc.). A discrete variable with only two possible outcomes is called a binary variable.
A probability distribution is a function that describes how likely it is to observe specific values for a given variable. See below for an introduction to some of the more important/common distributions. Distributions are either discrete or continuous: discrete distribution are used to describe discrete variables, continuous distributions to describe continuous variables.
If all outcomes of a variable are equally likely, we say that the variable is uniformly distributed. As an example, the outcome of rolling a fair dice is uniformly distributed on the integers 1,2,3,4,5,6 (the probability of each outcome is 1/6).
If a variable is binary, it is Bernoulli distributed. If the probability of one outcome is \(p\), the probability of the opposite outcome must be \(1-p\).
The standard example of a bernoulli distributed variable is a coin toss. This is a bernoulli variable (only two outcomes: heads or tails) with \(p=0.5\) (\(50\%\) chance of heads).
Sometimes we wish to sum a number of bernoulli distributed variables. The result is a single binomial distributed variable.
As an example: suppose we want to assess the risk of having neovascular AMD in a specific population. We select a sample of \(83\) patients from said population and count the number of patients who has neovascular AMD. Since each patient either has neovascular AMD or not, the outcome of each patient can be considered a bernoulli distributed variable, and hence the total number of patients with neovascular AMD is binomially distributed.
This is by far the most important distribution to be familiar with. The normal distribution (also called the Gaussian distribution, after German mathematician Johann Carl Friedrich Gauss) is the most important distribution. One reason for this is that even though some data is not normally distributed, the mean of a sample from said data is if the sample is “large enough.” Another reason is that many distributions can be approximated with a normal distribution – just take a look at the binomial distributions above and compare to the normal distributions below.
The t-distribution is another important distribution used especially when testing for significance of specific effects. It is often used when testing for difference in means, weighted means, regression coefficients.
A parametric test that is performed to answer questions regarding the mean of a population. Depending on the question, one of the following will be performed:
This is a non-parametric pendant to the t-test. This does not test for a difference in means, but rather a difference in medians.