Glossary of epidemiological terms

Photo by Romain Vignes on Unsplash

Half the battle in making sense of epidemiology is getting on top of the terminology. I hope that this summary is useful.

Association

A statistical relationship between an exposure (such as eating butter) and disease (such as cardiovascular disease). This is usually summarized by a relative risk and odds ratio. In order for an association to be present, the relative risk is usually ‘statistically significant’.

Evidence of association is generally regarded as the first step in determining whether or not there is sufficient evidence to believe that an exposure causes a disease. Eliminating other reasons for an association, such as random or systematic error is necessary before a causal argument may be made.

Bayes’ theorem

A method of inverting conditional probabilities (see conditional probability definition). The theorem allows the updating of belief in the light of new evidence, which is an important part of the scientific method. It is based on Bayes theorem:

\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

Where:

\(P(A|B)\) is the probability of \(A\) occurring after we know that \(B\) has occurred.
\(P(B|A)\) is the probability that \(B\) has occurred after we know that \(A\) has occurred.
\(P(A)\) is the probability that \(A\) has occurred. Prior probability or belief of the frequency that \(A\) will occur.
\(P(B)\) is the probability that \(B\) occurs.

It is frequently applied to problems of diagnostic test accuracy. For example, a clinician often wants to know what is the probability of a patient having disease (\(A\)), given that we know the patient has a positive test (\(B\)).

Applying Bayes’ theorem to such a problem, assuming we know the sensitivity \(P(B|A)\) and the specificity \(P(\text{not } B | \text{ not } A)\) is:

\[\begin{align} \text{Positive predictive value} &= \frac{\text{sensitivity} * \text{prevalence}}{P(\text{positive test})} \\ &= \frac{\text{sensitivity} * \text{prevalence}}{P(\text{true positive} + \text{false positive})} \\ &= \frac{\text{sensitivity} * \text{prevalence}}{\text{sensitivity}*\text{prevalence} + (1-\text{specificity})*(1- \text{prevalence})} \\ \end{align}\]

The odds version of Bayes’ theorem is perhaps easier to interpret where:

\[\begin{align} \frac{P(\text{Disease}|\text{Positive test})}{P(\text{No disease}|\text{Positive test})} &= \frac{\text{prevalence}}{1 - \text{prevalence}} * \frac{\text{sensitivity}}{1-\text{specificity}} \\ \end{align}\]

In words, this means that the odds of disease after the test is pretest odds of disease multiplied by sensitivity divided by the complement of the specificity (or the ratio of the true-positive propotion to the false-positive proportion).

Since low prevalences and odds are almost equivalent, the posterior odds is almost equivalent to the prior prevalence multiplied by latter ratio of sensitivity to 1 - specificity.

The posterior odds of disease may be converted to a probability (Positive predictive value) by using the formula:

\[\begin{align} \text{Risk} &= \frac{\text{odds}}{1 + \text{odds}} \end{align}\]

To get from Bayes’ theorem to the above it is necessary to also know the law of total probability, which enables the overall probability of \(B\) to be calculated from observed conditional probabilities of \(A\) given \(B\) (sensitivity and the converse specificity) and the prior probability (here prevalence of disease) of \(A\):

\(P(B) = P(A)*P(B|A) + P(\bar{A})*P(B|\bar{A})\)

Bayesian statistics

A branch of statistics, in which probability is considered a degree of belief in the frequency in which an event will occur. It is based on Bayes’ theorem.

Bradford-Hill’s criteria for causation

When epidemiology was a young science, Sir Austin Bradford-Hill noted that there were many more associations found in studies that turned out to be truly causal. He devised a number of criteria that were helpful in trying to tease out whether or not associations between exposure and disease were likely to be truly causal. Unfortunately, there is still much subjective judgement required in helping decide whether an exposure is causal, but at least they are a useful framework to guide our thinking.

The criteria include:

Strength of association

How strong is the odds ratio or relative risk comparing the exposed to unexposed groups? A stronger association is more indicative of a causal association.

Consistency

If an association is truly causal, then different groups using different methods should come to the same conclusion. Have other groups came to the same or similar conclusions? If not, which evidence is likely to be the most reliable?

Specificity

Does the causal factor only relate to the outcome in question?

Temporality

The evidence for causation should include consideration of whether or not the exposure preceded the outcome. This is generally a given feature of cohort

Biological or dose-response gradient

Does progressively more exposure lead to higher risk of disease? This can usually be assessed by considering different categories of exposure and assessing the measure of association (relative risk or odds ratio) for progressively higher levels of exposure.

Biological plausibility

Is there laboratory or biological work which shows a plausible mechanism by which exposure may lead to the onset of disease?

Coherence

If the exposure really did cause disease, would this help explain other scientific observations related to the disease? Does it fit with what is already known about the characteristics of the disease

Experimental evidence

Has there been randomised trial evidence? This is generally considered more convincing than observational evidence.

Analogy

Is there a similar area of science where exposure leads to disease? For example, acute rheumatic fever and post-streptococcal glomerulonephritis are post-infective complications of streptococcal disease. Since post-streptococcal glomerulonephritis is causally linked to scabies, this strengthens the evidence that the related disease acute rheumatic fever should also be.

Case-control study

An observational epidemiological study in which the study population is sampled by disease (or case status). Cases and controls are then assessed for their exposure status. If cases are exposed at a much greater frequency than controls, then this is considered evidence that the exposure may cause the disease. Obviously, other explanations must be taken into consideration, such as confounding and measurement error.

Categorical variable

A characteristic that is measured in categories or classes. For example: ethnicity may be measured as: Māori, Pacific, Chinese, Indian, ‘Other Asian’ and ‘NZ European and Other’. Another example is the classification of individuals in an epidemiological study as exposed (eg. smokers) and unexposed (eg. non-smokers).

Causation

The relationship by which an exposure influences the onset of a disease or outcome. It is distinct from the statistical idea of association, in which other explanations, such as random error, confounding, selection bias or measurement error may explain the results. Bradford-Hill’s criteria for causation are often used to help decide whether an observed association between exposure and disease should be considered causal.

Chi-square test

This is a statistical test of association between two variables, which are usually considered an exposure and outcome. The concept is related to asking the question “what if there really were no relationship between these two variables” (‘expected’ values may be calculated, based on the frequency of the variables that contribute to each cell) and then these expected values are then compared to observed or actual ones. The sum of the squared differences for each cell divided by the expected value of each cell yields a chi-square value. A probability (P-value) is then estimated from the chi-square value, under the assumption that there is no relationship between the exposure and outcome (the two values are ‘independent’) from the chi-square distribution. If the chance of the observed results are unlikely under this ‘no relationship’ or ‘null-hypothesis’ idea, then this belief is abandoned.

A worked example is provided below.

Imagine a case-control study, where cases of lung cancer are compared with non-cases.

	Lung cancer (col. 1)	No lung cancer (col. 2)
Smoke (row 1)	50	250
Don’t smoke (row 2)	10	290

The expected value of each cell is \(E_{ij}\) in row \(i\) and column \(j\) is given by:

\[ E_{ij} =\frac{i\text{th row total * } j\text{th column total}}{\text{grand total}}\]

So, therefore, here:

\[ E_{11} = \frac{300 * 60}{600} = 30 \] \[ E_{12} = \frac{300 * 540}{600} = 270\] \[ E_{21} = \frac{60 * 300}{600} = 30\] \[ E_{22} = \frac{300 * 540}{600} = 270\]

As you can see the expected values, particularly for the lung cancer and don’t smoke combination are lower than expected, as well as the lung cancer and smoke are higher than expected.

The formula for estimating the chi-square value for this 2 x 2 table is: \[ \chi^2 = \sum_{i = 1, j = 1}^{i = R, j = C}\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \] So, for this example: \[\begin{align} \chi^2 &= \frac{(50 - 30)^2}{30} + \frac{(270 - 250)^2}{270} + \frac{(50 - 30)^2}{30}+ \frac{(270 - 250)^2}{270} \\ &= 13.33 + 1.48 + 13.33 + 1.48 \\ &= 29.6 \end{align}\]

This value is then estimated from the upper tail of the chi-square distribution (1 degree of freedom).

Here, the P-value is less than 0.001, derived from the following R-code

pchisq(29.6, df = 1, lower.tail = FALSE)

## [1] 5.310494e-08

A plot of the chi-square distribution is given below, with the observed chi-square value of 29.6 at the extreme bottom right. I think you can see that this result is on the very ‘tail’ or unlikely part of the probability distribution. The probability of the observed result under the ‘null-hypothesis’ is very high when the difference between observed and expected values is close to zero, with a low chi-square value. With a chi-square of almost 30, the null hypothesis here is unlikely. The 0.05 P-value cutoff is shown in red at 3.84.

sjPlot::dist_chisq(chi2 = 30, deg.f = 1, p = 0.05)

## The equivalent result from the raw 2 x 2 table
epiDisplay::cci(50, 250, 10, 290)

## 
##           Exposure
## Outcome    Non-exposed Exposed Total
##   Negative         290     250   540
##   Positive          10      50    60
##   Total            300     300   600
## 
## OR =  5.8 
## 95% CI =  2.88, 11.68  
## Chi-squared = 29.63, 1 d.f., P value = 0
## Fisher's exact test (2-sided) P value = 0

Notice here that the computer has rounded the P-value in the output to 0 (zero). Stating that a P-value is zero is always false, that is why the conventional notation for low [P-values] is ‘< 0.0001’.

Cohort study

An observational epidemiological study in which the population is sampled by exposure status. The population is then followed-up until outcomes or disease occurs. The association between exposure and outcome may be used to assess whether or not the exposure is likely to cause the disease. For example, a diagnosis of scabies has been associated with the onset of acute rheumatic fever in a cohort study of Auckland children here. Another study considered the link between serum urate and the incidence of cardiovascular disease during follow-up here.

Conditional probability

The natural frequency or risk of an outcome, given or on the condition that we have other information. For example, the cumulative incidence (risk) of lung cancer in a cohort of smokers (the condition) is 10%. Another example is the chance of having angina after a positive treadmill (the condition) test is 70%. Sensitivity and specificity are examples of conditional probabilities.

Continuous variable

A measurement or characteristic, such as height, weight or age that is measured on a numeric scale.

Confounding

A shared common cause of an exposure and outcome. See here for a web app which illustrates this concept with regard to the confounder sugar, exposure rotten teeth and outcome cardiovascular disease.

It is often addressed in observational epidemiology by using either stratification or regression. These methods are called adjusted analyses (rather than crude), since the relationship between the exposure and outcome are now adjusted for confounders. Stratification can only be carried out on categorical variables. Regression is more flexible and may be used to adjust for both categorical and continuous confounding factors.

Egger test

The Egger test is a regression test for the presence of publication bias in a meta-analysis. It is based on a radial plot, which is a scatter plot of summary measures (odds ratios) of \(k\) trials:

\(x_k = \frac{1}{\text{S. E.}(\text{log}(\text{odds ratio}_k))}\)

and

\(y_k = \frac{\text{log}(\text{odds ratio}_k)}{\text{S. E.}(\text{log}(\text{odds ratio}_k))}\)

This plot has the following properties which hold under the fixed effects model:

\(\text{Var}(y_k)\) is one.
For each of the \(k\) studies, the estimated log odds ratio is equal to the slope of the line through the origin (0, 0) and the point (\(x_k, y_k\)).
The pooled odds ratio, using the fixed effects method, is equal the slope parameter \(\beta_1\) in an unweighted linear regression that intersects with the origin (0, 0):

\(y_k = \beta_1 x_k\)

The values \(y_k\) are also called \(z\) scores, since they correspond to the statistic of a test that \(\log(\text{odds ratio}_k)\) is different from zero.

If there is no publication bias, the individual data points are expected to be randomly distributed around the regression line.

The Egger regression test includes an intercept term in the regression equation:

\(y_k = \beta_0 + \beta_1 x_k\)

The test for publication bias is therefore a test for a non-zero intercept (\(\beta_0\)). The null-hypothesis of no bias in the meta-analysis is distributed according to Student’s t-distribution with k - 2 degrees of freedom. In R, the meta::radial() function can be used to draw the radial plot and the meta::metabias() function can be used to conduct Egger’s test.

Ecological study

An observational study which considers the statistical evidence of association between aggregated measures of exposure and outcome. For example, average per capita sugar intake has been correlated with childhood asthma prevalence. See here for the publication.

Exposure

A behaviour, biological or sociological characteristic that may cause a particular disease or outcome. For example, smoking is an exposure that is likely to be a cause of the disease lung cancer.

Fisher’s exact test

The Fisher’s test is an analogue to the chi-square test, and is particularly useful when the cell counts are low. It is used for testing for independence between two categorical variables. In essence, the test calculates the probability of the observed results from the hypergeometric distribution and sums this with all possible values of the observed data (that maintain the marginal cell counts) that are less probable than the observed results. This becomes the P-value of observing the data (or more extreme results), given that the null hypothesis is true. A more detailed explanation is given here.

Frequentist statistics

The branch of statistics most commonly used in medical research. Frequentists see probability as a long run natural frequency, rather than as a degree of belief that an event will occur. P-values and confidence intervals are frequentist ideas. Frequentist methods are often contrasted to Bayesian statistics.

Hazard

Hazards are often used in randomized trials or cohort studies to estimate disease frequency. Hazards have a tricky definition. Technically, the hazard relates to the instantaneous risk of an individual having a disease event, knowing (conditional upon) that an individual survived up until that time point. Hazards are rates (probabilities per unit time), rather than simple probabilities or risks.

Hazard ratios

Hazard ratios are the ratios of two hazards, usually in an exposed and unexposed group in an epidemiological study. They are a measure of association, similar to a risk ratio or odds ratio.

Hypothesis testing

In epidemiology, this is generally related to deciding whether or not study data support a conclusion of association between an exposure and disease. Neyman-Pearson hypothesis testing uses a formal framework, in which test results are compared with the unknowable ‘truth’, embodied by the “null-hypothesis”. If it were believed that sugar caused heart disease, and a study were carried out to interrogate this belief, such as by considering the association between sugar intake and the incidence of heart disease, then the ‘null-hypothesis’ would be that “sugar does not cause heart disease”.

The way hypothesis testing works, is that if the collected data (say calculated risk or odds ratio) are unlikely under the null-hypothesis, then this is rejected and In the following table, the test results are compared with whether the null hypothesis is either true of false. The false-positive and false-negative proportions are possible to fix beforehand or estimate afterward. By convention, the false-positive or \(\alpha\) level is generally fixed at 5%.

In this way, researchers are able to be confident that although they may not always make the right decision about test results, in the long run, they will mostly make the right decision.

	Test result?
Null-hypothesis?	Positive	Negative
True	False-positive	OK
False	OK	False-negative

Incidence

See rates.

Intraclass correlation coefficient (ICC)

The intraclass correlation coefficient as the name suggests, is a statistic that describes how correlated two measures of individuals within the same group or ‘class’ are. It is analogous to the Pearson correlation coefficient which is confined to a value between the range of -1 and 1. A modern take on the intraclass coefficient is if it is considered to be a random effects model such as:

\(Y_{ij} = \mu + \alpha_j + \epsilon_{ij}\)

Where:

\(Y_{ij}\) is the outcome for individual \(i\) in group \(j\). For example, a child’s waist circumference who attends a certain school.
\(\mu\) is the overall mean
\(\alpha_j\) is the random effect shared by all individuals in group \(j\). For example, it would be the average deviation from the overall mean of students who attend a certain school.
\(\epsilon_{ij}\) is the residual term, in the above example, it would be the difference of child’s waist circumference \(Y_{ij}\) from the overall mean minus the school random effect (\(\mu - \alpha_j\)).

The variance of \(\alpha_j\) is \(\sigma_{\alpha}^2\) and the residual variance of \(\epsilon_{ij}\) is \(\sigma_{\epsilon}^2\).

The intraclass correlation coefficient is therefore:

\(\frac{\sigma_{\alpha}^2}{\sigma_{\alpha}^2 + \sigma_{\epsilon}^2}\)

This is then the average between-subject, within-class correlation of two observations from the same group (school in the above example).

Implications of the ICC for survey sample size calculations

The ICC is related to the design effect when using survey methods. The design effect is a multiplication factor which relates a simple random sample survey calculation to that which is required if a more complex cluster sample is conducted.

The design effect is related to the ICC by the following formula:

\(\text{design effect} = 1 + (\text{number of clusters} - 1)*\text{ICC}\)

This means that the less correlated measures are within a cluster, the lower the design effect. If individuals are independent within clusters (ICC = 0), then the design effect is 1 and the sample size is the same as a simple random sample. Conversely, if there is a very strong ICC (\(\approx 1\)), then the within cluster information is less useful and the number of clusters tends toward what would be required from a simple random sample.

Kaplan-Meier estimator

This is a non-parametric estimate of the survival function, often used to compare the survival (death or diagnosis of a certain condition) between different groups, dependent on their exposure status. The method is able to account for different lengths of follow-up in cohort studies of survival or time-to-event, such as the diagnosis of a particular disease.

The survival function is given by:

\[ \widehat{S(t)} = \prod_{i: t_i\leq t}(1 - \frac{d_i}{n_i})\\ \] Where:

\(t_i\) is the time at which at least one event has occurred,

\(d_i\) is the number of events or deaths that has occurred at time \(t_i\),

\(n_i\) is the number of individuals who have survived (not dead or lost to follow-up [right censored]) at time \(t_i\); and

\(i\) is the interval defined by time between events or deaths occur. The interval lengths may vary and are defined by observed events.

The variance of the Kaplan-Meier estimator is given by Greenwood’s formula:

\[ \widehat{\text{Var}}(\widehat{S(t)}) = \widehat{S(t)}^2\sum_{i: t_i\leq t}\frac{d_i}{n_i(n_i-d_i)} \] The overall survival based on the Kaplan-Meier curve may be summarised by a median survival, or a restricted mean survival time (corresponding to the area under the Kaplan-Meier curve) and may be calculated with a 95% confidence interval. The statistics is “restricted” as it cannot be exactly estimated in the presence of censoring. Since survival is often skewed to the right, the median survival time is often preferred to the mean as a ‘typical’ observation.

A useful vdeo tutorial about Kaplan-Meier survival curves is given here.

Log-transformation

Logarithms are the power to which a fixed number (the base, often in epidemiology \(e \approx 2.718\)) must be raised to to produce a given number. For example, \(log_2(8) = 3\). Logarithms are often used to transform continuous variables in data analysis, and in fact, it is suggested to transform positive data that has no fixed upper bound and whose values cover two or more orders of magnitude (Fox and Weisberg, 2019). Conversely, if the range of a variable is less than an order of magnitude, a log transformation of a variable is unlikely to be helpful. Logarithmic transformation of a variable in regression modelling may achieve:

improve linearity between two variables
stabilise conditional variation
simplify models
achieve normality.

The Box-Cox method of transformation is often used to choose an appropriate power transformation, with a log transformation chosen for levels of \(\lambda\) close to zero, and \(x^\lambda\) for values far from zero. The R function car::powerTransform(variable ~ 1, data = data.frame, family = "bcPower") is often used to choose the appropriate form. car::symbox(~ variable, data = data.frame) can be used to trial a variety of transformations for a continuous variable and explore them as box plots.

Mean

This is a measure of central tendency, often used in epidemiological studies. The mean values of a sequence of measurements, such as blood pressure (\(\bar{x}\)) on a series of 1, 2, 3 …, i, …, n measurements is given by: \[\begin{align} \bar{x} &= \frac{\sum_{i=1}^{n}{x_i}}{n}\\ \text{Where:}\\ n &= \text{the total number of measurements}\\ x_i &= \text{the }i\text{th blood pressure measurement} \\ \end{align}\]

Measurement error

A type of bias affecting measures in an epidemiological study. These can affect

the exposure
the outcome
the confounders.

Estimates of measurement error are usually given against a gold-standard, using statistics such as sensitivity or specificity.

Measurement error can be reduced in epidemiological studies by:

favouring objective over subjective measures
calibrating equipment
standardising study procedures
staff training in procedures

Number needed to treat (NNT)

This is a method of reporting mainly trial or epidemiological study results for which the outcome is binary and can help convey the clinical utility of an intervention or exposure. It is based on the absolute risk reduction (ARR), which is the difference between the risk in the exposed (\(\pi_{\text{exposed}}\)) and the risk in the unexposed (\(\pi_{\text{unexposed}}\)).

\(\text{ARR} = \pi_{\text{exposed}} - \pi_{\text{unexposed}}\)

The NNT is the reciprocal of the ARR.

\(\text{NNT} = \frac{1}{(\pi_{\text{exposed}} - \pi_{\text{unexposed}})}\)

The statistic should be interpreted as the average number of people who need to be treated (or exposed) to prevent (or cause) one case. NNTs may be classified as \(\text{NNT}_{\text{benefit}}\) or \(\text{NNT}_{\text{harm}}\). Interval estimates of not-significant NNTs are complex, due to the crossing over from the harm-to-benefit part of the equation. Also, the ‘no effect’ value of an NNT is \(\infty\). A good discussion by Professor Doug Altman is given here.

Odds

The ratio of frequencies that an event occurs, divided by the frequency that it does not occur. If P is the probability of an event occurring, then:

\(\text{odds} = \frac{P}{(1-P)}\)

Odds ratio

The ratio of odds, usually comparing an exposed to an unexposed group. It is a measure of association. Odds ratios have desirable mathematical properties, in that they are reversible if the outcome is changed from a disease event to no event. They are also not bounded by the risk of disease in the unexposed, unlike risk ratios. This makes odds ratios more desirable when comparing measures of association between studies, such as in a meta-analysis. On the other hand, they have the disadvantage of being less interpretable.

P-value

In a study which links an exposure with a disease, this statistic is the long run risk (or probability) of getting the observed results or more extreme, if there is truly no relationship between exposure and disease (ie. that the null hypothesis is true).

Conventionally, a P-value of less than 0.05 (5% or 1/20 times) is considered to be evidence that the ‘no relationship’ hypothesis is unlikely, and that there really is an association or link between exposure and disease.

Since the P-value is influenced strongly by the sample size, it is important to always also consider the magnitude of the association, rather than only focusing on the P-value, when assessing the strength of association.

So, in an epidemiological study, the major influences on a \(P\)-value are:

The effect size, often expressed as a risk or odds ratio.
The sample size of the study. Larger sample sizes result in smaller P-values.
The spread of the data, often expressed as the standard error of the risk or mean measure in each exposure group.

Population attributable fraction

The proportion of the cases in the population that may be prevented if the exposure is removed. It is derived from a measure of association (relative risk or odds ratio) and the prevalence of exposure in the population. A causal assumption is inherent in the calculation that the exposure causes the outcome.

The basic idea is that a proportion of overall cases are attributable to exposure to the risk factor. The fundamental idea is encapsulated by the following formula:

\[ \text{Population attributable fraction} = \frac{P(D) - P(D \mid \bar{E}) }{P(D)} = 1- \frac{P(D \mid \bar{E}) }{P(D)} \] Where:

\(P(D)\) is the probability of disease in the population
\(P(D \mid \bar{E})\) is the probability of disease given no exposure.

Rearranged algebraically, the formula may be expressed by: \[ \text{Population attributable fraction} = \frac{P(E) (\text{RR} - 1)}{1 + P(E) (\text{RR} - 1)} \] Where:

\(P(E)\) is the prevalence of the exposure, and
\(\text{RR}\) is \(\frac{P(D \mid E)}{P(D \mid \bar{E})}\)

This formula is illustrated in the picture below, where the numerator represents the cross-hatched area and the denominator encompasses the unit square hashed and cross-hatched area.

Source

For exposures with different levels of exposure, the relevant formula is:

\[ \text{Population attributable fraction} = \frac{\sum_{i = 1}^{n} (\text{prevalence}_i (\text{RR}_i - 1))}{1+\sum_{i = 1}^{n} (\text{prevalence}_i (\text{RR}_i -1)) } \] In which there are are \(1, 2, i, ... n\) strata, with relative risk, \(\text{RR}_i\).

The population attributable fraction for a group of risk factors for the same disease, where \(r\) is the index for each risk factor.

\[ \text{Population attributable fraction}_\text{combined} = 1 - \prod_{r}(1 - \text{Population attributable fraction}_r) \] Model based estimates of \(PAF\) are available by building a suitable model to predict the outcome. Estimate predictions (probability of having an outcome, from logistic model, for example) from the model, setting the exposure to zero. The sum of predicted probabilities represents the expected number of cases.

The \(PAF\) is therefore:

\[ \text{Population attributable fraction} = \frac{\text{Observed cases} - \text{Expected cases}}{\text{Observed cases}}\]

A good synopsis on the subject is available here and here. Variance estimation of population attributable fractions often rely on jack-knife procedures or the delta method.

Estimating the 95% confidence interval of the PAF

Without confounders, the formula to estimate the 95% confidence interval, using the delta method in a cohort or cross-sectional study, with the following summary 2x2 table is as follows:

Exposed?	Disease	No disease	\(\Sigma\)
Yes	\(\pi_{11}\)	\(\pi_{10}\)	\(\pi_{1.}\)
No	\(\pi_{01}\)	\(\pi_{00}\)	\(\pi_{0.}\)
\(\Sigma\)	\(\pi_{.1}\)	\(\pi_{.0}\)	1

Here, \(\pi_{ij}\) is the cell probability (\(n_{ij}/N\)), where \(n_{ij}\) is the individual cell count and \(N\) is the total count in the study.

If: \[\begin{align} P(\text{D}) &= \pi = \pi_{.1} = \pi_{11} + \pi_{01} \\ P(\text{D | E}) &= \pi_1 = \frac{\pi_{11}}{\pi_{1.}} \\ P(\text{D |} \bar{\text{E}}) &= \pi_0 = \frac{\pi_{01}}{\pi_{0.}} \\ \text{Let } \phi &= \frac{\pi_{01}}{\pi_{0.} \pi_{.1}} \\ \text{and } \text{PAF} &= \frac{\pi - \pi_0}{\pi} = 1 - \phi \\ \widehat{\text{Var}}(\widehat{\phi}) &= \widehat{\phi}^2 \widehat{\text{Var}}(log(\widehat{\phi})) \\ \text{and:} \\ \widehat{\text{Var}}(log(\widehat{\phi})) &= \frac{1- \pi_{01}}{N\pi_{01}} - \frac{\pi_{0.} + \pi_{.1} - 2\pi_{01}}{N\pi_{0.}\pi_{.1}} \\ \end{align}\]

This is then used to estimate the 95% confidence interval by adding and subtracting \(1.96 \sqrt{\widehat{\text{Var}}(\widehat{\phi})}\) from the point estimate.

Positive predictive value

The probability or risk of an individual having a disease given or conditional on them having a positive test. See the definition of Bayes’ theorem for how it is calculated from prevalence of disease in the population, and the characteristics of the test (sensitivity and specificity).

Prevalence

The proportion or percentage of subjects who have a certain characteristic or disease. For example, the prevalence of diabetes in the New Zealand adult population is 10%.

For prevalence estimates using an imperfect test, if sensitivity and specificity are known, then a corrected prevalence may be estimated to adjust for the bias caused by measurement error.

The correction is given by the Rogan-Gladen estimator:

\[ \text{corrected prevalence} = \frac{\text{biased prevalence} + \text{specificity} - 1}{\text{sensitivity} + \text{specificity} - 1}\]

Publication bias

This is an issue which may distort the results of meta-analyses, since papers with statistically significant results are more likely to be accepted by journal editors. A funnel plot or Egger test or both may be used to investigate whether publication bias exists.

Quantile

Is a set of values which divides a set of measurements (eg. blood pressure) into equally sized groups, each containing the same fraction of the total sample. This technique is often used to cut a continuous variable into a categorical one. Dividing the measurements from a population into quarters, for example, will result in quartiles. Dividing into ten equally sized categories will result in deciles. This method is often used when there is no natural cut-off in a set of continuous observations. This method may be used to divide individuals in an epidemiological study into different exposure categories.

Random or sampling error

The error which is linked with taking a sample from a larger population. P-values and 95% confidence intervals account for random error. Random errors generally get smaller with larger samples. A web app that simulates random error, based on dice simulation, is here.

Randomised controlled trial

A longitudinal study in which treatment assignment is randomly assigned, usually using a predetermined, often computer generated, random sequence. Trials have the benefit of controlling for measured and unmeasured confounding factors (see confounding).

Rate

A measure per unit time. In epidemiology and statistics disease frequency is often estimated using incidence rates which are the proportion of new cases of disease in a population divided by the person-time at risk. Hazards are another example of a rate.

Regression

A statistical method of characterizing the relationship between one variable (such as number of rotten teeth) and another (such as sugar intake). It can be used to gather statistical evidence for whether or not sugar intake is likely to cause rotten teeth, for example. It can also be used to adjust for confounders. Regression can be visualized as a line or plane which minimises the vertical distance between points in a two or three dimensional scatterplot for simple cases (see here for a scabies and rheumatic fever example).

A difficulty related to regression is explaining the meaning of the terms in the model, especially the \(\beta\) coefficients. A summary of the interpretation of these terms under different transformations is given below.

Transformation	Model	Interpretation of \(\beta_1\) coefficient
None	\(y = \beta_0 + \beta_1x\)	A one unit increase in \(x\) is associated with an average change of \(\beta_1\) units in \(y\)
log transformed predictor	\(y = \beta_0 + \beta_1 ln(x)\)	A one % increase in \(x\) is associated with an average change of \(\frac{\beta_1}{100}\) units in \(y\).
log transformed outcome	\(ln(y) = \beta_0 + \beta_1 x\)	A one unit increase in \(x\) is associated with an average change of \((e^{\beta_1}) \times y\).
log-log. model	\(ln(y) = \beta_0 + \beta_1ln(x)\)	A one % increase in \(x\) is associated with an average change of \(\beta_1\%\) in \(y\).

Source

Risk

Risk is the long run frequency of an event occurring. It is the ratio of events of interest occurring to the whole number of trials, in which these events may have occurred. For example: if a sample of 100 adults are tested for diabetes, and 10 test positive for the disease, then the risk of disease in this population would be 0.1 or equivalently 10%.

Risk ratio or relative risk

The ratio of the risk of an outcome in an exposed group to the risk of an outcome in an unexposed groups. The definition always implies a comparison of the risk of disease in two different exposure groups. It is important, when talking about risk ratios, to communicate how these two groups are defined. A little known fact about risk ratios is that the upper limit is bounded by the risk of disease which occurs in the unexposed group. For example, in a study of the influence of smoking on lung cancer, if at the end of follow-up, \(\frac{4}{5}\) or 80% of non-smokers had lung cancer, then the upper limit of the risk ratio is \(\frac{5}{4}\) or 1.25. An example and visualisation of an extremely high relative risk is given in this letter to the New Zealand Medical Journal here.

Selection bias

Distortion of the measure of association between exposure and disease that is due to the participants who are selected into the study. The use of hospital rather than community controls in case-control studies are a common reason for selection bias. This is sometimes called Berkson’s bias and usually results in a spurious association between exposure and disease.

Sensitivity

This is usually a feature of a diagnostic test, when compared to a gold standard or ideal test. It is the probability of a positive test, given that an individual has the disease. It may be thought of as the proportion of tests that are “true positives”. For example, if a clinician screens 100 patients with diabetes (diagnosed with a gold-standard HbA1c test) with a capillary glucose and finds that 80 have diabetes, then the sensitivity of the capillary glucose test for identifying true cases of diabetes will be 80%. A highly sensitive test will be useful at ruling out a disease if the test is negative, since the rate of false-negatives is likely to low. A sensitive test is not so useful at ruling in disease. A highly specific test is required for this. See Bayes’ theorem for how this information can be used with Specificity and pre-test Prevalence to estimate the post-test chance (or Positive predictive value) of having disease.

Specificity

This is another feature of a diagnostic test, as compared to a ‘gold standard’ or ideal test. It is the probability of a negative test, given that an individual does not have the disease. It may be thought of as the proportion of diagnostic tests that return a “true negative” result. 1 - specificity is, conversely, the proportion of individuals who are “false negatives”. For example, if a group of 100 adults are known not to have diabetes on the basis of a gold-standard HbA1c test, and then are tested with an inferior fasting capillary glucose test. If 20 of the non-diabetic adults turn out to have positive tests for diabetes on the finger-prick test, then 20% are false-positives (1 - specificity), and the true-negative rate or specificity of the capillary glucose test is 80%. A highly specific test will rule in the disease (if positive), since the false-positive rate is likely to be low. It does not necessarily, however, rule out the disease, in the case of a negative test. A highly sensitive test is required for that. In epidemiological studies, to minimise measurement error, generally specific measures are favoured over sensitive ones. See Bayes’ theorem for how this information can be used with Sensitivity and pre-test Prevalence to estimate the post-test chance (or Positive predictive value) of having disease.

Standard deviation

A measure of spread of a range of values, in particular, of how much values vary from the mean value. The standard deviation \(\sigma\), for example, of a number (1, 2, 3, … i, … N) of height measurements (\(x_i\)) is given by:

\[\begin{align} \sigma &= \sqrt {\frac {\sum_{i=1}^{n}(x_{i}-{\bar{x}})^{2}}{n - 1}} \\ \text{Where:}\\ n &= \text{the total number of measurements} \\ x_i &= \text{the height measurement for the }i \text{th individual} \\ \bar{x} &= \text{the mean value of height} \\ \end{align}\]

Standard error

Standard error is derived from the idea that you are taking a sample from a larger population. It is a measure of sampling variability, so that, if you imagined you were repeating a study over and over again (in the frequentist paradigm), you’d get different results. If you were estimating, for example, the mean blood pressure of adult males in New Zealand, it would be impractical to take measurements on everyone. Instead, you would likely take a sample, and from that sample it would be possible to estimate how accurate that mean measure is likely to be. The formula to calculate the standard error (SE) for the mean is given by:

\[\begin{align} \text{SE} &= \frac{\sigma}{\sqrt{n}} \\ \text{Where:}\\ \sigma &= \text{the standard deviation of the sample} \\ n &= \text{the total number of measurements taken} \\ \end{align}\]

It is an interesting fact that the size of the population from which the sample came from does not determine the size of the standard error of the sample mean. The way to think of the difference between the standard deviation and standard error, is that the standard error is a measure of spread of means or other summary statistics (such as percentages, risk ratio, odds ratios), whereas the standard deviation is a measure of the distribution of the individual values.

Stratification

In response to the problem of confounding, one way of addressing this is to stratify or split the analysis of an epidemiological study by levels of the confounding variable. For example, if smoking were considered a confounder of the relationship between alcohol intake and lung cancer, then the association between this exposure and outcome could be considered within smokers and non-smokers. Since smoking status is the same in each level or strata of the study population, the confounding variable can not influence the exposure - disease relationship. Mantel-Haenszel stratification (see here for details) is often used in epidemiology to estimate a weighted average of stratum specific measures of association (risk or odds ratios for example).

A worked example of stratification, to deal with confounding, is summarised below:

In a case-control study, the primary exposure of interest was low to moderate alcohol intake and the outcome was myocardial infarction (heart attack). Cigarette smoking was considered a potential confounder, since arguably, it is a potential cause of both increased alcohol intake and cardiovascular disease.

In this case, the crude association between the exposure and outcome is as follows:

Alcohol intake	Cases	Controls
Yes	97	1,529
No	71	1,769

The relative risk (approximated by the odds ratio) which summarises the association between cases and controls is: \[\begin{align} \text{odds ratio}_{\text{crude}} &= \frac{97/71}{1,529/1,769}\\ &= 1.58 \\ \end{align}\]

The 95% confidence interval may be estimated by using the following formula: \[\begin{align} \text{standard error}(\text{log}_e(\text{OR})) &= \sqrt{\frac{1}{97} + \frac{1}{71} + \frac{1}{1,529} + \frac{1}{1,769}}\\ &= 0.160\\ \text{Error factor} &= e ^{1.96 \times \text{standard error}(\text{log}_e(\text{OR})) }\\ &= e ^{1.96 \times 0.160} \\ &=1.368\\ \text{95% confidence interval} &= \frac{\text{odds ratio}_{\text{crude}}}{\text{Error factor}} \text{ to } \text{odds ratio}_{\text{crude}} \times \text{Error factor}\\ &= 1.16 \text{ to } 2.16\\ \end{align}\]

Since, the 95% confidence interval does not include the null (odds ratio = 1), it is statistically significant. This means there is statistical evidence that alcohol intake influences incidence of heart attacks.

A detractor may say, “well, smoking could explain this link, because drinkers are more likely to smoke and smoking is likely to cause heart attacks”. To investigate this possibility further, we need to conduct an adjusted analysis.

To adjust for the confounding factor, the population is then divided or stratified into two: those who do and do not smoke.

The one 2x2 table is therefore broken further into two:

Smoking status	Alcohol intake	Cases	Controls
Non-smoker	Yes	19	609
Non-smoker	No	46	1,478
Smoker	Yes	78	920
Smoker	No	25	291

The stratum specific odds ratios are therefore:

\[\begin{align} \text{odds ratio}_{\text{Non-smokers}} &= \frac{19/46}{609/1,478}\\ &= 1.00 \text{ (95% CI: 0.58 to 1.72)}\\ \text{odds ratio}_{\text{Smokers}} &= \frac{78/25}{920/291}\\ &= 0.99 \text{ (95% CI: 0.62 to 1.58)}\\ \end{align}\]

One can now appreciate that there is now no relationship between alcohol intake in each strata of smoking status. The adjusted odds ratio is a weighted-average (Mantel-Haenszel method) of the two stratum-specific odds ratios. This will be a value close to the null value of 1. Thus, the analysis presented here indicates that the apparent crude association between alcohol and heart disease is explained by the third or confounding factor: smoking. The detractor was right! An Excel sheet with these calculations is available here.

Survey methods

Surveys are a powerful way of sampling a small number of individuals to make inferences about a population.

Probabilistic sampling is recommended and requires a list of the population to be sampled, a sampling frame.

A simple random sample can select individuals from a population and can give an unbiased estimate of a quantity in a target population. However, the efficiency of a simple random sample may be improved by stratification or clustering.

Stratification is dividing the population into relatively homogeneous segments, such as by socioeconomic status. Clusters are naturally occurring groups in the population, such as schools. A stratified sample results in increased precision and representation, whereas a cluster sample reduces cost and efficiency of data collection.

If, for example, a survey consists of a survey of 5 childcare centres for scabies, and three are in poor areas and two are in wealthy areas, then the results may be estimated as follows. If the poor childcare centres show that the first centre has 4/13 cases of scabies, the second 1/12 and the third 4/25. To estimate the overall proportion, the total number of cases is divided by the total population, to give 9/50 or 18% prevalence in this strata.

If the wealthy centres have a prevalence of 1/100 and 1/50, then the prevalence in this strata is 2/150 or 1.3%.

If 9% of childcare centres are classified as poor and 91% wealthy, then the overall prevalence is \(0.09*0.18 + 0.91*0.013 = 0.028 \text{ or } 2.8\%\).

This is naturally a very detailed rich area of statistical enquiry. The following books are recommended here and for implementing in R code, here.

Systematic error

Errors that are due to the experimental design are systematic errors. These include measurement error, selection bias and confounding. They are distinct from random or sampling error.

t-test

A test of the difference of two means in a population. For example, if weight loss is the outcome in a trial of two different dietary regimens, a t-test may be used to distinguish whether the difference between the two groups is likely to be due to random error or not. Several different types of t-test may be done, including paired, two sample and one sample. The standard version used most often in epidemiology is the two sample test.

The t statistic estimates how many standard errors the observed results are from the null value of no difference.

The formula for the t statistic is: \[\begin{align} t &= \frac{\text{difference in means}}{\text{standard error of difference in means}} \\ &= \frac{\bar{x}_1 - \bar{x}_2}{\text{standard error of difference in means}} \\ &= \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}} \\ \text{Where:}\\ \sigma &= \text{the standard deviation of the sample, in groups 1 and 2} \\ n &= \text{the total number of measurements taken in groups 1 and 2} \\ \bar{x} &= \text{the mean of groups 1 and 2} \\ \end{align}\]

The critical threshold for the t statistic varies according to the degrees of freedom which is \(n_1\) + \(n_2\) - 2, where:

\(n_1\) and \(n_2\) are the numbers of observations in each group.

Vaccine efficacy

Observational data may be used to estimate the efficacy of a vaccine by comparing the proportion of people vaccinated among cases of the infectious disease to background population proportions of vaccine coverage. Effective vaccines are expected to reduce the proportion of cases that are vaccinated. Age groups who are heavily exposed to the virus and may have prior disease and immunity should be excluded. A relevant article is available here.

In an investigation of an outbreak, the vaccine efficacy may be estimated as follows:

	Vaccination status?
Clinical status	Vaccinated	Unvaccinated
Cases	a	b
Population	c	d

\[\begin{align} \text{VE (%)} &= \frac{\frac{b}{(b + d)} - \frac{a}{(a + c)}}{\frac{b}{(b + d)}} * 100 \\ \text{Where:}\\ \text{VE} &= \text{Vaccine efficacy (%)} \\ \end{align}\]

For an unmatched case-control study, the equivalent calculation is:

	Vaccination status?
Clinical status	Vaccinated	Unvaccinated
Cases	a	b
Controls	c	d

\[\begin{align} \text{VE (%)} &= \left( 1 - \frac{ad}{bc} \right) \\ \end{align}\]

The expected ratio of vaccinated to unvaccinated people, given a known or assumed vaccine efficacy is:

\[\begin{align} \text{PCV} &= \frac{\text{PPV} - \text{PPV} * \text{VE}}{1 - \text{PPV}*\text{VE}}\\ \text{Where:}\\ \text{PCV} &= \text{Proportion of vaccinated individuals among cases} \\ \text{PPV} &= \text{Proportion of population vaccinated} \\ \text{VE} &= \text{Vaccine efficacy} \\ \end{align}\]

Validation

Assessing the validity of a measure generally means the degree to which a certain measure relates to a gold standard measure. For example, a researcher may ask questions about a person’s salt intake and assess the validity of this measure against a gold standard which is likely to have less measurement error, such as a urinary sodium estimate. The degree of agreement may be expressed by a variety of measures, such as sensitivity and specificity for categorical measures (such as ‘disease’ or ‘no disease’) or regression coefficients for continuous measures. ‘Regression validation’ is a related concept which estimates how accurately predictions from a statistical model conform to future subjects or observations.