Descriptive Summary Statistics |
|
Right now this document serves as a staging/development ground for content regarding summary statistics for use elsewhere.
Intro
Descriptive statistics Raw data often takes the form of a massive list, array, or database of labels and numbers. To make sense of the data, we can calculate summary statistics like the mean, median, and interquartile range. We can also visualize the data using graphical devices like histograms, scatterplots, and the empirical cdf. These methods are useful for both communicating and exploring the data to gain insight into its structure, such as whether it might follow a familiar probability distribution.
Summary statistics can be split into two main types:
- Population Statistics – These describe an entire population. They’re the actual, true values that summarize every single member of a dataset. Since you usually can’t study everyone in a population (unless it’s small), these are often theoretical and not directly calculated.
- Sample Statistics – These are calculated from a subset (sample) of the population. Since we usually can’t measure the whole population, we take a sample and use these stats to infer or estimate the real ones, this is achieved using estimators.
Term | Description |
---|---|
Statistic | A function of the observed data exclusively Formally, if \(X_1,X_2,\dots,X_n\) are random variables (a sample), a statistic is a function of the data only \(f(X_1,X_2,\dots,X_n)\) and not any parameters or other statistics |
Estimator \(T(X_1,X_2,\dots,X_n)\) |
a statistic \(T\) which estimates some population parameter \(\theta\) |
Estimate \(\hat{\theta}=T(X_1,X_2,\dots,X_n)\) |
numerical result \(\hat{\theta}\) of estimator applied to a sample |
Bias \(\text{Bias}(\hat{\theta})=\mathbb{E}[\hat{\theta}]-\theta\) |
The difference between the expected value of an estimator \(\mathbb{E}[\hat{\theta}]\) and the true parameter \(\theta\) |
Unbiased Estimator | An estimator whose expected value equals the parameter being estimated i.e. \(\mathbb{E}[\hat{\theta}]=\theta\) |
Mean Squared Error | MSE considers both bias and variance \(\text{MSE}(\hat{\theta})=\text{Var}(\hat{\theta})+[\text{Bias}(\hat{\theta})]^2\) An estimator with small MSE may be preferred, even if it’s biased |
Consistency | An estimator \(\hat{\theta}_n\) is consistent if it converges in probability to \(\theta\) as the sample size approaches infinity \(\hat{\theta}_n\xrightarrow{P}\theta\) as \(n\to\infty\) |
Efficiency | Among unbiased estimators, the one with the smallest variance is called efficient. |
The Frequentest approach treat population stats are fixed (but often unknown), while sample stats vary depending on the chosen sample. They use sample stats are used to infer population stats through estimation and hypothesis testing. The Bayesian approach treats population parameters as not fixed; they have their own probability distributions. Instead of just estimating a single number, Bayesians update their beliefs using prior knowledge and new data.
Degrees of freedom
In statistics, every time you estimate a parameter (like a mean), you “use up” a degree of freedom.
The sample variance \(s^2=\frac{1}{n-1}\sum(x_i-\bar{x})^2\) is unbiased because it estimates the population mean \(\mu\) with \(\bar{x}\) which costs 1 degree of freedom.
For example a one-sample t-test: \(t=\frac{\bar{x}-\mu}{s/\sqrt{n}}\)
Here I’m estimating sample mean \(\bar{x}\) and sample standard deviation \(s\) and that estimation introduce uncertainty \(df=n-1\). When estimating the sample variance \(s^2\).
we only lose one degree of freedom, even though it uses the mean… this is because/why when computing the sample variance we devide by \(n-1\) (the d.o.f.) to account for the loss of a degree of freedom i.e. the std.dev is only a function of the mean (so if we know the sample mean we know the sample std.dev): \[s^2=\frac{1}{n-1}\Sigma^n_{i=1}(x_i-\bar{x})^2\] The \(n-1\) divisor means I’m amplifying the variance a bit (compensating for small sample bias) but this is non trivial for very small \(n\) but the Student’t t-distribution reacts accordingly inflating uncertainty by widening its tails — this accounts for the imprecise estimate of variance when \(n\) is low. When \(n\) is larger the \(-1\) becomes neglegible and the Student’s t-distribution approximates the standard normal z-distribution.
- One-sample t-test we estimate the mean \(df=n-1\)
- Two-sample t-test we estimate two means \(df=n_1+n_2-2\) (in pooled version)
- In regression with \(k\) predictors \(df=n-k-1\)
Quantiles
Quantiles describe the probability distribution of a random variable. Given a PMF or PDF, the quartile is the value of the random variable at the given cummulative probability (lower tail has area equal to the probability) percentiles, deciles and quartiles and also used. The 60th percentile is the same as the 0.6 quantile. The median is defined as the 0.5 quantile.
Distributions are often compared by plotting their quantiles against one another (Q-Q plot). Sample and populations are often compared using rug plots - density plots with quantiles marked in the lower margin using geom_rug
.
Population (Theoretical) Quantiles
Population Quantiles can be calculated using the appropriate q<dist>
function for the distribution… Quantiles for the continuous normal population distribution are calculated with the qnorm
function…
{<- 0 # normal distribution mean
mu <- 1. # normal distribution std.dev
sigma
= seq(-3,3,0.1)
x = data.table(
data x = x,
pPopulation = dnorm(x, mean = mu, sd = sigma),
cpPopulation = pnorm(x, mean = mu, sd = sigma)
)
# quantiles: 0%, 5%, 10%, 15%, 20%, ...
<- qnorm(seq(0.05,0.95,0.05), mean = mu, sd = sigma)
q
ggplot(data) +
geom_line(aes(x = x, y = pPopulation), color = "lightblue", linewidth = 1.5) +
geom_line(aes(x = x, y = cpPopulation), color = "steelblue", linewidth = 1.5) +
geom_vline(xintercept = q, color = "red", alpha = 0.65)
}
Empirical (Sample) Quantiles
Quantiles for the sample data are calculated with the quantile
function…
{<- 0 # normal distribution mean
mu <- 1. # normal distribution std.dev
sigma
= data.table(x = rnorm(1000, mean = mu, sd = sigma))
data
# quantiles: 0%, 10%, 20%, 30%, 40%, ...
<- quantile(data$x, probs = seq(0.1,0.9,0.1))
q
ggplot(data, aes(x = x)) +
geom_histogram(aes(y = ..count..), binwidth = 0.2, fill = "lightblue") +
geom_density(aes(y = ..density.. * 0.2 * nrow(data)), color = "steelblue") +
scale_y_continuous(name = "Frequency",
# `~ .` represents the primary y-axis values
sec.axis = sec_axis(~ . / (0.2 * nrow(data)), name = "Density")
+
) geom_vline(xintercept = q, color = "red") +
geom_rug(color = "steelblue")
}
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
Credible Intervals (Bayesian Analysis)
Credible Intervals for Closed-Form Posterior PDF/PMF
If you’re working with Bayesian methods, you might use highest posterior density intervals (HPD) or equal-tailed credible intervals. The bayestestR
package provides a convenient function:
library(ggplot2)
# Define parameters
<- 0
mu <- 1
sigma
<- seq(-3,3,0.1)
x <- data.table(
data x = x,
pPosterior = dnorm(x, mu, sigma),
cp = pnorm(x, mu, sigma)
)
ggplot(data, aes(x, pPosterior)) +
geom_line(color = "steelblue") +
geom_area(data = data[cp>(1-0.75)/2 & cp<0.75+(1-0.75)/2], # Confidence level: 75%
aes(x, pPosterior), fill = "red", alpha = 0.15) +
geom_area(data = data[cp>(1-0.85)/2 & cp<0.85+(1-0.85)/2], # Confidence level: 85%
aes(x, pPosterior), fill = "red", alpha = 0.15) +
geom_area(data = data[cp>(1-0.95)/2 & cp<0.95+(1-0.95)/2], # Confidence level: 95%
aes(x, pPosterior), fill = "red", alpha = 0.15)
Credible Intervals for Non-Closed-Form Posterior PDF/PMF
The credible interval is purely based on posterior samples. As more data becomes available the likelihood dominates and the prior becomes neglegible, therefore we legitimately approximate the PDF/PMF using the samples alone using kernel density estimation or similar. In this case we can use the ci
function to find the quartiles/boundaries for a given credible interval…
library(bayestestR)
<- data.table(
data posterior = rnorm(60, mean = 0, sd = 1)
)<- ci(data$posterior, 0.75)
cred.int.
ggplot(data, aes(x = posterior)) +
geom_histogram(aes(y = ..count..), binwidth = 0.2, fill = "lightblue") +
geom_density(aes(y = ..density.. * 0.2 * nrow(data)), color = "steelblue") +
scale_y_continuous(name = "Frequency",
# `~ .` represents the primary y-axis values
sec.axis = sec_axis(~ . / (0.2 * nrow(data)), name = "Density")
+
) annotate("rect", xmin = cred.int.$CI_low, xmax = cred.int.$CI_high,
ymin = 0, ymax = Inf, fill = "red", alpha = 0.15)
Moments
Name | Sample | Population | Key Characteristics |
---|---|---|---|
Zeroth Central Moment | \(\mu_0 = 1\) | Represents total probability (always 1 for a valid distribution). | |
First Raw Moment (Mean) | \(M_1 = \frac{1}{N} \sum x_i\) | \(\mathbb{E}[X]\) | Measures the average value (mean) of the data; serves as the center of the distribution. |
First Central Moment | \(\mu_1 = 0\) | Always 0; measures the deviation from the mean, ensuring symmetry around the mean. | |
Second Raw Moment | \(M_2 = \frac{1}{N} \sum x_i^2\) | \(\mathbb{E}[X^2]\) | Measures the spread of the data relative to the origin; includes both spread and location effects. |
Variance | \(\mu_2 = \frac{1}{N} \sum (x_i - \bar{x})^2\) | \(\mathbb{E}[(X-\mathbb{E}[X])^2]\) | Measures dispersion or spread of the distribution; larger values indicate greater variability. |
Skewness | \(\mu_3 = \frac{1}{N} \sum (x_i - \bar{x})^3\) | \(\mathbb{E}[(X-\mathbb{E}[X])^3]\) | Describes asymmetry; positive skew indicates a tail to the right, and negative skew indicates a tail to the left. |
Kurtosis | \(\mu_4 = \frac{1}{N} \sum (x_i - \bar{x})^4\) | \(\mathbb{E}[(X-\mathbb{E}[X])^4]\) | Measures “tailedness”; highlights the prominence of the tails compared to the center of the distribution. |
Hyperskewness | \(\mu_5 = \frac{1}{N} \sum (x_i - \bar{x})^5\) | \(\mathbb{E}[(X-\mathbb{E}[X])^5]\) | Measures asymmetry in the tails; positive values emphasize right-skewed tails, while negative values emphasize left-skewed tails. |
Sixth Central Moment | \(\mu_6 = \frac{1}{N} \sum (x_i - \bar{x})^6\) | \(\mathbb{E}[(X-\mathbb{E}[X])^6]\) | Describes extreme tailedness; sensitive to outliers even more than the fourth moment. |
Derivation of Population Variance Shortcut Formula
We start with the definition of population variance:
\[ \mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \]
Now, expand the squared term:
\[ (X - \mathbb{E}[X])^2 = X^2 - 2X\mathbb{E}[X] + (\mathbb{E}[X])^2 \]
Taking the expectation of both sides:
\[ \mathrm{Var}(X) = \mathbb{E}[X^2 - 2X\mathbb{E}[X] + (\mathbb{E}[X])^2] \]
Using the linearity of expectation:
\[ \mathrm{Var}(X) = \mathbb{E}[X^2] - 2\mathbb{E}[X]\mathbb{E}[X] + (\mathbb{E}[X])^2 \]
Simplifying:
\[ \mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]
This is known as the computational formula or shortcut formula for variance.
MGF (Moment Generating Function)
A moment generating function \(M(t)\) for a random variable \(X\) allows us to directly calculate moment formulas through a little differentiation and substitution of \(t=0\). It is defined by as follows: \[\begin{align} M_X(t)&=E\left[e^{tx}\right]\\ &=\int_{-\infty}^{\infty} \text{pdf(x)}\:e^{tx}\:dx\\ &=E\left[1+tX+\frac{t^2X^2}{2!}+\frac{t^3X^3}{3!}+\dots+\frac{t^nX^n}{n!}+\dots\right]\\ &\boxed{=1+tE[X]+\frac{t^2}{2!}E[X^2]+\frac{t^3}{3!}E[X^3]+\dots+\frac{t^n}{n!}E[X^n]+\dots} \end{align} \] We can calculate the \(n^\text{th}\) raw moment by calculating \(\frac{dM_X}{dt}|_{t=0}\), this selects the correct moment term whereby the differentiation removes the lower terms and settig \(t=0\) removes the higher order terms. Moment generating functions are usually listed with distributions on wikipedia
Raw vs. Central Moments
Raw Moments
Calculated relative to the origin (\(x = 0\)), these moments emphasize the overall structure and location of the distribution but are heavily influenced by the mean and data range.
Central Moments
Calculated relative to the mean (\(\bar{x}\)), they remove the influence of location and focus purely on the distribution’s shape.
Even vs. Odd Moments
Even Moments
These measure the overall distribution of data and are indifferent to the direction of asymmetry. For example, the 2nd and 4th moments measure dispersion and tail prominence, respectively.
Odd Moments
These are sensitive to the asymmetry of the distribution. The 3rd and 5th moments highlight whether the distribution is skewed towards the left or the right.
Tail Sensitivity
Higher-order moments (4th and beyond) emphasize the tails of the distribution more than the center. This makes them useful for identifying outliers or extreme events in a dataset.
Zeroth and First Moments
While higher-order moments capture detailed characteristics of the distribution, the zeroth and first moments are foundational: - The zeroth moment ensures the total probability is normalized. - The first moment centers the distribution at the mean, making all higher-order central moments relative to this central value.
Practical Applications
- The variance (2nd moment) is widely used in risk assessment and variability analysis.
- The skewness (3rd moment) is crucial in finance, meteorology, and psychology for identifying biases or asymmetries.
- The kurtosis (4th moment) is essential in modeling financial crashes or extreme weather events where tail behavior is critical.
Scaling with Units
Moments scale with the units of the data raised to the power of the order (e.g., \(\mu_2\) is in squared units, \(\mu_3\) is in cubed units, and so on). Standardized moments (dividing by \(\sigma^n\)) are often used to make comparisons between datasets.
Higher Moments in Practice
While moments of order 5 and above are rarely used in day-to-day statistical analysis, they can provide deep insights in specialized fields like astrophysics, quantitative finance, and machine learning, where subtle details in distribution shape matter.
Distributions Bounded by \(X<1\)
“Inversion” arises because higher powers of \(X\) diminish for \(X\in[0,1]\). Raw moments and central moments are smaller for higher orders, focusing on finer details of the distribution near the bounds.
Tail emphasis shifts to small values. For \(X\in[0,1]\), higher moments magnify deviations near the edges (\(X\approx 0\) or \(X\approx1\)) rather than in large tails, since no large values exist.
Applications must consider boundedness. When analyzing such variables, care must be taken when interpreting moments, as their behavior is fundamentally different from distributions that are unbounded or centered around 0.
Quantiles
Autocorrelation/Cross-corellation/Convolution Convolution, Correlation & Covariance
Mechanical Analogy/Comparison
Mean & Variance of a Data Set
Equivalently, the mean value \(\mu\) of a dataset \([x_1,\dots x_N]\) and the second centred moment is the variance \(\sigma^2\) of the random variable are shown respectively:Mass & Density
At the centre of mass \(x_{CM}\) the torque due to all mass is zero \(T|_{cm}=0\). The centre of mass is calculated from the torque about \(x=0\) due to all mass for discrete point masses and a distributed mass respectively:Probability Distributions
Equivalently, the expectation value \(\langle X\rangle\) of a discrete random variable \(X\) is given by…- We know basic random Variable Arithmetic… - To add discrete random variables we just add the values of all permutations then calculate the probability of each of the sums - To multiply discrete random variables we just multiply the values of all permutaions then calculate the probabilty of each of the products - … - The expected value of a function of a random variable is given by: \(E[h(X)] = \sum_jh(x_j)p(x_j)\) |
> |
\[ \begin{align} \text{Linearity}&:\\\\ E[X+Y]&= \sum_i\left(x_i + y_i\right)P(\omega_i)\\ &= \sum_i x_iP(\omega_i) + \sum_i y_i P(\omega_i)\\ &= E[X]+E[Y]\\ \therefore\quad&\boxed{E\left[\sum_iX_i\right]=\sum_iE[X_i]} \end{align} \] |
\[ \begin{align} \text{Shifting \& Scaling}&:\\\\ E[\alpha X+b] &=\sum_iP(x_i)(\alpha\cdot x_i+b)\\ &=\alpha\sum_iP(x_i)x_i+b\sum_iP(x_i)\\ &=\alpha E[X]+b\\ \therefore\quad&\boxed{E[\alpha X+b]=\alpha E[X]+b} \end{align} \] |
- We know basic random Variable Arithmetic… - To add discrete random variables we just add the values of all permutations then calculate the probability of each of the sums - To multiply discrete random variables we just multiply the values of all permutaions then calculate the probabilty of each of the products - … - The expected value of a function of a random variable is given by: \(E[h(X)] = \sum_jh(x_j)p(x_j)\) |
\[ \begin{align} \text{Linearity}&:\\\\ \text{if }X_i\:&\text{are all mually independent then}\dots\\ \quad&\boxed{Var\left(\sum_iX_i\right)=\sum_iVar[X_i]} \end{align} \] |
\[ \begin{align} \text{Shifting \& Scaling}&\\\\ &\quad\text{Let}\:\mu=E[X]\text{. and }E[aX+b]=a\mu+b\\\\ &\quad\text{Then}\\ &\quad Var(aX+b) = E[\left(aX+b−[a\mu+b]\right)^2]\\ &\quad\quad = E[(aX−a\mu)^2] \\ &\quad\quad = E[a^2(X−μ)2] \\ &\quad\quad= a^2E[(X−μ)2] \\ &\quad\quad = a^2Var(X)\\ &\quad \therefore\:\boxed{Var(aX+b) = a^2Var(X)} \end{align} \] |
$$ \begin{align} |
E[(X − μ)^2] &= E[X^2 − 2μX + μ^2]\ &= E[X^2] − 2μE[X] + μ^2\ &= E[X^2] − 2μ^2 + μ^2 \ &= E[X^2] − μ^2 \ &= E[X^2] − E[X]^2\\ :&\ \end{align} $$ |
# Confusion |
- Statistical Measures: - Variance & Standard Deviation (Single Dataset) - Variance \(\sigma^2\) measures how spread out numbers are from the mean: - \(\text{Var}(X)=\frac{1}{N}\sum(x_i-\mu)^2\) |
- Standard Deviation is just the square root of the variance, making it easier to interpret since it’s just the same units as the data: - \(\text{Std. Dev}(X)=\sqrt{\text{Var}(X)}\) |
- Covariance & Correlation (Two Datasets) - Covariance measures how two variables change together. Positive covariance means they either increase or decrease together while negative covariance corresponds to one increasing while the other decreases: - \(\text{Cov}(X,Y)=\frac{1}{N}\sum(x_i-\mu_X)(y_i-\mu_Y)\) |
- Correlation is the standardised covariance. It’s range is always \(-1\leq\text{Corr}(X,Y)\leq1\): - \(\text{Corr}(X,Y)=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}\) |
- Signal Operations:
- Convolution & Correlation (Signal Operations)
- Convolution combines (sort of superimposes) two signals to produce a third
- \((f*g)(t)=\sum_mf(m)g(t-m)\)
- Correlation measures the similarity between two signals as one shifts over the other
- \((f\circ g)(t)=\sum_m f(m)g(m+t)\)
- Convolution combines (sort of superimposes) two signals to produce a third
- Autocorrelation, Cross-correlation and Autocovariance
- Autocorrelation is the correlation of a signal with itself at different time lags. It measures repeating patterns:
- \(R_{xx}(t)=\sum_mx(m)x(m+t)\)
- Cross-correlation is the correlation between two different signals. Measures how similar they are as one shifts over the other:
- \(R_{xy}(t)=\sum_mx(m)y(m+t)\)
- Autocovariance is similar to autocorrelation but it uses covariances (it centres values by subtracting their means):
- \(\text{ACV}(t)=\frac{1}{N}\sum_m(x(m)-\mu)(x(m+t)-\mu)\)
- Autocorrelation is the correlation of a signal with itself at different time lags. It measures repeating patterns:
- Convolution & Correlation (Signal Operations)
Confidence Intervals (Frequentist Analysis)
For traditional frequentist statistics, confidence intervals (CIs) are commonly used. You can compute and display them using functions like t.test()
, prop.test()
, or confint()
.
Student’s t-test Proportion Tests Confidence Intervals for Parametric Model
<- c(5.1, 4.9, 5.5, 5.8, 5.2) # Sample data
x t.test(x)$conf.int # 95% confidence interval
[1] 4.861005 5.738995
attr(,"conf.level")
[1] 0.95