By the end of this chapter, the student should be able to:
Extreme Value Theory is a branch of probability and statistics concerned with the behaviour of very large or very small observations. In financial risk measurement, the focus is usually on large losses. These large losses may arise from daily negative returns on a financial asset or portfolio, operational losses, catastrophic insurance claims, credit losses, or other high-impact financial events.
Let
\[ X_1, X_2, \ldots, X_n \]
be a sequence of identically distributed random variables with unknown distribution function
\[ F(x)=P(X_i\leq x). \]
In this chapter we work mainly with distribution functions rather than densities. This is important because Extreme Value Theory is concerned with probabilities in the tail, and the distribution function gives a direct way of describing such probabilities.
The variables \(X_i\) may represent losses, negative returns, claims, or other risk quantities. If we use the convention that losses are positive, then the largest observations in the sample represent the most severe losses. EVT asks how such largest observations behave, especially when the sample size becomes large.
The central idea is that the extreme part of a distribution may have a structure that can be modelled, even when the full underlying distribution is unknown. This is very useful in finance because the full distribution of returns or losses is rarely known exactly. Instead of trying to model the entire distribution, EVT concentrates on the tail.
A natural way to study extremes is to choose a high threshold and observe which data points exceed it. Suppose the threshold is denoted by \(u_n\). An observation \(X_i\) is called an exceedance if
\[ X_i>u_n. \]
The number of exceedances in a sample of size \(n\) is
\[ \#\{i:X_i>u_n,\;i=1,\ldots,n\} = \sum_{i=1}^{n} I(X_i>u_n), \]
where
\[ I(X_i>u_n)= \begin{cases} 1, & X_i>u_n,\\ 0, & X_i\leq u_n. \end{cases} \]
If the data are independent and identically distributed, then each observation has the same probability of exceeding the threshold. This probability is
\[ P(X_i>u_n). \]
Therefore, the number of exceedances follows a binomial distribution with parameters \(n\) and \(P(X_i>u_n)\).
For extremes, the threshold should not remain fixed as the sample size increases. If the sample size grows but the threshold remains too low, the selected observations may include too many ordinary observations from the centre of the distribution. EVT therefore considers the case where \(n\to\infty\) and the threshold \(u_n\) also increases in a suitable way.
A key condition is that for some \(\tau>0\),
\[ nP(X_i>u_n)\to \tau, \qquad n\to\infty. \]
This condition says that as the sample size grows and the threshold rises, the expected number of exceedances approaches a finite positive value. The threshold is rising, so exceedances become rarer, but the sample size is also growing, so a meaningful number of exceedances remains.
Under this condition, the number of exceedances converges in distribution to a Poisson distribution with parameter \(\tau\). This result helps explain why exceedances over high thresholds are often modelled using point process ideas.
set.seed(2426)
n <- 1000
x <- rt(n, df = 4)
threshold <- quantile(x, 0.95)
indicators <- ifelse(x > threshold, 1, 0)
number_exceedances <- sum(indicators)
list(
threshold = threshold,
number_exceedances = number_exceedances,
proportion_exceeding = mean(indicators)
)
## $threshold
## 95%
## 2.092996
##
## $number_exceedances
## [1] 50
##
## $proportion_exceeding
## [1] 0.05
The R output shows the threshold, the number of observations exceeding it, and the proportion of observations above it. If the threshold is chosen as the empirical 95th percentile, roughly 5% of the observations exceed it.
When exceedances are observed in a sample, one may index the exceedance times. If the observations are \(X_1,\ldots,X_n\), the original observation indices are \(1,2,\ldots,n\). However, as \(n\) increases, the interval \([0,n]\) becomes larger and larger. A more convenient representation is obtained by rescaling time to the interval \([0,1]\).
An observation \(X_i\) exceeding \(u_n\) is then represented by its normalized time point
\[ \frac{i}{n}. \]
For an interval \((a,b]\subset[0,1]\), define
\[ N_n((a,b]) = \#\left\{\frac{i}{n}\in(a,b]:X_i>u_n,\;i=1,2,\ldots,n\right\}. \]
This counts the number of exceedances whose normalized times fall in \((a,b]\). The resulting object is called a time-normalized point process of exceedances.
The important idea is that as the threshold rises and the sample size increases, exceedances become sparse. Under suitable conditions, the point process of exceedances converges to a Poisson process. This supports the use of threshold exceedance models in EVT.
exceedance_data <- tibble(
index = 1:n,
time_normalized = index / n,
value = x,
exceedance = x > threshold
)
ggplot(exceedance_data, aes(x = time_normalized, y = value)) +
geom_point(alpha = 0.5) +
geom_hline(yintercept = threshold, linetype = "dashed") +
labs(
title = "Threshold Exceedances on Normalized Time Scale",
x = "Normalized time i/n",
y = "Observation"
)
The dashed line represents the threshold. The points above the line are exceedances. In later chapters, these exceedances will become the foundation for the Peaks Over Threshold method.
Extreme Value Theory studies the distribution of extreme realizations of a distribution function or stochastic process under suitable assumptions. The foundational results are associated with Fisher and Tippett and later Gnedenko, who established that, after suitable rescaling, the distribution of sample extremes can converge to one of only three possible limiting families.
This is one reason EVT is powerful. In ordinary probability theory, there are many possible distributions. However, for normalized maxima, only three broad types of non-degenerate limiting distributions can arise. This is analogous to the role played by the Central Limit Theorem for sample averages.
The Central Limit Theorem tells us that, under suitable conditions, normalized sums or averages converge to the normal distribution. EVT gives a parallel result for maxima. Instead of asking about the average of a sample, EVT asks about the largest observation in the sample.
The three possible limiting families are:
This classification is important in risk management because financial losses often appear heavy-tailed. A heavy-tailed loss distribution can produce extreme losses more frequently than the normal distribution suggests.
EVT is particularly useful in finance because VaR and Expected Shortfall are tail-based quantities. They depend mainly on high quantiles and losses beyond high quantiles. The centre of the distribution is less important for these measures than the behaviour of the tail.
Let
\[ X_1,X_2,\ldots \]
be a sequence of independent and identically distributed non-degenerate random variables with common distribution function \(F\). Define the sample maximum by
\[ M_n=\max(X_1,X_2,\ldots,X_n), \qquad n\geq 2. \]
The random variable \(M_n\) records the largest observation in the sample. If the observations are losses, \(M_n\) is the largest loss observed in the sample.
The distribution function of \(M_n\) can be derived exactly. The event \(M_n\leq x\) means that the largest observation is less than or equal to \(x\). This can happen only if every observation in the sample is less than or equal to \(x\). Therefore,
\[ \begin{aligned} P(M_n\leq x) &=P(X_1\leq x,X_2\leq x,\ldots,X_n\leq x). \end{aligned} \]
If the observations are independent, this joint probability becomes the product of the individual probabilities:
\[ P(M_n\leq x)=P(X_1\leq x)P(X_2\leq x)\cdots P(X_n\leq x). \]
Since the observations are identically distributed,
\[ P(X_i\leq x)=F(x) \]
for each \(i\). Hence,
\[ P(M_n\leq x)=F(x)^n. \]
This formula is simple but very important. It shows that the distribution of the maximum depends on the underlying distribution \(F\), especially on the right tail of \(F\). Extremes occur near the upper end of the support of the distribution.
If the right endpoint of \(F\) is denoted by
\[ x_F=\sup\{x\in\mathbb{R}:F(x)<1\}, \]
then for values \(x<x_F\), one has \(F(x)<1\), and therefore
\[ F(x)^n\to 0, \qquad n\to\infty. \]
If \(x_F<\infty\) and \(x\geq x_F\), then \(F(x)=1\), so
\[ F(x)^n=1. \]
This means that, without rescaling, \(M_n\) tends to the upper endpoint of the distribution. If the endpoint is infinite, the maximum tends to drift upward without settling into a useful non-degenerate distribution. This is why EVT studies centred and normalized maxima.
set.seed(2426)
sample_sizes <- c(10, 50, 100, 500)
B <- 5000
maxima_data <- map_dfr(sample_sizes, function(size) {
maxima <- replicate(B, max(rnorm(size)))
tibble(
sample_size = factor(size),
maximum = maxima
)
})
ggplot(maxima_data, aes(x = maximum)) +
geom_histogram(bins = 60) +
facet_wrap(~ sample_size, scales = "free_y") +
labs(
title = "Distribution of Sample Maxima from Normal Samples",
x = "Sample maximum",
y = "Frequency"
)
The simulation shows that as the sample size increases, the sample maximum tends to move to the right. This is expected because a larger sample gives more opportunities to observe a large value.
The Fisher-Tippett theorem, also called the extremal types theorem, is a fundamental result in EVT. It gives the possible limiting distributions for centred and normalized maxima.
Suppose there exist constants \(c_n>0\) and \(d_n\in\mathbb{R}\), and a non-degenerate distribution function \(H\), such that
\[ \frac{M_n-d_n}{c_n}\xrightarrow{d}H. \]
Then \(H\) must belong to one of only three possible types of extreme value distributions: Fréchet, Weibull, or Gumbel.
The Fréchet distribution is given by
\[ \Phi_\alpha(x)= \begin{cases} 0, & x\leq 0,\\ \exp\{-x^{-\alpha}\}, & x>0, \end{cases} \qquad \alpha>0. \]
The Weibull distribution is given by
\[ \Psi_\alpha(x)= \begin{cases} \exp\{-(-x)^\alpha\}, & x\leq 0,\\ 1, & x>0, \end{cases} \qquad \alpha>0. \]
The Gumbel distribution is given by
\[ \Lambda(x)=\exp\{-e^{-x}\}, \qquad x\in\mathbb{R}. \]
The parameter \(\alpha\) is called the tail index. It helps describe the heaviness of the tail.
The theorem is powerful because it does not require the full distribution \(F\) to be known. It says that if a non-degenerate limiting distribution for normalized maxima exists, then it must be one of these three types.
If normalized maxima from a distribution \(F\) converge to an extreme value distribution \(H\), then \(F\) is said to belong to the maximum domain of attraction of \(H\). This is written as
\[ F\in MDA(H). \]
The maximum domain of attraction of a distribution \(H\) is the set of all distributions whose normalized maxima converge to \(H\).
The three major cases are as follows.
The Gumbel domain of attraction contains distributions with relatively thin tails and often infinite upper endpoints. Examples include the normal, lognormal, exponential, and gamma distributions. These distributions may allow very large observations, but the tail decays quickly.
The Fréchet domain of attraction contains heavy-tailed distributions. Examples include Pareto, Cauchy, Student’s t, and stable Paretian distributions. These distributions are important in finance because they can assign much higher probability to extreme losses.
The Weibull domain of attraction contains distributions with finite upper endpoints. In this case, the distribution has a maximum possible value. An example is a beta distribution on a bounded interval.
EVT has two major practical approaches.
The first approach is the Block Maxima method. The data are divided into non-overlapping blocks, and the maximum observation from each block is selected. For example, if daily losses are available, one may divide them into months or years and select the largest loss in each month or year. These block maxima are then modelled using an extreme value distribution.
The second approach is the Peaks Over Threshold method. A high threshold is selected, and all observations exceeding that threshold are retained. If the threshold is \(u\), the exceedances are observations satisfying \(X_i>u\). The excesses are the amounts by which those observations exceed the threshold:
\[ Y_i=X_i-u, \qquad X_i>u. \]
The POT approach is often preferred in practical applications because it uses data more efficiently. The block maxima method may discard many large observations if they are not the largest within their blocks. POT keeps all observations above a sufficiently high threshold.
Within the POT class, two styles of analysis are common. The first is semi-parametric and uses estimators such as the Hill estimator. The second is fully parametric and is based on the Generalized Pareto Distribution. These will be studied in later chapters.
set.seed(2426)
n <- 1500
losses <- -rt(n, df = 4) / 100
block_size <- 50
loss_data <- tibble(
time = 1:n,
loss = losses,
block = ceiling(time / block_size)
)
block_maxima <- loss_data %>%
group_by(block) %>%
summarise(maximum_loss = max(loss), .groups = "drop")
threshold <- quantile(loss_data$loss, 0.95)
exceedances <- loss_data %>%
filter(loss > threshold) %>%
mutate(excess = loss - threshold)
list(
number_of_blocks = n_distinct(loss_data$block),
number_of_block_maxima = nrow(block_maxima),
threshold = as.numeric(threshold),
number_of_threshold_exceedances = nrow(exceedances)
)
## $number_of_blocks
## [1] 30
##
## $number_of_block_maxima
## [1] 30
##
## $threshold
## [1] 0.02096107
##
## $number_of_threshold_exceedances
## [1] 75
ggplot(loss_data, aes(x = time, y = loss)) +
geom_line() +
geom_hline(yintercept = threshold, linetype = "dashed") +
labs(
title = "Losses with a High Threshold for POT Analysis",
x = "Time",
y = "Loss"
)
The block maxima approach produces one maximum per block. The POT approach produces all losses above the threshold. In many practical risk problems, the POT method gives a larger set of extreme observations for estimation.
This application shows, in one place, how to move from raw simulated losses to exceedance counts, block maxima, and the empirical behaviour of maxima.
set.seed(2026)
n <- 2000
losses <- -rt(n, df = 3) / 100
u <- quantile(losses, 0.975)
excesses <- losses[losses > u] - u
block_size <- 100
blocks <- ceiling(seq_along(losses) / block_size)
block_max <- tibble(loss = losses, block = blocks) %>%
group_by(block) %>%
summarise(max_loss = max(loss), .groups = "drop")
summary_table <- tibble(
Total_Observations = n,
Threshold = as.numeric(u),
Number_Exceedances = length(excesses),
Proportion_Exceeding = length(excesses) / n,
Number_Blocks = nrow(block_max),
Mean_Block_Maximum = mean(block_max$max_loss),
Maximum_Observed_Loss = max(losses)
)
summary_table
tibble(Maximum_Loss = block_max$max_loss) %>%
ggplot(aes(x = Maximum_Loss)) +
geom_histogram(bins = 30) +
labs(
title = "Empirical Distribution of Block Maxima",
x = "Block maximum loss",
y = "Frequency"
)
tibble(Excess = excesses) %>%
ggplot(aes(x = Excess)) +
geom_histogram(bins = 30) +
labs(
title = "Excesses Above a High Threshold",
x = "Excess over threshold",
y = "Frequency"
)
The first histogram summarizes block maxima. The second summarizes excesses over a high threshold. These represent the two practical routes through which EVT enters financial risk measurement.
Students often confuse the maximum observation \(M_n\) with the original sample \(X_1,\ldots,X_n\). The maximum is a new random variable formed from the sample.
Another common mistake is to forget the independence assumption when deriving
\[ P(M_n\leq x)=F(x)^n. \]
This formula uses both independence and identical distribution. Without independence, the joint probability does not generally factor into a product.
Students also sometimes interpret EVT as a method that predicts the exact worst possible loss. That is not correct. EVT provides probabilistic models for tail behaviour. It helps estimate rare-event probabilities and high quantiles, but it does not eliminate uncertainty.
A further mistake is to think that the block maxima method and POT method use the same observations. They do not. Block maxima selects one maximum from each block. POT selects all observations above a high threshold.
Finally, students often mix up the three domains of attraction. A helpful memory is: Gumbel is associated with relatively thin tails, Fréchet with heavy tails, and Weibull with finite upper endpoints.
1 / runif(1000). Plot a histogram and comment on its tail
behaviour.Let \(X_1,X_2,\ldots,X_n\) be independent and identically distributed random variables with common distribution function \(F\), and let
\[ M_n=\max(X_1,X_2,\ldots,X_n). \]
Required: