Chebyshev’s inequality states that for any distribution with mean \(\mu\) and standard deviation \(\sigma\), the proportion of values that lie within the interval \([\mu - k\sigma, \mu + k\sigma]\) is at least:
\[ 1 - \frac{1}{k^2} \quad \text{for } k > 1. \]
This result applies regardless of the shape of the distribution. However, for specific distributions (such as the normal distribution), more precise results exist, such as the empirical rule (68-95-99.7).
In this document, we will explore Chebyshev’s inequality through simulations for two distributions:
We generate a sample of 10,000 data points from a normal distribution with a mean \(\mu = 50\) and a standard deviation \(\sigma = 10\). For different values of \(k\), we calculate the proportion of data points that lie within the interval \([\mu - k\sigma, \mu + k\sigma]\) and compare it with Chebyshev’s theoretical bound.
set.seed(123)
n <- 10000
mu <- 50
sigma <- 10
x_norm <- rnorm(n, mean = mu, sd = sigma)
# Value of k for evaluation
k_vals <- seq(1, 3, by = 0.5)
# Calculate the proportion of data that satisfies the condition |x - mu| < k*sigma
prop_norm <- sapply(k_vals, function(k) {
mean(abs(x_norm - mu) < k * sigma)
})
# Limit theoretical de Chebyshev
cheb_bound <- 1 - 1/(k_vals^2)
data_norm <- data.frame(k = k_vals, Proporcion = prop_norm, Chebyshev = cheb_bound)
data_norm
## k Proporcion Chebyshev
## 1 1.0 0.6812 0.0000000
## 2 1.5 0.8678 0.5555556
## 3 2.0 0.9547 0.7500000
## 4 2.5 0.9881 0.8400000
## 5 3.0 0.9972 0.8888889
Below, the comparison between the empirical proportion and the theoretical bound is plotted:
library(ggplot2)
ggplot(data_norm, aes(x = k)) +
geom_line(aes(y = Proporcion, color = "empirical"), size = 1.2) +
geom_line(aes(y = Chebyshev, color = "Theorist (Chebyshev)"), size = 1.2, linetype = "dashed") +
labs(title = "Normal distribution: empirical vs. theoretical Chebyshev",
x = "k value",
y = "data proportion") +
scale_color_manual("", values = c("empirical" = "blue", "theoretical (Chebyshev)" = "red"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We generate a sample of 10,000 data points from an exponential distribution with mean \(\mu = 50\). Note that for the exponential distribution, the variance is given by \(\sigma^2 = \mu^2\), so \(\sigma = \mu\).
n <- 10000
mu_exp <- 50
lambda <- 1/mu_exp # parameter
x_exp <- rexp(n, rate = lambda)
sigma_exp <- mu_exp # sigma = mu
k_vals_exp <- seq(1, 3, by = 0.5)
prop_exp <- sapply(k_vals_exp, function(k) {
mean(abs(x_exp - mu_exp) < k * sigma_exp)
})
cheb_bound_exp <- 1 - 1/(k_vals_exp^2)
data_exp <- data.frame(k = k_vals_exp, Proporcion = prop_exp, Chebyshev = cheb_bound_exp)
data_exp
## k Proporcion Chebyshev
## 1 1.0 0.8639 0.0000000
## 2 1.5 0.9159 0.5555556
## 3 2.0 0.9475 0.7500000
## 4 2.5 0.9698 0.8400000
## 5 3.0 0.9826 0.8888889
Below, the comparison for the exponential distribution is plotted:
ggplot(data_exp, aes(x = k)) +
geom_line(aes(y = Proporcion, color = "empirical"), size = 1.2) +
geom_line(aes(y = Chebyshev, color = "theoretical (Chebyshev)"), size = 1.2, linetype = "dashed") +
labs(title = "Exponential Distribution: empirical vs. theoretical Chebyshev",
x = "k Value",
y = "data proportion") +
scale_color_manual("", values = c("empirical" = "green", "theoretical (Chebyshev)" = "red"))
## How does it behave in each distribution?
library(ggplot2)
library(gridExtra)
# Graph for the normal distribution
p1 <- ggplot(data_norm, aes(x = k)) +
geom_line(aes(y = Proporcion, color = "empirical"), size = 1.2) +
geom_line(aes(y = Chebyshev, color = "Theorist (Chebyshev)"), size = 1.2, linetype = "dashed") +
labs(title = "Normal distribution", x = "k value", y = "data proportion") +
scale_color_manual("", values = c("empirical" = "blue", "Theorist (Chebyshev)" = "red")) +
theme_minimal() +
theme(
axis.title = element_text(size = 11),
axis.text = element_text(size = 11),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
legend.position = "bottom",
legend.text = element_text(size = 11),
legend.title = element_text(size = 11)
)
# Graph for the exponential distribution
p2 <- ggplot(data_exp, aes(x = k)) +
geom_line(aes(y = Proporcion, color = "empirical"), size = 1.2) +
geom_line(aes(y = Chebyshev, color = "Theorist (Chebyshev)"), size = 1.2, linetype = "dashed") +
labs(title = "Exponential distribution", x = "k value", y = "data proportion") +
scale_color_manual("", values = c("empirical" = "green", "Theorist (Chebyshev)" = "red")) +
theme_minimal() +
theme(
axis.title = element_text(size = 11),
axis.text = element_text(size = 11),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5),
legend.position = "bottom",
legend.text = element_text(size = 11),
legend.title = element_text(size = 11)
)
# graphs
grid.arrange(p1, p2, ncol = 2)
The simulation results highlight both the usefulness and limitations of Chebyshev’s inequality. For the normal distribution, the empirical rule (68-95-99.7) indicates that approximately 95% of the data lie within 2 standard deviations of the mean. Our simulation confirms that the empirical proportion is significantly higher than the minimum bound provided by Chebyshev, i.e., \(1 - \frac{1}{4} = 0.75\). This shows that, in well-defined, symmetric distributions, Chebyshev’s inequality is very conservative because it offers only a minimal bound, which is less informative than distribution-specific parameters.
On the other hand, the exponential distribution, characterized by its asymmetry and long tails, exhibits greater data dispersion. Although Chebyshev’s inequality applies to any distribution, in this case the proportion of data within \(k\) standard deviations is more variable and generally lower than what is expected in a symmetric distribution. This underscores the importance of considering the specific shape of the distribution when analyzing data dispersion, as methods tailored to the distribution can provide more precise estimates than the general bound offered by Chebyshev’s inequality.
In summary, while Chebyshev’s inequality offers a robust and universal framework for estimating dispersion, when additional information about the distribution is available—such as in the case of the normal distribution—it is preferable to use specific methods (e.g., the empirical rule) for more accurate and relevant estimates.