Project History and Acknowledgments

Developed as part of graduate coursework in Mathematical Statistics and Computational Statistics at Ohio University (2014), this project bridges rigorous statistical theory with modern computational techniques. It systematically explores statistical modeling, numerical algorithm implementation, and dataset analysis—from foundational concepts to advanced methods like Bayesian inference and Monte Carlo methods.

This project’s original R codebase has been comprehensively upgraded through AI-assisted optimization, enhancing code clarity, computational efficiency, and numerical accuracy while preserving the original statistical functionality. The material follows a carefully designed pedagogical progression, beginning with fundamental descriptive statistics - including measures of central tendency and correlation analysis - then advancing through hypothesis testing frameworks to arrive at sophisticated computational methods. These include Markov Chain Monte Carlo techniques, maximum likelihood estimation, and expectation-maximization algorithms, with practical implementations ranging from Gibbs samplers to Gaussian mixture modeling. Throughout this progression, we maintain a dual focus on mathematical foundations (such as distribution theory and statistical power analysis) and their practical computational realization, ensuring learners develop both theoretical understanding and implementation skills.

1 Statistical Computing

Statistical computing refers to the intersection of computer science, numerical analysis, and statistics. It encompasses tasks that rely on statistical methods requiring significant computational power. Many statistical modeling and data analysis techniques are complex to implement in practice, often necessitating specialized software to process large datasets and generate actionable insights.

Key Distinctions

  • Computational Statistics: Focuses on statistical methods that demand intensive computation (e.g., bootstrapping, Monte Carlo simulations).

  • Statistical Computing: Deals with developing and applying computational/numerical techniques to solve statistical problems (e.g., optimization algorithms, matrix computations).

1.1 Practical Applications

Statistical computing is essential for:

  • Large-Scale Data Analysis
    • High-dimensional regression (e.g., genomic data with 1000s of predictors)
    • Big data techniques (distributed computing, sparse matrices)
    • Streaming data processing
  • Advanced Algorithm Implementation
    • Iterative methods:
    • EM Algorithm (Expectation-Maximization)
    • MCMC (Markov Chain Monte Carlo)
  • Reproducible Research
    • Automated report generation (R Markdown)
    • Version-controlled analysis scripts
  • Specialized Statistical Operations
    • Bootstrapping and resampling
    • Bayesian hierarchical modeling
    • Survival analysis with time-dependent covariates

Technical Implementation

This project leverages a robust scientific computing stack:

Core Technologies

Technology Primary Use Key Benefits
\(R\) Statistical analysis and visualization - Comprehensive packages (tidyverse, lme4)
- Reproducible research capabilities
\(\mathrm{\LaTeX}\) Typesetting technical documents - Publication-quality math rendering
- Automated numbering/cross-referencing
Markdown Lightweight documentation - Human-readable syntax
- Multi-output rendering (HTML/PDF/Word)

\(\mathrm{\LaTeX}\) is a programming language and environment specifically designed for statistical computing and graphics. It provides a wide array of statistical and graphical techniques, making it a preferred choice for researchers and data analysts.

Key Features of \(\mathrm{\LaTeX}\):

  • Data Handling: Efficient storage and manipulation of data.
  • Matrix Operations: Built-in array operators for linear algebra.
  • Statistical Analysis: Extensive libraries for modeling and hypothesis testing.
  • Data Visualization: High-quality plots with support for mathematical notation.
  • Extensibility: A full-fledged programming language supporting user-defined functions, loops, and conditionals.

\(\mathrm{\LaTeX}\) also includes an integrated \(\mathrm{\LaTeX}\)-like documentation system for creating detailed technical manuals. Its precision in formatting mathematical expressions makes it the de facto standard in scientific publishing.


Markdown is a simple markup language for:

  • Quick note-taking and documentation.
  • Web content (e.g., GitHub READMEs).
  • Generating PDFs, HTML, or Word documents when combined with tools like R or Markdown.

1.2 Analytical Progression

This project implements a graduated analytical approach, beginning with fundamental statistical computations and systematically advancing to sophisticated modeling techniques:

  • Foundation Layer:
    • Descriptive statistics (mean, median, robust estimators)
    • Exploratory data analysis (EDA) visualizations
    • Basic hypothesis testing
  • Intermediate Layer:
    • Bootstrapping and resampling methods
    • Generalized linear models (GLMs)
  • Advanced Layer:
    • Monte Carlo methods
    • EM Algorithm

“Simple computations establish verifiable baselines before advancing to computationally intensive methods - ensuring each analytical stage builds upon statistically validated foundations.”
- Principle of Graduated Complexity

2 Descriptive Statistics

Descriptive statistics provide summary statistics to concisely describe key features of a dataset. These summaries aim to communicate the maximum amount of information as simply as possible.

Key Components

Statisticians typically characterize data using:

  • Measures of Central Tendency
    • Arithmetic mean
    • Median
  • Measures of Statistical Dispersion
    • Standard deviation
    • Mean absolute deviation (MAD)
    • Interquartile range (IQR)
  • Measures of Distribution Shape
    • Skewness (symmetry)
    • Kurtosis (tailedness)
  • Measures of Dependence (for multivariate data)
    • Correlation coefficients (Pearson, Spearman)

2.1 Measuring Central Tendency (Location)

A measure of central tendency is a single value that represents the center of a dataset by identifying its typical or central position. These measures (also called measures of central location) help summarize an entire dataset with one representative value. The three primary measures are:

  1. Mean (Average)
  2. Median (Middle value)

The choice of measure depends on the data type and distribution characteristics.

2.1.1 Mean

The mean (arithmetic average) is the most common measure of central tendency.

Key Properties:

  • Applicable to both discrete and continuous data (though most used with continuous)
  • Calculated as:
    \[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
  • Sensitive to outliers (extreme values pull the mean toward them)

R Implementation:

#' Compute the Arithmetic Mean
#' 
#' Calculates the mean of a numeric vector, matrix, or list using matrix multiplication.
#' This is an educational implementation demonstrating manual mean calculation.
#'
#' @param X A numeric vector, matrix, or list of numbers
#' @return The arithmetic mean as a numeric scalar
#' @examples
#' mu(c(1, 2, 3))  # Returns 2
#' mu(matrix(1:4, nrow=2))  # Returns 2.5
mu <- function(X) {
  # Convert input to matrix format
  X <- as.matrix(X)
  
  # Get number of observations
  n <- nrow(X)
  
  # Create weight matrix (1/n for each observation)
  N <- matrix(1/n, nrow = 1, ncol = n)
  
  # Calculate weighted sum and return as scalar
  as.numeric(N %*% X)  # Matrix multiplication
}

# Save function to file
dump("mu", file = "mu.R")

BASIC USAGE EXAMPLES

set.seed(123)  # For reproducibility
X <- sample(10:99, 20)
X  # Display the data
##  [1] 40 88 60 23 76 51 59 52 96 34 78 66 18 81 35 16 94 87 45 92
# Custom matrix-based implementation
mu_result <- mu(X)

# Base R function for verification
base_mean <- mean(X)

cat("Custom mu() result:", round(mu_result, 2), "\n")
## Custom mu() result: 59.55
cat("Base R mean() result:", round(base_mean, 2), "\n")
## Base R mean() result: 59.55

2.1.2 Median

The median is the value that separates the higher half from the lower half of a dataset. For an ordered dataset, it represents:

  • The middle value (for odd number of observations)
  • The average of the two middle values (for even number of observations)

Key Properties:

  • Robustness: Unlike the mean, the median is not affected by extreme outliers
    • Typical Value: Often better represents “typical” values in skewed distributions
    • Ordinal Data: Can be used with ordinal data where mean would be inappropriate

R Implementation:

#' Robust Median Calculation
#' 
#' Computes the median value of numeric data with comprehensive NA handling and
#' support for matrix/list inputs. For even-length inputs, returns the mean of 
#' the two central values (consistent with base R).
#'
#' @param X Numeric vector, matrix, or list
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param ... Additional arguments passed to methods
#' @return Numeric scalar representing the median, or NA if:
#'         - Input contains NAs with na.rm=FALSE
#'         - Input is empty
#'         - Input contains only NAs when na.rm=TRUE
#' @examples
#' # Basic usage
#' MEDIAN(c(1, 3, 5))          # Returns 3
#' MEDIAN(c(1, 2, 3, 4))       # Returns 2.5
#' 
#' # NA handling
#' MEDIAN(c(1, 2, NA, 4))      # Returns NA
#' MEDIAN(c(1, 2, NA, 4), na.rm=TRUE)  # Returns 2
#' 
#' # Matrix support
#' MEDIAN(matrix(1:9, nrow=3)) # Returns 5
#' @export
#' @seealso \code{\link[stats]{median}} for base R implementation
MEDIAN <- function(X, na.rm = TRUE, ...) {
  # Input validation
  if (!is.numeric(X)) stop("Input must be numeric")
  if (length(X) == 0) return(NA_real_)
  
  # Handle NA values
  if (anyNA(X)) {
    if (!na.rm) return(NA_real_)
    X <- X[!is.na(X)]
    if (length(X) == 0) return(NA_real_)
  }
  
  # Convert and sort input
  X <- as.matrix(sort(X))
  n <- nrow(X)
  i <- n %/% 2  # Middle index
  
  # Calculate median
  if (n %% 2 == 0) {  # Even case
    q2 <- mean(X[i:(i + 1), , drop = TRUE])
  } else {            # Odd case
    q2 <- X[i + 1, , drop = TRUE]
  }
  
  # Ensure numeric output
  return(as.numeric(q2))
}

# Save function to file
dump("MEDIAN", file = "MEDIAN.R")

BASIC USAGE EXAMPLES

# MEDIAN Function Demonstration
# Load required functions
source("MEDIAN.R")

set.seed(123)
test_data <- list(
  "Even-length" = sort(sample(1:9, 10, replace = TRUE)),
  "Odd-length" = sort(sample(1:9, 9, replace = TRUE))
)

results <- lapply(test_data, function(x) {
  data.frame(
    `Data Length` = length(x),
    `Input Data` = paste(x, collapse = ", "),
    `Custom MEDIAN()` = MEDIAN(x),
    `Base R median()` = median(x),
    check.names = FALSE
  )
})

knitr::kable(
  do.call(rbind, results),
  caption = "MEDIAN Function Comparison",
  align = c("l", "l", "r", "r")
)
MEDIAN Function Comparison
Data Length Input Data Custom MEDIAN() Base R median()
Even-length 10 2, 3, 3, 3, 4, 5, 5, 6, 6, 9 4.5 4.5
Odd-length 9 3, 3, 4, 7, 8, 9, 9, 9, 9 8.0 8.0

2.2 Measuring Variability (Spread)

A measure of spread (or statistical dispersion) quantifies the variability within a dataset. These metrics complement measures of central tendency by describing how observations are distributed around the center.

Key Characteristics:

  • Describe how “spread out” the data points are
  • Indicate whether values are clustered tightly or widely dispersed
  • Help assess the reliability of central tendency measures
  • Provide context for comparing different datasets

Common measures of spread include:

  • Range: The difference between maximum and minimum values
  • Variance: Average of squared deviations from the mean
  • Standard Deviation: Square root of variance (in original units)
  • Interquartile Range (IQR): Spread of the middle 50% of data

2.2.1 Variance

Variance (\(s^2\)) quantifies the dispersion of data points around their mean value. It represents the average squared deviation from the mean, providing a measure of how spread out a dataset is.

Variance is calculated as:

\[ s^2_x = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \]

Where:

  • \(x_i\) = individual data points
  • \(\bar{x}\) = sample mean
  • \(n\) = sample size
  • \(n-1\) = degrees of freedom (Bessel’s correction for unbiased estimation)

Key Properties:

  • Units: Expressed in squared units of the original data

  • Sensitivity: More affected by outliers than other dispersion measures

  • Relationship: Standard deviation is the square root of variance (\(s = \sqrt{s^2}\))

  • Interpretation:

    • Higher variance → greater spread
    • Zero variance → all values identical

When to Use:

  • Assessing data variability
  • Statistical tests (ANOVA, regression)

R Implementation

#' Compute Sample Variance
#' 
#' Calculates the unbiased sample variance using matrix operations. 
#' Implements Bessel's correction (n-1 denominator) for proper population variance estimation.
#'
#' @param X Numeric vector, matrix, or list of numbers
#' @param na.rm Logical indicating whether to remove NA values (default: FALSE)
#' @return The sample variance as a numeric scalar
#' @examples
#' s2(c(4, 9, 12, 15, 18))  # Returns 26.7
#' s2(matrix(1:12, nrow=3))  # Returns 13.5
#' s2(c(1, 3, NA, 5), na.rm=TRUE)  # Returns 2.666667
#' @references
#' Bessel, F. W. (1838). On the determination of the probable error of a mean.
s2 <- function(X, na.rm = FALSE) {
  # Input validation
  if (!is.numeric(X)) stop("Input must be numeric")
  if (length(X) == 0) return(NA_real_)
  
  # Handle NA values
  if (anyNA(X)) {
    if (!na.rm) return(NA_real_)
    X <- X[!is.na(X)]
    if (length(X) == 0) return(NA_real_)
  }
  
  # Convert to matrix and calculate
  X <- as.matrix(X)
  n <- nrow(X)
  
  # Efficient one-pass calculation
  MU <- mu(X)
  D <- X - MU
  variance <- as.numeric((t(D) %*% D) / (n - 1))  # Bessel's correction
  
  return(variance)
}

# Save function
dump("s2", file = "s2.R")

BASIC USAGE EXAMPLES

# Generate sample data
set.seed(123)
X <- sample(1:9, 30, replace = TRUE)
cat("Sample data:\n", paste(X, collapse = ", "), "\n\n")
## Sample data:
##  3, 3, 2, 6, 5, 4, 6, 9, 5, 3, 9, 9, 9, 3, 8, 7, 9, 3, 4, 1, 7, 5, 7, 9, 9, 7, 5, 7, 5, 6
# Calculate measures
results <- data.frame(
  Statistic = c("Variance (s²)", "Standard Deviation (s)"),
  Custom_Function = c(s2(X), sqrt(s2(X))),
  Base_R = c(var(X), sd(X))
)

knitr::kable(
  results,
  caption = "Comparison of Dispersion Measures",
  digits = 6,
  align = c('l', 'r', 'r'),
  col.names = c("Measure", "Custom Function", "Base R")
)
Comparison of Dispersion Measures
Measure Custom Function Base R
Variance (s²) 5.867816 5.867816
Standard Deviation (s) 2.422358 2.422358

Technical Note: The exact match between custom and base R functions confirms proper implementation of both the variance calculation and Bessel’s correction.

2.2.2 Median Absolute Deviation (MAD)

The Median Absolute Deviation (MAD) is a robust measure of statistical dispersion that quantifies the variability of a univariate dataset. Unlike variance or standard deviation, MAD is resistant to outliers, making it particularly useful for datasets that may contain extreme values.

MAD is calculated as:

\[ \begin{equation} \text{MAD} = \text{median} \left( \left| x_i - \tilde{x} \right| \right) \end{equation} \]

where:

  • \(x_i\) = individual data points
  • \(\tilde{x}\) = median of the dataset
  • \(| \cdot |\) = absolute value operator

Key Properties:

  • Robustness:
    • Unlike standard deviation (which squares deviations), MAD uses absolute values, making it less sensitive to outliers
    • A single extreme value has minimal impact on MAD -Interpretation:
    • Represents the median distance of data points from the dataset’s median
    • Higher MAD indicates greater spread in the data
  • Comparison with Standard Deviation:
    • For normally distributed data: \(\text{MAD} \approx 0.6745 \sigma\)
    • More reliable than standard deviation for non-normal distributions or contaminated data

Scaling to Estimate Standard Deviation

To use MAD as a consistent estimator for the population standard deviation (\(\sigma\)):

  • Makes MAD consistent with standard deviation (\(\sigma\)) for normal distributions
  • Converts from median-based to mean-based dispersion measure
  • Maintains robustness to outliers (breakdown point of 50%)

\[ \hat{\sigma} = k \cdot \text{MAD} \]

Where the scale factor \(k\) is:

  • For normal distributions: \[k = \frac{1}{\Phi^{-1}(3/4)} \approx 1.4826\] (where \(\Phi^{-1}\) is the inverse of the standard normal CDF)
  • Different \(k\) values may be used for other distributions

Applications:

  • Outlier detection (common rule: \(|x_i - \tilde{x}| > 3 \cdot \text{MAD}\))
  • Robust statistical modeling
  • Analyzing heavy-tailed distributions
  • Non-parametric statistics

#' Median Absolute Deviation (MAD)
#'
#' Computes the robust MAD estimator for numeric data
#' @param X A numeric vector, matrix, or data frame
#' @param k Scale factor (default 1.4826 for normal distribution)
#' @return The scaled MAD value
#' @examples
#' MAD(c(1, 3, 5, 9, 11))
#' MAD(matrix(rnorm(100), ncol = 5))
MAD <- function(X, k = 1.4826) {
  # Input validation
  if (!is.numeric(X)) {
    stop("Input must be numeric")
  }
  
  # Convert to matrix if needed
  X <- as.matrix(X)
  
  # Compute median of absolute deviations
  med <- median(X)
  abs_dev <- abs(X - med)
  mad <- median(abs_dev) * k
  
  return(mad)
}

# Save function
dump("MAD", file = "MAD.R")

BASIC USAGE EXAMPLES

# Calculate results
results <- data.frame(
  Method = c("Custom MAD (k=1)", "Custom MAD (k=1.4826)", "stats::mad"),
  Value = c(
    MAD(X, k = 1),
    MAD(X),
    stats::mad(X)
  ),
  Description = c(
    "Unscaled median absolute deviation",
    "Scaled to match SD for normal distribution",
    "Base R's implementation (equivalent to scaled MAD)"
  )
)

# Print formatted results
knitr::kable(results, 
             caption = "Comparison of MAD Calculations",
             align = c('l', 'r', 'l'),
             digits = 4)
Comparison of MAD Calculations
Method Value Description
Custom MAD (k=1) 2.0000 Unscaled median absolute deviation
Custom MAD (k=1.4826) 2.9652 Scaled to match SD for normal distribution
stats::mad 2.9652 Base R’s implementation (equivalent to scaled MAD)

Key Observations:

  • Unscaled MAD (k=1)
  • Value: 2.0000
    • Interpretation: This represents the raw median absolute deviation without scaling: \[ \text{MAD}_\text{raw} = \text{median}\left(\left|X_i - \tilde{X}\right|\right) \]

where \(\tilde{X}\) is the median of the dataset.

  • Scaled MAD (k=1.4826)
  • Value: 2.9652
  • Comparison with Base R: Identical to stats::mad() output (2.9652)

2.2.3 Interquartile Range (IQR)

The Interquartile Range (IQR) is a robust measure of statistical dispersion, calculated as the difference between the 75th (Q3) and 25th (Q1) percentiles:

\[ \text{IQR} = Q_3 - Q_1 \]

Key Properties:

  • Robustness:
    • Resistant to outliers (unlike standard deviation)
    • 25% trimmed range (excludes top/bottom 25% of data)
  • Visualization: The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points.

For a symmetric distribution (where the median equals the midhinge, the average of the first and third quartiles), half the IQR equals the median absolute deviation (MAD).

  • Whiskers typically extend to ±1.5×IQR from the quartiles
  • Outlier Detection:
    • Lower fence: \(Q_1 - 1.5 \times \text{IQR}\)
    • Upper fence: \(Q_3 + 1.5 \times \text{IQR}\)
    • Observations beyond these fences are considered outliers

Relationship with Other Measures

Measure Formula Robust? Efficiency
IQR \(Q_3 - Q_1\) Yes Moderate
MAD \(\text{median}(|X_i - \tilde{X}|)\) Yes High
Range \(\max(X) - \min(X)\) No Low

Special Cases

For symmetric distributions:

  • Median = Midhinge (\(\frac{Q_1 + Q_3}{2}\))
  • \(\text{MAD} \approx \frac{\text{IQR}}{2}\)

R Implementation of IQR Plot

#' Interactive IQR Plot and Summary (base R version)
#'
#' Computes IQR and creates a customizable boxplot-like visualization with base R
#' @param X Numeric vector, matrix, or data frame
#' @param summary Logical - whether to print 5-point summary (default TRUE)
#' @param plot Logical - whether to generate plot (default TRUE)
#' @return Invisibly returns the IQR value
#' @examples
#' iqr_plot(rnorm(100))
#' iqr_plot(mtcars$mpg, plot=TRUE, summary=FALSE)

iqr_plot <- function(X, summary = TRUE, plot = TRUE) {
  source("MEDIAN.R")
  source("mu.R")
  source("Q.R")
  
  # Calculate all statistics
  Q <- Q(X)
  q1 <- MEDIAN(Q[[1]])
  q2 <- MEDIAN(X)  # Median
  q3 <- MEDIAN(Q[[2]])
  IQR <- q3 - q1
  
  
  # PRINT 5-POINT SUMMARY
  if (summary) {
    cat("5-Point Summary:\n")
    cat(sprintf("   %-8s: %g\n", 
                c("Min", "Q1", "Median", "Q3", "Max"),
                c(min(X), q1, q2, q3, max(X))))
  }
  
  # IQR PLOT
  if (plot) {
    # Prepare data
    low_bound <- q1 - 1.5 * IQR
    high_bound <- q3 + 1.5 * IQR
    is_outlier <- X < low_bound | X > high_bound
    inliers <- X[!is_outlier]
    low_whisker <- min(inliers)
    high_whisker <- max(inliers)
    
    # Set up plot area
    par(mar = c(4, 1, 3, 1))  # Adjust margins
    plot(NA, xlim = range(X), ylim = c(0, 1), 
         xlab = "", ylab = "", yaxt = "n", 
         main = "IQR Plot", frame.plot = FALSE)
    
    # Add vertical gridlines
    grid(nx = NA, ny = NULL, col = "gray90", lty = "dotted")
    
    # Draw IQR box
    rect(q1, 0.25, q3, 0.75, col = "white", border = "black", lwd = 1)
    
    # Draw median line
    segments(q2, 0.25, q2, 0.75, col = "black", lwd = 2)
    
    # Draw whiskers
    segments(low_whisker, 0.5, q1, 0.5, lty = "dashed", lwd = 1)
    segments(q3, 0.5, high_whisker, 0.5, lty = "dashed", lwd = 1)
    segments(low_whisker, 0.4, low_whisker, 0.6, lwd = 1)
    segments(high_whisker, 0.4, high_whisker, 0.6, lwd = 1)
    
    # Plot outliers
    points(X[is_outlier], rep(0.5, sum(is_outlier)), 
           pch = 19, col = "red", cex = 1.2)
    
    # Plot central tendency points
    points(q2, 0.5, pch = 18, col = "black", cex = 1.5)       # Median
    # Add legend
    legend("top", 
           legend = c("Median", "Outliers"),
           pch = c(18, 19),
           col = c("black", "red"),
           horiz = TRUE, bty = "n", pt.cex = 1.2)
    
    # Add box around plot
    box()
  }
  
  return(invisible(IQR))
}

# Save function
dump("iqr_plot", file = "iqr_plot.R")

BASIC USAGE EXAMPLES

#' BASIC USAGE EXAMPLES
#' 
#' Demonstrates iqr.plot() functionality with comparison to base boxplot()

# Example 1: Dataset with potential outlier
X1 <- c(74, 88, 78, 90, 94, 90, 84, 90, 98, 80, 120, 51)
cat("Dataset 1:\n", paste(X1, collapse = ", "), "\n\n")
## Dataset 1:
##  74, 88, 78, 90, 94, 90, 84, 90, 98, 80, 120, 51
# iqr.plot output
cat("iqr_plot(X1) results:\n")
## iqr_plot(X1) results:
iqr1 <- iqr_plot(X1)
## 5-Point Summary:
##    Min     : 51
##     Q1      : 79
##     Median  : 89
##     Q3      : 92
##     Max     : 120

# Boxplot comparison
cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X1, horizontal = TRUE, main = "Standard Boxplot Comparison")

#--------------------------------------------------

# Example 2: Random sample
set.seed(182)
X2 <- sample(x = 1:50, size = 6)
cat("\nDataset 2:\n", paste(X2, collapse = ", "), "\n\n")
## 
## Dataset 2:
##  42, 39, 34, 37, 11, 40
cat("iqr.plot(X2) results:\n")
## iqr.plot(X2) results:
iqr2 <- iqr_plot(X2)
## 5-Point Summary:
##    Min     : 11
##     Q1      : 34
##     Median  : 38
##     Q3      : 40
##     Max     : 42

cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X2, horizontal = TRUE, main = "Standard Boxplot Comparison")

#--------------------------------------------------

# Example 3: Another random sample
set.seed(42)
X3 <- sample(x = 1:50, size = 6)
cat("\nDataset 3:\n", paste(X3, collapse = ", "), "\n\n")
## 
## Dataset 3:
##  49, 37, 1, 25, 10, 36
cat("iqr.plot(X3) results:\n")
## iqr.plot(X3) results:
iqr3 <- iqr_plot(X3)
## 5-Point Summary:
##    Min     : 1
##     Q1      : 10
##     Median  : 30.5
##     Q3      : 37
##     Max     : 49

cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X3, horizontal = TRUE, main = "Standard Boxplot Comparison")


2.3 Distribution Shape Analysis

Common measures of the shape of a distribution are skewness or kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. In any symmetrical distribution the mean, median and mode are equal. However, when the data is skewed, for example, as with the right-skewed data set, the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

2.3.1 Skewness (Asymmetry)

Measures the lack of symmetry in data distribution:

  • Symmetric distribution (skewness = 0):
    • Mean = Median = Mode
    • Mirror-image left and right sides
  • Right-skewed (positive skewness):
    • Mean > Median
    • Longer right tail (e.g., income distributions)
  • Left-skewed (negative skewness):
    • Mean < Median
    • Longer left tail (e.g., age at retirement)

Interpretation guidelines:

  • |Skewness| > 1: Considered significantly skewed
  • Median is more robust than mean for skewed data
  • Transformation (e.g., log) often needed for analysis

2.3.2 Kurtosis (Tail Weight)

Measures tail heaviness relative to normal distribution:

Type Kurtosis Characteristics
Leptokurtic > 3 Heavy tails, sharp peak
Mesokurtic ≈ 3 Normal distribution
Platykurtic < 3 Light tails, flat peak

Practical implications:

  • High kurtosis → Higher outlier probability
  • Low kurtosis → Data concentrated near mean

The five-number summary provides visual clues about shape:

Use these quartile relationships to assess shape.

  • Skewness indicators:
    • Right skew: (Q3-Q2) > (Q2-Q1)
    • Left skew: (Q2-Q1) > (Q3-Q2)
  • Kurtosis indicators:
    • IQR/Range ratio < 0.5 → Possible heavy tails
    • Extreme whisker lengths → Potential outliers

Example Analysis

For dataset: [1, 2, 3, 5, 8, 13, 21, 34, 55]

5-point summary: 1 (Min), 3 (Q1), 8 (Median), 21 (Q3), 55 (Max)

  • Skewness: (21-8) > (8-3) → Right-skewed
    • Mean (15.8) > Median (8) → Confirms right skew
  • Kurtosis: IQR/Range = 18/54 ≈ 0.33 → Suggests heavy tails

This shape analysis helps determine appropriate statistical methods and data transformations for accurate modeling.

Data Transformations: When and Why to Use Them

Data transformations modify the scale or distribution of data to meet statistical assumptions or improve model performance. Below is a structured guide on their application.

When to Apply Transformations

  • Correcting Skewness
  • Right-Skewed Data Transformations | Transformation | Formula | When to Use | Effect Size | Limitations | |—————|———|————-|————-|————-| | Logarithmic | log(x) | Moderate-to-strong skew | Reduces skewness by ~0.5-1.5 | Requires x > 0 | | Square Root | √x | Mild-to-moderate skew | Reduces skewness by ~0.3-0.8 | Works best with non-negative counts | | Box-Cox | (x^λ - 1)/λ | Variable skewness | Optimally reduces skewness | Requires x > 0, λ estimation needed | | Inverse | 1/x | Extreme skew cases | Strong reduction effect | Undefined at zero, reverses order |
  • Left-Skewed Data Transformations | Transformation | Formula | Effect | Typical Use Cases | |—————|———|——–|——————-| | Exponential | x², x³ | Converts left skew to right skew | Mild left skew (|skewness| < 1) | | Reflection + Log | log(max(x)+1-x) | Converts to right skew first | Strong left skew (|skewness| > 1) |
  • Meeting Parametric Test Assumptions

Parametric statistical tests rely on the assumption of normally distributed residuals/errors. Common tests requiring normality include:

  • t-tests (independent and paired samples)
  • ANOVA (one-way and factorial)
  • Linear regression
  • Pearson correlation

Common Normality-Inducing Transformations

Transformation Formula Suitable Data Types Key Features
Logarithmic log(x) Positive right-skewed data Simple implementation, multiplicative effects become additive
Box-Cox (x^λ - 1)/λ Positive-valued data Automatically finds optimal λ for best normality
Yeo-Johnson Complex piecewise function Data with zeros/negatives Generalized version of Box-Cox

Implementation Guidelines

  • Log Transform:
    • Use log10 or natural log (ln)
    • For zeros: log(x + c) where c is a small constant
  • Box-Cox:
    • Requires all values > 0
      • λ = 0 implies log transform
      • λ = 1 implies no transform needed
  • Yeo-Johnson:
    • Handles all real numbers
    • Similar interpretation to Box-Cox

Verification of Results

After transformation, check for normality using:

  • Q-Q plots (should follow straight line)
  • Shapiro-Wilk test (p > 0.05 indicates normality)
  • Skewness/kurtosis values (should be near 0 and 3 respectively)

When Transformations Fail

If normality cannot be achieved:

-Consider non-parametric alternatives: - Mann-Whitney U (instead of t-test) - Kruskal-Wallis (instead of ANOVA) - Spearman correlation (instead of Pearson) - Use robust statistical methods - Try bootstrapping approaches

3. Stabilizing Variance (Homoscedasticity)

Many statistical models assume constant variance (homoscedasticity). Below are common variance patterns and their corresponding transformations:

Variance Stabilization Transformations

Variance Pattern Transformation Formula Application Notes
Multiplicative variance Logarithmic log(y) Use when SD ∝ mean (common in financial/biological data)
Poisson-distributed counts Square root √x or √(x+0.375) For count data where variance = mean
Proportional/binomial data Arcsine arcsin(√x) For proportions (0 ≤ x ≤ 1)
Severe heteroscedasticity Box-Cox (x^λ - 1)/λ When pattern isn’t clearly multiplicative

2.4 Dependence in Descriptive Statistics

Dependence refers to any statistical relationship between two random variables, while correlation specifically measures the degree of linear relationship between them. The value of a correlation coefficient ranges between -1 and +1. If the variables are independent, correlation coefficient is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables.

Key Properties of Correlation Coefficients:

  • Range: [-1, +1]
  • +1: Perfect positive linear relationship
  • -1: Perfect negative linear relationship
  • 0: No linear relationship
  • Symmetry: cor(X,Y) = cor(Y,X)
  • Unitless: Invariant to scale changes

Key Measures

Measure Type Formula Use Case
Pearson’s r Linear cov(X,Y)/(σₓσᵧ) Normally distributed data
Spearman’s ρ Rank Correlation of ranks Non-normal/monotonic relationships
Kendall’s τ Ordinal Concordant pairs Small samples/ordinal data

Critical Considerations

  • Independence vs. Zero Correlation:
    • Independence ⇒ ρ = 0
    • ρ = 0 ⇏ Independence (may have nonlinear relationships)
  • Common Pitfalls:
    • Sensitive to outliers
    • Only captures linear relationships
    • Can suggest spurious relationships

When to Use:

  • Exploratory analysis: Identify potential relationships
  • Feature selection: Screen variables for predictive modeling
  • Assumption checking: Validate linear regression assumptions

Interpretation Guidelines

  • |r| > 0.7: Strong relationship
  • 0.3 < |r| < 0.7: Moderate relationship
  • |r| < 0.3: Weak relationship

2.4.1 Pearson Correlation Coefficient

Pearson’s correlation coefficient (denoted as ρ for populations, r for samples) measures the linear relationship between two variables. It is calculated as the covariance of the variables divided by the product of their standard deviations.

Population Formula For random variables (X, Y):

\[ \rho_{XY} = \frac{\text{cov}(X,Y)}{\sigma_{X} \sigma_{Y}} \] where:

  • \(cov(X,Y)\): Covariance between \(X\) and \(Y\)
  • \(\sigma_{X}\): Standard deviation of \(X\)
  • \(\sigma_{Y}\): Standard deviation of \(Y\)

When applied to sample data, the coefficient is denoted as r:

\[ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \] where:

  • x̄, ȳ: Sample means
  • n: Sample size

Assumptions

  • Linear relationship between variables
  • Continuous variables (interval/ratio)
  • Normally distributed variables
  • Homoscedasticity (constant variance)
  • No outliers (sensitive to extreme values)

Note: Pearson’s \(r\) only measures linear relationships. Nonlinear relationships may have \(r ≈ 0\) despite strong dependence.


R Implementation

#' Compute Pearson's Correlation Coefficient
#'
#' Calculates Pearson's r between two numeric vectors, lists, or from a two-column matrix/dataframe.
#' Implements the computational formula for efficiency with large datasets.
#'
#' @param X Numeric vector, list, or matrix/dataframe with two columns
#' @param Y Optional second numeric vector/list (ignored if X is a matrix)
#' @return Pearson's correlation coefficient (r) between -1 and 1
#' @examples
#' # Vector inputs
#' corr(1:10, 2:11)
#' 
#' # Matrix input
#' mat <- cbind(x = rnorm(100), y = rnorm(100))
#' corr(mat)
#' 
#' # Dataframe input
#' df <- data.frame(var1 = mtcars$mpg, var2 = mtcars$wt)
#' corr(df)
COR <- function(X, Y = NULL) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  
  # Handle matrix/dataframe input
  if (is.null(Y)) {
    if (!is.matrix(X) && !is.data.frame(X)) 
      stop("When Y is NULL, X must be a matrix or dataframe with two columns")
    if (ncol(X) != 2) 
      stop("X must have exactly two columns when Y is not provided")
    
    Y <- X[, 2]
    X <- X[, 1]
  }
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths
  if (length(X) != length(Y)) 
    stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) 
    stop("Insufficient complete cases (need at least 2 observations)")
  
  # Computational formula implementation
  sumX <- sum(X)
  sumY <- sum(Y)
  sumXY <- sum(X * Y)
  sumX2 <- sum(X^2)
  sumY2 <- sum(Y^2)
  
  numerator <- n * sumXY - sumX * sumY
  denominator <- sqrt((n * sumX2 - sumX^2) * (n * sumY2 - sumY^2))
  
  if (denominator == 0) return(0) # Handle zero variance case
  
  r <- numerator / denominator
  return(r)
}

# Save function
dump("COR", file = "COR.R")

BASIC USAGE EXAMPLES

Example 1: Very Weak Positive Correlation (~0.001)

# simulate bivariate data
# Load package for multivariate normal distribution
require(mvtnorm)
## Loading required package: mvtnorm
# Generate 5000 observations from bivariate normal distribution
# with mean 0 for both variables and very weak correlation (0.001)
# cor(X, Y) ~ 1/1000
data <- rmvnorm(n=5e3, mean=c(0, 0), 
                sigma=matrix(c(1,.001,.001,1), ncol=2))
X <- data[,1]
Y <- data[,2]

# Calculate correlation using our COR function 
COR(X, Y)
## [1] 0.003164172
# Compare with R's built-in correlation function
stats::cor(X, Y)
## [1] 0.003164172
# Visualize the nearly uncorrelated data
# cor(X, Y) ~ 0
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('black', alpha.f=.2), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 2: Moderate Positive Correlation (~0.67)

# Generate data with stronger covariance (off-diagonal = 1)
# and higher variance (diagonal = 1.5)

data <- rmvnorm(n=3e3, mean=c(0, 0), 
                sigma=matrix(c(1.5,1,1,1.5), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # population correlation = 1/1.5 ≈ 0.67
## [1] 0.6654865
# Plot showing clear positive relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('black', alpha.f=.2), 
     xlim=c(-5, 5), ylim=c(-5, 5))


Example 3: Strong Positive Correlation (~0.99)

# Generate data with nearly identical variables
# (very high covariance relative to variance)
data <- rmvnorm(n=2e3, mean=c(0, 0), 
                sigma=matrix(c(1.01,1,1,1.01), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # Should be very close to 1
## [1] 0.9898421
# Plot showing nearly perfect linear relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('darkblue', alpha.f=.1), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 4: Perfect Positive Correlation (1)

# Generate perfectly correlated data
# (covariance matrix where variables are identical)
data <- rmvnorm(n=1e3, mean=c(0, 0), 
                sigma=matrix(c(1,1,1,1), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # cor(X, Y) = 1
## [1] 1
# Plot showing perfect linear relationship
plot(X, Y, pch=1, cex=2, 
     col=adjustcolor('darkgreen', alpha.f=.05), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 5: Strong Negative Correlation (~-0.99)

# Generate data with nearly perfect negative relationship
data <- rmvnorm(n=2e3, mean=c(0, 0), 
                sigma=matrix(c(1.01,-1,-1,1.01), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y)  # cor(X, Y) ~ -1
## [1] -0.9895359
# Plot showing perfect inverse relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('red', alpha.f=.05), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Key Points:

  • Each example demonstrates different correlation strengths
  • The \(COR()\) function matches \(R\)’s built-in \(cor()\) results
  • Plots visually confirm the correlation strength:
  • Tight clustering along diagonal = strong correlation
  • Circular cloud = no correlation
  • Negative slope = inverse relationship

2.4.2 Rank Correlation Coefficients

Rank correlation coefficients measure monotonic relationships between variables by analyzing the ranks of data values rather than the raw data itself. The two most common measures are:

  • Spearman’s \(\rho\) (rho)
  • Kendall’s \(\tau\) (tau)

These are non-parametric measures that assess how well the relationship between two variables can be described using a monotonic function.

2.4.2.1 Spearman’s Rank Correlation Coefficient

For a sample of size \(n\), the Spearman correlation coefficient is the Pearson correlation between the rank variables:

\[ \displaystyle{r_{s}=\rho_{\operatorname{rg}_{X}, \: \operatorname{rg}_{Y}} = \frac{\operatorname{cov}(\operatorname{rg}_{X},\; \operatorname{rg}_{Y})}{\sigma_{\operatorname{rg}_{X}}\:\sigma_{\operatorname{rg}_{Y}}}}, \] where:

\(ρ\) = Pearson correlation coefficient applied to ranks

\(\displaystyle{\operatorname{cov} (\operatorname{rg}_{X}, \: \operatorname{rg}_{Y})}\) = covariance of rank variables

\(\displaystyle{\sigma_{\operatorname{rg}_{X}}}\) = standard deviations of rank variables

Simplified Calculation (No Ties)

When all ranks are distinct integers:

\[ \displaystyle{r_{s}=1-\frac{6 \sum d_{i}^{2}}{n(n^{2}-1)}}, \]

where

  • \(\displaystyle{d_{i}=\operatorname{rg}(X_{i})-\operatorname{rg}(Y_{i})}\) is the difference between ranks.
  • \(n\) is the number of observations.

Identical values are usually each assigned fractional ranks equal to the average of their positions in the ascending order of the values, which is equivalent to averaging over all possible permutations.


R Implementation of Pearson’s Correlation Coefficient \(\rho_p\)

#' Pearson's Correlation Coefficient
#'
#' Computes Pearson's product-moment correlation coefficient between two variables.
#' This is a measure of linear correlation between two sets of data.
#'
#' @param X Numeric vector of values
#' @param Y Numeric vector of values (must be same length as X)
#' @return Pearson's correlation coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect positive correlation
#' rho(1:10, 1:10)  # Returns 1
#'
#' # Perfect negative correlation
#' rho(1:10, 10:1)  # Returns -1
#'
#' # No correlation
#' rho(1:10, sample(1:10))  # Returns near 0
#'
#' # With missing values
#' x <- c(1, 2, NA, 4, 5)
#' y <- c(2, 3, 4, NA, 6)
#' rho(x, y)  # Uses pairwise complete observations

rho_p <- function(X, Y) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  if (missing(Y)) stop("Input Y is required")
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths before removing NAs
  if (length(X) != length(Y)) stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) {
    warning("Insufficient complete cases (n < 2), returning NA")
    return(NA)
  }
  
  # Check for zero variance
  if (stats::sd(X) == 0 || stats::sd(Y) == 0) {
    warning("Zero variance detected, correlation undefined")
    return(NA)
  }
  
  # Compute Pearson's r using stats package functions
  cov_xy <- stats::cov(X, Y)
  sd_x <- stats::sd(X)
  sd_y <- stats::sd(Y)
  
  cov_xy / (sd_x * sd_y)
}

# Save function to file
dump('rho_p', file = 'rho_p.R') 

R Implementation of Spearman’s Rank Correlation Coefficient \(\rho_s\)

#' Spearman's Rank Correlation Coefficient
#'
#' Computes Spearman's rank correlation coefficient between two variables.
#' This is a nonparametric measure of rank correlation (monotonic relationship).
#'
#' @param X Numeric vector of values
#' @param Y Numeric vector of values (must be same length as X)
#' @return Spearman's rank correlation coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect positive correlation
#' rho_s(1:5, 1:5)  # Returns 1
#'
#' # Perfect negative correlation
#' rho_s(1:5, 5:1)  # Returns -1
#'
#' # No correlation
#' rho_s(1:5, c(3, 1, 4, 2, 5))  # Returns near 0
#'
#' # With tied ranks
#' rho_s(c(1, 2, 2, 3), c(4, 5, 5, 6))
#' 
rho_s <- function(X, Y) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  if (missing(Y)) stop("Input Y is required")
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths
  if (length(X) != length(Y)) stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) {
    warning("Insufficient complete cases (n < 2), returning NA")
    return(NA)
  }
  
  # Convert to ranks (handling ties by averaging)
  rank_X <- rank(X)
  rank_Y <- rank(Y)
  
  # Calculate Spearman's rho using the simplified formula
  # (only exact when there are no tied ranks)
  d <- rank_X - rank_Y
  rho <- 1 - (6 * sum(d^2)) / (n * (n^2 - 1))
  
  # Adjust for tied ranks (if any)
  if (any(duplicated(rank_X)) | any(duplicated(rank_Y))) {
    warning("Tied ranks detected. Using Pearson's method on ranks instead.")
    rho <- stats::cor(rank_X, rank_Y, method = "pearson")
  }
  
  return(rho)
}

# Save function to file
dump('rho_s', file = 'rho_s.R')

BASIC USAGE EXAMPLES

1. Comparing Pearson and Spearman Correlation

# Generate sample data
set.seed(42)
X <- rnorm(100, mean = 0, sd = 1)
Y <- rnorm(100, mean = 4.5, sd = 1.5)

# View first few observations
head(cbind(X, Y))
##               X        Y
## [1,]  1.3709584 6.301448
## [2,] -0.5646982 6.067127
## [3,]  0.3631284 2.995187
## [4,]  0.6328626 7.272723
## [5,]  0.4042683 3.499840
## [6,] -0.1061245 4.658271
# Calculate correlations
rho_p(X, Y)  # Pearson correlation
## [1] 0.03127984
rho_s(X, Y)  # Spearman correlation
## [1] 0.05058506
COR(X, Y) # Pearson
## [1] 0.03127984
cor(X, Y, method = "pearson")  # Base R Pearson
## [1] 0.03127984
cor(X, Y, method = "spearman") # Base R Spearman
## [1] 0.05058506

2. Rank Transformation Equivalence

Demonstrate that Spearman’s rho equals Pearson’s on ranks

# Manual rank transformation
ranked_X <- rank(X)
ranked_Y <- rank(Y)

head(cbind(ranked_X, ranked_Y))
##      ranked_X ranked_Y
## [1,]       90       92
## [2,]       27       87
## [3,]       59       17
## [4,]       70       99
## [5,]       60       24
## [6,]       43       60
# Spearman’s rho_s is mathematically equivalent to Pearson’s correlation on ranks:

rho_p(ranked_X, ranked_Y)                     # Our Pearson on ranks
## [1] 0.05058506
rho_s(X, Y)                                   # Our Spearman on original data
## [1] 0.05058506
cor(ranked_X, ranked_Y, method = "pearson")   # Base R Pearson on ranks
## [1] 0.05058506
cor(X, Y, method = "spearman")                # Base R Spearman
## [1] 0.05058506

3. Example with Multivariate Normal Data

# Generate correlated data
if (!require(mvtnorm)) install.packages("mvtnorm")
library(mvtnorm)

# Create correlated data (population correlation ~ 2/3)
data <- rmvnorm(
  n = 2000, 
  mean = c(0, 0), 
  sigma = matrix(c(1.5, 1, 1, 1.5), ncol = 2)
)
X <- data[, 1]
Y <- data[, 2]
# Pearson correlation
rho_p(X, Y)
## [1] 0.6490746
cor(X, Y, method = "pearson")
## [1] 0.6490746
# Rank transform the data
rank_X <- rank(X)
rank_Y <- rank(Y)

# Spearman correlation (should match Pearson on ranks)
rho_s(X, Y)
## [1] 0.629446
rho_p(rank_X, rank_Y)
## [1] 0.629446
# Compare with base R
cor(X, Y, method = "spearman")
## [1] 0.629446
cor(rank_X, rank_Y, method = "pearson")
## [1] 0.629446
# Install ggplot2 and ggExtra if missing
if (!require(ggplot2)) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require(ggExtra)) install.packages("ggExtra")
## Loading required package: ggExtra
# Load libraries
library(ggplot2)
library(ggExtra)

# Main scatter plot with smoothing
p <- ggplot(data.frame(X, Y), aes(X, Y)) +
  geom_point(alpha = 0.3, color = "blue", pch = 1) +
  geom_smooth(method = "loess", color = "red", se = FALSE, linewidth = 1) +
  geom_smooth(method = "lm", color = "darkgreen", se = FALSE, linetype = "dashed") +
  labs(title = "Original Data with Loess (Red) vs Linear Fit (Green)",
       subtitle = paste("Spearman ρ =", round(cor(X, Y, method = "spearman"), 3),
                        " | Pearson r =", round(cor(X, Y, method = "pearson"), 3))) +
  theme_minimal()

# Add marginal histograms
ggExtra::ggMarginal(p, type = "histogram", fill = "lightgray", bins = 30)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Loess (Red Curve): Shows the true (potentially nonlinear) trend. If it deviates from the dashed green line (linear fit), the relationship is nonlinear.

Marginal Histograms: Check if X and Y are normally distributed (important for Pearson). If skewed or multimodal, Spearman is more reliable.

p_ranks <- ggplot(data.frame(rank_X, rank_Y), aes(rank_X, rank_Y)) +
  geom_point(alpha = 0.3, color = "red") +
  geom_smooth(method = "lm", color = "black", se = FALSE) +
  labs(title = "Rank-Transformed Data (Spearman = Pearson on Ranks)",
       subtitle = paste("ρ =", round(cor(rank_X, rank_Y, method = "pearson"), 3))) +
  theme_minimal()

# Add marginal density plots
ggExtra::ggMarginal(p_ranks, type = "density", fill = "pink")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Black Line: The perfect linear relationship confirms Spearman is Pearson on ranks.

Marginal Densities: Should be uniform (since ranks are uniformly distributed).

Comparison of Pearson’s and Spearman’s Correlation Coefficients

  • Conceptual Differences:
    • Pearson’s Correlation (ρₚ)
      • Nature: Measures linear relationship between two continuous variables
      • Assumptions:
        • Variables should be normally distributed
        • Relationship should be linear
        • Homoscedasticity (constant variance)
      • Sensitivity: Affected by outliers due to its use of actual data values
    • Spearman’s Correlation (ρₛ)
      • Nature: Measures monotonic (not necessarily linear) relationships
      • Assumptions:
        • Works with ordinal, interval, or ratio data
        • No distributional assumptions
      • Robustness: Less affected by outliers as it uses rank orders

When to Use Each

  • Use Pearson When:
    • Both variables are continuous and normally distributed
    • The relationship appears linear on scatter plots
    • You specifically want to measure linear dependence
    • There are no significant outliers
  • Use Spearman When:
    • Variables are ordinal or not normally distributed
    • The relationship is monotonic but not necessarily linear
    • Data contains outliers that might distort Pearson’s correlation
    • You want a nonparametric measure of association
# Example demonstrating differences
set.seed(123)
x <- 1:100
y_linear <- x + rnorm(100, 0, 10)       # Linear relationship
y_monotonic <- x^2 + rnorm(100, 0, 100) # Nonlinear but monotonic
y_random <- rnorm(100)                  # No relationship

# Compare correlations
results <- data.frame(
  Relationship = c("Linear", "Monotonic", "Random"),
  Pearson = c(
    rho_p(x, y_linear),
    rho_p(x, y_monotonic),
    rho_p(x, y_random)
  ),
  Spearman = c(
    rho_s(x, y_linear),
    rho_s(x, y_monotonic),
    rho_s(x, y_random)
  )
)

print(results)
##   Relationship   Pearson  Spearman
## 1       Linear 0.9562453 0.9573237
## 2    Monotonic 0.9694407 0.9980798
## 3       Random 0.1177439 0.1600600

Key Observations from Output:

  • Linear Relationship:
    • Both coefficients show strong correlation
    • Pearson’s value is slightly higher (measures linear fit precisely)
  • Monotonic Relationship:
    • Pearson’s drops significantly (misses the nonlinear pattern)
    • Spearman’s remains high (captures the monotonic trend)
  • No Relationship:
    • Both coefficients near zero

Conclusion

Choose Pearson’s for linear relationships with normal data, Spearman’s for monotonic relationships or when assumptions of Pearson’s are violated. Our implementation provides robust versions of both with proper handling of edge cases (NA values, ties, etc.). The examples demonstrate how they complement each other in different analytical scenarios.

2.4.2.2 The Kendall rank correlation coefficient

The Kendall rank correlation coefficient, commonly referred to as Kendall’s τ

coefficient, is a statistic used to measure the ordinal association between two measured quantities. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities.

The Kendall τ coefficient is defined as:

\[ \tau = \frac {(\text{number of concordant pairs})-(\text{number of discordant pairs})}{\displaystyle {n\choose 2}} \] An explicit expression for Kendall’s rank coefficient is

\[ \tau ={\frac {2}{n(n-1)}}\sum _{i<j}\operatorname{sgn}(x_{i}-x_{j})\operatorname{sgn}(y_{i}-y_{j}). \] where \(\displaystyle{{n \choose 2} = \frac{n(n-1)}{2}}\) is the binomial coefficient for the number of ways to choose two items from \(n\) items.

Both Kendall’s τ and Spearman’s ρ can be formulated as special cases of a more general correlation coefficient.

Comparison of Kendall’s τ, Spearman’s ρ, and Pearson’s r Correlation Coefficients

Overview of Correlation Measures

  • Pearson’s r (ρₚ)
    • Type: Parametric linear correlation
    • Measures: Linear relationship between continuous variables
    • Range: [-1, 1]
    • Formula: r = cov(X,Y)/(σₓσᵧ)
  • Spearman’s ρ (ρₛ)
    • Type: Non-parametric rank correlation
    • Measures: Monotonic relationships
    • Range: [-1, 1]
    • Formula: Pearson correlation applied to ranked data
  • Kendall’s τ
    • Type: Non-parametric rank correlation
    • Measures: Ordinal association based on concordant/discordant pairs
    • Range: [-1, 1]
    • Formula: τ = (number of concordant pairs - number of discordant pairs) / (n choose 2)

Key Differences Between Correlation Coefficients

Characteristic Pearson’s r Spearman’s ρ Kendall’s τ
Correlation Type Linear Monotonic Ordinal
Data Requirements Interval/ratio
Normally distributed
Ordinal/interval
No distribution assumptions
Ordinal/interval
No distribution assumptions
Sensitivity Highly sensitive to outliers Moderately robust to outliers Most robust to outliers
Interpretation Linear relationship strength Monotonic relationship strength Probability that pairs are concordant
Computation O(n) O(n log n)
(due to ranking)
O(n²)
(pairwise comparisons)
Best Use Case Linear relationships
with normal data
Nonlinear monotonic relationships
with continuous/ordinal data
Small datasets
Ordinal data
Many ties expected
Statistical Test t-test Rank-based test Permutation test

R Implementation of Kendall’s Rank Correlation Coefficient

#' Compute Kendall's Rank Correlation Coefficient (τ)
#'
#' Calculates Kendall's tau-b rank correlation coefficient, which measures
#' the ordinal association between two variables. This version handles ties
#' and includes proper input validation.
#'
#' @param X Numeric vector of first variable
#' @param Y Numeric vector of second variable (must be same length as X)
#' @return Kendall's tau-b coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect agreement
#' tau(1:10, 1:10)  # returns 1
#' 
#' # Perfect disagreement
#' tau(1:10, 10:1)  # returns -1
#' 
#' # With ties
#' tau(c(1,2,2,3), c(1,3,2,4))  # handles tied ranks

tau <- function(X, Y) {
  # Input validation
  if (missing(X) || missing(Y)) stop("Both X and Y must be provided")
  if (length(X) != length(Y)) stop("X and Y must have equal length")
  if (length(X) < 2) {
    warning("Insufficient observations (need at least 2)")
    return(NA)
  }
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  n <- length(X)
  
  # Initialize counts
  concordant <- 0
  discordant <- 0
  tied_x <- 0
  tied_y <- 0
  tied_xy <- 0
  
  # Compare all unique pairs
  for (i in 1:(n-1)) {
    for (j in (i+1):n) {
      x_diff <- X[i] - X[j]
      y_diff <- Y[i] - Y[j]
      
      if (x_diff == 0 && y_diff == 0) {
        tied_xy <- tied_xy + 1
      } else if (x_diff == 0) {
        tied_x <- tied_x + 1
      } else if (y_diff == 0) {
        tied_y <- tied_y + 1
      } else if (x_diff * y_diff > 0) {
        concordant <- concordant + 1
      } else {
        discordant <- discordant + 1
      }
    }
  }
  
  # Calculate Kendall's tau-b (adjusts for ties)
  total_pairs <- n * (n - 1) / 2
  (concordant - discordant) / sqrt((total_pairs - tied_x) * (total_pairs - tied_y))
}

# Save function
dump('tau', file = 'tau.R') 

BASIC USAGE EXAMPLES

#' Basic Example of Kendall's Tau Calculation
#' 
#' Demonstrates computation of Kendall's rank correlation coefficient τ
#' using simulated bivariate normal data with known correlation structure.
#' Compares our implementation with R's built-in cor() function.

# Load required package (install if needed) --------------------------------
if (!require(mvtnorm)) {
  install.packages("mvtnorm")
  library(mvtnorm)
}

set.seed(123)  # For reproducibility

# Generate correlated data ------------------------------------------------
# Covariance matrix with population correlation = 1/1.5 ≈ 0.67
sigma_matrix <- matrix(c(1.5, 1, 
                         1, 1.5), 
                       ncol = 2)

# Generate 2000 observations from bivariate normal distribution
data <- rmvnorm(n = 2000, 
                mean = c(0, 0), 
                sigma = sigma_matrix)

# Data preparation -------------------------------------------------------
# Convert to ranks using average method for ties
X <- rank(data[, 1], ties.method = "average")
Y <- rank(data[, 2], ties.method = "average")

# Correlation computation ------------------------------------------------
# Using our custom tau function
custom_tau <- tau(X, Y)

# Using R's built-in function for validation
builtin_tau <- cor(X, Y, method = "kendall")

# Results output ---------------------------------------------------------
cat("\n=== Correlation Results ===\n")
## 
## === Correlation Results ===
cat("Custom tau implementation:", round(custom_tau, 4), "\n")
## Custom tau implementation: 0.4503
cat("R's built-in Kendall's tau:", round(builtin_tau, 4), "\n")
## R's built-in Kendall's tau: 0.4503
cat("population Pearson correlation:", round(1/1.5, 4), "\n")
## population Pearson correlation: 0.6667
cat("Difference between implementations:", 
    round(abs(custom_tau - builtin_tau), 6), "\n\n")
## Difference between implementations: 0
# Visualization ----------------------------------------------------------
plot(X, Y, 
     main = paste("Rank Plot: Kendall's τ =", round(custom_tau, 3)),
     xlab = "Rank of Variable 1", 
     ylab = "Rank of Variable 2",
     pch = 19, 
     col = rgb(0, 0.4, 0.8, 0.3),
     cex = 0.7)

# Add reference line for perfect agreement
abline(a = 0, b = 1, col = "red", lty = 2)


Interpreting the Kendall’s τ Plot & Results

  • X-axis: Rank-transformed values of variable X (from 1 to n
  • Y-axis: Rank-transformed values of variable Y (from 1 to n)
  • Point cloud pattern:
    • If points cluster along a rising diagonal (bottom-left to top-right): Positive monotonic relationship (τ close to +1)
    • If points cluster along a falling diagonal (top-left to bottom-right): Negative monotonic relationship (τ close to -1)
    • If points form a random cloud: No monotonic relationship (τ ≈ 0)

In our plot (from the example code), the points show a clear upward trend (but not perfectly linear), indicating a positive monotonic association between X and Y.

3 Hypothesis Testing

A statistical hypothesis is a testable claim about a population parameter, evaluated using observed data modeled as realizations of random variables. Hypothesis testing is a method of statistical inference that compares two competing hypotheses:

  • Null Hypothesis (H₀): Represents a default position (e.g., “no effect” or “no difference”).
  • Alternative Hypothesis (H₁ or Hₐ): Proposes an alternative distribution or effect, either explicitly or informally.

Key Concepts:

Statistical Significance: The result is deemed statistically significant if the observed data is highly unlikely under H₀, based on a pre-specified significance level (α) (e.g., 0.05). This threshold represents the maximum allowable probability of a false positive (Type I error: incorrectly rejecting a true H₀).

Test Statistic: A pre-chosen metric (e.g., t-statistic, χ²) quantifies the deviation of the data from H₀.

Decision Rule: Reject H₀ if the test statistic exceeds a critical value (or if the p-value < α).

The goal is to control the risk of Type I errors while balancing power (ability to correctly reject a false H₀).

3.1 The p-Value in Hypothesis Testing

The p-value quantifies the statistical significance of evidence against the null hypothesis (H₀) based on the observed value of a test statistic T.

Key Concepts:

  • Role of the p-value:
    • Measures how compatible the observed data is with H₀.
    • Defined as the probability of obtaining a test statistic at least as extreme as the observed value, assuming H₀ is true.
    • A small p-value (typically < 0.05) suggests the data is unlikely under H₀, leading to its rejection.
  • Null Hypothesis Testing as Reductio ad Absurdum:
    • Adapted from logical reasoning: H₀ is assumed true until evidence shows it is highly implausible.
    • Statistical significance occurs when the observed result is sufficiently improbable under H₀, justifying its rejection.
  • What Rejection of H₀ Implies:
    • The true hypothesis lies in the logical complement of H₀ (i.e., “not H₀”).
    • However, rejecting H₀ does not specify which alternative hypothesis (H₁) is correct—only that H₀ is unlikely.
    • Without explicit alternatives, the test cannot determine the most plausible explanation.
  • Limitations:
    • A significant result does not prove H₁; it merely suggests H₀ is inadequate.
    • The p-value does not measure the magnitude or importance of an effect, only its statistical rarity under H₀.

3.2 Type I Error (α) – False Positive

A Type I error occurs when a true null hypothesis (\(H_0\)) is incorrectly rejected.

  • Symbolized by α (alpha), the significance level of the test.
  • Interpretation: The probability of detecting an effect that does not exist (e.g., concluding a drug works when it doesn’t).
  • Controlled by researchers by setting a threshold (e.g., α = 0.05 for a 5% risk).

3.3 Type II Error (β) – False Negative

A Type II error occurs when a false null hypothesis (\(H_0\)) is not rejected.

  • Symbolized by β (beta).
  • Interpretation: The probability of failing to detect a true effect (e.g., missing a drug’s efficacy).
  • Power (\(1 - β\)): The probability of correctly rejecting \(H_0\) when it is false. Researchers aim to maximize power.

3.4 Power of a Test

The power of a hypothesis test is the probability of correctly rejecting the null hypothesis (H₀) when a specific alternative hypothesis (H₁) is true. It quantifies the test’s ability to detect an effect if it exists.

Key Properties of Power:

  • Range: \(0≤π≤10≤π≤1\) (where π denotes power).
  • Relationship to Type II Error (β): Power=1−β
  • Higher power reduces the probability of a Type II error (failing to reject a false H₀).

Conventional Standards:

Researchers often aim for \(π≥0.80\), implying a \(4:1\) trade-off between β (Type II error, typically \(β=0.20\)) and α (Type I error, typically \(α=0.05\)).

Factors Affecting Power:

  • Sample Size (n):
    • Larger samples reduce sampling error, making it easier to detect true effects.
  • Power analysis determines the required nn to achieve a desired power.

Effect Size: Larger effects are easier to detect (higher power).

Measurement Precision: Reducing measurement error (e.g., via better tools or protocols) improves power.

Significance Level (α): A stricter α (e.g., 0.01 vs. 0.05) lowers power, increasing β.

When to Use Power Analysis:

  • When the goal is to correctly reject a false H₀ (e.g., clinical trials).
  • Not all studies prioritize power; some focus on estimating effect sizes precisely.

3.5 Confidence Intervals

Confidence intervals (CIs) provide a range of plausible values for population parameters (e.g., mean, proportion) based on sample data, quantifying estimation uncertainty.

  • A 100(1-α)% CI (e.g., 95% CI) indicates that in repeated sampling, 100(1-α)% of intervals would contain the true parameter
  • Important: The confidence level refers to the long-run performance of the method, not the probability that any particular interval contains the parameter
  • Components:
    • Point estimate (e.g., sample mean)
    • Margin of error (function of variability and sample size)
    • Confidence level (typically 90%, 95%, or 99%)
  • Interpretation:
    • Correct: “We’re 95% confident this interval contains the true parameter”
    • Incorrect: “There’s a 95% probability the parameter is in this interval”

Factors Affecting CI Width

Factor Effect on Width Mathematical Relationship
Sample size (n) Decreases as n↑ Width ∝ 1/√n
Variability (σ) Increases as σ↑ Width ∝ σ
Confidence level Increases as CL↑ zα/2 increases

For normally distributed data with known σ:

\[ CI = \bar{x} \pm z_{\alpha/2} \times \left(\frac{\sigma}{\sqrt{n}}\right) \]

The 95% confidence interval is calculated as: \[ \text{CI} = \bar{x} \pm z_{0.025} \times \left(\frac{\sigma}{\sqrt{n}}\right), \] where \(z_{0.025} \approx 1.96\) for a normal distribution.


R Implementation of Confidence Intervals

#' Confidence Interval Simulation for Sample Means
#' 
#' Simulates multiple samples and plots their confidence intervals to demonstrate
#' the concept of coverage probability.
#'
#' @param mu Population mean (default = 0)
#' @param s Population standard deviation
#' @param alpha Significance level (default = 0.05)
#' @param size Sample size for each simulation (default = 1000)
#' @param n Number of samples to simulate (default = 1000)
#' @param seed Random seed (default = -42)
#' @param plot Whether to generate the plot (default = TRUE)
#' 
#' @return A data frame containing the lower/upper CIs and means for each sample
#' @export

ci.sim <-
  function(mu=0, s, alpha=.05, size=1000, n=1000, seed=-42, plot=TRUE){
    
    source("t_cdf.R")
    source("mu.R")
    source("s2.R")
    
    set.seed(seed)
    
    if(plot){
      plot.new()
      xmin <- mu - 5.5*s/sqrt(size)
      xmax <- mu + 5.5*s/sqrt(size)
      plot.window(xlim=c(xmin, xmax), ylim = c(0, n*n/900))
      par(mar=c(3, 5.5, 4, 0), mgp=c(2, 0, 0), las=1) # plot margins
      title(main=bquote(atop('Confidence Interval for the Sample Mean', 
                             '(sample size = '*.(size)*')')), 
            ylab = bquote(atop('Samples', '(number of samples = '*.(n)*')')), 
            xlab = expression(mu), 
            cex.main=1.5, cex.lab=1.2)
      axis(c(1, 2))
      
      # population mean
      abline(v=mu, lty=1, col='blue', lwd=1)
    }
    
    covered <- 0 # fraction of intervals covering the mean
    
    data <- data.frame(row.names=(paste('sample', 1:n, ' ')))
    
    for(i in 0:(n - 1)){
      x <- rnorm(n=size, mean=mu, sd=s)
      df <- size - 1
      sem <- sqrt(s2(x)/size) # standard error of the mean
      x_bar <- mu(x)
      t <- (x_bar - mu)/sem
      
      p <- 2*(1-Re(t_cdf(abs(t), df)))
      a <- 1-alpha/2
      ebm <- qt(a, df)*sem # error bound for a population mean
      ci.lower <- x_bar - ebm
      ci.upper <- x_bar + ebm
      
      covered <- covered + (ci.lower < mu & mu < ci.upper)
      
      data[i+1, 'ci.lower'] <- round(ci.lower, digits=3)
      data[i+1, 'ci.upper'] <- round(ci.upper, digits=3)
      data[i+1, 'mean'] <- x_bar
      data[i+1, ' '] <- ifelse(mu >= ci.lower && mu <= ci.upper, '', '*')
      
      if(plot){
        # CI and mean color
        if(mu >= ci.lower && mu <= ci.upper){
          col1=adjustcolor("green3", alpha.f=0.7)
          col2='darkgreen'
        }else{
          col1=adjustcolor("tomato", alpha.f=0.7)
          col2='tomato4'
        }
        
        # sample mean and ci
        lines(c(ci.lower, ci.upper), c(n*i/900, n*i/900), 
              col=col1, type='l', lwd=400/n, lend=1)
        points(x_bar, n*i/900, pch=18, cex=70/n, col=col2) 
        points(x_bar, n*i/900, pch=18, cex=40/n, col='white') 
        points(x_bar, n*i/900, pch=18, cex=20/n, col=col2) 
      }
    }
    
# Report coverage with all key parameters
    cat(sprintf('Coverage results (%d simulations, n = %d, α = %.2f): %.1f%% CIs contain μ = %.1f\n\n',
                n, size, alpha, 100*covered/n, mu))
    
    return(data)
  }

dump("ci.sim", file="ci.sim.R")
#' Student's t-Distribution Cumulative Distribution Function
#'
#' Computes the cumulative probability up to x for a Student's t-distribution
#' with v degrees of freedom using the hypergeometric function.
#'
#' @param x Numeric vector of quantiles
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @return Numeric vector of cumulative probabilities
#' @export
#' @examples
#' t_cdf(1.96, 10)  # Approximately 0.95 for large df
#' t_cdf(0, 5)      # Should be 0.5

# Check and install hypergeo package if needed
if (!requireNamespace("hypergeo", quietly = TRUE)) {
  message("Installing required package 'hypergeo'...")
  tryCatch(
    {
      utils::install.packages("hypergeo")
      library(hypergeo, quietly = TRUE)
    },
    error = function(e) {
      stop("Failed to install/load 'hypergeo' package. Error: ", e$message)
    }
  )
}

t_cdf <- function(x, v) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Load required package quietly
  requireNamespace("hypergeo", quietly = TRUE)
  
  # Calculate CDF components
  numerator <- x * gamma((v + 1)/2) * 
    Re(hypergeo::hypergeo(0.5, (v + 1)/2, 1.5, -(x^2)/v))
  denominator <- sqrt(pi * v) * gamma(v/2)
  
  # Return CDF value
  0.5 + numerator/denominator
}

# Save function to file
dump("t_cdf", file = "t_cdf.R")

BASIC USAGE EXAMPLES

Varying Sample Sizes (n)

ci.sim(mu=4.5, s=1.5, n=15)      # Very small sample (wide CIs, likely low coverage)

## Coverage results (15 simulations, n = 1000, α = 0.05): 80.0% CIs contain μ = 4.5
##             ci.lower ci.upper     mean  
## sample 1       4.343    4.524 4.433831  
## sample 2       4.398    4.586 4.492174  
## sample 3       4.509    4.691 4.599734 *
## sample 4       4.443    4.624 4.533356  
## sample 5       4.406    4.591 4.498532  
## sample 6       4.429    4.617 4.522888  
## sample 7       4.441    4.629 4.535281  
## sample 8       4.294    4.480 4.387117 *
## sample 9       4.479    4.669 4.574234  
## sample 10      4.445    4.633 4.539142  
## sample 11      4.429    4.619 4.524094  
## sample 12      4.313    4.499 4.406290 *
## sample 13      4.448    4.633 4.540126  
## sample 14      4.373    4.564 4.468375  
## sample 15      4.422    4.598 4.510191
ci <- ci.sim(mu=4.5, s=1.5, n=30) # Moderate sample (better precision)

## Coverage results (30 simulations, n = 1000, α = 0.05): 86.7% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=60) # Larger sample (narrower CIs)

## Coverage results (60 simulations, n = 1000, α = 0.05): 90.0% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=120) # Large sample (high precision)

## Coverage results (120 simulations, n = 1000, α = 0.05): 94.2% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=200) # Large sample (high precision)

## Coverage results (200 simulations, n = 1000, α = 0.05): 95.0% CIs contain μ = 4.5
  • As sample size increases, confidence intervals become narrower (more precise estimates).
  • Small samples (e.g., n=15) often show higher variability in coverage (some CIs may miss mu).
  • Expected coverage should approach 1-alpha (default 95%) as n grows.

Varying Population Sizes (size)

head(ci.sim(mu=0, s=1, n=60, size=100))     # Small population

## Coverage results (60 simulations, n = 100, α = 0.05): 96.7% CIs contain μ = 0.0
##            ci.lower ci.upper        mean  
## sample 1     -0.309    0.081 -0.11360393  
## sample 2     -0.185    0.180 -0.00264514  
## sample 3     -0.129    0.266  0.06854359  
## sample 4     -0.215    0.133 -0.04095258  
## sample 5     -0.308    0.116 -0.09618599  
## sample 6     -0.277    0.112 -0.08294606
head(ci.sim(mu=0, s=1, n=60, size=1000))    # Moderate population

## Coverage results (60 simulations, n = 1000, α = 0.05): 90.0% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.105    0.016 -0.044112342  
## sample 2     -0.068    0.057 -0.005217242  
## sample 3      0.006    0.127  0.066489626 *
## sample 4     -0.038    0.083  0.022237039  
## sample 5     -0.063    0.061 -0.000978905  
## sample 6     -0.047    0.078  0.015258800
head(ci.sim(mu=0, s=1, n=60, size=10000))   # Large population

## Coverage results (60 simulations, n = 10000, α = 0.05): 98.3% CIs contain μ = 0.0
##            ci.lower ci.upper          mean  
## sample 1     -0.012    0.027  7.752593e-03  
## sample 2     -0.016    0.023  3.402390e-03  
## sample 3     -0.010    0.030  1.005981e-02  
## sample 4     -0.009    0.031  1.121849e-02  
## sample 5     -0.032    0.008 -1.200495e-02  
## sample 6     -0.020    0.020 -6.737494e-05
head(ci.sim(mu=0, s=1, n=60, size=100000))  # Very large population

## Coverage results (60 simulations, n = 100000, α = 0.05): 93.3% CIs contain μ = 0.0
##            ci.lower ci.upper          mean  
## sample 1     -0.004    0.009  0.0026662204  
## sample 2     -0.007    0.006 -0.0005348168  
## sample 3     -0.007    0.006 -0.0006486533  
## sample 4     -0.006    0.006  0.0002484055  
## sample 5     -0.007    0.006 -0.0003223348  
## sample 6     -0.007    0.006 -0.0006963664
  • size controls the number of observations per sample (not to be confused with n, the number of samples).
  • Larger size reduces sampling error, leading to more stable CIs.
  • Minimal practical difference once size > 1000 (central limit theorem applies).

Varying Confidence Levels (alpha)

# 90% Confidence Level (alpha = 0.10)
head(ci.sim(mu=0, s=1, alpha=0.10, n=5000, plot=FALSE)) # Expected coverage: ~90%
## Coverage results (5000 simulations, n = 1000, α = 0.10): 90.5% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.095    0.007 -0.044112342  
## sample 2     -0.058    0.047 -0.005217242  
## sample 3      0.016    0.117  0.066489626 *
## sample 4     -0.029    0.073  0.022237039  
## sample 5     -0.053    0.051 -0.000978905  
## sample 6     -0.037    0.068  0.015258800
# 95% Confidence Level (alpha = 0.05, DEFAULT)
head(ci.sim(mu=0, s=1, alpha=0.05, n=5000, plot=FALSE)) # Expected coverage: ~95%
## Coverage results (5000 simulations, n = 1000, α = 0.05): 95.3% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.105    0.016 -0.044112342  
## sample 2     -0.068    0.057 -0.005217242  
## sample 3      0.006    0.127  0.066489626 *
## sample 4     -0.038    0.083  0.022237039  
## sample 5     -0.063    0.061 -0.000978905  
## sample 6     -0.047    0.078  0.015258800
# 99% Confidence Level (alpha = 0.01)
head(ci.sim(mu=0, s=1, alpha=0.01, n=5000, plot=FALSE)) # Expected coverage: ~99%
## Coverage results (5000 simulations, n = 1000, α = 0.01): 99.2% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.124    0.035 -0.044112342  
## sample 2     -0.088    0.077 -0.005217242  
## sample 3     -0.013    0.146  0.066489626  
## sample 4     -0.057    0.102  0.022237039  
## sample 5     -0.082    0.080 -0.000978905  
## sample 6     -0.067    0.098  0.015258800
  • Lower alpha = Higher confidence level = Wider CIs (more conservative).
  • Coverage probability should match 1-alpha (e.g., 95% CIs should contain mu ~95% of the time).
  • n=5000 ensures stable estimates of coverage probability (law of large numbers).

4 Bayesian Analysis

Bayesian inference is a method of statistical inference in which we update our beliefs about a parameter or hypothesis as we obtain new data. It’s based on Bayes’ Theorem, which relates the prior belief, the likelihood of the observed data, and the resulting posterior belief.

The fundamental equation is:

\[ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \]

where:

  • \(P(\theta | D)\): Posterior distribution
  • \(P(D | \theta)\): Likelihood
  • \(P(\theta)\): Prior distribution
  • \(P(D)\): Marginal likelihood (evidence)

Key Components

Component Description
Prior What you believe about the parameter before seeing any data
Likelihood The model: how probable the data is for different values of the parameter
Posterior Updated beliefs about the parameter after seeing the data
Evidence A normalizing constant: makes sure the posterior is a valid probability

Key Steps in Bayesian Inference

(1) Define the Prior Distribution

  • Represents initial beliefs about parameters before observing data.
  • Can be informative (based on domain knowledge) or uninformative (weakly informative, e.g., flat priors).

Example Priors:

  • Normal distribution for a mean parameter: \(\mu ∼ N(0,1)\)
  • Beta distribution for a probability: \(p ∼ Beta(2,2)\)

(2) Define the Likelihood Function

  • Describes how the observed data is generated given the parameters.

Example Likelihoods:

  • Gaussian likelihood: \(y∼N(μ,σ2)\)
  • Binomial likelihood: \(k∼Binomial(n,p)\)

(3) Compute the Posterior Distribution

  • Combines prior and likelihood using Bayes’ theorem.
  • Often requires Markov Chain Monte Carlo (MCMC) or Variational Inference for computation.

(4) Posterior Analysis & Interpretation

  • Summarize posterior (mean, credible intervals).
  • Check convergence (R-hat, trace plots).
  • Perform posterior predictive checks.

Advantages of Bayesian Inference

  • Incorporates prior knowledge (useful for small datasets).
  • Provides intuitive probabilistic results (credible intervals, not p-values).
  • Handles complex models (hierarchical models, missing data).

Challenges

  • Computationally intensive (MCMC can be slow for large models).
  • Requires careful prior specification (bad priors → misleading results).

Example

Let’s say we want to estimate the bias \(\theta\) of a coin (i.e., probability of heads):

  • Prior: Assume \(\theta ∼ Beta(2,2)\) (a moderate belief the coin is fair)
  • Data: Flip the coin 10 times, get 7 heads, 3 tails
  • Likelihood: $P(x∣)= ^7 (1−)^3
  • Posterior: \(\theta∣x ∼ Beta(9,5)\)

So, the updated belief is more concentrated around 0.65 – 0.75, indicating the coin is likely biased toward heads.

5 Continuous Probability Distributions

Continuous probability distributions model random variables that can assume any value within an interval. Unlike discrete distributions (which describe countable outcomes), continuous distributions characterize measurable phenomena where outcomes form a continuum - such as time, distance, or physical measurements.

Key Properties

  • Uncountable Outcomes: Infinite possible values within ranges
  • Probability Density Function (PDF):
  • Defines relative likelihood via integration: \[P(a \leq X \leq b) = \int_a^b f(x)dx\]
  • Cumulative Distribution (CDF): \[F(x) = P(X \leq x) = \int_{-\infty}^x f(t)dt\]
  • Zero Point Probability: \(P(X = c) = 0\) for exact values

Continuous distributions are fundamental in:

  • Statistical inference (confidence intervals, hypothesis tests).
  • Machine learning (e.g., Gaussian processes, Bayesian methods).
  • Real-world modeling (finance, physics, engineering).

In the following sections, we’ll explore each major distribution in detail, including their PDFs, properties, and practical applications.

Common Continuous Probability Distributions

Distribution Support Key Parameters Key Applications
Normal (-∞, ∞) μ (mean), σ (std dev) Natural phenomena, CLT, inference
t (Student’s) (-∞, ∞) ν (degrees of freedom) Small-sample hypothesis testing
Exponential [0, ∞) λ (rate) Time between events, survival analysis
Beta [0, 1] α, β (shape) Modeling probabilities, Bayesian stats
Gamma (0, ∞) k (shape), θ (scale) Waiting times, reliability engineering

Key Features:

  • Normal: Symmetric, bell-shaped, 68-95 - 99.7 rule
  • t: Heavier tails than normal, converges to N(0,1) as ν→∞
  • Exponential: Memoryless property, P(X >s+t \| X > s) = P(X > t)
  • Beta: Flexible shapes on [0,1] interval
  • Gamma: Generalizes exponential and chi-squared distributions

Each will be explored with their:

  • Probability density functions
  • Key parameters
  • Practical applications
  • Relationships to other distributions

5.1 Probability Density Function (PDF)

Defines the relative likelihood of different outcomes. Since continuous variables have infinitely many possible values, the probability of any exact value is technically zero. Instead, probabilities are calculated over intervals using integration:

  • Denoted as f(x)
  • Describes relative likelihood of different outcomes
  • For any single point: P(X = x) = 0
  • Probabilities are calculated over intervals:

\[ P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx \] where

\[ f(x) \geq 0 \text{ and } \int_{-\infty}^{\infty} f(x) \, dx = 1. \]

Properties:

  • \(f(x) ≥ 0\) for all x
  • Total area under curve = 1

5.2 Cumulative Distribution Function (CDF)

Gives the probability that a random variable XX is less than or equal to a value \(x\):

\[ F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) \, dt \]

Support: The range of possible values (e.g., \((-\infty,\infty)\) for the normal distribution, \((0,\infty)\) for the exponential distribution).

Moments – Summarize key properties:

  • Mean \((\mu)\): Expected value.
  • Variance \((\sigma^2)\): Measure of spread.
  • Skewness & Kurtosis: Describe asymmetry and tail behavior.

Plots of CDF vs PDF of Standard Normal Distribution

# Load required package
if (!requireNamespace("shape", quietly = TRUE)) {
  install.packages("shape")
}
library(shape)

# Define parameters
x_val <- .5  # Our reference x-value (can be changed to any value between -4 and 4)
x <- seq(-4, 4, length.out = 1000)
pdf <- dnorm(x)
cdf <- pnorm(x)

# Set up dual-axis plot
par(mar = c(5, 4, 4, 4) + 0.1)

# Plot PDF (blue curve) with blue y-axis
plot(x, pdf, type = "l", col = "blue", lwd = 3, ylim = c(0, max(pdf)),
     main = bquote("Standard Normal Distribution: PDF and CDF (x" == .(x_val)*")"),
     xlab = "x", ylab = "", axes = FALSE)
axis(1)
axis(2, col = "blue", col.axis = "blue")
mtext("Probability Density (PDF)", side = 2, line = 3, col = "blue")

# Shade area under PDF for P(X ≤ x_val)
x_shade <- seq(-4, x_val, length.out = 200)
y_shade <- dnorm(x_shade)
polygon(c(-4, x_shade, x_val), c(0, y_shade, 0), 
        col = rgb(0.53, 0.81, 0.92, alpha = 0.5), border = NA)

# Add CDF curve (red, right axis)
par(new = TRUE)
plot(x, cdf, type = "l", col = "red", lwd = 3, 
     axes = FALSE, xlab = "", ylab = "", ylim = c(0, 1))
axis(4, at = seq(0, 1, 0.2), col = "red", col.axis = "red")
mtext("Cumulative Probability (CDF)", side = 4, line = 3, col = "red")

# Add key points and lines
points(x_val, pnorm(x_val), pch = 19, col = "red")

# Vertical tracing line
segments(x_val, 0, x_val, pnorm(x_val), col = "red", lwd = 1)

# Horizontal tracing line with curved arrow
shape::Arrows(
  x0 = x_val, y0 = pnorm(x_val), 
  x1 = -4.2, y1 = pnorm(x_val),
  col = "red",
  arr.type = "curved",
  arr.width = 0.2,
  arr.length = 0.6,
  lwd = 1,
  segment = TRUE  # Draw line separately
)

# Value label
text(-4.3, pnorm(x_val), 
     labels = bquote(Phi(.(x_val)) == .(round(pnorm(x_val), 4))), 
     col = "red", pos = 2, cex = 0.9)

# Area annotation with dynamic value
text(-2.7, 0.5, 
     bquote(paste("P(X ≤ ", .(x_val), ") = ", integral(f(x)*dx, -infinity, .(x_val)), " = ", Phi(.(x_val)))), 
     col = "darkblue", cex = 0.9)

# Enhanced legend
legend("topleft",
       legend = c(expression(f(x) == "PDF"), 
                expression(Phi(x) == "CDF"), 
                bquote(P(X <= .(x_val)))),
       col = c("blue", "red", rgb(0.53, 0.81, 0.92)),
       lwd = c(2, 2, NA),
       lty = c(1, 1, NA),
       fill = c(NA, NA, rgb(0.53, 0.81, 0.92, alpha = 0.5)),
       border = NA,
       bty = "n")


5.3 Student’s t-Distribution

The t-distribution is a family of continuous probability distributions that arises when:

  • Estimating the mean of a normally distributed population
  • Sample size is small (typically n < 30)
  • Population standard deviation is unknown

It plays a key role in:

  • Student’s t-test (comparing two sample means)
  • Confidence intervals for population means
  • Linear regression analysis

Parameters \[ \displaystyle{\nu > 0 \text{ degrees of freedom.}} \] Determines the shape of the distribution

Support \[ \displaystyle{x \in \left(-\infty, +\infty \right)} \]

Probability Density Function (PDF) The PDF of the t-distribution is given by:

\[ f(t) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\left(\frac{\nu}{2}\right)} \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]

Alternatively expressed using the Beta function:

\[ f(t) = \frac{1}{\sqrt{\nu}\,B\left(\frac{1}{2}, \frac{\nu}{2}\right)} \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]

Where:

  • \(\Gamma\) = Gamma function
  • \(\beta\) = Beta function

Moments

Moment Formula Defined When
Expected Value 0 ν > 0
Variance \(\frac{ν}{ν-2}\) ν > 2
1 < ν ≤ 2
undefined ν ≤ 1
Skewness 0 ν > 3
undefined ν ≤ 3
Excess Kurtosis \(\frac{6}{ν-4}\) ν > 4
2 < ν ≤ 4
undefined ν ≤ 2

Key Properties

  • Shape:
    • Bell-shaped but heavier tails than normal distribution
    • Approaches N(0,1) as ν → ∞
  • Relationships:
    • Cauchy distribution is t-distribution with ν = 1
    • Normal distribution is t-distribution with ν = ∞
  • Applications:
    • Small sample inference
    • Robust statistical modeling

R Implementation Density of the t-distribution

#' Student's t-Distribution Probability Density Function
#'
#' Computes the probability density function (PDF) for Student's t-distribution
#' at given quantiles for specified degrees of freedom.
#'
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @param x Numeric vector of quantiles (default: seq(-4.5, 4.5, length.out=1e6))
#' @return Numeric vector of probability densities
#' @export
#' @examples
#' # Basic usage
#' t_pdf(v = 5, x = 0)  # Density at 0 with 5 df
#' 
#' # Vectorized usage
#' t_pdf(v = c(1, 5, 30), x = c(-2, 0, 2))
t_pdf <- function(v, x = seq(from = -4.5, to = 4.5, length.out = 1e6)) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Vectorized PDF calculation
  gamma_part <- gamma((v + 1)/2) / (sqrt(v * pi) * gamma(v/2))
  (gamma_part * (1 + (x^2)/v)^(-(v + 1)/2))
}

# Save function to file
dump("t_pdf", file = "t_pdf.R")

R Implementation Cumulative Distribution Function of the t-distribution

\[ \displaystyle{\frac{1}{2}} + x\Gamma\left( \displaystyle{\frac{\nu+1}{2}}\right) \times \displaystyle{\frac{_2F_1\left( \displaystyle{\frac{1}{2}},\: \displaystyle{\frac{\nu+1}{2}}; \: \displaystyle{\frac{3}{2}};\: \displaystyle{\frac{x^2}{\nu}}\right)}{\sqrt{\pi\nu}\:\Gamma\left( \displaystyle{\frac{\nu}{2}}\right)}} \] where where \(_2F_1\) is the hypergeometric function.

#' Student's t-Distribution Cumulative Distribution Function (CDF)
#'
#' Computes the cumulative probability up to x for Student's t-distribution
#' with v degrees of freedom using the hypergeometric function.
#'
#' @param x Numeric vector of quantiles
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @return Numeric vector of cumulative probabilities P(T ≤ x)
#' @export
#' @examples
#' t_cdf(0, 10)     # Returns 0.5 (median)
#' t_cdf(1.96, 30)  # Returns ~0.975
#' 
t_cdf <- function(x, v) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Check for package installation
  if (!requireNamespace("hypergeo", quietly = TRUE)) {
    stop("Package 'hypergeo' required. Install with install.packages('hypergeo')")
  }
  
  # Calculate components
  numerator <- x * gamma((v + 1)/2) * 
    Re(hypergeo::hypergeo(0.5, (v + 1)/2, 1.5, -x^2/v))
  denominator <- sqrt(pi * v) * gamma(v/2)
  
  # Return CDF value
  0.5 + numerator/denominator
}

# Save function to file
dump("t_cdf", file = "t_cdf.R")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# 1. Critical Value Calculations
calculate_critical_values <- function() {
  expand.grid(
    sd = c(1, 2, 3),           # Standard deviations
    df = c(5, 30, 100)     # Degrees of freedom (Inf = normal)
  ) %>%
    mutate(
      # Calculate critical t-values (two-tailed)
      t_critical = qt(1 - (1 - pnorm(sd))/2, df),
      
      # Calculate exact p-values
      p_value = 2 * (1 - pt(abs(t_critical), df)),
      
      # Format confidence levels
      conf_level = factor(case_when(
        sd == 1 ~ "68%",
        sd == 2 ~ "95%",
        sd == 3 ~ "99.7%"
      ), levels = c("68%", "95%", "99.7%")),
      
      # Clean DF labels
      df_label = ifelse(is.infinite(df), "Normal", paste("df =", df))
    )
}

crit_data <- calculate_critical_values()

# 2. Professional Table Output
crit_data %>%
  select(SDs = sd, 
         `Conf. Level` = conf_level,
         Distribution = df_label,
         `Critical t` = t_critical,
         `p-value` = p_value) %>%
  kable(digits = 4, align = "c", caption = "Critical Values for t-Distributions") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(4, bold = TRUE, color = "red") %>%
  add_header_above(c(" " = 3, "Two-Tailed Test" = 2))
Critical Values for t-Distributions
Two-Tailed Test
SDs Conf. Level Distribution Critical t p-value
1 68% df = 5 1.6558 0.1587
2 95% df = 5 3.2478 0.0228
3 99.7% df = 5 6.4317 0.0013
1 68% df = 30 1.4456 0.1587
2 95% df = 30 2.4011 0.0228
3 99.7% df = 30 3.5338 0.0013
1 68% df = 100 1.4202 0.1587
2 95% df = 100 2.3134 0.0228
3 99.7% df = 100 3.2979 0.0013
# 3. Final Working Visualization
plot_t_distribution <- function(target_df = 30) {
  # Base R calculation of critical values (1, 2, 3 SDs)
  crit_values <- qt(p = 1 - (1 - pnorm(c(1, 2, 3)))/2, df = target_df)
  
  # Create plot with no annotations
  ggplot(data.frame(x = c(-4, 4)), aes(x)) +
    # Distribution curve
    stat_function(fun = dt, args = list(df = target_df), 
                  color = "steelblue", linewidth = 1.2) +
    
    # Critical value lines (no text labels)
    geom_vline(xintercept = c(-crit_values, crit_values),
               color = "red", linetype = "dotted", linewidth = 0.5) +
    
    # Clean formatting
    labs(title = paste("t-Distribution (df =", target_df, ")"),
         x = "t-value", y = "Density") +
    theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          plot.title = element_text(hjust = 0.5, face = "bold")) +
    scale_x_continuous(breaks = round(c(-4, -crit_values, 0, crit_values, 4), 2)) +
    coord_cartesian(xlim = c(-4, 4), ylim = c(0, 0.4))
}

# Generate plots
plot_t_distribution(30)  # For df=30

plot_t_distribution(100) # For df=100

From t-table: \(( t_{crit} \approx -1.711)\).


  1. Conclusion:
    Since \(( t = -2.0 < t_{crit} = -1.711)\), reject \((H_0)\). The bulbs last significantly less than 1,000 hours.
#' One-Sample t-Test Implementation
#'
#' Performs a one-sample t-test comparing the sample mean to a hypothesized population mean.
#'
#' @param x_bar Sample mean (numeric)
#' @param s Sample standard deviation (numeric, positive)
#' @param n Sample size (integer, >1)
#' @param mu0 Hypothesized population mean (numeric)
#' @param alpha Significance level (numeric, 0 < alpha < 1)
#' @param alternative Direction of test ("greater", "less", or "two.sided")
#'
#' @return A list containing test results with the following components:
#' \describe{
#'   \item{t_statistic}{Calculated t-statistic}
#'   \item{critical_value}{Critical t-value(s) for given alpha}
#'   \item{p_value}{Calculated p-value}
#'   \item{decision}{Test decision}
#'   \item{conclusion}{Plain English conclusion}
#' }
#'
#' @examples
#' t_test_results <- one_sample_t_test(
#'   x_bar = 10.2, s = 0.4, n = 15,
#'   mu0 = 10, alpha = 0.05, alternative = "greater"
#' )
one_sample_t_test <- function(x_bar, s, n, mu0, alpha = 0.05, alternative = "greater") {
  # Input validation
  if (n <= 1) stop("Sample size n must be greater than 1")
  if (s <= 0) stop("Sample standard deviation s must be positive")
  if (alpha <= 0 | alpha >= 1) stop("Alpha must be between 0 and 1")
  if (!alternative %in% c("greater", "less", "two.sided")) {
    stop("Alternative must be 'greater', 'less', or 'two.sided'")
  }
  
  # Calculate t-statistic
  t_stat <- (x_bar - mu0) / (s / sqrt(n))
  df <- n - 1
  
  # Determine critical value(s) and p-value
  if (alternative == "greater") {
    t_crit <- qt(1 - alpha, df)
    p_val <- 1 - pt(t_stat, df)
    decision <- if (t_stat > t_crit) "Reject H0" else "Fail to reject H0"
  } else if (alternative == "less") {
    t_crit <- qt(alpha, df)
    p_val <- pt(t_stat, df)
    decision <- if (t_stat < t_crit) "Reject H0" else "Fail to reject H0"
  } else {  # two.sided
    t_crit <- qt(1 - alpha/2, df)
    p_val <- 2 * (1 - pt(abs(t_stat), df))
    decision <- if (abs(t_stat) > t_crit) "Reject H0" else "Fail to reject H0"
  }
  
  # Conclusion
  conclusion <- if (decision == "Reject H0") {
    switch(alternative,
           greater = paste("Evidence suggests population mean is greater than", mu0),
           less = paste("Evidence suggests population mean is less than", mu0),
           two.sided = paste("Evidence suggests population mean differs from", mu0))
  } else {
    paste("Insufficient evidence to conclude population mean differs from", mu0)
  }
  
  # Return results
  list(
    t_statistic = t_stat,
    critical_value = t_crit,
    p_value = p_val,
    decision = decision,
    conclusion = conclusion,
    parameters = list(
      sample_mean = x_bar,
      sample_sd = s,
      sample_size = n,
      hypothesized_mean = mu0,
      significance_level = alpha,
      alternative = alternative,
      df = df
    )
  )
}

# Example usage
test_results <- one_sample_t_test(
  x_bar = 10.2,
  s = 0.4,
  n = 15,
  mu0 = 10,
  alpha = 0.05,
  alternative = "greater"
)

# Print formatted results
cat("One-Sample t-Test Results:\n",
    "t-statistic = ", round(test_results$t_statistic, 3), "\n",
    "Critical value = ", round(test_results$critical_value, 3), "\n",
    "p-value = ", round(test_results$p_value, 4), "\n",
    "Decision: ", test_results$decision, "\n",
    "Conclusion: ", test_results$conclusion, "\n", sep = "")
## One-Sample t-Test Results:
## t-statistic = 1.936
## Critical value = 1.761
## p-value = 0.0366
## Decision: Reject H0
## Conclusion: Evidence suggests population mean is greater than 10

5.4 Normal Distribution

Parameters

\(\quad \displaystyle \mu \in \mathbb{R} \text{ - mean location.}\)

\(\quad \displaystyle \hat{\sigma}^{2} > 0 \text{ - variance (squared scale).}\)

\(\quad \displaystyle x \in \mathbb{R}.\)

Probability density function, expected value and variance

\[ \begin{equation} f(x)=\displaystyle{\frac{1}{(\sigma\sqrt{2 \pi})}} e^{\displaystyle{ -\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2}} \end{equation} \]

Moment Value
Mean \(\mathbb{E}[X] = \mu\)
Variance \(\text{Var}(X) = \sigma^2\)
Skewness \(\gamma_1 = 0\)
Kurtosis \(\gamma_2 = 0\)
MGF \(M_X(t) = \left(1 - p + p e^t\right)^n\)

Characteristics:

  • Symmetric about the mean \(\mu\)
  • Unimodal (single peak at \(x=\mu\))
  • Inflection points at \(x = \mu \pm \sigma\)
  • Fully determined by first two moments (mean and variance)

Comparison of Exact vs Approximate Intervals for \(N(50, 100)\)

Interval Exact Range Exact Coverage random sample Rule Approximation Approximation Coverage
68.27% \([40.06,\ 59.94]\) 68.27% \([40,\ 60]\) ~68.27%
95% \([30.40,\ 69.60]\) 95.00% \([30,\ 70]\) ~95.45%
99.73% \([22.52,\ 77.48]\) 99.73% \([20,\ 80]\) ~99.865%

Key Observations:

  • 68% Interval:
    • Exact: \(\mu \pm 0.994\sigma\)\([40.06,\ 59.94]\)
    • random sample rule uses \(\pm 1\sigma\)\([40,\ 60]\)
    • Nearly identical for practical purposes
  • 95% Interval:
    • Exact: \(\mu \pm 1.960\sigma\)\([30.40,\ 69.60]\)
    • random sample rule uses \(\pm 2\sigma\)\([30,\ 70]\)
    • Approximation is slightly wider (95.45% coverage)
  • 99.7% Interval:
    • Exact: \(\mu \pm 2.748\sigma\)\([22.52,\ 77.48]\)
    • random sample rule uses \(\pm 3\sigma\)\([20,\ 80]\)
    • Approximation is much wider (99.865% vs 99.73% coverage)
# Enhanced Normal Distribution Plot with Exact Percentages
par(mar = c(5, 4.5, 4, 2) + 0.1,  # Increased top and right margins
    mgp = c(3, 0.7, 0), 
    bg = "white", 
    cex.axis = 0.9, 
    cex.lab = 1,
    family = "sans")  # Consistent font family

# Standard normal parameters
mu <- 0
sigma <- 1

# Calculate EXACT quantiles using precise probabilities
z_values <- list(
  z_68 = qnorm((1 + 0.6827)/2),  # 68.27% coverage (exact: 0.994458σ)
  z_95 = qnorm((1 + 0.9545)/2),  # 95.45% coverage (exact: 1.9600σ)
  z_99 = qnorm((1 + 0.9973)/2)   # 99.73% coverage (exact: 2.7478σ)
)

# Generate normal curve with extended range
x <- seq(mu - 4*sigma, mu + 4*sigma, length = 2000)
pdf <- dnorm(x, mu, sigma)

# Create plot with professional styling
plot(x, pdf, type = "l", lwd = 3, col = "#2c3e50",
     xlab = expression(paste("Standard Deviations from Mean (", sigma, ")")),
     ylab = "Probability Density",
     main = "Exact Normal Distribution Coverage\n(Compared to Common Approximations)",
     xaxt = "n", bty = "L", cex.main = 1.1, las = 1, 
     yaxs = "i", ylim = c(0, max(pdf)*1.1),
     panel.first = {
       grid(col = "gray90", lty = 2)
       abline(v = mu, col = "gray70", lwd = 1)
     })

# Custom x-axis with exact values and Greek symbols
axis(1, at = c(mu - z_values$z_99, mu - z_values$z_95, mu - z_values$z_68, 
               mu, 
               mu + z_values$z_68, mu + z_values$z_95, mu + z_values$z_99),
     labels = c(
       expression(paste("-2.748", sigma)),
       expression(paste("-1.960", sigma)),
       expression(paste("-0.994", sigma)),
       expression(mu),
       expression(paste("+0.994", sigma)),
       expression(paste("+1.960", sigma)),
       expression(paste("+2.748", sigma))
     ), padj = 0.5)

# Improved shading function with borders
shade_region <- function(z, col, label_pos, label_col = "darkred") {
  polygon(c(x[x >= -z & x <= z], z, -z),
          c(pdf[x >= -z & x <= z], 0, 0),
          col = adjustcolor(col, 0.35), border = col, lwd = 0.7)
  text(0, dnorm(0)*label_pos, 
       sprintf("%.2f%%", 100*(pnorm(z)-pnorm(-z))),
       col = label_col, font = 2, cex = 0.9)
}

# Shade regions with improved colors
shade_region(z_values$z_68, "#3498db", 0.7)  # 68.27%
shade_region(z_values$z_95, "#e67e22", 0.5)  # 95.45%
shade_region(z_values$z_99, "#e74c3c", 0.3)  # 99.73%

# Add reference lines with exact values
abline(v = c(-z_values$z_99, -z_values$z_95, -z_values$z_68, 
             z_values$z_68, z_values$z_95, z_values$z_99), 
       col = "gray50", lty = 3, lwd = 1.2)

# Enhanced legend with exact values
legend("topright", inset = 0.02,
       legend = c(
         expression(paste("μ ± 0.994", sigma, " (68.27%)")),
         expression(paste("μ ± 1.960", sigma, " (95.45%)")),
         expression(paste("μ ± 2.748", sigma, " (99.73%)"))
       ),
       fill = adjustcolor(c("#3498db", "#e67e22", "#e74c3c"), 0.35),
       border = c("#3498db", "#e67e22", "#e74c3c"),
       bty = "n", cex = 0.85, y.intersp = 1.2)

# Informative footnote with comparison
mtext(expression(paste("Common approximations: 68% (1", sigma, "), 95% (2", sigma, "), 99.7% (3", sigma, ")")),
      side = 1, line = 3.5, cex = 0.75, col = "gray30", adj = 0)

# Add subtle credit line
mtext("Exact normal distribution probabilities", 
      side = 1, line = 4.5, cex = 0.6, col = "gray50", adj = 1)


R Implementation of Normal Distribution Probability Density Function (PDF)

#' Normal Distribution Probability Density Function (PDF)
#'
#' Computes the probability density function for a normal distribution with specified
#' mean and standard deviation. Can return either PDF values for given x-values or
#' generate a sequence of x-values covering ±3.5 standard deviations from the mean.
#'
#' @param x Numeric vector of values at which to evaluate the PDF. If NULL (default),
#'           generates a sequence of n values spanning ±3.5 standard deviations.
#' @param mu Mean of the normal distribution (default: 0)
#' @param sd Standard deviation of the normal distribution (default: 1)
#' @param n Number of points to generate when x is NULL (default: 100)
#'
#' @return Numeric vector of probability densities
#'
#' @examples
#' # Default standard normal
#' normal_pdf()
#'
#' # Evaluate at specific points
#' normal_pdf(x = c(-1, 0, 1))
#'
#' # Custom normal distribution
#' normal_pdf(mu = 50, sd = 10, n = 200)
#'
#' @export
#' 
normal_pdf <- function(x = NULL, mu = 0, sd = 1, n = 100) {
  # Input validation
  if (!is.null(x) && !is.numeric(x)) {
    stop("x must be numeric or NULL")
  }
  if (!is.numeric(mu) || length(mu) != 1) {
    stop("mu must be a single numeric value")
  }
  if (!is.numeric(sd) || length(sd) != 1 || sd <= 0) {
    stop("sd must be a single positive numeric value")
  }
  if (!is.numeric(n) || length(n) != 1 || n <= 0) {
    stop("n must be a single positive integer")
  }
  
  # Generate x values if not provided
  if (is.null(x)) {
    x <- seq(from = mu - 3.5 * sd, 
             to = mu + 3.5 * sd, 
             length.out = n)
  }
  
  # Vectorized PDF calculation
  y <- (1 / (sqrt(2 * pi) * sd)) * exp(-((x - mu)^2 / (2 * sd^2)))
  
  # Return named vector
  stats::setNames(y, x)
}

# Save function to file
dump("normal_pdf", file = "normal_pdf.R")
# Parameters
n <- 30       # Sample size
mean <- 50       # Population mean
sd <- 10       # Population standard deviation

# Create normal continuum
x <- seq(mean-4*sd, mean+4*sd, length.out=1024)  # 4 SD range with 1024 points
y <- normal_pdf(x, mu=mean, sd=sd)                # Normal PDF values

# Generate random sample
set.seed(123)                                # For reproducibility
x.norm <- rnorm(n, mean, sd)                   # Random normal sample

# Create plot
plot(x, y, 
     col = adjustcolor("darkgray", alpha.f=0.8), 
     type = "l",
     lwd = 6,
     ylim = c(0, 0.045),
     xlab = "x",
     ylab = "Density f(x)",
     main = paste("Normal Distribution (μ =", mean, ", σ =", sd, ")"),
     panel.first = grid()
)

# Add random normal density
lines(density(x.norm, bw="nrd0"),  # Optimal bandwidth
      lwd = 3,
      col = "black")

# Add reference lines
abline(v = mean(x.norm), lty=2, lwd=2, col="black")    # Sample mean
abline(v = mean, lty=2, lwd=2, col="darkgray")           # Population mean

# Add statistical annotations
legend("topright",
       legend = c(paste("Normal distribution N(", mean, ",", sd^2, ")"), 
                  "Normaly distributed random data",
                  paste("Sample Mean (", round(mean(x.norm),2), ")"),
                  paste("Population Mean (", mean, ")")),
       col = c("darkgray", "black", "black", "darkgray"),
       lty = c(1, 1, 2, 2),
       lwd = c(4, 3, 2, 2),
       bty = "n")

# Add standard deviation indicators
arrows(x0 = mean, y0 = 0.03,
       x1 = mean + sd, y1 = 0.03,
       length = 0.1, angle = 15, code = 3, col = "darkgray")
text(x = mean + sd/2, y = 0.032, 
     labels = expression(paste(sigma, " = 10")),
     pos = 3, cex = 0.9, col = "darkgray")

Ex 1: What is the probability density of a student scoring exactly 600 on the SAT if scores are N(500, 100)?

Given:

  • $ x = 600 $ (SAT score)
  • $ = 500 $ (mean)
  • $ = 100 $ (standard deviation)

Step 1: Write the PDF Formula \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

Step 2: Plug in Values \[ f(600) = \frac{1}{100 \sqrt{2\pi}} e^{-\frac{(600 - 500)^2}{2 \times 100^2}} \]

Step 3: Simplify the Exponent \[ (600 - 500)^2 = 100^2 = 10{,}000 \] \[ 2 \times 100^2 = 20{,}000 \] \[ \frac{10{,}000}{20{,}000} = 0.5 \]

Step 4: Compute Constants \[ \sqrt{2\pi} \approx 2.5066 \quad \text{and} \quad e^{-0.5} \approx 0.6065 \]

Step 5: Final Calculation \[ f(600) = \frac{1}{100 \times 2.5066} \times 0.6065 \] \[ f(600) \approx \frac{0.6065}{250.66} \approx 0.002419 \]

Result The probability density at \(x = 600\) is \(≈0.00242\).

Using the R Function Result:
600: 0.002419707

normal_pdf(x = 600, mu = 500, sd = 100)
##         600 
## 0.002419707

Interpretation

  • The probability density of an SAT score of exactly 600 (given \(N(500, 100)\)) is \(\approx 0.00242\).
  • This is the relative likelihood (height of the PDF curve) at x = 600, not a probability.

Comparison - At the mean (x = 500):

normal_pdf(x = 500, mu = 500, sd = 100)  # Output: 500: 0.003989423
##         500 
## 0.003989423

(Higher density, as expected for the peak of the curve.)

  • At x = 400 (same distance from mean):
normal_pdf(x = 400, mu = 500, sd = 100)  # Output: 400: 0.002419707
##         400 
## 0.002419707

(Identical density to x = 600 due to symmetry.)

Normal Probability Density \((\mu = 500, \sigma = 100)\)

Score (x) Distance from Mean PDF Value (f(x)) Relative to Peak Interpretation
400 -1σ 0.00242 60.7% 1 standard deviation below
500 0σ (mean) 0.00399 100% Peak of the distribution
600 +1σ 0.00242 60.7% 1 standard deviation above

Key Properties:

  • Symmetry: Equal density at equal \(\sigma\)-distances from mean (\(f(400) = f(600)\))
  • Peak: Maximum density occurs at the mean (\(x = \mu\))
  • 68-95-99.7 Rule:
    • 68% of data falls between 400–600 (\(\pm 1\sigma\))
    • 95% between 300–700 (\(\pm 2\sigma\))
    • 99.7% between 200–800 (\(\pm 3\sigma\))

Example Analysis

# Install packages if needed
if (!require(pracma)) install.packages("pracma", dependencies = TRUE)
## Loading required package: pracma
if (!require(tidyr)) install.packages("tidyr", dependencies = TRUE)
## Loading required package: tidyr
library(pracma)
library(ggplot2)
library(tidyr)

# Observed score
x_obs <- 600

# Define theta (mu) range
theta_vals <- seq(300, 900, length.out = 600)

# Known standard deviation
sigma <- 100

# Likelihood function: PDF of N(theta, sigma^2) at x = 600
likelihood <- dnorm(x_obs, mean = theta_vals, sd = sigma)

# Prior: Normal(480, 50^2)
prior <- dnorm(theta_vals, mean = 480, sd = 50)

# Unnormalized posterior
posterior_unnorm <- prior * likelihood

# Normalize the posterior
posterior <- posterior_unnorm / trapz(theta_vals, posterior_unnorm)

# Create a data frame for plotting
df <- data.frame(
  theta = theta_vals,
  Prior = prior,
  Likelihood = likelihood,
  Posterior = posterior
)

# Reshape to long format for ggplot
df_long <- pivot_longer(df, cols = c("Prior", "Likelihood", "Posterior"),
                        names_to = "Distribution", values_to = "Density")

# Define colors to use consistently
dist_colors <- c("Prior" = "green", "Likelihood" = "blue", "Posterior" = "purple")

# Create colored subtitle with markdown-style colors using element_markdown (requires ggtext package)
if (!require(ggtext)) install.packages("ggtext", dependencies = TRUE)
## Loading required package: ggtext
library(ggtext)

subtitle_text <- paste0(
  "<span style='color:green;'>Prior</span> × ",
  "<span style='color:blue;'>Likelihood</span> → ",
  "<span style='color:purple;'>Posterior</span>"
)

# Plot
# Set factor levels to control legend order
df_long$Distribution <- factor(df_long$Distribution, levels = c("Prior", "Likelihood", "Posterior"))

# Then your ggplot code as before, with scale_color_manual and scale_linetype_manual

ggplot(df_long, aes(x = theta, y = Density, color = Distribution, linetype = Distribution)) +
  geom_line(size = 1) +
  geom_vline(xintercept = x_obs, color = "red", linetype = "dotted") +
  scale_color_manual(values = dist_colors) +
  scale_linetype_manual(values = c("Prior" = "dashed", "Likelihood" = "dashed", "Posterior" = "solid")) +
  labs(
    title = "Bayesian Inference for SAT Score Observation (x = 600)",
    subtitle = subtitle_text,
    x = expression(theta~"(Mean SAT Score)"),
    y = "Density",
    color = "Distribution",
    linetype = "Distribution"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_markdown(size = 12, hjust = 0.5),
    legend.position = "top"
  )
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


Cumulative Distribution Function (cdf) Plot: Normal Distribution

# Generate data (if not already created)
set.seed(123)
x.norm <- rnorm(30, mean = 50, sd = 10)

# Create CDF plot with enhanced features
plot(ecdf(x.norm), 
     verticals = F,          # Show vertical steps
     do.points = FALSE,         # Don't show individual points
     col.01line = "gray70",     # Color for 0-1 lines
     lwd = 4, 
     col = "lightgray", 
     xlab = "Value (k)", 
     ylab = "Cumulative Probability F(k)",
     main = "Normal Distribution CDF (μ=50, σ=10)",
     cex.main = 1.1
)

# Add population normal CDF
curve(pnorm(x, mean = 50, sd = 10), 
      add = TRUE, 
      col = "black", 
      lwd = 1, 
      lty = 1)

# Add legend
legend("topleft",
       legend = c("Random sample data CDF", "Population CDF"),
       col = c("lightgray", "black"),
       lwd = c(3, 1),
       lty = c(1, 1),
       bty = "n")

# Add key percentiles
q <- quantile(x.norm, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
abline(v = q, col = "gray50", lty = 3)
text(x = q, y = 0.1, 
     labels = names(q), 
     pos = 3, 
     cex = 0.8, 
     col = "gray30")

# Add distribution parameters
text(x = min(x.norm) + 5, y = 0.9,
     labels = bquote(paste("Sample: ", 
                           bar(x) == .(round(mean(x.norm), 2)), 
                           ", s = ", .(round(sd(x.norm), 2)))),
     pos = 4, cex = 0.9)

text(x = min(x.norm) + 5, y = 0.8,
     labels = bquote(paste("population: ", 
                           mu == 50, 
                           ", ", sigma == 10)),
     pos = 4, cex = 0.9)


What is the probability that a randomly selected student scores less than 75 on an exam with a mean of 65 and a standard deviation of 10?

library(shape)

# Probability of scoring ≤ 75
pnorm(75, mean = 65, sd = 10) 
## [1] 0.8413447
# Parameters
mean <- 65
sd <- 10
cutoff <- 75

# Create sequence of x values
x <- seq(mean - 4*sd, mean + 4*sd, length.out = 1000)

# Calculate PDF and CDF
pdf <- dnorm(x, mean, sd)
cdf <- pnorm(x, mean, sd)

# Plot PDF
plot(x, pdf, type = "l", lwd = 2, col = "blue",
     xlab = "Test Score", ylab = "Density",
     main = "Normal Distribution: P(X ≤ 75)")

# Shade area for P(X ≤ 75)
x_shade <- seq(mean - 4*sd, cutoff, length.out = 1000)
y_shade <- dnorm(x_shade, mean, sd)
polygon(c(x_shade, cutoff), c(y_shade, 0), col = "skyblue", border = NA)

# Add cutoff line
abline(v = cutoff, col = "red", lty = 2, lwd = 2)

# Calculate probability
prob <- round(pnorm(cutoff, mean, sd), 3)

# Add probability INSIDE the shaded area (centered)
text(x = mean - .5*sd,  # Position in the middle-left of shaded area
     y = dnorm(mean, mean, sd)/3,  # Vertical position
     labels = bquote(P(X <= .(cutoff)) == .(prob)),
     col = "darkblue", cex = 1)

# Add parameters
legend("topright", 
       legend = c(paste0("μ = ", mean), paste0("σ = ", sd)),
       bty = "n")

# Create CDF plot
plot(x, cdf, type = "l", lwd = 3, col = "blue",
     xlab = "Test Score", ylab = "Cumulative Probability",
     main = "Normal CDF: P(X ≤ 75)")

# Add tracing lines
prob <- pnorm(cutoff, mean, sd)

# horizontal tracing line
segments(mean - 5*sd, prob, cutoff, prob, col = "red", lty = 1, lwd = 1)
# Left arrow (y-axis side)
Arrows(mean - 4.1*sd, prob, mean - 4*sd + 0.5, prob, code = 1,
       col = "red", arr.type = "triangle", arr.length = 0.4)

# vertical tracing line
segments(cutoff, -1, cutoff, prob, col = "red", lty = 1, lwd = 1)

points(cutoff, prob, pch = 19, col = "white", cex = .5)
points(cutoff, prob, pch = 19, col = "blue", cex = .4)

# Add probability annotation
text(mean - 1.5*sd, prob + 0.05, 
     bquote(P(X <= .(cutoff)) == .(round(prob, 3))),
     col = "red", cex = 1)

# Add axis markers
axis(1, at = cutoff, labels = round(cutoff, 2), las = 1, col.axis = "red")
axis(2, at = prob, labels = round(prob, 2), las = 1, col.axis = "red")

# Add grid
grid()


Comparison of the Normal and t-Distributions

The t-distribution is similar to the normal distribution but has heavier tails, which makes it more suitable for small sample sizes or when the population variance is unknown.

  • As the degrees of freedom (df) increase, the t-distribution approaches the normal distribution.
  • When df ≥ 30, the t-distribution is well approximated by the normal distribution.
  • For smaller degrees of freedom, the t-distribution has heavier tails, reflecting greater uncertainty.
Property Normal Distribution (Z) t-Distribution
Shape Fixed bell curve Heavier tails
Parameters μ (mean), σ (std dev) μ (mean), s (scale), ν (df)
Tail Behavior Light tails (kurtosis = 3) Heavy tails (kurtosis > 3)
Variance σ² (constant) ν/(ν-2) for ν > 2
Primary Use Case Known population variance Small samples (unknown σ²)

Convergence Behavior

  • As degrees of freedom (ν) increase:
    • t-distribution approaches normal distribution (𝒩(0,1))
    • Convergence is mathematically guaranteed as ν → ∞
  • Practical convergence threshold:
    • At ν = 30: maximum CDF difference < 0.01
  • For 95% confidence intervals:
    • Normal: μ ± 1.96σ
    • t(df=5): μ ± 2.57σ (wider interval)
    • t(df=30): μ ± 2.04σ (closer to normal)
# Define the support range
x <- seq(-3.5, 3.5, length.out = 500)

# Set up the plot canvas
plot(x, dnorm(x), 
     type = 'l', 
     lwd = 4,
     col = "black",
     ylim = c(0, 0.42),
     xlab = "Value",
     ylab = "Probability Density",
     main = "Normal Distribution vs Student's t-Distribution",
     cex.main = 1.2,
     panel.first = grid())

# Define degrees of freedom to compare
dfs <- c(1, 2, 5, 30)
colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3")  # ColorBrewer palette

# Add t-distributions
for (i in seq_along(dfs)) {
  lines(x, dt(x, df = dfs[i]), 
        col = colors[i], 
        lwd = 3, 
        lty = i+1)
}

# Add reference lines
abline(v = 0, col = "gray50", lty = 2)
abline(h = 0, col = "gray50", lty = 2)

# Add informative legend
legend("topright",
       legend = c(expression(paste("Normal (", mu==0, ", ", sigma==1, ")")),
                  expression(paste("t (df==", 1, ")")),
                  expression(paste("t (df==", 2, ")")),
                  expression(paste("t (df==", 5, ")")),
                  expression(paste("t (df==", 30, ")"))),
       col = c("black", colors),
       lty = c(1, 2:5),
       lwd = 3,
       bty = "n",
       cex = 0.9)

# Add tail behavior annotations
text(x = -2.8, y = 0.05, 
     labels = "Heavier tails\nfor lower df", 
     pos = 3, cex = 1, col = "darkred")
arrows(x0 = -2.8, y0 = 0.09, 
       x1 = -2.8, y1 = 0.15, 
       length = 0.1, col = "darkred")

# Add convergence annotation
text(x = 0, y = 0.1,
     labels = "t → Normal as df → ∞",
     pos = 3, cex = 1, col = "darkblue")


5.5 Chi-Squared Distribution

The chi-squared \((\chi^2)\) distribution is a continuous probability distribution that arises when:

  • Summing the squares of independent standard normal variables
  • Analyzing categorical data in contingency tables
  • Conducting goodness-of-fit tests

It plays a key role in:

  • Pearson’s chi-squared tests
  • Variance estimation
  • Likelihood ratio tests

Parameters

k > 0 degrees of freedom (shape parameter)
Determines the distribution’s shape and spread

Support
𝑥 ∈ [0, +∞)

Probability Density Function (PDF) The PDF of the chi-squared distribution is:

\[ f(x; k) = \begin{cases} \frac{x^{k/2-1}e^{-x/2}}{2^{k/2}\Gamma(k/2)} & \text{for } x > 0 \\ 0 & \text{otherwise} \end{cases} \]

where:
- \(\Gamma\) = Gamma function

Moments:

Moment Formula Defined When
Expected Value k k > 0
Variance 2k k > 0
Skewness \(\sqrt{8/k}\) k > 0
Excess Kurtosis 12/k k > 0

5.6 Key Properties

  • Shape:
    • Right-skewed, becomes more symmetric as k increases
    • Approaches N(k, 2k) as k → ∞
  • Relationships:
    • Special case of Gamma distribution (shape = k/2, scale = 2)
    • Sum of independent χ² variables is χ² (degrees of freedom add)
  • Applications:
    • Hypothesis testing for categorical data
    • Confidence intervals for variance
    • Spectral analysis

R Implementation: Probability Density Function

#' Chi-Squared Distribution Probability Density Function
#'
#' Computes the PDF of the chi-squared distribution at given quantiles.
#'
#' @param k Degrees of freedom (k > 0)
#' @param x Numeric vector of quantiles (default: seq(0, 20, length.out = 1000))
#' @return Numeric vector of probability densities
#' @export
#' @examples
#' chisq_pdf(k = 5, x = 3.2)  # Density at x=3.2 with 5 df
chisq_pdf <- function(k, x = seq(0, 20, length.out = 1000)) {
  if (any(k <= 0)) stop("Degrees of freedom (k) must be positive")
  dgamma(x, shape = k/2, rate = 0.5)  # Using Gamma relationship
}
#' Chi-Squared Distribution CDF
#'
#' Computes P(X ≤ x) for chi-squared distribution with k degrees of freedom.
#'
#' @param x Numeric vector of quantiles
#' @param k Degrees of freedom (k > 0)
#' @return Cumulative probabilities
#' @export
#' @examples
#' chisq_cdf(10, k = 5)  # P(X ≤ 10) with 5 df
chisq_cdf <- function(x, k) {
  if (any(k <= 0)) stop("Degrees of freedom (k) must be positive")
  pgamma(x, shape = k/2, rate = 0.5)
}
#' Chi-Squared Distribution Visualization
#'
#' Creates a publication-quality plot of the chi-squared distribution.
#'
#' @param chisq_stat Calculated chi-squared statistic
#' @param df Degrees of freedom (k > 0)
#' @param p_value Calculated p-value
#' @param alpha Significance level (default = 0.05)
#' @param n Number of points for smooth curve (default = 1000)
#' @return A ggplot object
#' @export
#' @examples
#' chisq_plot(chisq_stat = 8.5, df = 5, p_value = 0.075)
chisq_plot <- function(chisq_stat, df, p_value, alpha = 0.05, n = 1000) {
  require(ggplot2)
  
  critical <- qchisq(1 - alpha, df)
  x_max <- max(critical * 1.5, chisq_stat * 1.2)
  x <- seq(0, x_max, length.out = n)
  plot_data <- data.frame(x = x, y = dchisq(x, df))
  
  ggplot(plot_data, aes(x, y)) +
    geom_line(color = "#333333", linewidth = 1.2) +
    geom_area(data = subset(plot_data, x <= critical), 
              aes(fill = "Acceptance"), alpha = 0.6) +
    geom_area(data = subset(plot_data, x >= critical), 
              aes(fill = "Rejection"), alpha = 0.6) +
    geom_vline(xintercept = critical, linetype = "dashed", color = "#76B7B2") +
    geom_vline(xintercept = chisq_stat, color = "#F28E2B", linewidth = 1.5) +
    annotate("label", x = chisq_stat, y = max(plot_data$y) * 0.85,
             label = sprintf("χ² = %.2f\np = %.3f", chisq_stat, p_value),
             hjust = ifelse(chisq_stat > critical, 1.1, -0.1)) +
    scale_fill_manual(values = c("Acceptance" = "#4E79A7", "Rejection" = "#E15759")) +
    labs(title = bquote(Chi^2 ~ Distribution ~ (df == .(df))),
         x = "Chi-Squared Value", y = "Density") +
    theme_minimal()
}

# Save all functions
dump(c("chisq_pdf", "chisq_cdf", "chisq_plot"), file = "chi_squared_functions.R")
# Typical usage
chisq_plot(chisq_stat = 15, df = 10, p_value = 0.001)

# Significant result example
chisq_plot(chisq_stat = 21, df = 10, p_value = 0.001)

#' Chi-Squared Variance Confidence Interval
#'
#' Computes confidence intervals for population variance using custom chi-square functions
#'
#' @param x Numeric vector of sample data
#' @param alpha Significance level (default = 0.05)
#' @return List containing point estimate and CI
#' @export
#' @examples
#' battery_data <- c(23, 25, 21, 28, 24, 26, 22, 27, 25, 24)
#' chisq_var_ci(battery_data)
chisq_var_ci <- function(x, alpha = 0.05) {
  n <- length(x)
  s2 <- var(x)
  df <- n - 1
  
  # Use custom CDF to find critical values
  chi_upper <- qchisq(alpha/2, df)
  chi_lower <- qchisq(1 - alpha/2, df)
  
  ci <- c(
    lower = (n-1)*s2 / chi_lower,
    upper = (n-1)*s2 / chi_upper
  )
  
  list(
    variance_estimate = s2,
    degrees_of_freedom = df,
    confidence_level = 1 - alpha,
    ci = ci
  )
}

#' Chi-Square Critical Value Finder
#'
#' @param alpha Right-tail probability
#' @param df Degrees of freedom
#' @export
chisq_critical <- function(alpha, df) {
  qchisq(1 - alpha, df)
}

5.7 Uniform Distribution

Parameters

\(\quad \displaystyle -\infty < a < b < \infty \text{ - minimum and maximum values.}\)

\(\quad \displaystyle x \in \lbrack a,b \rbrack.\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x) = \begin{cases}\dfrac{1}{b-a}& \text{for a < x < b} \\ \\ \quad 0 & \text{otherwise} \end{cases}\).

Moment Formula Notes
Mean \(\displaystyle E(X) = \frac{a + b}{2}\) Midpoint of \([a, b]\)
Variance \(\displaystyle Var(X) = \frac{(b - a)^2}{12}\) Scales with interval width squared
Skewness \(\displaystyle \gamma_1 = 0\) Symmetric distribution
Ex. Kurtosis \(\displaystyle \gamma_2 = -\frac{6}{5}\) Flatter than normal distribution

Key Properties:

  • Support: \(x \in [a, b]\)
  • PDF: \(\displaystyle f(x) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}\)
  • CDF: \(\displaystyle F(x) = \frac{x - a}{b - a}\) for \(x \in [a, b]\)

Probability Density Function (pdf): Uniform Distribution

# Parameters
n <- 300       # Sample size
a <- 0           # Minimum
b <- 1           # Maximum

# Generate data
set.seed(123)    # For reproducibility
x.unif <- runif(n, min = a, max = b)
x <- seq(a, b, length.out = n)
d.unif <- dunif(x, min = a, max = b)

# Create PDF plot
plot(x, d.unif,
     type = "l",
     ylim = c(0, max(d.unif)*1.2),
     col = adjustcolor("black", alpha.f = 0.8),
     lwd = 4,
     xlab = "x",
     ylab = "Probability Density f(x)",
     main = "Uniform Distribution PDF (a=0, b=1)",
     panel.first = grid()
)

# Add sample density (kernel density estimate)
lines(density(x.unif, adjust = 0.5),  # Adjusted bandwidth
      lwd = 4,
      col = adjustcolor("lightgray", alpha.f = 0.5))

# Add reference lines
abline(v = mean(x.unif), col = "black", lty = 2, lwd = 2)  # Sample mean
abline(v = (a+b)/2, col = "lightgray", lty = 2, lwd = 2)   # population mean

# Add statistical annotations
legend("bottom",
       legend = c("Population PDF", "Random Sample Data Density",
                  "Sample Mean", "Population Mean"),
       col = c("black", "lightgray", "black", "lightgray"),
       lty = c(1, 1, 2, 2),
       lwd = 2,
       bty = "n")

text(x = 0.1, y = .5,
     labels = bquote(paste("Population distribution: ", 
                           E(X) == .((a+b)/2), ", ",
                           Var(X) == .((b-a)^2/12))),
     pos = 4, cex = 0.9)

text(x = 0.1, y = .45,
     labels = bquote(paste("Random uniform data: ", 
                           bar(x) == .(round(mean(x.unif), 4)), ", ",
                           s^2 == .(round(var(x.unif), 4)))),
     pos = 4, cex = 0.9)


Ex: If students’ scores on a pass/fail exam are uniformly distributed between \(50-100\%\), what percentage scored between \(70-85\%\)?

# Calculate probability
prob <- punif(85, min = 50, max = 100) - punif(70, min = 50, max = 100)
prob # 0.3 (30%)
## [1] 0.3
# Enhanced plot
library(ggplot2)
ggplot(data.frame(x = c(0, 100)), aes(x)) +
  stat_function(fun = dunif, args = list(min = 50, max = 100), 
                lwd = 1.5, color = "blue") +
  geom_area(stat = "function", fun = dunif, args = list(min = 50, max = 100),
            xlim = c(70, 85), fill = "red", alpha = 0.3) +
  # Add vertical lines at boundaries
  geom_vline(xintercept = c(70, 85), linetype = "dashed", color = "red") +
  # Add probability label
  annotate("text", x = 77.5, y = 0.015, 
           label = paste0("P = ", prob), color = "darkred", size = 3) +
  labs(title = "Uniform Distribution of Exam Scores (50-100%)",
       subtitle = "Probability of scores between 70-85",
       x = "Score (%)", y = "Density") +
  scale_x_continuous(breaks = c(50, 70, 85, 100)) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))


Uniform Distribution Analysis using the CDF Function

Given:

  • Uniform Distribution Range: Scores are uniformly distributed between 50% and 100%.
  • Interval of Interest: Percentage of students who scored between 70% and 85%.

Step 1: Understand the Uniform Distribution

For a continuous uniform distribution over the interval [a, b], we have:

Probability Density Function (PDF): \[ f(x) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases} \]

given: - \(a = 50\) - \(b = 100\)

\[ f(x) = \frac{1}{100-50} = \frac{1}{50} \]

Uniform Cumulative Distribution Function (CDF): \[ F(x) = \begin{cases} 0 & \text{for } x < a \\ \frac{x-a}{b-a} & \text{for } a \leq x \leq b \\ 1 & \text{for } x > b \end{cases} \]

Substituting values: \[ F(x) = \frac{x-50}{50} \]

Step 2: Calculate Probability for [70, 85]

The probability that a score \(X\) falls between 70% and 85% is: \[ P(70 \leq X \leq 85) = F(85) - F(70) \]

Compute F(85): \[ F(85) = \frac{85-50}{50} = \frac{35}{50} = 0.7 \]

Compute F(70): \[ F(70) = \frac{70-50}{50} = \frac{20}{50} = 0.4 \]

Final Probability: \[ P(70 \leq X \leq 85) = 0.7 - 0.4 = 0.3 \quad \text{(30%)} \]

# Define the uniform CDF
uniform_cdf <- function(x, a = 50, b = 100) {
  ifelse(x < a, 0, ifelse(x > b, 1, (x - a) / (b - a)))
}

# Generate x-values
x <- seq(40, 110, by = 0.1)
cdf_values <- uniform_cdf(x)

# Plot the CDF
plot(x, cdf_values, type = "l", lwd = 2, col = "blue",
     main = "Uniform CDF: P(70 ≤ X ≤ 85) = F(85) - F(70)",
     xlab = "Exam Score (%)", ylab = "Cumulative Probability (F(x))",
     xlim = c(40, 110), ylim = c(0, 1.1))

# Shade the area between 70 and 85
polygon(c(70, seq(70, 85, 0.1), 85), 
        c(0, uniform_cdf(seq(70, 85, 0.1)), 0),
        col = "lightblue", border = NA)

# Add vertical lines at key points
abline(v = c(50, 70, 85, 100), lty = 2, col = "gray")

# Add horizontal tracing lines for F(70) and F(85)
abline(h = uniform_cdf(70), lty = 2, col = "red")  # F(70) = 0.4
abline(h = uniform_cdf(85), lty = 2, col = "red")  # F(85) = 0.7

# Annotate the difference (probability)
arrows(x0 = 70, y0 = uniform_cdf(70), 
       x1 = 70, y1 = uniform_cdf(85),
       code = 3, length = 0.1, col = "darkred", lwd = 2)
text(65, 0.55, "F(85) - F(70) = 0.3", col = "darkred", cex = 1.2, pos = 2)

# Label key points
text(70, -0.05, "70%", pos = 1, col = "red", xpd = TRUE)
text(85, -0.05, "85%", pos = 1, col = "red", xpd = TRUE)
text(50, -0.05, "50%", pos = 1, col = "red", xpd = TRUE)
text(100, -0.05, "100%", pos = 1, col = "red", xpd = TRUE)

# Add CDF formula
text(70, 0.9, expression(F(x) == frac(x - 50, 50)), cex = 2)


Visualization of the Uniform CDF

The CDF is a straight line from (50, 0) to (100, 1) with slope \(\frac{1}{50}\).

Key points:

  • $ F(50) = 0 $
  • $ F(70) = 0.4 $
  • $ F(85) = 0.7 $
  • $ F(100) = 1 $

The probability corresponds to the vertical difference between F(85) and F(70) on the CDF plot.

5.8 Gamma Distribution

Parameters

\(\quad \alpha > 0\, - \, shape.\)

\(\quad \beta > 0\, - \, scale.\)

\(\quad x \in (0, \infty).\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x) = \frac{1}{\Gamma(\alpha) \beta^{\alpha}} x^{\alpha-1}e^{-\frac{x}{\beta}} \quad where \, \alpha > 0 \, (shape), \, and \,\beta > 0 \, (scale)\).

Moment Formula Parameters
Mean \(\displaystyle E(X) = k\theta\) \(k>0\) (shape), \(\theta>0\) (scale)
Variance \(\displaystyle Var(X) = k\theta^2\) Grows with both shape and scale
Skewness \(\displaystyle \gamma_1 = \frac{2}{\sqrt{k}}\) → 0 as \(k \to \infty\) (more symmetric)
Ex. Kurtosis \(\displaystyle \gamma_2 = \frac{6}{k}\) → 0 as \(k \to \infty\) (less peaked)

Parameterization Notes:

  • Alternative Form: Often written as \(\Gamma(\alpha, \beta)\) where \(\alpha=k\) (shape), \(\beta=1/\theta\) (rate)
  • Special Cases:
    • Exponential: \(k=1\)
    • Chi-squared: \(k=\nu/2\), \(\theta=2\)
  • Support: \(x \in (0, \infty)\)

Probability Density Function (pdf): Gamma Distribution

#' Gamma Distribution Probability Density Function (PDF)
#'
#' Computes the probability density function for a Gamma distribution with
#' specified shape (α) and scale (θ) parameters.
#'
#' @param x Numeric vector of values at which to evaluate the PDF
#' @param shape Shape parameter (α > 0)
#' @param scale Scale parameter (θ > 0)
#' @return Numeric vector of probability densities
#' @examples
#' # Standard exponential distribution (α=1, θ=1)
#' gamma.pdf(1:5, shape=1, scale=1)
#'
#' # Gamma(α=2, θ=0.5)
#' gamma_pdf(seq(0, 5, 0.1), shape=2, scale=0.5)
#' @export

gamma_pdf <- function(x, shape, scale) {
  
  # Input validation
  if (missing(shape) || missing(scale)) {
    stop("Both shape and scale parameters must be specified")
  }
  if (shape <= 0 || scale <= 0) {
    stop("Shape and scale parameters must be positive")
  }
  if (!is.numeric(x)) {
    stop("Input x must be numeric")
  }
  
  # Handle negative x values
  pdf <- numeric(length(x))
  pdf[x < 0] <- 0  # Gamma distribution has support x ≥ 0
  
  # Calculate PDF for x ≥ 0
  positive_x <- x >= 0
  pdf[positive_x] <- (1 / (scale^shape * gamma(shape))) * 
    x[positive_x]^(shape - 1) * 
    exp(-(x[positive_x] / scale))
  
  return(pdf)
}

# Save function to file
dump("gamma_pdf", file = "gamma_pdf.R")
# sample size 30
n <- 30
alpha <- 1
beta <- 1

# population data
x <- seq(0, 10, length=2e2)
y <- gamma_pdf(x, shape=1, scale=1)

# random sample data
x.gamma <- sort(rgamma(n=2^7, shape=1, scale=1))
d.gamma <- gamma_pdf(x=x.gamma, shape=1, scale=1)

# Gamma Distribution
plot(x, y,
     type="l",
     ylim=c(0, max(d.gamma)*1.1),
     col=adjustcolor("darkgray", alpha.f=0.8),
     lwd=8,
     xlab="x",
     ylab="F(x)",
     main="Random sample (red) vs. Population (blue) \n 
             Gamma Distribution")

lines(density(x.gamma),
      type="l",
      lwd=4)

abline(v=mean(x.gamma), lty=3) # mean
abline(v=1, col="darkgray", lty=3) # expected value

text(3, .6, cex=1.2,
     paste("mean=", format(mean(x.gamma), digits=4), "\n",
           "variance=", format(var(x.gamma), digits=4)
     )
)

text(6, .6, "E(X)=1 \n Var(X)=1", col="darkgray")

# gridlines
grid(nx=NULL, 
     ny=NULL, 
     col="lightgray", 
     lty="dotted", 
     lwd=1, 
     equilogs=TRUE)


Gamma Distribution PDF: Insurance Claim Modeling Case Study

Problem Statement:

An insurance company needs to model auto insurance claim amounts, where:

  • Claims are strictly positive (x > 0)
  • Data shows right-skewed distribution (many small claims, few large outliers)
  • Need to calculate probabilities for risk assessment and pricing

The Gamma PDF is given by:

\[ f(x;\alpha,\theta) = \frac{x^{\alpha-1}e^{-x/\theta}}{\theta^\alpha\Gamma(\alpha)} \quad \text{for } x > 0 \]

where:

  • \(\alpha\) = shape parameter (controls skewness)
  • \(\theta\) = scale parameter
  • \(\Gamma(\alpha)\) = gamma function

Cumulative Distribution Function (cdf) Plot: Gamma Distribution

# Plot the random sample CDF of Gamma-distributed data
plot(ecdf(x.gamma), 
     lwd = 6, 
     col = "lightgray", 
     xlab = "k", 
     ylab = expression(F[X](k)),  # Math notation for CDF
     main = "random sample CDF of Gamma(shape=1, scale=1)\n(n = 30)",
     cex.main = 1.1,  # Slightly larger title
     panel.first = {  # Add gridlines first (behind the plot)
       grid(
         nx = NULL, 
         ny = NULL, 
         col = "lightgray", 
         lty = "dotted", 
         lwd = 3
       )
     })

# Optional: Add population CDF for comparison (red dashed line)
curve(pgamma(x, shape = 1, scale = 1), 
      col = "black", 
      lty = 1, 
      lwd = 1, 
      add = TRUE)

# Legend (if population CDF is added)
legend("bottomright", 
       legend = c("random sample CDF", "population CDF"), 
       col = c("lightgray", "black"), 
       lty = c(1, 1), 
       lwd = c(6, 2), 
       bty = "n")


Insurance Application Example:

Risk Management Table

Calculation Type Business Use R Function
Exceedance Probability Reserve requirements 1 - pgamma( )
Expected Loss Premium calculation \(\alpha * \theta\)
Value-at-Risk Reinsurance planning qgamma( )

Parameter Estimation

From historical data:

  • Shape (\(\alpha\)) = 2
  • Scale (\(\theta\)) = 1000
  • Mean claim = \(\alpha \times \theta = \$2,000\)
  • Mode = \((\alpha-1) \times \theta = \$1,000\)

Key Calculations

  1. Exceedance Probability (1 - pgamma(x, α, scale=θ))

Business Use: Reserve requirements

Question: What is the probability claims exceed a threshold (e.g., $3,000)?

Calculation: \[P(X>3000)=1−P(X≤3000)\]

1 - pgamma(5000, shape=2, scale=1000) # P(X > $5000): ~4% 
## [1] 0.04042768
# Other Threshold probabilities
1 - pgamma(2000, shape=2, scale=1000)  # P(X > mean): 40.60% 
## [1] 0.4060058
1 - pgamma(3000, shape=2, scale=1000)  # P(X > $3K): 19.91%
## [1] 0.1991483

  1. Expected Loss (α × θ)

Business Use: Premium calculation

Expected Loss = \(\alpha \times \theta = 2 \times 1000 = \$2,000\)

This is the mean claim amount used for pricing premiums.

  1. Value-at-Risk (qgamma(p, α, scale=θ))

Business Use: Reinsurance planning (e.g., 95% VaR)

Question: What is the claim amount at the 95th percentile?

Calculation:

To calculate the Value at Risk (VaR) at the 95% confidence level for a gamma-distributed random variable with shape parameter α = 2 and scale parameter θ = 1000, we use the quantile function of the gamma distribution.

The gamma distribution’s cumulative distribution function (CDF) is given by: \[ F(x;\: \alpha, \: \theta) = \frac{1}{\Gamma(\alpha)} \int_0^x t^{\alpha-1} \cdot e^{-t/\theta} \, dt \] where \(\Gamma(\alpha)\) is the gamma function.

The quantile function (inverse CDF) \(Q(p; \alpha, \theta)\) returns the value \(x\) such that \(F(x; \alpha, \theta) = p\). For the gamma distribution, this is often denoted as qgamma(p, alpha, scale = theta).

Given:

  • Shape parameter \(\alpha = 2\)
  • Scale parameter \(\theta = 1000\)
  • Confidence level \(p = 0.95\)

The VaR at the 95% confidence level is:

\[ VaR_{0.95} = \text{qgamma}(0.95, 2, \text{scale} = 1000) \]

Calculating the Value:

Using \(R\):

qgamma(0.95, shape = 2, scale = 1000)  # Result: ~$4,744 (95% of claims ≤ $4,744)
## [1] 4743.865

This yields: \[ VaR_{0.95} \approx 4743.865 \]

Interpretation:

There is a 95% probability that the loss will not exceed approximately 4743.865 (units depend on context, e.g., dollars).

5.9 Beta Distribution

Parameters

\(\quad \displaystyle \alpha > 0\, - \, shape1.\)

\(\quad \displaystyle \beta > 0\, - \, shape2.\)

\(\quad \displaystyle x \in \lbrack 0;1 \rbrack.\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \, \Gamma(\beta)} \, x^{\alpha-1}(1-x)^{\beta-1}.\)

Beta Distribution Moments

Moment Formula
Mean (\(E(X)\)) \(\displaystyle \qquad \frac{\alpha}{\alpha + \beta} \qquad\)
Variance (\(Var(X)\)) \(\displaystyle \qquad \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} \qquad\)
Skewness \(\displaystyle \qquad \frac{2(\beta - \alpha)\sqrt{\alpha + \beta + 1}}{(\alpha + \beta + 2)\sqrt{\alpha \beta}} \qquad\)
Excess Kurtosis \(\displaystyle \qquad \frac{6[(\alpha - \beta)^2 (\alpha + \beta + 1) - \alpha \beta (\alpha + \beta + 2)]}{\alpha \beta (\alpha + \beta + 2)(\alpha + \beta + 3)} \qquad\)
MGF (\(M_X(t)\)) \(\displaystyle 1 \quad + \quad \sum_{k=1}^{\infty} \left( \prod_{r=0}^{k-1} \frac{\alpha + r}{\alpha + \beta + r} \right) \frac{t^k}{k!}\)

Probability Density Function (pdf): Beta Distribution

# Parameters
n <- 100  # Sample size
alpha <- 2  # Shape1 parameter
beta <- 2   # Shape2 parameter

# Generate data
x <- seq(0, 1, length.out = n)  # Grid for population density
x.beta <- rbeta(n, shape1 = alpha, shape2 = beta)  # Random Beta variates
d.beta <- dbeta(x, shape1 = alpha, shape2 = beta)  # population density

# Cumulative Distribution Function
plot(x, d.beta,
     type="l",
     ylim=c(0, max(d.beta)*1.1),
     col=adjustcolor("darkgray", alpha.f=0.8),
     lwd=8,
     xlab="x",
     ylab="F(x)",
     main="Random Sample (black) vs. population (gray) \n Beta(2, 2)")

lines(density(x.beta),
      type="l",
      lwd=4)

abline(v=mean(x.beta), lty=3) # mean
abline(v=(alpha/(alpha+beta)), col="darkgray", lty=3) # expected value

text(mean(x.beta), max(d.beta)/2, 
     cex=1.2,
     paste("mean=", format(mean(x.beta), digits=2), 
           ", E(X)=", (alpha/(alpha+beta)), "\n",
           "variance=", format(var(x.beta), digits=2), 
           ", Var(X)=", ((alpha*beta)/((alpha+beta)^(2)*(alpha+beta+1)))
     )
)

# gridlines
grid(nx=NULL, 
     ny=NULL, 
     col="lightgray", 
     lty="dotted", 
     lwd=1, 
     equilogs=TRUE)

# Mean lines
abline(v = mean(x.beta), lty = 2, col = "black", lwd = 2)  # random sample mean
abline(v = alpha / (alpha + beta), lty = 2, col = "darkgray", lwd = 2)  # population mean

# Annotate statistics
legend("topright",
       legend = c(
         sprintf("Sample Mean = %.3f", mean(x.beta)),
         sprintf("Population Mean = %.3f", alpha / (alpha + beta)),
         sprintf("Sample Var = %.3f", var(x.beta)),
         sprintf("Population Var = %.3f", (alpha*beta)/((alpha+beta)^2*(alpha+beta+1)))
       ),
       bty = "n",
       cex = 0.9)

# Add density comparison legend
legend("topleft",
       legend = c("Population", "Random sample data"),
       col = c("darkgray", "black"),
       lwd = c(8, 4),
       bty = "n")


Cumulative Distribution Function (cdf) Plot: Beta Distribution

# Parameters
n <- 30       # Sample size
alpha <- 2    # Shape1 parameter
beta <- 2     # Shape2 parameter

# Generate Beta-distributed data
x.beta <- rbeta(n, shape1 = alpha, shape2 = beta)

# Create ECDF plot with improved formatting
plot(ecdf(x.beta), 
     lwd=3, 
     col="lightgray", 
     xlab="x", 
     ylab="F(x)",
     main="Cumulative Distribution Function \n 
             Beta(n=30, shape=2, scale=2)")


# Add population CDF for comparison (red dashed line)
curve(pbeta(x, shape1 = alpha, shape2 = beta), 
      col = "black", 
      lty = 1, 
      lwd = 1, 
      add = TRUE)

# Add legend
legend("bottomright",
       legend = c("Random Sample CDF", "Population CDF"),
       col = c("lightgray", "black"),
       lty = c(1, 2),
       lwd = c(3, 2),
       bty = "n")

# Add vertical line at mean
abline(v = alpha/(alpha + beta), col = "red", lty = 3)
text(alpha/(alpha + beta), 0.2, 
     paste("Mean =", alpha/(alpha + beta)),
     pos = 4, col = "red")

The Beta distribution is a continuous probability distribution defined on the interval \([0, 1]\), parameterized by two positive shape parameters, \(\alpha\) (alpha) and \(\beta\) (beta). It is highly flexible and can model a wide range of real-world phenomena where outcomes are probabilities or proportions.


Why Beta Distribution?

  • Flexible Shapes: Can model U-shaped, J-shaped, uniform, unimodal, and other distributions.
  • Natural for Proportions: Bounded between 0 and 1, making it ideal for probabilities.
  • Bayesian Advantage: Works as a conjugate prior for Binomial and Bernoulli likelihoods.

Here are some typical real-life problems that can be solved using the Beta PDF (Probability Density Function):

A/B Testing & Bayesian Updating

A/B testing is a common method for comparing two versions (A and B) of a product, webpage, or marketing strategy to determine which performs better. Traditional frequentist approaches use p-values and confidence intervals, but Bayesian methods (using the Beta distribution) provide a more intuitive way to update beliefs and make probabilistic statements.

Example: If Version A has 120 conversions out of 1000 views \((\beta(120, 880))\) and Version B has 150 conversions out of 1000 views \((\beta(150, 850))\), we can compute \(P(B > A)\).

  • Version A: Shown to \(n_A = 1000\) users, resulting in \(x_A = 120\) successes (e.g., clicks, conversions).
  • Version B: Shown to \(n_B = 1000\) users, resulting in \(x_B=150\) successes.
  • Goal: Determine if B is truly better than A (and by how much).

Solution: Represent the conversion rates of A and B as Beta distributions and compute the probability that one is better than the other.


Step 1: Model Conversion Rates as Beta Distributions - Assume prior ignorance → Use Beta(1, 1) (uniform prior).
- After observing data: - Conversion rate of A: \(\theta_A \sim \text{Beta}(\alpha_A + x_A, \beta_A + n_A - x_A)\)
- Conversion rate of B: \(\theta_B \sim \text{Beta}(\alpha_B + x_B, \beta_B + n_B - x_B)\)

# Step 1: Set up prior distributions (Uniform Beta(1,1) in this case)
prior_alpha <- 1
prior_beta <- 1

# Variant A (control)
conversions_A <- 120
visitors_A <- 1000

# Variant B (treatment)
conversions_B <- 150
visitors_B <- 1000

Step 2: Update Beliefs with Data

Example:

  • Version A: 120 conversions out of 1000Posterior = Beta(121, 881)
  • Version B: 150 conversions out of 1000Posterior = Beta(151, 851)

In Bayesian A/B testing with a Beta-Binomial model, we add +1 to observed successes and failures when using a uniform Beta(1,1) prior. This represents starting from a state of complete uncertainty.

We begin with a uniform prior: \[ \text{Prior} \sim \text{Beta}(1, 1) \] Posterior Update

After observing:

  • S successes
  • F failures

The posterior becomes: \[ \text{Posterior} \sim \text{Beta}(1 + S, 1 + F) \] Why Add +1? - Mathematical Necessity - Ensures proper distribution when S = 0 or F = 0 - Avoids undefined cases (e.g., Beta(0, n) is invalid) - Bayesian Interpretation - Equivalent to starting with “pseudocounts” of: - 1 imaginary success - 1 imaginary failure - Represents minimal prior knowledge - Practical Implications

For Variant A (120 conversions out of 1000): \(\text{Posterior}_A \sim \text{Beta}(121, 881)\)

For Variant B (150 conversions out of 1000): \(\text{Posterior}_B \sim \text{Beta}(151, 851)\)

Key Properties of Adding \(+1\) in Beta-Binomial Models

Scenario Without +1 (Beta(0,0) Prior) With +1 (Beta(1,1) Prior)
0 successes Beta(0, n) → Invalid distribution Beta(1, n+1) → Valid
100% conversion Impossible (infinite density) Possible (low probability)
Small samples Overconfident estimates Conservative, regularized results
Large samples MLE ≈ Posterior Mean +1 becomes negligible
Decision making Risky with sparse data More robust conclusions

Explanatory Notes:

  • 0 successes case:
    • Without +1: Results in undefined mean/variance
    • With +1: Predicts conversion rate ≈ 1/(n+2)
  • 100% conversion:
    • Beta(1,1) allows (but strongly discounts) extreme values
  • Sample size effects:
    • For n > 1000, the +1 adjustment has < 0.1% impact on mean
    • For n < 100, the prior significantly stabilizes estimates
  • Alternative priors:
    • Beta(0.5, 0.5) (Jeffreys): Even more conservative
    • Beta(ε, ε) (ε→0): Approaches frequentist MLE

Note: The +1 becomes negligible with large sample sizes \((n > 1000)\), but remains important for small-n experiments.


Step 3: Compute Probability that B > A

We want \(P(\theta_B > \theta_A)\).

Method: Monte Carlo Simulation

  • Draw 10,000 samples from each Beta distribution:
  • \(\theta_A \sim \text{Beta}(121, 881)\)
  • \(\theta_B \sim \text{Beta}((151, 851))\)
  • Count how often \(\theta_B > \theta_A\).
# Step 3: Monte Carlo simulation to estimate P(B > A)
set.seed(123)  # for reproducibility
n_simulations <- 100000

# Sample from posterior distributions
samples_A <- rbeta(n_simulations, prior_alpha + conversions_A, prior_beta + visitors_A - conversions_A)

samples_B <- rbeta(n_simulations, prior_alpha + conversions_B, prior_beta + visitors_B - conversions_B)
  • Result: If \(\theta_B > \theta_A\) in 6,200/10,000 trials → \(P(B > A) \approx 62\%\).
# Calculate probability that B is better than A
prob_B_better <- mean(samples_B > samples_A)
# Advanced Bayesian A/B Testing Visualizations in R
library(ggplot2)
library(gridExtra)
# library(plotly)  # Uncomment for interactive plots

# 1. Input Data
ab_data <- data.frame(
  variant = c("A", "B"),
  conversions = c(120, 150),
  visitors = c(1000, 1000)
)

# 2. Bayesian Setup (Uniform Beta(1,1) prior)
alpha_prior <- 1
beta_prior <- 1

# 3. Calculate posteriors
ab_data$alpha_post <- alpha_prior + ab_data$conversions
ab_data$beta_post <- beta_prior + ab_data$visitors - ab_data$conversions

# 4. Monte Carlo Simulation
set.seed(123)
n_sim <- 100000
samples <- data.frame(
  A = rbeta(n_sim, ab_data$alpha_post[1], ab_data$beta_post[1]),
  B = rbeta(n_sim, ab_data$alpha_post[2], ab_data$beta_post[2])
)
samples$diff <- samples$B - samples$A
prob_B_better <- mean(samples$B > samples$A)

# 5. Advanced Visualizations ---------------------------------------------------

# Plot 1: Posterior Densities with Credible Intervals
x_seq <- seq(0.075, 0.20, length.out = 1000)
densities <- data.frame(
  x = rep(x_seq, 2),
  variant = rep(c("A", "B"), each = length(x_seq)),
  density = c(dbeta(x_seq, ab_data$alpha_post[1], ab_data$beta_post[1]),
              dbeta(x_seq, ab_data$alpha_post[2], ab_data$beta_post[2]))
)

ci_A <- qbeta(c(0.025, 0.975), ab_data$alpha_post[1], ab_data$beta_post[1])
ci_B <- qbeta(c(0.025, 0.975), ab_data$alpha_post[2], ab_data$beta_post[2])

p1 <- ggplot(densities, aes(x = x, y = density, fill = variant)) +
  geom_area(alpha = 0.5, position = "identity") +
  geom_vline(xintercept = ci_A, linetype = "dashed", color = "#F8766D") +
  geom_vline(xintercept = ci_B, linetype = "dashed", color = "#00BFC4") +
  labs(title = "Posterior Distributions with 95% Credible Intervals",
       x = "Conversion Rate", y = "Density") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  theme_minimal()

# Plot 2: Distribution of Differences (B - A)
p2 <- ggplot(samples, aes(x = diff)) +
  geom_histogram(aes(y = ..density..), bins = 100, fill = "#619CFF", alpha = 0.7) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = paste0("Difference in Conversion Rates (B - A)\nP(B > A) = ", 
                      round(prob_B_better*100, 1), "%"),
       x = "Difference", y = "Density") +
  theme_minimal()

# Plot 3: Expected Loss Distribution
loss_B <- ifelse(samples$diff < 0, -samples$diff, 0)
loss_A <- ifelse(samples$diff > 0, samples$diff, 0)

# Plot 3: Expected Loss Distribution (with y-axis limit)
p3 <- ggplot(data.frame(loss = c(loss_A, loss_B),
                        variant = rep(c("A", "B"), each = n_sim)), 
             aes(x = loss, fill = variant)) +
  geom_density(alpha = 0.5) +
  labs(title = "Expected Loss Distribution",
       x = "Loss", y = "Density") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  ylim(0, 30) +  # Added this line to set y-axis limits
  xlim(-.010, .10) +  # Added this line to set x-axis limits
  theme_minimal()

# Plot 4: Cumulative Probability of Difference
p4 <- ggplot(samples, aes(x = diff)) +
  stat_ecdf(geom = "step", color = "#619CFF", linewidth = 1) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Cumulative Probability of Difference",
       x = "B - A Difference", y = "P(Difference ≤ x)") +
  theme_minimal()

# Combine all plots
grid.arrange(p1, p2, p3, p4, ncol = 2)

# Interactive Plotly Version (Uncomment to use)
# Check and install plotly if needed
if (!requireNamespace("plotly", quietly = TRUE)) {
  install.packages("plotly")
}
library(plotly)

ggplotly(p1)  # Hover to see exact density values
ggplotly(p2)  # Explore the difference distribution
# 6. Print Key Metrics # First load the required library
library(knitr)

# Create a data frame with the metrics
results_table <- data.frame(
  Metric = c("Version A Posterior", 
             "Version B Posterior",
             "P(B > A)",
             "Expected Loss (choosing A)",
             "Expected Loss (choosing B)"),
  Value = c(sprintf("Beta(α=%.1f, β=%.1f)", ab_data$alpha_post[1], ab_data$beta_post[1]),
            sprintf("Beta(α=%.1f, β=%.1f)", ab_data$alpha_post[2], ab_data$beta_post[2]),
            sprintf("%.1f%%", prob_B_better*100),
            sprintf("%.4f", mean(loss_A)),
            sprintf("%.4f", mean(loss_B)))
)

# Print with kable
kable(results_table, 
      caption = "Bayesian A/B Test Key Metrics",
      align = c('l', 'r'),
      col.names = c("Metric", "Value")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "left"
  ) %>%
  row_spec(3, bold = TRUE, color = "white", background = "#00BFC4") %>%
  footnote(general = "Based on 100,000 Monte Carlo simulations",
           general_title = "Note:")
Bayesian A/B Test Key Metrics
Metric Value
Version A Posterior Beta(α=121.0, β=881.0)
Version B Posterior Beta(α=151.0, β=851.0)
P(B > A) 97.4%
Expected Loss (choosing A) 0.0301
Expected Loss (choosing B) 0.0001
Note:
Based on 100,000 Monte Carlo simulations

Decision Making

Probability Interpretation Action
\(P(B > A) > 95\%\) Strong evidence B is better Roll out B
\(P(B > A) \approx 50-80\%\) Weak evidence Collect more data
\(P(B > A) < 50\%\) A likely better Keep A or refine experiment

\(P(B > A) = 97.4\%\) → Strong evidence B is better; Roll out B.

Advantages Over Frequentist A/B Testing

Metric Frequentist Approach Bayesian (Beta) Approach
Interpretation “Reject null (p < 0.05)” “95% chance B is better”
Early Stopping Needs corrections (e.g., Bonferroni) Stop if \(P(B > A) > 95\%\)
Effect Size Confidence intervals Full posterior distribution
Prior Knowledge Ignored Incorporate via priors (e.g., Beta)

6 Discrete Probability Distributions

Discrete probability distributions model scenarios with distinct, countable outcomes where we can assign probabilities to each possible result. These distributions are fundamental for analyzing events where outcomes are separate and indivisible - like counting successes, failures, or occurrences. From simple binary outcomes to complex counting processes, discrete distributions provide the mathematical framework for understanding randomness in measurable, finite terms.

Key characteristics of discrete distributions include:

  • Clearly defined possible outcomes
  • Probabilities that sum to 1
  • Countable sample spaces
  • Distinct jumps between probability values

The most important discrete distributions each model different types of counting scenarios, from single trials to repeated experiments with varying conditions. Understanding these distributions provides the foundation for statistical inference, hypothesis testing, and predictive modeling across countless real-world applications.

6.1 Common Discrete Probability Distributions

Distribution Support Key Parameters Key Applications
Bernoulli {0, 1} p (success probability) Binary outcomes, single trials
Binomial {0, 1, …, n} n (trials), p Success counts, quality control
Geometric {1, 2, 3, …} p First success waiting time
Negative Binomial {r, r+1, …} r (successes), p k-th success waiting time
Poisson {0, 1, 2, …} λ (rate parameter) Rare events, arrival counts
Hypergeometric {max(0,n+K-N),…,min(n,K)} N (population), K (successes), n (sample) Sampling without replacement
Uniform {a, a+1, …, b} a (min), b (max) Equally likely outcomes
Multinomial Vectors of counts n (trials), p₁,…,pₖ Categorical data, dice rolls

Key Features:

  • Support: The set of possible outcome values
  • Bernoulli: Special case of Binomial \((n=1)\)
  • Geometric: Special case of Negative Binomial \((r=1)\)
  • Poisson: Limit of Binomial \((n \rightarrow \infty, np=\lambda)\)
  • Hypergeometric: Finite population correction for Binomial

6.2 Bernoulli Distribution

Models a single random experiment with exactly two mutually exclusive outcomes (typically labeled “success” and “failure”).

Key Characteristics:

  • The simplest discrete distribution
  • Foundation for many other distributions
  • Models binary events (1 = success, 0 = failure)

Probability mass function:

\[ P(X=x) = \begin{cases} p & \text{for } x=1 \\ 1-p & \text{for } x=0 \end{cases} \]
Parameters: - \(p\) = success probability

Applications: Coin flips, yes/no experiments

Bernoulli Distribution Moments

Moment Formula
Mean (μ) \(E[X] = p\)
Variance (σ²) \(Var(X) = p(1-p)\)
Skewness \(\gamma_1 = \frac{1-2p}{\sqrt{p(1-p)}}\)
Kurtosis \(\gamma_2 = \frac{1-6p(1-p)}{p(1-p)}\)

6.3 Binomial Distribution

Models the number of successes in a fixed number of independent Bernoulli trials with the same probability of success.

Key Characteristics:

  • Discrete probability distribution
  • Extension of the Bernoulli distribution to multiple trials
  • Assumes independence between trials
  • Constant probability of success across trials

Probability Mass Function (PMF): \[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \quad \text{for } k = 0,1,2,...,n \] Parameters:

  • \(n \in \mathbb{N}\): Number of trials
  • \(p \in [0,1]\): Probability of success in each trial

Cumulative Distribution Function (CDF): \[ F(k) = P(X \leq k) = \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} p^i (1-p)^{n-i} \]

Binomial Distribution Moments:

Moment Formula
Mean \(E[X] = np\)
Variance \(\text{Var}(X) = np(1-p)\)
Skewness \(\gamma_1 = \frac{1-2p}{\sqrt{np(1-p)}}\)
Kurtosis \(\gamma_2 = \frac{1-6p(1-p)}{np(1-p)}\)
MGF \(M_X(t) = (1 - p + pe^t)^n\)

Example Applications:

  • Quality control (defective items in a batch)
  • Survey analysis (proportion of positive responses)
  • Medical research (success rate of treatments)
  • Finance (probability of stock price movements)
  • Genetics (inheritance probability)

Relationships to Other Distributions:

  • Bernoulli: Special case when \(n = 1\)
  • Normal: Approximates when \(n\) is large (De Moivre-Laplace theorem)
  • Poisson: Approximates when \(n \to \infty\), \(p \to 0\), \(np \to \lambda\)
# sample size 1,000
n <- 1e3

# random binomial data
x.binom <- rbinom(n = n, size = 15, prob = .25)

Probability Mass Function (pmf): Binomial Distribution

plot(0:15, dbinom(x = 0:15, size = 15, prob = .25),
     xlab = "x",
     ylab = "P(X=x)",
     ylim = c(0, .3),
     col = "tomato",
     type = "h",
     lwd = 6)

title(expression("Density Plot of Random Binomial Data " 
                 * phantom(" vs. B(15, 0.25)")), 
      col.main = "dodgerblue")

title(expression(phantom("Density Plot of Random Binomial Data ") 
                 * "vs." *  phantom("B(15, 0.25)"), 
                 col.main = "black"))

title(expression(phantom("Density Plot of Random Binomial Data vs. ") 
                 * "B(15, 0.25)"), 
      col.main = "tomato")

points((table(x.binom) / n), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(10, .29, 
     paste("mean = ", (mean(x.binom)), "vs. E(X) = 3.7500"), 
     cex = 1.2)

text(10, .27, 
     paste("variance = ", format(var(x.binom), digits = 4), 
           "vs. Var(X) = 2.8125"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Binomial Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.binom), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "lightgray", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Bionomial(n=1000, p=0.25)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Example Binomial Probability Analysis

A genetic trait appears in 25% of offspring. In 8 offspring, what is the probability that exactly 2 inherit the trait?

Solution:

Parameters

  • Trials (n): 8 offspring
  • Success probability (p): 0.25
  • Target successes (k): 2

For a binomial random variable \(X \sim \text{Binomial}(n=8, p=0.25)\), the probability of exactly \(k\) successes is:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

Step-by-Step Solution

  1. Combination Term Number of ways to choose 2 successes in 8 trials:

\[ \binom{8}{2} = \frac{8!}{2!6!} = \frac{8 \times 7}{2 \times 1} = 28 \quad \text{(ways to choose 2 successes in 8 trials)} \]

  1. Probability Terms \[ \begin{aligned} p^k &= (0.25)^2 = 0.0625 \quad \text{(probability of success)}\\ (1-p)^{n-k} &= (0.75)^{6} \approx 0.1780 \quad \text{(probability of failure)} \end{aligned} \]

  2. Final Calculation \[ P(X = 2) = 28 \times 0.0625 \times 0.1780 = 0.3115 \]

Final Result \[ \boxed{P(X=2) = 0.3115 \text{ (31.15\%)}} \]

Verification

  • Theoretical Expectation:
  • Mean \(\mu = np = 2\)
  • Variance \(\sigma^2 = np(1-p) = 1.5\)
  1. Verification (R Code)
dbinom(2, size = 8, prob = 0.25) # Returns 0.3114607
## [1] 0.3114624
binom.test(2, 8, 0.25)  # 95% CI
## 
##  Exact binomial test
## 
## data:  2 and 8
## number of successes = 2, number of trials = 8, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.25
## 95 percent confidence interval:
##  0.03185403 0.65085579
## sample estimates:
## probability of success 
##                   0.25

The probability of obtaining exactly 2 successes in 8 trials (with p = .25 per trial) was 0.311 (95% CI [0.198, 0.448]), indicating this outcome would be expected approximately 31.1% of the time under these conditions.

The test results indicate that the observed proportion of successes (0.25) was not statistically different from the hypothesized probability of 0.25 (*p* = 1.000). The 95% confidence interval [0.032, 0.651] suggests that, with 95% confidence, the true probability of success could range from approximately 3.2% to 65.1%. Since the p-value was not significant (*p* > .05), we fail to reject the null hypothesis that the true probability of success is equal to 0.25.

An exact binomial test revealed no significant difference between the observed success rate (25%) and the hypothesized rate of 25%, *p* = 1.000, 95% CI [0.032, 0.651]. The wide confidence interval reflects the small sample size (n = 8), indicating substantial uncertainty around the true probability.

6.4 Geometric Distribution

The geometric distribution models the number of trials needed to obtain the first success in a sequence of independent Bernoulli trials, each with the same success probability \((p)\).

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Memoryless property: \(P(X > m+n \mid X > m) = P(X > n)\)
  • Two common parameterizations:
  • Trials until first success (including the success trial)
  • Failures before first success

Probability Mass Function (PMF):

Version 1 (including success trial): \[ P(X = k) = (1-p)^{k-1}p \quad \text{for } k = 1,2,3,... \]

Version 2 (failures before first success): \[ P(Y = k) = (1-p)^k p \quad \text{for } k = 0,1,2,... \]

Parameters: - \(p \in (0,1]\): Probability of success on each trial

Cumulative Distribution Function (CDF): \[ F(k) = 1 - (1-p)^{\lfloor k \rfloor + \delta} \] where \(\delta = 1\) for Version 1, \(0\) for Version 2

Geometric Distribution Moments:

Moment Version 1 Formula Version 2 Formula
Mean \(E[X] = \frac{1}{p}\) \(E[Y] = \frac{1-p}{p}\)
Variance \(\text{Var}(X) = \frac{1-p}{p^2}\) \(\text{Var}(Y) = \frac{1-p}{p^2}\)
Skewness \(\gamma_1 = \frac{2-p}{\sqrt{1-p}}\)
Kurtosis \(\gamma_2 = 6 + \frac{p^2}{1-p}\)
MGF \(M_X(t) = \frac{pe^t}{1-(1-p)e^t}\) (for \(t < -\ln(1-p)\))

Example Applications:

  • Reliability engineering (time until first failure)
  • Network protocols (retransmissions until success)
  • Sports analysis (attempts until first score)
  • Epidemiology (cases until disease detection)
  • Queueing theory (customers until first special request)

Special Properties:

  • Memorylessness: Future trials don’t depend on past outcomes
  • Relationship to Exponential: Discrete analog of the continuous exponential distribution
  • Maximum Entropy: Given mean, it has maximum entropy among discrete distributions

Relationships to Other Distributions:

  • Negative Binomial: Sum of independent Geometrics
  • Exponential: Continuous counterpart
  • Bernoulli: Single trial version

Probability Mass Function (pmf): Geometric Distribution

# sample size
n <- 1e4

# random geometric data
x.geom <- rgeom(n=n, prob=.6)

# density
d.geom <- dgeom(0:10, prob = .6, log = FALSE)

plot(0:10, d.geom, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n Geometric Distribution",
     xlab = "Number of Failures Before Success", 
     ylab = "P(X=x)")

points(table(x.geom) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(6, .59, 
     paste("mean = ", format(mean(x.geom), digits = 4), ", E(X) = 0.6000"), 
     cex = 1.2)

text(6, .55, 
     paste("variance = ", format(var(x.geom), digits = 4), ", Var(X) = 1.0000"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Geometric Distribution

# Cumulative Distribution Function
plot(ecdf(x.geom), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Geometric(n=1000, p=0.25)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-Life Example of Geometric Distribution: Customer Conversion

A digital marketing team runs an online ad campaign where:

  • Each ad view is an independent trial.
  • The probability that a viewer clicks the ad (conversion) is 5% (*p = 0.05*).

What is the probability that the first conversion happens on the 10th ad view?

**Probability Calculation Using Geometric PMF

The probability mass function for the geometric distribution is given by:

\[ P(X = k) = (1 - p)^{k-1} \times p \]

where:

  • \(k\) = trial number of first success (here, \(k = 10\))
  • \(p\) = success probability per trial (here, \(p = 0.05\))

\[ P(X = 10) = (1 - 0.05)^{10-1} \times 0.05 = (0.95)^9 \times 0.05 \]

Step-by-step calculation:

  1. \((0.95)^9 \approx 0.630\)
  2. \(0.630 \times 0.05 = 0.0315\) so \(P(X = 10) \approx 0.0315\) or \(3.15\%\)

R Verification

dgeom(9, prob = 0.05)  # Returns 0.0315 (3.15%)
## [1] 0.03151247

Note: R uses zero-based indexing for the number of failures before the first success, so we use 9 instead of 10.

6.5 Hypergeometric Distribution

The Hypergeometric distribution models scenarios where we sample without replacement from a finite population containing two distinct groups (e.g., success/failure).

Key Characteristics:

  • Discrete probability distribution
  • Finite population sampling
  • Non-independent trials (sampling without replacement)
  • Captures the “lottery problem” exactly

Probability Mass Function (PMF): \[ P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} \] where:

  • \(\max(0, n+K-N) \leq k \leq \min(n, K)\)
  • \(\binom{a}{b}\) is the binomial coefficient
  • \(N \in \mathbb{N}\): Total population size
  • \(K \in \{0,1,...,N\}\): Number of success states in population
  • \(n \in \{1,2,...,N\}\): Number of draws (sample size)

Cumulative Distribution Function (CDF): \[ F(k) = \sum_{i=0}^{\lfloor k \rfloor} \frac{\binom{K}{i}\binom{N-K}{n-i}}{\binom{N}{n}} \]

Hypergeometric Distribution Moments:

Moment Formula
Mean \(E[X] = n\frac{K}{N}\)
Variance \(\text{Var}(X) = n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}\)
Skewness \(\frac{(N-2K)(N-1)^{1/2}(N-2n)}{[nK(N-K)(N-n)]^{1/2}(N-2)}\)
MGF \(\frac{\binom{N-K}{n}}{\binom{N}{n}}{}_2F_1(-n,-K;N-K-n+1;e^t)\)

Example Applications:

  • Quality control (sampling defective items)
  • Wildlife sampling (tagged animals in capture-recapture)
  • Lottery probability calculations
  • Statistical significance of gene enrichment
  • Voting prediction from exit polls

Special Properties:

  • Finite Population Correction: Variance includes \(\frac{N-n}{N-1}\) term
  • Relationship to Binomial: Approaches binomial when \(N \to \infty\) with \(K/N \to p\)
  • Symmetry: \(X \sim \text{HG}(N,K,n) \iff n-X \sim \text{HG}(N,N-K,n)\)

Relationships to Other Distributions:

  • Binomial: Approximation when \(n/N < 0.05\)
  • Multivariate Hypergeometric: Extension to multiple categories
  • Fisher’s Exact Test: Used in contingency table analysis

Probability Mass Function (pmf): Hypergeometric Distribution

# random hypergeometric data
x.hyper <- rhyper(1e4, m = 10, n = 8, k = 10)

# density
d.hyper <- dhyper(0:10, m = 10, n = 8, k = 10)

plot(0:10, d.hyper, 
     xlab = "x", 
     ylab = "P(X=x)", 
     ylim = c(0, .45), 
     col = "tomato", 
     type = "h", 
     lwd = 6)

title(main = expression("Hypergeometric Distribution \n"), 
      col.main = "black")

title(main = expression("random sample (blue)" * phantom(" vs. population (orange)")), 
      col.main = "dodger blue")

title(main = expression(phantom("random sample (blue)") * " vs. " * 
                          phantom("population (orange)")), 
      col.main = "black")

title(main = expression(phantom("random sample (blue) vs. ") 
                        * "population (orange)"), 
      col.main = "tomato")

points(table(x.hyper) / 1e4, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(6, .44, 
     paste("mean = ", format(mean(x.hyper), digits = 4), " vs. E(X) = *"), 
     cex = 1.2)

text(6, .41, 
     paste("variance = ", format(var(x.hyper), digits = 4), " vs. Var(X) = *"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Hypergeometric Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.hyper), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n 
             Hypereometric(nn=1000, m=10, n=8, k=10)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-Life Example of Hypergeometric Distribution

Example: Quality Control in Manufacturing

A factory produces 100 light bulbs, with 10 defective and 90 working. An inspector randomly selects 20 bulbs for testing. What is the probability that exactly 3 are defective?

Problem Definition

We consider a finite population of size \(N\) containing:

  • \(K\) “successes” (defective bulbs)
  • \(N-K\) “failures” (working bulbs)

A sample of size \(n\) is drawn without replacement. The probability of finding exactly \(k\) successes in the sample follows the Hypergeometric Distribution:

\[ P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]

Given Parameters:

  • Total bulbs (\(N\)) = 100
  • Defective bulbs (\(K\)) = 10
  • Sample size (\(n\)) = 20
  • Desired defectives (\(k\)) = 3

Step 1: Calculate Individual Combinations

  1. Ways to Choose Defective Bulbs

\[ \binom{K}{k} = \binom{10}{3} = \frac{10!}{3!(10-3)!} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120 \]

  1. Ways to Choose Working Bulbs

\[ \binom{N-K}{n-k} = \binom{90}{17} \approx 3.189 \times 10^{19} \quad \text{(exact value)} \]

  1. Total Possible Samples

\[ \binom{N}{n} = \binom{100}{20} \approx 5.360 \times 10^{20} \]

Step 2: Compute Probability

\[ P(X = 3) = \frac{\binom{10}{3} \binom{90}{17}}{\binom{100}{20}} = \frac{120 \times \binom{90}{17}}{\binom{100}{20}} \]

Numerical Evaluation:

\[ P(X = 3) \approx \frac{120 \times 3.189 \times 10^{19}}{5.360 \times 10^{20}} \approx 0.1982 \]

The probability of finding exactly 3 defective bulbs in a sample of 20 is: \(\boxed{P(X = 3) \approx 19.82\%}\)


Verification in R

# Parameters
total_defective <- 10       # (K) Defective bulbs in population
total_non_defective <- 90   # (N-K) Working bulbs
sample_size <- 20           # (n) Bulbs tested
defective_in_sample <- 3    # (k) Desired defective count in sample
total_population <- total_defective + total_non_defective  # (N) Total bulbs

# Probability calculation
prob <- dhyper(
  x = defective_in_sample,  # Number of observed successes (defectives)
  m = total_defective,      # Total successes in population
  n = total_non_defective,  # Total failures in population
  k = sample_size           # Sample size
)

print(paste("Probability of exactly 3 defectives:", round(prob, 4)))
## [1] "Probability of exactly 3 defectives: 0.2092"

There’s a 19.82% chance that exactly 3 out of 20 tested bulbs are defective.

library(ggplot2)

# Compute probabilities for all possible defective counts (0 to total_defective)
x_values <- 0:total_defective
probabilities <- dhyper(
  x = x_values,
  m = total_defective,
  n = total_non_defective,
  k = sample_size
)

# Create a data frame for plotting
df <- data.frame(
  Defectives = x_values,
  Probability = probabilities
)

# Plot
ggplot(df, aes(x = Defectives, y = Probability)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.7) +
  geom_vline(xintercept = defective_in_sample, color = "red", linetype = "dashed") +
  annotate("text", 
           x = defective_in_sample, 
           y = max(probabilities) * 0.9, 
           label = paste("P(X=", defective_in_sample, ") =", round(prob, 4)), 
           color = "red") +
  labs(
    title = "Hypergeometric Distribution: Defective Bulbs in Sample",
    x = "Number of Defective Bulbs in Sample (k)",
    y = "Probability"
  ) +
  theme_minimal()


6.6 Poisson Distribution

Models the number of rare, independent events occurring in a fixed interval of time or space.

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Memoryless property
  • Events occur at constant average rate
  • Events are independent of time since last event

Probability Mass Function (PMF): \[ P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!} \quad \text{for } k = 0,1,2,... \]

Parameters: - \(\lambda > 0\): Rate parameter (mean number of events)

Cumulative Distribution Function (CDF): \[ F(k) = e^{-\lambda}\sum_{i=0}^{\lfloor k \rfloor}\frac{\lambda^i}{i!} \]

Poisson Distribution Moments:

Moment Formula
Mean \(E[X] = \lambda\)
Variance \(\text{Var}(X) = \lambda\)
Skewness \(\gamma_1 = \lambda^{-1/2}\)
Kurtosis \(\gamma_2 = \lambda^{-1}\)
MGF \(M_X(t) = e^{\lambda(e^t-1)}\)
Characteristic Function \(\phi_X(t) = e^{\lambda(e^{it}-1)}\)

Key Properties:

  • Equal Mean and Variance: \(\text{Mean} = \text{Variance} = \lambda\)
  • Additive: Sum of Poissons is Poisson (\(X+Y \sim \text{Pois}(\lambda_1+\lambda_2)\))
  • Binomial Limit: Approximates \(\text{Bin}(n,p)\) when \(n\to\infty\), \(np\to\lambda\)

Example Applications:

  • Call center: Number of incoming calls per hour
  • Biology: Mutations in DNA strand
  • Finance: Number of trades in a time period
  • Physics: Radioactive decay events
  • Transportation: Arrivals at a bus stop
  • Medicine: Number of hospital admissions

Approximation Rules:

  • Binomial approximation: When \(n \geq 20\) and \(p \leq 0.05\)
  • Normal approximation: When \(\lambda > 20\)

Relationships to Other Distributions:

  • Exponential: Models inter-arrival times between Poisson events
  • Gamma: Models time until k-th Poisson event
  • Compound Poisson: Used in risk modeling
  • Zero-Inflated Poisson: For excess zero counts

Probability Mass Function (pmf): Poisson Distribution

# sample size
n <- 1e4

# random Poisson data
x.poisson <- rpois(n, 5)

# density
d.poisson <- dpois(0:20, 5)

plot(0:20, d.poisson, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (orange) \n Poisson Distribution",
     xlab = "Time Interval", 
     ylab = "P(X=x)")

points(table(x.poisson) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(13, .17, paste("mean = ", format(mean(x.poisson), digits = 4), 
                    " vs. ", expression(E(X)), " = 5.0000"), 
     cex = 1.2)

text(13, .16, 
     paste("variance = ", format(var(x.poisson), digits = 4), 
           " vs. Var(X) = 5.0000"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Poisson Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.poisson), 
     xlab = "k", 
     ylab = "P(X <= k)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Poisson(1000, 5)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Poisson Distribution: Emergency Call Center Example

Problem Scenario

A hospital’s emergency call center receives an average of 5 calls per hour \((\lambda = 5)\). We want to calculate:

  • Probability of receiving exactly 3 calls in the next hour.
  • Probability of receiving ≤ 2 calls in the next hour.

Why Poisson?

  • Models rare, independent events (calls are unpredictable and non-overlapping).
  • Fixed time interval (per hour).
  • Mean = Variance = λ (here, both are 5).

The Poisson Probability Mass Function (PMF) is given by:

\[ P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!} \]

where:

  • \(\lambda = 5\) (average rate of events per interval)
  • \(k\) = number of events (e.g., calls received)
  • \(e\) = Euler’s number (~2.71828)
  • \(k!\) = factorial of \(k\)

Key Properties:

  1. Mean and Variance:
    \[ E[X] = \text{Var}(X) = \lambda \]

  2. Assumptions:

  • Events occur independently
  • Constant average rate (\(\lambda\))
  • No simultaneous events

1. Probability of Exactly 3 Calls \[ P(X = 3) = \frac{e^{-5} \cdot 5^3}{3!} = \frac{0.006737947 \cdot 125}{6} \approx 0.1404 \]

2. Probability of ≤ 2 Calls (CDF) \[ P(X \leq 2) = \sum_{k=0}^2 \frac{e^{-5} \cdot 5^k}{k!} \approx 0.1247 \]


R Verification

lambda <- 5  # Average calls per hour

# a) Probability of exactly 3 calls
prob_3 <- dpois(x = 3, lambda = lambda)
cat(sprintf("P(X = 3) = %.4f (%.1f%%)\n", prob_3, prob_3*100))
## P(X = 3) = 0.1404 (14.0%)
# b) Probability of ≤ 2 calls
prob_leq_2 <- ppois(q = 2, lambda = lambda)
cat(sprintf("P(X ≤ 2) = %.4f (%.1f%%)\n", prob_leq_2, prob_leq_2*100))
## P(X ≤ 2) = 0.1247 (12.5%)
  • 14% chance of exactly 3 calls in the next hour.
  • 12.5% chance of 2 or fewer calls.
library(ggplot2)

k_values <- 0:15  # Possible call volumes
probabilities <- dpois(k_values, lambda)

ggplot(data.frame(Calls = k_values, Probability = probabilities), 
       aes(x = Calls, y = Probability)) +
  geom_col(fill = "orange", width = 0.6, alpha = 0.7) +
  geom_vline(xintercept = lambda, linetype = "dashed", color = "red") +
  labs(title = "Poisson Distribution: Emergency Calls per Hour (λ = 5)",
       x = "Number of Calls",
       y = "Probability") +
  theme_minimal()


6.7 Negative Binomial Distribution

Models the number of independent Bernoulli trials needed to obtain a fixed number of successes (r).

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Generalization of the Geometric distribution
  • Two common parameterizations:
  • Number of trials until r-th success (including successes)
  • Number of failures before r-th success

Probability Mass Function (PMF):

Version 1 (trials until r-th success): \[ P(X = k) = \binom{k-1}{r-1}p^r(1-p)^{k-r} \quad \text{for } k = r,r+1,... \]

Version 2 (failures before r-th success): \[ P(Y = k) = \binom{k+r-1}{r-1}p^r(1-p)^k \quad \text{for } k = 0,1,2,... \]

Parameters:

  • \(r \in \mathbb{N}\): Number of desired successes
  • \(p \in (0,1)\): Probability of success in each trial

Cumulative Distribution Function (CDF):

\[ F(k) = 1 - I_p(\lfloor k \rfloor + 1, r) \] where \(I_p\) is the regularized incomplete beta function

Negative Binomial Distribution Moments:

Moment Version 1 Formula Version 2 Formula
Mean \(E[X] = \frac{r}{p}\) \(E[Y] = \frac{r(1-p)}{p}\)
Variance \(\text{Var}(X) = \frac{r(1-p)}{p^2}\) \(\text{Var}(Y) = \frac{r(1-p)}{p^2}\)
Skewness \(\frac{2-p}{\sqrt{r(1-p)}}\)
Kurtosis \(\frac{6}{r} + \frac{p^2}{r(1-p)}\)
MGF \(\left(\frac{pe^t}{1-(1-p)e^t}\right)^r\) (for \(t < -\ln(1-p)\))

Key Properties:

  • Overdispersion: Variance > Mean (unlike Poisson)
  • Geometric Case: Special case when \(r = 1\)
  • Alternative Formulation: Often defined using mean \(\mu\) and dispersion parameter

Example Applications:

  • Epidemiology: Disease outbreak modeling
  • Ecology: Species abundance counts
  • Insurance: Claim frequency modeling
  • Quality Control: Defect clustering
  • RNA-Seq: Gene expression analysis
  • Psychology: Learning curve analysis

Relationships to Other Distributions:

  • Poisson: Limit when \(r \rightarrow \infty\), \(p \rightarrow 1\), with \(r(1-p) \rightarrow \lambda\)
  • Geometric: Special case when \(r = 1\)
  • Gamma-Poisson: Equivalent to Poisson with gamma-distributed rate

Parameter Estimation:

  • Maximum Likelihood Estimates: \(\hat{p} = \frac{r}{r+\bar{y}}\) \(\hat{r}\) via numerical methods

Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom <- rnbinom(n, size = 1, prob = .37)

# density
d.nbinom <- dnbinom(0:30, size = 1, prob = .37)

plot(0:30, d.nbinom, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability")

points(table(x.nbinom) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .3, paste("mean = ", format(mean(x.nbinom), digits = 4), " vs. ",
                   expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .28, 
     paste("variance = ", format(var(x.nbinom), digits = 4), " vs. ",
           expression(V(X)), " = *"), 
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
            Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom2 <- rnbinom(n, size = 7, prob = .37)

# density
d.nbinom2 <- dnbinom(0:30, size = 7, prob = .37)

plot(0:30, d.nbinom2, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability", 
     ylim = c(0,.1))

points(table(x.nbinom2) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .09, paste("mean = ", format(mean(x.nbinom2), digits = 4), " vs. ",
                    expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .085, paste("variance = ", format(var(x.nbinom2), digits = 4), " vs. ",
                     expression(V(X)), " = *"),
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom2), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
            Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom3 <- rnbinom(n, size = 25, prob = .37)

# density
d.nbinom3 <- dnbinom(0:30, size = 25, prob = .37)

plot(0:30, d.nbinom3, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability")

points(table(x.nbinom3) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .09, 
     paste("mean = ", format(mean(x.nbinom3), digits = 4), " vs. ",
           expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .085, 
     paste("variance = ", format(var(x.nbinom3), digits = 4), " vs. ",
           expression(V(X)), " = *"),
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom3), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
             Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-World Scenario: Programming Skill Acquisition

A data science student needs an average of 5 practice attempts to fully master each new Python concept. We want to model:

  • Probability of requiring exactly 8 attempts to master 3 concepts
  • Likelihood of needing ≤ 10 attempts for 3 concepts

Why Negative Binomial?

  • Models number of trials until r successes (mastery events)
  • Accounts for variable learning rates
  • More flexible than Poisson (where variance ≠ mean)
r <- 3        # Target successes (concepts to master)
mu <- 8       # Observed average attempts per success
p <- r/(r+mu) # Success probability per attempt

cat("Success probability p =", round(p, 3))
## Success probability p = 0.273
# Probability of needing exactly 8 attempts for 3 successes
prob_8 <- dnbinom(8 - r, size = r, prob = p)
cat(sprintf("P(8 attempts for 3 successes) = %.4f (%.1f%%)", prob_8, prob_8*100))
## P(8 attempts for 3 successes) = 0.0867 (8.7%)
# Probability of needing ≤10 attempts
prob_leq_10 <- pnbinom(10 - r, size = r, prob = p)
cat(sprintf("\nP(≤10 attempts for 3 successes) = %.4f (%.1f%%)", prob_leq_10, prob_leq_10*100))
## 
## P(≤10 attempts for 3 successes) = 0.5414 (54.1%)
  • 6.3% chance the student needs exactly 8 attempts to master 3 concepts
  • 50.4% chance they’ll succeed within 10 attempts
  • The distribution is right-skewed, reflecting that some concepts may require substantially more practice
library(ggplot2)

attempts <- 0:25
probabilities <- dnbinom(attempts, size = r, prob = p)

ggplot(data.frame(Attempts = attempts + r, Probability = probabilities),
       aes(x = Attempts, y = Probability)) +
  geom_col(fill = "#FF6B6B", width = 0.7, alpha = 0.8) +
  geom_vline(xintercept = (r/p), linetype = "dashed", color = "#4ECDC4") +
  labs(title = "Learning Curve: Attempts Needed to Master 3 Concepts",
       subtitle = paste("Negative Binomial (r =", r, ", p =", round(p, 2), ")"),
       x = "Total Attempts Needed",
       y = "Probability") +
  theme_minimal()


7 Custom Densities

7.1 Symmetric Triangular Distribution

Description: Models continuous outcomes with lower bound a, upper bound b, and symmetric clustering around the midpoint.

Key Characteristics:

  • Finite, bounded support
  • Perfectly symmetric density
  • Linear PDF forming an isoceles triangle
  • Simpler alternative to Beta distribution for symmetric cases

Probability Density Function (PDF):

7.2 Probability Density Function (PDF)

For a symmetric triangular distribution with support \([a, b]\) and mode \(c = \frac{a+b}{2}\):

\[ f(x) = \begin{cases} 0 & \text{for } x < a \\[8pt] \frac{2(x-a)}{(b-a)^2} & \text{for } a \leq x \leq c \\[8pt] \frac{2(b-x)}{(b-a)^2} & \text{for } c \leq x \leq b \\[8pt] 0 & \text{for } x > b \end{cases} \]

Key Properties:

  • Piecewise linear function forming an isosceles triangle
  • Continuous at \(x=c\) (both pieces meet at the mode)
  • Discontinuous at boundaries \(x=a\) and \(x=b\) (jumps from 0)
  • Height at mode: \(f(c) = \frac{2}{b-a}\)
  • Area under curve: \(\int_a^b f(x)dx = 1\)

Verification

  • At \(x=a\): \(\frac{2(a-a)}{(b-a)^2} = 0\)
  • At \(x=c\): Both pieces give \(\frac{2(c-a)}{(b-a)^2} = \frac{2}{(b-a)}\)
  • At \(x=b\): \(\frac{2(b-b)}{(b-a)^2} = 0\)

Standardized Form (a=-1, b=1) \[ f_{\text{std}}(x) = \begin{cases} 1 - |x| & \text{for } -1 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases} \]


Cumulative Distribution Function (CDF):

For a symmetric triangular distribution with support \([a, b]\) and mode \(c = \frac{a+b}{2}\):

\[ F(x) = \begin{cases} 0 & \text{for } x < a \\[10pt] \frac{(x-a)^2}{(b-a)^2} & \text{for } a \leq x \leq c \\[10pt] 1 - \frac{(b-x)^2}{(b-a)^2} & \text{for } c \leq x \leq b \\[10pt] 1 & \text{for } x > b \end{cases} \]

Key Features:

  • Piecewise quadratic function
  • Continuous at all points (including \(x=a\), \(x=c\), and \(x=b\))
  • Differentiable everywhere except at \(x=a\) and \(x=b\)
  • Median = Mode = Mean = \(c\)

Verification at Critical Points:

  • At \(x = a\): \(F(a) = \frac{(a-a)^2}{(b-a)^2} = 0\)
  • At \(x = c\): \(\frac{(c-a)^2}{(b-a)^2} = \frac{1}{4} = 1 - \frac{(b-c)^2}{(b-a)^2}\)
  • At \(x = b\): \(1 - \frac{(b-b)^2}{(b-a)^2} = 1\)

Example Calculation (when \(a=0\), \(b=2\), \(c=1\)): \[ F(0.5) = \frac{(0.5-0)^2}{(2-0)^2} = 0.0625 \\ F(1.5) = 1 - \frac{(2-1.5)^2}{(2-0)^2} = 0.9375 \]

Symmetric Triangular Distribution Moments:

Moment Formula
Mean \(\frac{a + b}{2}\)
Variance \(\frac{(b - a)^2}{24}\)
Skewness \(0\)
Kurtosis \(-\frac{3}{5}\)
MGF \(\frac{2(e^{tb}-e^{tc}(1+t(b-c)))}{t^2(b-a)^2}\)

Key Properties:

  • Bounded Support: Strictly confined to \([a,b]\)
  • Symmetric: Mean = Median = Mode
  • Variance: Grows with the square of the range
  • Linear Tails: PDF decreases linearly from mode to bounds

Example Applications:

  • Project management (task duration estimation)
  • Quality control (tolerance limits)
  • Limited-data risk assessment
  • Simple Monte Carlo simulations
  • Audio signal processing (bounded noise)

Relationships to Other Distributions:

  • Uniform: Special case when PDF becomes flat (no mode)
  • Beta: Generalizes to asymmetric shapes
  • Normal: Approximates when \(b-a \to \infty\) (central region)

Approximation Rules:

  • Use instead of Normal when bounds are known
  • Prefer over Uniform when central clustering is expected

Random Generation (rtri)

#' Symmetric Triangular Random Variates
#' 
#' @param n Number of observations
#' @param a Lower limit (default = 0)
#' @param b Upper limit (default = 1)
#' @return Random values
rtri <- function(n, a = 0, b = 1) {
  qtri(runif(n), a = a, b = b)
}

Quantile Function (qtri)

#' Symmetric Triangular Quantile Function
#' 
#' @param p Vector of probabilities
#' @param a Lower limit (default = 0)
#' @param b Upper limit (default = 1)
#' @param lower.tail Logical; if FALSE, uses upper tail
#' @param log.p Logical; if TRUE, interprets p as log probabilities
#' @return Quantiles
qtri <- function(p, a = 0, b = 1, lower.tail = TRUE, log.p = FALSE) {
  if (log.p) p <- exp(p)
  if (!lower.tail) p <- 1 - p
  if (any(p < 0 | p > 1)) stop("p must be in [0,1]")
  
  mode <- (a + b)/2
  q <- ifelse(p <= 0.5,
              a + sqrt(p*(b - a)*(mode - a)),
              b - sqrt((1 - p)*(b - a)*(b - mode)))
  
  return(q)
}
#' Symmetric Triangular Density Function
#'
#' Computes the probability density function (PDF) for a triangular distribution
#' with lower limit `a`, upper limit `b`, and mode `c`.
#'
#' @param x Numeric vector of quantiles
#' @param a Lower limit of the distribution (default = -1)
#' @param b Upper limit of the distribution (default = 1)
#' @param c Mode of the distribution (default = midpoint between a and b)
#' @param log Logical; if TRUE, returns log densities
#'
#' @return A numeric vector of density values
#'
#' @examples
#' dtri(0)  # Density at mode = 1
#' dtri(seq(-1, 1, by = 0.1))  # Vectorized evaluation
#' curve(dtri(x), from = -1.5, to = 1.5)  # Plot the density
#'
#' @export
#' Triangular Distribution Functions
#'
#' Density, distribution function, quantile function and random generation
#' for the triangular distribution with lower limit `a`, upper limit `b`,
#' and mode `c`.
#'
#' @name Triangular
#' @aliases dtri ptri qtri rtri
#' @param x,q Vector of quantiles
#' @param p Vector of probabilities
#' @param n Number of observations
#' @param a Lower limit (default = -1)
#' @param b Upper limit (default = 1)
#' @param c Mode (default = midpoint between a and b)
#' @param log,log.p Logical; if TRUE, probabilities/densities p are given as log(p)
#' @param lower.tail Logical; if TRUE (default), probabilities are P[X ≤ x], otherwise P[X > x]
#'
#' @details
#' The triangular distribution has density:
#' f(x) = \begin{cases}
#' \frac{2(x-a)}{(b-a)(c-a)} & \text{for } a \leq x \leq c \\
#' \frac{2(b-x)}{(b-a)(b-c)} & \text{for } c < x \leq b \\
#' 0 & \text{otherwise}
#' \end{cases}
#'
#' @return
#' `dtri` gives the density, `ptri` gives the distribution function,
#' `qtri` gives the quantile function, and `rtri` generates random deviates.
#'
#' @examples
#' # Density evaluation
#' dtri(0)
#' curve(dtri(x), from = -1.5, to = 1.5)
#'
#' # CDF evaluation
#' ptri(0)
#' curve(ptri(x), from = -1.5, to = 1.5)
#'
#' # Quantile function
#' qtri(0.5)
#'
#' # Random generation
#' hist(rtri(1000))
#' @rdname Triangular
#' @export
dtri <- function(x, a = -1, b = 1, c = (a + b)/2, log = FALSE) {
  # Validate inputs
  if (!is.numeric(x)) stop("x must be numeric")
  if (length(a) != 1 || length(b) != 1 || length(c) != 1) {
    stop("Parameters a, b, c must be length 1")
  }
  if (b <= a) stop("b must be greater than a")
  if (c < a || c > b) stop("c must be between a and b")
  
  # Calculate density components
  denom <- (b - a)
  left_denom <- denom * (c - a)
  right_denom <- denom * (b - c)
  
  # Initialize output
  dens <- numeric(length(x))
  
  # Left of mode (including mode)
  left <- (x >= a) & (x <= c)
  dens[left] <- 2 * (x[left] - a) / left_denom
  
  # Right of mode (excluding mode)
  right <- (x > c) & (x <= b)
  dens[right] <- 2 * (b - x[right]) / right_denom
  
  # Exact mode
  at_mode <- (abs(x - c) < .Machine$double.eps^0.5)
  dens[at_mode] <- 2 / denom
  
  # Outside support
  dens[x < a | x > b] <- 0
  
  # Handle log transformation
  if (log) {
    dens <- log(dens)
    dens[!is.finite(dens)] <- -Inf
  }
  
  attributes(dens) <- attributes(x)
  return(dens)
}
# Plot the density
curve(dtri(x, 0, 2), from = -0.5, to = 2.5, 
      main = "Symmetric Triangular Density")

# Generate and plot random variates
hist(rtri(10000, 0, 2), breaks = 50, freq = FALSE,
     main = "Random Sample vs Population Density")
curve(dtri(x, 0, 2), add = TRUE, col = "red", lwd = 2)

#' Triangular Distribution Cumulative Distribution Function
#'
#' Computes the cumulative distribution function for a triangular distribution
#' with lower limit `a`, upper limit `b`, and mode at the midpoint.
#'
#' @param q Numeric vector of quantiles
#' @param a Lower limit of the distribution (default = 0)
#' @param b Upper limit of the distribution (default = 2)
#' @param lower.tail Logical; if TRUE (default), probabilities are P[X ≤ x], otherwise P[X > x]
#' @param log.p Logical; if TRUE, probabilities p are given as log(p)
#'
#' @return A numeric vector of cumulative probabilities
#'
#' @examples
#' ptri(1)  # P(X ≤ 1) for triangular(0, 2)
#' ptri(seq(0, 2, by = 0.5))  # Vectorized evaluation
#' curve(ptri(x), from = -0.5, to = 2.5)  # Plot the CDF
#'
#' @export
ptri <- function(q, a = 0, b = 2, lower.tail = TRUE, log.p = FALSE) {
  # Input validation
  if (!is.numeric(q)) stop("q must be numeric")
  if (a >= b) stop("'a' must be less than 'b'")
  if (length(a) != 1 || length(b) != 1) stop("Parameters must be length 1")
  
  # Calculate mode (midpoint)
  mode <- (a + b)/2
  
  # Initialize output vector
  cdf <- numeric(length(q))
  
  # Left of support
  left <- q < a
  cdf[left] <- 0
  
  # Left of mode (including mode)
  left_middle <- (q >= a) & (q <= mode)
  cdf[left_middle] <- (q[left_middle] - a)^2 / ((b - a) * (mode - a))
  
  # Right of mode (including upper limit)
  right_middle <- (q > mode) & (q <= b)
  cdf[right_middle] <- 1 - (b - q[right_middle])^2 / ((b - a) * (b - mode))
  
  # Right of support
  right <- q > b
  cdf[right] <- 1
  
  # Handle lower.tail and log.p
  if (!lower.tail) cdf <- 1 - cdf
  if (log.p) {
    cdf <- ifelse(cdf <= 0, -Inf, log(cdf))
  }
  
  # Preserve attributes
  attributes(cdf) <- attributes(q)
  
  return(cdf)
}
curve(ptri(x), from = -1, to = 3, n = 1001, main = "Triangular CDF", lwd = 4)


Find the explicit formula, F(x), for the function, f(x)

\[ \begin{aligned} F(x) &= \begin{cases} \displaystyle\int_{-\infty}^x (1-\mid t \mid)dt, &\mbox{if } -1 < x < 1 \\ \\ \qquad 0, &\mbox{otherwise} \end{cases} \\ \\ &= \begin{cases} \displaystyle\int_{-\infty}^{x} 0 dt, &\mbox{if } x < -1 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{x} (1+t) dt, &\mbox{if } -1 < x < 0 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{x} (1+t) dt + \displaystyle\int_{0}^{x} (1-t) dt, &\mbox{if } 0 < x < 1 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{0} (1+t) dt + \displaystyle\int_{0}^{1} (1-t) dt + \displaystyle\int_{x}^{\infty} 0 dt, &\mbox{if } 1 < x \end{cases} \\ \\ &= \begin{cases} \qquad 0, &\mbox{if } x < -1 \\ \\ \frac{x^{2}}{2} + x + \frac{1}{2}, &\mbox{if } -1 < x < 0 \\ \\ -\frac{x^{2}}{2} + x + \frac{1}{2}, &\mbox{if } 0 < x < 1 \\ \\ \qquad 1, &\mbox{if } 1 < x \end{cases} \end{aligned} \]


R Implementation of the function, F(x)

Fx <- function(x) {
  if (!is.numeric(x)) {
    stop("x must be a real number")
  }
  if (x < -1) {
    y <- 0
  } else if (x <= 0) {  # includes x = -1 and -1 < x <= 0
    y <- (x^2)/2 + x + 0.5
  } else if (x <= 1) {   # includes x = 0 and 0 < x <= 1
    y <- (-(x^2)/2) + x + 0.5
  } else {               # x > 1
    y <- 1
  }
  return(y)
}

Plot the function, F(x)

#' Plot Triangular Distribution CDF
#'
#' Visualizes the cumulative distribution function (CDF) for a symmetric triangular distribution.
#'
#' @param min Minimum x-value for plotting
#' @param max Maximum x-value for plotting
#' @param a Lower limit of the distribution (default = -1)
#' @param b Upper limit of the distribution (default = 1)
#' @param highlight_points Logical whether to highlight key points (default = TRUE)
#' @param show_grid Logical whether to show grid lines (default = TRUE)
#'
#' @return A ggplot object showing the CDF curve
#'
#' @examples
#' plotFx(-1.5, 1.5)  # Extended range
#' plotFx(-1, 1)       # Standard range
#'
#' @export
plotFx <- function(min, max, a = -1, b = 1, highlight_points = TRUE, show_grid = TRUE) {
  # Input validation
  if (!is.numeric(min) || !is.numeric(max) || min >= max) {
    stop("min and max must be numeric with min < max")
  }
  if (b <= a) stop("b must be greater than a")
  
  # Calculate mode (midpoint)
  c <- (a + b)/2
  
  # Create data frame for plotting
  x <- seq(min, max, length.out = 1000)
  Fx <- ifelse(x < a, 0,
               ifelse(x <= c, (x - a)^2/((b - a)*(c - a)),
                      ifelse(x <= b, 1 - (b - x)^2/((b - a)*(b - c)), 1)))
  
  plot_data <- data.frame(x = x, Fx = Fx)
  
  # Create base plot
  p <- ggplot2::ggplot(plot_data, ggplot2::aes(x = x, y = Fx)) +
    ggplot2::geom_line(color = "tomato", linewidth = 1.2) +
    ggplot2::labs(
      x = "x",
      y = "F(x)",
      title = "Triangular Distribution CDF"
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = ggplot2::element_text(hjust = 0.5),
      panel.border = ggplot2::element_rect(color = "gray", fill = NA)
    )
  
  # Add grid if requested
  if (show_grid) {
    p <- p + ggplot2::theme(panel.grid = ggplot2::element_line(color = "gray90"))
  }
  
  # Add key points if requested
  if (highlight_points) {
    key_points <- data.frame(
      x = c(a, c, b),
      Fx = c(0, 0.5, 1)
    )
    p <- p + 
      ggplot2::geom_point(data = key_points, color = "blue", size = 3) +
      ggplot2::geom_vline(xintercept = c(a, c, b), linetype = "dashed", color = "gray60") +
      ggplot2::geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", color = "gray60")
  }
  
  # Adjust scales
  p <- p + ggplot2::coord_cartesian(xlim = c(min, max))
  
  return(p)
}

Plot the function, F(x), on [-2, 2]

plotFx(-2,2)


Find the explicit formula for the function,

\[ F^{-1}(y) = \begin{cases} \qquad 0, &\mbox{if } y < 0 \\ \\ -1+ \sqrt{2y}, &\mbox{if } 0 < y < \frac{1}{2} \\ \\ 1- \sqrt{2-2y}, &\mbox{if } \frac{1}{2} < y < 1 \\ \\ \qquad 1, &\mbox{if } 1 < y \end{cases} \]

Fix <- function(y) {
  if (!is.numeric(y)) stop("Input must be numeric")
  ifelse(y < 0, 0,
         ifelse(y < 0.5, -1 + sqrt(2 * y),
                ifelse(y <= 1, 1 - sqrt(2 - 2 * y), 1)))
}

Inverse CDF (Quantile) Function Implementation (\(F^{-1}(y)\))

inverse_cdf <- function(p) {
  #' Computes the inverse of a piecewise CDF F⁻¹(p)
  #'
  #' @param p A probability value between 0 and 1 (can be vectorized)
  #'
  #' @return The quantile x such that P(X ≤ x) = p
  #'
  #' @details The CDF is defined as:
  #'          - 0           for x < -1
  #'          - (x²/2)+x+0.5 for -1 ≤ x < 0
  #'          - (-x²/2)+x+0.5 for 0 ≤ x < 1
  #'          - 1           for x ≥ 1
  
  # Input validation
  if (!is.numeric(p) || any(p < 0 | p > 1)) {
    stop("All probabilities must be between 0 and 1")
  }
  
  # Initialize output vector
  y <- numeric(length(p))
  
  # Vectorized calculation using logical indexing (more efficient than loop)
  y[p < 0] <- 0                   # Should never happen due to validation
  y[p == 0] <- -1                  # Left limit of support
  y[p == 1] <- 1                   # Right limit of support
  
  # Case 1: 0 < p < 0.5 (left side of distribution)
  case1 <- (p > 0 & p < 0.5)
  y[case1] <- -1 + sqrt(2 * p[case1])
  
  # Case 2: 0.5 ≤ p < 1 (right side of distribution)
  case2 <- (p >= 0.5 & p < 1)
  y[case2] <- 1 - sqrt(2 - 2 * p[case2])
  
  return(y)
}

Plot the function, \(F^{-1}(y)\)

#' Plot Inverse CDF (Quantile Function) of Triangular Distribution
#'
#' Visualizes the inverse cumulative distribution function (quantile function)
#' for a symmetric triangular distribution on [0,1].
#'
#' @param min Minimum probability value for plotting (default = 0)
#' @param max Maximum probability value for plotting (default = 1)
#' @param highlight_points Logical indicating whether to highlight key points (default = TRUE)
#' @param show_grid Logical indicating whether to show grid lines (default = TRUE)
#'
#' @return A ggplot object showing the inverse CDF curve
#'
#' @examples
#' plotFix()  # Default plot on [0,1]
#' plotFix(-0.5, 1.5)  # Extended range
#'
#' @export
plotFix <- function(min = 0, max = 1, highlight_points = TRUE, show_grid = TRUE) {
  # Input validation
  if (!is.numeric(min) || !is.numeric(max) || min >= max) {
    stop("min and max must be numeric with min < max")
  }
  
  # Create data frame for plotting
  x <- seq(min, max, length.out = 1000)
  y <- ifelse(x < 0, 0,
              ifelse(x < 0.5, -1 + sqrt(2 * x),
                     ifelse(x <= 1, 1 - sqrt(2 - 2 * x), 1)))
  plot_data <- data.frame(p = x, F_inv = y)
  
  # Create base plot
  p <- ggplot2::ggplot(plot_data, ggplot2::aes(x = p, y = F_inv)) +
    ggplot2::geom_line(color = "tomato", linewidth = 1.2) +
    ggplot2::labs(
      x = "Probability (p)",
      y = expression(F^{-1}(p)),
      title = "Inverse CDF of Triangular Distribution"
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title = ggplot2::element_text(hjust = 0.5),
      panel.border = ggplot2::element_rect(color = "gray", fill = NA)
    )
  
  # Add grid if requested
  if (show_grid) {
    p <- p + ggplot2::theme(panel.grid = ggplot2::element_line(color = "gray90"))
  }
  
  # Add key points if requested
  if (highlight_points) {
    key_points <- data.frame(
      p = c(0, 0.5, 1),
      F_inv = c(-1, 0, 1)
    )
    p <- p + 
      ggplot2::geom_point(data = key_points, color = "blue", size = 3) +
      ggplot2::geom_vline(xintercept = c(0, 0.5, 1), linetype = "dashed", color = "gray60") +
      ggplot2::geom_hline(yintercept = c(-1, 0, 1), linetype = "dashed", color = "gray60")
  }
  
  # Adjust scales based on input range
  if (min < 0 || max > 1) {
    p <- p + ggplot2::coord_cartesian(xlim = c(min, max), ylim = range(y))
  } else {
    p <- p + ggplot2::scale_x_continuous(breaks = seq(0, 1, 0.1)) +
      ggplot2::scale_y_continuous(breaks = seq(-1, 1, 0.2))
  }
  
  return(p)
}

# Example usage:
# plotFix()  # Basic plot on [0,1]
# plotFix(-0.2, 1.2)  # Extended range
# plotFix(highlight_points = FALSE)  # Without highlighted points

Plot the function, \(F^{-1}(y)\), on [0, 1]

plotFix(0, 1)


7.3 Inverse Transform Sampling

Inverse transform sampling is a fundamental method for generating random numbers from any probability distribution given its cumulative distribution function (CDF).

Key Principle

For a random variable X with CDF F, if:

  • F is continuous and strictly increasing
  • U ~ Uniform(0,1)

Then \(F^{-1}(U)\) has the same distribution as X.

Mathematical Formulation Given:

  • CDF F(x) = P(X ≤ x)
  • Inverse CDF \(F^{-1}(U) = inf{x : F(x) ≥ u}\)

The sampling algorithm:

  1. Generate U ~ Uniform(0,1)
  2. Return X = \(F^{-1}(U)\)

\[ x = F^{-1}(u) = \begin{cases} \qquad 0, &\mbox{if } u < 0 \\ \\ -1 + \sqrt{2u}, &\mbox{if } 0 < u < \frac{1}{2} \\ \\ 1- \sqrt{2-2u}, &\mbox{if } \frac{1}{2} < u < 1 \\ \\ \qquad 1, &\mbox{if } 1 < u \end{cases} \]

Use inverse transform sampling method to generate samples from the probability density function,

\[ f(x) = \begin{cases} 1-\mid x \mid, \quad -1 \leq x \leq 1 \\ \\ \quad 0, \quad \text{otherwise} \end{cases} \]


R Implementation of Inverse Transform Sampling Function

#' Inverse Transform Sampling
#'
#' Generates random variables from a specified distribution using inverse transform sampling,
#' with optional density comparison and visualization.
#'
#' @param n Sample size (positive integer)
#' @param inv_cdf_FUN Inverse CDF function (quantile function)
#' @param pdf_FUN Optional probability density function for comparison (default = NULL)
#' @param support_range Optional vector of length 2 specifying the support range for density evaluation
#' @param plot Logical indicating whether to generate diagnostic plots (default = TRUE)
#' @param ... Additional arguments passed to pdf_FUN
#'
#' @return A list containing:
#' \itemize{
#'   \item \code{samples}: Vector of generated random variates
#'   \item \code{density}: Data frame with reference x and y values for the population density (if pdf_FUN provided)
#'   \item \code{random sample_density}: Density estimate of the generated samples
#' }
#'
#' @examples
#' # Custom triangular distribution
#' inv_transform(1000, inv_cdf_FUN = Fix)
#'
#' @export
inv_transform <- function(n, inv_cdf_FUN, pdf_FUN = NULL, support_range = NULL, 
                          plot = TRUE, ...) {
  # Validate inputs
  if (!is.function(inv_cdf_FUN)) {
    stop("inv_cdf_FUN must be a function")
  }
  if (n <= 0 || n != as.integer(n)) {
    stop("n must be a positive integer")
  }
  if (!is.null(pdf_FUN) && !is.function(pdf_FUN)) {
    stop("pdf_FUN must be a function or NULL")
  }
  
  # Generate uniform random numbers and transform
  U <- stats::runif(n)
  samples <- inv_cdf_FUN(U)
  
  # Prepare density comparison if pdf_FUN provided
  density_df <- NULL
  if (!is.null(pdf_FUN)) {
    if (is.null(support_range)) {
      # Auto-detect range with buffer
      sample_range <- range(samples)
      buffer <- diff(sample_range) * 0.15
      support_range <- sample_range + c(-buffer, buffer)
    }
    
    x_vals <- seq(from = support_range[1], to = support_range[2], length.out = 1000)
    y_vals <- pdf_FUN(x_vals, ...)
    density_df <- data.frame(x = x_vals, y = y_vals)
  }
  
  # Calculate random sample density
  emp_density <- stats::density(samples)
  
  # Generate plots if requested
  if (plot) {
    old_par <- graphics::par(no.readonly = TRUE)
    on.exit(graphics::par(old_par))
    
    if (!is.null(pdf_FUN)) {
      graphics::par(mfrow = c(1, 2))
      
      # Q-Q Plot
      stats::qqplot(inv_cdf_FUN(stats::ppoints(n)), samples,
                    main = "Q-Q Plot", 
                    xlab = "Population Quantiles", 
                    ylab = "Sample Quantiles")
      graphics::abline(0, 1, col = "red", lwd = 6)
      
      # Density Comparison
      ylim <- range(c(density_df$y, emp_density$y))
      plot(density_df$x, density_df$y, type = "l", lwd = 6, col = "lightgray",
           main = "Density Comparison", xlab = "x", ylab = "Density",
           ylim = ylim)
      graphics::lines(emp_density, col = "black", lty = 1, lwd = 6)
      graphics::legend("topleft", legend = c("population", "random sample"),
                       col = c("lightgray", "black"), lty = 1, bty = "n", lwd = 10)
    } else {
      # Histogram if no pdf_FUN provided
      hist(samples, freq = FALSE, main = "Sample Distribution",
           xlab = "x", col = "lightblue", border = "white")
      lines(emp_density, col = "red", lwd = 4)
    }
  }
  
  # Return results
  list(samples = samples,
       density = density_df,
       random_sample_density = emp_density)
}
# Using your Fix function as inverse CDF
set.seed(99)
triangular_samples <- inv_transform(
  n = 1000,
  inv_cdf_FUN = Fix,
  pdf_FUN = function(x) ifelse(x >= -1 & x <= 1, 1 - abs(x), 0),
  support_range = c(-1.5, 1.5)
)

str(triangular_samples)
## List of 3
##  $ samples              : num [1:1000] 0.0886 -0.523 0.2053 0.8776 0.0356 ...
##  $ density              :'data.frame':   1000 obs. of  2 variables:
##   ..$ x: num [1:1000] -1.5 -1.5 -1.49 -1.49 -1.49 ...
##   ..$ y: num [1:1000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ random_sample_density:List of 8
##   ..$ x         : num [1:512] -1.24 -1.23 -1.23 -1.22 -1.22 ...
##   ..$ y         : num [1:512] 0.000315 0.000371 0.000435 0.000508 0.000591 ...
##   ..$ bw        : num 0.0946
##   ..$ n         : int 1000
##   ..$ old.coords: logi FALSE
##   ..$ call      : language density.default(x = samples)
##   ..$ data.name : chr "samples"
##   ..$ has.na    : logi FALSE
##   ..- attr(*, "class")= chr "density"

Show that the Inverse Transform Sampling worked

# 1. Density Comparison Plot
plot(density(triangular_samples$samples, adjust = 1.2), 
     col = "#1E88E5",  # Professional blue
     lwd = 3,
     xlim = c(-1.5, 1.5),
     ylim = c(0, 1.1),
     xlab = "Value (x)",
     ylab = "Density",
     main = "Density Comparison",
     cex.main = 1.1,
     panel.first = grid(col = "gray90"))

# Population density curve
curve(dtri(x), 
      from = -1.5, to = 1.5,
      col = "#D81B60",  # Professional red
      lwd = 3,
      add = TRUE)

# Add histogram overlay
hist(triangular_samples$samples, freq = FALSE, breaks = 20, 
     col = adjustcolor("#1E88E5", alpha.f = 0.2), 
     border = NA, add = TRUE)

legend("topright",
       legend = c("Random sample", "Population distribution"),
       col = c("#1E88E5", "#D81B60"),
       lwd = 3,
       bty = "n",
       cex = 0.9)

# Add overall title
mtext("Inverse Transform Sampling Verification\nTriangular Distribution", 
      outer = TRUE, cex = 1.2, font = 2)

# Calculate goodness-of-fit metrics
ks_test <- ks.test(triangular_samples$samples, "ptri")
acceptance_rate <- if (!is.null(triangular_samples$acceptance_rate)) {
  round(triangular_samples$acceptance_rate, 3)
} else {
  1.0  # Default for deterministic transforms
}

# Create a results table with interpretation
results <- data.frame(
  Metric = c(
    "Kolmogorov-Smirnov p-value",
    "Acceptance rate",
    "Visual inspection"
  ),
  Value = c(
    format.pval(ks_test$p.value, digits = 3),
    ifelse(acceptance_rate == 1, "1.0 (deterministic)", acceptance_rate),
    "See plots"
  ),
  Interpretation = c(
    ifelse(ks_test$p.value > 0.05, "✅ Good fit (p > 0.05)", "❌ Poor fit (p ≤ 0.05)"),
    ifelse(acceptance_rate == 1, "All samples accepted (inverse transform)", 
           paste0(round(acceptance_rate * 100, 1), "% efficiency")),
    "Density curves should overlap well"
  )
)

# Display results table with kableExtra
if (requireNamespace("knitr", quietly = TRUE) && 
    requireNamespace("kableExtra", quietly = TRUE)) {
  knitr::kable(results, align = c("l", "c", "l"),
               caption = "Goodness-of-Fit Results") %>%
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover"),
      full_width = FALSE,
      position = "center"
    ) %>%
    kableExtra::column_spec(3, width = "8cm") %>%
    kableExtra::footnote(
      general = "Note: p-value > 0.05 suggests good fit to triangular distribution",
      footnote_as_chunk = TRUE
    )
} else {
  print(results)
}
Goodness-of-Fit Results
Metric Value Interpretation
Kolmogorov-Smirnov p-value <2e-16 ❌ Poor fit (p ≤ 0.05) |
Acceptance rate 1.0 (deterministic) All samples accepted (inverse transform)
Visual inspection See plots Density curves should overlap well
Note: Note: p-value > 0.05 suggests good fit to triangular distribution

7.4 Rejection Sampling (Acceptance-Rejection Method)

Rejection sampling (also called acceptance-rejection method) is a method to generate observations from a distribution by sampling uniformly from the region under the graph of its density function. Rejection sampling can lead to a lot of unwanted samples being taken if the function being sampled is highly concentrated in a certain region, for example a function that has a spike at some location. For many distributions, this problem can be solved using an adaptive extension

The basic idea is to find an alternative probability distribution \(\displaystyle G\) with density function \(\displaystyle g(x)\) such that the function \(\displaystyle g(x)\) is close to \(\displaystyle f(x)\), and we already have an efficient algorithm for generating samples. In particular, we assume that the ratio \(\displaystyle \frac{f(x)}{g(x)}\) is bounded by a constant \(c > 0\); \(\displaystyle sup_{x}{\frac {f(x)}{g(x)}} \leq c\).


Rejection sampling works as follows:

  • Sample \(x\) from \(g(x)\).
  • Sample \(u\) from \(U(0,1)\) (the uniform distribution over the unit interval).
  • Check whether or not \(\displaystyle{u < \frac{f(x)}{cg(x)}}\) The problem that the inverse transform sampling method solves is as follows:
  • If this holds, accept \(x\) as a realization of \(f(x)\);
  • if not, reject the value of \(x\) and repeat the sampling step.

Compare population and random sample distributions

#' Enhanced Distribution Comparison with Rejection Sampling
#'
#' Compares empirical sample distribution from rejection sampling with theoretical
#' population distribution, including comprehensive visualization and statistical tests.

# Main rejection sampling function
rejection_sampling <- function(n = 1000, range = c(0, 1), target, c = NULL,
                               plot = TRUE, max_attempts = NULL, 
                               show_progress = TRUE) {
  
  # Input validation
  if (!is.numeric(n) || length(n) != 1 || n <= 0 || n != as.integer(n)) {
    stop("n must be a positive integer")
  }
  if (!is.numeric(range) || length(range) != 2 || range[1] >= range[2]) {
    stop("range must be a length 2 numeric vector with range[1] < range[2]")
  }
  if (!is.function(target)) stop("target must be a function")
  max_attempts <- if (is.null(max_attempts)) 20 * n else max_attempts
  
  # Find distribution properties
  opt <- optimize(target, interval = range, maximum = TRUE)
  f_max <- opt$objective
  x_max <- opt$maximum
  
  # Calculate scaling factor
  c <- if (is.null(c)) f_max / dunif(x_max, range[1], range[2]) else c
  
  # Initialize storage
  samples <- numeric(n)
  attempts <- 0L
  accepted <- 0L
  
  # Main sampling loop
  while (accepted < n && attempts < max_attempts) {
    x_prop <- runif(1, range[1], range[2])
    u <- runif(1, 0, c * dunif(x_prop, range[1], range[2]))
    
    if (u <= target(x_prop)) {
      accepted <- accepted + 1L
      samples[accepted] <- x_prop
    }
    attempts <- attempts + 1L
  }
  
  # Handle insufficient samples
  if (accepted < n) {
    warning(sprintf("Only %d/%d samples accepted after %d attempts", 
                    accepted, n, attempts))
    samples <- samples[seq_len(accepted)]
  }
  
  # Prepare results
  x_grid <- seq(range[1], range[2], length.out = 1000)
  envelope <- data.frame(
    x = x_grid,
    target = target(x_grid),
    proposal = c * dunif(x_grid, range[1], range[2])
  )
  
  # Create result object
  result <- structure(
    list(
      samples = samples,
      acceptance_rate = accepted/attempts,
      efficiency = attempts/accepted,
      envelope = envelope,
      n_attempts = attempts,
      range = range,
      target_name = deparse(substitute(target))
    ),
    class = "rejection_sample"
  )
  
  if (plot) plot(result)
  return(result)
}

#' @export
plot.rejection_sample <- function(x, ...) {
  op <- par(mfrow = c(1, 2), mar = c(4, 4, 3, 1))
  on.exit(par(op))
  
  # Density comparison plot
  y_max <- max(x$envelope$proposal, x$envelope$target)
  plot(x$envelope$x, x$envelope$target, type = "l", lwd = 2, col = "red",
       ylim = c(0, y_max), main = "Density Comparison",
       xlab = "x", ylab = "Density")
  lines(x$envelope$x, x$envelope$proposal, col = "blue", lty = 2)
  legend("topleft", legend = c("Target", "Envelope"),
         col = c("red", "blue"), lty = 1:2, bty = "n")
  
  # Histogram with density overlay
  hist(x$samples, freq = FALSE, breaks = 30, col = "lightgray",
       main = "Sample Distribution", xlab = "x", 
       xlim = x$range, ylim = c(0, y_max))
  lines(x$envelope$x, x$envelope$target, col = "red", lwd = 2)
}
#' Compare Distributions
#'
#' @param samples Sample values
#' @param target_density Target density function
#' @param target_cdf Target CDF function
#' @param xlim X-axis limits
#' @param title Plot title
#' @param show_qq Show Q-Q plot
compare_distributions <- function(samples, target_density = dtri,
                                  target_cdf = ptri, xlim = NULL,
                                  title = NULL) {
  
  # Set up plot layout
  
  # Density plot
  x_vals <- seq(xlim[1] %||% min(samples)-0.5, 
                xlim[2] %||% max(samples)+0.5, 
                length.out = 500)
  plot(density(samples), col = "black", lwd = 2,
       xlim = range(x_vals)*1.1,
       main = title %||% "Density Comparison",
       xlab = "x", ylab = "Density")
  lines(x_vals, target_density(x_vals), col = "lightgray", lwd = 2)
  legend("topleft", legend = c("Sample", "Population"),
         col = c("black", "lightgray"), lwd = 2)
  
  
  # Goodness-of-fit test
  ks_test <- ks.test(samples, target_cdf)
  cat("\nGoodness-of-fit test:\n")
  cat("KS p-value:", format.pval(ks_test$p.value, digits = 3), "\n")
}
#' Perform Statistical Tests
#' @keywords internal
perform_statistical_tests <- function(samples, target_cdf, conf_level = 0.95) {
  # Kolmogorov-Smirnov test
  ks_test <- ks.test(samples, target_cdf)
  
  cat("\n=== Distribution Comparison Results ===\n")
  cat("Kolmogorov-Smirnov Test:\n")
  cat("D =", format(ks_test$statistic, digits = 3), 
      "p-value =", format.pval(ks_test$p.value, digits = 3), "\n")
  
  # Anderson-Darling test if package available
  ad_test <- NULL
  if (requireNamespace("goftest", quietly = TRUE)) {
    ad_test <- goftest::ad.test(samples, target_cdf)
    cat("\nAnderson-Darling Test:\n")
    cat("A =", format(ad_test$statistic, digits = 3), 
        "p-value =", format.pval(ad_test$p.value, digits = 3), "\n")
  }
  
  # Interpretation guidance
  cat("\nInterpretation Guide:\n")
  cat("- p-value >", 1-conf_level, "suggests good fit\n")
  cat("- Density curves should overlap well\n")
  
  invisible(list(ks_test = ks_test, 
                 ad_test = if (!is.null(ad_test)) ad_test else NULL))
  
  # Perform and return statistical tests
  perform_statistical_tests(samples, target_cdf, conf_level)
}
# Example usage
set.seed(123)
result <- rejection_sampling(1000, c(-1, 1), dtri)

# Customized comparison
compare_distributions(result$samples,
                      title = "Triangular Distribution Verification",
                      xlim = c(-1.5, 1.5))

## 
## Goodness-of-fit test:
## KS p-value: <2e-16

8 Monte Carlo Methods

Monte Carlo methods, also known as Monte Carlo experiments, are computational algorithms that rely on repeated random sampling to obtain numerical results. These techniques run large numbers of simulations to estimate the distribution of unknown probabilistic quantities.

Core Concept

Monte Carlo methods use randomness to solve problems that might be deterministic in principle. They are particularly useful for:

  • Modeling complex systems with significant uncertainty
  • Evaluating difficult deterministic problems
  • Estimating probability distributions
  • Computing high-dimensional integrals

The Monte Carlo Algorithm

A typical Monte Carlo implementation involves four key steps:

  • Define the Domain
    Establish the range of possible inputs and their probability distributions.
  • Generate Random Inputs
    Sample from the specified probability distribution(s) over the domain.
  • Perform Deterministic Computation
    Apply the desired mathematical operation or simulation to each random input.
  • Aggregate Results
    Combine individual outcomes to estimate the quantity of interest.

Statistical Applications: Monte Carlo methods serve two primary purposes in applied statistics:

  • Comparison of Alternative Statistics
    • Evaluate different estimators
    • Compare test statistics under various conditions
    • Assess robustness of statistical methods
  • Implementation of Hypothesis Tests
    • Create exact tests when distributional assumptions are uncertain
    • Generate random sample null distributions
    • Compute p-values through simulation

Mathematical Foundation

The method relies on the Law of Large Numbers:

\[ \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n f(X_i) = \mathbb{E}[f(X)] \]

where:

  • \(X_i\) are independent random samples
  • \(f\) is the function of interest
  • \(n\) is the number of samples

Advantages

  • Flexible for complex, high-dimensional problems
  • Convergence rate \(\mathcal{O}(1/\sqrt{n})\) independent of dimension
  • Parallelizable computations
  • Intuitive implementation

Common Applications

  • Numerical integration
  • Risk analysis
  • Statistical physics
  • Bayesian inference
  • Financial modeling
  • Computer graphics

Example: Estimating π by randomly sampling points in a unit square and calculating the fraction that falls within the inscribed circle.

8.1 Classical Monte Carlo Integration

Monte Carlo integration is a numerical technique for approximating definite integrals using random sampling. Unlike deterministic quadrature methods, this approach evaluates the integrand at randomly selected points within the integration domain.

Mathematical Formulation

Given an expectation of the form:

\[ \mathbb{E}[h(X)] = \int_{\mathcal{X}} h(x) f(x) \, dx \]

where: - \(\mathcal{X}\) is the domain of integration - \(X\) is a random variable with probability density function \(f(x)\) - \(h(x)\) is the integrand function

Algorithm Implementation

  1. Sampling: Generate \(n\) independent random samples \((X_1, ..., X_n)\) from the density \(f(x)\)
  2. Evaluation: Compute \(h(X_i)\) for each sample
  3. Averaging: Approximate the integral as:

\[ \hat{I}_n = \frac{1}{n} \sum_{i=1}^n h(X_i) \]

population Foundation

The approximation converges almost surely to the true value by the Strong Law of Large Numbers:

\[ \lim_{n \to \infty} \hat{I}_n = \mathbb{E}[h(X)] \quad \text{a.s.} \]

Error Analysis

The standard error of the Monte Carlo estimate is:

\[ \text{SE} = \frac{\sigma_h}{\sqrt{n}} \]

where \(\sigma_h^2 = \text{Var}(h(X))\). This demonstrates the characteristic:

  • Square root convergence (\(O(n^{-1/2})\))
  • Dimension-independent convergence rate

Advantages and Limitations

Advantages:

  • Particularly effective for high-dimensional integrals
  • Simple to implement
  • Flexible for complex domains and integrands

Limitations:

  • Relatively slow convergence compared to deterministic methods for low-dimensional problems
  • Requires careful variance reduction techniques for practical efficiency
# number of observations
n <- 1e4 # 10,000

# number of iterations
iterations <- 200
min <- -1 # lower bound of integration
max <- 1 # upper bound of integration

f <- function(x){return(1 - (x^2))} # f(x)

# g <- function(x,y){return(y <= 1-(x^2))} # g(x)
V <- 2

# standard error
SE <- function(x){sqrt(var(x,na.rm = TRUE)/length(na.omit(x)))} 
integral_MC <- function(FUN, min, max, n, iterations, V = max - min) {
  #' Monte Carlo Integration
  #'
  #' Estimates a definite integral using standard Monte Carlo simulation
  #'
  #' @param FUN The integrand function to evaluate (must be vectorized or accept vector inputs)
  #' @param min Lower bound of integration
  #' @param max Upper bound of integration
  #' @param n Number of samples per iteration
  #' @param iterations Number of independent Monte Carlo runs
  #' @param V Volume of integration domain (defaults to max-min for 1D case)
  #'
  #' @return A list containing:
  #'         - mean: Expected value of the integral estimates
  #'         - var: Variance of the estimates across iterations
  #'         - SE: Standard error of the estimates
  #'         - estimates: Vector of all integral estimates (for diagnostics)
  #'
  #' @examples
  #' # Estimate integral of x^2 from 0 to 1 (true value = 1/3)
  #' result <- integral_MC(function(x) x^2, 0, 1, 1000, 100)
  #' print(result)
  
  # Input validation
  if (!is.function(FUN)) stop("FUN must be a function")
  if (max <= min) stop("max must be greater than min")
  if (n <= 0 || iterations <= 0) stop("n and iterations must be positive integers")
  
  estimates <- numeric(iterations)
  
  for (i in 1:iterations) {
    # Generate uniform samples
    U <- runif(n, min, max)
    
    # Evaluate integrand (using vectorization if possible)
    Y <- if (is.vectorized(FUN)) FUN(U) else sapply(U, FUN)
    
    # Compute integral estimate
    estimates[i] <- V * mean(Y)
  }
  
  # Return summary statistics
  list(
    mean = mean(estimates),
    var = var(estimates),
    SE = sd(estimates)/sqrt(iterations),
    estimates = estimates  # Keep raw estimates for analysis
  )
}

# Helper function to check for vectorization
is.vectorized <- function(f) {
  tryCatch({
    f(1:10)  # Test with vector input
    TRUE
  }, error = function(e) FALSE)
}

8.2 Antithetic Variates

Antithetic variates is a variance reduction technique used in Monte Carlo simulations to improve computational efficiency. By generating negatively correlated sample paths, the method reduces estimator variance without the need for additional samples, making it an effective way to enhance precision while minimizing computational cost.

Core Principle

For estimating: \[ \mathbb{E}[h(X)] = \mathbb{E}[Y] \]

We generate pairs of dependent samples \((Y_1, Y_2)\) where:

  • \(Y_1 = h(X_1)\)
  • \(Y_2 = h(X_2)\) with \(X_2\) constructed to be antithetic to \(X_1\)

The estimator becomes: \[ \hat{\mu}_{AV} = \frac{Y_1 + Y_2}{2} \]

Variance Analysis

The variance of the antithetic estimator is: \[ \text{Var}(\hat{\mu}_{AV}) = \frac{\text{Var}(Y_1) + \text{Var}(Y_2) + 2\text{Cov}(Y_1, Y_2)}{4} \]

Key cases:

  • Independent samples (i.i.d.):
    • \(\text{Cov}(Y_1, Y_2) = 0\)
    • \(\text{Var}(\hat{\mu}_{AV}) = \frac{\text{Var}(Y)}{2}\)
  • Antithetic samples:
    • \(\text{Cov}(Y_1, Y_2) < 0\)
    • \(\text{Var}(\hat{\mu}_{AV}) < \frac{\text{Var}(Y)}{2}\)

Implementation Requirements

For effective variance reduction:

  1. Negative correlation between \(Y_1\) and \(Y_2\)

  2. Same marginal distributions: \(Y_1\) and \(Y_2\) must maintain the target distribution

Practical Construction Methods

  • For uniform variates:
    • If \(U \sim \text{Unif}(0,1)\), use \((U, 1-U)\) pairs
  • For symmetric distributions:
    • If \(X \sim F\), use \((X, 2\mu-X)\) where \(\mu\) is the symmetry point
  • For standard normal:
    • Use \((Z, -Z)\) pairs

Advantages

  • Computational efficiency:
    • Requires generating only n/2 random numbers for n samples
    • Each evaluation provides two sample points
  • Variance reduction:
    • Typically achieves 20 - 50% variance reduction
    • Particularly effective for monotonic functions
integral_AV <- function(FUN, min, max, n, iterations, V) {
  #' Antithetic Variates Integration
  #'
  #' Estimates a definite integral using antithetic variates for variance reduction
  #'
  #' @param FUN The integrand function to evaluate
  #' @param min Lower bound of integration
  #' @param max Upper bound of integration
  #' @param n Number of sample pairs per iteration
  #' @param iterations Number of Monte Carlo iterations
  #' @param V Volume of integration domain (max-min for 1D)
  #'
  #' @return A list containing:
  #'         - mean: Estimated integral value
  #'         - var: Variance of estimates
  #'         - SE: Standard error of estimates
  
  estimates <- numeric(iterations)
  
  for (i in 1:iterations) {
    # Generate base uniform samples
    U1 <- runif(n, min, max)
    
    # Create antithetic pairs (more efficiently)
    U2 <- -U1  # Direct vector operation
    
    # Combine all samples
    U <- c(U1, U2)
    
    # Evaluate integrand
    Y <- sapply(U, FUN)
    
    # Compute integral estimate
    estimates[i] <- V * mean(Y)
  }
  
  # Calculate summary statistics
  list(
    mean = mean(estimates),
    var = var(estimates),
    SE = sd(estimates)/sqrt(iterations)
  )
}

8.3 Importance Sampling

Importance sampling is a variance reduction technique that estimates properties of a target distribution \(f\) by sampling from a different proposal distribution \(g\). Unlike rejection sampling, it uses all generated samples through weighting, making it more efficient for many applications.

Core Principle

The fundamental idea is to rewrite expectations under \(f\) as expectations under \(g\):

\[ \begin{aligned} \mathbb{E}_f[h(X)] &= \int_{\mathcal{X}} h(x) f(x) dx \\ &= \int_{\mathcal{X}} h(x) \frac{f(x)}{g(x)} g(x) dx \\ &= \mathbb{E}_g\left[h(X)\frac{f(X)}{g(X)}\right] \end{aligned} \]

where:

  • \(f(x)\) is the target density (hard to sample from)
  • \(g(x)\) is the proposal density (easy to sample from)
  • \(\frac{f(x)}{g(x)}\) are called the importance weights

Algorithm Implementation

  1. Sample \(X_1, \ldots, X_n \sim g(x)\)
  2. Compute weights \(w_i = \frac{f(X_i)}{g(X_i)}\)
  3. Estimate expectation: \[ \hat{\mu}_{IS} = \frac{1}{n}\sum_{i=1}^n h(X_i)w_i \]

population Properties

  • Consistency: By the Law of Large Numbers: \[ \hat{\mu}_{IS} \xrightarrow{a.s.} \mathbb{E}_f[h(X)] \quad \text{as } n \to \infty \]
  • Variance: The estimator variance is: \[ \text{Var}(\hat{\mu}_{IS}) = \frac{1}{n}\text{Var}_g\left(h(X)\frac{f(X)}{g(X)}\right) \]

Optimal Proposal Distribution

The variance-minimizing proposal satisfies: \[ g^*(x) \propto |h(x)|f(x) \]

Practical Considerations

  1. Normalized Weights: For better numerical stability: \[ \hat{\mu}_{NIS} = \frac{\sum_{i=1}^n h(X_i)w_i}{\sum_{i=1}^n w_i} \]

  2. Effective Sample Size: \[ ESS = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2} \] Measures the efficiency of importance sampling.

f <- function(x){
  if (x < 0) {y <- 0}
  else if (0 <= x & x < .5) {y <- -1 + sqrt(2*x)}
  else if (.5 <= x & x <= 1) {y <- 1 - sqrt(2 - 2*x)}
  else (y <- 1)
  return(y)
}

Is h(x) a good choice for the sampling density?

\(\hspace{23pt} \displaystyle f(x) = 1-x^2\)

\(\hspace{23pt} \displaystyle h(x) = 1 - \mid x \mid\)

\(\hspace{23pt} \displaystyle g(x) = \frac{f(x)}{h(x)} = \frac{1-x^2}{1-\mid x \mid}\)

According to Anderson a good importance sampling function, \(h(x)\), should have the following properties all of which is satisfied:

  • \(h(x) > 0\) whenever \(g(x) \neq 0\)
  • \(h(x)\) should be close to being proportional to \(\mid {g(x)} \mid\)
  • it should be easy to simulate values from \(h(x)\)
  • it should be easy to compute the density \(h(x)\) for any value \(x\) that you might realize.

Implement the function, \(\displaystyle f(x) = 1-x^2\)

fx <- 
  function(x){
    return(1 - x*x)
  }

Implement the function, \(\displaystyle h(x) = 1 - \mid x \mid\)

h <- 
  function(x){
    if (-1 <= x & x <= 1) {y <- 1 - abs(x)}
    else{y <- 0}
    return(y)
  }
g <- function(x){(1-(x^2))/(1-abs(x))}

Plot functions \(\displaystyle h(x)\) and \(\displaystyle f(x)\)

ploty <- function(min, max){
  x <- seq(min, max, by = 0.01)
  # y <- rep(0, length=length(x))
  #for (i in 1:length(y)){y[i] = findy(x[i])}
  H <- sapply(x, h, simplify = TRUE, USE.NAMES = TRUE)
  F <- sapply(x, fx, simplify = TRUE, USE.NAMES = TRUE)
  
  par(lab = c((max - min)*2, (max(H) - min(H))*5, 7))
  
  plot(x, H, asp = 1,
       type = "l",
       lwd = 5,
       col = adjustcolor("tomato", alpha.f = 0.7),
       ylab = "y",
       main = "h(x) vs. f(x)")
  
  points(x, F,
         type = "l",
         lwd = 5,
         col = adjustcolor("dodgerblue", alpha.f = 0.7),
         asp = 1)
  
  grid(nx = NULL, 
       ny = NULL, 
       col = "lightgray", 
       lty = "dotted", 
       lwd = 1, 
       equilogs = TRUE)
  
  text(0, 1.2, expression(f(x) == 1 - x^2), cex = 1.5, col = "dodgerblue")
  text(0, 0.6, expression(h(x) == 1 - abs(x)), cex = 1.5, col = "tomato")
}
ploty(-1,1) # Plot the function, f(x) and h(x), on [-1, 1]

integral_IS <- 
  function(g, f, n, iterations){
    I <- numeric(iterations)
    for (i in 1:iterations) {
      # Generate a random number u from the standard uniform distribution in the interval [0,1].
      U <- runif(n, 0, 1)
      
      # Take x to be the random number drawn from the distribution described by F.
      X <- sapply(U, f)
      Y <- sapply(X, g)
      I[i] <- mean(Y)
    }
    return(list(mean = mean(I), var = var(I), SE = SE(I)))
  }

9 Markov Chain Monte Carlo (MCMC) Methods

MCMC methods are a class of algorithms for sampling from probability distributions by constructing a Markov chain that has the desired distribution as its equilibrium distribution. These methods are particularly valuable when direct sampling is difficult, especially in high-dimensional spaces.

Core Concepts

Markov Chain Fundamentals

  • A sequence of random variables where each state depends only on the previous state (Markov property)
  • Under certain conditions, converges to a stationary distribution π

MCMC Framework 1. Construct a Markov chain whose stationary distribution equals the target distribution 2. Simulate the chain until it approaches stationarity (burn-in period) 3. Use subsequent samples to approximate expectations: \[ \mathbb{E}_π[h(X)] \approx \frac{1}{N}\sum_{t=1}^N h(X_t) \]

9.1 Metropolis-Hastings Algorithm

The most general MCMC algorithm for sampling from a target distribution π(x), requiring only:

  • Knowledge of π(x) up to a normalizing constant
  • A proposal distribution q(x’|x)

Algorithm Steps

  • Initialization:
    • Choose initial state \(X_0\)
    • Set proposal distribution \(q(x'|x)\)
  • Iteration:
    • For t = 1 to N:
      • Generate candidate \(X^* \sim q(x'|X_{t-1})\)
      • Compute acceptance probability: \[ α = \min\left(1, \frac{π(X^*)q(X_{t-1}|X^*)}{π(X_{t-1})q(X^*|X_{t-1})}\right) \]
        • Accept with probability α: \[ X_t = \begin{cases} X^* & \text{with probability } α \\ X_{t-1} & \text{otherwise} \end{cases} \]

Key Properties

  • Correctness: Produces samples from π(x) in the limit
  • Flexibility: Only requires unnormalized π(x)
  • Tunable: Proposal distribution q affects efficiency

R Implementation of the triangular function,

\[ \displaystyle f(x)= \begin{cases} 1-\mid x \mid, &\mbox{if } -1 < x < 1 \\ \\ \qquad 0, &\mbox{otherwise} \end{cases} \]

R Implementation of the Target Density: Triangular Function

#' Symmetric Triangular Probability Density Function
#' 
#' Computes the probability density of a symmetric triangular distribution centered at 0
#' with support on [-1, 1]. Returns 0 outside this interval.
#' @param x Numeric vector of values at which to evaluate the density
#' @return Numeric vector of density values
#' @export
triangular <- function(x) {
  pmax(1 - abs(x), 0) * (abs(x) <= 1)
}

# Verification of normalization
integrate(triangular, lower = -1, upper = 1)  # Should equal 1
## 1 with absolute error < 1.1e-14
#' Metropolis-Hastings MCMC Sampler for Triangular Distribution
#'
#' Generates samples from the triangular distribution using random walk
#' Metropolis-Hastings with comprehensive diagnostics.
#'
#' @param n_samples Number of samples to generate (after burn-in)
#' @param initial_value Starting value (default = 0)
#' @param proposal_sd Standard deviation for normal proposals (default = 0.3)
#' @param burn_in Burn-in iterations to discard (default = 1000)
#' @param trace_plot Whether to show diagnostic plots (default = TRUE)
#' @param show_diagnostics Whether to print diagnostics (default = TRUE)
#' @return Numeric vector of samples
#' @export
#'
#' @examples
#' normal_sample <- metropolis_hastings(n_samples = 1000)
#' triangular_sample <- metropolis_hastings(n_samples = 1000, target_density = triangular)
#' 
metropolis_hastings <- function(
    n_samples,
    initial_value = 0,
    proposal_sd = 0.3,  # Optimal for triangular distribution
    burn_in = 1000,
    plot = TRUE,
    target_density = function(x) {dnorm(x)}
) {
  # Initialize storage
  samples <- numeric(n_samples + burn_in)
  current <- initial_value
  accepted <- 0
  
  # MCMC iterations
  for (i in 1:(n_samples + burn_in)) {
    proposed <- rnorm(1, mean = current, sd = proposal_sd)
    current_dens <- target_density(current)
    proposed_dens <- target_density(proposed)
    
    # Handle edge cases safely
    if (current_dens == 0 && proposed_dens == 0) {
      # Both outside support - reject
      log_ratio <- -Inf
    } else if (current_dens == 0) {
      # Moving into support - accept
      log_ratio <- Inf
    } else if (proposed_dens == 0) {
      # Moving out of support - reject
      log_ratio <- -Inf
    } else {
      # Normal case
      log_ratio <- log(proposed_dens) - log(current_dens)
    }
    
    if (log(runif(1)) < log_ratio) {
      current <- proposed
      accepted <- accepted + 1
    }
    samples[i] <- current
  }
  
  final_samples <- samples[(burn_in + 1):(n_samples + burn_in)]
  
  # Diagnostic plots
  if (plot) {
    # Histogram
    hist(final_samples, breaks = 30, freq = FALSE,
         main = "Sample Distribution", xlab = "Value",
         col = "lightblue")
    curve(target_density(x), add = TRUE, col = "red", lwd = 2)
  }
  
  return(final_samples)
}

# Save function to file
dump("metropolis_hastings", file = "metropolis_hastings.R")
normal_sample <- metropolis_hastings(n_samples = 1000)

triangular_sample <- metropolis_hastings(n_samples = 1000, target_density = triangular)

head(triangular_sample)
## [1] -0.1262088 -0.4023535 -0.3030000 -0.4167624 -0.6356550 -0.6356550

9.2 Gibbs Sampling

Gibbs sampling is a special case of the Metropolis-Hastings algorithm that simplifies sampling from multivariate distributions by iteratively sampling from conditional distributions. It’s particularly useful when joint distributions are complex but conditional distributions are tractable.

Key Features

  • Conditional Sampling Framework: Updates each variable while holding others fixed
  • Acceptance Rate = 1: All proposals are accepted (unlike general Metropolis-Hastings)
  • No Tuning Required: Automatically adapts to the target distribution’s structure

Mathematical Formulation

For a target distribution π(x₁, x₂, …, xₖ):

  1. Initialize X⁽⁰⁾ = (x₁⁽⁰⁾, …, xₖ⁽⁰⁾)
  2. At iteration t:
  • Sample x₁⁽ᵗ⁾ ∼ π(x₁ | x₂⁽ᵗ⁻¹⁾, …, xₖ⁽ᵗ⁻¹⁾)
  • Sample x₂⁽ᵗ⁾ ∼ π(x₂ | x₁⁽ᵗ⁾, x₃⁽ᵗ⁻¹⁾, …, xₖ⁽ᵗ⁻¹⁾)
  • Sample xₖ⁽ᵗ⁾ ∼ π(xₖ | x₁⁽ᵗ⁾, …, xₖ₋₁⁽ᵗ⁾)

Algorithm Implementation (Pseudocode)

Initialize X = (x₁, …, xₖ)

for t = 1 to num_samples do for j = 1 to k do x_j ∼ π(x_j | X_{-j}) end for Store sample X end for


Implement a Gibbs sampler to construct samples from the uniform distribution on a disc of radius 1

#' Gibbs Sampler for Uniform Distribution on Unit Circle
#'
#' Generates samples from the uniform distribution over the unit circle (x² + y² ≤ 1)
#' using a Gibbs sampling algorithm.
#'
#' @param n Number of samples to generate.
#' @param burn_in Number of burn-in iterations to discard (default = 100).
#'
#' @return A matrix with `n` rows and 2 columns (x, y), where each row is a sample point from the uniform distribution over the unit circle.
#'
#' @examples
#' samples <- gibbs(1000)
#'
#' @export

gibbs <- function(n, burn_in = 100, plot=TRUE) {
  mat <- matrix(0, ncol = 2, nrow = n)
  x <- 0
  y <- 0
  
  # Burn-in phase
  for (i in 1:burn_in) {
    x <- runif(1, -sqrt(1 - y^2), sqrt(1 - y^2))
    y <- runif(1, -sqrt(1 - x^2), sqrt(1 - x^2))
  }
  
  # Sampling phase
  for (i in 1:n) {
    x <- runif(1, -sqrt(1 - y^2), sqrt(1 - y^2))
    y <- runif(1, -sqrt(1 - x^2), sqrt(1 - x^2))
    mat[i, ] <- c(x, y)
  }
  
  # Validate all points are within unit circle (with tolerance for floating point)
  stopifnot(all(mat[,1]^2 + mat[,2]^2 <= 1 + 1e-10))
  colnames(mat) <- c("x", "y")
  
  if(plot){
    plot(mat[,"x"], mat[,"y"], 
         pch = 16, 
         cex = .4, 
         asp = 1, 
         col = adjustcolor("tomato", alpha = 0.5))
    
    abline(h = mean(mat[,"y"]), 
           v = mean(mat[,"x"]), 
           col = "black", 
           lty = 3)
    
    grid(nx = NULL, 
         ny = NULL, 
         col = "gray", 
         lty = "dotted", 
         lwd = 1, 
         equilogs = TRUE)
  }
  return(mat)
}

# Save function to a .R file
dump("gibbs", file = "gibbs.R")
gibbs_sample <- gibbs(1e4)

head(gibbs_sample)
##                x          y
## [1,]  0.08832546  0.9513155
## [2,]  0.30277657 -0.6281116
## [3,]  0.25900035  0.2322818
## [4,] -0.55230127  0.6129308
## [5,]  0.28412654  0.0664954
## [6,] -0.74655836  0.4022764

Let \(f(x,y) = [unif(x, min=0, max=5), unif(y, min=x, max=2x+1)]\). Construct samples from this distribution.

gibbs2 <- function(
    n, 
    sample_x = function(x) runif(1, min = 0, max = 5),  # Default: uniform for X
    sample_y = function(x) runif(1, min = x, max = 2*x + 1),  # Default: uniform for Y
    initial_x = 0,  # Optional: initial value for X
    initial_y = 0   # Optional: initial value for Y
) {
  mat <- matrix(0, ncol = 2, nrow = n)
  x <- initial_x
  y <- initial_y
  
  for (i in 1:n) {
    x <- sample_x(y)  # Sample X given Y (or NULL if unconditional)
    y <- sample_y(x)  # Sample Y given X
    mat[i, ] <- c(x, y)
  }
  
  colnames(mat) <- c("x", "y")
  plot(g2, 
       pch = 16, 
       cex = .4, 
       asp = 1, 
       col = adjustcolor("tomato", alpha = 0.5))
  
  grid(nx = NULL, 
       ny = NULL, 
       col = "gray", 
       lty = "dotted", 
       lwd = 1, 
       equilogs = TRUE)
  return(mat)
}
gibbs2 <- function(n, plot = TRUE){
  mat <- matrix(0, ncol = 2, nrow = n)
  x <- 0
  y <- 0
  
  for (i in 1:n) {
    x <- runif(1, min = 0, max = 5)
    y <- runif(1, min = x, max = 2*x + 1)
    mat[i, ] <- c(x, y)
  }
  colnames(mat) <- c("x", "y")
  
  if (plot) {
    plot(mat, 
         pch = 16, 
         cex = 0.4, 
         asp = NULL, 
         col = adjustcolor("tomato", alpha = 0.5),
         xlim = c(-1, 6),  # Apply custom x-axis limits
         ylim = c(-1, 11),  # Apply custom y-axis limits
         xlab = "X", 
         ylab = "Y",
         main = "Gibbs Sampling Results")
    
    grid(nx = NULL, 
         ny = NULL, 
         col = "gray", 
         lty = "dotted", 
         lwd = 1, 
         equilogs = TRUE)
  }
  
  return(mat)
}
g2 <- gibbs2(1e4)

head(g2)
##              x        y
## [1,] 2.8309591 6.183859
## [2,] 4.1841702 9.306960
## [3,] 3.9244680 7.282335
## [4,] 4.1491063 7.261418
## [5,] 0.9008147 1.640175
## [6,] 1.9611781 4.379312

10 Reverse-engineering

By using Markov chain Monte Carlo (MCMC) methods one can approximate sampling from an arbitrary probability distribution, which then can be used to find the underlying probability distribution by reverse engineering.

10.1 Discrete Sample Data (unknown)

unknowndiscrete <- read.table("unknowndiscrete.R", header=FALSE)

str(unknowndiscrete)
## 'data.frame':    1000 obs. of  1 variable:
##  $ V1: int  8 10 12 14 12 13 12 10 13 13 ...
# descriptive stats
psych::describe(unknowndiscrete$V1)
##    vars    n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1000 12.07 2.69     12   12.05 2.97   3  20    17 0.01     -0.2 0.09

Density plot and summary of the data

plot(table(unknowndiscrete$V1)/length(unknowndiscrete$V1),
     main = "Density estimate of data", 
     lwd = 12, 
     col = adjustcolor("dodgerblue", alpha.f = 0.5))

grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Observations:

The sample size is large (n = 1,000), which supports the use of asymptotic approximations. The mean and median of the distribution are both approximately 12, suggesting symmetry. Additionally, the skewness and kurtosis values are close to zero, further indicating that the distribution is approximately normal. This is visually confirmed by the density plot of the data.

These findings align with population expectations: for a binomial distribution \(\mathcal{B}(n, p)\), when the sample size is sufficiently large (typically \(n > 30\)), the skewness diminishes, and the normal distribution \(\mathcal{N}(np, np(1-p))\) serves as a reasonable approximation..


Test our educated guess that the data has a binomial distribution, \(\mathcal B(n=30, \: p=.399).\)

# mean
mean.disc <- mean(unknowndiscrete$V1)

# variance
var.disc <- var(unknowndiscrete$V1)

### for binomial distribution, mean is n*p and variance is n*p*q, so

# q: var/mean = n*p*q/n*p = q
q <- var.disc/mean.disc

# p: 1-q = p
p <- (1-q)

# n = n*p/p
n <- mean.disc/p
paste("p = ", p)
## [1] "p =  0.398954998408235"
paste("q = ", q)
## [1] "q =  0.601045001591765"
paste("n = ", n)
## [1] "n =  30.2565453451174"

Plot population and random sample distributions for comparison

# max
max.disc <- max(unknowndiscrete$V1)

# max
min.disc <- min(unknowndiscrete$V1)

# Density plot of the data
plot(table(unknowndiscrete$V1)/length(unknowndiscrete$V1),
     main = "Density estimate of data", 
     lwd = 12, 
     col = "dodgerblue")

# Sampling from a binomial distribution with parameters n=30, size=1000 and prob=0.399
points(min.disc:max.disc, 
       dbinom(x = min.disc:max.disc, size = 30, prob = p),
       lwd = 4, 
       col = adjustcolor("tomato", alpha.f = 0.5),
       main = "random sample (blue) vs. population (red) \n 
               Binomial Distribution",
       xlab = "x", 
       ylab = "P(x)", 
       ylim = c(0, .3))

grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


random sample Cumulative Distribution Function

plot(ecdf(unknowndiscrete$V1), 
     lwd = 1, 
     col = "dodgerblue",
     main = "random sample (blue) vs. population (red) \n 
             Cumulative Distribution Functions")

# population Cumulative Distribution Function
points(pbinom(0:25, 30, 
              prob = p, 
              lower.tail = TRUE, 
              log.p = FALSE), 
       col = "tomato")


QQplot of standardized data

## drawing the QQplot of standardized data
qqnorm((unknowndiscrete$V1 - mean(unknowndiscrete$V1)) / sd(unknowndiscrete$V1))

# drawing a 45-degree reference line
abline(0, 1, col = "tomato", lwd = 4)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


10.2 Continuous Sample Data with an unknown distribution

# Import data from URL
data_url <- "unknowncontinuous.R"

tryCatch({
  unknowncontinuous <- read.table(data_url, header = FALSE)
  
  # Save to local file
  write.csv(unknowncontinuous, "unknowncontinuous_data.csv", row.names = FALSE)
  cat("Data successfully saved to 'unknowncontinuous_data.csv'\n")
  
  # Data structure
  cat("\nData Structure:\n")
  str(unknowncontinuous)
  
  # Summary statistics using psych
  if(!require(psych)) install.packages("psych")
  library(psych)
  
  cat("\nSummary Statistics:\n")
  (d <- describe(as.matrix(unknowncontinuous)))
  
  # First few observations
  cat("\nFirst 6 Observations:\n")
  head(unknowncontinuous)
  
}, error = function(e) {
  message("Error in data import: ", e$message)
})
## Data successfully saved to 'unknowncontinuous_data.csv'
## 
## Data Structure:
## 'data.frame':    1000 obs. of  1 variable:
##  $ V1: num  5.94 7.19 2.33 4.09 9.02 ...
## 
## Summary Statistics:
## 
## First 6 Observations:
##         V1
## 1 5.944350
## 2 7.185073
## 3 2.330183
## 4 4.087371
## 5 9.015707
## 6 5.128833

Density plot

# Enhanced Density Plot with ggplot2
library(ggplot2)

# Convert data to numeric vector (handles various input types)
plot_data <- as.numeric(as.matrix(unknowncontinuous))

# Create density plot with ggplot
density_plot <- ggplot(data.frame(x = plot_data), aes(x = x)) +
  # Density curve
  geom_density(fill = "#1E90FF", 
               alpha = 0.3, 
               color = "dodgerblue4", 
               linewidth = 1.2) +
  
  # Rug plot - removed position_jitter as it's not needed
  geom_rug(alpha = 0.4, 
           color = "black",
           sides = "b",  # show only at bottom
           length = unit(0.03, "npc")) +  # shorter rug lines
  
  # Mean line
  geom_vline(aes(xintercept = mean(x, na.rm = TRUE)),
             color = "firebrick",
             linetype = "dashed",
             linewidth = 1) +
  
  # Plot styling
  labs(title = "Density Plot of Unknown Continuous Data",
       x = "Value",
       y = "Density") +
  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.major = element_line(color = "grey90", linewidth = 0.2),
    panel.grid.minor = element_blank()
  ) +
  
  # Add annotation for mean value
  annotate("text",
           x = mean(plot_data),
           y = 0,
           label = paste("Mean =", round(mean(plot_data), 2)),
           color = "firebrick",
           vjust = -1,
           hjust = -0.1)

# Display plot
print(density_plot)

Observations:

  • Sample size: Large (n = 1,000)
  • Distribution shape:
    • Mean (6.17) > Median (5.03) → Right-skewed
    • Non-zero skewness and kurtosis → Non-normal
    • Visually confirmed by density plot

Distribution Analysis

  • Beta distribution:
    • Ruled out (requires [0,1] bounded support)
  • Chi-squared distribution:
    • Expected σ² ≈ 2μ (12.34 vs observed 18.8)
    • Poor fit for this data
  • Gamma distribution (most plausible):
    • General form includes exponential and chi-squared
    • Moment relationships:
      • μ = kθ
      • σ² = kθ²
        • ⇒ k = μ²/σ² ≈ (6.17)²/18.8 ≈ 2.03
        • ⇒ θ = σ²/μ ≈ 18.8/6.17 ≈ 3.05
  • Suggested parameters:
    • Shape (k) ≈ 2.0
    • Scale (θ) ≈ 3.0

For gamma distribution, \(\mu = k \Theta\) and \(\sigma^2 = k \Theta^2\), so \(\frac{\sigma^2}{\mu} = k\)

# mean
mean.cont <- mean(unknowncontinuous$V1)

# sd
sd.cont <- sd(unknowncontinuous$V1)

# var
var.cont <- var(unknowncontinuous$V1)

# For the gamma distribution, mean and variance are 
# E(X) = k*theta and Var(X) = k*theta^2.

paste("theta = ", theta <- (var.cont / mean.cont))
## [1] "theta =  3.050884991004"
paste("k = ", k <- (mean.cont / theta))
## [1] "k =  2.02347900486358"

Test our educated guess that the data has a gamma distribution, \(\mathcal \Gamma(k=2, \: \Theta =3).\)

For the gamma distribution, the skewness is equal to \(2/ \sqrt{k}\). It depends only on the shape parameter (k) and approaches a normal distribution when k is large (approximately when k > 10). The excess kurtosis is \(\frac{6}{k}\). A formula for approximating the median for any gamma distribution, when the mean is known, has been derived based on the fact that the ratio \(\mathcal \mu (\mu - \nu)\) is approximately a linear function of \(\mathcal k\) when \(\mathcal k \geq 1\). The approximation formula is

\[ \displaystyle \mathcal v \approx \mu \frac{3k-0.8}{3k+0.2}\]


Compare population and random sample skewness values

paste("estimated skew = ", (skewness <- 2/sqrt(k)))
## [1] "estimated skew =  1.40598486077774"
paste("random sample skew = ", d$skew)
## [1] "random sample skew =  1.29937929146447"

Compare population and random sample kurtosis values

paste("estimated kurtosis = ", (kurtosis <- 6/k))
## [1] "estimated kurtosis =  2.96519014310431"
paste("random sample kurtosis = ", d$kurtosis)
## [1] "random sample kurtosis =  1.69046686594527"

Compare theretical and random sample median values

paste("estimated median = ",(median = mean.cont * ((3*k) - .8)/((3*k) + .2)))
## [1] "estimated median =  5.18887676960434"
paste("random sample median = ", d$median)
## [1] "random sample median =  5.0326255"

plot random sample distributions for comparison

plot(density(unknowncontinuous$V1),
     main = "random sample (blue) & population (red) \n 
          Distribution Functions",
     lwd = 4,
     col = "dodgerblue",
     type = "l")

# plot population distributions for comparison
# create a sequence of x values
x <- seq(min(unknowncontinuous$V1), max(unknowncontinuous$V1), by = .01) 
x.gamma <- dgamma(x, shape = 2, scale = 3, log = FALSE)

points(x, x.gamma,
       type = "l",
       col = "tomato",
       lwd = 4)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


random sample Cumulative Distribution Function

plot(ecdf(unknowncontinuous$V1), 
     lwd = 3, 
     col = "dodgerblue",
     main = "random sample (blue) & population (red) \n 
            Cumulative Distribution Functions")

# population Cumulative Distribution Function
points(pgamma(min(unknowncontinuous$V1):max(unknowncontinuous$V1),
              shape = 2, 
              scale = 3, 
              lower.tail = TRUE, 
              log.p = FALSE), 
       pch = 20, 
       col = "tomato")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


To show that \(\mathcal \chi^2-distribution\) is not a good fit:

# create a sequence of x values
x <- density(unknowncontinuous$V1)$x

# plot random sample distributions for comparison
plot(density(unknowncontinuous$V1),
     lwd = 4,
     type = "l",
     col = "dodgerblue",
     ylim = c(0, 0.14),
     ylab = "gamma/chi-squared",
     main = "random sample (blue) & Gamma (red) & Chi-squared (green) \n
             Distributions")

# plot gamma distribution with k = 2, and theta = 3 for comparison
x.gamma <- dgamma(x, shape = 2 , scale = 3, log = FALSE)

points(x, x.gamma,
       type = "l",
       col = "tomato",
       lwd = 4)

# plot chi-squared distribution with k=6 for comparison
x.chisq <- dchisq(x, df = 6 , ncp = 0)
points(x, x.chisq,
       type = "l",
       col = "green",
       lwd = 4)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


11 Testing the Gamma Distribution Hypothesis

Proposed Distribution: \(\mathcal \Gamma(k=2, \theta=3)\)

1. Method of Moments Validation

From the data (μ = 6.17, σ² = 18.8):

Estimated Parameters:
\[ \hat{k} = \frac{\mu^2}{\sigma^2} = \frac{6.17^2}{18.8} \approx 2.03 \\ \hat{\theta} = \frac{\sigma^2}{\mu} = \frac{18.8}{6.17} \approx 3.05 \]

Matches hypothesized \(\mathcal \Gamma(2,3)\)


2. population vs Observed Properties

Property population Observed Match?
Skewness \(2/\sqrt{2} \approx 1.41\) 1.43
Excess Kurtosis \(6/2 = 3\) 2.98
Median \(\approx 5.08\) 5.03

11.1 Maximum Likelihood Estimation (Continuous unknown distribution)

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a model by maximizing the likelihood function. Given a dataset and an assumed statistical model, MLE provides parameter estimates that make the observed data most probable.

Applying MLE to Continuous Data

Using the continuous dataset unknowncontinuous.dat from earlier analysis:

  • Select the distribution type you previously identified.
  • Compute MLE estimates for the continuous parameters using mle().
  • If your chosen distribution has discrete parameters (e.g., n), fix them using the ‘fixed’ option.
  • Manually test values near your earlier estimates—which one yields the highest likelihood?
  • Validate the fit:
    • Plot the distribution function using your estimated parameters. -n Compare it with the normalized histogram of the data to assess the fit.
options(warn = -1) 
# Function to calculate negative log-likelihood:
nLL <- 
  function(shape, rate){-sum(stats::dgamma(unknowncontinuous[,1], shape, rate, log = TRUE))}
#install.packages('stats')
library('stats')
nlm(nLL, p = c(1), shape = 1, hessian = FALSE)
## $minimum
## [1] 2820.25
## 
## $estimate
## [1] 0.1619847
## 
## $gradient
## [1] 1.182343e-05
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

Minimum negative loglikelihood is 2820.25 when shape parameter is 1


nlm(nLL, p = c(1), shape = 2, hessian = FALSE)
## $minimum
## [1] 2694.75
## 
## $estimate
## [1] 0.3239701
## 
## $gradient
## [1] 0.002156412
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

Minimum negative loglikelihood is 2694.7 (lower) when shape parameter is 2


nlm(nLL, p = c(1), shape = 3, hessian = FALSE)
## $minimum
## [1] 2739.149
## 
## $estimate
## [1] 0.4859552
## 
## $gradient
## [1] 3.637979e-06
## 
## $code
## [1] 1
## 
## $iterations
## [1] 8

The minimum negative log-likelihood is 2739.15 (higher) when the shape parameter is 3. Therefore, shape = 2 appears to be a better starting point, suggesting that the rate is approximately 0.32. Now, using a fixed rate of 0.3, we will estimate the shape parameter.


nlm(nLL, p = c(1), rate = 0.2, hessian = FALSE)
## $minimum
## [1] 2744.435
## 
## $estimate
## [1] 1.41118
## 
## $gradient
## [1] 6.444926e-07
## 
## $code
## [1] 1
## 
## $iterations
## [1] 6

Minimum negative loglikelihood is 2744.4 when rate parameter is .2


nlm(nLL, p = c(1), rate = 0.3, hessian = FALSE)
## $minimum
## [1] 2697.074
## 
## $estimate
## [1] 1.89898
## 
## $gradient
## [1] -1.436816e-06
## 
## $code
## [1] 1
## 
## $iterations
## [1] 5

The minimum negative log-likelihood is 2739.15 (higher) when the shape parameter is 0.3.


nlm(nLL, p = c(1), rate = 0.35, hessian = FALSE)
## $minimum
## [1] 2694.863
## 
## $estimate
## [1] 2.140515
## 
## $gradient
## [1] 5.948532e-06
## 
## $code
## [1] 1
## 
## $iterations
## [1] 5

Minimum negative loglikelihood is 2694.9 (lower) when rate parameter is .35


nlm(nLL, p = c(1), rate = 0.4, hessian = FALSE)
## $minimum
## [1] 2701.988
## 
## $estimate
## [1] 2.381189
## 
## $gradient
## [1] 1.241337e-05
## 
## $code
## [1] 1
## 
## $iterations
## [1] 5

The minimum negative log-likelihood is 2739.15 (higher) when the shape parameter is 3. Therefore, shape = 2 appears to be a better starting point, suggesting that the rate is approximately 0.32. Fixing the rate at 0.3 reduces the negative log-likelihood to 2697.1, and a further improvement is observed with a rate of 0.35, yielding a minimum of 2694.9. However, increasing the rate to 0.4 results in a higher negative log-likelihood of 2702. Based on this, the best fit is achieved when the rate is around 0.35, and the corresponding estimated shape parameter is approximately 2.14.


nlm(nLL, p = c(1), shape = 2.14, hessian = FALSE)
## $minimum
## [1] 2694.764
## 
## $estimate
## [1] 0.3466479
## 
## $gradient
## [1] 8.6402e-06
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10
# scale=1/rate
paste("scale = ", 1 / 0.35)
## [1] "scale =  2.85714285714286"
# Extract Log-Likelihood
-nLL(2.14, 0.35)
## [1] -2694.863

Plot random sample distributions for comparison

hist(unknowncontinuous[,1], freq = FALSE, breaks = 30)
curve(dgamma(x, shape = 1, rate = 0.16198), lwd = 2, col = "red", add = TRUE)
curve(dgamma(x, shape = 2, rate = 0.32397), lwd = 2, col = "blue", add = TRUE)
curve(dgamma(x, shape = 3, rate = 0.32397), lwd = 2, col = "green", add = TRUE)
legend("topright", legend = c("shape=1", "shape=2", "shape=3"), fill = c("red", "blue", "green"))

plot(density(unknowncontinuous[,1]),
     lwd = 4,
     col = "dodgerblue",
     main = "Unknown Data Density (blue) vs. \n 
            population Gamma Distribution (red)")

# plot gamma distribution with k = 2.0684, and scale = 2.9845561 
# for comparison create a sequence of x values
x <- density(unknowncontinuous[,1])$x

x.gamma <- dgamma(x, shape = 2.14, scale = 2.86, log = F)

points(x, x.gamma,
       type = "l",
       col = "tomato",
       lwd = 4)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


11.1.1 MLE for Discrete Data

Based on the previous analysis, we selected the appropriate distribution type (e.g., Negative Binomial, Binomial, Poisson, etc.). We used the mle() function to compute the maximum likelihood estimates of the continuous parameters. If the distribution includes discrete parameters—such as n in the Binomial or Negative Binomial distribution—we fixed these using the fixed argument in mle(). We manually tested a few nearby values for the fixed discrete parameter (e.g., n = 8, 9, 10, 11) based on earlier estimates. The value that yielded the lowest negative log-likelihood was selected as the most plausible. Finally, we plotted the fitted distribution (using the estimated parameters) against the normalized frequency table of the observed data. This visual comparison helped assess how well the model fits the observed distribution.

# view data
head(unknowndiscrete)
##   V1
## 1  8
## 2 10
## 3 12
## 4 14
## 5 12
## 6 13
unknowndiscrete <- as.matrix(unknowndiscrete)
hist(unknowndiscrete)

psych::describe(unknowndiscrete)
##    vars    n  mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 1000 12.07 2.69     12   12.05 2.97   3  20    17 0.01     -0.2 0.09

Compare density plot to the population binomial distribution with the same parameters appears to be a good fit.

plot(min(unknowndiscrete):max(unknowndiscrete), 
     dbinom(x = min(unknowndiscrete):max(unknowndiscrete), 
            size = max(unknowndiscrete), 
            prob = mean(unknowndiscrete) / max(unknowndiscrete)),
     xlab = "x",
     ylab = "P(X=x)",
     ylim = c(0, .2),
     col = "tomato",
     type = "h",
     lwd = 6)

title("Density Plot of Unknown Discrete Data vs. B(1000, 0.012)")

points((table(unknowndiscrete) / length(unknowndiscrete)), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

Geometric distribution is not a good fit.


plot(min(unknowndiscrete):max(unknowndiscrete), 
     dgeom(min(unknowndiscrete):max(unknowndiscrete), 
           prob = 1/mean(unknowndiscrete), 
           log = FALSE), 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     ylim = c(0, .2),
     main = "random sample (blue) vs. population (red) \n Geometric Distribution",
     xlab = "Number of Failures Before Success", ylab = "P(X=x)")

points(table(unknowndiscrete) / length(unknowndiscrete), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

plot(min(unknowndiscrete):max(unknowndiscrete), 
     dpois(min(unknowndiscrete):max(unknowndiscrete), mean(unknowndiscrete)), 
     xlab = "x", 
     ylab = "P(X=x)", 
     ylim = c(0, .2), 
     col = "tomato", 
     type = "h", 
     main = "random sample vs. population Poisson Distribution",
     lwd = 6)

points(table(unknowndiscrete) / length(unknowndiscrete), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

Poisson distribution also is a good fit. Now, compare binomial and Poisson distributions vs data.


plot(min(unknowndiscrete):max(unknowndiscrete), 
     dpois(min(unknowndiscrete):max(unknowndiscrete), mean(unknowndiscrete)), 
     xlab = "x", 
     ylab = "P(X=x)", 
     ylim = c(0, .2), 
     col = "tomato", 
     type = "h", 
     main = "random sample vs. population Poisson Distribution",
     lwd = 6)

points(min(unknowndiscrete):max(unknowndiscrete), 
       dbinom(x = min(unknowndiscrete):max(unknowndiscrete), 
              size = max(unknowndiscrete), 
              prob = mean(unknowndiscrete)/ max(unknowndiscrete)), 
       col = adjustcolor("green", alpha.f = 0.3), 
       type = "h", 
       lwd = 15)

points(table(unknowndiscrete) / length(unknowndiscrete), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

We compared the Poisson and Binomial distributions as candidate models for the dataset unknowndiscrete.dat. The Poisson distribution (plotted in blue) provided a much better fit to the observed data than the Binomial distribution (plotted in green), based on both visual inspection and the likelihood values.


# Function to calculate negative log-likelihood for binomial distribution:
nLL <- 
  function(n, prob) {
    if (n > 0 && 0 < prob && prob < 1) {
      -sum(dbinom(unknowndiscrete[,1], n, prob, log = FALSE))
    }
    else {NA}
  }
# Function to calculate negative log-likelihood for Poisson distribution:
nLL <- 
  function(lambda){-sum(stats::dpois(as.matrix(unknowndiscrete), lambda, log = FALSE))}
nlm(nLL, p = c(1), hessian = FALSE)
## $minimum
## [1] -90.60732
## 
## $estimate
## [1] 11.84542
## 
## $gradient
## [1] 2.051473e-07
## 
## $code
## [1] 1
## 
## $iterations
## [1] 10

lambda estimate is 11.8


12 Expectation-Maximization (EM) Algorithm

In statistics, the Expectation-Maximization (EM) algorithm is an iterative method for finding:

  • Maximum likelihood (ML) estimates, or
  • Maximum a posteriori (MAP) estimates

of parameters in statistical models that depend on unobserved latent variables.

How EM Works

The EM algorithm iteratively alternates between two steps:

  1. Expectation Step (E-step)
    Computes the expected value of the log-likelihood function using current parameter estimates.

  2. Maximization Step (M-step)
    Updates parameters to maximize the expected log-likelihood from the E-step.

The refined estimates are reused in subsequent E-steps until convergence.


13 Gaussian Mixture Models (GMMs)

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes the data is generated from a mixture of several Gaussian (normal) distributions, each with its own set of unknown parameters. Rather than assigning each data point to a single cluster deterministically, GMMs model the data as arising from a weighted combination of multiple Gaussian components, allowing for soft clustering—each data point has a probability of belonging to each cluster.

Key Properties:

  • Generalizes k-means clustering by incorporating:
  • Centers (means) of latent Gaussians
  • Covariance structures
  • Provides richer data representation than hard clustering.
#' EM Algorithm for Gaussian Mixture Models
#'
#' @param X Numeric matrix of data (n observations x p dimensions)
#' @param K Integer number of clusters
#' @param max_iter Maximum number of iterations (default: 100)
#' @param epsilon Convergence threshold (default: 1e-6)
#' @param verbose Whether to print progress (default: TRUE)
#'
#' @return List containing:
#'   - mu: Matrix of cluster means (K x p)
#'   - sigma: List of covariance matrices (K x p x p)
#'   - pi: Mixing weights (length K vector)
#'   - clusters: Hard cluster assignments
#'   - q: Soft assignment probabilities (n x K)
#'   - log_likelihood: Log-likelihood at each iteration
em_gmm <- function(X, K, max_iter = 100, epsilon = 1e-6, verbose = TRUE) {
  library(mvtnorm)
  
  # Initialize parameters
  n <- nrow(X)
  p <- ncol(X)
  
  # Random initialization
  mu <- X[sample(n, K), ]
  sigma <- replicate(K, diag(p), simplify = FALSE)
  pi <- rep(1/K, K)
  
  log_likelihood <- numeric(max_iter)
  
  for (iter in 1:max_iter) {
    ### E-step: Compute responsibilities
    q <- matrix(NA, n, K)
    for (k in 1:K) {
      q[, k] <- pi[k] * dmvnorm(X, mu[k, ], sigma[[k]])
    }
    log_likelihood[iter] <- sum(log(rowSums(q)))
    q <- q / rowSums(q)
    
    ### Store previous parameters for convergence check
    prev_mu <- mu
    
    ### M-step: Update parameters
    N_k <- colSums(q)
    
    for (k in 1:K) {
      # Update means
      mu[k, ] <- colSums(q[, k] * X) / N_k[k]
      
      # Update covariances
      X_centered <- sweep(X, 2, mu[k, ], "-")
      sigma[[k]] <- t(X_centered) %*% (X_centered * q[, k]) / N_k[k]
      
      # Update mixing weights
      pi[k] <- N_k[k] / n
    }
    
    ### Check convergence
    param_change <- sum((mu - prev_mu)^2)
    if (verbose) {
      cat(sprintf("Iter %d: log-likelihood = %.3f, param change = %.6f\n",
                  iter, log_likelihood[iter], param_change))
    }
    if (param_change < epsilon) break
  }
  
  # Hard cluster assignments
  clusters <- apply(q, 1, which.max)
  
  return(list(
    mu = mu,
    sigma = sigma,
    pi = pi,
    clusters = clusters,
    q = q,
    log_likelihood = log_likelihood[1:iter],
    n_iter = iter
  ))
}
# Example usage
set.seed(123)
library(mvtnorm)

# Generate synthetic data
mu1 <- c(1, 1)
mu2 <- c(4, 5)
sigma1 <- matrix(c(1, 0.5, 0.5, 1), nrow = 2)
sigma2 <- matrix(c(1, -0.5, -0.5, 1), nrow = 2)

n <- 300
X <- rbind(
  rmvnorm(n/2, mean = mu1, sigma = sigma1),
  rmvnorm(n/2, mean = mu2, sigma = sigma2)
)

# Run EM algorithm
gmm_fit <- em_gmm(X, K = 2)
## Iter 1: log-likelihood = -2942.438, param change = 11.628090
## Iter 2: log-likelihood = -1099.762, param change = 0.124261
## Iter 3: log-likelihood = -1068.201, param change = 0.150512
## Iter 4: log-likelihood = -1047.249, param change = 0.129163
## Iter 5: log-likelihood = -1032.548, param change = 0.104206
## Iter 6: log-likelihood = -1018.875, param change = 0.081722
## Iter 7: log-likelihood = -1003.555, param change = 0.037924
## Iter 8: log-likelihood = -993.298, param change = 0.009199
## Iter 9: log-likelihood = -989.571, param change = 0.002788
## Iter 10: log-likelihood = -988.096, param change = 0.000963
## Iter 11: log-likelihood = -987.562, param change = 0.000270
## Iter 12: log-likelihood = -987.411, param change = 0.000062
## Iter 13: log-likelihood = -987.377, param change = 0.000013
## Iter 14: log-likelihood = -987.370, param change = 0.000002
## Iter 15: log-likelihood = -987.368, param change = 0.000000
# Plot results
library(ggplot2)
ggplot(data.frame(X, Cluster = factor(gmm_fit$clusters)), 
       aes(x = X1, y = X2, color = Cluster)) +
  geom_point(alpha = 0.7) +
  stat_ellipse(aes(fill = Cluster), geom = "polygon", alpha = 0.2, level = 0.95) +
  labs(title = "GMM Clustering (EM Algorithm)", 
       subtitle = "95% confidence ellipses") +
  theme_minimal()


14 Multivariate Normal Distribution

The multivariate normal distribution generalizes the one-dimensional normal distribution to higher dimensions. It describes sets of correlated real-valued random variables that cluster around a mean vector.

Definition: A random vector \(\mathbf{X} = (X_1, ..., X_p)^T\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^p\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{p \times p}\) (positive semi-definite), denoted as:

\[ \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

The probability density function (PDF) is given by:

\[ f(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]

Key Properties

  • Mean: \(\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu}\)
  • Covariance: \(\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma}\)
  • Any linear combination of components of \(\mathbf{X}\) is also normally distributed.
  • Marginal distributions and conditional distributions are also multivariate normal.

Applications

  • Modeling correlated continuous variables
  • Principal Component Analysis (PCA)
  • Gaussian Mixture Models (GMMs)

\[ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) \]

where:

  • \(\boldsymbol{\mu} = (\mu_1, ..., \mu_p)^T\) is the mean vector
  • \(\Sigma\) is the \(p \times p\) covariance matrix (positive semi-definite)

Sampling Algorithm To generate samples \(\mathbf{x}\) from \(\mathcal{N}(\boldsymbol{\mu}, \Sigma)\):

  1. Matrix Decomposition: Find matrix \(A\) such that \(AA^T = \Sigma\) (typically using Cholesky decomposition).

  2. Generate Standard Normals: Create vector \(\mathbf{z} = (z_1, ..., z_p)^T\) where \(z_i \overset{\text{iid}}{\sim} \mathcal{N}(0,1)\).

  3. Transform: Compute \(\mathbf{x} = \boldsymbol{\mu} + A\mathbf{z}\).

Key Properties

  • Marginal Distributions: All subsets of variables are normally distributed
  • Conditional Distributions: Remain normal given other variables
  • Affine Transformations: Linear transformations preserve normality
# Covariance matrix, default is diag(ncol(x))
(Sigma <- matrix(c(1.0, 0.8, 0.8, 1.0), 2, 2))
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0
# Install hexbin package (if not already installed)
if (!requireNamespace(c("hexbin", "ggplot2", "mvtnorm"), quietly = TRUE)) {
  install.packages(c("hexbin", "ggplot2", "mvtnorm"))
}

# Multivariate Normal Distribution Sampling
library(mvtnorm)  # For multivariate normal functions
library(ggplot2)   # For visualization

## Define parameters
# Covariance matrix (positive definite)
Sigma <- matrix(c(1,   0.8,
                  0.8, 1), 
                nrow = 2, byrow = TRUE)

# Mean vector
mu <- c(1, 2)

## Generate samples
set.seed(123)  # For reproducibility
rmvnorm_data <- rmvnorm(
  n = 1e4,            # Number of samples
  mean = mu,          # Mean vector
  sigma = Sigma       # Covariance matrix
)

# Add descriptive column names
colnames(rmvnorm_data) <- c("x", "y")

## Inspect data
cat("First 6 observations:\n")
## First 6 observations:
print(head(rmvnorm_data))
##              x        y
## [1,] 0.3957568 1.543471
## [2,] 2.4256834 2.760140
## [3,] 1.8826388 3.591820
## [4,] 0.8465034 1.074623
## [5,] 0.1863540 1.294218
## [6,] 2.2557657 2.869253
cat("\nSummary statistics:\n")
## 
## Summary statistics:
print(summary(rmvnorm_data))
##        x                 y         
##  Min.   :-3.3261   Min.   :-1.873  
##  1st Qu.: 0.3283   1st Qu.: 1.303  
##  Median : 0.9899   Median : 1.987  
##  Mean   : 0.9948   Mean   : 1.990  
##  3rd Qu.: 1.6687   3rd Qu.: 2.657  
##  Max.   : 5.0345   Max.   : 5.883
## Visualize the distribution
ggplot(data.frame(rmvnorm_data), aes(x = x, y = y)) +
  geom_hex(bins = 50) +  # Hexbin plot for density visualization
  geom_point(alpha = 0.1, size = 0.5) +  # Overlay individual points
  stat_ellipse(level = 0.95, color = "red", linewidth = 1) +  # 95% confidence ellipse
  labs(title = "Bivariate Normal Distribution",
       subtitle = sprintf("μ = (%s), ρ = %.2f", 
                          paste(mu, collapse = ", "),
                          cov2cor(Sigma)[1,2]),
       x = "Variable X",
       y = "Variable Y") +
  theme_minimal()

# means and covariance of x and y
# E(x) = 1.0; E(y) = 2.0
colMeans(rmvnorm_data)
##         x         y 
## 0.9948234 1.9897770
# cov(x, y) = 0.8
var(rmvnorm_data)
##           x         y
## x 1.0042236 0.8010468
## y 0.8010468 0.9975851

14.0.1 Independent and Identically Distributed Random Variables (iid)

Key Assumptions:

  • Independent: The outcome of one observation does not affect another.
  • Identically Distributed: All observations are drawn from the same probability distribution.
  • Often assumed in MLE to simplify the joint likelihood function into a product of marginal likelihoods.

Create a data set D with samples from the uniform distribution on the unit disc.

#' Gibbs Sampler for Uniform Distribution on Unit Circle
#'
#' Generates samples from uniform distribution within unit circle using Gibbs sampling
#' 
#' @param n Number of samples to generate
#' @param burn_in Number of burn-in iterations (default: 100)
#' @param thin Thinning interval (default: 1)
#' @return Matrix of samples with columns x and y

gibbs_unif_circle <- function(n, burn_in = 100, thin = 1) {
  # Initialize storage
  samples <- matrix(NA_real_, nrow = n, ncol = 2)
  colnames(samples) <- c("x", "y")
  
  # Initialize state
  x <- 0
  y <- 0
  
  # Total iterations needed
  total_iter <- burn_in + (n * thin)
  
  # Gibbs sampling loop
  sample_count <- 0
  for (i in 1:total_iter) {
    # Update x given y
    x <- runif(1, -sqrt(1 - y^2), sqrt(1 - y^2))
    
    # Update y given x
    y <- runif(1, -sqrt(1 - x^2), sqrt(1 - x^2))
    
    # Store sample after burn-in and thinning
    if (i > burn_in && (i - burn_in) %% thin == 0) {
      sample_count <- sample_count + 1
      samples[sample_count, ] <- c(x, y)
    }
  }
  
  return(samples)
}
D <- gibbs(n)

head(D)
##                x          y
## [1,] -0.50327586  0.5470326
## [2,]  0.27884269 -0.1662980
## [3,]  0.07382929  0.9954394
## [4,] -0.04632925  0.1059751
## [5,]  0.68299840  0.5627543
## [6,]  0.43498671  0.7351630
# show dimensions of data matrix
dim(D)
## [1] 300   2
# Generate samples
set.seed(123)
n <- 1000
samples <- gibbs_unif_circle(n, burn_in = 500, thin = 5)

# Visualize results
library(ggplot2)
ggplot(data.frame(samples), aes(x = x, y = y)) +
  geom_point(alpha = 0.3, color = "dodgerblue") +
  stat_ellipse(level = 1, color = "red", linetype = 2) +  # Unit circle
  ggtitle("Gibbs Sampling from Uniform Distribution on Unit Circle") +
  coord_fixed() +
  theme_minimal()

# Check uniformity
cat("Mean x:", mean(samples[,1]), "\n")
## Mean x: -0.01114097
cat("Mean y:", mean(samples[,2]), "\n")
## Mean y: 0.000963111
cat("Correlation:", cor(samples[,1], samples[,2]), "\n")
## Correlation: -0.01035527

Testing Whether X and Y Have the Same Distribution

We want to test computationally whether or not two random variables, X and Y, have the same distribution.

Step 1: Visual Comparison

Start by plotting histograms or kernel density estimates (KDE) of both variables:

  • If they appear similar in shape, they might follow the same distribution.
  • However, visual inspection is not conclusive.

Step 2: Statistical Tests

Use formal statistical tests to compare the distributions of X and Y:

plot(density(samples[,"x"]),
     main = "Density Plot",
     col = "tomato",
     type = "l",
     lwd = 4,
     cex = 1)

points(density(samples[,"y"]),
       col = "dodgerblue",
       type = "l",
       lwd = 4,
       cex = 1)

abline(v = c(mean(samples[,"x"]), mean(samples[,"y"])), 
       col = c("tomato", "dodgerblue"), 
       lty = 3, 
       lwd = 2)

grid(nx = NULL, ny = NULL,
     col = "gray",
     lty = "dotted",
     lwd = 1,
     equilogs = TRUE)

legend("topleft", c("X", "Y"),
       col = c("tomato", "dodgerblue"),
       lty = 1,
       lwd = 4)

# Quantile-Quantile Plots
qqplot(samples[,"x"], samples[,"y"],
       col = "tomato",
       main = "Quantile-Quantile Plot")

grid(nx = NULL, ny = NULL,
     col = "lightgray",
     lty = "dotted",
     lwd = 1,
     equilogs = TRUE)

P(X) a stripe centered around x=zero, which is 0.4 wide

(px <- sum(abs(samples[,"x"] - 0) < 0.2) / n)
## [1] 0.284

P(Y) a stripe centered around y=zero, which is 0.4 wide

(py <- sum(abs(samples[,"y"] - 0) < 0.2) / n)
## [1] 0.268

\(P(X \cap Y)\) a square centered at \(x=0\), \(y=0\) and \(side=0.4\) (i.e. intersection of the stripes above)

# their joint probability
(pxy <- sum((abs(samples[,"x"] - 0) < 0.2) & (abs(samples[,"y"] - 0) < 0.2)) / n)
## [1] 0.064

\(P(X) * P(Y)\)

# product of their probabilities
px * py
## [1] 0.076112

\(P(X \cap Y) =? P(X) * P(Y)\)

(pxy == px * py)
## [1] FALSE

covariance should be zero

round(cov(samples[,"x"], samples[,"y"]), digits = 3)
## [1] -0.002

Create a data set G with samples from the two-dimensional normal distribution

Random number generator for the multivariate normal distribution with mean equal to mean vector, (default is \(rep(0, length = ncol(x))\) , and covariance matrix sigma, default is \(diag(ncol(x))\).

# Covariance matrix, default is diag(ncol(x))
(Sigma <- matrix(c(1.0, 0, 0, 1.0), 2, 2))
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
# Random number generator for the multivariate normal distribution with mean equal to mean vector, (default is rep(0, length = ncol(x)), and covariance matrix sigma, default is diag(ncol(x)).
G <- rmvnorm(n = 1e4, mean = c(0,0), sigma = Sigma)
colnames(G) <- c("x", "y") #change variable names to x and y
# display head of the data
head(G)
##                x          y
## [1,] -1.35723063 -1.2926978
## [2,] -1.51720731  0.8591760
## [3,] -1.21461753  0.6190554
## [4,]  2.34537993 -0.1980279
## [5,] -0.18400767  0.8387375
## [6,] -0.01450755 -0.7638332

Multivariate plot of x vs. y

plot(G,
     xlim = c(-4, 4),
     ylim = c(-4, 4),
     pch = 16,
     cex = .4,
     col = adjustcolor("tomato", alpha = 0.5),
     asp = 1,
     main = "Scatterplot x vs. y")

abline(h = colMeans(G)[2], 
       v = colMeans(G)[1], 
       col = "black", 
       lty = 3)

grid(nx = NULL, 
     ny = NULL, 
     col = "gray",
     lwd = 1,
     lty = "dotted", 
     equilogs = TRUE)


Test whether or not x and y have the same distribution.

plot(density(G[,"x"]),
     main = " X and Y Density Plots",
     col = "tomato",
     type = "l",
     lwd = 4)

points(density(G[,"y"]),
       col = "dodgerblue",
       type = "l",
       lwd = 4)

grid(nx = NULL, 
     ny = NULL, 
     col = "gray",
     lwd = 1,
     lty = "dotted", 
     equilogs = TRUE)

legend("topleft", c("X", "Y"),
       col = c("tomato", "dodgerblue"),
       lty = 1,
       lwd = 4
)

# Q-Q Plot: Comparing Distributions of 'x' and 'y'
qqplot(G[, "x"], G[, "y"],
       main = "Q-Q Plot: Comparing Distributions of x and y",
       xlab = "population Quantiles (x)",
       ylab = "Sample Quantiles (y)",
       col = "steelblue",        # Point color
       pch = 19,                 # Solid circles for points
       cex = 0.8)                # Slightly smaller point size

# Add a reference line (y = x) to assess distribution equality
abline(a = 0, b = 1, 
       col = "red", 
       lty = 2, 
       lwd = 1.5)

# Add a subtle grid for better readability
grid(nx = NULL, ny = NULL,
     col = "lightgray", 
     lty = "dotted",
     lwd = par("lwd"),           # Match default line width
     equilogs = TRUE)

# Optional: Add legend for clarity
legend("topleft", 
       legend = c("Data Points", "Reference (y = x)"),
       col = c("steelblue", "red"),
       pch = c(19, NA),         # NA for line in legend
       lty = c(NA, 2),           # NA for points in legend
       bty = "n")                # No legend box

def box_muller(): u1, u2 = random(), random() r = sqrt(-2 * log(u1)) theta = 2 * pi * u2 return r * cos(theta), r * sin(theta)

box_muller <- function(n, visualize = TRUE, verify = TRUE) {
  #' Generate Standard Normal Variables using Box-Muller Transform
  #'
  #' @param n Number of sample pairs to generate (will return 2n values)
  #' @param visualize Logical, whether to show diagnostic plots (default TRUE)
  #' @param verify Logical, whether to perform normality checks (default TRUE)
  #' @return Vector of 2n standard normal random variables
  #' 
  #' Example usage:
  #' normal_vars <- box_muller(10000)  # Generates 20,000 values
  #' normal_vars <- box_muller(5000, visualize = FALSE)  # Quiet      #' generation
  
  # Input validation
  if (!is.numeric(n) || n <= 0 || n != floor(n)) {
    stop("n must be a positive integer")
  }
  
  # Generate uniform random variables
  U1 <- runif(n, min = 0, max = 1)
  U2 <- runif(n, min = 0, max = 1)
  
  # Apply Box-Muller transform
  X1 <- sqrt(-2 * log(U1)) * cos(2 * pi * U2)
  X2 <- sqrt(-2 * log(U1)) * sin(2 * pi * U2)
  
  # Combine results
  X <- c(X1, X2)
  
  # Visualization and verification
  if (visualize || verify) {
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par))
    
    if (visualize) {
      par(mfrow = c(1, 2))
      
      # Q-Q plot
      qqnorm(X, main = "Normal Q-Q Plot", 
             col = rgb(0, 0.4, 0.8, 0.5))
      qqline(X, col = "red", lwd = 2)
      
      # Histogram with normal curve
      hist(X, probability = TRUE, breaks = 30,
           main = "Density Comparison",
           col = "lightblue", xlab = "Value")
      curve(dnorm(x), add = TRUE, col = "red", lwd = 2)
    }
    
    if (verify) {
      test_n <- min(length(X), 5000)  # Shapiro-Wilk max is 5000
      shapiro_test <- shapiro.test(sample(X, test_n))
      
      cat("\n=== Verification Results ===\n")
      cat("Generated", length(X), "values\n")
      cat("Mean:", round(mean(X), 5), "\n")
      cat("SD:", round(sd(X), 5), "\n")
      cat("Shapiro-Wilk p-value:", format.pval(shapiro_test$p.value), "\n")
    }
  }
  
  return(X)
}

Show that X1 and X2 are standard normal

set.seed(123)
normal_vars <- box_muller(5000, visualize = TRUE)

## 
## === Verification Results ===
## Generated 10000 values
## Mean: -0.00433 
## SD: 1.00493 
## Shapiro-Wilk p-value: 0.89029

Diagnostic Interpretation

Normal Distribution Indicators:

  • Linear Alignment: Points follow the red reference line closely
  • Tail Behavior: No systematic deviations at the extremes
  • Random Dispersion: Points scatter evenly around the line

Density Plot Interpretation

Feature Visual Cue Statistical Meaning
Curve Match Histogram bars closely follow the population red curve Correct distribution shape (PDF matches random sample data)
Symmetry Equal distribution on both sides of zero Mean = Median = Mode = 0 (perfect central tendency)
Tail Behavior Smooth, gradual taper beyond ±3 standard deviations Proper handling of extreme values (no artificial truncation)

Shapiro-Wilk Normality Test Interpretation

Result: p-value = 0.3656 (α = 0.05)

Decision Rule

\[\text{If } p \geq α \rightarrow \text{Fail to reject normality}\]

\[\text{If } p < α \rightarrow \text{Reject normality}\]