Project History and Acknowledgments

Developed as part of graduate coursework in Mathematical Statistics and Computational Statistics at Ohio University (2014), this project bridges rigorous statistical theory with modern computational techniques. It offers a systematic exploration of statistical modeling, numerical algorithm implementation, and dataset analysis—progressing from foundational concepts to advanced methods like Bayesian inference and Monte Carlo methods.

The original R codebase for this project has been extensively enhanced through AI-assisted optimization, improving code clarity, computational efficiency, and numerical accuracy, while preserving the original statistical functionality. The material follows a structured pedagogical progression, starting with basic descriptive statistics—such as measures of central tendency and correlation analysis—before advancing through hypothesis testing frameworks and culminating in more sophisticated computational methods. These include Markov Chain Monte Carlo techniques, maximum likelihood estimation, and expectation-maximization algorithms, with practical implementations ranging from Gibbs samplers to Gaussian mixture modeling.

Throughout this progression, we maintain a dual focus on mathematical foundations (such as distribution theory and statistical power analysis) and their practical computational realization, ensuring that learners develop both theoretical understanding and hands-on implementation skills.


1 Statistical Computing

Statistical computing refers to the intersection of computer science, numerical analysis, and statistics. It encompasses tasks that rely on statistical methods requiring significant computational power. Many statistical modeling and data analysis techniques are complex to implement in practice, often necessitating specialized software to process large datasets and generate actionable insights.

Technical Implementation

This project leverages a robust scientific computing stack:

Core Technologies

Technology Primary Use Key Benefits
\(R\) Statistical analysis and visualization - Comprehensive packages (tidyverse, lme4)
- Reproducible research capabilities
\(\mathrm{\LaTeX}\) Typesetting technical documents - Publication-quality math rendering
- Automated numbering/cross-referencing
Markdown Lightweight documentation - Human-readable syntax
- Multi-output rendering (HTML/PDF/Word)

1.1 Analytical Progression

This project implements a graduated analytical approach, beginning with fundamental statistical computations and systematically advancing to sophisticated modeling techniques:

  • Foundation Layer:
    • Descriptive statistics (mean, median, robust estimators)
    • Exploratory data analysis (EDA) visualizations
    • Basic hypothesis testing
  • Intermediate Layer:
    • Bootstrapping and resampling methods
    • Generalized linear models (GLMs)
  • Advanced Layer:
    • Monte Carlo methods
    • EM Algorithm

“Simple computations establish verifiable baselines before advancing to computationally intensive methods - ensuring each analytical stage builds upon statistically validated foundations.”
- Principle of Graduated Complexity


2 Descriptive Statistics

Descriptive statistics provide summary statistics to concisely describe key features of a dataset. These summaries aim to communicate the maximum amount of information as simply as possible.

Key Components

Statisticians typically characterize data using:

  • Measures of Central Tendency
    • Arithmetic mean
    • Median
  • Measures of Statistical Dispersion
    • Standard deviation
    • Mean absolute deviation (MAD)
    • Interquartile range (IQR)
  • Measures of Distribution Shape
    • Skewness (symmetry)
    • Kurtosis (tailedness)
  • Measures of Dependence (for multivariate data)
    • Correlation coefficients (Pearson, Spearman)

2.1 Measuring Central Tendency (Location)

A measure of central tendency is a single value that represents the center of a dataset by identifying its typical or central position. These measures (also called measures of central location) help summarize an entire dataset with one representative value. The three primary measures are:

  1. Mean (Average)
  2. Median (Middle value)

The choice of measure depends on the data type and distribution characteristics.

2.1.1 Mean

The mean (arithmetic average) is the most common measure of central tendency.

Key Properties:

  • Applicable to both discrete and continuous data (though most used with continuous)
  • Calculated as:
    \[ \bar{x} = \frac{\sum_{i=1}^{n} x_i}{n} \]
  • Sensitive to outliers (extreme values pull the mean toward them)

R Implementation:

#' Compute the Arithmetic Mean
#' 
#' Calculates the mean of a numeric vector, matrix, or list using matrix multiplication.
#' This is an educational implementation demonstrating manual mean calculation.
#'
#' @param X A numeric vector, matrix, or list of numbers
#' @return The arithmetic mean as a numeric scalar
#' @examples
#' mu(c(1, 2, 3))  # Returns 2
#' mu(matrix(1:4, nrow=2))  # Returns 2.5
mu <- function(X) {
  # Convert input to matrix format
  X <- as.matrix(X)
  
  # Get number of observations
  n <- nrow(X)
  
  # Create weight matrix (1/n for each observation)
  N <- matrix(1/n, nrow = 1, ncol = n)
  
  # Calculate weighted sum and return as scalar
  as.numeric(N %*% X)  # Matrix multiplication
}

# Save function to file
dump("mu", file = "mu.R")

BASIC USAGE EXAMPLES

set.seed(123)  # For reproducibility
X <- sample(10:99, 20)
X  # Display the data
##  [1] 40 88 60 23 76 51 59 52 96 34 78 66 18 81 35 16 94 87 45 92
# Custom matrix-based implementation
mu_result <- mu(X)

# Base R function for verification
base_mean <- mean(X)

cat("Custom mu() result:", round(mu_result, 2), "\n")
## Custom mu() result: 59.55
cat("Base R mean() result:", round(base_mean, 2), "\n")
## Base R mean() result: 59.55

2.1.2 Median

The median is the value that separates the higher half from the lower half of a dataset. For an ordered dataset, it represents:

  • The middle value (for odd number of observations)
  • The average of the two middle values (for even number of observations)

Key Properties:

  • Robustness: Unlike the mean, the median is not affected by extreme outliers
    • Typical Value: Often better represents “typical” values in skewed distributions
    • Ordinal Data: Can be used with ordinal data where mean would be inappropriate

R Implementation:

#' Robust Median Calculation
#' 
#' Computes the median value of numeric data with comprehensive NA handling and
#' support for matrix/list inputs. For even-length inputs, returns the mean of 
#' the two central values (consistent with base R).
#'
#' @param X Numeric vector, matrix, or list
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param ... Additional arguments passed to methods
#' @return Numeric scalar representing the median, or NA if:
#'         - Input contains NAs with na.rm=FALSE
#'         - Input is empty
#'         - Input contains only NAs when na.rm=TRUE
#' @examples
#' # Basic usage
#' MEDIAN(c(1, 3, 5))          # Returns 3
#' MEDIAN(c(1, 2, 3, 4))       # Returns 2.5
#' 
#' # NA handling
#' MEDIAN(c(1, 2, NA, 4))      # Returns NA
#' MEDIAN(c(1, 2, NA, 4), na.rm=TRUE)  # Returns 2
#' 
#' # Matrix support
#' MEDIAN(matrix(1:9, nrow=3)) # Returns 5
#' @export
#' @seealso \code{\link[stats]{median}} for base R implementation
MEDIAN <- function(X, na.rm = TRUE, ...) {
  # Input validation
  if (!is.numeric(X)) stop("Input must be numeric")
  if (length(X) == 0) return(NA_real_)
  
  # Handle NA values
  if (anyNA(X)) {
    if (!na.rm) return(NA_real_)
    X <- X[!is.na(X)]
    if (length(X) == 0) return(NA_real_)
  }
  
  # Convert and sort input
  X <- as.matrix(sort(X))
  n <- nrow(X)
  i <- n %/% 2  # Middle index
  
  # Calculate median
  if (n %% 2 == 0) {  # Even case
    q2 <- mean(X[i:(i + 1), , drop = TRUE])
  } else {            # Odd case
    q2 <- X[i + 1, , drop = TRUE]
  }
  
  # Ensure numeric output
  return(as.numeric(q2))
}

# Save function to file
dump("MEDIAN", file = "MEDIAN.R")

BASIC USAGE EXAMPLES

# MEDIAN Function Demonstration
# Load required functions
source("MEDIAN.R")

set.seed(123)
test_data <- list(
  "Even-length" = sort(sample(1:9, 10, replace = TRUE)),
  "Odd-length" = sort(sample(1:9, 9, replace = TRUE))
)

results <- lapply(test_data, function(x) {
  data.frame(
    `Data Length` = length(x),
    `Input Data` = paste(x, collapse = ", "),
    `Custom MEDIAN()` = MEDIAN(x),
    `Base R median()` = median(x),
    check.names = FALSE
  )
})

knitr::kable(
  do.call(rbind, results),
  caption = "MEDIAN Function Comparison",
  align = c("l", "l", "r", "r")
)
MEDIAN Function Comparison
Data Length Input Data Custom MEDIAN() Base R median()
Even-length 10 2, 3, 3, 3, 4, 5, 5, 6, 6, 9 4.5 4.5
Odd-length 9 3, 3, 4, 7, 8, 9, 9, 9, 9 8.0 8.0

2.2 Measuring Variability (Spread)

A measure of spread (or statistical dispersion) quantifies the variability within a dataset. These metrics complement measures of central tendency by describing how observations are distributed around the center.

Key Characteristics:

  • Describe how “spread out” the data points are
  • Indicate whether values are clustered tightly or widely dispersed
  • Help assess the reliability of central tendency measures
  • Provide context for comparing different datasets

Common measures of spread include:

  • Range: The difference between maximum and minimum values
  • Variance: Average of squared deviations from the mean
  • Standard Deviation: Square root of variance (in original units)
  • Interquartile Range (IQR): Spread of the middle 50% of data

2.2.1 Variance

Variance (\(s^2\)) quantifies the dispersion of data points around their mean value. It represents the average squared deviation from the mean, providing a measure of how spread out a dataset is.

Variance is calculated as:

\[ s^2_x = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} \]

Where:

  • \(x_i\) = individual data points
  • \(\bar{x}\) = sample mean
  • \(n\) = sample size
  • \(n-1\) = degrees of freedom (Bessel’s correction for unbiased estimation)

Key Properties:

  • Units: Expressed in squared units of the original data

  • Sensitivity: More affected by outliers than other dispersion measures

  • Relationship: Standard deviation is the square root of variance (\(s = \sqrt{s^2}\))

  • Interpretation:

    • Higher variance → greater spread
    • Zero variance → all values identical

When to Use:

  • Assessing data variability
  • Statistical tests (ANOVA, regression)

R Implementation

#' Compute Sample Variance
#' 
#' Calculates the unbiased sample variance using matrix operations. 
#' Implements Bessel's correction (n-1 denominator) for proper population variance estimation.
#'
#' @param X Numeric vector, matrix, or list of numbers
#' @param na.rm Logical indicating whether to remove NA values (default: FALSE)
#' @return The sample variance as a numeric scalar
#' @examples
#' s2(c(4, 9, 12, 15, 18))  # Returns 26.7
#' s2(matrix(1:12, nrow=3))  # Returns 13.5
#' s2(c(1, 3, NA, 5), na.rm=TRUE)  # Returns 2.666667
#' @references
#' Bessel, F. W. (1838). On the determination of the probable error of a mean.
s2 <- function(X, na.rm = FALSE) {
  # Input validation
  if (!is.numeric(X)) stop("Input must be numeric")
  if (length(X) == 0) return(NA_real_)
  
  # Handle NA values
  if (anyNA(X)) {
    if (!na.rm) return(NA_real_)
    X <- X[!is.na(X)]
    if (length(X) == 0) return(NA_real_)
  }
  
  # Convert to matrix and calculate
  X <- as.matrix(X)
  n <- nrow(X)
  
  # Efficient one-pass calculation
  MU <- mu(X)
  D <- X - MU
  variance <- as.numeric((t(D) %*% D) / (n - 1))  # Bessel's correction
  
  return(variance)
}

# Save function
dump("s2", file = "s2.R")

BASIC USAGE EXAMPLES

# Generate sample data
set.seed(123)
X <- sample(1:9, 30, replace = TRUE)
cat("Sample data:\n", paste(X, collapse = ", "), "\n\n")
## Sample data:
##  3, 3, 2, 6, 5, 4, 6, 9, 5, 3, 9, 9, 9, 3, 8, 7, 9, 3, 4, 1, 7, 5, 7, 9, 9, 7, 5, 7, 5, 6
# Calculate measures
results <- data.frame(
  Statistic = c("Variance (s²)", "Standard Deviation (s)"),
  Custom_Function = c(s2(X), sqrt(s2(X))),
  Base_R = c(var(X), sd(X))
)

knitr::kable(
  results,
  caption = "Comparison of Dispersion Measures",
  digits = 6,
  align = c('l', 'r', 'r'),
  col.names = c("Measure", "Custom Function", "Base R")
)
Comparison of Dispersion Measures
Measure Custom Function Base R
Variance (s²) 5.867816 5.867816
Standard Deviation (s) 2.422358 2.422358

Technical Note: The exact match between custom and base R functions confirms proper implementation of both the variance calculation and Bessel’s correction.


2.2.2 Median Absolute Deviation (MAD)

The Median Absolute Deviation (MAD) is a robust measure of statistical dispersion that quantifies the variability of a univariate dataset. Unlike variance or standard deviation, MAD is resistant to outliers, making it particularly useful for datasets that may contain extreme values.

MAD is calculated as:

\[ \begin{equation} \text{MAD} = \text{median} \left( \left| x_i - \tilde{x} \right| \right) \end{equation} \]

where:

  • \(x_i\) = individual data points
  • \(\tilde{x}\) = median of the dataset
  • \(| \cdot |\) = absolute value operator

Key Properties:

  • Robustness:
    • Unlike standard deviation (which squares deviations), MAD uses absolute values, making it less sensitive to outliers
    • A single extreme value has minimal impact on MAD -Interpretation:
    • Represents the median distance of data points from the dataset’s median
    • Higher MAD indicates greater spread in the data
  • Comparison with Standard Deviation:
    • For normally distributed data: \(\text{MAD} \approx 0.6745 \sigma\)
    • More reliable than standard deviation for non-normal distributions or contaminated data

Scaling to Estimate Standard Deviation

To use MAD as a consistent estimator for the population standard deviation (\(\sigma\)):

  • Makes MAD consistent with standard deviation (\(\sigma\)) for normal distributions
  • Converts from median-based to mean-based dispersion measure
  • Maintains robustness to outliers (breakdown point of 50%)

\[ \hat{\sigma} = k \cdot \text{MAD} \]

Where the scale factor \(k\) is:

  • For normal distributions: \[k = \frac{1}{\Phi^{-1}(3/4)} \approx 1.4826\] (where \(\Phi^{-1}\) is the inverse of the standard normal CDF)
  • Different \(k\) values may be used for other distributions

Applications:

  • Outlier detection (common rule: \(|x_i - \tilde{x}| > 3 \cdot \text{MAD}\))
  • Robust statistical modeling
  • Analyzing heavy-tailed distributions
  • Non-parametric statistics
#' Median Absolute Deviation (MAD)
#'
#' Computes the robust MAD estimator for numeric data
#' @param X A numeric vector, matrix, or data frame
#' @param k Scale factor (default 1.4826 for normal distribution)
#' @return The scaled MAD value
#' @examples
#' MAD(c(1, 3, 5, 9, 11))
#' MAD(matrix(rnorm(100), ncol = 5))
MAD <- function(X, k = 1.4826) {
  # Input validation
  if (!is.numeric(X)) {
    stop("Input must be numeric")
  }
  
  # Convert to matrix if needed
  X <- as.matrix(X)
  
  # Compute median of absolute deviations
  med <- median(X)
  abs_dev <- abs(X - med)
  mad <- median(abs_dev) * k
  
  return(mad)
}

# Save function
dump("MAD", file = "MAD.R")

BASIC USAGE EXAMPLES

# Calculate results
results <- data.frame(
  Method = c("Custom MAD (k=1)", "Custom MAD (k=1.4826)", "stats::mad"),
  Value = c(
    MAD(X, k = 1),
    MAD(X),
    stats::mad(X)
  ),
  Description = c(
    "Unscaled median absolute deviation",
    "Scaled to match SD for normal distribution",
    "Base R's implementation (equivalent to scaled MAD)"
  )
)

# Print formatted results
knitr::kable(results, 
             caption = "Comparison of MAD Calculations",
             align = c('l', 'r', 'l'),
             digits = 4)
Comparison of MAD Calculations
Method Value Description
Custom MAD (k=1) 2.0000 Unscaled median absolute deviation
Custom MAD (k=1.4826) 2.9652 Scaled to match SD for normal distribution
stats::mad 2.9652 Base R’s implementation (equivalent to scaled MAD)

Key Observations:

  • Unscaled MAD (k=1)
  • Value: 2.0000
    • Interpretation: This represents the raw median absolute deviation without scaling: \[ \text{MAD}_\text{raw} = \text{median}\left(\left|X_i - \tilde{X}\right|\right) \]

where \(\tilde{X}\) is the median of the dataset.

  • Scaled MAD (k=1.4826)
  • Value: 2.9652
  • Comparison with Base R: Identical to stats::mad() output (2.9652)

2.2.3 Interquartile Range (IQR)

The Interquartile Range (IQR) is a robust measure of statistical dispersion, calculated as the difference between the 75th (Q3) and 25th (Q1) percentiles:

\[ \text{IQR} = Q_3 - Q_1 \]

Key Properties:

  • Robustness:
    • Resistant to outliers (unlike standard deviation)
    • 25% trimmed range (excludes top/bottom 25% of data)
  • Visualization: The interquartile range is often used to find outliers in data. Outliers here are defined as observations that fall below Q1 − 1.5 IQR or above Q3 + 1.5 IQR. In a boxplot, the highest and lowest occurring value within this limit are indicated by whiskers of the box (frequently with an additional bar at the end of the whisker) and any outliers as individual points.

For a symmetric distribution (where the median equals the midhinge, the average of the first and third quartiles), half the IQR equals the median absolute deviation (MAD).

  • Whiskers typically extend to ±1.5×IQR from the quartiles
  • Outlier Detection:
    • Lower fence: \(Q_1 - 1.5 \times \text{IQR}\)
    • Upper fence: \(Q_3 + 1.5 \times \text{IQR}\)
    • Observations beyond these fences are considered outliers

Relationship with Other Measures

Measure Formula Robust? Efficiency
IQR \(Q_3 - Q_1\) Yes Moderate
MAD \(\text{median}(|X_i - \tilde{X}|)\) Yes High
Range \(\max(X) - \min(X)\) No Low

Special Cases

For symmetric distributions:

  • Median = Midhinge (\(\frac{Q_1 + Q_3}{2}\))
  • \(\text{MAD} \approx \frac{\text{IQR}}{2}\)

R Implementation of IQR Plot

#' Interactive IQR Plot and Summary (base R version)
#'
#' Computes IQR and creates a customizable boxplot-like visualization with base R
#' @param X Numeric vector, matrix, or data frame
#' @param summary Logical - whether to print 5-point summary (default TRUE)
#' @param plot Logical - whether to generate plot (default TRUE)
#' @return Invisibly returns the IQR value
#' @examples
#' iqr_plot(rnorm(100))
#' iqr_plot(mtcars$mpg, plot=TRUE, summary=FALSE)

iqr_plot <- function(X, summary = TRUE, plot = TRUE) {
  source("MEDIAN.R")
  source("mu.R")
  source("Q.R")
  
  # Calculate all statistics
  Q <- Q(X)
  q1 <- MEDIAN(Q[[1]])
  q2 <- MEDIAN(X)  # Median
  q3 <- MEDIAN(Q[[2]])
  IQR <- q3 - q1
  
  
  # PRINT 5-POINT SUMMARY
  if (summary) {
    cat("5-Point Summary:\n")
    cat(sprintf("   %-8s: %g\n", 
                c("Min", "Q1", "Median", "Q3", "Max"),
                c(min(X), q1, q2, q3, max(X))))
  }
  
  # IQR PLOT
  if (plot) {
    # Prepare data
    low_bound <- q1 - 1.5 * IQR
    high_bound <- q3 + 1.5 * IQR
    is_outlier <- X < low_bound | X > high_bound
    inliers <- X[!is_outlier]
    low_whisker <- min(inliers)
    high_whisker <- max(inliers)
    
    # Set up plot area
    par(mar = c(4, 1, 3, 1))  # Adjust margins
    plot(NA, xlim = range(X), ylim = c(0, 1), 
         xlab = "", ylab = "", yaxt = "n", 
         main = "IQR Plot", frame.plot = FALSE)
    
    # Add vertical gridlines
    grid(nx = NA, ny = NULL, col = "gray90", lty = "dotted")
    
    # Draw IQR box
    rect(q1, 0.25, q3, 0.75, col = "white", border = "black", lwd = 1)
    
    # Draw median line
    segments(q2, 0.25, q2, 0.75, col = "black", lwd = 2)
    
    # Draw whiskers
    segments(low_whisker, 0.5, q1, 0.5, lty = "dashed", lwd = 1)
    segments(q3, 0.5, high_whisker, 0.5, lty = "dashed", lwd = 1)
    segments(low_whisker, 0.4, low_whisker, 0.6, lwd = 1)
    segments(high_whisker, 0.4, high_whisker, 0.6, lwd = 1)
    
    # Plot outliers
    points(X[is_outlier], rep(0.5, sum(is_outlier)), 
           pch = 19, col = "red", cex = 1.2)
    
    # Plot central tendency points
    points(q2, 0.5, pch = 18, col = "black", cex = 1.5)       # Median
    # Add legend
    legend("top", 
           legend = c("Median", "Outliers"),
           pch = c(18, 19),
           col = c("black", "red"),
           horiz = TRUE, bty = "n", pt.cex = 1.2)
    
    # Add box around plot
    box()
  }
  
  return(invisible(IQR))
}

# Save function
dump("iqr_plot", file = "iqr_plot.R")

BASIC USAGE EXAMPLES

#' BASIC USAGE EXAMPLES
#' 
#' Demonstrates iqr.plot() functionality with comparison to base boxplot()

# Example 1: Dataset with potential outlier
X1 <- c(74, 88, 78, 90, 94, 90, 84, 90, 98, 80, 120, 51)
cat("Dataset 1:\n", paste(X1, collapse = ", "), "\n\n")
## Dataset 1:
##  74, 88, 78, 90, 94, 90, 84, 90, 98, 80, 120, 51
# iqr.plot output
cat("iqr_plot(X1) results:\n")
## iqr_plot(X1) results:
iqr1 <- iqr_plot(X1)
## 5-Point Summary:
##    Min     : 51
##     Q1      : 79
##     Median  : 89
##     Q3      : 92
##     Max     : 120

# Boxplot comparison
cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X1, horizontal = TRUE, main = "Standard Boxplot Comparison")

#--------------------------------------------------

# Example 2: Random sample
set.seed(182)
X2 <- sample(x = 1:50, size = 6)
cat("\nDataset 2:\n", paste(X2, collapse = ", "), "\n\n")
## 
## Dataset 2:
##  42, 39, 34, 37, 11, 40
cat("iqr.plot(X2) results:\n")
## iqr.plot(X2) results:
iqr2 <- iqr_plot(X2)
## 5-Point Summary:
##    Min     : 11
##     Q1      : 34
##     Median  : 38
##     Q3      : 40
##     Max     : 42

cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X2, horizontal = TRUE, main = "Standard Boxplot Comparison")

#--------------------------------------------------

# Example 3: Another random sample
set.seed(42)
X3 <- sample(x = 1:50, size = 6)
cat("\nDataset 3:\n", paste(X3, collapse = ", "), "\n\n")
## 
## Dataset 3:
##  49, 37, 1, 25, 10, 36
cat("iqr.plot(X3) results:\n")
## iqr.plot(X3) results:
iqr3 <- iqr_plot(X3)
## 5-Point Summary:
##    Min     : 1
##     Q1      : 10
##     Median  : 30.5
##     Q3      : 37
##     Max     : 49

cat("\nEquivalent boxplot:\n")
## 
## Equivalent boxplot:
boxplot(X3, horizontal = TRUE, main = "Standard Boxplot Comparison")


2.3 Distribution Shape Analysis

Common measures of the shape of a distribution are skewness or kurtosis. Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. In any symmetrical distribution the mean, median and mode are equal. However, when the data is skewed, for example, as with the right-skewed data set, the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.


2.3.1 Skewness (Asymmetry)

Measures the lack of symmetry in data distribution:

  • Symmetric distribution (skewness = 0):
    • Mean = Median = Mode
    • Mirror-image left and right sides
  • Right-skewed (positive skewness):
    • Mean > Median
    • Longer right tail (e.g., income distributions)
  • Left-skewed (negative skewness):
    • Mean < Median
    • Longer left tail (e.g., age at retirement)

Interpretation guidelines:

  • |Skewness| > 1: Considered significantly skewed
  • Median is more robust than mean for skewed data
  • Transformation (e.g., log) often needed for analysis

2.3.2 Kurtosis (Tail Weight)

Measures tail heaviness relative to normal distribution:

Type Kurtosis Characteristics
Leptokurtic > 3 Heavy tails, sharp peak
Mesokurtic ≈ 3 Normal distribution
Platykurtic < 3 Light tails, flat peak

Practical implications:

  • High kurtosis → Higher outlier probability
  • Low kurtosis → Data concentrated near mean

The five-number summary provides visual clues about shape:

Use these quartile relationships to assess shape.

  • Skewness indicators:
    • Right skew: (Q3-Q2) > (Q2-Q1)
    • Left skew: (Q2-Q1) > (Q3-Q2)
  • Kurtosis indicators:
    • IQR/Range ratio < 0.5 → Possible heavy tails
    • Extreme whisker lengths → Potential outliers

Example Analysis

For dataset: [1, 2, 3, 5, 8, 13, 21, 34, 55]

5-point summary: 1 (Min), 3 (Q1), 8 (Median), 21 (Q3), 55 (Max)

  • Skewness: (21-8) > (8-3) → Right-skewed
    • Mean (15.8) > Median (8) → Confirms right skew
  • Kurtosis: IQR/Range = 18/54 ≈ 0.33 → Suggests heavy tails

This shape analysis helps determine appropriate statistical methods and data transformations for accurate modeling.

Data Transformations: When and Why to Use Them

Data transformations modify the scale or distribution of data to meet statistical assumptions or improve model performance. Below is a structured guide on their application.

When to Apply Transformations

  • Correcting Skewness
  • Right-Skewed Data Transformations
Transformation Formula When to Use Effect Size Limitations
Logarithmic \(\log(x)\) Moderate-to-strong skew Reduces skewness by ~ 0.5 – 1.5 Requires \(x>0\)
Square Root \(\sqrt{x}\) Mild-to-moderate skew Reduces skewness by ~ 0.3 –0.8 Works best with non-negative counts
Box–Cox \(\frac{x^\lambda - 1}{\lambda}\) Variable skewness Optimally reduces skewness Requires \(x>0\), \(\lambda\) estimation needed
Inverse \(1/x\) Extreme skew cases Strong reduction effect Undefined at zero, reverses order
  • Left-Skewed Data Transformations
Transformation Formula Effect Typical Use Cases
Exponential x², x³ Converts left skew to right skew Mild left skew (
Reflection + Log log(max(x)+1-x) Converts to right skew first Strong left skew (
  • Meeting Parametric Test Assumptions

Parametric statistical tests rely on the assumption of normally distributed residuals/errors. Common tests requiring normality include:

  • t-tests (independent and paired samples)
  • ANOVA (one-way and factorial)
  • Linear regression
  • Pearson correlation

Common Normality-Inducing Transformations

Transformation Formula Suitable Data Types Key Features
Logarithmic log(x) Positive right-skewed data Simple implementation, multiplicative effects become additive
Box-Cox \(\frac{x^\lambda - 1}{\lambda}\) Positive-valued data Automatically finds optimal λ for best normality
Yeo-Johnson Complex piecewise function Data with zeros/negatives Generalized version of Box-Cox

Implementation Guidelines

  • Log Transform:
    • Use log10 or natural log (ln)
    • For zeros: log(x + c) where c is a small constant
  • Box-Cox:
    • Requires all values > 0
      • λ = 0 implies log transform
      • λ = 1 implies no transform needed
  • Yeo-Johnson:
    • Handles all real numbers
    • Similar interpretation to Box-Cox

Verification of Results

After transformation, check for normality using:

  • Q-Q plots (should follow straight line)
  • Shapiro-Wilk test (p > 0.05 indicates normality)
  • Skewness/kurtosis values (should be near 0 and 3 respectively)

When Transformations Fail

If normality cannot be achieved:

-Consider non-parametric alternatives: - Mann-Whitney U (instead of t-test) - Kruskal-Wallis (instead of ANOVA) - Spearman correlation (instead of Pearson) - Use robust statistical methods - Try bootstrapping approaches

3. Stabilizing Variance (Homoscedasticity)

Many statistical models assume constant variance (homoscedasticity). Below are common variance patterns and their corresponding transformations:

Variance Stabilization Transformations

Variance Pattern Transformation Formula Application Notes
Multiplicative variance Logarithmic log(y) Use when SD ∝ mean (common in financial/biological data)
Poisson-distributed counts Square root \(\sqrt{x}\) or \(\sqrt{(x + 0.375)}\) For count data where variance = mean
Proportional/binomial data Arcsine arcsin(√x) For proportions (0 ≤ x ≤ 1)
Severe heteroscedasticity Box-Cox \(\frac{x^\lambda - 1}{\lambda}\) When pattern isn’t clearly multiplicative

2.4 Dependence in Descriptive Statistics

Dependence refers to any statistical relationship between two random variables, while correlation specifically measures the degree of linear relationship between them. The value of a correlation coefficient ranges between -1 and +1. If the variables are independent, correlation coefficient is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables.

Key Properties of Correlation Coefficients:

  • Range: [-1, +1]
  • +1: Perfect positive linear relationship
  • -1: Perfect negative linear relationship
  • 0: No linear relationship
  • Symmetry: cor(X,Y) = cor(Y,X)
  • Unitless: Invariant to scale changes

Key Measures

Measure Type Formula Use Case
Pearson’s r Linear cov(X,Y)/(σₓσᵧ) Normally distributed data
Spearman’s ρ Rank Correlation of ranks Non-normal/monotonic relationships
Kendall’s τ Ordinal Concordant pairs Small samples/ordinal data

Critical Considerations

  • Independence vs. Zero Correlation:
    • Independence ⇒ ρ = 0
    • ρ = 0 ⇏ Independence (may have nonlinear relationships)
  • Common Pitfalls:
    • Sensitive to outliers
    • Only captures linear relationships
    • Can suggest spurious relationships

When to Use:

  • Exploratory analysis: Identify potential relationships
  • Feature selection: Screen variables for predictive modeling
  • Assumption checking: Validate linear regression assumptions

Interpretation Guidelines

  • |r| > 0.7: Strong relationship
  • 0.3 < |r| < 0.7: Moderate relationship
  • |r| < 0.3: Weak relationship

2.4.1 Pearson Correlation Coefficient

Pearson’s correlation coefficient (denoted as ρ for populations, r for samples) measures the linear relationship between two variables. It is calculated as the covariance of the variables divided by the product of their standard deviations.

Population Formula For random variables (X, Y):

\[ \rho_{XY} = \frac{\text{cov}(X,Y)}{\sigma_{X} \sigma_{Y}} \] where:

  • \(cov(X,Y)\): Covariance between \(X\) and \(Y\)
  • \(\sigma_{X}\): Standard deviation of \(X\)
  • \(\sigma_{Y}\): Standard deviation of \(Y\)

When applied to sample data, the coefficient is denoted as r:

\[ r_{xy} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \] where:

  • x̄, ȳ: Sample means
  • n: Sample size

Assumptions

  • Linear relationship between variables
  • Continuous variables (interval/ratio)
  • Normally distributed variables
  • Homoscedasticity (constant variance)
  • No outliers (sensitive to extreme values)

Note: Pearson’s \(r\) only measures linear relationships. Nonlinear relationships may have \(r ≈ 0\) despite strong dependence.


R Implementation

#' Compute Pearson's Correlation Coefficient
#'
#' Calculates Pearson's r between two numeric vectors, lists, or from a two-column matrix/dataframe.
#' Implements the computational formula for efficiency with large datasets.
#'
#' @param X Numeric vector, list, or matrix/dataframe with two columns
#' @param Y Optional second numeric vector/list (ignored if X is a matrix)
#' @return Pearson's correlation coefficient (r) between -1 and 1
#' @examples
#' # Vector inputs
#' corr(1:10, 2:11)
#' 
#' # Matrix input
#' mat <- cbind(x = rnorm(100), y = rnorm(100))
#' corr(mat)
#' 
#' # Dataframe input
#' df <- data.frame(var1 = mtcars$mpg, var2 = mtcars$wt)
#' corr(df)
COR <- function(X, Y = NULL) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  
  # Handle matrix/dataframe input
  if (is.null(Y)) {
    if (!is.matrix(X) && !is.data.frame(X)) 
      stop("When Y is NULL, X must be a matrix or dataframe with two columns")
    if (ncol(X) != 2) 
      stop("X must have exactly two columns when Y is not provided")
    
    Y <- X[, 2]
    X <- X[, 1]
  }
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths
  if (length(X) != length(Y)) 
    stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) 
    stop("Insufficient complete cases (need at least 2 observations)")
  
  # Computational formula implementation
  sumX <- sum(X)
  sumY <- sum(Y)
  sumXY <- sum(X * Y)
  sumX2 <- sum(X^2)
  sumY2 <- sum(Y^2)
  
  numerator <- n * sumXY - sumX * sumY
  denominator <- sqrt((n * sumX2 - sumX^2) * (n * sumY2 - sumY^2))
  
  if (denominator == 0) return(0) # Handle zero variance case
  
  r <- numerator / denominator
  return(r)
}

# Save function
dump("COR", file = "COR.R")

BASIC USAGE EXAMPLES

Example 1: Very Weak Positive Correlation (~0.001)

# simulate bivariate data
# Load package for multivariate normal distribution
require(mvtnorm)
## Loading required package: mvtnorm
# Generate 5000 observations from bivariate normal distribution
# with mean 0 for both variables and very weak correlation (0.001)
# cor(X, Y) ~ 1/1000
data <- rmvnorm(n=5e3, mean=c(0, 0), 
                sigma=matrix(c(1,.001,.001,1), ncol=2))
X <- data[,1]
Y <- data[,2]

# Calculate correlation using our COR function 
COR(X, Y)
## [1] 0.003164172
# Compare with R's built-in correlation function
stats::cor(X, Y)
## [1] 0.003164172
# Visualize the nearly uncorrelated data
# cor(X, Y) ~ 0
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('black', alpha.f=.2), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 2: Moderate Positive Correlation (~0.67)

# Generate data with stronger covariance (off-diagonal = 1)
# and higher variance (diagonal = 1.5)

data <- rmvnorm(n=3e3, mean=c(0, 0), 
                sigma=matrix(c(1.5,1,1,1.5), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # population correlation = 1/1.5 ≈ 0.67
## [1] 0.6654865
# Plot showing clear positive relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('black', alpha.f=.2), 
     xlim=c(-5, 5), ylim=c(-5, 5))


Example 3: Strong Positive Correlation (~0.99)

# Generate data with nearly identical variables
# (very high covariance relative to variance)
data <- rmvnorm(n=2e3, mean=c(0, 0), 
                sigma=matrix(c(1.01,1,1,1.01), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # Should be very close to 1
## [1] 0.9898421
# Plot showing nearly perfect linear relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('darkblue', alpha.f=.1), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 4: Perfect Positive Correlation (1)

# Generate perfectly correlated data
# (covariance matrix where variables are identical)
data <- rmvnorm(n=1e3, mean=c(0, 0), 
                sigma=matrix(c(1,1,1,1), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y) # cor(X, Y) = 1
## [1] 1
# Plot showing perfect linear relationship
plot(X, Y, pch=1, cex=2, 
     col=adjustcolor('darkgreen', alpha.f=.05), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Example 5: Strong Negative Correlation (~-0.99)

# Generate data with nearly perfect negative relationship
data <- rmvnorm(n=2e3, mean=c(0, 0), 
                sigma=matrix(c(1.01,-1,-1,1.01), ncol=2))
X <- data[,1]
Y <- data[,2]

COR(X, Y)  # cor(X, Y) ~ -1
## [1] -0.9895359
# Plot showing perfect inverse relationship
plot(X, Y, pch=1, cex=1.2, 
     col=adjustcolor('red', alpha.f=.05), 
     xlim=c(-3, 3), ylim=c(-3, 3))


Key Points:

  • Each example demonstrates different correlation strengths
  • The \(COR()\) function matches \(R\)’s built-in \(cor()\) results
  • Plots visually confirm the correlation strength:
  • Tight clustering along diagonal = strong correlation
  • Circular cloud = no correlation
  • Negative slope = inverse relationship

2.4.2 Rank Correlation Coefficients

Rank correlation coefficients measure monotonic relationships between variables by analyzing the ranks of data values rather than the raw data itself. The two most common measures are:

  • Spearman’s \(\rho\) (rho)
  • Kendall’s \(\tau\) (tau)

These are non-parametric measures that assess how well the relationship between two variables can be described using a monotonic function.

2.4.2.1 Spearman’s Rank Correlation Coefficient

For a sample of size \(n\), the Spearman correlation coefficient is the Pearson correlation between the rank variables:

\[ \displaystyle{r_{s}=\rho_{\operatorname{rg}_{X}, \: \operatorname{rg}_{Y}} = \frac{\operatorname{cov}(\operatorname{rg}_{X},\; \operatorname{rg}_{Y})}{\sigma_{\operatorname{rg}_{X}}\:\sigma_{\operatorname{rg}_{Y}}}}, \] where:

\(ρ\) = Pearson correlation coefficient applied to ranks

\(\displaystyle{\operatorname{cov} (\operatorname{rg}_{X}, \: \operatorname{rg}_{Y})}\) = covariance of rank variables

\(\displaystyle{\sigma_{\operatorname{rg}_{X}}}\) = standard deviations of rank variables

Simplified Calculation (No Ties)

When all ranks are distinct integers:

\[ \displaystyle{r_{s}=1-\frac{6 \sum d_{i}^{2}}{n(n^{2}-1)}}, \]

where

  • \(\displaystyle{d_{i}=\operatorname{rg}(X_{i})-\operatorname{rg}(Y_{i})}\) is the difference between ranks.
  • \(n\) is the number of observations.

Identical values are usually each assigned fractional ranks equal to the average of their positions in the ascending order of the values, which is equivalent to averaging over all possible permutations.


R Implementation of Pearson’s Correlation Coefficient \(\rho_p\)

#' Pearson's Correlation Coefficient
#'
#' Computes Pearson's product-moment correlation coefficient between two variables.
#' This is a measure of linear correlation between two sets of data.
#'
#' @param X Numeric vector of values
#' @param Y Numeric vector of values (must be same length as X)
#' @return Pearson's correlation coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect positive correlation
#' rho(1:10, 1:10)  # Returns 1
#'
#' # Perfect negative correlation
#' rho(1:10, 10:1)  # Returns -1
#'
#' # No correlation
#' rho(1:10, sample(1:10))  # Returns near 0
#'
#' # With missing values
#' x <- c(1, 2, NA, 4, 5)
#' y <- c(2, 3, 4, NA, 6)
#' rho(x, y)  # Uses pairwise complete observations

rho_p <- function(X, Y) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  if (missing(Y)) stop("Input Y is required")
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths before removing NAs
  if (length(X) != length(Y)) stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) {
    warning("Insufficient complete cases (n < 2), returning NA")
    return(NA)
  }
  
  # Check for zero variance
  if (stats::sd(X) == 0 || stats::sd(Y) == 0) {
    warning("Zero variance detected, correlation undefined")
    return(NA)
  }
  
  # Compute Pearson's r using stats package functions
  cov_xy <- stats::cov(X, Y)
  sd_x <- stats::sd(X)
  sd_y <- stats::sd(Y)
  
  cov_xy / (sd_x * sd_y)
}

# Save function to file
dump('rho_p', file = 'rho_p.R') 

R Implementation of Spearman’s Rank Correlation Coefficient \(\rho_s\)

#' Spearman's Rank Correlation Coefficient
#'
#' Computes Spearman's rank correlation coefficient between two variables.
#' This is a nonparametric measure of rank correlation (monotonic relationship).
#'
#' @param X Numeric vector of values
#' @param Y Numeric vector of values (must be same length as X)
#' @return Spearman's rank correlation coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect positive correlation
#' rho_s(1:5, 1:5)  # Returns 1
#'
#' # Perfect negative correlation
#' rho_s(1:5, 5:1)  # Returns -1
#'
#' # No correlation
#' rho_s(1:5, c(3, 1, 4, 2, 5))  # Returns near 0
#'
#' # With tied ranks
#' rho_s(c(1, 2, 2, 3), c(4, 5, 5, 6))
#' 
rho_s <- function(X, Y) {
  # Input validation
  if (missing(X)) stop("Input X is required")
  if (missing(Y)) stop("Input Y is required")
  
  # Convert to numeric vectors
  X <- as.numeric(X)
  Y <- as.numeric(Y)
  
  # Check equal lengths
  if (length(X) != length(Y)) stop("X and Y must have the same length")
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  
  # Check minimum sample size
  n <- length(X)
  if (n < 2) {
    warning("Insufficient complete cases (n < 2), returning NA")
    return(NA)
  }
  
  # Convert to ranks (handling ties by averaging)
  rank_X <- rank(X)
  rank_Y <- rank(Y)
  
  # Calculate Spearman's rho using the simplified formula
  # (only exact when there are no tied ranks)
  d <- rank_X - rank_Y
  rho <- 1 - (6 * sum(d^2)) / (n * (n^2 - 1))
  
  # Adjust for tied ranks (if any)
  if (any(duplicated(rank_X)) | any(duplicated(rank_Y))) {
    warning("Tied ranks detected. Using Pearson's method on ranks instead.")
    rho <- stats::cor(rank_X, rank_Y, method = "pearson")
  }
  
  return(rho)
}

# Save function to file
dump('rho_s', file = 'rho_s.R')

BASIC USAGE EXAMPLES

1. Comparing Pearson and Spearman Correlation

# Generate sample data
set.seed(42)
X <- rnorm(100, mean = 0, sd = 1)
Y <- rnorm(100, mean = 4.5, sd = 1.5)

# View first few observations
head(cbind(X, Y))
##               X        Y
## [1,]  1.3709584 6.301448
## [2,] -0.5646982 6.067127
## [3,]  0.3631284 2.995187
## [4,]  0.6328626 7.272723
## [5,]  0.4042683 3.499840
## [6,] -0.1061245 4.658271
# Calculate correlations
rho_p(X, Y)  # Pearson correlation
## [1] 0.03127984
rho_s(X, Y)  # Spearman correlation
## [1] 0.05058506
COR(X, Y) # Pearson
## [1] 0.03127984
cor(X, Y, method = "pearson")  # Base R Pearson
## [1] 0.03127984
cor(X, Y, method = "spearman") # Base R Spearman
## [1] 0.05058506

2. Rank Transformation Equivalence

Demonstrate that Spearman’s rho equals Pearson’s on ranks

# Manual rank transformation
ranked_X <- rank(X)
ranked_Y <- rank(Y)

head(cbind(ranked_X, ranked_Y))
##      ranked_X ranked_Y
## [1,]       90       92
## [2,]       27       87
## [3,]       59       17
## [4,]       70       99
## [5,]       60       24
## [6,]       43       60
# Spearman’s rho_s is mathematically equivalent to Pearson’s correlation on ranks:

rho_p(ranked_X, ranked_Y)                     # Our Pearson on ranks
## [1] 0.05058506
rho_s(X, Y)                                   # Our Spearman on original data
## [1] 0.05058506
cor(ranked_X, ranked_Y, method = "pearson")   # Base R Pearson on ranks
## [1] 0.05058506
cor(X, Y, method = "spearman")                # Base R Spearman
## [1] 0.05058506

3. Example with Multivariate Normal Data

# Generate correlated data
if (!require(mvtnorm)) install.packages("mvtnorm")
library(mvtnorm)

# Create correlated data (population correlation ~ 2/3)
data <- rmvnorm(
  n = 2000, 
  mean = c(0, 0), 
  sigma = matrix(c(1.5, 1, 1, 1.5), ncol = 2)
)
X <- data[, 1]
Y <- data[, 2]
# Pearson correlation
rho_p(X, Y)
## [1] 0.6490746
cor(X, Y, method = "pearson")
## [1] 0.6490746
# Rank transform the data
rank_X <- rank(X)
rank_Y <- rank(Y)

# Spearman correlation (should match Pearson on ranks)
rho_s(X, Y)
## [1] 0.629446
rho_p(rank_X, rank_Y)
## [1] 0.629446
# Compare with base R
cor(X, Y, method = "spearman")
## [1] 0.629446
cor(rank_X, rank_Y, method = "pearson")
## [1] 0.629446
# Install and load packages
library(ggplot2)
library(ggside)

df <- data.frame(X, Y)

p <- ggplot(df, aes(X, Y)) +
  geom_point(alpha = 0.3, color = "blue", pch = 1) +
  geom_smooth(method = "loess", color = "red", se = FALSE, linewidth = 1) +
  geom_smooth(method = "lm", color = "darkgreen", se = FALSE, linetype = "dashed") +
  geom_rug(alpha = 0.2, sides = "bl", color = "blue") +   # rugs, same color as points
  labs(
    title = "Original Data with Loess (Red) vs Linear Fit (Green)",
    subtitle = paste(
      "Spearman ρ =", round(cor(X, Y, method = "spearman"), 3),
      " | Pearson r =", round(cor(X, Y, method = "pearson"), 3)
    )
  ) +
  theme_minimal()

# Transparent histograms + semi-transparent density lines
p +
  geom_xsidehistogram(
    aes(y = after_stat(density)),
    bins = 30, fill = NA, color = "darkgreen", alpha = 0.4
  ) +
  geom_xsidedensity(
    aes(y = after_stat(density)),
    color = "blue", alpha = 0.4, linewidth = 1
  ) +
  geom_ysidehistogram(
    aes(x = after_stat(density)),
    bins = 30, fill = NA, color = "darkgreen", alpha = 0.4
  ) +
  geom_ysidedensity(
    aes(x = after_stat(density)),
    color = "blue", alpha = 0.4, linewidth = 1
  ) +
  theme(
    ggside.panel.scale = 0.3,
    axis.text.x.top = element_blank(),
    axis.text.y.right = element_blank(),
    axis.ticks.x.top = element_blank(),
    axis.ticks.y.right = element_blank()
  )

Loess (Red Curve): Shows the true (potentially nonlinear) trend. If it deviates from the dashed green line (linear fit), the relationship is nonlinear.

Marginal Histograms: Check if X and Y are normally distributed (important for Pearson). If skewed or multimodal, Spearman is more reliable.

library(ggplot2)
library(ggExtra)

p_ranks <- ggplot(data.frame(rank_X, rank_Y), aes(rank_X, rank_Y)) +
  geom_point(alpha = 0.4, color = "firebrick", size = 1.8, shape = 16) +   # softer points
  geom_smooth(method = "lm", color = "black", se = FALSE, linewidth = 1) + # cleaner fit line
  labs(
    title = "Rank-Transformed Data",
    subtitle = paste("Spearman ρ =", round(cor(rank_X, rank_Y, method = "pearson"), 3))
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

# Marginals: transparent fill + outline
ggExtra::ggMarginal(
  p_ranks,
  type = "density",
  fill = "pink",
  color = "firebrick",
  alpha = 0.4
)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Black Line: The perfect linear relationship confirms Spearman is Pearson on ranks.

Marginal Densities: Should be uniform (since ranks are uniformly distributed).

Comparison of Pearson’s and Spearman’s Correlation Coefficients

  • Conceptual Differences:
    • Pearson’s Correlation (ρₚ)
      • Nature: Measures linear relationship between two continuous variables
      • Assumptions:
        • Variables should be normally distributed
        • Relationship should be linear
        • Homoscedasticity (constant variance)
      • Sensitivity: Affected by outliers due to its use of actual data values
    • Spearman’s Correlation (ρₛ)
      • Nature: Measures monotonic (not necessarily linear) relationships
      • Assumptions:
        • Works with ordinal, interval, or ratio data
        • No distributional assumptions
      • Robustness: Less affected by outliers as it uses rank orders

When to Use Each

  • Use Pearson When:
    • Both variables are continuous and normally distributed
    • The relationship appears linear on scatter plots
    • You specifically want to measure linear dependence
    • There are no significant outliers
  • Use Spearman When:
    • Variables are ordinal or not normally distributed
    • The relationship is monotonic but not necessarily linear
    • Data contains outliers that might distort Pearson’s correlation
    • You want a nonparametric measure of association
# Example demonstrating differences
set.seed(123)
x <- 1:100
y_linear <- x + rnorm(100, 0, 10)       # Linear relationship
y_monotonic <- x^2 + rnorm(100, 0, 100) # Nonlinear but monotonic
y_random <- rnorm(100)                  # No relationship

# Compare correlations
results <- data.frame(
  Relationship = c("Linear", "Monotonic", "Random"),
  Pearson = c(
    rho_p(x, y_linear),
    rho_p(x, y_monotonic),
    rho_p(x, y_random)
  ),
  Spearman = c(
    rho_s(x, y_linear),
    rho_s(x, y_monotonic),
    rho_s(x, y_random)
  )
)

print(results)
##   Relationship   Pearson  Spearman
## 1       Linear 0.9562453 0.9573237
## 2    Monotonic 0.9694407 0.9980798
## 3       Random 0.1177439 0.1600600

Key Observations from Output:

  • Linear Relationship:
    • Both coefficients show strong correlation
    • Pearson’s value is slightly higher (measures linear fit precisely)
  • Monotonic Relationship:
    • Pearson’s drops significantly (misses the nonlinear pattern)
    • Spearman’s remains high (captures the monotonic trend)
  • No Relationship:
    • Both coefficients near zero

Conclusion

Choose Pearson’s for linear relationships with normal data, Spearman’s for monotonic relationships or when assumptions of Pearson’s are violated. Our implementation provides robust versions of both with proper handling of edge cases (NA values, ties, etc.). The examples demonstrate how they complement each other in different analytical scenarios.


2.4.2.2 The Kendall rank correlation coefficient

The Kendall rank correlation coefficient, commonly referred to as Kendall’s τ

coefficient, is a statistic used to measure the ordinal association between two measured quantities. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities.

The Kendall τ coefficient is defined as:

\[ \tau = \frac {(\text{number of concordant pairs})-(\text{number of discordant pairs})}{\displaystyle {n\choose 2}} \] An explicit expression for Kendall’s rank coefficient is

\[ \tau ={\frac {2}{n(n-1)}}\sum _{i<j}\operatorname{sgn}(x_{i}-x_{j})\operatorname{sgn}(y_{i}-y_{j}). \] where \(\displaystyle{{n \choose 2} = \frac{n(n-1)}{2}}\) is the binomial coefficient for the number of ways to choose two items from \(n\) items.

Both Kendall’s τ and Spearman’s ρ can be formulated as special cases of a more general correlation coefficient.

Comparison of Kendall’s τ, Spearman’s ρ, and Pearson’s r Correlation Coefficients

Overview of Correlation Measures

  • Pearson’s r (ρₚ)
    • Type: Parametric linear correlation
    • Measures: Linear relationship between continuous variables
    • Range: [-1, 1]
    • Formula: r = cov(X,Y)/(σₓσᵧ)
  • Spearman’s ρ (ρₛ)
    • Type: Non-parametric rank correlation
    • Measures: Monotonic relationships
    • Range: [-1, 1]
    • Formula: Pearson correlation applied to ranked data
  • Kendall’s τ
    • Type: Non-parametric rank correlation
    • Measures: Ordinal association based on concordant/discordant pairs
    • Range: [-1, 1]
    • Formula: τ = (number of concordant pairs - number of discordant pairs) / (n choose 2)

Key Differences Between Correlation Coefficients

Characteristic Pearson’s r Spearman’s ρ Kendall’s τ
Correlation Type Linear Monotonic Ordinal
Data Requirements Interval/ratio
Normally distributed
Ordinal/interval
No distribution assumptions
Ordinal/interval
No distribution assumptions
Sensitivity Highly sensitive to outliers Moderately robust to outliers Most robust to outliers
Interpretation Linear relationship strength Monotonic relationship strength Probability that pairs are concordant
Computation O(n) O(n log n)
(due to ranking)
O(n²)
(pairwise comparisons)
Best Use Case Linear relationships
with normal data
Nonlinear monotonic relationships
with continuous/ordinal data
Small datasets
Ordinal data
Many ties expected
Statistical Test t-test Rank-based test Permutation test

R Implementation of Kendall’s Rank Correlation Coefficient

#' Compute Kendall's Rank Correlation Coefficient (τ)
#'
#' Calculates Kendall's tau-b rank correlation coefficient, which measures
#' the ordinal association between two variables. This version handles ties
#' and includes proper input validation.
#'
#' @param X Numeric vector of first variable
#' @param Y Numeric vector of second variable (must be same length as X)
#' @return Kendall's tau-b coefficient between -1 and 1
#' @export
#' @examples
#' # Perfect agreement
#' tau(1:10, 1:10)  # returns 1
#' 
#' # Perfect disagreement
#' tau(1:10, 10:1)  # returns -1
#' 
#' # With ties
#' tau(c(1,2,2,3), c(1,3,2,4))  # handles tied ranks

tau <- function(X, Y) {
  # Input validation
  if (missing(X) || missing(Y)) stop("Both X and Y must be provided")
  if (length(X) != length(Y)) stop("X and Y must have equal length")
  if (length(X) < 2) {
    warning("Insufficient observations (need at least 2)")
    return(NA)
  }
  
  # Remove NA values pairwise
  complete <- complete.cases(X, Y)
  X <- X[complete]
  Y <- Y[complete]
  n <- length(X)
  
  # Initialize counts
  concordant <- 0
  discordant <- 0
  tied_x <- 0
  tied_y <- 0
  tied_xy <- 0
  
  # Compare all unique pairs
  for (i in 1:(n-1)) {
    for (j in (i+1):n) {
      x_diff <- X[i] - X[j]
      y_diff <- Y[i] - Y[j]
      
      if (x_diff == 0 && y_diff == 0) {
        tied_xy <- tied_xy + 1
      } else if (x_diff == 0) {
        tied_x <- tied_x + 1
      } else if (y_diff == 0) {
        tied_y <- tied_y + 1
      } else if (x_diff * y_diff > 0) {
        concordant <- concordant + 1
      } else {
        discordant <- discordant + 1
      }
    }
  }
  
  # Calculate Kendall's tau-b (adjusts for ties)
  total_pairs <- n * (n - 1) / 2
  (concordant - discordant) / sqrt((total_pairs - tied_x) * (total_pairs - tied_y))
}

# Save function
dump('tau', file = 'tau.R') 

BASIC USAGE EXAMPLES

#' Basic Example of Kendall's Tau Calculation
#' 
#' Demonstrates computation of Kendall's rank correlation coefficient τ
#' using simulated bivariate normal data with known correlation structure.
#' Compares our implementation with R's built-in cor() function.

# Load required package (install if needed) --------------------------------
if (!require(mvtnorm)) {
  install.packages("mvtnorm")
  library(mvtnorm)
}

set.seed(123)  # For reproducibility

# Generate correlated data ------------------------------------------------
# Covariance matrix with population correlation = 1/1.5 ≈ 0.67
sigma_matrix <- matrix(c(1.5, 1, 
                         1, 1.5), 
                       ncol = 2)

# Generate 2000 observations from bivariate normal distribution
data <- rmvnorm(n = 2000, 
                mean = c(0, 0), 
                sigma = sigma_matrix)

# Data preparation -------------------------------------------------------
# Convert to ranks using average method for ties
X <- rank(data[, 1], ties.method = "average")
Y <- rank(data[, 2], ties.method = "average")

# Correlation computation ------------------------------------------------
# Using our custom tau function
custom_tau <- tau(X, Y)

# Using R's built-in function for validation
builtin_tau <- cor(X, Y, method = "kendall")

# Results output ---------------------------------------------------------
cat("\n=== Correlation Results ===\n")
## 
## === Correlation Results ===
cat("Custom tau implementation:", round(custom_tau, 4), "\n")
## Custom tau implementation: 0.4503
cat("R's built-in Kendall's tau:", round(builtin_tau, 4), "\n")
## R's built-in Kendall's tau: 0.4503
cat("population Pearson correlation:", round(1/1.5, 4), "\n")
## population Pearson correlation: 0.6667
cat("Difference between implementations:", 
    round(abs(custom_tau - builtin_tau), 6), "\n\n")
## Difference between implementations: 0
# Visualization ----------------------------------------------------------
plot(X, Y, 
     main = paste("Rank Plot: Kendall's τ =", round(custom_tau, 3)),
     xlab = "Rank of Variable 1", 
     ylab = "Rank of Variable 2",
     pch = 19, 
     col = rgb(0, 0.4, 0.8, 0.3),
     cex = 0.7)

# Add reference line for perfect agreement
abline(a = 0, b = 1, col = "red", lty = 2)


Interpreting the Kendall’s τ Plot & Results

  • X-axis: Rank-transformed values of variable X (from 1 to n
  • Y-axis: Rank-transformed values of variable Y (from 1 to n)
  • Point cloud pattern:
    • If points cluster along a rising diagonal (bottom-left to top-right): Positive monotonic relationship (τ close to +1)
    • If points cluster along a falling diagonal (top-left to bottom-right): Negative monotonic relationship (τ close to -1)
    • If points form a random cloud: No monotonic relationship (τ ≈ 0)

In our plot (from the example code), the points show a clear upward trend (but not perfectly linear), indicating a positive monotonic association between X and Y.


3 Hypothesis Testing

A statistical hypothesis is a testable claim about a population parameter, evaluated using observed data modeled as realizations of random variables. Hypothesis testing is a method of statistical inference that compares two competing hypotheses:

  • Null Hypothesis (H₀): Represents a default position (e.g., “no effect” or “no difference”).
  • Alternative Hypothesis (H₁ or Hₐ): Proposes an alternative distribution or effect, either explicitly or informally.

Key Concepts:

Statistical Significance: The result is deemed statistically significant if the observed data is highly unlikely under H₀, based on a pre-specified significance level (α) (e.g., 0.05). This threshold represents the maximum allowable probability of a false positive (Type I error: incorrectly rejecting a true H₀).

Test Statistic: A pre-chosen metric (e.g., t-statistic, χ²) quantifies the deviation of the data from H₀.

Decision Rule: Reject H₀ if the test statistic exceeds a critical value (or if the p-value < α).

The goal is to control the risk of Type I errors while balancing power (ability to correctly reject a false H₀).


3.1 The p-Value in Hypothesis Testing

The p-value quantifies the statistical significance of evidence against the null hypothesis (H₀) based on the observed value of a test statistic T.

Key Concepts:

  • Role of the p-value:
    • Measures how compatible the observed data is with H₀.
    • Defined as the probability of obtaining a test statistic at least as extreme as the observed value, assuming H₀ is true.
    • A small p-value (typically < 0.05) suggests the data is unlikely under H₀, leading to its rejection.
  • Null Hypothesis Testing as Reductio ad Absurdum:
    • Adapted from logical reasoning: H₀ is assumed true until evidence shows it is highly implausible.
    • Statistical significance occurs when the observed result is sufficiently improbable under H₀, justifying its rejection.
  • What Rejection of H₀ Implies:
    • The true hypothesis lies in the logical complement of H₀ (i.e., “not H₀”).
    • However, rejecting H₀ does not specify which alternative hypothesis (H₁) is correct—only that H₀ is unlikely.
    • Without explicit alternatives, the test cannot determine the most plausible explanation.
  • Limitations:
    • A significant result does not prove H₁; it merely suggests H₀ is inadequate.
    • The p-value does not measure the magnitude or importance of an effect, only its statistical rarity under H₀.

3.2 Type I Error (α) – False Positive

A Type I error occurs when a true null hypothesis (\(H_0\)) is incorrectly rejected.

  • Symbolized by α (alpha), the significance level of the test.
  • Interpretation: The probability of detecting an effect that does not exist (e.g., concluding a drug works when it doesn’t).
  • Controlled by researchers by setting a threshold (e.g., α = 0.05 for a 5% risk).

3.3 Type II Error (β) – False Negative

A Type II error occurs when a false null hypothesis (\(H_0\)) is not rejected.

  • Symbolized by β (beta).
  • Interpretation: The probability of failing to detect a true effect (e.g., missing a drug’s efficacy).
  • Power (\(1 - β\)): The probability of correctly rejecting \(H_0\) when it is false. Researchers aim to maximize power.

3.4 Power of a Test

The power of a hypothesis test is the probability of correctly rejecting the null hypothesis (H₀) when a specific alternative hypothesis (H₁) is true. It quantifies the test’s ability to detect an effect if it exists.

Key Properties of Power:

  • Range: \(0≤π≤10≤π≤1\) (where π denotes power).
  • Relationship to Type II Error (β): Power=1−β
  • Higher power reduces the probability of a Type II error (failing to reject a false H₀).

Conventional Standards:

Researchers often aim for \(π≥0.80\), implying a \(4:1\) trade-off between β (Type II error, typically \(β=0.20\)) and α (Type I error, typically \(α=0.05\)).

Factors Affecting Power:

  • Sample Size (n):
    • Larger samples reduce sampling error, making it easier to detect true effects.
  • Power analysis determines the required nn to achieve a desired power.

Effect Size: Larger effects are easier to detect (higher power).

Measurement Precision: Reducing measurement error (e.g., via better tools or protocols) improves power.

Significance Level (α): A stricter α (e.g., 0.01 vs. 0.05) lowers power, increasing β.

When to Use Power Analysis:

  • When the goal is to correctly reject a false H₀ (e.g., clinical trials).
  • Not all studies prioritize power; some focus on estimating effect sizes precisely.

3.5 Confidence Intervals

Confidence intervals (CIs) provide a range of plausible values for population parameters (e.g., mean, proportion) based on sample data, quantifying estimation uncertainty.

  • A 100(1-α)% CI (e.g., 95% CI) indicates that in repeated sampling, 100(1-α)% of intervals would contain the true parameter
  • Important: The confidence level refers to the long-run performance of the method, not the probability that any particular interval contains the parameter
  • Components:
    • Point estimate (e.g., sample mean)
    • Margin of error (function of variability and sample size)
    • Confidence level (typically 90%, 95%, or 99%)
  • Interpretation:
    • Correct: “We’re 95% confident this interval contains the true parameter”
    • Incorrect: “There’s a 95% probability the parameter is in this interval”

Factors Affecting CI Width

Factor Effect on Width Mathematical Relationship
Sample size (n) Decreases as n↑ Width ∝ 1/√n
Variability (σ) Increases as σ↑ Width ∝ σ
Confidence level Increases as CL↑ zα/2 increases

For normally distributed data with known σ:

\[ CI = \bar{x} \pm z_{\alpha/2} \times \left(\frac{\sigma}{\sqrt{n}}\right) \]

The 95% confidence interval is calculated as: \[ \text{CI} = \bar{x} \pm z_{0.025} \times \left(\frac{\sigma}{\sqrt{n}}\right), \] where \(z_{0.025} \approx 1.96\) for a normal distribution.


R Implementation of Confidence Intervals

#' Confidence Interval Simulation for Sample Means
#' 
#' Simulates multiple samples and plots their confidence intervals to demonstrate
#' the concept of coverage probability.
#'
#' @param mu Population mean (default = 0)
#' @param s Population standard deviation
#' @param alpha Significance level (default = 0.05)
#' @param size Sample size for each simulation (default = 1000)
#' @param n Number of samples to simulate (default = 1000)
#' @param seed Random seed (default = -42)
#' @param plot Whether to generate the plot (default = TRUE)
#' 
#' @return A data frame containing the lower/upper CIs and means for each sample
#' @export

ci.sim <-
  function(mu=0, s, alpha=.05, size=1000, n=1000, seed=-42, plot=TRUE){
    
    source("t_cdf.R")
    source("mu.R")
    source("s2.R")
    
    set.seed(seed)
    
    if(plot){
      plot.new()
      xmin <- mu - 5.5*s/sqrt(size)
      xmax <- mu + 5.5*s/sqrt(size)
      plot.window(xlim=c(xmin, xmax), ylim = c(0, n*n/900))
      par(mar=c(3, 5.5, 4, 0), mgp=c(2, 0, 0), las=1) # plot margins
      title(main=bquote(atop('Confidence Interval for the Sample Mean', 
                             '(sample size = '*.(size)*')')), 
            ylab = bquote(atop('Samples', '(number of samples = '*.(n)*')')), 
            xlab = expression(mu), 
            cex.main=1.5, cex.lab=1.2)
      axis(c(1, 2))
      
      # population mean
      abline(v=mu, lty=1, col='blue', lwd=1)
    }
    
    covered <- 0 # fraction of intervals covering the mean
    
    data <- data.frame(row.names=(paste('sample', 1:n, ' ')))
    
    for(i in 0:(n - 1)){
      x <- rnorm(n=size, mean=mu, sd=s)
      df <- size - 1
      sem <- sqrt(s2(x)/size) # standard error of the mean
      x_bar <- mu(x)
      t <- (x_bar - mu)/sem
      
      p <- 2*(1-Re(t_cdf(abs(t), df)))
      a <- 1-alpha/2
      ebm <- qt(a, df)*sem # error bound for a population mean
      ci.lower <- x_bar - ebm
      ci.upper <- x_bar + ebm
      
      covered <- covered + (ci.lower < mu & mu < ci.upper)
      
      data[i+1, 'ci.lower'] <- round(ci.lower, digits=3)
      data[i+1, 'ci.upper'] <- round(ci.upper, digits=3)
      data[i+1, 'mean'] <- x_bar
      data[i+1, ' '] <- ifelse(mu >= ci.lower && mu <= ci.upper, '', '*')
      
      if(plot){
        # CI and mean color
        if(mu >= ci.lower && mu <= ci.upper){
          col1=adjustcolor("green3", alpha.f=0.7)
          col2='darkgreen'
        }else{
          col1=adjustcolor("tomato", alpha.f=0.7)
          col2='tomato4'
        }
        
        # sample mean and ci
        lines(c(ci.lower, ci.upper), c(n*i/900, n*i/900), 
              col=col1, type='l', lwd=400/n, lend=1)
        points(x_bar, n*i/900, pch=18, cex=70/n, col=col2) 
        points(x_bar, n*i/900, pch=18, cex=40/n, col='white') 
        points(x_bar, n*i/900, pch=18, cex=20/n, col=col2) 
      }
    }
    
# Report coverage with all key parameters
    cat(sprintf('Coverage results (%d simulations, n = %d, α = %.2f): %.1f%% CIs contain μ = %.1f\n\n',
                n, size, alpha, 100*covered/n, mu))
    
    return(data)
  }

dump("ci.sim", file="ci.sim.R")
#' Student's t-Distribution Cumulative Distribution Function
#'
#' Computes the cumulative probability up to x for a Student's t-distribution
#' with v degrees of freedom using the hypergeometric function.
#'
#' @param x Numeric vector of quantiles
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @return Numeric vector of cumulative probabilities
#' @export
#' @examples
#' t_cdf(1.96, 10)  # Approximately 0.95 for large df
#' t_cdf(0, 5)      # Should be 0.5

# Check and install hypergeo package if needed
if (!requireNamespace("hypergeo", quietly = TRUE)) {
  message("Installing required package 'hypergeo'...")
  tryCatch(
    {
      utils::install.packages("hypergeo")
      library(hypergeo, quietly = TRUE)
    },
    error = function(e) {
      stop("Failed to install/load 'hypergeo' package. Error: ", e$message)
    }
  )
}

t_cdf <- function(x, v) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Load required package quietly
  requireNamespace("hypergeo", quietly = TRUE)
  
  # Calculate CDF components
  numerator <- x * gamma((v + 1)/2) * 
    Re(hypergeo::hypergeo(0.5, (v + 1)/2, 1.5, -(x^2)/v))
  denominator <- sqrt(pi * v) * gamma(v/2)
  
  # Return CDF value
  0.5 + numerator/denominator
}

# Save function to file
dump("t_cdf", file = "t_cdf.R")

BASIC USAGE EXAMPLES

Varying Sample Sizes (n)

ci.sim(mu=4.5, s=1.5, n=15)      # Very small sample (wide CIs, likely low coverage)

## Coverage results (15 simulations, n = 1000, α = 0.05): 80.0% CIs contain μ = 4.5
##             ci.lower ci.upper     mean  
## sample 1       4.343    4.524 4.433831  
## sample 2       4.398    4.586 4.492174  
## sample 3       4.509    4.691 4.599734 *
## sample 4       4.443    4.624 4.533356  
## sample 5       4.406    4.591 4.498532  
## sample 6       4.429    4.617 4.522888  
## sample 7       4.441    4.629 4.535281  
## sample 8       4.294    4.480 4.387117 *
## sample 9       4.479    4.669 4.574234  
## sample 10      4.445    4.633 4.539142  
## sample 11      4.429    4.619 4.524094  
## sample 12      4.313    4.499 4.406290 *
## sample 13      4.448    4.633 4.540126  
## sample 14      4.373    4.564 4.468375  
## sample 15      4.422    4.598 4.510191
ci <- ci.sim(mu=4.5, s=1.5, n=30) # Moderate sample (better precision)

## Coverage results (30 simulations, n = 1000, α = 0.05): 86.7% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=60) # Larger sample (narrower CIs)

## Coverage results (60 simulations, n = 1000, α = 0.05): 90.0% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=120) # Large sample (high precision)

## Coverage results (120 simulations, n = 1000, α = 0.05): 94.2% CIs contain μ = 4.5
ci <- ci.sim(mu=4.5, s=1.5, n=200) # Large sample (high precision)

## Coverage results (200 simulations, n = 1000, α = 0.05): 95.0% CIs contain μ = 4.5
  • As sample size increases, confidence intervals become narrower (more precise estimates).
  • Small samples (e.g., n=15) often show higher variability in coverage (some CIs may miss mu).
  • Expected coverage should approach 1-alpha (default 95%) as n grows.

Varying Population Sizes (size)

head(ci.sim(mu=0, s=1, n=60, size=100))     # Small population

## Coverage results (60 simulations, n = 100, α = 0.05): 96.7% CIs contain μ = 0.0
##            ci.lower ci.upper        mean  
## sample 1     -0.309    0.081 -0.11360393  
## sample 2     -0.185    0.180 -0.00264514  
## sample 3     -0.129    0.266  0.06854359  
## sample 4     -0.215    0.133 -0.04095258  
## sample 5     -0.308    0.116 -0.09618599  
## sample 6     -0.277    0.112 -0.08294606
head(ci.sim(mu=0, s=1, n=60, size=1000))    # Moderate population

## Coverage results (60 simulations, n = 1000, α = 0.05): 90.0% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.105    0.016 -0.044112342  
## sample 2     -0.068    0.057 -0.005217242  
## sample 3      0.006    0.127  0.066489626 *
## sample 4     -0.038    0.083  0.022237039  
## sample 5     -0.063    0.061 -0.000978905  
## sample 6     -0.047    0.078  0.015258800
head(ci.sim(mu=0, s=1, n=60, size=10000))   # Large population

## Coverage results (60 simulations, n = 10000, α = 0.05): 98.3% CIs contain μ = 0.0
##            ci.lower ci.upper          mean  
## sample 1     -0.012    0.027  7.752593e-03  
## sample 2     -0.016    0.023  3.402390e-03  
## sample 3     -0.010    0.030  1.005981e-02  
## sample 4     -0.009    0.031  1.121849e-02  
## sample 5     -0.032    0.008 -1.200495e-02  
## sample 6     -0.020    0.020 -6.737494e-05
head(ci.sim(mu=0, s=1, n=60, size=100000))  # Very large population

## Coverage results (60 simulations, n = 100000, α = 0.05): 93.3% CIs contain μ = 0.0
##            ci.lower ci.upper          mean  
## sample 1     -0.004    0.009  0.0026662204  
## sample 2     -0.007    0.006 -0.0005348168  
## sample 3     -0.007    0.006 -0.0006486533  
## sample 4     -0.006    0.006  0.0002484055  
## sample 5     -0.007    0.006 -0.0003223348  
## sample 6     -0.007    0.006 -0.0006963664
  • size controls the number of observations per sample (not to be confused with n, the number of samples).
  • Larger size reduces sampling error, leading to more stable CIs.
  • Minimal practical difference once size > 1000 (central limit theorem applies).

Varying Confidence Levels (alpha)

# 90% Confidence Level (alpha = 0.10)
head(ci.sim(mu=0, s=1, alpha=0.10, n=5000, plot=FALSE)) # Expected coverage: ~90%
## Coverage results (5000 simulations, n = 1000, α = 0.10): 90.5% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.095    0.007 -0.044112342  
## sample 2     -0.058    0.047 -0.005217242  
## sample 3      0.016    0.117  0.066489626 *
## sample 4     -0.029    0.073  0.022237039  
## sample 5     -0.053    0.051 -0.000978905  
## sample 6     -0.037    0.068  0.015258800
# 95% Confidence Level (alpha = 0.05, DEFAULT)
head(ci.sim(mu=0, s=1, alpha=0.05, n=5000, plot=FALSE)) # Expected coverage: ~95%
## Coverage results (5000 simulations, n = 1000, α = 0.05): 95.3% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.105    0.016 -0.044112342  
## sample 2     -0.068    0.057 -0.005217242  
## sample 3      0.006    0.127  0.066489626 *
## sample 4     -0.038    0.083  0.022237039  
## sample 5     -0.063    0.061 -0.000978905  
## sample 6     -0.047    0.078  0.015258800
# 99% Confidence Level (alpha = 0.01)
head(ci.sim(mu=0, s=1, alpha=0.01, n=5000, plot=FALSE)) # Expected coverage: ~99%
## Coverage results (5000 simulations, n = 1000, α = 0.01): 99.2% CIs contain μ = 0.0
##            ci.lower ci.upper         mean  
## sample 1     -0.124    0.035 -0.044112342  
## sample 2     -0.088    0.077 -0.005217242  
## sample 3     -0.013    0.146  0.066489626  
## sample 4     -0.057    0.102  0.022237039  
## sample 5     -0.082    0.080 -0.000978905  
## sample 6     -0.067    0.098  0.015258800
  • Lower alpha = Higher confidence level = Wider CIs (more conservative).
  • Coverage probability should match 1-alpha (e.g., 95% CIs should contain mu ~95% of the time).
  • n=5000 ensures stable estimates of coverage probability (law of large numbers).

4 Bayesian Analysis

Bayesian inference is a method of statistical inference in which we update our beliefs about a parameter or hypothesis as we obtain new data. It’s based on Bayes’ Theorem, which relates the prior belief, the likelihood of the observed data, and the resulting posterior belief.

The fundamental equation is:

\[ P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)} \]

where:

  • \(P(\theta | D)\): Posterior distribution
  • \(P(D | \theta)\): Likelihood
  • \(P(\theta)\): Prior distribution
  • \(P(D)\): Marginal likelihood (evidence)

Key Components

Component Description
Prior What you believe about the parameter before seeing any data
Likelihood The model: how probable the data is for different values of the parameter
Posterior Updated beliefs about the parameter after seeing the data
Evidence A normalizing constant: makes sure the posterior is a valid probability

Key Steps in Bayesian Inference

(1) Define the Prior Distribution

  • Represents initial beliefs about parameters before observing data.
  • Can be informative (based on domain knowledge) or uninformative (weakly informative, e.g., flat priors).

Example Priors:

  • Normal distribution for a mean parameter: \(\mu ∼ N(0,1)\)
  • Beta distribution for a probability: \(p ∼ Beta(2,2)\)

(2) Define the Likelihood Function

  • Describes how the observed data is generated given the parameters.

Example Likelihoods:

  • Gaussian likelihood: \(y∼N(μ,σ2)\)
  • Binomial likelihood: \(k∼Binomial(n,p)\)

(3) Compute the Posterior Distribution

  • Combines prior and likelihood using Bayes’ theorem.
  • Often requires Markov Chain Monte Carlo (MCMC) or Variational Inference for computation.

(4) Posterior Analysis & Interpretation

  • Summarize posterior (mean, credible intervals).
  • Check convergence (R-hat, trace plots).
  • Perform posterior predictive checks.

Advantages of Bayesian Inference

  • Incorporates prior knowledge (useful for small datasets).
  • Provides intuitive probabilistic results (credible intervals, not p-values).
  • Handles complex models (hierarchical models, missing data).

Challenges

  • Computationally intensive (MCMC can be slow for large models).
  • Requires careful prior specification (bad priors → misleading results).

Example

Let’s say we want to estimate the bias \(\theta\) of a coin (i.e., probability of heads):

  • Prior: Assume \(\theta ∼ Beta(2,2)\) (a moderate belief the coin is fair)
  • Data: Flip the coin 10 times, get 7 heads, 3 tails
  • Likelihood: $P(x∣)= ^7 (1−)^3
  • Posterior: \(\theta∣x ∼ Beta(9,5)\)

So, the updated belief is more concentrated around 0.65 – 0.75, indicating the coin is likely biased toward heads.


5 Continuous Probability Distributions

Continuous probability distributions model random variables that can assume any value within an interval. Unlike discrete distributions (which describe countable outcomes), continuous distributions characterize measurable phenomena where outcomes form a continuum - such as time, distance, or physical measurements.

Key Properties

  • Uncountable Outcomes: Infinite possible values within ranges
  • Probability Density Function (PDF):
  • Defines relative likelihood via integration: \[P(a \leq X \leq b) = \int_a^b f(x)dx\]
  • Cumulative Distribution (CDF): \[F(x) = P(X \leq x) = \int_{-\infty}^x f(t)dt\]
  • Zero Point Probability: \(P(X = c) = 0\) for exact values

Continuous distributions are fundamental in:

  • Statistical inference (confidence intervals, hypothesis tests).
  • Machine learning (e.g., Gaussian processes, Bayesian methods).
  • Real-world modeling (finance, physics, engineering).

In the following sections, we’ll explore each major distribution in detail, including their PDFs, properties, and practical applications.

Common Continuous Probability Distributions

Distribution Support Key Parameters Key Applications
Normal (-∞, ∞) μ (mean), σ (std dev) Natural phenomena, CLT, inference
t (Student’s) (-∞, ∞) ν (degrees of freedom) Small-sample hypothesis testing
Exponential [0, ∞) λ (rate) Time between events, survival analysis
Beta [0, 1] α, β (shape) Modeling probabilities, Bayesian stats
Gamma (0, ∞) k (shape), θ (scale) Waiting times, reliability engineering

Key Features:

  • Normal: Symmetric, bell-shaped, 68-95 - 99.7 rule
  • t: Heavier tails than normal, converges to N(0,1) as ν→∞
  • Exponential: Memoryless property, P(X >s+t \| X > s) = P(X > t)
  • Beta: Flexible shapes on [0,1] interval
  • Gamma: Generalizes exponential and chi-squared distributions

Each will be explored with their:

  • Probability density functions
  • Key parameters
  • Practical applications
  • Relationships to other distributions

5.1 Probability Density Function (PDF)

Defines the relative likelihood of different outcomes. Since continuous variables have infinitely many possible values, the probability of any exact value is technically zero. Instead, probabilities are calculated over intervals using integration:

  • Denoted as f(x)
  • Describes relative likelihood of different outcomes
  • For any single point: P(X = x) = 0
  • Probabilities are calculated over intervals:

\[ P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx \] where

\[ f(x) \geq 0 \text{ and } \int_{-\infty}^{\infty} f(x) \, dx = 1. \]

Properties:

  • \(f(x) ≥ 0\) for all x
  • Total area under curve = 1

5.2 Cumulative Distribution Function (CDF)

Gives the probability that a random variable XX is less than or equal to a value \(x\):

\[ F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t) \, dt \]

Support: The range of possible values (e.g., \((-\infty,\infty)\) for the normal distribution, \((0,\infty)\) for the exponential distribution).

Moments – Summarize key properties:

  • Mean \((\mu)\): Expected value.
  • Variance \((\sigma^2)\): Measure of spread.
  • Skewness & Kurtosis: Describe asymmetry and tail behavior.

Plots of CDF vs PDF of Standard Normal Distribution

# Load required package
if (!requireNamespace("shape", quietly = TRUE)) {
  install.packages("shape")
}
library(shape)

# Define parameters
x_val <- .5  # Our reference x-value (can be changed to any value between -4 and 4)
x <- seq(-4, 4, length.out = 1000)
pdf <- dnorm(x)
cdf <- pnorm(x)

# Set up dual-axis plot
par(mar = c(5, 4, 4, 4) + 0.1)

# Plot PDF (blue curve) with blue y-axis
plot(x, pdf, type = "l", col = "blue", lwd = 3, ylim = c(0, max(pdf)),
     main = bquote("Standard Normal Distribution: PDF and CDF (x" == .(x_val)*")"),
     xlab = "x", ylab = "", axes = FALSE)
axis(1)
axis(2, col = "blue", col.axis = "blue")
mtext("Probability Density (PDF)", side = 2, line = 3, col = "blue")

# Shade area under PDF for P(X ≤ x_val)
x_shade <- seq(-4, x_val, length.out = 200)
y_shade <- dnorm(x_shade)
polygon(c(-4, x_shade, x_val), c(0, y_shade, 0), 
        col = rgb(0.53, 0.81, 0.92, alpha = 0.5), border = NA)

# Add CDF curve (red, right axis)
par(new = TRUE)
plot(x, cdf, type = "l", col = "red", lwd = 3, 
     axes = FALSE, xlab = "", ylab = "", ylim = c(0, 1))
axis(4, at = seq(0, 1, 0.2), col = "red", col.axis = "red")
mtext("Cumulative Probability (CDF)", side = 4, line = 3, col = "red")

# Add key points and lines
points(x_val, pnorm(x_val), pch = 19, col = "red")

# Vertical tracing line
segments(x_val, 0, x_val, pnorm(x_val), col = "red", lwd = 1)

# Horizontal tracing line with curved arrow
shape::Arrows(
  x0 = x_val, y0 = pnorm(x_val), 
  x1 = -4.2, y1 = pnorm(x_val),
  col = "red",
  arr.type = "curved",
  arr.width = 0.2,
  arr.length = 0.6,
  lwd = 1,
  segment = TRUE  # Draw line separately
)

# Value label
text(-4.3, pnorm(x_val), 
     labels = bquote(Phi(.(x_val)) == .(round(pnorm(x_val), 4))), 
     col = "red", pos = 2, cex = 0.9)

# Area annotation with dynamic value
text(-2.7, 0.5, 
     bquote(paste("P(X ≤ ", .(x_val), ") = ", integral(f(x)*dx, -infinity, .(x_val)), " = ", Phi(.(x_val)))), 
     col = "darkblue", cex = 0.9)

# Enhanced legend
legend("topleft",
       legend = c(expression(f(x) == "PDF"), 
                expression(Phi(x) == "CDF"), 
                bquote(P(X <= .(x_val)))),
       col = c("blue", "red", rgb(0.53, 0.81, 0.92)),
       lwd = c(2, 2, NA),
       lty = c(1, 1, NA),
       fill = c(NA, NA, rgb(0.53, 0.81, 0.92, alpha = 0.5)),
       border = NA,
       bty = "n")


5.3 Student’s t-Distribution

The t-distribution is a family of continuous probability distributions that arises when:

  • Estimating the mean of a normally distributed population
  • Sample size is small (typically n < 30)
  • Population standard deviation is unknown

It plays a key role in:

  • Student’s t-test (comparing two sample means)
  • Confidence intervals for population means
  • Linear regression analysis

Parameters \[ \displaystyle{\nu > 0 \text{ degrees of freedom.}} \] Determines the shape of the distribution

Support \[ \displaystyle{x \in \left(-\infty, +\infty \right)} \]

Probability Density Function (PDF) The PDF of the t-distribution is given by:

\[ f(t) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\,\Gamma\left(\frac{\nu}{2}\right)} \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]

Alternatively expressed using the Beta function:

\[ f(t) = \frac{1}{\sqrt{\nu}\,B\left(\frac{1}{2}, \frac{\nu}{2}\right)} \left(1+\frac{t^2}{\nu}\right)^{-\frac{\nu+1}{2}} \]

Where:

  • \(\Gamma\) = Gamma function
  • \(\beta\) = Beta function

Moments

Moment Formula Defined When
Expected Value 0 ν > 0
Variance \(\frac{ν}{ν-2}\) ν > 2
1 < ν ≤ 2
undefined ν ≤ 1
Skewness 0 ν > 3
undefined ν ≤ 3
Excess Kurtosis \(\frac{6}{ν-4}\) ν > 4
2 < ν ≤ 4
undefined ν ≤ 2

Key Properties

  • Shape:
    • Bell-shaped but heavier tails than normal distribution
    • Approaches N(0,1) as ν → ∞
  • Relationships:
    • Cauchy distribution is t-distribution with ν = 1
    • Normal distribution is t-distribution with ν = ∞
  • Applications:
    • Small sample inference
    • Robust statistical modeling

R Implementation Density of the t-distribution

#' Student's t-Distribution Probability Density Function
#'
#' Computes the probability density function (PDF) for Student's t-distribution
#' at given quantiles for specified degrees of freedom.
#'
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @param x Numeric vector of quantiles (default: seq(-4.5, 4.5, length.out=1e6))
#' @return Numeric vector of probability densities
#' @export
#' @examples
#' # Basic usage
#' t_pdf(v = 5, x = 0)  # Density at 0 with 5 df
#' 
#' # Vectorized usage
#' t_pdf(v = c(1, 5, 30), x = c(-2, 0, 2))
t_pdf <- function(v, x = seq(from = -4.5, to = 4.5, length.out = 1e6)) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Vectorized PDF calculation
  gamma_part <- gamma((v + 1)/2) / (sqrt(v * pi) * gamma(v/2))
  (gamma_part * (1 + (x^2)/v)^(-(v + 1)/2))
}

# Save function to file
dump("t_pdf", file = "t_pdf.R")

R Implementation Cumulative Distribution Function of the t-distribution

\[ \displaystyle{\frac{1}{2}} + x\Gamma\left( \displaystyle{\frac{\nu+1}{2}}\right) \times \displaystyle{\frac{_2F_1\left( \displaystyle{\frac{1}{2}},\: \displaystyle{\frac{\nu+1}{2}}; \: \displaystyle{\frac{3}{2}};\: \displaystyle{\frac{x^2}{\nu}}\right)}{\sqrt{\pi\nu}\:\Gamma\left( \displaystyle{\frac{\nu}{2}}\right)}} \] where where \(_2F_1\) is the hypergeometric function.

#' Student's t-Distribution Cumulative Distribution Function (CDF)
#'
#' Computes the cumulative probability up to x for Student's t-distribution
#' with v degrees of freedom using the hypergeometric function.
#'
#' @param x Numeric vector of quantiles
#' @param v Numeric vector of degrees of freedom (v > 0)
#' @return Numeric vector of cumulative probabilities P(T ≤ x)
#' @export
#' @examples
#' t_cdf(0, 10)     # Returns 0.5 (median)
#' t_cdf(1.96, 30)  # Returns ~0.975
#' 
t_cdf <- function(x, v) {
  # Input validation
  if (any(v <= 0)) {
    stop("Degrees of freedom (v) must be positive")
  }
  
  # Check for package installation
  if (!requireNamespace("hypergeo", quietly = TRUE)) {
    stop("Package 'hypergeo' required. Install with install.packages('hypergeo')")
  }
  
  # Calculate components
  numerator <- x * gamma((v + 1)/2) * 
    Re(hypergeo::hypergeo(0.5, (v + 1)/2, 1.5, -x^2/v))
  denominator <- sqrt(pi * v) * gamma(v/2)
  
  # Return CDF value
  0.5 + numerator/denominator
}

# Save function to file
dump("t_cdf", file = "t_cdf.R")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# 1. Critical Value Calculations
calculate_critical_values <- function() {
  expand.grid(
    sd = c(1, 2, 3),           # Standard deviations
    df = c(5, 30, 100)     # Degrees of freedom (Inf = normal)
  ) %>%
    mutate(
      # Calculate critical t-values (two-tailed)
      t_critical = qt(1 - (1 - pnorm(sd))/2, df),
      
      # Calculate exact p-values
      p_value = 2 * (1 - pt(abs(t_critical), df)),
      
      # Format confidence levels
      conf_level = factor(case_when(
        sd == 1 ~ "68%",
        sd == 2 ~ "95%",
        sd == 3 ~ "99.7%"
      ), levels = c("68%", "95%", "99.7%")),
      
      # Clean DF labels
      df_label = ifelse(is.infinite(df), "Normal", paste("df =", df))
    )
}

crit_data <- calculate_critical_values()

# 2. Professional Table Output
crit_data %>%
  dplyr::select(
    SDs = sd,
    `Conf. Level` = conf_level,
    Distribution = df_label,
    `Critical t` = t_critical,
    `p-value` = p_value
  ) %>%
  knitr::kable(digits = 4, align = "c",
               caption = "Critical Values for t-Distributions") %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  kableExtra::column_spec(4, bold = TRUE, color = "red") %>%
  kableExtra::add_header_above(c(" " = 3, "Two-Tailed Test" = 2))
Critical Values for t-Distributions
Two-Tailed Test
SDs Conf. Level Distribution Critical t p-value
1 68% df = 5 1.6558 0.1587
2 95% df = 5 3.2478 0.0228
3 99.7% df = 5 6.4317 0.0013
1 68% df = 30 1.4456 0.1587
2 95% df = 30 2.4011 0.0228
3 99.7% df = 30 3.5338 0.0013
1 68% df = 100 1.4202 0.1587
2 95% df = 100 2.3134 0.0228
3 99.7% df = 100 3.2979 0.0013
# 3. Final Working Visualization
plot_t_distribution <- function(target_df = 30) {
  # Base R calculation of critical values (1, 2, 3 SDs)
  crit_values <- qt(p = 1 - (1 - pnorm(c(1, 2, 3)))/2, df = target_df)
  
  # Create plot with no annotations
  ggplot(data.frame(x = c(-4, 4)), aes(x)) +
    # Distribution curve
    stat_function(fun = dt, args = list(df = target_df), 
                  color = "steelblue", linewidth = 1.2) +
    
    # Critical value lines (no text labels)
    geom_vline(xintercept = c(-crit_values, crit_values),
               color = "red", linetype = "dotted", linewidth = 0.5) +
    
    # Clean formatting
    labs(title = paste("t-Distribution (df =", target_df, ")"),
         x = "t-value", y = "Density") +
    theme_minimal() +
    theme(panel.grid.minor = element_blank(),
          plot.title = element_text(hjust = 0.5, face = "bold")) +
    scale_x_continuous(breaks = round(c(-4, -crit_values, 0, crit_values, 4), 2)) +
    coord_cartesian(xlim = c(-4, 4), ylim = c(0, 0.4))
}

# Generate plots
plot_t_distribution(30)  # For df=30

plot_t_distribution(100) # For df=100

From t-table: \(( t_{crit} \approx -1.711)\).


  1. Conclusion:
    Since \(( t = -2.0 < t_{crit} = -1.711)\), reject \((H_0)\). The bulbs last significantly less than 1,000 hours.
#' One-Sample t-Test Implementation
#'
#' Performs a one-sample t-test comparing the sample mean to a hypothesized population mean.
#'
#' @param x_bar Sample mean (numeric)
#' @param s Sample standard deviation (numeric, positive)
#' @param n Sample size (integer, >1)
#' @param mu0 Hypothesized population mean (numeric)
#' @param alpha Significance level (numeric, 0 < alpha < 1)
#' @param alternative Direction of test ("greater", "less", or "two.sided")
#'
#' @return A list containing test results with the following components:
#' \describe{
#'   \item{t_statistic}{Calculated t-statistic}
#'   \item{critical_value}{Critical t-value(s) for given alpha}
#'   \item{p_value}{Calculated p-value}
#'   \item{decision}{Test decision}
#'   \item{conclusion}{Plain English conclusion}
#' }
#'
#' @examples
#' t_test_results <- one_sample_t_test(
#'   x_bar = 10.2, s = 0.4, n = 15,
#'   mu0 = 10, alpha = 0.05, alternative = "greater"
#' )
one_sample_t_test <- function(x_bar, s, n, mu0, alpha = 0.05, alternative = "greater") {
  # Input validation
  if (n <= 1) stop("Sample size n must be greater than 1")
  if (s <= 0) stop("Sample standard deviation s must be positive")
  if (alpha <= 0 | alpha >= 1) stop("Alpha must be between 0 and 1")
  if (!alternative %in% c("greater", "less", "two.sided")) {
    stop("Alternative must be 'greater', 'less', or 'two.sided'")
  }
  
  # Calculate t-statistic
  t_stat <- (x_bar - mu0) / (s / sqrt(n))
  df <- n - 1
  
  # Determine critical value(s) and p-value
  if (alternative == "greater") {
    t_crit <- qt(1 - alpha, df)
    p_val <- 1 - pt(t_stat, df)
    decision <- if (t_stat > t_crit) "Reject H0" else "Fail to reject H0"
  } else if (alternative == "less") {
    t_crit <- qt(alpha, df)
    p_val <- pt(t_stat, df)
    decision <- if (t_stat < t_crit) "Reject H0" else "Fail to reject H0"
  } else {  # two.sided
    t_crit <- qt(1 - alpha/2, df)
    p_val <- 2 * (1 - pt(abs(t_stat), df))
    decision <- if (abs(t_stat) > t_crit) "Reject H0" else "Fail to reject H0"
  }
  
  # Conclusion
  conclusion <- if (decision == "Reject H0") {
    switch(alternative,
           greater = paste("Evidence suggests population mean is greater than", mu0),
           less = paste("Evidence suggests population mean is less than", mu0),
           two.sided = paste("Evidence suggests population mean differs from", mu0))
  } else {
    paste("Insufficient evidence to conclude population mean differs from", mu0)
  }
  
  # Return results
  list(
    t_statistic = t_stat,
    critical_value = t_crit,
    p_value = p_val,
    decision = decision,
    conclusion = conclusion,
    parameters = list(
      sample_mean = x_bar,
      sample_sd = s,
      sample_size = n,
      hypothesized_mean = mu0,
      significance_level = alpha,
      alternative = alternative,
      df = df
    )
  )
}

# Example usage
test_results <- one_sample_t_test(
  x_bar = 10.2,
  s = 0.4,
  n = 15,
  mu0 = 10,
  alpha = 0.05,
  alternative = "greater"
)

# Print formatted results
cat("One-Sample t-Test Results:\n",
    "t-statistic = ", round(test_results$t_statistic, 3), "\n",
    "Critical value = ", round(test_results$critical_value, 3), "\n",
    "p-value = ", round(test_results$p_value, 4), "\n",
    "Decision: ", test_results$decision, "\n",
    "Conclusion: ", test_results$conclusion, "\n", sep = "")
## One-Sample t-Test Results:
## t-statistic = 1.936
## Critical value = 1.761
## p-value = 0.0366
## Decision: Reject H0
## Conclusion: Evidence suggests population mean is greater than 10
# ==============================
# TWO‑SAMPLE T‑TEST EXPLORER
# ==============================

library(shiny) # Interactive web apps
## 
## Attaching package: 'shiny'
## The following object is masked from 'package:ggExtra':
## 
##     runExample
library(ggplot2) # Publication‑ready plots
library(dplyr) # Data manipulation
library(patchwork) # Combine ggplots
library(DT) # Interactive tables
## 
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
## 
##     dataTableOutput, renderDataTable
# ——————————————————————————————
# 1. EFFECT SIZE & CI FUNCTIONS
# ——————————————————————————————
#' Calculate Cohen's d Effect Size for Two Independent Groups
#'
#' Computes Cohen's d, a standardized measure of effect size for the difference
#' between two means. The function handles both independent samples and includes
#' checks for valid input.
#'
#' @param x Numeric vector of values for group 1
#' @param y Numeric vector of values for group 2
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param hedges_correction Logical indicating whether to apply Hedges' g correction
#' for small sample sizes (default: FALSE)
#'
#' @return A numeric value representing Cohen's d effect size. The sign indicates
#' the direction of the effect (positive when x > y, negative when x < y).
#'
#' @details
#' The function calculates the pooled standard deviation version of Cohen's d,
#' which is appropriate for independent samples. When \code{hedges_correction = TRUE},
#' it applies Hedges' correction factor to reduce bias in small samples.
#'
#' @examples
#' # Basic usage
#' group1 <- c(1, 2, 3, 4, 5)
#' group2 <- c(6, 7, 8, 9, 10)
#' cohens.d(group1, group2)
#'
#' # With NA values
#' group1 <- c(1, 2, NA, 4, 5)
#' cohens.d(group1, group2, na.rm = TRUE)
#'
#' # With Hedges' correction
#' cohens.d(group1, group2, hedges_correction = TRUE)
#'
#' @references
#' Cohen, J. (1988). Statistical power analysis for the behavioral
#' sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates.
#'
#' Hedges, L. V. (1981). Distribution theory for Glass's estimator
#' of effect size and related estimators. Journal of Educational
#' Statistics, 6(2), 107-128.
#'
#' @export
cohens.d <- function(x, y, na.rm = TRUE, hedges_correction = FALSE) {
  # Input validation
  if (!is.numeric(x) || !is.numeric(y)) {
    stop("Both x and y must be numeric vectors")
  }
  
  if (na.rm) {
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  } else if (any(is.na(c(x, y)))) {
    stop("NA values found and na.rm = FALSE")
  }
  
  if (length(x) < 2 || length(y) < 2) {
    stop("Each group must have at least 2 observations")
  }
  
  # Calculate means and variances
  mean_x <- mean(x)
  mean_y <- mean(y)
  var_x <- var(x)
  var_y <- var(y)
  n_x <- length(x)
  n_y <- length(y)
  
  # Pooled standard deviation
  pooled_sd <- sqrt(((n_x - 1) * var_x + (n_y - 1) * var_y)/(n_x + n_y - 2))
  
  # Cohen's d (with direction)
  d <- (mean_x - mean_y)/pooled_sd
  
  # Apply Hedges' correction if requested
  if (hedges_correction) {
    df <- n_x + n_y - 2
    correction_factor <- 1 - (3/(4 * df - 1))
    d <- d * correction_factor
  }
  
  return(d)
}
#' Calculate Alternative Cohen's d Effect Size (N-denominator version)
#'
#' Computes an alternative version of Cohen's d that uses N (rather than N-2)
#' in the denominator of the pooled standard deviation calculation. This version
#' matches some textbook definitions of Cohen's d for independent samples.
#'
#' @param x Numeric vector of values for group 1
#' @param y Numeric vector of values for group 2
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param absolute Logical indicating whether to return absolute value (default: TRUE)
#'
#' @return Numeric value representing the effect size. By default returns absolute value.
#'
#' @details
#' This version differs from cohens_d1() in the denominator of the pooled SD calculation:
#'
#' # Standard version
#' - cohens_d1(): sqrt(((n1-1)*var1 + (n2-1)*var2)/(n1+n2-2))
#'
#' # Alternative version
#' - cohens_d2(): sqrt(((n1-1)*var1 + (n2-1)*var2)/(n1+n2))
#'
#' The absolute parameter controls whether to return signed (FALSE) or absolute (TRUE) values.
#'
#' @examples
#' group1 <- c(1, 2, 3, 4, 5)
#' group2 <- c(6, 7, 8, 9, 10)
#' cohens_d2(group1, group2) # Absolute effect size
#' cohens_d2(group1, group2, absolute = FALSE) # Signed effect size
#'
#' @references
#' Cohen, J. (1988). Statistical power analysis for the behavioral sciences.
#' Hillsdale, NJ: Lawrence Erlbaum Associates.
#'
#' @seealso \code{\link{cohens_d1}} for the more common N-2 denominator version
#' @export
#'
cohens_d2 <- function(x, y, na.rm = TRUE, absolute = TRUE) {
  
  # Input validation
  if (!is.numeric(x) || !is.numeric(y)) {
    stop("Both x and y must be numeric vectors")
  }
  
  if (na.rm) {
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  } else if (any(is.na(c(x, y)))) {
    stop("NA values found and na.rm = FALSE")
  }
  
  if (length(x) < 2 || length(y) < 2) {
    stop("Each group must have at least 2 observations")
  }
  
  # Calculate components
  n_x <- length(x)
  n_y <- length(y)
  var_x <- var(x)
  var_y <- var(y)
  
  # Alternative pooled SD calculation (using N rather than N-2 in
  # denominator)
  pooled_sd <- sqrt(((n_x - 1) * var_x + (n_y - 1) * var_y)/(n_x + n_y))
  
  # Calculate effect size (optionally keep direction)
  d <- (mean(x) - mean(y))/pooled_sd
  
  if (absolute) {
    return(abs(d))
  } else {
    return(d)
  }
}
#' Calculate Standard Error of the Mean Difference Between Two Independent Groups
#'
#' Computes the standard error for the difference between means of two independent samples,
#' using pooled variance. This is appropriate when comparing means between two groups
#' in t-tests and confidence interval calculations.
#'
#' @param x Numeric vector of values for group 1
#' @param y Numeric vector of values for group 2
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param pooled Logical indicating whether to use pooled variance (default: TRUE).
#' If FALSE, uses separate variances (Welch-Satterthwaite approximation).
#'
#' @return Numeric value representing the standard error of the mean difference
#'
#' @details
#' When pooled=TRUE (default), the calculation is:
#' \deqn{SE = \sqrt{\frac{s_p^2}{n_x} + \frac{s_p^2}{n_y}}}
#' Where \eqn{s_p^2} is the pooled variance.
#'
#' When pooled=FALSE, uses Welch's adjustment:
#' \deqn{SE = \sqrt{\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}}}
#'
#' @examples
#' group1 <- c(1, 2, 3, 4, 5)
#' group2 <- c(6, 7, 8, 9, 10)
#' se.m(group1, group2) # Pooled variance
#' se.m(group1, group2, pooled = FALSE) # Welch's version
#'
#' @references
#' Satterthwaite, F. E. (1946). An approximate distribution of estimates
#' of variance components. Biometrics Bulletin, 2(6), 110-114.
#'
#' Welch, B. L. (1947). The generalization of 'Student's' problem when
#' several different population variances are involved. Biometrika,
#' 34(1-2), 28-35.
#' @export
se.m <- function(x, y, na.rm = TRUE, pooled = TRUE) {
  
  # Input validation
  if (!is.numeric(x) || !is.numeric(y)) {
    stop("Both x and y must be numeric vectors")
  }
  
  if (na.rm) {
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  } else if (any(is.na(c(x, y)))) {
    stop("NA values found and na.rm = FALSE")
  }
  
  if (length(x) < 2 || length(y) < 2) {
    stop("Each group must have at least 2 observations")
  }
  
  # Calculate components
  n_x <- length(x)
  n_y <- length(y)
  var_x <- var(x)
  var_y <- var(y)
  
  if (pooled) {
    # Pooled variance version (equal variances assumed)
    s_pooled <- ((n_x - 1) * var_x + (n_y - 1) * var_y)/(n_x + n_y - 2)
    se <- sqrt((s_pooled/n_x) + (s_pooled/n_y))
  } else {
    # Welch-Satterthwaite version (unequal variances)
    se <- sqrt((var_x/n_x) + (var_y/n_y))
  }
  
  return(se)
}
#' Calculate Confidence Interval for Mean Difference Between Two Independent Groups
#'
#' Computes the mean difference and its confidence interval between two independent samples,
#' using either pooled or unpooled standard error estimates.
#'
#' @param x Numeric vector of values for group 1
#' @param y Numeric vector of values for group 2
#' @param na.rm Logical indicating whether to remove NA values (default: TRUE)
#' @param pooled Logical indicating whether to use pooled variance (default: TRUE).
#' If FALSE, uses separate variances (Welch-Satterthwaite approximation).
#' @param conf.level Confidence level for the interval (default: 0.95)
#' @param absolute Logical indicating whether to return absolute mean difference (default: TRUE)
#'
#' @return A list containing:
#' \itemize{
#' \item mean.difference - The difference between group means
#' \item se - Standard error of the mean difference
#' \item lower.ci - Lower bound of confidence interval
#' \item upper.ci - Upper bound of confidence interval
#' \item conf.level - The confidence level used
#' }
#'
#' @details
#' The confidence interval is calculated as:
#' \deqn{CI = \bar{x} - \bar{y} \pm t_{\alpha/2, df} \times SE}
#' Where SE is calculated either with pooled variance or Welch's method,
#' and degrees of freedom are adjusted for the Welch test when pooled=FALSE.
#'
#' @examples
#' group1 <- c(1, 2, 3, 4, 5)
#' group2 <- c(6, 7, 8, 9, 10)
#' ci(group1, group2) # Pooled variance 95% CI
#' ci(group1, group2, pooled = FALSE, conf.level = 0.99) # Welch's 99% CI
#'
#' @seealso \code{\link{se.m}} for the standard error calculation
#' @export
ci <- function(x, y, na.rm = TRUE, pooled = TRUE, conf.level = 0.95, absolute = TRUE) {
  
  # Input validation
  if (!is.numeric(x) || !is.numeric(y)) {
    stop("Both x and y must be numeric vectors")
  }
  
  if (na.rm) {
    x <- x[!is.na(x)]
    y <- y[!is.na(y)]
  } else if (any(is.na(c(x, y)))) {
    stop("NA values found and na.rm = FALSE")
  }
  
  if (length(x) < 2 || length(y) < 2) {
    stop("Each group must have at least 2 observations")
  }
  
  if (conf.level <= 0 || conf.level >= 1) {
    stop("conf.level must be between 0 and 1")
  }
  
  # Calculate components
  mean_diff <- mean(x) - mean(y)
  if (absolute)
    mean_diff <- abs(mean_diff)
  se <- se.m(x, y, na.rm = na.rm, pooled = pooled)
  
  # Calculate critical value
  if (pooled) {
    df <- length(x) + length(y) - 2
  } else {
    # Welch-Satterthwaite degrees of freedom
    vx <- var(x)
    vy <- var(y)
    nx <- length(x)
    ny <- length(y)
    df <- (vx/nx + vy/ny)^2/((vx/nx)^2/(nx - 1) + (vy/ny)^2/(ny - 1))
  }
  
  crit_val <- qt((1 + conf.level)/2, df = df)
  
  # Calculate confidence interval
  lower <- mean_diff - crit_val * se
  upper <- mean_diff + crit_val * se
  
  return(list(mean.difference = mean_diff, se = se, lower.ci = lower, upper.ci = upper,
              conf.level = conf.level, method = ifelse(pooled, "Pooled variance", "Welch's unequal variances")))
}

# ——————————————————————————————
# ============ UI ==============
# ——————————————————————————————

ui <- fluidPage(
  div(class = "full-title", "Two-Sample T-Test with T-Distribution Visualization"),
  
  tags$head(tags$style(HTML("
    /* YOUR ORIGINAL CSS + NEW ACCORDION */
    .full-title {text-align:center;width:100vw;position:relative;left:50%;right:50%;
                 margin-left:-50vw;margin-right:-50vw;padding:1.5rem 0;
                 background:linear-gradient(135deg,#3498db,#2c3e50);color:white;
                 font-size:clamp(1.8rem,5vw,2.8rem);font-weight:700;
                 box-shadow:0 4px 15px rgba(0,0,0,.25);z-index:10;}
    .card {background:white;padding:clamp(12px,3vw,20px);margin:12px 0;
           border-radius:12px;box-shadow:0 3px 10px rgba(0,0,0,.12);}
    .card:hover{transform:translateY(-3px);}
    .card h4{margin:0 0 10px;color:#2c3e50;border-bottom:2px solid #3498db;}
    @media (max-width:768px){.sidebar-panel,.main-panel{width:100%!important;}}
    .shiny-plot-output img{max-width:100%;height:auto;}
    
    /* NEW ACCORDION */
    .acc{padding:0;margin:10px 0;border-radius:12px;overflow:hidden;
         box-shadow:0 4px 15px rgba(0,0,0,.15);background:white;}
    .acc-head{background:#3498db;color:white;padding:16px 20px;cursor:pointer;
               font-weight:600;display:flex;justify-content:space-between;
               align-items:center;transition:.2s;}
    .acc-head:hover{background:#2980b9;}
    .acc-head::after{content:'▼';font-size:1.2em;}
    .acc-head.active::after{content:'▲';}
    .acc-body{padding:20px;background:#f8fbff;display:none;}
    .acc-body.active{display:block;}
    .icon{margin-right:12px;font-size:1.3em;}
  "))),
  
  tags$script(HTML("
    $(document).on('click','.acc-head',function(){
      $(this).toggleClass('active').next().toggleClass('active');
    });
  ")),
  
  sidebarLayout(
    sidebarPanel(width = 4,
                 # ── 1. SCENARIO ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Pick a Scenario"),
                     div(class="acc-body active",
                         radioButtons("data_scenario", NULL,
                                      c("Clear Difference"="clear_diff","Small Difference"="small_diff",
                                        "Overlapping"="overlapping","High Variance"="high_var"),
                                      "clear_diff")
                     )
                 ),
                 
                 # ── 2. SAMPLE SIZE ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Sample Size (per group)"),
                     div(class="acc-body active",
                         sliderInput("sample_size", NULL, 10, 999, 30, 10)
                     )
                 ),
                 
                 # ── 3. GROUP 1 ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Group 1"),
                     div(class="acc-body active",
                         fluidRow(
                           column(6, numericInput("group1_mean","Mean",70,0,100,1)),
                           column(6, numericInput("group1_sd","SD",10,1,30,1))
                         )
                     )
                 ),
                 
                 # ── 4. GROUP 2 ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Group 2"),
                     div(class="acc-body active",
                         fluidRow(
                           column(6, numericInput("group2_mean","Mean",78,0,100,1)),
                           column(6, numericInput("group2_sd","SD",10,1,30,1))
                         )
                     )
                 ),
                 
                 # ── 5. TEST SETTINGS ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Test Settings"),
                     div(class="acc-body active",
                         sliderInput("alpha_level","α",0.001,0.20,0.05,0.01),
                         checkboxInput("var_equal","Assume equal variances",TRUE)
                     )
                 ),
                 
                 # ── 6. LIVE T-DIST ──
                 div(class="acc",
                     div(class="acc-head active", span(class="icon",""), "Live t-Distribution"),
                     div(class="acc-body active",
                         plotOutput("t_distribution_plot", height="260px")
                     )
                 ),
                 
                 # ── 7. GUIDE ──
                 div(class="acc",
                     div(class="acc-head", span(class="icon",""), "Interpretation Guide"),
                     div(class="acc-body",
                         tags$ul(style="font-size:13px;line-height:1.5;margin:8px 0;",
                                 tags$li("Red = rejection zones"),
                                 tags$li("Purple = your t"),
                                 tags$li("CI overlap = visual significance")
                         )
                     )
                 )
    ),
    
    mainPanel(width = 8,
              tabsetPanel(
                tabPanel("T-Test Analysis", plotOutput("t_analysis_plot", height="600px"),
                         verbatimTextOutput("detailed_interpretation")),
                tabPanel("Effect Sizes", verbatimTextOutput("effect_size_output"),
                         plotOutput("effect_size_plot")),
                tabPanel("Data Summary", verbatimTextOutput("data_summary"),
                         DTOutput("data_preview"), plotOutput("raw_data_plot")),
                tabPanel("Statistical Details", verbatimTextOutput("statistical_details"),
                         plotOutput("power_plot"))
              )
    )
  )
)

# ——————————————————————————————
# ========== SERVER ============
# ——————————————————————————————

server <- function(input, output, session) {
  
  # Update parameters based on scenario
  observeEvent(input$data_scenario, {
    scenario <- input$data_scenario
    if(scenario == "clear_diff") {
      updateNumericInput(session, "group1_mean", value = 70)
      updateNumericInput(session, "group1_sd", value = 10)
      updateNumericInput(session, "group2_mean", value = 85)
      updateNumericInput(session, "group2_sd", value = 10)
    } else if(scenario == "small_diff") {
      updateNumericInput(session, "group1_mean", value = 70)
      updateNumericInput(session, "group1_sd", value = 8)
      updateNumericInput(session, "group2_mean", value = 73)
      updateNumericInput(session, "group2_sd", value = 8)
    } else if(scenario == "overlapping") {
      updateNumericInput(session, "group1_mean", value = 70)
      updateNumericInput(session, "group1_sd", value = 12)
      updateNumericInput(session, "group2_mean", value = 72)
      updateNumericInput(session, "group2_sd", value = 12)
    } else if(scenario == "high_var") {
      updateNumericInput(session, "group1_mean", value = 70)
      updateNumericInput(session, "group1_sd", value = 25)
      updateNumericInput(session, "group2_mean", value = 75)
      updateNumericInput(session, "group2_sd", value = 25)
    }
  })
  
  # Keep numeric inputs in sync when scenario changes
  observe({
    updateNumericInput(session, "group1_mean", value = input$group1_mean)
    updateNumericInput(session, "group1_sd", value = input$group1_sd)
    updateNumericInput(session, "group2_mean", value = input$group2_mean)
    updateNumericInput(session, "group2_sd", value = input$group2_sd)
  })
  
  # Generate data reactively based on parameters (real-time)
  generated_data <- reactive({
    set.seed(123) # For reproducibility
    
    n <- input$sample_size
    
    group1 <- rnorm(n, mean = input$group1_mean, sd = input$group1_sd)
    group2 <- rnorm(n, mean = input$group2_mean, sd = input$group2_sd)
    
    # Ensure scores stay within reasonable bounds (0-100)
    group1 <- pmax(0, pmin(100, group1))
    group2 <- pmax(0, pmin(100, group2))
    
    return(list(group1 = group1, group2 = group2))
  })
  
  # Perform t-test reactively (real-time)
  test_results <- reactive({
    data <- generated_data()
    test <- t.test(data$group1, data$group2,
                   var.equal = input$var_equal,
                   conf.level = 1 - input$alpha_level)
    return(test)
  })
  
  # T-distribution plot for sidebar
  output$t_distribution_plot <- renderPlot({
    req(test_results())
    t_result <- test_results()
    
    # Create detailed t-distribution plot
    x_range <- seq(-4, 4, length.out = 200)
    t_dist <- data.frame(
      x = x_range,
      y = dt(x_range, df = t_result$parameter)
    )
    
    critical_val <- qt(1 - input$alpha_level/2, df = t_result$parameter)
    t_stat <- t_result$statistic
    
    # Create shaded regions for critical areas
    critical_region_left <- data.frame(x = seq(-4, -critical_val, length.out = 50))
    critical_region_left$y <- dt(critical_region_left$x, df = t_result$parameter)
    
    critical_region_right <- data.frame(x = seq(critical_val, 4, length.out = 50))
    critical_region_right$y <- dt(critical_region_right$x, df = t_result$parameter)
    
    ggplot(t_dist, aes(x = x, y = y)) +
      geom_line(color = "blue", size = 1.2) +
      geom_area(data = critical_region_left, aes(x = x, y = y), fill = "red", alpha = 0.3) +
      geom_area(data = critical_region_right, aes(x = x, y = y), fill = "red", alpha = 0.3) +
      geom_vline(xintercept = c(-critical_val, critical_val),
                 linetype = "dashed", color = "darkred", size = 1) +
      geom_vline(xintercept = t_stat, color = "purple", size = 1.5, alpha = 0.8) +
      annotate("text", x = 0, y = max(t_dist$y)*0.9,
               label = paste("t =", round(t_stat, 3)), size = 4, fontface = "bold") +
      annotate("text", x = 0, y = max(t_dist$y)*0.8,
               label = paste("df =", t_result$parameter), size = 3) +
      annotate("text", x = 0, y = max(t_dist$y)*0.7,
               label = paste("p =", format.pval(t_result$p.value, digits = 3)), size = 3) +
      annotate("text", x = critical_val, y = max(t_dist$y)*0.1,
               label = paste("Critical t = ±", round(critical_val, 2)),
               hjust = -0.1, color = "darkred", size = 3) +
      labs(title = paste("T-Distribution (α =", input$alpha_level, ")"),
           x = "t-value", y = "Density") +
      theme_minimal() +
      theme(plot.title = element_text(size = 12),
            axis.text = element_text(size = 10),
            axis.title = element_text(size = 10))
  })
  
  # Modified plot function for the main analysis - FIXED VERSION
  output$t_analysis_plot <- renderPlot({
    req(generated_data(), test_results())
    data <- generated_data()
    
    all_values <- c(data$group1, data$group2)
    x_min <- min(all_values) - sd(all_values)
    x_max <- max(all_values) + sd(all_values)
    x_limits <- c(x_min, x_max)
    
    t_result <- test_results()
    
    plot_data <- data.frame(
      value = c(data$group1, data$group2),
      group = rep(c("Group 1", "Group 2"),
                  times = c(length(data$group1), length(data$group2)))
    )
    
    ci_data <- plot_data %>%
      group_by(group) %>%
      summarise(
        mean = mean(value),
        lower = t.test(value)$conf.int[1],
        upper = t.test(value)$conf.int[2],
        .groups = "drop"
      )
    
    ci_overlap <- (ci_data$lower[1] <= ci_data$upper[2]) &
      (ci_data$lower[2] <= ci_data$upper[1])
    
    p1 <- ggplot(plot_data, aes(x = value, fill = group)) +
      geom_density(alpha = 0.6) +
      geom_vline(data = ci_data, aes(xintercept = mean, color = group),
                 linetype = "dashed", size = 1) +
      scale_fill_brewer(palette = "Set1") +
      scale_color_brewer(palette = "Set1") +
      labs(title = "Distribution Comparison",
           x = "Value", y = "Density") +
      theme_minimal() +
      theme(legend.position = "none") +
      coord_cartesian(xlim = x_limits) +
      expand_limits(x = x_limits)
    
    p2 <- ggplot(ci_data, aes(x = group, y = mean, color = group)) +
      geom_point(size = 4) +
      geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2, size = 1) +
      labs(title = ifelse(ci_overlap,
                          "CIs Overlap – No Significant Difference",
                          "CIs Do Not Overlap – Difference is Significant"),
           x = "", y = "Score") +
      theme_minimal() +
      scale_color_brewer(palette = "Set1") +
      coord_flip() +
      theme(legend.position = "none")
    
    p_legend <- ggplot(plot_data, aes(x = value, fill = group)) +
      geom_density(alpha = 0.6) +
      scale_fill_brewer(palette = "Set1", name = "Groups") +
      theme(legend.position = "bottom",
            legend.direction = "horizontal")
    
    design <- "AA
               BB
               CC"
    
    combined <- p1 + p2 + guide_area() +
      plot_layout(design = design,
                  heights = c(2, 1, 0.2),
                  guides = "collect") &
      theme(legend.position = "bottom")
    
    combined + plot_annotation(
      title = paste("Two-Sample T-Test:",
                    ifelse(t_result$p.value < input$alpha_level,
                           "SIGNIFICANT", "NOT SIGNIFICANT")),
      theme = theme(plot.title = element_text(size = 16,
                                              face = "bold",
                                              hjust = 0.5))
    )
  })
  
  output$detailed_interpretation <- renderPrint({
    req(test_results(), generated_data())
    test <- test_results()
    data <- generated_data()
    
    cat("=== TWO-SAMPLE T-TEST INTERPRETATION ===\n\n")
    cat("GROUP STATISTICS:\n")
    cat("• Group 1: M =", round(mean(data$group1), 2),
        ", SD =", round(sd(data$group1), 2),
        ", 95% CI [", round(t.test(data$group1)$conf.int[1], 2), ", ",
        round(t.test(data$group1)$conf.int[2], 2), "]\n", sep = "")
    cat("• Group 2: M =", round(mean(data$group2), 2),
        ", SD =", round(sd(data$group2), 2),
        ", 95% CI [", round(t.test(data$group2)$conf.int[1], 2), ", ",
        round(t.test(data$group2)$conf.int[2], 2), "]\n\n", sep = "")
    
    cat("TEST RESULTS:\n")
    cat("• t(", test$parameter, ") = ", round(test$statistic, 3), "\n", sep = "")
    cat("• p-value = ", format.pval(test$p.value, digits = 3), "\n")
    cat("• ", 100*(1-input$alpha_level), "% CI for difference: [",
        round(test$conf.int[1], 3), ", ", round(test$conf.int[2], 3), "]\n\n", sep = "")
    
    cat("STATISTICAL DECISION:\n")
    if(test$p.value < input$alpha_level) {
      cat("• ✓ REJECT null hypothesis (p < α)\n")
      cat("• Significant difference between groups\n")
      if(mean(data$group1) > mean(data$group2)) {
        cat("• Group 1 > Group 2\n")
      } else {
        cat("• Group 2 > Group 1\n")
      }
    } else {
      cat("• × FAIL TO REJECT null hypothesis (p ≥ α)\n")
      cat("• No significant difference between groups\n")
    }
    
    ci1 <- t.test(data$group1)$conf.int
    ci2 <- t.test(data$group2)$conf.int
    ci_overlap <- (ci1[1] <= ci2[2]) & (ci2[1] <= ci1[2])
    cat("• Confidence intervals overlap:", ifelse(ci_overlap, "YES", "NO"), "\n")
  })
  
  output$effect_size_output <- renderPrint({
    req(generated_data())
    data <- generated_data()
    
    cat("=== EFFECT SIZE ANALYSIS ===\n\n")
    d1 <- cohens.d(data$group1, data$group2)
    d2 <- cohens_d2(data$group1, data$group2)
    
    cat("Effect Size Measures:\n")
    cat("• Cohen's d (pooled SD): ", round(d1, 3), "\n")
    cat("• Cohen's d (N denominator): ", round(d2, 3), "\n\n")
    
    cat("EFFECT SIZE INTERPRETATION:\n")
    cat("• |d| < 0.2: Very small\n")
    cat("• 0.2 ≤ |d| < 0.5: Small\n")
    cat("• 0.5 ≤ |d| < 0.8: Medium\n")
    cat("• |d| ≥ 0.8: Large\n\n")
    
    cat("YOUR EFFECT SIZES:\n")
    effects <- c(d1, d2)
    names <- c("Cohen's d (pooled)", "Cohen's d (N-denom)")
    
    for(i in 1:length(effects)) {
      abs_effect <- abs(effects[i])
      interpretation <- ifelse(abs_effect < 0.2, "Very small",
                               ifelse(abs_effect < 0.5, "Small",
                                      ifelse(abs_effect < 0.8, "Medium", "Large")))
      direction <- ifelse(effects[i] > 0, "Group 1 > Group 2", "Group 2 > Group 1")
      cat("• ", names[i], ": ", round(effects[i], 3), " (", interpretation, ", ", direction, ")\n", sep = "")
    }
  })
  
  output$effect_size_plot <- renderPlot({
    req(generated_data())
    data <- generated_data()
    
    effect_sizes <- data.frame(
      Method = c("Cohen's d (Pooled)", "Cohen's d (N-denom)"),
      Value = c(
        cohens.d(data$group1, data$group2),
        cohens_d2(data$group1, data$group2)
      )
    )
    
    ggplot(effect_sizes, aes(x = Method, y = abs(Value), fill = Method)) +
      geom_col(alpha = 0.7, width = 0.6) +
      geom_hline(yintercept = c(0.2, 0.5, 0.8), linetype = "dashed",
                 color = c("green", "orange", "red"), size = 1) +
      geom_text(aes(label = round(Value, 3)), vjust = -0.5, size = 5, fontface = "bold") +
      annotate("text", x = 0.5, y = 0.1, label = "Very Small", color = "darkgreen", hjust = 0) +
      annotate("text", x = 0.5, y = 0.35, label = "Small", color = "orange", hjust = 0) +
      annotate("text", x = 0.5, y = 0.65, label = "Medium", color = "darkorange", hjust = 0) +
      annotate("text", x = 0.5, y = 0.9, label = "Large", color = "red", hjust = 0) +
      labs(title = "Effect Size Comparison",
           y = "Absolute Effect Size (Cohen's d)") +
      theme_minimal() +
      theme(
        legend.position = "none",
        axis.text.x = element_text(angle = 0, hjust = 0.5, size = 16, face = "bold"),
        axis.title = element_text(size = 18, face = "bold"),
        plot.title = element_text(size = 24, face = "bold", hjust = 0.5)
      )
  })
  
  output$data_summary <- renderPrint({
    req(generated_data())
    data <- generated_data()
    
    cat("=== DATA SUMMARY ===\n\n")
    cat("Group 1 (n = ", length(data$group1), "):\n", sep = "")
    cat("• Mean: ", round(mean(data$group1), 2), "\n")
    cat("• SD: ", round(sd(data$group1), 2), "\n")
    cat("• Median: ", round(median(data$group1), 2), "\n")
    cat("• Range: [", round(min(data$group1), 2), ", ", round(max(data$group1), 2), "]\n\n", sep = "")
    
    cat("Group 2 (n = ", length(data$group2), "):\n", sep = "")
    cat("• Mean: ", round(mean(data$group2), 2), "\n")
    cat("• SD: ", round(sd(data$group2), 2), "\n")
    cat("• Median: ", round(median(data$group2), 2), "\n")
    cat("• Range: [", round(min(data$group2), 2), ", ", round(max(data$group2), 2), "]\n\n", sep = "")
    
    cat("Difference (Group 1 - Group 2):\n")
    cat("• Mean difference: ", round(mean(data$group1) - mean(data$group2), 2), "\n")
    cat("• Pooled SD: ", round(sqrt((var(data$group1) + var(data$group2))/2), 2), "\n")
  })
  
  output$data_preview <- renderDT({
    req(generated_data())
    data <- generated_data()
    
    full_data <- data.frame(
      ID = 1:length(data$group1),
      Group1 = round(data$group1, 1),
      Group2 = round(data$group2, 1)
    )
    
    datatable(
      full_data,
      options = list(
        pageLength = 10,
        lengthMenu = c(10, 25, 50, 100),
        searching = TRUE,
        ordering = TRUE,
        autoWidth = TRUE,
        dom = 'lftip'
      ),
      class = 'display compact',
      rownames = FALSE,
      caption = "Interactive Data Table - Click column headers to sort, use search box to filter data",
      filter = 'top'
    ) %>%
      formatStyle(columns = 1:3, fontSize = '12px')
  })
  
  output$raw_data_plot <- renderPlot({
    req(generated_data())
    data <- generated_data()
    
    all_vals <- c(data$group1, data$group2)
    y_pad <- 2 * sd(all_vals)
    y_limits <- c(min(all_vals) - y_pad, max(all_vals) + y_pad)
    
    plot_data <- data.frame(
      Group = rep(c("Group 1", "Group 2"),
                  times = c(length(data$group1), length(data$group2))),
      Value = c(data$group1, data$group2)
    )
    
    ggplot(plot_data, aes(x = Group, y = Value, fill = Group)) +
      geom_boxplot(alpha = 0.7, outlier.alpha = 0.6) +
      geom_jitter(width = 0.25, alpha = 0.5, size = 1) +
      stat_summary(fun = mean, geom = "point",
                   shape = 23, size = 4, fill = "white", stroke = 1.5) +
      labs(title = "Boxplot + Jitter + Means (white diamonds)",
           y = "Score") +
      scale_fill_brewer(palette = "Set1") +
      theme_minimal(base_size = 14) +
      theme(legend.position = "bottom") +
      coord_cartesian(ylim = y_limits)
  })
  
  output$statistical_details <- renderPrint({
    req(test_results(), generated_data())
    test <- test_results()
    data <- generated_data()
    
    cat("=== TECHNICAL DETAILS ===\n\n")
    cat("TEST SPECIFICATIONS:\n")
    cat("• Test type: Two-sample t-test\n")
    cat("• Variances: ", ifelse(input$var_equal, "Equal", "Unequal"), "\n")
    cat("• Alternative hypothesis: Two-sided\n")
    cat("• α level: ", input$alpha_level, "\n")
    cat("• Confidence level: ", 100*(1-input$alpha_level), "%\n\n", sep = "")
    
    cat("DEGREES OF FREEDOM:\n")
    if(input$var_equal) {
      cat("• df = n₁ + n₂ - 2 = ", length(data$group1), " + ", length(data$group2),
          " - 2 = ", test$parameter, "\n", sep = "")
    } else {
      cat("• df = ", round(test$parameter, 1), " (Welch-Satterthwaite adjustment)\n", sep = "")
    }
    
    cat("\nEFFECT SIZE CALCULATIONS:\n")
    pooled_sd <- sqrt(((length(data$group1)-1)*var(data$group1) +
                         (length(data$group2)-1)*var(data$group2)) /
                        (length(data$group1) + length(data$group2) - 2))
    cat("• Cohen's d = (M₁ - M₂) / pooled SD\n")
    cat("• = (", round(mean(data$group1), 2), " - ", round(mean(data$group2), 2),
        ") / ", round(pooled_sd, 2), " = ",
        round((mean(data$group1) - mean(data$group2))/pooled_sd, 3), "\n", sep = "")
  })
  
  output$power_plot <- renderPlot({
    req(generated_data())
    data <- generated_data()
    
    sample_sizes <- seq(10, 999, by = 10)
    effect_size <- abs(cohens.d(data$group1, data$group2))
    
    power_approx <- pnorm(sqrt(sample_sizes/2) * effect_size - qnorm(1 - input$alpha_level/2))
    
    power_data <- data.frame(
      SampleSize = sample_sizes,
      Power = power_approx
    )
    
    current_power <- pnorm(sqrt(input$sample_size/2) * effect_size - qnorm(1 - input$alpha_level/2))
    
    ggplot(power_data, aes(x = SampleSize, y = Power)) +
      geom_line(color = "blue", size = 1.5) +
      geom_hline(yintercept = 0.8, linetype = "dashed", color = "red") +
      geom_vline(xintercept = input$sample_size, linetype = "dashed", color = "green") +
      geom_point(aes(x = input$sample_size, y = current_power),
                 color = "green", size = 3) +
      annotate("text", x = input$sample_size, y = 0.5,
               label = paste("Your study\nn =", input$sample_size, "\nPower =",
                             round(current_power, 2)),
               color = "green", hjust = -0.1) +
      annotate("text", x = 300, y = 0.82, label = "80% Power", color = "red") +
      labs(title = "Statistical Power by Sample Size",
           subtitle = paste("Assuming effect size d =", round(effect_size, 3),
                            "and α =", input$alpha_level),
           x = "Sample Size per Group", y = "Statistical Power") +
      ylim(0, 1) +
      theme_minimal()
  })
}

# ——————————————————————————————
# LAUNCH
# ——————————————————————————————
shinyApp(ui = ui, server = server)
## 
## Listening on http://127.0.0.1:8117
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.


5.4 Normal Distribution

Parameters

\(\quad \displaystyle \mu \in \mathbb{R} \text{ - mean location.}\)

\(\quad \displaystyle \hat{\sigma}^{2} > 0 \text{ - variance (squared scale).}\)

\(\quad \displaystyle x \in \mathbb{R}.\)

Probability density function, expected value and variance

\[ \begin{equation} f(x)=\displaystyle{\frac{1}{(\sigma\sqrt{2 \pi})}} e^{\displaystyle{ -\frac{1}{2} \left(\frac{x - \mu}{\sigma}\right)^2}} \end{equation} \]

Moment Value
Mean \(\mathbb{E}[X] = \mu\)
Variance \(\text{Var}(X) = \sigma^2\)
Skewness \(\gamma_1 = 0\)
Kurtosis \(\gamma_2 = 0\)
MGF \(M_X(t) = \left(1 - p + p e^t\right)^n\)

Characteristics:

  • Symmetric about the mean \(\mu\)
  • Unimodal (single peak at \(x=\mu\))
  • Inflection points at \(x = \mu \pm \sigma\)
  • Fully determined by first two moments (mean and variance)

Comparison of Exact vs Approximate Intervals for \(N(50, 100)\)

Interval Exact Range Exact Coverage random sample Rule Approximation Approximation Coverage
68.27% \([40.06,\ 59.94]\) 68.27% \([40,\ 60]\) ~68.27%
95% \([30.40,\ 69.60]\) 95.00% \([30,\ 70]\) ~95.45%
99.73% \([22.52,\ 77.48]\) 99.73% \([20,\ 80]\) ~99.865%

Key Observations:

  • 68% Interval:
    • Exact: \(\mu \pm 0.994\sigma\)\([40.06,\ 59.94]\)
    • random sample rule uses \(\pm 1\sigma\)\([40,\ 60]\)
    • Nearly identical for practical purposes
  • 95% Interval:
    • Exact: \(\mu \pm 1.960\sigma\)\([30.40,\ 69.60]\)
    • random sample rule uses \(\pm 2\sigma\)\([30,\ 70]\)
    • Approximation is slightly wider (95.45% coverage)
  • 99.7% Interval:
    • Exact: \(\mu \pm 2.748\sigma\)\([22.52,\ 77.48]\)
    • random sample rule uses \(\pm 3\sigma\)\([20,\ 80]\)
    • Approximation is much wider (99.865% vs 99.73% coverage)
# Enhanced Normal Distribution Plot with Exact Percentages
par(mar = c(5, 4.5, 4, 2) + 0.1,  # Increased top and right margins
    mgp = c(3, 0.7, 0), 
    bg = "white", 
    cex.axis = 0.9, 
    cex.lab = 1,
    family = "sans")  # Consistent font family

# Standard normal parameters
mu <- 0
sigma <- 1

# Calculate EXACT quantiles using precise probabilities
z_values <- list(
  z_68 = qnorm((1 + 0.6827)/2),  # 68.27% coverage (exact: 0.994458σ)
  z_95 = qnorm((1 + 0.9545)/2),  # 95.45% coverage (exact: 1.9600σ)
  z_99 = qnorm((1 + 0.9973)/2)   # 99.73% coverage (exact: 2.7478σ)
)

# Generate normal curve with extended range
x <- seq(mu - 4*sigma, mu + 4*sigma, length = 2000)
pdf <- dnorm(x, mu, sigma)

# Create plot with professional styling
plot(x, pdf, type = "l", lwd = 3, col = "#2c3e50",
     xlab = expression(paste("Standard Deviations from Mean (", sigma, ")")),
     ylab = "Probability Density",
     main = "Exact Normal Distribution Coverage\n(Compared to Common Approximations)",
     xaxt = "n", bty = "L", cex.main = 1.1, las = 1, 
     yaxs = "i", ylim = c(0, max(pdf)*1.1),
     panel.first = {
       grid(col = "gray90", lty = 2)
       abline(v = mu, col = "gray70", lwd = 1)
     })

# Custom x-axis with exact values and Greek symbols
axis(1, at = c(mu - z_values$z_99, mu - z_values$z_95, mu - z_values$z_68, 
               mu, 
               mu + z_values$z_68, mu + z_values$z_95, mu + z_values$z_99),
     labels = c(
       expression(paste("-2.748", sigma)),
       expression(paste("-1.960", sigma)),
       expression(paste("-0.994", sigma)),
       expression(mu),
       expression(paste("+0.994", sigma)),
       expression(paste("+1.960", sigma)),
       expression(paste("+2.748", sigma))
     ), padj = 0.5)

# Improved shading function with borders
shade_region <- function(z, col, label_pos, label_col = "darkred") {
  polygon(c(x[x >= -z & x <= z], z, -z),
          c(pdf[x >= -z & x <= z], 0, 0),
          col = adjustcolor(col, 0.35), border = col, lwd = 0.7)
  text(0, dnorm(0)*label_pos, 
       sprintf("%.2f%%", 100*(pnorm(z)-pnorm(-z))),
       col = label_col, font = 2, cex = 0.9)
}

# Shade regions with improved colors
shade_region(z_values$z_68, "#3498db", 0.7)  # 68.27%
shade_region(z_values$z_95, "#e67e22", 0.5)  # 95.45%
shade_region(z_values$z_99, "#e74c3c", 0.3)  # 99.73%

# Add reference lines with exact values
abline(v = c(-z_values$z_99, -z_values$z_95, -z_values$z_68, 
             z_values$z_68, z_values$z_95, z_values$z_99), 
       col = "gray50", lty = 3, lwd = 1.2)

# Enhanced legend with exact values
legend("topright", inset = 0.02,
       legend = c(
         expression(paste("μ ± 0.994", sigma, " (68.27%)")),
         expression(paste("μ ± 1.960", sigma, " (95.45%)")),
         expression(paste("μ ± 2.748", sigma, " (99.73%)"))
       ),
       fill = adjustcolor(c("#3498db", "#e67e22", "#e74c3c"), 0.35),
       border = c("#3498db", "#e67e22", "#e74c3c"),
       bty = "n", cex = 0.85, y.intersp = 1.2)

# Informative footnote with comparison
mtext(expression(paste("Common approximations: 68% (1", sigma, "), 95% (2", sigma, "), 99.7% (3", sigma, ")")),
      side = 1, line = 3.5, cex = 0.75, col = "gray30", adj = 0)

# Add subtle credit line
mtext("Exact normal distribution probabilities", 
      side = 1, line = 4.5, cex = 0.6, col = "gray50", adj = 1)


R Implementation of Normal Distribution Probability Density Function (PDF)

#' Normal Distribution Probability Density Function (PDF)
#'
#' Computes the probability density function for a normal distribution with specified
#' mean and standard deviation. Can return either PDF values for given x-values or
#' generate a sequence of x-values covering ±3.5 standard deviations from the mean.
#'
#' @param x Numeric vector of values at which to evaluate the PDF. If NULL (default),
#'           generates a sequence of n values spanning ±3.5 standard deviations.
#' @param mu Mean of the normal distribution (default: 0)
#' @param sd Standard deviation of the normal distribution (default: 1)
#' @param n Number of points to generate when x is NULL (default: 100)
#'
#' @return Numeric vector of probability densities
#'
#' @examples
#' # Default standard normal
#' normal_pdf()
#'
#' # Evaluate at specific points
#' normal_pdf(x = c(-1, 0, 1))
#'
#' # Custom normal distribution
#' normal_pdf(mu = 50, sd = 10, n = 200)
#'
#' @export
#' 
normal_pdf <- function(x = NULL, mu = 0, sd = 1, n = 100) {
  # Input validation
  if (!is.null(x) && !is.numeric(x)) {
    stop("x must be numeric or NULL")
  }
  if (!is.numeric(mu) || length(mu) != 1) {
    stop("mu must be a single numeric value")
  }
  if (!is.numeric(sd) || length(sd) != 1 || sd <= 0) {
    stop("sd must be a single positive numeric value")
  }
  if (!is.numeric(n) || length(n) != 1 || n <= 0) {
    stop("n must be a single positive integer")
  }
  
  # Generate x values if not provided
  if (is.null(x)) {
    x <- seq(from = mu - 3.5 * sd, 
             to = mu + 3.5 * sd, 
             length.out = n)
  }
  
  # Vectorized PDF calculation
  y <- (1 / (sqrt(2 * pi) * sd)) * exp(-((x - mu)^2 / (2 * sd^2)))
  
  # Return named vector
  stats::setNames(y, x)
}

# Save function to file
dump("normal_pdf", file = "normal_pdf.R")
# Parameters
n <- 30       # Sample size
mean <- 50       # Population mean
sd <- 10       # Population standard deviation

# Create normal continuum
x <- seq(mean-4*sd, mean+4*sd, length.out=1024)  # 4 SD range with 1024 points
y <- normal_pdf(x, mu=mean, sd=sd)                # Normal PDF values

# Generate random sample
set.seed(123)                                # For reproducibility
x.norm <- rnorm(n, mean, sd)                   # Random normal sample

# Create plot
plot(x, y, 
     col = adjustcolor("darkgray", alpha.f=0.8), 
     type = "l",
     lwd = 6,
     ylim = c(0, 0.045),
     xlab = "x",
     ylab = "Density f(x)",
     main = paste("Normal Distribution (μ =", mean, ", σ =", sd, ")"),
     panel.first = grid()
)

# Add random normal density
lines(density(x.norm, bw="nrd0"),  # Optimal bandwidth
      lwd = 3,
      col = "black")

# Add reference lines
abline(v = mean(x.norm), lty=2, lwd=2, col="black")    # Sample mean
abline(v = mean, lty=2, lwd=2, col="darkgray")           # Population mean

# Add statistical annotations
legend("topright",
       legend = c(paste("Normal distribution N(", mean, ",", sd^2, ")"), 
                  "Normaly distributed random data",
                  paste("Sample Mean (", round(mean(x.norm),2), ")"),
                  paste("Population Mean (", mean, ")")),
       col = c("darkgray", "black", "black", "darkgray"),
       lty = c(1, 1, 2, 2),
       lwd = c(4, 3, 2, 2),
       bty = "n")

# Add standard deviation indicators
arrows(x0 = mean, y0 = 0.03,
       x1 = mean + sd, y1 = 0.03,
       length = 0.1, angle = 15, code = 3, col = "darkgray")
text(x = mean + sd/2, y = 0.032, 
     labels = expression(paste(sigma, " = 10")),
     pos = 3, cex = 0.9, col = "darkgray")

Ex 1: What is the probability density of a student scoring exactly 600 on the SAT if scores are N(500, 100)?

Given:

  • $ x = 600 $ (SAT score)
  • $ = 500 $ (mean)
  • $ = 100 $ (standard deviation)

Step 1: Write the PDF Formula \[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]

Step 2: Plug in Values \[ f(600) = \frac{1}{100 \sqrt{2\pi}} e^{-\frac{(600 - 500)^2}{2 \times 100^2}} \]

Step 3: Simplify the Exponent \[ (600 - 500)^2 = 100^2 = 10{,}000 \] \[ 2 \times 100^2 = 20{,}000 \] \[ \frac{10{,}000}{20{,}000} = 0.5 \]

Step 4: Compute Constants \[ \sqrt{2\pi} \approx 2.5066 \quad \text{and} \quad e^{-0.5} \approx 0.6065 \]

Step 5: Final Calculation \[ f(600) = \frac{1}{100 \times 2.5066} \times 0.6065 \] \[ f(600) \approx \frac{0.6065}{250.66} \approx 0.002419 \]

Result The probability density at \(x = 600\) is \(≈0.00242\).

Using the R Function Result:
600: 0.002419707

normal_pdf(x = 600, mu = 500, sd = 100)
##         600 
## 0.002419707

Interpretation

  • The probability density of an SAT score of exactly 600 (given \(N(500, 100)\)) is \(\approx 0.00242\).
  • This is the relative likelihood (height of the PDF curve) at x = 600, not a probability.

Comparison - At the mean (x = 500):

normal_pdf(x = 500, mu = 500, sd = 100)  # Output: 500: 0.003989423
##         500 
## 0.003989423

(Higher density, as expected for the peak of the curve.)

  • At x = 400 (same distance from mean):
normal_pdf(x = 400, mu = 500, sd = 100)  # Output: 400: 0.002419707
##         400 
## 0.002419707

(Identical density to x = 600 due to symmetry.)

Normal Probability Density \((\mu = 500, \sigma = 100)\)

Score (x) Distance from Mean PDF Value (f(x)) Relative to Peak Interpretation
400 -1σ 0.00242 60.7% 1 standard deviation below
500 0σ (mean) 0.00399 100% Peak of the distribution
600 +1σ 0.00242 60.7% 1 standard deviation above

Key Properties:

  • Symmetry: Equal density at equal \(\sigma\)-distances from mean (\(f(400) = f(600)\))
  • Peak: Maximum density occurs at the mean (\(x = \mu\))
  • 68-95-99.7 Rule:
    • 68% of data falls between 400–600 (\(\pm 1\sigma\))
    • 95% between 300–700 (\(\pm 2\sigma\))
    • 99.7% between 200–800 (\(\pm 3\sigma\))

Example Analysis

# Install packages if needed
if (!require(pracma)) install.packages("pracma", dependencies = TRUE)
## Loading required package: pracma
if (!require(tidyr)) install.packages("tidyr", dependencies = TRUE)
## Loading required package: tidyr
library(pracma)
library(ggplot2)
library(tidyr)

# Observed score
x_obs <- 600

# Define theta (mu) range
theta_vals <- seq(300, 900, length.out = 600)

# Known standard deviation
sigma <- 100

# Likelihood function: PDF of N(theta, sigma^2) at x = 600
likelihood <- dnorm(x_obs, mean = theta_vals, sd = sigma)

# Prior: Normal(480, 50^2)
prior <- dnorm(theta_vals, mean = 480, sd = 50)

# Unnormalized posterior
posterior_unnorm <- prior * likelihood

# Normalize the posterior
posterior <- posterior_unnorm / trapz(theta_vals, posterior_unnorm)

# Create a data frame for plotting
df <- data.frame(
  theta = theta_vals,
  Prior = prior,
  Likelihood = likelihood,
  Posterior = posterior
)

# Reshape to long format for ggplot
df_long <- pivot_longer(df, cols = c("Prior", "Likelihood", "Posterior"),
                        names_to = "Distribution", values_to = "Density")

# Define colors to use consistently
dist_colors <- c("Prior" = "green", "Likelihood" = "blue", "Posterior" = "purple")

# Create colored subtitle with markdown-style colors using element_markdown (requires ggtext package)
if (!require(ggtext)) install.packages("ggtext", dependencies = TRUE)
## Loading required package: ggtext
library(ggtext)

subtitle_text <- paste0(
  "<span style='color:green;'>Prior</span> × ",
  "<span style='color:blue;'>Likelihood</span> → ",
  "<span style='color:purple;'>Posterior</span>"
)

# Plot
# Set factor levels to control legend order
df_long$Distribution <- factor(df_long$Distribution, levels = c("Prior", "Likelihood", "Posterior"))

# Then your ggplot code as before, with scale_color_manual and scale_linetype_manual

ggplot(df_long, aes(x = theta, y = Density, color = Distribution, linetype = Distribution)) +
  geom_line(size = 1) +
  geom_vline(xintercept = x_obs, color = "red", linetype = "dotted") +
  scale_color_manual(values = dist_colors) +
  scale_linetype_manual(values = c("Prior" = "dashed", "Likelihood" = "dashed", "Posterior" = "solid")) +
  labs(
    title = "Bayesian Inference for SAT Score Observation (x = 600)",
    subtitle = subtitle_text,
    x = expression(theta~"(Mean SAT Score)"),
    y = "Density",
    color = "Distribution",
    linetype = "Distribution"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    plot.subtitle = element_markdown(size = 12, hjust = 0.5),
    legend.position = "top"
  )


Cumulative Distribution Function (cdf) Plot: Normal Distribution

# Generate data (if not already created)
set.seed(123)
x.norm <- rnorm(30, mean = 50, sd = 10)

# Create CDF plot with enhanced features
plot(ecdf(x.norm), 
     verticals = F,          # Show vertical steps
     do.points = FALSE,         # Don't show individual points
     col.01line = "gray70",     # Color for 0-1 lines
     lwd = 4, 
     col = "lightgray", 
     xlab = "Value (k)", 
     ylab = "Cumulative Probability F(k)",
     main = "Normal Distribution CDF (μ=50, σ=10)",
     cex.main = 1.1
)

# Add population normal CDF
curve(pnorm(x, mean = 50, sd = 10), 
      add = TRUE, 
      col = "black", 
      lwd = 1, 
      lty = 1)

# Add legend
legend("topleft",
       legend = c("Random sample data CDF", "Population CDF"),
       col = c("lightgray", "black"),
       lwd = c(3, 1),
       lty = c(1, 1),
       bty = "n")

# Add key percentiles
q <- quantile(x.norm, probs = c(0.05, 0.25, 0.5, 0.75, 0.95))
abline(v = q, col = "gray50", lty = 3)
text(x = q, y = 0.1, 
     labels = names(q), 
     pos = 3, 
     cex = 0.8, 
     col = "gray30")

# Add distribution parameters
text(x = min(x.norm) + 5, y = 0.9,
     labels = bquote(paste("Sample: ", 
                           bar(x) == .(round(mean(x.norm), 2)), 
                           ", s = ", .(round(sd(x.norm), 2)))),
     pos = 4, cex = 0.9)

text(x = min(x.norm) + 5, y = 0.8,
     labels = bquote(paste("population: ", 
                           mu == 50, 
                           ", ", sigma == 10)),
     pos = 4, cex = 0.9)


What is the probability that a randomly selected student scores less than 75 on an exam with a mean of 65 and a standard deviation of 10?

library(shape)

# Probability of scoring ≤ 75
pnorm(75, mean = 65, sd = 10) 
## [1] 0.8413447
# Parameters
mean <- 65
sd <- 10
cutoff <- 75

# Create sequence of x values
x <- seq(mean - 4*sd, mean + 4*sd, length.out = 1000)

# Calculate PDF and CDF
pdf <- dnorm(x, mean, sd)
cdf <- pnorm(x, mean, sd)

# Plot PDF
plot(x, pdf, type = "l", lwd = 2, col = "blue",
     xlab = "Test Score", ylab = "Density",
     main = "Normal Distribution: P(X ≤ 75)")

# Shade area for P(X ≤ 75)
x_shade <- seq(mean - 4*sd, cutoff, length.out = 1000)
y_shade <- dnorm(x_shade, mean, sd)
polygon(c(x_shade, cutoff), c(y_shade, 0), col = "skyblue", border = NA)

# Add cutoff line
abline(v = cutoff, col = "red", lty = 2, lwd = 2)

# Calculate probability
prob <- round(pnorm(cutoff, mean, sd), 3)

# Add probability INSIDE the shaded area (centered)
text(x = mean - .5*sd,  # Position in the middle-left of shaded area
     y = dnorm(mean, mean, sd)/3,  # Vertical position
     labels = bquote(P(X <= .(cutoff)) == .(prob)),
     col = "darkblue", cex = 1)

# Add parameters
legend("topright", 
       legend = c(paste0("μ = ", mean), paste0("σ = ", sd)),
       bty = "n")

# Create CDF plot
plot(x, cdf, type = "l", lwd = 3, col = "blue",
     xlab = "Test Score", ylab = "Cumulative Probability",
     main = "Normal CDF: P(X ≤ 75)")

# Add tracing lines
prob <- pnorm(cutoff, mean, sd)

# horizontal tracing line
segments(mean - 5*sd, prob, cutoff, prob, col = "red", lty = 1, lwd = 1)
# Left arrow (y-axis side)
Arrows(mean - 4.1*sd, prob, mean - 4*sd + 0.5, prob, code = 1,
       col = "red", arr.type = "triangle", arr.length = 0.4)

# vertical tracing line
segments(cutoff, -1, cutoff, prob, col = "red", lty = 1, lwd = 1)

points(cutoff, prob, pch = 19, col = "white", cex = .5)
points(cutoff, prob, pch = 19, col = "blue", cex = .4)

# Add probability annotation
text(mean - 1.5*sd, prob + 0.05, 
     bquote(P(X <= .(cutoff)) == .(round(prob, 3))),
     col = "red", cex = 1)

# Add axis markers
axis(1, at = cutoff, labels = round(cutoff, 2), las = 1, col.axis = "red")
axis(2, at = prob, labels = round(prob, 2), las = 1, col.axis = "red")

# Add grid
grid()


Comparison of the Normal and t-Distributions

The t-distribution is similar to the normal distribution but has heavier tails, which makes it more suitable for small sample sizes or when the population variance is unknown.

  • As the degrees of freedom (df) increase, the t-distribution approaches the normal distribution.
  • When df ≥ 30, the t-distribution is well approximated by the normal distribution.
  • For smaller degrees of freedom, the t-distribution has heavier tails, reflecting greater uncertainty.
Property Normal Distribution (Z) t-Distribution
Shape Fixed bell curve Heavier tails
Parameters μ (mean), σ (std dev) μ (mean), s (scale), ν (df)
Tail Behavior Light tails (kurtosis = 3) Heavy tails (kurtosis > 3)
Variance σ² (constant) ν/(ν-2) for ν > 2
Primary Use Case Known population variance Small samples (unknown σ²)

Convergence Behavior

  • As degrees of freedom (ν) increase:
    • t-distribution approaches normal distribution (𝒩(0,1))
    • Convergence is mathematically guaranteed as ν → ∞
  • Practical convergence threshold:
    • At ν = 30: maximum CDF difference < 0.01
  • For 95% confidence intervals:
    • Normal: μ ± 1.96σ
    • t(df=5): μ ± 2.57σ (wider interval)
    • t(df=30): μ ± 2.04σ (closer to normal)
# Define the support range
x <- seq(-3.5, 3.5, length.out = 500)

# Set up the plot canvas
plot(x, dnorm(x), 
     type = 'l', 
     lwd = 4,
     col = "black",
     ylim = c(0, 0.42),
     xlab = "Value",
     ylab = "Probability Density",
     main = "Normal Distribution vs Student's t-Distribution",
     cex.main = 1.2,
     panel.first = grid())

# Define degrees of freedom to compare
dfs <- c(1, 2, 5, 30)
colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3")  # ColorBrewer palette

# Add t-distributions
for (i in seq_along(dfs)) {
  lines(x, dt(x, df = dfs[i]), 
        col = colors[i], 
        lwd = 3, 
        lty = i+1)
}

# Add reference lines
abline(v = 0, col = "gray50", lty = 2)
abline(h = 0, col = "gray50", lty = 2)

# Add informative legend
legend("topright",
       legend = c(expression(paste("Normal (", mu==0, ", ", sigma==1, ")")),
                  expression(paste("t (df==", 1, ")")),
                  expression(paste("t (df==", 2, ")")),
                  expression(paste("t (df==", 5, ")")),
                  expression(paste("t (df==", 30, ")"))),
       col = c("black", colors),
       lty = c(1, 2:5),
       lwd = 3,
       bty = "n",
       cex = 0.9)

# Add tail behavior annotations
text(x = -2.8, y = 0.05, 
     labels = "Heavier tails\nfor lower df", 
     pos = 3, cex = 1, col = "darkred")
arrows(x0 = -2.8, y0 = 0.09, 
       x1 = -2.8, y1 = 0.15, 
       length = 0.1, col = "darkred")

# Add convergence annotation
text(x = 0, y = 0.1,
     labels = "t → Normal as df → ∞",
     pos = 3, cex = 1, col = "darkblue")


5.5 Chi-Squared Distribution

The chi-squared \((\chi^2)\) distribution is a continuous probability distribution that arises when:

  • Summing the squares of independent standard normal variables
  • Analyzing categorical data in contingency tables
  • Conducting goodness-of-fit tests

It plays a key role in:

  • Pearson’s chi-squared tests
  • Variance estimation
  • Likelihood ratio tests

Parameters

k > 0 degrees of freedom (shape parameter)
Determines the distribution’s shape and spread

Support
𝑥 ∈ [0, +∞)

Probability Density Function (PDF) The PDF of the chi-squared distribution is:

\[ f(x; k) = \begin{cases} \frac{x^{k/2-1}e^{-x/2}}{2^{k/2}\Gamma(k/2)} & \text{for } x > 0 \\ 0 & \text{otherwise} \end{cases} \]

where:
- \(\Gamma\) = Gamma function

Moments:

Moment Formula Defined When
Expected Value k k > 0
Variance 2k k > 0
Skewness \(\sqrt{8/k}\) k > 0
Excess Kurtosis 12/k k > 0

5.6 Key Properties

  • Shape:
    • Right-skewed, becomes more symmetric as k increases
    • Approaches N(k, 2k) as k → ∞
  • Relationships:
    • Special case of Gamma distribution (shape = k/2, scale = 2)
    • Sum of independent χ² variables is χ² (degrees of freedom add)
  • Applications:
    • Hypothesis testing for categorical data
    • Confidence intervals for variance
    • Spectral analysis

R Implementation: Probability Density Function

#' Chi-Squared Distribution Probability Density Function
#'
#' Computes the PDF of the chi-squared distribution at given quantiles.
#'
#' @param k Degrees of freedom (k > 0)
#' @param x Numeric vector of quantiles (default: seq(0, 20, length.out = 1000))
#' @return Numeric vector of probability densities
#' @export
#' @examples
#' chisq_pdf(k = 5, x = 3.2)  # Density at x=3.2 with 5 df
chisq_pdf <- function(k, x = seq(0, 20, length.out = 1000)) {
  if (any(k <= 0)) stop("Degrees of freedom (k) must be positive")
  dgamma(x, shape = k/2, rate = 0.5)  # Using Gamma relationship
}
#' Chi-Squared Distribution CDF
#'
#' Computes P(X ≤ x) for chi-squared distribution with k degrees of freedom.
#'
#' @param x Numeric vector of quantiles
#' @param k Degrees of freedom (k > 0)
#' @return Cumulative probabilities
#' @export
#' @examples
#' chisq_cdf(10, k = 5)  # P(X ≤ 10) with 5 df
chisq_cdf <- function(x, k) {
  if (any(k <= 0)) stop("Degrees of freedom (k) must be positive")
  pgamma(x, shape = k/2, rate = 0.5)
}
#' Chi-Squared Distribution Visualization
#'
#' Creates a publication-quality plot of the chi-squared distribution.
#'
#' @param chisq_stat Calculated chi-squared statistic
#' @param df Degrees of freedom (k > 0)
#' @param p_value Calculated p-value
#' @param alpha Significance level (default = 0.05)
#' @param n Number of points for smooth curve (default = 1000)
#' @return A ggplot object
#' @export
#' @examples
#' chisq_plot(chisq_stat = 8.5, df = 5, p_value = 0.075)
chisq_plot <- function(chisq_stat, df, p_value, alpha = 0.05, n = 1000) {
  require(ggplot2)
  
  critical <- qchisq(1 - alpha, df)
  x_max <- max(critical * 1.5, chisq_stat * 1.2)
  x <- seq(0, x_max, length.out = n)
  plot_data <- data.frame(x = x, y = dchisq(x, df))
  
  ggplot(plot_data, aes(x, y)) +
    geom_line(color = "#333333", linewidth = 1.2) +
    geom_area(data = subset(plot_data, x <= critical), 
              aes(fill = "Acceptance"), alpha = 0.6) +
    geom_area(data = subset(plot_data, x >= critical), 
              aes(fill = "Rejection"), alpha = 0.6) +
    geom_vline(xintercept = critical, linetype = "dashed", color = "#76B7B2") +
    geom_vline(xintercept = chisq_stat, color = "#F28E2B", linewidth = 1.5) +
    annotate("label", x = chisq_stat, y = max(plot_data$y) * 0.85,
             label = sprintf("χ² = %.2f\np = %.3f", chisq_stat, p_value),
             hjust = ifelse(chisq_stat > critical, 1.1, -0.1)) +
    scale_fill_manual(values = c("Acceptance" = "#4E79A7", "Rejection" = "#E15759")) +
    labs(title = bquote(Chi^2 ~ Distribution ~ (df == .(df))),
         x = "Chi-Squared Value", y = "Density") +
    theme_minimal()
}

# Save all functions
dump(c("chisq_pdf", "chisq_cdf", "chisq_plot"), file = "chi_squared_functions.R")
# Typical usage
chisq_plot(chisq_stat = 15, df = 10, p_value = 0.001)

# Significant result example
chisq_plot(chisq_stat = 21, df = 10, p_value = 0.001)

#' Chi-Squared Variance Confidence Interval
#'
#' Computes confidence intervals for population variance using custom chi-square functions
#'
#' @param x Numeric vector of sample data
#' @param alpha Significance level (default = 0.05)
#' @return List containing point estimate and CI
#' @export
#' @examples
#' battery_data <- c(23, 25, 21, 28, 24, 26, 22, 27, 25, 24)
#' chisq_var_ci(battery_data)
chisq_var_ci <- function(x, alpha = 0.05) {
  n <- length(x)
  s2 <- var(x)
  df <- n - 1
  
  # Use custom CDF to find critical values
  chi_upper <- qchisq(alpha/2, df)
  chi_lower <- qchisq(1 - alpha/2, df)
  
  ci <- c(
    lower = (n-1)*s2 / chi_lower,
    upper = (n-1)*s2 / chi_upper
  )
  
  list(
    variance_estimate = s2,
    degrees_of_freedom = df,
    confidence_level = 1 - alpha,
    ci = ci
  )
}

#' Chi-Square Critical Value Finder
#'
#' @param alpha Right-tail probability
#' @param df Degrees of freedom
#' @export
chisq_critical <- function(alpha, df) {
  qchisq(1 - alpha, df)
}

5.7 Uniform Distribution

Parameters

\(\quad \displaystyle -\infty < a < b < \infty \text{ - minimum and maximum values.}\)

\(\quad \displaystyle x \in \lbrack a,b \rbrack.\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x) = \begin{cases}\dfrac{1}{b-a}& \text{for a < x < b} \\ \\ \quad 0 & \text{otherwise} \end{cases}\).

Moment Formula Notes
Mean \(\displaystyle E(X) = \frac{a + b}{2}\) Midpoint of \([a, b]\)
Variance \(\displaystyle Var(X) = \frac{(b - a)^2}{12}\) Scales with interval width squared
Skewness \(\displaystyle \gamma_1 = 0\) Symmetric distribution
Ex. Kurtosis \(\displaystyle \gamma_2 = -\frac{6}{5}\) Flatter than normal distribution

Key Properties:

  • Support: \(x \in [a, b]\)
  • PDF: \(\displaystyle f(x) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases}\)
  • CDF: \(\displaystyle F(x) = \frac{x - a}{b - a}\) for \(x \in [a, b]\)

Probability Density Function (pdf): Uniform Distribution

# Parameters
n <- 300       # Sample size
a <- 0           # Minimum
b <- 1           # Maximum

# Generate data
set.seed(123)    # For reproducibility
x.unif <- runif(n, min = a, max = b)
x <- seq(a, b, length.out = n)
d.unif <- dunif(x, min = a, max = b)

# Create PDF plot
plot(x, d.unif,
     type = "l",
     ylim = c(0, max(d.unif)*1.2),
     col = adjustcolor("black", alpha.f = 0.8),
     lwd = 4,
     xlab = "x",
     ylab = "Probability Density f(x)",
     main = "Uniform Distribution PDF (a=0, b=1)",
     panel.first = grid()
)

# Add sample density (kernel density estimate)
lines(density(x.unif, adjust = 0.5),  # Adjusted bandwidth
      lwd = 4,
      col = adjustcolor("lightgray", alpha.f = 0.5))

# Add reference lines
abline(v = mean(x.unif), col = "black", lty = 2, lwd = 2)  # Sample mean
abline(v = (a+b)/2, col = "lightgray", lty = 2, lwd = 2)   # population mean

# Add statistical annotations
legend("bottom",
       legend = c("Population PDF", "Random Sample Data Density",
                  "Sample Mean", "Population Mean"),
       col = c("black", "lightgray", "black", "lightgray"),
       lty = c(1, 1, 2, 2),
       lwd = 2,
       bty = "n")

text(x = 0.1, y = .5,
     labels = bquote(paste("Population distribution: ", 
                           E(X) == .((a+b)/2), ", ",
                           Var(X) == .((b-a)^2/12))),
     pos = 4, cex = 0.9)

text(x = 0.1, y = .45,
     labels = bquote(paste("Random uniform data: ", 
                           bar(x) == .(round(mean(x.unif), 4)), ", ",
                           s^2 == .(round(var(x.unif), 4)))),
     pos = 4, cex = 0.9)


Ex: If students’ scores on a pass/fail exam are uniformly distributed between \(50-100\%\), what percentage scored between \(70-85\%\)?

# Calculate probability
prob <- punif(85, min = 50, max = 100) - punif(70, min = 50, max = 100)
prob # 0.3 (30%)
## [1] 0.3
# Enhanced plot
library(ggplot2)
ggplot(data.frame(x = c(0, 100)), aes(x)) +
  stat_function(fun = dunif, args = list(min = 50, max = 100), 
                lwd = 1.5, color = "blue") +
  geom_area(stat = "function", fun = dunif, args = list(min = 50, max = 100),
            xlim = c(70, 85), fill = "red", alpha = 0.3) +
  # Add vertical lines at boundaries
  geom_vline(xintercept = c(70, 85), linetype = "dashed", color = "red") +
  # Add probability label
  annotate("text", x = 77.5, y = 0.015, 
           label = paste0("P = ", prob), color = "darkred", size = 3) +
  labs(title = "Uniform Distribution of Exam Scores (50-100%)",
       subtitle = "Probability of scores between 70-85",
       x = "Score (%)", y = "Density") +
  scale_x_continuous(breaks = c(50, 70, 85, 100)) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))


Uniform Distribution Analysis using the CDF Function

Given:

  • Uniform Distribution Range: Scores are uniformly distributed between 50% and 100%.
  • Interval of Interest: Percentage of students who scored between 70% and 85%.

Step 1: Understand the Uniform Distribution

For a continuous uniform distribution over the interval [a, b], we have:

Probability Density Function (PDF): \[ f(x) = \begin{cases} \frac{1}{b-a} & \text{for } a \leq x \leq b \\ 0 & \text{otherwise} \end{cases} \]

given: - \(a = 50\) - \(b = 100\)

\[ f(x) = \frac{1}{100-50} = \frac{1}{50} \]

Uniform Cumulative Distribution Function (CDF): \[ F(x) = \begin{cases} 0 & \text{for } x < a \\ \frac{x-a}{b-a} & \text{for } a \leq x \leq b \\ 1 & \text{for } x > b \end{cases} \]

Substituting values: \[ F(x) = \frac{x-50}{50} \]

Step 2: Calculate Probability for [70, 85]

The probability that a score \(X\) falls between 70% and 85% is: \[ P(70 \leq X \leq 85) = F(85) - F(70) \]

Compute F(85): \[ F(85) = \frac{85-50}{50} = \frac{35}{50} = 0.7 \]

Compute F(70): \[ F(70) = \frac{70-50}{50} = \frac{20}{50} = 0.4 \]

Final Probability: \[ P(70 \leq X \leq 85) = 0.7 - 0.4 = 0.3 \quad \text{(30%)} \]

# Define the uniform CDF
uniform_cdf <- function(x, a = 50, b = 100) {
  ifelse(x < a, 0, ifelse(x > b, 1, (x - a) / (b - a)))
}

# Generate x-values
x <- seq(40, 110, by = 0.1)
cdf_values <- uniform_cdf(x)

# Plot the CDF
plot(x, cdf_values, type = "l", lwd = 2, col = "blue",
     main = "Uniform CDF: P(70 ≤ X ≤ 85) = F(85) - F(70)",
     xlab = "Exam Score (%)", ylab = "Cumulative Probability (F(x))",
     xlim = c(40, 110), ylim = c(0, 1.1))

# Shade the area between 70 and 85
polygon(c(70, seq(70, 85, 0.1), 85), 
        c(0, uniform_cdf(seq(70, 85, 0.1)), 0),
        col = "lightblue", border = NA)

# Add vertical lines at key points
abline(v = c(50, 70, 85, 100), lty = 2, col = "gray")

# Add horizontal tracing lines for F(70) and F(85)
abline(h = uniform_cdf(70), lty = 2, col = "red")  # F(70) = 0.4
abline(h = uniform_cdf(85), lty = 2, col = "red")  # F(85) = 0.7

# Annotate the difference (probability)
arrows(x0 = 70, y0 = uniform_cdf(70), 
       x1 = 70, y1 = uniform_cdf(85),
       code = 3, length = 0.1, col = "darkred", lwd = 2)
text(65, 0.55, "F(85) - F(70) = 0.3", col = "darkred", cex = 1.2, pos = 2)

# Label key points
text(70, -0.05, "70%", pos = 1, col = "red", xpd = TRUE)
text(85, -0.05, "85%", pos = 1, col = "red", xpd = TRUE)
text(50, -0.05, "50%", pos = 1, col = "red", xpd = TRUE)
text(100, -0.05, "100%", pos = 1, col = "red", xpd = TRUE)

# Add CDF formula
text(70, 0.9, expression(F(x) == frac(x - 50, 50)), cex = 2)


Visualization of the Uniform CDF

The CDF is a straight line from (50, 0) to (100, 1) with slope \(\frac{1}{50}\).

Key points:

  • $ F(50) = 0 $
  • $ F(70) = 0.4 $
  • $ F(85) = 0.7 $
  • $ F(100) = 1 $

The probability corresponds to the vertical difference between F(85) and F(70) on the CDF plot.


5.8 Gamma Distribution

Parameters

\(\quad \alpha > 0\, - \, shape.\)

\(\quad \beta > 0\, - \, scale.\)

\(\quad x \in (0, \infty).\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x) = \frac{1}{\Gamma(\alpha) \beta^{\alpha}} x^{\alpha-1}e^{-\frac{x}{\beta}} \quad where \, \alpha > 0 \, (shape), \, and \,\beta > 0 \, (scale)\).

Moment Formula Parameters
Mean \(\displaystyle E(X) = k\theta\) \(k>0\) (shape), \(\theta>0\) (scale)
Variance \(\displaystyle Var(X) = k\theta^2\) Grows with both shape and scale
Skewness \(\displaystyle \gamma_1 = \frac{2}{\sqrt{k}}\) → 0 as \(k \to \infty\) (more symmetric)
Ex. Kurtosis \(\displaystyle \gamma_2 = \frac{6}{k}\) → 0 as \(k \to \infty\) (less peaked)

Parameterization Notes:

  • Alternative Form: Often written as \(\Gamma(\alpha, \beta)\) where \(\alpha=k\) (shape), \(\beta=1/\theta\) (rate)
  • Special Cases:
    • Exponential: \(k=1\)
    • Chi-squared: \(k=\nu/2\), \(\theta=2\)
  • Support: \(x \in (0, \infty)\)

Probability Density Function (pdf): Gamma Distribution

#' Gamma Distribution Probability Density Function (PDF)
#'
#' Computes the probability density function for a Gamma distribution with
#' specified shape (α) and scale (θ) parameters.
#'
#' @param x Numeric vector of values at which to evaluate the PDF
#' @param shape Shape parameter (α > 0)
#' @param scale Scale parameter (θ > 0)
#' @return Numeric vector of probability densities
#' @examples
#' # Standard exponential distribution (α=1, θ=1)
#' gamma.pdf(1:5, shape=1, scale=1)
#'
#' # Gamma(α=2, θ=0.5)
#' gamma_pdf(seq(0, 5, 0.1), shape=2, scale=0.5)
#' @export

gamma_pdf <- function(x, shape, scale) {
  
  # Input validation
  if (missing(shape) || missing(scale)) {
    stop("Both shape and scale parameters must be specified")
  }
  if (shape <= 0 || scale <= 0) {
    stop("Shape and scale parameters must be positive")
  }
  if (!is.numeric(x)) {
    stop("Input x must be numeric")
  }
  
  # Handle negative x values
  pdf <- numeric(length(x))
  pdf[x < 0] <- 0  # Gamma distribution has support x ≥ 0
  
  # Calculate PDF for x ≥ 0
  positive_x <- x >= 0
  pdf[positive_x] <- (1 / (scale^shape * gamma(shape))) * 
    x[positive_x]^(shape - 1) * 
    exp(-(x[positive_x] / scale))
  
  return(pdf)
}

# Save function to file
dump("gamma_pdf", file = "gamma_pdf.R")
# sample size 30
n <- 30
alpha <- 1
beta <- 1

# population data
x <- seq(0, 10, length=2e2)
y <- gamma_pdf(x, shape=1, scale=1)

# random sample data
x.gamma <- sort(rgamma(n=2^7, shape=1, scale=1))
d.gamma <- gamma_pdf(x=x.gamma, shape=1, scale=1)

# Gamma Distribution
plot(x, y,
     type="l",
     ylim=c(0, max(d.gamma)*1.1),
     col=adjustcolor("darkgray", alpha.f=0.8),
     lwd=8,
     xlab="x",
     ylab="F(x)",
     main="Random sample (red) vs. Population (blue) \n 
             Gamma Distribution")

lines(density(x.gamma),
      type="l",
      lwd=4)

abline(v=mean(x.gamma), lty=3) # mean
abline(v=1, col="darkgray", lty=3) # expected value

text(3, .6, cex=1.2,
     paste("mean=", format(mean(x.gamma), digits=4), "\n",
           "variance=", format(var(x.gamma), digits=4)
     )
)

text(6, .6, "E(X)=1 \n Var(X)=1", col="darkgray")

# gridlines
grid(nx=NULL, 
     ny=NULL, 
     col="lightgray", 
     lty="dotted", 
     lwd=1, 
     equilogs=TRUE)


Gamma Distribution PDF: Insurance Claim Modeling Case Study

Problem Statement:

An insurance company needs to model auto insurance claim amounts, where:

  • Claims are strictly positive (x > 0)
  • Data shows right-skewed distribution (many small claims, few large outliers)
  • Need to calculate probabilities for risk assessment and pricing

The Gamma PDF is given by:

\[ f(x;\alpha,\theta) = \frac{x^{\alpha-1}e^{-x/\theta}}{\theta^\alpha\Gamma(\alpha)} \quad \text{for } x > 0 \]

where:

  • \(\alpha\) = shape parameter (controls skewness)
  • \(\theta\) = scale parameter
  • \(\Gamma(\alpha)\) = gamma function

Cumulative Distribution Function (cdf) Plot: Gamma Distribution

# Plot the random sample CDF of Gamma-distributed data
plot(ecdf(x.gamma), 
     lwd = 6, 
     col = "lightgray", 
     xlab = "k", 
     ylab = expression(F[X](k)),  # Math notation for CDF
     main = "random sample CDF of Gamma(shape=1, scale=1)\n(n = 30)",
     cex.main = 1.1,  # Slightly larger title
     panel.first = {  # Add gridlines first (behind the plot)
       grid(
         nx = NULL, 
         ny = NULL, 
         col = "lightgray", 
         lty = "dotted", 
         lwd = 3
       )
     })

# Optional: Add population CDF for comparison (red dashed line)
curve(pgamma(x, shape = 1, scale = 1), 
      col = "black", 
      lty = 1, 
      lwd = 1, 
      add = TRUE)

# Legend (if population CDF is added)
legend("bottomright", 
       legend = c("random sample CDF", "population CDF"), 
       col = c("lightgray", "black"), 
       lty = c(1, 1), 
       lwd = c(6, 2), 
       bty = "n")


Insurance Application Example:

Risk Management Table

Calculation Type Business Use R Function
Exceedance Probability Reserve requirements 1 - pgamma( )
Expected Loss Premium calculation \(\alpha * \theta\)
Value-at-Risk Reinsurance planning qgamma( )

Parameter Estimation

From historical data:

  • Shape (\(\alpha\)) = 2
  • Scale (\(\theta\)) = 1000
  • Mean claim = \(\alpha \times \theta = \$2,000\)
  • Mode = \((\alpha-1) \times \theta = \$1,000\)

Key Calculations

  1. Exceedance Probability (1 - pgamma(x, α, scale=θ))

Business Use: Reserve requirements

Question: What is the probability claims exceed a threshold (e.g., $3,000)?

Calculation: \[P(X>3000)=1−P(X≤3000)\]

1 - pgamma(5000, shape=2, scale=1000) # P(X > $5000): ~4% 
## [1] 0.04042768
# Other Threshold probabilities
1 - pgamma(2000, shape=2, scale=1000)  # P(X > mean): 40.60% 
## [1] 0.4060058
1 - pgamma(3000, shape=2, scale=1000)  # P(X > $3K): 19.91%
## [1] 0.1991483

  1. Expected Loss (α × θ)

Business Use: Premium calculation

Expected Loss = \(\alpha \times \theta = 2 \times 1000 = \$2,000\)

This is the mean claim amount used for pricing premiums.

  1. Value-at-Risk (qgamma(p, α, scale=θ))

Business Use: Reinsurance planning (e.g., 95% VaR)

Question: What is the claim amount at the 95th percentile?

Calculation:

To calculate the Value at Risk (VaR) at the 95% confidence level for a gamma-distributed random variable with shape parameter α = 2 and scale parameter θ = 1000, we use the quantile function of the gamma distribution.

The gamma distribution’s cumulative distribution function (CDF) is given by: \[ F(x;\: \alpha, \: \theta) = \frac{1}{\Gamma(\alpha)} \int_0^x t^{\alpha-1} \cdot e^{-t/\theta} \, dt \] where \(\Gamma(\alpha)\) is the gamma function.

The quantile function (inverse CDF) \(Q(p; \alpha, \theta)\) returns the value \(x\) such that \(F(x; \alpha, \theta) = p\). For the gamma distribution, this is often denoted as qgamma(p, alpha, scale = theta).

Given:

  • Shape parameter \(\alpha = 2\)
  • Scale parameter \(\theta = 1000\)
  • Confidence level \(p = 0.95\)

The VaR at the 95% confidence level is:

\[ VaR_{0.95} = \text{qgamma}(0.95, 2, \text{scale} = 1000) \]

Calculating the Value:

Using \(R\):

qgamma(0.95, shape = 2, scale = 1000)  # Result: ~$4,744 (95% of claims ≤ $4,744)
## [1] 4743.865

This yields: \[ VaR_{0.95} \approx 4743.865 \]

Interpretation:

There is a 95% probability that the loss will not exceed approximately 4743.865 (units depend on context, e.g., dollars).

5.9 Beta Distribution

Parameters

\(\quad \displaystyle \alpha > 0\, - \, shape1.\)

\(\quad \displaystyle \beta > 0\, - \, shape2.\)

\(\quad \displaystyle x \in \lbrack 0;1 \rbrack.\)

Probability density function, expected value and variance

\(\quad \displaystyle f(x)=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha) \, \Gamma(\beta)} \, x^{\alpha-1}(1-x)^{\beta-1}.\)

Beta Distribution Moments

Moment Formula
Mean (\(E(X)\)) \(\displaystyle \qquad \frac{\alpha}{\alpha + \beta} \qquad\)
Variance (\(Var(X)\)) \(\displaystyle \qquad \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} \qquad\)
Skewness \(\displaystyle \qquad \frac{2(\beta - \alpha)\sqrt{\alpha + \beta + 1}}{(\alpha + \beta + 2)\sqrt{\alpha \beta}} \qquad\)
Excess Kurtosis \(\displaystyle \qquad \frac{6[(\alpha - \beta)^2 (\alpha + \beta + 1) - \alpha \beta (\alpha + \beta + 2)]}{\alpha \beta (\alpha + \beta + 2)(\alpha + \beta + 3)} \qquad\)
MGF (\(M_X(t)\)) \(\displaystyle 1 \quad + \quad \sum_{k=1}^{\infty} \left( \prod_{r=0}^{k-1} \frac{\alpha + r}{\alpha + \beta + r} \right) \frac{t^k}{k!}\)

Probability Density Function (pdf): Beta Distribution

# Parameters
n <- 100  # Sample size
alpha <- 2  # Shape1 parameter
beta <- 2   # Shape2 parameter

# Generate data
x <- seq(0, 1, length.out = n)  # Grid for population density
x.beta <- rbeta(n, shape1 = alpha, shape2 = beta)  # Random Beta variates
d.beta <- dbeta(x, shape1 = alpha, shape2 = beta)  # population density

# Cumulative Distribution Function
plot(x, d.beta,
     type="l",
     ylim=c(0, max(d.beta)*1.1),
     col=adjustcolor("darkgray", alpha.f=0.8),
     lwd=8,
     xlab="x",
     ylab="F(x)",
     main="Random Sample (black) vs. population (gray) \n Beta(2, 2)")

lines(density(x.beta),
      type="l",
      lwd=4)

abline(v=mean(x.beta), lty=3) # mean
abline(v=(alpha/(alpha+beta)), col="darkgray", lty=3) # expected value

text(mean(x.beta), max(d.beta)/2, 
     cex=1.2,
     paste("mean=", format(mean(x.beta), digits=2), 
           ", E(X)=", (alpha/(alpha+beta)), "\n",
           "variance=", format(var(x.beta), digits=2), 
           ", Var(X)=", ((alpha*beta)/((alpha+beta)^(2)*(alpha+beta+1)))
     )
)

# gridlines
grid(nx=NULL, 
     ny=NULL, 
     col="lightgray", 
     lty="dotted", 
     lwd=1, 
     equilogs=TRUE)

# Mean lines
abline(v = mean(x.beta), lty = 2, col = "black", lwd = 2)  # random sample mean
abline(v = alpha / (alpha + beta), lty = 2, col = "darkgray", lwd = 2)  # population mean

# Annotate statistics
legend("topright",
       legend = c(
         sprintf("Sample Mean = %.3f", mean(x.beta)),
         sprintf("Population Mean = %.3f", alpha / (alpha + beta)),
         sprintf("Sample Var = %.3f", var(x.beta)),
         sprintf("Population Var = %.3f", (alpha*beta)/((alpha+beta)^2*(alpha+beta+1)))
       ),
       bty = "n",
       cex = 0.9)

# Add density comparison legend
legend("topleft",
       legend = c("Population", "Random sample data"),
       col = c("darkgray", "black"),
       lwd = c(8, 4),
       bty = "n")


Cumulative Distribution Function (cdf) Plot: Beta Distribution

# Parameters
n <- 30       # Sample size
alpha <- 2    # Shape1 parameter
beta <- 2     # Shape2 parameter

# Generate Beta-distributed data
x.beta <- rbeta(n, shape1 = alpha, shape2 = beta)

# Create ECDF plot with improved formatting
plot(ecdf(x.beta), 
     lwd=3, 
     col="lightgray", 
     xlab="x", 
     ylab="F(x)",
     main="Cumulative Distribution Function \n 
             Beta(n=30, shape=2, scale=2)")


# Add population CDF for comparison (red dashed line)
curve(pbeta(x, shape1 = alpha, shape2 = beta), 
      col = "black", 
      lty = 1, 
      lwd = 1, 
      add = TRUE)

# Add legend
legend("bottomright",
       legend = c("Random Sample CDF", "Population CDF"),
       col = c("lightgray", "black"),
       lty = c(1, 2),
       lwd = c(3, 2),
       bty = "n")

# Add vertical line at mean
abline(v = alpha/(alpha + beta), col = "red", lty = 3)
text(alpha/(alpha + beta), 0.2, 
     paste("Mean =", alpha/(alpha + beta)),
     pos = 4, col = "red")

The Beta distribution is a continuous probability distribution defined on the interval \([0, 1]\), parameterized by two positive shape parameters, \(\alpha\) (alpha) and \(\beta\) (beta). It is highly flexible and can model a wide range of real-world phenomena where outcomes are probabilities or proportions.


Why Beta Distribution?

  • Flexible Shapes: Can model U-shaped, J-shaped, uniform, unimodal, and other distributions.
  • Natural for Proportions: Bounded between 0 and 1, making it ideal for probabilities.
  • Bayesian Advantage: Works as a conjugate prior for Binomial and Bernoulli likelihoods.

Here are some typical real-life problems that can be solved using the Beta PDF (Probability Density Function):

A/B Testing & Bayesian Updating

A/B testing is a common method for comparing two versions (A and B) of a product, webpage, or marketing strategy to determine which performs better. Traditional frequentist approaches use p-values and confidence intervals, but Bayesian methods (using the Beta distribution) provide a more intuitive way to update beliefs and make probabilistic statements.

Example: If Version A has 120 conversions out of 1000 views \((\beta(120, 880))\) and Version B has 150 conversions out of 1000 views \((\beta(150, 850))\), we can compute \(P(B > A)\).

  • Version A: Shown to \(n_A = 1000\) users, resulting in \(x_A = 120\) successes (e.g., clicks, conversions).
  • Version B: Shown to \(n_B = 1000\) users, resulting in \(x_B=150\) successes.
  • Goal: Determine if B is truly better than A (and by how much).

Solution: Represent the conversion rates of A and B as Beta distributions and compute the probability that one is better than the other.


Step 1: Model Conversion Rates as Beta Distributions - Assume prior ignorance → Use Beta(1, 1) (uniform prior).
- After observing data: - Conversion rate of A: \(\theta_A \sim \text{Beta}(\alpha_A + x_A, \beta_A + n_A - x_A)\)
- Conversion rate of B: \(\theta_B \sim \text{Beta}(\alpha_B + x_B, \beta_B + n_B - x_B)\)

# Step 1: Set up prior distributions (Uniform Beta(1,1) in this case)
prior_alpha <- 1
prior_beta <- 1

# Variant A (control)
conversions_A <- 120
visitors_A <- 1000

# Variant B (treatment)
conversions_B <- 150
visitors_B <- 1000

Step 2: Update Beliefs with Data

Example:

  • Version A: 120 conversions out of 1000Posterior = Beta(121, 881)
  • Version B: 150 conversions out of 1000Posterior = Beta(151, 851)

In Bayesian A/B testing with a Beta-Binomial model, we add +1 to observed successes and failures when using a uniform Beta(1,1) prior. This represents starting from a state of complete uncertainty.

We begin with a uniform prior: \[ \text{Prior} \sim \text{Beta}(1, 1) \] Posterior Update

After observing:

  • S successes
  • F failures

The posterior becomes: \[ \text{Posterior} \sim \text{Beta}(1 + S, 1 + F) \] Why Add +1? - Mathematical Necessity - Ensures proper distribution when S = 0 or F = 0 - Avoids undefined cases (e.g., Beta(0, n) is invalid) - Bayesian Interpretation - Equivalent to starting with “pseudocounts” of: - 1 imaginary success - 1 imaginary failure - Represents minimal prior knowledge - Practical Implications

For Variant A (120 conversions out of 1000): \(\text{Posterior}_A \sim \text{Beta}(121, 881)\)

For Variant B (150 conversions out of 1000): \(\text{Posterior}_B \sim \text{Beta}(151, 851)\)

Key Properties of Adding \(+1\) in Beta-Binomial Models

Scenario Without +1 (Beta(0,0) Prior) With +1 (Beta(1,1) Prior)
0 successes Beta(0, n) → Invalid distribution Beta(1, n+1) → Valid
100% conversion Impossible (infinite density) Possible (low probability)
Small samples Overconfident estimates Conservative, regularized results
Large samples MLE ≈ Posterior Mean +1 becomes negligible
Decision making Risky with sparse data More robust conclusions

Explanatory Notes:

  • 0 successes case:
    • Without +1: Results in undefined mean/variance
    • With +1: Predicts conversion rate ≈ 1/(n+2)
  • 100% conversion:
    • Beta(1,1) allows (but strongly discounts) extreme values
  • Sample size effects:
    • For n > 1000, the +1 adjustment has < 0.1% impact on mean
    • For n < 100, the prior significantly stabilizes estimates
  • Alternative priors:
    • Beta(0.5, 0.5) (Jeffreys): Even more conservative
    • Beta(ε, ε) (ε→0): Approaches frequentist MLE

Note: The +1 becomes negligible with large sample sizes \((n > 1000)\), but remains important for small-n experiments.


Step 3: Compute Probability that B > A

We want \(P(\theta_B > \theta_A)\).

Method: Monte Carlo Simulation

  • Draw 10,000 samples from each Beta distribution:
  • \(\theta_A \sim \text{Beta}(121, 881)\)
  • \(\theta_B \sim \text{Beta}((151, 851))\)
  • Count how often \(\theta_B > \theta_A\).
# Step 3: Monte Carlo simulation to estimate P(B > A)
set.seed(123)  # for reproducibility
n_simulations <- 100000

# Sample from posterior distributions
samples_A <- rbeta(n_simulations, prior_alpha + conversions_A, prior_beta + visitors_A - conversions_A)

samples_B <- rbeta(n_simulations, prior_alpha + conversions_B, prior_beta + visitors_B - conversions_B)
  • Result: If \(\theta_B > \theta_A\) in 6,200/10,000 trials → \(P(B > A) \approx 62\%\).
# Calculate probability that B is better than A
prob_B_better <- mean(samples_B > samples_A)
# Advanced Bayesian A/B Testing Visualizations in R
library(ggplot2)
library(gridExtra)
# library(plotly)  # Uncomment for interactive plots

# 1. Input Data
ab_data <- data.frame(
  variant = c("A", "B"),
  conversions = c(120, 150),
  visitors = c(1000, 1000)
)

# 2. Bayesian Setup (Uniform Beta(1,1) prior)
alpha_prior <- 1
beta_prior <- 1

# 3. Calculate posteriors
ab_data$alpha_post <- alpha_prior + ab_data$conversions
ab_data$beta_post <- beta_prior + ab_data$visitors - ab_data$conversions

# 4. Monte Carlo Simulation
set.seed(123)
n_sim <- 100000
samples <- data.frame(
  A = rbeta(n_sim, ab_data$alpha_post[1], ab_data$beta_post[1]),
  B = rbeta(n_sim, ab_data$alpha_post[2], ab_data$beta_post[2])
)
samples$diff <- samples$B - samples$A
prob_B_better <- mean(samples$B > samples$A)

# 5. Advanced Visualizations ---------------------------------------------------

# Plot 1: Posterior Densities with Credible Intervals
x_seq <- seq(0.075, 0.20, length.out = 1000)
densities <- data.frame(
  x = rep(x_seq, 2),
  variant = rep(c("A", "B"), each = length(x_seq)),
  density = c(dbeta(x_seq, ab_data$alpha_post[1], ab_data$beta_post[1]),
              dbeta(x_seq, ab_data$alpha_post[2], ab_data$beta_post[2]))
)

ci_A <- qbeta(c(0.025, 0.975), ab_data$alpha_post[1], ab_data$beta_post[1])
ci_B <- qbeta(c(0.025, 0.975), ab_data$alpha_post[2], ab_data$beta_post[2])

p1 <- ggplot(densities, aes(x = x, y = density, fill = variant)) +
  geom_area(alpha = 0.5, position = "identity") +
  geom_vline(xintercept = ci_A, linetype = "dashed", color = "#F8766D") +
  geom_vline(xintercept = ci_B, linetype = "dashed", color = "#00BFC4") +
  labs(title = "Posterior Distributions with 95% Credible Intervals",
       x = "Conversion Rate", y = "Density") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  theme_minimal()

# Plot 2: Distribution of Differences (B - A)
p2 <- ggplot(samples, aes(x = diff)) +
  geom_histogram(aes(y = ..density..), bins = 100, fill = "#619CFF", alpha = 0.7) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = paste0("Difference in Conversion Rates (B - A)\nP(B > A) = ", 
                      round(prob_B_better*100, 1), "%"),
       x = "Difference", y = "Density") +
  theme_minimal()

# Plot 3: Expected Loss Distribution
loss_B <- ifelse(samples$diff < 0, -samples$diff, 0)
loss_A <- ifelse(samples$diff > 0, samples$diff, 0)

# Plot 3: Expected Loss Distribution (with y-axis limit)
p3 <- ggplot(data.frame(loss = c(loss_A, loss_B),
                        variant = rep(c("A", "B"), each = n_sim)), 
             aes(x = loss, fill = variant)) +
  geom_density(alpha = 0.5) +
  labs(title = "Expected Loss Distribution",
       x = "Loss", y = "Density") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4")) +
  ylim(0, 30) +  # Added this line to set y-axis limits
  xlim(-.010, .10) +  # Added this line to set x-axis limits
  theme_minimal()

# Plot 4: Cumulative Probability of Difference
p4 <- ggplot(samples, aes(x = diff)) +
  stat_ecdf(geom = "step", color = "#619CFF", linewidth = 1) +
  geom_vline(xintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Cumulative Probability of Difference",
       x = "B - A Difference", y = "P(Difference ≤ x)") +
  theme_minimal()

# Combine all plots
grid.arrange(p1, p2, p3, p4, ncol = 2)

# Interactive Plotly Version (Uncomment to use)
# Check and install plotly if needed
if (!requireNamespace("plotly", quietly = TRUE)) {
  install.packages("plotly")
}
library(plotly)

ggplotly(p1)  # Hover to see exact density values
ggplotly(p2)  # Explore the difference distribution
# 6. Print Key Metrics # First load the required library
library(knitr)

# Create a data frame with the metrics
results_table <- data.frame(
  Metric = c("Version A Posterior", 
             "Version B Posterior",
             "P(B > A)",
             "Expected Loss (choosing A)",
             "Expected Loss (choosing B)"),
  Value = c(sprintf("Beta(α=%.1f, β=%.1f)", ab_data$alpha_post[1], ab_data$beta_post[1]),
            sprintf("Beta(α=%.1f, β=%.1f)", ab_data$alpha_post[2], ab_data$beta_post[2]),
            sprintf("%.1f%%", prob_B_better*100),
            sprintf("%.4f", mean(loss_A)),
            sprintf("%.4f", mean(loss_B)))
)

# Print with kable
kable(results_table, 
      caption = "Bayesian A/B Test Key Metrics",
      align = c('l', 'r'),
      col.names = c("Metric", "Value")) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "left"
  ) %>%
  row_spec(3, bold = TRUE, color = "white", background = "#00BFC4") %>%
  footnote(general = "Based on 100,000 Monte Carlo simulations",
           general_title = "Note:")
Bayesian A/B Test Key Metrics
Metric Value
Version A Posterior Beta(α=121.0, β=881.0)
Version B Posterior Beta(α=151.0, β=851.0)
P(B > A) 97.4%
Expected Loss (choosing A) 0.0301
Expected Loss (choosing B) 0.0001
Note:
Based on 100,000 Monte Carlo simulations

Decision Making

Probability Interpretation Action
\(P(B > A) > 95\%\) Strong evidence B is better Roll out B
\(P(B > A) \approx 50-80\%\) Weak evidence Collect more data
\(P(B > A) < 50\%\) A likely better Keep A or refine experiment

\(P(B > A) = 97.4\%\) → Strong evidence B is better; Roll out B.

Advantages Over Frequentist A/B Testing

Metric Frequentist Approach Bayesian (Beta) Approach
Interpretation “Reject null (p < 0.05)” “95% chance B is better”
Early Stopping Needs corrections (e.g., Bonferroni) Stop if \(P(B > A) > 95\%\)
Effect Size Confidence intervals Full posterior distribution
Prior Knowledge Ignored Incorporate via priors (e.g., Beta)

6 Discrete Probability Distributions

Discrete probability distributions model scenarios with distinct, countable outcomes where we can assign probabilities to each possible result. These distributions are fundamental for analyzing events where outcomes are separate and indivisible - like counting successes, failures, or occurrences. From simple binary outcomes to complex counting processes, discrete distributions provide the mathematical framework for understanding randomness in measurable, finite terms.

Key characteristics of discrete distributions include:

  • Clearly defined possible outcomes
  • Probabilities that sum to 1
  • Countable sample spaces
  • Distinct jumps between probability values

The most important discrete distributions each model different types of counting scenarios, from single trials to repeated experiments with varying conditions. Understanding these distributions provides the foundation for statistical inference, hypothesis testing, and predictive modeling across countless real-world applications.

6.1 Common Discrete Probability Distributions

Distribution Support Key Parameters Key Applications
Bernoulli {0, 1} p (success probability) Binary outcomes, single trials
Binomial {0, 1, …, n} n (trials), p Success counts, quality control
Geometric {1, 2, 3, …} p First success waiting time
Negative Binomial {r, r+1, …} r (successes), p k-th success waiting time
Poisson {0, 1, 2, …} λ (rate parameter) Rare events, arrival counts
Hypergeometric {max(0,n+K-N),…,min(n,K)} N (population), K (successes), n (sample) Sampling without replacement
Uniform {a, a+1, …, b} a (min), b (max) Equally likely outcomes
Multinomial Vectors of counts n (trials), p₁,…,pₖ Categorical data, dice rolls

Key Features:

  • Support: The set of possible outcome values
  • Bernoulli: Special case of Binomial \((n=1)\)
  • Geometric: Special case of Negative Binomial \((r=1)\)
  • Poisson: Limit of Binomial \((n \rightarrow \infty, np=\lambda)\)
  • Hypergeometric: Finite population correction for Binomial

6.2 Bernoulli Distribution

Models a single random experiment with exactly two mutually exclusive outcomes (typically labeled “success” and “failure”).

Key Characteristics:

  • The simplest discrete distribution
  • Foundation for many other distributions
  • Models binary events (1 = success, 0 = failure)

Probability mass function:

\[ P(X=x) = \begin{cases} p & \text{for } x=1 \\ 1-p & \text{for } x=0 \end{cases} \]
Parameters: - \(p\) = success probability

Applications: Coin flips, yes/no experiments

Bernoulli Distribution Moments

Moment Formula
Mean (μ) \(E[X] = p\)
Variance (σ²) \(Var(X) = p(1-p)\)
Skewness \(\gamma_1 = \frac{1-2p}{\sqrt{p(1-p)}}\)
Kurtosis \(\gamma_2 = \frac{1-6p(1-p)}{p(1-p)}\)

6.3 Binomial Distribution

Models the number of successes in a fixed number of independent Bernoulli trials with the same probability of success.

Key Characteristics:

  • Discrete probability distribution
  • Extension of the Bernoulli distribution to multiple trials
  • Assumes independence between trials
  • Constant probability of success across trials

Probability Mass Function (PMF): \[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \quad \text{for } k = 0,1,2,...,n \] Parameters:

  • \(n \in \mathbb{N}\): Number of trials
  • \(p \in [0,1]\): Probability of success in each trial

Cumulative Distribution Function (CDF): \[ F(k) = P(X \leq k) = \sum_{i=0}^{\lfloor k \rfloor} \binom{n}{i} p^i (1-p)^{n-i} \]

Binomial Distribution Moments:

Moment Formula
Mean \(E[X] = np\)
Variance \(\text{Var}(X) = np(1-p)\)
Skewness \(\gamma_1 = \frac{1-2p}{\sqrt{np(1-p)}}\)
Kurtosis \(\gamma_2 = \frac{1-6p(1-p)}{np(1-p)}\)
MGF \(M_X(t) = (1 - p + pe^t)^n\)

Example Applications:

  • Quality control (defective items in a batch)
  • Survey analysis (proportion of positive responses)
  • Medical research (success rate of treatments)
  • Finance (probability of stock price movements)
  • Genetics (inheritance probability)

Relationships to Other Distributions:

  • Bernoulli: Special case when \(n = 1\)
  • Normal: Approximates when \(n\) is large (De Moivre-Laplace theorem)
  • Poisson: Approximates when \(n \to \infty\), \(p \to 0\), \(np \to \lambda\)
# sample size 1,000
n <- 1e3

# random binomial data
x.binom <- rbinom(n = n, size = 15, prob = .25)

Probability Mass Function (pmf): Binomial Distribution

plot(0:15, dbinom(x = 0:15, size = 15, prob = .25),
     xlab = "x",
     ylab = "P(X=x)",
     ylim = c(0, .3),
     col = "tomato",
     type = "h",
     lwd = 6)

title(expression("Density Plot of Random Binomial Data " 
                 * phantom(" vs. B(15, 0.25)")), 
      col.main = "dodgerblue")

title(expression(phantom("Density Plot of Random Binomial Data ") 
                 * "vs." *  phantom("B(15, 0.25)"), 
                 col.main = "black"))

title(expression(phantom("Density Plot of Random Binomial Data vs. ") 
                 * "B(15, 0.25)"), 
      col.main = "tomato")

points((table(x.binom) / n), 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(10, .29, 
     paste("mean = ", (mean(x.binom)), "vs. E(X) = 3.7500"), 
     cex = 1.2)

text(10, .27, 
     paste("variance = ", format(var(x.binom), digits = 4), 
           "vs. Var(X) = 2.8125"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Binomial Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.binom), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "lightgray", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Bionomial(n=1000, p=0.25)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "gray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Example Binomial Probability Analysis

A genetic trait appears in 25% of offspring. In 8 offspring, what is the probability that exactly 2 inherit the trait?

Solution:

Parameters

  • Trials (n): 8 offspring
  • Success probability (p): 0.25
  • Target successes (k): 2

For a binomial random variable \(X \sim \text{Binomial}(n=8, p=0.25)\), the probability of exactly \(k\) successes is:

\[ P(X = k) = \binom{n}{k} p^k (1-p)^{n-k} \]

Step-by-Step Solution

  1. Combination Term Number of ways to choose 2 successes in 8 trials:

\[ \binom{8}{2} = \frac{8!}{2!6!} = \frac{8 \times 7}{2 \times 1} = 28 \quad \text{(ways to choose 2 successes in 8 trials)} \]

  1. Probability Terms \[ \begin{aligned} p^k &= (0.25)^2 = 0.0625 \quad \text{(probability of success)}\\ (1-p)^{n-k} &= (0.75)^{6} \approx 0.1780 \quad \text{(probability of failure)} \end{aligned} \]

  2. Final Calculation \[ P(X = 2) = 28 \times 0.0625 \times 0.1780 = 0.3115 \]

Final Result \[ \boxed{P(X=2) = 0.3115 \text{ (31.15\%)}} \]

Verification

  • Theoretical Expectation:
  • Mean \(\mu = np = 2\)
  • Variance \(\sigma^2 = np(1-p) = 1.5\)
  1. Verification (R Code)
dbinom(2, size = 8, prob = 0.25) # Returns 0.3114607
## [1] 0.3114624
binom.test(2, 8, 0.25)  # 95% CI
## 
##  Exact binomial test
## 
## data:  2 and 8
## number of successes = 2, number of trials = 8, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.25
## 95 percent confidence interval:
##  0.03185403 0.65085579
## sample estimates:
## probability of success 
##                   0.25

The probability of obtaining exactly 2 successes in 8 trials (with p = .25 per trial) was 0.311 (95% CI [0.198, 0.448]), indicating this outcome would be expected approximately 31.1% of the time under these conditions.

The test results indicate that the observed proportion of successes (0.25) was not statistically different from the hypothesized probability of 0.25 (*p* = 1.000). The 95% confidence interval [0.032, 0.651] suggests that, with 95% confidence, the true probability of success could range from approximately 3.2% to 65.1%. Since the p-value was not significant (*p* > .05), we fail to reject the null hypothesis that the true probability of success is equal to 0.25.

An exact binomial test revealed no significant difference between the observed success rate (25%) and the hypothesized rate of 25%, *p* = 1.000, 95% CI [0.032, 0.651]. The wide confidence interval reflects the small sample size (n = 8), indicating substantial uncertainty around the true probability.

6.4 Geometric Distribution

The geometric distribution models the number of trials needed to obtain the first success in a sequence of independent Bernoulli trials, each with the same success probability \((p)\).

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Memoryless property: \(P(X > m+n \mid X > m) = P(X > n)\)
  • Two common parameterizations:
  • Trials until first success (including the success trial)
  • Failures before first success

Probability Mass Function (PMF):

Version 1 (including success trial): \[ P(X = k) = (1-p)^{k-1}p \quad \text{for } k = 1,2,3,... \]

Version 2 (failures before first success): \[ P(Y = k) = (1-p)^k p \quad \text{for } k = 0,1,2,... \]

Parameters: - \(p \in (0,1]\): Probability of success on each trial

Cumulative Distribution Function (CDF): \[ F(k) = 1 - (1-p)^{\lfloor k \rfloor + \delta} \] where \(\delta = 1\) for Version 1, \(0\) for Version 2

Geometric Distribution Moments:

Moment Version 1 Formula Version 2 Formula
Mean \(E[X] = \frac{1}{p}\) \(E[Y] = \frac{1-p}{p}\)
Variance \(\text{Var}(X) = \frac{1-p}{p^2}\) \(\text{Var}(Y) = \frac{1-p}{p^2}\)
Skewness \(\gamma_1 = \frac{2-p}{\sqrt{1-p}}\)
Kurtosis \(\gamma_2 = 6 + \frac{p^2}{1-p}\)
MGF \(M_X(t) = \frac{pe^t}{1-(1-p)e^t}\) (for \(t < -\ln(1-p)\))

Example Applications:

  • Reliability engineering (time until first failure)
  • Network protocols (retransmissions until success)
  • Sports analysis (attempts until first score)
  • Epidemiology (cases until disease detection)
  • Queueing theory (customers until first special request)

Special Properties:

  • Memorylessness: Future trials don’t depend on past outcomes
  • Relationship to Exponential: Discrete analog of the continuous exponential distribution
  • Maximum Entropy: Given mean, it has maximum entropy among discrete distributions

Relationships to Other Distributions:

  • Negative Binomial: Sum of independent Geometrics
  • Exponential: Continuous counterpart
  • Bernoulli: Single trial version

Probability Mass Function (pmf): Geometric Distribution

# sample size
n <- 1e4

# random geometric data
x.geom <- rgeom(n=n, prob=.6)

# density
d.geom <- dgeom(0:10, prob = .6, log = FALSE)

plot(0:10, d.geom, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n Geometric Distribution",
     xlab = "Number of Failures Before Success", 
     ylab = "P(X=x)")

points(table(x.geom) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(6, .59, 
     paste("mean = ", format(mean(x.geom), digits = 4), ", E(X) = 0.6000"), 
     cex = 1.2)

text(6, .55, 
     paste("variance = ", format(var(x.geom), digits = 4), ", Var(X) = 1.0000"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Geometric Distribution

# Cumulative Distribution Function
plot(ecdf(x.geom), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Geometric(n=1000, p=0.25)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-Life Example of Geometric Distribution: Customer Conversion

A digital marketing team runs an online ad campaign where:

  • Each ad view is an independent trial.
  • The probability that a viewer clicks the ad (conversion) is 5% (*p = 0.05*).

What is the probability that the first conversion happens on the 10th ad view?

**Probability Calculation Using Geometric PMF

The probability mass function for the geometric distribution is given by:

\[ P(X = k) = (1 - p)^{k-1} \times p \]

where:

  • \(k\) = trial number of first success (here, \(k = 10\))
  • \(p\) = success probability per trial (here, \(p = 0.05\))

\[ P(X = 10) = (1 - 0.05)^{10-1} \times 0.05 = (0.95)^9 \times 0.05 \]

Step-by-step calculation:

  1. \((0.95)^9 \approx 0.630\)
  2. \(0.630 \times 0.05 = 0.0315\) so \(P(X = 10) \approx 0.0315\) or \(3.15\%\)

R Verification

dgeom(9, prob = 0.05)  # Returns 0.0315 (3.15%)
## [1] 0.03151247

Note: R uses zero-based indexing for the number of failures before the first success, so we use 9 instead of 10.

6.5 Hypergeometric Distribution

The Hypergeometric distribution models scenarios where we sample without replacement from a finite population containing two distinct groups (e.g., success/failure).

Key Characteristics:

  • Discrete probability distribution
  • Finite population sampling
  • Non-independent trials (sampling without replacement)
  • Captures the “lottery problem” exactly

Probability Mass Function (PMF): \[ P(X = k) = \frac{\binom{K}{k}\binom{N-K}{n-k}}{\binom{N}{n}} \] where:

  • \(\max(0, n+K-N) \leq k \leq \min(n, K)\)
  • \(\binom{a}{b}\) is the binomial coefficient
  • \(N \in \mathbb{N}\): Total population size
  • \(K \in \{0,1,...,N\}\): Number of success states in population
  • \(n \in \{1,2,...,N\}\): Number of draws (sample size)

Cumulative Distribution Function (CDF): \[ F(k) = \sum_{i=0}^{\lfloor k \rfloor} \frac{\binom{K}{i}\binom{N-K}{n-i}}{\binom{N}{n}} \]

Hypergeometric Distribution Moments:

Moment Formula
Mean \(E[X] = n\frac{K}{N}\)
Variance \(\text{Var}(X) = n\frac{K}{N}\frac{N-K}{N}\frac{N-n}{N-1}\)
Skewness \(\frac{(N-2K)(N-1)^{1/2}(N-2n)}{[nK(N-K)(N-n)]^{1/2}(N-2)}\)
MGF \(\frac{\binom{N-K}{n}}{\binom{N}{n}}{}_2F_1(-n,-K;N-K-n+1;e^t)\)

Example Applications:

  • Quality control (sampling defective items)
  • Wildlife sampling (tagged animals in capture-recapture)
  • Lottery probability calculations
  • Statistical significance of gene enrichment
  • Voting prediction from exit polls

Special Properties:

  • Finite Population Correction: Variance includes \(\frac{N-n}{N-1}\) term
  • Relationship to Binomial: Approaches binomial when \(N \to \infty\) with \(K/N \to p\)
  • Symmetry: \(X \sim \text{HG}(N,K,n) \iff n-X \sim \text{HG}(N,N-K,n)\)

Relationships to Other Distributions:

  • Binomial: Approximation when \(n/N < 0.05\)
  • Multivariate Hypergeometric: Extension to multiple categories
  • Fisher’s Exact Test: Used in contingency table analysis

Probability Mass Function (pmf): Hypergeometric Distribution

# random hypergeometric data
x.hyper <- rhyper(1e4, m = 10, n = 8, k = 10)

# density
d.hyper <- dhyper(0:10, m = 10, n = 8, k = 10)

plot(0:10, d.hyper, 
     xlab = "x", 
     ylab = "P(X=x)", 
     ylim = c(0, .45), 
     col = "tomato", 
     type = "h", 
     lwd = 6)

title(main = expression("Hypergeometric Distribution \n"), 
      col.main = "black")

title(main = expression("random sample (blue)" * phantom(" vs. population (orange)")), 
      col.main = "dodger blue")

title(main = expression(phantom("random sample (blue)") * " vs. " * 
                          phantom("population (orange)")), 
      col.main = "black")

title(main = expression(phantom("random sample (blue) vs. ") 
                        * "population (orange)"), 
      col.main = "tomato")

points(table(x.hyper) / 1e4, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(6, .44, 
     paste("mean = ", format(mean(x.hyper), digits = 4), " vs. E(X) = *"), 
     cex = 1.2)

text(6, .41, 
     paste("variance = ", format(var(x.hyper), digits = 4), " vs. Var(X) = *"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Hypergeometric Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.hyper), 
     xlab = "x", 
     ylab = "P(X <= x)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n 
             Hypereometric(nn=1000, m=10, n=8, k=10)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-Life Example of Hypergeometric Distribution

Example: Quality Control in Manufacturing

A factory produces 100 light bulbs, with 10 defective and 90 working. An inspector randomly selects 20 bulbs for testing. What is the probability that exactly 3 are defective?

Problem Definition

We consider a finite population of size \(N\) containing:

  • \(K\) “successes” (defective bulbs)
  • \(N-K\) “failures” (working bulbs)

A sample of size \(n\) is drawn without replacement. The probability of finding exactly \(k\) successes in the sample follows the Hypergeometric Distribution:

\[ P(X = k) = \frac{\binom{K}{k} \binom{N-K}{n-k}}{\binom{N}{n}} \]

Given Parameters:

  • Total bulbs (\(N\)) = 100
  • Defective bulbs (\(K\)) = 10
  • Sample size (\(n\)) = 20
  • Desired defectives (\(k\)) = 3

Step 1: Calculate Individual Combinations

  1. Ways to Choose Defective Bulbs

\[ \binom{K}{k} = \binom{10}{3} = \frac{10!}{3!(10-3)!} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120 \]

  1. Ways to Choose Working Bulbs

\[ \binom{N-K}{n-k} = \binom{90}{17} \approx 3.189 \times 10^{19} \quad \text{(exact value)} \]

  1. Total Possible Samples

\[ \binom{N}{n} = \binom{100}{20} \approx 5.360 \times 10^{20} \]

Step 2: Compute Probability

\[ P(X = 3) = \frac{\binom{10}{3} \binom{90}{17}}{\binom{100}{20}} = \frac{120 \times \binom{90}{17}}{\binom{100}{20}} \]

Numerical Evaluation:

\[ P(X = 3) \approx \frac{120 \times 3.189 \times 10^{19}}{5.360 \times 10^{20}} \approx 0.1982 \]

The probability of finding exactly 3 defective bulbs in a sample of 20 is: \(\boxed{P(X = 3) \approx 19.82\%}\)


Verification in R

# Parameters
total_defective <- 10       # (K) Defective bulbs in population
total_non_defective <- 90   # (N-K) Working bulbs
sample_size <- 20           # (n) Bulbs tested
defective_in_sample <- 3    # (k) Desired defective count in sample
total_population <- total_defective + total_non_defective  # (N) Total bulbs

# Probability calculation
prob <- dhyper(
  x = defective_in_sample,  # Number of observed successes (defectives)
  m = total_defective,      # Total successes in population
  n = total_non_defective,  # Total failures in population
  k = sample_size           # Sample size
)

print(paste("Probability of exactly 3 defectives:", round(prob, 4)))
## [1] "Probability of exactly 3 defectives: 0.2092"

There’s a 19.82% chance that exactly 3 out of 20 tested bulbs are defective.

library(ggplot2)

# Compute probabilities for all possible defective counts (0 to total_defective)
x_values <- 0:total_defective
probabilities <- dhyper(
  x = x_values,
  m = total_defective,
  n = total_non_defective,
  k = sample_size
)

# Create a data frame for plotting
df <- data.frame(
  Defectives = x_values,
  Probability = probabilities
)

# Plot
ggplot(df, aes(x = Defectives, y = Probability)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.7) +
  geom_vline(xintercept = defective_in_sample, color = "red", linetype = "dashed") +
  annotate("text", 
           x = defective_in_sample, 
           y = max(probabilities) * 0.9, 
           label = paste("P(X=", defective_in_sample, ") =", round(prob, 4)), 
           color = "red") +
  labs(
    title = "Hypergeometric Distribution: Defective Bulbs in Sample",
    x = "Number of Defective Bulbs in Sample (k)",
    y = "Probability"
  ) +
  theme_minimal()


6.6 Poisson Distribution

Models the number of rare, independent events occurring in a fixed interval of time or space.

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Memoryless property
  • Events occur at constant average rate
  • Events are independent of time since last event

Probability Mass Function (PMF): \[ P(X = k) = \frac{e^{-\lambda}\lambda^k}{k!} \quad \text{for } k = 0,1,2,... \]

Parameters: - \(\lambda > 0\): Rate parameter (mean number of events)

Cumulative Distribution Function (CDF): \[ F(k) = e^{-\lambda}\sum_{i=0}^{\lfloor k \rfloor}\frac{\lambda^i}{i!} \]

Poisson Distribution Moments:

Moment Formula
Mean \(E[X] = \lambda\)
Variance \(\text{Var}(X) = \lambda\)
Skewness \(\gamma_1 = \lambda^{-1/2}\)
Kurtosis \(\gamma_2 = \lambda^{-1}\)
MGF \(M_X(t) = e^{\lambda(e^t-1)}\)
Characteristic Function \(\phi_X(t) = e^{\lambda(e^{it}-1)}\)

Key Properties:

  • Equal Mean and Variance: \(\text{Mean} = \text{Variance} = \lambda\)
  • Additive: Sum of Poissons is Poisson (\(X+Y \sim \text{Pois}(\lambda_1+\lambda_2)\))
  • Binomial Limit: Approximates \(\text{Bin}(n,p)\) when \(n\to\infty\), \(np\to\lambda\)

Example Applications:

  • Call center: Number of incoming calls per hour
  • Biology: Mutations in DNA strand
  • Finance: Number of trades in a time period
  • Physics: Radioactive decay events
  • Transportation: Arrivals at a bus stop
  • Medicine: Number of hospital admissions

Approximation Rules:

  • Binomial approximation: When \(n \geq 20\) and \(p \leq 0.05\)
  • Normal approximation: When \(\lambda > 20\)

Relationships to Other Distributions:

  • Exponential: Models inter-arrival times between Poisson events
  • Gamma: Models time until k-th Poisson event
  • Compound Poisson: Used in risk modeling
  • Zero-Inflated Poisson: For excess zero counts

Probability Mass Function (pmf): Poisson Distribution

# sample size
n <- 1e4

# random Poisson data
x.poisson <- rpois(n, 5)

# density
d.poisson <- dpois(0:20, 5)

plot(0:20, d.poisson, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (orange) \n Poisson Distribution",
     xlab = "Time Interval", 
     ylab = "P(X=x)")

points(table(x.poisson) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(13, .17, paste("mean = ", format(mean(x.poisson), digits = 4), 
                    " vs. ", expression(E(X)), " = 5.0000"), 
     cex = 1.2)

text(13, .16, 
     paste("variance = ", format(var(x.poisson), digits = 4), 
           " vs. Var(X) = 5.0000"), 
     cex = 1.2)

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Cumulative Distribution Function (cdf) Plot: Poisson Distribution

# random sample Cumulative Distribution Function
plot(ecdf(x.poisson), 
     xlab = "k", 
     ylab = "P(X <= k)",
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21,
     main = "Cumulative Distribution Function \n Poisson(1000, 5)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Poisson Distribution: Emergency Call Center Example

Problem Scenario

A hospital’s emergency call center receives an average of 5 calls per hour \((\lambda = 5)\). We want to calculate:

  • Probability of receiving exactly 3 calls in the next hour.
  • Probability of receiving ≤ 2 calls in the next hour.

Why Poisson?

  • Models rare, independent events (calls are unpredictable and non-overlapping).
  • Fixed time interval (per hour).
  • Mean = Variance = λ (here, both are 5).

The Poisson Probability Mass Function (PMF) is given by:

\[ P(X = k) = \frac{e^{-\lambda} \lambda^k}{k!} \]

where:

  • \(\lambda = 5\) (average rate of events per interval)
  • \(k\) = number of events (e.g., calls received)
  • \(e\) = Euler’s number (~2.71828)
  • \(k!\) = factorial of \(k\)

Key Properties:

  1. Mean and Variance:
    \[ E[X] = \text{Var}(X) = \lambda \]

  2. Assumptions:

  • Events occur independently
  • Constant average rate (\(\lambda\))
  • No simultaneous events

1. Probability of Exactly 3 Calls \[ P(X = 3) = \frac{e^{-5} \cdot 5^3}{3!} = \frac{0.006737947 \cdot 125}{6} \approx 0.1404 \]

2. Probability of ≤ 2 Calls (CDF) \[ P(X \leq 2) = \sum_{k=0}^2 \frac{e^{-5} \cdot 5^k}{k!} \approx 0.1247 \]


R Verification

lambda <- 5  # Average calls per hour

# a) Probability of exactly 3 calls
prob_3 <- dpois(x = 3, lambda = lambda)
cat(sprintf("P(X = 3) = %.4f (%.1f%%)\n", prob_3, prob_3*100))
## P(X = 3) = 0.1404 (14.0%)
# b) Probability of ≤ 2 calls
prob_leq_2 <- ppois(q = 2, lambda = lambda)
cat(sprintf("P(X ≤ 2) = %.4f (%.1f%%)\n", prob_leq_2, prob_leq_2*100))
## P(X ≤ 2) = 0.1247 (12.5%)
  • 14% chance of exactly 3 calls in the next hour.
  • 12.5% chance of 2 or fewer calls.
library(ggplot2)

k_values <- 0:15  # Possible call volumes
probabilities <- dpois(k_values, lambda)

ggplot(data.frame(Calls = k_values, Probability = probabilities), 
       aes(x = Calls, y = Probability)) +
  geom_col(fill = "orange", width = 0.6, alpha = 0.7) +
  geom_vline(xintercept = lambda, linetype = "dashed", color = "red") +
  labs(title = "Poisson Distribution: Emergency Calls per Hour (λ = 5)",
       x = "Number of Calls",
       y = "Probability") +
  theme_minimal()


6.7 Negative Binomial Distribution

Models the number of independent Bernoulli trials needed to obtain a fixed number of successes (r).

Key Characteristics:

  • Discrete probability distribution with infinite support
  • Generalization of the Geometric distribution
  • Two common parameterizations:
  • Number of trials until r-th success (including successes)
  • Number of failures before r-th success

Probability Mass Function (PMF):

Version 1 (trials until r-th success): \[ P(X = k) = \binom{k-1}{r-1}p^r(1-p)^{k-r} \quad \text{for } k = r,r+1,... \]

Version 2 (failures before r-th success): \[ P(Y = k) = \binom{k+r-1}{r-1}p^r(1-p)^k \quad \text{for } k = 0,1,2,... \]

Parameters:

  • \(r \in \mathbb{N}\): Number of desired successes
  • \(p \in (0,1)\): Probability of success in each trial

Cumulative Distribution Function (CDF):

\[ F(k) = 1 - I_p(\lfloor k \rfloor + 1, r) \] where \(I_p\) is the regularized incomplete beta function

Negative Binomial Distribution Moments:

Moment Version 1 Formula Version 2 Formula
Mean \(E[X] = \frac{r}{p}\) \(E[Y] = \frac{r(1-p)}{p}\)
Variance \(\text{Var}(X) = \frac{r(1-p)}{p^2}\) \(\text{Var}(Y) = \frac{r(1-p)}{p^2}\)
Skewness \(\frac{2-p}{\sqrt{r(1-p)}}\)
Kurtosis \(\frac{6}{r} + \frac{p^2}{r(1-p)}\)
MGF \(\left(\frac{pe^t}{1-(1-p)e^t}\right)^r\) (for \(t < -\ln(1-p)\))

Key Properties:

  • Overdispersion: Variance > Mean (unlike Poisson)
  • Geometric Case: Special case when \(r = 1\)
  • Alternative Formulation: Often defined using mean \(\mu\) and dispersion parameter

Example Applications:

  • Epidemiology: Disease outbreak modeling
  • Ecology: Species abundance counts
  • Insurance: Claim frequency modeling
  • Quality Control: Defect clustering
  • RNA-Seq: Gene expression analysis
  • Psychology: Learning curve analysis

Relationships to Other Distributions:

  • Poisson: Limit when \(r \rightarrow \infty\), \(p \rightarrow 1\), with \(r(1-p) \rightarrow \lambda\)
  • Geometric: Special case when \(r = 1\)
  • Gamma-Poisson: Equivalent to Poisson with gamma-distributed rate

Parameter Estimation:

  • Maximum Likelihood Estimates: \(\hat{p} = \frac{r}{r+\bar{y}}\) \(\hat{r}\) via numerical methods

Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom <- rnbinom(n, size = 1, prob = .37)

# density
d.nbinom <- dnbinom(0:30, size = 1, prob = .37)

plot(0:30, d.nbinom, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability")

points(table(x.nbinom) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .3, paste("mean = ", format(mean(x.nbinom), digits = 4), " vs. ",
                   expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .28, 
     paste("variance = ", format(var(x.nbinom), digits = 4), " vs. ",
           expression(V(X)), " = *"), 
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
            Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom2 <- rnbinom(n, size = 7, prob = .37)

# density
d.nbinom2 <- dnbinom(0:30, size = 7, prob = .37)

plot(0:30, d.nbinom2, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability", 
     ylim = c(0,.1))

points(table(x.nbinom2) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .09, paste("mean = ", format(mean(x.nbinom2), digits = 4), " vs. ",
                    expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .085, paste("variance = ", format(var(x.nbinom2), digits = 4), " vs. ",
                     expression(V(X)), " = *"),
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom2), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
            Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Probability Mass Function (pmf): Negative Binomial Distribution

# sample size
n <- 1e4

# random negative binomial data
x.nbinom3 <- rnbinom(n, size = 25, prob = .37)

# density
d.nbinom3 <- dnbinom(0:30, size = 25, prob = .37)

plot(0:30, d.nbinom3, 
     col = "tomato", 
     type = "h", 
     lwd = 6,
     main = "random sample (blue) vs. population (red) \n 
           Negative Binomial Distribution",
     xlab = "Total Trials", 
     ylab = "Probability")

points(table(x.nbinom3) / n, 
       col = adjustcolor("dodgerblue", alpha.f = 0.3), 
       lwd = 20)

text(20, .09, 
     paste("mean = ", format(mean(x.nbinom3), digits = 4), " vs. ",
           expression(E(X)), " = *"), 
     cex = 1.2)

text(20, .085, 
     paste("variance = ", format(var(x.nbinom3), digits = 4), " vs. ",
           expression(V(X)), " = *"),
     cex = 1.2)


Cumulative Distribution Function (cdf) Plot: Negative Binomial Distribution

# Cumulative Distribution Function
plot(ecdf(x.nbinom3), 
     col = "maroon", 
     cex = 2, 
     lwd = 6, 
     pch = 21, 
     xlab = "k", 
     ylab = "P(X <= k)",
     main = "Cumulative Distribution Function \n 
             Negative Binomial(n=1000, size=7, prob=0.37)")

# gridlines
grid(nx = NULL, 
     ny = NULL, 
     col = "lightgray", 
     lty = "dotted", 
     lwd = 1, 
     equilogs = TRUE)


Real-World Scenario: Programming Skill Acquisition

A data science student needs an average of 5 practice attempts to fully master each new Python concept. We want to model:

  • Probability of requiring exactly 8 attempts to master 3 concepts
  • Likelihood of needing ≤ 10 attempts for 3 concepts

Why Negative Binomial?

  • Models number of trials until r successes (mastery events)
  • Accounts for variable learning rates
  • More flexible than Poisson (where variance ≠ mean)
r <- 3        # Target successes (concepts to master)
mu <- 8       # Observed average attempts per success
p <- r/(r+mu) # Success probability per attempt

cat("Success probability p =", round(p, 3))
## Success probability p = 0.273
# Probability of needing exactly 8 attempts for 3 successes
prob_8 <- dnbinom(8 - r, size = r, prob = p)
cat(sprintf("P(8 attempts for 3 successes) = %.4f (%.1f%%)", prob_8, prob_8*100))
## P(8 attempts for 3 successes) = 0.0867 (8.7%)
# Probability of needing ≤10 attempts
prob_leq_10 <- pnbinom(10 - r, size = r, prob = p)
cat(sprintf("\nP(≤10 attempts for 3 successes) = %.4f (%.1f%%)", prob_leq_10, prob_leq_10*100))
## 
## P(≤10 attempts for 3 successes) = 0.5414 (54.1%)
  • 6.3% chance the student needs exactly 8 attempts to master 3 concepts
  • 50.4% chance they’ll succeed within 10 attempts
  • The distribution is right-skewed, reflecting that some concepts may require substantially more practice
library(ggplot2)

attempts <- 0:25
probabilities <- dnbinom(attempts, size = r, prob = p)

ggplot(data.frame(Attempts = attempts + r, Probability = probabilities),
       aes(x = Attempts, y = Probability)) +
  geom_col(fill = "#FF6B6B", width = 0.7, alpha = 0.8) +
  geom_vline(xintercept = (r/p), linetype = "dashed", color = "#4ECDC4") +
  labs(title = "Learning Curve: Attempts Needed to Master 3 Concepts",
       subtitle = paste("Negative Binomial (r =", r, ", p =", round(p, 2), ")"),
       x = "Total Attempts Needed",
       y = "Probability") +
  theme_minimal()


7 Random Number Generation & Simulation Methods

Random Number Generation (RNG) and Simulation Methods form the computational backbone of modern statistics, machine learning, and data science. They encompass the algorithms and techniques for producing random samples from probability distributions and for using those samples to approximate integrals, quantify uncertainty, and perform inference when closed-form solutions are intractable.


7.1 Demonstration Example: Symmetric Triangular Distribution

Throughout this chapter, we will use the symmetric triangular distribution on \([−1, 1]\), defined by

\[ f(x) = \begin{cases} 1 - |x|, & -1 \leq x \leq 1, \\[6pt] 0, & \text{otherwise.} \end{cases} \]

as a running example.

  • It is simple enough to allow closed-form derivations of its CDF and inverse CDF.
  • Its shape (peaked at 0, linear tails to 0 at the boundaries) makes it ideal for illustrating Inverse Transform Sampling, Rejection Sampling, and comparisons with Importance Sampling and MCMC methods.
  • Because the triangular distribution is not included in most standard statistical libraries, it forces us to implement custom algorithms — giving a more pedagogical, “hands-on” experience rather than relying on black-box functions.

Symmetric Triangular Distribution Moments:

Moment Formula
Mean \(\frac{a + b}{2}\)
Variance \(\frac{(b - a)^2}{24}\)
Skewness \(0\)
Kurtosis \(-\frac{3}{5}\)
MGF \(\frac{2(e^{tb}-e^{tc}(1+t(b-c)))}{t^2(b-a)^2}\)

Key Properties:

  • Finite, bounded support: Strictly confined to \([a,b]\)
  • Perfectly symmetric density: Mean = Median = Mode
  • Variance: Grows with the square of the range
  • Linear PDF forming an isoceles triangle: PDF decreases linearly from mode to bounds

In short: RNG provides the raw randomness, and simulation methods harness it to solve statistical problems that are analytically intractable. The symmetric triangular distribution will serve as our guiding example to demonstrate these techniques in practice.

To apply these methods, we first need a precise mathematical description of our chosen distribution. We begin by defining its Probability Density Function (PDF), which specifies how probability mass is distributed across its support and serves as the foundation for deriving the CDF and ultimately the inverse CDF needed for simulation.


This mini “distribution family” implements the symmetric triangular distribution on \([a,b]\) with mode \(c=\tfrac{a+b}{2}\), following base R’s d/p/q/r conventions:

  • dtri(x, a, b)density \(f(x)\) (piecewise linear; zero outside \([a,b]\)).
  • ptri(q, a, b)CDF \(F(x)\) (piecewise quadratic; continuous and increasing).
  • qtri(p, a, b)quantile \(F^{-1}(p)\) (closed form for the symmetric case).
  • rtri(n, a, b)random generation via inverse transform: \(X=F^{-1}(U)\), \(U\sim\mathrm{Unif}(0,1)\).

Design notes - Assumes symmetry (mode at midpoint). For an asymmetric triangular with arbitrary mode \(c\in[a,b]\), formulas generalize naturally. - Functions are vectorized; arguments mirror base R (lower.tail, log, log.p where appropriate). - Endpoints handled explicitly: \(F(a)=0\), \(F(b)=1\); qtri(0)=a, qtri(1)=b.


Triangular Distribution function on \([−1,1]\) with \(f(x)= 1−|x|)\)

We’ll define the PDF, CDF, and inverse CDF (quantile) for the symmetric triangular distribution:

#' Triangular Distribution (Symmetric) — Density, CDF, Quantile, Random Generation
#'
#' Symmetric triangular distribution on [a, b] with mode c = (a+b)/2.
#' Provides:
#'  - density (`dtri`), 
#'  - distribution function (`ptri`),
#'  - quantile function (`qtri`), and 
#'  - random generation (`rtri`).
#'
#' PDF (general form):
#'   a <= x <= c:  f(x) =  2(x-a) / ((b-a)(c-a))
#'   c <  x <= b:  f(x) =  2(b-x) / ((b-a)(b-c))
#' For the symmetric case c = (a+b)/2, these simplify to:
#'   a <= x <= c:  f(x) = 4(x-a) / (b-a)^2
#'   c <  x <= b:  f(x) = 4(b-x) / (b-a)^2
#' which matches the standardized form on [-1,1]: f(x) = 1 - |x|.
#'
#' CDF (symmetric):
#'   F(x) =
#'     0,                                   x < a
#'     2 (x-a)^2 / (b-a)^2,                 a <= x <= c
#'     1 - 2 (b-x)^2 / (b-a)^2,             c <  x <= b
#'     1,                                   x > b
#'
#' Inverse CDF (symmetric):
#'   For 0 <= p <= 0.5:  x = a + (b-a) * sqrt(p/2)
#'   For 0.5 <  p <= 1:  x = b - (b-a) * sqrt((1-p)/2)
#'
#' @name Triangular
#' @aliases dtri ptri qtri rtri
#' @param x,q Numeric vector of quantiles.
#' @param p Numeric vector of probabilities in [0,1].
#' @param n Integer; number of observations.
#' @param a,b Numeric; lower/upper limits with a < b.
#' @param lower.tail Logical; if TRUE (default), probabilities are P[X <= x]; otherwise P[X > x].
#' @param log,log.p Logical; if TRUE, return log densities / treat p as log-probabilities.
#' @return For `dtri`: density; for `ptri`: distribution function values;
#'   for `qtri`: quantiles; for `rtri`: random variates.
#' @examples
#' # Density / CDF / Quantile / Random
#' curve(dtri(x, 0, 2), from = -0.5, to = 2.5, main = "Triangular PDF", lwd = 2)
#' curve(ptri(x, 0, 2), from = -0.5, to = 2.5, main = "Triangular CDF", lwd = 2)
#' qtri(c(0, 0.25, 0.5, 0.75, 1), 0, 2)
#' set.seed(123); hist(rtri(1e4, 0, 2), freq = FALSE, breaks = 60,
#'                     main = "rtri vs. true PDF", col = "gray90", border = NA)
#' curve(dtri(x, 0, 2), add = TRUE, col = "red", lwd = 2)

7.1.1 Probability Density Function (PDF):

For a symmetric triangular distribution with support \([a, b]\) and mode \(c = \frac{a+b}{2}\):

\[ f(x) = \begin{cases} 0 & \text{for } x < a \\[8pt] \frac{2(x-a)}{(b-a)^2} & \text{for } a \leq x \leq c \\[8pt] \frac{2(b-x)}{(b-a)^2} & \text{for } c \leq x \leq b \\[8pt] 0 & \text{for } x > b \end{cases} \]

Key Properties:

  • Piecewise linear function forming an isosceles triangle
  • Continuous at \(x=c\) (both pieces meet at the mode)
  • Discontinuous at boundaries \(x=a\) and \(x=b\) (jumps from 0)
  • Height at mode: \(f(c) = \frac{2}{b-a}\)
  • Area under curve: \(\int_a^b f(x)dx = 1\)

Verification

  • At \(x=a\): \(\frac{2(a-a)}{(b-a)^2} = 0\)
  • At \(x=c\): Both pieces give \(\frac{2(c-a)}{(b-a)^2} = \frac{2}{(b-a)}\)
  • At \(x=b\): \(\frac{2(b-b)}{(b-a)^2} = 0\)

Standardized Form (a=-1, b=1) \[ f_{\text{std}}(x) = \begin{cases} 1 - |x| & \text{for } -1 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases} \]

#--------------------------------------
# Density
#--------------------------------------
#' @rdname Triangular
#' @export
dtri <- function(x, a = -1, b = 1, log = FALSE) {
  # Basic checks
  if (!is.numeric(x)) stop("x must be numeric")
  if (!is.numeric(a) || !is.numeric(b) || length(a) != 1 || length(b) != 1) {
    stop("a and b must be numeric scalars")
  }
  if (a >= b) stop("'a' must be less than 'b'")

  c <- (a + b) / 2               # symmetric mode
  denom <- (b - a)               # interval length
  # For symmetry, both denominators are identical: (b-a)*(c-a) = (b-a)*(b-c) = (b-a)^2/2
  left_denom  <- denom * (c - a)     # = (b-a)^2 / 2
  right_denom <- denom * (b - c)     # = (b-a)^2 / 2

  dens <- numeric(length(x))     # preallocate output
  left  <- (x >= a) & (x <= c)   # mask: left half incl. mode
  right <- (x >  c) & (x <= b)   # mask: right half

  # Piecewise linear density (vectorized)
  dens[left]  <- 2 * (x[left]  - a) / left_denom
  dens[right] <- 2 * (b - x[right]) / right_denom
  dens[x < a | x > b] <- 0       # zero outside support

  if (log) {
    dens <- log(dens)
    dens[!is.finite(dens)] <- -Inf  # guard for log(0)
  }
  dens
}

# save
dump("dtri", "dtri.R")

7.1.2 Cumulative Distribution Function (CDF):

Find the explicit formula, F(x), for the function, f(x)

\[ \begin{aligned} F(x) &= \begin{cases} \displaystyle\int_{-\infty}^x (1-\mid t \mid)dt, &\mbox{if } -1 < x < 1 \\ \\ \qquad 0, &\mbox{otherwise} \end{cases} \\ \\ &= \begin{cases} \displaystyle\int_{-\infty}^{x} 0 dt, &\mbox{if } x < -1 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{x} (1+t) dt, &\mbox{if } -1 < x < 0 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{x} (1+t) dt + \displaystyle\int_{0}^{x} (1-t) dt, &\mbox{if } 0 < x < 1 \\ \\ \displaystyle\int_{-\infty}^{-1} 0 dt + \displaystyle\int_{-1}^{0} (1+t) dt + \displaystyle\int_{0}^{1} (1-t) dt + \displaystyle\int_{x}^{\infty} 0 dt, &\mbox{if } 1 < x \end{cases} \\ \\ &= \begin{cases} \qquad 0, &\mbox{if } x < -1 \\ \\ \frac{x^{2}}{2} + x + \frac{1}{2}, &\mbox{if } -1 < x < 0 \\ \\ -\frac{x^{2}}{2} + x + \frac{1}{2}, &\mbox{if } 0 < x < 1 \\ \\ \qquad 1, &\mbox{if } 1 < x \end{cases} \end{aligned} \]

For a symmetric triangular distribution with support \([a, b]\) and mode \(c = \frac{a+b}{2}\):

\[ F(x) = \begin{cases} 0 & \text{for } x < a \\[10pt] \frac{(x-a)^2}{(b-a)^2} & \text{for } a \leq x \leq c \\[10pt] 1 - \frac{(b-x)^2}{(b-a)^2} & \text{for } c \leq x \leq b \\[10pt] 1 & \text{for } x > b \end{cases} \]

Key Features:

  • Piecewise quadratic function
  • Continuous at all points (including \(x=a\), \(x=c\), and \(x=b\))
  • Differentiable everywhere except at \(x=a\) and \(x=b\)
  • Median = Mode = Mean = \(c\)

Verification at Critical Points:

  • At \(x = a\): \[ F(a) = \frac{(a-a)^2}{(b-a)^2} = 0 \quad \checkmark \]

  • At \(x = c\): \[ F(c) = \frac{(c-a)^2}{(b-a)^2} = \frac{\left(\tfrac{b-a}{2}\right)^2}{(b-a)^2} = \frac{1}{4} = 1 - \frac{(b-c)^2}{(b-a)^2} = 0.5 \quad \checkmark \]

  • At \(x = b\): \[ F(b) = 1 - \frac{(b-b)^2}{(b-a)^2} = 1 \quad \checkmark \]

Example Calculation (when \(a=0\), \(b=2\), \(c=1\)):

\[ F(0.5) = \frac{(0.5-0)^2}{(2-0)^2} = 0.0625 \\ F(1.5) = 1 - \frac{(2-1.5)^2}{(2-0)^2} = 0.9375 \]

#--------------------------------------
# CDF
#--------------------------------------
#' @rdname Triangular
#' @export
#' @rdname Triangular
#' @export
ptri <- function(q, a = -1, b = 1, lower.tail = TRUE, log.p = FALSE) {
  if (!is.numeric(q)) stop("'q' must be numeric")
  if (!is.numeric(a) || !is.numeric(b) || !(a < b)) stop("'a' and 'b' must be numeric with a < b")

  nm <- names(q)
  q  <- as.numeric(q)
  width <- b - a
  c <- (a + b) / 2

  # Initialize CDF, set non-finite qs to NA
  cdf <- rep(NA_real_, length(q))
  finite <- is.finite(q)

  # Masks
  left_support  <- finite & (q <  a)
  right_support <- finite & (q >  b)
  mid_left      <- finite & (q >= a) & (q <= c)
  mid_right     <- finite & (q >  c) & (q <= b)

  # Piecewise symmetric triangular CDF
  cdf[left_support]  <- 0
  cdf[mid_left]      <- 2 * (q[mid_left]  - a)^2 / width^2
  cdf[mid_right]     <- 1 - 2 * (b - q[mid_right])^2 / width^2
  cdf[right_support] <- 1

  # Tails / log
  if (!lower.tail) cdf <- 1 - cdf
  if (log.p) {
    cdf <- ifelse(is.na(cdf), NA_real_, ifelse(cdf <= 0, -Inf, log(cdf)))
  }

  names(cdf) <- nm
  cdf
}

# save
dump("ptri", "ptri.R")

Plot Triangular Distribution CDF, F(x)

#' Plot Triangular Distribution CDF (Square Grid) using ptri()
#'
#' Visualizes the CDF F(x) for a symmetric triangular distribution via `ptri()`,
#' with square grid cells (1:1 data aspect), **colored highlight points**, and
#' a bit of **padding** so axes/guides don’t get clipped. Falls back to
#' standardized `ptri(s)` or closed-form if needed.
#'
#' @param min,max Numeric. x-range to display.
#' @param a,b     Numeric. Distribution bounds (a < b). Default: -1, 1.
#' @param grid_step Numeric. Same step used on x and y (square cells). Default 0.1.
#' @param x_pad,y_pad Optional numeric padding added to x/y limits. Defaults to
#'        max(grid_step/2, 2% of the axis range).
#' @param highlight_points Logical. Mark x={a, (a+b)/2, b}. Default TRUE.
#' @param highlight_colors Character(3). Colors for the three highlight points/lines.
#'        Default: c("#E41A1C", "#377EB8", "#4DAF4A").
#' @param show_grid Logical. Show grid lines. Default TRUE.
#' @param show_legend Logical. Show legend for highlights. Default FALSE.
#' @return A ggplot object
#' @export
plot_ptri <- function(min, max, a = -1, b = 1,
                      grid_step = 0.1,
                      x_pad = NULL, y_pad = NULL,
                      highlight_points = TRUE,
                      highlight_colors = c("#E41A1C", "#377EB8", "#4DAF4A"),
                      show_grid = TRUE,
                      show_legend = FALSE) {
  if (!is.numeric(min) || !is.numeric(max) || min >= max) stop("min < max")
  if (!is.numeric(a) || !is.numeric(b) || b <= a) stop("'b' must be > 'a'")
  if (!is.numeric(grid_step) || grid_step <= 0) stop("grid_step > 0")

  # Default padding: half a grid step or 2% of range
  if (is.null(x_pad)) x_pad <- max(grid_step/2, 0.02 * (max - min))
  if (is.null(y_pad)) y_pad <- max(grid_step/2, 0.02 * 1)  # CDF range is [0,1]

  c_mid <- (a + b) / 2
  x <- seq(min, max, length.out = 1000)

  # 1) Try ptri(x, a, b) -> 2) standardized ptri(s) -> 3) closed form
  Fx <- tryCatch(
    ptri(x, a = a, b = b),
    error = function(e1) {
      s <- (2 * (x - a) / (b - a)) - 1
      tryCatch(
        ptri(s),
        error = function(e2) {
          Fx_cf <- numeric(length(x))
          iL  <- x <= a
          iR  <- x >= b
          iML <- x > a & x <= c_mid
          iMR <- x > c_mid & x <  b
          Fx_cf[iL]  <- 0
          Fx_cf[iML] <- 2 * (x[iML] - a)^2 / (b - a)^2
          Fx_cf[iMR] <- 1 - 2 * (b - x[iMR])^2 / (b - a)^2
          Fx_cf[iR]  <- 1
          Fx_cf
        }
      )
    }
  )

  df <- data.frame(x = x, Fx = Fx)

  # Padded limits & square-grid breaks
  x_lims   <- c(min - x_pad, max + x_pad)
  y_lims   <- c(0   - y_pad, 1   + y_pad)
  x_breaks <- seq(min, max, by = grid_step)
  y_breaks <- seq(0,   1,   by = grid_step)

  p <- ggplot2::ggplot(df, ggplot2::aes(x, Fx)) +
    ggplot2::geom_line(color = "tomato", linewidth = 1.2) +
    ggplot2::labs(x = "x", y = "F(x)", title = sprintf("Inverse CDF (Quantile) on [%g, %g]", a, b)) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title  = ggplot2::element_text(hjust = 0.5),
      panel.border = ggplot2::element_rect(color = "gray50", fill = NA),
      panel.grid.minor = ggplot2::element_line(color = "gray92")
    ) +
    ggplot2::scale_x_continuous(breaks = x_breaks, limits = x_lims, expand = c(0, 0)) +
    ggplot2::scale_y_continuous(breaks = y_breaks, limits = y_lims, expand = c(0, 0)) +
    ggplot2::coord_equal(ratio = 1)

  if (show_grid) {
    p <- p + ggplot2::theme(panel.grid.major = ggplot2::element_line(color = "gray85"))
  } else {
    p <- p + ggplot2::theme(panel.grid = ggplot2::element_blank())
  }

  if (highlight_points) {
    cols   <- rep(highlight_colors, length.out = 3)
    labels <- c("x = a", "x = c", "x = b")
    key_x  <- c(a, c_mid, b)
    # same robust path for the highlighted F(x)
    key_Fx <- tryCatch(
      ptri(key_x, a = a, b = b),
      error = function(e1) {
        s <- (2 * (key_x - a) / (b - a)) - 1
        tryCatch(ptri(s), error = function(e2) c(0, 0.5, 1))
      }
    )
    key_df <- data.frame(x = key_x, Fx = key_Fx, label = labels)

    p <- p +
      ggplot2::geom_point(data = key_df,
                          ggplot2::aes(color = label),
                          size = 3) +
      ggplot2::geom_vline(data = key_df,
                          ggplot2::aes(xintercept = x, color = label),
                          linetype = "dashed") +
      ggplot2::geom_hline(data = key_df,
                          ggplot2::aes(yintercept = Fx, color = label),
                          linetype = "dashed") +
      ggplot2::scale_color_manual(values = setNames(cols, labels),
                                  guide = if (show_legend) "legend" else "none")
  }

  p
}

Plot the function, F(x), on [-1, 1]

plot_ptri(-1,1)


7.1.3 Inverse CDF (Quantile) Function

The inverse CDF (quantile function) maps a probability level \(y \in [0,1]\) to the value \(x\) such that \(P(X \le x) = y\). It’s the key ingredient for inverse transform sampling.


Standardized case on \([-1,1]\)

\[ F^{-1}(y)= \begin{cases} -1, & y \le 0, \\[6pt] -1+\sqrt{2y}, & 0< y \le \tfrac{1}{2}, \\[8pt] 1-\sqrt{2-2y}, & \tfrac{1}{2}< y < 1, \\[8pt] 1, & y \ge 1. \end{cases} \]

Checks: \(F^{-1}(0)=-1\), \(F^{-1}(0.5)=0\), \(F^{-1}(1)=1\).


General symmetric case on \([a,b]\) with mode \(c=\tfrac{a+b}{2}\)

\[ F^{-1}(y)= \begin{cases} a, & y \le 0, \\[6pt] a + (b-a)\sqrt{\dfrac{y}{2}}, & 0< y \le \tfrac{1}{2}, \\[10pt] b - (b-a)\sqrt{\dfrac{1-y}{2}}, & \tfrac{1}{2}< y < 1, \\[10pt] b, & y \ge 1. \end{cases} \]

Usage note: Given \(U \sim \mathrm{Uniform}(0,1)\), the inverse transform sample is \(X = F^{-1}(U)\), producing draws from the triangular distribution.

#--------------------------------------
# Quantile (Inverse CDF)
#--------------------------------------
#' @rdname Triangular
#' @export
qtri <- function(p, a = -1, b = 1, lower.tail = TRUE, log.p = FALSE) {
  # Convert tails/log-probs to standard p in [0,1]
  if (log.p) p <- exp(p)
  if (!lower.tail) p <- 1 - p
  if (any(!is.finite(p)) || any(p < 0 | p > 1)) {
    stop("p must be in [0,1]")
  }

  out <- numeric(length(p))     # preallocate output

  # Left branch: p in [0, 0.5]
  i1 <- (p <= 0.5)
  out[i1] <- a + (b - a) * sqrt(p[i1] / 2)

  # Right branch: p in (0.5, 1]
  i2 <- (p > 0.5)
  out[i2] <- b - (b - a) * sqrt((1 - p[i2]) / 2)

  # Exact endpoints (avoid tiny numerical drift)
  out[p == 0] <- a
  out[p == 1] <- b
  out
}

# save 
dump("qtri", "qtri.R")

Plot Inverse CDF (Quantile) Function, \(F^{-1}(y)\)

#' Plot Inverse CDF (Quantile) — Square Grid via qtri()
#'
#' @param min,max Probability range to plot on the x-axis
#' @param a,b     Distribution bounds with a < b (default -1..1)
#' @param grid_step Data-unit step used for both axes to form square cells (default = 0.1)
#' @param x_pad,y_pad Optional padding added to x/y limits.
#'        Defaults to max(grid_step/2, 2% of the range) for each axis.
#' @param highlight_points Logical; mark p = {0, 0.5, 1} (default TRUE)
#' @param highlight_colors Character vector of 3 colors for the points/guide lines
#' @param show_grid Logical; show grid lines (default TRUE)
#' @param show_legend Logical; show legend for highlight points (default FALSE)
#' @return ggplot object
#' @export

#' Plot Inverse CDF (Quantile) — Square Grid via qtri()
#' (colored highlight points + padded limits)
plot_qtri <- function(min = 0, max = 1, a = -1, b = 1,
                      grid_step = 0.1,
                      x_pad = NULL, y_pad = NULL,
                      highlight_points = TRUE,
                      highlight_colors = c("#E41A1C", "#377EB8", "#4DAF4A"),
                      show_grid = TRUE,
                      show_legend = FALSE) {
  if (!is.numeric(min) || !is.numeric(max) || min >= max) stop("min < max")
  if (!is.numeric(a) || !is.numeric(b) || a >= b)          stop("'a' < 'b'")
  if (!is.numeric(grid_step) || grid_step <= 0)            stop("grid_step > 0")

  # Default padding: half a grid step or 2% of range
  if (is.null(x_pad)) x_pad <- max(grid_step/2, 0.02 * (max - min))
  if (is.null(y_pad)) y_pad <- max(grid_step/2, 0.02 * (b - a))

  # Probability grid (clamp for evaluation)
  p_grid <- seq(min, max, length.out = 1000)
  p_eval <- pmin(pmax(p_grid, 0), 1)

  # Robust quantile evaluation
  q_vals <- tryCatch(
    qtri(p_eval, a = a, b = b),
    error = function(e1) {
      q_std <- tryCatch(qtri(p_eval), error = function(e2) NULL)
      if (!is.null(q_std)) {
        a + (b - a) * (q_std + 1) / 2
      } else {
        ifelse(p_eval <= 0.5,
               a + (b - a) * sqrt(p_eval / 2),
               b - (b - a) * sqrt((1 - p_eval) / 2))
      }
    }
  )

  df <- data.frame(p = p_grid, F_inv = q_vals)

  # Padded limits
  x_lims <- c(min - x_pad, max + x_pad)
  y_lims <- c(a   - y_pad, b   + y_pad)

  p <- ggplot2::ggplot(df, ggplot2::aes(p, F_inv)) +
    ggplot2::geom_line(linewidth = 1.2, color = "tomato") +
    ggplot2::labs(
      x = "Probability (p)",
      y = expression(F^{-1}(p)),
      title = sprintf("Inverse CDF (Quantile) on [%g, %g]", a, b)
    ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(
      plot.title  = ggplot2::element_text(hjust = 0.5),
      panel.border = ggplot2::element_rect(color = "gray50", fill = NA),
      panel.grid.minor = ggplot2::element_line(color = "gray92")
    )

  if (show_grid) {
    p <- p + ggplot2::theme(panel.grid.major = ggplot2::element_line(color = "gray85"))
  } else {
    p <- p + ggplot2::theme(panel.grid = ggplot2::element_blank())
  }

  # Equal data units + equal step size -> square grid cells
  p <- p +
    ggplot2::scale_x_continuous(breaks = seq(min, max, by = grid_step),
                                limits = x_lims, expand = c(0, 0)) +
    ggplot2::scale_y_continuous(breaks = seq(a, b, by = grid_step),
                                limits = y_lims, expand = c(0, 0)) +
    ggplot2::coord_equal(ratio = 1)

  if (highlight_points) {
    cols <- rep(highlight_colors, length.out = 3)
    key_p <- c(0, 0.5, 1)
    labels <- c("p = 0", "p = 0.5", "p = 1")

    key_q <- tryCatch(
      qtri(key_p, a = a, b = b),
      error = function(e1) {
        q_std <- tryCatch(qtri(key_p), error = function(e2) NULL)
        if (!is.null(q_std)) a + (b - a) * (q_std + 1) / 2
        else ifelse(key_p <= 0.5,
                    a + (b - a) * sqrt(key_p / 2),
                    b - (b - a) * sqrt((1 - key_p) / 2))
      }
    )

    key_df <- data.frame(p = key_p, F_inv = key_q, label = labels)

    p <- p +
      ggplot2::geom_point(data = key_df,
                          ggplot2::aes(color = label),
                          size = 3) +
      ggplot2::geom_vline(data = key_df,
                          ggplot2::aes(xintercept = p, color = label),
                          linetype = "dashed") +
      ggplot2::geom_hline(data = key_df,
                          ggplot2::aes(yintercept = F_inv, color = label),
                          linetype = "dashed") +
      ggplot2::scale_color_manual(values = setNames(cols, labels),
                                  guide = if (show_legend) "legend" else "none")
  }

  p
}

Plot the function, \(F^{-1}(y)\), on [0, 1]

# Plot the inverse CDF F^{-1}(y) for the symmetric triangular on [-1, 1]
# over the probability domain y ∈ [0, 1]
# p in [0,1], x in [-1,1], 0.1-step on both axes -> square grid cells
plot_qtri(0, 1, a = -1, b = 1, grid_step = 0.1)

#--------------------------------------
# Random Generation (Inverse Transform)
#--------------------------------------
#' @rdname Triangular
#' @export
rtri <- function(n, a = -1, b = 1) {
  # Argument checks
  if (length(n) != 1 || !is.finite(n) || n <= 0 || n != as.integer(n)) {
    stop("n must be a positive integer")
  }
  if (a >= b) stop("'a' must be less than 'b'")

  # Draw U ~ Uniform(0,1), then map via quantile function: X = F^{-1}(U)
  u <- runif(n)
  qtri(u, a = a, b = b)
}

Quick sanity checks / examples

# Quick sanity checks / examples
# (Assumes dtri, ptri, qtri are already defined as above)

set.seed(123)

a <- 0; b <- 2; c_mid <- (a + b)/2

## --- 1) PDF & CDF plots ---
op <- par(mfrow = c(1, 2), mar = c(4,4,3,1))
on.exit(par(op), add = TRUE)

# PDF
curve(dtri(x, a, b), from = a - 0.5, to = b + 0.5, lwd = 2,
      main = sprintf("PDF Tri(a=%g,b=%g)", a, b),
      xlab = "x", ylab = "f(x)")
abline(v = c(a, c_mid, b), lty = 3, col = "gray60")

# CDF
curve(ptri(x, a, b), from = a - 0.5, to = b + 0.5, lwd = 2,
      main = sprintf("CDF Tri(a=%g,b=%g)", a, b),
      xlab = "x", ylab = "F(x)")
abline(h = c(0, 0.5, 1), v = c(a, c_mid, b), lty = 3, col = "gray60")

par(op)  # reset

## --- 2) Quantiles ---
p <- c(0, 0.25, 0.5, 0.75, 1)
q <- qtri(p, a, b)
q_expected <- c(
  a,
  a + (b - a) * sqrt(0.25/2),
  c_mid,
  b - (b - a) * sqrt(0.25/2),
  b
)
quantile_check <- data.frame(p = p, qtri = q, expected = q_expected)
print(quantile_check, row.names = FALSE)
##     p      qtri  expected
##  0.00 0.0000000 0.0000000
##  0.25 0.7071068 0.7071068
##  0.50 1.0000000 1.0000000
##  0.75 1.2928932 1.2928932
##  1.00 2.0000000 2.0000000
## --- 3) Random generation vs. true PDF ---
xx <- rtri(1e4, a, b)

hist(xx, freq = FALSE, breaks = 60, col = "gray90", border = NA,
     main = "rtri sample vs. true PDF", xlab = "x")
curve(dtri(x, a, b), add = TRUE, col = "red", lwd = 2)

## --- 4) Moment checks (sample vs. theory) ---
theory_mean <- (a + b)/2
theory_var  <- (b - a)^2 / 24
cat("\nMoment checks:\n")
## 
## Moment checks:
cat(" sample mean  =", round(mean(xx), 5), " | theory mean  =", theory_mean, "\n")
##  sample mean  = 0.99613  | theory mean  = 1
cat(" sample var   =", round(var(xx), 5),  " | theory var   =", theory_var,  "\n")
##  sample var   = 0.16406  | theory var   = 0.1666667
## --- 5) KS goodness-of-fit (sample vs. CDF) ---
ks_out <- ks.test(xx, function(t) ptri(t, a, b))
cat("\nKS test:\n")
## 
## KS test:
cat(" D =", unname(ks_out$statistic), " | p-value =", ks_out$p.value, "\n")
##  D = 0.00739211  | p-value = 0.6453502

7.2 Inverse Transform Sampling

Inverse transform sampling is a fundamental technique for generating random numbers from a specified probability distribution. The method works by using the distribution’s cumulative distribution function (CDF) to transform uniform random samples into samples that follow the desired distribution.

Algorithm:

  1. Generate a uniform random variable
    Draw a random number \(\sim \text{Uniform}(0,1)\).

  2. Apply the inverse CDF
    Solve for \(x\) such that
    \[ F(x) = u, \]
    where \(F(x)\) is the CDF of the target distribution.

  3. Obtain the sample
    The solution \(x = F^{-1}(u)\) is then a random draw from the target distribution.


Example 1: Exponential Distribution

Suppose we want to generate samples from an exponential distribution with rate parameter \(\lambda > 0\).

  • The CDF of the exponential distribution is:
    \[ F(x) = 1 - e^{-\lambda x}, \quad x \geq 0 \]

  • Solve for the inverse:
    \[ F(x) = u \implies x = -\frac{1}{\lambda} \ln(1 - u) \]

  • Since \(1-u\) is also uniformly distributed on \([0,1]\), we can simplify to:
    \[ x = -\frac{1}{\lambda} \ln(u) \]

Procedure:
1. Generate \(\sim \text{Uniform}(0,1)\)
2. Compute \(x = -\frac{1}{\lambda} \ln(u)\)
3. The result \(x\) follows an exponential distribution.


Advantages

  • Conceptually simple and broadly applicable.
  • Works for both discrete and continuous distributions, provided the inverse CDF is available in closed form.

Limitations

  • Requires knowing or being able to compute the inverse CDF.
  • For complex distributions without a closed-form inverse, alternative methods (e.g., rejection sampling, MCMC) may be more practical.

Example 2: Piecewise Inverse CDF

Inverse transform sampling also applies to distributions with piecewise CDFs. Consider the probability density function (PDF):

\[ f(x) = \begin{cases} 1 - |x|, & -1 \leq x \leq 1 \\[6pt] 0, & \text{otherwise} \end{cases} \]

This is a triangular distribution centered at 0.


7.2.1 Step 1: Compute the CDF

For \(-1 \leq x \leq 0\):
\[ F(x) = \int_{-1}^x (1 - |t|) \, dt = \int_{-1}^x (1 + t) \, dt = \frac{(x+1)^2}{2} \]

For \(0 < x \leq 1\):
\[ F(x) = \frac{1}{2} + \int_{0}^x (1 - t) \, dt = \frac{1}{2} + \left(x - \frac{x^2}{2}\right) = 1 - \frac{(1-x)^2}{2} \]

So the full CDF is:
\[ F(x) = \begin{cases} 0, & x < -1 \\[6pt] \dfrac{(x+1)^2}{2}, & -1 \leq x \leq 0 \\[10pt] 1 - \dfrac{(1-x)^2}{2}, & 0 < x \leq 1 \\[10pt] 1, & x > 1 \end{cases} \]


7.2.2 Step 2: Invert the CDF

Solve \(F(x) = u\) for \(x\).

  • For \(0 < u \leq \tfrac{1}{2}\):
    \[ u = \frac{(x+1)^2}{2} \implies x = -1 + \sqrt{2u} \]

  • For \(\tfrac{1}{2} < u < 1\):
    \[ u = 1 - \frac{(1-x)^2}{2} \implies x = 1 - \sqrt{2 - 2u} \]

Thus, the inverse CDF is:

\[ x = F^{-1}(u) = \begin{cases} -1, & u \leq 0 \\[6pt] -1 + \sqrt{2u}, & 0 < u \leq \tfrac{1}{2} \\[6pt] 1 - \sqrt{2 - 2u}, & \tfrac{1}{2} < u < 1 \\[6pt] 1, & u \geq 1 \end{cases} \]


7.2.3 Step 3: Sampling Procedure

  1. Generate \(u \sim \text{Uniform}(0,1)\)
  2. Apply the inverse CDF above to compute \(x = F^{-1}(u)\)
  3. The result \(x\) is a random sample from the triangular distribution defined by \(f(x)\).

R Implementation of Inverse Transform Sampling Function

## R Implementation of Inverse Transform Sampling
#' Generates random variables from a specified distribution using inverse transform sampling,
#' with optional density comparison and visualization.
#'
#' @param n Integer. Sample size (n > 0).
#' @param inv_cdf_FUN Function. Inverse CDF (quantile function) taking a numeric vector u in [0,1].
#' @param pdf_FUN Function or NULL. Density function f(x) for comparison; if NULL, only sample plots.
#' @param support_range Numeric length-2 or NULL. Range for drawing the population density curve.
#' @param plot Logical. If TRUE, draw diagnostic plots (Q-Q and density comparison / histogram).
#' @param ... Passed to pdf_FUN.
#'
#' @return A list with:
#'   - samples: numeric vector of generated values
#'   - density: data.frame with columns x,y for the population density (if pdf_FUN provided)
#'   - sample_density: kernel density estimate (stats::density) of samples
#'
#' @examples
#' # See example further below.
#' @export
inv_transform <- function(n, inv_cdf_FUN, pdf_FUN = NULL, support_range = NULL, ...) {
  # ---- Validate inputs ----
  if (!is.function(inv_cdf_FUN)) stop("inv_cdf_FUN must be a function.")
  if (!is.null(pdf_FUN) && !is.function(pdf_FUN)) stop("pdf_FUN must be a function or NULL.")
  if (length(n) != 1 || !is.finite(n) || n <= 0 || n != as.integer(n)) {
    stop("n must be a positive integer.")
  }

  # ---- Generate samples via inverse transform ----
  U <- stats::runif(n)
  samples <- inv_cdf_FUN(U)

  # ---- Prepare population density grid if provided ----
  density_df <- NULL
  if (!is.null(pdf_FUN)) {
    if (is.null(support_range)) {
      srg <- range(samples)
      buf <- diff(srg) * 0.15
      support_range <- srg + c(-buf, buf)
    }
    x_vals <- seq(support_range[1], support_range[2], length.out = 1000)
    y_vals <- pdf_FUN(x_vals, ...)
    density_df <- data.frame(x = x_vals, y = y_vals)
  }

  # ---- Empirical density of the sample ----
  emp_density <- stats::density(samples)

  invisible(samples)
}

# save 
dump("inv_transform", "inv_transform.R")

Universal RNG diagnostics plotter

#' Universal RNG diagnostics plotter
#'
#' Plots diagnostics for random samples produced by any generator.
#' Accepts either a numeric vector of samples or a list with at least a
#' \code{samples} component. Optionally, the list may include
#' \code{envelope}, \code{range}, and \code{acceptance_rate}.
#' Can overlay a supplied target density/PMF, compare ECDF vs theoretical CDF,
#' and draw a Q–Q plot when the inverse CDF is provided.
#'
#' @param x Numeric vector of samples, or a list with at least \code{samples}.
#'   If a list, optional components \code{envelope}, \code{range}, and
#'   \code{acceptance_rate} will be used when present.
#' @param target_density Optional function \code{f(x)} for overlay (PDF or PMF).
#' @param target_cdf Optional function \code{F(x)} for ECDF comparison.
#' @param inv_cdf Optional function \code{F^{-1}(u)} for Q–Q diagnostics.
#' @param envelope Optional data frame with columns \code{x}, \code{target},
#'   and \code{envelope}. If omitted but \code{x} is a list with \code{envelope},
#'   that will be used.
#' @param xlim,ylim Plot limits for the sample distribution panel.
#' @param breaks Histogram breaks (continuous case); e.g. \code{"FD"}.
#' @param discrete Logical or \code{NA}. If \code{NA} (default), a heuristic
#'   decides between discrete and continuous displays.
#' @param show_kde Logical; add a kernel density estimate for continuous samples.
#' @param show_rug Logical; add a rug of sample points (continuous samples).
#' @param main Optional title for the sample distribution panel.
#' @param compute_envelope Logical; if \code{TRUE} and no envelope is supplied/found,
#'   rebuild an envelope as \code{c * g(x)} on \code{envelope_range}.
#' @param envelope_range Numeric length-2 \code{c(min, max)} for the envelope grid
#'   and for optimizing \code{f(x)/g(x)} if \code{envelope_c} is \code{NULL}.
#' @param proposal_d Function \code{g(x, range)} returning proposal density; if
#'   \code{NULL} and \code{compute_envelope = TRUE}, defaults to \code{dunif(x, range)}.
#' @param envelope_c Optional numeric; if \code{NULL}, computed as
#'   \eqn{\sup_x f(x)/g(x)} over \code{envelope_range}.
#' @param envelope_n Integer; number of grid points for the envelope (default 1000).
#' @param ... Passed to lower-level plotting functions (currently unused).
#'
#' @return Invisibly returns a list with \code{samples}, \code{envelope} (if any),
#'   and \code{xlim}.
#'
#' @examples
#' set.seed(1)
#' x <- rnorm(1000)
#' plot_RNG(x, target_density = dnorm, target_cdf = pnorm, inv_cdf = qnorm)
#'
#' # Auto-build envelope for a triangular target with Uniform proposal on [-1,1]
#' dtri <- function(z) ifelse(z >= -1 & z <= 1, 1 - abs(z), 0)
#' ptri <- function(z) { out <- numeric(length(z)); out[z < -1] <- 0
#'   i1 <- z >= -1 & z <= 0; out[i1] <- ((z[i1] + 1)^2)/2
#'   i2 <- z > 0  & z <= 1; out[i2] <- 1 - ((1 - z[i2])^2)/2
#'   out[z > 1] <- 1; out }
#' plot_RNG(x, target_density = dtri, target_cdf = ptri,
#'          inv_cdf = function(u) qnorm(u), # replace with qtri for real tri samples
#'          compute_envelope = TRUE,
#'          envelope_range = c(-1, 1),
#'          proposal_d = function(z, range) dunif(z, range[1], range[2]))
#'
#' @importFrom graphics par plot lines legend hist barplot points rug mtext
#' @importFrom stats density ecdf ppoints qqplot optimize dunif
#' @importFrom grDevices adjustcolor
#' @export

#' Universal RNG diagnostics plotter
#'
#' Plots diagnostics for random samples produced by any generator.
#' Accepts either a numeric vector of samples or a list with at least a
#' \code{samples} component. Optionally, the list may include
#' \code{envelope}, \code{range}, and \code{acceptance_rate}.
#' Can overlay a supplied target density/PMF, compare ECDF vs theoretical CDF,
#' and draw a Q–Q plot when the inverse CDF is provided.
#'
#' Panel B (continuous) styling is fixed to:
#' - KDE in blue (#1E88E5) with \code{adjust = 1.2},
#' - population density curve in red (#D81B60) when \code{target_density} is provided,
#' - a transparent histogram overlay,
#' - a light grid background.
#'
#' @param x Numeric vector of samples, or a list with at least \code{samples}.
#'   If a list, optional components \code{envelope}, \code{range}, and
#'   \code{acceptance_rate} will be used when present.
#' @param target_density Optional function \code{f(x)} for overlay (PDF or PMF).
#' @param target_cdf Optional function \code{F(x)} for ECDF comparison.
#' @param inv_cdf Optional function \code{F^{-1}(u)} for Q–Q diagnostics.
#' @param envelope Optional data frame with columns \code{x}, \code{target},
#'   and \code{envelope}. If omitted but \code{x} is a list with \code{envelope},
#'   that will be used.
#' @param xlim,ylim Plot limits for the sample distribution panel.
#' @param breaks Histogram breaks (continuous case); default \code{"FD"}.
#' @param discrete Logical or \code{NA}. If \code{NA} (default), a heuristic
#'   decides between discrete and continuous displays.
#' @param show_kde Logical; add a kernel density estimate for continuous samples.
#' @param show_rug Logical; add a rug of sample points (continuous samples).
#' @param main Optional title for the sample distribution panel.
#' @param compute_envelope Logical; if \code{TRUE} and no envelope is supplied/found,
#'   rebuild an envelope as \code{c * g(x)} on \code{envelope_range}.
#' @param envelope_range Numeric length-2 \code{c(min, max)} for the envelope grid
#'   and for optimizing \code{f(x)/g(x)} if \code{envelope_c} is \code{NULL}.
#' @param proposal_d Function \code{g(x, range)} returning proposal density; if
#'   \code{NULL} and \code{compute_envelope = TRUE}, defaults to \code{dunif(x, range)}.
#' @param envelope_c Optional numeric; if \code{NULL}, computed as
#'   \eqn{\sup_x f(x)/g(x)} over \code{envelope_range}.
#' @param envelope_n Integer; number of grid points for the envelope (default 1000).
#' @param ... Passed to lower-level plotting functions (currently unused).
#'
#' @return Invisibly returns a list with \code{samples}, \code{envelope} (if any),
#'   and \code{xlim}.
#'
#' @examples
#' set.seed(1)
#' x <- rnorm(1000)
#' plot_RNG(x, target_density = dnorm, target_cdf = pnorm, inv_cdf = qnorm)
#'
#' @importFrom graphics par plot lines legend hist barplot points rug mtext grid
#' @importFrom stats density ecdf ppoints qqplot optimize dunif
#' @importFrom grDevices adjustcolor
#' @export
plot_RNG <- function(x,
                     target_density = NULL,
                     target_cdf = NULL,
                     inv_cdf = NULL,
                     envelope = NULL,
                     xlim = NULL, ylim = NULL,
                     breaks = "FD",
                     discrete = NA,
                     show_kde = TRUE,
                     show_rug = FALSE,
                     main = NULL,
                     compute_envelope = FALSE,
                     envelope_range = NULL,
                     proposal_d = NULL,
                     envelope_c = NULL,
                     envelope_n = 1000,
                     ...) {

  # ---- Extract samples & metadata ----
  if (is.numeric(x)) {
    samples   <- x
    rng_range <- NULL
    acc_rate  <- NULL
    env       <- envelope
  } else if (is.list(x) && !is.null(x$samples)) {
    samples   <- x$samples
    rng_range <- if (!is.null(x$range)) x$range else NULL
    acc_rate  <- if (!is.null(x$acceptance_rate)) x$acceptance_rate else NULL
    env       <- if (!is.null(envelope)) envelope else x$envelope
  } else {
    stop("'x' must be a numeric vector of samples or a list with $samples.")
  }

  # ---- Optionally rebuild envelope if missing ----
  if (is.null(env) && isTRUE(compute_envelope)) {
    if (!is.function(target_density)) {
      stop("To compute envelope, provide target_density = f(x).")
    }
    erange <- if (!is.null(envelope_range)) {
      envelope_range
    } else if (!is.null(rng_range)) {
      rng_range
    } else {
      r <- range(samples, finite = TRUE)
      if (!all(is.finite(r))) stop("Cannot infer envelope_range from samples.")
      r
    }
    if (is.null(proposal_d)) {
      proposal_d <- function(z, range) stats::dunif(z, range[1], range[2])
    }
    if (is.null(envelope_c)) {
      ratio_fn <- function(z) {
        gx <- proposal_d(z, erange)
        fx <- target_density(z)
        as.numeric(ifelse(gx > 0 & is.finite(gx), fx / gx, -Inf))
      }
      opt <- stats::optimize(ratio_fn, interval = erange, maximum = TRUE)
      envelope_c <- opt$objective
      if (!is.finite(envelope_c) || envelope_c <= 0) {
        stop("Computed envelope_c is not finite/positive. Check f/g and envelope_range.")
      }
    }
    xg <- seq(erange[1], erange[2], length.out = envelope_n)
    env <- data.frame(
      x        = xg,
      target   = target_density(xg),
      envelope = envelope_c * proposal_d(xg, erange)
    )
  }

  # Accept legacy name if present
  if (!is.null(env)) {
    if (!("envelope" %in% names(env)) && "proposal" %in% names(env)) {
      env$envelope <- env$proposal
    }
    stopifnot(all(c("x", "target", "envelope") %in% names(env)))
  }

  # ---- Determine plotting xlim ----
  if (is.null(xlim)) {
    xlim <- if (!is.null(rng_range)) {
      rng_range
    } else {
      r <- range(samples, finite = TRUE)
      pad <- diff(r) * 0.15; if (!is.finite(pad)) pad <- 1
      r + c(-pad, pad)
    }
  }

  # Discrete vs continuous heuristic
  if (is.na(discrete)) {
    uniq <- unique(samples[is.finite(samples)])
    discrete <- (length(uniq) <= min(50, length(samples) / 5)) &&
                all(abs(uniq - round(uniq)) < 1e-8)
  }

  # Panels
  show_env <- !is.null(env)
  show_cdf <- is.function(target_cdf)
  show_qq  <- is.function(inv_cdf)
  n_panels <- 1 + as.integer(show_env) + as.integer(show_cdf) + as.integer(show_qq)

  # Layout
  if (n_panels <= 2) {
    op <- graphics::par(mfrow = c(1, n_panels), mar = c(4, 4, 3, 1))
  } else {
    rows <- ceiling(n_panels / 2); cols <- 2
    op <- graphics::par(mfrow = c(rows, cols), mar = c(4, 4, 3, 1))
  }
  on.exit(graphics::par(op), add = TRUE)

  # ---- Panel A: Density vs Envelope (if provided) ----
  if (show_env) {
    y_max <- max(env$envelope, env$target, na.rm = TRUE)
    graphics::plot(env$x, env$target, type = "l", lwd = 6,
                   col = "red", ylim = c(0, y_max),
                   main = "Density vs Envelope", xlab = "x", ylab = "Density")
    graphics::lines(env$x, env$envelope, col = "blue", lty = 4, lwd = 6)
    graphics::legend("topleft",
                     legend = c("Target f(x)", "Envelope c g(x)"),
                     col = c("red", "blue"), lty = c(1, 4), lwd = c(6, 6), bty = "n")
  }

  # ---- Panel B: Sample distribution ----
  if (isTRUE(discrete)) {
    # Discrete: barplot + optional theoretical PMF
    tab <- table(factor(samples, levels = sort(unique(samples))))
    xs  <- as.numeric(names(tab))
    p_emp <- as.numeric(tab) / sum(tab)
    y_max <- max(p_emp)
    if (is.function(target_density)) {
      p_th <- target_density(xs); y_max <- max(y_max, p_th, na.rm = TRUE)
    } else p_th <- NULL
    graphics::barplot(height = p_emp, names.arg = xs, space = 0.2,
                      main = if (!is.null(main)) main else "Sample PMF",
                      xlab = "x", ylab = "Probability", ylim = c(0, y_max))
    if (!is.null(p_th)) {
      graphics::points(seq_along(xs), p_th, pch = 16)
      graphics::lines(seq_along(xs), p_th, lwd = 6)
      graphics::legend("topright", legend = c("Empirical PMF", "Target PMF"),
                       lty = c(0, 1), pch = c(15, 16), lwd = c(NA, 6),
                       col = c("gray40", "black"), bty = "n")
    }
  } else {
    # Continuous: KDE (blue), population density (red), translucent histogram, grid
    # target density grid (if provided)
    if (is.function(target_density)) {
      x_grid <- seq(xlim[1], xlim[2], length.out = 512)
      y_th   <- target_density(x_grid)
      y_th[!is.finite(y_th)] <- NA_real_
      ymax_th <- max(y_th, na.rm = TRUE)
    } else {
      x_grid <- y_th <- NULL
      ymax_th <- NA_real_
    }

    # KDE with fixed smoothing
    kde <- try(stats::density(samples, adjust = 1.2), silent = TRUE)

    # y-limits
    if (!is.null(ylim)) {
      y_lim <- ylim
    } else if (!inherits(kde, "try-error") && is.finite(max(kde$y))) {
      y_lim <- range(c(kde$y, ymax_th), finite = TRUE)
    } else if (is.finite(ymax_th)) {
      y_lim <- c(0, ymax_th)
    } else {
      y_lim <- NULL
    }

    # 1) Base layer: KDE (blue) with grid background
    graphics::plot(kde,
                   col = "#1E88E5", lwd = 3,
                   xlim = xlim, ylim = y_lim,
                   xlab = "Value (x)", ylab = "Density",
                   main = if (is.null(main)) "Density Comparison" else main,
                   cex.main = 1.1,
                   panel.first = graphics::grid(col = "gray90"))

    # 2) Population density (red)
    if (!is.null(x_grid)) {
      graphics::lines(x_grid, y_th, col = "#D81B60", lwd = 3)
    }

    # 3) Transparent histogram overlay
    #    default 'breaks' is "FD"; if so, use 20 bins to match example
    breaks_local <- if (identical(breaks, "FD")) 20 else breaks
    graphics::hist(samples, freq = FALSE, breaks = breaks_local,
                   col = grDevices::adjustcolor("#1E88E5", alpha.f = 0.20),
                   border = NA, add = TRUE)

    # 4) Legend
    graphics::legend("topright",
                     legend = c("Random sample", "Population distribution"),
                     col = c("#1E88E5", "#D81B60"),
                     lwd = 3, bty = "n", cex = 0.9)

    if (isTRUE(show_rug)) graphics::rug(samples)
    
    # annotate acceptance rate if available
    if (!is.null(acc_rate)) {
    graphics::mtext(sprintf("  Acceptance rate: %.2f%%", 100 * acc_rate),
                    side = 3, line = -1.5, adj = 0, cex = 1)
    }
  }

  # ---- Panel C: ECDF vs CDF (if provided) ----
  if (is.function(target_cdf)) {
    F_emp <- stats::ecdf(samples)
    xs <- seq(xlim[1], xlim[2], length.out = 512)
    ecdf_col   <- grDevices::adjustcolor("gray40", alpha.f = 0.6)
    th_cdf_col <- grDevices::adjustcolor("red",     alpha.f = 0.75)
    graphics::plot(F_emp, main = "ECDF vs CDF", xlab = "x", ylab = "Cumulative prob.",
                   verticals = TRUE, do.points = TRUE,
                   col = ecdf_col, lwd = 6, pch = 16, cex = 1.6)
    graphics::lines(xs, target_cdf(xs), col = th_cdf_col, lwd = 6)
    graphics::legend("bottomright", legend = c("Empirical CDF", "Target CDF"),
                     col = c(ecdf_col, th_cdf_col), lty = c(1, 1), lwd = c(6, 6),
                     pch = c(16, NA), pt.cex = c(1.6, NA), bty = "n")
  }

  # ---- Panel D: Q–Q (if inverse CDF provided) ----
  if (is.function(inv_cdf)) {
    u    <- stats::ppoints(length(samples))
    th_q <- inv_cdf(u)
    pt_col   <- grDevices::adjustcolor("gray20", alpha.f = 0.35)
    line_col <- grDevices::adjustcolor("red",     alpha.f = 0.8)
    stats::qqplot(th_q, samples,
                  main = "Q–Q (Theoretical vs Sample)",
                  xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
                  pch = 19, cex = 2, col = pt_col)
    graphics::abline(0, 1, lwd = 3, col = line_col)
  }

  invisible(list(samples = samples, envelope = env, xlim = xlim))
}

# save
dump("plot_RNG", "plot_RNG.R")

Generate Samples via Inverse Transform

# Using your Fix function as inverse CDF
set.seed(99)
res <- inv_transform(
  n = 1000,
  inv_cdf_FUN = qtri,
  pdf_FUN = dtri,
  support_range = c(-1.5, 1.5)   # a little buffer beyond support [-1,1]
)

plot_RNG(res,
         target_density = dtri,          # <- important for the red curve
         target_cdf     = ptri,
         inv_cdf        = function(u) qtri(u, -1, 1),
         # optional: tweak visuals
         kde_adjust = 1.2,
         xlim = c(-1.5, 1.5),
         ylim = c(0, 1.1))


7.2.4 Goodness-of-fit diagnostics (Kolmogorov–Smirnov & Anderson–Darling)

#' Goodness-of-fit diagnostics (Kolmogorov–Smirnov & Anderson–Darling)
#'
#' Runs Kolmogorov–Smirnov and Anderson–Darling tests comparing samples
#' to a target CDF. Accepts either a numeric vector or a list with
#' \code{$samples} (e.g., from a simulator). Optionally prints a
#' sample-vs-theory quantile table at selected probabilities.
#'
#' @param x Numeric vector of samples, or a list with a \code{samples} element.
#' @param target_cdf Function \code{F(x)} giving the target cumulative
#'   distribution function.
#' @param inv_cdf Optional function \code{F^{-1}(u)} for theoretical quantiles.
#'   If omitted, only sample quantiles are printed.
#' @param probs Numeric vector of probabilities in (0, 1) at which to report
#'   quantiles. Default: \code{c(0.01, 0.05, 0.95, 0.99)}.
#'
#' @return Invisibly returns a list with:
#'   \itemize{
#'     \item \code{ks_test}: \code{stats::htest} object (KS)
#'     \item \code{ad_test}: \code{htest}-like object from \code{goftest::ad.test}
#'     \item \code{quantiles}: \code{data.frame} with \code{prob}, \code{sample},
#'           and (if available) \code{theory}
#'   }
#'
#' @details The Anderson–Darling test requires the \pkg{goftest} package.
#' If it is not installed, the function stops with an informative message.
#'
#' @examples
#' set.seed(123)
#' # Triangular target on [-1,1]
#' dtri <- function(x) ifelse(x >= -1 & x <= 1, 1 - abs(x), 0)
#' ptri <- function(x) {
#'   out <- numeric(length(x)); out[x < -1] <- 0
#'   i1 <- x >= -1 & x <= 0; out[i1] <- ((x[i1] + 1)^2)/2
#'   i2 <- x > 0  & x <= 1; out[i2] <- 1 - ((1 - x[i2])^2)/2
#'   out[x > 1] <- 1; out
#' }
#' qtri <- function(p, a=-1, b=1) {
#'   out <- numeric(length(p)); i1 <- p <= 0.5; i2 <- p > 0.5
#'   out[i1] <- a + (b - a) * sqrt(p[i1] / 2)
#'   out[i2] <- b - (b - a) * sqrt((1 - p[i2]) / 2)
#'   out[p==0] <- a; out[p==1] <- b; out
#' }
#' samples <- stats::runif(1000, -1, 1)  # replace with your generator's output
#' gof_diagnostics(samples, target_cdf = ptri, inv_cdf = function(u) qtri(u, -1, 1))
#'
#' @importFrom stats ks.test quantile
#' @export
#' Runs Kolmogorov–Smirnov and Anderson–Darling tests and prints a
#' sample-vs-theory quantile table with probability labels as row names.
#'
#' @param x Numeric vector of samples, or a list with a \code{samples} element.
#' @param target_cdf Function \code{F(x)} (target CDF).
#' @param inv_cdf Function \code{F^{-1}(u)} (target quantile function).
#' @param probs Probabilities at which to report quantiles.
#'
#' @return Invisibly, a list with \code{ks_test}, \code{ad_test}, and \code{quantiles} (data.frame).
#' @importFrom stats ks.test quantile
#' @export
gof_diagnostics <- function(x, target_cdf, inv_cdf,
                            probs = c(0.01, 0.05, 0.95, 0.99)) {
  # Extract samples
  samples <- if (is.list(x) && !is.null(x$samples)) x$samples else as.numeric(x)
  samples <- samples[is.finite(samples)]
  if (!length(samples)) stop("No finite samples provided.")
  if (!is.function(target_cdf)) stop("'target_cdf' must be a function F(x).")
  if (!is.function(inv_cdf))   stop("'inv_cdf' must be a function F^{-1}(u).")
  if (!is.numeric(probs) || any(probs <= 0 | probs >= 1)) {
    stop("'probs' must be in (0,1).")
  }

  # Tests
  ks_out <- stats::ks.test(samples, function(t) target_cdf(t))
  if (!requireNamespace("goftest", quietly = TRUE)) {
    stop("Package 'goftest' is required for the Anderson–Darling test. ",
         "Install it with: install.packages('goftest')")
  }
  ad_out <- goftest::ad.test(samples, function(t) target_cdf(t))

  # Quantiles (with percent labels as row names)
  q_sample <- as.numeric(stats::quantile(samples, probs, names = FALSE))
  q_theory <- as.numeric(inv_cdf(probs))
  rowlab   <- paste0(formatC(probs * 100, format = "f", digits = 0), "%")
  qt_tab   <- data.frame(sample = q_sample, theory = q_theory, row.names = rowlab)

  # Print
  cat("\n=== Goodness-of-Fit Diagnostics ===\n")
  cat("Kolmogorov–Smirnov:\n")
  cat("  D =", format(ks_out$statistic, digits = 3),
      "| p-value =", format.pval(ks_out$p.value, digits = 3), "\n")
  cat("Anderson–Darling:\n")
  cat("  A =", format(ad_out$statistic, digits = 3),
      "| p-value =", format.pval(ad_out$p.value, digits = 3), "\n")
  cat("\n")
  print(qt_tab, right = TRUE, digits = 7)

  invisible(list(ks_test = ks_out, ad_test = ad_out, quantiles = qt_tab))
}
set.seed(123)

gof_diagnostics(res, target_cdf = ptri, inv_cdf = qtri)
## 
## === Goodness-of-Fit Diagnostics ===
## Kolmogorov–Smirnov:
##   D = 0.0271 | p-value = 0.455 
## Anderson–Darling:
##   A = 0.901 | p-value = 0.414 
## 
##         sample     theory
## 1%  -0.8834588 -0.8585786
## 5%  -0.6692522 -0.6837722
## 95%  0.6917262  0.6837722
## 99%  0.8636028  0.8585786

Goodness-of-Fit — Interpretation:

KS test: D = 0.0271, p = 0.455 → fail to reject at α = 0.05.
- Max CDF gap is ~2.7%, which is small.

Anderson–Darling: A = 0.901, p = 0.414 → fail to reject at α = 0.05.
- No evidence of tail mismatches.

Takeaway: The sample is consistent with the target distribution; both center (KS) and tails (AD) look fine within sampling noise.


Quantile check — interpretation:

(sample − theory)

  • 1%: −0.88346 vs −0.85858 → −0.02488
    More negative than theory → left extreme tail a bit heavier.

  • 5%: −0.66925 vs −0.68377 → +0.01452
    Less negative than theory → left mid-tail a bit lighter.

  • 95%: 0.69173 vs 0.68377 → +0.00795
    Slightly larger than theory → right mid-tail slightly heavier.

  • 99%: 0.86360 vs 0.85858 → +0.00502
    Slightly larger than theory → right extreme tail slightly heavier.

Overall: tiny deviations (≈0.005–0.025). Pattern suggests slightly heavier tails (especially left at 1%) with a mild right-side tilt. Given your KS (p=0.455) and AD (p=0.414) both fail to reject, these differences are consistent with sampling noise for typical \(n\). If you want to probe further, increase \(n\) (e.g., 5k–10k) and recheck the Q–Q and ECDF vs CDF panels.


7.3 Rejection Sampling (Acceptance–Rejection Method)

Rejection sampling is a classical technique for generating random draws from a target distribution \(f(x)\) using samples from a simpler proposal distribution \(g(x)\). The method works by placing an “envelope” over the target density and accepting or rejecting candidate draws based on how well they represent \(f(x)\).

Although conceptually simple, rejection sampling can be inefficient when the target density is sharply peaked or highly concentrated, since many proposals may be discarded. Variants such as adaptive rejection sampling have been developed to reduce this inefficiency.


Key Idea

  1. Select a proposal distribution \(G\) with density \(g(x)\) such that:
    • \(g(x)\) is easy to sample from, and
    • there exists a constant \(c > 0\) with
      \[ f(x) \leq c \times g(x) \quad \text{for all } x. \]
    Here, \(c \times g(x)\) acts as an upper bound (envelope) for the target density.

Algorithm: Rejection Sampling

  1. Sample a candidate
    Draw \(x \sim g(x)\).

  2. Sample a uniform variate
    Draw \(u \sim U(0,1)\).

  3. Acceptance rule
    Accept \(x\) if
    \[ u < \frac{f(x)}{c \times g(x)}. \]
    Otherwise, reject \(x\) and return to step 1.

  4. Output
    Accepted samples follow the target distribution \(f(x)\).


Notes

  • Efficiency depends on how closely \(c g(x)\) tracks the shape of \(f(x)\). Larger \(c\) leads to higher rejection rates.
  • Choice of proposal is critical—good proposals approximate the target distribution’s shape.
  • Adaptive rejection sampling adjusts the proposal distribution iteratively to improve acceptance probability for difficult targets.

#' Rejection Sampling with Pluggable Target and Proposal
#'
#' Generates random variates from a target distribution \eqn{f(x)} using
#' acceptance–rejection with a user-specified proposal \eqn{g(x)}.
#' The envelope constant \eqn{c} satisfies \eqn{f(x) \le c\,g(x)} on \code{range}.
#'
#' @details
#' If \code{c} is \code{NULL}, it is estimated by maximizing
#' \eqn{f(x)/g(x)} over \code{range} via \code{\link[stats]{optimize}}.
#' The function can either return just the numeric sample vector
#' (\code{return = "samples"}) or a rich object containing diagnostics
#' (\code{return = "object"}), including an envelope grid for plotting.
#'
#' @param n Integer (>\,0). Number of samples to generate.
#' @param range Numeric length-2 \code{c(min, max)} giving proposal support and
#'   the search interval for \eqn{c}.
#' @param target_density Function \code{f(x)} returning target density values.
#' @param target_cdf Optional function \code{F(x)}. Stored in the returned object
#'   for downstream tests/plots; not used by the sampler itself.
#' @param proposal_r Function to draw from proposal \eqn{g}. Signature:
#'   \code{function(n, range) -> numeric}.
#'   Defaults to \code{runif(n, range[1], range[2])}.
#' @param proposal_d Function to evaluate proposal density \eqn{g(x)}. Signature:
#'   \code{function(x, range) -> numeric}.
#'   Defaults to \code{dunif(x, range[1], range[2])}.
#' @param c Positive numeric or \code{NULL}. Envelope constant; if \code{NULL}
#'   it is computed as \eqn{\sup_x f(x)/g(x)} on \code{range}.
#' @param max_attempts Integer. Hard cap on total proposals; default \code{20 * n}.
#' @param show_progress Logical. Emit progress every 10,000 attempts.
#' @param return Character, one of \code{"samples"} or \code{"object"}.
#'   Controls the return type.
#'
#' @return If \code{return = "samples"}, a numeric vector of length \eqn{\le n}
#'   (exactly \eqn{n} unless the attempt cap is hit).
#'   If \code{return = "object"}, an object of class \code{"rejection_sample"}
#'   with components:
#'   \itemize{
#'     \item \code{samples} Numeric vector of accepted draws.
#'     \item \code{acceptance_rate} Proportion accepted.
#'     \item \code{avg_attempts_per_accept} Attempts per accept.
#'     \item \code{envelope} Data frame with columns \code{x}, \code{target},
#'       \code{envelope = c * g(x)} for diagnostics.
#'     \item \code{n_attempts}, \code{range}, \code{c}, \code{target_cdf}.
#'   }
#'
#' @examples
#' # Triangular target on [-1, 1]: f(x) = 1 - |x|
#' dtri <- function(x) ifelse(x >= -1 & x <= 1, 1 - abs(x), 0)
#' ptri <- function(x) {
#'   out <- numeric(length(x))
#'   out[x < -1] <- 0
#'   i1 <- x >= -1 & x <= 0; out[i1] <- ((x[i1] + 1)^2)/2
#'   i2 <- x > 0  & x <= 1; out[i2] <- 1 - ((1 - x[i2])^2)/2
#'   out[x > 1] <- 1; out
#' }
#'
#' set.seed(123)
#' # (1) Just samples
#' xs <- rejection_sampling(n = 1000, range = c(-1, 1),
#'                          target_density = dtri, return = "samples")
#' # (2) Full object for plotting/diagnostics
#' res <- rejection_sampling(n = 1000, range = c(-1, 1),
#'                           target_density = dtri, target_cdf = ptri,
#'                           return = "object")
#' # if you also define plot.rejection_sample(), then:
#' # plot(res)
#'
#' @seealso \code{\link[stats]{optimize}}, \code{\link[stats]{runif}},
#'   \code{\link[stats]{dunif}}
#' @references
#' Devroye, L. (1986). \emph{Non-Uniform Random Variate Generation}. Springer. \cr
#' Robert, C. P., & Casella, G. (2004). \emph{Monte Carlo Statistical Methods}. Springer.
#'
#' @importFrom stats runif dunif optimize
#' @export

rejection_sampling <- function(n = 10000,
                               range = c(0, 1),
                               target_density,
                               target_cdf = NULL,
                               proposal_r = function(n, range) stats::runif(n, range[1], range[2]),
                               proposal_d = function(x, range) stats::dunif(x, range[1], range[2]),
                               c = NULL,
                               max_attempts = NULL,
                               show_progress = FALSE,
                               return = c("samples", "object")) {
  return <- match.arg(return)

  # ---- Input checks ----
  stopifnot(length(n) == 1, is.finite(n), n == as.integer(n), n > 0)
  stopifnot(is.numeric(range), length(range) == 2, range[1] < range[2])
  if (!is.function(target_density)) stop("'target_density' must be a function")
  if (!is.null(target_cdf) && !is.function(target_cdf)) stop("'target_cdf' must be a function or NULL")
  if (!is.function(proposal_r)) stop("'proposal_r' must be a function")
  if (!is.function(proposal_d)) stop("'proposal_d' must be a function")
  max_attempts <- if (is.null(max_attempts)) 20L * n else as.integer(max_attempts)

  # ---- Compute envelope constant c if not given: sup_x f(x)/g(x) over 'range' ----
  if (is.null(c)) {
    ratio_fn <- function(x) {
      gx <- proposal_d(x, range)
      fx <- target_density(x)
      r  <- ifelse(gx > 0 & is.finite(gx), fx / gx, -Inf)
      as.numeric(r)
    }
    opt <- optimize(ratio_fn, interval = range, maximum = TRUE)
    c <- opt$objective
    if (!is.finite(c) || c <= 0) stop("Computed envelope constant 'c' is not finite/positive. Check densities and 'range'.")
  }

  # ---- Storage ----
  samples  <- numeric(n)
  attempts <- 0L
  accepted <- 0L

  # ---- Main loop ----
  while (accepted < n && attempts < max_attempts) {
    x_prop <- proposal_r(1L, range)
    gx <- proposal_d(x_prop, range)
    fx <- target_density(x_prop)
    if (!is.finite(gx) || gx <= 0) { attempts <- attempts + 1L; next }
    u <- stats::runif(1L)
    if (u < fx / (c * gx)) {
      accepted <- accepted + 1L
      samples[accepted] <- x_prop
    }
    attempts <- attempts + 1L
    if (show_progress && attempts %% 10000L == 0L) {
      message("Attempts: ", attempts, " | Accepted: ", accepted)
    }
  }

  if (accepted < n) {
    warning(sprintf("Only %d/%d samples accepted after %d attempts",
                    accepted, n, attempts))
    samples <- samples[seq_len(accepted)]
  }

  # ---- Return mode ----
  if (return == "samples") {
    return(samples)
  }

  # Build envelope only if returning full object
  x_grid <- seq(range[1], range[2], length.out = 1000)
  env_df <- data.frame(
    x        = x_grid,
    target   = target_density(x_grid),
    envelope = c * proposal_d(x_grid, range)
  )

  structure(list(
    samples                 = samples,
    acceptance_rate         = accepted / attempts,
    avg_attempts_per_accept = if (accepted > 0) attempts / accepted else Inf,
    envelope                = env_df,
    n_attempts              = attempts,
    range                   = range,
    c                       = c,
    target_cdf              = target_cdf
  ), class = "rejection_sample")
}

# save
dump("rejection_sampling", "rejection_sampling.R")

set.seed(123)

# Just the numeric vector (no envelope returned)
res <- rejection_sampling(
  n = 10000, range = c(-1, 1),
  target_density = dtri,
  return = "samples"
)
plot_RNG(
  res,
  target_density = dtri,
  target_cdf = ptri,
  inv_cdf = function(u) qtri(u, a = -1, b = 1),
  compute_envelope = TRUE,
  envelope_range = c(-1, 1),
  proposal_d = function(z, range) stats::dunif(z, range[1], range[2])
)

set.seed(123)
samples <- rejection_sampling(
  n = 10000, range = c(-1, 1),
  target_density = dtri,
  return = "samples"
)

gof_diagnostics(samples, target_cdf = ptri, inv_cdf = qtri)
## 
## === Goodness-of-Fit Diagnostics ===
## Kolmogorov–Smirnov:
##   D = 0.0076 | p-value = 0.61 
## Anderson–Darling:
##   A = 0.379 | p-value = 0.869 
## 
##         sample     theory
## 1%  -0.8531725 -0.8585786
## 5%  -0.6843214 -0.6837722
## 95%  0.6796308  0.6837722
## 99%  0.8534387  0.8585786

Goodness-of-Fit Diagnostics — Interpretation:

KS test: D = 0.0076, p = 0.61 → fail to reject (no evidence of mismatch near the center).
AD test: A = 0.379, p = 0.869 → fail to reject (no evidence of tail mismatch).

Takeaway: The samples are consistent with the target distribution. The maximum CDF difference is ~0.76%, which is tiny and well within sampling noise at common α levels (0.05, 0.10).


Quantile check — interpretation:

Differences (sample − theory):

  • 1%: −0.85317 vs −0.85858 → +0.00541 (left extreme tail slightly lighter)
  • 5%: −0.68432 vs −0.68377 → −0.00055 (left mid-tail essentially exact; hair heavier)
  • 95%: 0.67963 vs 0.68377 → −0.00414 (right mid-tail slightly lighter)
  • 99%: 0.85344 vs 0.85858 → −0.00514 (right extreme tail slightly lighter)

Bottom line: all deviations are tiny (≤ ~0.005), symmetric, and fully consistent with sampling noise for typical \(n\) (especially in the 1%/99% tails). This supports an excellent match to the target distribution.


8 Monte Carlo Methods

Monte Carlo methods, also known as Monte Carlo experiments, are computational algorithms that rely on repeated random sampling to obtain numerical results. These techniques run large numbers of simulations to estimate the distribution of unknown probabilistic quantities.

Core Concept

Monte Carlo methods use randomness to solve problems that might be deterministic in principle. They are particularly useful for:

  • Modeling complex systems with significant uncertainty
  • Evaluating difficult deterministic problems
  • Estimating probability distributions
  • Computing high-dimensional integrals

The Monte Carlo Algorithm

A typical Monte Carlo implementation involves four key steps:

  • Define the Domain
    Establish the range of possible inputs and their probability distributions.
  • Generate Random Inputs
    Sample from the specified probability distribution(s) over the domain.
  • Perform Deterministic Computation
    Apply the desired mathematical operation or simulation to each random input.
  • Aggregate Results
    Combine individual outcomes to estimate the quantity of interest.

Statistical Applications: Monte Carlo methods serve two primary purposes in applied statistics:

  • Comparison of Alternative Statistics
    • Evaluate different estimators
    • Compare test statistics under various conditions
    • Assess robustness of statistical methods
  • Implementation of Hypothesis Tests
    • Create exact tests when distributional assumptions are uncertain
    • Generate random sample null distributions
    • Compute p-values through simulation

Mathematical Foundation

The method relies on the Law of Large Numbers:

\[ \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^n f(X_i) = \mathbb{E}[f(X)] \]

where:

  • \(X_i\) are independent random samples
  • \(f\) is the function of interest
  • \(n\) is the number of samples

Advantages

  • Flexible for complex, high-dimensional problems
  • Convergence rate \(\mathcal{O}(1/\sqrt{n})\) independent of dimension
  • Parallelizable computations
  • Intuitive implementation

Common Applications

  • Numerical integration
  • Risk analysis
  • Statistical physics
  • Bayesian inference
  • Financial modeling
  • Computer graphics

Example: Estimating π by randomly sampling points in a unit square and calculating the fraction that falls within the inscribed circle.

8.1 Classical Monte Carlo Integration

Monte Carlo integration is a numerical technique for approximating definite integrals using random sampling. Unlike deterministic quadrature methods, this approach evaluates the integrand at randomly selected points within the integration domain.

Mathematical Formulation

Given an expectation of the form:

\[ \mathbb{E}[h(X)] = \int_{\mathcal{X}} h(x) f(x) \, dx \]

where: - \(\mathcal{X}\) is the domain of integration - \(X\) is a random variable with probability density function \(f(x)\) - \(h(x)\) is the integrand function

Algorithm

  1. Sampling: Generate \(n\) independent random samples \((X_1, ..., X_n)\) from the density \(f(x)\)
  2. Evaluation: Compute \(h(X_i)\) for each sample
  3. Averaging: Approximate the integral as:

\[ \hat{I}_n = \frac{1}{n} \sum_{i=1}^n h(X_i) \]

population Foundation

The approximation converges almost surely to the true value by the Strong Law of Large Numbers:

\[ \lim_{n \to \infty} \hat{I}_n = \mathbb{E}[h(X)] \quad \text{a.s.} \]

Error Analysis

The standard error of the Monte Carlo estimate is:

\[ \text{SE} = \frac{\sigma_h}{\sqrt{n}} \]

where \(\sigma_h^2 = \text{Var}(h(X))\). This demonstrates the characteristic:

  • Square root convergence (\(O(n^{-1/2})\))
  • Dimension-independent convergence rate

Advantages and Limitations

Advantages:

  • Particularly effective for high-dimensional integrals
  • Simple to implement
  • Flexible for complex domains and integrands

Limitations:

  • Relatively slow convergence compared to deterministic methods for low-dimensional problems
  • Requires careful variance reduction techniques for practical efficiency
# number of observations
n <- 1e4 # 10,000

# number of iterations
iterations <- 200
min <- -1 # lower bound of integration
max <- 1 # upper bound of integration

f <- function(x){return(1 - (x^2))} # f(x)

# g <- function(x,y){return(y <= 1-(x^2))} # g(x)
V <- 2

# standard error
SE <- function(x){sqrt(var(x,na.rm = TRUE)/length(na.omit(x)))} 
#' Classical Monte Carlo Integration on a 1D Interval
#'
#' @description
#' `monte_carlo_integrate()` approximates a one-dimensional definite integral
#' \eqn{\int_{a}^{b} f(x)\,dx} by standard (plain) Monte Carlo using
#' uniform sampling on \([a,b]\). The function is robust to non-vectorized
#' integrands: it first attempts vectorized evaluation and falls back to a
#' safe elementwise evaluation if needed.
#'
#' @param FUN Function. The integrand \code{f(x)} to be integrated.
#'   May be vectorized; if not, the function will evaluate it safely elementwise.
#' @param min,max Numeric. Integration bounds with \code{max > min}.
#' @param n Integer (length 1). Number of uniform samples per iteration.
#' @param iterations Integer (length 1). Number of independent Monte Carlo
#'   replications; results across iterations are summarized into mean/SE/CI.
#' @param V Numeric. Volume/length of the integration domain. Defaults to
#'   \code{max - min} for the 1D case.
#' @param conf_level Numeric in (0,1). Confidence level for the normal-approximation
#'   interval around the Monte Carlo mean across iterations. Default \code{0.95}.
#' @param ... Additional arguments passed to \code{FUN}.
#'
#' @returns A list with components:
#' \itemize{
#'   \item \code{mean} — Monte Carlo mean of the integral estimates across iterations
#'   \item \code{var} — Sample variance of the integral estimates across iterations
#'   \item \code{SE} — Standard error of the mean across iterations
#'   \item \code{CI} — Normal-approximation confidence interval at \code{conf_level}
#'   \item \code{conf_level} — The confidence level used
#'   \item \code{estimates} — Vector of per-iteration integral estimates
#'   \item \code{n_per_iter} — Number of samples per iteration (\code{n})
#'   \item \code{iterations} — Number of MC replications
#'   \item \code{volume} — The domain length \code{V}
#'   \item \code{method} — Description string
#' }
#'
#' @details
#' For each iteration, the method draws \eqn{X_i \sim \mathrm{Unif}(a,b)}, evaluates
#' \eqn{Y_i = f(X_i)}, and returns the estimate
#' \deqn{\hat{I} = (b-a)\,\frac{1}{n}\sum_{i=1}^{n} Y_i.}
#' Across \code{iterations}, the function reports the mean, variance, standard error,
#' and a normal-approximation confidence interval for the mean. The convergence rate
#' is \eqn{\mathcal{O}(n^{-1/2})} w.r.t. the number of samples per iteration.
#'
#' If you later compare variance-reduction methods (e.g., antithetic or importance
#' sampling), keep the same \emph{total} sample budget to make SE comparisons fair.
#'
#' @examples
#' # Integrate f(x) = 1 - x^2 over [-1, 1]; exact value is 4/3 ≈ 1.3333
#' f <- function(x) 1 - x^2
#' set.seed(1)
#' out <- monte_carlo_integrate(f, min = -1, max = 1, n = 1e4, iterations = 200)
#' out$mean
#' out$SE
#' out$CI
#'
#' # Non-vectorized integrand is fine:
#' g <- function(x) sapply(x, function(t) exp(-t^2))
#' set.seed(2)
#' out2 <- monte_carlo_integrate(g, min = -2, max = 2, n = 5000, iterations = 100)
#' out2$mean
#'
#' @seealso \code{\link[stats]{integrate}} for adaptive quadrature in 1D.
#'
#' @importFrom stats runif var sd qnorm
#' @export
#' 
monte_carlo_integrate <- function(FUN, min, max, n, iterations, V = max - min,
                                  conf_level = 0.95, ...) {
  # ---- Input checks ----
  if (!is.function(FUN)) stop("FUN must be a function.")
  if (!is.numeric(min) || !is.numeric(max) || !is.finite(min) || !is.finite(max) || max <= min) {
    stop("'min' and 'max' must be finite numerics with max > min.")
  }
  if (length(n) != 1 || !is.finite(n) || n <= 0 || n != as.integer(n)) {
    stop("'n' must be a positive integer.")
  }
  if (length(iterations) != 1 || !is.finite(iterations) || iterations <= 0 ||
      iterations != as.integer(iterations)) {
    stop("'iterations' must be a positive integer.")
  }
  if (!is.numeric(V) || !is.finite(V) || V <= 0) stop("'V' must be a positive numeric.")
  if (!is.numeric(conf_level) || conf_level <= 0 || conf_level >= 1) {
    stop("'conf_level' must be in (0, 1).")
  }

  f <- match.fun(FUN)
  estimates <- numeric(iterations)

  for (i in seq_len(iterations)) {
    U <- stats::runif(n, min, max)

    # Try vectorized evaluation; if shape mismatches or error, fall back
    Y <- try(f(U, ...), silent = TRUE)
    if (inherits(Y, "try-error") || length(Y) != length(U)) {
      Y <- vapply(U, f, numeric(1), ...)
    }

    estimates[i] <- V * mean(Y)
  }

  m  <- mean(estimates)
  v  <- stats::var(estimates)
  se <- stats::sd(estimates) / sqrt(iterations)

  z  <- stats::qnorm((1 + conf_level) / 2)
  ci <- c(lower = m - z * se, upper = m + z * se)

  list(
    mean       = m,
    var        = v,
    SE         = se,
    CI         = ci,
    conf_level = conf_level,
    estimates  = estimates,
    n_per_iter = n,
    iterations = iterations,
    volume     = V,
    method     = "Classical Monte Carlo (Uniform on [min,max])"
  )
}
# Integrate f(x) = 1 - x^2 over [-1, 1]; exact value is 4/3 ≈ 1.3333

f <- function(x) 1 - x^2
set.seed(1)
out <- monte_carlo_integrate(f, min = -1, max = 1, n = 1e4, iterations = 200)

mc_report <- function(x, exact = NULL, digits = 6) {
  if (!is.list(x) || is.null(x$mean) || is.null(x$SE)) stop("Not an mc result.")
  fmt <- function(v) formatC(v, digits = digits, format = "fg", flag = "#")
  cat("\n── Monte Carlo Integral — Summary ─────────────────────────\n")
  cat("Method: ", x$method %||% "Classical Monte Carlo", "\n", sep = "")
  if (!is.null(x$volume)) cat("Domain length (V): ", fmt(x$volume), "\n", sep = "")
  if (!is.null(x$n_per_iter) && !is.null(x$iterations)) {
    cat("Samples per iter: ", x$n_per_iter, 
        " | Iterations: ", x$iterations,
        " | Total evals: ", x$n_per_iter * x$iterations, "\n", sep = "")
  }
  cl <- x$conf_level %||% 0.95
  cat("Confidence level: ", fmt(cl), "\n", sep = "")
  cat("\nEstimate: ", fmt(x$mean), "  (± ", fmt(x$SE), " SE)\n", sep = "")
  if (!is.null(x$CI)) {
    cat("CI: [", fmt(x$CI[1]), ", ", fmt(x$CI[2]), "]\n", sep = "")
  }
  if (!is.null(exact)) {
    rel_err <- (x$mean - exact) / exact
    cat("Exact:   ", fmt(exact), 
        "  |  Abs. error: ", fmt(x$mean - exact),
        "  |  Rel. error: ", fmt(rel_err), "\n", sep = "")
  }
  cat("────────────────────────────────────────────────────────────\n")
  invisible(x)
}

# small helper like %||% (fallback)
`%||%` <- function(a, b) if (is.null(a)) b else a

mc_report(out, exact = 4/3)
## 
## ── Monte Carlo Integral — Summary ─────────────────────────
## Method: Classical Monte Carlo (Uniform on [min,max])
## Domain length (V): 2.00000
## Samples per iter: 10000 | Iterations: 200 | Total evals: 2e+06
## Confidence level: 0.950000
## 
## Estimate: 1.33364  (± 0.000400828 SE)
## CI: [1.33286, 1.33443]
## Exact:   1.33333  |  Abs. error: 0.000309288  |  Rel. error: 0.000231966
## ────────────────────────────────────────────────────────────

8.2 Antithetic Variates

Antithetic variates is a variance reduction technique used in Monte Carlo simulations to improve computational efficiency. By generating negatively correlated sample paths, the method reduces estimator variance without the need for additional samples, making it an effective way to enhance precision while minimizing computational cost.

Core Principle

For estimating: \[ \mathbb{E}[h(X)] = \mathbb{E}[Y] \]

We generate pairs of dependent samples \((Y_1, Y_2)\) where:

  • \(Y_1 = h(X_1)\)
  • \(Y_2 = h(X_2)\) with \(X_2\) constructed to be antithetic to \(X_1\)

The estimator becomes: \[ \hat{\mu}_{AV} = \frac{Y_1 + Y_2}{2} \]

Variance Analysis

The variance of the antithetic estimator is: \[ \text{Var}(\hat{\mu}_{AV}) = \frac{\text{Var}(Y_1) + \text{Var}(Y_2) + 2\text{Cov}(Y_1, Y_2)}{4} \]

Key cases:

  • Independent samples (i.i.d.):
    • \(\text{Cov}(Y_1, Y_2) = 0\)
    • \(\text{Var}(\hat{\mu}_{AV}) = \frac{\text{Var}(Y)}{2}\)
  • Antithetic samples:
    • \(\text{Cov}(Y_1, Y_2) < 0\)
    • \(\text{Var}(\hat{\mu}_{AV}) < \frac{\text{Var}(Y)}{2}\)

Implementation Requirements

For effective variance reduction:

  1. Negative correlation between \(Y_1\) and \(Y_2\)

  2. Same marginal distributions: \(Y_1\) and \(Y_2\) must maintain the target distribution

Practical Construction Methods

  • For uniform variates:
    • If \(U \sim \text{Unif}(0,1)\), use \((U, 1-U)\) pairs
  • For symmetric distributions:
    • If \(X \sim F\), use \((X, 2\mu-X)\) where \(\mu\) is the symmetry point
  • For standard normal:
    • Use \((Z, -Z)\) pairs

Advantages

  • Computational efficiency:
    • Requires generating only n/2 random numbers for n samples
    • Each evaluation provides two sample points
  • Variance reduction:
    • Typically achieves 20 - 50% variance reduction
    • Particularly effective for monotonic functions
# Helper for nullable defaults
`%||%` <- function(a, b) if (is.null(a)) b else a

#' Antithetic-variates Monte Carlo integration (1D)
#'
#' Estimates ∫_min^max f(x) dx using antithetic pairs:
#' u ~ Unif(0,1); x1 = min + (max-min) * u; x2 = min + (max-min) * (1-u).
#'
#' @param FUN Integrand f(x).
#' @param min,max Integration bounds.
#' @param n_pairs Integer. Antithetic pairs per iteration (total samples = 2*n_pairs).
#' @param iterations Integer. Number of Monte Carlo replications.
#' @param conf_level Confidence level for normal-approx CI across iterations.
#' @param ... Passed to FUN.
#' @return An object of class "mc_integral" with summary fields.
antithetic_integrate <- function(FUN, min, max, n_pairs, iterations = 1L,
                                 conf_level = 0.95, ...) {
  stopifnot(is.numeric(min), is.numeric(max), max > min)
  stopifnot(length(n_pairs) == 1, n_pairs > 0, n_pairs == as.integer(n_pairs))
  stopifnot(length(iterations) == 1, iterations > 0, iterations == as.integer(iterations))
  if (!is.numeric(conf_level) || conf_level <= 0 || conf_level >= 1)
    stop("'conf_level' must be in (0,1).")

  f <- match.fun(FUN)
  V <- max - min
  estimates <- numeric(iterations)

  for (i in seq_len(iterations)) {
    u  <- runif(n_pairs)         # on [0,1]
    x1 <- min + V * u
    x2 <- min + V * (1 - u)      # antithetic partner
    x  <- c(x1, x2)

    y <- try(f(x, ...), silent = TRUE)
    if (inherits(y, "try-error") || length(y) != length(x)) {
      y <- vapply(x, f, numeric(1), ...)
    }
    estimates[i] <- V * mean(y)
  }

  m  <- mean(estimates)
  se <- sd(estimates) / sqrt(iterations)
  v  <- var(estimates)
  z  <- stats::qnorm((1 + conf_level) / 2)
  ci <- c(lower = m - z * se, upper = m + z * se)

  out <- list(
    mean        = m,
    var         = v,
    SE          = se,
    CI          = ci,
    conf_level  = conf_level,
    estimates   = estimates,
    n_pairs     = n_pairs,
    n_per_iter  = 2L * n_pairs,
    iterations  = iterations,
    min         = min,
    max         = max,
    volume      = V,
    method      = "antithetic variates"
  )
  class(out) <- c("mc_integral", class(out))
  out
}

# Pretty printer for MC integrals (works for antithetic & plain MC objects)
# Usage: print(out) will show a clean summary.
print.mc_integral <- function(x, digits = 6, exact = NULL, ...) {
  fmt <- function(v) formatC(v, digits = digits, format = "fg", flag = "#")
  total_evals <- (x$n_per_iter %||% NA_integer_) * (x$iterations %||% NA_integer_)

  cat("\n── Monte Carlo Integral — Summary ─────────────────────────\n")
  cat("Method:          ", x$method %||% "Monte Carlo", "\n", sep = "")
  if (!is.null(x$min) && !is.null(x$max)) {
    cat("Domain:          [", fmt(x$min), ", ", fmt(x$max), "]  (V = ", fmt(x$volume %||% NA_real_), ")\n", sep = "")
  } else if (!is.null(x$volume)) {
    cat("Domain length V: ", fmt(x$volume), "\n", sep = "")
  }
  if (!is.null(x$n_per_iter) && !is.null(x$iterations)) {
    cat("Per-iter draws:  ", x$n_per_iter, 
        "  | Iterations: ", x$iterations,
        "  | Total evals: ", if (is.finite(total_evals)) total_evals else "—", "\n", sep = "")
  }
  cat("Confidence lvl:  ", fmt(x$conf_level %||% 0.95), "\n", sep = "")

  cat("\nEstimate:        ", fmt(x$mean), "  (± ", fmt(x$SE), " SE)\n", sep = "")
  if (!is.null(x$CI)) {
    cat("Normal CI:       [", fmt(x$CI[1]), ", ", fmt(x$CI[2]), "]\n", sep = "")
  }
  if (!is.null(exact)) {
    rel_err <- (x$mean - exact) / exact
    cat("Exact value:     ", fmt(exact),
        "  | Abs. error: ", fmt(x$mean - exact),
        "  | Rel. error: ", fmt(rel_err), "\n", sep = "")
  }
  cat("────────────────────────────────────────────────────────────\n")
  invisible(x)
}
# f(x) = x^2 on [0,1]; exact = 1/3
f <- function(x) x^2
set.seed(42)
out <- antithetic_integrate(f, min = 0, max = 1, n_pairs = 1000, iterations = 200)

# Nice one-line summary (optionally supply the exact value)
print(out, exact = 1/3)
## 
## ── Monte Carlo Integral — Summary ─────────────────────────
## Method:          antithetic variates
## Domain:          [0, 1.00000]  (V = 1.00000)
## Per-iter draws:  2000  | Iterations: 200  | Total evals: 4e+05
## Confidence lvl:  0.950000
## 
## Estimate:        0.333678  (± 0.000178599 SE)
## Normal CI:       [0.333328, 0.334028]
## Exact value:     0.333333  | Abs. error: 0.000344284  | Rel. error: 0.00103285
## ────────────────────────────────────────────────────────────

8.3 Importance Sampling

Importance sampling is a variance reduction technique that estimates properties of a target distribution \(f\) by sampling from a different proposal distribution \(g\). Unlike rejection sampling, it uses all generated samples through weighting, making it more efficient for many applications.

Core Principle

The fundamental idea is to rewrite expectations under \(f\) as expectations under \(g\):

\[ \begin{aligned} \mathbb{E}_f[h(X)] &= \int_{\mathcal{X}} h(x) f(x) dx \\ &= \int_{\mathcal{X}} h(x) \frac{f(x)}{g(x)} g(x) dx \\ &= \mathbb{E}_g\left[h(X)\frac{f(X)}{g(X)}\right] \end{aligned} \]

where:

  • \(f(x)\) is the target density (hard to sample from)
  • \(g(x)\) is the proposal density (easy to sample from)
  • \(\frac{f(x)}{g(x)}\) are called the importance weights

Algorithm Implementation

  1. Sample \(X_1, \ldots, X_n \sim g(x)\)
  2. Compute weights \(w_i = \frac{f(X_i)}{g(X_i)}\)
  3. Estimate expectation: \[ \hat{\mu}_{IS} = \frac{1}{n}\sum_{i=1}^n h(X_i)w_i \]

population Properties

  • Consistency: By the Law of Large Numbers: \[ \hat{\mu}_{IS} \xrightarrow{a.s.} \mathbb{E}_f[h(X)] \quad \text{as } n \to \infty \]
  • Variance: The estimator variance is: \[ \text{Var}(\hat{\mu}_{IS}) = \frac{1}{n}\text{Var}_g\left(h(X)\frac{f(X)}{g(X)}\right) \]

Optimal Proposal Distribution

The variance-minimizing proposal satisfies: \[ g^*(x) \propto |h(x)|f(x) \]

Practical Considerations

  1. Normalized Weights: For better numerical stability: \[ \hat{\mu}_{NIS} = \frac{\sum_{i=1}^n h(X_i)w_i}{\sum_{i=1}^n w_i} \]

  2. Effective Sample Size: \[ ESS = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2} \] Measures the efficiency of importance sampling.

f <- function(x){
  if (x < 0) {y <- 0}
  else if (0 <= x & x < .5) {y <- -1 + sqrt(2*x)}
  else if (.5 <= x & x <= 1) {y <- 1 - sqrt(2 - 2*x)}
  else (y <- 1)
  return(y)
}

Is h(x) a good choice for the sampling density?

\(\hspace{23pt} \displaystyle f(x) = 1-x^2\)

\(\hspace{23pt} \displaystyle h(x) = 1 - \mid x \mid\)

\(\hspace{23pt} \displaystyle g(x) = \frac{f(x)}{h(x)} = \frac{1-x^2}{1-\mid x \mid}\)

According to Anderson a good importance sampling function, \(h(x)\), should have the following properties all of which is satisfied:

  • \(h(x) > 0\) whenever \(g(x) \neq 0\)
  • \(h(x)\) should be close to being proportional to \(\mid {g(x)} \mid\)
  • it should be easy to simulate values from \(h(x)\)
  • it should be easy to compute the density \(h(x)\) for any value \(x\) that you might realize.

Implement the function, \(\displaystyle f(x) = 1-x^2\)

fx <- function(x){
    return(1 - x*x)
  }

Implement the function, \(\displaystyle h(x) = 1 - | x |\)

h <- 
  function(x){
    if (-1 <= x & x <= 1) {y <- 1 - abs(x)}
    else{y <- 0}
    return(y)
  }
g <- function(x){(1-(x^2))/(1-abs(x))}

Plot functions \(\displaystyle h(x)\) and \(\displaystyle f(x)\)

ploty <- function(min, max){
  x <- seq(min, max, by = 0.01)
  # y <- rep(0, length=length(x))
  #for (i in 1:length(y)){y[i] = findy(x[i])}
  H <- sapply(x, h, simplify = TRUE, USE.NAMES = TRUE)
  F <- sapply(x, fx, simplify = TRUE, USE.NAMES = TRUE)
  
  par(lab = c((max - min)*2, (max(H) - min(H))*5, 7))
  
  plot(x, H, asp = 1,
       type = "l",
       lwd = 5,
       col = adjustcolor("tomato", alpha.f = 0.7),
       ylab = "y",
       main = "h(x) vs. f(x)")
  
  points(x, F,
         type = "l",
         lwd = 5,
         col = adjustcolor("dodgerblue", alpha.f = 0.7),
         asp = 1)
  
  grid(nx = NULL, 
       ny = NULL, 
       col = "lightgray", 
       lty = "dotted", 
       lwd = 1, 
       equilogs = TRUE)
  
  text(0, 1.2, expression(f(x) == 1 - x^2), cex = 1.5, col = "dodgerblue")
  text(0, 0.6, expression(h(x) == 1 - abs(x)), cex = 1.5, col = "tomato")
}
ploty(-1,1) # Plot the function, f(x) and h(x), on [-1, 1]

#' Importance Sampling for Numerical Integration
#'
#' @description
#' Estimate an (unnormalized) integral
#' \deqn{I = \int h(x)\,dx}
#' using importance sampling with proposal density \eqn{g}. Draw
#' \eqn{X_i \sim g} via \code{rprop}, compute weights \eqn{w_i = 1/g(X_i)},
#' and use the estimator \eqn{\hat I = \frac{1}{n}\sum_{i=1}^n h(X_i)\,w_i}.
#'
#' @details
#' The estimator is unbiased when:
#' \itemize{
#'   \item \eqn{g(x) > 0} wherever \eqn{|h(x)| > 0} (support condition), and
#'   \item \eqn{\mathbb{E}_g\!\left[\,|h(X)|/g(X)\,\right] < \infty}.
#' }
#' The function runs \code{iterations} independent replications and returns the
#' mean/variance/SE across replication-level estimates.
#'
#' \strong{Expectations under a target density:} To estimate
#' \eqn{\mathbb{E}_{f}[ \phi(X) ]}, use weights \eqn{f(x)/g(x)} instead of
#' \eqn{1/g(x)} (i.e., modify the line computing \code{w}).
#'
#' Robustness: \code{FUN} is evaluated vectorized if possible; otherwise it
#' falls back to a safe elementwise evaluation. Only \emph{named} arguments
#' in \code{...} that match \code{FUN}'s formals are forwarded (unnamed or
#' unknown names are ignored with a warning).
#'
#' @param FUN Function \code{h(x)} — the integrand.
#' @param rprop Function generating proposal draws: \code{rprop(n)} for \eqn{X \sim g}.
#' @param dprop Function giving proposal density: \code{dprop(x)} returns \eqn{g(x)}.
#' @param n Integer (length 1). Number of proposal draws per replication.
#' @param iterations Integer (length 1). Number of independent Monte Carlo replications.
#' @param ... Additional \emph{named} arguments forwarded to \code{FUN} when names match.
#'
#' @return An object of class \code{"mc_integral"} (a list) with components:
#' \itemize{
#'   \item \code{mean} — Monte Carlo mean of the replication estimates
#'   \item \code{var} — Sample variance of the replication estimates
#'   \item \code{SE} — Standard error of the mean across replications
#'   \item \code{estimates} — Vector of per-replication estimates
#'   \item \code{method} — Description string
#'   \item \code{n_per_iter}, \code{iterations} — sampling metadata
#' }
#'
#' @examples
#' ## Integrate h(x) = 1 - x^2 over [-1,1]; exact value = 4/3
#' h <- function(x) ifelse(abs(x) <= 1, 1 - x^2, 0)
#'
#' ## Triangular proposal g(x) = 1 - |x| on [-1,1]
#' dtri <- function(x) ifelse(abs(x) <= 1, 1 - abs(x), 0)
#' qtri <- function(u) ifelse(u <= 0.5, -1 + sqrt(2*u), 1 - sqrt(2*(1-u)))
#' rtri <- function(n) qtri(runif(n))
#'
#' set.seed(1)
#' out <- importance_sampling_integrate(
#'   FUN  = h,
#'   rprop = rtri,
#'   dprop = dtri,
#'   n = 5000,
#'   iterations = 200
#' )
#' print(out)           # pretty summary
#' out$mean             # ~ 1.3333
#' out$SE
#'
#' @seealso \code{\link{monte_carlo_integrate}}, \code{\link{antithetic_integrate}}
#' @importFrom stats var sd
#' @export
importance_sampling_integrate <- function(FUN, rprop, dprop, n, iterations = 1L, ...) {
  stopifnot(is.function(FUN), is.function(rprop), is.function(dprop))
  if (length(n) != 1 || n <= 0 || n != as.integer(n)) stop("'n' must be a positive integer")
  if (length(iterations) != 1 || iterations <= 0 || iterations != as.integer(iterations)) {
    stop("'iterations' must be a positive integer")
  }

  f <- match.fun(FUN)

  # sanitize ... : keep only NAMED args that FUN actually accepts
  dots <- list(...)
  if (length(dots)) {
    nm <- names(dots)
    if (is.null(nm) || any(nm == "")) {
      warning("Ignoring unnamed arguments in `...` (e.g., a stray value).")
    }
    fun_formals <- names(formals(f))
    keep <- !is.null(nm) && nzchar(nm) & nm %in% fun_formals
    if (any(!keep, na.rm = TRUE)) {
      dropped <- nm[!keep]
      if (length(dropped)) warning("Dropping unknown args in `...`: ", paste(dropped, collapse = ", "))
    }
    dots <- dots[keep]
  }

  estimates <- replicate(iterations, {
    x <- rprop(n)

    # vectorized attempt; fallback to elementwise
    h <- try(do.call(f, c(list(x), dots)), silent = TRUE)
    if (inherits(h, "try-error") || length(h) != length(x)) {
      h <- vapply(x, function(t) do.call(f, c(list(t), dots)), numeric(1))
    }

    gx <- dprop(x)
    if (any(!is.finite(gx) | gx <= 0)) stop("dprop(x) returned nonpositive/NA values on support.")
    mean(h / gx)  # unbiased estimate of ∫ h(x) dx
  })

  res <- list(
    mean       = mean(estimates),
    var        = stats::var(estimates),
    SE         = stats::sd(estimates) / sqrt(iterations),
    estimates  = estimates,
    method     = "importance sampling (integral)",
    n_per_iter = n,
    iterations = iterations
  )
  class(res) <- c("mc_integral", class(res))
  res
}

#' Pretty printer for Monte Carlo integral results
#'
#' @param x An object returned by \code{importance_sampling_integrate()} (class "mc_integral").
#' @param digits Number of significant digits to print.
#' @param exact Optional exact value of the integral to report error metrics.
#' @param ... Ignored.
#' @export
#' @method print mc_integral
print.mc_integral <- function(x, digits = 6, exact = NULL, ...) {
  fmt <- function(v) formatC(v, digits = digits, format = "fg", flag = "#")
  reps <- if (!is.null(x$estimates)) length(x$estimates) else NA_integer_

  cat("\n── Importance Sampling — Integral Summary ────────────────\n")
  cat("Method:         ", if (!is.null(x$method)) x$method else "importance sampling", "\n", sep = "")
  if (!is.null(x$n_per_iter) && !is.null(x$iterations)) {
    cat("Replications:   ", reps, "  |  Draws/rep: ", x$n_per_iter,
        "  |  Total draws: ", x$n_per_iter * x$iterations, "\n", sep = "")
  }
  cat("\nEstimate:       ", fmt(x$mean), "  (± ", fmt(x$SE), " SE)\n", sep = "")
  if (!is.null(x$var)) cat("Between-rep var:", fmt(x$var), "\n")
  if (!is.null(exact)) {
    abs_err <- x$mean - exact
    rel_err <- abs_err / exact
    cat("Exact:          ", fmt(exact),
        "  | Abs. error: ", fmt(abs_err),
        "  | Rel. error: ", fmt(rel_err), "\n", sep = "")
  }
  cat("───────────────────────────────────────────────────────────\n")
  invisible(x)
}

% Integrand and proposal on [-1,1] \[ h(x)= \begin{cases} 1 - x^{2}, & |x|\le 1,\\[2pt] 0, & \text{otherwise}, \end{cases} \qquad g(x)= \begin{cases} 1 - |x|, & |x|\le 1,\\[2pt] 0, & \text{otherwise}. \end{cases} \]

% (Equivalent indicator-function form) \[ h(x)=(1-x^2)\,\mathbf{1}_{\{|x|\le 1\}},\qquad g(x)=(1-|x|)\,\mathbf{1}_{\{|x|\le 1\}}. \] where

\[ \text{Indicator: }\quad \mathbf{1}_{A}(x)= \begin{cases} 1, & x\in A,\\ 0, & x\notin A. \end{cases} \]

Quick check it’s a valid pdf:

\[ \int_{-\!1}^{1} (1-|x|)\,dx = 2\int_{0}^{1} (1-x)\,dx = 2\left[ x-\tfrac{x^{2}}{2}\right]_{0}^{1} = 2\cdot\tfrac{1}{2} = 1. \]


% Importance–sampling setup \[ I \;=\; \int_{-1}^{1} f_{\mathrm{int}}(x)\,dx, \qquad g(x) \;=\; (1-|x|)\,\mathbf{1}_{\{|x|\le 1\}}. \]

% Estimator \[ X_i \sim g,\quad w_i=\frac{1}{g(X_i)},\qquad \hat I \;=\; \frac{1}{n}\sum_{i=1}^n f_{\mathrm{int}}(X_i)\,w_i. \]

Let \[ f(x) \;=\; (1 - x^2)\,\mathbf{1}_{\{|x|\le 1\}} \quad\Rightarrow\quad I \;=\; \int_{-1}^{1} f(x)\,dx \;=\; \frac{4}{3}. \]

Choose the triangular proposal \[ g(x) \;=\; (1 - |x|)\,\mathbf{1}_{\{|x|\le 1\}},\qquad \int_{\mathbb{R}} g(x)\,dx = 1. \]

Draw i.i.d. samples \(X_1,\dots,X_n \sim g\) and use \[ \widehat{I}_n \;=\; \frac{1}{n}\sum_{i=1}^n \frac{f(X_i)}{g(X_i)} \;=\; \frac{1}{n}\sum_{i=1}^n \frac{1 - X_i^2}{\,1 - |X_i|\,}\,\mathbf{1}_{\{|X_i|\le 1\}}. \]

# Integrand on [-1,1]
f_int <- function(x) ifelse(abs(x) <= 1, 1 - x^2, 0)

# Triangular proposal g(x)=1-|x| on [-1,1]
dtri <- function(x) ifelse(abs(x) <= 1, 1 - abs(x), 0)
qtri <- function(u) ifelse(u <= 0.5, -1 + sqrt(2*u), 1 - sqrt(2*(1-u)))
rtri <- function(n) qtri(runif(n))
set.seed(1)
out <- importance_sampling_integrate(
  FUN  = f_int,   # your integrand
  rprop = rtri,   # proposal sampler
  dprop = dtri,   # proposal density
  n = 5000,
  iterations = 200
)

print(out, exact = 4/3)  # pretty, one-line summary (optional exact)
## 
## ── Importance Sampling — Integral Summary ────────────────
## Method:         importance sampling (integral)
## Replications:   200  |  Draws/rep: 5000  |  Total draws: 1e+06
## 
## Estimate:       1.33322  (± 0.000241839 SE)
## Between-rep var: 0.0000116973 
## Exact:          1.33333  | Abs. error: -0.000114553  | Rel. error: -0.0000859148
## ───────────────────────────────────────────────────────────

9 Markov Chain Monte Carlo (MCMC) Methods

MCMC methods are a class of algorithms for sampling from probability distributions by constructing a Markov chain that has the desired distribution as its equilibrium distribution. These methods are particularly valuable when direct sampling is difficult, especially in high-dimensional spaces.

Core Concepts

Markov Chain Fundamentals

  • A sequence of random variables where each state depends only on the previous state (Markov property)
  • Under certain conditions, converges to a stationary distribution π

MCMC Framework 1. Construct a Markov chain whose stationary distribution equals the target distribution 2. Simulate the chain until it approaches stationarity (burn-in period) 3. Use subsequent samples to approximate expectations: \[ \mathbb{E}_π[h(X)] \approx \frac{1}{N}\sum_{t=1}^N h(X_t) \]

9.1 Metropolis–Hastings Algorithm

Metropolis–Hastings (MH) is a general MCMC method for sampling from a target distribution \(\pi(x)\) when you can evaluate \(\pi(x)\) up to a normalizing constant and can sample from a proposal \(q(x'|x)\).

Algorithm

  • Inputs: target \(\pi(x)\) (unnormalized OK), proposal \(q(x'|x)\), initial state \(X_0\).
  • For \(t = 1,\dots,N\):
    1. Propose \(X^\ast \sim q(\,\cdot\,|X_{t-1})\).
    2. Accept with probability \[ \alpha \;=\; \min\!\left(1,\; \frac{\pi(X^\ast)\,q(X_{t-1}\,|\,X^\ast)}{\pi(X_{t-1})\,q(X^\ast\,|\,X_{t-1})}\right). \]
    3. Update \[ X_t \;=\; \begin{cases} X^\ast, & \text{with probability } \alpha,\\[4pt] X_{t-1}, & \text{otherwise.} \end{cases} \]

Random-walk MH (symmetric proposal): if \(q(x'|x)=q(x|x')\) (e.g., \(x' = x + \varepsilon\) with \(\varepsilon\) symmetric), the \(q\)-terms cancel and \[ \alpha \;=\; \min\!\left(1,\; \frac{\pi(X^\ast)}{\pi(X_{t-1})}\right). \]

Key Properties

  • Correctness: Produces samples from \(\pi(x)\) in the limit (under standard ergodicity conditions).
  • Flexibility: Only requires unnormalized \(\pi(x)\).
  • Tunable: Proposal \(q\) controls efficiency/mixing.

Triangular target on \([-1,1]\)

A simple target density: \[ f(x) \;=\; (1 - |x|)\,\mathbf{1}_{\{|x|\le 1\}} \;=\; \begin{cases} 1-|x|, & -1 < x < 1,\\[4pt] 0, & \text{otherwise}. \end{cases} \]


R Implementation of the Target Density: Triangular Function

#' Symmetric Triangular Probability Density Function on [-1, 1]
#'
#' @param x Numeric vector.
#' @return Density values of g(x) = 1 - |x| on [-1,1], 0 otherwise.
#' @export
triangular <- function(x) {
  pmax(1 - abs(x), 0)  # already 0 for |x| >= 1
}

# quick check: should be ~1
integrate(triangular, lower = -1, upper = 1)
## 1 with absolute error < 1.1e-14
#' Metropolis–Hastings (Random-Walk) Sampler
#'
#' Samples from a target density using a normal random-walk proposal.
#' For symmetric proposals, the acceptance ratio is f(x*)/f(x).
#'
#' @param n_samples Integer. Number of post–burn-in draws to return.
#' @param initial_value Numeric. Starting state (default 0).
#' @param proposal_sd Positive numeric. Std. dev. of Normal proposal (default 0.3).
#' @param burn_in Integer. Number of burn-in iterations (default 1000).
#' @param target_density Function f(x) returning an (unnormalized) density (default: triangular).
#' @return A list with:
#'   - samples: numeric vector (length n_samples)
#'   - acceptance_rate: scalar in [0,1]
#'   - proposal_sd, burn_in, initial_value
#' @export
metropolis_hastings <- function(
  n_samples,
  initial_value = 0,
  proposal_sd   = 0.3,
  burn_in       = 1000,
  target_density = triangular
) {
  stopifnot(n_samples > 0, burn_in >= 0, proposal_sd > 0, is.function(target_density))

  total <- n_samples + burn_in
  samples  <- numeric(total)
  current  <- initial_value
  accepted <- 0L

  for (i in seq_len(total)) {
    proposed <- rnorm(1, mean = current, sd = proposal_sd)

    f_curr <- target_density(current)
    f_prop <- target_density(proposed)

    # Robust acceptance logic for symmetric q:
    # - if f_prop <= 0: reject
    # - else if f_curr <= 0: accept (moving into support)
    # - else accept with probability min(1, f_prop/f_curr)
    accept <- FALSE
    if (f_prop > 0) {
      if (f_curr <= 0) {
        accept <- TRUE
      } else {
        log_ratio <- log(f_prop) - log(f_curr)
        accept <- (log(runif(1)) < log_ratio)
      }
    }

    if (accept) {
      current <- proposed
      accepted <- accepted + 1L
    }
    samples[i] <- current
  }

  final_samples <- samples[(burn_in + 1L):total]
  acc_rate <- accepted / total

  list(
    samples         = final_samples,
    acceptance_rate = acc_rate,
    proposal_sd     = proposal_sd,
    burn_in         = burn_in,
    initial_value   = initial_value
  )
}


# Save function to file
dump("metropolis_hastings", file = "metropolis_hastings.R")
set.seed(1)

## Normal target (just to show it works for any target)
set.seed(1)
mh_norm <- metropolis_hastings(
  n_samples = 5000,
  target_density = dnorm,   # unnormalized also fine
  proposal_sd = 1.0,
  burn_in = 1000,
)

plot_RNG(
  mh_norm,
  target_density = dnorm,
  target_cdf     = pnorm,
  inv_cdf        = qnorm
)

# Triangular target on [-1,1]
set.seed(1)
mh_tri <- metropolis_hastings(
  n_samples = 5000,
  target_density = triangular,
  proposal_sd = 0.3,
  burn_in = 1000,
)

plot_RNG(
  mh_tri,
  target_density = triangular,
  target_cdf     = ptri,
  inv_cdf        = qtri
)

gof_diagnostics(mh_tri, target_cdf = ptri, inv_cdf = qtri)
## Warning in ks.test.default(samples, function(t) target_cdf(t)): ties should not
## be present for the one-sample Kolmogorov-Smirnov test
## 
## === Goodness-of-Fit Diagnostics ===
## Kolmogorov–Smirnov:
##   D = 0.0156 | p-value = 0.173 
## Anderson–Darling:
##   A = 1.74 | p-value = 0.129 
## 
##         sample     theory
## 1%  -0.8777081 -0.8585786
## 5%  -0.6999842 -0.6837722
## 95%  0.6956260  0.6837722
## 99%  0.8524011  0.8585786

How to interpret these results

Warning: “ties should not be present for the one-sample Kolmogorov–Smirnov test”
- The classical one-sample KS test assumes continuous targets and i.i.d. samples—so no duplicate values.
- MCMC (e.g., Metropolis–Hastings) produces dependent draws and often repeated states (on rejections), so ties are expected and the KS p-value isn’t strictly valid. The same i.i.d. caveat affects AD as well.


What the numbers say (heuristically)

  • KS: \(D = 0.0156,\; p = 0.173\) → No evidence against the target distribution at common levels.
  • AD: \(A = 1.74,\; p = 0.129\) → Consistent with KS; likewise, no rejection.
  • Bottom line: The sample looks compatible with the target, but treat these p-values cautiously due to dependence/ties.

Here’s a parametric-bootstrap GOF that mimics your MH dependence: it re-simulates many MH chains under the null (same proposal, burn-in, length) and builds the null distributions of KS/AD, then gives empirical p-values.

##--- Helpers ---------------------------------------------------------------

# Compute KS and AD (with tiny jitter to break ties consistently)
.ks_ad_stats <- function(x, target_cdf, jitter_sd = 1e-9) {
  xj <- if (jitter_sd > 0) x + rnorm(length(x), sd = jitter_sd) else x

  # KS
  ks_res <- suppressWarnings(stats::ks.test(xj, function(t) target_cdf(t)))

  # AD (optional if goftest missing)
  if (requireNamespace("goftest", quietly = TRUE)) {
    ad_res <- goftest::ad.test(xj, function(t) target_cdf(t))
  } else {
    ad_res <- NULL
  }

  list(
    ks_D = unname(ks_res$statistic),
    ad_A = if (!is.null(ad_res)) unname(ad_res$statistic) else NA_real_,
    ks_obj = ks_res,
    ad_obj = ad_res
  )
}

# Empirical (bootstrap) p-values with +1 smoothing
.emp_pval <- function(t_obs, t_boot) {
  (1 + sum(t_boot >= t_obs)) / (length(t_boot) + 1)
}

##--- Main bootstrap --------------------------------------------------------

#' Parametric-bootstrap GOF for MH samples (KS & AD)
#'
#' @param x_obs numeric vector of observed draws (correlated MH samples)
#' @param target_cdf function F(x) of the target distribution
#' @param sim_fun function(n) -> numeric vector of length n that simulates
#'        an MH chain under the null with the SAME settings as used for x_obs
#'        (same proposal_sd, burn-in, etc.). See example below.
#' @param B number of bootstrap replicates (default 500)
#' @param jitter_sd numeric, sd for tiny jitter to break ties (default 1e-9)
#' @param progress logical, print progress every ~10%
#' @return list with observed stats, bootstrap distributions, and p-values
mh_gof_bootstrap <- function(x_obs, target_cdf, sim_fun,
                             B = 500, jitter_sd = 1e-9, progress = TRUE) {
  stopifnot(is.numeric(x_obs), is.function(target_cdf), is.function(sim_fun))
  n <- length(x_obs)

  # observed stats (with the same jitter policy used in bootstrap)
  obs <- .ks_ad_stats(x_obs, target_cdf, jitter_sd = jitter_sd)

  ks_boot <- numeric(B)
  ad_boot <- numeric(B)

  for (b in seq_len(B)) {
    xb <- sim_fun(n)                 # simulate a null MH chain of length n
    st <- .ks_ad_stats(xb, target_cdf, jitter_sd = jitter_sd)
    ks_boot[b] <- st$ks_D
    ad_boot[b] <- st$ad_A
    if (progress && (b %% max(1L, floor(B/10))) == 0L) {
      message("Bootstrap ", b, "/", B)
    }
  }

  list(
    observed  = list(KS_D = obs$ks_D, AD_A = obs$ad_A,
                     KS = obs$ks_obj, AD = obs$ad_obj),
    bootstrap = list(KS_D = ks_boot, AD_A = ad_boot),
    p_values  = list(
      KS = .emp_pval(obs$ks_D, ks_boot),
      AD = if (all(is.na(ad_boot))) NA_real_ else .emp_pval(obs$ad_A, ad_boot)
    ),
    B = B,
    jitter_sd = jitter_sd
  )
}
# Target density & CDF (triangular)
triangular <- function(x) pmax(1 - abs(x), 0)
ptri <- function(x) {
  out <- numeric(length(x))
  out[x < -1] <- 0
  i1 <- x >= -1 & x <= 0; out[i1] <- ((x[i1] + 1)^2) / 2
  i2 <- x > 0  & x <= 1;  out[i2] <- 1 - ((1 - x[i2])^2) / 2
  out[x > 1] <- 1
  out
}

# Your MH (random-walk) should already be defined; assumes this signature:
# metropolis_hastings(n_samples, initial_value, proposal_sd, burn_in, plot, target_density)

set.seed(1)
obs <- metropolis_hastings(
  n_samples      = 3000,
  burn_in        = 1000,
  proposal_sd    = 0.3,
  target_density = triangular
)$samples

# Build a simulator that MIMICS the null (same MH settings)
sim_fun <- function(n) {
  metropolis_hastings(
    n_samples      = n,
    burn_in        = 1000,
    proposal_sd    = 0.3,
    target_density = triangular
  )$samples
}

# Run the bootstrap
set.seed(123)
boot <- mh_gof_bootstrap(
  x_obs     = obs,
  target_cdf= ptri,
  sim_fun   = sim_fun,
  B         = 500
)
## Bootstrap 50/500
## Bootstrap 100/500
## Bootstrap 150/500
## Bootstrap 200/500
## Bootstrap 250/500
## Bootstrap 300/500
## Bootstrap 350/500
## Bootstrap 400/500
## Bootstrap 450/500
## Bootstrap 500/500
# Results
boot$p_values   # empirical p-values (KS, AD)
## $KS
## [1] 0.9640719
## 
## $AD
## [1] 0.9780439
boot$observed   # observed KS/AD stats
## $KS_D
## [1] 0.0145351
## 
## $AD_A
## [1] 0.8071911
## 
## $KS
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  xj
## D = 0.014535, p-value = 0.5505
## alternative hypothesis: two-sided
## 
## 
## $AD
## 
##  Anderson-Darling test of goodness-of-fit
##  Null hypothesis: distribution 'function(t) target_cdf(t)'
##  Parameters assumed to be fixed
## 
## data:  xj
## An = 0.80719, p-value = 0.4759
# Optional: quick visuals of the bootstrap null distributions
par(mfrow = c(1,2))
hist(boot$bootstrap$KS_D, breaks = 30, col = "gray85",
     main = "Bootstrap null of KS D", xlab = "D")
abline(v = boot$observed$KS_D, col = "red", lwd = 2)

if (!all(is.na(boot$bootstrap$AD_A))) {
  hist(boot$bootstrap$AD_A, breaks = 30, col = "gray85",
       main = "Bootstrap null of AD A", xlab = "A")
  abline(v = boot$observed$AD_A, col = "red", lwd = 2)
}

par(mfrow = c(1,1))

Interpretation

Observed test statistics - KS statistic: D = 0.0145 - Anderson–Darling statistic: A = 0.807

Asymptotic (i.i.d.) p-values
- KS: 0.5505
- AD: 0.4759
These assume independent, continuous samples. For MCMC, treat them as heuristic.

Parametric-bootstrap p-values (MH-matched)
- KS: 0.9641
- AD: 0.9780
These are more appropriate because they replicate our MH dependence/ties. Both are very large (≫ 0.05), so we fail to reject the hypothesis that the chain’s marginal matches the target CDF.

Bottom line: Our MH draws look consistent with the target distribution. KS (max deviation) and AD (tail-sensitive) agree.


9.2 Gibbs Sampling

Gibbs sampling is a special case of Metropolis–Hastings that generates samples from a multivariate target distribution by iteratively drawing from its full conditional distributions. It is especially useful when the joint density is complicated but each conditional is easy to sample.

Key features

  • Conditional updates: Cycle through coordinates (or blocks), sampling each \(x_j\) from \(\pi(x_j \mid x_{-j})\) while holding the others fixed.
  • Acceptance rate = 1: Because proposals are exact draws from the full conditionals, every update is accepted.
  • Minimal tuning: No proposal scale to tune; efficiency hinges on how you block and reparameterize variables.

Mathematical formulation

For a target \(\pi(x_1,\dots,x_k)\):

  1. Initialize \(X^{(0)}=(x_1^{(0)},\dots,x_k^{(0)})\).
  2. For iterations \(t=1,2,\dots,N\), update sequentially (systematic scan): \[ \begin{aligned} x_1^{(t)} &\sim \pi\!\big(x_1 \mid x_2^{(t-1)},\dots,x_k^{(t-1)}\big),\\ x_2^{(t)} &\sim \pi\!\big(x_2 \mid x_1^{(t)},x_3^{(t-1)},\dots,x_k^{(t-1)}\big),\\ &\ \ \vdots \\ x_k^{(t)} &\sim \pi\!\big(x_k \mid x_1^{(t)},\dots,x_{k-1}^{(t)}\big). \end{aligned} \] Under standard conditions (irreducibility, aperiodicity), the Markov chain has \(\pi\) as its stationary distribution, and ergodic averages converge: \[ \frac{1}{N}\sum_{t=1}^{N} h\!\big(X^{(t)}\big)\ \xrightarrow{\ \ a.s.\ \ }\ \mathbb{E}_\pi[h(X)]. \]

Practical notes

  • Scan strategy: You can use systematic scan (fixed order) or random scan (pick coordinates at random); both are common.
  • Blocking: Updating groups of highly correlated variables together can dramatically improve mixing.
  • Diagnostics: Use trace plots, ACF/ESS, and multiple chains; include a burn-in phase before retaining samples.

Algorithm (pseudocode)



Initialize X = (x₁, …, x_k)

for t = 1 to num_samples do
  for j = 1 to k do
    draw xj ~ π(xj | X−j)
  end for
  store sample X
end for

Implement a Gibbs sampler to construct samples from the uniform distribution on a disc of radius 1

#' Generic Gibbs Sampler (k-dimensional)
#'
#' A flexible Gibbs sampler that updates coordinates using user-supplied
#' full-conditional samplers. Works for any dimension \code{k >= 1}.
#'
#' @param n Integer. Number of kept samples (after burn-in/thinning).
#' @param conditionals List of length \code{k}. Each element is a function
#'   \code{f_j(state, ...)} that returns a scalar draw for coordinate \code{j}
#'   given the current \code{state} (a numeric vector of length \code{k}).
#' @param init Numeric vector of length \code{k}. Initial state.
#' @param burn_in Integer. Number of burn-in iterations to discard (default 100).
#' @param thin Integer. Keep every \code{thin}-th state after burn-in (default 1).
#' @param plot Logical. If \code{TRUE} and \code{k == 2}, draws a scatter plot.
#' @param xlim,ylim Optional numeric length-2; plot limits when \code{k == 2}.
#' @param ... Passed to the conditional sampler functions.
#'
#' @return A numeric matrix of size \code{n x k} with column names
#'   \code{x1, x2, ...} (or names from \code{init} if provided).
#'
#' @examples
#' ## Example: Uniform on the unit disc {(x,y): x^2 + y^2 <= 1}
#' sqrt_pos <- function(z) sqrt(pmax(z, 0))
#' cond_unit_disc <- list(
#'   function(state) { y <- state[2]; runif(1, -sqrt_pos(1 - y^2),  sqrt_pos(1 - y^2)) }, # x | y
#'   function(state) { x <- state[1]; runif(1, -sqrt_pos(1 - x^2),  sqrt_pos(1 - x^2)) }  # y | x
#' )
#' set.seed(1)
#' S <- gibbs_sampler(
#'   n = 5000,
#'   conditionals = cond_unit_disc,
#'   init = c(0, 0),
#'   burn_in = 500,
#'   thin = 1,
#'   plot = TRUE
#' )
#' head(S)
#'
#' ## Example: Your original unit-disc special case as a wrapper
#' #' Gibbs sampler for the unit disc
#' gibbs_unit_disc <- function(n, burn_in = 100, thin = 1, init = c(0,0), plot = TRUE) {
#'   sqrt_pos <- function(z) sqrt(pmax(z, 0))
#'   conds <- list(
#'     function(state) { y <- state[2]; runif(1, -sqrt_pos(1 - y^2),  sqrt_pos(1 - y^2)) },
#'     function(state) { x <- state[1]; runif(1, -sqrt_pos(1 - x^2),  sqrt_pos(1 - x^2)) }
#'   )
#'   gibbs_sampler(n, conds, init, burn_in = burn_in, thin = thin, plot = plot)
#' }
#'
#' @export
gibbs_sampler <- function(
  n,
  conditionals,
  init,
  burn_in = 100,
  thin = 1,
  plot = TRUE,
  xlim = NULL, ylim = NULL,
  ...
) {
  # ---- Checks ----
  stopifnot(length(n) == 1, n == as.integer(n), n > 0)
  stopifnot(length(burn_in) == 1, burn_in == as.integer(burn_in), burn_in >= 0)
  stopifnot(length(thin) == 1, thin == as.integer(thin), thin >= 1)
  if (!is.list(conditionals) || !all(vapply(conditionals, is.function, TRUE))) {
    stop("'conditionals' must be a list of functions f_j(state, ...)->scalar.")
  }
  k <- length(conditionals)
  init <- as.numeric(init)
  if (length(init) != k) stop("'init' must be a numeric vector of length equal to length(conditionals).")

  # ---- Main loop ----
  total <- burn_in + n * thin
  state <- init
  out <- matrix(NA_real_, nrow = n, ncol = k)
  keep <- 0L

  for (t in seq_len(total)) {
    # Systematic scan j = 1..k
    for (j in seq_len(k)) {
      val <- conditionals[[j]](state, ...)
      if (length(val) != 1L || !is.finite(val)) {
        stop(sprintf("Conditional %d must return one finite numeric value.", j))
      }
      state[j] <- as.numeric(val)
    }
    # Store after burn-in, with thinning
    if (t > burn_in && ((t - burn_in) %% thin == 0L)) {
      keep <- keep + 1L
      out[keep, ] <- state
    }
  }

  # ---- Names & plotting ----
  if (!is.null(names(init))) {
    colnames(out) <- names(init)
  } else {
    colnames(out) <- paste0("x", seq_len(k))
  }

  if (isTRUE(plot) && k == 2) {
    if (is.null(xlim)) xlim <- range(out[,1], finite = TRUE) + c(-0.5, 0.5)
    if (is.null(ylim)) ylim <- range(out[,2], finite = TRUE) + c(-0.5, 0.5)
    graphics::plot(out[,1], out[,2],
                   pch = 16, cex = 0.4, asp = 1,
                   col = grDevices::adjustcolor("tomato", alpha.f = 0.5),
                   xlab = colnames(out)[1], ylab = colnames(out)[2],
                   main = "Gibbs samples")
    graphics::grid(col = "gray80", lty = "dotted")
  }

  return(out)
}

# save 
dump("gibbs_sampler", file = "gibbs_sampler.R")
#' #' Gibbs sampler for the unit disc
gibbs_unit_disc <- function(n, burn_in = 100, thin = 1, init = c(0,0), plot = TRUE) {
  sqrt_pos <- function(z) sqrt(pmax(z, 0))
  conds <- list(
    function(state) { y <- state[2]; runif(1, -sqrt_pos(1 - y^2),  sqrt_pos(1 - y^2)) },
    function(state) { x <- state[1]; runif(1, -sqrt_pos(1 - x^2),  sqrt_pos(1 - x^2)) }
  )
  gibbs_sampler(n, conds, init, burn_in = burn_in, thin = thin, plot = plot)
}

samples <- gibbs_unit_disc(n=10000)


Let \(f(x,y) = [unif(x, min=0, max=5), unif(y, min=x, max=2x+1)]\). Construct samples from this distribution.

Great target — this joint is \[ f(x,y) = \underbrace{\frac{1}{5}\,\mathbf{1}_{\{0\le x\le 5\}}(x)}_{\text{Unif}(0,5)} \;\times\; \underbrace{\frac{1}{(2x+1)-x}\,\mathbf{1}_{\{x\le y\le 2x+1\}}(y)}_{\text{Unif}(x,\,2x+1)} = \frac{1}{5(x+1)}\,\mathbf{1}_{\{0\le x\le 5,\; x\le y\le 2x+1\}}. \] A Gibbs sampler needs the full conditionals:

  • \(Y \mid X=x \sim \mathrm{Unif}(x,\,2x+1)\) (given).

  • \(X \mid Y=y\) has density proportional to \(\dfrac{1}{x+1}\) on
    \(x \in [a(y),\,b(y)]\) where

    \[ a(y)=\max\!\left\{0,\frac{y-1}{2}\right\},\qquad b(y)=\min\{5,\,y\}. \]

    Inverse–CDF sampler: \[ X \;=\; \exp\!\Big(\log(1+a) + U\,\big[\log(1+b)-\log(1+a)\big]\Big) - 1, \quad U\sim\mathrm{Unif}(0,1). \]

Plug into gibbs_sampler():

# X | Y = y  ~ density ∝ 1/(x+1) on [a(y), b(y)]
x_given_y <- function(state) {
  y <- state[2]
  a <- max(0, (y - 1)/2)
  b <- min(5, y)
  if (a > b) stop("Empty support for X|Y; check initial state / transitions.")
  u <- runif(1)
  # inverse-CDF for 1/(x+1) on [a,b]
  exp(log1p(a) + u * (log1p(b) - log1p(a))) - 1
}

# Y | X = x  ~ Unif(x, 2x+1)
y_given_x <- function(state) {
  x <- state[1]
  runif(1, x, 2*x + 1)
}

conds <- list(x_given_y, y_given_x)

## Run Gibbs 

set.seed(1)

S <- gibbs_sampler(
  n = 5000,
  conditionals = conds,
  init = c(0, 0),     # feasible: 0 <= 0 <= 1
  burn_in = 1000,
  thin = 1,
  plot = TRUE         # shows a scatter of (x,y)
)

# X|Y and Y|X were Gaussian placeholders, so we don't have closed-form targets.
# Just use the univariate diagnostics on each margin.

par(mfrow = c(1,2))
plot_RNG(S[,1], main = "X marginal (KDE + diagnostics)")

plot_RNG(S[,2], main = "Y marginal (KDE + diagnostics)")

par(mfrow = c(1,1))

Quick checks (optional)

  • \(X\) marginal should be \(\mathrm{Unif}(0,5)\) → histogram flat at height \(1/5\).
  • Constraint check: \(x \le y \le 2x+1\) holds for all sampled pairs.
# X marginal should be Uniform(0,5)
hist(S[,1], freq = FALSE, breaks = 50, col = "gray90", border = "white",
     main = "X marginal vs Uniform(0,5)", xlab = "x")
abline(h = 1/5, col = "red", lwd = 2)   # density of Unif(0,5)

# Given X, Y should lie in [X, 2X+1]
all(S[,2] >= S[,1] & S[,2] <= 2*S[,1] + 1)  # should be TRUE
## [1] TRUE
# placeholders you defined
sx <- function(y) rnorm(1, mean = y/2, sd = 0.5)   # draw x | y
sy <- function(x) rnorm(1, mean = x + 1, sd = 0.7) # draw y | x

# wrap them to match gibbs_sampler(state)-style conditionals
conds <- list(
  function(state) sx(state[2]),  # x | y
  function(state) sy(state[1])   # y | x
)

set.seed(2)
S2 <- gibbs_sampler(
  n = 3000,
  conditionals = conds,
  init = c(0, 0),
  burn_in = 1000,
  thin = 5,
  plot = TRUE
)

# X|Y and Y|X were Gaussian placeholders, so we don't have closed-form targets.
# Just use the univariate diagnostics on each margin.

par(mfrow = c(1,2))
plot_RNG(S2[,1], main = "X marginal (KDE + diagnostics)")

plot_RNG(S2[,2], main = "Y marginal (KDE + diagnostics)")

par(mfrow = c(1,1))

10 Reverse-engineering

Reverse–engineering an unknown data-generating mechanism entails inferring a probability model —or its governing parameters— from observed data. Within a Bayesian framework this amounts to characterizing the posterior distribution $ (x) p(x ),p()$. Markov chain Monte Carlo (MCMC) methods provide a principled means of approximating such posteriors when direct sampling is infeasible, including cases where the normalizing constant is intractable. By constructing an ergodic Markov chain with stationary distribution \(\pi(\theta \mid x)\), MCMC yields asymptotically exact samples, enabling estimation of parameters and predictive functionals, together with coherent quantification of uncertainty via Monte Carlo error assessments.

Idea

Specify a model \(p(x\mid\theta)\) for the data and a prior \(p(\theta)\). Given observations \(x_{1:n}\),

\[ p(\theta\mid x_{1:n}) \;\propto\; p(x_{1:n}\mid \theta)\, p(\theta), \] and MCMC generates draws \(\theta^{(t)} \sim p(\theta\mid x_{1:n})\). From these you can recover:

  • Parameter estimates: means/medians/credible intervals of \(\theta\).
  • Predictive distribution: \[ p(x_{\text{new}}\mid x_{1:n}) \;=\; \int p(x_{\text{new}}\mid \theta)\, p(\theta\mid x_{1:n})\, d\theta \;\approx\; \frac{1}{T}\sum_{t=1}^T p\big(x_{\text{new}}\mid \theta^{(t)}\big). \]
  • Functionals of the unknown distribution: expectations, quantiles, tail probs, etc.

Practical workflow

  1. Choose a likelihood \(p(x\mid\theta)\) and prior \(p(\theta)\).
  2. Run an MCMC sampler (e.g., Metropolis–Hastings, Gibbs) targeting \(p(\theta\mid x)\).
  3. Check convergence/mixing (trace plots, R-hat, ESS).
  4. Summarize the posterior and form the posterior predictive.

Notes & caveats

  • If you only have samples with no model, use density estimation (KDE, mixtures, normalizing flows). MCMC still applies if you fit a flexible Bayesian model (e.g., mixture models, Dirichlet process mixtures).
  • MCMC doesn’t require the normalizing constant of the posterior, but it does need a computable likelihood (or a likelihood-free scheme like ABC-MCMC).
  • Watch for identifiability, model misspecification, and poor mixing in high dimensions.

In short, MCMC doesn’t “find” the distribution automatically; it lets you approximate the posterior implied by your modeling assumptions, from which you can reconstruct the underlying distribution and its uncertainties.

10.1 Discrete Sample Data (unknown)

## ============================================================
## Reverse-engineering an unknown discrete distribution (base R)
## ============================================================

## ---------------------------
## Robust CSV loader → integer vector
## ---------------------------
load_unknowndiscrete <- function(
  filename   = "unknowndiscrete.csv",
  header     = FALSE,
  col        = 1,
  as_vector  = TRUE
) {
  candidates <- unique(na.omit(c(
    filename,
    file.path("data", filename),
    if (!is.null(knitr::opts_knit$get("root.dir")))
      file.path(knitr::opts_knit$get("root.dir"), filename) else NA,
    tryCatch(file.path(dirname(knitr::current_input()), filename),
             error = function(e) NA),
    if (requireNamespace("here", quietly = TRUE))
      here::here(filename) else NA
  )))
  message("Working dir: ", getwd())
  message("Looking for file at:\n", paste("  - ", candidates, collapse = "\n"))

  path_found <- NULL
  for (p in candidates) if (!is.na(p) && file.exists(p)) { path_found <- p; break }
  if (is.null(path_found)) {
    if (interactive()) {
      message("File not found — please choose it manually.")
      path_found <- file.choose()
    } else stop("Could not find '", filename, "'.")
  }
  message("Reading: ", path_found)

  df <- read.csv(path_found, header = header,
                 na.strings = c("", "NA", "NaN"),
                 fileEncoding = "UTF-8",
                 check.names = FALSE, stringsAsFactors = FALSE)
  if (!as_vector) return(df)

  v <- if (is.character(col)) df[[col]] else df[[col]]
  if (is.factor(v)) v <- as.character(v)
  v_num <- suppressWarnings(as.numeric(v))
  v_num <- v_num[is.finite(v_num)]
  if (!length(v_num)) stop("No finite numeric values found.")
  if (!all(abs(v_num - round(v_num)) < .Machine$double.eps^0.5)) {
    warning("Non-integers detected; rounding to nearest integer.")
    v_num <- round(v_num)
  }
  as.integer(v_num)
}

## --- Load & basic EDA objects ---
unknowndiscrete <- load_unknowndiscrete("unknowndiscrete.csv", header = FALSE, col = 1, as_vector = TRUE)
## Working dir: /Users/salvadorcastro/Desktop/RCode/Computational_Statistics
## Looking for file at:
##   -  unknowndiscrete.csv
##   -  data/unknowndiscrete.csv
##   -  ./unknowndiscrete.csv
##   -  /Users/salvadorcastro/Desktop/RCode/Computational_Statistics/unknowndiscrete.csv
## Reading: unknowndiscrete.csv
x   <- unknowndiscrete
n   <- length(x); rng <- range(x); med <- median(x); mu <- mean(x); v <- var(x)

cat("Loaded n =", n, " | min/median/mean/var/max =",
    rng[1], "/", med, "/", round(mu,3), "/", round(v,3), "/", rng[2], "\n")
## Loaded n = 1000  | min/median/mean/var/max = 3 / 12 / 12.071 / 7.255 / 20
tab <- sort(table(x))
print(tab)
## x
##   3   4  20   5  19   6  18   7  17  16   8  15   9  10  13  14  11  12 
##   1   1   1   4   5  11  15  22  35  45  49  83  88 112 120 125 131 152
cat("\nEmpirical probabilities:\n"); print(round(tab / n, 3))
## 
## Empirical probabilities:
## x
##     3     4    20     5    19     6    18     7    17    16     8    15     9 
## 0.001 0.001 0.001 0.004 0.005 0.011 0.015 0.022 0.035 0.045 0.049 0.083 0.088 
##    10    13    14    11    12 
## 0.112 0.120 0.125 0.131 0.152
## ---------------------------
## 1) EDA plots
## ---------------------------
op <- par(mfrow = c(1,2))
barplot(table(x), main = "Empirical PMF", xlab = "Value", ylab = "Count")
plot(as.numeric(names(tab)), as.numeric(tab)/n, type = "h", lwd = 3,
     main = "Empirical Probabilities", xlab = "Value", ylab = "Probability")

par(op)

## ---------------------------------------------------------
## 2) WAIC utility
## ---------------------------------------------------------
waic_from_loglik_draws <- function(loglik_draws) {
  # loglik_draws: matrix [S x n] (S posterior draws, n data points)
  lppd    <- sum(log(colMeans(exp(loglik_draws))))
  p_waic  <- sum(apply(loglik_draws, 2, var))
  elpd    <- lppd - p_waic
  list(WAIC = -2 * elpd, lppd = lppd, p_waic = p_waic, elpd_waic = elpd)
}

## ---------------------------------------------------------
## 3) Poisson(λ) with Gamma prior
## ---------------------------------------------------------
pois_fit <- function(x, S = 4000, burn = 1000, a0 = 0.5, b0 = 0.5) {
  a_post <- a0 + sum(x); b_post <- b0 + length(x)
  lambda_draws <- rgamma(S, shape = a_post, rate = b_post)[(burn+1):S]
  loglik <- sapply(x, function(xi) dpois(xi, lambda_draws, log = TRUE))
  list(par_draws = data.frame(lambda = lambda_draws),
       loglik = as.matrix(loglik), prior = c(a0 = a0, b0 = b0))
}

## ---------------------------------------------------------
## 4) Geometric(p) on {0,1,...} with shift = min(x)
## ---------------------------------------------------------
geom_fit <- function(x, S = 4000, burn = 1000, alpha = 0.5, beta = 0.5) {
  shift <- min(x); y <- x - shift
  alpha_post <- alpha + length(y); beta_post <- beta + sum(y)
  p_draws <- rbeta(S, alpha_post, beta_post)[(burn+1):S]
  loglik <- sapply(x, function(xi) { yi <- xi - shift; log(p_draws) + yi*log1p(-p_draws) })
  list(par_draws = data.frame(p = p_draws, shift = shift),
       loglik = as.matrix(loglik), prior = c(alpha = alpha, beta = beta), shift = shift)
}

## ---------------------------------------------------------
## 5) NegBin NB(r,p), r~Gamma, p~Beta (MH-within-Gibbs)
## ---------------------------------------------------------
nb_fit <- function(x, S = 5000, burn = 2000,
                   a_r = 0.5, b_r = 0.5, alpha = 0.5, beta = 0.5,
                   r_init = 1.0, p_init = 0.5, proposal_sd = 0.25) {
  n <- length(x); sx <- sum(x)
  loglik_nb <- function(x, r, p) sum(lgamma(x + r) - lgamma(r) - lgamma(x + 1) + x*log1p(-p) + r*log(p))
  logpost_r <- function(r, p) {
    if (r <= 0) return(-Inf)
    loglik_nb(x, r, p) + dgamma(r, shape = a_r, rate = b_r, log = TRUE)
  }
  r <- numeric(S); p <- numeric(S); accept <- logical(S)
  r[1] <- r_init; p[1] <- p_init; accept[1] <- NA
  for (s in 2:S) {
    p[s] <- rbeta(1, alpha + r[s-1]*n, beta + sx)               # Gibbs for p|r,x
    r_prop <- abs(r[s-1] + rnorm(1, 0, proposal_sd))            # MH for r|p,x
    logacc <- logpost_r(r_prop, p[s]) - logpost_r(r[s-1], p[s])
    if (log(runif(1)) < logacc) { r[s] <- r_prop; accept[s] <- TRUE } else { r[s] <- r[s-1]; accept[s] <- FALSE }
  }
  draws <- data.frame(r = r[(burn+1):S], p = p[(burn+1):S])
  acc_rate <- mean(accept[(burn+1):S], na.rm = TRUE)
  loglik <- sapply(x, function(xi) lgamma(xi + draws$r) - lgamma(draws$r) - lgamma(xi + 1) +
                     xi*log1p(-draws$p) + draws$r*log(draws$p))
  list(par_draws = draws, loglik = as.matrix(loglik),
       priors = c(a_r = a_r, b_r = b_r, alpha = alpha, beta = beta),
       acc_rate_r = acc_rate)
}

## ---------------------------------------------------------
## 6) Multinomial over observed support w/ Dirichlet prior
## ---------------------------------------------------------
multi_dirichlet_fit <- function(x, S = 4000, burn = 1000, alpha0 = 0.5) {
  vals <- sort(unique(x)); k <- length(vals)
  idx <- match(x, vals); counts <- tabulate(idx, nbins = k)
  alpha_post <- rep(alpha0, k) + counts
  g <- matrix(0, nrow = S, ncol = k)
  for (j in 1:k) g[, j] <- rgamma(S, shape = alpha_post[j], rate = 1)
  p_draws <- g / rowSums(g); p_draws <- p_draws[(burn+1):S, , drop = FALSE]
  loglik <- sapply(seq_along(x), function(i) log(p_draws[, idx[i]]))
  list(par_draws = list(p = p_draws, support = vals),
       loglik = as.matrix(loglik), prior_alpha0 = alpha0)
}

## ---------------------------------------------------------
## 7) Binomial(n,p), optional left-truncation at L = min(x)
## ---------------------------------------------------------
binom_fit <- function(x, n_fixed = NULL, S = 4000, burn = 1000,
                      a = 0.5, b = 0.5, use_trunc = TRUE) {
  L <- min(x); ntrials <- if (is.null(n_fixed)) max(x) else n_fixed
  if (any(x > ntrials)) stop("Data exceed chosen n in Binomial model.")
  sum_x <- sum(x); n_obs <- length(x)
  p_draws <- rbeta(S, a + sum_x, b + ntrials*n_obs - sum_x)[(burn+1):S]
  if (!use_trunc) {
    loglik <- sapply(x, function(xi) dbinom(xi, size = ntrials, prob = p_draws, log = TRUE))
  } else {
    den <- 1 - pbinom(L - 1, size = ntrials, prob = p_draws)
    loglik <- sapply(x, function(xi) dbinom(xi, size = ntrials, prob = p_draws, log = TRUE) - log(den))
  }
  list(par_draws = data.frame(p = p_draws, n = ntrials, L = if (use_trunc) L else NA),
       loglik = as.matrix(loglik), priors = c(a = a, b = b), truncated = use_trunc)
}

## ---------------------------------------------------------
## 8) Beta–Binomial(n, alpha, beta), optional left-truncation
## ---------------------------------------------------------
betabinom_fit <- function(x, n_fixed = NULL, S = 6000, burn = 3000,
                          alpha_init = 2, beta_init = 2,
                          a_h = 0.1, b_h = 0.1, prop_sd = 0.20, use_trunc = TRUE) {
  ntrials <- if (is.null(n_fixed)) max(x) else n_fixed
  if (any(x > ntrials)) stop("Data exceed chosen n in Beta–Binomial model.")
  L <- min(x)

  log_dbetabinom <- function(k, n, a, b) lchoose(n, k) + lbeta(k + a, n - k + b) - lbeta(a, b)
  log_tail_prob  <- function(a, b) {
    ks <- L:ntrials
    ll <- vapply(ks, function(k) log_dbetabinom(k, ntrials, a, b), numeric(1))
    m <- max(ll); m + log(sum(exp(ll - m)))
  }
  logpost <- function(a, b) {
    if (a <= 0 || b <= 0) return(-Inf)
    lt <- if (use_trunc) log_tail_prob(a, b) else 0
    ll <- sum(vapply(x, function(xi) log_dbetabinom(xi, ntrials, a, b) - lt, numeric(1)))
    ll + dgamma(a, shape = a_h, rate = b_h, log = TRUE) + dgamma(b, shape = a_h, rate = b_h, log = TRUE)
  }

  a_draw <- numeric(S); b_draw <- numeric(S); a_draw[1] <- alpha_init; b_draw[1] <- beta_init
  acc <- 0L
  for (s in 2:S) {
    a_prop <- abs(a_draw[s-1] + rnorm(1, 0, prop_sd))
    b_prop <- abs(b_draw[s-1] + rnorm(1, 0, prop_sd))
    logacc <- logpost(a_prop, b_prop) - logpost(a_draw[s-1], b_draw[s-1])
    if (log(runif(1)) < logacc) { a_draw[s] <- a_prop; b_draw[s] <- b_prop; acc <- acc + 1L
    } else { a_draw[s] <- a_draw[s-1]; b_draw[s] <- b_draw[s-1] }
  }
  keep  <- (burn+1):S
  draws <- data.frame(alpha = a_draw[keep], beta = b_draw[keep])
  acc_rate <- acc / (S - 1)

  loglik <- sapply(x, function(xi) {
    ll <- lchoose(ntrials, xi) + lbeta(xi + draws$alpha, ntrials - xi + draws$beta) - lbeta(draws$alpha, draws$beta)
    if (use_trunc) {
      ks <- L:ntrials
      lt <- apply(draws, 1, function(row) {
        a <- row[["alpha"]]; b <- row[["beta"]]
        llk <- lchoose(ntrials, ks) + lbeta(ks + a, ntrials - ks + b) - lbeta(a, b)
        m <- max(llk); m + log(sum(exp(llk - m)))
      })
      ll <- ll - lt
    }
    as.numeric(ll)
  })

  list(par_draws = draws, loglik = as.matrix(loglik),
       n = ntrials, L = if (use_trunc) L else NA,
       acc_rate = acc_rate, priors = c(a_h = a_h, b_h = b_h))
}

## ---------------------------------------------------------
## 9) Fit all models
## ---------------------------------------------------------
set.seed(123)
pois   <- pois_fit(x)
geom   <- geom_fit(x)
nb     <- nb_fit(x)
multi  <- multi_dirichlet_fit(x)
binom  <- binom_fit(x, n_fixed = max(x), use_trunc = TRUE)     # set n = max(x)
bb     <- betabinom_fit(x, n_fixed = max(x), use_trunc = TRUE) # set n = max(x)

## ---------------------------------------------------------
## 10) WAIC comparison
## ---------------------------------------------------------
waic_pois  <- waic_from_loglik_draws(pois$loglik)
waic_geom  <- waic_from_loglik_draws(geom$loglik)
waic_nb    <- waic_from_loglik_draws(nb$loglik)
waic_multi <- waic_from_loglik_draws(multi$loglik)
waic_binom <- waic_from_loglik_draws(binom$loglik)
waic_bb    <- waic_from_loglik_draws(bb$loglik)

waic_table <- data.frame(
  Model = c("Poisson", "Geometric(+shift)", "NegBin", "Multinomial/Dirichlet",
            paste0("Binomial (n=", unique(binom$par_draws$n), ", left-trunc)"),
            paste0("Beta–Binomial (n=", bb$n, ", left-trunc)")),
  WAIC  = c(waic_pois$WAIC, waic_geom$WAIC, waic_nb$WAIC, waic_multi$WAIC,
            waic_binom$WAIC, waic_bb$WAIC),
  lppd  = c(waic_pois$lppd, waic_geom$lppd, waic_nb$lppd, waic_multi$lppd,
            waic_binom$lppd, waic_bb$lppd),
  p_waic= c(waic_pois$p_waic, waic_geom$p_waic, waic_nb$p_waic, waic_multi$p_waic,
            waic_binom$p_waic, waic_bb$p_waic)
)
waic_table <- waic_table[order(waic_table$WAIC), ]
cat("\n=== WAIC comparison (lower is better) ===\n"); print(waic_table, row.names = FALSE)
## 
## === WAIC comparison (lower is better) ===
##                             Model     WAIC      lppd      p_waic
##  Beta–Binomial (n=20, left-trunc) 4828.751 -2411.658  2.71750110
##             Multinomial/Dirichlet 4840.573 -2403.503 16.78337267
##       Binomial (n=20, left-trunc) 4932.424 -2464.641  1.57092136
##                           Poisson 4933.474 -2466.145  0.59209783
##                            NegBin 5007.365 -2503.123  0.55975928
##                 Geometric(+shift) 6517.678 -3258.757  0.08231579
## ---------------------------------------------------------
## 11) Posterior summaries
## ---------------------------------------------------------
cat("\n--- Posterior summaries ---\n")
## 
## --- Posterior summaries ---
cat("Poisson lambda (mean, median, 95% CI):\n")
## Poisson lambda (mean, median, 95% CI):
with(pois$par_draws, { qs <- quantile(lambda, c(.025,.5,.975)); print(c(mean = mean(lambda), median = qs[2], lo = qs[1], hi = qs[3])) })
##       mean median.50%    lo.2.5%   hi.97.5% 
##   12.06484   12.06346   11.84971   12.27786
cat("\nGeometric p (on Y = X - shift). shift =", geom$shift, "\n")
## 
## Geometric p (on Y = X - shift). shift = 3
with(geom$par_draws, { qs <- quantile(p, c(.025,.5,.975)); print(c(mean = mean(p), median = qs[2], lo = qs[1], hi = qs[3])) })
##       mean median.50%    lo.2.5%   hi.97.5% 
## 0.09931742 0.09928719 0.09350139 0.10521312
cat("\nNegBin (r, p):\n")
## 
## NegBin (r, p):
apply(nb$par_draws, 2, function(v) { qs <- quantile(v, c(.025,.5,.975)); c(mean = mean(v), median = qs[2], lo = qs[1], hi = qs[3]) }) |> print()
##                   r         p
## mean       67.58059 0.8473284
## median.50% 65.16806 0.8439512
## lo.2.5%    57.90968 0.8276519
## hi.97.5%   81.43385 0.8715837
cat("\nMultinomial/Dirichlet: posterior mean p over support values:\n")
## 
## Multinomial/Dirichlet: posterior mean p over support values:
multi_mean_p <- colMeans(multi$par_draws$p); multi_support <- multi$par_draws$support
print(setNames(round(multi_mean_p, 4), paste0("x=", multi_support)))
##    x=3    x=4    x=5    x=6    x=7    x=8    x=9   x=10   x=11   x=12   x=13 
## 0.0015 0.0015 0.0044 0.0114 0.0224 0.0491 0.0874 0.1114 0.1306 0.1511 0.1194 
##   x=14   x=15   x=16   x=17   x=18   x=19   x=20 
## 0.1245 0.0826 0.0448 0.0352 0.0156 0.0055 0.0014
cat("\nBinomial (left-trunc) p (mean, median, 95% CI); n, L:\n")
## 
## Binomial (left-trunc) p (mean, median, 95% CI); n, L:
with(binom$par_draws, { qs <- quantile(p, c(.025,.5,.975)); print(c(mean = mean(p), median = qs[2], lo = qs[1], hi = qs[3])) })
##       mean median.50%    lo.2.5%   hi.97.5% 
##  0.6034562  0.6034862  0.5964546  0.6104089
print(list(n = unique(binom$par_draws$n), L = binom$par_draws$L[1]))
## $n
## [1] 20
## 
## $L
## [1] 3
cat("\nBeta–Binomial (left-trunc) alpha, beta (mean, median, 95% CI):\n")
## 
## Beta–Binomial (left-trunc) alpha, beta (mean, median, 95% CI):
apply(bb$par_draws, 2, function(v) { qs <- quantile(v, c(.025,.5,.975)); c(mean = mean(v), median = qs[2], lo = qs[1], hi = qs[3]) }) |> print()
##               alpha      beta
## mean       18.65844 12.255527
## median.50% 17.52741 11.533313
## lo.2.5%    14.29682  9.410973
## hi.97.5%   25.54418 16.733848
cat(sprintf("Beta–Binom MH accept rate: %.3f\n", bb$acc_rate))
## Beta–Binom MH accept rate: 0.670
## ---------------------------------------------------------
## 12) Posterior predictive checks (means)
## ---------------------------------------------------------
ppc_summary <- function(sim_fn, S = 1000) {
  sims <- replicate(S, sim_fn()); c(mean = mean(sims), var = var(sims))
}
n_obs <- length(x)

pois_ppc <- ppc_summary(function() {
  lam <- sample(pois$par_draws$lambda, 1); mean(rpois(n_obs, lam))
})
geom_ppc <- ppc_summary(function() {
  pp <- sample(geom$par_draws$p, 1); shift <- geom$shift
  mean(rgeom(n_obs, prob = pp) + shift)
})
nb_ppc <- ppc_summary(function() {
  row <- nb$par_draws[sample.int(nrow(nb$par_draws), 1), ]; mean(rnbinom(n_obs, size = row$r, prob = row$p))
})
multi_ppc <- ppc_summary(function() {
  idx <- sample.int(nrow(multi$par_draws$p), 1)
  p <- multi$par_draws$p[idx, ]; vals <- multi$par_draws$support
  mean(sample(vals, size = n_obs, replace = TRUE, prob = p))
})
binom_ppc <- ppc_summary(function() {
  p <- sample(binom$par_draws$p, 1); ntr <- unique(binom$par_draws$n); L <- binom$par_draws$L[1]
  samp <- rbinom(5*n_obs, ntr, p); samp <- samp[samp >= L]; if (length(samp) < n_obs) samp <- c(samp, rep(L, n_obs - length(samp)))
  mean(sample(samp, n_obs))
})
bb_ppc <- ppc_summary(function() {
  row <- bb$par_draws[sample.int(nrow(bb$par_draws), 1), ]; a <- row[["alpha"]]; b <- row[["beta"]]
  ntr <- bb$n; L <- bb$L; p <- rbeta(1, a, b)
  samp <- rbinom(5*n_obs, ntr, p); samp <- samp[samp >= L]; if (length(samp) < n_obs) samp <- c(samp, rep(L, n_obs - length(samp)))
  mean(sample(samp, n_obs))
})

cat("\n--- Posterior Predictive (mean of dataset mean; variance across reps) ---\n")
## 
## --- Posterior Predictive (mean of dataset mean; variance across reps) ---
ppc_tab <- rbind(
  Poisson                  = pois_ppc,
  Geometric_shift          = geom_ppc,
  NegBin                   = nb_ppc,
  Multinomial              = multi_ppc,
  Binomial_left_trunc      = binom_ppc,
  Beta_Binomial_left_trunc = bb_ppc
)
print(round(ppc_tab, 4))
##                             mean    var
## Poisson                  12.0609 0.0247
## Geometric_shift          12.0955 0.1812
## NegBin                   12.0619 0.0291
## Multinomial              12.0677 0.0151
## Binomial_left_trunc      12.0637 0.0090
## Beta_Binomial_left_trunc 12.0714 3.2259
## ---------------------------------------------------------
## 13) Empirical vs posterior-predicted PMF overlays (improved)
## ---------------------------------------------------------

# Use x if defined; else pull from unknowndiscrete
if (!exists("x")) x <- unknowndiscrete
stopifnot(is.numeric(x))
n_obs <- length(x)

# Common support (observed values)
support_all <- sort(unique(x))

# Empirical PMF aligned to support
emp_counts <- table(factor(x, levels = support_all))
emp_p <- as.numeric(emp_counts) / n_obs

set.seed(123)

# Posterior-predictive PMFs (averaged over posterior draws)
pred_pmf_pois <- function(vals) {
  lam <- sample(pois$par_draws$lambda, 1000, replace = TRUE)
  sapply(vals, function(v) mean(dpois(v, lam)))
}
pred_pmf_geom <- function(vals) {
  pp <- sample(geom$par_draws$p, 1000, replace = TRUE)
  shift <- geom$shift
  sapply(vals, function(v) {
    y <- v - shift
    if (y < 0) return(0)
    mean(pp * (1 - pp)^y)
  })
}
pred_pmf_nb <- function(vals) {
  rows <- nb$par_draws[sample.int(nrow(nb$par_draws), 1000, replace = TRUE), ]
  sapply(vals, function(v) mean(dnbinom(v, size = rows$r, prob = rows$p)))
}
pred_pmf_multi <- function(vals) {
  # posterior mean over p_k for matched support (0 if value not in support)
  map <- match(vals, multi$par_draws$support)
  sapply(seq_along(vals), function(j) {
    idx <- map[j]; if (is.na(idx)) return(0)
    mean(multi$par_draws$p[, idx])
  })
}
pred_pmf_binom <- function(vals) {
  pp  <- sample(binom$par_draws$p, 1000, replace = TRUE)
  ntr <- unique(binom$par_draws$n)
  L   <- binom$par_draws$L[1]
  sapply(vals, function(v) {
    if (!isTRUE(binom$truncated)) return(mean(dbinom(v, ntr, pp)))
    tp <- 1 - pbinom(L - 1, ntr, pp)
    mean(dbinom(v, ntr, pp) / tp * as.numeric(v >= L))
  })
}
pred_pmf_betabinom <- function(vals) {
  ntr <- bb$n; L <- bb$L
  # average truncated Beta–Binomial pmf over posterior draws
  sapply(vals, function(v) {
    means <- apply(bb$par_draws, 1, function(row) {
      a <- row[["alpha"]]; b <- row[["beta"]]
      ll <- lchoose(ntr, v) + lbeta(v + a, ntr - v + b) - lbeta(a, b)
      if (is.na(L)) return(exp(ll))
      ks  <- L:ntr
      llk <- lchoose(ntr, ks) + lbeta(ks + a, ntr - ks + b) - lbeta(a, b)
      m <- max(llk); lt <- m + log(sum(exp(llk - m)))
      exp(ll - lt) * as.numeric(v >= L)
    })
    mean(means)
  })
}

# Precompute predictions to set a common ylim
pred_list <- list(
  Poisson                 = pred_pmf_pois(support_all),
  `Geometric(+shift)`     = pred_pmf_geom(support_all),
  NegBin                  = pred_pmf_nb(support_all),
  `Multinomial/Dirichlet` = pred_pmf_multi(support_all),
  `Binomial (left-trunc)` = pred_pmf_binom(support_all),
  `Beta–Binomial (left-trunc)` = pred_pmf_betabinom(support_all)
)
y_max <- max(c(emp_p, unlist(pred_list)), na.rm = TRUE) * 1.15

# Simple panel helper
panel <- function(title, pred) {
  M <- rbind(emp = emp_p, pred = pred)
  barplot(M, beside = TRUE,
          names.arg = support_all,
          ylim = c(0, y_max),
          col = c("grey70", adjustcolor("tomato", alpha.f = 0.6)),
          border = NA,
          las = 1,
          main = title, ylab = "Probability", xlab = "Value")
  box(bty = "l")
}

op <- par(mfrow = c(3,2), mar = c(4,4,3,1))
panel("Poisson",                          pred_list$Poisson)
panel("Geometric(+shift)",                pred_list$`Geometric(+shift)`)
panel("NegBin",                           pred_list$NegBin)
panel("Multinomial/Dirichlet",            pred_list$`Multinomial/Dirichlet`)
panel("Binomial (left-trunc)",            pred_list$`Binomial (left-trunc)`)
# Add legend on the last panel
panel("Beta–Binomial (left-trunc)",       pred_list$`Beta–Binomial (left-trunc)`)
legend("topright", inset = 0.02, bty = "n",
       fill = c("grey70", adjustcolor("tomato", 0.6)),
       legend = c("Empirical", "Posterior predictive"))

par(op)

## ---------------------------------------------------------
## 14) Recommendation (unchanged)
## ---------------------------------------------------------
best <- waic_table$Model[which.min(waic_table$WAIC)]
cat("\n>>> By WAIC, the leading candidate is:", best, "\n")
## 
## >>> By WAIC, the leading candidate is: Beta–Binomial (n=20, left-trunc)
cat("Use the corresponding posterior draws for parameters and posterior predictive inference.\n")
## Use the corresponding posterior draws for parameters and posterior predictive inference.

We analyzed an univariate discrete sample with n = 1000 taking values in {3,…,20}.

Summary: min = 3, median = 12, mean = 12.071, variance = 7.255, max = 20.

The empirical PMF is unimodal and roughly centered near 12, with light tails.

Modeling goals:

Infer a plausible generative mechanism, quantify posterior uncertainty for parameters, and evaluate posterior-predictive fit.

Candidate models:

All models are Bayesian with weak priors; posteriors are obtained via conjugacy or simple MCMC. We compare by WAIC.

  1. Poisson(λ) (Gamma prior).
  2. Geometric(p) with shift \(Y=X-\min(x)\) (Beta prior on \(p\)).
  3. Negative Binomial \(\mathrm{NB}(r,p)\) (Gamma prior on \(r\), Beta prior on \(p\)).
  4. Multinomial/Dirichlet on the observed support (descriptive baseline).
  5. Binomial(n,p) with left truncation at \(L=\min(x)=3\); \(n=20\) fixed; Beta prior on \(p\).
  6. Beta–Binomial(n, α, β) with left truncation at \(L=3\); weak Gamma priors on \(α,β\).

Truncated likelihood

For models (5)–(6) the observed data obey \(X \ge L\). If \(f(x\mid\theta)\) is the non-truncated pmf, \[ f_L(x\mid\theta)=\frac{f(x\mid\theta)}{\sum_{k=L}^{n} f(k\mid\theta)}\,,\qquad x=L,\dots,n. \]

Inference & diagnostics

For the Beta–Binomial we used a random-walk MH on \((α,β)\) with 4 chains.

  • \(\hat R\): alpha 1.0025, beta 1.0025 (target < 1.01).
  • Acceptance rates (per chain): 0.436, 0.439, 0.312, 0.437 (target 0.20 – 0.60).
  • ESS per chain ≈ 120 – 132 (sum ≈ 493 overall).

Conclusion: convergence/mixing are adequate for reporting.

Model comparison (WAIC)

Lower is better.

Model WAIC lppd p_waic
Beta–Binomial (n=20, left-trunc) 4828.8 −2411.658 2.718
Multinomial/Dirichlet 4840.6 −2403.503 16.783
Binomial (n=20, left-trunc) 4932.4 −2464.641 1.571
Poisson 4933.5 −2466.145 0.592
NegBin 5007.4 −2503.123 0.560
Geometric(+shift) 6517.7 −3258.757 0.082

Winner: left-truncated Beta–Binomial (n=20).

Multinomial/Dirichlet is competitive in WAIC but far less parsimonious (p_waic ≈ 17 ~ one parameter per support point).

Posterior summary:

  • Beta–Binomial (n=20, L=3):
    • \(\alpha\) mean 18.66 (95% CI 14.30 – 25.54)
    • \(\beta\) mean 12.26 (95% CI 9.41 – 16.73)
    • implied \(\mathbb E[p]=\alpha/(\alpha+\beta) \approx\) 0.603

10.2 Posterior-predictive checks

Mean of the dataset mean across replicated datasets and its Monte Carlo variance:

Model mean var
Poisson 12.061 0.0247
Geometric(+shift) 12.096 0.1812
NegBin 12.062 0.0291
Multinomial/Dirichlet 12.068 0.0151
Binomial (left-trunc) 12.064 0.0090
Beta–Binomial (left-trunc) 12.071 3.2259

Remark: The Beta–Binomial shows larger replicate-to-replicate variability for the dataset mean because each replicate draws its own \(p\sim\mathrm{Beta}(\alpha,\beta)\), adding between-dataset variation.

Visual fit (PMF overlays)

Empirical bars closely match the Beta–Binomial posterior-predictive PMF (with truncation), respecting the bounded support 3–20, matching the peak near 12, and capturing tail thickness. Poisson/NegBin can’t encode the hard upper bound; shifted Geometric is shape-mismatched. Multinomial/Dirichlet tracks data pointwise but lacks a compact mechanism.

Interpretation

A coherent generative story is: - Each observation is successes out of \(n=20\) trials.
- The success probability varies across units: \(p\sim\mathrm{Beta}(\alpha,\beta)\) with posterior means \(\alpha\approx18.7,\ \beta\approx12.3 \Rightarrow \mathbb E[p]\approx0.603\).
- Observations below 3 are excluded by design/protocol (left truncation at \(L=3\)).

This model is best-supported by WAIC, parsimonious (effective parameter count ≈ 2.7), and predictively calibrated.


Methods appendix (formal model statement)

We model \(X_i\) as a left-truncated Beta–Binomial with \(n=20\) and \(L=3\): \[ X_i \mid p_i \sim \mathrm{Binomial}(20,p_i),\qquad p_i \sim \mathrm{Beta}(\alpha,\beta),\qquad \Pr(X_i=k\mid X_i\ge 3)=\frac{\Pr(X_i=k)}{\sum_{j=3}^{20}\Pr(X_i=j)}. \] Priors: \(\alpha,\beta \stackrel{\text{iid}}{\sim}\mathrm{Gamma}(0.1,0.1)\).
Posterior sampling via random-walk MH for \((\alpha,\beta)\); other candidates fitted by conjugacy or simple MH-within-Gibbs.
Model comparison by WAIC: \[ \mathrm{WAIC}=-2\left(\sum_{i=1}^n \log \mathbb E_{\theta} [p(x_i\mid\theta)] \;-\; \sum_{i=1}^n \mathrm{Var}_{\theta}[\log p(x_i\mid\theta)]\right). \]


set.seed(123)

# 1) Robustly extract the numeric vector
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else unknowndiscrete
stopifnot(is.numeric(x), all(is.finite(x)))
x <- as.integer(round(x))  # ensure integer counts

n_obs <- length(x)
cat("Discrete x: n =", n_obs, "  min/max =", min(x), "/", max(x), "\n")
## Discrete x: n = 1000   min/max = 3 / 20
# 2) Choose n_fixed safely
if (any(x > 20L)) {
  cat("Values above 20 detected:\n"); print(table(x[x > 20L]))
  n_fix <- max(x)
  cat("Setting n_fixed <-", n_fix, "\n")
} else {
  n_fix <- 20L
}

# 3) Fit the left-truncated Beta–Binomial
bb <- betabinom_fit(
  x, n_fixed = n_fix,
  S = 20000, burn = 10000,
  alpha_init = 20, beta_init = 13,
  prop_sd = 0.20, use_trunc = TRUE
)

cat(sprintf("MH acceptance rate: %.3f\n", bb$acc_rate))
## MH acceptance rate: 0.704
# 4) Trace plots with a light moving average overlay
op <- par(mfrow = c(1,2), mar = c(4,4,2,1))
plot(bb$par_draws$alpha, type = "l", col = "dodgerblue",
     main = "Traceplot: alpha", xlab = "Iteration", ylab = "alpha")
grid(col = "grey85", lty = "dotted")
ma_k <- 200
alpha_ma <- as.numeric(stats::filter(bb$par_draws$alpha, rep(1/ma_k, ma_k), sides = 1))
lines(alpha_ma, col = "navy", lwd = 2)

plot(bb$par_draws$beta, type = "l", col = "tomato",
     main = "Traceplot: beta", xlab = "Iteration", ylab = "beta")
grid(col = "grey85", lty = "dotted")
beta_ma <- as.numeric(stats::filter(bb$par_draws$beta, rep(1/ma_k, ma_k), sides = 1))
lines(beta_ma, col = "firebrick", lwd = 2)

par(op)
# ------------ Gelman–Rubin R-hat for multiple chains ------------
rhat <- function(chains_mat){
  # chains_mat: n x m (rows = iterations, cols = chains)
  m <- ncol(chains_mat); n <- nrow(chains_mat)
  stopifnot(m >= 2, n >= 2)
  chain_means <- colMeans(chains_mat)
  B <- n * stats::var(chain_means)               # between-chain
  W <- mean(apply(chains_mat, 2, stats::var))    # within-chain
  var_hat <- ((n - 1) / n) * W + B / n
  sqrt(var_hat / W)
}

# ------------ ESS (per chain): sum initial positive ACFs ------------
ess_one <- function(draws, max_lag = 1000){
  n <- length(draws)
  if (n < 3) return(n)
  a <- stats::acf(draws, plot = FALSE,
                  lag.max = min(max_lag, n - 1),
                  na.action = na.pass)$acf[-1]
  if (!length(a)) return(n)
  # keep only the initial run of positive autocorrelations
  k <- rle(a > 0)$lengths[1]
  if (is.na(k) || k < 1) return(n)
  tau <- 1 + 2 * sum(a[seq_len(k)])
  eff <- n / max(tau, 1e-8)
  max(1L, floor(eff))
}

# ------------ Single-chain Beta–Binomial RW on log-scale ------------
betabinom_fit_logrw <- function(x, n_fixed = NULL, S = 40000, burn = 20000,
                                alpha_init = 2, beta_init = 2,
                                a_h = 0.1, b_h = 0.1,     # Gamma priors on α,β
                                sd_a = 0.35, sd_b = 0.35, # proposal sds on log-scale
                                use_trunc = TRUE, pilot = 4000, target = 0.35){

  stopifnot(is.numeric(x), all(is.finite(x)), length(x) > 0)
  x <- as.integer(round(x))

  ntr <- if (is.null(n_fixed)) max(x) else n_fixed
  if (any(x > ntr)) stop("Data exceed chosen n in Beta–Binomial model.")
  L <- min(x)

  log_dbetabinom <- function(k, n, a, b){
    lchoose(n, k) + lbeta(k + a, n - k + b) - lbeta(a, b)
  }
  log_tail_prob <- function(a, b){
    ks <- L:ntr
    ll <- vapply(ks, function(k) log_dbetabinom(k, ntr, a, b), numeric(1))
    m <- max(ll); m + log(sum(exp(ll - m)))  # log-sum-exp
  }
  logpost <- function(a, b){
    if (a <= 0 || b <= 0) return(-Inf)
    if (!use_trunc) {
      ll <- sum(vapply(x, function(xi) log_dbetabinom(xi, ntr, a, b), numeric(1)))
    } else {
      lt <- log_tail_prob(a, b)
      ll <- sum(vapply(x, function(xi) log_dbetabinom(xi, ntr, a, b) - lt, numeric(1)))
    }
    ll + dgamma(a, shape = a_h, rate = b_h, log = TRUE) +
         dgamma(b, shape = a_h, rate = b_h, log = TRUE)
  }
  # work on log-scale with Jacobian
  logpost_u <- function(u_a, u_b){
    a <- exp(u_a); b <- exp(u_b)
    logpost(a, b) + u_a + u_b
  }

  # ---- pilot adaptation (tune sd_a, sd_b toward 'target') ----
  u_a <- log(alpha_init); u_b <- log(beta_init)
  acc <- 0L
  for (t in 1:pilot){
    u_a_prop <- u_a + rnorm(1, 0, sd_a)
    u_b_prop <- u_b + rnorm(1, 0, sd_b)
    logacc <- logpost_u(u_a_prop, u_b_prop) - logpost_u(u_a, u_b)
    if (log(runif(1)) < logacc){ u_a <- u_a_prop; u_b <- u_b_prop; acc <- acc + 1L }
    if (t %% 200 == 0){
      ar <- acc / 200; acc <- 0L
      adj <- if (ar > target) 1.2 else 0.8
      sd_a <- sd_a * adj; sd_b <- sd_b * adj
    }
  }

  # ---- main sampling ----
  keep <- S - burn
  a_draw <- numeric(keep); b_draw <- numeric(keep)
  acc_main <- 0L
  for (s in 1:S){
    u_a_prop <- u_a + rnorm(1, 0, sd_a)
    u_b_prop <- u_b + rnorm(1, 0, sd_b)
    logacc <- logpost_u(u_a_prop, u_b_prop) - logpost_u(u_a, u_b)
    if (log(runif(1)) < logacc){
      u_a <- u_a_prop; u_b <- u_b_prop
      if (s > burn) acc_main <- acc_main + 1L
    }
    if (s > burn){
      a_draw[s - burn] <- exp(u_a)
      b_draw[s - burn] <- exp(u_b)
    }
  }
  acc_rate <- acc_main / keep
  draws <- data.frame(alpha = a_draw, beta = b_draw)

  # pointwise log-lik matrix [S_keep x n_obs] for WAIC
  loglik <- sapply(x, function(xi){
    ll <- lchoose(ntr, xi) + lbeta(xi + draws$alpha, ntr - xi + draws$beta) - lbeta(draws$alpha, draws$beta)
    if (use_trunc){
      ks <- L:ntr
      lt <- apply(draws, 1, function(row){
        a <- row[["alpha"]]; b <- row[["beta"]]
        llk <- lchoose(ntr, ks) + lbeta(ks + a, ntr - ks + b) - lbeta(a, b)
        m <- max(llk); m + log(sum(exp(llk - m)))
      })
      ll <- ll - lt
    }
    as.numeric(ll)
  })

  list(par_draws = draws, loglik = as.matrix(loglik),
       n = ntr, L = L, acc_rate = acc_rate,
       tuned_sd = c(sd_a = sd_a, sd_b = sd_b))
}

# ------------ Multi-chain wrapper with diagnostics ------------
betabinom_fit_multi_logrw <- function(x, n_fixed = NULL,
                                      chains = 4, S = 40000, burn = 20000,
                                      seeds = NULL, use_trunc = TRUE){
  stopifnot(is.numeric(x), all(is.finite(x)))
  x <- as.integer(round(x))
  if (is.null(seeds)) seeds <- sample.int(1e9, chains)

  fits <- vector("list", chains)
  for (i in seq_len(chains)){
    set.seed(seeds[i])
    fit_i <- betabinom_fit_logrw(
      x, n_fixed = n_fixed, S = S, burn = burn,
      alpha_init = runif(1, 10, 25), beta_init = runif(1, 8, 18),
      use_trunc = use_trunc
    )
    fit_i$par_draws$chain <- i
    fit_i$par_draws$iter  <- seq_len(nrow(fit_i$par_draws))
    fits[[i]] <- fit_i
  }

  # align by min length
  n_keep <- min(sapply(fits, function(f) nrow(f$par_draws)))
  A <- sapply(fits, function(f) f$par_draws$alpha[seq_len(n_keep)])
  B <- sapply(fits, function(f) f$par_draws$beta [seq_len(n_keep)])

  list(
    draws = do.call(rbind, lapply(fits, function(f) f$par_draws[seq_len(n_keep), ])),
    n = fits[[1]]$n, L = fits[[1]]$L,
    rhat = c(alpha = rhat(A), beta = rhat(B)),
    ess  = data.frame(
      chain = 1:chains,
      ESS_alpha = sapply(fits, \(f) ess_one(f$par_draws$alpha)),
      ESS_beta  = sapply(fits, \(f) ess_one(f$par_draws$beta))
    ),
    accept_rates = sapply(fits, `[[`, "acc_rate"),
    tuned_sds    = t(sapply(fits, `[[`, "tuned_sd"))
  )
}
set.seed(123)

# --- Extract a clean integer vector x from unknowndiscrete ---
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else unknowndiscrete
stopifnot(is.numeric(x))
x <- as.integer(round(x))

# --- Sanity checks on x (not on the data.frame) ---
stopifnot(all(is.finite(x)), all(x >= 0))

# Keep n_fixed = 20 unless data exceed 20
n_fix <- 20L
if (max(x) > n_fix) {
  stop(sprintf("max(x)=%d exceeds n_fixed=%d. Set n_fixed <- max(x) or clean data.", max(x), n_fix))
}

cat("Discrete x: n =", length(x), "  min/max =", min(x), "/", max(x), "\n")
## Discrete x: n = 1000   min/max = 3 / 20
# --- Use smaller S during knit to keep it fast ---
S_knit    <- 200
burn_knit <- 100

bbmc <- betabinom_fit_multi_logrw(
  x, n_fixed = n_fix,
  chains = 4, S = S_knit, burn = burn_knit,
  use_trunc = TRUE
)

# --- Diagnostics ---
rh  <- bbmc$rhat            # expect ~1.00–1.01
essd <- bbmc$ess            # per-chain ESS; larger is better
acc  <- bbmc$accept_rates   # target 0.20–0.60
sds  <- as.data.frame(bbmc$tuned_sds)  # learned step sizes (log-scale)

# Status flags
ok_rhat <- all(is.finite(rh)) && all(rh < 1.01)
ok_acc  <- all(is.finite(acc)) && all(acc >= 0.20 & acc <= 0.60)

# Per-chain ESS mins & sums
ess_min_alpha <- min(essd$ESS_alpha, na.rm = TRUE)
ess_min_beta  <- min(essd$ESS_beta,  na.rm = TRUE)
ess_sum_alpha <- sum(essd$ESS_alpha, na.rm = TRUE)
ess_sum_beta  <- sum(essd$ESS_beta,  na.rm = TRUE)

fmt <- function(x) paste(formatC(x, digits = 4, format = "f"), collapse = ", ")

cat("### MCMC Diagnostics\n\n")
## ### MCMC Diagnostics
cat(sprintf("- **R-hat**: alpha = %.4f, beta = %.4f %s (target < 1.01)\n",
            rh["alpha"], rh["beta"], if (ok_rhat) "✅" else "⚠️"))
## - **R-hat**: alpha = 2.5375, beta = 2.4823 ⚠️ (target < 1.01)
cat(sprintf("- **Acceptance rates (per chain)**: %s %s (target 0.20–0.60)\n",
            fmt(acc), if (ok_acc) "✅" else "⚠️"))
## - **Acceptance rates (per chain)**: 0.4600, 0.3700, 0.3200, 0.2700 ✅ (target 0.20–0.60)
cat(sprintf("- **ESS (min per chain)**: alpha = %d, beta = %d\n",
            ess_min_alpha, ess_min_beta))
## - **ESS (min per chain)**: alpha = 6, beta = 5
cat(sprintf("- **ESS (sum across chains)**: alpha = %d, beta = %d\n\n",
            ess_sum_alpha, ess_sum_beta))
## - **ESS (sum across chains)**: alpha = 36, beta = 32
# Per-chain ESS table
knitr::kable(essd, caption = "Per-chain Effective Sample Sizes (ESS)")
Per-chain Effective Sample Sizes (ESS)
chain ESS_alpha ESS_beta
1 6 5
2 10 8
3 7 6
4 13 13
# Tuned proposal sds (log-scale) table
colnames(sds) <- c("sd_a", "sd_b")
sds$chain <- seq_len(nrow(sds))
knitr::kable(
  sds[, c("chain","sd_a","sd_b")],
  digits = 6,
  caption = "Tuned log-scale step sizes by chain"
)
Tuned log-scale step sizes by chain
chain sd_a sd_b
1 0.030642 0.030642
2 0.045964 0.045964
3 0.045964 0.045964
4 0.045964 0.045964
# Posterior summaries for alpha, beta, and p = alpha/(alpha+beta)
d <- bbmc$draws
p <- with(d, alpha/(alpha+beta))

qs <- function(v) setNames(quantile(v, c(.025, .5, .975), na.rm = TRUE),
                           c("2.5%", "50%", "97.5%"))

post_tab <- rbind(
  data.frame(parameter = "alpha",
             mean = mean(d$alpha), t(qs(d$alpha))),
  data.frame(parameter = "beta",
             mean = mean(d$beta ), t(qs(d$beta ))),
  data.frame(parameter = "p = alpha/(alpha+beta)",
             mean = mean(p),       t(qs(p)))
)

knitr::kable(post_tab, digits = 6,
             caption = "Posterior summaries (means and central 95% intervals)")
Posterior summaries (means and central 95% intervals)
parameter mean X2.5. X50. X97.5.
alpha 20.437474 15.869642 21.150061 24.140738
beta 13.405585 10.566549 13.831235 15.631938
p = alpha/(alpha+beta) 0.603707 0.591783 0.603818 0.610158
# Short verdict line (robust)
verdict <- if (ok_rhat && ok_acc && ess_min_alpha >= 100 && ess_min_beta >= 100) {
  "✅ Convergence and mixing look good for reporting."
} else {
  "⚠️ Consider longer runs or retuning proposals before final reporting."
}
knitr::asis_output(paste0("**Verdict:** ", verdict))

Verdict: ⚠️ Consider longer runs or retuning proposals before final reporting.

## ===== MCSE summaries + refreshed PMF/CDF overlays (multi-chain) =====

# Use a clean integer vector x
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else unknowndiscrete
stopifnot(is.numeric(x)); x <- as.integer(round(x))

# Pull from multi-chain fit
d <- bbmc$draws            # needs columns: alpha, beta, chain
n <- bbmc$n
L <- bbmc$L

# --- helpers: ESS per chain + MCSE for mean ---
ess_one <- function(draws, max_lag = 1000){
  a <- stats::acf(draws, plot = FALSE, lag.max = min(max_lag, length(draws)-1))$acf[-1]
  if (!length(a)) return(length(draws))
  k <- rle(a > 0)$lengths[1]
  if (is.na(k) || k < 1) return(length(draws))
  tau <- 1 + 2 * sum(a[seq_len(k)])
  max(1L, floor(length(draws) / max(tau, 1e-8)))
}
mcse_mean <- function(x, ess) stats::sd(x) / sqrt(ess)

# --- combined ESS across chains (sum of per-chain ESS) ---
ESS_alpha <- sum(sapply(split(d$alpha, d$chain), ess_one))
ESS_beta  <- sum(sapply(split(d$beta , d$chain), ess_one))

# --- derived p and its ESS/MCSE ---
p_draws <- with(d, alpha / (alpha + beta))
ESS_p   <- sum(sapply(split(p_draws, d$chain), ess_one))

# --- posterior means/quantiles + MCSEs ---
qs <- function(v) stats::quantile(v, c(.025,.5,.975))
summ <- list(
  alpha = c(mean = mean(d$alpha),
            `q2.5` = qs(d$alpha)[1], `q50` = qs(d$alpha)[2], `q97.5` = qs(d$alpha)[3],
            ESS = ESS_alpha,
            MCSE_mean = mcse_mean(d$alpha, ESS_alpha)),
  beta  = c(mean = mean(d$beta),
            `q2.5` = qs(d$beta)[1],  `q50` = qs(d$beta)[2],  `q97.5` = qs(d$beta)[3],
            ESS = ESS_beta,
            MCSE_mean = mcse_mean(d$beta, ESS_beta)),
  p     = c(mean = mean(p_draws),
            `q2.5` = qs(p_draws)[1], `q50` = qs(p_draws)[2], `q97.5` = qs(p_draws)[3],
            ESS = ESS_p,
            MCSE_mean = mcse_mean(p_draws, ESS_p))
)
print(summ)
## $alpha
##        mean   q2.5.2.5%     q50.50% q97.5.97.5%         ESS   MCSE_mean 
##  20.4374744  15.8696421  21.1500614  24.1407380  36.0000000   0.4178856 
## 
## $beta
##        mean   q2.5.2.5%     q50.50% q97.5.97.5%         ESS   MCSE_mean 
##  13.4055848  10.5665486  13.8312346  15.6319377  32.0000000   0.2773333 
## 
## $p
##         mean    q2.5.2.5%      q50.50%  q97.5.97.5%          ESS    MCSE_mean 
## 6.037073e-01 5.917831e-01 6.038183e-01 6.101578e-01 8.400000e+01 4.639857e-04
# --- population params for overlays (posterior means) ---
alpha_hat <- unname(summ$alpha["mean"])
beta_hat  <- unname(summ$beta ["mean"])

# ---- Beta–Binomial pmf (untruncated) + truncated helpers (use lbeta to avoid 'beta' clash) ----
dbetabinom <- function(k, n, alpha, beta_par){
  choose(n, k) * exp(lbeta(k + alpha, n - k + beta_par) - lbeta(alpha, beta_par))
}
tbb_pmf <- function(k, n, alpha, beta_par, L = 0){
  ks  <- 0:n
  pmf <- dbetabinom(ks, n, alpha, beta_par)
  if (L > 0){
    z <- sum(pmf[ks >= L]); pmf[ks < L] <- 0; pmf <- pmf / z
  }
  pmf[match(k, ks)]
}
tbb_cdf <- function(k_vals, n, alpha, beta_par, L = 0){
  ks  <- 0:n
  pmf <- dbetabinom(ks, n, alpha, beta_par)
  if (L > 0){
    z <- sum(pmf[ks >= L]); pmf[ks < L] <- 0; pmf <- pmf / z
  }
  cdf <- cumsum(pmf)
  stats::approx(x = ks, y = cdf, xout = k_vals, method = "constant", rule = 2, f = 1)$y
}

# --- Empirical PMF aligned to full truncated support ---
support  <- L:n
emp_tab  <- table(factor(x, levels = support))
emp_p    <- as.numeric(emp_tab) / length(x)

# --- Population PMF using (alpha_hat, beta_hat) ---
pop_p    <- tbb_pmf(support, n, alpha_hat, beta_hat, L = L)
ylim_max <- max(c(emp_p, pop_p)) * 1.2

# --- PMF overlay ---
op <- par(mfrow = c(1,2), mar = c(4,4,3,1))
plot(NA, xlim = c(L - 0.5, n + 0.5), ylim = c(0, ylim_max),
     xlab = "x", ylab = "P(x)",
     main = "Observed (blue) vs Theoretical (red)\nTruncated Beta–Binomial PMF")
segments(support, 0, support, emp_p, lwd = 16, col = adjustcolor("dodgerblue", 0.6))
points(support, pop_p, type = "h", lwd = 8, col = adjustcolor("tomato", 0.75))
legend("topright", bty = "n", lwd = c(16, 8),
       col = c(adjustcolor("dodgerblue", 0.6), adjustcolor("tomato", 0.75)),
       legend = c("Empirical PMF", sprintf("Pop PMF (α=%.2f, β=%.2f)", alpha_hat, beta_hat)))

# --- CDF overlay: ECDF vs population CDF ---
plot(stats::ecdf(x), verticals = TRUE, do.points = FALSE, lwd = 3,
     col = adjustcolor("dodgerblue", 0.9), xlab = "x", ylab = "Cumulative probability",
     main = "Observed (blue) vs Population (red)\nTruncated Beta–Binomial CDF",
     xlim = c(L, n), ylim = c(0, 1))
lines(support, tbb_cdf(support, n, alpha_hat, beta_hat, L = L),
      type = "s", lwd = 3, col = adjustcolor("tomato", 0.9))
grid(nx = NA, ny = NULL, col = "lightgray", lty = "dotted")

par(op)

Key takeaways:

  • Beta–Binomial (left-trunc, n=20) is best
    • Lowest WAIC (4828.8).
    • Much better than plain Binomial (4932.4), showing the need for extra dispersion.
    • \(p_{\text{waic}} \approx 2.7\) → effective parameter complexity is small (parsimonious).
  • Multinomial/Dirichlet is competitive but less parsimonious
    • WAIC ~4840, only ~12 points worse.
    • But \(p_{\text{waic}} \approx 17\): that’s like estimating 17 free parameters (almost one per support point).
    • Good empirical fit, but no simple mechanistic story.
  • Poisson & NegBin are notably worse
    • WAIC ~4933 and ~5007, far higher.
    • They don’t respect the bounded support (0…∞ instead of 3…20).
    • NegBin’s extra dispersion doesn’t rescue it because the variance structure isn’t right.
  • Geometric(+shift) is way off
    • WAIC ~6518, essentially disqualified.
    • Support/shape mismatch.

The best-supported generative mechanism for the discrete data is a left-truncated Beta–Binomial distribution with n=20 trials.
This model assumes each observation arises from a Binomial(\(n=20, p\)), where \(p\) itself varies across units with a Beta(\(\alpha \approx 20, \beta \approx 13\)) distribution.
The truncation at 3 accounts for the fact that no values below 3 were observed.

Compared with alternatives (plain Binomial, Poisson, NegBin, Geometric, Multinomial), this model balances parsimony and fit: it achieves the lowest WAIC while using only ~3 effective parameters, and it provides a coherent generative story.


Overlay plots: interpretation

  • Poisson / NegBin (top-left, bottom-left): too much mass in the tails and spillover beyond the support. Reasonable near the center, but can’t respect the hard cutoff at 20.
  • Geometric(+shift) (top-right): totally the wrong shape — too heavy on the left tail, doesn’t capture the symmetry around 12.
  • Multinomial/Dirichlet (middle-right): matches the data very closely at every point, as expected (empirical smoothing). But this is descriptive, not mechanistic.
  • Binomial (bottom-left): already good — the hump aligns around 12, tails taper nicely, but variance is too low.
  • Beta–Binomial (bottom-right): nearly perfect overlay. Respects bounded support (3–20 with truncation), matches the peak at 12, and has the right spread.

We model each observation \(X_i\) as arising from a left-truncated Beta–Binomial distribution with \(n=20\) trials:

\[ X_i \sim \text{Beta–Binomial}_{[3,20]}(n=20,\;\alpha,\;\beta). \]

The untruncated Beta–Binomial pmf is

\[ \Pr(X=k \mid \alpha,\beta,n) \;=\; \binom{n}{k}\; \frac{B(k+\alpha,\; n-k+\beta)}{B(\alpha,\beta)}, \qquad k=0,\dots,n. \]

Because the dataset contains no counts below 3, we normalize over the truncated support:

\[ \Pr(X=k \mid \alpha,\beta,n, X \ge 3) \;=\; \frac{\Pr(X=k \mid \alpha,\beta,n)}{\sum_{j=3}^n \Pr(X=j \mid \alpha,\beta,n)}, \qquad k=3,\dots,n. \]

We assign weakly informative Gamma priors:

\[ \alpha,\beta \sim \text{Gamma}(0.1, 0.1). \]

Posterior inference is carried out via Metropolis–Hastings within Gibbs.


Interpretation:

  • \(n=20\) matches the clear upper bound in the data.
  • Posterior mean \(\alpha \approx 20,\;\beta \approx 13\) → mean success probability
    \[ E[p] = \frac{\alpha}{\alpha+\beta} \approx 0.61. \]
  • The Beta prior on \(p\) induces extra-binomial variation across units, giving the right amount of dispersion.
  • Truncation at 3 accounts for the absence of small counts in the sample.

10.3 Continuous Sample Data with an unknown distribution

## ---------------------------
## Robust loader for unknowncontinuous.csv
## ---------------------------
load_unknowncontinuous <- function(
  filename          = "unknowncontinuous.csv",
  header            = FALSE,
  col               = 1,        # which column to read (name or index)
  as_vector         = TRUE,     # TRUE: return numeric vector; FALSE: return data.frame
  strip_non_numeric = TRUE      # try to clean strings like "5.2 ms" -> "5.2"
) {
  # Candidate locations to search
  candidates <- unique(na.omit(c(
    filename,
    file.path("data", filename),
    # knitr root dir (if set)
    if (!is.null(knitr::opts_knit$get("root.dir")))
      file.path(knitr::opts_knit$get("root.dir"), filename) else NA,
    # alongside the current Rmd (while knitting)
    tryCatch(file.path(dirname(knitr::current_input()), filename),
             error = function(e) NA),
    # project root via {here}
    if (requireNamespace("here", quietly = TRUE))
      here::here(filename) else NA
  )))

  message("Working dir: ", getwd())
  message("Looking for file at:\n", paste("  - ", candidates, collapse = "\n"))

  # Find a readable path
  path_found <- NULL
  for (p in candidates) {
    if (!is.na(p) && file.exists(p)) { path_found <- p; break }
  }
  if (is.null(path_found)) {
    if (interactive()) {
      message("File not found — please choose it manually.")
      path_found <- file.choose()
    } else {
      stop("Could not find '", filename, "'. Place it next to the Rmd or in a 'data/' folder,\n",
           "or set knitr root dir: knitr::opts_knit$set(root.dir = here::here())")
    }
  }
  message("Reading: ", path_found)

  # Read CSV
  df <- read.csv(path_found,
                 header = header,
                 na.strings = c("", "NA", "NaN"),
                 fileEncoding = "UTF-8",
                 check.names = FALSE,
                 stringsAsFactors = FALSE)

  if (!as_vector) return(df)

  # Select column
  if (is.character(col)) {
    if (!col %in% names(df)) stop("Column '", col, "' not found in CSV.")
    v <- df[[col]]
  } else {
    if (col < 1 || col > ncol(df)) stop("Column index ", col, " out of range.")
    v <- df[[col]]
  }

  # Clean and coerce to numeric
  if (is.factor(v)) v <- as.character(v)
  if (strip_non_numeric && is.character(v)) {
    # keep digits, sign, decimal, exponent markers
    v <- gsub("[^0-9eE+\\-\\.]", "", v)
  }
  v_num <- suppressWarnings(as.numeric(v))

  # Drop non-finite
  dropped <- sum(!is.finite(v_num))
  if (dropped > 0) message("Dropped ", dropped, " non-numeric / non-finite entries.")
  v_num <- v_num[is.finite(v_num)]

  if (!length(v_num)) stop("No finite numeric values found in selected column.")

  as.numeric(v_num)
}

## --- Load & basic EDA for the continuous sample ---
unknowncontinuous <- load_unknowncontinuous("unknowncontinuous.csv", header = FALSE, col = 1, as_vector = TRUE)
z   <- unknowncontinuous

n   <- length(z)
rng <- range(z)
med <- median(z)
mu  <- mean(z)
v   <- var(z)

cat("Loaded n =", n, " | min/median/mean/var/max =",
    round(rng[1], 3), "/", round(med, 3), "/", round(mu, 3), "/",
    round(v, 3), "/", round(rng[2], 3), "\n")
## Loaded n = 1001  | min/median/mean/var/max = 0.081 / 5.032 / 6.168 / 18.842 / 25.481
# Optional quick peek
cat("First 6 values:", paste(round(head(z, 6), 3), collapse = ", "), "\n")
## First 6 values: 1, 5.944, 7.185, 2.33, 4.087, 9.016
cat("Last  6 values:", paste(round(tail(z, 6), 3), collapse = ", "), "\n")
## Last  6 values: 2.015, 0.866, 4.724, 1.713, 7.274, 0.747

Density plot

library(ggplot2)

# 1) Coerce to a clean numeric vector
plot_data <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else unknowncontinuous
plot_data <- suppressWarnings(as.numeric(plot_data))
plot_data <- plot_data[is.finite(plot_data)]
stopifnot(length(plot_data) > 1)

# 2) Get density to position annotations nicely
d_est  <- density(plot_data)
y_max  <- max(d_est$y)
y_mean <- y_max * 0.90
y_med  <- y_max * 0.80

# 3) Build plot
ggplot(data.frame(x = plot_data), aes(x = x)) +
  geom_density(fill = "#1E90FF", alpha = 0.30, color = "dodgerblue4", linewidth = 1.2, na.rm = TRUE) +
  geom_rug(alpha = 0.4, color = "black", sides = "b", length = grid::unit(0.03, "npc")) +
  geom_vline(aes(xintercept = mean(x, na.rm = TRUE)), color = "firebrick", linetype = "dashed", linewidth = 1) +
  geom_vline(aes(xintercept = median(x, na.rm = TRUE)), color = "blue",     linetype = "dashed", linewidth = 1) +
  annotate("text",
           x = mean(plot_data), y = y_mean,
           label = paste("Mean =", round(mean(plot_data), 2)),
           color = "firebrick", vjust = -0.3, hjust = -0.05) +
  annotate("text",
           x = median(plot_data), y = y_med,
           label = paste("Median =", round(median(plot_data), 2)),
           color = "blue", vjust = -0.3, hjust = -0.05) +
  labs(title = "Density Plot of Unknown Continuous Data",
       x = "Value", y = "Density") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    panel.grid.major = element_line(color = "grey90", linewidth = 0.2),
    panel.grid.minor = element_blank()
  )

Observations:

  • Sample size: Large (n = 1,000)
  • Distribution shape:
    • Mean (6.17) > Median (5.03) → Right-skewed
    • Non-zero skewness and kurtosis → Non-normal
    • Visually confirmed by density plot

Distribution Analysis

  • Beta distribution:
    • Ruled out (requires [0,1] bounded support)
  • Chi-squared distribution:
    • Expected σ² ≈ 2μ (12.34 vs observed 18.8)
    • Poor fit for this data
  • Gamma distribution (most plausible):
    • General form includes exponential and chi-squared
    • Moment relationships:
      • μ = kθ
      • σ² = kθ²
        • ⇒ k = μ²/σ² ≈ (6.17)²/18.8 ≈ 2.03
        • ⇒ θ = σ²/μ ≈ 18.8/6.17 ≈ 3.05
  • Suggested parameters:
    • Shape (k) ≈ 2.0
    • Scale (θ) ≈ 3.0

# ================= Robust candidate fitting & comparison (fixed) =================

# 0) Clean numeric vector z
to_numeric <- function(obj) {
  if (is.numeric(obj)) return(obj)
  if (is.data.frame(obj)) obj <- obj[[1]]
  v <- as.character(as.vector(obj))
  v <- gsub("[^0-9eE+\\-\\.]", "", v)   # strip non-numeric chars
  suppressWarnings(as.numeric(v))
}

z_raw <- to_numeric(unknowncontinuous)
dropped <- sum(!is.finite(z_raw))
if (dropped > 0) message("Dropped ", dropped, " non-numeric/non-finite entries.")
z <- z_raw[is.finite(z_raw)]
stopifnot(length(z) > 5)

# 1) Fit candidates safely
if (!requireNamespace("fitdistrplus", quietly = TRUE)) install.packages("fitdistrplus")
library(fitdistrplus)

safe_fit <- function(x, dist, ...) {
  suppressWarnings( tryCatch(fitdist(x, dist, ...), error = function(e) NULL) )
}

fits <- list()
fits$normal <- safe_fit(z, "norm")

if (min(z) > 0) {
  fits$lognormal <- safe_fit(z, "lnorm")
  fits$gamma     <- safe_fit(z, "gamma")
  fits$weibull   <- safe_fit(z, "weibull")
} else {
  message("Non-positive values detected; skipping lognormal/gamma/weibull.")
}

# Keep only successful fits
cand <- Filter(Negate(is.null), fits)
if (length(cand) == 0) stop("No distributions could be fitted. Check your data.")

# 2) GOF & model selection
gs <- suppressWarnings(gofstat(cand))  # KS, CvM, AD, etc.
aic <- sapply(cand, function(f) f$aic)
bic <- sapply(cand, function(f) f$bic)

comp_tbl <- data.frame(
  model = names(cand),
  AIC   = as.numeric(aic),
  BIC   = as.numeric(bic),
  KS    = as.numeric(gs$kstest),   # smaller is better
  CvM   = as.numeric(gs$cvm),      # smaller is better
  row.names = NULL
)
comp_tbl <- comp_tbl[order(comp_tbl$AIC), ]
print(comp_tbl)
##       model      AIC      BIC KS       CvM
## 3     gamma 5398.102 5407.920 NA 0.1291074
## 4   weibull 5418.442 5428.259 NA 0.4937047
## 2 lognormal 5474.124 5483.942 NA 0.5733925
## 1    normal 5782.752 5792.569 NA 4.7928921
# 3) Visual diagnostics — thicker lines, no arg collisions
cols <- c("#D55E00","#0072B2","#009E73","#CC79A7","#E69F00","#56B4E9")[seq_along(cand)]
lw   <- rep(3, length(cand))
lt   <- rep(1, length(cand))  # solid

op <- par(mfrow = c(2,2)); on.exit(par(op), add = TRUE)

# 1) Density comparison: use fitcol/fitlwd/fitlty + datacol
denscomp(cand,
         legendtext = names(cand),
         main = "Density comparison",
         datacol = "grey85",
         fitcol  = cols,
         fitlwd  = lw,
         fitlty  = lt)

# 2) CDF comparison
cdfcomp(cand,
        legendtext = names(cand),
        main = "CDF comparison",
        fitcol = cols,
        fitlwd = lw,
        fitlty = lt)

# 3) QQ comparison: points + solid 45° line
qqcomp(cand,
       legendtext = names(cand),
       main = "Q–Q comparison",
       fitcol = cols,
       fitpch = 16)
abline(0, 1, col = "grey20", lwd = 2, lty = 1)

# 4) P–P comparison: points + solid 45° line
ppcomp(cand,
       legendtext = names(cand),
       main = "P–P comparison",
       fitcol = cols,
       fitpch = 16)
abline(0, 1, col = "grey20", lwd = 2, lty = 1)

# 4) ggplot overlay of best-AIC model — thicker curve
best_name <- comp_tbl$model[1]
best_fit  <- cand[[best_name]]
est <- best_fit$estimate

best_dens <- switch(best_name,
  "normal"    = function(x) dnorm(x, mean = est["mean"], sd = est["sd"]),
  "lognormal" = function(x) dlnorm(x, meanlog = est["meanlog"], sdlog = est["sdlog"]),
  "gamma"     = function(x) {
    rate <- if ("rate" %in% names(est)) est["rate"] else 1/est["scale"]
    dgamma(x, shape = est["shape"], rate = rate)
  },
  "weibull"   = function(x) dweibull(x, shape = est["shape"], scale = est["scale"])
)

library(ggplot2)
ggplot(data.frame(x = z), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 30,
                 fill = "grey90", color = "white", linewidth = 0.6) +
  stat_function(fun = best_dens, linewidth = 2.2, color = "tomato") +
  labs(
    title = paste0("Best fit: ", best_name, " (red) over histogram"),
    subtitle = paste("AIC:", round(comp_tbl$AIC[1],1),
                     "| BIC:", round(comp_tbl$BIC[1],1)),
    x = "Value", y = "Density"
  ) +
  theme_minimal(base_size = 12)

For gamma distribution, \(\mu = k \Theta\) and \(\sigma^2 = k \Theta^2\), so \(\frac{\sigma^2}{\mu} = k\)

# Clean numeric vector
z <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else unknowncontinuous
z <- suppressWarnings(as.numeric(z))
z <- z[is.finite(z)]
stopifnot(length(z) > 1)

if (any(z <= 0)) {
  warning("Gamma assumes positive support; values <= 0 detected.")
}

# Summary stats
n   <- length(z)
m   <- mean(z, na.rm = TRUE)
s2  <- var(z,  na.rm = TRUE)          # sample variance
s2p <- s2 * (n - 1) / n               # population variance (optional)

# Method-of-moments parameters
theta <- s2p / m                      # scale
k     <- m / theta                    # shape  (equivalently m^2 / s2p)
rate  <- 1 / theta

cat(sprintf("n = %d | mean = %.4f | var(sample) = %.4f | var(pop) = %.4f\n", n, m, s2, s2p))
## n = 1001 | mean = 6.1682 | var(sample) = 18.8422 | var(pop) = 18.8234
cat(sprintf("Gamma MoM: shape k = %.6f, scale θ = %.6f  (rate = %.6f)\n", k, theta, rate))
## Gamma MoM: shape k = 2.021264, scale θ = 3.051671  (rate = 0.327689)

plot random sample distributions for comparison

# --- 1) Clean numeric vector ---
z <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else unknowncontinuous
z <- suppressWarnings(as.numeric(z))
z <- z[is.finite(z)]
stopifnot(length(z) > 1)

# Gamma needs positive support
if (any(z <= 0)) warning("Gamma assumes positive support; values <= 0 detected.")
zpos <- z[z > 0]
stopifnot(length(zpos) > 1)

# --- 2) Method-of-moments Gamma params (shape k, scale theta) ---
n   <- length(zpos)
m   <- mean(zpos)
s2  <- var(zpos)              # sample variance
s2p <- s2 * (n - 1) / n       # population variance (optional; smoother)
theta_hat <- s2p / m
k_hat     <- m / theta_hat

# Optional: MLE fit (shape, rate); comment out if not desired
mle_ok <- requireNamespace("MASS", quietly = TRUE)
if (mle_ok) {
  fit <- MASS::fitdistr(zpos, "gamma")  # returns shape & rate
  k_mle <- unname(fit$estimate["shape"])
  r_mle <- unname(fit$estimate["rate"])
}

# --- 3) Density overlay plot (transparent lines) ---
dens <- density(zpos)
xgrid <- seq(max(min(zpos), .Machine$double.eps), max(zpos), length.out = 512)
g_mom <- dgamma(xgrid, shape = k_hat, scale = theta_hat)
ymax  <- 1.1 * max(c(dens$y, g_mom, if (mle_ok) dgamma(xgrid, shape = k_mle, rate = r_mle) else 0))

plot(dens,
     main = "Unknown continuous (green) vs. Gamma fits (orange/red)",
     xlab = "x", ylab = "Density",
     lwd = 4, col = adjustcolor("darkgreen", 0.8),
     xlim = range(xgrid), ylim = c(0, ymax), type = "l")
rug(zpos, col = adjustcolor("black", 0.3))

# MoM Gamma (orange)
lines(xgrid, g_mom, col = adjustcolor("tomato", 0.85), lwd = 4)

# Optional: MLE Gamma (red)
if (mle_ok) {
  lines(xgrid, dgamma(xgrid, shape = k_mle, rate = r_mle),
        col = adjustcolor("red3", 0.75), lwd = 3, lty = 2)
}

grid(nx = NA, ny = NULL, col = "lightgray", lty = "dotted", lwd = 1)

legend("topright", bty = "n", cex = 0.9,
       col = c(adjustcolor("darkgreen", 0.8),
               adjustcolor("tomato", 0.85),
               if (mle_ok) adjustcolor("red3", 0.75)),
       lwd = c(4, 4, if (mle_ok) 3),
       lty = c(1, 1, if (mle_ok) 2),
       legend = c(
         "Kernel density (sample)",
         sprintf("Gamma MoM: k=%.2f, θ=%.2f", k_hat, theta_hat),
         if (mle_ok) sprintf("Gamma MLE: k=%.2f, rate=%.2f", k_mle, r_mle)
       ))


Cumulative Distribution Function

## --- Improved ECDF vs Gamma CDF (base R) ---

# 1) Clean numeric vector
x <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else unknowncontinuous
x <- suppressWarnings(as.numeric(x))
x <- x[is.finite(x)]
stopifnot(length(x) > 1)

# Gamma needs positive support → restrict to > 0
x <- x[x > 0]
n <- length(x); stopifnot(n > 1)

# 2) Choose Gamma parameters
#    Option A: use your fixed values
shape <- 2.02
scale <- 3.05

#    Option B (toggle to TRUE): method-of-moments from data
if (FALSE) {
  m  <- mean(x); s2 <- var(x) * (n - 1) / n   # population variance
  shape <- m^2 / s2
  scale <- s2 / m
}

# 3) Plot ECDF with DKW ± band
Fn <- stats::ecdf(x)
x_min <- min(x); x_max <- max(x)
xgrid <- seq(max(.Machine$double.eps, x_min), x_max, length.out = 600)

plot(Fn,
     verticals = TRUE, do.points = FALSE,
     lwd = 4, col = adjustcolor("darkgreen", 0.5),
     main = "Unknown CDF (green) vs Gamma CDF (orange)",
     xlab = "x", ylab = "Cumulative probability",
     xlim = c(x_min, x_max), ylim = c(0, 1))

# DKW band (pointwise, 1 - alpha)
alpha <- 0.05
eps   <- sqrt(log(2/alpha) / (2*n))
xx    <- sort(unique(x))
Fn_x  <- Fn(xx)
lines(xx, pmax(0, pmin(1, Fn_x + eps)), col = adjustcolor("darkgreen", 0.35), lty = 2, lwd = 2)
lines(xx, pmax(0, pmin(1, Fn_x - eps)), col = adjustcolor("darkgreen", 0.35), lty = 2, lwd = 2)

# 4) Overlay Gamma CDF
lines(xgrid, pgamma(xgrid, shape = shape, scale = scale),
      col = adjustcolor("tomato", 0.9), lwd = 4)

# 5) KS statistic (naive; valid if params fixed a priori)
ks_naive <- suppressWarnings(stats::ks.test(x, "pgamma", shape = shape, scale = scale))
mtext(sprintf("KS (naive) D = %.3f, p = %.3f", ks_naive$statistic, ks_naive$p.value),
      side = 3, line = 0.5, cex = 0.9)

grid(col = "grey85", lty = "dotted", lwd = 1)
legend("bottomright", bty = "n",
       col = c("darkgreen", adjustcolor("darkgreen", 0.35), "tomato"),
       lwd = c(4, 2, 4), lty = c(1, 2, 1),
       legend = c("Empirical CDF", "DKW ± band (95%)", sprintf("Gamma CDF (shape=%.2f, scale=%.2f)", shape, scale)))


10.4 Maximum Likelihood Estimation (Continuous unknown distribution)

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a model by maximizing the likelihood function. Given a dataset and an assumed statistical model, MLE provides parameter estimates that make the observed data most probable.

Applying MLE to Continuous Data

Using the continuous dataset unknowncontinuous.dat from earlier analysis:

  • Select the distribution type you previously identified.
  • Compute MLE estimates for the continuous parameters using mle().
  • If your chosen distribution has discrete parameters (e.g., n), fix them using the ‘fixed’ option.
  • Manually test values near your earlier estimates—which one yields the highest likelihood?
  • Validate the fit:
    • Plot the distribution function using your estimated parameters. -n Compare it with the normalized histogram of the data to assess the fit.
## =======================
## MLE for Gamma via stats4::mle
## =======================

# 0) Clean numeric, positive data vector z
z <- if (exists("unknowncontinuous")) {
  if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else unknowncontinuous
} else if (file.exists("unknowncontinuous.dat")) {
  scan("unknowncontinuous.dat", what = numeric(), quiet = TRUE)
} else if (file.exists("unknowncontinuous.csv")) {
  read.csv("unknowncontinuous.csv", header = FALSE)[[1]]
} else stop("Data not found: provide `unknowncontinuous`, or unknowncontinuous.dat/csv in the working dir.")

z <- suppressWarnings(as.numeric(z))
z <- z[is.finite(z) & z > 0]
stopifnot(length(z) > 1)

# 1) Negative log-likelihood (log-parameterization enforces positivity)
nll_gamma <- function(logshape, lograte) {
  shape <- exp(logshape); rate <- exp(lograte)
  -sum(dgamma(z, shape = shape, rate = rate, log = TRUE))
}

# 2) Free MLE (shape & rate)
s0 <- log(2.0)     # start near your earlier shape ~2
r0 <- log(0.35)    # start near your earlier rate ~0.35
if (!requireNamespace("stats4", quietly = TRUE)) stop("Please install 'stats4'.")
library(stats4)

fit_free <- mle(nll_gamma,
                start  = list(logshape = s0, lograte = r0),
                method = "Nelder-Mead")

cf  <- coef(fit_free)
sh  <- exp(cf["logshape"])
rt  <- exp(cf["lograte"])
sc  <- 1/rt
LLf <- as.numeric(logLik(fit_free))
NLLf<- -LLf

cat(sprintf("Free MLE (Gamma): shape = %.4f, rate = %.4f (scale = %.4f);  -logLik = %.2f\n",
            sh, rt, sc, NLLf))
## Free MLE (Gamma): shape = 2.0632, rate = 0.3345 (scale = 2.9892);  -logLik = 2697.05
# SEs & 95% CI (delta method from vcov)
vc <- vcov(fit_free)
se_logshape <- sqrt(vc["logshape","logshape"])
se_lograte  <- sqrt(vc["lograte","lograte"])
se_shape    <- sh * se_logshape
se_rate     <- rt * se_lograte

ci_log <- suppressWarnings(confint(fit_free))  # on log-scale
## Profiling...
ci_shape <- exp(ci_log["logshape", ])
ci_rate  <- exp(ci_log["lograte", ])

cat(sprintf("  shape SE ≈ %.4f;  95%% CI: [%.4f, %.4f]\n", se_shape, ci_shape[1], ci_shape[2]))
##   shape SE ≈ 0.0858;  95% CI: [1.8997, 2.2362]
cat(sprintf("  rate  SE ≈ %.4f;  95%% CI: [%.4f, %.4f]\n\n", se_rate,  ci_rate[1],  ci_rate[2]))
##   rate  SE ≈ 0.0157;  95% CI: [0.3045, 0.3662]
# 3) Fixed-parameter MLEs (examples)
## (a) Fix rate at 0.35, estimate shape
fit_fix_rate <- mle(nll_gamma,
                    start = list(logshape = s0),
                    fixed = list(lograte = log(0.35)),
                    method = "Nelder-Mead")
cf_fr <- coef(fit_fix_rate)
sh_fr <- exp(cf_fr["logshape"])
NLL_fr <- -as.numeric(logLik(fit_fix_rate))
cat(sprintf("Fix rate=0.35 → shapê = %.4f;  -logLik = %.2f\n", sh_fr, NLL_fr))
## Fix rate=0.35 → shapê = 2.1378;  -logLik = 2697.52
## (b) Fix shape at 2, estimate rate
nll_gamma_rate <- function(lograte, logshape_fixed) {
  nll_gamma(logshape = logshape_fixed, lograte = lograte)
}
fit_fix_shape <- mle(nll_gamma_rate,
                     start = list(lograte = r0),
                     fixed = list(logshape_fixed = log(2)),
                     method = "Nelder-Mead")
rt_fs  <- exp(coef(fit_fix_shape)["lograte"])
NLL_fs <- -as.numeric(logLik(fit_fix_shape))
cat(sprintf("Fix shape=2 → ratê = %.4f;  -logLik = %.2f\n\n", rt_fs, NLL_fs))
## Fix shape=2 → ratê = 0.3243;  -logLik = 2697.33
# 4) Manual grid search near earlier estimates (small, fast)
shape_grid <- seq(max(0.5, sh - 0.8), sh + 0.8, length.out = 13)
rate_grid  <- seq(max(0.05, rt - 0.15), rt + 0.15, length.out = 13)

nll_grid <- matrix(NA_real_, length(shape_grid), length(rate_grid),
                   dimnames = list(sprintf("sh=%.3f", shape_grid),
                                   sprintf("rt=%.3f",  rate_grid)))
for (i in seq_along(shape_grid)) {
  for (j in seq_along(rate_grid)) {
    nll_grid[i, j] <- -sum(dgamma(z, shape = shape_grid[i], rate = rate_grid[j], log = TRUE))
  }
}
min_idx <- which(nll_grid == min(nll_grid), arr.ind = TRUE)
sh_grid <- shape_grid[min_idx[1]]
rt_grid <- rate_grid[min_idx[2]]
NLL_grid<- nll_grid[min_idx]
cat(sprintf("Grid search min near: shape ≈ %.4f, rate ≈ %.4f;  -logLik = %.2f\n\n",
            sh_grid, rt_grid, NLL_grid))
## Grid search min near: shape ≈ 2.0632, rate ≈ 0.3345;  -logLik = 2697.05
# 5) Validation plots
op <- par(mfrow = c(1,2)); on.exit(par(op), add = TRUE)

# (A) Histogram + MLE PDF
hist(z, breaks = "FD", freq = FALSE, col = "grey90", border = "white",
     main = "Gamma fit over histogram", xlab = "x", ylab = "Density")
xx <- seq(min(z), max(z), length.out = 600)
lines(xx, dgamma(xx, shape = sh, rate = rt),
      col = adjustcolor("tomato", 0.9), lwd = 3)
rug(z, col = adjustcolor("black", 0.25))
legend("topright", bty = "n",
       col = c(adjustcolor("tomato", 0.9)),
       lwd = 3, legend = sprintf("MLE: shape=%.2f, rate=%.2f", sh, rt))

# (B) ECDF vs Gamma CDF
Fn <- ecdf(z)
plot(Fn, verticals = TRUE, do.points = FALSE, lwd = 3,
     col = adjustcolor("darkgreen", 0.7),
     main = "ECDF (green) vs Gamma CDF (orange)",
     xlab = "x", ylab = "Cumulative probability")
lines(xx, pgamma(xx, shape = sh, rate = rt),
      col = adjustcolor("tomato", 0.9), lwd = 3)
grid(col = "grey85", lty = "dotted")
legend("bottomright", bty = "n",
       col = c(adjustcolor("darkgreen", 0.7), adjustcolor("tomato", 0.9)),
       lwd = 3, legend = c("Empirical CDF", "Gamma CDF (MLE)"))


MLE report for the continuous sample (Gamma model)

We model \(X\) with a Gamma distribution parameterized by shape \(k\) and rate \(\lambda\) (so scale \(\theta = 1/\lambda\)): \[ f(x\mid k,\lambda)=\frac{\lambda^k}{\Gamma(k)}\,x^{k-1}e^{-\lambda x},\quad x>0. \]

Key moments under this parameterization: \[ \mathbb{E}[X]=\frac{k}{\lambda}=k\theta,\qquad \mathrm{Var}(X)=\frac{k}{\lambda^2}=k\theta^2. \]


Free MLE (both parameters estimated)

  • Shape \(\hat{k} = 2.0632\)
  • Rate \(\hat{\lambda} = 0.3345\)
  • Scale \(\hat{\theta} = 2.9892\)
  • Log-likelihood \(\ell(\hat{k},\hat{\lambda}) = -2697.05\) \(\Rightarrow\) \(-\ell\) = 2697.05

Uncertainty (profile-based):

  • \(\text{SE}(\hat{k}) \approx 0.0858\), 95% CI: [1.8997, 2.2362]
  • \(\text{SE}(\hat{\lambda}) \approx 0.0157\), 95% CI: [0.3045, 0.3662]

Sensitivity: fixing one parameter

  • Fix \(\lambda=0.35\), estimate \(k\):
    • \(\hat{k} = 2.1378\)
    • \(-\ell = 2697.52\)
  • Fix \(k=2\), estimate \(\lambda\):
    • \(\hat{\lambda} = 0.3243\)
    • \(-\ell = 2697.33\)

Takeaway: both constrained fits yield slightly higher \(-\ell\) than the free MLE, confirming the free MLE \((k,\lambda)=(2.0632,\,0.3345)\) is preferred.


10.5 Grid search (local verification)

A coarse grid search around the free estimates finds a minimum near: - \(k \approx 2.0632\), \(\lambda \approx 0.3345\); \(-\ell = 2697.05\)
This matches the free MLE above.


Interpretation

With \((\hat{k},\hat{\lambda}) = (2.0632,\,0.3345)\), - Mean \(\approx \hat{k}/\hat{\lambda} \approx 6.17\) - Variance \(\approx \hat{k}/\hat{\lambda}^2\)

These align with earlier exploratory summaries and the visual fit diagnostics (PDF over histogram, CDF vs ECDF).


  • Best fit (Gamma, rate form): \(k=2.0632\) \([1.8997, 2.2362]\), \(\lambda=0.3345\) \([0.3045, 0.3662]\), \(\theta=2.9892\); \(-\ell=2697.05\).
  • Constrained checks: fixing either \(k\) or \(\lambda\) near prior estimates worsens the fit slightly.
  • Conclusion: The Gamma model with freely estimated parameters best fits the continuous data, capturing its dispersion and mild right-skew. The minimum negative log-likelihood is 2697.05 at shape = 2.0632, rate = 0.3345 (scale = 2.9892). Constraining either rate = 0.35 or shape = 2 increases −logLik slightly (to 2697.52 and 2697.33, respectively), so the unrestricted MLE is preferred.

Plot random sample distributions for comparison

# make a clean numeric vector z from whatever unknowncontinuous is
z <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else as.numeric(unknowncontinuous)
z <- z[is.finite(z) & z > 0]  # Gamma needs positives

hist(z, freq = FALSE, breaks = 30,
     col = "grey92", border = "white",
     main = "Sample vs Gamma overlays", xlab = "x", ylab = "Density")

curve(dgamma(x, shape = 1, rate = 0.16198), lwd = 2, col = "red",   add = TRUE)
curve(dgamma(x, shape = 2, rate = 0.32397), lwd = 2, col = "blue",  add = TRUE)
curve(dgamma(x, shape = 3, rate = 0.48595), lwd = 2, col = "green", add = TRUE)  # keep same mean

legend("topright", bty = "n",
       legend = c("shape=1, rate=0.16198",
                  "shape=2, rate=0.32397",
                  "shape=3, rate=0.48595"),
       col = c("red","blue","green"), lwd = 2, lty = 1)

# --- Make a clean numeric vector z ---
z <- if (is.data.frame(unknowncontinuous)) unknowncontinuous[[1]] else as.numeric(unknowncontinuous)
z <- z[is.finite(z) & z > 0]  # Gamma needs positives
stopifnot(length(z) > 1)

# Kernel density of the sample
dens <- density(z)

# Plot sample density
plot(dens,
     lwd = 4,
     col = "dodgerblue",
     main = "Unknown Data Density (blue) vs\nGamma Distribution(s) (red/orange)",
     xlab = "x", ylab = "Density")

# Use the same x-grid as the kernel density
x <- dens$x

# --- Overlay Gamma curves ---
# Example 1: your alternative params (shape=2.14, scale=2.86)
lines(x, dgamma(x, shape = 2.14, scale = 2.86),
      col = adjustcolor("tomato", 0.9), lwd = 4)

# Example 2: free MLE from earlier (shape=2.0632, scale=2.9892)
lines(x, dgamma(x, shape = 2.0632, scale = 2.9892),
      col = adjustcolor("orange3", 0.8), lwd = 3, lty = 2)

# Optional: mean/median lines
abline(v = mean(z),   col = adjustcolor("firebrick", 0.7), lty = 3)
abline(v = median(z), col = adjustcolor("steelblue", 0.7), lty = 3)

# Grid + legend
grid(nx = NA, ny = NULL, col = "lightgray", lty = "dotted", lwd = 1)
legend("topright", bty = "n",
       col = c("dodgerblue", "tomato", "orange3", "firebrick", "steelblue"),
       lwd = c(4,4,3,1,1), lty = c(1,1,2,3,3),
       legend = c("Sample density",
                  "Gamma: shape=2.14, scale=2.86",
                  "Gamma (MLE): shape=2.0632, scale=2.9892",
                  "Mean", "Median"))


10.5.1 MLE for Discrete Data

Based on the previous analysis, we selected the appropriate distribution type (e.g., Negative Binomial, Binomial, Poisson, etc.). We used the mle() function to compute the maximum likelihood estimates of the continuous parameters. If the distribution includes discrete parameters—such as n in the Binomial or Negative Binomial distribution—we fixed these using the fixed argument in mle(). We manually tested a few nearby values for the fixed discrete parameter (e.g., n = 8, 9, 10, 11) based on earlier estimates. The value that yielded the lowest negative log-likelihood was selected as the most plausible. Finally, we plotted the fitted distribution (using the estimated parameters) against the normalized frequency table of the observed data. This visual comparison helped assess how well the model fits the observed distribution.

# Clean integer vector x from whatever 'unknowndiscrete' is
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else as.integer(unknowndiscrete)
x <- x[is.finite(x)]
stopifnot(length(x) > 1, all(x >= 0))
n_obs <- length(x)
L <- min(x)           # left-truncation point (3 in your data)
x_max <- max(x)       # 20 in your data

# quick EDA
cat(sprintf("n=%d  min/median/mean/var/max = %d / %d / %.3f / %.3f / %d\n",
            n_obs, min(x), median(x), mean(x), var(x), max(x)))
## n=1000  min/median/mean/var/max = 3 / 12 / 12.071 / 7.255 / 20
tab <- table(x); emp_p <- tab / n_obs; support <- as.integer(names(tab))

Compare density plot to the population binomial distribution with the same parameters appears to be a good fit.

# NLL for Binomial with fixed n; parameterized by logit(p) for stability
logit <- function(p) log(p/(1-p)); inv_logit <- function(z) 1/(1+exp(-z))

nll_binom <- function(logit_p, n, left_trunc = TRUE, L = L) {
  p <- inv_logit(logit_p)
  if (!left_trunc) {
    return(-sum(dbinom(x, size = n, prob = p, log = TRUE)))
  } else {
    # truncated at L: divide by P(X>=L) for each observation
    den <- 1 - pbinom(L - 1, size = n, prob = p)
    return(-sum(dbinom(x, size = n, prob = p, log = TRUE) - log(den)))
  }
}

# Fit with n fixed at 20 (your earlier analyses used n=20)
n_fix <- 20
p_init <- mean(x)/n_fix
fit_binom_20 <- mle(nll_binom,
                    start  = list(logit_p = logit(p_init)),
                    fixed  = list(n = n_fix, left_trunc = TRUE, L = L),
                    method = "Nelder-Mead")
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
p_hat_20 <- inv_logit(coef(fit_binom_20)["logit_p"])
nll_20   <- -as.numeric(logLik(fit_binom_20))
cat(sprintf("Binomial (n=%d, left-trunc) MLE: p = %.6f; -logLik = %.2f\n",
            n_fix, p_hat_20, nll_20))
## Binomial (n=20, left-trunc) MLE: p = 0.603550; -logLik = 2464.91

# Try a small grid of n around max(x); keep left-truncation
n_grid <- seq(max(x), max(x)+5)  # e.g., 20:25
res <- sapply(n_grid, function(n_try) {
  p0 <- min(max(mean(x)/n_try, 1e-6), 1-1e-6)
  fit <- mle(nll_binom,
             start = list(logit_p = logit(p0)),
             fixed = list(n = n_try, left_trunc = TRUE, L = L),
             method = "Nelder-Mead")
  c(n = n_try,
    p = inv_logit(coef(fit)["logit_p"]),
    nll = -as.numeric(logLik(fit)))
})
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
res <- t(res); colnames(res) <- c("n","p","-logLik"); print(res, digits = 6)
##       n        p -logLik
## [1,] 20 0.603550 2464.91
## [2,] 21 0.574810 2445.19
## [3,] 22 0.548682 2432.54
## [4,] 23 0.524826 2424.14
## [5,] 24 0.502958 2418.50
## [6,] 25 0.482840 2414.72
best <- res[which.min(res[,3]), , drop=FALSE]
cat(sprintf("Best Binomial (left-trunc) on grid: n=%d, p=%.6f; -logLik=%.2f\n",
            as.integer(best[1]), best[2], best[3]))
## Best Binomial (left-trunc) on grid: n=25, p=0.482840; -logLik=2414.72

# Clean integer data
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else as.integer(unknowndiscrete)
x <- x[is.finite(x)]
stopifnot(length(x) > 1)

# Setup
n_obs <- length(x)
L     <- min(x)       # left-truncation (3)
n_fix <- 20           # Binomial n (use your chosen value)
lam   <- mean(x)      # Poisson MLE
p_hat <- mean(x)/n_fix

# Support and empirical PMF
k_all <- seq(L, max(x))
tab   <- table(factor(x, levels = k_all))
emp_p <- as.numeric(tab) / n_obs

# Truncated Poisson PMF
pois_p    <- dpois(k_all, lam)
pois_tail <- 1 - ppois(L - 1, lam)
pois_tr_p <- ifelse(k_all >= L, pois_p / pois_tail, 0)

# Truncated Binomial PMF
pmf_binom_trunc <- function(k, n, p, L){
  num <- dbinom(k, n, p)
  den <- 1 - pbinom(L - 1, n, p)
  ifelse(k >= L, num/den, 0)
}
binom_tr_p <- pmf_binom_trunc(k_all, n_fix, p_hat, L)

# Offsets: Poisson left, Empirical middle, Binomial right
off_pois <- -0.22
off_emp  <-  0.00
off_bino <-  0.22

# Same bar thickness for all three
bar_lwd <- 8
ylim_max <- max(emp_p, pois_tr_p, binom_tr_p, na.rm = TRUE) * 1.25

plot(NA, xlim = c(min(k_all)-0.6, max(k_all)+0.6), ylim = c(0, ylim_max),
     xlab = "x", ylab = "Probability",
     main = sprintf("Empirical (center) vs Truncated Poisson & Binomial (L=%d)", L))
grid(col = "grey90", lty = "dotted")

# Poisson (left)
segments(k_all + off_pois, 0, k_all + off_pois, pois_tr_p,
         lwd = bar_lwd, col = adjustcolor("tomato", 0.85))

# Empirical (middle)
segments(k_all + off_emp, 0, k_all + off_emp, emp_p,
         lwd = bar_lwd, col = adjustcolor("dodgerblue", 0.85))

# Binomial (right)
segments(k_all + off_bino, 0, k_all + off_bino, binom_tr_p,
         lwd = bar_lwd, col = adjustcolor("darkgreen", 0.85))

legend("topright", bty = "n",
       col = c("tomato","dodgerblue","darkgreen"),
       lwd = bar_lwd, lty = 1,
       legend = c(sprintf("Poisson (trunc): λ̂=%.2f", lam),
                  "Empirical PMF",
                  sprintf("Binomial (trunc): n=%d, p̂=%.3f", n_fix, p_hat)))


lambda_hat_cf <- mean(x)   # exact MLE (no truncation adjustment)
cat(sprintf("Poisson MLE (closed form): lambda = %.4f\n", lambda_hat_cf))
## Poisson MLE (closed form): lambda = 12.0710
# If you want mle() too (same answer):
if (!requireNamespace("stats4", quietly = TRUE)) stop("Install 'stats4'")
library(stats4)
nll_pois <- function(loglambda) {
  lam <- exp(loglambda)
  # untruncated log-likelihood (matches closed-form)
  -sum(dpois(x, lam, log = TRUE))
}
fit_pois <- mle(nll_pois, start = list(loglambda = log(lambda_hat_cf)), method = "Nelder-Mead")
## Warning in optim(start, f, method = method, hessian = TRUE, ...): one-dimensional optimization by Nelder-Mead is unreliable:
## use "Brent" or optimize() directly
lambda_hat_mle <- exp(coef(fit_pois)["loglambda"])
cat(sprintf("Poisson via mle(): lambda = %.4f; -logLik = %.2f\n",
            lambda_hat_mle, -as.numeric(logLik(fit_pois))))
## Poisson via mle(): lambda = 12.0710; -logLik = 2465.95
## =========================
## MLE: Beta–Binomial (left-trunc), n fixed
## =========================
# x: integer vector from `unknowndiscrete`
x <- if (is.data.frame(unknowndiscrete)) unknowndiscrete[[1]] else as.integer(unknowndiscrete)
x <- x[is.finite(x)]
stopifnot(length(x) > 1, all(x >= 0))

n_obs <- length(x)
L     <- min(x)                # left-truncation (3 in your data)
n_fix <- 20                    # fixed n (as in your WAIC winner)
if (max(x) > n_fix) stop("Data exceed chosen n; set n_fix <- max(x).")

# ---- helpers ----
logsumexp <- function(v) { m <- max(v); m + log(sum(exp(v - m))) }
log_dbetabinom <- function(k, n, a, b) {
  lchoose(n, k) + lbeta(k + a, n - k + b) - lbeta(a, b)
}

# Negative log-likelihood on (log_alpha, log_beta) to enforce positivity
nll_bb_trunc <- function(log_alpha, log_beta) {
  a <- exp(log_alpha); b <- exp(log_beta)
  # normalizing term for truncation (depends on a,b but not on each xi)
  ks <- L:n_fix
  lt <- logsumexp(vapply(ks, function(k) log_dbetabinom(k, n_fix, a, b), numeric(1)))
  # total log-lik
  ll <- sum(vapply(x, function(xi) log_dbetabinom(xi, n_fix, a, b) - lt, numeric(1)))
  return(-ll)
}

# ---- fit via stats4::mle ----
if (!requireNamespace("stats4", quietly = TRUE)) stop("Please install 'stats4'.")

library(stats4)
# good starting values (from your Bayesian run or MOM): alpha ~ 19, beta ~ 12
start_list <- list(log_alpha = log(19), log_beta = log(12))

fit_nm <- mle(nll_bb_trunc, start = start_list, method = "Nelder-Mead")
coef_nm <- coef(fit_nm)

# (optional) refine with BFGS to get Hessian/SEs
fit_bb <- tryCatch(
  mle(nll_bb_trunc, start = as.list(coef_nm), method = "BFGS"),
  error = function(e) fit_nm
)

# point estimates (back-transform)
co <- coef(fit_bb)
alpha_hat <- exp(co[["log_alpha"]])
beta_hat  <- exp(co[["log_beta"]])
p_hat     <- alpha_hat / (alpha_hat + beta_hat)

cat(sprintf("Beta–Binomial (left-trunc, n=%d) MLE:\n  alpha = %.4f, beta = %.4f,   p = alpha/(alpha+beta) = %.4f\n  -logLik = %.2f\n",
            n_fix, alpha_hat, beta_hat, p_hat, -as.numeric(logLik(fit_bb))))
## Beta–Binomial (left-trunc, n=20) MLE:
##   alpha = 21.2182, beta = 13.9378,   p = alpha/(alpha+beta) = 0.6035
##   -logLik = 2410.88
# approximate SEs via delta method (if Hessian available)
vc <- tryCatch(vcov(fit_bb), error = function(e) NULL)
if (!is.null(vc)) {
  se_log_a <- sqrt(vc["log_alpha","log_alpha"])
  se_log_b <- sqrt(vc["log_beta","log_beta"])
  se_alpha <- alpha_hat * se_log_a
  se_beta  <- beta_hat  * se_log_b
  cat(sprintf("Approx SEs:  se(alpha) ≈ %.3f,  se(beta) ≈ %.3f\n", se_alpha, se_beta))
}
## Approx SEs:  se(alpha) ≈ 2.722,  se(beta) ≈ 1.784
## -------- overlay: empirical vs fitted truncated Beta–Binomial PMF --------
k_all  <- seq(L, n_fix)
tab    <- table(factor(x, levels = k_all))
emp_p  <- as.numeric(tab) / n_obs

# fitted truncated PMF at MLE
pmf_bb_trunc <- function(k, n, alpha, beta, L) {
  num <- exp(log_dbetabinom(k, n, alpha, beta))
  den <- sum(exp(vapply(L:n, function(j) log_dbetabinom(j, n, alpha, beta), numeric(1))))
  ifelse(k >= L, num/den, 0)
}
bb_p <- vapply(k_all, function(k) pmf_bb_trunc(k, n_fix, alpha_hat, beta_hat, L), numeric(1))

# draw three equal-width bars with empirical in the middle (and a blank 3rd to keep spacing)
off_emp <- 0.00; off_bb <- 0.22; off_blank <- -0.22
bar_lwd <- 10
ylim_max <- 1.25 * max(emp_p, bb_p)

plot(NA, xlim = c(min(k_all)-0.6, max(k_all)+0.6), ylim = c(0, ylim_max),
     xlab = "x", ylab = "Probability",
     main = sprintf("Empirical (center) vs Beta–Binomial MLE (n=%d, L=%d)", n_fix, L))
grid(col = "grey90", lty = "dotted")

# (blank left slot just to maintain symmetric spacing)
segments(k_all + off_blank, 0, k_all + off_blank, 0, lwd = bar_lwd, col = NA)

# empirical middle
segments(k_all + off_emp, 0, k_all + off_emp, emp_p,
         lwd = bar_lwd, col = adjustcolor("dodgerblue", 0.85))

# Beta–Binomial right
segments(k_all + off_bb, 0, k_all + off_bb, bb_p,
         lwd = bar_lwd, col = adjustcolor("tomato", 0.85))

legend("topright", bty = "n",
       col = c("dodgerblue","tomato"),
       lwd = bar_lwd, lty = 1,
       legend = c("Empirical PMF",
                  sprintf("Beta–Binomial MLE (α=%.2f, β=%.2f)", alpha_hat, beta_hat)))

  • Poisson MLE (untruncated). The MLE is
    \[ \hat\lambda=\bar x \approx 12.07. \] Conditioning on \(x\ge 3\) penalizes Poisson because it allocates non-negligible mass below 3 and (being unbounded) above 20, neither of which is supported by the data.

  • Binomial (left-truncated). With fixed \(n=20\) and
    \[ \hat p=\bar x/n \approx 0.603, \] the model respects the bounded support and typically outperforms Poisson on this dataset.

  • Beta–Binomial (left-truncated, \(n=20\)) — MLE. Treating \(X_i\mid p_i\sim\mathrm{Bin}(n,p_i)\), \(p_i\sim\mathrm{Beta}(\alpha,\beta)\) and conditioning on \(X\ge 3\), the MLEs are
    \[ \hat\alpha=21.22,\quad \hat\beta=13.94,\quad \hat p=\frac{\hat\alpha}{\hat\alpha+\hat\beta}\approx 0.6035. \] Fit metrics: \(-\log L=2410.88\) \(\Rightarrow\) \(-2\log L=4821.76\),
    \(\mathrm{AIC}\approx 4825.76\), \(\mathrm{BIC}\approx 4835.58\).
    This aligns with the Bayesian comparison (WAIC \(\approx 4828.8\)) and captures extra-binomial dispersion while honoring the \(3\ldots 20\) support.

Bottom line: Among practical parametric options, the left-truncated Beta–Binomial (\(n=20\)) provides the best combination of fit and parsimony for these data.


11 Expectation-Maximization (EM) Algorithm

In statistics, the Expectation-Maximization (EM) algorithm is an iterative method for finding:

  • Maximum likelihood (ML) estimates, or
  • Maximum a posteriori (MAP) estimates

of parameters in statistical models that depend on unobserved latent variables.

How EM Works

The EM algorithm iteratively alternates between two steps:

  1. Expectation Step (E-step)
    Computes the expected value of the log-likelihood function using current parameter estimates.

  2. Maximization Step (M-step)
    Updates parameters to maximize the expected log-likelihood from the E-step.

The refined estimates are reused in subsequent E-steps until convergence.


12 Gaussian Mixture Models (GMMs)

A Gaussian Mixture Model (GMM) is a probabilistic model that assumes the data is generated from a mixture of several Gaussian (normal) distributions, each with its own set of unknown parameters. Rather than assigning each data point to a single cluster deterministically, GMMs model the data as arising from a weighted combination of multiple Gaussian components, allowing for soft clustering—each data point has a probability of belonging to each cluster.

Key Properties:

  • Generalizes k-means clustering by incorporating:
  • Centers (means) of latent Gaussians
  • Covariance structures
  • Provides richer data representation than hard clustering.
#' EM Algorithm for Gaussian Mixture Models
#'
#' @param X Numeric matrix of data (n observations x p dimensions)
#' @param K Integer number of clusters
#' @param max_iter Maximum number of iterations (default: 100)
#' @param epsilon Convergence threshold (default: 1e-6)
#' @param verbose Whether to print progress (default: TRUE)
#'
#' @return List containing:
#'   - mu: Matrix of cluster means (K x p)
#'   - sigma: List of covariance matrices (K x p x p)
#'   - pi: Mixing weights (length K vector)
#'   - clusters: Hard cluster assignments
#'   - q: Soft assignment probabilities (n x K)
#'   - log_likelihood: Log-likelihood at each iteration
em_gmm <- function(X, K, max_iter = 100, epsilon = 1e-6, verbose = TRUE) {
  library(mvtnorm)
  
  # Initialize parameters
  n <- nrow(X)
  p <- ncol(X)
  
  # Random initialization
  mu <- X[sample(n, K), ]
  sigma <- replicate(K, diag(p), simplify = FALSE)
  pi <- rep(1/K, K)
  
  log_likelihood <- numeric(max_iter)
  
  for (iter in 1:max_iter) {
    ### E-step: Compute responsibilities
    q <- matrix(NA, n, K)
    for (k in 1:K) {
      q[, k] <- pi[k] * dmvnorm(X, mu[k, ], sigma[[k]])
    }
    log_likelihood[iter] <- sum(log(rowSums(q)))
    q <- q / rowSums(q)
    
    ### Store previous parameters for convergence check
    prev_mu <- mu
    
    ### M-step: Update parameters
    N_k <- colSums(q)
    
    for (k in 1:K) {
      # Update means
      mu[k, ] <- colSums(q[, k] * X) / N_k[k]
      
      # Update covariances
      X_centered <- sweep(X, 2, mu[k, ], "-")
      sigma[[k]] <- t(X_centered) %*% (X_centered * q[, k]) / N_k[k]
      
      # Update mixing weights
      pi[k] <- N_k[k] / n
    }
    
    ### Check convergence
    param_change <- sum((mu - prev_mu)^2)
    if (verbose) {
      cat(sprintf("Iter %d: log-likelihood = %.3f, param change = %.6f\n",
                  iter, log_likelihood[iter], param_change))
    }
    if (param_change < epsilon) break
  }
  
  # Hard cluster assignments
  clusters <- apply(q, 1, which.max)
  
  return(list(
    mu = mu,
    sigma = sigma,
    pi = pi,
    clusters = clusters,
    q = q,
    log_likelihood = log_likelihood[1:iter],
    n_iter = iter
  ))
}
# Example usage
set.seed(123)
library(mvtnorm)

# Generate synthetic data
mu1 <- c(1, 1)
mu2 <- c(4, 5)
sigma1 <- matrix(c(1, 0.5, 0.5, 1), nrow = 2)
sigma2 <- matrix(c(1, -0.5, -0.5, 1), nrow = 2)

n <- 300
X <- rbind(
  rmvnorm(n/2, mean = mu1, sigma = sigma1),
  rmvnorm(n/2, mean = mu2, sigma = sigma2)
)

# Run EM algorithm
gmm_fit <- em_gmm(X, K = 2)
## Iter 1: log-likelihood = -2942.438, param change = 11.628090
## Iter 2: log-likelihood = -1099.762, param change = 0.124261
## Iter 3: log-likelihood = -1068.201, param change = 0.150512
## Iter 4: log-likelihood = -1047.249, param change = 0.129163
## Iter 5: log-likelihood = -1032.548, param change = 0.104206
## Iter 6: log-likelihood = -1018.875, param change = 0.081722
## Iter 7: log-likelihood = -1003.555, param change = 0.037924
## Iter 8: log-likelihood = -993.298, param change = 0.009199
## Iter 9: log-likelihood = -989.571, param change = 0.002788
## Iter 10: log-likelihood = -988.096, param change = 0.000963
## Iter 11: log-likelihood = -987.562, param change = 0.000270
## Iter 12: log-likelihood = -987.411, param change = 0.000062
## Iter 13: log-likelihood = -987.377, param change = 0.000013
## Iter 14: log-likelihood = -987.370, param change = 0.000002
## Iter 15: log-likelihood = -987.368, param change = 0.000000
# Plot results
library(ggplot2)
ggplot(data.frame(X, Cluster = factor(gmm_fit$clusters)), 
       aes(x = X1, y = X2, color = Cluster)) +
  geom_point(alpha = 0.7) +
  stat_ellipse(aes(fill = Cluster), geom = "polygon", alpha = 0.2, level = 0.95) +
  labs(title = "GMM Clustering (EM Algorithm)", 
       subtitle = "95% confidence ellipses") +
  theme_minimal()


13 Multivariate Normal Distribution

The multivariate normal distribution generalizes the one-dimensional normal distribution to higher dimensions. It describes sets of correlated real-valued random variables that cluster around a mean vector.

Definition: A random vector \(\mathbf{X} = (X_1, ..., X_p)^T\) follows a multivariate normal distribution with mean vector \(\boldsymbol{\mu} \in \mathbb{R}^p\) and covariance matrix \(\boldsymbol{\Sigma} \in \mathbb{R}^{p \times p}\) (positive semi-definite), denoted as:

\[ \mathbf{X} \sim \mathcal{N}_p(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \]

The probability density function (PDF) is given by:

\[ f(\mathbf{x}) = \frac{1}{(2\pi)^{p/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right) \]

Key Properties

  • Mean: \(\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu}\)
  • Covariance: \(\text{Cov}(\mathbf{X}) = \boldsymbol{\Sigma}\)
  • Any linear combination of components of \(\mathbf{X}\) is also normally distributed.
  • Marginal distributions and conditional distributions are also multivariate normal.

Applications

  • Modeling correlated continuous variables
  • Principal Component Analysis (PCA)
  • Gaussian Mixture Models (GMMs)

\[ \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) \]

where:

  • \(\boldsymbol{\mu} = (\mu_1, ..., \mu_p)^T\) is the mean vector
  • \(\Sigma\) is the \(p \times p\) covariance matrix (positive semi-definite)

Sampling Algorithm To generate samples \(\mathbf{x}\) from \(\mathcal{N}(\boldsymbol{\mu}, \Sigma)\):

  1. Matrix Decomposition: Find matrix \(A\) such that \(AA^T = \Sigma\) (typically using Cholesky decomposition).

  2. Generate Standard Normals: Create vector \(\mathbf{z} = (z_1, ..., z_p)^T\) where \(z_i \overset{\text{iid}}{\sim} \mathcal{N}(0,1)\).

  3. Transform: Compute \(\mathbf{x} = \boldsymbol{\mu} + A\mathbf{z}\).

Key Properties

  • Marginal Distributions: All subsets of variables are normally distributed
  • Conditional Distributions: Remain normal given other variables
  • Affine Transformations: Linear transformations preserve normality
# Covariance matrix, default is diag(ncol(x))
(Sigma <- matrix(c(1.0, 0.8, 0.8, 1.0), 2, 2))
##      [,1] [,2]
## [1,]  1.0  0.8
## [2,]  0.8  1.0
# Install hexbin package (if not already installed)
if (!requireNamespace(c("hexbin", "ggplot2", "mvtnorm"), quietly = TRUE)) {
  install.packages(c("hexbin", "ggplot2", "mvtnorm"))
}

# Multivariate Normal Distribution Sampling
library(mvtnorm)  # For multivariate normal functions
library(ggplot2)   # For visualization

## Define parameters
# Covariance matrix (positive definite)
Sigma <- matrix(c(1,   0.8,
                  0.8, 1), 
                nrow = 2, byrow = TRUE)

# Mean vector
mu <- c(1, 2)

## Generate samples
set.seed(123)  # For reproducibility
rmvnorm_data <- rmvnorm(
  n = 1e4,            # Number of samples
  mean = mu,          # Mean vector
  sigma = Sigma       # Covariance matrix
)

# Add descriptive column names
colnames(rmvnorm_data) <- c("x", "y")

## Inspect data
cat("First 6 observations:\n")
## First 6 observations:
print(head(rmvnorm_data))
##              x        y
## [1,] 0.3957568 1.543471
## [2,] 2.4256834 2.760140
## [3,] 1.8826388 3.591820
## [4,] 0.8465034 1.074623
## [5,] 0.1863540 1.294218
## [6,] 2.2557657 2.869253
cat("\nSummary statistics:\n")
## 
## Summary statistics:
print(summary(rmvnorm_data))
##        x                 y         
##  Min.   :-3.3261   Min.   :-1.873  
##  1st Qu.: 0.3283   1st Qu.: 1.303  
##  Median : 0.9899   Median : 1.987  
##  Mean   : 0.9948   Mean   : 1.990  
##  3rd Qu.: 1.6687   3rd Qu.: 2.657  
##  Max.   : 5.0345   Max.   : 5.883
## Visualize the distribution
ggplot(data.frame(rmvnorm_data), aes(x = x, y = y)) +
  geom_hex(bins = 50) +  # Hexbin plot for density visualization
  geom_point(alpha = 0.1, size = 0.5) +  # Overlay individual points
  stat_ellipse(level = 0.95, color = "red", linewidth = 1) +  # 95% confidence ellipse
  labs(title = "Bivariate Normal Distribution",
       subtitle = sprintf("μ = (%s), ρ = %.2f", 
                          paste(mu, collapse = ", "),
                          cov2cor(Sigma)[1,2]),
       x = "Variable X",
       y = "Variable Y") +
  theme_minimal()

# means and covariance of x and y
# E(x) = 1.0; E(y) = 2.0
colMeans(rmvnorm_data)
##         x         y 
## 0.9948234 1.9897770
# cov(x, y) = 0.8
var(rmvnorm_data)
##           x         y
## x 1.0042236 0.8010468
## y 0.8010468 0.9975851

13.0.1 Independent and Identically Distributed Random Variables (iid)

Key Assumptions:

  • Independent: The outcome of one observation does not affect another.
  • Identically Distributed: All observations are drawn from the same probability distribution.
  • Often assumed in MLE to simplify the joint likelihood function into a product of marginal likelihoods.

Create a data set D with samples from the uniform distribution on the unit disc.

## Create a dataset D of samples from the Uniform unit disc using gibbs_sampler()

set.seed(1)
n <- 10000

# Full-conditionals for the unit disc: 
# x | y ~ Unif(-sqrt(1 - y^2), +sqrt(1 - y^2))
# y | x ~ Unif(-sqrt(1 - x^2), +sqrt(1 - x^2))
sqrt_pos <- function(z) sqrt(pmax(z, 0))

x_given_y <- function(state) {
  y <- state[2]
  runif(1, -sqrt_pos(1 - y^2),  sqrt_pos(1 - y^2))
}

y_given_x <- function(state) {
  x <- state[1]
  runif(1, -sqrt_pos(1 - x^2),  sqrt_pos(1 - x^2))
}

conds <- list(x_given_y, y_given_x)

# Generate samples (burn-in to approach stationarity)
D <- as.data.frame(
  gibbs_sampler(
    n = n,
    conditionals = conds,
    init = c(0, 0),    # feasible start
    burn_in = 1000,
    thin = 1,
    plot = TRUE        # set to FALSE to skip the scatter plot
  )
)

names(D) <- c("x", "y")

# Sanity check: all points are within the unit disc (allow tiny numeric tolerance)
stopifnot(all(D$x^2 + D$y^2 <= 1 + 1e-10))
# show dimensions of data matrix
dim(D)
## [1] 10000     2

Testing Whether X and Y Have the Same Distribution

We want to test computationally whether or not two random variables, X and Y, have the same distribution.

Step 1: Visual Comparison

Start by plotting histograms or kernel density estimates (KDE) of both variables:

  • If they appear similar in shape, they might follow the same distribution.
  • However, visual inspection is not conclusive.

Step 2: Statistical Tests

Use formal statistical tests to compare the distributions of X and Y:

plot(density(D[,"x"]),
     main = "Density Plot",
     col = "tomato",
     type = "l",
     lwd = 4,
     cex = 1)

points(density(D[,"y"]),
       col = "dodgerblue",
       type = "l",
       lwd = 4,
       cex = 1)

abline(v = c(mean(D[,"x"]), mean(D[,"y"])), 
       col = c("tomato", "dodgerblue"), 
       lty = 3, 
       lwd = 2)

grid(nx = NULL, ny = NULL,
     col = "gray",
     lty = "dotted",
     lwd = 1,
     equilogs = TRUE)

legend("topleft", c("X", "Y"),
       col = c("tomato", "dodgerblue"),
       lty = 1,
       lwd = 4)

# Quantile-Quantile Plots
qqplot(D[,"x"], D[,"y"],
       col = "tomato",
       main = "Quantile-Quantile Plot")

grid(nx = NULL, ny = NULL,
     col = "lightgray",
     lty = "dotted",
     lwd = 1,
     equilogs = TRUE)

P(X) a stripe centered around x=zero, which is 0.4 wide

(px <- sum(abs(D[,"x"] - 0) < 0.2) / n)
## [1] 0.2581

P(Y) a stripe centered around y=zero, which is 0.4 wide

(py <- sum(abs(D[,"y"] - 0) < 0.2) / n)
## [1] 0.2477

\(P(X \cap Y)\) a square centered at \(x=0\), \(y=0\) and \(side=0.4\) (i.e. intersection of the stripes above)

# their joint probability
(pxy <- sum((abs(D[,"x"] - 0) < 0.2) & (abs(D[,"y"] - 0) < 0.2)) / n)
## [1] 0.0503

\(P(X) * P(Y)\)

# product of their probabilities
px * py
## [1] 0.06393137

\(P(X \cap Y) =? P(X) * P(Y)\)

(pxy == px * py)
## [1] FALSE

covariance should be zero

round(cov(D[,"x"], D[,"y"]), digits = 3)
## [1] 0.001

Create a data set G with samples from the two-dimensional normal distribution

Random number generator for the multivariate normal distribution with mean equal to mean vector, (default is \(rep(0, length = ncol(x))\) , and covariance matrix sigma, default is \(diag(ncol(x))\).

# Covariance matrix, default is diag(ncol(x))
(Sigma <- matrix(c(1.0, 0, 0, 1.0), 2, 2))
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
# Random number generator for the multivariate normal distribution with mean equal to mean vector, (default is rep(0, length = ncol(x)), and covariance matrix sigma, default is diag(ncol(x)).
G <- rmvnorm(n = 1e4, mean = c(0,0), sigma = Sigma)
colnames(G) <- c("x", "y") #change variable names to x and y
# display head of the data
head(G)
##               x            y
## [1,] -1.4115219  1.083869657
## [2,]  1.1702224  0.294754540
## [3,] -0.5544277 -0.403440689
## [4,] -1.2681232 -0.009138438
## [5,]  0.3417593  0.394039668
## [6,] -0.7629955  1.490328015

Multivariate plot of x vs. y

plot(G,
     xlim = c(-4, 4),
     ylim = c(-4, 4),
     pch = 16,
     cex = .4,
     col = adjustcolor("tomato", alpha = 0.5),
     asp = 1,
     main = "Scatterplot x vs. y")

abline(h = colMeans(G)[2], 
       v = colMeans(G)[1], 
       col = "black", 
       lty = 3)

grid(nx = NULL, 
     ny = NULL, 
     col = "gray",
     lwd = 1,
     lty = "dotted", 
     equilogs = TRUE)


Test whether or not x and y have the same distribution.

plot(density(G[,"x"]),
     main = " X and Y Density Plots",
     col = "tomato",
     type = "l",
     lwd = 4)

points(density(G[,"y"]),
       col = "dodgerblue",
       type = "l",
       lwd = 4)

grid(nx = NULL, 
     ny = NULL, 
     col = "gray",
     lwd = 1,
     lty = "dotted", 
     equilogs = TRUE)

legend("topleft", c("X", "Y"),
       col = c("tomato", "dodgerblue"),
       lty = 1,
       lwd = 4
)

# Q-Q Plot: Comparing Distributions of 'x' and 'y'
qqplot(G[, "x"], G[, "y"],
       main = "Q-Q Plot: Comparing Distributions of x and y",
       xlab = "population Quantiles (x)",
       ylab = "Sample Quantiles (y)",
       col = "steelblue",        # Point color
       pch = 19,                 # Solid circles for points
       cex = 0.8)                # Slightly smaller point size

# Add a reference line (y = x) to assess distribution equality
abline(a = 0, b = 1, 
       col = "red", 
       lty = 2, 
       lwd = 1.5)

# Add a subtle grid for better readability
grid(nx = NULL, ny = NULL,
     col = "lightgray", 
     lty = "dotted",
     lwd = par("lwd"),           # Match default line width
     equilogs = TRUE)

# Optional: Add legend for clarity
legend("topleft", 
       legend = c("Data Points", "Reference (y = x)"),
       col = c("steelblue", "red"),
       pch = c(19, NA),         # NA for line in legend
       lty = c(NA, 2),           # NA for points in legend
       bty = "n")                # No legend box

def box_muller(): u1, u2 = random(), random() r = sqrt(-2 * log(u1)) theta = 2 * pi * u2 return r * cos(theta), r * sin(theta)

box_muller <- function(n, visualize = TRUE, verify = TRUE) {
  #' Generate Standard Normal Variables using Box-Muller Transform
  #'
  #' @param n Number of sample pairs to generate (will return 2n values)
  #' @param visualize Logical, whether to show diagnostic plots (default TRUE)
  #' @param verify Logical, whether to perform normality checks (default TRUE)
  #' @return Vector of 2n standard normal random variables
  #' 
  #' Example usage:
  #' normal_vars <- box_muller(10000)  # Generates 20,000 values
  #' normal_vars <- box_muller(5000, visualize = FALSE)  # Quiet      #' generation
  
  # Input validation
  if (!is.numeric(n) || n <= 0 || n != floor(n)) {
    stop("n must be a positive integer")
  }
  
  # Generate uniform random variables
  U1 <- runif(n, min = 0, max = 1)
  U2 <- runif(n, min = 0, max = 1)
  
  # Apply Box-Muller transform
  X1 <- sqrt(-2 * log(U1)) * cos(2 * pi * U2)
  X2 <- sqrt(-2 * log(U1)) * sin(2 * pi * U2)
  
  # Combine results
  X <- c(X1, X2)
  
  # Visualization and verification
  if (visualize || verify) {
    old_par <- par(no.readonly = TRUE)
    on.exit(par(old_par))
    
    if (visualize) {
      par(mfrow = c(1, 2))
      
      # Q-Q plot
      qqnorm(X, main = "Normal Q-Q Plot", 
             col = rgb(0, 0.4, 0.8, 0.5))
      qqline(X, col = "red", lwd = 2)
      
      # Histogram with normal curve
      hist(X, probability = TRUE, breaks = 30,
           main = "Density Comparison",
           col = "lightblue", xlab = "Value")
      curve(dnorm(x), add = TRUE, col = "red", lwd = 2)
    }
    
    if (verify) {
      test_n <- min(length(X), 5000)  # Shapiro-Wilk max is 5000
      shapiro_test <- shapiro.test(sample(X, test_n))
      
      cat("\n=== Verification Results ===\n")
      cat("Generated", length(X), "values\n")
      cat("Mean:", round(mean(X), 5), "\n")
      cat("SD:", round(sd(X), 5), "\n")
      cat("Shapiro-Wilk p-value:", format.pval(shapiro_test$p.value), "\n")
    }
  }
  
  return(X)
}

Show that X1 and X2 are standard normal

set.seed(123)
normal_vars <- box_muller(5000, visualize = TRUE)

## 
## === Verification Results ===
## Generated 10000 values
## Mean: -0.00433 
## SD: 1.00493 
## Shapiro-Wilk p-value: 0.89029

Diagnostic Interpretation

Normal Distribution Indicators:

  • Linear Alignment: Points follow the red reference line closely
  • Tail Behavior: No systematic deviations at the extremes
  • Random Dispersion: Points scatter evenly around the line

Density Plot Interpretation

Feature Visual Cue Statistical Meaning
Curve Match Histogram bars closely follow the population red curve Correct distribution shape (PDF matches random sample data)
Symmetry Equal distribution on both sides of zero Mean = Median = Mode = 0 (perfect central tendency)
Tail Behavior Smooth, gradual taper beyond ±3 standard deviations Proper handling of extreme values (no artificial truncation)

Shapiro-Wilk Normality Test Interpretation

Result: p-value = 0.3656 (α = 0.05)

Decision Rule

\[\text{If } p \geq α \rightarrow \text{Fail to reject normality}\]

\[\text{If } p < α \rightarrow \text{Reject normality}\]