Statistical Distributions for Simulation in R

Author
Affiliations

Dr. Md. Nasif Hossain

Department of Statistics, Pabna University of Science and Technology, Pabna, Bangladesh

Published

May 7, 2026

PUST Campus

Loading Package in R

library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggeffects)

Reproducibility and set.seed()

# Without a seed: different results every time
sample(1:100, 5)
[1] 78 39 58 66 88
sample(1:100, 5)
[1]  8 15 92 50 12
set.seed(2026)   # reset to the same seed
sample(1:100, 5) # same result as above
[1] 93 97 38 45 91
# Recommended practice: one seed at the top of the script
set.seed(2026)

# From this point on, all random calls are reproducible
x <- rnorm(100)
y <- sample(letters, 10)
z <- rbinom(50, size = 1, prob = 0.3)

head(x, 5); y; head(z, 5)
[1]  0.52058907 -1.07969076  0.13923812 -0.08474878 -0.66663962
 [1] "x" "c" "k" "v" "p" "f" "u" "a" "d" "m"
[1] 1 0 0 0 0

Simulating from Statistical Distributions

R distribution function naming convention

Prefix Function Example
r Random samples rnorm(n, mean, sd)
d Density (PDF/PMF) dnorm(x, mean, sd)
p Cumulative probability (CDF) pnorm(q, mean, sd)
q Quantile (inverse CDF) qnorm(p, mean, sd)

The following table summarises which distributions are most useful for linguistic data:

Distribution guide for linguistic data simulation
Distribution R function Typical linguistic use
Normal rnorm() Pitch, formants, log-transformed RT, log word frequency
Log-normal exp(rnorm()) Raw reaction times, response latencies
Binomial rbinom() Accuracy (correct/incorrect), binary grammaticality judgements
Poisson rpois() Error counts, disfluency counts, word occurrences in short texts
Negative binomial rnbinom() Overdispersed counts: word frequencies, collocation counts
Uniform runif() Ages, random positions, presentation order
Beta rbeta() Proportions bounded in (0, 1): accent ratings, similarity scores
Gamma rgamma() Skewed positive continuous: inter-pausal intervals, vowel duration

Normal Distribution

set.seed(2026)

# Simulate log-transformed reaction times
# Mean ~6.4 ≈ exp(6.4) ≈ 600 ms; SD = 0.3 on the log scale
log_rt <- rnorm(n = 500, mean = 6.4, sd = 0.3)
rt_ms  <- exp(log_rt)   # back-transform to milliseconds

data.frame(log_rt = log_rt, rt_ms = rt_ms) |>
  ggplot(aes(x = rt_ms)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated reaction times (log-normal distribution)",
       subtitle = "n = 500 | Mean ≈ 600 ms",
       x = "Reaction time (ms)", y = "Density")

cat(sprintf("Mean RT: %.1f ms | SD: %.1f ms | Range: %.0f–%.0f ms\n",
            mean(rt_ms), sd(rt_ms), min(rt_ms), max(rt_ms)))
Mean RT: 637.9 ms | SD: 198.2 ms | Range: 280–1545 ms

Binomial Distribution

set.seed(2026)

# Simulate accuracy in a lexical decision task
# 100 participants, 80 trials each, average accuracy 85%
n_participants <- 100
n_trials       <- 80
accuracy_prob  <- 0.85

n_correct <- rbinom(n = n_participants, size = n_trials, prob = accuracy_prob)
accuracy  <- n_correct / n_trials

data.frame(accuracy = accuracy) |>
  ggplot(aes(x = accuracy)) +
  geom_histogram(binwidth = 0.02, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean(accuracy), color = "firebrick",
             linetype = "dashed", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated accuracy in a lexical decision task",
       subtitle = sprintf("n = %d | %d trials | P(correct) = %.2f | Mean = %.3f",
                          n_participants, n_trials, accuracy_prob, mean(accuracy)),
       x = "Accuracy", y = "Count")

Poisson Distribution

set.seed(2026)

# Simulate disfluency counts per minute for 200 speakers
corrections <- rpois(n = 200, lambda = 1.8)

data.frame(corrections = corrections) |>
  ggplot(aes(x = corrections)) +
  geom_bar(fill = "steelblue", color = "white", alpha = 0.8) +
  theme_bw() +
  labs(title = "Simulated disfluency counts per minute (Poisson)",
       subtitle = sprintf("n = 200 | λ = 1.8 | Mean = %.2f | Variance = %.2f",
                          mean(corrections), var(corrections)),
       x = "Disfluencies per minute", y = "Count")

Negative Binomial Distribution

set.seed(2026)

# Compare Poisson vs. Negative Binomial: same mean, different variance
pois_counts <- rpois(n = 500, lambda = 3.0)
nb_counts   <- rnbinom(n = 500, mu = 3.0, size = 0.8)  # size controls overdispersion

cat(sprintf("Poisson:    Mean = %.2f | Variance = %.2f\n",
            mean(pois_counts), var(pois_counts)))
Poisson:    Mean = 2.90 | Variance = 2.73
cat(sprintf("Neg. Binom: Mean = %.2f | Variance = %.2f\n",
            mean(nb_counts), var(nb_counts)))
Neg. Binom: Mean = 2.79 | Variance = 13.41
data.frame(
  count = c(pois_counts, nb_counts),
  dist  = rep(c("Poisson (λ=3)", "Neg. Binomial (μ=3, size=0.8)"), each = 500)
) |>
  ggplot(aes(x = count, fill = dist)) +
  geom_bar(position = "dodge", alpha = 0.8, color = "white") +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() + theme(legend.position = "top") +
  labs(title = "Poisson vs. Negative Binomial: same mean, very different variance",
       x = "Count", y = "Frequency", fill = "")

Beta Distribution

set.seed(2026)

# Simulate accent ratings (0 = native-like, 1 = heavily accented)
# Beta(2, 5): right-skewed, most ratings near the native-like end
# Beta(5, 2): left-skewed, most ratings toward heavily-accented end
# Beta(2, 2): symmetric, centred at 0.5

ratings_native  <- rbeta(200, shape1 = 2, shape2 = 5)
ratings_l2      <- rbeta(200, shape1 = 5, shape2 = 2)

data.frame(
  rating = c(ratings_native, ratings_l2),
  group  = rep(c("L1 speakers", "Advanced L2 speakers"), each = 200)
) |>
  ggplot(aes(x = rating, fill = group)) +
  geom_histogram(binwidth = 0.05, position = "identity", alpha = 0.6, color = "white") +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() + theme(legend.position = "top") +
  labs(title = "Simulated accent ratings (Beta distribution)",
       subtitle = "L1: Beta(2,5) — mostly native-like | L2: Beta(5,2) — more accented",
       x = "Accent rating (0 = native-like, 1 = heavily accented)", y = "Count",
       fill = "")

Gamma Distribution

set.seed(2026)

# Simulate inter-pausal intervals in spontaneous speech (ms)
# shape = 2, rate = 0.01 → mean = shape/rate = 200 ms
pauses <- rgamma(n = 500, shape = 2, rate = 0.01)

data.frame(pause_ms = pauses) |>
  ggplot(aes(x = pause_ms)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated inter-pausal intervals (Gamma distribution)",
       subtitle = sprintf("n = 500 | Mean = %.0f ms | SD = %.0f ms",
                          mean(pauses), sd(pauses)),
       x = "Inter-pausal interval (ms)", y = "Density")

Basic Simulation models in R

Simulate data

set.seed(123)

n <- 1000   # sample size

# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10)         # continuous predictor (Normal)
x2 <- rbinom(n, size = 1, prob = 0.4)      # binary predictor (Binomial)
x3 <- rpois(n, lambda = 3)                 # count predictor (Poisson)

income <- factor(
  sample(c("Low", "Medium", "High"),
         n,
         replace = TRUE,
         prob = c(0.15, 0.35, 0.50)),
  levels = c("Low", "Medium", "High")
  )

income<-relevel(income, ref = "Low")

df=data.frame(x1,x2,x3,income)

Linear Regression (Normal outcome)

# Simulate outcome (Normal)
y_normal <- 5 + 0.3*x1 - 2*x2 + 1.5*x3 + rnorm(n, 0, 5)

df <- data.frame(
  y_normal = y_normal,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit linear regression
model_lm <- lm(y_normal ~ x1 + x2 + x3+income,
               data=df)

summary(model_lm)

Call:
lm(formula = y_normal ~ x1 + x2 + x3 + income, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2083  -3.2377  -0.0841   3.2627  16.9846 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.23824    0.95819   5.467 5.79e-08 ***
x1            0.28551    0.01603  17.807  < 2e-16 ***
x2           -1.67738    0.32288  -5.195 2.48e-07 ***
x3            1.53071    0.08975  17.055  < 2e-16 ***
incomeMedium -0.06113    0.51822  -0.118    0.906    
incomeHigh    0.56059    0.49851   1.125    0.261    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.009 on 994 degrees of freedom
Multiple R-squared:  0.4069,    Adjusted R-squared:  0.4039 
F-statistic: 136.4 on 5 and 994 DF,  p-value: < 2.2e-16
levels(income)
[1] "Low"    "Medium" "High"  

Logistic Regression (Binomial outcome)

# Generate probability using logistic function
logit_p <- -2 + 0.05*x1 - 1*x2 + 0.3*x3
p <- 1 / (1 + exp(-logit_p))

# Simulate binary outcome
y_binomial <- rbinom(n, size = 1, prob = p)

df <- data.frame(
  y_binomial = y_binomial,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit logistic regression
model_logit <- glm(y_binomial ~ x1 + x2 + x3+income,
                   data = df,
                   family = binomial)

summary(model_logit)

Call:
glm(formula = y_binomial ~ x1 + x2 + x3 + income, family = binomial, 
    data = df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.019479   0.463241  -6.518 7.12e-11 ***
x1            0.065382   0.008084   8.088 6.05e-16 ***
x2           -0.828648   0.150819  -5.494 3.92e-08 ***
x3            0.293158   0.046807   6.263 3.77e-10 ***
incomeMedium  0.234193   0.240161   0.975    0.329    
incomeHigh    0.220943   0.230439   0.959    0.338    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1223.4  on 999  degrees of freedom
Residual deviance: 1073.7  on 994  degrees of freedom
AIC: 1085.7

Number of Fisher Scoring iterations: 4

Prediction of logistic regression

# Prediction of x1 only
pred_1 <- ggpredict(model_logit, terms = c("x1 [all]"))
# Plot for predicted value 
plot(pred_1)

# Prediction of x1 by income 
pred_2 <- ggpredict(model_logit, terms = c("x1 [all]", "income"))
# Plot for predicted value

plot(pred_2)

Poisson Regression (Count outcome)

# Generate lambda (mean count)
lambda <- exp(1 + 0.02*x1 - 0.5*x2 + 0.2*x3)

# Simulate count outcome
y_poisson <- rpois(n, lambda = lambda)

df <- data.frame(
  y_poisson = y_poisson,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit Poisson regression
model_pois <- glm(y_poisson ~ x1 + x2 + x3+income, 
                  data=df,
                  family = poisson)

summary(model_pois)

Call:
glm(formula = y_poisson ~ x1 + x2 + x3 + income, family = poisson, 
    data = df)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0343374  0.0572523  18.066   <2e-16 ***
x1            0.0198665  0.0009362  21.220   <2e-16 ***
x2           -0.5216223  0.0199425 -26.156   <2e-16 ***
x3            0.1974561  0.0046577  42.393   <2e-16 ***
incomeMedium -0.0297095  0.0292233  -1.017    0.309    
incomeHigh   -0.0174807  0.0280185  -0.624    0.533    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4084.39  on 999  degrees of freedom
Residual deviance:  937.71  on 994  degrees of freedom
AIC: 5147.5

Number of Fisher Scoring iterations: 4

Negative Binomial Model

set.seed(2026)

n <- 1000   # sample size

# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10)         # continuous predictor
x2 <- rbinom(n, size = 1, prob = 0.4)      # binary predictor
x3 <- rnbinom(n, mu = 3.0, size = 0.8)  # size                # count predictor

# Optional categorical variable
income <- factor(sample(c("Low", "Middle", "High"),
                        n,
                        replace = TRUE))

# Linear predictor
eta <- 1 + 0.02*x1 - 0.5*x2 + 0.2*x3

# Mean parameter
mu <- exp(eta)

# Dispersion parameter (theta/size)
theta <- 2

# Simulate Negative Binomial outcome
y_nb <- rnbinom(n,
                size = theta,
                mu = mu)

# Create data frame
df_nb <- data.frame(
  y_nb = y_nb,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Load MASS package
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
# Fit Negative Binomial regression
model_nb <- glm.nb(y_nb ~ x1 + x2 + x3 + income,
                   data = df_nb)

# Summary
summary(model_nb)

Call:
glm.nb(formula = y_nb ~ x1 + x2 + x3 + income, data = df_nb, 
    init.theta = 2.111601872, link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.893065   0.131539   6.789 1.13e-11 ***
x1            0.023590   0.002438   9.678  < 2e-16 ***
x2           -0.449619   0.049522  -9.079  < 2e-16 ***
x3            0.194489   0.006396  30.407  < 2e-16 ***
incomeLow    -0.039276   0.059052  -0.665    0.506    
incomeMiddle -0.068216   0.058632  -1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.1116) family taken to be 1)

    Null deviance: 3096.9  on 999  degrees of freedom
Residual deviance: 1085.6  on 994  degrees of freedom
AIC: 6803.7

Number of Fisher Scoring iterations: 1

              Theta:  2.112 
          Std. Err.:  0.113 

 2 x log-likelihood:  -6789.660 

Prediction of negative binomial model

# Prediction of x1 by income 
pred_nb <- ggpredict(model_nb, terms = c("x1 [all]", "income"))
plot(pred_nb)

Simulated model for a realistic example

Simulate a realistic epidemiological dataset

Variable Meaning

Variable Meaning
child_age Age of child (months)
temperature_anomaly Deviation from average temperature
household_size Number of people in household
region_type Urban vs rural setting
infection_status Whether child had infection (yes/no)
diarrhea_episodes Count of diarrhea cases
clinic_visits Healthcare visits (overdispersed counts)
# ==========================================
# Simulate realistic epidemiological dataset
# ==========================================

set.seed(123)

# -------------------------------
# 1. Sample size
# -------------------------------
n <- 1000

# -------------------------------
# 2. Predictors (real-life names)
# -------------------------------

child_age <- rnorm(n, mean = 24, sd = 10)         # Age in months
temperature_anomaly <- rnorm(n, mean = 0, sd = 1) # Climate variable
household_size <- sample(3:8, n, replace = TRUE)  # Household members
region_type <- factor(sample(c("Urban", "Rural"), n, replace = TRUE))

# -------------------------------
# 3. BINOMIAL OUTCOME
# -------------------------------
# Outcome: Infection status (0 = no, 1 = yes)

lin_pred_bin <- -2 +
  0.03 * child_age +
  0.6 * temperature_anomaly +
  0.2 * household_size +
  0.5 * (region_type == "Rural")

prob_infection <- exp(lin_pred_bin) / (1 + exp(lin_pred_bin))

infection_status <- rbinom(n, size = 1, prob = prob_infection)

# -------------------------------
# 4. POISSON OUTCOME
# -------------------------------
# Outcome: Number of diarrheal episodes

lambda <- exp(-1 +
  0.02 * child_age +
  0.4 * temperature_anomaly +
  0.1 * household_size +
  0.3 * (region_type == "Rural"))

diarrhea_episodes <- rpois(n, lambda)

# -------------------------------
# 5. NEGATIVE BINOMIAL OUTCOME
# -------------------------------
# Outcome: Number of clinic visits (overdispersed)

mu <- exp(-1 +
  0.02 * child_age +
  0.4 * temperature_anomaly +
  0.1 * household_size +
  0.3 * (region_type == "Rural"))

dispersion <- 1.2

clinic_visits <- rnegbin(n, mu = mu, theta = dispersion)

# -------------------------------
# 6. Final dataset
# -------------------------------

sim_data <- data.frame(
  child_age,
  temperature_anomaly,
  household_size,
  region_type,
  infection_status,      # Binomial
  diarrhea_episodes,     # Poisson
  clinic_visits          # Negative binomial
)

# View data
head(sim_data)
  child_age temperature_anomaly household_size region_type infection_status
1  18.39524         -0.99579872              4       Urban                1
2  21.69823         -1.03995504              6       Rural                0
3  39.58708         -0.01798024              4       Rural                1
4  24.70508         -0.13217513              7       Rural                1
5  25.29288         -2.54934277              7       Urban                0
6  41.15065          1.04057346              4       Urban                1
  diarrhea_episodes clinic_visits
1                 0             0
2                 1             0
3                 2             1
4                 2             0
5                 0             1
6                 3             1

Building three different models with the simulated dataset

# Logistic regression
model_bin <- glm(infection_status ~ child_age + temperature_anomaly +
                   household_size + region_type,
                 family = binomial, data = sim_data)

# Poisson regression
model_pois <- glm(diarrhea_episodes ~ child_age + temperature_anomaly +
                    household_size + region_type,
                  family = poisson, data = sim_data)

# Negative binomial regression
model_nb <- glm.nb(clinic_visits ~ child_age + temperature_anomaly +
                     household_size + region_type,
                   data = sim_data)

Prediction model (Binomial Model)

pred_binomial<-ggpredict(model_bin, terms =c("temperature_anomaly [all", "region_type"))
Data were 'prettified'. Consider using `terms="temperature_anomaly
  [all]"` to get smooth plots.

Prediction plot (Binomial Model)

plot(pred_binomial)+
  labs(
    title = "Binomial model",
    x = "Temperature Anomaly",
    y = "Predicted Probability",
    color = "Region Type"
  )

Prediction model (Poisson Model)

pred_poisson<-ggpredict(model_pois, terms =c("temperature_anomaly [all", "region_type"))

Prediction plot (Poisson Model)

plot(pred_poisson)+   
  labs(title = "Poisson model", 
       x = "Temperature Anomaly",
       y = "Predicted Probability", 
       color = "Region Type"   )   

Prediction model (Negative Binomial Model)

pred_NegBin<-ggpredict(model_nb, terms =c("temperature_anomaly [all", "region_type"))

Prediction plot (Negative Binomial Model)

plot(pred_NegBin)+      
  labs(title = "Negative Binomial Model",
       x = "Temperature Anomaly",
       y = "Predicted Probability",
       color = "Region Type"   )   

In this simulated epidemiological dataset, three generalized linear models were used to examine how demographic and environmental factors influence child health outcomes. Each model serves a distinct purpose depending on the nature of the outcome variable.

  • The binomial (logistic) model showed how predictors such as child age, temperature anomalies, household size, and region type influence the probability of infection. Results are interpreted in terms of odds ratios, indicating how risk factors increase or decrease the likelihood of disease occurrence.

  • The Poisson model quantified how the same predictors affect the expected number of diarrheal episodes. It assumes that the mean and variance of the counts are equal and provides incidence rate ratios (IRR) to describe multiplicative changes in disease frequency.

  • The negative binomial model extended the Poisson approach by accounting for overdispersion, which is common in real-world health data. It provided more reliable estimates of the relationship between predictors and healthcare utilization (clinic visits) when variability exceeds the Poisson assumption.

Concluding Remark

This practical session introduced the fundamental concepts and tools required for statistical simulation in R. We explored how to load and manage packages, ensure reproducibility using set.seed(), and generate random data from different statistical distributions. In addition, simple simulation models were implemented to demonstrate how probability distributions can be applied in real-world data analysis and research. These foundational skills are essential for statistical computing, data science, and simulation-based research, and they provide a strong basis for more advanced modeling and analytical techniques in R.

Acknowledgement

We sincerely thank Martin Schweinberger and the Language Technology and Data Analysis Laboratory for their valuable educational resources. The content used here has been adapted from the LDAL website solely for learning and teaching purposes.

References

Cedergren, Henrietta J., and David Sankoff. 1974. “Variable Rules: Performance as a Statistical Reflection of Competence.” Language 50 (2): 333–55. https://doi.org/10.2307/412441.

Sankoff, David, and William Labov. 1979. “On the Uses of Variable Rules.” Language in Society 8 (2): 189–222. https://doi.org/10.1017/S0047404500006307.