library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggeffects)Statistical Distributions for Simulation in R
PUST Campus
Loading Package in R
Reproducibility and set.seed()
# Without a seed: different results every time
sample(1:100, 5)[1] 78 39 58 66 88
sample(1:100, 5)[1] 8 15 92 50 12
set.seed(2026) # reset to the same seed
sample(1:100, 5) # same result as above[1] 93 97 38 45 91
# Recommended practice: one seed at the top of the script
set.seed(2026)
# From this point on, all random calls are reproducible
x <- rnorm(100)
y <- sample(letters, 10)
z <- rbinom(50, size = 1, prob = 0.3)
head(x, 5); y; head(z, 5)[1] 0.52058907 -1.07969076 0.13923812 -0.08474878 -0.66663962
[1] "x" "c" "k" "v" "p" "f" "u" "a" "d" "m"
[1] 1 0 0 0 0
Simulating from Statistical Distributions
R distribution function naming convention
| Prefix | Function | Example |
|---|---|---|
r |
Random samples | rnorm(n, mean, sd) |
d |
Density (PDF/PMF) | dnorm(x, mean, sd) |
p |
Cumulative probability (CDF) | pnorm(q, mean, sd) |
q |
Quantile (inverse CDF) | qnorm(p, mean, sd) |
The following table summarises which distributions are most useful for linguistic data:
| Distribution guide for linguistic data simulation | ||
| Distribution | R function | Typical linguistic use |
| Normal | rnorm() |
Pitch, formants, log-transformed RT, log word frequency |
| Log-normal | exp(rnorm()) |
Raw reaction times, response latencies |
| Binomial | rbinom() |
Accuracy (correct/incorrect), binary grammaticality judgements |
| Poisson | rpois() |
Error counts, disfluency counts, word occurrences in short texts |
| Negative binomial | rnbinom() |
Overdispersed counts: word frequencies, collocation counts |
| Uniform | runif() |
Ages, random positions, presentation order |
| Beta | rbeta() |
Proportions bounded in (0, 1): accent ratings, similarity scores |
| Gamma | rgamma() |
Skewed positive continuous: inter-pausal intervals, vowel duration |
Normal Distribution
set.seed(2026)
# Simulate log-transformed reaction times
# Mean ~6.4 ≈ exp(6.4) ≈ 600 ms; SD = 0.3 on the log scale
log_rt <- rnorm(n = 500, mean = 6.4, sd = 0.3)
rt_ms <- exp(log_rt) # back-transform to milliseconds
data.frame(log_rt = log_rt, rt_ms = rt_ms) |>
ggplot(aes(x = rt_ms)) +
geom_histogram(aes(y = after_stat(density)),
bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "firebrick", linewidth = 1) +
theme_bw() +
labs(title = "Simulated reaction times (log-normal distribution)",
subtitle = "n = 500 | Mean ≈ 600 ms",
x = "Reaction time (ms)", y = "Density")cat(sprintf("Mean RT: %.1f ms | SD: %.1f ms | Range: %.0f–%.0f ms\n",
mean(rt_ms), sd(rt_ms), min(rt_ms), max(rt_ms)))Mean RT: 637.9 ms | SD: 198.2 ms | Range: 280–1545 ms
Binomial Distribution
set.seed(2026)
# Simulate accuracy in a lexical decision task
# 100 participants, 80 trials each, average accuracy 85%
n_participants <- 100
n_trials <- 80
accuracy_prob <- 0.85
n_correct <- rbinom(n = n_participants, size = n_trials, prob = accuracy_prob)
accuracy <- n_correct / n_trials
data.frame(accuracy = accuracy) |>
ggplot(aes(x = accuracy)) +
geom_histogram(binwidth = 0.02, fill = "steelblue", color = "white", alpha = 0.8) +
geom_vline(xintercept = mean(accuracy), color = "firebrick",
linetype = "dashed", linewidth = 1) +
theme_bw() +
labs(title = "Simulated accuracy in a lexical decision task",
subtitle = sprintf("n = %d | %d trials | P(correct) = %.2f | Mean = %.3f",
n_participants, n_trials, accuracy_prob, mean(accuracy)),
x = "Accuracy", y = "Count")Poisson Distribution
set.seed(2026)
# Simulate disfluency counts per minute for 200 speakers
corrections <- rpois(n = 200, lambda = 1.8)
data.frame(corrections = corrections) |>
ggplot(aes(x = corrections)) +
geom_bar(fill = "steelblue", color = "white", alpha = 0.8) +
theme_bw() +
labs(title = "Simulated disfluency counts per minute (Poisson)",
subtitle = sprintf("n = 200 | λ = 1.8 | Mean = %.2f | Variance = %.2f",
mean(corrections), var(corrections)),
x = "Disfluencies per minute", y = "Count")Negative Binomial Distribution
set.seed(2026)
# Compare Poisson vs. Negative Binomial: same mean, different variance
pois_counts <- rpois(n = 500, lambda = 3.0)
nb_counts <- rnbinom(n = 500, mu = 3.0, size = 0.8) # size controls overdispersion
cat(sprintf("Poisson: Mean = %.2f | Variance = %.2f\n",
mean(pois_counts), var(pois_counts)))Poisson: Mean = 2.90 | Variance = 2.73
cat(sprintf("Neg. Binom: Mean = %.2f | Variance = %.2f\n",
mean(nb_counts), var(nb_counts)))Neg. Binom: Mean = 2.79 | Variance = 13.41
data.frame(
count = c(pois_counts, nb_counts),
dist = rep(c("Poisson (λ=3)", "Neg. Binomial (μ=3, size=0.8)"), each = 500)
) |>
ggplot(aes(x = count, fill = dist)) +
geom_bar(position = "dodge", alpha = 0.8, color = "white") +
scale_fill_manual(values = c("steelblue", "firebrick")) +
theme_bw() + theme(legend.position = "top") +
labs(title = "Poisson vs. Negative Binomial: same mean, very different variance",
x = "Count", y = "Frequency", fill = "")Beta Distribution
set.seed(2026)
# Simulate accent ratings (0 = native-like, 1 = heavily accented)
# Beta(2, 5): right-skewed, most ratings near the native-like end
# Beta(5, 2): left-skewed, most ratings toward heavily-accented end
# Beta(2, 2): symmetric, centred at 0.5
ratings_native <- rbeta(200, shape1 = 2, shape2 = 5)
ratings_l2 <- rbeta(200, shape1 = 5, shape2 = 2)
data.frame(
rating = c(ratings_native, ratings_l2),
group = rep(c("L1 speakers", "Advanced L2 speakers"), each = 200)
) |>
ggplot(aes(x = rating, fill = group)) +
geom_histogram(binwidth = 0.05, position = "identity", alpha = 0.6, color = "white") +
scale_fill_manual(values = c("steelblue", "firebrick")) +
theme_bw() + theme(legend.position = "top") +
labs(title = "Simulated accent ratings (Beta distribution)",
subtitle = "L1: Beta(2,5) — mostly native-like | L2: Beta(5,2) — more accented",
x = "Accent rating (0 = native-like, 1 = heavily accented)", y = "Count",
fill = "")Gamma Distribution
set.seed(2026)
# Simulate inter-pausal intervals in spontaneous speech (ms)
# shape = 2, rate = 0.01 → mean = shape/rate = 200 ms
pauses <- rgamma(n = 500, shape = 2, rate = 0.01)
data.frame(pause_ms = pauses) |>
ggplot(aes(x = pause_ms)) +
geom_histogram(aes(y = after_stat(density)),
bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
geom_density(color = "firebrick", linewidth = 1) +
theme_bw() +
labs(title = "Simulated inter-pausal intervals (Gamma distribution)",
subtitle = sprintf("n = 500 | Mean = %.0f ms | SD = %.0f ms",
mean(pauses), sd(pauses)),
x = "Inter-pausal interval (ms)", y = "Density")Basic Simulation models in R
Simulate data
set.seed(123)
n <- 1000 # sample size
# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10) # continuous predictor (Normal)
x2 <- rbinom(n, size = 1, prob = 0.4) # binary predictor (Binomial)
x3 <- rpois(n, lambda = 3) # count predictor (Poisson)
income <- factor(
sample(c("Low", "Medium", "High"),
n,
replace = TRUE,
prob = c(0.15, 0.35, 0.50)),
levels = c("Low", "Medium", "High")
)
income<-relevel(income, ref = "Low")
df=data.frame(x1,x2,x3,income)Linear Regression (Normal outcome)
# Simulate outcome (Normal)
y_normal <- 5 + 0.3*x1 - 2*x2 + 1.5*x3 + rnorm(n, 0, 5)
df <- data.frame(
y_normal = y_normal,
x1 = x1,
x2 = x2,
x3 = x3,
income = income
)
# Fit linear regression
model_lm <- lm(y_normal ~ x1 + x2 + x3+income,
data=df)
summary(model_lm)
Call:
lm(formula = y_normal ~ x1 + x2 + x3 + income, data = df)
Residuals:
Min 1Q Median 3Q Max
-15.2083 -3.2377 -0.0841 3.2627 16.9846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.23824 0.95819 5.467 5.79e-08 ***
x1 0.28551 0.01603 17.807 < 2e-16 ***
x2 -1.67738 0.32288 -5.195 2.48e-07 ***
x3 1.53071 0.08975 17.055 < 2e-16 ***
incomeMedium -0.06113 0.51822 -0.118 0.906
incomeHigh 0.56059 0.49851 1.125 0.261
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.009 on 994 degrees of freedom
Multiple R-squared: 0.4069, Adjusted R-squared: 0.4039
F-statistic: 136.4 on 5 and 994 DF, p-value: < 2.2e-16
levels(income)[1] "Low" "Medium" "High"
Logistic Regression (Binomial outcome)
# Generate probability using logistic function
logit_p <- -2 + 0.05*x1 - 1*x2 + 0.3*x3
p <- 1 / (1 + exp(-logit_p))
# Simulate binary outcome
y_binomial <- rbinom(n, size = 1, prob = p)
df <- data.frame(
y_binomial = y_binomial,
x1 = x1,
x2 = x2,
x3 = x3,
income = income
)
# Fit logistic regression
model_logit <- glm(y_binomial ~ x1 + x2 + x3+income,
data = df,
family = binomial)
summary(model_logit)
Call:
glm(formula = y_binomial ~ x1 + x2 + x3 + income, family = binomial,
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.019479 0.463241 -6.518 7.12e-11 ***
x1 0.065382 0.008084 8.088 6.05e-16 ***
x2 -0.828648 0.150819 -5.494 3.92e-08 ***
x3 0.293158 0.046807 6.263 3.77e-10 ***
incomeMedium 0.234193 0.240161 0.975 0.329
incomeHigh 0.220943 0.230439 0.959 0.338
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1223.4 on 999 degrees of freedom
Residual deviance: 1073.7 on 994 degrees of freedom
AIC: 1085.7
Number of Fisher Scoring iterations: 4
Prediction of logistic regression
# Prediction of x1 only
pred_1 <- ggpredict(model_logit, terms = c("x1 [all]"))# Plot for predicted value
plot(pred_1)# Prediction of x1 by income
pred_2 <- ggpredict(model_logit, terms = c("x1 [all]", "income"))# Plot for predicted value
plot(pred_2)Poisson Regression (Count outcome)
# Generate lambda (mean count)
lambda <- exp(1 + 0.02*x1 - 0.5*x2 + 0.2*x3)
# Simulate count outcome
y_poisson <- rpois(n, lambda = lambda)
df <- data.frame(
y_poisson = y_poisson,
x1 = x1,
x2 = x2,
x3 = x3,
income = income
)
# Fit Poisson regression
model_pois <- glm(y_poisson ~ x1 + x2 + x3+income,
data=df,
family = poisson)
summary(model_pois)
Call:
glm(formula = y_poisson ~ x1 + x2 + x3 + income, family = poisson,
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0343374 0.0572523 18.066 <2e-16 ***
x1 0.0198665 0.0009362 21.220 <2e-16 ***
x2 -0.5216223 0.0199425 -26.156 <2e-16 ***
x3 0.1974561 0.0046577 42.393 <2e-16 ***
incomeMedium -0.0297095 0.0292233 -1.017 0.309
incomeHigh -0.0174807 0.0280185 -0.624 0.533
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4084.39 on 999 degrees of freedom
Residual deviance: 937.71 on 994 degrees of freedom
AIC: 5147.5
Number of Fisher Scoring iterations: 4
Negative Binomial Model
set.seed(2026)
n <- 1000 # sample size
# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10) # continuous predictor
x2 <- rbinom(n, size = 1, prob = 0.4) # binary predictor
x3 <- rnbinom(n, mu = 3.0, size = 0.8) # size # count predictor
# Optional categorical variable
income <- factor(sample(c("Low", "Middle", "High"),
n,
replace = TRUE))
# Linear predictor
eta <- 1 + 0.02*x1 - 0.5*x2 + 0.2*x3
# Mean parameter
mu <- exp(eta)
# Dispersion parameter (theta/size)
theta <- 2
# Simulate Negative Binomial outcome
y_nb <- rnbinom(n,
size = theta,
mu = mu)
# Create data frame
df_nb <- data.frame(
y_nb = y_nb,
x1 = x1,
x2 = x2,
x3 = x3,
income = income
)
# Load MASS package
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
# Fit Negative Binomial regression
model_nb <- glm.nb(y_nb ~ x1 + x2 + x3 + income,
data = df_nb)
# Summary
summary(model_nb)
Call:
glm.nb(formula = y_nb ~ x1 + x2 + x3 + income, data = df_nb,
init.theta = 2.111601872, link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.893065 0.131539 6.789 1.13e-11 ***
x1 0.023590 0.002438 9.678 < 2e-16 ***
x2 -0.449619 0.049522 -9.079 < 2e-16 ***
x3 0.194489 0.006396 30.407 < 2e-16 ***
incomeLow -0.039276 0.059052 -0.665 0.506
incomeMiddle -0.068216 0.058632 -1.163 0.245
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(2.1116) family taken to be 1)
Null deviance: 3096.9 on 999 degrees of freedom
Residual deviance: 1085.6 on 994 degrees of freedom
AIC: 6803.7
Number of Fisher Scoring iterations: 1
Theta: 2.112
Std. Err.: 0.113
2 x log-likelihood: -6789.660
Prediction of negative binomial model
# Prediction of x1 by income
pred_nb <- ggpredict(model_nb, terms = c("x1 [all]", "income"))plot(pred_nb)Simulated model for a realistic example
Simulate a realistic epidemiological dataset
Variable Meaning
| Variable | Meaning |
|---|---|
child_age |
Age of child (months) |
temperature_anomaly |
Deviation from average temperature |
household_size |
Number of people in household |
region_type |
Urban vs rural setting |
infection_status |
Whether child had infection (yes/no) |
diarrhea_episodes |
Count of diarrhea cases |
clinic_visits |
Healthcare visits (overdispersed counts) |
# ==========================================
# Simulate realistic epidemiological dataset
# ==========================================
set.seed(123)
# -------------------------------
# 1. Sample size
# -------------------------------
n <- 1000
# -------------------------------
# 2. Predictors (real-life names)
# -------------------------------
child_age <- rnorm(n, mean = 24, sd = 10) # Age in months
temperature_anomaly <- rnorm(n, mean = 0, sd = 1) # Climate variable
household_size <- sample(3:8, n, replace = TRUE) # Household members
region_type <- factor(sample(c("Urban", "Rural"), n, replace = TRUE))
# -------------------------------
# 3. BINOMIAL OUTCOME
# -------------------------------
# Outcome: Infection status (0 = no, 1 = yes)
lin_pred_bin <- -2 +
0.03 * child_age +
0.6 * temperature_anomaly +
0.2 * household_size +
0.5 * (region_type == "Rural")
prob_infection <- exp(lin_pred_bin) / (1 + exp(lin_pred_bin))
infection_status <- rbinom(n, size = 1, prob = prob_infection)
# -------------------------------
# 4. POISSON OUTCOME
# -------------------------------
# Outcome: Number of diarrheal episodes
lambda <- exp(-1 +
0.02 * child_age +
0.4 * temperature_anomaly +
0.1 * household_size +
0.3 * (region_type == "Rural"))
diarrhea_episodes <- rpois(n, lambda)
# -------------------------------
# 5. NEGATIVE BINOMIAL OUTCOME
# -------------------------------
# Outcome: Number of clinic visits (overdispersed)
mu <- exp(-1 +
0.02 * child_age +
0.4 * temperature_anomaly +
0.1 * household_size +
0.3 * (region_type == "Rural"))
dispersion <- 1.2
clinic_visits <- rnegbin(n, mu = mu, theta = dispersion)
# -------------------------------
# 6. Final dataset
# -------------------------------
sim_data <- data.frame(
child_age,
temperature_anomaly,
household_size,
region_type,
infection_status, # Binomial
diarrhea_episodes, # Poisson
clinic_visits # Negative binomial
)
# View data
head(sim_data) child_age temperature_anomaly household_size region_type infection_status
1 18.39524 -0.99579872 4 Urban 1
2 21.69823 -1.03995504 6 Rural 0
3 39.58708 -0.01798024 4 Rural 1
4 24.70508 -0.13217513 7 Rural 1
5 25.29288 -2.54934277 7 Urban 0
6 41.15065 1.04057346 4 Urban 1
diarrhea_episodes clinic_visits
1 0 0
2 1 0
3 2 1
4 2 0
5 0 1
6 3 1
Building three different models with the simulated dataset
# Logistic regression
model_bin <- glm(infection_status ~ child_age + temperature_anomaly +
household_size + region_type,
family = binomial, data = sim_data)
# Poisson regression
model_pois <- glm(diarrhea_episodes ~ child_age + temperature_anomaly +
household_size + region_type,
family = poisson, data = sim_data)
# Negative binomial regression
model_nb <- glm.nb(clinic_visits ~ child_age + temperature_anomaly +
household_size + region_type,
data = sim_data)Prediction model (Binomial Model)
pred_binomial<-ggpredict(model_bin, terms =c("temperature_anomaly [all", "region_type"))Data were 'prettified'. Consider using `terms="temperature_anomaly
[all]"` to get smooth plots.
Prediction plot (Binomial Model)
plot(pred_binomial)+
labs(
title = "Binomial model",
x = "Temperature Anomaly",
y = "Predicted Probability",
color = "Region Type"
)Prediction model (Poisson Model)
pred_poisson<-ggpredict(model_pois, terms =c("temperature_anomaly [all", "region_type"))Prediction plot (Poisson Model)
plot(pred_poisson)+
labs(title = "Poisson model",
x = "Temperature Anomaly",
y = "Predicted Probability",
color = "Region Type" ) Prediction model (Negative Binomial Model)
pred_NegBin<-ggpredict(model_nb, terms =c("temperature_anomaly [all", "region_type"))Prediction plot (Negative Binomial Model)
plot(pred_NegBin)+
labs(title = "Negative Binomial Model",
x = "Temperature Anomaly",
y = "Predicted Probability",
color = "Region Type" ) In this simulated epidemiological dataset, three generalized linear models were used to examine how demographic and environmental factors influence child health outcomes. Each model serves a distinct purpose depending on the nature of the outcome variable.
The binomial (logistic) model showed how predictors such as child age, temperature anomalies, household size, and region type influence the probability of infection. Results are interpreted in terms of odds ratios, indicating how risk factors increase or decrease the likelihood of disease occurrence.
The Poisson model quantified how the same predictors affect the expected number of diarrheal episodes. It assumes that the mean and variance of the counts are equal and provides incidence rate ratios (IRR) to describe multiplicative changes in disease frequency.
The negative binomial model extended the Poisson approach by accounting for overdispersion, which is common in real-world health data. It provided more reliable estimates of the relationship between predictors and healthcare utilization (clinic visits) when variability exceeds the Poisson assumption.
Concluding Remark
This practical session introduced the fundamental concepts and tools required for statistical simulation in R. We explored how to load and manage packages, ensure reproducibility using set.seed(), and generate random data from different statistical distributions. In addition, simple simulation models were implemented to demonstrate how probability distributions can be applied in real-world data analysis and research. These foundational skills are essential for statistical computing, data science, and simulation-based research, and they provide a strong basis for more advanced modeling and analytical techniques in R.
Acknowledgement
We sincerely thank Martin Schweinberger and the Language Technology and Data Analysis Laboratory for their valuable educational resources. The content used here has been adapted from the LDAL website solely for learning and teaching purposes.
References
Cedergren, Henrietta J., and David Sankoff. 1974. “Variable Rules: Performance as a Statistical Reflection of Competence.” Language 50 (2): 333–55. https://doi.org/10.2307/412441.
Sankoff, David, and William Labov. 1979. “On the Uses of Variable Rules.” Language in Society 8 (2): 189–222. https://doi.org/10.1017/S0047404500006307.