Statistical Distributions for Simulation in R

Author
Affiliations

Dr. Md. Nasif Hossain

Department of Statistics, Pabna University of Science and Technology, Pabna, Bangladesh

Published

May 7, 2026

PUST Campus

Loading Package in R

library(tidyverse)
library(dplyr)
library(ggplot2)
library(ggeffects)

Reproducibility and set.seed()

# Without a seed: different results every time
sample(1:100, 5)
[1] 89 26 68 40 79
sample(1:100, 5)
[1] 11 63 75 40 27
set.seed(2026)   # reset to the same seed
sample(1:100, 5) # same result as above
[1] 93 97 38 45 91
# Recommended practice: one seed at the top of the script
set.seed(2026)

# From this point on, all random calls are reproducible
x <- rnorm(100)
y <- sample(letters, 10)
z <- rbinom(50, size = 1, prob = 0.3)

head(x, 5); y; head(z, 5)
[1]  0.52058907 -1.07969076  0.13923812 -0.08474878 -0.66663962
 [1] "x" "c" "k" "v" "p" "f" "u" "a" "d" "m"
[1] 1 0 0 0 0

Simulating from Statistical Distributions

R distribution function naming convention

Prefix Function Example
r Random samples rnorm(n, mean, sd)
d Density (PDF/PMF) dnorm(x, mean, sd)
p Cumulative probability (CDF) pnorm(q, mean, sd)
q Quantile (inverse CDF) qnorm(p, mean, sd)

The following table summarises which distributions are most useful for linguistic data:

Distribution guide for linguistic data simulation
Distribution R function Typical linguistic use
Normal rnorm() Pitch, formants, log-transformed RT, log word frequency
Log-normal exp(rnorm()) Raw reaction times, response latencies
Binomial rbinom() Accuracy (correct/incorrect), binary grammaticality judgements
Poisson rpois() Error counts, disfluency counts, word occurrences in short texts
Negative binomial rnbinom() Overdispersed counts: word frequencies, collocation counts
Uniform runif() Ages, random positions, presentation order
Beta rbeta() Proportions bounded in (0, 1): accent ratings, similarity scores
Gamma rgamma() Skewed positive continuous: inter-pausal intervals, vowel duration

Normal Distribution

set.seed(2026)

# Simulate log-transformed reaction times
# Mean ~6.4 ≈ exp(6.4) ≈ 600 ms; SD = 0.3 on the log scale
log_rt <- rnorm(n = 500, mean = 6.4, sd = 0.3)
rt_ms  <- exp(log_rt)   # back-transform to milliseconds

data.frame(log_rt = log_rt, rt_ms = rt_ms) |>
  ggplot(aes(x = rt_ms)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated reaction times (log-normal distribution)",
       subtitle = "n = 500 | Mean ≈ 600 ms",
       x = "Reaction time (ms)", y = "Density")

cat(sprintf("Mean RT: %.1f ms | SD: %.1f ms | Range: %.0f–%.0f ms\n",
            mean(rt_ms), sd(rt_ms), min(rt_ms), max(rt_ms)))
Mean RT: 637.9 ms | SD: 198.2 ms | Range: 280–1545 ms

Binomial Distribution

set.seed(2026)

# Simulate accuracy in a lexical decision task
# 100 participants, 80 trials each, average accuracy 85%
n_participants <- 100
n_trials       <- 80
accuracy_prob  <- 0.85

n_correct <- rbinom(n = n_participants, size = n_trials, prob = accuracy_prob)
accuracy  <- n_correct / n_trials

data.frame(accuracy = accuracy) |>
  ggplot(aes(x = accuracy)) +
  geom_histogram(binwidth = 0.02, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_vline(xintercept = mean(accuracy), color = "firebrick",
             linetype = "dashed", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated accuracy in a lexical decision task",
       subtitle = sprintf("n = %d | %d trials | P(correct) = %.2f | Mean = %.3f",
                          n_participants, n_trials, accuracy_prob, mean(accuracy)),
       x = "Accuracy", y = "Count")

Poisson Distribution

set.seed(2026)

# Simulate disfluency counts per minute for 200 speakers
corrections <- rpois(n = 200, lambda = 1.8)

data.frame(corrections = corrections) |>
  ggplot(aes(x = corrections)) +
  geom_bar(fill = "steelblue", color = "white", alpha = 0.8) +
  theme_bw() +
  labs(title = "Simulated disfluency counts per minute (Poisson)",
       subtitle = sprintf("n = 200 | λ = 1.8 | Mean = %.2f | Variance = %.2f",
                          mean(corrections), var(corrections)),
       x = "Disfluencies per minute", y = "Count")

Negative Binomial Distribution

set.seed(2026)

# Compare Poisson vs. Negative Binomial: same mean, different variance
pois_counts <- rpois(n = 500, lambda = 3.0)
nb_counts   <- rnbinom(n = 500, mu = 3.0, size = 0.8)  # size controls overdispersion

cat(sprintf("Poisson:    Mean = %.2f | Variance = %.2f\n",
            mean(pois_counts), var(pois_counts)))
Poisson:    Mean = 2.90 | Variance = 2.73
cat(sprintf("Neg. Binom: Mean = %.2f | Variance = %.2f\n",
            mean(nb_counts), var(nb_counts)))
Neg. Binom: Mean = 2.79 | Variance = 13.41
data.frame(
  count = c(pois_counts, nb_counts),
  dist  = rep(c("Poisson (λ=3)", "Neg. Binomial (μ=3, size=0.8)"), each = 500)
) |>
  ggplot(aes(x = count, fill = dist)) +
  geom_bar(position = "dodge", alpha = 0.8, color = "white") +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() + theme(legend.position = "top") +
  labs(title = "Poisson vs. Negative Binomial: same mean, very different variance",
       x = "Count", y = "Frequency", fill = "")

Beta Distribution

set.seed(2026)

# Simulate accent ratings (0 = native-like, 1 = heavily accented)
# Beta(2, 5): right-skewed, most ratings near the native-like end
# Beta(5, 2): left-skewed, most ratings toward heavily-accented end
# Beta(2, 2): symmetric, centred at 0.5

ratings_native  <- rbeta(200, shape1 = 2, shape2 = 5)
ratings_l2      <- rbeta(200, shape1 = 5, shape2 = 2)

data.frame(
  rating = c(ratings_native, ratings_l2),
  group  = rep(c("L1 speakers", "Advanced L2 speakers"), each = 200)
) |>
  ggplot(aes(x = rating, fill = group)) +
  geom_histogram(binwidth = 0.05, position = "identity", alpha = 0.6, color = "white") +
  scale_fill_manual(values = c("steelblue", "firebrick")) +
  theme_bw() + theme(legend.position = "top") +
  labs(title = "Simulated accent ratings (Beta distribution)",
       subtitle = "L1: Beta(2,5) — mostly native-like | L2: Beta(5,2) — more accented",
       x = "Accent rating (0 = native-like, 1 = heavily accented)", y = "Count",
       fill = "")

Gamma Distribution

set.seed(2026)

# Simulate inter-pausal intervals in spontaneous speech (ms)
# shape = 2, rate = 0.01 → mean = shape/rate = 200 ms
pauses <- rgamma(n = 500, shape = 2, rate = 0.01)

data.frame(pause_ms = pauses) |>
  ggplot(aes(x = pause_ms)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40, fill = "steelblue", color = "white", alpha = 0.8) +
  geom_density(color = "firebrick", linewidth = 1) +
  theme_bw() +
  labs(title = "Simulated inter-pausal intervals (Gamma distribution)",
       subtitle = sprintf("n = 500 | Mean = %.0f ms | SD = %.0f ms",
                          mean(pauses), sd(pauses)),
       x = "Inter-pausal interval (ms)", y = "Density")

Basic Simulation models in R

Simulate data

set.seed(123)

n <- 1000   # sample size

# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10)         # continuous predictor (Normal)
x2 <- rbinom(n, size = 1, prob = 0.4)      # binary predictor (Binomial)
x3 <- rpois(n, lambda = 3)                 # count predictor (Poisson)

income <- factor(
  sample(c("Low", "Medium", "High"),
         n,
         replace = TRUE,
         prob = c(0.15, 0.35, 0.50)),
  levels = c("Low", "Medium", "High")
  )

income<-relevel(income, ref = "Low")

df=data.frame(x1,x2,x3,income)

Linear Regression (Normal outcome)

# Simulate outcome (Normal)
y_normal <- 5 + 0.3*x1 - 2*x2 + 1.5*x3 + rnorm(n, 0, 5)

df <- data.frame(
  y_normal = y_normal,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit linear regression
model_lm <- lm(y_normal ~ x1 + x2 + x3+income,
               data=df)

summary(model_lm)

Call:
lm(formula = y_normal ~ x1 + x2 + x3 + income, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2083  -3.2377  -0.0841   3.2627  16.9846 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.23824    0.95819   5.467 5.79e-08 ***
x1            0.28551    0.01603  17.807  < 2e-16 ***
x2           -1.67738    0.32288  -5.195 2.48e-07 ***
x3            1.53071    0.08975  17.055  < 2e-16 ***
incomeMedium -0.06113    0.51822  -0.118    0.906    
incomeHigh    0.56059    0.49851   1.125    0.261    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.009 on 994 degrees of freedom
Multiple R-squared:  0.4069,    Adjusted R-squared:  0.4039 
F-statistic: 136.4 on 5 and 994 DF,  p-value: < 2.2e-16
levels(income)
[1] "Low"    "Medium" "High"  

Logistic Regression (Binomial outcome)

# Generate probability using logistic function
logit_p <- -2 + 0.05*x1 - 1*x2 + 0.3*x3
p <- 1 / (1 + exp(-logit_p))

# Simulate binary outcome
y_binomial <- rbinom(n, size = 1, prob = p)

df <- data.frame(
  y_binomial = y_binomial,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit logistic regression
model_logit <- glm(y_binomial ~ x1 + x2 + x3+income,
                   data = df,
                   family = binomial)

summary(model_logit)

Call:
glm(formula = y_binomial ~ x1 + x2 + x3 + income, family = binomial, 
    data = df)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -3.019479   0.463241  -6.518 7.12e-11 ***
x1            0.065382   0.008084   8.088 6.05e-16 ***
x2           -0.828648   0.150819  -5.494 3.92e-08 ***
x3            0.293158   0.046807   6.263 3.77e-10 ***
incomeMedium  0.234193   0.240161   0.975    0.329    
incomeHigh    0.220943   0.230439   0.959    0.338    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1223.4  on 999  degrees of freedom
Residual deviance: 1073.7  on 994  degrees of freedom
AIC: 1085.7

Number of Fisher Scoring iterations: 4

Prediction of logistic regression

# Prediction of x1 only
pred_1 <- ggpredict(model_logit, terms = c("x1 [all]"))
# Plot for predicted value 
plot(pred_1)

# Prediction of x1 by income 
pred_2 <- ggpredict(model_logit, terms = c("x1 [all]", "income"))
# Plot for predicted value

plot(pred_2)

Poisson Regression (Count outcome)

# Generate lambda (mean count)
lambda <- exp(1 + 0.02*x1 - 0.5*x2 + 0.2*x3)

# Simulate count outcome
y_poisson <- rpois(n, lambda = lambda)

df <- data.frame(
  y_poisson = y_poisson,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Fit Poisson regression
model_pois <- glm(y_poisson ~ x1 + x2 + x3+income, 
                  data=df,
                  family = poisson)

summary(model_pois)

Call:
glm(formula = y_poisson ~ x1 + x2 + x3 + income, family = poisson, 
    data = df)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0343374  0.0572523  18.066   <2e-16 ***
x1            0.0198665  0.0009362  21.220   <2e-16 ***
x2           -0.5216223  0.0199425 -26.156   <2e-16 ***
x3            0.1974561  0.0046577  42.393   <2e-16 ***
incomeMedium -0.0297095  0.0292233  -1.017    0.309    
incomeHigh   -0.0174807  0.0280185  -0.624    0.533    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4084.39  on 999  degrees of freedom
Residual deviance:  937.71  on 994  degrees of freedom
AIC: 5147.5

Number of Fisher Scoring iterations: 4

Negative Binomial Model

set.seed(2026)

n <- 1000   # sample size

# Predictor variables
x1 <- rnorm(n, mean = 50, sd = 10)         # continuous predictor
x2 <- rbinom(n, size = 1, prob = 0.4)      # binary predictor
x3 <- rnbinom(n, mu = 3.0, size = 0.8)  # size                # count predictor

# Optional categorical variable
income <- factor(sample(c("Low", "Middle", "High"),
                        n,
                        replace = TRUE))

# Linear predictor
eta <- 1 + 0.02*x1 - 0.5*x2 + 0.2*x3

# Mean parameter
mu <- exp(eta)

# Dispersion parameter (theta/size)
theta <- 2

# Simulate Negative Binomial outcome
y_nb <- rnbinom(n,
                size = theta,
                mu = mu)

# Create data frame
df_nb <- data.frame(
  y_nb = y_nb,
  x1 = x1,
  x2 = x2,
  x3 = x3,
  income = income
)

# Load MASS package
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
# Fit Negative Binomial regression
model_nb <- glm.nb(y_nb ~ x1 + x2 + x3 + income,
                   data = df_nb)

# Summary
summary(model_nb)

Call:
glm.nb(formula = y_nb ~ x1 + x2 + x3 + income, data = df_nb, 
    init.theta = 2.111601872, link = log)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.893065   0.131539   6.789 1.13e-11 ***
x1            0.023590   0.002438   9.678  < 2e-16 ***
x2           -0.449619   0.049522  -9.079  < 2e-16 ***
x3            0.194489   0.006396  30.407  < 2e-16 ***
incomeLow    -0.039276   0.059052  -0.665    0.506    
incomeMiddle -0.068216   0.058632  -1.163    0.245    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.1116) family taken to be 1)

    Null deviance: 3096.9  on 999  degrees of freedom
Residual deviance: 1085.6  on 994  degrees of freedom
AIC: 6803.7

Number of Fisher Scoring iterations: 1

              Theta:  2.112 
          Std. Err.:  0.113 

 2 x log-likelihood:  -6789.660 

Prediction of negative binomial model

# Prediction of x1 by income 
pred_nb <- ggpredict(model_nb, terms = c("x1 [all]", "income"))
plot(pred_nb)

Simulated model for a realistic example

Simulate a realistic epidemiological dataset

Variable Meaning

Variable Meaning
child_age Age of child (months)
temperature_anomaly Deviation from average temperature
household_size Number of people in household
region_type Urban vs rural setting
infection_status Whether child had infection (yes/no)
diarrhea_episodes Count of diarrhea cases
clinic_visits Healthcare visits (overdispersed counts)
# ==========================================
# Simulate realistic epidemiological dataset
# ==========================================

set.seed(123)

# -------------------------------
# 1. Sample size
# -------------------------------
n <- 1000

# -------------------------------
# 2. Predictors (real-life names)
# -------------------------------

child_age <- rnorm(n, mean = 24, sd = 10)         # Age in months
temperature_anomaly <- rnorm(n, mean = 0, sd = 1) # Climate variable
household_size <- sample(3:8, n, replace = TRUE)  # Household members
region_type <- factor(sample(c("Urban", "Rural"), n, replace = TRUE))

# -------------------------------
# 3. BINOMIAL OUTCOME
# -------------------------------
# Outcome: Infection status (0 = no, 1 = yes)

lin_pred_bin <- -2 +
  0.03 * child_age +
  0.6 * temperature_anomaly +
  0.2 * household_size +
  0.5 * (region_type == "Rural")

prob_infection <- exp(lin_pred_bin) / (1 + exp(lin_pred_bin))

infection_status <- rbinom(n, size = 1, prob = prob_infection)

# -------------------------------
# 4. POISSON OUTCOME
# -------------------------------
# Outcome: Number of diarrheal episodes

lambda <- exp(-1 +
  0.02 * child_age +
  0.4 * temperature_anomaly +
  0.1 * household_size +
  0.3 * (region_type == "Rural"))

diarrhea_episodes <- rpois(n, lambda)

# -------------------------------
# 5. NEGATIVE BINOMIAL OUTCOME
# -------------------------------
# Outcome: Number of clinic visits (overdispersed)

mu <- exp(-1 +
  0.02 * child_age +
  0.4 * temperature_anomaly +
  0.1 * household_size +
  0.3 * (region_type == "Rural"))

dispersion <- 1.2

clinic_visits <- rnegbin(n, mu = mu, theta = dispersion)

# -------------------------------
# 6. Final dataset
# -------------------------------

sim_data <- data.frame(
  child_age,
  temperature_anomaly,
  household_size,
  region_type,
  infection_status,      # Binomial
  diarrhea_episodes,     # Poisson
  clinic_visits          # Negative binomial
)

# View data
head(sim_data)
  child_age temperature_anomaly household_size region_type infection_status
1  18.39524         -0.99579872              4       Urban                1
2  21.69823         -1.03995504              6       Rural                0
3  39.58708         -0.01798024              4       Rural                1
4  24.70508         -0.13217513              7       Rural                1
5  25.29288         -2.54934277              7       Urban                0
6  41.15065          1.04057346              4       Urban                1
  diarrhea_episodes clinic_visits
1                 0             0
2                 1             0
3                 2             1
4                 2             0
5                 0             1
6                 3             1

Building three different models with the simulated dataset

# Logistic regression
model_bin <- glm(infection_status ~ child_age + temperature_anomaly +
                   household_size + region_type,
                 family = binomial, data = sim_data)

# Poisson regression
model_pois <- glm(diarrhea_episodes ~ child_age + temperature_anomaly +
                    household_size + region_type,
                  family = poisson, data = sim_data)

# Negative binomial regression
model_nb <- glm.nb(clinic_visits ~ child_age + temperature_anomaly +
                     household_size + region_type,
                   data = sim_data)

Prediction model (Binomial Model)

pred_binomial<-ggpredict(model_bin, terms =c("temperature_anomaly [all", "region_type"))
Data were 'prettified'. Consider using `terms="temperature_anomaly
  [all]"` to get smooth plots.

Prediction plot (Binomial Model)

plot(pred_binomial)+
  labs(
    title = "Binomial model",
    x = "Temperature Anomaly",
    y = "Predicted Probability",
    color = "Region Type"
  )

Prediction model (Poisson Model)

pred_poisson<-ggpredict(model_pois, terms =c("temperature_anomaly [all", "region_type"))

Prediction plot (Poisson Model)

plot(pred_poisson)+   
  labs(title = "Poisson model", 
       x = "Temperature Anomaly",
       y = "Predicted Probability", 
       color = "Region Type"   )   

Prediction model (Negative Binomial Model)

pred_NegBin<-ggpredict(model_nb, terms =c("temperature_anomaly [all", "region_type"))

Prediction plot (Negative Binomial Model)

plot(pred_NegBin)+      
  labs(title = "Negative Binomial Model",
       x = "Temperature Anomaly",
       y = "Predicted Probability",
       color = "Region Type"   )   

AIC(model_bin) 
[1] 1278.94
AIC(model_pois) 
[1] 2804.183
AIC(model_nb)
[1] 3070.739

In this simulated epidemiological dataset, three generalized linear models were used to examine how demographic and environmental factors influence child health outcomes. Each model serves a distinct purpose depending on the nature of the outcome variable.

  • The binomial (logistic) model showed how predictors such as child age, temperature anomalies, household size, and region type influence the probability of infection. Results are interpreted in terms of odds ratios, indicating how risk factors increase or decrease the likelihood of disease occurrence.

  • The Poisson model quantified how the same predictors affect the expected number of diarrheal episodes. It assumes that the mean and variance of the counts are equal and provides incidence rate ratios (IRR) to describe multiplicative changes in disease frequency.

  • The negative binomial model extended the Poisson approach by accounting for overdispersion, which is common in real-world health data. It provided more reliable estimates of the relationship between predictors and healthcare utilization (clinic visits) when variability exceeds the Poisson assumption.

Exercise

a. Simulate 100 observations (use your roll-number as seed) containing the following five variables:

Variable Name Type Description
age Continuous Age of individuals (e.g., Normal distribution with mean 30 and variance 5)
temperature Continuous (Exposure) Environmental temperature (e.g., Normal distribution with mean 25 and variance 3)
gender Categorical Male / Female
region Categorical Urban / Rural
doctor_visits Count (Outcome) Number of doctor visits in a year (Poisson distribution with mean 2)

b. Explore (i) histogram of outcome variable and (ii) the relationship by using a scatter plot of temperature vs doctor_visits.

c. Find the association between temperature and doctor_visits by fitting (i) a linear model, (ii) Poisson model, and (iii) Negative binomial model

d. Compare models using: (i) AIC values and (ii) Overdispersion check

e. Identify the most appropriate model and justify your choice.

Solution

a. Simulate Data

# Use your roll number as seed (example: 12345)
set.seed(12345)

# Number of observations
n <- 100

# Continuous variables
age <- rnorm(n, mean = 30, sd = sqrt(5))
temperature <- rnorm(n, mean = 25, sd = sqrt(3))

# Categorical variables
gender <- factor(sample(c("Male", "Female"), n, replace = TRUE))
region <- factor(sample(c("Urban", "Rural"), n, replace = TRUE))

# Count outcome (Poisson distribution)
doctor_visits <- rpois(n, lambda = 2)

# Create data frame
data <- data.frame(age, temperature, gender, region, doctor_visits)

# View first few rows
head(data)
       age temperature gender region doctor_visits
1 31.30928    25.38785   Male  Urban             3
2 31.58641    22.99736 Female  Urban             0
3 29.75559    25.73165   Male  Urban             2
4 28.98595    22.70546   Male  Urban             3
5 31.35481    25.24437   Male  Urban             1
6 25.93493    24.07154 Female  Urban             3

(b) Exploratory Analysis

b (i) Histogram of Outcome Variable (doctor_visits)

# Histogram of doctor visits
hist(data$doctor_visits,
     main = "Histogram of Doctor Visits",
     xlab = "Number of Doctor Visits",
     col = "lightblue",
     border = "black")

b (ii) Scatter Plot: Temperature vs Doctor Visits

# Scatter plot
plot(data$temperature, data$doctor_visits,
     main = "Temperature vs Doctor Visits",
     xlab = "Temperature",
     ylab = "Doctor Visits",
     pch = 16,
     col = "darkgreen")

# Add regression line
abline(lm(doctor_visits ~ temperature, data = data),
       col = "red", lwd = 2)

c (i) Linear model

# Linear regression
lm_model <- lm(doctor_visits ~ temperature, data = data)

# Summary of model
summary(lm_model)

Call:
lm(formula = doctor_visits ~ temperature, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6190 -1.1896 -0.1827  0.9292  3.5578 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  5.39238    2.17937   2.474   0.0151 *
temperature -0.12690    0.08669  -1.464   0.1465  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.511 on 98 degrees of freedom
Multiple R-squared:  0.0214,    Adjusted R-squared:  0.01141 
F-statistic: 2.143 on 1 and 98 DF,  p-value: 0.1465

c (ii) Poisson Regression Model

# Poisson regression
pois_model <- glm(doctor_visits ~ temperature,
                  data = data,
                  family = poisson(link = "log"))

# Summary of model
summary(pois_model)

Call:
glm(formula = doctor_visits ~ temperature, family = poisson(link = "log"), 
    data = data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.24140    0.97290   2.304   0.0212 *
temperature -0.05796    0.03897  -1.487   0.1370  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 123.80  on 99  degrees of freedom
Residual deviance: 121.58  on 98  degrees of freedom
AIC: 360.08

Number of Fisher Scoring iterations: 5

c (iii) Negative Binomial Model

# Load MASS package
library(MASS)

# Negative binomial regression
nb_model <- glm.nb(doctor_visits ~ temperature, data = data)
Warning in glm.nb(doctor_visits ~ temperature, data = data): alternation limit
reached
# Summary of model
summary(nb_model)

Call:
glm.nb(formula = doctor_visits ~ temperature, data = data, init.theta = 193.7046517, 
    link = log)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.23974    0.97850   2.289   0.0221 *
temperature -0.05789    0.03919  -1.477   0.1397  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(193.7047) family taken to be 1)

    Null deviance: 122.64  on 99  degrees of freedom
Residual deviance: 120.44  on 98  degrees of freedom
AIC: 362.08

Number of Fisher Scoring iterations: 1

              Theta:  194 
          Std. Err.:  2612 
Warning while fitting theta: alternation limit reached 

 2 x log-likelihood:  -356.079 

d (i) Compare using AIC

# Extract AIC values
AIC(lm_model, pois_model, nb_model)
           df      AIC
lm_model    3 370.3010
pois_model  2 360.0850
nb_model    3 362.0794

d (ii) Overdispersion Check (for Poisson model)

mean(data$doctor_visits)
[1] 2.21
var(data$doctor_visits)
[1] 2.30899

The mean (2.21) and variance (2.31) of doctor visits are very similar, satisfying the Poisson assumption that the mean equals the variance.

# Calculate dispersion
dispersion <- sum(residuals(pois_model, type = "pearson")^2) / pois_model$df.residual
dispersion
[1] 1.027491

The dispersion statistic (1.027) is very close to 1, indicating no overdispersion in the data.

e appropriate model
The Poisson regression model is the most appropriate for this analysis. The outcome variable is count data, which suits the Poisson framework. The overdispersion checks show that the dispersion statistic (1.027) is very close to 1, and the mean (2.21) is approximately equal to the variance (2.31), indicating that the Poisson assumption is satisfied. Therefore, there is no evidence of overdispersion, and the Negative Binomial model is unnecessary. The linear model is also less appropriate due to the nature of the outcome variable. Hence, the Poisson model is the best choice.

Concluding Remark

This practical session introduced the fundamental concepts and tools required for statistical simulation in R. We explored how to load and manage packages, ensure reproducibility using set.seed(), and generate random data from different statistical distributions. In addition, simple simulation models were implemented to demonstrate how probability distributions can be applied in real-world data analysis and research. These foundational skills are essential for statistical computing, data science, and simulation-based research, and they provide a strong basis for more advanced modeling and analytical techniques in R.

Acknowledgement

We sincerely thank Martin Schweinberger and the Language Technology and Data Analysis Laboratory for their valuable educational resources. The content used here has been adapted from the LDAL website solely for learning and teaching purposes.

References

Cedergren, Henrietta J., and David Sankoff. 1974. “Variable Rules: Performance as a Statistical Reflection of Competence.” Language 50 (2): 333–55. https://doi.org/10.2307/412441.

Sankoff, David, and William Labov. 1979. “On the Uses of Variable Rules.” Language in Society 8 (2): 189–222. https://doi.org/10.1017/S0047404500006307.