library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'stringr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(purrr)
library(broom)
## Warning: package 'broom' was built under R version 4.0.2

Simulation is a very useful way to understand the behavior of a statistical model or parameter. R is a great environment for data simulation. In this note, I would like to share with you some simple R codes for data simulation.

Simulating a binomial variable

Let us start with a simple binomial simulation. We know that there are always more baby boys than girls born. The proportion of boys is around 51.2%. We can simulate a sample of 1000 births by using the rbinom as follows:

nboys = rbinom(1, 1000, 0.512)
head(nboys)
## [1] 529

which shows how many boys we could observe in 1000 births. If we repeat this simulation multiple times, the number of boys would be different every time, but the average is still 51.2%. Now, let us try to repeat 10000 times:

nsims = 10000
nboys = rep(NA, nsims)
for (i in 1:nsims)
{
  nboys[i] = rbinom(1, 1000, 0.512) 
}
mean(nboys)
## [1] 512.1432
hist(nboys, col="blue", border="white")

As can be seen, if we take 10,000 samples, each with 1000 births, the distribution of the number of boys can range (mostly) between 470 and 550, with the mean being 512.

Simulating a normally distributed variable

A normally distributed variable has mean mu and standard deviation sd. In R, we can easily simulate such a random variable by using the rnorm function. In the example below, we take a random sample of 1000 people from a population whose mean of bone mineral density (BMS) is 1.00 and sd = 0.12.

bmd = rnorm(n=1000, mean=1.00, sd=0.12)
mean(bmd)
## [1] 0.9948937
ggplot() + geom_histogram(aes(x=bmd), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As can be seen, the observed values of x can range mostly between 0.8 to 1.2, with the mean being 1.0 as expected. Now, let us repeat the simulation 10,000 times to see how the mean of BMD behaves.

nsims = 10000
mean.bmd = rep(NA, nsims)
for (i in 1:nsims)
{
  bmd = rnorm(1000, mean=1.00, sd=0.12) 
  mean.bmd[i] = mean(bmd)
}
mean(mean.bmd)
## [1] 1.000015
sd(mean.bmd)
## [1] 0.003768532
ggplot() + geom_histogram(aes(x=mean.bmd), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Simulating the Central Limit Theorem (CLT)

Let us start with a simple exponential distribution with lambda = 0.2. We will generate 30 observations:

x = rexp(30, 0.2)
mean(x)
## [1] 3.993741
hist(x, col="blue", border="white")

So, x is not normally distributed. The mean of x is around 4 - 5.

The CLT states that if we have a population with mean mu and standard deviation sd and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This is true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).

Let us prove this theorem by R simulation. In the example below, we generate

set.seed(1234) # for reproducability
mean.x = NULL
n = 30
lambda = 0.2 
for (i in 1:1000) 
  {
  x = rexp(30, 0.2)
  mean.x[i] = mean(x)
}
ggplot() + geom_histogram(aes(x=mean.x), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now, as can be seen, the distribution of the means is normally distributed, even though each of the original populations is exponentially distributed. So, the CLT is true.

Simulating a linear regression model

The simple linear regression states that the relationship between y (dependent variable) and x (independent variable) is: y = a + bx + e, where a is the intercept, b is the gradient, and e is a independently normally distributed error term. We can simulate this equation easily by using R.

# generating 100 normal observations from a population with mean=50 and sd = 9 
x = rnorm(100, 50, 9) 
# generating 100 error observations from a population with mean=0 and sd = 10 
error = rnorm(100, 0, 10)
# generating dependent variable y = 150 - 4*x (intercept=150 and slope=-4)
y = 150 -4*x + error

Now let us check the model. The slope is not exactly equal to -4 and the intercept deviates slightly from 150, because there was a random error in the model:

ggplot() + geom_point(aes(x=x, y=y), pch=16, col="blue")

model = lm(y ~ x)
summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.7923  -6.3434   0.3619   6.5473  19.3064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 147.6651     5.4444   27.12   <2e-16 ***
## x            -3.9562     0.1059  -37.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.813 on 98 degrees of freedom
## Multiple R-squared:  0.9344, Adjusted R-squared:  0.9337 
## F-statistic:  1396 on 1 and 98 DF,  p-value: < 2.2e-16

In the above example, we simulate a simple model with only 1 predictor. We can also generate a model with multiple predictors. In the example below we simulate a model with 2 predictors, with one being a categorical variable.

set.seed(1234)
ngroups = 2
nrep = 50 
b0 = 5; b1 = -2; sd = 2

# each group has 50 observations 
x = rnorm(ngroups*nrep)
group = rep(c("A", "B"), each=nrep)
error = rnorm(n=ngroups*nrep, mean=0, sd=sd)
y = b0 + 0.5*x + b1*(group == "A") + error 

summary(lm(y ~ x + group))
## 
## Call:
## lm(formula = y ~ x + group)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8283 -1.2186  0.0342  1.2138  5.9012 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0078     0.3109   9.675 6.74e-16 ***
## x             0.4290     0.2183   1.965   0.0523 .  
## groupB        2.1272     0.4364   4.874 4.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.084 on 97 degrees of freedom
## Multiple R-squared:  0.2734, Adjusted R-squared:  0.2584 
## F-statistic: 18.25 on 2 and 97 DF,  p-value: 1.87e-07

We can create a function based on the above codes, and then use the function replicate in package purrr to replicate the function multiple times:

sim = function(nrep=nrep, b0=5, b1=-2, sigma=2) {
     ngroups = 2
     x = rnorm(ngroups*nrep)
     group = rep(c("A", "B"), each=nrep)
     error = rnorm(n=ngroups*nrep, mean=0, sd=sd)
     y = b0 + 0.5*x + b1*(group == "A") + error 
     
     simdat = data.frame(y, x, group)
     fit = lm(y ~ x + group, data=simdat)
     fit
}
sims = replicate(n=1000, sim(nrep=50), simplify=F)

Now, let us have a look at the distribution of the coefficients related to group

sims %>% map_df(tidy) %>% filter(term=="groupB") %>% ggplot(aes(x=estimate)) + geom_density(fill="blue", alpha=0.5) + geom_vline(xintercept=2) 

Simulating t-test

We can use simulation to learn about the behavior of a unpaired t-test for 2 group comparison.

nsims = 1000 
n1 = 10; n2 = 15
ci = matrix(NA, ncol=2, nrow=nsims)
for (i in 1:nsims) {
  sample1 = rnorm(n1, mean=10, sd=5)
  sample2 = rnorm(n2, mean=15, sd=5)
  ttest = t.test(sample1, sample2)
  ci[i, ] = ttest$conf.int
  rm(ttest, sample1, sample2)
}

ci = as.data.frame(ci)
names(ci)=c("LowerCI", "UpperCI")
ci$sn = 1:nrow(ci)

ggplot(ci) + theme_bw() + geom_linerange(aes(x=sn, ymin=LowerCI, ymax=UpperCI)) + geom_hline(yintercept=0, linetype="dashed") 

Simulating a logistic regression model

In the example below we simulate a logistic regression model with 2 predictor variables (age and sex). The variable ‘age’ has values ranging between 18 and 80 uniformly. The sex variable has an approximate 50:50 distribution for males:females. The ‘study’ has 100 observations. The linear coefficients associated with age and sex are 0.2 and 3.5, respectively.

set.seed(1234)
sex = sample(c(0, 1), size=100, replace=T)
age = round(runif(100, 18, 80))
linear = -9 + 3.5*sex + 0.2*age
p = 1/(1+exp(-linear))
y = rbinom(n = 100, size=1, prob=p)
df = data.frame(age, sex, y)

# Checking the simulation
fit = glm(y ~ sex + age, family = "binomial", data=df)
summary(fit)
## 
## Call:
## glm(formula = y ~ sex + age, family = "binomial", data = df)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.44839   0.00413   0.04476   0.26091   1.93797  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.08429    2.56033  -3.939 8.19e-05 ***
## sex           3.71439    1.19314   3.113  0.00185 ** 
## age           0.23290    0.05528   4.213 2.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 105.382  on 99  degrees of freedom
## Residual deviance:  36.592  on 97  degrees of freedom
## AIC: 42.592
## 
## Number of Fisher Scoring iterations: 7