library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## Warning: package 'tibble' was built under R version 4.0.2
## Warning: package 'tidyr' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'stringr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(purrr)
library(broom)
## Warning: package 'broom' was built under R version 4.0.2
Simulation is a very useful way to understand the behavior of a statistical model or parameter. R is a great environment for data simulation. In this note, I would like to share with you some simple R codes for data simulation.
Let us start with a simple binomial simulation. We know that there are always more baby boys than girls born. The proportion of boys is around 51.2%. We can simulate a sample of 1000 births by using the rbinom as follows:
nboys = rbinom(1, 1000, 0.512)
head(nboys)
## [1] 529
which shows how many boys we could observe in 1000 births. If we repeat this simulation multiple times, the number of boys would be different every time, but the average is still 51.2%. Now, let us try to repeat 10000 times:
nsims = 10000
nboys = rep(NA, nsims)
for (i in 1:nsims)
{
nboys[i] = rbinom(1, 1000, 0.512)
}
mean(nboys)
## [1] 512.1432
hist(nboys, col="blue", border="white")
As can be seen, if we take 10,000 samples, each with 1000 births, the distribution of the number of boys can range (mostly) between 470 and 550, with the mean being 512.
A normally distributed variable has mean mu and standard deviation sd. In R, we can easily simulate such a random variable by using the rnorm function. In the example below, we take a random sample of 1000 people from a population whose mean of bone mineral density (BMS) is 1.00 and sd = 0.12.
bmd = rnorm(n=1000, mean=1.00, sd=0.12)
mean(bmd)
## [1] 0.9948937
ggplot() + geom_histogram(aes(x=bmd), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As can be seen, the observed values of x can range mostly between 0.8 to 1.2, with the mean being 1.0 as expected. Now, let us repeat the simulation 10,000 times to see how the mean of BMD behaves.
nsims = 10000
mean.bmd = rep(NA, nsims)
for (i in 1:nsims)
{
bmd = rnorm(1000, mean=1.00, sd=0.12)
mean.bmd[i] = mean(bmd)
}
mean(mean.bmd)
## [1] 1.000015
sd(mean.bmd)
## [1] 0.003768532
ggplot() + geom_histogram(aes(x=mean.bmd), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let us start with a simple exponential distribution with lambda = 0.2. We will generate 30 observations:
x = rexp(30, 0.2)
mean(x)
## [1] 3.993741
hist(x, col="blue", border="white")
So, x is not normally distributed. The mean of x is around 4 - 5.
The CLT states that if we have a population with mean mu and standard deviation sd and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This is true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30).
Let us prove this theorem by R simulation. In the example below, we generate
set.seed(1234) # for reproducability
mean.x = NULL
n = 30
lambda = 0.2
for (i in 1:1000)
{
x = rexp(30, 0.2)
mean.x[i] = mean(x)
}
ggplot() + geom_histogram(aes(x=mean.x), fill="blue", col="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, as can be seen, the distribution of the means is normally distributed, even though each of the original populations is exponentially distributed. So, the CLT is true.
The simple linear regression states that the relationship between y (dependent variable) and x (independent variable) is: y = a + bx + e, where a is the intercept, b is the gradient, and e is a independently normally distributed error term. We can simulate this equation easily by using R.
# generating 100 normal observations from a population with mean=50 and sd = 9
x = rnorm(100, 50, 9)
# generating 100 error observations from a population with mean=0 and sd = 10
error = rnorm(100, 0, 10)
# generating dependent variable y = 150 - 4*x (intercept=150 and slope=-4)
y = 150 -4*x + error
Now let us check the model. The slope is not exactly equal to -4 and the intercept deviates slightly from 150, because there was a random error in the model:
ggplot() + geom_point(aes(x=x, y=y), pch=16, col="blue")
model = lm(y ~ x)
summary(model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.7923 -6.3434 0.3619 6.5473 19.3064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 147.6651 5.4444 27.12 <2e-16 ***
## x -3.9562 0.1059 -37.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.813 on 98 degrees of freedom
## Multiple R-squared: 0.9344, Adjusted R-squared: 0.9337
## F-statistic: 1396 on 1 and 98 DF, p-value: < 2.2e-16
In the above example, we simulate a simple model with only 1 predictor. We can also generate a model with multiple predictors. In the example below we simulate a model with 2 predictors, with one being a categorical variable.
set.seed(1234)
ngroups = 2
nrep = 50
b0 = 5; b1 = -2; sd = 2
# each group has 50 observations
x = rnorm(ngroups*nrep)
group = rep(c("A", "B"), each=nrep)
error = rnorm(n=ngroups*nrep, mean=0, sd=sd)
y = b0 + 0.5*x + b1*(group == "A") + error
summary(lm(y ~ x + group))
##
## Call:
## lm(formula = y ~ x + group)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8283 -1.2186 0.0342 1.2138 5.9012
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0078 0.3109 9.675 6.74e-16 ***
## x 0.4290 0.2183 1.965 0.0523 .
## groupB 2.1272 0.4364 4.874 4.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.084 on 97 degrees of freedom
## Multiple R-squared: 0.2734, Adjusted R-squared: 0.2584
## F-statistic: 18.25 on 2 and 97 DF, p-value: 1.87e-07
We can create a function based on the above codes, and then use the function replicate in package purrr to replicate the function multiple times:
sim = function(nrep=nrep, b0=5, b1=-2, sigma=2) {
ngroups = 2
x = rnorm(ngroups*nrep)
group = rep(c("A", "B"), each=nrep)
error = rnorm(n=ngroups*nrep, mean=0, sd=sd)
y = b0 + 0.5*x + b1*(group == "A") + error
simdat = data.frame(y, x, group)
fit = lm(y ~ x + group, data=simdat)
fit
}
sims = replicate(n=1000, sim(nrep=50), simplify=F)
Now, let us have a look at the distribution of the coefficients related to group
sims %>% map_df(tidy) %>% filter(term=="groupB") %>% ggplot(aes(x=estimate)) + geom_density(fill="blue", alpha=0.5) + geom_vline(xintercept=2)
We can use simulation to learn about the behavior of a unpaired t-test for 2 group comparison.
nsims = 1000
n1 = 10; n2 = 15
ci = matrix(NA, ncol=2, nrow=nsims)
for (i in 1:nsims) {
sample1 = rnorm(n1, mean=10, sd=5)
sample2 = rnorm(n2, mean=15, sd=5)
ttest = t.test(sample1, sample2)
ci[i, ] = ttest$conf.int
rm(ttest, sample1, sample2)
}
ci = as.data.frame(ci)
names(ci)=c("LowerCI", "UpperCI")
ci$sn = 1:nrow(ci)
ggplot(ci) + theme_bw() + geom_linerange(aes(x=sn, ymin=LowerCI, ymax=UpperCI)) + geom_hline(yintercept=0, linetype="dashed")
In the example below we simulate a logistic regression model with 2 predictor variables (age and sex). The variable ‘age’ has values ranging between 18 and 80 uniformly. The sex variable has an approximate 50:50 distribution for males:females. The ‘study’ has 100 observations. The linear coefficients associated with age and sex are 0.2 and 3.5, respectively.
set.seed(1234)
sex = sample(c(0, 1), size=100, replace=T)
age = round(runif(100, 18, 80))
linear = -9 + 3.5*sex + 0.2*age
p = 1/(1+exp(-linear))
y = rbinom(n = 100, size=1, prob=p)
df = data.frame(age, sex, y)
# Checking the simulation
fit = glm(y ~ sex + age, family = "binomial", data=df)
summary(fit)
##
## Call:
## glm(formula = y ~ sex + age, family = "binomial", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44839 0.00413 0.04476 0.26091 1.93797
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.08429 2.56033 -3.939 8.19e-05 ***
## sex 3.71439 1.19314 3.113 0.00185 **
## age 0.23290 0.05528 4.213 2.52e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 105.382 on 99 degrees of freedom
## Residual deviance: 36.592 on 97 degrees of freedom
## AIC: 42.592
##
## Number of Fisher Scoring iterations: 7