Simulation

Generate Random Numbers

From Uniform Distribution runif

The generic command for rnorm is as follows: runif(n, min = 0, max = 1) We need to define the number of draws (n), mean and minimum and maximum values of distribution using min and max. Default is min=0 and max=1.

runif(5, min = 100, max = 150)
## [1] 142.6100 134.3366 145.4520 147.3470 146.4783
set.seed(971)
runif(5)
## [1] 0.06579037 0.55917226 0.40300621 0.51901064 0.50780445
set.seed(971)
runif(5, min = 0, max = 1)
## [1] 0.06579037 0.55917226 0.40300621 0.51901064 0.50780445
mydata=data.frame(a=runif(1000, min = 0, max = 1),
                b=runif(1000,  min = 0, max = 1)
                )
fig1<-ggplot(data=mydata)+aes(x=a,y=b)+geom_point()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Uniform")+theme_minimal()

fig1<-ggMarginal(fig1, type = "histogram", fill="transparent")
fig1

From Normal Distribution rnorm

The generic command for rnorm is as follows: rnorm(n, mean = 0, sd = 1) We need to define the number of draws (n), mean and standard deviation for our distribution using mean and sd. Default is mean=0 and sd=1. So, we have a standard normal distribution. To get reproducible random numbers, we need to set the seed via set.seed()

rnorm(5, mean = 100, sd = 4)
## [1] 100.7370 102.1424 104.0383 103.9875 103.3148
set.seed(971)
rnorm(5)
## [1] -1.50789755 -0.24557347  0.01956411 -0.84374377  1.46568186
set.seed(971)
rnorm(5, mean = 0, sd = 1)
## [1] -1.50789755 -0.24557347  0.01956411 -0.84374377  1.46568186

For meanand standard deviation, we can use a vector of values as follows:

rnorm(10, mean = c(0,10, 1000), sd = c(1,50,200))
##  [1]   -0.1899982  -22.3862215 1314.3204961    0.6631383    4.9611506
##  [6]  873.7887468   -0.7996139   -2.2557492 1165.2253458    0.7635781

Let’s generate two unrelated variables and plot them:

mydata=data.frame(a=rnorm(1000, mean = 0, sd = 1),
                b=rnorm(1000, mean = 0, sd = 1)
                )
fig2<-ggplot(data=mydata)+aes(x=a,y=b)+geom_point()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Normal")+theme_minimal()

fig2<-ggMarginal(fig2, type = "histogram", fill="transparent")
fig2

From Bernoulli and Binomial Distribution rbinom:

The rbinom function can be used to simulate the outcome of Bernoulli trials. This is a fancy statistical word for flipping coins. You can use it to calculate the number of successes in a set of pass/fail trials with success estimated at probability p. R’s rbinom function simulates a series of Bernoulli trials and return the results. The generic command is as follows:

rbinom(n, size, prob)

  • n: Number of observations you want to see
  • size: Number of trials per observation
  • prob: probability of success for each trial

If size=1, we have a bernoulli distribution:

rbinom(100, size=1, prob = 0.3)
##   [1] 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0
##  [38] 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0
##  [75] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0

If size>1, we have a binomial distribution:

rbinom(100, size=100, prob = 0.3)
##   [1] 25 36 31 33 32 27 29 30 39 31 26 37 30 35 27 31 29 31 26 31 28 28 33 27 24
##  [26] 17 38 24 41 37 32 30 33 40 25 28 33 24 28 31 24 27 29 36 30 29 32 24 26 29
##  [51] 21 27 38 35 28 30 26 30 22 30 31 39 38 30 30 29 26 25 34 25 24 31 29 31 19
##  [76] 29 27 22 35 29 33 36 35 25 27 33 22 38 31 27 26 33 32 26 31 29 25 20 30 26
mydata=data.frame(a=rbinom(1000, size=100, prob = 0.3),
                b=rbinom(1000, size=100, prob = 0.5)
                )
fig3<-ggplot(data=mydata)+aes(x=a,y=b)+geom_jitter()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Binomial")+theme_minimal()

fig3<-ggMarginal(fig3, type = "density", fill="transparent")
fig3

From Poisson Distribution rpois:

We can generate discrete integers (including 0) from a Poisson distribution with rpois().

The generic command for rnorm is as follows: rpois(n, lambda) lambda is the mean and we define it.

rpois(5, lambda = 100)
## [1] 101 104  95 108  98
mydata=data.frame(a=rpois(1000, lambda=5),
                b=rpois(1000, lambda=5)
                )
fig4<-ggplot(data=mydata)+aes(x=a,y=b)+geom_jitter()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Poisson")+theme_minimal()

fig4<-ggMarginal(fig4, type = "density", fill="transparent")
fig4

Repeatedly simulating by replicate()

Let’s say we want to simulate some values from a normal distribution, which we can do using the rnorm() function as above. But now instead of drawing some number of values from a distribution one time we want to do it many times.

set.seed(971)
replicate(n = 3, 
          expr = rnorm(n = 5, mean = 0, sd = 1), 
          simplify = FALSE )
## [[1]]
## [1] -1.50789755 -0.24557347  0.01956411 -0.84374377  1.46568186
## 
## [[2]]
## [1] -0.1899982 -0.6477244  1.5716025  0.6631383 -0.1007770
## 
## [[3]]
## [1] -0.6310563 -0.7996139 -0.2451150  0.8261267  0.7635781
set.seed(971)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
  list1[[i]] = rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration
}
list1
## [[1]]
## [1] -1.50789755 -0.24557347  0.01956411 -0.84374377  1.46568186
## 
## [[2]]
## [1] -0.1899982 -0.6477244  1.5716025  0.6631383 -0.1007770
## 
## [[3]]
## [1] -0.6310563 -0.7996139 -0.2451150  0.8261267  0.7635781

Now let’s say I create the following dataset:

group <- rep(c("Male","Female"), times=3) # we repeat "M, F" three times.
response <- rnorm(n = 6, mean = c(5,10), sd = 1)
mydata<-data.frame(group,
           response)

Now I want to repeatedly simulate this dataset:

sim<-  replicate(n = 3, 
          expr = data.frame(group =  rep(c("Male","Female"), times=3),
                            response = rnorm(n = 6, mean = c(5,10), sd = 1) ),
          simplify = FALSE)

Does linear regression recover true intercept and slope?

OK, now we can simulate a regression:

twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
     ngroup = 2
     gender = rep( c("Male", "Female"), each = nrep)
     eps = rnorm(ngroup*nrep, 0, sigma)
     ReadingScore = b0 + b1*(gender == "Male") + eps
     ReadingScoreFit = lm(ReadingScore ~ gender)
     ReadingScoreFit
}

twogroup_fun()
## 
## Call:
## lm(formula = ReadingScore ~ gender)
## 
## Coefficients:
## (Intercept)   genderMale  
##       4.941       -1.208

I will repeat the data generation and model fitting many times using replicate(). The result is a list of fitted models. I’ll run the function 1000 times:

a<-replicate(1000, exp=twogroup_fun(nrep = 10), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)

Changing the sample size

Let’s increase the nrep=1000

a<-replicate(1000, exp=twogroup_fun(nrep = 1000), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)

Changing the standard deviation of the error term

Let’s increase from sigma=2 to sigma=10:

a<-replicate(1000, exp=twogroup_fun(nrep = 1000,sigma = 20), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)

Significance of Intercept and Slope Due to Chance

twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
     ngroup = 2
     gender = rep( c("Male", "Female"), each = nrep)
     ReadingScore = rnorm(ngroup*nrep, 0, sigma) ## Now this is completely a random vector
     ReadingScoreFit = lm(ReadingScore ~ gender)
     ReadingScoreFit
}

a<-replicate(1000, exp=twogroup_fun(nrep = 100,sigma = 2), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()

ggplot(data=mydat)+geom_density(aes(x=p.value))+facet_grid(col=vars(term))+theme_minimal()

# Percentage of Regressions with a significant intercept
sum(mydat$p.value[mydat$term=="(Intercept)"]<0.01)/1000
## [1] 0.009
# Percentage of Regressions with a significant slope
sum(mydat$p.value[mydat$term=="genderMale"]<0.01)/1000
## [1] 0.009

Permutation Tests

t-test assumes both equal variance, random sampling, random assignment, normality, permutation test does not have these assumptions.

set.seed(971) ## for reproducibility

nrep=100
gender = rep( c("Male", "Female"), times = nrep)
score = rnorm(2*nrep, mean=c(10,14), sd=c(2,4) )
data<-data.frame(gender,score)

ggplot(data)+aes(x=gender,y=score)+ geom_boxplot()+theme_minimal()

Let’s do permutation test:

set.seed(971) ## for reproducibility
nrep=5
gender = rep( c("Male", "Female"), times = nrep)
score = rnorm(2*nrep, mean=c(10,14), sd=c(2,4))
data<-data.frame(gender,score)

mypermut = function() {
    perm <- sample(nrow(data))
    bdat <- data
    bdat$score<-score[perm]
    result<- mean(bdat$score[bdat$gender=="Male"])-
             mean(bdat$score[bdat$gender=="Female"])
}

permresults<-replicate(1000, exp=mypermut(), simplify=FALSE)
mydat<-map_dfr(permresults, broom::tidy)

observed<-mean(data[data$gender=="Male","score"])-
        mean(data[data$gender=="Female","score"])

ggplot(data=mydat)+aes(x=x)+geom_histogram()+geom_vline(xintercept = observed, color="red")+theme_minimal()

Our observed mean-difference appears to be quite extreme in terms of the distribution of possible mean-differences observable were the outcome independent of treatment assignment. The P-value is the probability of obtaining a result at least as extreme as the test statistic when the null hypothesis is true. We obtain the p-value for our mean-difference by counting how many mean-differences have greater magnitude than the one we observed in our actual data. We can then divide this by the number of items in our permutation distribution as we defined in replicate(1000...):

sum(mydat$x<=observed)/nrow(mydat)
## [1] 0.016
sum(abs(mydat$x)>abs(observed))/nrow(mydat)
## [1] 0.04

We find that the proportion of values as extreme as -4.1561031 is 0.016, indicating that this is smaller than our p-value threshold of 0.05. Therefore, we reject the null hypothesis. In other words, observing a difference of -4.1561031 between the two groups due to random chance is 0.016.