runifThe generic command for rnorm is as follows: runif(n, min = 0, max = 1) We need to define the number of draws (n), mean and minimum and maximum values of distribution using min and max. Default is min=0 and max=1.
## [1] 142.6100 134.3366 145.4520 147.3470 146.4783
## [1] 0.06579037 0.55917226 0.40300621 0.51901064 0.50780445
## [1] 0.06579037 0.55917226 0.40300621 0.51901064 0.50780445
mydata=data.frame(a=runif(1000, min = 0, max = 1),
b=runif(1000, min = 0, max = 1)
)
fig1<-ggplot(data=mydata)+aes(x=a,y=b)+geom_point()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Uniform")+theme_minimal()
fig1<-ggMarginal(fig1, type = "histogram", fill="transparent")
fig1rnormThe generic command for rnorm is as follows: rnorm(n, mean = 0, sd = 1) We need to define the number of draws (n), mean and standard deviation for our distribution using mean and sd. Default is mean=0 and sd=1. So, we have a standard normal distribution. To get reproducible random numbers, we need to set the seed via set.seed()
## [1] 100.7370 102.1424 104.0383 103.9875 103.3148
## [1] -1.50789755 -0.24557347 0.01956411 -0.84374377 1.46568186
## [1] -1.50789755 -0.24557347 0.01956411 -0.84374377 1.46568186
For meanand standard deviation, we can use a vector of values as follows:
## [1] -0.1899982 -22.3862215 1314.3204961 0.6631383 4.9611506
## [6] 873.7887468 -0.7996139 -2.2557492 1165.2253458 0.7635781
Let’s generate two unrelated variables and plot them:
mydata=data.frame(a=rnorm(1000, mean = 0, sd = 1),
b=rnorm(1000, mean = 0, sd = 1)
)
fig2<-ggplot(data=mydata)+aes(x=a,y=b)+geom_point()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Normal")+theme_minimal()
fig2<-ggMarginal(fig2, type = "histogram", fill="transparent")
fig2rbinom:The rbinom function can be used to simulate the outcome of Bernoulli trials. This is a fancy statistical word for flipping coins. You can use it to calculate the number of successes in a set of pass/fail trials with success estimated at probability p. R’s rbinom function simulates a series of Bernoulli trials and return the results. The generic command is as follows:
rbinom(n, size, prob)
n: Number of observations you want to seesize: Number of trials per observationprob: probability of success for each trialIf size=1, we have a bernoulli distribution:
## [1] 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0
## [38] 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0
## [75] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0
If size>1, we have a binomial distribution:
## [1] 25 36 31 33 32 27 29 30 39 31 26 37 30 35 27 31 29 31 26 31 28 28 33 27 24
## [26] 17 38 24 41 37 32 30 33 40 25 28 33 24 28 31 24 27 29 36 30 29 32 24 26 29
## [51] 21 27 38 35 28 30 26 30 22 30 31 39 38 30 30 29 26 25 34 25 24 31 29 31 19
## [76] 29 27 22 35 29 33 36 35 25 27 33 22 38 31 27 26 33 32 26 31 29 25 20 30 26
mydata=data.frame(a=rbinom(1000, size=100, prob = 0.3),
b=rbinom(1000, size=100, prob = 0.5)
)
fig3<-ggplot(data=mydata)+aes(x=a,y=b)+geom_jitter()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Binomial")+theme_minimal()
fig3<-ggMarginal(fig3, type = "density", fill="transparent")
fig3rpois:We can generate discrete integers (including 0) from a Poisson distribution with rpois().
The generic command for rnorm is as follows: rpois(n, lambda) lambda is the mean and we define it.
## [1] 101 104 95 108 98
mydata=data.frame(a=rpois(1000, lambda=5),
b=rpois(1000, lambda=5)
)
fig4<-ggplot(data=mydata)+aes(x=a,y=b)+geom_jitter()+geom_rug()+geom_smooth(method="lm", se=F)+labs(title="Poisson")+theme_minimal()
fig4<-ggMarginal(fig4, type = "density", fill="transparent")
fig4Let’s say we want to simulate some values from a normal distribution, which we can do using the rnorm() function as above. But now instead of drawing some number of values from a distribution one time we want to do it many times.
## [[1]]
## [1] -1.50789755 -0.24557347 0.01956411 -0.84374377 1.46568186
##
## [[2]]
## [1] -0.1899982 -0.6477244 1.5716025 0.6631383 -0.1007770
##
## [[3]]
## [1] -0.6310563 -0.7996139 -0.2451150 0.8261267 0.7635781
set.seed(971)
list1 = list() # Make an empty list to save output in
for (i in 1:3) { # Indicate number of iterations with "i"
list1[[i]] = rnorm(n = 5, mean = 0, sd = 1) # Save output in list for each iteration
}
list1## [[1]]
## [1] -1.50789755 -0.24557347 0.01956411 -0.84374377 1.46568186
##
## [[2]]
## [1] -0.1899982 -0.6477244 1.5716025 0.6631383 -0.1007770
##
## [[3]]
## [1] -0.6310563 -0.7996139 -0.2451150 0.8261267 0.7635781
Now let’s say I create the following dataset:
group <- rep(c("Male","Female"), times=3) # we repeat "M, F" three times.
response <- rnorm(n = 6, mean = c(5,10), sd = 1)
mydata<-data.frame(group,
response)Now I want to repeatedly simulate this dataset:
OK, now we can simulate a regression:
twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
gender = rep( c("Male", "Female"), each = nrep)
eps = rnorm(ngroup*nrep, 0, sigma)
ReadingScore = b0 + b1*(gender == "Male") + eps
ReadingScoreFit = lm(ReadingScore ~ gender)
ReadingScoreFit
}
twogroup_fun()##
## Call:
## lm(formula = ReadingScore ~ gender)
##
## Coefficients:
## (Intercept) genderMale
## 4.941 -1.208
I will repeat the data generation and model fitting many times using replicate(). The result is a list of fitted models. I’ll run the function 1000 times:
a<-replicate(1000, exp=twogroup_fun(nrep = 10), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)Let’s increase the nrep=1000
a<-replicate(1000, exp=twogroup_fun(nrep = 1000), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)Let’s increase from sigma=2 to sigma=10:
a<-replicate(1000, exp=twogroup_fun(nrep = 1000,sigma = 20), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()+xlim(-5,7.5)twogroup_fun = function(nrep = 10, b0 = 5, b1 = -2, sigma = 2) {
ngroup = 2
gender = rep( c("Male", "Female"), each = nrep)
ReadingScore = rnorm(ngroup*nrep, 0, sigma) ## Now this is completely a random vector
ReadingScoreFit = lm(ReadingScore ~ gender)
ReadingScoreFit
}
a<-replicate(1000, exp=twogroup_fun(nrep = 100,sigma = 2), simplify=FALSE)
mydat<-map_dfr(a, broom::tidy, .id = "model")
ggplot(data=mydat)+geom_density(aes(x=estimate))+facet_grid(col=vars(term))+theme_minimal()# Percentage of Regressions with a significant intercept
sum(mydat$p.value[mydat$term=="(Intercept)"]<0.01)/1000## [1] 0.009
# Percentage of Regressions with a significant slope
sum(mydat$p.value[mydat$term=="genderMale"]<0.01)/1000## [1] 0.009
t-test assumes both equal variance, random sampling, random assignment, normality, permutation test does not have these assumptions.
set.seed(971) ## for reproducibility
nrep=100
gender = rep( c("Male", "Female"), times = nrep)
score = rnorm(2*nrep, mean=c(10,14), sd=c(2,4) )
data<-data.frame(gender,score)
ggplot(data)+aes(x=gender,y=score)+ geom_boxplot()+theme_minimal()Let’s do permutation test:
set.seed(971) ## for reproducibility
nrep=5
gender = rep( c("Male", "Female"), times = nrep)
score = rnorm(2*nrep, mean=c(10,14), sd=c(2,4))
data<-data.frame(gender,score)
mypermut = function() {
perm <- sample(nrow(data))
bdat <- data
bdat$score<-score[perm]
result<- mean(bdat$score[bdat$gender=="Male"])-
mean(bdat$score[bdat$gender=="Female"])
}
permresults<-replicate(1000, exp=mypermut(), simplify=FALSE)
mydat<-map_dfr(permresults, broom::tidy)
observed<-mean(data[data$gender=="Male","score"])-
mean(data[data$gender=="Female","score"])
ggplot(data=mydat)+aes(x=x)+geom_histogram()+geom_vline(xintercept = observed, color="red")+theme_minimal()Our observed mean-difference appears to be quite extreme in terms of the distribution of possible mean-differences observable were the outcome independent of treatment assignment. The P-value is the probability of obtaining a result at least as extreme as the test statistic when the null hypothesis is true. We obtain the p-value for our mean-difference by counting how many mean-differences have greater magnitude than the one we observed in our actual data. We can then divide this by the number of items in our permutation distribution as we defined in replicate(1000...):
## [1] 0.016
## [1] 0.04
We find that the proportion of values as extreme as -4.1561031 is 0.016, indicating that this is smaller than our p-value threshold of 0.05. Therefore, we reject the null hypothesis. In other words, observing a difference of -4.1561031 between the two groups due to random chance is 0.016.