R allows you to simulate datasets with known statistical properties. In this example, we introduce a simple example using the normal distribution. We begin by simulating normal distributions with different measures of location (mean) and dispersion (standard deviation). After the Student’s t-distribution is introduced, we will use it to test how often a 1000 simulated means differ significantly than the expected value, as a measure of performance of the function used to simulate the data.
R allows you to simulate data based an a number of probability distributions. Some familiar distributions are the normal, poisson, and the binomial, but there are many, many others. Here, we use the normal distribution which should be familiar to anyone who has taken an elementary statistics course.
set.seed (1234) # This fixes the random data as a set point
rdist = rnorm (10) #Assigns an array of 10 values to ‘rdist’
An initial value is introducted with the R code set.seed
so that the numbers generated here will be the same ones generated on your R console, in case you are following along. As in the other introductory document (Part I), the text that follows the hashtag is an informational remark and does not need to be typed into the console.
The full function to generate random, normal deviates is rnorm(n, mean = 0, sd = 1)
, where \(n\) = the number of values in the resulting array (i.e., 10 in this case). The default values for the mean and standard deviation are 0 and 1, respectively. After we introduce this very simple example, we will show the effect of changing \(n\) or the default settings.
In this first example, the sample size is purposefully low so that you can easily view the entire array by tabulating the raw data, summarizing the descriptive statistics of the R object, and plotting a histogram. As in the other introductory document (Part I), the R code is set in a box and followed by output whose lines begin with a double hashtag or a plot.
rdist #Tabulate the data object
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247 0.5060559
## [7] -0.5747400 -0.5466319 -0.5644520 -0.8900378
summary(rdist) #Create descriptive statistics of the data object
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.3460 -0.8112 -0.5555 -0.3832 0.3912 1.0840
hist(rdist) #Plot a histogram of the data object
Since we are using the default values, the expected mean of rdist
is zero and the expected standard deviation is 1. The histogram of rdist
looks rather choppy, so much so that it is not obviously following a normal distribution; however, its small sample size is the problem. Sample size matters in terms of how normally distributed the sample appears.
par (mfrow = c(1,3)) #This is a graphic parameters to put three figures together
hist(rnorm(10), xlab=''); hist(rnorm(100), xlab='Value of the random deviates'); hist(rnorm(1000), xlab='')
As sample size increases, the sample distribution increasingly resembles a normal population distribution. Note that the distribution of rnorm(10)
is different in the figure panel on the left than plotted the previously object rdist
, because a new draw of random deviates has occurred.
The default options rnorm(n, mean = 0, sd = 1)
can be changed.
par (mfrow = c(1,2))
x <- seq(-4, 4, length=100)
plot(x,dnorm(x, mean = -1, sd= 1), type='l', lty=2, xlab='x value', ylab='Density')
lines(x,dnorm(x, mean = 0, sd= 1), type='l', lty=1, xlab='', ylab='')
lines(x,dnorm(x, mean = 1, sd= 1), type='l', lty=3, xlab='', ylab='')
legend("topleft", inset=0.05, title="Mean value", c("-1", " 0", " 1"), lty=c(2,1,3), cex=0.5)
plot(x,dnorm(x, mean = 0, sd= 1), type='l', lty=1, xlab='x value', ylab='Density')
lines(x,dnorm(x, mean = 0, sd= 1.5), type='l', lty=2, xlab='', ylab='')
lines(x,dnorm(x, mean = 0, sd= 2), type='l', lty=3, xlab='', ylab='')
legend("topleft", inset=0.05, title="SD value", c("1", "1.5", "2"), lty=c(1,2,3), cex=0.5)
On the left, the mean is varied from -1 to 1, while holding the sd constant at 1. On the right, the sd is varied from 1 to 2, while holding the mean constant at 0. In both panels, the default is the solid line.
In most cases, a sample mean generated by the default values of rnorm()
function will not be significantly different than zero, which is the expected mean. Nonetheless, there is some probability that even a randomly drawn set of values will be different than zero. We can test if this is true for our original sample rdist
using a Student’s t-test. The t-test should be familiar to anyone who has taken an elementary statistics course.
t.result = t.test (rdist, mu=0) #Explicitly tests against a value of zero
summary (t.result) #You can list all of the output using the summary function
## Length Class Mode
## statistic 1 -none- numeric
## parameter 1 -none- numeric
## p.value 1 -none- numeric
## conf.int 2 -none- numeric
## estimate 1 -none- numeric
## null.value 1 -none- numeric
## alternative 1 -none- character
## method 1 -none- character
## data.name 1 -none- character
t.result #This is how R packages the output
##
## One Sample t-test
##
## data: rdist
## t = -1.2168, df = 9, p-value = 0.2546
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -1.0955009 0.3291861
## sample estimates:
## mean of x
## -0.3831574
The output provides a t-statistic (-1.2167757), and relative to the degrees of freedom (9), it generates a p-value (0.2546342). It states the alternative to the null hypothesis, \(H_o\), confirming that this sample mean (-0.3831574) will be compared to an expected value of 0. A measure of confidence states that the sample mean lies somewhere between -1.0955009 and 0.3291861, which overlaps with the expected value (0), with 95% certainty.
When using a falsification approach to statistical inference, one sets some probability threshold, \(\alpha\), to accept or reject the the null hypothsis. At a value of \(\alpha\) = 0.05, which is a particularly common threshold, a sample mean of -0.3831574 is not different than zero because its corresponding p-value, 0.2546342, is greater than 0.05. You can set different values of \(\alpha\) but the value itself reflects how often you are willing to be wrong, in this case, given that the rnorm()
function works correctly.
rnorm()
function work as expected?Of course, if our alpha value is 0.05, then there is a reasonable expectation that if we repeat this test with 19 additional samples then we might observe that one of the 20 simulated datasets is actually different than zero. To satisfy our curiosity about this, we generate 1000 datasets (= samples, = R objects) of 1000 random deviates each. Remember that large sample sizes are useful to characterize the sample with confidence. Fortunately, producing large numbers of large samples can be achieved easily in R using a for
loop.
P <- rep(NA, 1000) #Set up an array of missing values (NA) to hold the p values later
head(P) #Check the first 6 values to confirm they are NA
## [1] NA NA NA NA NA NA
for (i in 1:1000) #Use an i loop to count from 1 to 1000
{ #Start a set of instructions; use {} for loops
rdist = rnorm (1000) #Generate a new sample of 1000 random numbers
t.result = t.test (rdist, mu=0) #Redo the t-test
P[i] <- t.result$p.value #Record the P value for each sample
} #Close the loop
head(P, 25) #Check some of these P values
## [1] 0.50474901 0.20901210 0.77337323 0.19412544 0.25064192 0.93203469
## [7] 0.10407735 0.69221319 0.34097789 0.27092923 0.27632157 0.45773232
## [13] 0.75405689 0.47000086 0.12072564 0.28288186 0.93876771 0.89541676
## [19] 0.77423716 0.87762442 0.22766161 0.41039241 0.01283842 0.48231633
## [25] 0.63807759
R makes simulation so easy that the amount of data can be quickly overwhelming. The array of P values printed above is just for the first 25 (of 1000) samples. Let’s evaluate the patterns from all 1000 samples with some graphics.
par (mfrow = c(1,2))
plot (P) #You can plot the P values in order of each test run (Fig. 1A)
hist (P, breaks=20, main='', col='gray') #Or you can plot the P values as a histogram (Fig. 1B)
The figure above plots the P values for each t-test of a randomly generated sample of standard deviates (left) . The index value indicates the order of each of the 1000 test runs. A histogram of the P values for all 1000 t-tests is also depicted (right).
But exactly how many of the arrays had a p value < 0.05?
sum (P<0.05) #This checks for any outcomes significantly different from zero
## [1] 43
This is a performance measure of how well the rnorm()
function works. It does, in fact, produce sample means that differ significantly from zero about 5% of the time (i.e., 43 out of 1000 times or 4.3% in this simulation). Overall, the performance of the rnorm ()
function looks good based on this initial simulation.
Technically, we don’t know much about how this performance will vary. Will it always be < 0.05 or will it fluctuate around 0.05. If the latter, how much will it fluctuate? Using a for
loop within a for
loop, we can rerun the simulation a number of times, say 30 times, to investigate this better.
P.1000 <- rep(NA, 1000)
R.1000 <- rep(NA, 30) #An array for rejection rates
for (j in 1:30)
{
for (i in 1:1000)
{
rdist = rnorm (1000)
t.result = t.test (rdist, mu=0)
P.1000[i] <- t.result$p.value
}
R.1000[j] <- sum(P.1000<0.05)/length(P.1000)
}
summary (R.1000)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04000 0.04825 0.05200 0.05270 0.05675 0.06700
The performance of rnorm()
still looks good. The range of rejection rates falls between 0.04 and 0.067.
Of course, we might still be curious to see if sample size inputted into rnorm()
has an effect on these rejections rates. There is no need to show the code here, which just repeats the same double for
loop structure; instead let’s focus on the results.
R.rates<-data.frame(R.1000,R.100,R.10)
boxplot(R.rates, ylim=c(0, 0.2), ylab='Rejection rates', main="Rejection rates based on sample size")
abline(0.05,0,lty=2)
This box-and-whisker plot depicts the rejection rates (R.) based on 30 runs with different sample sizes (i.e., N=1000, 100, 10). In a box-and-whicker plot, the heavy line is the median, the box is an interquartile range, and the whiskers depict the range. Overlaid in the panel, the hortizontal dashed line = 0.05, our \(\alpha\) value.
The performance of rnorm()
still looks good but clearly the largest sample size provides the most consistent results. The main problem for the smallest sample size is that number of P values < 0.05 per only 10 simulations is so few that the sum(P.10<0.05)
is almost a binomial occurrence (i.e., 0 or 1 and ocassionally 2). In this way, a sample size of 10 is not adequate to evaluate the performance of rnorm()
. Visually, the other two sample sizes suggest that rnorm()
is a reasonably precise and unbiased generator of random normal deviates.
This chapter introduces the idea of simulating data to fit a probability distribution. It does so by using the function rnorm()
to simulate a contrasting series of R data objects that follow the normal probablity distribution. Using a for
loop procedure, including a double for
loop, 1000s of R objects with normal distributions were also generated, and the performance of the rnorm()
function was evaluted with the Student’s t-test.
The subject of probability and statistical inference is vast. This chapter was only designed to introduce the concepts of simulating data while getting more comfortable with the syntax of R, so that a novice can more easily understand the other ‘non-introductory’ chapters that follow.
It should also be apparent that simulation modeling, including this example, can help clarify the limitations of data sets. Here we find that the lowest sample size provided inconsistent results. If this was our original target sample size, and simulating 1000s of samples was actually a serious obstacle, then we should continue the simulation in increments of 10 (i.e., R.10, R.20, R.30, etc.) to determine a larger sample size that could optimally reduce this variance relative to the cost of collecting more data.