One of the surprising outcomes of the revolution in computing technology has been the discovery of diverse uses for randomness in the analysis of data and in science generally. One of the simplest operators for generating random events in R is resample. This function takes two arguments: 1) a set of items to choose from at random, and 2) how many events to generate. Each item is equally likely to be chosen. For example, here is a simulation of a coin flip:
coin = c("H", "T")
resample(coin, 5)
## [1] "H" "T" "T" "T" "H"
resample(coin, 5)
## [1] "H" "T" "H" "H" "H"
The first command create an object holding the possible outcome of each event (“H”: Heads, “T”: Tails), called coin. The next two commands generated five events each, with each event being a random choice of the outcomes in coin. Another simple example is rolling dice. First, we construct a set of the possible outcomes (in this case, 6 possible sides):
dice = seq(1, 6)
dice
## [1] 1 2 3 4 5 6
Then generate random events. Here is a roll of two dice:
resample(dice, 2)
## [1] 4 2
The resample() function is also useful for selecting cases at random from a data.frame. This use will be the basis for statistical methods introduced in later lectures and examples…
Although resample() is useful for random sampling, it can work only with finite set of possible outcomes, such as H/T or 1/2/3/4/5/6 or the cases in a data.frame. By default, the underlying probability model used by resample() is equiprobability (i.e., each possible outcomes is equally likely). You can specify another probability model by using the prob= argument to resample(). For instance, to flip coins that are very likely to come up heads, we could use:
resample(coin, 10, prob=c(0.9, 0.1))
## [1] "H" "H" "H" "H" "H" "T" "H" "H" "H" "H"
R provides other operators that allows draws to be made from outcome sets that are infinite. For example, to make random draws from a normal probability distribution, we can use the rnorm() function (r: random, norm: normal). The required argument tells how many draws to make. Optional named arguments let you specify the mean and standard deviation of the particular normal distribution you want. To illustrate, here is a set of 15 random numbers from a normal distribution with mean 1000 and standard deviation 75:
samps = rnorm(15, mean=1000, sd=75)
samps
## [1] 921.9934 1024.3006 900.6254 1022.4405 1017.9001 978.6061 982.1767
## [8] 1085.4456 887.9998 898.1780 1107.7794 982.9020 934.6675 1017.0118
## [15] 1028.3428
In this example, the output was assigned to an object called samps to facilitate some additional computations to the values, for example, here is the mean and standard deviation of the above samples:
mean(samps)
## [1] 986.0247
sd(samps)
## [1] 66.92377
Don’t be surprised if the mean and standard deviation of the sample don’t match exactly the parameters that were set with the arguments (mean=1000, sd=75). The sample was drawn at random, and so the sample statistics are going to reflect the sampling variability.
It is possible to generate very large samples (here we are generating 100,000!):
samps = rnorm(100000, mean=1000, sd=75)
mean(samps)
## [1] 1000.126
sd(samps)
## [1] 75.18456
Notice that the sample mean and standard deviation are quite close to the population parameters in this large sample! The simulations that we will do in later exercises will be much more elaborate than the simple draws here. Even with current computing technology, we will likely limit our trials to a few hundred…
R provides a large set of operators like rnorm for different probability models. All of these operators work in the same way, more or less:
All the operators start with the letter r (random), and are followed by the name of the probability model:
| Family | R name |
Parameters | Type |
|---|---|---|---|
| Normal | rnorm |
mean, sd |
Continuous |
| Uniform | runif |
min,max |
Continuous |
| Binomial | rbinom |
size,prob |
Discrete |
| Poisson | rpois |
rate (lambda) |
Discrete |
| Exponential | rexp |
rate |
Continuous |
| Lognormal | rlnorm |
meanlog, sdlog |
Continuous |
| \(\chi^2\) | rchisq |
DoF (df) |
Continuous |
| t | rt |
DoF (df) |
Continuous |
| F | rf |
DoF (df1, df2) |
Continuous |
To use these operators, you first must choose a particular probability model based on the setting that applies to your situation. This setting will usually indicate what the population parameters should be. Some examples:
You are in charge of a hiring committee that is going to interview three candidates selected from a population of job applicants that is 63% female. How many of the interviewees will be female? Modeling this as random selection from the application pool, a binomial model is appropriate. The size of each trial is 3, the probability of being female is 63%, so:
samps = rbinom(40, size=3, prob=0.63)
samps
## [1] 1 3 1 2 2 2 0 3 2 2 1 1 1 1 2 1 3 0 1 2 2 0 3 3 2 2 3 3 1 1 2 2 2 3 2
## [36] 0 2 3 2 0
There are 40 trials here, since the first argument was set to 40. Recall that each of the trails is a simulation of one hiring event. In the first simulated event, 1 of the interviewees were female, in the third, there was 1 females. Typically, you will simply summarize all the simulations to see how likely each possible outcome is:
tally(samps)
## First argument should be a formula... But I'll try to guess what you meant
##
## 0 1 2 3
## 5 10 16 9You want to simulate the number of customers who come into a store over the course of an hour. The average rate is 15 per hour. To simulate a situation where customers arrive randomly, the Poisson model is appropriate:
rpois(25, lambda=15)
## [1] 14 20 15 8 10 11 13 24 12 15 12 12 13 18 18 24 19 20 17 12 14 18 10
## [24] 20 11You want to generate s simulation of the interval between earthquakes. The simulate the random intervals with a typical rate of 0.03 earthquakes per year, you would use:
rexp(15, rate=0.03)
## [1] 1.9053395 7.2016256 43.1535210 34.1805837 80.7710750
## [6] 30.8534622 69.1087319 39.2869470 6.5969589 16.6047956
## [11] 39.5466026 145.2781885 43.8666947 3.0923205 0.2636432
Notice the huge variation in the intervals, from less than half a year to almost 90 years between earthquakes.
Often, we need to compute coverage intervals in order to describe the range of likely outcomes from a random process. R provides a series of operators for this purpose; a separate operator for each named probability model. The operators all begin with q, standing for quantiles. In all cases, the first argument is the set of quantiles that you want to calculate for the particular probability model. The optional named arguments are the parameters.
Recall that to find a 95% coverage interval, you need the 0.025 and 0.975 quantiles. For a 99% interval, you need the 0.005 and 0.995 quantiles (they are always symmetric). For example, here are 95% coverage intervals for a few probability models:
A normal distribution with mean 0 and standard deviation 1:
qnorm(c(0.025, 0.975), mean=0, sd=1)
## [1] -1.959964 1.959964The hiring committee situation modeled by a binomial distribution with size=3 and prob=0.63:
qbinom(c(0.025, 0.975), size=3, prob=0.63)
## [1] 0 3
You may be surprised to see that the coverage interval includes all the possible outcomes. This is because the number of cases in each trial (n=3) is quite small.
The number of customers entering a store during an hour as modeled by a Poisson distribution, again with an average rate of 15 per hour:
qpois(c(0.025, 0.975), lambda=15)
## [1] 8 23The interval between earthquakes modeled by an exponential distribution with a typical rate of 0.03 earthquakes per year:
qexp(c(0.025, 0.975), rate=0.03)
## [1] 0.8439269 122.9626485You can also use the q operators to find the value that would be at a particular percentile. For example, the exponential model with rate=0.03 gives the 25th percentile of the interval between earthquakes as:
qexp(0.25, rate=0.03)
## [1] 9.589402
A quarter of the time, the interval between earthquakes will be 9.59 years or less. It is entirely feasible to calculate percentiles and coverage intervals by combining the random-number generators with quantile. For example, here is the 95% coverage interval from a normal distribution with mean=0 and sd=1:
samps = rnorm(10000, mean=0, sd=1)
qdata(c(0.025, 0.975), samps)
## quantile p
## 2.5% -1.939839 0.025
## 97.5% 1.943178 0.975
The disadvantage here is that it is a simulation, and the results will vary randomly. By making the sample size large enough (here it is n=10,000), we can reduce the random variation. However, using the q operators uses mathematical analysis to give us what is effectively an infinite sample size. For this reason, it is advisable to use the q operators when you can. However, for many of the techniques to be introduced in later lectures, you will have to generate a random sample and then apply quantile to approximate the coverage intervals…
This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.