Modeling randomness

Randomness

One of the surprising outcomes of the revolution in computing technology has been the discovery of diverse uses for randomness in the analysis of data and in science generally. One of the simplest operators for generating random events in R is resample. This function takes two arguments: 1) a set of items to choose from at random, and 2) how many events to generate. Each item is equally likely to be chosen. For example, here is a simulation of a coin flip:

coin = c("H", "T")
resample(coin, 5)
## [1] "H" "T" "T" "T" "H"
resample(coin, 5)
## [1] "H" "T" "H" "H" "H"

The first command create an object holding the possible outcome of each event (“H”: Heads, “T”: Tails), called coin. The next two commands generated five events each, with each event being a random choice of the outcomes in coin. Another simple example is rolling dice. First, we construct a set of the possible outcomes (in this case, 6 possible sides):

dice = seq(1, 6)
dice
## [1] 1 2 3 4 5 6

Then generate random events. Here is a roll of two dice:

resample(dice, 2)
## [1] 4 2

The resample() function is also useful for selecting cases at random from a data.frame. This use will be the basis for statistical methods introduced in later lectures and examples…

Random draws from probability models

Although resample() is useful for random sampling, it can work only with finite set of possible outcomes, such as H/T or 1/2/3/4/5/6 or the cases in a data.frame. By default, the underlying probability model used by resample() is equiprobability (i.e., each possible outcomes is equally likely). You can specify another probability model by using the prob= argument to resample(). For instance, to flip coins that are very likely to come up heads, we could use:

resample(coin, 10, prob=c(0.9, 0.1))
##  [1] "H" "H" "H" "H" "H" "T" "H" "H" "H" "H"

R provides other operators that allows draws to be made from outcome sets that are infinite. For example, to make random draws from a normal probability distribution, we can use the rnorm() function (r: random, norm: normal). The required argument tells how many draws to make. Optional named arguments let you specify the mean and standard deviation of the particular normal distribution you want. To illustrate, here is a set of 15 random numbers from a normal distribution with mean 1000 and standard deviation 75:

samps = rnorm(15, mean=1000, sd=75)
samps
##  [1]  921.9934 1024.3006  900.6254 1022.4405 1017.9001  978.6061  982.1767
##  [8] 1085.4456  887.9998  898.1780 1107.7794  982.9020  934.6675 1017.0118
## [15] 1028.3428

In this example, the output was assigned to an object called samps to facilitate some additional computations to the values, for example, here is the mean and standard deviation of the above samples:

mean(samps)
## [1] 986.0247
sd(samps)
## [1] 66.92377

Don’t be surprised if the mean and standard deviation of the sample don’t match exactly the parameters that were set with the arguments (mean=1000, sd=75). The sample was drawn at random, and so the sample statistics are going to reflect the sampling variability.

It is possible to generate very large samples (here we are generating 100,000!):

samps = rnorm(100000, mean=1000, sd=75)
mean(samps)
## [1] 1000.126
sd(samps)
## [1] 75.18456

Notice that the sample mean and standard deviation are quite close to the population parameters in this large sample! The simulations that we will do in later exercises will be much more elaborate than the simple draws here. Even with current computing technology, we will likely limit our trials to a few hundred…

Standard probability models

R provides a large set of operators like rnorm for different probability models. All of these operators work in the same way, more or less:

  • Each has a required first argument that specifies the number of draws to make
  • Each has an optional set of parameters that specify the particular probability distribution that you want

All the operators start with the letter r (random), and are followed by the name of the probability model:

Family R name Parameters Type
Normal rnorm mean, sd Continuous
Uniform runif min,max Continuous
Binomial rbinom size,prob Discrete
Poisson rpois rate (lambda) Discrete
Exponential rexp rate Continuous
Lognormal rlnorm meanlog, sdlog Continuous
\(\chi^2\) rchisq DoF (df) Continuous
t rt DoF (df) Continuous
F rf DoF (df1, df2) Continuous

To use these operators, you first must choose a particular probability model based on the setting that applies to your situation. This setting will usually indicate what the population parameters should be. Some examples:

  • You are in charge of a hiring committee that is going to interview three candidates selected from a population of job applicants that is 63% female. How many of the interviewees will be female? Modeling this as random selection from the application pool, a binomial model is appropriate. The size of each trial is 3, the probability of being female is 63%, so:

    samps = rbinom(40, size=3, prob=0.63)
    samps
    ##  [1] 1 3 1 2 2 2 0 3 2 2 1 1 1 1 2 1 3 0 1 2 2 0 3 3 2 2 3 3 1 1 2 2 2 3 2
    ## [36] 0 2 3 2 0

    There are 40 trials here, since the first argument was set to 40. Recall that each of the trails is a simulation of one hiring event. In the first simulated event, 1 of the interviewees were female, in the third, there was 1 females. Typically, you will simply summarize all the simulations to see how likely each possible outcome is:

    tally(samps)
    ## First argument should be a formula... But I'll try to guess what you meant
    ## 
    ##  0  1  2  3 
    ##  5 10 16  9
  • You want to simulate the number of customers who come into a store over the course of an hour. The average rate is 15 per hour. To simulate a situation where customers arrive randomly, the Poisson model is appropriate:

    rpois(25, lambda=15)
    ##  [1] 14 20 15  8 10 11 13 24 12 15 12 12 13 18 18 24 19 20 17 12 14 18 10
    ## [24] 20 11
  • You want to generate s simulation of the interval between earthquakes. The simulate the random intervals with a typical rate of 0.03 earthquakes per year, you would use:

    rexp(15, rate=0.03)
    ##  [1]   1.9053395   7.2016256  43.1535210  34.1805837  80.7710750
    ##  [6]  30.8534622  69.1087319  39.2869470   6.5969589  16.6047956
    ## [11]  39.5466026 145.2781885  43.8666947   3.0923205   0.2636432

    Notice the huge variation in the intervals, from less than half a year to almost 90 years between earthquakes.

Coverage intervals

Often, we need to compute coverage intervals in order to describe the range of likely outcomes from a random process. R provides a series of operators for this purpose; a separate operator for each named probability model. The operators all begin with q, standing for quantiles. In all cases, the first argument is the set of quantiles that you want to calculate for the particular probability model. The optional named arguments are the parameters.

Recall that to find a 95% coverage interval, you need the 0.025 and 0.975 quantiles. For a 99% interval, you need the 0.005 and 0.995 quantiles (they are always symmetric). For example, here are 95% coverage intervals for a few probability models:

  • A normal distribution with mean 0 and standard deviation 1:

    qnorm(c(0.025, 0.975), mean=0, sd=1)
    ## [1] -1.959964  1.959964
  • The hiring committee situation modeled by a binomial distribution with size=3 and prob=0.63:

    qbinom(c(0.025, 0.975), size=3, prob=0.63)
    ## [1] 0 3

    You may be surprised to see that the coverage interval includes all the possible outcomes. This is because the number of cases in each trial (n=3) is quite small.

  • The number of customers entering a store during an hour as modeled by a Poisson distribution, again with an average rate of 15 per hour:

    qpois(c(0.025, 0.975), lambda=15)
    ## [1]  8 23
  • The interval between earthquakes modeled by an exponential distribution with a typical rate of 0.03 earthquakes per year:

    qexp(c(0.025, 0.975), rate=0.03)
    ## [1]   0.8439269 122.9626485

Percentiles

You can also use the q operators to find the value that would be at a particular percentile. For example, the exponential model with rate=0.03 gives the 25th percentile of the interval between earthquakes as:

qexp(0.25, rate=0.03)
## [1] 9.589402

A quarter of the time, the interval between earthquakes will be 9.59 years or less. It is entirely feasible to calculate percentiles and coverage intervals by combining the random-number generators with quantile. For example, here is the 95% coverage interval from a normal distribution with mean=0 and sd=1:

samps = rnorm(10000, mean=0, sd=1)
qdata(c(0.025, 0.975), samps)
##        quantile     p
## 2.5%  -1.939839 0.025
## 97.5%  1.943178 0.975

The disadvantage here is that it is a simulation, and the results will vary randomly. By making the sample size large enough (here it is n=10,000), we can reduce the random variation. However, using the q operators uses mathematical analysis to give us what is effectively an infinite sample size. For this reason, it is advisable to use the q operators when you can. However, for many of the techniques to be introduced in later lectures, you will have to generate a random sample and then apply quantile to approximate the coverage intervals…

Reference

This demo is based directly on material from ‘Statistical Modeling: A Fresh Approach (2nd Edition)’ by Daniel Kaplan.