Although the R scripts written in these labs are relatively short, it is still important to follow some best practices that help make the code easier to read and modify.
Organize the code in a clear, logical manner. Be sure that the script can be run from beginning to end, line-by-line, without errors.
Annotate the code with comments to make it easier to identify the purpose of specific sections.
Start out by defining the parameters (i.e., variables) that will be used in the script. Use variables as much as possible so that the code is easy to re-use in similar settings.
Simulation allows for an intuitive way to understand the definition of probability as a proportion of times that an outcome of interest occurs if the random phenomenon could be repeated infinitely. With a programming language like R, it is possible to feasibly conduct simulations with enough replicates such that the proportion of occurrences with a particular outcome closely approximates the probability \(p\).
In this section (click to expand), we introduce the basic elements of
R programming required to conduct such simulations: the
sample() command and for loop control
structure.
sample() and for loops
sample()In a probability setting, it is not always the case that each outcome
of an experiment is equally likely. The sample( ) function
has the generic structure
sample(x, size = , replace = FALSE, prob = NULL)
where the prob argument allows for the probability of
sampling each element in x to be specified as a vector.
When prob is omitted, the function will sample each element
with equal probability.
The following code simulates the outcome of tossing a biased coin ten
times, where the probability of a heads is 0.6; a heads is represented
by 1 and a tails is represented by 0 The first
argument is the vector (0, 1), and the prob
argument is the vector (0.4, 0.6); the order indicates that
the first element (0) is to be sampled with probability 0.4
and the second element (1) is to be sampled with
probability 0.6.
#set the seed for a pseudo-random sample
set.seed(2018)
outcomes = sample(c(0, 1), size = 10, prob = c(0.4, 0.6), replace = TRUE)
outcomes
## [1] 1 1 1 1 1 1 0 1 0 1
sum ()The sum () function is used in the lab to return the sum
of the outcomes vector, where outcomes are recorded as either
0 or 1. For simplicity, the function was used
in its most basic form:
sum(outcomes) #number of 1's (heads)
## [1] 8
10 - sum(outcomes) #number of 0's (tails)
## [1] 2
It is often convenient to combine the sum() function
with logical operators.
sum(outcomes == 1) #number of 1's (heads)
## [1] 8
sum(outcomes == 0) #number of 0's (tails)
## [1] 2
This provides more flexibility and can make the code easier to parse
quickly, such as for cases where there are more than two outcomes, or
when more than two outcomes are of interest. For example, the following
application of sum() identifies the number of rolls (out of
twenty) of a fair six-sided that are either 1 or greater than 4.
#set the seed for a pseudo-random sample
set.seed(2018)
dice.rolls = sample(1:6, size = 20, replace = TRUE)
dice.rolls
## [1] 3 4 5 2 5 1 3 4 2 4 3 3 6 1 1 6 5 3 1 3
sum(dice.rolls == 1 | dice.rolls > 4)
## [1] 9
Additionally, since logical operators work with text strings, it is
possible to use them with sum() to return the number of
heads if heads is represented by, for example, H.
#set the seed for a pseudo-random sample
set.seed(2018)
outcomes = sample(c("T", "H"), size = 10, prob = c(0.4, 0.6), replace = TRUE)
outcomes
## [1] "H" "H" "H" "H" "H" "H" "T" "H" "T" "H"
sum(outcomes == "H") #number of heads
## [1] 8
for LoopsA loop allows for a set of code to be repeated under a specific set
of conditions; the for loop is one of the several types of
loops available in R.
A for loop has the basic structure
`for( counter ) \{ instructions \}. The loop below will
calculate the squares of the integers from 1 through 5.
Prior to running the loop, an empty vector squares
is created to store the results of the loop. This step is referred to as
initialization.
The counter consists of the index variable that keeps track of
each iteration of the loop; the index is typically a letter like
i, j, or k, but can be any
sequence such as ii. The index variable is conceptually
similar to the index of summation \(k\)
in sigma notation (\(\sum_{k = 1}^n\)).
In the example below, the counter can be read as “for every \(k\) in 1 through 5, repeat the following
instructions…”
The instructions are enclosed within the pair of curly braces.
For each iteration, the value \(k^2\)
is to be stored in the \(k^{th}\)
element of the vector squares. The empty vector
squares was created prior to running the loop using
vector().
So, for the first iteration, R sets \(k
= 1\), calculates \(1^2\), and
fills in the first element of squares with the result. In
the next iteration, \(k = 2\)… and this
process repeats until \(k = 5\) and
\(5^2\) is stored as the fifth element
of squares.
After the loop is done, the vector squares is no
longer empty and instead consists of the squares of the integers from 1
through 5: 1, 4, 9, 16, and 25.
#create empty vector to store results (initialize)
squares = vector("numeric", 5)
#run the loop
for(k in 1:5){
squares[k] = k^2
}
#print the results
squares
## [1] 1 4 9 16 25
Of course, the same result could be easily achieved without a
for loop:
(1:5)^2
## [1] 1 4 9 16 25
The for loop is a useful tool when the set of
instructions to be repeated is more complicated, such as in the coin
tossing scenario from the lab. The loop from Question 2 is reproduced
here for reference.
The loop is set to run for every \(k\) from 1 to the value of
number.replicate, which has been previously defined as
50.
There are two vectors defined within the curly braces.
The first vector, outcomes.replicate, consists of
the outcomes for the tosses in a single replicate (i.e., iteration of
the loop). The number of tosses in a single experiment has been defined
as 5. These values are produced from sample(). This
vector’s values are re-populated each time the loop is run.
The second vector, outcomes, is created by
calculating the sum of outcomes.replicate for each
iteration of the loop. Once the loop has run, outcomes is a
record of the number of 1’s that occurred for each
iteration.
In the language of the coin tossing scenario: the loop starts by
tossing 5 coins, counting how many heads occur, then recording that
number as the first element of outcomes. Next, the loop
tosses 5 coins again and records the number of heads in the second
element of outcomes. The loop stops once it has repeated
the experiment (of tossing 5 coins) 50 times.
The distinction between the outcomes.replicate
vector and the outcomes vector, for each run of the loop,
is that the outcomes.replicate vector stores the exact
sequence of results while outcomes only records the number
of heads.
For this particular problem, the only information needed is the
number of heads; thus, one can think of outcomes as
summarizing the relevant information from the experiment.
for(k in 1:number.replicates){
outcomes.replicate = sample(c(0, 1), size = number.tosses,
prob = c(1 - prob.heads, prob.heads), replace = TRUE)
outcomes[k] = sum(outcomes.replicate)
}