Overview

  1. Basic Functioning
  2. Manipulate the Index
  3. Theoretical Probability
  4. Experimental Probability: Sampling

Basic Functioning

Loops are used to iterate (repeat) commands over an index. For-loops are especially useful when we need to repeat command(s) a large number of times.

For-loops “execute for a prescribed number of times, as controlled by a counter or an index, incremented at each iteration cycle.” (Click here for the souce)

The diagram below describes the process of a for-loop.


For-loops have an iterator that encloses the body, which contains the command(s), in curly brackets.

The iterator, with i as the index, reads as : "for each item i in a sequence from i_1 to i_2, do {this}. It defines the index, here with the letter i, and the range of the index.

The body (“this”) tells R what we want to do (what command to repeat).

# Print the index sequentially 
for(i in (10:20)){  # start the iterator: iterate over an index sequence that ranges from 10 to 20.
  print(i)         # the body: prints the value of index
} # close the iterator
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20

Manipulate the Index

For loops are also useful when we cannot use vectorized operations (i.e. functions), which function element-wise. When we need to refer to previous or subsequent elements in a vector, we need to use for loops.

# Basic index manipulation 
for(i in (10:20)){  
  print(i + 1)  # Prints the subsequent index value   
}
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21

Question 1

How would you change the code above to print the index value two positions behind?


A common use of for-loops in data analysis is for lagging variables when working with time series (variables measured across time). For instance, we may suspect that current GDP is strongly correlated with past GDP values. To check that hypothesis, we would need to first lag the GDP variable.

The following code is a simple example of lagging a vector by 1 position.

# Vector of data that we wish to lag
f
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
# Create an empty vector that we will populate with the loop
f_lagged <- rep(NA,length(f)) # Repeating NA by the number of elements in f

for(t in (2:length(f))){  # starts the range of the index at 2, since there is no previous value for the first observation in f
  f_lagged[t] <- f[t-1]  # Assigning the previous value in f to f_lagged       
}

f_lagged
##  [1] NA  1  2  3  4  5  6  7  8  9 10 11 12 13 14

Exercise 1

Use a loop to create a vector whose elements i are the difference between f and the previous value of f.

Theoretical Probability

When we know all the potential outcomes, we can use basic operations in R to calculate probabilities.

In this example, we will calculate the probability of drawing a green marble in the first draw when drawing a total of two marbles with replacement. The bag contains 10 marbles: 5 red marbles (“Rx”), 3 yellow marbles (“Yx”), and 2 Green marbles (“Gx”).

# Create a vector with 10 marbles
marbles <- c("R1","R2","R3","R4","R5","Y1","Y2","Y3","G1","G2") 

# Use expand.grid() to create all the potential combinations of two marbles
marbles_outcomes <- expand.grid(marbles, marbles) # Inputting the marble vector as many times as there are draws
nrow(marbles_outcomes)
## [1] 100
# Computing how many potential outcomes have a green marble in the first draw
numerator <- length(which(marbles_outcomes[,1] == "G1"| marbles_outcomes[,1] == "G2" )) # subsetting the first column of marbles_outcomes matrix to get the first draw; using logical operators to get how many elements evaluate as TRUE for the condition of being green

# Computing probability
numerator/nrow(marbles_outcomes) # dividing the number of potential outcomes with a green marble as first draw by the total number of possible outcomes
## [1] 0.2

Experimental Probability: Sampling

Sometimes the number of potential outcomes is too large to be calculated. In this case, we can use repeated sampling to estimate the probability of given outcomes.

In this section, we will use a dataset of baby names for babies born between 1880 and 2017 in the US. First, let’s preview the data.

str(babynames) # Almost 2 million observations
## Classes 'tbl_df', 'tbl' and 'data.frame':    1924665 obs. of  5 variables:
##  $ year: num  1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
##  $ sex : chr  "F" "F" "F" "F" ...
##  $ name: chr  "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ n   : int  7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
##  $ prop: num  0.0724 0.0267 0.0205 0.0199 0.0179 ...
length(unique(babynames$name)) # Almost 100 000 unique names 
## [1] 97310
head(babynames) # There is one observation for each year-name-sex observation
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162
# Which names were the most popular each year across time?
library(dplyr)
  babynames %>% 
    slice_max(prop, n=10) # Get a subset of the data with the highest value on the prop variable
## # A tibble: 10 x 5
##     year sex   name        n   prop
##    <dbl> <chr> <chr>   <int>  <dbl>
##  1  1880 M     John     9655 0.0815
##  2  1881 M     John     8769 0.0810
##  3  1880 M     William  9532 0.0805
##  4  1883 M     John     8894 0.0791
##  5  1881 M     William  8524 0.0787
##  6  1882 M     John     9557 0.0783
##  7  1884 M     John     9388 0.0765
##  8  1882 M     William  9298 0.0762
##  9  1886 M     John     9026 0.0758
## 10  1885 M     John     8756 0.0755
# In 2017
  babynames %>% 
    filter(year == 2017) %>% 
    slice_max(prop, n=10)
## # A tibble: 10 x 5
##     year sex   name         n    prop
##    <dbl> <chr> <chr>    <int>   <dbl>
##  1  2017 F     Emma     19738 0.0105 
##  2  2017 F     Olivia   18632 0.00994
##  3  2017 M     Liam     18728 0.00954
##  4  2017 M     Noah     18326 0.00933
##  5  2017 F     Ava      15902 0.00848
##  6  2017 F     Isabella 15100 0.00805
##  7  2017 F     Sophia   14831 0.00791
##  8  2017 M     William  14904 0.00759
##  9  2017 M     James    14232 0.00725
## 10  2017 F     Mia      13437 0.00717

Sampling

Let’s sample 10 baby names at random, without replacement.

set.seed(234) # set a starting number for the random number generator, to make the example reproducible
# Get the index of baby names
index_babies <- seq(from=1, to=nrow(babynames))

# Get a random sample of index positions
s1 <- sample(x=index_babies, size=10, replace=F) # We are sampling 10 index values without replacement
s1 # if we repeat the last three lines, we will get the same result because of the seed
##  [1] 1435069 1504534   38565 1493703  128780 1241012 1788752 1381216
##  [9] 1785575  547046
# Use the vector of random index positions to get a random sample of baby names
babynames$name[s1] 
##  [1] "Iyla"    "Kylynne" "Roxanna" "Abi"     "Jac"     "Daivon"  "Dharius"
##  [8] "Quinten" "Evon"    "Helmut"

Exercise 2

Get two random samples of 10 baby names for babies born in 1880 without using a seed.

Using an Experiment to Compute Probabilities

We can use for-loops and sampling to create a sampling distribution and find the probability of a given event.

Let’s say we want to compute the probability that a baby born in 1880 is named “Alfred”. We don’t have access to the full population and can only take samples.

# Load the babies1880 dataset
load("babies1880.Rda") 
head(babies1880) # each row is a baby
## # A tibble: 6 x 3
##    year sex   name 
##   <dbl> <chr> <chr>
## 1  1880 F     Mary 
## 2  1880 F     Mary 
## 3  1880 F     Mary 
## 4  1880 F     Mary 
## 5  1880 F     Mary 
## 6  1880 F     Mary
# Set the number of samples and sample size 
samples <- 200
sample_size <- 1000

# Create an empty vector, with a row for each sample, that we will populate with a loop
Alfred_p <- rep(NA, samples)

# Set the seed
set.seed(234)

# Create the loop that stores the number of Alfreds sampled for each sample
for(i in (1:samples)){
  random_index <- sample(x=seq(from=1, to=nrow(babies1880)), size=sample_size, replace=FALSE)
  random_names <- babies1880$name[random_index]
  Alfred_p[i] <- length(which(random_names == "Alfred")) / sample_size * 100 # Get the percentage of Alfreds in a given sample
}

# Check the sampling distribution
plot(density(Alfred_p), xlab = "Percentage (%)", main = "Sampling Distribution of the Percentage of\nBabies Named Alfred in 1880")

# Compute the mean to get the probability of a baby being named Alfred 
mean(Alfred_p)
## [1] 0.238
# Compare to actual probability in the population
length(which(babies1880$name == "Alfred"))/nrow(babies1880)*100 # Pretty close!
## [1] 0.2327728

This work by Sarah Lachance is licensed under CC BY-NC-ND 4.0