Loops are used to iterate (repeat) commands over an index. For-loops are especially useful when we need to repeat command(s) a large number of times.
For-loops “execute for a prescribed number of times, as controlled by a counter or an index, incremented at each iteration cycle.” (Click here for the souce)
The diagram below describes the process of a for-loop.
Source, with modification: https://www.geeksforgeeks.org/loops-in-r-for-while-repeat/
For-loops have an iterator that encloses the body, which contains the command(s), in curly brackets.
The iterator, with i as the index, reads as : "for each item i in a sequence from i_1 to i_2, do {this}. It defines the index, here with the letter i, and the range of the index.
The body (“this”) tells R what we want to do (what command to repeat).
# Print the index sequentially
for(i in (10:20)){ # start the iterator: iterate over an index sequence that ranges from 10 to 20.
print(i) # the body: prints the value of index
} # close the iterator
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
For loops are also useful when we cannot use vectorized operations (i.e. functions), which function element-wise. When we need to refer to previous or subsequent elements in a vector, we need to use for loops.
# Basic index manipulation
for(i in (10:20)){
print(i + 1) # Prints the subsequent index value
}
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
How would you change the code above to print the index value two positions behind?
A common use of for-loops in data analysis is for lagging variables when working with time series (variables measured across time). For instance, we may suspect that current GDP is strongly correlated with past GDP values. To check that hypothesis, we would need to first lag the GDP variable.
The following code is a simple example of lagging a vector by 1 position.
# Vector of data that we wish to lag
f
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Create an empty vector that we will populate with the loop
f_lagged <- rep(NA,length(f)) # Repeating NA by the number of elements in f
for(t in (2:length(f))){ # starts the range of the index at 2, since there is no previous value for the first observation in f
f_lagged[t] <- f[t-1] # Assigning the previous value in f to f_lagged
}
f_lagged
## [1] NA 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Use a loop to create a vector whose elements i are the difference between f and the previous value of f.
When we know all the potential outcomes, we can use basic operations in R to calculate probabilities.
In this example, we will calculate the probability of drawing a green marble in the first draw when drawing a total of two marbles with replacement. The bag contains 10 marbles: 5 red marbles (“Rx”), 3 yellow marbles (“Yx”), and 2 Green marbles (“Gx”).
# Create a vector with 10 marbles
marbles <- c("R1","R2","R3","R4","R5","Y1","Y2","Y3","G1","G2")
# Use expand.grid() to create all the potential combinations of two marbles
marbles_outcomes <- expand.grid(marbles, marbles) # Inputting the marble vector as many times as there are draws
nrow(marbles_outcomes)
## [1] 100
# Computing how many potential outcomes have a green marble in the first draw
numerator <- length(which(marbles_outcomes[,1] == "G1"| marbles_outcomes[,1] == "G2" )) # subsetting the first column of marbles_outcomes matrix to get the first draw; using logical operators to get how many elements evaluate as TRUE for the condition of being green
# Computing probability
numerator/nrow(marbles_outcomes) # dividing the number of potential outcomes with a green marble as first draw by the total number of possible outcomes
## [1] 0.2
Sometimes the number of potential outcomes is too large to be calculated. In this case, we can use repeated sampling to estimate the probability of given outcomes.
In this section, we will use a dataset of baby names for babies born between 1880 and 2017 in the US. First, let’s preview the data.
str(babynames) # Almost 2 million observations
## Classes 'tbl_df', 'tbl' and 'data.frame': 1924665 obs. of 5 variables:
## $ year: num 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
## $ sex : chr "F" "F" "F" "F" ...
## $ name: chr "Mary" "Anna" "Emma" "Elizabeth" ...
## $ n : int 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
## $ prop: num 0.0724 0.0267 0.0205 0.0199 0.0179 ...
length(unique(babynames$name)) # Almost 100 000 unique names
## [1] 97310
head(babynames) # There is one observation for each year-name-sex observation
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
# Which names were the most popular each year across time?
library(dplyr)
babynames %>%
slice_max(prop, n=10) # Get a subset of the data with the highest value on the prop variable
## # A tibble: 10 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 M John 9655 0.0815
## 2 1881 M John 8769 0.0810
## 3 1880 M William 9532 0.0805
## 4 1883 M John 8894 0.0791
## 5 1881 M William 8524 0.0787
## 6 1882 M John 9557 0.0783
## 7 1884 M John 9388 0.0765
## 8 1882 M William 9298 0.0762
## 9 1886 M John 9026 0.0758
## 10 1885 M John 8756 0.0755
# In 2017
babynames %>%
filter(year == 2017) %>%
slice_max(prop, n=10)
## # A tibble: 10 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 F Emma 19738 0.0105
## 2 2017 F Olivia 18632 0.00994
## 3 2017 M Liam 18728 0.00954
## 4 2017 M Noah 18326 0.00933
## 5 2017 F Ava 15902 0.00848
## 6 2017 F Isabella 15100 0.00805
## 7 2017 F Sophia 14831 0.00791
## 8 2017 M William 14904 0.00759
## 9 2017 M James 14232 0.00725
## 10 2017 F Mia 13437 0.00717
Let’s sample 10 baby names at random, without replacement.
set.seed(234) # set a starting number for the random number generator, to make the example reproducible
# Get the index of baby names
index_babies <- seq(from=1, to=nrow(babynames))
# Get a random sample of index positions
s1 <- sample(x=index_babies, size=10, replace=F) # We are sampling 10 index values without replacement
s1 # if we repeat the last three lines, we will get the same result because of the seed
## [1] 1435069 1504534 38565 1493703 128780 1241012 1788752 1381216
## [9] 1785575 547046
# Use the vector of random index positions to get a random sample of baby names
babynames$name[s1]
## [1] "Iyla" "Kylynne" "Roxanna" "Abi" "Jac" "Daivon" "Dharius"
## [8] "Quinten" "Evon" "Helmut"
Get two random samples of 10 baby names for babies born in 1880 without using a seed.
We can use for-loops and sampling to create a sampling distribution and find the probability of a given event.
Let’s say we want to compute the probability that a baby born in 1880 is named “Alfred”. We don’t have access to the full population and can only take samples.
# Load the babies1880 dataset
load("babies1880.Rda")
head(babies1880) # each row is a baby
## # A tibble: 6 x 3
## year sex name
## <dbl> <chr> <chr>
## 1 1880 F Mary
## 2 1880 F Mary
## 3 1880 F Mary
## 4 1880 F Mary
## 5 1880 F Mary
## 6 1880 F Mary
# Set the number of samples and sample size
samples <- 200
sample_size <- 1000
# Create an empty vector, with a row for each sample, that we will populate with a loop
Alfred_p <- rep(NA, samples)
# Set the seed
set.seed(234)
# Create the loop that stores the number of Alfreds sampled for each sample
for(i in (1:samples)){
random_index <- sample(x=seq(from=1, to=nrow(babies1880)), size=sample_size, replace=FALSE)
random_names <- babies1880$name[random_index]
Alfred_p[i] <- length(which(random_names == "Alfred")) / sample_size * 100 # Get the percentage of Alfreds in a given sample
}
# Check the sampling distribution
plot(density(Alfred_p), xlab = "Percentage (%)", main = "Sampling Distribution of the Percentage of\nBabies Named Alfred in 1880")
# Compute the mean to get the probability of a baby being named Alfred
mean(Alfred_p)
## [1] 0.238
# Compare to actual probability in the population
length(which(babies1880$name == "Alfred"))/nrow(babies1880)*100 # Pretty close!
## [1] 0.2327728
This work by Sarah Lachance is licensed under CC BY-NC-ND 4.0