0. Introduction

In this module, we will learn more details about R data structure and functions to strength our basics. The topics covered include:


1. List


We have learned how to create vectors that contain multiple values of the same type with The c() function. These vectors are called atomic vectors, of which there are six types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors. In this course we won’t use complex and raw vectors so we ignore them here.

We can use typeof function to check the type of a vector:

typeof(c(1,2,3))
## [1] "double"
typeof(c(1L, 2L, 3L))
## [1] "integer"
typeof(c("a", "b", "c"))
## [1] "character"
typeof(c(T, F, TRUE, FALSE))
## [1] "logical"

Note that a number by default is considered a double type. To force it into an integer, we need to place an L after the number.

These vectors are homogeneous in the sense that all values must be of the same type. If not, they will be forced into the same type.

typeof(c(1L, 2, 3))
## [1] "double"
typeof(c(1, 2, "a"))
## [1] "character"

In the example above, a mixture of integers and doubles is considered as a double vector, while a mixture of numbers and strings is considered as a string vector.

To create a heterogenous vector that contains values of different atomic types, we have to use lists, which are sometimes called recursive vectors because lists can contain other lists. The diagram below shows the hierarchy of R’s vector types

There’s one other related object: NULL. NULL is often used to represent the absence of a vector (as opposed to NA which is used to represent the absence of a value in a vector). NULL typically behaves like a vector of length 0.

Every vector has two key properties:

typeof(letters)
## [1] "character"
typeof(1:10)
## [1] "integer"
x <- list("a", "b", 1:10)
length(x)
## [1] 3

Vectors can also contain arbitrary additional metadata in the form of attributes. These attributes are used to create augmented vectors which build on additional behaviour. There are three important types of augmented vector:


Coercion

There are two ways to convert, or coerce, one type of vector to another:

  • Explicit coercion happens when you call a function like as.logical(), as.integer(), as.double(), or as.character(). Whenever you find yourself using explicit coercion, you should always check whether you can make the fix upstream, so that the vector never had the wrong type in the first place. For example, you may need to tweak your readr col_types specification.

  • Implicit coercion happens when you use a vector in a specific context that expects a certain type of vector. For example, when you use a logical vector with a numeric summary function, or when you use a double vector where an integer vector is expected.

Explicit coercion is used relatively rarely, and is largely easy to understand. For implicit coercion, we’ve already seen an example - using a logical vector in a numeric context. In this case TRUE is converted to 1 and FALSE converted to 0:

x <- 1:100
y <- x <= 40
print(y)
##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [37]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE
sum(y)
## [1] 40


Recycling Rules

As well as implicitly coercing the types of vectors to be compatible, R will also implicitly coerce the length of vectors. This is called vector recycling, because the shorter vector is repeated, or recycled, to the same length as the longer vector.

This is generally most useful when you are mixing vectors and “scalars”. I put scalars in quotes because R doesn’t actually have scalars: instead, a single number is a vector of length 1. Because there are no scalars, most built-in functions are vectorised, meaning that they will operate on a vector of numbers. That’s why, for example, this code works:

1:10 + 100
##  [1] 101 102 103 104 105 106 107 108 109 110
1:10 > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

In R, basic mathematical operations work with vectors. That means that you should never need to perform explicit iteration when performing simple mathematical computations.

c(1,3,5)/2
## [1] 0.5 1.5 2.5
c(1,3,5)^2
## [1]  1  9 25

It’s intuitive what should happen if you add two vectors of the same length, or a vector and a “scalar”, but what happens if you add two vectors of different lengths?

c(1,3,5,7,9,11) + c(1,2)
## [1]  2  5  6  9 10 13

Here, R will expand the shortest vector to the same length as the longest, so called recycling. This is silent except when the length of the longer is not an integer multiple of the length of the shorter:

c(1,3,5) + c(1,3)
## Warning in c(1, 3, 5) + c(1, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 6 6

While vector recycling can be used to create very succinct, clever code, it can also silently conceal problems. For this reason, the vectorised functions in tidyverse will throw errors when you recycle anything other than a scalar. If you do want to recycle, you’ll need to do it yourself with rep():

tibble(x = 1:4, y = 1:2)
tibble(x = 1:4, y = rep(1:2, 2))
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     1
## 4     4     2
tibble(x = 1:4, y = rep(1:2, each = 2))
## # A tibble: 4 × 2
##       x     y
##   <int> <int>
## 1     1     1
## 2     2     1
## 3     3     2
## 4     4     2


Naming Vectors

All types of vectors can be named. You can name them during creation with c() or with purrr::set_names()

v1 <- c(x = 1, y = 2, x = 4)
v2 <- set_names(1:3, c("a", "b", "c"))

Named vectors are like dictionaries in Python. We can retract the values by their names using [] or [[]]:

v1["x"]
## x 
## 1
v1[c("x","y")]
## x y 
## 1 2
v1[["x"]]
## [1] 1
v2[["b"]]
## [1] 2

The difference between [ and [[ is that [[ only extracts a single element, and always drops names. It’s a good idea to use it whenever you want to make it clear that you’re extracting a single item, as in a for loop. The distinction between [ and [[ is most important for lists, as we’ll see shortly.


Lists

Lists are a step up in complexity from atomic vectors, because lists can contain other lists. This makes them suitable for representing hierarchical or tree-like structures. You create a list with list():

x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3

A very useful tool for working with lists is str() because it focusses on the structure, not the contents.

str(x)
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
##  $ a: num 1
##  $ b: num 2
##  $ c: num 3

Unlike atomic vectors, list() can contain a mix of objects and can even contain other lists.

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE
z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ :List of 2
##   ..$ : num 3
##   ..$ : num 4


Subsetting a list

There are three ways to subset a list, which we’ll illustrate with a list named a:

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
  • [ extracts a sub-list. The result will always be a list.
str(a[1:2])
## List of 2
##  $ a: int [1:3] 1 2 3
##  $ b: chr "a string"
str(a[4])
## List of 1
##  $ d:List of 2
##   ..$ : num -1
##   ..$ : num -5
  • [[ extracts a single component from a list. It removes a level of hierarchy from the list.
str(a[[1]])
##  int [1:3] 1 2 3
str(a[[4]])
## List of 2
##  $ : num -1
##  $ : num -5
  • $ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.
a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3


Application of lists - ggplot object

You may wonder why we need lists at all in data analysis since most data frames have columns of the same data type. For example

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

So each column is a vector of either character, integer or double. What is the use of lists here? Let’s create a plot.

plot1 <- ggplot(mpg) + geom_bar(mapping = aes(x = as.factor(cyl), fill = drv), position = "dodge")
plot1

Let’s name the plot to be plot1. You might think that plot1 is merely a plot. Actually, if you input plot1 into the console, it does return the same plot.

However, actually all information of the plot is stored in plot1! If we check its type, we will see that:

typeof(plot1)
## [1] "list"

So plot1 is actually a list. What is in it? The answer is everything. If we want to see the data set that was used to create the plot, we can use

plot1$data
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

So the first element in the list of plot1 is the original data frame. If we want to see the labels of the plot, we can have

plot1$labels
## $x
## [1] "as.factor(cyl)"
## 
## $fill
## [1] "drv"
## 
## $y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
## 
## $weight
## [1] "weight"
## attr(,"fallback")
## [1] TRUE

We have another list returned which has again a few items in it. For example, the x-label and y-label are

plot1$labels$x
## [1] "as.factor(cyl)"
plot1$labels$y
## [1] "count"
## attr(,"fallback")
## [1] TRUE

We can actually change the labels by directly changing the values here

plot1$labels$x <- "New_x_label"
plot1$labels$y <- "New_y_label"
plot1


Use ggplot_build to adjust all details of a plot

We can use the ggplot_build() function on a ggplot object to obtain all raw data in producing the plot, and adjust any details.

plot1_data <- ggplot_build(plot1)
plot1_data
## $data
## $data[[1]]
##      fill  y count prop     x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23    23    1 0.775       FALSE     1     1    0   23 0.55 1.00
## 2 #00BA38 58    58    1 1.225       FALSE     1     2    0   58 1.00 1.45
## 3 #00BA38  4     4    1 2.000       FALSE     1     3    0    4 1.55 2.45
## 4 #F8766D 32    32    1 2.700       FALSE     1     4    0   32 2.55 2.85
## 5 #00BA38 43    43    1 3.000       FALSE     1     5    0   43 2.85 3.15
## 6 #619CFF  4     4    1 3.300       FALSE     1     6    0    4 3.15 3.45
## 7 #F8766D 48    48    1 3.700       FALSE     1     7    0   48 3.55 3.85
## 8 #00BA38  1     1    1 4.000       FALSE     1     8    0    1 3.85 4.15
## 9 #619CFF 21    21    1 4.300       FALSE     1     9    0   21 4.15 4.45
##   colour linewidth linetype alpha
## 1     NA       0.5        1    NA
## 2     NA       0.5        1    NA
## 3     NA       0.5        1    NA
## 4     NA       0.5        1    NA
## 5     NA       0.5        1    NA
## 6     NA       0.5        1    NA
## 7     NA       0.5        1    NA
## 8     NA       0.5        1    NA
## 9     NA       0.5        1    NA
## 
## 
## $layout
## <ggproto object: Class Layout, gg>
##     coord: <ggproto object: Class CoordCartesian, Coord, gg>
##         aspect: function
##         backtransform_range: function
##         clip: on
##         default: TRUE
##         distance: function
##         expand: TRUE
##         is_free: function
##         is_linear: function
##         labels: function
##         limits: list
##         modify_scales: function
##         range: function
##         render_axis_h: function
##         render_axis_v: function
##         render_bg: function
##         render_fg: function
##         setup_data: function
##         setup_layout: function
##         setup_panel_guides: function
##         setup_panel_params: function
##         setup_params: function
##         train_panel_guides: function
##         transform: function
##         super:  <ggproto object: Class CoordCartesian, Coord, gg>
##     coord_params: list
##     facet: <ggproto object: Class FacetNull, Facet, gg>
##         compute_layout: function
##         draw_back: function
##         draw_front: function
##         draw_labels: function
##         draw_panels: function
##         finish_data: function
##         init_scales: function
##         map_data: function
##         params: list
##         setup_data: function
##         setup_params: function
##         shrink: TRUE
##         train_scales: function
##         vars: function
##         super:  <ggproto object: Class FacetNull, Facet, gg>
##     facet_params: list
##     finish_data: function
##     get_scales: function
##     layout: data.frame
##     map_position: function
##     panel_params: list
##     panel_scales_x: list
##     panel_scales_y: list
##     render: function
##     render_labels: function
##     reset_scales: function
##     setup: function
##     setup_panel_guides: function
##     setup_panel_params: function
##     train_position: function
##     xlabel: function
##     ylabel: function
##     super:  <ggproto object: Class Layout, gg>
## 
## $plot

## 
## attr(,"class")
## [1] "ggplot_built"

So here we have another list with three elements $data, $layout and $plot. If we check $data we see all details of the plot.

plot1_data$data[[1]]
##      fill  y count prop     x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23    23    1 0.775       FALSE     1     1    0   23 0.55 1.00
## 2 #00BA38 58    58    1 1.225       FALSE     1     2    0   58 1.00 1.45
## 3 #00BA38  4     4    1 2.000       FALSE     1     3    0    4 1.55 2.45
## 4 #F8766D 32    32    1 2.700       FALSE     1     4    0   32 2.55 2.85
## 5 #00BA38 43    43    1 3.000       FALSE     1     5    0   43 2.85 3.15
## 6 #619CFF  4     4    1 3.300       FALSE     1     6    0    4 3.15 3.45
## 7 #F8766D 48    48    1 3.700       FALSE     1     7    0   48 3.55 3.85
## 8 #00BA38  1     1    1 4.000       FALSE     1     8    0    1 3.85 4.15
## 9 #619CFF 21    21    1 4.300       FALSE     1     9    0   21 4.15 4.45
##   colour linewidth linetype alpha
## 1     NA       0.5        1    NA
## 2     NA       0.5        1    NA
## 3     NA       0.5        1    NA
## 4     NA       0.5        1    NA
## 5     NA       0.5        1    NA
## 6     NA       0.5        1    NA
## 7     NA       0.5        1    NA
## 8     NA       0.5        1    NA
## 9     NA       0.5        1    NA

The $data stores data for different panels. In this graph we only have one panel, so it only has one element. To extract that, we use [[1]] and then have a data frame.

class(plot1_data$data[[1]])
## [1] "data.frame"
typeof(plot1_data$data[[1]])
## [1] "list"

Note that class() returns the name of the augmented vector. So a data frame is built upon a list, so it is a list if we check its type. But it is an augmented type and has its own name - data frame.

This data frame stores the details of this bar plot. For example, the range of x for each bar. Let’s change this and see how the plot will be updated:

plot1_data$data[[1]]$xmin[3] <- 2
plot(ggplot_gtable(plot1_data))

So we successfully change the left border of the bar on top of “5”. Note that here “4”, “5”, “6”, “8” are labels and the actual x-coordinates are 1 to 4 correspondingly. So when we set xmin of the bar of “5” to be 2, it becomes what we see above.

The ggplot_gtable() function rebuilds plot data into a ggplot object which can be displayed by the plot function.

Lab Exercise: Change the count of “5” to become 100 manually from modifying plot1_data$data[[1]].


2. For loops

Imagine we have this simple tibble:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

We want to compute the median of each column. You could do with copy-and-paste:

median(df$a)
## [1] 0.2631598
median(df$b)
## [1] -0.1114329
median(df$c)
## [1] -0.230409
median(df$d)
## [1] 0.162175

But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:

output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
## [1]  0.2631598 -0.1114329 -0.2304090  0.1621750

The template of for loops in R is similar to that in Python:

for (<iter_var> in <iterable>) {
  # Body
}

Here the seq_along() function gives a vector in indices of the given vector or list

a = letters # The character vector of a-z
seq_along(a)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26
seq_along(mpg) # The indices are for each element of a list - in this case each column
##  [1]  1  2  3  4  5  6  7  8  9 10 11

Every for loop has three components:

y <- vector("double", 0)
seq_along(y)
## integer(0)
1:length(y)
## [1] 1 0

That’s all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we’ll move on some variations of the for loop that help you solve other problems that will crop up in practice.


Lab Exercises:

  1. Write for loops to:
  • Compute the mean of every column in mtcars.
  • Determine the type of each column in nycflights13::flights.
  • Compute the number of unique values in each column of iris.


For loops variations (self-study)


Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don’t forget about them once you’ve mastered the FP techniques you’ll learn about in the next section.

There are four variations on the basic theme of the for loop:

  • Modifying an existing object, instead of creating a new object.
  • Looping over names or values, instead of indices.
  • Handling outputs of unknown length.
  • Handling sequences of unknown length.


Modifying an existing object

Sometimes you want to use a for loop to modify an existing object. For example, remember our challenge from functions. We wanted to rescale every column in a data frame:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

To solve this with a for loop we again think about the three components:

  • Output: we already have the output — it’s the same as the input!

  • Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).

  • Body: apply rescale01().

This gives us:

for (i in seq_along(df)) {
  df[[i]] <- rescale01(df[[i]])
}


Looping Patterns

There are three basic ways to loop over a vector. So far I’ve shown you the most general: looping over the numeric indices with for (i in seq_along(xs)), and extracting the value with x[[i]]. There are two other forms:

  1. Loop over the elements: for (x in xs). This is most useful if you only care about side-effects, like plotting or saving a file, because it’s difficult to save the output efficiently since we don’t have the indices.

  2. Loop over the names: for (nm in names(xs)). This gives you name, which you can use to access the value with x[[nm]]. This is useful if you want to use the name in a plot title or a file name. If you’re creating named output, make sure to name the results vector like so:

results <- vector("list", length(mpg))
names(results) <- names(mpg)

for (nm in names(mpg)) {
  results[[nm]] = is.numeric(mpg[[nm]]) # Output whether each column in "mpg" is a numeric one or not
}

results
## $manufacturer
## [1] FALSE
## 
## $model
## [1] FALSE
## 
## $displ
## [1] TRUE
## 
## $year
## [1] TRUE
## 
## $cyl
## [1] TRUE
## 
## $trans
## [1] FALSE
## 
## $drv
## [1] FALSE
## 
## $cty
## [1] TRUE
## 
## $hwy
## [1] TRUE
## 
## $fl
## [1] FALSE
## 
## $class
## [1] FALSE

Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:

for (i in seq_along(x)) {
  name <- names(x)[[i]]
  value <- x[[i]]
}


Unknown output length

Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:

means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)
##  num [1:144] -0.7611 -0.0208 1.243 0.2511 -1.1472 ...

But this is not very efficient in terms of the computational time. A better solution to save the results in a list, and then combine into a single vector after the loop is done:

out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)
## List of 3
##  $ : num [1:13] 1.215 -0.563 0.124 -1.423 1.309 ...
##  $ : num [1:93] -0.405 0.455 0.751 1.765 2.009 ...
##  $ : num [1:14] 1.4 1.48 2.33 4.42 1.44 ...
str(unlist(out))
##  num [1:120] 1.215 -0.563 0.124 -1.423 1.309 ...

Here we’ve used unlist() to flatten a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl() — it will throw an error if the input isn’t a list of doubles.

Given an example, you might be generating a big data frame. Instead of sequentially binding things in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.

x <- letters
result <- vector("list", length(x))

for (i in seq_along(x)) {
  result[[i]] = tibble(x1 = x[[i]], x2 = str_c(x[[i]], x[[i]]), x3 = str_c(x[[i]], x[[i]], x[[i]]))
}

bind_rows(result)
## # A tibble: 26 × 3
##    x1    x2    x3   
##    <chr> <chr> <chr>
##  1 a     aa    aaa  
##  2 b     bb    bbb  
##  3 c     cc    ccc  
##  4 d     dd    ddd  
##  5 e     ee    eee  
##  6 f     ff    fff  
##  7 g     gg    ggg  
##  8 h     hh    hhh  
##  9 i     ii    iii  
## 10 j     jj    jjj  
## # … with 16 more rows


Unknown sequence length

Sometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:

while (condition) {
  # body
}

A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:

for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:

flip <- function() sample(c("T", "H"), 1)

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
## [1] 10

Here the sample function samples from two possible outcomes “T” and “H” (of equal chance) with the sample size of one. So each time we run flip(), we simulate tossing a fair coin.


3. Functionals

For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.

To see why this is important, consider (again) this simple data frame:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

Imagine you want to compute the mean of every column. You could do that with a for loop:

output <- vector("double", length(df))
for (i in seq_along(df)) {
  output[[i]] <- mean(df[[i]])
}
output
## [1]  0.19406328 -0.03098851  0.18946388  0.15469845

You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:

col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- mean(df[[i]])
  }
  output
}

Now we can use the function onto any data frames

col_mean(df)
## [1]  0.19406328 -0.03098851  0.18946388  0.15469845
col_mean(mtcars)
##  [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
##  [7]  17.848750   0.437500   0.406250   3.687500   2.812500

But then you think it’d also be helpful to be able to compute the median, and the standard deviation, so you copy and paste your col_mean() function and replace the mean() with median() and sd():

col_median <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- median(df[[i]])
  }
  output
}
col_sd <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- sd(df[[i]])
  }
  output
}

Obviously, it is not convenient to define so many different functions for each type of summary. So we will consider generalising this into a single function:

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, median)
## [1] -0.05732161  0.44834778 -0.10577387  0.33546271
col_summary(df, mean)
## [1]  0.19406328 -0.03098851  0.18946388  0.15469845

In the code above, the function fun itself becomes an argument of another function col_summary. This is what we refer to as functional programming.

In the next, we’ll learn about and use the purrr package, which provides functions that eliminate the need for many common for loops. The apply family of functions in base R (apply(), lapply(), tapply(), etc) solve a similar problem, but purrr is more consistent and thus is easier to learn.


The map function

The pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:

  • map() makes a list.
  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.

Each function takes two key inputs. The first input is a vector, and the second one is a function name. The map family applies a function to each element of the vector, and then returns a new vector (or list) that’s the same length (and has the same names) as the input.

For example, each data frame is a list with each element being a column. Therefore map functions would apply the function to each of the column.

df
## # A tibble: 10 × 4
##         a      b       c      d
##     <dbl>  <dbl>   <dbl>  <dbl>
##  1  0.741 -0.364 -0.420   0.691
##  2  0.861  0.460  1.82    1.19 
##  3  1.85   0.508  0.253   0.280
##  4 -0.448 -1.17   0.861   1.50 
##  5  0.115  0.698  1.44   -1.30 
##  6  1.09  -2.35  -0.481  -0.560
##  7 -0.229 -1.66  -0.637   0.391
##  8 -0.937  1.75  -0.728  -0.829
##  9 -0.776  0.437 -0.0755 -0.281
## 10 -0.325  1.37  -0.136   0.467
map(df, mean)
## $a
## [1] 0.1940633
## 
## $b
## [1] -0.03098851
## 
## $c
## [1] 0.1894639
## 
## $d
## [1] 0.1546985
map(df, median)
## $a
## [1] -0.05732161
## 
## $b
## [1] 0.4483478
## 
## $c
## [1] -0.1057739
## 
## $d
## [1] 0.3354627
map(df, sd)
## $a
## [1] 0.9046091
## 
## $b
## [1] 1.327154
## 
## $c
## [1] 0.8941286
## 
## $d
## [1] 0.8908309

The map function returns a list as the result. In this example, we can also return a double vector using the map_dbl() function:

map_dbl(df, mean)
##           a           b           c           d 
##  0.19406328 -0.03098851  0.18946388  0.15469845
map_dbl(df, median)
##           a           b           c           d 
## -0.05732161  0.44834778 -0.10577387  0.33546271
map_dbl(df, sd)
##         a         b         c         d 
## 0.9046091 1.3271540 0.8941286 0.8908309


Shortcuts for functions and additional arguments

The map family has some features to make it very convenient to use. First, we may simply call additional arguments of the function inside map functions.

map_dbl(df, mean, trim = 0.1)
##          a          b          c          d 
## 0.12898642 0.03602351 0.10079341 0.16864996

Here trim is an argument for the mean function to do the trimmed mean (trimming 10% of data from each end of the data before computing the mean). But we can simply add them into map_dbl() or any other map family functions.

Second, there are a few shortcuts that you can use to replace the function name. For example, when computing the number of NA values in each column, we used a formula:

map_dbl(flights, ~sum(is.na(.)))
##           year          month            day       dep_time sched_dep_time 
##              0              0              0           8255              0 
##      dep_delay       arr_time sched_arr_time      arr_delay        carrier 
##           8255           8713              0           9430              0 
##         flight        tailnum         origin           dest       air_time 
##              0           2512              0              0           9430 
##       distance           hour         minute      time_hour 
##              0              0              0              0

These functions also nicely create names for the vector or list in the output. Here the . refers to the current list element.

As another example, we hope to compute the correlation coefficients between mpg and all other variables in mtcars data set, we can do the following:

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
map_dbl(mtcars, ~cor(mtcars$mpg, .))
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  1.0000000 -0.8521620 -0.8475514 -0.7761684  0.6811719 -0.8676594  0.4186840 
##         vs         am       gear       carb 
##  0.6640389  0.5998324  0.4802848 -0.5509251

By doing this we see all the correlation coefficients in one shot!

What if there are some non-numeric columns? We can filter them out and do the same thing. Let’s do the same for mpg data set. First, we need to keep columns of numeric type only. We can easily do this by using the map_lgl function that returns a vector of logical values.

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows
column_numeric <- map_lgl(mpg, is.numeric) # See whether each column is numeric or not
mpg_numeric <- mpg[column_numeric]  # Only keep numeric columns
mpg_numeric
## # A tibble: 234 × 5
##    displ  year   cyl   cty   hwy
##    <dbl> <int> <int> <int> <int>
##  1   1.8  1999     4    18    29
##  2   1.8  1999     4    21    29
##  3   2    2008     4    20    31
##  4   2    2008     4    21    30
##  5   2.8  1999     6    16    26
##  6   2.8  1999     6    18    26
##  7   3.1  2008     6    18    27
##  8   1.8  1999     4    18    26
##  9   1.8  1999     4    16    25
## 10   2    2008     4    20    28
## # … with 224 more rows

Now we can do the same as above to get the correlation coefficients. Let’s use cty as the measure of fuel efficiency.

map_dbl(mpg_numeric, ~cor(mpg_numeric$cty, .))
##       displ        year         cyl         cty         hwy 
## -0.79852397 -0.03723229 -0.80577141  1.00000000  0.95591591

It is obvious that hwy is highly correlated with cty as expected, year has little to do with fuel efficiency. And larger engines or more cylinders lead to lower fuel efficiency.


Example for self-study

The following codes compute the p-value for t-tests between Attrition flag and all other numeric variables in the BankChurners.csv data set.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
column_num <- map_lgl(bank_data, is.numeric) # See whether each column is numeric or not
bank_num <- bank_data[column_num]  # Only keep numeric columns
bank_num
## # A tibble: 10,127 × 14
##    Customer_Age Depend…¹ Month…² Total…³ Month…⁴ Conta…⁵ Credi…⁶ Total…⁷ Avg_O…⁸
##           <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1           45        3      39       5       1       3   12691     777   11914
##  2           49        5      44       6       1       2    8256     864    7392
##  3           51        3      36       4       1       0    3418       0    3418
##  4           40        4      34       3       4       1    3313    2517     796
##  5           40        3      21       5       1       0    4716       0    4716
##  6           44        2      36       3       1       2    4010    1247    2763
##  7           51        4      46       6       1       3   34516    2264   32252
##  8           32        0      27       2       2       2   29081    1396   27685
##  9           37        3      36       5       2       0   22352    2517   19835
## 10           48        2      36       6       3       3   11656    1677    9979
## # … with 10,117 more rows, 5 more variables: Total_Amt_Chng_Q4_Q1 <dbl>,
## #   Total_Trans_Amt <dbl>, Total_Trans_Ct <dbl>, Total_Ct_Chng_Q4_Q1 <dbl>,
## #   Avg_Utilization_Ratio <dbl>, and abbreviated variable names
## #   ¹​Dependent_count, ²​Months_on_book, ³​Total_Relationship_Count,
## #   ⁴​Months_Inactive_12_mon, ⁵​Contacts_Count_12_mon, ⁶​Credit_Limit,
## #   ⁷​Total_Revolving_Bal, ⁸​Avg_Open_To_Buy
map_dbl(bank_num, ~t.test(.[bank_data$Attrition_Flag == "Existing Customer"], .[bank_data$Attrition_Flag == "Attrited Customer"])$p.value)
##             Customer_Age          Dependent_count           Months_on_book 
##             5.771863e-02             5.251960e-02             1.603851e-01 
## Total_Relationship_Count   Months_Inactive_12_mon    Contacts_Count_12_mon 
##             3.225023e-48             1.717553e-60             6.687312e-89 
##             Credit_Limit      Total_Revolving_Bal          Avg_Open_To_Buy 
##             1.642963e-02            7.089719e-113             9.771547e-01 
##     Total_Amt_Chng_Q4_Q1          Total_Trans_Amt           Total_Trans_Ct 
##             1.305897e-39            6.349082e-106             0.000000e+00 
##      Total_Ct_Chng_Q4_Q1    Avg_Utilization_Ratio 
##            7.156056e-173             2.782074e-72

The following codes compute the p-value for chi-square tests between Attrition flag and all other categorical variables in the BankChurners.csv data set.

bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
column_chr <- map_lgl(bank_data, is.character) # See whether each column is numeric or not
bank_chr <- bank_data[column_chr]  # Only keep numeric columns
bank_chr
## # A tibble: 10,127 × 6
##    Attrition_Flag    Gender Education_Level Marital_Status Income_Cate…¹ Card_…²
##    <chr>             <chr>  <chr>           <chr>          <chr>         <chr>  
##  1 Existing Customer M      High School     Married        $60K - $80K   Blue   
##  2 Existing Customer F      Graduate        Single         Less than $4… Blue   
##  3 Existing Customer M      Graduate        Married        $80K - $120K  Blue   
##  4 Existing Customer F      High School     Unknown        Less than $4… Blue   
##  5 Existing Customer M      Uneducated      Married        $60K - $80K   Blue   
##  6 Existing Customer M      Graduate        Married        $40K - $60K   Blue   
##  7 Existing Customer M      Unknown         Married        $120K +       Gold   
##  8 Existing Customer M      High School     Unknown        $60K - $80K   Silver 
##  9 Existing Customer M      Uneducated      Single         $60K - $80K   Blue   
## 10 Existing Customer M      Graduate        Single         $80K - $120K  Blue   
## # … with 10,117 more rows, and abbreviated variable names ¹​Income_Category,
## #   ²​Card_Category
result <- map_dbl(bank_chr, ~chisq.test(bank_chr$Attrition_Flag, .)$p.value)
result
##  Attrition_Flag          Gender Education_Level  Marital_Status Income_Category 
##    0.0000000000    0.0001963585    0.0514891315    0.1089126339    0.0250024257 
##   Card_Category 
##    0.5252382798

We can further filter columns that has a p-value lower than 0.05

result[result < 0.05]
##  Attrition_Flag          Gender Income_Category 
##    0.0000000000    0.0001963585    0.0250024257


Predicate functions

The way to filter a particula type of data above is not the best one. A number of functions work with predicate functions (such as is.factor, is.character etc.) that return either a single TRUE or FALSE.

keep() and discard() keep elements of the input where the predicate is TRUE or FALSE respectively:

mpg %>%
  keep(is.character) %>%
  print()
## # A tibble: 234 × 6
##    manufacturer model      trans      drv   fl    class  
##    <chr>        <chr>      <chr>      <chr> <chr> <chr>  
##  1 audi         a4         auto(l5)   f     p     compact
##  2 audi         a4         manual(m5) f     p     compact
##  3 audi         a4         manual(m6) f     p     compact
##  4 audi         a4         auto(av)   f     p     compact
##  5 audi         a4         auto(l5)   f     p     compact
##  6 audi         a4         manual(m5) f     p     compact
##  7 audi         a4         auto(av)   f     p     compact
##  8 audi         a4 quattro manual(m5) 4     p     compact
##  9 audi         a4 quattro auto(l5)   4     p     compact
## 10 audi         a4 quattro manual(m6) 4     p     compact
## # … with 224 more rows
mpg %>%
  discard(is.character) %>%
  print()
## # A tibble: 234 × 5
##    displ  year   cyl   cty   hwy
##    <dbl> <int> <int> <int> <int>
##  1   1.8  1999     4    18    29
##  2   1.8  1999     4    21    29
##  3   2    2008     4    20    31
##  4   2    2008     4    21    30
##  5   2.8  1999     6    16    26
##  6   2.8  1999     6    18    26
##  7   3.1  2008     6    18    27
##  8   1.8  1999     4    18    26
##  9   1.8  1999     4    16    25
## 10   2    2008     4    20    28
## # … with 224 more rows

some() and every() determine if the predicate is true for any or for all of the elements.

some(mpg, is.character)
## [1] TRUE
every(mpg, is_character)
## [1] FALSE
every(mpg, is_vector)
## [1] TRUE


Lab Exercises:

For your midterm project data set, find all columns that have a p-value less than 0.05 when doing t-test or chi-square test with the target variable.