In this module, we will learn more basics about lists, how to write for loops in R, and further learn a set of powerful mapping functions to avoid writing explicit loops, which are called functionals.
The diagram below shows the hierarchy of R’s vector types
There’s one other related object: NULL
.
NULL
is often used to represent the absence of a vector (as
opposed to NA
which is used to represent the absence of a
value in a vector). NULL
typically behaves like a vector of
length 0.
Lists are a step up in complexity from atomic vectors, because
lists can contain other lists. This makes them suitable
for representing hierarchical or tree-like structures. You create a list
with list()
:
x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
A very useful tool for working with lists is str()
because it focusses on the structure, not the contents.
str(x)
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
Unlike atomic vectors, list()
can contain a mix of
objects and can even contain other lists.
y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
z <- list(list(1, 2), list(3, 4))
str(z)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ :List of 2
## ..$ : num 3
## ..$ : num 4
There are three ways to subset a list, which we’ll illustrate with a
list named a
:
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
[
extracts a sub-list. The result will always be a
list.str(a[1:2])
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
str(a[4])
## List of 1
## $ d:List of 2
## ..$ : num -1
## ..$ : num -5
[[
extracts a single component from a list. It removes
a level of hierarchy from the list.str(a[[1]])
## int [1:3] 1 2 3
str(a[[4]])
## List of 2
## $ : num -1
## $ : num -5
$
is a shorthand for extracting named elements of a
list. It works similarly to [[
except that you don’t need
to use quotes.a$a
## [1] 1 2 3
a[["a"]]
## [1] 1 2 3
ggplot
objectYou may wonder why we need lists at all in data analysis since most data frames have columns of the same data type. For example
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
So each column is a vector of either character, integer or double. What is the use of lists here? Let’s create a plot.
plot1 <- ggplot(mpg) + geom_bar(mapping = aes(x = as.factor(cyl), fill = drv), position = "dodge")
plot1
Let’s name the plot to be plot1
. You might think that
plot1
is merely a plot. Actually, if you input
plot1
into the console, it does return the same plot.
However, actually all information of the plot is stored in
plot1
! If we check its type, we will see that:
typeof(plot1)
## [1] "list"
So plot1
is actually a list. What is in it? The answer
is everything. If we want to see the data set that was used to create
the plot, we can use
plot1$data
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
So the first element in the list of plot1
is the
original data frame. If we want to see the labels of the plot, we can
have
plot1$labels
## $x
## [1] "as.factor(cyl)"
##
## $fill
## [1] "drv"
##
## $y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
##
## $weight
## [1] "weight"
## attr(,"fallback")
## [1] TRUE
We have another list returned which has again a few items in it. For example, the x-label and y-label are
plot1$labels$x
## [1] "as.factor(cyl)"
plot1$labels$y
## [1] "count"
## attr(,"fallback")
## [1] TRUE
We can actually change the labels by directly changing the values here
plot1$labels$x <- "New_x_label"
plot1$labels$y <- "New_y_label"
plot1
ggplot_build
to adjust all details of a plotWe can use the ggplot_build()
function on a
ggplot
object to obtain all raw data in producing the plot,
and adjust any details.
plot1_data <- ggplot_build(plot1)
plot1_data
## $data
## $data[[1]]
## fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23 23 1 0.775 FALSE 1 1 0 23 0.55 1.00
## 2 #00BA38 58 58 1 1.225 FALSE 1 2 0 58 1.00 1.45
## 3 #00BA38 4 4 1 2.000 FALSE 1 3 0 4 1.55 2.45
## 4 #F8766D 32 32 1 2.700 FALSE 1 4 0 32 2.55 2.85
## 5 #00BA38 43 43 1 3.000 FALSE 1 5 0 43 2.85 3.15
## 6 #619CFF 4 4 1 3.300 FALSE 1 6 0 4 3.15 3.45
## 7 #F8766D 48 48 1 3.700 FALSE 1 7 0 48 3.55 3.85
## 8 #00BA38 1 1 1 4.000 FALSE 1 8 0 1 3.85 4.15
## 9 #619CFF 21 21 1 4.300 FALSE 1 9 0 21 4.15 4.45
## colour linewidth linetype alpha
## 1 NA 0.5 1 NA
## 2 NA 0.5 1 NA
## 3 NA 0.5 1 NA
## 4 NA 0.5 1 NA
## 5 NA 0.5 1 NA
## 6 NA 0.5 1 NA
## 7 NA 0.5 1 NA
## 8 NA 0.5 1 NA
## 9 NA 0.5 1 NA
##
##
## $layout
## <ggproto object: Class Layout, gg>
## coord: <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## backtransform_range: function
## clip: on
## default: TRUE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_guides: function
## setup_panel_params: function
## setup_params: function
## train_panel_guides: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## coord_params: list
## facet: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## facet_params: list
## finish_data: function
## get_scales: function
## layout: data.frame
## map_position: function
## panel_params: list
## panel_scales_x: list
## panel_scales_y: list
## render: function
## render_labels: function
## reset_scales: function
## setup: function
## setup_panel_guides: function
## setup_panel_params: function
## train_position: function
## xlabel: function
## ylabel: function
## super: <ggproto object: Class Layout, gg>
##
## $plot
##
## attr(,"class")
## [1] "ggplot_built"
So here we have another list with three elements $data
,
$layout
and $plot
. If we check
$data
we see all details of the plot.
plot1_data$data[[1]]
## fill y count prop x flipped_aes PANEL group ymin ymax xmin xmax
## 1 #F8766D 23 23 1 0.775 FALSE 1 1 0 23 0.55 1.00
## 2 #00BA38 58 58 1 1.225 FALSE 1 2 0 58 1.00 1.45
## 3 #00BA38 4 4 1 2.000 FALSE 1 3 0 4 1.55 2.45
## 4 #F8766D 32 32 1 2.700 FALSE 1 4 0 32 2.55 2.85
## 5 #00BA38 43 43 1 3.000 FALSE 1 5 0 43 2.85 3.15
## 6 #619CFF 4 4 1 3.300 FALSE 1 6 0 4 3.15 3.45
## 7 #F8766D 48 48 1 3.700 FALSE 1 7 0 48 3.55 3.85
## 8 #00BA38 1 1 1 4.000 FALSE 1 8 0 1 3.85 4.15
## 9 #619CFF 21 21 1 4.300 FALSE 1 9 0 21 4.15 4.45
## colour linewidth linetype alpha
## 1 NA 0.5 1 NA
## 2 NA 0.5 1 NA
## 3 NA 0.5 1 NA
## 4 NA 0.5 1 NA
## 5 NA 0.5 1 NA
## 6 NA 0.5 1 NA
## 7 NA 0.5 1 NA
## 8 NA 0.5 1 NA
## 9 NA 0.5 1 NA
The $data
stores data for different panels. In this
graph we only have one panel, so it only has one element. To extract
that, we use [[1]]
and then have a data frame.
class(plot1_data$data[[1]])
## [1] "data.frame"
typeof(plot1_data$data[[1]])
## [1] "list"
Note that class()
returns the name of the
augmented vector. So a data frame is built upon a list,
so it is a list if we check its type. But it is an augmented type and
has its own name - data frame.
This data frame stores the details of this bar plot. For example, the
range of x
for each bar. Let’s change this and see how the
plot will be updated:
plot1_data$data[[1]]$xmin[3] <- 2
plot(ggplot_gtable(plot1_data))
So we successfully change the left border of the bar on top of “5”.
Note that here “4”, “5”, “6”, “8” are labels and the actual
x-coordinates are 1 to 4 correspondingly. So when we set
xmin
of the bar of “5” to be 2, it becomes what we see
above.
The ggplot_gtable()
function rebuilds plot data into a
ggplot object which can be displayed by the plot
function.
Change the count of “5” to become 100 manually from modifying
plot1_data$data[[1]]
.
Imagine we have this simple tibble:
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
We want to compute the median of each column. You could do with copy-and-paste:
median(df$a)
## [1] 0.01869644
median(df$b)
## [1] -0.6875783
median(df$c)
## [1] -0.2067631
median(df$d)
## [1] -0.4444836
But that breaks our rule of thumb: never copy and paste more than twice. Instead, we could use a for loop:
output <- vector("double", ncol(df)) # 1. output
for (i in seq_along(df)) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
output
## [1] 0.01869644 -0.68757830 -0.20676309 -0.44448360
The template of for loops in R is similar to that in Python:
for (<iter_var> in <iterable>) {
# Body
}
Here the seq_along()
function gives a vector in indices
of the given vector or list
a = letters # The character vector of a-z
seq_along(a)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26
seq_along(mpg) # The indices are for each element of a list - in this case each column
## [1] 1 2 3 4 5 6 7 8 9 10 11
Every for loop has three components:
The output:
output <- vector("double", length(x))
. Before you start
the loop, you must always allocate sufficient space for the output. This
is very important for efficiency: if you grow the for loop at each
iteration using c()
(for example), your for loop will be
very slow. A general way of creating an empty vector of given length is
the vector() function. It has two arguments: the type of the vector
(“logical”, “integer”, “double”, “character”, etc) and the length of the
vector.
The sequence: i in seq_along(df)
.
This determines what to loop over: each run of the for loop will assign
i
to a different value from seq_along(df)
. You
might not have seen seq_along()
before. It’s a safe version
of the familiar 1:length(l)
, with an important difference:
if you have a zero-length vector, seq_along()
does the
right thing:
y <- vector("double", 0)
seq_along(y)
## integer(0)
1:length(y)
## [1] 1 0
output[[i]] <- median(df[[i]])
. This is the code that
does the work. It’s run repeatedly, each time with a different value for
i
. The first iteration will run
output[[1]] <- median(df[[1]])
, the second will run
output[[2]] <- median(df[[2]])
, and so on. Here we use
[[]]
since we want to work on the elements of a
list/vector.That’s all there is to the for loop! Now is a good time to practice creating some basic (and not so basic) for loops using the exercises below. Then we’ll move on some variations of the for loop that help you solve other problems that will crop up in practice.
Compute the mean of every column in mtcars.
Determine the type of each column in
nycflights13::flights
.
Compute the number of unique values in each column of
iris
.
Once you have the basic for loop under your belt, there are some variations that you should be aware of. These variations are important regardless of how you do iteration, so don’t forget about them once you’ve mastered the FP techniques you’ll learn about in the next section.
There are four variations on the basic theme of the for loop:
Modifying an existing object, instead of creating a new object.
Looping over names or values, instead of indices.
Handling outputs of unknown length.
Handling sequences of unknown length.
Sometimes you want to use a for loop to modify an existing object. For example, remember our challenge from functions. We wanted to rescale every column in a data frame:
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
To solve this with a for loop we again think about the three components:
Output: we already have the output — it’s the same as the input!
Sequence: we can think about a data frame as a
list of columns, so we can iterate over each column with
seq_along(df)
.
Body: apply rescale01()
.
This gives us:
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
There are three basic ways to loop over a vector. So far I’ve shown
you the most general: looping over the numeric indices with for
(i in seq_along(xs)
), and extracting the value with
x[[i]]
. There are two other forms:
Loop over the elements: for (x in xs)
. This is most
useful if you only care about side-effects, like plotting or saving a
file, because it’s difficult to save the output efficiently since we
don’t have the indices.
Loop over the names: for (nm in names(xs))
. This
gives you name, which you can use to access the value with
x[[nm]]
. This is useful if you want to use the name in a
plot title or a file name. If you’re creating named output, make sure to
name the results vector like so:
results <- vector("list", length(mpg))
names(results) <- names(mpg)
for (nm in names(mpg)) {
results[[nm]] = is.numeric(mpg[[nm]]) # Output whether each column in "mpg" is a numeric one or not
}
results
## $manufacturer
## [1] FALSE
##
## $model
## [1] FALSE
##
## $displ
## [1] TRUE
##
## $year
## [1] TRUE
##
## $cyl
## [1] TRUE
##
## $trans
## [1] FALSE
##
## $drv
## [1] FALSE
##
## $cty
## [1] TRUE
##
## $hwy
## [1] TRUE
##
## $fl
## [1] FALSE
##
## $class
## [1] FALSE
Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:
for (i in seq_along(x)) {
name <- names(x)[[i]]
value <- x[[i]]
}
Sometimes you might not know how long the output will be. For example, imagine you want to simulate some random vectors of random lengths. You might be tempted to solve this problem by progressively growing the vector:
means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)) {
n <- sample(100, 1)
output <- c(output, rnorm(n, means[[i]]))
}
str(output)
## num [1:154] -1.909 -2.543 -1.269 0.243 1.123 ...
But this is not very efficient in terms of the computational time. A better solution to save the results in a list, and then combine into a single vector after the loop is done:
out <- vector("list", length(means))
for (i in seq_along(means)) {
n <- sample(100, 1)
out[[i]] <- rnorm(n, means[[i]])
}
str(out)
## List of 3
## $ : num [1:55] -0.446 1.705 -0.32 -1.392 -1.218 ...
## $ : num [1:88] 2.2881 0.3702 -0.1459 -0.0999 1.8746 ...
## $ : num [1:85] 3.34 1.88 1.39 2.4 2.26 ...
str(unlist(out))
## num [1:228] -0.446 1.705 -0.32 -1.392 -1.218 ...
Here we’ve used unlist()
to flatten a list of vectors
into a single vector. A stricter option is to use
purrr::flatten_dbl()
— it will throw an error if the input
isn’t a list of doubles.
Given an example, you might be generating a big data frame. Instead
of sequentially binding things in each iteration, save the output in a
list, then use dplyr::bind_rows(output)
to combine the
output into a single data frame.
x <- letters
result <- vector("list", length(x))
for (i in seq_along(x)) {
result[[i]] = tibble(x1 = x[[i]], x2 = str_c(x[[i]], x[[i]]), x3 = str_c(x[[i]], x[[i]], x[[i]]))
}
bind_rows(result)
## # A tibble: 26 × 3
## x1 x2 x3
## <chr> <chr> <chr>
## 1 a aa aaa
## 2 b bb bbb
## 3 c cc ccc
## 4 d dd ddd
## 5 e ee eee
## 6 f ff fff
## 7 g gg ggg
## 8 h hh hhh
## 9 i ii iii
## 10 j jj jjj
## # ℹ 16 more rows
Sometimes you don’t even know how long the input sequence should run for. This is common when doing simulations. For example, you might want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, you can use a while loop. A while loop is simpler than for loop because it only has two components, a condition and a body:
while (condition) {
# body
}
A while loop is also more general than a for loop, because you can rewrite any for loop as a while loop, but you can’t rewrite every while loop as a for loop:
for (i in seq_along(x)) {
# body
}
# Equivalent to
i <- 1
while (i <= length(x)) {
# body
i <- i + 1
}
Here’s how we could use a while loop to find how many tries it takes to get three heads in a row:
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
## [1] 4
Here the sample
function samples from two possible
outcomes “T” and “H” (of equal chance) with the sample size of one. So
each time we run flip()
, we simulate tossing a fair
coin.
For loops are not as important in R as they are in other languages because R is a functional programming language. This means that it’s possible to wrap up for loops in a function, and call that function instead of using the for loop directly.
To see why this is important, consider (again) this simple data frame:
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
Imagine you want to compute the mean of every column. You could do that with a for loop:
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[[i]] <- mean(df[[i]])
}
output
## [1] -0.4658461 0.2339690 0.4616592 0.2435819
You realise that you’re going to want to compute the means of every column pretty frequently, so you extract it out into a function:
col_mean <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- mean(df[[i]])
}
output
}
Now we can use the function onto any data frames
col_mean(df)
## [1] -0.4658461 0.2339690 0.4616592 0.2435819
col_mean(mtcars)
## [1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
## [7] 17.848750 0.437500 0.406250 3.687500 2.812500
But then you think it’d also be helpful to be able to compute the
median, and the standard deviation, so you copy and paste your
col_mean()
function and replace the mean()
with median()
and sd()
:
col_median <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
output
}
col_sd <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- sd(df[[i]])
}
output
}
Obviously, it is not convenient to define so many different functions for each type of summary. So we will consider generalising this into a single function:
col_summary <- function(df, fun) {
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, median)
## [1] -0.28949036 0.01286251 0.39718194 0.75517353
col_summary(df, mean)
## [1] -0.4658461 0.2339690 0.4616592 0.2435819
In the code above, the function fun
itself becomes an
argument of another function col_summary
. This is what we
refer to as functional programming.
In the next, we’ll learn about and use the purrr
package, which provides functions that eliminate the need for many
common for loops. The apply family of functions in base R
(apply()
, lapply()
, tapply()
,
etc) solve a similar problem, but purrr is more consistent and thus is
easier to learn.
map
functionThe pattern of looping over a vector, doing something to each element and saving the results is so common that the purrr package provides a family of functions to do it for you. There is one function for each type of output:
map()
makes a list.map_lgl()
makes a logical vector.map_int()
makes an integer vector.map_dbl()
makes a double vector.map_chr()
makes a character vector.Each function takes two key inputs. The first input
is a vector, and the second one is a function name. The map
family applies a function to each element of the vector, and then
returns a new vector (or list) that’s the same length (and has the same
names) as the input.
For example, each data frame is a list with each element being a
column. Therefore map
functions would apply the function to
each of the column.
df
## # A tibble: 10 × 4
## a b c d
## <dbl> <dbl> <dbl> <dbl>
## 1 -1.39 0.231 0.929 1.01
## 2 -0.313 0.523 0.389 1.17
## 3 0.538 -0.497 -0.509 1.37
## 4 -0.0803 0.0114 1.00 -1.97
## 5 -0.128 1.25 0.301 0.165
## 6 -0.795 -0.111 0.678 -0.739
## 7 0.240 1.45 1.44 0.580
## 8 -0.310 -0.0163 -0.269 0.930
## 9 -0.269 0.0143 0.254 1.65
## 10 -2.15 -0.510 0.405 -1.72
map(df, mean)
## $a
## [1] -0.4658461
##
## $b
## [1] 0.233969
##
## $c
## [1] 0.4616592
##
## $d
## [1] 0.2435819
map(df, median)
## $a
## [1] -0.2894904
##
## $b
## [1] 0.01286251
##
## $c
## [1] 0.3971819
##
## $d
## [1] 0.7551735
map(df, sd)
## $a
## [1] 0.7926949
##
## $b
## [1] 0.6624582
##
## $c
## [1] 0.5832125
##
## $d
## [1] 1.291152
The map
function returns a list as the result. In this
example, we can also return a double vector using the
map_dbl()
function:
map_dbl(df, mean)
## a b c d
## -0.4658461 0.2339690 0.4616592 0.2435819
map_dbl(df, median)
## a b c d
## -0.28949036 0.01286251 0.39718194 0.75517353
map_dbl(df, sd)
## a b c d
## 0.7926949 0.6624582 0.5832125 1.2911519
The map
family has some features to make it very
convenient to use. First, we may simply call additional arguments of the
function inside map
functions.
map_dbl(df, mean, trim = 0.1)
## a b c d
## -0.3809874 0.1754418 0.4613353 0.3444179
Here trim
is an argument for the mean
function to do the trimmed mean (trimming 10% of data from each end of
the data before computing the mean). But we can simply add them into
map_dbl()
or any other map
family
functions.
Second, there are a few shortcuts that you can use to replace the
function name. For example, when computing the number of NA
values in each column, we used a formula:
map_dbl(flights, ~sum(is.na(.)))
## year month day dep_time sched_dep_time
## 0 0 0 8255 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 8255 8713 0 9430 0
## flight tailnum origin dest air_time
## 0 2512 0 0 9430
## distance hour minute time_hour
## 0 0 0 0
These functions also nicely create names for the vector or list in
the output. Here the .
refers to the current list
element.
As another example, we hope to compute the correlation coefficients
between mpg
and all other variables in mtcars
data set, we can do the following:
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
map_dbl(mtcars, ~cor(mtcars$mpg, .))
## mpg cyl disp hp drat wt qsec
## 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719 -0.8676594 0.4186840
## vs am gear carb
## 0.6640389 0.5998324 0.4802848 -0.5509251
By doing this we see all the correlation coefficients in one shot!
What if there are some non-numeric columns? We can filter them out
and do the same thing. Let’s do the same for mpg
data set.
First, we need to keep columns of numeric type only. We can easily do
this by using the map_lgl
function that returns a vector of
logical values.
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
column_numeric <- map_lgl(mpg, is.numeric) # See whether each column is numeric or not
mpg_numeric <- mpg[column_numeric] # Only keep numeric columns
mpg_numeric
## # A tibble: 234 × 5
## displ year cyl cty hwy
## <dbl> <int> <int> <int> <int>
## 1 1.8 1999 4 18 29
## 2 1.8 1999 4 21 29
## 3 2 2008 4 20 31
## 4 2 2008 4 21 30
## 5 2.8 1999 6 16 26
## 6 2.8 1999 6 18 26
## 7 3.1 2008 6 18 27
## 8 1.8 1999 4 18 26
## 9 1.8 1999 4 16 25
## 10 2 2008 4 20 28
## # ℹ 224 more rows
Now we can do the same as above to get the correlation coefficients.
Let’s use cty
as the measure of fuel efficiency.
map_dbl(mpg_numeric, ~cor(mpg_numeric$cty, .))
## displ year cyl cty hwy
## -0.79852397 -0.03723229 -0.80577141 1.00000000 0.95591591
It is obvious that hwy
is highly correlated with
cty
as expected, year
has little to do with
fuel efficiency. And larger engines or more cylinders lead to lower fuel
efficiency.
The following codes compute the p-value for t-tests between Attrition
flag and all other numeric variables in the
BankChurners.csv
data set.
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
column_num <- map_lgl(bank_data, is.numeric) # See whether each column is numeric or not
bank_num <- bank_data[column_num] # Only keep numeric columns
bank_num
## # A tibble: 10,127 × 14
## Customer_Age Dependent_count Months_on_book Total_Relationship_Count
## <dbl> <dbl> <dbl> <dbl>
## 1 45 3 39 5
## 2 49 5 44 6
## 3 51 3 36 4
## 4 40 4 34 3
## 5 40 3 21 5
## 6 44 2 36 3
## 7 51 4 46 6
## 8 32 0 27 2
## 9 37 3 36 5
## 10 48 2 36 6
## # ℹ 10,117 more rows
## # ℹ 10 more variables: Months_Inactive_12_mon <dbl>,
## # Contacts_Count_12_mon <dbl>, Credit_Limit <dbl>, Total_Revolving_Bal <dbl>,
## # Avg_Open_To_Buy <dbl>, Total_Amt_Chng_Q4_Q1 <dbl>, Total_Trans_Amt <dbl>,
## # Total_Trans_Ct <dbl>, Total_Ct_Chng_Q4_Q1 <dbl>,
## # Avg_Utilization_Ratio <dbl>
map_dbl(bank_num, ~t.test(.[bank_data$Attrition_Flag == "Existing Customer"], .[bank_data$Attrition_Flag == "Attrited Customer"])$p.value)
## Customer_Age Dependent_count Months_on_book
## 5.771863e-02 5.251960e-02 1.603851e-01
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## 3.225023e-48 1.717553e-60 6.687312e-89
## Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy
## 1.642963e-02 7.089719e-113 9.771547e-01
## Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct
## 1.305897e-39 6.349082e-106 0.000000e+00
## Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## 7.156056e-173 2.782074e-72
The following codes compute the p-value for chi-square tests between
Attrition flag and all other categorical variables in the
BankChurners.csv
data set.
bank_data <- read_csv("~/Documents/Fei Tian/Course_STA305_Statistical_Computing_and_Graphics_Fall2023/Datasets/BankChurners.csv")
column_chr <- map_lgl(bank_data, is.character) # See whether each column is numeric or not
bank_chr <- bank_data[column_chr] # Only keep numeric columns
bank_chr
## # A tibble: 10,127 × 6
## Attrition_Flag Gender Education_Level Marital_Status Income_Category
## <chr> <chr> <chr> <chr> <chr>
## 1 Existing Customer M High School Married $60K - $80K
## 2 Existing Customer F Graduate Single Less than $40K
## 3 Existing Customer M Graduate Married $80K - $120K
## 4 Existing Customer F High School Unknown Less than $40K
## 5 Existing Customer M Uneducated Married $60K - $80K
## 6 Existing Customer M Graduate Married $40K - $60K
## 7 Existing Customer M Unknown Married $120K +
## 8 Existing Customer M High School Unknown $60K - $80K
## 9 Existing Customer M Uneducated Single $60K - $80K
## 10 Existing Customer M Graduate Single $80K - $120K
## # ℹ 10,117 more rows
## # ℹ 1 more variable: Card_Category <chr>
result <- map_dbl(bank_chr, ~chisq.test(bank_chr$Attrition_Flag, .)$p.value)
result
## Attrition_Flag Gender Education_Level Marital_Status Income_Category
## 0.0000000000 0.0001963585 0.0514891315 0.1089126339 0.0250024257
## Card_Category
## 0.5252382798
We can further filter columns that has a p-value lower than 0.05
result[result < 0.05]
## Attrition_Flag Gender Income_Category
## 0.0000000000 0.0001963585 0.0250024257
The way to filter a particular type of data above is not the best
one. A number of functions work with predicate
functions (such as is.factor
,
is.character
etc.) that return either a single TRUE or
FALSE.
keep()
and discard()
keep elements of the
input where the predicate is TRUE or FALSE respectively:
mpg %>%
keep(is.character) %>%
print()
## # A tibble: 234 × 6
## manufacturer model trans drv fl class
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 audi a4 auto(l5) f p compact
## 2 audi a4 manual(m5) f p compact
## 3 audi a4 manual(m6) f p compact
## 4 audi a4 auto(av) f p compact
## 5 audi a4 auto(l5) f p compact
## 6 audi a4 manual(m5) f p compact
## 7 audi a4 auto(av) f p compact
## 8 audi a4 quattro manual(m5) 4 p compact
## 9 audi a4 quattro auto(l5) 4 p compact
## 10 audi a4 quattro manual(m6) 4 p compact
## # ℹ 224 more rows
mpg %>%
discard(is.character) %>%
print()
## # A tibble: 234 × 5
## displ year cyl cty hwy
## <dbl> <int> <int> <int> <int>
## 1 1.8 1999 4 18 29
## 2 1.8 1999 4 21 29
## 3 2 2008 4 20 31
## 4 2 2008 4 21 30
## 5 2.8 1999 6 16 26
## 6 2.8 1999 6 18 26
## 7 3.1 2008 6 18 27
## 8 1.8 1999 4 18 26
## 9 1.8 1999 4 16 25
## 10 2 2008 4 20 28
## # ℹ 224 more rows
some()
and every()
determine if the
predicate is true for any or for all of the elements.
some(mpg, is.character)
## [1] TRUE
every(mpg, is_character)
## [1] FALSE
every(mpg, is_vector)
## [1] TRUE
For your midterm project data set, find all columns that have a p-value less than 0.05 when doing t-test or chi-square test with the target variable.