suppressPackageStartupMessages(library("tidyverse"))
package 㤼㸱tidyverse㤼㸲 was built under R version 3.6.3
suppressPackageStartupMessages(library("stringr"))
#The package microbenchmark is used for timing code.
suppressPackageStartupMessages(library("microbenchmark"))
package 㤼㸱microbenchmark㤼㸲 was built under R version 3.6.3
1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
, and now want to read each one with read_csv()
. Write the for loop that will load them into a single data frame.
files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
files
[1] "data/file1.csv" "data/file2.csv" "data/file3.csv"
#> [1] "data//file1.csv" "data//file2.csv" "data//file3.csv"
Since, the number of files is known, pre-allocate a list with a length equal to the number of files.
df_list <- vector("list", length(files))
Then, read each file into a data frame, and assign it to an element in that list. The result is a list of data frames.
for (i in seq_along(files)) {
df_list[[i]] <- read_csv(files[[i]])
}
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
print(df_list)
[[1]]
[[2]]
[[3]]
NA
Finally, use use bind_rows()
to combine the list of data frames into a single data frame.
df <- bind_rows(df_list)
print(df)
Alternatively, I could have pre-allocated a list with the names of the files.
df2_list <- vector("list", length(files))
names(df2_list) <- files
for (fname in files) {
df2_list[[fname]] <- read_csv(fname)
}
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
X2 = [31mcol_character()[39m
)
df2 <- bind_rows(df2_list)
2. What happens if you use for (nm in names(x))
and x
has no names? What if only some of the elements are named? What if the names are not unique?
Let’s try it out and see what happens. When there are no names for the vector, it does not run the code in the loop. In other words, it runs zero iterations of the loop.
x <- c(11, 12, 13)
print(names(x))
NULL
for (nm in names(x)) {
print(nm)
print(x[[nm]])
}
Note that the length of NULL is zero:
length(NULL)
[1] 0
If there only some names, then we get an error for trying to access an element without a name.
x <- c(a = 11, 12, c = 13)
names(x)
[1] "a" "" "c"
for (nm in names(x)) {
print(nm)
print(x[[nm]])
}
[1] "a"
[1] 11
[1] ""
Error in x[[nm]] : subscript out of bounds
Finally, if the vector contains duplicate names, then x[[nm]]
returns the first element with that name.
x <- c(a = 11, a = 12, c = 13)
names(x)
[1] "a" "a" "c"
for (nm in names(x)) {
print(nm)
print(x[[nm]])
}
[1] "a"
[1] 11
[1] "a"
[1] 11
[1] "c"
[1] 13
3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris)
would print:
# show_mean(iris)
# > Sepal.Length: 5.84
# > Sepal.Width: 3.06
# > Petal.Length: 3.76
# > Petal.Width: 1.20
Extra challenge: what function did I use to make sure that the numbers lined up nicely, even though the variable names had different lengths?
There may be other functions to do this, but I’ll use str_pad()
, and str_length()
to ensure that the space given to the variable names is the same. I messed around with the options to format()
until I got two digits.
show_mean <- function(df, digits = 2) {
# Get max length of all variable names in the dataset
maxstr <- max(str_length(names(df)))
for (nm in names(df)) {
if (is.numeric(df[[nm]])) {
cat(
str_c(str_pad(str_c(nm, ":"), maxstr + 1L, side = "right"),
format(mean(df[[nm]]), digits = digits, nsmall = digits),
sep = " "
),
"\n"
)
}
}
}
show_mean(iris)
Sepal.Length: 5.84
Sepal.Width: 3.06
Petal.Length: 3.76
Petal.Width: 1.20
4. What does this code do? How does it work?
trans <- list(
disp = function(x) x * 0.0163871,
am = function(x) {
factor(x, labels = c("auto", "manual"))
}
)
for (var in names(trans)) {
mtcars[[var]] <- trans[[var]](mtcars[[var]])
}
This code mutates the disp
and am
columns:
disp
is multiplied by 0.0163871
am
is replaced by a factor variable.
The code works by looping over a named list of functions. It calls the named function in the list on the column of mtcars with the same name, and replaces the values of that column.
This is a function.
trans[["disp"]]
function(x) x * 0.0163871
This applies the function to the column of mtcars with the same name
trans[["disp"]](mtcars[["disp"]])
[1] 0.0007040869 0.0007040869 0.0004752587 0.0011353402 0.0015841956 0.0009901223 0.0015841956
[8] 0.0006455597 0.0006195965 0.0007375311 0.0007375311 0.0012136699 0.0012136699 0.0012136699
[15] 0.0020770565 0.0020242500 0.0019362391 0.0003463228 0.0003331211 0.0003128786 0.0005285053
[22] 0.0013993728 0.0013377652 0.0015401902 0.0017602174 0.0003476429 0.0005293854 0.0004184917
[29] 0.0015445907 0.0006380788 0.0013245636 0.0005324658
LS0tDQp0aXRsZTogIkZvciBsb29wIHZhcmlhdGlvbnMiDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyfQ0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKGxpYnJhcnkoInRpZHl2ZXJzZSIpKQ0Kc3VwcHJlc3NQYWNrYWdlU3RhcnR1cE1lc3NhZ2VzKGxpYnJhcnkoInN0cmluZ3IiKSkNCiNUaGUgcGFja2FnZSBtaWNyb2JlbmNobWFyayBpcyB1c2VkIGZvciB0aW1pbmcgY29kZS4NCnN1cHByZXNzUGFja2FnZVN0YXJ0dXBNZXNzYWdlcyhsaWJyYXJ5KCJtaWNyb2JlbmNobWFyayIpKQ0KYGBgDQoNCiMjIyAxLiBJbWFnaW5lIHlvdSBoYXZlIGEgZGlyZWN0b3J5IGZ1bGwgb2YgQ1NWIGZpbGVzIHRoYXQgeW91IHdhbnQgdG8gcmVhZCBpbi4gWW91IGhhdmUgdGhlaXIgcGF0aHMgaW4gYSB2ZWN0b3IsIGBmaWxlcyA8LSBkaXIoImRhdGEvIiwgcGF0dGVybiA9ICJcXC5jc3YkIiwgZnVsbC5uYW1lcyA9IFRSVUUpYCwgYW5kIG5vdyB3YW50IHRvIHJlYWQgZWFjaCBvbmUgd2l0aCBgcmVhZF9jc3YoKWAuIFdyaXRlIHRoZSBmb3IgbG9vcCB0aGF0IHdpbGwgbG9hZCB0aGVtIGludG8gYSBzaW5nbGUgZGF0YSBmcmFtZS4NCg0KYGBge3J9DQpmaWxlcyA8LSBkaXIoImRhdGEvIiwgcGF0dGVybiA9ICJcXC5jc3YkIiwgZnVsbC5uYW1lcyA9IFRSVUUpDQpmaWxlcw0KIz4gWzFdICJkYXRhLy9maWxlMS5jc3YiICJkYXRhLy9maWxlMi5jc3YiICJkYXRhLy9maWxlMy5jc3YiDQpgYGANCg0KU2luY2UsIHRoZSBudW1iZXIgb2YgZmlsZXMgaXMga25vd24sIHByZS1hbGxvY2F0ZSBhIGxpc3Qgd2l0aCBhIGxlbmd0aCBlcXVhbCB0byB0aGUgbnVtYmVyIG9mIGZpbGVzLg0KDQpgYGB7cn0NCmRmX2xpc3QgPC0gdmVjdG9yKCJsaXN0IiwgbGVuZ3RoKGZpbGVzKSkNCmBgYA0KVGhlbiwgcmVhZCBlYWNoIGZpbGUgaW50byBhIGRhdGEgZnJhbWUsIGFuZCBhc3NpZ24gaXQgdG8gYW4gZWxlbWVudCBpbiB0aGF0IGxpc3QuIFRoZSByZXN1bHQgaXMgYSBsaXN0IG9mIGRhdGEgZnJhbWVzLg0KDQpgYGB7cn0NCmZvciAoaSBpbiBzZXFfYWxvbmcoZmlsZXMpKSB7DQogIGRmX2xpc3RbW2ldXSA8LSByZWFkX2NzdihmaWxlc1tbaV1dKQ0KfQ0KcHJpbnQoZGZfbGlzdCkNCmBgYA0KDQpGaW5hbGx5LCB1c2UgdXNlIGBiaW5kX3Jvd3MoKWAgdG8gY29tYmluZSB0aGUgbGlzdCBvZiBkYXRhIGZyYW1lcyBpbnRvIGEgc2luZ2xlIGRhdGEgZnJhbWUuDQoNCmBgYHtyfQ0KZGYgPC0gYmluZF9yb3dzKGRmX2xpc3QpDQpwcmludChkZikNCmBgYA0KDQpBbHRlcm5hdGl2ZWx5LCBJIGNvdWxkIGhhdmUgcHJlLWFsbG9jYXRlZCBhIGxpc3Qgd2l0aCB0aGUgbmFtZXMgb2YgdGhlIGZpbGVzLg0KDQpgYGB7cn0NCmRmMl9saXN0IDwtIHZlY3RvcigibGlzdCIsIGxlbmd0aChmaWxlcykpDQpuYW1lcyhkZjJfbGlzdCkgPC0gZmlsZXMNCmZvciAoZm5hbWUgaW4gZmlsZXMpIHsNCiAgZGYyX2xpc3RbW2ZuYW1lXV0gPC0gcmVhZF9jc3YoZm5hbWUpDQp9DQpkZjIgPC0gYmluZF9yb3dzKGRmMl9saXN0KQ0KYGBgDQoNCiMjIyAyLiBXaGF0IGhhcHBlbnMgaWYgeW91IHVzZSBgZm9yIChubSBpbiBuYW1lcyh4KSlgIGFuZCBgeGAgaGFzIG5vIG5hbWVzPyBXaGF0IGlmIG9ubHkgc29tZSBvZiB0aGUgZWxlbWVudHMgYXJlIG5hbWVkPyBXaGF0IGlmIHRoZSBuYW1lcyBhcmUgbm90IHVuaXF1ZT8NCg0KTGV04oCZcyB0cnkgaXQgb3V0IGFuZCBzZWUgd2hhdCBoYXBwZW5zLiBXaGVuIHRoZXJlIGFyZSBubyBuYW1lcyBmb3IgdGhlIHZlY3RvciwgaXQgZG9lcyBub3QgcnVuIHRoZSBjb2RlIGluIHRoZSBsb29wLiBJbiBvdGhlciB3b3JkcywgaXQgcnVucyB6ZXJvIGl0ZXJhdGlvbnMgb2YgdGhlIGxvb3AuDQoNCmBgYHtSfQ0KeCA8LSBjKDExLCAxMiwgMTMpDQpwcmludChuYW1lcyh4KSkNCmZvciAobm0gaW4gbmFtZXMoeCkpIHsNCiAgcHJpbnQobm0pDQogIHByaW50KHhbW25tXV0pDQp9DQpgYGANCg0KTm90ZSB0aGF0IHRoZSBsZW5ndGggb2YgTlVMTCBpcyB6ZXJvOg0KDQpgYGB7cn0NCmxlbmd0aChOVUxMKQ0KYGBgDQoNCklmIHRoZXJlIG9ubHkgc29tZSBuYW1lcywgdGhlbiB3ZSBnZXQgYW4gZXJyb3IgZm9yIHRyeWluZyB0byBhY2Nlc3MgYW4gZWxlbWVudCB3aXRob3V0IGEgbmFtZS4NCg0KYGBge3J9DQp4IDwtIGMoYSA9IDExLCAxMiwgYyA9IDEzKQ0KbmFtZXMoeCkNCmZvciAobm0gaW4gbmFtZXMoeCkpIHsNCiAgcHJpbnQobm0pDQogIHByaW50KHhbW25tXV0pDQp9DQpgYGANCg0KRmluYWxseSwgaWYgdGhlIHZlY3RvciBjb250YWlucyBkdXBsaWNhdGUgbmFtZXMsIHRoZW4gYHhbW25tXV1gIHJldHVybnMgdGhlIGZpcnN0IGVsZW1lbnQgd2l0aCB0aGF0IG5hbWUuDQoNCmBgYHtyfQ0KeCA8LSBjKGEgPSAxMSwgYSA9IDEyLCBjID0gMTMpDQpuYW1lcyh4KQ0KZm9yIChubSBpbiBuYW1lcyh4KSkgew0KICBwcmludChubSkNCiAgcHJpbnQoeFtbbm1dXSkNCn0NCmBgYA0KDQojIyMgMy4gV3JpdGUgYSBmdW5jdGlvbiB0aGF0IHByaW50cyB0aGUgbWVhbiBvZiBlYWNoIG51bWVyaWMgY29sdW1uIGluIGEgZGF0YSBmcmFtZSwgYWxvbmcgd2l0aCBpdHMgbmFtZS4gRm9yIGV4YW1wbGUsIGBzaG93X21lYW4oaXJpcylgIHdvdWxkIHByaW50Og0KDQpgYGB7cn0NCiMgc2hvd19tZWFuKGlyaXMpDQojID4gU2VwYWwuTGVuZ3RoOiA1Ljg0DQojID4gU2VwYWwuV2lkdGg6ICAzLjA2DQojID4gUGV0YWwuTGVuZ3RoOiAzLjc2DQojID4gUGV0YWwuV2lkdGg6ICAxLjIwDQpgYGANCg0KRXh0cmEgY2hhbGxlbmdlOiB3aGF0IGZ1bmN0aW9uIGRpZCBJIHVzZSB0byBtYWtlIHN1cmUgdGhhdCB0aGUgbnVtYmVycyBsaW5lZCB1cCBuaWNlbHksIGV2ZW4gdGhvdWdoIHRoZSB2YXJpYWJsZSBuYW1lcyBoYWQgZGlmZmVyZW50IGxlbmd0aHM/DQoNClRoZXJlIG1heSBiZSBvdGhlciBmdW5jdGlvbnMgdG8gZG8gdGhpcywgYnV0IEnigJlsbCB1c2UgYHN0cl9wYWQoKWAsIGFuZCBgc3RyX2xlbmd0aCgpYCB0byBlbnN1cmUgdGhhdCB0aGUgc3BhY2UgZ2l2ZW4gdG8gdGhlIHZhcmlhYmxlIG5hbWVzIGlzIHRoZSBzYW1lLiBJIG1lc3NlZCBhcm91bmQgd2l0aCB0aGUgb3B0aW9ucyB0byBgZm9ybWF0KClgIHVudGlsIEkgZ290IHR3byBkaWdpdHMuDQoNCmBgYHtyfQ0Kc2hvd19tZWFuIDwtIGZ1bmN0aW9uKGRmLCBkaWdpdHMgPSAyKSB7DQogICMgR2V0IG1heCBsZW5ndGggb2YgYWxsIHZhcmlhYmxlIG5hbWVzIGluIHRoZSBkYXRhc2V0DQogIG1heHN0ciA8LSBtYXgoc3RyX2xlbmd0aChuYW1lcyhkZikpKQ0KICBmb3IgKG5tIGluIG5hbWVzKGRmKSkgew0KICAgIGlmIChpcy5udW1lcmljKGRmW1tubV1dKSkgew0KICAgICAgY2F0KA0KICAgICAgICBzdHJfYyhzdHJfcGFkKHN0cl9jKG5tLCAiOiIpLCBtYXhzdHIgKyAxTCwgc2lkZSA9ICJyaWdodCIpLA0KICAgICAgICAgIGZvcm1hdChtZWFuKGRmW1tubV1dKSwgZGlnaXRzID0gZGlnaXRzLCBuc21hbGwgPSBkaWdpdHMpLA0KICAgICAgICAgIHNlcCA9ICIgIg0KICAgICAgICApLA0KICAgICAgICAiXG4iDQogICAgICApDQogICAgfQ0KICB9DQp9DQpzaG93X21lYW4oaXJpcykNCmBgYA0KDQojIyMgNC4gV2hhdCBkb2VzIHRoaXMgY29kZSBkbz8gSG93IGRvZXMgaXQgd29yaz8NCg0KYGBge3J9DQp0cmFucyA8LSBsaXN0KA0KICBkaXNwID0gZnVuY3Rpb24oeCkgeCAqIDAuMDE2Mzg3MSwNCiAgYW0gPSBmdW5jdGlvbih4KSB7DQogICAgZmFjdG9yKHgsIGxhYmVscyA9IGMoImF1dG8iLCAibWFudWFsIikpDQogIH0NCikNCmZvciAodmFyIGluIG5hbWVzKHRyYW5zKSkgew0KICBtdGNhcnNbW3Zhcl1dIDwtIHRyYW5zW1t2YXJdXShtdGNhcnNbW3Zhcl1dKQ0KfQ0KYGBgDQoNClRoaXMgY29kZSBtdXRhdGVzIHRoZSBgZGlzcGAgYW5kIGBhbWAgY29sdW1uczoNCg0KIC0gYGRpc3BgIGlzIG11bHRpcGxpZWQgYnkgMC4wMTYzODcxDQogLSBgYW1gIGlzIHJlcGxhY2VkIGJ5IGEgZmFjdG9yIHZhcmlhYmxlLg0KDQpUaGUgY29kZSB3b3JrcyBieSBsb29waW5nIG92ZXIgYSBuYW1lZCBsaXN0IG9mIGZ1bmN0aW9ucy4gSXQgY2FsbHMgdGhlIG5hbWVkIGZ1bmN0aW9uIGluIHRoZSBsaXN0IG9uIHRoZSBjb2x1bW4gb2YgbXRjYXJzIHdpdGggdGhlIHNhbWUgbmFtZSwgYW5kIHJlcGxhY2VzIHRoZSB2YWx1ZXMgb2YgdGhhdCBjb2x1bW4uDQoNClRoaXMgaXMgYSBmdW5jdGlvbi4NCg0KYGBge3J9DQp0cmFuc1tbImRpc3AiXV0NCmBgYA0KDQpUaGlzIGFwcGxpZXMgdGhlIGZ1bmN0aW9uIHRvIHRoZSBjb2x1bW4gb2YgbXRjYXJzIHdpdGggdGhlIHNhbWUgbmFtZQ0KDQpgYGB7cn0NCnRyYW5zW1siZGlzcCJdXShtdGNhcnNbWyJkaXNwIl1dKQ0KYGBg