Recently I found myself needing to process a number of CSV data dumps from an ERP system in order to prepare the data for analysis. R is super useful for this type of basic data cleaning and prepartation, but it took me a while to wrap my head around how to make use of vectorization, and specifically how functions like lapply() can help. What follows are some examples of how we can use lapply() to make our code more compact, efficient, and easier to manage.

First off, what exactly does lapply do?

lapply is part of a small family of “apply” functions in base R that accepts something as input (such as a matrix, list, or vector), applies a function on its elements, and spits back out a list or vector. In the case of lapply(), you get back a list. Here’s a quick example to get us started, where we’ll return a list containing the means of 2 vectors.

list<-list()
x<-rnorm(10)
y<-rnorm(10)

list[[1]]<-x
list[[2]]<-y

z<-lapply(list, mean)
z

## [[1]]
## [1] 0.2003034
## 
## [[2]]
## [1] 0.1214118

As expected, lapply() returned a list containing the means of our 2 vectors.

Using vectorization to build a list of dataframes

Let’s load up our CSVs. There are a few ways we could do this, but for the purposes of this discussion I’m simply creating a dataframe for each CSV.

library(dplyr)
library(data.table)

data_2013 <- read.csv("2013_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")
data_2014 <- read.csv("2014_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")
data_2015 <- read.csv("2015_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")
data_2016 <- read.csv("2016_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")
data_2017 <- read.csv("2017_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")
data_2018 <- read.csv("2018_D20_11192018.csv", skip=4, header=TRUE, colClasses = "character")

Now here’s the thing. We’re going to want to perform some basic cleanup activites on each of these dataframes. And we want to make it easy to add additional data in the future. We could really go about it one of two ways:

Go through each dataframe and transformation individually. This might seems like the simple approach, but it’ll get cumbersome. Firtsly our code will be bloated and inefficient. It’ll also be difficult to add another data source in the future, as we’d need to make sure to add code for each transformation individually.
Add all the dataframes to a list, and use vectorization in R to transform the contents of the list. As you’ll see this makes for more compact, cleaner code, that makes managing changes in the future much easier.

We’ll go ahead with option 2, and in order to do this we’ll first need to initiate a list and add our dataframes to it. Let’s contrast old-school and vectorized approaches, starting with the old-school approach.

Here we simply go through each element of our list and add a dataframe to it. Note that if we ever pull in more data in the future, we’ll need to add another line of code here as well.

df_list <- list()
df_list[[1]] <-data_2013
df_list[[2]] <-data_2014
df_list[[3]] <-data_2015
df_list[[4]] <-data_2016
df_list[[5]] <-data_2017
df_list[[6]] <-data_2018

But a much better way to do this is to take advantage of vecotorization to load all dataframes in our environment whose names start with the pattern “data_”. This way, if we add more dataframes in the future we just need to use the same prefix, and won’t need to add any additional code.

df_list <- lapply(ls(pattern="data_"), function(x) get(x))

How does this work?

We use the function lapply(list, function(element of list)) to apply a function to each element of our new list. This function can be anything. It can be a function you defined externally, defined inside lapply(). In this case, we’re using lapply() on the list of all dataframes in our environment starting with the pattern “data_”, using ls(pattern="data_"). Then for each list element, we call get(x) which returns an object (in this case a dataframe) with the name we passed as an argument to function(x).

Sometimes I find it easier to understand lapply() if I define the function externally. This way we can clearly see we’re passing each dataframe name to our get_dataframes() function:

get_dataframes <- function(x){
  get(x)
}
df_list <- lapply(ls(pattern="data_"), get_dataframes)

Data-cleaning tasks with vectorization

The cool thing about lapply is that it works just as well on lists of dataframes. So let’s use it to run some transformations on our dataframes. To begin with, we want to take the field names from one dataset, and apply it to the others to ensure common naming (we know the ordering is the same).

Using the old-school method, I use setnames on each dataframe, applying the names from data_2014 to the others. Note that if I ever pull in more data (in new DFs), I’ll need to update the code here.

setnames(data_2013, old = names(data_2013)[2:20], new = names(data_2014)[2:20])
setnames(data_2015, old = names(data_2015)[2:20], new = names(data_2014)[2:20])
setnames(data_2016, old = names(data_2016)[2:20], new = names(data_2014)[2:20])
setnames(data_2017, old = names(data_2017)[2:20], new = names(data_2014)[2:20])
setnames(data_2018, old = names(data_2018)[2:20], new = names(data_2014)[2:20])

In the vectorized case, I can simply use setnames() inside lapply(). Just like our previous example first we’ll create a function to which we will pass a list element (a dataframe(), and then rename the columns of that dataframe.

list_rename <- function(x) {
  setnames(x, old = names(x)[2:20], new = names(data_2014)[2:20])
}
df_list<-lapply(df_list, list_rename)

I can also use the same approach to drop columns I don’t need from each dataframe. You can see that in this case, we pass each list elememt (a dataframe) as an argument to function(x), where query for the columns found in our keeps list.

#Keep only the data cols
keeps <-names(data_2014)[2:20]
df_list<-lapply(df_list, function(x){x[ , (names(x) %in% keeps)]})

Lastly, in this example I want to convert a couple value columns from character to numeric. However, in order to do this I first need to remove commas from the value fields. I can do this using gsub() prior to the as.numeric(). We need to think about our function a bit here though. gsub() will return a vector, which isn’t quite whatI want. I need to update the columns, then return the entire dataframe to lapply(). As such, I’ll explicitly call return() in my function definition.

clean_numeric <- function(x) {
  x$Acquisitio <- gsub(",","",x$Acquisitio)
  x$AUC.Transf <- gsub(",","",x$AUC.Transf)
  x$Acquisitio <- as.numeric(x$Acquisitio)
  x$AUC.Transf <- as.numeric(x$AUC.Transf)
  return(x)
}
df_list<-lapply(df_list, clean_numeric)

Dumping the list back to a single dataframe

Once we’ve worked through all of our transformations, we can easily dump everything back to a single dataframe, potentially to write out a new CSV for use in external applications.

We’ll use rbindlist() from the data.table package

consolidated<-rbindlist(df_list)

I hope you found these examples useful in improving your workflow with vectorization and getting a better handle on the lapply function.

Taking advantage of vectorization in R to improve your data workflow

Phil Jette

November 22, 2018

First off, what exactly does lapply do?

Using vectorization to build a list of dataframes

How does this work?

Data-cleaning tasks with vectorization

Dumping the list back to a single dataframe