Data Wrangling Faster and Bigger with data.table

Nick Huntington-Klein

05 February, 2021

Data Wrangling

Welcome to the Data Wrangling Workshop!

The goal of data wrangling
How to think about data wrangling
Technical tips for data wrangling in R using the data.table package
A walkthrough example

Limitations

If you already attended the tidyverse version of this workshop, there will be some overlap in content, although of course the technical details will be different.
We only have so much time! I won’t be going into great detail on the use of all the technical commands, but by the end of this you will know what’s out there and generally how it’s used
As with any computer skill, a teacher’s comparative advantage is in letting you know what’s out there. The real learning comes from practice and Googling. So take what you see here today, find yourself a project, and do it! It will be awful but you will learn an astounding amount by the end

data.table notes

data.table is a package for working with data
It is extremely fast. Its functions are much faster than comparable tidyverse functions, and for many purposes it is faster than pandas in Python as well. Julia outperforms it sometimes, but then you have to learn Julia
It’s also great at handling big data. It’s basically your technically-best option for working with data in-memory. Once you work from a database (i.e. SQL) you’re in to something different

Tidyverse notes

Throughout this talk I’ll be using the pipe (%>%), which simply means “take whatever’s on the left and make it the first argument of the thing on the right”
Very handy for chaining together operations and making code more readable.
This is more of a tidyverse tool but I like it for data.table too. It can be loaded from library(magrittr) or loading the tidyverse or one of many packages that come with the pipe.

The pipe

scales::percent(mean(mtcars$am, na.rm = TRUE), accuracy = .1) can be rewritten

mtcars %>%
  `[[`('am') %>%
  mean(na.rm = TRUE) %>%
  scales::percent(accuracy = .1)

Like a conveyer belt! Nice and easy. Note that R 4.1 will switch to the use of |> for the pipe which won’t require a package load. Also, you can chain data.table() operations by just using a bunch of []s (we’ll get to it)
[[ is the “[[ function” - i.e. this is doing mtcars[['am']], equivalent to mtcars$am

Data Wrangling

What is data wrangling?

You have data
It’s not ready for you to run your model
You want to get it ready to run your model
Ta-da!

The Core of Data Wrangling

Always look directly at your data so you know what it looks like
Always think about what you want your data to look like when you’re done
Think about how you can take information from where it is and put it where you want it to be
After every step, look directly at your data again to make sure it’s doing what you think it’s doing

I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that

The Core of Data Wrangling

How can you “look at your data”?
Literally is one way - click on the data set, or do View() to look at it
Summary statistics tables: sumtable() or vtable(lush = TRUE) in vtable for example
Checking what values it takes: table() or summary() on individual variables
Look for: What values are there, what the observations look like, presence of missing or unusable data, how the data is structured

The Stages of Data Wrangling

From records to data
From data to tidy data
From tidy data to data for your analysis

From Records to Data

Not something we’ll be focusing on today! But any time the data isn’t in a workable format, like a spreadsheet or database, someone’s got to get it there!

“Google Trends has information on the popularity of our marketing terms, go get it!”
“Here’s a 600-page unformatted PDF of our sales records for the past three years. Turn it into a database.”
“Here are scans of the 15,000 handwritten doctor’s notes at the hospital over the past year”
“Here’s access to the website. The records are in there somewhere.”
“Go do a survey”

From Records to Data: Tips!

Do as little by hand as possible. It’s a lot of work and you will make mistakes
Look at the data a lot!
Check for changes in formatting - it’s common for things like “this enormous PDF of our tables” or “eight hundred text files with the different responses/orders” to change formatting halfway through
When working with something like a PDF or a bunch of text files, think “how can I tell a computer to spot where the actual data is?”
If push comes to shove, or if the data set is small enough, you can do by-hand data entry. Be very careful!

Reading Files

One common thing you run across is data split into multiple files. How can we read these in and compile them?

list.files() produces a vector of filenames (tip: full.names = TRUE gives full filepaths)
Use map() from purrr to iterate over that vector and read in the data. This gives a list of tibbles (data.frames) read in
Create your own function to process each, use map with that too (if you want some processing before you combine)
Turn each to a data.table (if it isn’t already, say from fread() with map(as.data.table))
Combine the results with rbindlist()!

Reading Files

For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together.

# For reading Excel
library(readxl)
# For map
library(purrr)

# Get the list of 200 reports
filelist <- list.files(path = '../Monthly_reports/', pattern = 'sales', full.names = TRUE)

Reading Files

We can simplify by making a little function that processes each of the reports as it’s read. Then, use map() with read_excel() and then our function, then bind it together!

How do I get df[1,3], etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it!

process_file <- function(df) {
  sales <- df[1,3]
  employee <- df[42,2]
  return(data.table(sales = sales, employee = employee))
}

compiled_data <- filelist %>%
  map(read_excel) %>%
  map(process_file) %>%
  rbindlist()

From Data to Tidy Data

Data is any time you have your records stored in some structured format
But there are many such structures! They could be across a bunch of different tables, or perhaps a spreadsheet with different variables stored randomly in different areas, or one table per observation
These structures can be great for looking up values. That’s why they are often used in business or other settings where you say “I wonder what the value of X is for person/day/etc. N”
They’re rarely good for doing analysis (calculating statistics, fitting models, making visualizations)
For that, we will aim to get ourselves tidy data (see this walkthrough )

Tidy Data

In tidy data:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Tidy Data

The variables in tidy data come in two types:

Identifying Variables/Keys are the columns you’d look at to locate a particular observation.
Measures/Values are the actual data.

Which are they in this data?

Tidy Data

Person and Year are our identifying variables. The combination of person and year uniquely identifies a row in the data. Our “observation level” is person and year. There’s only one row with Person == “Chidi” and Year == 2018
Points and ShrimpConsumption are our measures. They are the things we have measured for each of our observations
Notice how there’s one row per observation (combination of Person and Year), and one column per variable
Also this table contains only variables that are at the Person-Year observation level. Variables at a different level (perhaps things that vary between Person but don’t change over Year) would go in a different table, although this last one is less important

Tidying Non-Tidy Data

So what might data look like when it’s not like this, and how can we get it this way?
Here’s one common example, a count table (not tidy!) where each column is a value, not a variable

Tidying Non-tidy Data

Here’s another, where the “chart position” variable is split across 52 columns, one for each week

Tidying Non-Tidy Data

The first big tool in our tidying toolbox is the pivot, which in data.table is the function pair melt() and dcast()
A pivot takes a single row with K columns and turns it into K rows with 1 column, using the identifying variables/keys to keep things lined up.
This can also be referred to as going from “wide” data to “long” data (melt)
Long to wide is also an option (dcast)
In every statistics package, pivot functions are notoriously fiddly. Always read the help file, and do trial-and-error! Make sure it worked as intended.

Tidying Non-Tidy Data

Check our steps!

We looked at the data
Think about how we want the data to look - one row per (keys) artist, track, and week, and a column for the chart position of that artist/track in that week, and the date entered for that artist/track (value)
How can we carry information from where it is to where we want it to be? With a pivot!
And afterwards we’ll look at the result (and, likely, go back and fix our pivot code - the person who gets a pivot right the first try is a mysterious genius)

Pivot

Here we want wide-to-long so we use melt()
This asks for:
data (the data set you’re working with, also the first argument so we can pipe to it)
id.vars (a vector of identifying/key columns, either numbers for position or character for names)
measure.vars (the columns to pivot) - by default everything not in id.vars
variable.name (the name of the variable to store which column a given row came from, here “week”)
value.name (the variable to store the value in)
Many other options (see help(melt))

Pivot

billboard %>%
  melt(measure.vars = patterns('^wk'), # patterns helps us pick columns based on name patterns
             variable.name = 'week',
             value.name = 'chart_position')

Variables Stored as Rows

Here we have tax form data where each variable is a row, but we have multiple tables For this one we can use pivot_wider(), and then combine multiple individuals with bind_rows()

Variables Stored as Rows

dcast() needs:
data (first argument, the data we’re working with)
formula (this tells us the observation level of the wide data and how it expands to the long data)
Many others! See help(dcast)
Here, the new observation level doesn’t have a key (.), but the old one is TaxFormRow

Variables Stored as Rows

taxdata %>%
  dcast(. ~ TaxFormRow)

##    .   AGI Deductible Income        Person
## 1: . 88341      24000 112341 James Acaster

(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)

Variables Stored as Rows

We can use rbind() to stack data sets with the same variables together, handy for compiling data from different sources (rbindlist() binds a list of data.tables)

##    .    AGI Deductible Income        Person
## 1: .  88341      24000 112341 James Acaster
## 2: . 309122      16000 325122  Eddie Izzard

Merging Data

Commonly, you will need to link two datasets together based on some shared keys
For example, if one dataset has the variables “Person”, “Year”, and “Income” and the other has “Person” and “Birthplace”

Merging Data

That was person_year_data. And now for person_data:

Merging Data

The data.table merge function will do this (see help(merge), making sure you get the data.table one instead of the base-R one).
The by option will specify the columns to merge on. The all.x and all.y options will specify whether to keep rows from the first and second data sets, respectively, that don’t find a match

Merging Data

person_year_data %>%
  merge(person_data, by = 'Person', all.x = TRUE)

##     Person Year Income      Birthplace
## 1:   David 2014 102452            <NA>
## 2:   David 2014 105133            <NA>
## 3:  Ramesh 2014  81314         Crawley
## 4:  Ramesh 2015  82155         Crawley
## 5: Whitney 2014 131292 Washington D.C.
## 6: Whitney 2015 141262 Washington D.C.

person_year_data %>%
  merge(person_data, by = 'Person', all.y = TRUE)

##     Person Year Income      Birthplace
## 1:  Ramesh 2014  81314         Crawley
## 2:  Ramesh 2015  82155         Crawley
## 3: Whitney 2014 131292 Washington D.C.
## 4: Whitney 2015 141262 Washington D.C.

Merging Data

Things work great if the list of variables in by is the exact observation level in at least one of the two data sets
But if there are multiple observations per combination of by variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:

a <- data.table(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- data.table(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% merge(b, by = 'Name')

##    Name Year Value Characteristic
## 1:    A 2014     1             Up
## 2:    A 2014     1           Down
## 3:    A 2015     2             Up
## 4:    A 2015     2           Down
## 5:    B 2014     3             Up
## 6:    C 2014     4           Left
## 7:    C 2014     4          Right

Merging Data

This is why it’s super important to always know the observation level of your data. You can check it by seeing if there are any duplicate rows among what you think are your key variables: if we think that Person is a key for data set a, then a[, .(Person)] %>% duplicated() %>% max() will return TRUE, showing us we’re wrong
At that point you can figure out how you want to proceed - drop observations so it’s the observation level in one? Accept the multi-match? Pick only one of the multi-matches?

Merging data

Another way to merge data.tables is to use the special data.table syntax DT1[DT2, on = .(keys)].
This approach is a little harder to work through syntax-wise, but it is (a little) faster
And lets you do neat tricks by matching in non-exact ways

From Tidy Data to Your Analysis

Okay! We now have, hopefully, a nice tidy data set with one column per variable, one row per observation, we know what the observation level is!
That doesn’t mean our data is ready to go! We likely have plenty of cleaning and manipulation to go before we are ready for analysis

Working with data.tables

data.table() syntax is extremely simple
DT[filter, variable operations, grouping] aka DT[i, j, by]
We can use this to do just about anything we like!
Thankfully, there are also some “wrapper” functions for this that can automate some operations, and other helper functions like fcase()
See this data.table cheat sheet or this one

In-Place Manipulation

Normally in R, to change something, you must reassign it, which takes up space in memory a <- a + 1
Many data.table operations can be done directly at the existing memory location. Blazing fast.
For variable manipulation, you can do in-place with the walrus operator :=
Other in-place functions begin with set like setnames(), setorder(), set(), etc.
Note that these generally have versions like setorderv() that take column-name vectors instead of direct column names

data.table oddities

Some things about data.tables clash with typical R practice. In-place manipulation is one of them
These changes “stick” even inside a function, and generally shouldn’t be combined with pipes
And if you copy a data.table, say by dt2 <- dt1, the changes to dt2 will also happen to dt1, unless you do dt2 <- copy(dt1) instead for a “deep copy”
Also note that putting a data.table by itself on a line won’t print it - you need to add [] after
By the way feel free to chain multiple []s together for chained operations!

Filtering

The first argument limits the data to the observations that fulfill a certain logical condition. It picks rows.
For example, Income > 100000 is TRUE for everyone with income above 100000, and FALSE otherwise. data[Income > 100000] would return just the rows of data that have Income > 100000

merge(person_year_data, person_data, by = 'Person')[Income > 10000]

##     Person Year Income      Birthplace
## 1:  Ramesh 2014  81314         Crawley
## 2:  Ramesh 2015  82155         Crawley
## 3: Whitney 2014 131292 Washington D.C.
## 4: Whitney 2015 141262 Washington D.C.

Logical Conditions

A lot of programming in general is based on writing logical conditions that check whether something is true
In R, if the condition is true, it returns TRUE, which turns into 1 if you do a calculation with it. If false, it returns FALSE, which turns into 0. (tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant)
Also, if ifelse is what you want, it’s not what you want! data.table’s fifelse is the same but faster

Logical Conditions Tips

Handy tools for constructing logical conditions:

a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal

a %in% c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!

Logical Conditions Tips

Whatever your condition is (condition), just put a ! (“not”) in front to reverse TRUE/FALSE. 2 + 2 == 4 is TRUE, but !(2 + 2 == 4) is FALSE

Chain multiple conditions together! & is “and”, | is “or”. Be careful with parentheses if combining them!

Column operations

Column operations in data.tables go in the second argument, i.e. the j in DT[i,j,by]
There are two main ways to do them! In-place and not-in-place.

Column operations

Not-in-place operations are done by providing a list of variables, and, if you want to reassign them, what they will be equal to. All variables not mentioned in the list (or in by) are dropped
The .() function in data.table is a shortcut for list()
To store it, save over the data.table

mtcars <- as.data.table(mtcars)
# Select just these columns
mtcars[, .(mpg, hp)]
mtcars[, c(1,4)]
# Select just those columns and also add the ratio variable
mtcars[, .(mpg, hp, ratio = mpg/hp)]

Column Operations

When to use a list (.()) and when to use c()?
Generally, use a list if you want to refer to columns by name directly (“unquoted”)
And use c() to refer to column numbers, or to pass in names as strings (handy in programming!)
Sometimes if using a string variable in j, you’ll also need the option with = FALSE

varnames <- c('mpg','hp')
mtcars[, varnames, with = FALSE]
mtcars[, c('mpg','hp')]

Column Operations

Note that these sorts of column operations are really about what to calculate. It just so happens that often we want to assign that calculation to a column. But sometimes we don’t!
For example, instead of mean(mtcars$hp) we can do mtcars[, mean(hp)]
We can also pull a variable out of a data.table entirely with mtcars[, hp] (this gives back a numeric vector, not a one-column data.table! For a one-column data.table we’d do mtcars[, .(hp)])
Mix n match! Guess what mtcars[am == 1, mean(hp)] does… you can start to see the appeal!

In-Place Column Operations

Most of the time when it comes to creating variables you’ll probably do in-place operations. This will just add/replace columns. Non-mentioned columns stay intact

# Create ratio variable
mtcars[, ratio := mpg/hp]
# Create two variables at once
mtcars[, `:=`(ratio = mpg/hp, hp_square = hp^2)]
# Drop a single variable by setting it to NULL
mtcars[, am := NULL]

fcase()

A function that comes in handy a lot when using mutate to create a categorical variable is fcase(), which is sort of like ifelse() except it can cleanly handle way more than one condition
Provide fcase() with a series of if, then conditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.
As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn’t count any more.
Also note the default option for what to do with observations that are FALSE for everything else

fcase()

person_year_data[, .(Income = Income,
                     IncomeBracket = fcase(
                       Income <= 50000, 'Under 50k',
                       Income > 50000 & Income <= 100000, '50-100k',
                       Income > 100000 & Income < 120000, '100-120k',
                       default = 'Above 120k'))]

##    Income IncomeBracket
## 1:  81314       50-100k
## 2:  82155       50-100k
## 3: 131292    Above 120k
## 4: 141262    Above 120k
## 5: 102452      100-120k
## 6: 105133      100-120k

fcase()

Note that the then doesn’t have to be a value, it can be a calculation, for example

person_year_data[, .(Income = Income,
                     Year = Year,
                     Inflation_Adjusted_Income = fcase(
                       Year == 2014, Income*1.001, 
                       Year == 2015, Income))]

Changing Some Observations

A related problem to fcase() is when you want to make an adjustment to just some of the observations. Say we realized that David’s income was reported in pounds, not dollars, so we need to adjust it.
For this, just apply both a filter and an in-place update. We could do

person_year_data[Person == 'David', Income := Income*1.34]

If you attended the tidyverse version of this, you may recall this was a huge pain in the tidyverse. Easy here!

Grouping

The third argument of data.table (DT[i,j,by]) performs the j operations by group, splitting each combination of the grouping variables
Can just give the variable (or condition!) by itself, or list multiples with by = .(a, b) or by = c('a','b')

person_year_data[, .(Income = Income,
                     Income_Relative_to_Mean = Income - mean(Income)),
                 by = Person]

##     Person Income Income_Relative_to_Mean
## 1:  Ramesh  81314                  -420.5
## 2:  Ramesh  82155                   420.5
## 3: Whitney 131292                 -4985.0
## 4: Whitney 141262                  4985.0
## 5:   David 102452                 -1340.5
## 6:   David 105133                  1340.5

Keys

You can, as mentioned, tell data.table to do a calculation by group by giving those groups to by
data.tables also have explicit keys - when the keys are set, the data.table is pre-sorted by those keys
Any grouping, merging, etc., by those keys becomes insanely faster
So if you’re going to use the same grouping multiple times, set the key!!

setkey(person_year_data, Person)
# Perform a by operation and set the key at the same time
person_year_data[, Income_Relative_to_Mean := Income - mean(Income), keyby = Person]

Grouping

How is grouping useful in preparing data?
Remember, we want to look at where information is and think about how we can get it where we need it to be
Grouping helps us move information from one row to another in a key variable - otherwise a difficult move!
It can also let us change the observation level depending on our use of j
Tip: .N gives the number of rows in the group - handy! and seq_len(.N) gives the row number within its group of that observation

Changing Observation Level

Using := with by maintains the original observation level. But .() with by changes the observation level to the level implied by the functions you give it!
So give it a function returning one row per group and you get one row per group - your new observation level!

person_year_data[, .(Mean_Income = mean(Income),
                     Years_Tracked = .N),
                 by = Person]

##     Person Mean_Income Years_Tracked
## 1:  Ramesh     81734.5             2
## 2: Whitney    136277.0             2
## 3:   David    103792.5             2

Sorting data.tables

It’s often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate
There are also some data manipulation tricks that rely on the position of the data
You could sort by just passing an order() to the i argument, but more often you’ll just do in-place setorder()

setorder(person_year_data, Person, Year)
person_year_data[]

##     Person Year Income
## 1:   David 2014 102452
## 2:   David 2014 105133
## 3:  Ramesh 2014  81314
## 4:  Ramesh 2015  82155
## 5: Whitney 2014 131292
## 6: Whitney 2015 141262

Variable Types

Manipulating Variables

Those are the base data.table actions we need to think about
They can be combined to do all sorts of things!
But important in using column operations
A lot of data cleaning is making an already-tidy variable usable!

Variable Types

Common variable types:

Numeric
Character/string
Factor
Date

Variable Types

You can check the types of your variables with is. and then the type, or class(), or vtable::vtable(data) or doing str(data)
You can generally convert between types using as. and then the type

widetax <- dcast(taxdata, . ~ TaxFormRow)
widetax[, `:=`(Person = as.factor(Person),
         Income = as.numeric(Income),
         Deductible = as.numeric(Deductible),
         AGI = as.numeric(AGI))]
sapply(widetax, class)

##           .         AGI  Deductible      Income      Person 
## "character"   "numeric"   "numeric"   "numeric"    "factor"

Numeric Notes

Numeric data actually comes in multiple formats based on the level of acceptable precision: integer, double, and so on
Often you won’t have to worry about this - R will just make the data whatever numeric type makes sense at the time
But a common problem is that reading in very big integers (like ID numbers) will sometimes create doubles that are stored in scientific notation - lumping multiple groups together! Avoid this with options like colClasses in your data-reading function

Character/string

Specified with '' or ""
Use paste0() to stick stuff together! paste0('h','ello', sep = '_') is “h_ello”
Messy data often defaults to character. For example, a “1,000,000” in your Excel sheet might not be parsed as 1000000 but instead as a literal “1,000,000” with commas
Lots of details on working with these - back to them in a moment

Factors

Factors are for categorical data - you’re in one category or another
The factor() function lets you specify these labels, and also specify the levels they go in - factors can be ordered!

incdata <- data.table(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+'))[,
  .(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+')))]
setorder(incdata, Income)
incdata

##           Income
## 1: Less than 50k
## 2:      50k-100k
## 3:      50k-100k
## 4:         100k+
## 5:         100k+

Dates

Dates are the scourge of data cleaners everywhere. They’re just plain hard to work with!
There are Date variables, Datetime variables, both of multiple different formats… eugh!
I won’t go into detail here, but I strongly recommend using the lubridate package whenever working with dates. See the cheatsheet

Characters/strings

Back to strings!
Even if your data isn’t textual, working with strings is a very common aspect of preparing data for analysis
Some are straightforward, for example using DT[condition, var := fix] to fix typos/misspellings in the data
But other common tasks in data cleaning include: getting substrings, splitting strings, cleaning strings, and detecting patterns in strings
For this we will be using the stringr package, see the cheatsheet

Getting Substrings

When working with things like nested IDs (for example, NAICS codes are six digits, but the first two and first four digits have their own meaning), you will commonly want to pick just a certain range of characters
str_sub(string, start, end) will do this. str_sub('hello', 2, 4) is 'ell'
Note negative values read from end of string. str_sub('hello', -1) is 'o'

Getting Substrings

For example, geographic Census Block Group indicators are 13 digits, the first two of which are the state FIPS code

cbgdata <- data.table(cbg = c(0152371824231, 1031562977281)) 
cbgdata[, cbg := as.character(cbg)] # Make it a string to work with
cbgdata[, state_fips := fcase(
  nchar(cbg) == 12, str_sub(cbg, 1, 1), # Leading zeroes!
  nchar(cbg) == 13, str_sub(cbg, 1, 2)
)]
cbgdata[]

##              cbg state_fips
## 1:  152371824231          1
## 2: 1031562977281         10

Strings

Lots of data will try to stick multiple pieces of information in a single cell, so you need to split it out!
Generically, str_split() will do this. str_split('a,b', ',')[[1]] is c('a','b')
Conveniently, double-assignment basically works like you want it to in data.tables!

deptdata <- data.table(category = c('Sales,Marketing','H&R,Marketing'))
deptdata[, c('Category1', 'Category2') := str_split(category, ',')]
deptdata[]

##           category Category1 Category2
## 1: Sales,Marketing     Sales       H&R
## 2:   H&R,Marketing Marketing Marketing

Cleaning Strings

Strings sometimes come with unwelcome extras! Garbage or extra whitespace at the beginning or end, or badly-used characters
str_trim() removes beginning/end whitespace, str_squish() removes additional whitespace from the middle too. str_trim(' hi hello ') is 'hi hello'.
str_replace_all() is often handy for eliminating (or fixing) unwanted characters

numdata <- data.table(number = c('1,000', '2,003,124')) 
numdata[, number := str_replace_all(number, ',', '') %>% as.numeric()]
numdata[]

##     number
## 1:    1000
## 2: 2003124

Detecting Patterns in Strings

Often we want to do something a bit more complex. Unfortunately, this requires we dip our toes into the bottomless well that is regular expressions
Regular expressions are ways of describing patterns in strings so that the computer can recognize them. Technically this is what we did with str_replace_all(',','') - ',' is a regular expression saying “look for a comma”
There are a lot of options here. See the guide
Common: [0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.

Detecting Patterns in Strings

For example, some companies are publicly listed and we want to indicate that but not kepe the ticeker separate() won’t do it here, not easily!
On the next page we’ll use the regular expression '\$[A-Z].*\$'
'\$[A-Z].*\$' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”

Detecting Patterns in Strings

For detecting patterns, str_detect() from stringr is fine, but a touch faster is data.table’s %like% operator or like() function, the latter having options for ignoring case and regular-expression syntax

companydata <- data.table(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)'))
companydata[, `:=`(publicly_listed  = name %like% '\\([A-Z].*\\)',
                   name = str_replace_all(name, '\\([A-Z].*\\)', ''))]
companydata[]

##                           name publicly_listed
## 1:            Amazon  Holdings            TRUE
## 2: Cargill Corp. (cool place!)           FALSE

Using Data Structure

One of the core steps of data wrangling we discussed is thinking about how to get information from where it is now to where you want it
A tough thing about tidy data is that it can be a little tricky to move data into different rows than it currently is
This is often necessary when changing observation level, or when doing things like “calculate growth from an initial value”
But we can solve this with the use of setorder() along with other-row-referencing functions like first(), last(), and shift()

Using Data Structure

first() and last() refer to the first and last row, naturally

stockdata <- data.table(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
       date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
       stock_price = c(103,103.4,107,85.2, 86.3, 85.6))
setorder(stockdata, ticker, date)
stockdata[, price_growth_since_march_4 := stock_price/first(stock_price) - 1, by = ticker]
stockdata[]

##    ticker       date stock_price price_growth_since_march_4
## 1:   AMZN 2020-03-04       103.0                0.000000000
## 2:   AMZN 2020-03-05       103.4                0.003883495
## 3:   AMZN 2020-03-06       107.0                0.038834951
## 4:    WMT 2020-03-04        85.2                0.000000000
## 5:    WMT 2020-03-05        86.3                0.012910798
## 6:    WMT 2020-03-06        85.6                0.004694836

Using Data Structure

shift() looks to the row a certain number above/below this one, based on the n argument
Careful! shift() doesn’t care about time structure, it only cares about data structure. If you want daily growth but the row above is last year, you’ll get the wrong result!

Using Data Structure

setorder(stockdata, ticker, date)
stockdata[, price_growth_daily := stock_price/lag(stock_price, 1) - 1, by = ticker]
stockdata

##    ticker       date stock_price price_growth_since_march_4 price_growth_daily
## 1:   AMZN 2020-03-04       103.0                0.000000000                 NA
## 2:   AMZN 2020-03-05       103.4                0.003883495        0.003883495
## 3:   AMZN 2020-03-06       107.0                0.038834951        0.034816248
## 4:    WMT 2020-03-04        85.2                0.000000000                 NA
## 5:    WMT 2020-03-05        86.3                0.012910798        0.012910798
## 6:    WMT 2020-03-06        85.6                0.004694836       -0.008111240

Trickier Stuff

Sometimes the kind of data you want to move from one row to another is more complex!
You can use first()/last() to get stuff that might not normally be first or last with things like stockdata[, targetdate := date == as.Date('2020-03-05')] and setorder(stockdata, ticker, -(targetdate))
For even more complex stuff, I often find it useful to use DT[condition, operation] to create a new variable that only picks data from the rows you want, then a grouped in-place column operation to spread the data from those rows across the other rows in the group

Trickier Stuff

testdata <- data.table(person = c('Adam','James','Diego','Beth','Francis','Qian','Ryan','Selma'),
       school_grade = c(6,7,7,8,6,7,8,8),
       subject = c('Math','Math','English','Science','English','Science','Math','PE'),
       test_score = c(80,84,67,87,55,75,85,70))
testdata[subject == 'Math', Math_Scores := test_score]
testdata[, Math_Average_In_This_Grade := mean(Math_Scores, na.rm = TRUE), by = school_grade]
testdata[, Math_Scores := NULL]
testdata[]

##     person school_grade subject test_score Math_Average_In_This_Grade
## 1:    Adam            6    Math         80                         80
## 2:   James            7    Math         84                         84
## 3:   Diego            7 English         67                         84
## 4:    Beth            8 Science         87                         85
## 5: Francis            6 English         55                         80
## 6:    Qian            7 Science         75                         84
## 7:    Ryan            8    Math         85                         85
## 8:   Selma            8      PE         70                         85

Automation

Data cleaning is often very repetitive
You shouldn’t let it be!
Not just to save yourself work and tedium, but also because standardizing your process so you only have to write the code once both reduces errors and means that if you have to change something you only have to change it once
So let’s automate! Three ways we’ll do it here: .SD, writing functions, and purrr

.SD

If you have a lot of variables, cleaning them all can be a pain. Who wants to write out the same thing a million times, say to convert all those read-in-as-text variables to numeric?
.SD refers to the entire set of data being analyzed other than any variables in by (i.e. the whole thing, or the current group if grouped)

.SD

You can pass this to lapply() to apply a function to every variable! (plenty of fancier applications too but we’ll stick here for now)
Or just a subset: you can specify only some columns to be in .SD with the .SDcols argument (which takes patterns())
It really does work like the dataset! .SD[1] gives the first row of all columns, etc.

.SD

Let’s apply the same function to every column. First just to get some summary stats:

mtcars[, lapply(.SD, mean)]

##         mpg    cyl     disp       hp     drat      wt     qsec     vs      am
## 1: 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
##      gear   carb
## 1: 3.6875 2.8125

.SD

And then to apply a function to each of them, changing the original data:

mtcars[, lapply(.SD, function(x) x + 1)]

.SDcols

We can be a little choosier by just doing specific columns (despite doing multiple of them)
patterns('price_growth') on the next slide the same here as 4:5 or c(price_growth_since_march_4, price_growth_daily) or price_growth_since_march_4:price_growth_daily
The naming column is only to not overwrite old names. Could overwrite with just 4:5 there, or avoid the walrus

.SDcols

stockgrowth <- stockdata[, .(ticker, date, price_growth_since_march_4, price_growth_daily)]
stockgrowth[, c('bps_march4', 'bps_daily') := lapply(.SD, function(x) x*10000), 
            .SDcols = patterns('price_growth')]
stockgrowth

##    ticker       date price_growth_since_march_4 price_growth_daily bps_march4
## 1:   AMZN 2020-03-04                0.000000000                 NA    0.00000
## 2:   AMZN 2020-03-05                0.003883495        0.003883495   38.83495
## 3:   AMZN 2020-03-06                0.038834951        0.034816248  388.34951
## 4:    WMT 2020-03-04                0.000000000                 NA    0.00000
## 5:    WMT 2020-03-05                0.012910798        0.012910798  129.10798
## 6:    WMT 2020-03-06                0.004694836       -0.008111240   46.94836
##    bps_daily
## 1:        NA
## 2:  38.83495
## 3: 348.16248
## 4:        NA
## 5: 129.10798
## 6: -81.11240

.SDcols

That version replaced the original values, but you can have it create new ones with .names
Also, you can use a c() of lapply()s instead of just one to do multiple calculations at the same time

.SDcols

stockgrowth[, c('bps_march4', 'bps_daily',
                'pct_march4', 'pct_daily') := 
              c(lapply(.SD, function(x) x*10000),
                lapply(.SD, function(x) x*100)), .SDcols = 4:5]
stockgrowth[]

##    ticker       date price_growth_since_march_4 price_growth_daily bps_march4
## 1:   AMZN 2020-03-04                0.000000000                 NA         NA
## 2:   AMZN 2020-03-05                0.003883495        0.003883495   38.83495
## 3:   AMZN 2020-03-06                0.038834951        0.034816248  348.16248
## 4:    WMT 2020-03-04                0.000000000                 NA         NA
## 5:    WMT 2020-03-05                0.012910798        0.012910798  129.10798
## 6:    WMT 2020-03-06                0.004694836       -0.008111240  -81.11240
##    bps_daily pct_march4 pct_daily
## 1:       0.0         NA     0.000
## 2:  388349.5  0.3883495  3883.495
## 3: 3883495.1  3.4816248 38834.951
## 4:       0.0         NA     0.000
## 5: 1291079.8  1.2910798 12910.798
## 6:  469483.6 -0.8111240  4694.836

.SDcols

Another common issue is wanting to apply the same transformation to all variables of the same type
For example, converting all characters to factors, or converting a bunch of dollar values to pounds
Use sapply(DT,is.type)) for this

.SDcols

justprice <- stockdata[, .(ticker, date, stock_price)]
numeric_col_names <- names(justprice)[sapply(justprice,is.numeric)]
newnames <- paste0(numeric_col_names, '_pounds')
justprice[, (newnames) := lapply(.SD, function(x) x/1.36), .SDcols = numeric_col_names]
justprice[]

##    ticker       date stock_price stock_price_pounds
## 1:   AMZN 2020-03-04       103.0           75.73529
## 2:   AMZN 2020-03-05       103.4           76.02941
## 3:   AMZN 2020-03-06       107.0           78.67647
## 4:    WMT 2020-03-04        85.2           62.64706
## 5:    WMT 2020-03-05        86.3           63.45588
## 6:    WMT 2020-03-06        85.6           62.94118

Rowwise Operations

A lot of business data especially might record values in a bunch of categories, each category in its own column, but not report the total
This is annoying! But .SD helps here
Especially if you just want to sum, then it’s easy with rowSums()

Summing over Columns

deptdata <- data.table(year = c(1994, 1995, 1996), sales = c(104, 106, 109), marketing = c(100, 200, 174), rnd = c(423,123,111)) 
deptdata[, total_spending := rowSums(.SD), .SDcols = sales:rnd]
deptdata[]

##    year sales marketing rnd total_spending
## 1: 1994   104       100 423            627
## 2: 1995   106       200 123            429
## 3: 1996   109       174 111            394

Writing Functions

We’ve already done a bit of function-writing here, in the file read-in and with lapply()
Generally, if you’re going to do the same thing more than once, you’re probably better off writing a function
Reduces errors, saves time, makes code reusable later!

function_name <- function(argument1 = default1, argument2 = default2, etc.) {
  some code
  result <- more code
  return(result)
  # (or just do result by itself - the last object printed will be automatically returned if there's no return())
}

Function-writing tips

Make sure to think about what kind of values your function accepts and make sure that what it returns is consistent so you know what you’re getting
This is a really deep topic to cover in two slides, and mostly I just want to poke you and encourage you to do it. At least, if you find yourself doing something a bunch of times in a row, just take the code, stick it inside a function() wrapper, and instead use a bunch of calls to that function in a row
More information here.

Unnamed Functions

There are other ways to do functions in R: unnamed functions
Notice how in the lapply() examples I didn’t have to do bps <- function(x) x*10000, I just did function(x) x*10000? That’s an “unnamed function”
If your function is very small like this and you’re only going to use it once, it’s great for that!
In R 4.1, you will be able to just do \(x) instead of function(x)

purrr

One good way to apply functions iteratively (yours or not) is with the map() functions in purrr
We already did this to read in files, but it applies much more broadly! map() usually generates a list(), map_dbl() a numeric vector, map_chr() a character vector, map_df() a tibble()…

purrr

It iterates through a list, data.frame/tibble/data.table (which are technically lists, or vector, and then applies a function to each of the elements

person_year_data %>%
  map_chr(class)

##      Person        Year      Income 
## "character"   "numeric"   "numeric"

purrr

Obviously handy for processing many files, as in our reading-in-files example
Or looping more generally for diagnostic or wrangling purposes. Perhaps you have a summary_profile() function you’ve made, and want to check each state’s data to see if its data looks right. You could do

data[, state] %>% unique() %>% map(summary_profile)

You can use it generally in place of a for() loop
See the purrr cheatsheet

Finishing Up, and an Example!

Some Final Notes

We can’t possibly cover everything. So one last note, about saving your data!
What to do when you’re done and want to save your processed data?
Saving data in R format: save() saves many objects, which are all put back in the environment with load(). Often preferable is saveRDS() which saves a single data.frame() in compressed format, loadable with df <- readRDS()
Saving data for sharing: fwrite() makes a CSV. Yay!

Some Final Notes

Also, please, please, please DOCUMENT YOUR DATA
At the very least, keep a spreadsheet/ with a set of descriptions for each of your variables
Also look into the sjlabelled or haven packages to add variable labels directly to the data set itself
Once you have your variables labelled, vtable() in vtable can generate a documentation file for sharing

A Walkthrough

Let’s clean some data!