Welcome to the Data Wrangling Workshop!
data.table is a package for working with data%>%), which simply means “take whatever’s on the left and make it the first argument of the thing on the right”library(magrittr) or loading the tidyverse or one of many packages that come with the pipe.scales::percent(mean(mtcars$am, na.rm = TRUE), accuracy = .1) can be rewritten
|> for the pipe which won’t require a package load. Also, you can chain data.table() operations by just using a bunch of []s (we’ll get to it)[[ is the “[[ function” - i.e. this is doing mtcars[['am']], equivalent to mtcars$amWhat is data wrangling?
I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that
View() to look at itsumtable() or vtable(lush = TRUE) in vtable for exampletable() or summary() on individual variablesNot something we’ll be focusing on today! But any time the data isn’t in a workable format, like a spreadsheet or database, someone’s got to get it there!
One common thing you run across is data split into multiple files. How can we read these in and compile them?
list.files() produces a vector of filenames (tip: full.names = TRUE gives full filepaths)map() from purrr to iterate over that vector and read in the data. This gives a list of tibbles (data.frames) read inmap with that too (if you want some processing before you combine)data.table (if it isn’t already, say from fread() with map(as.data.table))rbindlist()!For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together.
We can simplify by making a little function that processes each of the reports as it’s read. Then, use map() with read_excel() and then our function, then bind it together!
How do I get df[1,3], etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it!
In tidy data:
The variables in tidy data come in two types:
Which are they in this data?
melt() and dcast()melt)dcast)Check our steps!
melt()data (the data set you’re working with, also the first argument so we can pipe to it)id.vars (a vector of identifying/key columns, either numbers for position or character for names)measure.vars (the columns to pivot) - by default everything not in id.varsvariable.name (the name of the variable to store which column a given row came from, here “week”)value.name (the variable to store the value in)help(melt))pivot_wider(), and then combine multiple individuals with bind_rows()dcast() needs:data (first argument, the data we’re working with)formula (this tells us the observation level of the wide data and how it expands to the long data)help(dcast).), but the old one is TaxFormRow## . AGI Deductible Income Person
## 1: . 88341 24000 112341 James Acaster
(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)
We can use rbind() to stack data sets with the same variables together, handy for compiling data from different sources (rbindlist() binds a list of data.tables)
## . AGI Deductible Income Person
## 1: . 88341 24000 112341 James Acaster
## 2: . 309122 16000 325122 Eddie Izzard
That was person_year_data. And now for person_data:
merge function will do this (see help(merge), making sure you get the data.table one instead of the base-R one).by option will specify the columns to merge on. The all.x and all.y options will specify whether to keep rows from the first and second data sets, respectively, that don’t find a match## Person Year Income Birthplace
## 1: David 2014 102452 <NA>
## 2: David 2014 105133 <NA>
## 3: Ramesh 2014 81314 Crawley
## 4: Ramesh 2015 82155 Crawley
## 5: Whitney 2014 131292 Washington D.C.
## 6: Whitney 2015 141262 Washington D.C.
## Person Year Income Birthplace
## 1: Ramesh 2014 81314 Crawley
## 2: Ramesh 2015 82155 Crawley
## 3: Whitney 2014 131292 Washington D.C.
## 4: Whitney 2015 141262 Washington D.C.
by is the exact observation level in at least one of the two data setsby variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:a <- data.table(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- data.table(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% merge(b, by = 'Name')## Name Year Value Characteristic
## 1: A 2014 1 Up
## 2: A 2014 1 Down
## 3: A 2015 2 Up
## 4: A 2015 2 Down
## 5: B 2014 3 Up
## 6: C 2014 4 Left
## 7: C 2014 4 Right
Person is a key for data set a, then a[, .(Person)] %>% duplicated() %>% max() will return TRUE, showing us we’re wrongdata.tables is to use the special data.table syntax DT1[DT2, on = .(keys)].data.table() syntax is extremely simpleDT[filter, variable operations, grouping] aka DT[i, j, by]fcase()a <- a + 1data.table operations can be done directly at the existing memory location. Blazing fast.:=set like setnames(), setorder(), set(), etc.setorderv() that take column-name vectors instead of direct column namesdata.tables clash with typical R practice. In-place manipulation is one of themdata.table, say by dt2 <- dt1, the changes to dt2 will also happen to dt1, unless you do dt2 <- copy(dt1) instead for a “deep copy”data.table by itself on a line won’t print it - you need to add [] after[]s together for chained operations!Income > 100000 is TRUE for everyone with income above 100000, and FALSE otherwise. data[Income > 100000] would return just the rows of data that have Income > 100000## Person Year Income Birthplace
## 1: Ramesh 2014 81314 Crawley
## 2: Ramesh 2015 82155 Crawley
## 3: Whitney 2014 131292 Washington D.C.
## 4: Whitney 2015 141262 Washington D.C.
TRUE, which turns into 1 if you do a calculation with it. If false, it returns FALSE, which turns into 0. (tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant)ifelse is what you want, it’s not what you want! data.table’s fifelse is the same but fasterHandy tools for constructing logical conditions:
a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal
a %in% c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!
Whatever your condition is (condition), just put a ! (“not”) in front to reverse TRUE/FALSE. 2 + 2 == 4 is TRUE, but !(2 + 2 == 4) is FALSE
Chain multiple conditions together! & is “and”, | is “or”. Be careful with parentheses if combining them!
data.tables go in the second argument, i.e. the j in DT[i,j,by]by) are dropped.() function in data.table is a shortcut for list()data.table.()) and when to use c()?c() to refer to column numbers, or to pass in names as strings (handy in programming!)j, you’ll also need the option with = FALSEmean(mtcars$hp) we can do mtcars[, mean(hp)]mtcars[, hp] (this gives back a numeric vector, not a one-column data.table! For a one-column data.table we’d do mtcars[, .(hp)])mtcars[am == 1, mean(hp)] does… you can start to see the appeal!fcase(), which is sort of like ifelse() except it can cleanly handle way more than one conditionfcase() with a series of if, then conditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.default option for what to do with observations that are FALSE for everything elseperson_year_data[, .(Income = Income,
IncomeBracket = fcase(
Income <= 50000, 'Under 50k',
Income > 50000 & Income <= 100000, '50-100k',
Income > 100000 & Income < 120000, '100-120k',
default = 'Above 120k'))]## Income IncomeBracket
## 1: 81314 50-100k
## 2: 82155 50-100k
## 3: 131292 Above 120k
## 4: 141262 Above 120k
## 5: 102452 100-120k
## 6: 105133 100-120k
then doesn’t have to be a value, it can be a calculation, for examplefcase() is when you want to make an adjustment to just some of the observations. Say we realized that David’s income was reported in pounds, not dollars, so we need to adjust it.data.table (DT[i,j,by]) performs the j operations by group, splitting each combination of the grouping variablesby = .(a, b) or by = c('a','b')person_year_data[, .(Income = Income,
Income_Relative_to_Mean = Income - mean(Income)),
by = Person]## Person Income Income_Relative_to_Mean
## 1: Ramesh 81314 -420.5
## 2: Ramesh 82155 420.5
## 3: Whitney 131292 -4985.0
## 4: Whitney 141262 4985.0
## 5: David 102452 -1340.5
## 6: David 105133 1340.5
data.table to do a calculation by group by giving those groups to bydata.tables also have explicit keys - when the keys are set, the data.table is pre-sorted by those keysj.N gives the number of rows in the group - handy! and seq_len(.N) gives the row number within its group of that observation:= with by maintains the original observation level. But .() with by changes the observation level to the level implied by the functions you give it!## Person Mean_Income Years_Tracked
## 1: Ramesh 81734.5 2
## 2: Whitney 136277.0 2
## 3: David 103792.5 2
order() to the i argument, but more often you’ll just do in-place setorder()## Person Year Income
## 1: David 2014 102452
## 2: David 2014 105133
## 3: Ramesh 2014 81314
## 4: Ramesh 2015 82155
## 5: Whitney 2014 131292
## 6: Whitney 2015 141262
Common variable types:
is. and then the type, or class(), or vtable::vtable(data) or doing str(data)as. and then the typewidetax <- dcast(taxdata, . ~ TaxFormRow)
widetax[, `:=`(Person = as.factor(Person),
Income = as.numeric(Income),
Deductible = as.numeric(Deductible),
AGI = as.numeric(AGI))]
sapply(widetax, class)## . AGI Deductible Income Person
## "character" "numeric" "numeric" "numeric" "factor"
integer, double, and so ondoubles that are stored in scientific notation - lumping multiple groups together! Avoid this with options like colClasses in your data-reading function'' or ""paste0() to stick stuff together! paste0('h','ello', sep = '_') is “h_ello”1000000 but instead as a literal “1,000,000” with commasfactor() function lets you specify these labels, and also specify the levels they go in - factors can be ordered!incdata <- data.table(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+'))[,
.(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+')))]
setorder(incdata, Income)
incdata## Income
## 1: Less than 50k
## 2: 50k-100k
## 3: 50k-100k
## 4: 100k+
## 5: 100k+
DT[condition, var := fix] to fix typos/misspellings in the datastr_sub(string, start, end) will do this. str_sub('hello', 2, 4) is 'ell'str_sub('hello', -1) is 'o'cbgdata <- data.table(cbg = c(0152371824231, 1031562977281))
cbgdata[, cbg := as.character(cbg)] # Make it a string to work with
cbgdata[, state_fips := fcase(
nchar(cbg) == 12, str_sub(cbg, 1, 1), # Leading zeroes!
nchar(cbg) == 13, str_sub(cbg, 1, 2)
)]
cbgdata[]## cbg state_fips
## 1: 152371824231 1
## 2: 1031562977281 10
str_split() will do this. str_split('a,b', ',')[[1]] is c('a','b')data.tables!deptdata <- data.table(category = c('Sales,Marketing','H&R,Marketing'))
deptdata[, c('Category1', 'Category2') := str_split(category, ',')]
deptdata[]## category Category1 Category2
## 1: Sales,Marketing Sales H&R
## 2: H&R,Marketing Marketing Marketing
str_trim() removes beginning/end whitespace, str_squish() removes additional whitespace from the middle too. str_trim(' hi hello ') is 'hi hello'.str_replace_all() is often handy for eliminating (or fixing) unwanted charactersnumdata <- data.table(number = c('1,000', '2,003,124'))
numdata[, number := str_replace_all(number, ',', '') %>% as.numeric()]
numdata[]## number
## 1: 1000
## 2: 2003124
str_replace_all(',','') - ',' is a regular expression saying “look for a comma”[0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.separate() won’t do it here, not easily!'\\([A-Z].*\\)''\\([A-Z].*\\)' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”str_detect() from stringr is fine, but a touch faster is data.table’s %like% operator or like() function, the latter having options for ignoring case and regular-expression syntaxcompanydata <- data.table(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)'))
companydata[, `:=`(publicly_listed = name %like% '\\([A-Z].*\\)',
name = str_replace_all(name, '\\([A-Z].*\\)', ''))]
companydata[]## name publicly_listed
## 1: Amazon Holdings TRUE
## 2: Cargill Corp. (cool place!) FALSE
setorder() along with other-row-referencing functions like first(), last(), and shift()first() and last() refer to the first and last row, naturallystockdata <- data.table(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
stock_price = c(103,103.4,107,85.2, 86.3, 85.6))
setorder(stockdata, ticker, date)
stockdata[, price_growth_since_march_4 := stock_price/first(stock_price) - 1, by = ticker]
stockdata[]## ticker date stock_price price_growth_since_march_4
## 1: AMZN 2020-03-04 103.0 0.000000000
## 2: AMZN 2020-03-05 103.4 0.003883495
## 3: AMZN 2020-03-06 107.0 0.038834951
## 4: WMT 2020-03-04 85.2 0.000000000
## 5: WMT 2020-03-05 86.3 0.012910798
## 6: WMT 2020-03-06 85.6 0.004694836
shift() looks to the row a certain number above/below this one, based on the n argumentshift() doesn’t care about time structure, it only cares about data structure. If you want daily growth but the row above is last year, you’ll get the wrong result!setorder(stockdata, ticker, date)
stockdata[, price_growth_daily := stock_price/lag(stock_price, 1) - 1, by = ticker]
stockdata## ticker date stock_price price_growth_since_march_4 price_growth_daily
## 1: AMZN 2020-03-04 103.0 0.000000000 NA
## 2: AMZN 2020-03-05 103.4 0.003883495 0.003883495
## 3: AMZN 2020-03-06 107.0 0.038834951 0.034816248
## 4: WMT 2020-03-04 85.2 0.000000000 NA
## 5: WMT 2020-03-05 86.3 0.012910798 0.012910798
## 6: WMT 2020-03-06 85.6 0.004694836 -0.008111240
first()/last() to get stuff that might not normally be first or last with things like stockdata[, targetdate := date == as.Date('2020-03-05')] and setorder(stockdata, ticker, -(targetdate))DT[condition, operation] to create a new variable that only picks data from the rows you want, then a grouped in-place column operation to spread the data from those rows across the other rows in the grouptestdata <- data.table(person = c('Adam','James','Diego','Beth','Francis','Qian','Ryan','Selma'),
school_grade = c(6,7,7,8,6,7,8,8),
subject = c('Math','Math','English','Science','English','Science','Math','PE'),
test_score = c(80,84,67,87,55,75,85,70))
testdata[subject == 'Math', Math_Scores := test_score]
testdata[, Math_Average_In_This_Grade := mean(Math_Scores, na.rm = TRUE), by = school_grade]
testdata[, Math_Scores := NULL]
testdata[]## person school_grade subject test_score Math_Average_In_This_Grade
## 1: Adam 6 Math 80 80
## 2: James 7 Math 84 84
## 3: Diego 7 English 67 84
## 4: Beth 8 Science 87 85
## 5: Francis 6 English 55 80
## 6: Qian 7 Science 75 84
## 7: Ryan 8 Math 85 85
## 8: Selma 8 PE 70 85
.SD, writing functions, and purrr.SD refers to the entire set of data being analyzed other than any variables in by (i.e. the whole thing, or the current group if grouped)lapply() to apply a function to every variable! (plenty of fancier applications too but we’ll stick here for now).SD with the .SDcols argument (which takes patterns()).SD[1] gives the first row of all columns, etc.## mpg cyl disp hp drat wt qsec vs am
## 1: 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
## gear carb
## 1: 3.6875 2.8125
patterns('price_growth') on the next slide the same here as 4:5 or c(price_growth_since_march_4, price_growth_daily) or price_growth_since_march_4:price_growth_daily4:5 there, or avoid the walrusstockgrowth <- stockdata[, .(ticker, date, price_growth_since_march_4, price_growth_daily)]
stockgrowth[, c('bps_march4', 'bps_daily') := lapply(.SD, function(x) x*10000),
.SDcols = patterns('price_growth')]
stockgrowth## ticker date price_growth_since_march_4 price_growth_daily bps_march4
## 1: AMZN 2020-03-04 0.000000000 NA 0.00000
## 2: AMZN 2020-03-05 0.003883495 0.003883495 38.83495
## 3: AMZN 2020-03-06 0.038834951 0.034816248 388.34951
## 4: WMT 2020-03-04 0.000000000 NA 0.00000
## 5: WMT 2020-03-05 0.012910798 0.012910798 129.10798
## 6: WMT 2020-03-06 0.004694836 -0.008111240 46.94836
## bps_daily
## 1: NA
## 2: 38.83495
## 3: 348.16248
## 4: NA
## 5: 129.10798
## 6: -81.11240
.namesc() of lapply()s instead of just one to do multiple calculations at the same timestockgrowth[, c('bps_march4', 'bps_daily',
'pct_march4', 'pct_daily') :=
c(lapply(.SD, function(x) x*10000),
lapply(.SD, function(x) x*100)), .SDcols = 4:5]
stockgrowth[]## ticker date price_growth_since_march_4 price_growth_daily bps_march4
## 1: AMZN 2020-03-04 0.000000000 NA NA
## 2: AMZN 2020-03-05 0.003883495 0.003883495 38.83495
## 3: AMZN 2020-03-06 0.038834951 0.034816248 348.16248
## 4: WMT 2020-03-04 0.000000000 NA NA
## 5: WMT 2020-03-05 0.012910798 0.012910798 129.10798
## 6: WMT 2020-03-06 0.004694836 -0.008111240 -81.11240
## bps_daily pct_march4 pct_daily
## 1: 0.0 NA 0.000
## 2: 388349.5 0.3883495 3883.495
## 3: 3883495.1 3.4816248 38834.951
## 4: 0.0 NA 0.000
## 5: 1291079.8 1.2910798 12910.798
## 6: 469483.6 -0.8111240 4694.836
sapply(DT,is.type)) for thisjustprice <- stockdata[, .(ticker, date, stock_price)]
numeric_col_names <- names(justprice)[sapply(justprice,is.numeric)]
newnames <- paste0(numeric_col_names, '_pounds')
justprice[, (newnames) := lapply(.SD, function(x) x/1.36), .SDcols = numeric_col_names]
justprice[]## ticker date stock_price stock_price_pounds
## 1: AMZN 2020-03-04 103.0 75.73529
## 2: AMZN 2020-03-05 103.4 76.02941
## 3: AMZN 2020-03-06 107.0 78.67647
## 4: WMT 2020-03-04 85.2 62.64706
## 5: WMT 2020-03-05 86.3 63.45588
## 6: WMT 2020-03-06 85.6 62.94118
.SD helps hererowSums()deptdata <- data.table(year = c(1994, 1995, 1996), sales = c(104, 106, 109), marketing = c(100, 200, 174), rnd = c(423,123,111))
deptdata[, total_spending := rowSums(.SD), .SDcols = sales:rnd]
deptdata[]## year sales marketing rnd total_spending
## 1: 1994 104 100 423 627
## 2: 1995 106 200 123 429
## 3: 1996 109 174 111 394
lapply()function() wrapper, and instead use a bunch of calls to that function in a rowlapply() examples I didn’t have to do bps <- function(x) x*10000, I just did function(x) x*10000? That’s an “unnamed function”\(x) instead of function(x)map() functions in purrrmap() usually generates a list(), map_dbl() a numeric vector, map_chr() a character vector, map_df() a tibble()…list, data.frame/tibble/data.table (which are technically lists, or vector, and then applies a function to each of the elements## Person Year Income
## "character" "numeric" "numeric"
summary_profile() function you’ve made, and want to check each state’s data to see if its data looks right. You could dofor() loopsave() saves many objects, which are all put back in the environment with load(). Often preferable is saveRDS() which saves a single data.frame() in compressed format, loadable with df <- readRDS()fwrite() makes a CSV. Yay!vtable() in vtable can generate a documentation file for sharing