Welcome to the Data Wrangling Workshop!
tibbles” instead of data.frames - these are very similar to data.frames but look nicer when you print them, and can accept list() columns, as well as some other neat stuff%>%), which simply means “take whatever’s on the left and make it the first argument of the thing on the right”scales::percent(mean(mtcars$am, na.rm = TRUE), accuracy = .1) can be rewritten
|> for the pipepull() is a dplyr function that says “give me back this one variable instead of a data set” but in a pipe-friendly way, so mtcars %>% pull(am) is the same as mtcars$am or mtcars[['am']]What is data wrangling?
I help a lot of people with their problems with data wrangling. Their issues are almost always not doing one of these four things, much more so than having trouble coding or anything like that
View() to look at itsumtable() or vtable(lush = TRUE) in vtable for exampletable() or summary() on individual variablesNot something we’ll be focusing on today! But any time the data isn’t in a workable format, like a spreadsheet or database, someone’s got to get it there!
One common thing you run across is data split into multiple files. How can we read these in and compile them?
list.files() produces a vector of filenames (tip: full.names = TRUE gives full filepaths)map() from purrr to iterate over that vector and read in the data. This gives a list of tibbles (data.frames) read inmap with that too (if you want some processing before you combine)bind_rows()!For example, imagine you have 200 monthly sales reports in Excel files. You just want to pull cell C2 (total sales) and cell B43 (employee of the month) and combine them together.
We can simplify by making a little function that processes each of the reports as it’s read. Then, use map() with read_excel() and then our function, then bind it together!
How do I get df[1,3], etc.? Because I look straight at the files and check where the data I want is, so I can pull it and put it where I want it!
In tidy data:
The variables in tidy data come in two types:
Which are they in this data?
## Tidy DataCheck our steps!
pivot_longer() and pivot_wider(). Here we want wide-to-long so we use pivot_longer()data (the data set you’re working with, also the first argument so we can pipe to it)cols (the columns to pivot) - it will assume anything not named here are the keysnames_to (the name of the variable to store which column a given row came from, here “week”)values_to (the name of the vairable to store the value in)help(pivot_longer))billboard %>%
pivot_longer(cols = starts_with('wk'), # tidyselect functions help us pick columns based on name patterns
names_to = 'week',
names_prefix = 'wk', # Remove the "wk" at the start of the column names
values_to = 'chart_position',
values_drop_na = TRUE) # Drop any key combination with a missing valuepivot_wider(), and then combine multiple individuals with bind_rows()pivot_wider() needs:data (first argument, the data we’re working with)id_cols (the columns that give us the key - what should it be here?)names_from (the column containing what will be the new variable names)values_from (the column containing the new values)help(pivot_wider)## # A tibble: 1 x 4
## Person Income Deductible AGI
## <chr> <chr> <chr> <chr>
## 1 James Acaster 112341 24000 88341
(note that the variables are all stored as character variables not numbers - that’s because the “person” row is a character, which forced the rest to be too. we’ll go through how to fix that later)
We can use bind_rows() to stack data sets with the same variables together, handy for compiling data from different sources
## # A tibble: 2 x 4
## Person Income Deductible AGI
## <chr> <chr> <chr> <chr>
## 1 James Acaster 112341 24000 88341
## 2 Eddie Izzard 325122 16000 309122
That was person_year_data. And now for person_data:
join family of functions will do this (see help(join)). The different varieties just determine what to do with rows you don’t find a match for. left_join() keeps non-matching rows from the first dataset but not the second, right_join() from the second not the first, full_join() from both, inner_join() from neither, and anti_join() JUST keeps non-matches## Person Year Income Birthplace
## 1 Ramesh 2014 81314 Crawley
## 2 Ramesh 2015 82155 Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
## 5 David 2014 102452 <NA>
## 6 David 2015 105133 <NA>
## Person Year Income Birthplace
## 1 Ramesh 2014 81314 Crawley
## 2 Ramesh 2015 82155 Crawley
## 3 Whitney 2014 131292 Washington D.C.
## 4 Whitney 2015 141262 Washington D.C.
by is the exact observation level in at least one of the two data setsby variables in both, that’s a problem! It will create all the potential matches, which may not be what you want:a <- tibble(Name = c('A','A','B','C'), Year = c(2014, 2015, 2014, 2014), Value = 1:4)
b <- tibble(Name = c('A','A','B','C','C'), Characteristic = c('Up','Down','Up','Left','Right'))
a %>% left_join(b, by = 'Name')## # A tibble: 7 x 4
## Name Year Value Characteristic
## <chr> <dbl> <int> <chr>
## 1 A 2014 1 Up
## 2 A 2014 1 Down
## 3 A 2015 2 Up
## 4 A 2015 2 Down
## 5 B 2014 3 Up
## 6 C 2014 4 Left
## 7 C 2014 4 Right
Person is a key for data set a, then a %>% select(Person) %>% duplicated() %>% max() will return TRUE, showing us we’re wrongsafe_join() in the pmdplyr package, which will check for you that you’re doing the kind of merge you think you’re doing.inexact_join() family of functions which can help join data sets that don’t line up exactly, like if you want to match on time, but on the most recent match, not an exact match. The fuzzyjoin package has similar functions for matching inexactly for text variablesfilter(), select(), arrange(), mutate(), group_by(), and summarize().pull() (which we covered), case_when()filter() limits the data to the observations that fulfill a certain logical condition. It picks rows.Income > 100000 is TRUE for everyone with income above 100000, and FALSE otherwise. filter(data, Income > 100000) would return just the rows of data that have Income > 100000## Person Year Income Birthplace
## 1 Whitney 2014 131292 Washington D.C.
## 2 Whitney 2015 141262 Washington D.C.
## 3 David 2014 102452 <NA>
## 4 David 2015 105133 <NA>
TRUE, which turns into 1 if you do a calculation with it. If false, it returns FALSE, which turns into 0. (tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant)Handy tools for constructing logical conditions:
a > b, a >= b, a < b, a <= b, a == b, or a != b to compare two numbers and check if a is above, above-or-equal, below, below-or-equal, equal (note == to check equality, not =), or not equal
a %in% c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!
Whatever your condition is (condition), just put a ! (“not”) in front to reverse TRUE/FALSE. 2 + 2 == 4 is TRUE, but !(2 + 2 == 4) is FALSE
Chain multiple conditions together! & is “and”, | is “or”. Be careful with parentheses if combining them! In filter specifically, you can use , instead of &.
select() gives you back just a subset of the columns. It picks columns- to not pick certain columnsIf our data has the columns “Person”, “Year”, and “Income”, then all of these do the same thing:
no_income <- person_year_data %>% select(Person, Year)
no_income <- person_year_data %>% select(1:2)
no_income <- person_year_data %>% select(-Income)
print(no_income)## Person Year
## 1 Ramesh 2014
## 2 Ramesh 2015
## 3 Whitney 2014
## 4 Whitney 2015
## 5 David 2014
## 6 David 2015
arrange() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.## Person Year Income
## 1 David 2014 102452
## 2 David 2015 105133
## 3 Ramesh 2014 81314
## 4 Ramesh 2015 82155
## 5 Whitney 2014 131292
## 6 Whitney 2015 141262
mutate() assigns columns/variables, i.e. you can create variables with it (note also its sibling transmute() which does the same thing and then drops any variables you don’t explicitly specify in the function)mutate() call, separated by commas (,)## Person Year Income NextYear Above100k
## 1 Ramesh 2014 81314 2015 FALSE
## 2 Ramesh 2015 82155 2016 FALSE
## 3 Whitney 2014 131292 2015 TRUE
## 4 Whitney 2015 141262 2016 TRUE
## 5 David 2014 102452 2015 TRUE
## 6 David 2015 105133 2016 TRUE
case_when(), which is sort of like ifelse() except it can cleanly handle way more than one conditioncase_when() with a series of if ~ then conditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.if be TRUE to give a value for anyone who hasn’t been caught yetperson_year_data %>%
mutate(IncomeBracket = case_when(
Income <= 50000 ~ 'Under 50k',
Income > 50000 & Income <= 100000 ~ '50-100k',
Income > 100000 & Income < 120000 ~ '100-120k',
TRUE ~ 'Above 120k'
))## Person Year Income IncomeBracket
## 1 Ramesh 2014 81314 50-100k
## 2 Ramesh 2015 82155 50-100k
## 3 Whitney 2014 131292 Above 120k
## 4 Whitney 2015 141262 Above 120k
## 5 David 2014 102452 100-120k
## 6 David 2015 105133 100-120k
then doesn’t have to be a value, it can be a calculation, for examplecase_when() to change the values of just some of the observations.NA, you must use the type-appropriate NA. NA_character_, NA_real_, etc.group_by() turns the dataset into a grouped data set, splitting each combination of the grouping variablesmutate() or (up next) summarize() or (if you want to get fancy) group_map() then process the data separately by each group## # A tibble: 6 x 4
## # Groups: Person [3]
## Person Year Income Income_Relative_to_Mean
## <chr> <dbl> <dbl> <dbl>
## 1 Ramesh 2014 81314 -420.
## 2 Ramesh 2015 82155 420.
## 3 Whitney 2014 131292 -4985
## 4 Whitney 2015 141262 4985
## 5 David 2014 102452 -1340.
## 6 David 2015 105133 1340.
group_by() it, or ungroup() it, or summarize() it (which removes one of the grouping variables)group_by() helps us move information from one row to another in a key variable - otherwise a difficult move!summarize()n() gives the number of rows in the group - handy! and row_number() gives the row number within its group of that observationsummarize() changes the observation level to a broader levelgroup_by()person_year_data %>%
group_by(Person) %>%
summarize(Mean_Income = mean(Income),
Years_Tracked = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 3
## Person Mean_Income Years_Tracked
## <chr> <dbl> <int>
## 1 David 103792. 2
## 2 Ramesh 81734. 2
## 3 Whitney 136277 2
mutate()s and our summarizes()Common variable types:
tibble(), or is. and then the type, or doing str(data)as. and then the typetaxdata %>%
pivot_wider(names_from = 'TaxFormRow',
values_from = 'Value') %>%
mutate(Person = as.factor(Person),
Income = as.numeric(Income),
Deductible = as.numeric(Deductible),
AGI = as.numeric(AGI))## # A tibble: 1 x 4
## Person Income Deductible AGI
## <fct> <dbl> <dbl> <dbl>
## 1 James Acaster 112341 24000 88341
integer, double, and so ondoubles that are stored in scientific notation - lumping multiple groups together! Avoid this with options like col_types in your data-reading function'' or ""paste0() to stick stuff together! paste0('h','ello', sep = '_') is ‘’h_ello’`1000000 but instead as a literal “1,000,000” with commasfactor() function lets you specify these labels, and also specify the levels they go in - factors can be ordered!tibble(Income = c('50k-100k','Less than 50k', '50k-100k', '100k+', '100k+')) %>%
mutate(Income = factor(Income, levels = c('Less than 50k','50k-100k','100k+'))) %>%
arrange(Income)## # A tibble: 5 x 1
## Income
## <fct>
## 1 Less than 50k
## 2 50k-100k
## 3 50k-100k
## 4 100k+
## 5 100k+
mutate() and case_when() to fix typos/misspellings in the datastr_sub(string, start, end) will do this. str_sub('hello', 2, 4) is 'ell'str_sub('hello', -1) is 'o'tibble(cbg = c(0152371824231, 1031562977281)) %>%
mutate(cbg = as.character(cbg)) %>% # Make it a string to work with
mutate(state_fips = case_when(
nchar(cbg) == 12 ~ str_sub(cbg, 1, 1), # Leading zeroes!
nchar(cbg) == 13 ~ str_sub(cbg, 1, 2)
))## # A tibble: 2 x 2
## cbg state_fips
## <chr> <chr>
## 1 152371824231 1
## 2 1031562977281 10
str_split() will do this. str_split('a,b', ',')[[1]] is c('a','b')separate() from tidyr. Make sure you list enough new into columns to get everything!tibble(category = c('Sales,Marketing','H&R,Marketing')) %>%
separate(category, into = c('Category1', 'Category2'), ',')## # A tibble: 2 x 2
## Category1 Category2
## <chr> <chr>
## 1 Sales Marketing
## 2 H&R Marketing
str_trim() removes beginning/end whitespace, str_squish() removes additional whitespace from the middle too. str_trim(' hi hello ') is 'hi hello'.str_replace_all() is often handy for eliminating (or fixing) unwanted characterstibble(number = c('1,000', '2,003,124')) %>%
mutate(number = number %>% str_replace_all(',', '') %>% as.numeric())## # A tibble: 2 x 1
## number
## <dbl>
## 1 1000
## 2 2003124
str_replace_all(',','') - ',' is a regular expression saying “look for a comma”[0-9] to look for a digit, [a-zA-Z] for letters, * to repeat until you see the next thing… hard to condense here. Read the guide.separate() won’t do it here, not easily!'\\([A-Z].*\\)''\\([A-Z].*\\)' says “look for a (” (note the \\ to treat the usually-special ( character as an actual character), then “Look for a capital letter [A-Z]”, then “keep looking for capital letters .*”, then “look for a )”tibble(name = c('Amazon (AMZN) Holdings','Cargill Corp. (cool place!)')) %>%
mutate(publicly_listed = str_detect(name, '\\([A-Z].*\\)'),
name = str_replace_all(name, '\\([A-Z].*\\)', ''))## # A tibble: 2 x 2
## name publicly_listed
## <chr> <lgl>
## 1 Amazon Holdings TRUE
## 2 Cargill Corp. (cool place!) FALSE
summarize()ing, or when doing things like “calculate growth from an initial value”first(), last(), and lag()first() and last() refer to the first and last row, naturallystockdata <- tibble(ticker = c('AMZN','AMZN', 'AMZN', 'WMT', 'WMT','WMT'),
date = as.Date(rep(c('2020-03-04','2020-03-05','2020-03-06'), 2)),
stock_price = c(103,103.4,107,85.2, 86.3, 85.6))
stockdata %>%
arrange(ticker, date) %>%
group_by(ticker) %>%
mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1)## # A tibble: 6 x 4
## # Groups: ticker [2]
## ticker date stock_price price_growth_since_march_4
## <chr> <date> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 0
## 2 AMZN 2020-03-05 103. 0.00388
## 3 AMZN 2020-03-06 107 0.0388
## 4 WMT 2020-03-04 85.2 0
## 5 WMT 2020-03-05 86.3 0.0129
## 6 WMT 2020-03-06 85.6 0.00469
lag() looks to the row a certain number above/below this one, based on the n argumentdplyr::lag() doesn’t care about time structure, it only cares about data structure. If you want daily growth but the row above is last year, too bad!stockdata %>%
arrange(ticker, date) %>%
group_by(ticker) %>%
mutate(daily_price_growth = stock_price/lag(stock_price, 1) - 1)## # A tibble: 6 x 4
## # Groups: ticker [2]
## ticker date stock_price daily_price_growth
## <chr> <date> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 NA
## 2 AMZN 2020-03-05 103. 0.00388
## 3 AMZN 2020-03-06 107 0.0348
## 4 WMT 2020-03-04 85.2 NA
## 5 WMT 2020-03-05 86.3 0.0129
## 6 WMT 2020-03-06 85.6 -0.00811
first()/last() to get stuff that might not normally be first or last with things like arrange(ticker, -(date == as.Date('2020-03-05')))case_when() to create a new variable that only picks data from the rows you want, then a group_by() and mutate() to spread the data from those rows across the other rows in the grouptibble(person = c('Adam','James','Diego','Beth','Francis','Qian','Ryan','Selma'),
school_grade = c(6,7,7,8,6,7,8,8),
subject = c('Math','Math','English','Science','English','Science','Math','PE'),
test_score = c(80,84,67,87,55,75,85,70)) %>%
mutate(Math_Scores = case_when(subject == 'Math' ~ test_score,
TRUE ~ NA_real_)) %>%
group_by(school_grade) %>%
mutate(Math_Average_In_This_Grade = mean(Math_Scores, na.rm = TRUE)) %>%
select(-Math_Scores)## # A tibble: 8 x 5
## # Groups: school_grade [3]
## person school_grade subject test_score Math_Average_In_This_Grade
## <chr> <dbl> <chr> <dbl> <dbl>
## 1 Adam 6 Math 80 80
## 2 James 7 Math 84 84
## 3 Diego 7 English 67 84
## 4 Beth 8 Science 87 85
## 5 Francis 6 English 55 80
## 6 Qian 7 Science 75 84
## 7 Ryan 8 Math 85 85
## 8 Selma 8 PE 70 85
across(), writing functions, and purrrmutate_at() or mutate_if(). As of dplyr 1.0.0, these have been deprecated in favor of across()across() lets you use all the variable-selection tricks available in select(), like starts_with() or a:z or 1:5, but then lets you apply functions to each of them in mutate() or summarize()rowwise() and c_across() lets you do stuff like “add up a bunch of columns”starts_with('price_growth') is the same here as 4:5 or c(price_growth_since_march_4, price_growth_daily)stockgrowth <- stockdata %>%
arrange(ticker, date) %>%
group_by(ticker) %>%
mutate(price_growth_since_march_4 = stock_price/first(stock_price) - 1,
price_growth_daily = stock_price/lag(stock_price, 1) - 1)
stockgrowth %>%
mutate(across(starts_with('price_growth'), function(x) x*10000)) # Convert to basis points## # A tibble: 6 x 5
## # Groups: ticker [2]
## ticker date stock_price price_growth_since_march_4 price_growth_daily
## <chr> <date> <dbl> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 0 NA
## 2 AMZN 2020-03-05 103. 38.8 38.8
## 3 AMZN 2020-03-06 107 388. 348.
## 4 WMT 2020-03-04 85.2 0 NA
## 5 WMT 2020-03-05 86.3 129. 129.
## 6 WMT 2020-03-06 85.6 46.9 -81.1
.nameslist() of functions instead of just one to do multiple calculations at the same timewhere(is.type) for this## # A tibble: 6 x 4
## ticker date stock_price stock_price_stock_price_pounds
## <chr> <date> <dbl> <dbl>
## 1 AMZN 2020-03-04 103 75.7
## 2 AMZN 2020-03-05 103. 76.0
## 3 AMZN 2020-03-06 107 78.7
## 4 WMT 2020-03-04 85.2 62.6
## 5 WMT 2020-03-05 86.3 63.5
## 6 WMT 2020-03-06 85.6 62.9
rowwise() and c_across()tibble(year = c(1994, 1995, 1996), sales = c(104, 106, 109), marketing = c(100, 200, 174), rnd = c(423,123,111)) %>%
rowwise() %>%
mutate(total_spending = sum(c_across(sales:rnd))) %>%
mutate(across(sales:rnd, function(x) x/total_spending, .names = '{.col}_pct'))## # A tibble: 3 x 8
## # Rowwise:
## year sales marketing rnd total_spending sales_pct marketing_pct rnd_pct
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1994 104 100 423 627 0.166 0.159 0.675
## 2 1995 106 200 123 429 0.247 0.466 0.287
## 3 1996 109 174 111 394 0.277 0.442 0.282
across()function() wrapper, and instead use a bunch of calls to that function in a rowacross() examples I didn’t have to do bps <- function(x) x*10000, I just did function(x) x*10000? That’s an “unnamed function”\(x) instead of function(x)map() functions in purrrmap() usually generates a list(), map_dbl() a numeric vector, map_chr() a character vector, map_df() a tibble()…list, data.frame/tibble (which are technically lists, or vector, and then applies a function to each of the elements## Person Year Income
## "character" "numeric" "numeric"
summary_profile() function you’ve made, and want to check each state’s data to see if its data looks right. You could dofor() loopsave() saves many objects, which are all put back in the environment with load(). Often preferable is saveRDS() which saves a single data.frame() in compressed format, loadable with df <- readRDS()write_csv() makes a CSV. Yay!vtable() in vtable can generate a documentation file for sharing