library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.2
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
R Tutorial 4: Data Tidying
MKT 410: Marketing Analytics
Learning Objectives
In the previous tutorial, we covered data importing. We will now go to the next step of data analysis and talk about data tidying. Our objectives are to discuss:
- The definition of tidy data and see it applied to a simple toy dataset
- The primary tool used for tidying data, pivoting, which involves either lengthening or widening data
- Lengthening data
- Widening data
Prerequisites
The focus of this tutorial is on the tidyr
package, which is included in tidyverse
.
Tidy Data
You can represent the same underlying data in multiple ways. The below three datasets show the same values for four variables: country, year, population and the number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.
table1#> # A tibble: 6 × 4
#> country year cases population
#> <chr> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table2#> # A tibble: 12 × 4
#> country year type count
#> <chr> <dbl> <chr> <dbl>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> # ℹ 6 more rows
table3#> # A tibble: 6 × 3
#> country year rate
#> <chr> <dbl> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
Of these three datasets, table1
will be much easier to work with using the tidyverse because it’s tidy. A dataset is tidy if:
- Each variable is a column; each column is a variable
- Each observation is a row; each row is an observation
- Each value is a cell; each cell is a single value
There are two main advantages for making sure your data is tidy:
- Consistency: With a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity
- Vectorization: Placing variables in columns allows R’s vectorized nature to shine, making transforming tiny data feel particularly natural
Some examples of how you might work with tidy data are shown below (we will go over data transformations and visualizations in the next two tutorials):
# Compute rate per 10,000
%>%
table1 mutate(rate = cases / population * 10000)
#> # A tibble: 6 × 5
#> country year cases population rate
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1999 745 19987071 0.373
#> 2 Afghanistan 2000 2666 20595360 1.29
#> 3 Brazil 1999 37737 172006362 2.19
#> 4 Brazil 2000 80488 174504898 4.61
#> 5 China 1999 212258 1272915272 1.67
#> 6 China 2000 213766 1280428583 1.67
# Compute total cases per year
%>%
table1 group_by(year) %>%
summarize(total_cases = sum(cases))
#> # A tibble: 2 × 2
#> year total_cases
#> <dbl> <dbl>
#> 1 1999 250740
#> 2 2000 296920
# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50") +
geom_point(aes(color = country, shape = country)) +
scale_x_continuous(breaks = c(1999, 2000))
Lengthening Data
tidyr
provides two functions for pivoting data: pivot_longer()
and pivot_wider()
. We’ll first start with pivot_longer()
because it is the most common case.
Data in Column Names
The billboard
dataset records the billboard rank of songs in the year 2000:
billboard#> # A tibble: 317 × 79
#> artist track date.entered wk1 wk2 wk3 wk4 wk5
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac Baby Don't Cry (Ke… 2000-02-26 87 82 72 77 87
#> 2 2Ge+her The Hardest Part O… 2000-09-02 91 87 92 NA NA
#> 3 3 Doors Down Kryptonite 2000-04-08 81 70 68 67 66
#> 4 3 Doors Down Loser 2000-10-21 76 76 72 69 67
#> 5 504 Boyz Wobble Wobble 2000-04-15 57 34 25 17 17
#> 6 98^0 Give Me Just One N… 2000-08-19 51 39 34 26 26
#> # ℹ 311 more rows
#> # ℹ 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, …
Each observation (row) is a song. The first three columns (artist
, track
, date.entered
) are variables that describe the song. The next 76 columns (wk1
…wk76
) describe the rank of the song (in the Billboard Top 100) in each week.
Why is this dataset, as it stands, nottidy?
In order to tidy this data, we’ll use the pivot_longer()
function:
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
)#> # A tibble: 24,092 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#> # ℹ 24,086 more rows
There are three key arguments (after the data
argument):
cols
: specifies which columns need to be pivoted (i.e. which columns aren’t variables), using the same syntax asselect()
- These can be specified using a vector of column names
c()
,!c()
for all columns not in the vector, or using one of a series of helper functions, including:start_with("abc")
: matches names that begin with “abc”ends_with("xyz")
: matches names that end with “xyz”contains("ijk")
: matches names that contain “ijk”num_range("x", 1:3)
: matchesx1
,x2
, andx3
- In the above example, we could use a variety of different ways to select the columns to pivot:
!c(artist, track, date.entered)
,starts_with("wk")
, ornum_range("wk", 1:76)
- These can be specified using a vector of column names
names_to
: names the variable stored in the column names (in our example, we named that variableweek
)values_to
: names the variable stored in the cell values (in our example, we named that variablerank
)
Note that "week"
and "rank"
are quoted because those are new variables we’re creating that don’t yet exist in the data when we run the pivot_longer()
call.
Let’s examine the output again. We notice that 2 Pac’s song “Baby Don’t Cry” was only in the top 100 for 7 weeks (the other weeks have missing values). We can ask pivot_longer()
to get rid of these missing values by setting values_drop_na = TRUE
:
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)#> # A tibble: 5,307 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#> # ℹ 5,301 more rows
We might also want to convert the week
column to integer values. We can do this by using mutate()
with parse_number():
<- billboard %>%
billboard_longer pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
%>%
) mutate(
week = parse_number(week)
)
billboard_longer#> # A tibble: 5,307 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <dbl> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
#> # ℹ 5,301 more rows
Now that we have week
in one variable and rank
in another, we can visualize how song ranks vary over time. We’ll return to discussing visualizations in depth in a later tutorial.
%>%
billboard_longer ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 0.25) +
scale_y_reverse()
How Does Pivoting Work?
Now that we’ve seen pivoting in action, let’s get some intuition about what pivoting does to the data. Suppose we have three patients with id
s A, B, and C, and we take two blood pressure measurements on each patient.
<- tibble(
df id = c("A", "B", "C"),
bp1 = c(100, 140, 120),
bp2 = c(120, 115, 125)
)
df#> # A tibble: 3 × 3
#> id bp1 bp2
#> <chr> <dbl> <dbl>
#> 1 A 100 120
#> 2 B 140 115
#> 3 C 120 125
We want three variables in our new tidy (reshaped) dataset:
id
: already existsmeasurement
: currently the column namesvalue
: the cell values
To achieve this, we’ll use pivot_longer()
again:
%>%
df pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)#> # A tibble: 6 × 3
#> id measurement value
#> <chr> <chr> <dbl>
#> 1 A bp1 100
#> 2 A bp2 120
#> 3 B bp1 140
#> 4 B bp2 115
#> 5 C bp1 120
#> 6 C bp2 125
What happened? The values in a column that was already a variable in the original dataset (id
) needed to be repeated, once for each column that is pivoted.
The column names become values in a new variable, whose name is defined by names_to
(which we called measurement
), and need to be repeated once for each row in the original dataset.
The cell values also become values in a new variable, with a name defined by values_to
(which we called value
). They are unwound row by row.
Many Variables in Column Names
A more challenging situation occurs when you have multiple pieces of information within each column name. For example, take the who2
dataset, which records information about tuberculosis diagnoses.
who2#> # A tibble: 7,240 × 58
#> country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Afghanistan 1980 NA NA NA NA NA
#> 2 Afghanistan 1981 NA NA NA NA NA
#> 3 Afghanistan 1982 NA NA NA NA NA
#> 4 Afghanistan 1983 NA NA NA NA NA
#> 5 Afghanistan 1984 NA NA NA NA NA
#> 6 Afghanistan 1985 NA NA NA NA NA
#> # ℹ 7,234 more rows
#> # ℹ 51 more variables: sp_m_5564 <dbl>, sp_m_65 <dbl>, sp_f_014 <dbl>, …
After the first two columns, country
and year
, we are given some very weird looking columns like sp_m_014
, ep_m_4554
, and rel_m_3544
. There is, however, a pattern to these columns. Each part of the column name separated by _
tells us a different piece of information
- The first piece:
sp
/rel
/ep
describes the method used for the diagnosis - The second piece:
m
/f
tells us the gender (coded as a binary variable in this dataset) - The third piece:
014
/1524
/2534
/… tells us the age range of the patient
Thus, there are six variables in the dataset: country, year, method of diagnosis, gender, age range, and the count of patients within the specific category. We can use pivot_longer()
to better represent the data:
%>%
who2 pivot_longer(
cols = !c("country", "year"),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)#> # A tibble: 405,440 × 6
#> country year diagnosis gender age count
#> <chr> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Afghanistan 1980 sp m 014 NA
#> 2 Afghanistan 1980 sp m 1524 NA
#> 3 Afghanistan 1980 sp m 2534 NA
#> 4 Afghanistan 1980 sp m 3544 NA
#> 5 Afghanistan 1980 sp m 4554 NA
#> 6 Afghanistan 1980 sp m 5564 NA
#> # ℹ 405,434 more rows
When names_to
contains a character vector of length >1, we need to specify names_sep
(or alternatively, names_pattern
, which is beyond the scope of this tutorial) which tells the function how to split the column names that need to be pivoted. In this case, we want, for example, sp_m_014
to be split using "_"
resulting in three separate columns: diagnosis
(sp
), gender
(m
), and age
(014
).
Conceptually, what’s happening is similar to the previous pivot, except the columns themselves are pivoted into multiple columns (instead of a single column).
Data and Variable Names in Column Headers
An even more complex dataset is one in which the column names include a mix of variable values and variable names. Take the household
dataset, for example:
household#> # A tibble: 5 × 5
#> family dob_child1 dob_child2 name_child1 name_child2
#> <int> <date> <date> <chr> <chr>
#> 1 1 1998-11-26 2000-01-29 Susan Jose
#> 2 2 1996-06-22 NA Mark <NA>
#> 3 3 2002-07-11 2004-04-05 Sam Seth
#> 4 4 2004-10-10 2009-08-27 Craig Khai
#> 5 5 2000-12-05 2005-02-28 Parker Gracie
This dataset contains data about five families with the names and dates of birth of up to two children. The challenge here is that the column names contain the names of two variables (dob
, name
) and the values of another (child
, with values of 1 or 2).
What do we want the final tidy dataset to look like?
- Each column needs to be a variable:
family
,child
,name
,dob
- Each row is an observation: a specific child
- Each cell is a value (corresponding to the variable)
To solve this problem, we’ll need to use a special value in the names_to
argument: ".value"
. This unique value tells pivot_longer()
to override the usual values_to
argument and use the first component of the pivoted column name as a variable name in the output.
%>%
household pivot_longer(
cols = !family,
names_to = c(".value", "child"), # ".value" is a placeholder for the first part
# of each of the original column names
names_sep = "_",
values_drop_na = TRUE
)#> # A tibble: 9 × 4
#> family child dob name
#> <int> <chr> <date> <chr>
#> 1 1 child1 1998-11-26 Susan
#> 2 1 child2 2000-01-29 Jose
#> 3 2 child1 1996-06-22 Mark
#> 4 3 child1 2002-07-11 Sam
#> 5 3 child2 2004-04-05 Seth
#> 6 4 child1 2004-10-10 Craig
#> # ℹ 3 more rows
The below figure illustrates the basic idea with a simple example. When you use ".value"
in names_to
, the column names in the input contribute to both values and variable names in the output.
In the above example, pivoting with names_to = c(".value", "num")
splits the column names (x1
, x2
, etc.) into two components:
- The first part of
x_1
,x_2
, etc. determines the output column name (x
ory
) - The second part of
x_1
,x_2
, etc. determines the value of thenum
column
Widening Data
Now let’s talk about pivot_wider()
, which makes datasets wider by increasing columns and reducing rows. This helps when one observation is spread across multiple rows.
Let’s take a look at cms_patient_experience
, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:
cms_patient_experience#> # A tibble: 500 × 5
#> org_pac_id org_nm measure_cd measure_title prf_rate
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 CAHPS for MIPS… 63
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 CAHPS for MIPS… 87
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 CAHPS for MIPS… 86
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 CAHPS for MIPS… 57
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8 CAHPS for MIPS… 85
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS… 24
#> # ℹ 494 more rows
The core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization.
The complete set of values for measure_cd
(measurement code) and measure_title
(measurement item) are (there are six total):
%>%
cms_patient_experience distinct(measure_cd, measure_title)
#> # A tibble: 6 × 2
#> measure_cd measure_title
#> <chr> <chr>
#> 1 CAHPS_GRP_1 CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
#> 2 CAHPS_GRP_2 CAHPS for MIPS SSM: How Well Providers Communicate
#> 3 CAHPS_GRP_3 CAHPS for MIPS SSM: Patient's Rating of Provider
#> 4 CAHPS_GRP_5 CAHPS for MIPS SSM: Health Promotion and Education
#> 5 CAHPS_GRP_8 CAHPS for MIPS SSM: Courteous and Helpful Office Staff
#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources
What do we want to do here? Notice that the prf_rate
(performance rate) column has values in each row that correspond to a different performance metric, with every group of six rows corresponding to a specific organization (in org_pac_id
, a unique identifier, and org_nm
, the organization name). A better way to view the data is if we had:
- Each column (after the identifier columns) is a different performance measure
- Each row is a unique organization
- Each value is the score
prf_rate
for the corresponding performance measure and corresponding organization
With pivot_wider()
, we can make each value of a column into a new column by using the names_from
argument and specifying the values for the new columns using values_from
. Additionally, we will need to specify id_cols
that tell the function what the “level” of each observation needs to be in the resulting dataframe (in our case, an organization identifier).
That is, pivot_wider()
has the following main arguments:
idcol
: (Optional) the column(s) that uniquely identify each row (by default, this is all the columns that are not given bynames_from
andvalues_from
)names_from
: the column(s) to get the name of the output columnvalues_from
: the column(s) to get the cell values of each new output column from
%>%
cms_patient_experience pivot_wider(
id_cols = starts_with("org"),
names_from = measure_cd,
values_from = prf_rate
)#> # A tibble: 95 × 8
#> org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 0446157747 USC CARE MEDICA… 63 87 86 57
#> 2 0446162697 ASSOCIATION OF … 59 85 83 63
#> 3 0547164295 BEAVER MEDICAL … 49 NA 75 44
#> 4 0749333730 CAPE PHYSICIANS… 67 84 85 65
#> 5 0840104360 ALLIANCE PHYSIC… 66 87 87 64
#> 6 0840109864 REX HOSPITAL INC 73 87 84 67
#> # ℹ 89 more rows
#> # ℹ 2 more variables: CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>
How Does pivot_wider()
Work?
Let’s create another dataset with two patients, where we have three measurements for patient A
and two measurements for patient B
. We’ll use tribble()
(a function for row-wise tibble creation) to create this simple dataset:
<- tribble(
df ~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)
We’ll use pivot_wider()
to create a dataset with a row for each patient and a column for each type of measurement:
%>%
df pivot_wider(
names_from = measurement,
values_from = value
)#> # A tibble: 2 × 4
#> id bp1 bp2 bp3
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 100 120 105
#> 2 B 140 115 NA
The first step of pivot_wider()
is figuring out what will go in the rows and columns.
The column are specified in the
names_from
argument: i.e. the values in themeasurement
column in the original data%>% df distinct(measurement) %>% pull() # pull() takes the values of a column and represents them as a vector #> [1] "bp1" "bp2" "bp3"
The rows are, by default, determined by all the variables that aren’t going into the new names or values. These are called the
id_cols
. In this case, there is just one column, but there can be any number (and are specified in theidcols
argument)%>% df select(-measurement, -value) %>% distinct() #> # A tibble: 2 × 1 #> id #> <chr> #> 1 A #> 2 B
pivot_wider()
then combines these results to generate an empty data frame:
%>%
df select(-measurement, -value) %>%
distinct() %>%
mutate(bp1 = NA, bp2 = NA, bp3 = NA)
#> # A tibble: 2 × 4
#> id bp1 bp2 bp3
#> <chr> <lgl> <lgl> <lgl>
#> 1 A NA NA NA
#> 2 B NA NA NA
It then fills in all the missing values using the data in the input. In some cases, we might have some missing values, as in the example above.
What happens if there are multiple rows in the input that correspond to one cell in the output? This can happen in real world data, for example when repeated measures are taken. Let’s say there are two rows that correspond to patient “A
” and measurement “bp1
”:
<- tribble(
df ~id, ~measurement, ~value,
"A", "bp1", 100,
"A", "bp1", 102,
"A", "bp2", 120,
"B", "bp1", 140,
"B", "bp2", 115
)
If we use pivot_wider()
on this new dataset, we will get a warning:
%>%
df pivot_wider(
names_from = measurement,
values_from = value
)#> Warning: Values from `value` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#> {data} |>
#> dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |>
#> dplyr::filter(n > 1L)
#> # A tibble: 2 × 3
#> id bp1 bp2
#> <chr> <list> <list>
#> 1 A <dbl [2]> <dbl [1]>
#> 2 B <dbl [1]> <dbl [1]>
Following the hint, we can see where the pivot went wrong:
%>%
df group_by(id, measurement) %>%
summarize(n = n(), .groups = "drop") %>%
filter(n > 1)
#> # A tibble: 1 × 3
#> id measurement n
#> <chr> <chr> <int>
#> 1 A bp1 2
As we can see, we can see that for a specific patient id
and measure measurement
, there are two total observations (i.e. a repeated measurement was made). This causes pivot_wider()
to give us list-type columns in order to accommodate multiple values within each cell.
Summary
In this tutorial, we learned how to ensure the data we view is tidy. We talked about how to reshape data to a tidy format by using either pivot_longer()
or pivot_wider()
.
By now, we have covered the tools to import, tidy, and transform data. In the next tutorial, we’ll talk about the fourth component of data analysis that helps us make sense of our data: data visualization.