R Tutorial 4: Data Tidying

MKT 410: Marketing Analytics

Author

Levin Zhu

Learning Objectives

In the previous tutorial, we covered data importing. We will now go to the next step of data analysis and talk about data tidying. Our objectives are to discuss:

The definition of tidy data and see it applied to a simple toy dataset
The primary tool used for tidying data, pivoting, which involves either lengthening or widening data
- Lengthening data
- Widening data

Prerequisites

The focus of this tutorial is on the tidyr package, which is included in tidyverse.

library(tidyverse)
#> ── Attaching core tidyverse packages ───────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Tidy Data

You can represent the same underlying data in multiple ways. The below three datasets show the same values for four variables: country, year, population and the number of documented cases of TB (tuberculosis), but each dataset organizes the values in a different way.

table1
#> # A tibble: 6 × 4
#>   country      year  cases population
#>   <chr>       <dbl>  <dbl>      <dbl>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table2
#> # A tibble: 12 × 4
#>   country      year type           count
#>   <chr>       <dbl> <chr>          <dbl>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # ℹ 6 more rows
table3
#> # A tibble: 6 × 3
#>   country      year rate             
#>   <chr>       <dbl> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

Of these three datasets, table1 will be much easier to work with using the tidyverse because it’s tidy. A dataset is tidy if:

Each variable is a column; each column is a variable
Each observation is a row; each row is an observation
Each value is a cell; each cell is a single value

There are two main advantages for making sure your data is tidy:

Consistency: With a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity
Vectorization: Placing variables in columns allows R’s vectorized nature to shine, making transforming tiny data feel particularly natural

Some examples of how you might work with tidy data are shown below (we will go over data transformations and visualizations in the next two tutorials):

# Compute rate per 10,000
table1 %>%
  mutate(rate = cases / population * 10000)
#> # A tibble: 6 × 5
#>   country      year  cases population  rate
#>   <chr>       <dbl>  <dbl>      <dbl> <dbl>
#> 1 Afghanistan  1999    745   19987071 0.373
#> 2 Afghanistan  2000   2666   20595360 1.29 
#> 3 Brazil       1999  37737  172006362 2.19 
#> 4 Brazil       2000  80488  174504898 4.61 
#> 5 China        1999 212258 1272915272 1.67 
#> 6 China        2000 213766 1280428583 1.67

# Compute total cases per year
table1 %>%
  group_by(year) %>%
  summarize(total_cases = sum(cases))
#> # A tibble: 2 × 2
#>    year total_cases
#>   <dbl>       <dbl>
#> 1  1999      250740
#> 2  2000      296920

# Visualize changes over time
ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country, shape = country)) +
  scale_x_continuous(breaks = c(1999, 2000))

Lengthening Data

tidyr provides two functions for pivoting data: pivot_longer() and pivot_wider(). We’ll first start with pivot_longer() because it is the most common case.

Data in Column Names

The billboard dataset records the billboard rank of songs in the year 2000:

billboard
#> # A tibble: 317 × 79
#>   artist       track               date.entered   wk1   wk2   wk3   wk4   wk5
#>   <chr>        <chr>               <date>       <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac        Baby Don't Cry (Ke… 2000-02-26      87    82    72    77    87
#> 2 2Ge+her      The Hardest Part O… 2000-09-02      91    87    92    NA    NA
#> 3 3 Doors Down Kryptonite          2000-04-08      81    70    68    67    66
#> 4 3 Doors Down Loser               2000-10-21      76    76    72    69    67
#> 5 504 Boyz     Wobble Wobble       2000-04-15      57    34    25    17    17
#> 6 98^0         Give Me Just One N… 2000-08-19      51    39    34    26    26
#> # ℹ 311 more rows
#> # ℹ 71 more variables: wk6 <dbl>, wk7 <dbl>, wk8 <dbl>, wk9 <dbl>, …

Each observation (row) is a song. The first three columns (artist, track, date.entered) are variables that describe the song. The next 76 columns (wk1…wk76) describe the rank of the song (in the Billboard Top 100) in each week.

Why is this dataset, as it stands, nottidy?

In order to tidy this data, we’ll use the pivot_longer() function:

billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank"
  )
#> # A tibble: 24,092 × 5
#>   artist track                   date.entered week   rank
#>   <chr>  <chr>                   <date>       <chr> <dbl>
#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
#> # ℹ 24,086 more rows

There are three key arguments (after the data argument):

cols: specifies which columns need to be pivoted (i.e. which columns aren’t variables), using the same syntax as select()
- These can be specified using a vector of column names c(), !c() for all columns not in the vector, or using one of a series of helper functions, including:
  - start_with("abc"): matches names that begin with “abc”
  - ends_with("xyz"): matches names that end with “xyz”
  - contains("ijk"): matches names that contain “ijk”
  - num_range("x", 1:3): matches x1, x2, and x3
- In the above example, we could use a variety of different ways to select the columns to pivot: !c(artist, track, date.entered), starts_with("wk"), or num_range("wk", 1:76)
names_to: names the variable stored in the column names (in our example, we named that variable week)
values_to: names the variable stored in the cell values (in our example, we named that variable rank)

Note that "week" and "rank" are quoted because those are new variables we’re creating that don’t yet exist in the data when we run the pivot_longer() call.

Let’s examine the output again. We notice that 2 Pac’s song “Baby Don’t Cry” was only in the top 100 for 7 weeks (the other weeks have missing values). We can ask pivot_longer() to get rid of these missing values by setting values_drop_na = TRUE:

billboard %>%
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  )
#> # A tibble: 5,307 × 5
#>   artist track                   date.entered week   rank
#>   <chr>  <chr>                   <date>       <chr> <dbl>
#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
#> # ℹ 5,301 more rows

We might also want to convert the week column to integer values. We can do this by using mutate() with parse_number():

billboard_longer <- billboard %>%
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank",
    values_drop_na = TRUE
  ) %>%
  mutate(
    week = parse_number(week)
  )

billboard_longer
#> # A tibble: 5,307 × 5
#>   artist track                   date.entered  week  rank
#>   <chr>  <chr>                   <date>       <dbl> <dbl>
#> 1 2 Pac  Baby Don't Cry (Keep... 2000-02-26       1    87
#> 2 2 Pac  Baby Don't Cry (Keep... 2000-02-26       2    82
#> 3 2 Pac  Baby Don't Cry (Keep... 2000-02-26       3    72
#> 4 2 Pac  Baby Don't Cry (Keep... 2000-02-26       4    77
#> 5 2 Pac  Baby Don't Cry (Keep... 2000-02-26       5    87
#> 6 2 Pac  Baby Don't Cry (Keep... 2000-02-26       6    94
#> # ℹ 5,301 more rows

Now that we have week in one variable and rank in another, we can visualize how song ranks vary over time. We’ll return to discussing visualizations in depth in a later tutorial.

billboard_longer %>%
  ggplot(aes(x = week, y = rank, group = track)) + 
  geom_line(alpha = 0.25) + 
  scale_y_reverse()

How Does Pivoting Work?

Now that we’ve seen pivoting in action, let’s get some intuition about what pivoting does to the data. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient.

df <- tibble(
  id = c("A", "B", "C"),
  bp1 = c(100, 140, 120),
  bp2 = c(120, 115, 125)
)
df
#> # A tibble: 3 × 3
#>   id      bp1   bp2
#>   <chr> <dbl> <dbl>
#> 1 A       100   120
#> 2 B       140   115
#> 3 C       120   125

We want three variables in our new tidy (reshaped) dataset:

id: already exists
measurement: currently the column names
value: the cell values

To achieve this, we’ll use pivot_longer() again:

df %>%
  pivot_longer(
    cols = bp1:bp2,
    names_to = "measurement",
    values_to = "value"
  )
#> # A tibble: 6 × 3
#>   id    measurement value
#>   <chr> <chr>       <dbl>
#> 1 A     bp1           100
#> 2 A     bp2           120
#> 3 B     bp1           140
#> 4 B     bp2           115
#> 5 C     bp1           120
#> 6 C     bp2           125

What happened? The values in a column that was already a variable in the original dataset (id) needed to be repeated, once for each column that is pivoted.

The column names become values in a new variable, whose name is defined by names_to (which we called measurement), and need to be repeated once for each row in the original dataset.

The cell values also become values in a new variable, with a name defined by values_to (which we called value). They are unwound row by row.

Many Variables in Column Names

A more challenging situation occurs when you have multiple pieces of information within each column name. For example, take the who2 dataset, which records information about tuberculosis diagnoses.

who2
#> # A tibble: 7,240 × 58
#>   country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554
#>   <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1 Afghanistan  1980       NA        NA        NA        NA        NA
#> 2 Afghanistan  1981       NA        NA        NA        NA        NA
#> 3 Afghanistan  1982       NA        NA        NA        NA        NA
#> 4 Afghanistan  1983       NA        NA        NA        NA        NA
#> 5 Afghanistan  1984       NA        NA        NA        NA        NA
#> 6 Afghanistan  1985       NA        NA        NA        NA        NA
#> # ℹ 7,234 more rows
#> # ℹ 51 more variables: sp_m_5564 <dbl>, sp_m_65 <dbl>, sp_f_014 <dbl>, …

After the first two columns, country and year, we are given some very weird looking columns like sp_m_014, ep_m_4554, and rel_m_3544. There is, however, a pattern to these columns. Each part of the column name separated by _ tells us a different piece of information

The first piece: sp/rel/ep describes the method used for the diagnosis
The second piece: m/f tells us the gender (coded as a binary variable in this dataset)
The third piece: 014/1524/2534/… tells us the age range of the patient

Thus, there are six variables in the dataset: country, year, method of diagnosis, gender, age range, and the count of patients within the specific category. We can use pivot_longer() to better represent the data:

who2 %>%
  pivot_longer(
    cols = !c("country", "year"),
    names_to = c("diagnosis", "gender", "age"),
    names_sep = "_",
    values_to = "count"
  )
#> # A tibble: 405,440 × 6
#>   country      year diagnosis gender age   count
#>   <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
#> 1 Afghanistan  1980 sp        m      014      NA
#> 2 Afghanistan  1980 sp        m      1524     NA
#> 3 Afghanistan  1980 sp        m      2534     NA
#> 4 Afghanistan  1980 sp        m      3544     NA
#> 5 Afghanistan  1980 sp        m      4554     NA
#> 6 Afghanistan  1980 sp        m      5564     NA
#> # ℹ 405,434 more rows

When names_to contains a character vector of length >1, we need to specify names_sep (or alternatively, names_pattern, which is beyond the scope of this tutorial) which tells the function how to split the column names that need to be pivoted. In this case, we want, for example, sp_m_014 to be split using "_" resulting in three separate columns: diagnosis (sp), gender (m), and age (014).

Conceptually, what’s happening is similar to the previous pivot, except the columns themselves are pivoted into multiple columns (instead of a single column).

Data and Variable Names in Column Headers

An even more complex dataset is one in which the column names include a mix of variable values and variable names. Take the household dataset, for example:

household
#> # A tibble: 5 × 5
#>   family dob_child1 dob_child2 name_child1 name_child2
#>    <int> <date>     <date>     <chr>       <chr>      
#> 1      1 1998-11-26 2000-01-29 Susan       Jose       
#> 2      2 1996-06-22 NA         Mark        <NA>       
#> 3      3 2002-07-11 2004-04-05 Sam         Seth       
#> 4      4 2004-10-10 2009-08-27 Craig       Khai       
#> 5      5 2000-12-05 2005-02-28 Parker      Gracie

This dataset contains data about five families with the names and dates of birth of up to two children. The challenge here is that the column names contain the names of two variables (dob, name) and the values of another (child, with values of 1 or 2).

What do we want the final tidy dataset to look like?

Each column needs to be a variable: family, child, name, dob
Each row is an observation: a specific child
Each cell is a value (corresponding to the variable)

To solve this problem, we’ll need to use a special value in the names_to argument: ".value". This unique value tells pivot_longer() to override the usual values_to argument and use the first component of the pivoted column name as a variable name in the output.

household %>%
  pivot_longer(
    cols = !family,
    names_to = c(".value", "child"), # ".value" is a placeholder for the first part
                                     # of each of the original column names
    names_sep = "_",
    values_drop_na = TRUE
  )
#> # A tibble: 9 × 4
#>   family child  dob        name 
#>    <int> <chr>  <date>     <chr>
#> 1      1 child1 1998-11-26 Susan
#> 2      1 child2 2000-01-29 Jose 
#> 3      2 child1 1996-06-22 Mark 
#> 4      3 child1 2002-07-11 Sam  
#> 5      3 child2 2004-04-05 Seth 
#> 6      4 child1 2004-10-10 Craig
#> # ℹ 3 more rows

The below figure illustrates the basic idea with a simple example. When you use ".value" in names_to, the column names in the input contribute to both values and variable names in the output.

In the above example, pivoting with names_to = c(".value", "num") splits the column names (x1, x2, etc.) into two components:

The first part of x_1, x_2, etc. determines the output column name (x or y)
The second part of x_1, x_2, etc. determines the value of the num column

Widening Data

Now let’s talk about pivot_wider(), which makes datasets wider by increasing columns and reducing rows. This helps when one observation is spread across multiple rows.

Let’s take a look at cms_patient_experience, a dataset from the Centers of Medicare and Medicaid services that collects data about patient experiences:

cms_patient_experience
#> # A tibble: 500 × 5
#>   org_pac_id org_nm                     measure_cd   measure_title   prf_rate
#>   <chr>      <chr>                      <chr>        <chr>              <dbl>
#> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1  CAHPS for MIPS…       63
#> 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2  CAHPS for MIPS…       87
#> 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3  CAHPS for MIPS…       86
#> 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5  CAHPS for MIPS…       57
#> 5 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_8  CAHPS for MIPS…       85
#> 6 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_12 CAHPS for MIPS…       24
#> # ℹ 494 more rows

The core unit being studied is an organization, but each organization is spread across six rows, with one row for each measurement taken in the survey organization.

The complete set of values for measure_cd (measurement code) and measure_title (measurement item) are (there are six total):

cms_patient_experience %>%
  distinct(measure_cd, measure_title)
#> # A tibble: 6 × 2
#>   measure_cd   measure_title                                                 
#>   <chr>        <chr>                                                         
#> 1 CAHPS_GRP_1  CAHPS for MIPS SSM: Getting Timely Care, Appointments, and In…
#> 2 CAHPS_GRP_2  CAHPS for MIPS SSM: How Well Providers Communicate            
#> 3 CAHPS_GRP_3  CAHPS for MIPS SSM: Patient's Rating of Provider              
#> 4 CAHPS_GRP_5  CAHPS for MIPS SSM: Health Promotion and Education            
#> 5 CAHPS_GRP_8  CAHPS for MIPS SSM: Courteous and Helpful Office Staff        
#> 6 CAHPS_GRP_12 CAHPS for MIPS SSM: Stewardship of Patient Resources

What do we want to do here? Notice that the prf_rate (performance rate) column has values in each row that correspond to a different performance metric, with every group of six rows corresponding to a specific organization (in org_pac_id, a unique identifier, and org_nm, the organization name). A better way to view the data is if we had:

Each column (after the identifier columns) is a different performance measure
Each row is a unique organization
Each value is the score prf_rate for the corresponding performance measure and corresponding organization

With pivot_wider(), we can make each value of a column into a new column by using the names_from argument and specifying the values for the new columns using values_from. Additionally, we will need to specify id_cols that tell the function what the “level” of each observation needs to be in the resulting dataframe (in our case, an organization identifier).

That is, pivot_wider() has the following main arguments:

idcol: (Optional) the column(s) that uniquely identify each row (by default, this is all the columns that are not given by names_from and values_from)
names_from: the column(s) to get the name of the output column
values_from: the column(s) to get the cell values of each new output column from

cms_patient_experience %>%
  pivot_wider(
    id_cols = starts_with("org"),
    names_from = measure_cd,
    values_from = prf_rate
  )
#> # A tibble: 95 × 8
#>   org_pac_id org_nm           CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5
#>   <chr>      <chr>                  <dbl>       <dbl>       <dbl>       <dbl>
#> 1 0446157747 USC CARE MEDICA…          63          87          86          57
#> 2 0446162697 ASSOCIATION OF …          59          85          83          63
#> 3 0547164295 BEAVER MEDICAL …          49          NA          75          44
#> 4 0749333730 CAPE PHYSICIANS…          67          84          85          65
#> 5 0840104360 ALLIANCE PHYSIC…          66          87          87          64
#> 6 0840109864 REX HOSPITAL INC          73          87          84          67
#> # ℹ 89 more rows
#> # ℹ 2 more variables: CAHPS_GRP_8 <dbl>, CAHPS_GRP_12 <dbl>

How Does `pivot_wider()` Work?

Let’s create another dataset with two patients, where we have three measurements for patient A and two measurements for patient B. We’ll use tribble() (a function for row-wise tibble creation) to create this simple dataset:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We’ll use pivot_wider() to create a dataset with a row for each patient and a column for each type of measurement:

df %>%
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
#> # A tibble: 2 × 4
#>   id      bp1   bp2   bp3
#>   <chr> <dbl> <dbl> <dbl>
#> 1 A       100   120   105
#> 2 B       140   115    NA

The first step of pivot_wider() is figuring out what will go in the rows and columns.

The column are specified in the names_from argument: i.e. the values in the measurement column in the original data

df %>%
  distinct(measurement) %>%
  pull() # pull() takes the values of a column and represents them as a vector
#> [1] "bp1" "bp2" "bp3"

The rows are, by default, determined by all the variables that aren’t going into the new names or values. These are called the id_cols. In this case, there is just one column, but there can be any number (and are specified in the idcols argument)
```
df %>%
  select(-measurement, -value) %>%
  distinct()
#> # A tibble: 2 × 1
#>   id   
#>   <chr>
#> 1 A    
#> 2 B
```

pivot_wider() then combines these results to generate an empty data frame:

df %>%
  select(-measurement, -value) %>%
  distinct() %>%
  mutate(bp1 = NA, bp2 = NA, bp3 = NA)
#> # A tibble: 2 × 4
#>   id    bp1   bp2   bp3  
#>   <chr> <lgl> <lgl> <lgl>
#> 1 A     NA    NA    NA   
#> 2 B     NA    NA    NA

It then fills in all the missing values using the data in the input. In some cases, we might have some missing values, as in the example above.

What happens if there are multiple rows in the input that correspond to one cell in the output? This can happen in real world data, for example when repeated measures are taken. Let’s say there are two rows that correspond to patient “A” and measurement “bp1”:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "A",        "bp1",    102,
  "A",        "bp2",    120,
  "B",        "bp1",    140, 
  "B",        "bp2",    115
)

If we use pivot_wider() on this new dataset, we will get a warning:

df %>%
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
#> Warning: Values from `value` are not uniquely identified; output will contain
#> list-cols.
#> • Use `values_fn = list` to suppress this warning.
#> • Use `values_fn = {summary_fun}` to summarise duplicates.
#> • Use the following dplyr code to identify duplicates.
#>   {data} |>
#>   dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |>
#>   dplyr::filter(n > 1L)
#> # A tibble: 2 × 3
#>   id    bp1       bp2      
#>   <chr> <list>    <list>   
#> 1 A     <dbl [2]> <dbl [1]>
#> 2 B     <dbl [1]> <dbl [1]>

Following the hint, we can see where the pivot went wrong:

df %>%
  group_by(id, measurement) %>%
  summarize(n = n(), .groups = "drop") %>%
  filter(n > 1)
#> # A tibble: 1 × 3
#>   id    measurement     n
#>   <chr> <chr>       <int>
#> 1 A     bp1             2

As we can see, we can see that for a specific patient id and measure measurement, there are two total observations (i.e. a repeated measurement was made). This causes pivot_wider() to give us list-type columns in order to accommodate multiple values within each cell.

Summary

In this tutorial, we learned how to ensure the data we view is tidy. We talked about how to reshape data to a tidy format by using either pivot_longer() or pivot_wider().

By now, we have covered the tools to import, tidy, and transform data. In the next tutorial, we’ll talk about the fourth component of data analysis that helps us make sense of our data: data visualization.