The problem

Frequently data you receive or create will be in formats that are unsuitable for analysis as they stand and require modification in some way. This process is known as “tidying”, or perhaps “munging”. The goal is generally to get the data into a “tidy” format: one row per case, one column per field, and one cell per value. (You can read more about “tidy” data here.)

One of the most common examples of this issue encountered is data where things that should be in the rows are in the columns or vice versa. It’s generally not a transpose, either; some of the content is right where you want it, but not all of it. In these cases, we need to rearrange the data we’ve received into a form that is easier to deal with. The two commands we want to look at today are gather (move columns into rows) and spread (move rows into columns). We’re going to use the functions in tidyr, part of the tidyverse packages; though, as with many tasks in R, there are a number of other packages with similar functionality you could use instead.

The example

To start with, we’ll need some untidy data. For this, we’ll use some data showing the general reasons people were admitted to hospital by financial year from July 1993 to June 1998.

library(tidyverse)
seps <- read_csv("http://www.mm-c.me/mdsi/hospitals93to98.csv")
head(seps)
## # A tibble: 6 × 8
##                             IcdChapter       Field  FY1993  FY1994  FY1995
##                                  <chr>       <chr>   <dbl>   <dbl>   <dbl>
## 1                      0. Not Reported PatientDays  257965   55582  128507
## 2                      0. Not Reported Separations   37178    6146    3832
## 3 1. Infectious and Parasitic Diseases PatientDays  311221  313386  324693
## 4 1. Infectious and Parasitic Diseases Separations   75857   78323   84631
## 5                         2. Neoplasms PatientDays 1686919 1707437 1795751
## 6                         2. Neoplasms Separations  301928  336447  348905
## # ... with 3 more variables: FY1996 <dbl>, FY1997 <dbl>, FY1998 <dbl>

As you can see, the data isn’t in a great shape for analysis. The years are in columns, which is fine for side-by-side eyeballing, but terrible for charting or similar; and the bed days and separations counts are muddled together. That’s going to make it hard to work with.

The solution

Wide to long

Let’s start by fixing the year issue. To push data that is currently in columns into rows, we need to use the gather() command:

gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)

  • data: The dataset to be modified (in our case, seps)
  • key: the name of the new “naming” variable (year)
  • value: the name of the new “result” variable (value)
  • na.rm: whether missing values are removed (this dataset doesn’t have any, so it isn’t a problem)
  • convert: convert anything that seems like it should be in another format to that other format, e.g. numeric to numeric (since we used read_csv we don’t need this one either)

So, to move the years into a column, we’ll run this:

inprogress<-gather(seps,year,value,FY1993:FY1998)
head(inprogress)
## # A tibble: 6 × 4
##                             IcdChapter       Field   year   value
##                                  <chr>       <chr>  <chr>   <dbl>
## 1                      0. Not Reported PatientDays FY1993  257965
## 2                      0. Not Reported Separations FY1993   37178
## 3 1. Infectious and Parasitic Diseases PatientDays FY1993  311221
## 4 1. Infectious and Parasitic Diseases Separations FY1993   75857
## 5                         2. Neoplasms PatientDays FY1993 1686919
## 6                         2. Neoplasms Separations FY1993  301928

Okay, this is starting to look at little more like it should. We still have the bed days and separations merged, but at least we’ve pulled together the years.

Long to wide

Next, let’s pull the values into their own columns. For this, we need spread():

spread(data, key, value, fill = NA, convert = FALSE)

The format of this one is similar to gather():

  • data: The data to be reformatted (inprogress)
  • key: The column you want to split apart (Field)
  • value: The column you want to use to populate the new columns (the value column we just created in the spread step)
  • fill: what to substitute if there are combinations that don’t exist (not a problem here)
  • convert: whether to fix incorrect data types as it goes (not a problem here)

So, the resulting command would look like this:

rearranged <- spread(inprogress,Field,value)
head(rearranged)
## # A tibble: 6 × 4
##        IcdChapter   year PatientDays Separations
##             <chr>  <chr>       <dbl>       <dbl>
## 1 0. Not Reported FY1993      257965       37178
## 2 0. Not Reported FY1994       55582        6146
## 3 0. Not Reported FY1995      128507        3832
## 4 0. Not Reported FY1996      182226        4861
## 5 0. Not Reported FY1997       61599        1558
## 6 0. Not Reported FY1998      685879       53575

That looks much better, doesn’t it?

In One Step

Because this is using tidyverse packages, we can also use the magrittr package’s pipe command (%>%) to do this all in one step if preferred:

seps %>%
  gather(year,value,FY1993:FY1998) %>%
  spread(Field,value)
## # A tibble: 114 × 4
##                              IcdChapter   year PatientDays Separations
## *                                 <chr>  <chr>       <dbl>       <dbl>
## 1                       0. Not Reported FY1993      257965       37178
## 2                       0. Not Reported FY1994       55582        6146
## 3                       0. Not Reported FY1995      128507        3832
## 4                       0. Not Reported FY1996      182226        4861
## 5                       0. Not Reported FY1997       61599        1558
## 6                       0. Not Reported FY1998      685879       53575
## 7  1. Infectious and Parasitic Diseases FY1993      311221       75857
## 8  1. Infectious and Parasitic Diseases FY1994      313386       78323
## 9  1. Infectious and Parasitic Diseases FY1995      324693       84631
## 10 1. Infectious and Parasitic Diseases FY1996      311560       80864
## # ... with 104 more rows

Happy analysis!

Bibliography/Further Reading:

The data used as an example was pulled from the Australian Institute of Health and Welfare Principal Diagnosis Data Cubes. These data cubes can be accessed here: http://www.aihw.gov.au/hospitals-data/principal-diagnosis-data-cubes/

The content itself is based on these sources: