1 Learning Objectives

  • You will understand what a wide data format is, and what a long data format is

  • You will know how to pivot long data to wide data using pivot_long()

  • You will know how to pivot wide data to long data using pivot_wider()

  • You have the intuition why the long data format is easier for plotting and wrangling

2 Packages


3 What do wide and long mean?

  • Understand by examples

  • Example 1: blood pressure data from three patients on three days.

    WIDE format:

Fig: wide dataset for a timeseries of patients.
Fig: wide dataset for a timeseries of patients.

LONG format:

Fig: long dataset for a timeseries of patients.
Fig: long dataset for a timeseries of patients.

  • Example 2: observational units are countries

LONG format:

Fig: long dataset where the unique observation unit is a country.
Fig: long dataset where the unique observation unit is a country.

WIDE format:

Fig: the equivalent wide dataset
Fig: the equivalent wide dataset

  • Examples above = time-series datasets

  • Example 3: the number of patients in different units of three hospitals

WIDE format:

Fig: wide dataset, where each hospital is an observational unit
Fig: wide dataset, where each hospital is an observational unit

LONG format:

Fig: the equivalent long dataset
Fig: the equivalent long dataset

“observational units” = “statistical units” = primary entities described by the columns/variables

For example:

  • a patient

  • a country

  • a hospital


Consider the mock dataset created below:

temperatures <- 
  data.frame(
    country = c("Sweden", "Denmark", "Norway"),
    avgtemp.1994 = 1:3,
    avgtemp.1995 = 3:5,
    avgtemp.1996 = 5:7)
temperatures
##   country avgtemp.1994 avgtemp.1995 avgtemp.1996
## 1  Sweden            1            3            5
## 2 Denmark            2            4            6
## 3  Norway            3            5            7

Is this data in a wide or long format?

# Enter the string wide or the string long
# Assign your answer to the object Q_data_type
# Then run the CHECK function below
Q_data_type <- "WIDE"

4 When should you use wide vs long data?

  • Depends on what you want to do!

  • wide format = great for displaying data = easy to visually compare values

  • long format = best for data analysis tasks = like grouping & plotting

  • know how to switch from one format to the other

  • switching from the wide to the long format = pivoting


5 Pivoting wide to long

  • Data from Gapminder on the number of infant deaths in specific countries over several years.

Gapminder is a good source of clean, rich health-relevant datasets.

infant_deaths_wide <- read_csv(here("data/gapminder_infant_deaths.csv"))
infant_deaths_wide
## # A tibble: 193 x 7
##    country              x2010 x2011 x2012 x2013 x2014 x2015
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Afghanistan          74600 72000 69500 67100 64800 62700
##  2 Angola               79100 76400 73700 71200 69000 67200
##  3 Albania                420   384   354   331   313   301
##  4 United Arab Emirates   683   687   686   681   672   658
##  5 Argentina             9550  9230  8860  8480  8100  7720
##  6 Armenia                723   686   649   614   579   545
##  7 Antigua and Barbuda     12    11    11    10    10    10
##  8 Australia             1200  1160  1110  1080  1050  1030
##  9 Austria                280   273   266   260   255   253
## 10 Azerbaijan            5210  5070  4890  4670  4420  4140
## # i 183 more rows

  • observational unit = 1 country = 1 row

  • repeated measurements spread out across multiple columns

  • a wide format


  • convert to a long format = pivot_longer.

  • cols argument = which columns we want to pivot

infant_deaths_wide %>%  pivot_longer(cols = x2010:x2015)
## # A tibble: 1,158 x 3
##    country     name  value
##    <chr>       <chr> <dbl>
##  1 Afghanistan x2010 74600
##  2 Afghanistan x2011 72000
##  3 Afghanistan x2012 69500
##  4 Afghanistan x2013 67100
##  5 Afghanistan x2014 64800
##  6 Afghanistan x2015 62700
##  7 Angola      x2010 79100
##  8 Angola      x2011 76400
##  9 Angola      x2012 73700
## 10 Angola      x2013 71200
## # i 1,148 more rows
  # we pivot longer the columns x2010 to x2015

  • each country = 5 rows = one row per year

  • years are in the variable names

  • death count values are in the variable values

  • infant deaths values used to be in matrix format (2 dimensions; 2D)

  • now in a vector format (1 dimension; 1D).


  • default names of the variables, names and values = unsatisfactory

  • do not describe what the variables contain

  • custom column names with names_to & values_to

infant_deaths_wide %>% 
  # we will set our names to "year" and the values to "death_counts"
  pivot_longer(cols = x2010:x2015,
               names_to = "year", 
               values_to ="deaths_count" )
## # A tibble: 1,158 x 3
##    country     year  deaths_count
##    <chr>       <chr>        <dbl>
##  1 Afghanistan x2010        74600
##  2 Afghanistan x2011        72000
##  3 Afghanistan x2012        69500
##  4 Afghanistan x2013        67100
##  5 Afghanistan x2014        64800
##  6 Afghanistan x2015        62700
##  7 Angola      x2010        79100
##  8 Angola      x2011        76400
##  9 Angola      x2012        73700
## 10 Angola      x2013        71200
## # i 1,148 more rows

Long format = more informative than the original wide format.

Why?

In the wide format, no idea what the 2D matrix of death counts represents

In the long format, the variable has a name.


  • Side-note :

  • Remove the x in front of each year.

infant_deaths_wide %>% 
  pivot_longer(cols = x2010:x2015,
               names_to = "year", 
               values_to = "deaths_count") %>% mutate(year=parse_number(year))
## # A tibble: 1,158 x 3
##    country      year deaths_count
##    <chr>       <dbl>        <dbl>
##  1 Afghanistan  2010        74600
##  2 Afghanistan  2011        72000
##  3 Afghanistan  2012        69500
##  4 Afghanistan  2013        67100
##  5 Afghanistan  2014        64800
##  6 Afghanistan  2015        62700
##  7 Angola       2010        79100
##  8 Angola       2011        76400
##  9 Angola       2012        73700
## 10 Angola       2013        71200
## # i 1,148 more rows
  # we use parse_number inside of mutate

Let’s store the data:

infant_deaths_long <- 
  infant_deaths_wide %>% 
  pivot_longer(cols = x2010:x2015,
               names_to = "year", 
               values_to = "deaths_count")

For this practice question, you will use the euro_births_wide dataset from Eurostat. It shows the annual number of births in 50 European countries:

euro_births_wide <- 
  read_csv(here("data/euro_births_wide.csv"))
head(euro_births_wide)
## # A tibble: 6 x 8
##   country   x2015  x2016  x2017  x2018  x2019  x2020  x2021
##   <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Belgium  122274 121896 119690 118319 117695 114350 118349
## 2 Bulgaria  65950  64984  63955  62197  61538  59086  58678
## 3 Czechia  110764 112663 114405 114036 112231 110200 111793
## 4 Denmark   58205  61614  61397  61476  61167  60937  63473
## 5 Germany  737575 792141 784901 787523 778090 773144 795492
## 6 Estonia   13907  14053  13784  14367  14099  13209  13272

The data is in a wide format. Convert it to a long format, with the following column names: “country”, “year” and “births_count”

Q_euro_births_long <- 
  euro_births_wide %>% # complete the code with your answer
  pivot_longer(cols = x2015:x2021, names_to = "year", values_to = "births_count")

6 Pivoting long to wide

  • pivot the other way, from long to wide?

  • Let’s first consider where you’re likely to run into long vs wide data:

    • Wide data tend to come from external sources

    • Long data is likely to be created by you while data wrangling.

  • Let’s see an example of this now.


  • Let’s see an example of this now.

  • Datase of patient records from an Ebola outbreak in Sierra Leone in 2014

  • We extract this data from the {outbreaks} package and perform some simplifying manipulations on it.

ebola <- 
  outbreaks::ebola_sierraleone_2014 %>% 
  as_tibble() %>% mutate(year=lubridate::year(date_of_onset)) %>% 
  select(patient_id=id, district,year_of_onset=year)
  ebola
## # A tibble: 11,903 x 3
##    patient_id district year_of_onset
##         <int> <fct>            <dbl>
##  1          1 Kailahun          2014
##  2          2 Kailahun          2014
##  3          3 Kailahun          2014
##  4          4 Kailahun          2014
##  5          5 Kailahun          2014
##  6          6 Kailahun          2014
##  7          7 Kailahun          2014
##  8          8 Kailahun          2014
##  9          9 Kailahun          2014
## 10         10 Kailahun          2014
## # i 11,893 more rows

Now, consider the following grouped summary of the ebola dataset, which counts the number of patients recorded in each district in each year:

cases_per_district_per_year <- 
  ebola %>% 
  group_by(district) %>% 
  count(year_of_onset) %>% 
  ungroup()

cases_per_district_per_year
## # A tibble: 28 x 3
##    district year_of_onset     n
##    <fct>            <dbl> <int>
##  1 Bo                2014   397
##  2 Bo                2015   209
##  3 Bombali           2014  1070
##  4 Bombali           2015   120
##  5 Bonthe            2014     7
##  6 Bonthe            2015    77
##  7 Kailahun          2014   535
##  8 Kailahun          2015    35
##  9 Kambia            2014   127
## 10 Kambia            2015   294
## # i 18 more rows
  • This is a quintessentially “long” dataset! Each observational unit (each district) occupies multiple rows, with one row for each measurement (each year).

  • Now, let’s see how to convert such long data into a wide format with pivot_wider().

  • Within this function, there are two important arguments:

    • values_from and names_from
cases_per_district_per_year %>% 
  pivot_wider(values_from=n, 
              names_from = year_of_onset)
## # A tibble: 14 x 3
##    district      `2014` `2015`
##    <fct>          <int>  <int>
##  1 Bo               397    209
##  2 Bombali         1070    120
##  3 Bonthe             7     77
##  4 Kailahun         535     35
##  5 Kambia           127    294
##  6 Kenema           641    139
##  7 Koinadugu        142     15
##  8 Kono             328    223
##  9 Moyamba          258    191
## 10 Port Loko       1119    582
## 11 Pujehun           45     79
## 12 Tonkolili        541     42
## 13 Western Rural   1181    341
## 14 Western Urban   2068   1097

The population dataset from the tidyr package shows the populations of 219 countries over time.

Pivot this data into a wide format. Your answer should have 20 columns and 219 rows.

Q_population_widen <- 
  tidyr::population %>% pivot_wider(names_from = year, values_from = population)

7 Why is long data better for analysis?

  • Long data is best for a majority of data analysis tasks

7.1 Filtering grouped data

  • Example using the infant deaths dataset. Imagine that we want to answer the following question:

    • For each country, which year had the highest number of child deaths?

This is how we would do so with the long format of the data :

infant_deaths_long %>% 
  group_by(country) %>% 
  filter(deaths_count == max(deaths_count))
## # A tibble: 240 x 3
## # Groups:   country [193]
##    country              year  deaths_count
##    <chr>                <chr>        <dbl>
##  1 Afghanistan          x2010        74600
##  2 Angola               x2010        79100
##  3 Albania              x2010          420
##  4 United Arab Emirates x2011          687
##  5 Argentina            x2010         9550
##  6 Armenia              x2010          723
##  7 Antigua and Barbuda  x2010           12
##  8 Australia            x2010         1200
##  9 Austria              x2010          280
## 10 Azerbaijan           x2010         5210
## # i 230 more rows

Easy right? We can easily see, for example, that Afghanistan had its highest infant death count in 2010, and the United Arab Emirates had its highest death count in 2011.


  • Answering the same question with wide data:

  • You could try an approach like this with rowise():

infant_deaths_wide %>% 
  rowwise() %>% 
  mutate(max_count = max(x2010:x2015))
## # A tibble: 193 x 8
## # Rowwise: 
##    country              x2010 x2011 x2012 x2013 x2014 x2015 max_count
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <int>
##  1 Afghanistan          74600 72000 69500 67100 64800 62700     74600
##  2 Angola               79100 76400 73700 71200 69000 67200     79100
##  3 Albania                420   384   354   331   313   301       420
##  4 United Arab Emirates   683   687   686   681   672   658       683
##  5 Argentina             9550  9230  8860  8480  8100  7720      9550
##  6 Armenia                723   686   649   614   579   545       723
##  7 Antigua and Barbuda     12    11    11    10    10    10        12
##  8 Australia             1200  1160  1110  1080  1050  1030      1200
##  9 Austria                280   273   266   260   255   253       280
## 10 Azerbaijan            5210  5070  4890  4670  4420  4140      5210
## # i 183 more rows

This works, but we still don’t know which year is attached to that value in max_count. We would have to take that value and index it back to its respective year column… what a hassle! There are solutions to find this but all are very painful. Why make your life complicated when you can just pivot to long format and use the beauty of group_by() and filter()?

Here we used a special {dplyr} function: rowwise(). rowwise() is a function which allows further operations to be applied on the rows rather than on the columns. It is equivalent to creating one group for each row (group_by(row_number())).

Without rowwise() you would get this :

infant_deaths_wide %>% 
  mutate(max_count = max(x2010, x2011, x2012, x2013, x2014, x2015))
## # A tibble: 193 x 8
##    country              x2010 x2011 x2012 x2013 x2014 x2015 max_count
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
##  1 Afghanistan          74600 72000 69500 67100 64800 62700   1170000
##  2 Angola               79100 76400 73700 71200 69000 67200   1170000
##  3 Albania                420   384   354   331   313   301   1170000
##  4 United Arab Emirates   683   687   686   681   672   658   1170000
##  5 Argentina             9550  9230  8860  8480  8100  7720   1170000
##  6 Armenia                723   686   649   614   579   545   1170000
##  7 Antigua and Barbuda     12    11    11    10    10    10   1170000
##  8 Australia             1200  1160  1110  1080  1050  1030   1170000
##  9 Austria                280   273   266   260   255   253   1170000
## 10 Azerbaijan            5210  5070  4890  4670  4420  4140   1170000
## # i 183 more rows

The maximum count over all columns.

For this practice question, you will perform a grouped filter on the long format population dataset from the tidyr package. Use group_by() and filter() to obtain a dataset that shows the maximum population recorded for each country, and the year in which that maximum population was recorded.

Q_population_max <- 
  population %>% 
  group_by(country) %>%
  filter(population == max(population)) %>%
  ungroup()

7.2 Summarizing grouped data

  • Grouped summaries are also difficult to perform on wide data.

  • Example: Using the infant_deaths_long dataset once again, if you want to ask:

  • For each country, what was the mean number of infant deaths and the standard deviation (variation) in deaths ?

  • With long data it is simple:

infant_deaths_long %>% 
  group_by(country) %>% 
  summarize(mean_deaths = mean(deaths_count), 
            sd_deaths = sd(deaths_count))
## # A tibble: 193 x 3
##    country             mean_deaths sd_deaths
##    <chr>                     <dbl>     <dbl>
##  1 Afghanistan             68450    4466.   
##  2 Albania                   350.     45.2  
##  3 Algeria                 21033.    484.   
##  4 Angola                  72767.   4513.   
##  5 Antigua and Barbuda        10.7     0.816
##  6 Argentina                8657.    691.   
##  7 Armenia                   633.     66.6  
##  8 Australia                1105      65.3  
##  9 Austria                   264.     10.6  
## 10 Azerbaijan               4733.    405.   
## # i 183 more rows

With wide data, on the other hand, finding the mean is less intuitive…

infant_deaths_wide %>% 
  rowwise() %>% 
  mutate(mean_deaths = sum(x2010, x2011, x2012, 
                           x2013, x2014, x2015, na.rm = T)/6) 
## # A tibble: 193 x 8
## # Rowwise: 
##    country              x2010 x2011 x2012 x2013 x2014 x2015 mean_deaths
##    <chr>                <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>       <dbl>
##  1 Afghanistan          74600 72000 69500 67100 64800 62700     68450  
##  2 Angola               79100 76400 73700 71200 69000 67200     72767. 
##  3 Albania                420   384   354   331   313   301       350. 
##  4 United Arab Emirates   683   687   686   681   672   658       678. 
##  5 Argentina             9550  9230  8860  8480  8100  7720      8657. 
##  6 Armenia                723   686   649   614   579   545       633. 
##  7 Antigua and Barbuda     12    11    11    10    10    10        10.7
##  8 Australia             1200  1160  1110  1080  1050  1030      1105  
##  9 Austria                280   273   266   260   255   253       264. 
## 10 Azerbaijan            5210  5070  4890  4670  4420  4140      4733. 
## # i 183 more rows

And finding the standard deviation would be very difficult. (We can’t think of any way to achieve this, actually.)

For this practice question, you will again work with the long format population dataset from the tidyr package.

Use group_by() and summarize() to obtain, for each country, the maximum reported population, the minimum reported population, and the mean reported population across the years available in the data. Your data should have four columns, “country”, “max_population”, “min_population” and “mean_population”.

Q_population_summaries <- 
  population %>% 
  group_by(country) %>%
  summarize(
    max_population = max(population),
    min_population = min(population),
    mean_population = mean(population, na.rm = TRUE)
  ) %>%
  ungroup()

7.3 Plotting

Finally, one of the data analysis tasks that is MOST hindered by wide formats is plotting. You may not yet have any prior knowledge of {ggplot} and how to plot so we will see the figures without going in depth with the code. What you need to remember is: many plots with with ggplot are also only possible with long-format data

Consider again the infant_deaths data infant_deaths_long. We will plot the number of deaths for Belgium per year:

infant_deaths_long %>% 
  filter(country == "Belgium") %>% 
  ggplot() + 
  geom_col(aes(x = year, y = deaths_count))

The plotting works because we can give the variable year for the x-axis. In the long format, year is a variable variable of its own. In the wide format, each year is a column and your counts are a matrix, not a vector. Plots need vectors of data!


Another plot that would not be possible without a long format:

infant_deaths_long %>% 
  head(30) %>% 
  ggplot(aes(x = year, y = deaths_count, group = country, color = country)) + 
  geom_line() + 
  geom_point()

Once again, the reason is the same, we need to tell the plot what to use as an x-axis and a y-axis and it is necessary to have these variables in their own column (as organized in the long format).

Pivoting can be hard …

We have mostly looked at very simple examples here, but in the wild, you will run into not-so-neat situations. We strongly recommend some further reading of this additional resource.

Wrap Up !

You have now explored different datasets and how they are either in a long or wide format. In the end, it’s just about how you present the information. Sometimes one format will be more convenient, and other times another could be best. Now, you are no longer limited by the format of your data: don’t like it? change it !

Contributors

The following team members contributed to this lesson: