You will understand what a wide data format is, and what a long data format is
You will know how to pivot long data to wide data using
pivot_long()
You will know how to pivot wide data to long data using
pivot_wider()
You have the intuition why the long data format is easier for plotting and wrangling
Understand by examples
Example 1: blood pressure data from three patients on three days.
WIDE format:
LONG format:
LONG format:
WIDE format:
Examples above = time-series datasets
Example 3: the number of patients in different units of three hospitals
WIDE format:
LONG format:
“observational units” = “statistical units” = primary entities described by the columns/variables
For example:
a patient
a country
a hospital
Consider the mock dataset created below:
temperatures <-
data.frame(
country = c("Sweden", "Denmark", "Norway"),
avgtemp.1994 = 1:3,
avgtemp.1995 = 3:5,
avgtemp.1996 = 5:7)
temperatures
## country avgtemp.1994 avgtemp.1995 avgtemp.1996
## 1 Sweden 1 3 5
## 2 Denmark 2 4 6
## 3 Norway 3 5 7
Is this data in a wide or long format?
Depends on what you want to do!
wide format = great for displaying data = easy to visually compare values
long format = best for data analysis tasks = like grouping & plotting
know how to switch from one format to the other
switching from the wide to the long format = pivoting
Gapminder is a good source of clean, rich health-relevant datasets.
## # A tibble: 193 x 7
## country x2010 x2011 x2012 x2013 x2014 x2015
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 74600 72000 69500 67100 64800 62700
## 2 Angola 79100 76400 73700 71200 69000 67200
## 3 Albania 420 384 354 331 313 301
## 4 United Arab Emirates 683 687 686 681 672 658
## 5 Argentina 9550 9230 8860 8480 8100 7720
## 6 Armenia 723 686 649 614 579 545
## 7 Antigua and Barbuda 12 11 11 10 10 10
## 8 Australia 1200 1160 1110 1080 1050 1030
## 9 Austria 280 273 266 260 255 253
## 10 Azerbaijan 5210 5070 4890 4670 4420 4140
## # i 183 more rows
observational unit = 1 country = 1 row
repeated measurements spread out across multiple columns
a wide format
convert to a long format = pivot_longer
.
cols
argument = which columns we want to
pivot
## # A tibble: 1,158 x 3
## country name value
## <chr> <chr> <dbl>
## 1 Afghanistan x2010 74600
## 2 Afghanistan x2011 72000
## 3 Afghanistan x2012 69500
## 4 Afghanistan x2013 67100
## 5 Afghanistan x2014 64800
## 6 Afghanistan x2015 62700
## 7 Angola x2010 79100
## 8 Angola x2011 76400
## 9 Angola x2012 73700
## 10 Angola x2013 71200
## # i 1,148 more rows
each country = 5 rows = one row per year
years are in the variable names
death count values are in the variable
values
infant deaths values used to be in matrix format (2 dimensions; 2D)
now in a vector format (1 dimension; 1D).
default names of the variables, names
and
values
= unsatisfactory
do not describe what the variables contain
custom column names with names_to
&
values_to
infant_deaths_wide %>%
# we will set our names to "year" and the values to "death_counts"
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to ="deaths_count" )
## # A tibble: 1,158 x 3
## country year deaths_count
## <chr> <chr> <dbl>
## 1 Afghanistan x2010 74600
## 2 Afghanistan x2011 72000
## 3 Afghanistan x2012 69500
## 4 Afghanistan x2013 67100
## 5 Afghanistan x2014 64800
## 6 Afghanistan x2015 62700
## 7 Angola x2010 79100
## 8 Angola x2011 76400
## 9 Angola x2012 73700
## 10 Angola x2013 71200
## # i 1,148 more rows
Long format = more informative than the original wide format.
Why?
In the wide format, no idea what the 2D matrix of death counts represents
In the long format, the variable has a name.
Side-note :
Remove the x
in front of each year.
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to = "deaths_count") %>% mutate(year=parse_number(year))
## # A tibble: 1,158 x 3
## country year deaths_count
## <chr> <dbl> <dbl>
## 1 Afghanistan 2010 74600
## 2 Afghanistan 2011 72000
## 3 Afghanistan 2012 69500
## 4 Afghanistan 2013 67100
## 5 Afghanistan 2014 64800
## 6 Afghanistan 2015 62700
## 7 Angola 2010 79100
## 8 Angola 2011 76400
## 9 Angola 2012 73700
## 10 Angola 2013 71200
## # i 1,148 more rows
Let’s store the data:
infant_deaths_long <-
infant_deaths_wide %>%
pivot_longer(cols = x2010:x2015,
names_to = "year",
values_to = "deaths_count")
For this practice question, you will use the
euro_births_wide
dataset from Eurostat.
It shows the annual number of births in 50 European countries:
## # A tibble: 6 x 8
## country x2015 x2016 x2017 x2018 x2019 x2020 x2021
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Belgium 122274 121896 119690 118319 117695 114350 118349
## 2 Bulgaria 65950 64984 63955 62197 61538 59086 58678
## 3 Czechia 110764 112663 114405 114036 112231 110200 111793
## 4 Denmark 58205 61614 61397 61476 61167 60937 63473
## 5 Germany 737575 792141 784901 787523 778090 773144 795492
## 6 Estonia 13907 14053 13784 14367 14099 13209 13272
The data is in a wide format. Convert it to a long format, with the following column names: “country”, “year” and “births_count”
pivot the other way, from long to wide?
Let’s first consider where you’re likely to run into long vs wide data:
Wide data tend to come from external sources
Long data is likely to be created by you while data wrangling.
Let’s see an example of this now.
Let’s see an example of this now.
Datase of patient records from an Ebola outbreak in Sierra Leone in 2014
We extract this data from the {outbreaks} package and perform some simplifying manipulations on it.
ebola <-
outbreaks::ebola_sierraleone_2014 %>%
as_tibble() %>% mutate(year=lubridate::year(date_of_onset)) %>%
select(patient_id=id, district,year_of_onset=year)
ebola
## # A tibble: 11,903 x 3
## patient_id district year_of_onset
## <int> <fct> <dbl>
## 1 1 Kailahun 2014
## 2 2 Kailahun 2014
## 3 3 Kailahun 2014
## 4 4 Kailahun 2014
## 5 5 Kailahun 2014
## 6 6 Kailahun 2014
## 7 7 Kailahun 2014
## 8 8 Kailahun 2014
## 9 9 Kailahun 2014
## 10 10 Kailahun 2014
## # i 11,893 more rows
Now, consider the following grouped summary of the ebola
dataset, which counts the number of patients recorded in each district
in each year:
cases_per_district_per_year <-
ebola %>%
group_by(district) %>%
count(year_of_onset) %>%
ungroup()
cases_per_district_per_year
## # A tibble: 28 x 3
## district year_of_onset n
## <fct> <dbl> <int>
## 1 Bo 2014 397
## 2 Bo 2015 209
## 3 Bombali 2014 1070
## 4 Bombali 2015 120
## 5 Bonthe 2014 7
## 6 Bonthe 2015 77
## 7 Kailahun 2014 535
## 8 Kailahun 2015 35
## 9 Kambia 2014 127
## 10 Kambia 2015 294
## # i 18 more rows
This is a quintessentially “long” dataset! Each observational unit (each district) occupies multiple rows, with one row for each measurement (each year).
Now, let’s see how to convert such long data into a wide format
with pivot_wider()
.
Within this function, there are two important arguments:
values_from
and names_from
## # A tibble: 14 x 3
## district `2014` `2015`
## <fct> <int> <int>
## 1 Bo 397 209
## 2 Bombali 1070 120
## 3 Bonthe 7 77
## 4 Kailahun 535 35
## 5 Kambia 127 294
## 6 Kenema 641 139
## 7 Koinadugu 142 15
## 8 Kono 328 223
## 9 Moyamba 258 191
## 10 Port Loko 1119 582
## 11 Pujehun 45 79
## 12 Tonkolili 541 42
## 13 Western Rural 1181 341
## 14 Western Urban 2068 1097
The population
dataset from the tidyr
package shows the populations of 219 countries over time.
Pivot this data into a wide format. Your answer should have 20 columns and 219 rows.
Example using the infant deaths dataset. Imagine that we want to answer the following question:
This is how we would do so with the long format of the data :
## # A tibble: 240 x 3
## # Groups: country [193]
## country year deaths_count
## <chr> <chr> <dbl>
## 1 Afghanistan x2010 74600
## 2 Angola x2010 79100
## 3 Albania x2010 420
## 4 United Arab Emirates x2011 687
## 5 Argentina x2010 9550
## 6 Armenia x2010 723
## 7 Antigua and Barbuda x2010 12
## 8 Australia x2010 1200
## 9 Austria x2010 280
## 10 Azerbaijan x2010 5210
## # i 230 more rows
Easy right? We can easily see, for example, that Afghanistan had its highest infant death count in 2010, and the United Arab Emirates had its highest death count in 2011.
Answering the same question with wide data:
You could try an approach like this with rowise():
## # A tibble: 193 x 8
## # Rowwise:
## country x2010 x2011 x2012 x2013 x2014 x2015 max_count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Afghanistan 74600 72000 69500 67100 64800 62700 74600
## 2 Angola 79100 76400 73700 71200 69000 67200 79100
## 3 Albania 420 384 354 331 313 301 420
## 4 United Arab Emirates 683 687 686 681 672 658 683
## 5 Argentina 9550 9230 8860 8480 8100 7720 9550
## 6 Armenia 723 686 649 614 579 545 723
## 7 Antigua and Barbuda 12 11 11 10 10 10 12
## 8 Australia 1200 1160 1110 1080 1050 1030 1200
## 9 Austria 280 273 266 260 255 253 280
## 10 Azerbaijan 5210 5070 4890 4670 4420 4140 5210
## # i 183 more rows
This works, but we still don’t know which year is attached to that
value in max_count
. We would have to take that value and
index it back to its respective year column… what a hassle! There are
solutions to find this but all are very painful. Why make your life
complicated when you can just pivot to long format and use the beauty of
group_by()
and filter()
?
Here we used a special {dplyr} function: rowwise()
.
rowwise()
is a function which allows further operations to
be applied on the rows rather than on the columns. It is equivalent to
creating one group for each row
(group_by(row_number())
).
Without rowwise()
you would get this :
## # A tibble: 193 x 8
## country x2010 x2011 x2012 x2013 x2014 x2015 max_count
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 74600 72000 69500 67100 64800 62700 1170000
## 2 Angola 79100 76400 73700 71200 69000 67200 1170000
## 3 Albania 420 384 354 331 313 301 1170000
## 4 United Arab Emirates 683 687 686 681 672 658 1170000
## 5 Argentina 9550 9230 8860 8480 8100 7720 1170000
## 6 Armenia 723 686 649 614 579 545 1170000
## 7 Antigua and Barbuda 12 11 11 10 10 10 1170000
## 8 Australia 1200 1160 1110 1080 1050 1030 1170000
## 9 Austria 280 273 266 260 255 253 1170000
## 10 Azerbaijan 5210 5070 4890 4670 4420 4140 1170000
## # i 183 more rows
The maximum count over all columns.
For this practice question, you will perform a grouped filter on the
long format population
dataset from the tidyr
package. Use group_by()
and filter()
to obtain
a dataset that shows the maximum population recorded for each country,
and the year in which that maximum population was recorded.
Grouped summaries are also difficult to perform on wide data.
Example: Using the infant_deaths_long
dataset once
again, if you want to ask:
For each country, what was the mean number of infant deaths and the standard deviation (variation) in deaths ?
With long data it is simple:
infant_deaths_long %>%
group_by(country) %>%
summarize(mean_deaths = mean(deaths_count),
sd_deaths = sd(deaths_count))
## # A tibble: 193 x 3
## country mean_deaths sd_deaths
## <chr> <dbl> <dbl>
## 1 Afghanistan 68450 4466.
## 2 Albania 350. 45.2
## 3 Algeria 21033. 484.
## 4 Angola 72767. 4513.
## 5 Antigua and Barbuda 10.7 0.816
## 6 Argentina 8657. 691.
## 7 Armenia 633. 66.6
## 8 Australia 1105 65.3
## 9 Austria 264. 10.6
## 10 Azerbaijan 4733. 405.
## # i 183 more rows
With wide data, on the other hand, finding the mean is less intuitive…
infant_deaths_wide %>%
rowwise() %>%
mutate(mean_deaths = sum(x2010, x2011, x2012,
x2013, x2014, x2015, na.rm = T)/6)
## # A tibble: 193 x 8
## # Rowwise:
## country x2010 x2011 x2012 x2013 x2014 x2015 mean_deaths
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 74600 72000 69500 67100 64800 62700 68450
## 2 Angola 79100 76400 73700 71200 69000 67200 72767.
## 3 Albania 420 384 354 331 313 301 350.
## 4 United Arab Emirates 683 687 686 681 672 658 678.
## 5 Argentina 9550 9230 8860 8480 8100 7720 8657.
## 6 Armenia 723 686 649 614 579 545 633.
## 7 Antigua and Barbuda 12 11 11 10 10 10 10.7
## 8 Australia 1200 1160 1110 1080 1050 1030 1105
## 9 Austria 280 273 266 260 255 253 264.
## 10 Azerbaijan 5210 5070 4890 4670 4420 4140 4733.
## # i 183 more rows
And finding the standard deviation would be very difficult. (We can’t think of any way to achieve this, actually.)
For this practice question, you will again work with the long format
population
dataset from the tidyr
package.
Use group_by()
and summarize()
to obtain,
for each country, the maximum reported population, the minimum reported
population, and the mean reported population across the years available
in the data. Your data should have four columns, “country”,
“max_population”, “min_population” and “mean_population”.
Finally, one of the data analysis tasks that is MOST hindered by wide formats is plotting. You may not yet have any prior knowledge of {ggplot} and how to plot so we will see the figures without going in depth with the code. What you need to remember is: many plots with with ggplot are also only possible with long-format data
Consider again the infant_deaths data
infant_deaths_long
. We will plot the number of deaths for
Belgium per year:
infant_deaths_long %>%
filter(country == "Belgium") %>%
ggplot() +
geom_col(aes(x = year, y = deaths_count))
The plotting works because we can give the variable year
for the x-axis. In the long format, year
is a variable
variable of its own. In the wide format, each year is a column and your
counts are a matrix, not a vector. Plots need vectors of data!
Another plot that would not be possible without a long format:
infant_deaths_long %>%
head(30) %>%
ggplot(aes(x = year, y = deaths_count, group = country, color = country)) +
geom_line() +
geom_point()
Once again, the reason is the same, we need to tell the plot what to use as an x-axis and a y-axis and it is necessary to have these variables in their own column (as organized in the long format).
We have mostly looked at very simple examples here, but in the wild, you will run into not-so-neat situations. We strongly recommend some further reading of this additional resource.
You have now explored different datasets and how they are either in a long or wide format. In the end, it’s just about how you present the information. Sometimes one format will be more convenient, and other times another could be best. Now, you are no longer limited by the format of your data: don’t like it? change it !
The following team members contributed to this lesson: