As we have seen, having data in tidy
format is what makes the tidyverse
flow. After the first step in the data analysis process, importing data, a common next step is to reshape the data into a form that facilitates the rest of the analysis. The tidyr
package includes several functions that are useful for tidying data.
gather
One of the most used functions in this package is gather
, which converts wide data into tidy data. Let’s see a simple example with a subset of the gapminder data. Here we have annual fertility rates for Germany and Korea in wide format:
library(tidyverse)
path <- system.file("extdata", package="dslabs")
filename <- file.path(path, "fertility-two-countries-example.csv")
wide_data <- read_csv(filename)
head(wide_data)
## # A tibble: 2 x 57
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968` `1969`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37 2.28 2.17
## 2 South K~ 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85 4.73 4.62
## # ... with 46 more variables: 1970 <dbl>, 1971 <dbl>, 1972 <dbl>, 1973 <dbl>,
## # 1974 <dbl>, 1975 <dbl>, 1976 <dbl>, 1977 <dbl>, 1978 <dbl>, 1979 <dbl>,
## # 1980 <dbl>, 1981 <dbl>, 1982 <dbl>, 1983 <dbl>, 1984 <dbl>, 1985 <dbl>,
## # 1986 <dbl>, 1987 <dbl>, 1988 <dbl>, 1989 <dbl>, 1990 <dbl>, 1991 <dbl>,
## # 1992 <dbl>, 1993 <dbl>, 1994 <dbl>, 1995 <dbl>, 1996 <dbl>, 1997 <dbl>,
## # 1998 <dbl>, 1999 <dbl>, 2000 <dbl>, 2001 <dbl>, 2002 <dbl>, 2003 <dbl>,
## # 2004 <dbl>, 2005 <dbl>, 2006 <dbl>, 2007 <dbl>, 2008 <dbl>, 2009 <dbl>, ...
Recall that the gapminder data we used had a column named year
and a column named fertility_rate
. We would like to convert this subset into that format. We will use the gather
function for this.
In the third argument of the gather
function you specify the columns that will be gathered. The default is to gather all columns, so in most cases we have to specify the columns. Here we want columns 1960
, 1961
, up to 2015
. The first argument sets the column/variable name that will hold the variable that is currently kept in the wide data column names. In our case it makes sense to set the name to year
, but we can name it anything. The second argument sets the column/variable name that will hold the values in the column cells. In this case we call it fertility
since this is what is stored in this file. Note that nowhere in this file does it tell us this is fertility data. Instead, this information was kept in the file name.
The gathering code looks like this:
new_tidy_data <- wide_data %>% gather(year, fertility, `1960`:`2015`)
We can see that the data have been converted to tidy format with columns year
and fertility
:
head(new_tidy_data)
## # A tibble: 6 x 3
## country year fertility
## <chr> <chr> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
However, each year resulted in two rows since we have two countries and this column was not gathered. A somewhat quicker way to write this code is to specify which column will not be gathered rather than all the columns that will be gathered:
new_tidy_data <- wide_data %>% gather(year, fertility, -country)
new_tidy_data
## # A tibble: 112 x 3
## country year fertility
## <chr> <chr> <dbl>
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
## 7 Germany 1963 2.49
## 8 South Korea 1963 5.57
## 9 Germany 1964 2.49
## 10 South Korea 1964 5.36
## # ... with 102 more rows
This data looks a lot like the original tidy_data
we used. There is just one minor difference. Can you spot it? Look at the data type of the year column:
library(dslabs)
## Warning: package 'dslabs' was built under R version 4.0.5
data("gapminder")
tidy_data<- gapminder %>%
filter(country %in% c("South Korea", "Germany")) %>%
select(country, year, fertility)
class(tidy_data$year)
## [1] "integer"
class(new_tidy_data$year)
## [1] "character"
The gather
function assumes that column names are characters. So we need a bit more wrangling before we are ready to make a plot. We need to convert the column to numbers. The gather
function has an argument for that, the convert
argument:
new_tidy_data <- wide_data %>%
gather(year, fertility, -country, convert= TRUE)
class(new_tidy_data$year)
## [1] "integer"
We could have also used the mutate
and as.numeric
functions. Now that the data is tidy we can use the same ggplot as before:
new_tidy_data %>% ggplot(aes(year, fertility, color=country)) + geom_point()
spread
As we will see in later examples it is sometimes useful for data wrangling purposes to convert tidy data into wide data. We often use this as an intermediate step in tidying up data. The spread
function is basically the inverse of gather
. The first argument tells spread
which variable will be used as the column names. The second argument specifies which variable to use to fill out the cells:
new_wide_data <- new_tidy_data %>% spread(year, fertility)
select(new_wide_data, country, `1960`:`1967`)
## # A tibble: 2 x 9
## country `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 2.44 2.47 2.49 2.49 2.48 2.44 2.37
## 2 South Korea 6.16 5.99 5.79 5.57 5.36 5.16 4.99 4.85
separate
The data wrangling shown above was simple compared to what is usually required. In our example spreadsheet files we include an example that is slightly more complicated. It includes two variables: life expectancy as well as fertility. However, the way it is stored is not tidy and, as we will explain, not optimal.
path <- system.file("extdata", package = "dslabs")
filename <- file.path(path, "life-expectancy-and-fertility-two-countries-example.csv")
raw_dat <- read_csv(filename)
select(raw_dat, 1:5)
## # A tibble: 2 x 5
## country `1960_fertility` `1960_life_expec~ `1961_fertility` `1961_life_expec~
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Germany 2.41 69.3 2.44 69.8
## 2 South K~ 6.16 53.0 5.99 53.8
First note that the data is in wide format. Second, note that now there are values for two variables with the column names encoding which column represents which variable. We can start the data wrangling with the gather
function, but we should no longer use the column name year
for the new column since since it also contains the variable type. We will call it key
, the default, for now:
dat <- raw_dat %>% gather(key, value, -country)
head(dat)
## # A tibble: 6 x 3
## country key value
## <chr> <chr> <dbl>
## 1 Germany 1960_fertility 2.41
## 2 South Korea 1960_fertility 6.16
## 3 Germany 1960_life_expectancy 69.3
## 4 South Korea 1960_life_expectancy 53.0
## 5 Germany 1961_fertility 2.44
## 6 South Korea 1961_fertility 5.99
The result is not exactly what we refer to as tidy since each observation is associated with two rows instead of one. We want to have the values from the two variables, fertility and life expectancy, in two separate columns. The first challenge to achieve this is to separate the key
column into the year and the variable type. Note that the entries in this column separate the year from the variable name with an underscore:
dat$key[1:5]
## [1] "1960_fertility" "1960_fertility" "1960_life_expectancy"
## [4] "1960_life_expectancy" "1961_fertility"
Encoding multiple variables in a column name is such a common problem that the readr
package includes a function to separate these columns into two or more. Apart from the data, the separate
function takes three arguments: the name of the column to be separated, the names to be used for the new columns and the character that separates the variables. So a first attempt at this is:
dat %>% separate(key, c("year", "variable_name"), "_")
Because “_” is the default separator we actually can simply write:
dat %>% separate(key, c("year", "variable_name"))
## Warning: Expected 2 pieces. Additional pieces discarded in 112 rows [3, 4, 7, 8,
## 11, 12, 15, 16, 19, 20, 23, 24, 27, 28, 31, 32, 35, 36, 39, 40, ...].
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life 69.3
## 4 South Korea 1960 life 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life 69.8
## 8 South Korea 1961 life 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
## # ... with 214 more rows
However, we run into a problem. Note that we receive the warning Too many values at 112 locations:
and that the life_exepectancy
variable is truncated to life
. This is because the _
is used to separate life
and expectancy
not just year and variable name. We could add a third column to catch this and let the separate
function know which column to fill in with missing values, NA
, when there is no third value. Here we tell it to fill the column on the right:
dat %>% separate(key,
c("year", "first_variable_name", "second_variable_name"), fill = "right")
## # A tibble: 224 x 5
## country year first_variable_name second_variable_name value
## <chr> <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility <NA> 2.41
## 2 South Korea 1960 fertility <NA> 6.16
## 3 Germany 1960 life expectancy 69.3
## 4 South Korea 1960 life expectancy 53.0
## 5 Germany 1961 fertility <NA> 2.44
## 6 South Korea 1961 fertility <NA> 5.99
## 7 Germany 1961 life expectancy 69.8
## 8 South Korea 1961 life expectancy 53.8
## 9 Germany 1962 fertility <NA> 2.47
## 10 South Korea 1962 fertility <NA> 5.79
## # ... with 214 more rows
However, if we read the separate
help file we find that a better approach is to merge the last two variables when there is an extra separation:
dat %>% separate(key,
c("year", "variable_name"), sep = "_" , extra = "merge")
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility 2.41
## 2 South Korea 1960 fertility 6.16
## 3 Germany 1960 life_expectancy 69.3
## 4 South Korea 1960 life_expectancy 53.0
## 5 Germany 1961 fertility 2.44
## 6 South Korea 1961 fertility 5.99
## 7 Germany 1961 life_expectancy 69.8
## 8 South Korea 1961 life_expectancy 53.8
## 9 Germany 1962 fertility 2.47
## 10 South Korea 1962 fertility 5.79
## # ... with 214 more rows
This achieves the separation we wanted. However, we are not done yet. We need to create a column for each variable. As we learned, the spread
function can do this:
dat %>% separate(key,
c("year", "variable_name"), sep = "_" , extra = "merge") %>%
spread(variable_name, value)
## # A tibble: 112 x 4
## country year fertility life_expectancy
## <chr> <chr> <dbl> <dbl>
## 1 Germany 1960 2.41 69.3
## 2 Germany 1961 2.44 69.8
## 3 Germany 1962 2.47 70.0
## 4 Germany 1963 2.49 70.1
## 5 Germany 1964 2.49 70.7
## 6 Germany 1965 2.48 70.6
## 7 Germany 1966 2.44 70.8
## 8 Germany 1967 2.37 71.0
## 9 Germany 1968 2.28 70.6
## 10 Germany 1969 2.17 70.5
## # ... with 102 more rows
The data is now in tidy format with one row for each observation with three variables: year
, fertility
and life expectancy
.
unite
It is sometimes useful to do the inverse of separate
, i.e. unite two columns into one. So, although this is not an optimal approach, had we used this command to separate:
dat %>% separate(key,
c("year", "variable_name", "second_variable_name"), fill = "right")
## # A tibble: 224 x 5
## country year variable_name second_variable_name value
## <chr> <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility <NA> 2.41
## 2 South Korea 1960 fertility <NA> 6.16
## 3 Germany 1960 life expectancy 69.3
## 4 South Korea 1960 life expectancy 53.0
## 5 Germany 1961 fertility <NA> 2.44
## 6 South Korea 1961 fertility <NA> 5.99
## 7 Germany 1961 life expectancy 69.8
## 8 South Korea 1961 life expectancy 53.8
## 9 Germany 1962 fertility <NA> 2.47
## 10 South Korea 1962 fertility <NA> 5.79
## # ... with 214 more rows
we can achieve the same final result by uniting the second and third column like this:
dat %>%
separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
unite(variable_name, first_variable_name, second_variable_name, sep = "_")
## # A tibble: 224 x 4
## country year variable_name value
## <chr> <chr> <chr> <dbl>
## 1 Germany 1960 fertility_NA 2.41
## 2 South Korea 1960 fertility_NA 6.16
## 3 Germany 1960 life_expectancy 69.3
## 4 South Korea 1960 life_expectancy 53.0
## 5 Germany 1961 fertility_NA 2.44
## 6 South Korea 1961 fertility_NA 5.99
## 7 Germany 1961 life_expectancy 69.8
## 8 South Korea 1961 life_expectancy 53.8
## 9 Germany 1962 fertility_NA 2.47
## 10 South Korea 1962 fertility_NA 5.79
## # ... with 214 more rows
Then spreading the columns:
dat %>%
separate(key, c("year", "first_variable_name", "second_variable_name"), fill = "right") %>%
unite(variable_name, first_variable_name, second_variable_name, sep = "_") %>%
spread(variable_name, value) %>%
rename(fertility = fertility_NA)
## # A tibble: 112 x 4
## country year fertility life_expectancy
## <chr> <chr> <dbl> <dbl>
## 1 Germany 1960 2.41 69.3
## 2 Germany 1961 2.44 69.8
## 3 Germany 1962 2.47 70.0
## 4 Germany 1963 2.49 70.1
## 5 Germany 1964 2.49 70.7
## 6 Germany 1965 2.48 70.6
## 7 Germany 1966 2.44 70.8
## 8 Germany 1967 2.37 71.0
## 9 Germany 1968 2.28 70.6
## 10 Germany 1969 2.17 70.5
## # ... with 102 more rows