Data tidying using tidyr

Sources

Borrowed from: bradleyboehmke

Loading Necessay Libraries

library(dplyr)
library(tidyr)
library(knitr)
library(printr)

What is tidy data ?

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003). Well structured data Makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.

Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together → leads to difficult-to-read nested functions and/or choppy code. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways → leads to more efficient code, easier to remember syntax, and easier to read syntax.

tidy data

%>% Operator

Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.

This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value)
or

data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible. The %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as:

Nested Option:

arrange(
        summarize(
            filter(data, variable == numeric_value),
            Total = sum(variable)
        ),
    desc(Total)
)

Multiple Object Option:

 a <- filter(data, variable == numeric_value)
 b <- summarise(a, Total = sum(variable))
 c <- arrange(b, desc(Total))

%>% Option :

 data %>%
        filter(variable == “value”) %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

tidyr Operations

There are four fundamental functions of data tidying:

gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer
spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider
separate() splits a single column into multiple columns
unite() combines multiple columns into a single column

gather() Function

Objective: Reshaping wide format to long format

Description: There are times when our data is considered unstacked and a common attribute of concern is spread out across columns. To reformat the data such that these common attributes are gathered together as a single variable, the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

Complement to: spread()

Fucntion summary:

Function:       gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as:        data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)

Arguments:
        data:           data frame
        key:            column name representing new variable
        value:          column name representing variable values
        ...:            names of columns to gather (or not gather)
        na.rm:          option to remove observations with missing values (represented by NAs)
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Schematic description:

Example:

Reading data :

data <- read.csv("Data/data1.csv", head=TRUE,sep="")
data

Group	Year	Qtr.1	Qtr.2	Qtr.3	Qtr.4
1	2006	15	16	19	17
1	2007	12	13	27	23
1	2008	22	22	24	20
1	2009	10	14	20	16
2	2006	12	12	25	18
2	2007	16	14	21	19
2	2008	13	11	29	15
2	2009	23	20	26	20
3	2006	11	12	22	16
3	2007	13	11	27	21
3	2008	17	12	23	19
3	2009	14	9	31	24

This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.

long_data <- data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
        
kable(head (long_data,24),align = 'c')

Group	Year	Quarter	Revenue
1	2006	Qtr.1	15
1	2007	Qtr.1	12
1	2008	Qtr.1	22
1	2009	Qtr.1	10
2	2006	Qtr.1	12
2	2007	Qtr.1	16
2	2008	Qtr.1	13
2	2009	Qtr.1	23
3	2006	Qtr.1	11
3	2007	Qtr.1	13
3	2008	Qtr.1	17
3	2009	Qtr.1	14
1	2006	Qtr.2	16
1	2007	Qtr.2	13
1	2008	Qtr.2	22
1	2009	Qtr.2	14
2	2006	Qtr.2	12
2	2007	Qtr.2	14
2	2008	Qtr.2	11
2	2009	Qtr.2	20
3	2006	Qtr.2	12
3	2007	Qtr.2	11
3	2008	Qtr.2	12
3	2009	Qtr.2	9

Similar operations :

These all produce the same results:
        data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
        data %>% gather(Quarter, Revenue, -Group, -Year)
        data %>% gather(Quarter, Revenue, 3:6)
        data %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4)

Also note that if you do not supply arguments for na.rm or convert values then the defaults are used

spread() Function

Objective: Reshaping long format to wide format

Description: There are times when we are required to turn long formatted data into wide formatted data. The spread() function spreads a key-value pair across multiple columns.

Complement to: gather()

Function summary:

Function:       spread(data, key, value, fill = NA, convert = FALSE)
Same as:        data %>% spread(key, value, fill = NA, convert = FALSE)

Arguments:
        data:           data frame
        key:            column values to convert to multiple columns
        value:          single column values to convert to multiple columns' values 
        fill:           If there isn't a value for every combination of the other variables and the key 
                        column, this value will be substituted
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Example:

Consider the long format data long_data we created in the last section. Now we turn it back to a wide format.

wide_data <- long_data %>% spread(Quarter, Revenue)

kable(wide_data,align = 'c')

Group	Year	Qtr.1	Qtr.2	Qtr.3	Qtr.4
1	2006	15	16	19	17
1	2007	12	13	27	23
1	2008	22	22	24	20
1	2009	10	14	20	16
2	2006	12	12	25	18
2	2007	16	14	21	19
2	2008	13	11	29	15
2	2009	23	20	26	20
3	2006	11	12	22	16
3	2007	13	11	27	21
3	2008	17	12	23	19
3	2009	14	9	31	24

separate() Function:

Objective: Splitting a single variable into two

Description: Many times a single column variable will capture multiple variables, or even parts of a variable you just don’t care about. In these cases, our objective may be to separate characters within the variable string. This can be accomplished using the separate() function which turns a single character column into multiple columns.

Complement to: unite()

Function summary:

Function:       separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as:        data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)

Arguments:
        data:           data frame
        col:            column name representing current variable
        into:           names of variables representing new variables
        sep:            how to separate current variable (char, num, or symbol)
        remove:         if TRUE, remove input column from output data frame
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Example :

We can go back to our long_data dataframe we created above in which way may desire to clean up or separate the Quarter variable.

Group	Year	Quarter	Revenue
1	2006	Qtr.1	15
1	2007	Qtr.1	12
1	2008	Qtr.1	22
1	2009	Qtr.1	10
2	2006	Qtr.1	12
2	2007	Qtr.1	16
2	2008	Qtr.1	13
2	2009	Qtr.1	23
3	2006	Qtr.1	11
3	2007	Qtr.1	13

By applying the separate() function we get the following:

separate_data <- long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))

kable(head (separate_data,10),align = 'c')

Group	Year	Time_Interval	Interval_ID	Revenue
1	2006	Qtr	1	15
1	2007	Qtr	1	12
1	2008	Qtr	1	22
1	2009	Qtr	1	10
2	2006	Qtr	1	12
2	2007	Qtr	1	16
2	2008	Qtr	1	13
2	2009	Qtr	1	23
3	2006	Qtr	1	11
3	2007	Qtr	1	13

Similar operations :

These produce the same results:
        long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
        long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.")

unite() Function:

Objective: Merging two variables into one

Description: There may be a time in which we would like to combine the values of two variables. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.

Complement to: separate()

Function summary:

Function:       unite(data, col, ..., sep = " ", remove = TRUE)
Same as:        data %>% unite(col, ..., sep = " ", remove = TRUE)

Arguments:
        data:           data frame
        col:            column name of new "merged" column
        ...:            names of columns to merge
        sep:            separator to use between merged values
        remove:         if TRUE, remove input column from output data frame

Example :

Using the separate_data dataframe we created above, shown below :

kable(head (separate_data,10),align = 'c')

Group	Year	Time_Interval	Interval_ID	Revenue
1	2006	Qtr	1	15
1	2007	Qtr	1	12
1	2008	Qtr	1	22
1	2009	Qtr	1	10
2	2006	Qtr	1	12
2	2007	Qtr	1	16
2	2008	Qtr	1	13
2	2009	Qtr	1	23
3	2006	Qtr	1	11
3	2007	Qtr	1	13

We can re-unite the Time_Interval and Interval_ID variables we created and re-create the original Quarter variable we had in the long_data dataframe.

unite_data <- separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".")

kable(head (unite_data,10),align = 'c')

Group	Year	Quarter	Revenue
1	2006	Qtr.1	15
1	2007	Qtr.1	12
1	2008	Qtr.1	22
1	2009	Qtr.1	10
2	2006	Qtr.1	12
2	2007	Qtr.1	16
2	2008	Qtr.1	13
2	2009	Qtr.1	23
3	2006	Qtr.1	11
3	2007	Qtr.1	13

Similar operations :

These produce the same results:
        separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = "_")
        separate_data %>% unite(Quarter, Time_Interval, Interval_ID)

If no spearator is identified, "_" will automatically be used

Additional Resources

Hadley Wickham’s paper on Tidy Data provides a great explanation behind the concept of “tidy data”. Link

Data tidying using tidyr

Davood Astaraky

Sources

Loading Necessay Libraries

What is tidy data ?

%>% Operator

tidyr Operations

gather() Function

spread() Function

separate() Function:

unite() Function:

Additional Resources