Sources

Loading Necessay Libraries

library(dplyr)
library(tidyr)
library(knitr)
library(printr)

What is tidy data ?

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003). Well structured data Makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.

Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together → leads to difficult-to-read nested functions and/or choppy code. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways → leads to more efficient code, easier to remember syntax, and easier to read syntax.

tidy data

tidy data

%>% Operator

Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.

This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:

filter(data, variable == numeric_value)
or

data %>% filter(variable == numeric_value)

Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible. The %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as:

Nested Option:

arrange(
        summarize(
            filter(data, variable == numeric_value),
            Total = sum(variable)
        ),
    desc(Total)
)

Multiple Object Option:

 a <- filter(data, variable == numeric_value)
 b <- summarise(a, Total = sum(variable))
 c <- arrange(b, desc(Total))

%>% Option :

 data %>%
        filter(variable == “value”) %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

tidyr Operations

There are four fundamental functions of data tidying:

gather() Function

Objective: Reshaping wide format to long format

Description: There are times when our data is considered unstacked and a common attribute of concern is spread out across columns. To reformat the data such that these common attributes are gathered together as a single variable, the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.

Complement to: spread()

Fucntion summary:

Function:       gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as:        data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)

Arguments:
        data:           data frame
        key:            column name representing new variable
        value:          column name representing variable values
        ...:            names of columns to gather (or not gather)
        na.rm:          option to remove observations with missing values (represented by NAs)
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Schematic description:

Example:

Reading data :

data <- read.csv("Data/data1.csv", head=TRUE,sep="")
data
Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
1 2006 15 16 19 17
1 2007 12 13 27 23
1 2008 22 22 24 20
1 2009 10 14 20 16
2 2006 12 12 25 18
2 2007 16 14 21 19
2 2008 13 11 29 15
2 2009 23 20 26 20
3 2006 11 12 22 16
3 2007 13 11 27 21
3 2008 17 12 23 19
3 2009 14 9 31 24

This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.

long_data <- data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
        
kable(head (long_data,24),align = 'c')
Group Year Quarter Revenue
1 2006 Qtr.1 15
1 2007 Qtr.1 12
1 2008 Qtr.1 22
1 2009 Qtr.1 10
2 2006 Qtr.1 12
2 2007 Qtr.1 16
2 2008 Qtr.1 13
2 2009 Qtr.1 23
3 2006 Qtr.1 11
3 2007 Qtr.1 13
3 2008 Qtr.1 17
3 2009 Qtr.1 14
1 2006 Qtr.2 16
1 2007 Qtr.2 13
1 2008 Qtr.2 22
1 2009 Qtr.2 14
2 2006 Qtr.2 12
2 2007 Qtr.2 14
2 2008 Qtr.2 11
2 2009 Qtr.2 20
3 2006 Qtr.2 12
3 2007 Qtr.2 11
3 2008 Qtr.2 12
3 2009 Qtr.2 9

Similar operations :

These all produce the same results:
        data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
        data %>% gather(Quarter, Revenue, -Group, -Year)
        data %>% gather(Quarter, Revenue, 3:6)
        data %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4)

Also note that if you do not supply arguments for na.rm or convert values then the defaults are used

spread() Function

Objective: Reshaping long format to wide format

Description: There are times when we are required to turn long formatted data into wide formatted data. The spread() function spreads a key-value pair across multiple columns.

Complement to: gather()

Function summary:

Function:       spread(data, key, value, fill = NA, convert = FALSE)
Same as:        data %>% spread(key, value, fill = NA, convert = FALSE)

Arguments:
        data:           data frame
        key:            column values to convert to multiple columns
        value:          single column values to convert to multiple columns' values 
        fill:           If there isn't a value for every combination of the other variables and the key 
                        column, this value will be substituted
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Example:

Consider the long format data long_data we created in the last section. Now we turn it back to a wide format.

wide_data <- long_data %>% spread(Quarter, Revenue)

kable(wide_data,align = 'c')
Group Year Qtr.1 Qtr.2 Qtr.3 Qtr.4
1 2006 15 16 19 17
1 2007 12 13 27 23
1 2008 22 22 24 20
1 2009 10 14 20 16
2 2006 12 12 25 18
2 2007 16 14 21 19
2 2008 13 11 29 15
2 2009 23 20 26 20
3 2006 11 12 22 16
3 2007 13 11 27 21
3 2008 17 12 23 19
3 2009 14 9 31 24

separate() Function:

Objective: Splitting a single variable into two

Description: Many times a single column variable will capture multiple variables, or even parts of a variable you just don’t care about. In these cases, our objective may be to separate characters within the variable string. This can be accomplished using the separate() function which turns a single character column into multiple columns.

Complement to: unite()

Function summary:

Function:       separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as:        data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)

Arguments:
        data:           data frame
        col:            column name representing current variable
        into:           names of variables representing new variables
        sep:            how to separate current variable (char, num, or symbol)
        remove:         if TRUE, remove input column from output data frame
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

Example :

We can go back to our long_data dataframe we created above in which way may desire to clean up or separate the Quarter variable.

Group Year Quarter Revenue
1 2006 Qtr.1 15
1 2007 Qtr.1 12
1 2008 Qtr.1 22
1 2009 Qtr.1 10
2 2006 Qtr.1 12
2 2007 Qtr.1 16
2 2008 Qtr.1 13
2 2009 Qtr.1 23
3 2006 Qtr.1 11
3 2007 Qtr.1 13

By applying the separate() function we get the following:

separate_data <- long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))

kable(head (separate_data,10),align = 'c')
Group Year Time_Interval Interval_ID Revenue
1 2006 Qtr 1 15
1 2007 Qtr 1 12
1 2008 Qtr 1 22
1 2009 Qtr 1 10
2 2006 Qtr 1 12
2 2007 Qtr 1 16
2 2008 Qtr 1 13
2 2009 Qtr 1 23
3 2006 Qtr 1 11
3 2007 Qtr 1 13

Similar operations :

These produce the same results:
        long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
        long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.")

unite() Function:

Objective: Merging two variables into one

Description: There may be a time in which we would like to combine the values of two variables. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.

Complement to: separate()

Function summary:

Function:       unite(data, col, ..., sep = " ", remove = TRUE)
Same as:        data %>% unite(col, ..., sep = " ", remove = TRUE)

Arguments:
        data:           data frame
        col:            column name of new "merged" column
        ...:            names of columns to merge
        sep:            separator to use between merged values
        remove:         if TRUE, remove input column from output data frame

Example :

Using the separate_data dataframe we created above, shown below :

kable(head (separate_data,10),align = 'c')
Group Year Time_Interval Interval_ID Revenue
1 2006 Qtr 1 15
1 2007 Qtr 1 12
1 2008 Qtr 1 22
1 2009 Qtr 1 10
2 2006 Qtr 1 12
2 2007 Qtr 1 16
2 2008 Qtr 1 13
2 2009 Qtr 1 23
3 2006 Qtr 1 11
3 2007 Qtr 1 13

We can re-unite the Time_Interval and Interval_ID variables we created and re-create the original Quarter variable we had in the long_data dataframe.

unite_data <- separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".")

kable(head (unite_data,10),align = 'c')
Group Year Quarter Revenue
1 2006 Qtr.1 15
1 2007 Qtr.1 12
1 2008 Qtr.1 22
1 2009 Qtr.1 10
2 2006 Qtr.1 12
2 2007 Qtr.1 16
2 2008 Qtr.1 13
2 2009 Qtr.1 23
3 2006 Qtr.1 11
3 2007 Qtr.1 13

Similar operations :

These produce the same results:
        separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = "_")
        separate_data %>% unite(Quarter, Time_Interval, Interval_ID)

If no spearator is identified, "_" will automatically be used

Additional Resources