library(dplyr)
library(tidyr)
library(knitr)
library(printr)
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003). Well structured data Makes data suitable for software processing whether that be mathematical functions, visualization, etc. and reveals information and insights.
Although many fundamental data processing functions exist in R, they have been a bit convoluted to date and have lacked consistent coding and the ability to easily flow together → leads to difficult-to-read nested functions and/or choppy code. As a result, a lot of data processing tasks are becoming packaged in more cohesive and consistent ways → leads to more efficient code, easier to remember syntax, and easier to read syntax.
tidy data
Although not required, the tidyr and dplyr packages make use of the pipe operator %>% developed by Stefan Milton Bache in the R package magrittr. Although all the functions in tidyr and dplyr can be used without the pipe operator, one of the great conveniences these packages provide is the ability to string multiple functions together by incorporating %>%.
This operator will forward a value, or the result of an expression, into the next function call/expression. For instance a function to filter data can be written as:
filter(data, variable == numeric_value)
or
Both functions complete the same task and the benefit of using %>% is not evident; however, when you desire to perform multiple functions its advantage becomes obvious. As your function tasks get longer the %>% operator becomes more efficient and makes your code more legible. The %>% operator allows you to flow from data manipulation tasks straight into vizualization functions (via ggplot and ggvis) and also into many analytic functions. For instance, if we want to filter some data, summarize it, and then order the summarized results we would write it out as:
Nested Option:
arrange(
summarize(
filter(data, variable == numeric_value),
Total = sum(variable)
),
desc(Total)
)
Multiple Object Option:
a <- filter(data, variable == numeric_value)
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))
%>% Option :
data %>%
filter(variable == “value”) %>%
summarise(Total = sum(variable)) %>%
arrange(desc(Total))
There are four fundamental functions of data tidying:
gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longerspread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data widerseparate() splits a single column into multiple columnsunite() combines multiple columns into a single columnObjective: Reshaping wide format to long format
Description: There are times when our data is considered unstacked and a common attribute of concern is spread out across columns. To reformat the data such that these common attributes are gathered together as a single variable, the gather() function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.
Complement to: spread()
Fucntion summary:
Function: gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as: data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)
Arguments:
data: data frame
key: column name representing new variable
value: column name representing variable values
...: names of columns to gather (or not gather)
na.rm: option to remove observations with missing values (represented by NAs)
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
Schematic description:
Example:
Reading data :
data <- read.csv("Data/data1.csv", head=TRUE,sep="")
data
| Group | Year | Qtr.1 | Qtr.2 | Qtr.3 | Qtr.4 |
|---|---|---|---|---|---|
| 1 | 2006 | 15 | 16 | 19 | 17 |
| 1 | 2007 | 12 | 13 | 27 | 23 |
| 1 | 2008 | 22 | 22 | 24 | 20 |
| 1 | 2009 | 10 | 14 | 20 | 16 |
| 2 | 2006 | 12 | 12 | 25 | 18 |
| 2 | 2007 | 16 | 14 | 21 | 19 |
| 2 | 2008 | 13 | 11 | 29 | 15 |
| 2 | 2009 | 23 | 20 | 26 | 20 |
| 3 | 2006 | 11 | 12 | 22 | 16 |
| 3 | 2007 | 13 | 11 | 27 | 21 |
| 3 | 2008 | 17 | 12 | 23 | 19 |
| 3 | 2009 | 14 | 9 | 31 | 24 |
This data is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.
long_data <- data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
kable(head (long_data,24),align = 'c')
| Group | Year | Quarter | Revenue |
|---|---|---|---|
| 1 | 2006 | Qtr.1 | 15 |
| 1 | 2007 | Qtr.1 | 12 |
| 1 | 2008 | Qtr.1 | 22 |
| 1 | 2009 | Qtr.1 | 10 |
| 2 | 2006 | Qtr.1 | 12 |
| 2 | 2007 | Qtr.1 | 16 |
| 2 | 2008 | Qtr.1 | 13 |
| 2 | 2009 | Qtr.1 | 23 |
| 3 | 2006 | Qtr.1 | 11 |
| 3 | 2007 | Qtr.1 | 13 |
| 3 | 2008 | Qtr.1 | 17 |
| 3 | 2009 | Qtr.1 | 14 |
| 1 | 2006 | Qtr.2 | 16 |
| 1 | 2007 | Qtr.2 | 13 |
| 1 | 2008 | Qtr.2 | 22 |
| 1 | 2009 | Qtr.2 | 14 |
| 2 | 2006 | Qtr.2 | 12 |
| 2 | 2007 | Qtr.2 | 14 |
| 2 | 2008 | Qtr.2 | 11 |
| 2 | 2009 | Qtr.2 | 20 |
| 3 | 2006 | Qtr.2 | 12 |
| 3 | 2007 | Qtr.2 | 11 |
| 3 | 2008 | Qtr.2 | 12 |
| 3 | 2009 | Qtr.2 | 9 |
Similar operations :
These all produce the same results:
data %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
data %>% gather(Quarter, Revenue, -Group, -Year)
data %>% gather(Quarter, Revenue, 3:6)
data %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4)
Also note that if you do not supply arguments for na.rm or convert values then the defaults are used
Objective: Reshaping long format to wide format
Description: There are times when we are required to turn long formatted data into wide formatted data. The spread() function spreads a key-value pair across multiple columns.
Complement to: gather()
Function summary:
Function: spread(data, key, value, fill = NA, convert = FALSE)
Same as: data %>% spread(key, value, fill = NA, convert = FALSE)
Arguments:
data: data frame
key: column values to convert to multiple columns
value: single column values to convert to multiple columns' values
fill: If there isn't a value for every combination of the other variables and the key
column, this value will be substituted
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
Example:
Consider the long format data long_data we created in the last section. Now we turn it back to a wide format.
wide_data <- long_data %>% spread(Quarter, Revenue)
kable(wide_data,align = 'c')
| Group | Year | Qtr.1 | Qtr.2 | Qtr.3 | Qtr.4 |
|---|---|---|---|---|---|
| 1 | 2006 | 15 | 16 | 19 | 17 |
| 1 | 2007 | 12 | 13 | 27 | 23 |
| 1 | 2008 | 22 | 22 | 24 | 20 |
| 1 | 2009 | 10 | 14 | 20 | 16 |
| 2 | 2006 | 12 | 12 | 25 | 18 |
| 2 | 2007 | 16 | 14 | 21 | 19 |
| 2 | 2008 | 13 | 11 | 29 | 15 |
| 2 | 2009 | 23 | 20 | 26 | 20 |
| 3 | 2006 | 11 | 12 | 22 | 16 |
| 3 | 2007 | 13 | 11 | 27 | 21 |
| 3 | 2008 | 17 | 12 | 23 | 19 |
| 3 | 2009 | 14 | 9 | 31 | 24 |
Objective: Splitting a single variable into two
Description: Many times a single column variable will capture multiple variables, or even parts of a variable you just don’t care about. In these cases, our objective may be to separate characters within the variable string. This can be accomplished using the separate() function which turns a single character column into multiple columns.
Complement to: unite()
Function summary:
Function: separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as: data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)
Arguments:
data: data frame
col: column name representing current variable
into: names of variables representing new variables
sep: how to separate current variable (char, num, or symbol)
remove: if TRUE, remove input column from output data frame
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
Example :
We can go back to our long_data dataframe we created above in which way may desire to clean up or separate the Quarter variable.
| Group | Year | Quarter | Revenue |
|---|---|---|---|
| 1 | 2006 | Qtr.1 | 15 |
| 1 | 2007 | Qtr.1 | 12 |
| 1 | 2008 | Qtr.1 | 22 |
| 1 | 2009 | Qtr.1 | 10 |
| 2 | 2006 | Qtr.1 | 12 |
| 2 | 2007 | Qtr.1 | 16 |
| 2 | 2008 | Qtr.1 | 13 |
| 2 | 2009 | Qtr.1 | 23 |
| 3 | 2006 | Qtr.1 | 11 |
| 3 | 2007 | Qtr.1 | 13 |
By applying the separate() function we get the following:
separate_data <- long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
kable(head (separate_data,10),align = 'c')
| Group | Year | Time_Interval | Interval_ID | Revenue |
|---|---|---|---|---|
| 1 | 2006 | Qtr | 1 | 15 |
| 1 | 2007 | Qtr | 1 | 12 |
| 1 | 2008 | Qtr | 1 | 22 |
| 1 | 2009 | Qtr | 1 | 10 |
| 2 | 2006 | Qtr | 1 | 12 |
| 2 | 2007 | Qtr | 1 | 16 |
| 2 | 2008 | Qtr | 1 | 13 |
| 2 | 2009 | Qtr | 1 | 23 |
| 3 | 2006 | Qtr | 1 | 11 |
| 3 | 2007 | Qtr | 1 | 13 |
Similar operations :
These produce the same results:
long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
long_data %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.")
Objective: Merging two variables into one
Description: There may be a time in which we would like to combine the values of two variables. The unite() function is a convenience function to paste together multiple variable values into one. In essence, it combines two variables of a single observation into one variable.
Complement to: separate()
Function summary:
Function: unite(data, col, ..., sep = " ", remove = TRUE)
Same as: data %>% unite(col, ..., sep = " ", remove = TRUE)
Arguments:
data: data frame
col: column name of new "merged" column
...: names of columns to merge
sep: separator to use between merged values
remove: if TRUE, remove input column from output data frame
Example :
Using the separate_data dataframe we created above, shown below :
kable(head (separate_data,10),align = 'c')
| Group | Year | Time_Interval | Interval_ID | Revenue |
|---|---|---|---|---|
| 1 | 2006 | Qtr | 1 | 15 |
| 1 | 2007 | Qtr | 1 | 12 |
| 1 | 2008 | Qtr | 1 | 22 |
| 1 | 2009 | Qtr | 1 | 10 |
| 2 | 2006 | Qtr | 1 | 12 |
| 2 | 2007 | Qtr | 1 | 16 |
| 2 | 2008 | Qtr | 1 | 13 |
| 2 | 2009 | Qtr | 1 | 23 |
| 3 | 2006 | Qtr | 1 | 11 |
| 3 | 2007 | Qtr | 1 | 13 |
We can re-unite the Time_Interval and Interval_ID variables we created and re-create the original Quarter variable we had in the long_data dataframe.
unite_data <- separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".")
kable(head (unite_data,10),align = 'c')
| Group | Year | Quarter | Revenue |
|---|---|---|---|
| 1 | 2006 | Qtr.1 | 15 |
| 1 | 2007 | Qtr.1 | 12 |
| 1 | 2008 | Qtr.1 | 22 |
| 1 | 2009 | Qtr.1 | 10 |
| 2 | 2006 | Qtr.1 | 12 |
| 2 | 2007 | Qtr.1 | 16 |
| 2 | 2008 | Qtr.1 | 13 |
| 2 | 2009 | Qtr.1 | 23 |
| 3 | 2006 | Qtr.1 | 11 |
| 3 | 2007 | Qtr.1 | 13 |
Similar operations :
These produce the same results:
separate_data %>% unite(Quarter, Time_Interval, Interval_ID, sep = "_")
separate_data %>% unite(Quarter, Time_Interval, Interval_ID)
If no spearator is identified, "_" will automatically be used