Tidy Data - Three Approaches

Intro

This page demonstrates three different ways of tidying up the data.
1. Natural way - Using Tidyr package
2. Fast(in terms of runtime) way - Using data.table
3. Hard way - Without using any built-in function/package

Most of the operations are explained as code comments within the code chunck.

Natural Way - Using Tidyr Package

I choose Jason Joseph’s problem as that was a good wide data candidate for Tidyr package.

Jason has extracted the data from http://www.census.gov/foreign-trade/data/index.html and provided a nice sample file. Thanks Jason for the csv.

library(tidyr)
library(dplyr)
library(knitr)
trade_wide=read.table(file="https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/International Trade in Goods and Services.csv",stringsAsFactors = F,header = T,na.strings = "",sep = ",")
print(trade_wide)

##      Period        X   X2013     X.1   X2014     X.2   X2015     X.3
## 1      <NA>     <NA> Exports Imports Exports Imports Exports Imports
## 2      <NA>     <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
## 3   January   Goods  131,228 191,745 133,738 193,706 129,292 192,242
## 4      <NA> Services   56883  37,995  59,141  38,635  59,733  40,371
## 5      <NA>     <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
## 6  February   Goods   132084 193,731 131,768 193,060 126,329 184,370
## 7      <NA> Services  56,828  37,935  57,726  39,270  59,560  40,057
## 8      <NA>     <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
## 9     March   Goods  130,093 186,326 135,923 198,973 127,183 198,347
## 10     <NA> Services  56,647  37,938  58,837  38,908  59,583  40,581
## 11     <NA>     <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>
## 12    April   Goods  131,468 189,507 135,556 199,877 129,376 190,967
## 13     <NA> Services  56,592  38,318  59,468  39,417  59,766  40,432

#Remove Rows which has all the values as NA
trade_wide=trade_wide[rowSums(is.na(trade_wide))!=8,]

#Rename the columns in line with the values of the column
colnames(trade_wide)=c("Month","Type","Exports 2013","Imports 2013","Exports 2014","Imports 2014","Exports 2015","Imports 2015")

#Now let's use tidyr and dplyr to transform the data

trade=trade_wide %>% 
  gather(cat,value,3:8) %>% #Collapse column 3 to 8 into two column (key and value pair)
  separate(cat,c("Trade","year")) %>% #Split the Trade type and year which is stored in one column to two separate columns
  fill(1) %>% #Fill missing values with previous row values
  filter(!is.na(Type)) %>% #Remove rows with type NA
  mutate(value=as.numeric(gsub(",","",value))) #Remove thousand separator in numbers  

head(trade)

##      Month     Type   Trade year  value
## 1  January   Goods  Exports 2013 131228
## 2  January Services Exports 2013  56883
## 3 February   Goods  Exports 2013 132084
## 4 February Services Exports 2013  56828
## 5    March   Goods  Exports 2013 130093
## 6    March Services Exports 2013  56647

#Now that we got the data in desired format, let's answer the question average trade by month and type
kable(trade %>% group_by(Month,Type) %>% summarise(Avg=mean(value)))

Month	Type	Avg
April	Goods	162791.83
April	Services	48998.83
February	Goods	160223.67
February	Services	48562.67
January	Goods	161991.83
January	Services	48793.00
March	Goods	162807.50
March	Services	48749.00

Fast way - Using data.table

For this approach I choose to work with the data provided by Christina Taylor. This data require some transformation and join and so I wanted to try that out in data.table. In general data.table operations tend to execute much faster than tidyr.

library(data.table)
library(ggplot2)

jobopening=fread("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/JobOpen.csv") #Note: fread is faster than read.csv and does a better job in detecting separator and column classes
head(jobopening)

##    Opening ID           Role Location          Recruiter Target Open Date
## 1:          1 Administration       US              Bilbo       11/20/2014
## 2:          2 Administration       US              Jerry       11/18/2014
## 3:          3 Administration       US            Gandalf       11/30/2014
## 4:          4 Administration       US Thorin Oakenshield        10/4/2014
## 5:          5 Administration       US             Shamus       10/13/2014
## 6:          6 Administration       US              Balin        12/2/2014
##    Target Start Date Actual Open Date Offer Accepted Date
## 1:         2/18/2015       11/23/2014          12/25/2014
## 2:         2/16/2015       11/23/2014          12/14/2014
## 3:         2/28/2015        12/6/2014           2/10/2015
## 4:          1/2/2015        10/4/2014          12/23/2014
## 5:         1/11/2015       10/22/2014          11/19/2014
## 6:          3/2/2015        12/8/2014           2/12/2015
##    Actual Start Date employee ID of hired guy
## 1:          1/7/2015                        1
## 2:        12/21/2014                        2
## 3:         2/20/2015                        3
## 4:        12/30/2014                        4
## 5:        11/28/2014                        5
## 6:         2/22/2015                        6

jobopening[,`Actual Start Date`:=as.Date(`Actual Start Date`,"%m/%d/%Y")] #Convert date string to Date type

#Compute Quater and year of actual start Date
jobopening[,`:=`(Qtr=paste0("Q",(as.numeric(format(`Actual Start Date`,"%m"))-1)%/%3+1),Year=format(`Actual Start Date`,"%Y"))] 

head(jobopening)

##    Opening ID           Role Location          Recruiter Target Open Date
## 1:          1 Administration       US              Bilbo       11/20/2014
## 2:          2 Administration       US              Jerry       11/18/2014
## 3:          3 Administration       US            Gandalf       11/30/2014
## 4:          4 Administration       US Thorin Oakenshield        10/4/2014
## 5:          5 Administration       US             Shamus       10/13/2014
## 6:          6 Administration       US              Balin        12/2/2014
##    Target Start Date Actual Open Date Offer Accepted Date
## 1:         2/18/2015       11/23/2014          12/25/2014
## 2:         2/16/2015       11/23/2014          12/14/2014
## 3:         2/28/2015        12/6/2014           2/10/2015
## 4:          1/2/2015        10/4/2014          12/23/2014
## 5:         1/11/2015       10/22/2014          11/19/2014
## 6:          3/2/2015        12/8/2014           2/12/2015
##    Actual Start Date employee ID of hired guy Qtr Year
## 1:        2015-01-07                        1  Q1 2015
## 2:        2014-12-21                        2  Q4 2014
## 3:        2015-02-20                        3  Q1 2015
## 4:        2014-12-30                        4  Q4 2014
## 5:        2014-11-28                        5  Q4 2014
## 6:        2015-02-22                        6  Q1 2015

hctarget=fread("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/hctarget.csv")[-.N]
head(hctarget)

##              Role Location Current_HeadCount Q1_2015 Q2_2015 Q3_2015
## 1: Administration       US                33      39      47      55
## 2: Administration       DZ                33      39      47      55
## 3: Administration       GF                34      40      48      57
## 4:          Sales       US                66      78      93     111
## 5:          Sales       DZ                66      78      93     111
## 6:          Sales       GF                67      80      95     113
##    Q4_2015
## 1:      66
## 2:      66
## 3:      68
## 4:     132
## 5:     132
## 6:     134

#Collapse Quaterly columns to Qtr/Value column
hctarget=melt(hctarget,measure.vars = c("Q1_2015","Q2_2015","Q3_2015","Q4_2015"),variable.name="Qtr",value.name="Target")

#Split Qtr column to Qtr/Year
hctarget[,c("Qtr","Year"):=tstrsplit(Qtr,"_")]
head(hctarget)

##              Role Location Current_HeadCount Qtr Target Year
## 1: Administration       US                33  Q1     39 2015
## 2: Administration       DZ                33  Q1     39 2015
## 3: Administration       GF                34  Q1     40 2015
## 4:          Sales       US                66  Q1     78 2015
## 5:          Sales       DZ                66  Q1     78 2015
## 6:          Sales       GF                67  Q1     80 2015

#Set Key values so that we can do data table joins
setkey(jobopening,Role,Location,Qtr,Year)
setkey(hctarget,Role,Location,Qtr,Year)

#Compute total hires and extract only required column
hired=jobopening[!is.na(Year),.(hires=.N),by=.(Role,Location,Qtr,Year)]
#Join hired data and target data
results=hired[hctarget]
#Convert NAs to zero
results$hires[is.na(results$hires)]=0

head(results)

##              Role Location Qtr Year hires Current_HeadCount Target
## 1: Administration       DZ  Q1 2015     6                33     39
## 2: Administration       DZ  Q2 2015     3                33     47
## 3: Administration       DZ  Q3 2015     0                33     55
## 4: Administration       DZ  Q4 2015     0                33     66
## 5: Administration       GF  Q1 2015     7                34     40
## 6: Administration       GF  Q2 2015     4                34     48

#Heat Map of (Actual Vs Target) by year and Quarter using ggplot facet
ggplot(results,aes(Location, Role)) + geom_tile(aes(fill=hires/(Target-Current_HeadCount)*100)) + scale_fill_gradient2(low="white",high="green") + facet_grid(~Year+Qtr) + labs(fill="Achieved%") + theme(axis.text.x=element_text(size=12),axis.text.y=element_text(size=12))

Hard way - Without using any built-in function/package

What if the data that we get is so messy and we couldn’t use any package? I wanted to try and tidy up data without using any package and see how it goes. I choose the data posted by Maxwell Wagner Th data is sourced from https://en.wikipedia.org/wiki/Motocross_des_Nations.

For this problem I came up with solution assuming that tidyr and ddplyr are not available and I have to do the tidy part myself.

I defined a function that pretty much performs the tidyr::gather function

my_gather=function(x,cols,keyname,valuename){
  all_keyvalue=NULL
  for (y in cols){c
    varname=colnames(x)[y]
    keyvalue=cbind(varname,x[,y,with=FALSE],x[,.I])
    colnames(keyvalue)=c(keyname,valuename,"row")#The extra row column is to keep track of the row to which this column needs to be attached to
    all_keyvalue=rbind(all_keyvalue,keyvalue)
  }
  all_keyvalue
  
}

Then, I performed the data cleansing and tidying as shown below

moto_wide=fread("https://raw.githubusercontent.com/mkds/MSDA/master/IS607/data/motocross.csv",stringsAsFactors = F)
head(moto_wide)

##          Team  \n2000  \n2001  \n2002  \n2003  \n2004  \n2005  \n2006
## 1:  Australia     7th     6th            14th    13th    16th    11th
## 2:    Austria    13th                            22nd    23rd    21st
## 3:    Belarus                            27th                        
## 4:    Belgium     3rd     2nd     2nd     1st     1st     3rd     2nd
## 5:     Brazil    23rd    23rd                                        
## 6:     Canada    28th    26th                     8th    17th    18th
##     \n2007  \n2008  \n2009  \n2010  \n2011  \n2012  \n2013  \n2014  \n2015
## 1:    20th     6th     7th     6th     3rd    10th     4th    14th     7th
## 2:                    23rd    12th    20th    24th    12th            10th
## 3:                                                                        
## 4:     3rd     3rd     3rd     2nd     5th     2nd     1st     2nd     3rd
## 5:    16th    14th    14th    18th    27th    32nd    31st    27th    27th
## 6:    10th    13th    27th    16th                    23rd    17th

keyvalues=my_gather(moto_wide,2:17,"year","position")#Collapse columns to key value pair  
head(keyvalues)

##       year position row
## 1:  \n2000      7th   1
## 2:  \n2000     13th   2
## 3:  \n2000            3
## 4:  \n2000      3rd   4
## 5:  \n2000     23rd   5
## 6:  \n2000     28th   6

#Extract Numbers from the strings
keyvalues[,`:=`(year=as.numeric(gsub("\n","",year)),position=as.numeric(gsub("(\\d+).*","\\1",position)))]
#Add row number as column so that we can join by row
moto_wide[,row:=.I]
setkey(moto_wide,row)
setkey(keyvalues,row)
#Join keyvalues with the required values from original data
moto=keyvalues[moto_wide[,.(Team,row)]][,.(Team,year,position)]
head(moto)

##          Team year position
## 1:  Australia 2000        7
## 2:  Australia 2001        6
## 3:  Australia 2002       NA
## 4:  Australia 2003       14
## 5:  Australia 2004       13
## 6:  Australia 2005       16

#Note that all above function could be performed easily using tidyr as given below
# moto=moto_wide %>% gather(year,position,2:17) %>% 
#      mutate(year=as.numeric(gsub("X..","",year)),position=as.numeric(gsub("(\\d+).*","\\1",position)))
#

#Now that we got the long data let's try to answer the question of  average finishes and average standing
kable(moto %>% 
  group_by(Team) %>% 
  summarize(finishes=sum(!is.na(position)),Avg_position=mean(position,na.rm=T)) %>%
  arrange(Avg_position))

Team	finishes	Avg_position
USA	13	1.538461
Belgium	16	2.375000
France	16	4.500000
Great Britain	15	6.133333
Italy	16	8.312500
Germany	16	8.937500
Australia	15	9.600000
Estonia	16	9.687500
South Africa	11	10.454546
New Zealand	14	11.571429
Spain	16	12.625000
Finland	15	12.866667
Switzerland	15	13.133333
Netherlands	13	13.461538
Portugal	16	13.875000
Japan	13	15.384615
Czech Republic	15	16.600000
Ireland	16	16.750000
Sweden	16	16.750000
Denmark	14	16.928571
Latvia	16	17.875000
Austria	10	18.000000
Chile	1	18.000000
Russia	13	18.153846
Canada	11	18.454545
Slovenia	14	21.071429
Ecuador	2	21.500000
Norway	11	21.727273
Puerto Rico	9	22.444444
Brazil	11	22.909091
Costa Rica	2	25.000000
Guatemala	4	25.250000
Poland	3	25.666667
Morocco	2	26.000000
Slovakia	8	26.000000
Venezuela	10	26.200000
Belarus	1	27.000000
Mexico	2	27.000000
Hungary	8	27.750000
Croatia	10	27.800000
Lithuania	11	27.818182
Ukraine	8	27.875000
Philippines	1	29.000000
Dominican Republic	1	30.000000
Greece	10	30.500000
Iceland	7	31.142857
Romania	1	32.000000
China	2	32.500000
Cyprus	2	32.500000
Mongolia	2	32.500000
Luxembourg	4	32.750000
Thailand	4	35.000000
Kuwait	1	36.000000
Israel	3	36.666667

Tidy Data - Three Approaches

Mohan Kandaraj

Convert untidy Data to Tidy Data - Three approaches

Intro

Natural Way - Using Tidyr Package

Fast way - Using data.table

Hard way - Without using any built-in function/package