This report was prepared by data analyst and a Statistician Martin Nyamu using on June 23rd 2019. The aim of this report is to demostrate the use of ‘dplyr’ package in R in data cleaning. For more information you can contact me on mail: analystnyamu79@gmail.com I will start by loading the relevant packages in R.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(hflights)
We will use hflights data in R ## Explore data
data("hflights") # Loading hflights data
head(hflights,5) # Displaying the first 5 observations of hflights data
A local dataframe is a rapper for a dataframe that prints nicely We first Convert our data to a local dataframe as:
flights<-tbl_df(hflights)
flights # printing only shows 10 rows as many colmns as it can fit on the screen
You can also specify if you want to see more rows as follows:
print(flights, n=15)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## 11 2011 1 11 2 1429 1539 AA 428
## 12 2011 1 12 3 1419 1515 AA 428
## 13 2011 1 13 4 1358 1501 AA 428
## 14 2011 1 14 5 1357 1504 AA 428
## 15 2011 1 15 6 1359 1459 AA 428
## # ... with 2.275e+05 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
data.frame(head(flights)) #convert to a normal data flame to see all the columns
How to use a filter function
A filter keep rows matching criteria. In Base R approach filtering is forced to repeat the dataframe’s name. However, in dplyr approach, it is simple to write and read. The command structure for all dplyr verbs shows that:
It is noted that ‘dplyr’ generally does not preserve row names.
Using base R approach to view all flights on January 1
flights[flights$Month==1 & flights$DayofMonth==1,]
Using ‘dplyr’ approach to do the same job above. Note that you can use comma or ampersand to represent ‘AND’ condition.
filter(flights, Month==1, DayofMonth==1)
Moreover, dplyr provides an easier than ‘Base R’. We can use ‘pipe’ (%in%) instead of ‘OR’ (|) condition.
filter(flights,UniqueCarrier=="AA" | UniqueCarrier=="UA")
filter(flights,UniqueCarrier %in% c("AA", "UA")) #using %in% operator
Base R approach is arkward to type and to read
flights[,c("DepTime", "ArrTime","FlightNum")] # Base R approach to select Deptime, ArrTime, and FlightNum columns
On the other hand, ‘dplyr’ approach can also use similar syntax to filter like a ‘SELECT’ in ‘SQL’
flights[,c("DepTime", "ArrTime","FlightNum")] # dplyr using c() function
select(flights,DepTime,ArrTime,FlightNum) # dplyr using select () function
In ‘dplyr’approach, we can use colon to select multiple contiguous columns, and use contains to match columns by name. Note that ’starts with’, ‘ends_with’, and matches (for regular expressions). Besides, it can also be used to match columns by name.
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
The usual way to perform multiple operations in one line is by nesting. In addition, we can also write commands in a natural order by using teh ‘%>%’ infix operator. The operator can be pronounced as ‘then’.
‘Nesting method’ to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes.
filter(select(flights, UniqueCarrier, DepDelay),DepDelay >60)
Chaining Method
flights%>%
select(UniqueCarrier, DepDelay)%>%
filter(DepDelay>60)
Unlike nesting method, chaining increases readability significantly when there are many commands. The operator using in chaining is automatically imported from the magrittr package. Therefore, chaining can be used to replace nesting in R commands outside of dplyr.
An Example to demonstrate the advantage of using chaining method
x1<-1:5; x2<-2:6
sqrt(sum((x1-x2)^2)) # create two vectors and calculate euclidian distance between them
## [1] 2.236068
(x1-x2)^2 %>% sum() %>% sqrt() # chaining method
## [1] 2.236068
‘Base R’ approach to select uniqueCarrier and DepDelay column and sort by DepDelay
flights[order(flights$DepDelay), c("UniqueCarrier","DepDelay")]
Using ‘dplyr’ approach:
flights %>%
select(UniqueCarrier,DepDelay) %>%
arrange(DepDelay)
Reordering in ‘dplyr’ approach, we can specify the descending order by using ‘desc()’
flights %>%
select(UniqueCarrier,DepDelay) %>%
arrange(desc(DepDelay)) # use desc for descending
Using ‘dplyr’ approach we can create new variables that are functions of existing variables.
base R approach to crate a new variable (in mph)
flights$speed<-flights$Distance/flights$AirTime*60
flights[,c("Distance","AirTime","speed")]
dplyr approach. The method prints the new variable but does not store it.
flights %>%
select(Distance, AirTime) %>%
mutate(speed=Distance/AirTime*60) # Creating the new variable 'speed' using mutate() function
After creating the new variable, we need to store it in our dataframe as shown below:
flights<-flights%>%mutate(speed=Distance/AirTime*60) #store the new variable
‘Summarise’is primarily usesful with data that has been grouped by one or more varibles. Therefore, ’group_by’ creates the groups that will be operated on. Futhermore, ‘summarise’ uses the provided aggration function to summarise each group.
Let compare the ‘Base R’ and ‘dplyr’ approaches to calculate the average arrival delay to each destination.
Base R approach
head(with(flights, tapply(ArrDelay,Dest,mean, na.rm=TRUE)))
## ABQ AEX AGS AMA ANC ASE
## 7.226259 5.839437 4.000000 6.840095 26.080645 6.794643
dplyr approach In this appoach we create a table grouped by Dest, and then ‘summarise’ each group by taking the mean of arrdelay.
flights%>%
group_by(Dest) %>%
summarise(avg_delay=mean(ArrDelay, na.rm=TRUE))
‘dplyr’ approach provides another function ‘summarise_each’ that allows you to apply the same summary function to multiple columns once. Approach also uses ‘mutate_each’ when working with multiple columns.
flights %>%
group_by(UniqueCarrier) %>% # for each carrier, calculate the percentage of flights cancelled or diverted
summarise_each(funs(mean), Cancelled, Diverted)
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
##
## # Before:
## funs(name = f(.))
##
## # After:
## list(name = ~ f(.))
## This warning is displayed once per session.
flights %>%
group_by(UniqueCarrier) %>% #for each carrier, calcualte the minimum and maximum arrival and departure delays
summarise_each(funs(min(., na.rm=TRUE),max(., na.rm=TRUE)), matches("Delay"))
‘dplyr’ approach also allows you to use helper function, ‘n()’ within ‘summarise’ to count the number of rows in a group. Moreover, it can also use helper function, ‘n_distinct(vector)’ to count the number of unique items in a vector.
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count=n()) %>%
arrange(desc(flight_count)) #for each day of the year, count the total number of flights and sort in descending order
The task above of counting the number of observations in a group can be made more simple with the ‘tally’ function.
flights %>%
group_by(Month, DayofMonth) %>%
tally(sort=T) # rewrite more simple with the 'tally' function
** More Examples** *For each destination, count the total number of flights and the number distinct planes that flow there.
flights %>%
group_by(Dest) %>%
summarise(flight_count=n(), plane_count=n_distinct (TailNum))
*For each destination, show the numberof cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select((Cancelled)) %>%
table() %>%
head()
## Adding missing grouping variables: `Dest`
## Cancelled
## Dest 0 1
## ABQ 2787 25
## AEX 712 12
## AGS 1 0
## AMA 1265 32
## ANC 125 0
## ASE 120 5
Like ‘mean’, aggregation function takes ‘n’ inputs and returns 1 value. Other functions like (‘lead’ and ‘lag’), and cumulative aggregates like ‘cummean’ takes ‘n’ inputs and return n values.
For instance, for each month, calculate the number of flights and the change from the previous month
##
flights %>%
group_by(Month) %>%
summarise(flight_count=n()) %>%
mutate(change=flight_count-lag(flight_count))
This can be rewrited in more simply way with the ‘tally’ function as:
flights %>%
group_by(Month) %>%
tally() %>%
mutate(change=n-lag(n))
We can use pipeling to randomly sample a fixed number of rows, without replacement
flights %>% sample_n(5)
Or with replacement
flights %>% sample_frac(0.25, replace=T)
We can also use ‘Base R’ or ‘dplyr’ approaches to view the structure of an object. Nevertheless, ‘dplyr’ approach has better formatting and unlike ‘Base R’ it adapts to your screen width.
str(flights) # base R approach to view the structure of an object
## Classes 'tbl_df', 'tbl' and 'data.frame': 227496 obs. of 22 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ speed : num 336 299 280 345 305 ...
glimpse(flights) # dplyr approach: better formatting, and adapts to your screen width
## Observations: 227,496
## Variables: 22
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ speed <dbl> 336.0000, 298.6667, 280.0000, 344.6154, 305....
Reference
Peng, R. (2012). Exploratory data analysis with R. Lulu. com.