Data clening using dplyr package in R

Let Get Started

We will use hflights data in R ## Explore data

data("hflights") # Loading hflights data
head(hflights,5) # Displaying the first 5 observations of hflights data

A local dataframe is a rapper for a dataframe that prints nicely We first Convert our data to a local dataframe as:

flights<-tbl_df(hflights)
flights # printing only shows 10 rows as many colmns as it can fit on the screen

You can also specify if you want to see more rows as follows:

print(flights, n=15)

## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## 11  2011     1         11         2    1429    1539 AA                  428
## 12  2011     1         12         3    1419    1515 AA                  428
## 13  2011     1         13         4    1358    1501 AA                  428
## 14  2011     1         14         5    1357    1504 AA                  428
## 15  2011     1         15         6    1359    1459 AA                  428
## # ... with 2.275e+05 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

data.frame(head(flights)) #convert to  a normal data flame to see all the columns

Filter() Function

How to use a filter function

A filter keep rows matching criteria. In Base R approach filtering is forced to repeat the dataframe’s name. However, in dplyr approach, it is simple to write and read. The command structure for all dplyr verbs shows that:

First argument is a dataframe
Return value is a dataframe
And nothing is modified in place

It is noted that ‘dplyr’ generally does not preserve row names.

Comparision between ‘Base R’ and ‘dplyr’ approach

Using base R approach to view all flights on January 1

flights[flights$Month==1 & flights$DayofMonth==1,]

Using ‘dplyr’ approach to do the same job above. Note that you can use comma or ampersand to represent ‘AND’ condition.

filter(flights, Month==1, DayofMonth==1)

Moreover, dplyr provides an easier than ‘Base R’. We can use ‘pipe’ (%in%) instead of ‘OR’ (|) condition.

filter(flights,UniqueCarrier=="AA" | UniqueCarrier=="UA")

filter(flights,UniqueCarrier %in% c("AA", "UA")) #using %in% operator

Selecting columns by name

Base R approach is arkward to type and to read

flights[,c("DepTime", "ArrTime","FlightNum")] # Base R approach to select Deptime, ArrTime, and FlightNum columns

On the other hand, ‘dplyr’ approach can also use similar syntax to filter like a ‘SELECT’ in ‘SQL’

flights[,c("DepTime", "ArrTime","FlightNum")] # dplyr using c() function

select(flights,DepTime,ArrTime,FlightNum) # dplyr using select () function

In ‘dplyr’approach, we can use colon to select multiple contiguous columns, and use contains to match columns by name. Note that ’starts with’, ‘ends_with’, and matches (for regular expressions). Besides, it can also be used to match columns by name.

select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))

Chaining or Pipelining

The usual way to perform multiple operations in one line is by nesting. In addition, we can also write commands in a natural order by using teh ‘%>%’ infix operator. The operator can be pronounced as ‘then’.

‘Nesting method’ to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes.

filter(select(flights, UniqueCarrier, DepDelay),DepDelay >60)

Chaining Method

flights%>%
  select(UniqueCarrier, DepDelay)%>%
  filter(DepDelay>60)

Unlike nesting method, chaining increases readability significantly when there are many commands. The operator using in chaining is automatically imported from the magrittr package. Therefore, chaining can be used to replace nesting in R commands outside of dplyr.

An Example to demonstrate the advantage of using chaining method

x1<-1:5; x2<-2:6
sqrt(sum((x1-x2)^2)) # create two vectors and calculate euclidian distance between them

## [1] 2.236068

(x1-x2)^2 %>% sum() %>% sqrt() # chaining method

## [1] 2.236068

Arrange or reorder

‘Base R’ approach to select uniqueCarrier and DepDelay column and sort by DepDelay

flights[order(flights$DepDelay), c("UniqueCarrier","DepDelay")]

Using ‘dplyr’ approach:

flights %>%
  select(UniqueCarrier,DepDelay) %>%
  arrange(DepDelay)

Reordering in ‘dplyr’ approach, we can specify the descending order by using ‘desc()’

flights %>%
  select(UniqueCarrier,DepDelay) %>%
  arrange(desc(DepDelay)) # use desc for descending

Mutate: Add new variables

Using ‘dplyr’ approach we can create new variables that are functions of existing variables.

base R approach to crate a new variable (in mph)

flights$speed<-flights$Distance/flights$AirTime*60
flights[,c("Distance","AirTime","speed")]

dplyr approach. The method prints the new variable but does not store it.

flights %>%
  select(Distance, AirTime) %>%
  mutate(speed=Distance/AirTime*60) # Creating the new variable 'speed' using mutate() function

After creating the new variable, we need to store it in our dataframe as shown below:

flights<-flights%>%mutate(speed=Distance/AirTime*60) #store the new variable

Summarise: Reduce varibles to values

‘Summarise’is primarily usesful with data that has been grouped by one or more varibles. Therefore, ’group_by’ creates the groups that will be operated on. Futhermore, ‘summarise’ uses the provided aggration function to summarise each group.

Let compare the ‘Base R’ and ‘dplyr’ approaches to calculate the average arrival delay to each destination.

Base R approach

head(with(flights, tapply(ArrDelay,Dest,mean, na.rm=TRUE)))

##       ABQ       AEX       AGS       AMA       ANC       ASE 
##  7.226259  5.839437  4.000000  6.840095 26.080645  6.794643

dplyr approach In this appoach we create a table grouped by Dest, and then ‘summarise’ each group by taking the mean of arrdelay.

flights%>%
  group_by(Dest) %>%
  summarise(avg_delay=mean(ArrDelay, na.rm=TRUE))

‘dplyr’ approach provides another function ‘summarise_each’ that allows you to apply the same summary function to multiple columns once. Approach also uses ‘mutate_each’ when working with multiple columns.

flights %>%
  group_by(UniqueCarrier) %>%  # for each carrier, calculate the percentage of flights cancelled or diverted
  summarise_each(funs(mean), Cancelled, Diverted)

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## please use list() instead
## 
##   # Before:
##   funs(name = f(.))
## 
##   # After: 
##   list(name = ~ f(.))
## This warning is displayed once per session.

flights %>% 
  group_by(UniqueCarrier) %>%  #for each carrier, calcualte the minimum and maximum arrival and departure delays
  summarise_each(funs(min(., na.rm=TRUE),max(., na.rm=TRUE)), matches("Delay"))

‘dplyr’ approach also allows you to use helper function, ‘n()’ within ‘summarise’ to count the number of rows in a group. Moreover, it can also use helper function, ‘n_distinct(vector)’ to count the number of unique items in a vector.

flights %>%
  group_by(Month, DayofMonth) %>%
  summarise(flight_count=n()) %>%
  arrange(desc(flight_count)) #for each day of the year, count the total number of flights and sort in descending order

The task above of counting the number of observations in a group can be made more simple with the ‘tally’ function.

flights %>% 
  group_by(Month, DayofMonth) %>%
  tally(sort=T)  # rewrite more simple with the 'tally' function

** More Examples** *For each destination, count the total number of flights and the number distinct planes that flow there.

flights %>%
  group_by(Dest) %>%
  summarise(flight_count=n(), plane_count=n_distinct (TailNum))

*For each destination, show the numberof cancelled and not cancelled flights

flights %>%
  group_by(Dest) %>%
  select((Cancelled)) %>%
  table() %>%
  head()

## Adding missing grouping variables: `Dest`

##      Cancelled
## Dest     0    1
##   ABQ 2787   25
##   AEX  712   12
##   AGS    1    0
##   AMA 1265   32
##   ANC  125    0
##   ASE  120    5

Window Functions

Like ‘mean’, aggregation function takes ‘n’ inputs and returns 1 value. Other functions like (‘lead’ and ‘lag’), and cumulative aggregates like ‘cummean’ takes ‘n’ inputs and return n values.

For instance, for each month, calculate the number of flights and the change from the previous month

## 
flights %>%
  group_by(Month) %>%
  summarise(flight_count=n()) %>%
  mutate(change=flight_count-lag(flight_count))

This can be rewrited in more simply way with the ‘tally’ function as:

flights %>%
  group_by(Month) %>%
  tally() %>%
  mutate(change=n-lag(n))

Other Useful Convenience Functions

We can use pipeling to randomly sample a fixed number of rows, without replacement

flights %>% sample_n(5)

Or with replacement

flights %>% sample_frac(0.25, replace=T)

We can also use ‘Base R’ or ‘dplyr’ approaches to view the structure of an object. Nevertheless, ‘dplyr’ approach has better formatting and unlike ‘Base R’ it adapts to your screen width.

str(flights) # base R approach to view the structure of an object

## Classes 'tbl_df', 'tbl' and 'data.frame':    227496 obs. of  22 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ speed            : num  336 299 280 345 305 ...

glimpse(flights) # dplyr approach: better formatting, and adapts to your screen width

## Observations: 227,496
## Variables: 22
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ speed             <dbl> 336.0000, 298.6667, 280.0000, 344.6154, 305....

Reference

Peng, R. (2012). Exploratory data analysis with R. Lulu. com.

DATA CLEANING IN RSTUDIO

Statistician Martin Nyamu

June 23, 2019

Introduction