Title: “dplr package role in R programming in R”

Author:“adusu” :

Date: “October 10, 2015”

Output: html_document


dplr package in R programming


Advantages

  • dplr is a package for faster data manipulation and exploration.

  • save the time when writing and reading the code.

  • performance is good.

Steps to follow before we start

STEP :1

  • install.packages(“dplyr”)

STEP :2

STEP :3

  • Let me test the data set
#LOAD PACKAGES
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(hflights)

#EXPLORE THE DATA
data("hflights")
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0

dplyr verbs

First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:

1.filter() :

->filters rows from the data frame by some criterion

2.arrange() :

-> arranges rows ascending or descending based on the value(s) of one or more columns

3.select() :

->allows you to select one or more columns

4.mutate() :

-> allows you to add new columns to a data frame that are transformations of other columns

5.group_by() and summarize() :

both are usually used together, and allow you to compute values grouped by some other variable,

Lets Practice!

Creating Local DataFrame

#convert local dataframe
flightsdata<-tbl_df(hflights)
flightsdata
## Source: local data frame [227,496 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#if you want see more rows
print(flightsdata,n=20)
## Source: local data frame [227,496 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          2         7    1401    1501            AA
## 3   2011     1          3         1    1352    1502            AA
## 4   2011     1          4         2    1403    1513            AA
## 5   2011     1          5         3    1405    1507            AA
## 6   2011     1          6         4    1359    1503            AA
## 7   2011     1          7         5    1359    1509            AA
## 8   2011     1          8         6    1355    1454            AA
## 9   2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## 11  2011     1         11         2    1429    1539            AA
## 12  2011     1         12         3    1419    1515            AA
## 13  2011     1         13         4    1358    1501            AA
## 14  2011     1         14         5    1357    1504            AA
## 15  2011     1         15         6    1359    1459            AA
## 16  2011     1         16         7    1359    1509            AA
## 17  2011     1         17         1    1530    1634            AA
## 18  2011     1         18         2    1408    1508            AA
## 19  2011     1         19         3    1356    1503            AA
## 20  2011     1         20         4    1507    1622            AA
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)
#if you want to see more columns
data.frame(head(flightsdata))
##   Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011     1          1         6    1400    1500            AA       428
## 2 2011     1          2         7    1401    1501            AA       428
## 3 2011     1          3         1    1352    1502            AA       428
## 4 2011     1          4         2    1403    1513            AA       428
## 5 2011     1          5         3    1405    1507            AA       428
## 6 2011     1          6         4    1359    1503            AA       428
##   TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1  N576AA                60      40      -10        0    IAH  DFW      224
## 2  N557AA                60      45       -9        1    IAH  DFW      224
## 3  N541AA                70      48       -8       -8    IAH  DFW      224
## 4  N403AA                70      39        3        3    IAH  DFW      224
## 5  N492AA                62      44       -3        5    IAH  DFW      224
## 6  N262AA                64      45       -7       -1    IAH  DFW      224
##   TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1      7      13         0                         0
## 2      6       9         0                         0
## 3      5      17         0                         0
## 4      9      22         0                         0
## 5      9       9         0                         0
## 6      6      13         0                         0

1.Filter:

+filter(localdf,cond1,cond2….)

** Scenario: to view all flights on Jan 1

R aproach

flightsdata[flightsdata$Month==1 & flightsdata$DayofMonth==1, ]
## Source: local data frame [552 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

dplr approch

filter(flightsdata,Month==1,DayofMonth==1)
## Source: local data frame [552 x 21]
## 
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    (int) (int)      (int)     (int)   (int)   (int)         (chr)
## 1   2011     1          1         6    1400    1500            AA
## 2   2011     1          1         6     728     840            AA
## 3   2011     1          1         6    1631    1736            AA
## 4   2011     1          1         6    1756    2112            AA
## 5   2011     1          1         6    1012    1347            AA
## 6   2011     1          1         6    1211    1325            AA
## 7   2011     1          1         6     557     906            AA
## 8   2011     1          1         6    1824    2106            AS
## 9   2011     1          1         6     654    1124            B6
## 10  2011     1          1         6    1639    2110            B6
## ..   ...   ...        ...       ...     ...     ...           ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
##   (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
##   (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
##   CancellationCode (chr), Diverted (int)

2.Select :

** Scenario: I need these columns (DepTime ArrTime UniqueCarrier)

R aproach

flightsdata[,c("DepTime","ArrTime","UniqueCarrier") ]
## Source: local data frame [227,496 x 3]
## 
##    DepTime ArrTime UniqueCarrier
##      (int)   (int)         (chr)
## 1     1400    1500            AA
## 2     1401    1501            AA
## 3     1352    1502            AA
## 4     1403    1513            AA
## 5     1405    1507            AA
## 6     1359    1503            AA
## 7     1359    1509            AA
## 8     1355    1454            AA
## 9     1443    1554            AA
## 10    1443    1553            AA
## ..     ...     ...           ...

dplr approach

select(flightsdata,DepTime,ArrTime,UniqueCarrier)
## Source: local data frame [227,496 x 3]
## 
##    DepTime ArrTime UniqueCarrier
##      (int)   (int)         (chr)
## 1     1400    1500            AA
## 2     1401    1501            AA
## 3     1352    1502            AA
## 4     1403    1513            AA
## 5     1405    1507            AA
## 6     1359    1503            AA
## 7     1359    1509            AA
## 8     1355    1454            AA
## 9     1443    1554            AA
## 10    1443    1553            AA
## ..     ...     ...           ...
# use : to select continous columns, and use 'contains' and match by column name
select(flightsdata,FlightNum:ArrDelay ,contains("Taxi") )
## Source: local data frame [227,496 x 7]
## 
##    FlightNum TailNum ActualElapsedTime AirTime ArrDelay TaxiIn TaxiOut
##        (int)   (chr)             (int)   (int)    (int)  (int)   (int)
## 1        428  N576AA                60      40      -10      7      13
## 2        428  N557AA                60      45       -9      6       9
## 3        428  N541AA                70      48       -8      5      17
## 4        428  N403AA                70      39        3      9      22
## 5        428  N492AA                62      44       -3      9       9
## 6        428  N262AA                64      45       -7      6      13
## 7        428  N493AA                70      43       -1     12      15
## 8        428  N477AA                59      40      -16      7      12
## 9        428  N476AA                71      41       44      8      22
## 10       428  N504AA                70      45       43      6      19
## ..       ...     ...               ...     ...      ...    ...     ...

nested method

#1,2,3,4,5 are steps to execute r (somany jumps over here)
#filter(4)(select(2)(flightsdata (1),UniqueCarrier(3),DepDelay(3)), DepDelay>30(5))

filter(select(flightsdata,UniqueCarrier,DepDelay), DepDelay>30)
## Source: local data frame [22,662 x 2]
## 
##    UniqueCarrier DepDelay
##            (chr)    (int)
## 1             AA       43
## 2             AA       43
## 3             AA       90
## 4             AA       67
## 5             AA       41
## 6             AA       55
## 7             AA       40
## 8             AA       74
## 9             AA       31
## 10            AA       38
## ..           ...      ...

Alternative Method %>% : Then

flightsdata %>%
  select(UniqueCarrier,DepDelay) %>%
  filter(DepDelay>30)
## Source: local data frame [22,662 x 2]
## 
##    UniqueCarrier DepDelay
##            (chr)    (int)
## 1             AA       43
## 2             AA       43
## 3             AA       90
## 4             AA       67
## 5             AA       41
## 6             AA       55
## 7             AA       40
## 8             AA       74
## 9             AA       31
## 10            AA       38
## ..           ...      ...

3.Arrange :

** Scenario: I need to sort by DepDelay

flightsdata %>%
  select(UniqueCarrier,DepDelay) %>%
  filter(DepDelay>30) %>%
  arrange(desc(DepDelay))
## Source: local data frame [22,662 x 2]
## 
##    UniqueCarrier DepDelay
##            (chr)    (int)
## 1             CO      981
## 2             AA      970
## 3             MQ      931
## 4             UA      869
## 5             MQ      814
## 6             MQ      803
## 7             CO      780
## 8             CO      758
## 9             DL      730
## 10            MQ      691
## ..           ...      ...

4.Mutate :

** Scenario: I need speed (speed= distance/time)

# R approach

flightsdata$Speed <- flightsdata$Distance / flightsdata$AirTime * 60 
 flightsdata[,c("Speed","Distance","AirTime")]
## Source: local data frame [227,496 x 3]
## 
##       Speed Distance AirTime
##       (dbl)    (int)   (int)
## 1  336.0000      224      40
## 2  298.6667      224      45
## 3  280.0000      224      48
## 4  344.6154      224      39
## 5  305.4545      224      44
## 6  298.6667      224      45
## 7  312.5581      224      43
## 8  336.0000      224      40
## 9  327.8049      224      41
## 10 298.6667      224      45
## ..      ...      ...     ...

dplr approach

 flightsdata %>%
   select(Distance,AirTime) %>%
   mutate(Speed = Distance/AirTime * 60)
## Source: local data frame [227,496 x 3]
## 
##    Distance AirTime    Speed
##       (int)   (int)    (dbl)
## 1       224      40 336.0000
## 2       224      45 298.6667
## 3       224      48 280.0000
## 4       224      39 344.6154
## 5       224      44 305.4545
## 6       224      45 298.6667
## 7       224      43 312.5581
## 8       224      40 336.0000
## 9       224      41 327.8049
## 10      224      45 298.6667
## ..      ...     ...      ...

STORE THE NEW VARIABLE

flightsdata<- 
flightsdata %>%
   select(Distance,AirTime) %>%
   mutate(Speed = Distance/AirTime * 60)

Hope you understood!