Title: “dplr package role in R programming in R”
Author:“adusu” :
Date: “October 10, 2015”
Output: html_document
dplr is a package for faster data manipulation and exploration.
save the time when writing and reading the code.
performance is good.
STEP :1
install.packages(“dplyr”)
STEP :2
we have to import sample dataset:)
install.packages(“hflights”,repos=‘http://cran.us.r-project.org’)
STEP :3
#LOAD PACKAGES
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(hflights)
#EXPLORE THE DATA
data("hflights")
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
First, dplyr works on data frames, and introduces a few “verbs” that allow you to do some basic manipulation:
1.filter() :
->filters rows from the data frame by some criterion
2.arrange() :
-> arranges rows ascending or descending based on the value(s) of one or more columns
3.select() :
->allows you to select one or more columns
4.mutate() :
-> allows you to add new columns to a data frame that are transformations of other columns
5.group_by() and summarize() :
both are usually used together, and allow you to compute values grouped by some other variable,
#convert local dataframe
flightsdata<-tbl_df(hflights)
flightsdata
## Source: local data frame [227,496 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## (int) (int) (int) (int) (int) (int) (chr)
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## .. ... ... ... ... ... ... ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
## (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
#if you want see more rows
print(flightsdata,n=20)
## Source: local data frame [227,496 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## (int) (int) (int) (int) (int) (int) (chr)
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## 11 2011 1 11 2 1429 1539 AA
## 12 2011 1 12 3 1419 1515 AA
## 13 2011 1 13 4 1358 1501 AA
## 14 2011 1 14 5 1357 1504 AA
## 15 2011 1 15 6 1359 1459 AA
## 16 2011 1 16 7 1359 1509 AA
## 17 2011 1 17 1 1530 1634 AA
## 18 2011 1 18 2 1408 1508 AA
## 19 2011 1 19 3 1356 1503 AA
## 20 2011 1 20 4 1507 1622 AA
## .. ... ... ... ... ... ... ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
## (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
#if you want to see more columns
data.frame(head(flightsdata))
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance
## 1 N576AA 60 40 -10 0 IAH DFW 224
## 2 N557AA 60 45 -9 1 IAH DFW 224
## 3 N541AA 70 48 -8 -8 IAH DFW 224
## 4 N403AA 70 39 3 3 IAH DFW 224
## 5 N492AA 62 44 -3 5 IAH DFW 224
## 6 N262AA 64 45 -7 -1 IAH DFW 224
## TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 1 7 13 0 0
## 2 6 9 0 0
## 3 5 17 0 0
## 4 9 22 0 0
## 5 9 9 0 0
## 6 6 13 0 0
1.Filter:
filters rows from the data frame by some criterion
syntax :
+filter(localdf,cond1,cond2….)
** Scenario: to view all flights on Jan 1
flightsdata[flightsdata$Month==1 & flightsdata$DayofMonth==1, ]
## Source: local data frame [552 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## (int) (int) (int) (int) (int) (int) (chr)
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 1 6 728 840 AA
## 3 2011 1 1 6 1631 1736 AA
## 4 2011 1 1 6 1756 2112 AA
## 5 2011 1 1 6 1012 1347 AA
## 6 2011 1 1 6 1211 1325 AA
## 7 2011 1 1 6 557 906 AA
## 8 2011 1 1 6 1824 2106 AS
## 9 2011 1 1 6 654 1124 B6
## 10 2011 1 1 6 1639 2110 B6
## .. ... ... ... ... ... ... ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
## (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
filter(flightsdata,Month==1,DayofMonth==1)
## Source: local data frame [552 x 21]
##
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## (int) (int) (int) (int) (int) (int) (chr)
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 1 6 728 840 AA
## 3 2011 1 1 6 1631 1736 AA
## 4 2011 1 1 6 1756 2112 AA
## 5 2011 1 1 6 1012 1347 AA
## 6 2011 1 1 6 1211 1325 AA
## 7 2011 1 1 6 557 906 AA
## 8 2011 1 1 6 1824 2106 AS
## 9 2011 1 1 6 654 1124 B6
## 10 2011 1 1 6 1639 2110 B6
## .. ... ... ... ... ... ... ...
## Variables not shown: FlightNum (int), TailNum (chr), ActualElapsedTime
## (int), AirTime (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest
## (chr), Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
## CancellationCode (chr), Diverted (int)
2.Select :
pick columns by name
syntax :
select(localdf,col1,col2….)
** Scenario: I need these columns (DepTime ArrTime UniqueCarrier)
flightsdata[,c("DepTime","ArrTime","UniqueCarrier") ]
## Source: local data frame [227,496 x 3]
##
## DepTime ArrTime UniqueCarrier
## (int) (int) (chr)
## 1 1400 1500 AA
## 2 1401 1501 AA
## 3 1352 1502 AA
## 4 1403 1513 AA
## 5 1405 1507 AA
## 6 1359 1503 AA
## 7 1359 1509 AA
## 8 1355 1454 AA
## 9 1443 1554 AA
## 10 1443 1553 AA
## .. ... ... ...
select(flightsdata,DepTime,ArrTime,UniqueCarrier)
## Source: local data frame [227,496 x 3]
##
## DepTime ArrTime UniqueCarrier
## (int) (int) (chr)
## 1 1400 1500 AA
## 2 1401 1501 AA
## 3 1352 1502 AA
## 4 1403 1513 AA
## 5 1405 1507 AA
## 6 1359 1503 AA
## 7 1359 1509 AA
## 8 1355 1454 AA
## 9 1443 1554 AA
## 10 1443 1553 AA
## .. ... ... ...
# use : to select continous columns, and use 'contains' and match by column name
select(flightsdata,FlightNum:ArrDelay ,contains("Taxi") )
## Source: local data frame [227,496 x 7]
##
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay TaxiIn TaxiOut
## (int) (chr) (int) (int) (int) (int) (int)
## 1 428 N576AA 60 40 -10 7 13
## 2 428 N557AA 60 45 -9 6 9
## 3 428 N541AA 70 48 -8 5 17
## 4 428 N403AA 70 39 3 9 22
## 5 428 N492AA 62 44 -3 9 9
## 6 428 N262AA 64 45 -7 6 13
## 7 428 N493AA 70 43 -1 12 15
## 8 428 N477AA 59 40 -16 7 12
## 9 428 N476AA 71 41 44 8 22
## 10 428 N504AA 70 45 43 6 19
## .. ... ... ... ... ... ... ...
#1,2,3,4,5 are steps to execute r (somany jumps over here)
#filter(4)(select(2)(flightsdata (1),UniqueCarrier(3),DepDelay(3)), DepDelay>30(5))
filter(select(flightsdata,UniqueCarrier,DepDelay), DepDelay>30)
## Source: local data frame [22,662 x 2]
##
## UniqueCarrier DepDelay
## (chr) (int)
## 1 AA 43
## 2 AA 43
## 3 AA 90
## 4 AA 67
## 5 AA 41
## 6 AA 55
## 7 AA 40
## 8 AA 74
## 9 AA 31
## 10 AA 38
## .. ... ...
Alternative Method %>% : Then
flightsdata %>%
select(UniqueCarrier,DepDelay) %>%
filter(DepDelay>30)
## Source: local data frame [22,662 x 2]
##
## UniqueCarrier DepDelay
## (chr) (int)
## 1 AA 43
## 2 AA 43
## 3 AA 90
## 4 AA 67
## 5 AA 41
## 6 AA 55
## 7 AA 40
## 8 AA 74
## 9 AA 31
## 10 AA 38
## .. ... ...
3.Arrange :
arranges rows ascending or descending based on the value(s) of one or more columns
like sort
syntax :
arrange(asc/desc col1,col2….)
** Scenario: I need to sort by DepDelay
flightsdata %>%
select(UniqueCarrier,DepDelay) %>%
filter(DepDelay>30) %>%
arrange(desc(DepDelay))
## Source: local data frame [22,662 x 2]
##
## UniqueCarrier DepDelay
## (chr) (int)
## 1 CO 981
## 2 AA 970
## 3 MQ 931
## 4 UA 869
## 5 MQ 814
## 6 MQ 803
## 7 CO 780
## 8 CO 758
## 9 DL 730
## 10 MQ 691
## .. ... ...
4.Mutate :
Add new variable
Create new variables that are functions of existing variables .
** Scenario: I need speed (speed= distance/time)
# R approach
flightsdata$Speed <- flightsdata$Distance / flightsdata$AirTime * 60
flightsdata[,c("Speed","Distance","AirTime")]
## Source: local data frame [227,496 x 3]
##
## Speed Distance AirTime
## (dbl) (int) (int)
## 1 336.0000 224 40
## 2 298.6667 224 45
## 3 280.0000 224 48
## 4 344.6154 224 39
## 5 305.4545 224 44
## 6 298.6667 224 45
## 7 312.5581 224 43
## 8 336.0000 224 40
## 9 327.8049 224 41
## 10 298.6667 224 45
## .. ... ... ...
flightsdata %>%
select(Distance,AirTime) %>%
mutate(Speed = Distance/AirTime * 60)
## Source: local data frame [227,496 x 3]
##
## Distance AirTime Speed
## (int) (int) (dbl)
## 1 224 40 336.0000
## 2 224 45 298.6667
## 3 224 48 280.0000
## 4 224 39 344.6154
## 5 224 44 305.4545
## 6 224 45 298.6667
## 7 224 43 312.5581
## 8 224 40 336.0000
## 9 224 41 327.8049
## 10 224 45 298.6667
## .. ... ... ...
flightsdata<-
flightsdata %>%
select(Distance,AirTime) %>%
mutate(Speed = Distance/AirTime * 60)