O dplyr é um pacote que facilita o trabalho com dados, com uma gramática de manipulação de dados simples e flexível. Foi construído com o intuito de obter uma forma mais rápida e expressiva de tratar os dados.
As funções do dplyr mais comuns para manipulação de dados são chamadas de funções verbais. O dataset, em particular dataframe ou datatable, é sempre o primeiro argumento destas. As funções verbais são:
Instalação e inicialização do pacote:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
O dplyr também é utilizado como interface para manipular Spark dataframes (remotas) ao traduzir os comandos em instruções do Spark SQL. Fontes de dados remotas usam também as cinco funções verbais.
Funções verbais e comandos SQL correspondentes:
O tbl é uma forma moderna de trabalhar com o dataframe. O mesmo comparado com o dataframe tradicional:
Quando um tbl é impresso, ele mostra apenas as primeiras dez linhas e todas as colunas que se encaixam na tela. Também imprime uma descrição abreviada do tipo de coluna.
Em resumo, o tbl oferece um melhor formato e verificação rigorosa dos dados. Esta é então a versão de dataframe mais conveniente para usar com dplyr.
O conjunto de dados hflights contém todos os vôos que partem dos aeroportos de Houston IAH (George Bush Intercontinental) e HOU (Houston Hobby). O dataframe contém 227.496 linhas e 21 colunas:
Inicir pacote:
library(hflights)
conhecer os dados:
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
converter o data.frame tradicional em um tbl dataframe. Usar a função tbl_df():
hflights <- tbl_df(hflights)
melhor visualização do dataset:
print(hflights)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## * <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1400 1500 AA
## 2 2011 1 2 7 1401 1501 AA
## 3 2011 1 3 1 1352 1502 AA
## 4 2011 1 4 2 1403 1513 AA
## 5 2011 1 5 3 1405 1507 AA
## 6 2011 1 6 4 1359 1503 AA
## 7 2011 1 7 5 1359 1509 AA
## 8 2011 1 8 6 1355 1454 AA
## 9 2011 1 9 7 1443 1554 AA
## 10 2011 1 10 1 1443 1553 AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
imprimir ainda melhor o tbl de forma que tenha todas as colunas e alguns do seus dados iniciais:
glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
Observação: As funções verbais não modificam o dataframe original.
seleciona colunas 14 e 15:
select(hflights,14,15)
## # A tibble: 227,496 x 2
## Origin Dest
## * <chr> <chr>
## 1 IAH DFW
## 2 IAH DFW
## 3 IAH DFW
## 4 IAH DFW
## 5 IAH DFW
## 6 IAH DFW
## 7 IAH DFW
## 8 IAH DFW
## 9 IAH DFW
## 10 IAH DFW
## # ... with 227,486 more rows
seleciona colunas Origin e Dest:
select(hflights,Origin,Dest)
## # A tibble: 227,496 x 2
## Origin Dest
## * <chr> <chr>
## 1 IAH DFW
## 2 IAH DFW
## 3 IAH DFW
## 4 IAH DFW
## 5 IAH DFW
## 6 IAH DFW
## 7 IAH DFW
## 8 IAH DFW
## 9 IAH DFW
## 10 IAH DFW
## # ... with 227,486 more rows
seleciona da coluna DepTime até a coluna Dest, excluido a coluna TailNum:
select(hflights,DepTime:Dest,-TailNum)
## # A tibble: 227,496 x 10
## DepTime ArrTime UniqueCarrier FlightNum ActualElapsedTime AirTime
## * <int> <int> <chr> <int> <int> <int>
## 1 1400 1500 AA 428 60 40
## 2 1401 1501 AA 428 60 45
## 3 1352 1502 AA 428 70 48
## 4 1403 1513 AA 428 70 39
## 5 1405 1507 AA 428 62 44
## 6 1359 1503 AA 428 64 45
## 7 1359 1509 AA 428 70 43
## 8 1355 1454 AA 428 59 40
## 9 1443 1554 AA 428 71 41
## 10 1443 1553 AA 428 70 45
## # ... with 227,486 more rows, and 4 more variables: ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>
devolver as seguintes colunas: TailNum, TaxiIn, TaxiOut:
select(hflights,TailNum,TaxiIn,TaxiOut)
## # A tibble: 227,496 x 3
## TailNum TaxiIn TaxiOut
## * <chr> <int> <int>
## 1 N576AA 7 13
## 2 N557AA 6 9
## 3 N541AA 5 17
## 4 N403AA 9 22
## 5 N492AA 9 9
## 6 N262AA 6 13
## 7 N493AA 12 15
## 8 N477AA 7 12
## 9 N476AA 8 22
## 10 N504AA 6 19
## # ... with 227,486 more rows
dplyr também contém um conjunto de funções auxiliares que podem ajudar a selecionar grupos de variáveis, entre elas:
devolver as seguintes colunas com select() utilizando funções auxiliares: TailNum, TaxiIn, TaxiOut:
select(hflights,starts_with("T"))
## # A tibble: 227,496 x 3
## TailNum TaxiIn TaxiOut
## * <chr> <int> <int>
## 1 N576AA 7 13
## 2 N557AA 6 9
## 3 N541AA 5 17
## 4 N403AA 9 22
## 5 N492AA 9 9
## 6 N262AA 6 13
## 7 N493AA 12 15
## 8 N477AA 7 12
## 9 N476AA 8 22
## 10 N504AA 6 19
## # ... with 227,486 more rows
Operadores para o filtro: >, >=, <, <=, ==, !=, is.na, !is.na
selecionar os voos que tiveram o MIA como destino:
MIADest<-filter(hflights, Dest =="MIA")
glimpse(MIADest)
## Observations: 2,463
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1756, 1823, 1755, 1757, 1751, 1746, 1757, 17...
## $ ArrTime <int> 2112, 2132, 2106, 2122, 2100, 2120, 2108, 21...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 1294, 1294, 1294, 1294, 1294, 1294, 1294, 12...
## $ TailNum <chr> "N3DGAA", "N3CCAA", "N3ARAA", "N3DFAA", "N3B...
## $ ActualElapsedTime <int> 136, 129, 131, 145, 129, 154, 131, 131, 133,...
## $ AirTime <int> 113, 112, 112, 108, 107, 106, 107, 107, 110,...
## $ ArrDelay <int> -3, 17, -9, 7, -15, 5, -7, -15, 8, 80, 41, -...
## $ DepDelay <int> 1, 28, 0, 2, -4, -9, 2, -6, 15, 99, 53, -7, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "MIA", "MIA", "MIA", "MIA", "MIA", "MIA", "M...
## $ Distance <int> 964, 964, 964, 964, 964, 964, 964, 964, 964,...
## $ TaxiIn <int> 9, 6, 5, 26, 11, 34, 9, 11, 7, 3, 7, 7, 36, ...
## $ TaxiOut <int> 14, 11, 14, 11, 11, 14, 15, 13, 16, 11, 10, ...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
selecionar os voos que tiveram o MIA ou SEA como destino:
MIA_SEA <- filter(hflights, Dest %in% c("MIA","SEA"))
result <-MIA_SEA$Dest
unique(MIA_SEA$Dest)
## [1] "MIA" "SEA"
todos os voos onde TaxiIn + TaxiOut > AirTime:
filter(hflights, TaxiIn + TaxiOut > AirTime)
## # A tibble: 1,389 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 24 1 731 904 AA
## 2 2011 1 30 7 1959 2132 AA
## 3 2011 1 24 1 1621 1749 AA
## 4 2011 1 10 1 941 1113 AA
## 5 2011 1 31 1 1301 1356 CO
## 6 2011 1 31 1 2113 2215 CO
## 7 2011 1 31 1 1434 1539 CO
## 8 2011 1 31 1 900 1006 CO
## 9 2011 1 30 7 1304 1408 CO
## 10 2011 1 30 7 2004 2128 CO
## # ... with 1,379 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
Operador pipe(%>%) para realizar múltiplas operações numa linha.
devolver um tbl com as colunas UniqueCarrier e Distance, onde Distance >= 3500 milhas:
hflights %>% select(UniqueCarrier,Distance) %>% filter(Distance>3500)
## # A tibble: 402 x 2
## UniqueCarrier Distance
## <chr> <int>
## 1 CO 3904
## 2 CO 3904
## 3 CO 3904
## 4 CO 3904
## 5 CO 3904
## 6 CO 3904
## 7 CO 3904
## 8 CO 3904
## 9 CO 3904
## 10 CO 3904
## # ... with 392 more rows
combinar as variáveis DayofMonth e Month para criar uma nova coluna de nome Day_Month. Utlizar o separador “/”:
MutCol<- mutate(hflights, Day_Month = paste(DayofMonth,Month, sep= '/'))
glimpse(MutCol)
## Observations: 227,496
## Variables: 22
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Day_Month <chr> "1/1", "2/1", "3/1", "4/1", "5/1", "6/1", "7...
adicionar uma nova coluna Speed : Distance / AirTime * 60:
Sp<- mutate(hflights,Speed = Distance / AirTime * 60)
glimpse(Sp)
## Observations: 227,496
## Variables: 22
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Speed <dbl> 336.0000, 298.6667, 280.0000, 344.6154, 305....
Observação: arrange() organiza as linhas do menor ao maior
ordenar por DepTime:
arrange(hflights,DepTime)
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## <int> <int> <int> <int> <int> <int> <chr>
## 1 2011 1 1 6 1 621 CO
## 2 2011 3 11 5 1 557 CO
## 3 2011 4 29 5 1 510 CO
## 4 2011 6 24 5 1 515 CO
## 5 2011 11 15 2 1 55 XE
## 6 2011 12 24 6 1 642 CO
## 7 2011 12 9 5 1 633 CO
## 8 2011 7 8 5 2 53 WN
## 9 2011 12 30 5 2 611 CO
## 10 2011 7 8 5 3 521 CO
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## # TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
organizar o dataset para que os voos do mesmo destino estejam juntos:
glimpse(arrange(hflights,Dest))
## Observations: 227,496
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth <int> 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, ...
## $ DayOfWeek <int> 1, 7, 6, 5, 4, 3, 2, 1, 7, 6, 5, 4, 3, 2, 1,...
## $ DepTime <int> 1733, 1750, 1732, 1733, 1741, 1732, 1729, 17...
## $ ArrTime <int> 1901, 1913, 1837, 1848, 1854, 1853, 1858, 18...
## $ UniqueCarrier <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "C...
## $ FlightNum <int> 1715, 1715, 1715, 1715, 1715, 1715, 1715, 17...
## $ TailNum <chr> "N19638", "N19621", "N14613", "N14653", "N17...
## $ ActualElapsedTime <int> 148, 143, 125, 135, 133, 141, 149, 131, 138,...
## $ AirTime <int> 120, 124, 109, 105, 109, 122, 120, 113, 115,...
## $ ArrDelay <int> -4, 8, -23, -17, -11, -12, -7, -20, -12, -17...
## $ DepDelay <int> -2, 15, -3, -2, 6, -3, -6, -1, 0, -2, -7, 6,...
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest <chr> "ABQ", "ABQ", "ABQ", "ABQ", "ABQ", "ABQ", "A...
## $ Distance <int> 744, 744, 744, 744, 744, 744, 744, 744, 744,...
## $ TaxiIn <int> 8, 5, 4, 6, 7, 5, 6, 1, 6, 6, 8, 6, 6, 6, 6,...
## $ TaxiOut <int> 20, 14, 12, 24, 17, 14, 23, 17, 17, 12, 22, ...
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
devolver um tbl com as colunas UniqueCarrier e Distance, onde Distance é ordenada em ordem decrescente:
hflights %>% select(UniqueCarrier,Distance) %>% arrange(desc(Distance))
## # A tibble: 227,496 x 2
## UniqueCarrier Distance
## <chr> <int>
## 1 CO 3904
## 2 CO 3904
## 3 CO 3904
## 4 CO 3904
## 5 CO 3904
## 6 CO 3904
## 7 CO 3904
## 8 CO 3904
## 9 CO 3904
## 10 CO 3904
## # ... with 227,486 more rows
Algumas funções de agregação:
devolver o menor valor de AirTime
summarise(hflights,min_airtime = min(AirTime,na.rm = TRUE))
## # A tibble: 1 x 1
## min_airtime
## <dbl>
## 1 11
Para cada transportador, encontrar a média dos tempos de voo:
MedFly <- hflights %>% group_by(UniqueCarrier) %>% summarise(avg = mean(AirTime,na.rm= TRUE))
glimpse(MedFly)
## Observations: 15
## Variables: 2
## $ UniqueCarrier <chr> "AA", "AS", "B6", "CO", "DL", "EV", "F9", "FL", ...
## $ avg <dbl> 69.65261, 254.18407, 183.98514, 145.45787, 97.80...
Para cada dia do ano, contar a quantidade de voos. Utilizar a função n() para contar a quantidade de linhas no grupo:
CantFly <- hflights %>% group_by(Month,DayofMonth) %>% summarise(total = n())
glimpse(CantFly)
## Observations: 365
## Variables: 3
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ total <int> 552, 678, 702, 583, 590, 660, 661, 500, 602, 659, 5...