Pacote dplyr

O dplyr é um pacote que facilita o trabalho com dados, com uma gramática de manipulação de dados simples e flexível. Foi construído com o intuito de obter uma forma mais rápida e expressiva de tratar os dados.

As funções do dplyr mais comuns para manipulação de dados são chamadas de funções verbais. O dataset, em particular dataframe ou datatable, é sempre o primeiro argumento destas. As funções verbais são:

  • select(): subconjunto das colunas ou variáveis do dataset
  • filter(): subconjunto das linhas ou observações
  • mutate(): adicionar colunas
  • arrange(): reordenar as linhas com respeito a determinadas colunas
  • summarise() [+groupby()]: reduz cada grupo numa única linha sob alguma medida estatística

Instalação e inicialização do pacote:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

O dplyr também é utilizado como interface para manipular Spark dataframes (remotas) ao traduzir os comandos em instruções do Spark SQL. Fontes de dados remotas usam também as cinco funções verbais.

Funções verbais e comandos SQL correspondentes:

  • select ~ SELECT
  • filter ~ WHERE
  • arrange ~ ORDER
  • summarise ~ aggregators: sum, min, sd, etc.
  • mutate ~ operators: +, *, log, etc.

O dataframe moderno: Tibble (tbl)

O tbl é uma forma moderna de trabalhar com o dataframe. O mesmo comparado com o dataframe tradicional:

  • Nunca converte um string como factor
  • Nunca muda os nomes das variáveis
  • Nunca usa row.names()

Quando um tbl é impresso, ele mostra apenas as primeiras dez linhas e todas as colunas que se encaixam na tela. Também imprime uma descrição abreviada do tipo de coluna.

Em resumo, o tbl oferece um melhor formato e verificação rigorosa dos dados. Esta é então a versão de dataframe mais conveniente para usar com dplyr.

Caso de estudo: dataframe hflights

O conjunto de dados hflights contém todos os vôos que partem dos aeroportos de Houston IAH (George Bush Intercontinental) e HOU (Houston Hobby). O dataframe contém 227.496 linhas e 21 colunas:

  • Year, Month, DayofMonth: data de partida
  • DayOfWeek: dia da semana de partida
  • DepTime, ArrTime: horários de partida e chegada
  • UniqueCarrier: abreviatura única para um transportador
  • FlightNum: número do vôo
  • TailNum: número do avião
  • ActualElapsedTime: tempo percorrido pelo vôo
  • AirTime: tempo de duração do vôo
  • ArrDelay, DepDelay: atrasos de chegada e partida
  • Origin, Dest: códigos dos aeroportos de origem e destino.
  • Distance: distância da trajetória do vôo
  • TaxiIn, TaxiOut: tempo de taxi- in e taxi-out
  • Cancelled: indicador de cancelamento de voo
  • CancellationCode: motivo do cancelamento
  • Diverted: indicador de desvio

Inicir pacote:

library(hflights)

conhecer os dados:

summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 

converter o data.frame tradicional em um tbl dataframe. Usar a função tbl_df():

hflights <- tbl_df(hflights)

melhor visualização do dataset:

print(hflights)
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##  * <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          1         6    1400    1500            AA
##  2  2011     1          2         7    1401    1501            AA
##  3  2011     1          3         1    1352    1502            AA
##  4  2011     1          4         2    1403    1513            AA
##  5  2011     1          5         3    1405    1507            AA
##  6  2011     1          6         4    1359    1503            AA
##  7  2011     1          7         5    1359    1509            AA
##  8  2011     1          8         6    1355    1454            AA
##  9  2011     1          9         7    1443    1554            AA
## 10  2011     1         10         1    1443    1553            AA
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

imprimir ainda melhor o tbl de forma que tenha todas as colunas e alguns do seus dados iniciais:

glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

Trabalho com as funções verbais

select

Observação: As funções verbais não modificam o dataframe original.

seleciona colunas 14 e 15:

select(hflights,14,15)
## # A tibble: 227,496 x 2
##    Origin  Dest
##  *  <chr> <chr>
##  1    IAH   DFW
##  2    IAH   DFW
##  3    IAH   DFW
##  4    IAH   DFW
##  5    IAH   DFW
##  6    IAH   DFW
##  7    IAH   DFW
##  8    IAH   DFW
##  9    IAH   DFW
## 10    IAH   DFW
## # ... with 227,486 more rows

seleciona colunas Origin e Dest:

select(hflights,Origin,Dest)
## # A tibble: 227,496 x 2
##    Origin  Dest
##  *  <chr> <chr>
##  1    IAH   DFW
##  2    IAH   DFW
##  3    IAH   DFW
##  4    IAH   DFW
##  5    IAH   DFW
##  6    IAH   DFW
##  7    IAH   DFW
##  8    IAH   DFW
##  9    IAH   DFW
## 10    IAH   DFW
## # ... with 227,486 more rows

seleciona da coluna DepTime até a coluna Dest, excluido a coluna TailNum:

select(hflights,DepTime:Dest,-TailNum)
## # A tibble: 227,496 x 10
##    DepTime ArrTime UniqueCarrier FlightNum ActualElapsedTime AirTime
##  *   <int>   <int>         <chr>     <int>             <int>   <int>
##  1    1400    1500            AA       428                60      40
##  2    1401    1501            AA       428                60      45
##  3    1352    1502            AA       428                70      48
##  4    1403    1513            AA       428                70      39
##  5    1405    1507            AA       428                62      44
##  6    1359    1503            AA       428                64      45
##  7    1359    1509            AA       428                70      43
##  8    1355    1454            AA       428                59      40
##  9    1443    1554            AA       428                71      41
## 10    1443    1553            AA       428                70      45
## # ... with 227,486 more rows, and 4 more variables: ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>

devolver as seguintes colunas: TailNum, TaxiIn, TaxiOut:

select(hflights,TailNum,TaxiIn,TaxiOut)
## # A tibble: 227,496 x 3
##    TailNum TaxiIn TaxiOut
##  *   <chr>  <int>   <int>
##  1  N576AA      7      13
##  2  N557AA      6       9
##  3  N541AA      5      17
##  4  N403AA      9      22
##  5  N492AA      9       9
##  6  N262AA      6      13
##  7  N493AA     12      15
##  8  N477AA      7      12
##  9  N476AA      8      22
## 10  N504AA      6      19
## # ... with 227,486 more rows

dplyr também contém um conjunto de funções auxiliares que podem ajudar a selecionar grupos de variáveis, entre elas:

  • starts_with()
  • ends_with()
  • contains()

devolver as seguintes colunas com select() utilizando funções auxiliares: TailNum, TaxiIn, TaxiOut:

select(hflights,starts_with("T"))
## # A tibble: 227,496 x 3
##    TailNum TaxiIn TaxiOut
##  *   <chr>  <int>   <int>
##  1  N576AA      7      13
##  2  N557AA      6       9
##  3  N541AA      5      17
##  4  N403AA      9      22
##  5  N492AA      9       9
##  6  N262AA      6      13
##  7  N493AA     12      15
##  8  N477AA      7      12
##  9  N476AA      8      22
## 10  N504AA      6      19
## # ... with 227,486 more rows

filter

Operadores para o filtro: >, >=, <, <=, ==, !=, is.na, !is.na

selecionar os voos que tiveram o MIA como destino:

MIADest<-filter(hflights, Dest =="MIA") 
glimpse(MIADest)
## Observations: 2,463
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1756, 1823, 1755, 1757, 1751, 1746, 1757, 17...
## $ ArrTime           <int> 2112, 2132, 2106, 2122, 2100, 2120, 2108, 21...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 1294, 1294, 1294, 1294, 1294, 1294, 1294, 12...
## $ TailNum           <chr> "N3DGAA", "N3CCAA", "N3ARAA", "N3DFAA", "N3B...
## $ ActualElapsedTime <int> 136, 129, 131, 145, 129, 154, 131, 131, 133,...
## $ AirTime           <int> 113, 112, 112, 108, 107, 106, 107, 107, 110,...
## $ ArrDelay          <int> -3, 17, -9, 7, -15, 5, -7, -15, 8, 80, 41, -...
## $ DepDelay          <int> 1, 28, 0, 2, -4, -9, 2, -6, 15, 99, 53, -7, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "MIA", "MIA", "MIA", "MIA", "MIA", "MIA", "M...
## $ Distance          <int> 964, 964, 964, 964, 964, 964, 964, 964, 964,...
## $ TaxiIn            <int> 9, 6, 5, 26, 11, 34, 9, 11, 7, 3, 7, 7, 36, ...
## $ TaxiOut           <int> 14, 11, 14, 11, 11, 14, 15, 13, 16, 11, 10, ...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

selecionar os voos que tiveram o MIA ou SEA como destino:

MIA_SEA <- filter(hflights, Dest %in% c("MIA","SEA"))
result <-MIA_SEA$Dest 
unique(MIA_SEA$Dest)
## [1] "MIA" "SEA"

todos os voos onde TaxiIn + TaxiOut > AirTime:

filter(hflights, TaxiIn + TaxiOut > AirTime)
## # A tibble: 1,389 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1         24         1     731     904            AA
##  2  2011     1         30         7    1959    2132            AA
##  3  2011     1         24         1    1621    1749            AA
##  4  2011     1         10         1     941    1113            AA
##  5  2011     1         31         1    1301    1356            CO
##  6  2011     1         31         1    2113    2215            CO
##  7  2011     1         31         1    1434    1539            CO
##  8  2011     1         31         1     900    1006            CO
##  9  2011     1         30         7    1304    1408            CO
## 10  2011     1         30         7    2004    2128            CO
## # ... with 1,379 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Operador pipe(%>%) para realizar múltiplas operações numa linha.

devolver um tbl com as colunas UniqueCarrier e Distance, onde Distance >= 3500 milhas:

hflights %>% select(UniqueCarrier,Distance) %>% filter(Distance>3500)
## # A tibble: 402 x 2
##    UniqueCarrier Distance
##            <chr>    <int>
##  1            CO     3904
##  2            CO     3904
##  3            CO     3904
##  4            CO     3904
##  5            CO     3904
##  6            CO     3904
##  7            CO     3904
##  8            CO     3904
##  9            CO     3904
## 10            CO     3904
## # ... with 392 more rows

mutate

combinar as variáveis DayofMonth e Month para criar uma nova coluna de nome Day_Month. Utlizar o separador “/”:

MutCol<- mutate(hflights, Day_Month = paste(DayofMonth,Month, sep= '/'))
glimpse(MutCol)
## Observations: 227,496
## Variables: 22
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Day_Month         <chr> "1/1", "2/1", "3/1", "4/1", "5/1", "6/1", "7...

adicionar uma nova coluna Speed : Distance / AirTime * 60:

Sp<- mutate(hflights,Speed = Distance / AirTime * 60)
glimpse(Sp)
## Observations: 227,496
## Variables: 22
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Speed             <dbl> 336.0000, 298.6667, 280.0000, 344.6154, 305....

arrange

Observação: arrange() organiza as linhas do menor ao maior

ordenar por DepTime:

arrange(hflights,DepTime)
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
##    <int> <int>      <int>     <int>   <int>   <int>         <chr>
##  1  2011     1          1         6       1     621            CO
##  2  2011     3         11         5       1     557            CO
##  3  2011     4         29         5       1     510            CO
##  4  2011     6         24         5       1     515            CO
##  5  2011    11         15         2       1      55            XE
##  6  2011    12         24         6       1     642            CO
##  7  2011    12          9         5       1     633            CO
##  8  2011     7          8         5       2      53            WN
##  9  2011    12         30         5       2     611            CO
## 10  2011     7          8         5       3     521            CO
## # ... with 227,486 more rows, and 14 more variables: FlightNum <int>,
## #   TailNum <chr>, ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

organizar o dataset para que os voos do mesmo destino estejam juntos:

glimpse(arrange(hflights,Dest))
## Observations: 227,496
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ DayofMonth        <int> 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, ...
## $ DayOfWeek         <int> 1, 7, 6, 5, 4, 3, 2, 1, 7, 6, 5, 4, 3, 2, 1,...
## $ DepTime           <int> 1733, 1750, 1732, 1733, 1741, 1732, 1729, 17...
## $ ArrTime           <int> 1901, 1913, 1837, 1848, 1854, 1853, 1858, 18...
## $ UniqueCarrier     <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "C...
## $ FlightNum         <int> 1715, 1715, 1715, 1715, 1715, 1715, 1715, 17...
## $ TailNum           <chr> "N19638", "N19621", "N14613", "N14653", "N17...
## $ ActualElapsedTime <int> 148, 143, 125, 135, 133, 141, 149, 131, 138,...
## $ AirTime           <int> 120, 124, 109, 105, 109, 122, 120, 113, 115,...
## $ ArrDelay          <int> -4, 8, -23, -17, -11, -12, -7, -20, -12, -17...
## $ DepDelay          <int> -2, 15, -3, -2, 6, -3, -6, -1, 0, -2, -7, 6,...
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "ABQ", "ABQ", "ABQ", "ABQ", "ABQ", "ABQ", "A...
## $ Distance          <int> 744, 744, 744, 744, 744, 744, 744, 744, 744,...
## $ TaxiIn            <int> 8, 5, 4, 6, 7, 5, 6, 1, 6, 6, 8, 6, 6, 6, 6,...
## $ TaxiOut           <int> 20, 14, 12, 24, 17, 14, 23, 17, 17, 12, 22, ...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

devolver um tbl com as colunas UniqueCarrier e Distance, onde Distance é ordenada em ordem decrescente:

hflights %>% select(UniqueCarrier,Distance) %>% arrange(desc(Distance)) 
## # A tibble: 227,496 x 2
##    UniqueCarrier Distance
##            <chr>    <int>
##  1            CO     3904
##  2            CO     3904
##  3            CO     3904
##  4            CO     3904
##  5            CO     3904
##  6            CO     3904
##  7            CO     3904
##  8            CO     3904
##  9            CO     3904
## 10            CO     3904
## # ... with 227,486 more rows

summarise

Algumas funções de agregação:

  • min(): valor minimo
  • max(): valor maximo
  • mean(): valor medio
  • sd(): desvio padrão
  • var(): variância

devolver o menor valor de AirTime

summarise(hflights,min_airtime = min(AirTime,na.rm = TRUE))
## # A tibble: 1 x 1
##   min_airtime
##         <dbl>
## 1          11

Para cada transportador, encontrar a média dos tempos de voo:

MedFly <- hflights %>% group_by(UniqueCarrier) %>% summarise(avg = mean(AirTime,na.rm= TRUE))
glimpse(MedFly)
## Observations: 15
## Variables: 2
## $ UniqueCarrier <chr> "AA", "AS", "B6", "CO", "DL", "EV", "F9", "FL", ...
## $ avg           <dbl> 69.65261, 254.18407, 183.98514, 145.45787, 97.80...

Para cada dia do ano, contar a quantidade de voos. Utilizar a função n() para contar a quantidade de linhas no grupo:

CantFly <- hflights %>% group_by(Month,DayofMonth) %>% summarise(total = n())
glimpse(CantFly)
## Observations: 365
## Variables: 3
## $ Month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ total      <int> 552, 678, 702, 583, 590, 660, 661, 500, 602, 659, 5...

Referencias Bibliográficas