Tibble objects

A tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble(). The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.

Visualizacion de las variables e información

Partimos de un objeto de tipo data.frame. Los comandos stry glimpse dan información sobre el tipo de las variables contenidas en un objeto tibble. classda información sobre el tipo de objeto. Con head visualizamos las primeras filas de la BD.

data(hflights)
head(hflights)
##      Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011     1          1         6    1400    1500            AA
## 5425 2011     1          2         7    1401    1501            AA
## 5426 2011     1          3         1    1352    1502            AA
## 5427 2011     1          4         2    1403    1513            AA
## 5428 2011     1          5         3    1405    1507            AA
## 5429 2011     1          6         4    1359    1503            AA
##      FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424       428  N576AA                60      40      -10        0    IAH
## 5425       428  N557AA                60      45       -9        1    IAH
## 5426       428  N541AA                70      48       -8       -8    IAH
## 5427       428  N403AA                70      39        3        3    IAH
## 5428       428  N492AA                62      44       -3        5    IAH
## 5429       428  N262AA                64      45       -7       -1    IAH
##      Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424  DFW      224      7      13         0                         0
## 5425  DFW      224      6       9         0                         0
## 5426  DFW      224      5      17         0                         0
## 5427  DFW      224      9      22         0                         0
## 5428  DFW      224      9       9         0                         0
## 5429  DFW      224      6      13         0                         0
str(hflights)
## 'data.frame':    227496 obs. of  21 variables:
##  $ Year             : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ Month            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DayofMonth       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DayOfWeek        : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ DepTime          : int  1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
##  $ ArrTime          : int  1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
##  $ UniqueCarrier    : chr  "AA" "AA" "AA" "AA" ...
##  $ FlightNum        : int  428 428 428 428 428 428 428 428 428 428 ...
##  $ TailNum          : chr  "N576AA" "N557AA" "N541AA" "N403AA" ...
##  $ ActualElapsedTime: int  60 60 70 70 62 64 70 59 71 70 ...
##  $ AirTime          : int  40 45 48 39 44 45 43 40 41 45 ...
##  $ ArrDelay         : int  -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
##  $ DepDelay         : int  0 1 -8 3 5 -1 -1 -5 43 43 ...
##  $ Origin           : chr  "IAH" "IAH" "IAH" "IAH" ...
##  $ Dest             : chr  "DFW" "DFW" "DFW" "DFW" ...
##  $ Distance         : int  224 224 224 224 224 224 224 224 224 224 ...
##  $ TaxiIn           : int  7 6 5 9 9 6 12 7 8 6 ...
##  $ TaxiOut          : int  13 9 17 22 9 13 15 12 22 19 ...
##  $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ CancellationCode : chr  "" "" "" "" ...
##  $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...
glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
class(hflights)
## [1] "data.frame"
summary(hflights)
##       Year          Month          DayofMonth      DayOfWeek    
##  Min.   :2011   Min.   : 1.000   Min.   : 1.00   Min.   :1.000  
##  1st Qu.:2011   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2.000  
##  Median :2011   Median : 7.000   Median :16.00   Median :4.000  
##  Mean   :2011   Mean   : 6.514   Mean   :15.74   Mean   :3.948  
##  3rd Qu.:2011   3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:6.000  
##  Max.   :2011   Max.   :12.000   Max.   :31.00   Max.   :7.000  
##                                                                 
##     DepTime        ArrTime     UniqueCarrier        FlightNum   
##  Min.   :   1   Min.   :   1   Length:227496      Min.   :   1  
##  1st Qu.:1021   1st Qu.:1215   Class :character   1st Qu.: 855  
##  Median :1416   Median :1617   Mode  :character   Median :1696  
##  Mean   :1396   Mean   :1578                      Mean   :1962  
##  3rd Qu.:1801   3rd Qu.:1953                      3rd Qu.:2755  
##  Max.   :2400   Max.   :2400                      Max.   :7290  
##  NA's   :2905   NA's   :3066                                    
##    TailNum          ActualElapsedTime    AirTime         ArrDelay      
##  Length:227496      Min.   : 34.0     Min.   : 11.0   Min.   :-70.000  
##  Class :character   1st Qu.: 77.0     1st Qu.: 58.0   1st Qu.: -8.000  
##  Mode  :character   Median :128.0     Median :107.0   Median :  0.000  
##                     Mean   :129.3     Mean   :108.1   Mean   :  7.094  
##                     3rd Qu.:165.0     3rd Qu.:141.0   3rd Qu.: 11.000  
##                     Max.   :575.0     Max.   :549.0   Max.   :978.000  
##                     NA's   :3622      NA's   :3622    NA's   :3622     
##     DepDelay          Origin              Dest              Distance     
##  Min.   :-33.000   Length:227496      Length:227496      Min.   :  79.0  
##  1st Qu.: -3.000   Class :character   Class :character   1st Qu.: 376.0  
##  Median :  0.000   Mode  :character   Mode  :character   Median : 809.0  
##  Mean   :  9.445                                         Mean   : 787.8  
##  3rd Qu.:  9.000                                         3rd Qu.:1042.0  
##  Max.   :981.000                                         Max.   :3904.0  
##  NA's   :2905                                                            
##      TaxiIn           TaxiOut         Cancelled       CancellationCode  
##  Min.   :  1.000   Min.   :  1.00   Min.   :0.00000   Length:227496     
##  1st Qu.:  4.000   1st Qu.: 10.00   1st Qu.:0.00000   Class :character  
##  Median :  5.000   Median : 14.00   Median :0.00000   Mode  :character  
##  Mean   :  6.099   Mean   : 15.09   Mean   :0.01307                     
##  3rd Qu.:  7.000   3rd Qu.: 18.00   3rd Qu.:0.00000                     
##  Max.   :165.000   Max.   :163.00   Max.   :1.00000                     
##  NA's   :3066      NA's   :2947                                         
##     Diverted       
##  Min.   :0.000000  
##  1st Qu.:0.000000  
##  Median :0.000000  
##  Mean   :0.002853  
##  3rd Qu.:0.000000  
##  Max.   :1.000000  
## 

Y lo convertimos a tibble con tbl_df para una mejor visualizacion, sin afectar a las propiedades de la BD. La forma de recurrir a las variables es similar a la de un data frame (por ejemplo hflights$UniqueCarrier).

hflights=tbl_df(hflights)
glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
class(hflights)
## [1] "tbl_df"     "tbl"        "data.frame"
hflights
## # A tibble: 227,496 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>

Cuando queremos cambiar las etiquetas, definimos un vector relacionando las antiguas etiquetas con las nuevas.

# Las etiquetas originales de UniqueCarrier son
unique(hflights$UniqueCarrier)
##  [1] "AA" "AS" "B6" "CO" "DL" "OO" "UA" "US" "WN" "EV" "F9" "FL" "MQ" "XE"
## [15] "YV"
# definimos un vector con reasignaciones de las etiquetas
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

# Definimos una nueva variable Carrier que asigna las nuevas etiquetas en funcion de las antiguas
hflights$Carrier <- lut[hflights$UniqueCarrier]
head(hflights$Carrier)
## [1] "American" "American" "American" "American" "American" "American"
# Glimpse at hflights
glimpse(hflights)
## Observations: 227,496
## Variables: 22
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Carrier           <chr> "American", "American", "American", "American"…
#Creamos una variable *Code* con una recodificación de CancellationCode
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")

# Add the Code column
hflights$Code <-lut[hflights$CancellationCode]

También podemos utilizar la librería forcats, con la que podemos prescindir o modificar niveles de un factor, o recodificarlos fácilmente.

Verbos en dplyr

The dplyr package contains five key data manipulation functions, also called verbs:

select

# Print out a tbl with the four columns of hflights related to delay
select(hflights,ActualElapsedTime, AirTime, ArrDelay, DepDelay)
## # A tibble: 227,496 x 4
##    ActualElapsedTime AirTime ArrDelay DepDelay
##                <int>   <int>    <int>    <int>
##  1                60      40      -10        0
##  2                60      45       -9        1
##  3                70      48       -8       -8
##  4                70      39        3        3
##  5                62      44       -3        5
##  6                64      45       -7       -1
##  7                70      43       -1       -1
##  8                59      40      -16       -5
##  9                71      41       44       43
## 10                70      45       43       43
## # … with 227,486 more rows
# Print out the columns Origin up to Cancelled of hflights
names(hflights)
##  [1] "Year"              "Month"             "DayofMonth"       
##  [4] "DayOfWeek"         "DepTime"           "ArrTime"          
##  [7] "UniqueCarrier"     "FlightNum"         "TailNum"          
## [10] "ActualElapsedTime" "AirTime"           "ArrDelay"         
## [13] "DepDelay"          "Origin"            "Dest"             
## [16] "Distance"          "TaxiIn"            "TaxiOut"          
## [19] "Cancelled"         "CancellationCode"  "Diverted"         
## [22] "Carrier"           "Code"
select(hflights,Origin,Dest, Distance, TaxiIn, TaxiOut,Cancelled)
## # A tibble: 227,496 x 6
##    Origin Dest  Distance TaxiIn TaxiOut Cancelled
##    <chr>  <chr>    <int>  <int>   <int>     <int>
##  1 IAH    DFW        224      7      13         0
##  2 IAH    DFW        224      6       9         0
##  3 IAH    DFW        224      5      17         0
##  4 IAH    DFW        224      9      22         0
##  5 IAH    DFW        224      9       9         0
##  6 IAH    DFW        224      6      13         0
##  7 IAH    DFW        224     12      15         0
##  8 IAH    DFW        224      7      12         0
##  9 IAH    DFW        224      8      22         0
## 10 IAH    DFW        224      6      19         0
## # … with 227,486 more rows
# Answer to last question: be concise!
select(hflights,1:4)
## # A tibble: 227,496 x 4
##     Year Month DayofMonth DayOfWeek
##    <int> <int>      <int>     <int>
##  1  2011     1          1         6
##  2  2011     1          2         7
##  3  2011     1          3         1
##  4  2011     1          4         2
##  5  2011     1          5         3
##  6  2011     1          6         4
##  7  2011     1          7         5
##  8  2011     1          8         6
##  9  2011     1          9         7
## 10  2011     1         10         1
## # … with 227,486 more rows
# Print out a tbl containing just ArrDelay and DepDelay
select(hflights,contains("Delay"))
## # A tibble: 227,496 x 2
##    ArrDelay DepDelay
##       <int>    <int>
##  1      -10        0
##  2       -9        1
##  3       -8       -8
##  4        3        3
##  5       -3        5
##  6       -7       -1
##  7       -1       -1
##  8      -16       -5
##  9       44       43
## 10       43       43
## # … with 227,486 more rows
# Print out a tbl as described in the second instruction, using both helper functions and variable names
select(hflights,UniqueCarrier,ends_with("Num"),starts_with("Cancel"))
## # A tibble: 227,496 x 5
##    UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##    <chr>             <int> <chr>       <int> <chr>           
##  1 AA                  428 N576AA          0 ""              
##  2 AA                  428 N557AA          0 ""              
##  3 AA                  428 N541AA          0 ""              
##  4 AA                  428 N403AA          0 ""              
##  5 AA                  428 N492AA          0 ""              
##  6 AA                  428 N262AA          0 ""              
##  7 AA                  428 N493AA          0 ""              
##  8 AA                  428 N477AA          0 ""              
##  9 AA                  428 N476AA          0 ""              
## 10 AA                  428 N504AA          0 ""              
## # … with 227,486 more rows
# Print out a tbl as described in the third instruction, using only helper functions.
select(hflights,ends_with("Time"),ends_with("Delay"))
## # A tibble: 227,496 x 6
##    DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
##      <int>   <int>             <int>   <int>    <int>    <int>
##  1    1400    1500                60      40      -10        0
##  2    1401    1501                60      45       -9        1
##  3    1352    1502                70      48       -8       -8
##  4    1403    1513                70      39        3        3
##  5    1405    1507                62      44       -3        5
##  6    1359    1503                64      45       -7       -1
##  7    1359    1509                70      43       -1       -1
##  8    1355    1454                59      40      -16       -5
##  9    1443    1554                71      41       44       43
## 10    1443    1553                70      45       43       43
## # … with 227,486 more rows
# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights,starts_with("Taxi"),Distance)
#ex1d<-select(hflights,16:18)

# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights,1:6,-3)

# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights,starts_with("Ta"))

mutate

# hflights and dplyr are loaded and ready to serve you.

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1=mutate(hflights,ActualGroundTime=ActualElapsedTime-AirTime)

# Add the new variable GroundTime to g1. Save the result as g2.
g2=mutate(g1,GroundTime=TaxiIn+TaxiOut)

# Add the new variable AverageSpeed to g2. Save the result as g3.
g3=mutate(g2,AverageSpeed=60 * Distance / AirTime)

# Print out g3
g3
## # A tibble: 227,496 x 26
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          1         6    1400    1500 AA                  428
##  2  2011     1          2         7    1401    1501 AA                  428
##  3  2011     1          3         1    1352    1502 AA                  428
##  4  2011     1          4         2    1403    1513 AA                  428
##  5  2011     1          5         3    1405    1507 AA                  428
##  6  2011     1          6         4    1359    1503 AA                  428
##  7  2011     1          7         5    1359    1509 AA                  428
##  8  2011     1          8         6    1355    1454 AA                  428
##  9  2011     1          9         7    1443    1554 AA                  428
## 10  2011     1         10         1    1443    1553 AA                  428
## # … with 227,486 more rows, and 18 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>, ActualGroundTime <int>,
## #   GroundTime <int>, AverageSpeed <dbl>

Adding multiple variables with mutate

# hflights and dplyr are ready, are you?

# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay,
                        loss_ratio=loss/DepDelay)

# Add the three variables as described in the third instruction: m2
m2=mutate(hflights,TotalTaxi=TaxiIn+TaxiOut,
ActualGroundTime=ActualElapsedTime-AirTime, Diff=TotalTaxi-ActualGroundTime)

filter and logical operators

R comes with a set of logical operators that you can use inside filter():

x < y, TRUE if x is less than y x <= y, TRUE if x is less than or equal to y x == y, TRUE if x equals y x != y, TRUE if x does not equal y x >= y, TRUE if x is greater than or equal to y x > y, TRUE if x is greater than y x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

# hflights is at your disposal as a tbl, with clean carrier names

# All flights that traveled 3000 miles or more
filter(hflights,Distance>=3000)
## # A tibble: 527 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         31         1     924    1413 CO                    1
##  2  2011     1         30         7     925    1410 CO                    1
##  3  2011     1         29         6    1045    1445 CO                    1
##  4  2011     1         28         5    1516    1916 CO                    1
##  5  2011     1         27         4     950    1344 CO                    1
##  6  2011     1         26         3     944    1350 CO                    1
##  7  2011     1         25         2     924    1337 CO                    1
##  8  2011     1         24         1    1144    1605 CO                    1
##  9  2011     1         23         7     926    1335 CO                    1
## 10  2011     1         22         6     942    1340 CO                    1
## # … with 517 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# All flights flown by one of JetBlue, Southwest, or Delta
filter(hflights,UniqueCarrier %in% c("JetBlue", "Southwest","Delta"))
## # A tibble: 0 x 23
## # … with 23 variables: Year <int>, Month <int>, DayofMonth <int>,
## #   DayOfWeek <int>, DepTime <int>, ArrTime <int>, UniqueCarrier <chr>,
## #   FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>,
## #   AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## #   Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## #   Cancelled <int>, CancellationCode <chr>, Diverted <int>,
## #   Carrier <chr>, Code <chr>
# All flights where taxiing took longer than flying
filter(hflights,AirTime<(TaxiOut+TaxiIn))
## # A tibble: 1,389 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         24         1     731     904 AA                  460
##  2  2011     1         30         7    1959    2132 AA                  533
##  3  2011     1         24         1    1621    1749 AA                 1121
##  4  2011     1         10         1     941    1113 AA                 1436
##  5  2011     1         31         1    1301    1356 CO                  241
##  6  2011     1         31         1    2113    2215 CO                 1533
##  7  2011     1         31         1    1434    1539 CO                 1541
##  8  2011     1         31         1     900    1006 CO                 1583
##  9  2011     1         30         7    1304    1408 CO                  241
## 10  2011     1         30         7    2004    2128 CO                  423
## # … with 1,379 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>

Combining tests using boolean operators

R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:

filter(df, a > 0 & b > 0) filter(df, a > 0, b > 0)

filter(df, !is.na(x)) keeps the observations in df for which the variable x is not NA.

# All flights that departed before 5am or arrived after 10pm
filter(hflights, (DepTime<500) | (ArrTime>2200))
## # A tibble: 27,799 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          4         2    2100    2207 AA                  533
##  2  2011     1         14         5    2119    2229 AA                  533
##  3  2011     1         10         1    1934    2235 AA                 1294
##  4  2011     1         26         3    1905    2211 AA                 1294
##  5  2011     1         30         7    1856    2209 AA                 1294
##  6  2011     1          9         7    1938    2228 AS                  731
##  7  2011     1         31         1    1919    2231 CO                  190
##  8  2011     1         31         1    2116    2344 CO                  209
##  9  2011     1         31         1    1850    2211 CO                  250
## 10  2011     1         31         1    2102    2216 CO                  299
## # … with 27,789 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# All flights that departed late but arrived ahead of schedule
filter(hflights, (DepDelay>0) & (ArrDelay<0))
## # A tibble: 27,712 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1          2         7    1401    1501 AA                  428
##  2  2011     1          5         3    1405    1507 AA                  428
##  3  2011     1         18         2    1408    1508 AA                  428
##  4  2011     1         18         2     721     827 AA                  460
##  5  2011     1         12         3    2015    2113 AA                  533
##  6  2011     1         13         4    2020    2116 AA                  533
##  7  2011     1         26         3    2009    2103 AA                  533
##  8  2011     1          1         6    1631    1736 AA                 1121
##  9  2011     1         10         1    1639    1740 AA                 1121
## 10  2011     1         12         3    1631    1739 AA                 1121
## # … with 27,702 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# All flights that were cancelled after being delayed
filter(hflights, (Cancelled==1) & (DepDelay>0))
## # A tibble: 40 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         26         3    1926      NA CO                  310
##  2  2011     1         11         2    1100      NA US                  944
##  3  2011     1         19         3    1811      NA XE                 2376
##  4  2011     1          7         5    2028      NA XE                 3050
##  5  2011     2          4         5    1638      NA AA                 1121
##  6  2011     2          8         2    1057      NA CO                  408
##  7  2011     2          2         3     802      NA XE                 2189
##  8  2011     2          9         3     904      NA XE                 2605
##  9  2011     2          1         2    1508      NA OO                 5812
## 10  2011     3         31         4    1016      NA CO                  586
## # … with 30 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>

select and filter (together)

# Select the flights that had JFK as their destination: c1
c1=filter(hflights,Dest=="JFK")

# Combine the Year, Month and DayofMonth variables to create a Date column: c2
c2=mutate(c1,Date=paste(Year,Month,DayofMonth,sep="-"))

# Print out a selection of columns of c2
select(c2,Date,DepTime,ArrTime,TailNum)
## # A tibble: 695 x 4
##    Date     DepTime ArrTime TailNum
##    <chr>      <int>   <int> <chr>  
##  1 2011-1-1     654    1124 N324JB 
##  2 2011-1-1    1639    2110 N324JB 
##  3 2011-1-2     703    1113 N324JB 
##  4 2011-1-2    1604    2040 N324JB 
##  5 2011-1-3     659    1100 N229JB 
##  6 2011-1-3    1801    2200 N206JB 
##  7 2011-1-4     654    1103 N267JB 
##  8 2011-1-4    1608    2034 N267JB 
##  9 2011-1-5     700    1103 N708JB 
## 10 2011-1-5    1544    1954 N644JB 
## # … with 685 more rows

arrange

Sirve para ordenar los datos respecto de algún criterio/variable.

# Definition of dtc
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))

# Arrange dtc by departure delays
arrange(dtc,DepDelay)
## # A tibble: 68 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     7         23         6     605      NA F9                  225
##  2  2011     1         17         1     916      NA XE                 3068
##  3  2011    12          1         4     541      NA US                  282
##  4  2011    10         12         3    2022      NA MQ                 3724
##  5  2011     7         29         5    1424      NA CO                 1079
##  6  2011     9         29         4    1639      NA OO                 2062
##  7  2011     2          9         3     555      NA MQ                 3265
##  8  2011     5          9         1     715      NA OO                 1177
##  9  2011     1         20         4    1413      NA UA                  552
## 10  2011     1         17         1     831      NA WN                    1
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# Arrange dtc so that cancellation reasons are grouped
arrange(dtc,CancellationCode)
## # A tibble: 68 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     1         20         4    1413      NA UA                  552
##  2  2011     1          7         5    2028      NA XE                 3050
##  3  2011     2          4         5    1638      NA AA                 1121
##  4  2011     2          8         2    1057      NA CO                  408
##  5  2011     2          1         2    1508      NA OO                 5812
##  6  2011     2         21         1    2257      NA OO                 1111
##  7  2011     2          9         3     555      NA MQ                 3265
##  8  2011     3         18         5     727      NA UA                  109
##  9  2011     4          4         1    1632      NA DL                    8
## 10  2011     4          8         5    1608      NA WN                    4
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# Arrange dtc according to carrier and departure delays
arrange(dtc,UniqueCarrier,DepDelay)
## # A tibble: 68 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     8         18         4    1808      NA AA                 1294
##  2  2011     2          4         5    1638      NA AA                 1121
##  3  2011     7         29         5    1424      NA CO                 1079
##  4  2011     1         26         3    1703      NA CO                  410
##  5  2011     8         11         4    1320      NA CO                 1669
##  6  2011     7         25         1    1654      NA CO                 1422
##  7  2011     1         26         3    1926      NA CO                  310
##  8  2011     3         31         4    1016      NA CO                  586
##  9  2011     2          8         2    1057      NA CO                  408
## 10  2011     4          4         1    1632      NA DL                    8
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>

Reverse the order of arranging By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().

# dplyr and the hflights tbl are available

# Arrange according to carrier and decreasing departure delays
arrange(hflights,UniqueCarrier,desc(DepDelay))
## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011    12         12         1     650     808 AA                 1740
##  2  2011    11         19         6    1752    1910 AA                 1903
##  3  2011    12         22         4    1728    1848 AA                 1903
##  4  2011    10         23         7    2305       2 AA                  742
##  5  2011     9         27         2    1206    1300 AA                 1948
##  6  2011     3         17         4    1647    1747 AA                 1505
##  7  2011     6         21         2     955    1315 AA                  466
##  8  2011     5         20         5    2359     130 AA                  426
##  9  2011     4         19         2    2023    2142 AA                 1925
## 10  2011     5         12         4    2133      53 AA                 1294
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>
# Arrange flights by total delay (normal order).
arrange(hflights,DepDelay+ArrDelay)
## # A tibble: 227,496 x 23
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     7          3         7    1914    2039 XE                 2804
##  2  2011     8         31         3     934    1039 OO                 2040
##  3  2011     8         21         7     935    1039 OO                 2001
##  4  2011     8         28         7    2059    2206 OO                 2003
##  5  2011     8         29         1     935    1041 OO                 2040
##  6  2011    12         25         7     741     926 OO                 4591
##  7  2011     1         30         7     620     812 OO                 4461
##  8  2011     8          3         3    1741    1810 XE                 2603
##  9  2011     8          4         4     930    1041 OO                 1171
## 10  2011     8         18         4     939    1043 OO                 2001
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>, Carrier <chr>, Code <chr>

summarize / summarise

# hflights and dplyr are loaded in the workspace

# Print out a summary with variables min_dist and max_dist
summarise(hflights,min_dist=min(Distance),max_dist=max(Distance))
## # A tibble: 1 x 2
##   min_dist max_dist
##      <int>    <int>
## 1       79     3904
# Print out a summary with variable max_div
c1=filter(hflights,Diverted==1)
summarise(c1,max_div=max(Distance))
## # A tibble: 1 x 1
##   max_div
##     <int>
## 1    3904

You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:

min(x) - minimum value of vector x. max(x) - maximum value of vector x. mean(x) - mean value of vector x. median(x) - median value of vector x. quantile(x, p) - pth quantile of vector x. sd(x) - standard deviation of vector x. var(x) - variance of vector x. IQR(x) - Inter Quartile Range (IQR) of vector x. diff(range(x)) - total range of vector x.

# Remove rows that have NA ArrDelay: temp1
temp1=filter(hflights,!is.na(ArrDelay))

# Generate summary about ArrDelay column of temp1
summarise(temp1,earliest=min(ArrDelay),average=mean(ArrDelay),latest=max(ArrDelay),
sd=sd(ArrDelay))
## # A tibble: 1 x 4
##   earliest average latest    sd
##      <int>   <dbl>  <int> <dbl>
## 1      -70    7.09    978  30.7
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2=filter(hflights,!is.na(TaxiIn) & !is.na(TaxiOut))

# Print the maximum taxiing difference of temp2 with summarize()
summarise(temp2,max_taxi_diff=max(abs(TaxiIn-TaxiOut)))
## # A tibble: 1 x 1
##   max_taxi_diff
##           <int>
## 1           160

dplyr aggregate functions dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) - The first element of vector x. last(x) - The last element of vector x. nth(x, n) - The nth element of vector x. n() - The number of rows in the data.frame or group of observations that summarize() describes. n_distinct(x) - The number of unique values in vector x.

Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.

# Generate summarizing statistics for hflights
summarize(hflights,
          n_obs = n(),
          n_carrier = n_distinct(UniqueCarrier),
          n_dest = n_distinct(Dest))
## # A tibble: 1 x 3
##    n_obs n_carrier n_dest
##    <int>     <int>  <int>
## 1 227496        15    116
# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")

# Generate summarizing statistics for aa 
summarize(aa,n_flights=n(),
n_canc=sum(Cancelled),avg_delay=mean(ArrDelay,na.rm=TRUE))
## # A tibble: 1 x 3
##   n_flights n_canc avg_delay
##       <int>  <int>     <dbl>
## 1         0      0       NaN

pipe >%>

# Write the 'piped' version of the English sentences.
hflights %>% mutate(diff=TaxiOut-TaxiIn) %>%
            filter(!is.na(diff)) %>%
            summarize(avg=mean(diff))
## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1  8.99
# Chain together mutate(), filter() and summarize()
hflights %>% mutate(RealTime=ActualElapsedTime+100,
                    mph=60*Distance/RealTime)%>%
             filter(!is.na(mph),mph<70) %>%
             summarize(n_less=n(),n_dest=n_distinct(Dest),
              min_dist=min(Distance),max_dist=max(Distance))
## # A tibble: 1 x 4
##   n_less n_dest min_dist max_dist
##    <int>  <int>    <int>    <int>
## 1   6726     13       79      305
# Finish the command with a filter() and summarize() call
hflights %>%
  mutate(
    RealTime = ActualElapsedTime + 100, 
    mph = 60 * Distance / RealTime) %>%
    filter(mph<105 | Cancelled==1 | Diverted==1)%>%
    summarize(
    n_non=n(),
    n_dest=n_distinct(Dest),
    min_dist=min(Distance),
    max_dist=max(Distance))
## # A tibble: 1 x 4
##   n_non n_dest min_dist max_dist
##   <int>  <int>    <int>    <int>
## 1 42400    113       79     3904
# Count the number of overnight flights
hflights %>% filter(!is.na(DepTime),
                    !is.na(ArrTime),DepTime>ArrTime) %>%
            summarize(num=n()) 
## # A tibble: 1 x 1
##     num
##   <int>
## 1  2718

group_by

Resumimos cancelaciones y retrasos por compañía aérea.

# Make an ordered per-carrier summary of hflights
hflights %>%
  group_by(UniqueCarrier) %>%
  summarize(
    p_canc = mean(Cancelled)*100,
    avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
    arrange(avg_delay,p_canc)
## # A tibble: 15 x 3
##    UniqueCarrier p_canc avg_delay
##    <chr>          <dbl>     <dbl>
##  1 US             1.13     -0.631
##  2 AA             1.85      0.892
##  3 FL             0.982     1.85 
##  4 AS             0         3.19 
##  5 YV             1.27      4.01 
##  6 DL             1.59      6.08 
##  7 CO             0.678     6.10 
##  8 MQ             2.90      7.15 
##  9 EV             3.45      7.26 
## 10 WN             1.55      7.59 
## 11 F9             0.716     7.67 
## 12 XE             1.55      8.19 
## 13 OO             1.39      8.69 
## 14 B6             2.59      9.86 
## 15 UA             1.64     10.5
# Ordered overview of average arrival delays per carrier
hflights %>% filter(!is.na(ArrDelay), ArrDelay>0) %>%
            group_by(UniqueCarrier)%>%
            summarize(avg=mean(ArrDelay))%>%
            mutate(rank=rank(avg)) %>%
            arrange(rank)
## # A tibble: 15 x 3
##    UniqueCarrier   avg  rank
##    <chr>         <dbl> <dbl>
##  1 YV             18.7     1
##  2 F9             18.7     2
##  3 US             20.7     3
##  4 CO             22.1     4
##  5 AS             22.9     5
##  6 OO             24.1     6
##  7 XE             24.2     7
##  8 WN             25.3     8
##  9 FL             27.9     9
## 10 AA             28.5    10
## 11 DL             32.1    11
## 12 UA             32.5    12
## 13 MQ             38.8    13
## 14 EV             40.2    14
## 15 B6             45.5    15

How many airplanes only flew to one destination?

hflights %>%
  group_by(TailNum) %>%
  summarize(ndest=n_distinct(Dest)) %>%
  filter(ndest==1) %>%
  summarize(nplanes=sum(ndest))
## # A tibble: 1 x 1
##   nplanes
##     <int>
## 1    1526
# Find the most visited destination for each carrier
hflights %>%
  group_by(UniqueCarrier,Dest) %>%
  summarize(n=n()) %>%
  mutate(rank=rank(desc(n))) %>%
  filter(rank==1)
## # A tibble: 15 x 4
## # Groups:   UniqueCarrier [15]
##    UniqueCarrier Dest      n  rank
##    <chr>         <chr> <int> <dbl>
##  1 AA            DFW    2105     1
##  2 AS            SEA     365     1
##  3 B6            JFK     695     1
##  4 CO            EWR    3924     1
##  5 DL            ATL    2396     1
##  6 EV            DTW     851     1
##  7 F9            DEN     837     1
##  8 FL            ATL    2029     1
##  9 MQ            DFW    2424     1
## 10 OO            COS    1335     1
## 11 UA            SFO     643     1
## 12 US            CLT    2212     1
## 13 WN            DAL    8243     1
## 14 XE            CRP    3175     1
## 15 YV            CLT      71     1

dplyr deals with different types hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
hflights2 <- as.data.table(hflights)

# hflights2 is pre-loaded as a data.table

# Use summarize to calculate n_carrier
summarize(hflights2,n_carrier=n_distinct(UniqueCarrier))
##   n_carrier
## 1        15

dplyr and mySQL databases

# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
      host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "student",
                   password = "datacamp")

# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
glimpse(nycflights)
## Observations: ??
## Variables: 17
## Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ tailnum   <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ flight    <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ origin    <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest      <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time  <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance  <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, …
## $ minute    <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, 58…
# Ordered, grouped summary of nycflights
 nycflights %>% group_by(carrier) %>%
 summarise(n_flights=n(),avg_delay=mean(arr_delay))%>%
 arrange(avg_delay)
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source:     lazy query [?? x 3]
## # Database:   mysql 5.6.34-log
## #   [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
##    carrier n_flights avg_delay
##    <chr>       <dbl>     <dbl>
##  1 AS            714    -9.86 
##  2 HA            342    -6.92 
##  3 AA          32729     0.356
##  4 DL          48110     1.63 
##  5 VX           5162     1.75 
##  6 US          20536     2.06 
##  7 UA          58665     3.50 
##  8 9E          18460     6.91 
##  9 B6          54635     9.36 
## 10 WN          12275     9.47 
## # … with more rows