A tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble(). The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.
Partimos de un objeto de tipo data.frame. Los comandos stry glimpse dan información sobre el tipo de las variables contenidas en un objeto tibble. classda información sobre el tipo de objeto. Con head visualizamos las primeras filas de la BD.
data(hflights)
head(hflights)
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier
## 5424 2011 1 1 6 1400 1500 AA
## 5425 2011 1 2 7 1401 1501 AA
## 5426 2011 1 3 1 1352 1502 AA
## 5427 2011 1 4 2 1403 1513 AA
## 5428 2011 1 5 3 1405 1507 AA
## 5429 2011 1 6 4 1359 1503 AA
## FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin
## 5424 428 N576AA 60 40 -10 0 IAH
## 5425 428 N557AA 60 45 -9 1 IAH
## 5426 428 N541AA 70 48 -8 -8 IAH
## 5427 428 N403AA 70 39 3 3 IAH
## 5428 428 N492AA 62 44 -3 5 IAH
## 5429 428 N262AA 64 45 -7 -1 IAH
## Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
## 5424 DFW 224 7 13 0 0
## 5425 DFW 224 6 9 0 0
## 5426 DFW 224 5 17 0 0
## 5427 DFW 224 9 22 0 0
## 5428 DFW 224 9 9 0 0
## 5429 DFW 224 6 13 0 0
str(hflights)
## 'data.frame': 227496 obs. of 21 variables:
## $ Year : int 2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DayofMonth : int 1 2 3 4 5 6 7 8 9 10 ...
## $ DayOfWeek : int 6 7 1 2 3 4 5 6 7 1 ...
## $ DepTime : int 1400 1401 1352 1403 1405 1359 1359 1355 1443 1443 ...
## $ ArrTime : int 1500 1501 1502 1513 1507 1503 1509 1454 1554 1553 ...
## $ UniqueCarrier : chr "AA" "AA" "AA" "AA" ...
## $ FlightNum : int 428 428 428 428 428 428 428 428 428 428 ...
## $ TailNum : chr "N576AA" "N557AA" "N541AA" "N403AA" ...
## $ ActualElapsedTime: int 60 60 70 70 62 64 70 59 71 70 ...
## $ AirTime : int 40 45 48 39 44 45 43 40 41 45 ...
## $ ArrDelay : int -10 -9 -8 3 -3 -7 -1 -16 44 43 ...
## $ DepDelay : int 0 1 -8 3 5 -1 -1 -5 43 43 ...
## $ Origin : chr "IAH" "IAH" "IAH" "IAH" ...
## $ Dest : chr "DFW" "DFW" "DFW" "DFW" ...
## $ Distance : int 224 224 224 224 224 224 224 224 224 224 ...
## $ TaxiIn : int 7 6 5 9 9 6 12 7 8 6 ...
## $ TaxiOut : int 13 9 17 22 9 13 15 12 22 19 ...
## $ Cancelled : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CancellationCode : chr "" "" "" "" ...
## $ Diverted : int 0 0 0 0 0 0 0 0 0 0 ...
glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
class(hflights)
## [1] "data.frame"
summary(hflights)
## Year Month DayofMonth DayOfWeek
## Min. :2011 Min. : 1.000 Min. : 1.00 Min. :1.000
## 1st Qu.:2011 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2.000
## Median :2011 Median : 7.000 Median :16.00 Median :4.000
## Mean :2011 Mean : 6.514 Mean :15.74 Mean :3.948
## 3rd Qu.:2011 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:6.000
## Max. :2011 Max. :12.000 Max. :31.00 Max. :7.000
##
## DepTime ArrTime UniqueCarrier FlightNum
## Min. : 1 Min. : 1 Length:227496 Min. : 1
## 1st Qu.:1021 1st Qu.:1215 Class :character 1st Qu.: 855
## Median :1416 Median :1617 Mode :character Median :1696
## Mean :1396 Mean :1578 Mean :1962
## 3rd Qu.:1801 3rd Qu.:1953 3rd Qu.:2755
## Max. :2400 Max. :2400 Max. :7290
## NA's :2905 NA's :3066
## TailNum ActualElapsedTime AirTime ArrDelay
## Length:227496 Min. : 34.0 Min. : 11.0 Min. :-70.000
## Class :character 1st Qu.: 77.0 1st Qu.: 58.0 1st Qu.: -8.000
## Mode :character Median :128.0 Median :107.0 Median : 0.000
## Mean :129.3 Mean :108.1 Mean : 7.094
## 3rd Qu.:165.0 3rd Qu.:141.0 3rd Qu.: 11.000
## Max. :575.0 Max. :549.0 Max. :978.000
## NA's :3622 NA's :3622 NA's :3622
## DepDelay Origin Dest Distance
## Min. :-33.000 Length:227496 Length:227496 Min. : 79.0
## 1st Qu.: -3.000 Class :character Class :character 1st Qu.: 376.0
## Median : 0.000 Mode :character Mode :character Median : 809.0
## Mean : 9.445 Mean : 787.8
## 3rd Qu.: 9.000 3rd Qu.:1042.0
## Max. :981.000 Max. :3904.0
## NA's :2905
## TaxiIn TaxiOut Cancelled CancellationCode
## Min. : 1.000 Min. : 1.00 Min. :0.00000 Length:227496
## 1st Qu.: 4.000 1st Qu.: 10.00 1st Qu.:0.00000 Class :character
## Median : 5.000 Median : 14.00 Median :0.00000 Mode :character
## Mean : 6.099 Mean : 15.09 Mean :0.01307
## 3rd Qu.: 7.000 3rd Qu.: 18.00 3rd Qu.:0.00000
## Max. :165.000 Max. :163.00 Max. :1.00000
## NA's :3066 NA's :2947
## Diverted
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.002853
## 3rd Qu.:0.000000
## Max. :1.000000
##
Y lo convertimos a tibble con tbl_df para una mejor visualizacion, sin afectar a las propiedades de la BD. La forma de recurrir a las variables es similar a la de un data frame (por ejemplo hflights$UniqueCarrier).
hflights=tbl_df(hflights)
glimpse(hflights)
## Observations: 227,496
## Variables: 21
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
class(hflights)
## [1] "tbl_df" "tbl" "data.frame"
hflights
## # A tibble: 227,496 x 21
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 13 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>
Cuando queremos cambiar las etiquetas, definimos un vector relacionando las antiguas etiquetas con las nuevas.
# Las etiquetas originales de UniqueCarrier son
unique(hflights$UniqueCarrier)
## [1] "AA" "AS" "B6" "CO" "DL" "OO" "UA" "US" "WN" "EV" "F9" "FL" "MQ" "XE"
## [15] "YV"
# definimos un vector con reasignaciones de las etiquetas
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
# Definimos una nueva variable Carrier que asigna las nuevas etiquetas en funcion de las antiguas
hflights$Carrier <- lut[hflights$UniqueCarrier]
head(hflights$Carrier)
## [1] "American" "American" "American" "American" "American" "American"
# Glimpse at hflights
glimpse(hflights)
## Observations: 227,496
## Variables: 22
## $ Year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ Month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DayofMonth <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,…
## $ DayOfWeek <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7…
## $ DepTime <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 1355…
## $ ArrTime <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 1454…
## $ UniqueCarrier <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "AA"…
## $ FlightNum <int> 428, 428, 428, 428, 428, 428, 428, 428, 428, 4…
## $ TailNum <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N492A…
## $ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, 56…
## $ AirTime <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, 41…
## $ ArrDelay <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29, 5…
## $ DepDelay <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, -2…
## $ Origin <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "IAH…
## $ Dest <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "DFW…
## $ Distance <int> 224, 224, 224, 224, 224, 224, 224, 224, 224, 2…
## $ TaxiIn <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6, …
## $ TaxiOut <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11, …
## $ Cancelled <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ CancellationCode <chr> "", "", "", "", "", "", "", "", "", "", "", ""…
## $ Diverted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Carrier <chr> "American", "American", "American", "American"…
#Creamos una variable *Code* con una recodificación de CancellationCode
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")
# Add the Code column
hflights$Code <-lut[hflights$CancellationCode]
También podemos utilizar la librería forcats, con la que podemos prescindir o modificar niveles de un factor, o recodificarlos fácilmente.
The dplyr package contains five key data manipulation functions, also called verbs:
# Print out a tbl with the four columns of hflights related to delay
select(hflights,ActualElapsedTime, AirTime, ArrDelay, DepDelay)
## # A tibble: 227,496 x 4
## ActualElapsedTime AirTime ArrDelay DepDelay
## <int> <int> <int> <int>
## 1 60 40 -10 0
## 2 60 45 -9 1
## 3 70 48 -8 -8
## 4 70 39 3 3
## 5 62 44 -3 5
## 6 64 45 -7 -1
## 7 70 43 -1 -1
## 8 59 40 -16 -5
## 9 71 41 44 43
## 10 70 45 43 43
## # … with 227,486 more rows
# Print out the columns Origin up to Cancelled of hflights
names(hflights)
## [1] "Year" "Month" "DayofMonth"
## [4] "DayOfWeek" "DepTime" "ArrTime"
## [7] "UniqueCarrier" "FlightNum" "TailNum"
## [10] "ActualElapsedTime" "AirTime" "ArrDelay"
## [13] "DepDelay" "Origin" "Dest"
## [16] "Distance" "TaxiIn" "TaxiOut"
## [19] "Cancelled" "CancellationCode" "Diverted"
## [22] "Carrier" "Code"
select(hflights,Origin,Dest, Distance, TaxiIn, TaxiOut,Cancelled)
## # A tibble: 227,496 x 6
## Origin Dest Distance TaxiIn TaxiOut Cancelled
## <chr> <chr> <int> <int> <int> <int>
## 1 IAH DFW 224 7 13 0
## 2 IAH DFW 224 6 9 0
## 3 IAH DFW 224 5 17 0
## 4 IAH DFW 224 9 22 0
## 5 IAH DFW 224 9 9 0
## 6 IAH DFW 224 6 13 0
## 7 IAH DFW 224 12 15 0
## 8 IAH DFW 224 7 12 0
## 9 IAH DFW 224 8 22 0
## 10 IAH DFW 224 6 19 0
## # … with 227,486 more rows
# Answer to last question: be concise!
select(hflights,1:4)
## # A tibble: 227,496 x 4
## Year Month DayofMonth DayOfWeek
## <int> <int> <int> <int>
## 1 2011 1 1 6
## 2 2011 1 2 7
## 3 2011 1 3 1
## 4 2011 1 4 2
## 5 2011 1 5 3
## 6 2011 1 6 4
## 7 2011 1 7 5
## 8 2011 1 8 6
## 9 2011 1 9 7
## 10 2011 1 10 1
## # … with 227,486 more rows
# Print out a tbl containing just ArrDelay and DepDelay
select(hflights,contains("Delay"))
## # A tibble: 227,496 x 2
## ArrDelay DepDelay
## <int> <int>
## 1 -10 0
## 2 -9 1
## 3 -8 -8
## 4 3 3
## 5 -3 5
## 6 -7 -1
## 7 -1 -1
## 8 -16 -5
## 9 44 43
## 10 43 43
## # … with 227,486 more rows
# Print out a tbl as described in the second instruction, using both helper functions and variable names
select(hflights,UniqueCarrier,ends_with("Num"),starts_with("Cancel"))
## # A tibble: 227,496 x 5
## UniqueCarrier FlightNum TailNum Cancelled CancellationCode
## <chr> <int> <chr> <int> <chr>
## 1 AA 428 N576AA 0 ""
## 2 AA 428 N557AA 0 ""
## 3 AA 428 N541AA 0 ""
## 4 AA 428 N403AA 0 ""
## 5 AA 428 N492AA 0 ""
## 6 AA 428 N262AA 0 ""
## 7 AA 428 N493AA 0 ""
## 8 AA 428 N477AA 0 ""
## 9 AA 428 N476AA 0 ""
## 10 AA 428 N504AA 0 ""
## # … with 227,486 more rows
# Print out a tbl as described in the third instruction, using only helper functions.
select(hflights,ends_with("Time"),ends_with("Delay"))
## # A tibble: 227,496 x 6
## DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
## <int> <int> <int> <int> <int> <int>
## 1 1400 1500 60 40 -10 0
## 2 1401 1501 60 45 -9 1
## 3 1352 1502 70 48 -8 -8
## 4 1403 1513 70 39 3 3
## 5 1405 1507 62 44 -3 5
## 6 1359 1503 64 45 -7 -1
## 7 1359 1509 70 43 -1 -1
## 8 1355 1454 59 40 -16 -5
## 9 1443 1554 71 41 44 43
## 10 1443 1553 70 45 43 43
## # … with 227,486 more rows
# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights,starts_with("Taxi"),Distance)
#ex1d<-select(hflights,16:18)
# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights,1:6,-3)
# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights,starts_with("Ta"))
# hflights and dplyr are loaded and ready to serve you.
# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1=mutate(hflights,ActualGroundTime=ActualElapsedTime-AirTime)
# Add the new variable GroundTime to g1. Save the result as g2.
g2=mutate(g1,GroundTime=TaxiIn+TaxiOut)
# Add the new variable AverageSpeed to g2. Save the result as g3.
g3=mutate(g2,AverageSpeed=60 * Distance / AirTime)
# Print out g3
g3
## # A tibble: 227,496 x 26
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 1 6 1400 1500 AA 428
## 2 2011 1 2 7 1401 1501 AA 428
## 3 2011 1 3 1 1352 1502 AA 428
## 4 2011 1 4 2 1403 1513 AA 428
## 5 2011 1 5 3 1405 1507 AA 428
## 6 2011 1 6 4 1359 1503 AA 428
## 7 2011 1 7 5 1359 1509 AA 428
## 8 2011 1 8 6 1355 1454 AA 428
## 9 2011 1 9 7 1443 1554 AA 428
## 10 2011 1 10 1 1443 1553 AA 428
## # … with 227,486 more rows, and 18 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>, ActualGroundTime <int>,
## # GroundTime <int>, AverageSpeed <dbl>
Adding multiple variables with mutate
# hflights and dplyr are ready, are you?
# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay,
loss_ratio=loss/DepDelay)
# Add the three variables as described in the third instruction: m2
m2=mutate(hflights,TotalTaxi=TaxiIn+TaxiOut,
ActualGroundTime=ActualElapsedTime-AirTime, Diff=TotalTaxi-ActualGroundTime)
R comes with a set of logical operators that you can use inside filter():
x < y, TRUE if x is less than y x <= y, TRUE if x is less than or equal to y x == y, TRUE if x equals y x != y, TRUE if x does not equal y x >= y, TRUE if x is greater than or equal to y x > y, TRUE if x is greater than y x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
# hflights is at your disposal as a tbl, with clean carrier names
# All flights that traveled 3000 miles or more
filter(hflights,Distance>=3000)
## # A tibble: 527 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 31 1 924 1413 CO 1
## 2 2011 1 30 7 925 1410 CO 1
## 3 2011 1 29 6 1045 1445 CO 1
## 4 2011 1 28 5 1516 1916 CO 1
## 5 2011 1 27 4 950 1344 CO 1
## 6 2011 1 26 3 944 1350 CO 1
## 7 2011 1 25 2 924 1337 CO 1
## 8 2011 1 24 1 1144 1605 CO 1
## 9 2011 1 23 7 926 1335 CO 1
## 10 2011 1 22 6 942 1340 CO 1
## # … with 517 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# All flights flown by one of JetBlue, Southwest, or Delta
filter(hflights,UniqueCarrier %in% c("JetBlue", "Southwest","Delta"))
## # A tibble: 0 x 23
## # … with 23 variables: Year <int>, Month <int>, DayofMonth <int>,
## # DayOfWeek <int>, DepTime <int>, ArrTime <int>, UniqueCarrier <chr>,
## # FlightNum <int>, TailNum <chr>, ActualElapsedTime <int>,
## # AirTime <int>, ArrDelay <int>, DepDelay <int>, Origin <chr>,
## # Dest <chr>, Distance <int>, TaxiIn <int>, TaxiOut <int>,
## # Cancelled <int>, CancellationCode <chr>, Diverted <int>,
## # Carrier <chr>, Code <chr>
# All flights where taxiing took longer than flying
filter(hflights,AirTime<(TaxiOut+TaxiIn))
## # A tibble: 1,389 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 24 1 731 904 AA 460
## 2 2011 1 30 7 1959 2132 AA 533
## 3 2011 1 24 1 1621 1749 AA 1121
## 4 2011 1 10 1 941 1113 AA 1436
## 5 2011 1 31 1 1301 1356 CO 241
## 6 2011 1 31 1 2113 2215 CO 1533
## 7 2011 1 31 1 1434 1539 CO 1541
## 8 2011 1 31 1 900 1006 CO 1583
## 9 2011 1 30 7 1304 1408 CO 241
## 10 2011 1 30 7 2004 2128 CO 423
## # … with 1,379 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following two calls are completely equivalent:
filter(df, a > 0 & b > 0) filter(df, a > 0, b > 0)
filter(df, !is.na(x)) keeps the observations in df for which the variable x is not NA.
# All flights that departed before 5am or arrived after 10pm
filter(hflights, (DepTime<500) | (ArrTime>2200))
## # A tibble: 27,799 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 4 2 2100 2207 AA 533
## 2 2011 1 14 5 2119 2229 AA 533
## 3 2011 1 10 1 1934 2235 AA 1294
## 4 2011 1 26 3 1905 2211 AA 1294
## 5 2011 1 30 7 1856 2209 AA 1294
## 6 2011 1 9 7 1938 2228 AS 731
## 7 2011 1 31 1 1919 2231 CO 190
## 8 2011 1 31 1 2116 2344 CO 209
## 9 2011 1 31 1 1850 2211 CO 250
## 10 2011 1 31 1 2102 2216 CO 299
## # … with 27,789 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# All flights that departed late but arrived ahead of schedule
filter(hflights, (DepDelay>0) & (ArrDelay<0))
## # A tibble: 27,712 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 2 7 1401 1501 AA 428
## 2 2011 1 5 3 1405 1507 AA 428
## 3 2011 1 18 2 1408 1508 AA 428
## 4 2011 1 18 2 721 827 AA 460
## 5 2011 1 12 3 2015 2113 AA 533
## 6 2011 1 13 4 2020 2116 AA 533
## 7 2011 1 26 3 2009 2103 AA 533
## 8 2011 1 1 6 1631 1736 AA 1121
## 9 2011 1 10 1 1639 1740 AA 1121
## 10 2011 1 12 3 1631 1739 AA 1121
## # … with 27,702 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# All flights that were cancelled after being delayed
filter(hflights, (Cancelled==1) & (DepDelay>0))
## # A tibble: 40 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 26 3 1926 NA CO 310
## 2 2011 1 11 2 1100 NA US 944
## 3 2011 1 19 3 1811 NA XE 2376
## 4 2011 1 7 5 2028 NA XE 3050
## 5 2011 2 4 5 1638 NA AA 1121
## 6 2011 2 8 2 1057 NA CO 408
## 7 2011 2 2 3 802 NA XE 2189
## 8 2011 2 9 3 904 NA XE 2605
## 9 2011 2 1 2 1508 NA OO 5812
## 10 2011 3 31 4 1016 NA CO 586
## # … with 30 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# Select the flights that had JFK as their destination: c1
c1=filter(hflights,Dest=="JFK")
# Combine the Year, Month and DayofMonth variables to create a Date column: c2
c2=mutate(c1,Date=paste(Year,Month,DayofMonth,sep="-"))
# Print out a selection of columns of c2
select(c2,Date,DepTime,ArrTime,TailNum)
## # A tibble: 695 x 4
## Date DepTime ArrTime TailNum
## <chr> <int> <int> <chr>
## 1 2011-1-1 654 1124 N324JB
## 2 2011-1-1 1639 2110 N324JB
## 3 2011-1-2 703 1113 N324JB
## 4 2011-1-2 1604 2040 N324JB
## 5 2011-1-3 659 1100 N229JB
## 6 2011-1-3 1801 2200 N206JB
## 7 2011-1-4 654 1103 N267JB
## 8 2011-1-4 1608 2034 N267JB
## 9 2011-1-5 700 1103 N708JB
## 10 2011-1-5 1544 1954 N644JB
## # … with 685 more rows
Sirve para ordenar los datos respecto de algún criterio/variable.
# Definition of dtc
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay))
# Arrange dtc by departure delays
arrange(dtc,DepDelay)
## # A tibble: 68 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 7 23 6 605 NA F9 225
## 2 2011 1 17 1 916 NA XE 3068
## 3 2011 12 1 4 541 NA US 282
## 4 2011 10 12 3 2022 NA MQ 3724
## 5 2011 7 29 5 1424 NA CO 1079
## 6 2011 9 29 4 1639 NA OO 2062
## 7 2011 2 9 3 555 NA MQ 3265
## 8 2011 5 9 1 715 NA OO 1177
## 9 2011 1 20 4 1413 NA UA 552
## 10 2011 1 17 1 831 NA WN 1
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# Arrange dtc so that cancellation reasons are grouped
arrange(dtc,CancellationCode)
## # A tibble: 68 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 1 20 4 1413 NA UA 552
## 2 2011 1 7 5 2028 NA XE 3050
## 3 2011 2 4 5 1638 NA AA 1121
## 4 2011 2 8 2 1057 NA CO 408
## 5 2011 2 1 2 1508 NA OO 5812
## 6 2011 2 21 1 2257 NA OO 1111
## 7 2011 2 9 3 555 NA MQ 3265
## 8 2011 3 18 5 727 NA UA 109
## 9 2011 4 4 1 1632 NA DL 8
## 10 2011 4 8 5 1608 NA WN 4
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# Arrange dtc according to carrier and departure delays
arrange(dtc,UniqueCarrier,DepDelay)
## # A tibble: 68 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 8 18 4 1808 NA AA 1294
## 2 2011 2 4 5 1638 NA AA 1121
## 3 2011 7 29 5 1424 NA CO 1079
## 4 2011 1 26 3 1703 NA CO 410
## 5 2011 8 11 4 1320 NA CO 1669
## 6 2011 7 25 1 1654 NA CO 1422
## 7 2011 1 26 3 1926 NA CO 310
## 8 2011 3 31 4 1016 NA CO 586
## 9 2011 2 8 2 1057 NA CO 408
## 10 2011 4 4 1 1632 NA DL 8
## # … with 58 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
Reverse the order of arranging By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behavior with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange().
# dplyr and the hflights tbl are available
# Arrange according to carrier and decreasing departure delays
arrange(hflights,UniqueCarrier,desc(DepDelay))
## # A tibble: 227,496 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 12 12 1 650 808 AA 1740
## 2 2011 11 19 6 1752 1910 AA 1903
## 3 2011 12 22 4 1728 1848 AA 1903
## 4 2011 10 23 7 2305 2 AA 742
## 5 2011 9 27 2 1206 1300 AA 1948
## 6 2011 3 17 4 1647 1747 AA 1505
## 7 2011 6 21 2 955 1315 AA 466
## 8 2011 5 20 5 2359 130 AA 426
## 9 2011 4 19 2 2023 2142 AA 1925
## 10 2011 5 12 4 2133 53 AA 1294
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# Arrange flights by total delay (normal order).
arrange(hflights,DepDelay+ArrDelay)
## # A tibble: 227,496 x 23
## Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
## <int> <int> <int> <int> <int> <int> <chr> <int>
## 1 2011 7 3 7 1914 2039 XE 2804
## 2 2011 8 31 3 934 1039 OO 2040
## 3 2011 8 21 7 935 1039 OO 2001
## 4 2011 8 28 7 2059 2206 OO 2003
## 5 2011 8 29 1 935 1041 OO 2040
## 6 2011 12 25 7 741 926 OO 4591
## 7 2011 1 30 7 620 812 OO 4461
## 8 2011 8 3 3 1741 1810 XE 2603
## 9 2011 8 4 4 930 1041 OO 1171
## 10 2011 8 18 4 939 1043 OO 2001
## # … with 227,486 more rows, and 15 more variables: TailNum <chr>,
## # ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## # DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## # TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## # Diverted <int>, Carrier <chr>, Code <chr>
# hflights and dplyr are loaded in the workspace
# Print out a summary with variables min_dist and max_dist
summarise(hflights,min_dist=min(Distance),max_dist=max(Distance))
## # A tibble: 1 x 2
## min_dist max_dist
## <int> <int>
## 1 79 3904
# Print out a summary with variable max_div
c1=filter(hflights,Diverted==1)
summarise(c1,max_div=max(Distance))
## # A tibble: 1 x 1
## max_div
## <int>
## 1 3904
You can use any function you like in summarize() so long as the function can take a vector of data and return a single number. R contains many aggregating functions, as dplyr calls them:
min(x) - minimum value of vector x. max(x) - maximum value of vector x. mean(x) - mean value of vector x. median(x) - median value of vector x. quantile(x, p) - pth quantile of vector x. sd(x) - standard deviation of vector x. var(x) - variance of vector x. IQR(x) - Inter Quartile Range (IQR) of vector x. diff(range(x)) - total range of vector x.
# Remove rows that have NA ArrDelay: temp1
temp1=filter(hflights,!is.na(ArrDelay))
# Generate summary about ArrDelay column of temp1
summarise(temp1,earliest=min(ArrDelay),average=mean(ArrDelay),latest=max(ArrDelay),
sd=sd(ArrDelay))
## # A tibble: 1 x 4
## earliest average latest sd
## <int> <dbl> <int> <dbl>
## 1 -70 7.09 978 30.7
# Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2
temp2=filter(hflights,!is.na(TaxiIn) & !is.na(TaxiOut))
# Print the maximum taxiing difference of temp2 with summarize()
summarise(temp2,max_taxi_diff=max(abs(TaxiIn-TaxiOut)))
## # A tibble: 1 x 1
## max_taxi_diff
## <int>
## 1 160
dplyr aggregate functions dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:
first(x) - The first element of vector x. last(x) - The last element of vector x. nth(x, n) - The nth element of vector x. n() - The number of rows in the data.frame or group of observations that summarize() describes. n_distinct(x) - The number of unique values in vector x.
Next to these dplyr-specific functions, you can also turn a logical test into an aggregating function with sum() or mean(). A logical test returns a vector of TRUE’s and FALSE’s. When you apply sum() or mean() to such a vector, R coerces each TRUE to a 1 and each FALSE to a 0. sum() then represents the total number of observations that passed the test; mean() represents the proportion.
# Generate summarizing statistics for hflights
summarize(hflights,
n_obs = n(),
n_carrier = n_distinct(UniqueCarrier),
n_dest = n_distinct(Dest))
## # A tibble: 1 x 3
## n_obs n_carrier n_dest
## <int> <int> <int>
## 1 227496 15 116
# All American Airline flights
aa <- filter(hflights, UniqueCarrier == "American")
# Generate summarizing statistics for aa
summarize(aa,n_flights=n(),
n_canc=sum(Cancelled),avg_delay=mean(ArrDelay,na.rm=TRUE))
## # A tibble: 1 x 3
## n_flights n_canc avg_delay
## <int> <int> <dbl>
## 1 0 0 NaN
# Write the 'piped' version of the English sentences.
hflights %>% mutate(diff=TaxiOut-TaxiIn) %>%
filter(!is.na(diff)) %>%
summarize(avg=mean(diff))
## # A tibble: 1 x 1
## avg
## <dbl>
## 1 8.99
# Chain together mutate(), filter() and summarize()
hflights %>% mutate(RealTime=ActualElapsedTime+100,
mph=60*Distance/RealTime)%>%
filter(!is.na(mph),mph<70) %>%
summarize(n_less=n(),n_dest=n_distinct(Dest),
min_dist=min(Distance),max_dist=max(Distance))
## # A tibble: 1 x 4
## n_less n_dest min_dist max_dist
## <int> <int> <int> <int>
## 1 6726 13 79 305
# Finish the command with a filter() and summarize() call
hflights %>%
mutate(
RealTime = ActualElapsedTime + 100,
mph = 60 * Distance / RealTime) %>%
filter(mph<105 | Cancelled==1 | Diverted==1)%>%
summarize(
n_non=n(),
n_dest=n_distinct(Dest),
min_dist=min(Distance),
max_dist=max(Distance))
## # A tibble: 1 x 4
## n_non n_dest min_dist max_dist
## <int> <int> <int> <int>
## 1 42400 113 79 3904
# Count the number of overnight flights
hflights %>% filter(!is.na(DepTime),
!is.na(ArrTime),DepTime>ArrTime) %>%
summarize(num=n())
## # A tibble: 1 x 1
## num
## <int>
## 1 2718
Resumimos cancelaciones y retrasos por compañía aérea.
# Make an ordered per-carrier summary of hflights
hflights %>%
group_by(UniqueCarrier) %>%
summarize(
p_canc = mean(Cancelled)*100,
avg_delay = mean(ArrDelay,na.rm=TRUE)) %>%
arrange(avg_delay,p_canc)
## # A tibble: 15 x 3
## UniqueCarrier p_canc avg_delay
## <chr> <dbl> <dbl>
## 1 US 1.13 -0.631
## 2 AA 1.85 0.892
## 3 FL 0.982 1.85
## 4 AS 0 3.19
## 5 YV 1.27 4.01
## 6 DL 1.59 6.08
## 7 CO 0.678 6.10
## 8 MQ 2.90 7.15
## 9 EV 3.45 7.26
## 10 WN 1.55 7.59
## 11 F9 0.716 7.67
## 12 XE 1.55 8.19
## 13 OO 1.39 8.69
## 14 B6 2.59 9.86
## 15 UA 1.64 10.5
# Ordered overview of average arrival delays per carrier
hflights %>% filter(!is.na(ArrDelay), ArrDelay>0) %>%
group_by(UniqueCarrier)%>%
summarize(avg=mean(ArrDelay))%>%
mutate(rank=rank(avg)) %>%
arrange(rank)
## # A tibble: 15 x 3
## UniqueCarrier avg rank
## <chr> <dbl> <dbl>
## 1 YV 18.7 1
## 2 F9 18.7 2
## 3 US 20.7 3
## 4 CO 22.1 4
## 5 AS 22.9 5
## 6 OO 24.1 6
## 7 XE 24.2 7
## 8 WN 25.3 8
## 9 FL 27.9 9
## 10 AA 28.5 10
## 11 DL 32.1 11
## 12 UA 32.5 12
## 13 MQ 38.8 13
## 14 EV 40.2 14
## 15 B6 45.5 15
How many airplanes only flew to one destination?
hflights %>%
group_by(TailNum) %>%
summarize(ndest=n_distinct(Dest)) %>%
filter(ndest==1) %>%
summarize(nplanes=sum(ndest))
## # A tibble: 1 x 1
## nplanes
## <int>
## 1 1526
# Find the most visited destination for each carrier
hflights %>%
group_by(UniqueCarrier,Dest) %>%
summarize(n=n()) %>%
mutate(rank=rank(desc(n))) %>%
filter(rank==1)
## # A tibble: 15 x 4
## # Groups: UniqueCarrier [15]
## UniqueCarrier Dest n rank
## <chr> <chr> <int> <dbl>
## 1 AA DFW 2105 1
## 2 AS SEA 365 1
## 3 B6 JFK 695 1
## 4 CO EWR 3924 1
## 5 DL ATL 2396 1
## 6 EV DTW 851 1
## 7 F9 DEN 837 1
## 8 FL ATL 2029 1
## 9 MQ DFW 2424 1
## 10 OO COS 1335 1
## 11 UA SFO 643 1
## 12 US CLT 2212 1
## 13 WN DAL 8243 1
## 14 XE CRP 3175 1
## 15 YV CLT 71 1
dplyr deals with different types hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:
hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.
Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
hflights2 <- as.data.table(hflights)
# hflights2 is pre-loaded as a data.table
# Use summarize to calculate n_carrier
summarize(hflights2,n_carrier=n_distinct(UniqueCarrier))
## n_carrier
## 1 15
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")
# glimpse at nycflights
glimpse(nycflights)
## Observations: ??
## Variables: 17
## Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558,…
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849…
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", …
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N39…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, …
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA"…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD"…
## $ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, …
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733,…
## $ hour <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, …
## $ minute <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, 58…
# Ordered, grouped summary of nycflights
nycflights %>% group_by(carrier) %>%
summarise(n_flights=n(),avg_delay=mean(arr_delay))%>%
arrange(avg_delay)
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source: lazy query [?? x 3]
## # Database: mysql 5.6.34-log
## # [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
## carrier n_flights avg_delay
## <chr> <dbl> <dbl>
## 1 AS 714 -9.86
## 2 HA 342 -6.92
## 3 AA 32729 0.356
## 4 DL 48110 1.63
## 5 VX 5162 1.75
## 6 US 20536 2.06
## 7 UA 58665 3.50
## 8 9E 18460 6.91
## 9 B6 54635 9.36
## 10 WN 12275 9.47
## # … with more rows