Abstract
Exercício número 2 da disciplina de Data Science (Tópicos Especiais em Estatística e Experimentação Agropecuária), administrada pelo professor Paulo Henrique no programa de Pós-Graduação da Universidade Federal de Lavras.Exercício proposto pela disciplina de Data Science no link http://rpubs.com/phsg13/530258.
Consistem em utilizar o dataset flights do pacote nycflights13 ou então no link: https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv responda as questões que seguem:
Nota: “|” denota “ou”; “&” denota “e”.
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600,…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, …
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3,…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, …
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN",…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", …
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", …
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, …
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944,…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-0…
O banco de dados possue 336.776 observações e 19 variáveis relacionadas a informações pontuais de todos os vôos que partiram de Nova York (ou seja, JFK, LGA ou EWR) em 2013. Os dados foram fornecidos publicamente pelo Departamento de Transportes dos Estados Unidos (DOT), detalhando o extenso sistema aeroportuário e os padrões de vôo dos Estados Unidos.
summary(flights)
## year month day dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907
## Median :2013 Median : 7.000 Median :16.00 Median :1401
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400
## NA's :8255
## sched_dep_time dep_delay arr_time sched_arr_time
## Min. : 106 Min. : -43.00 Min. : 1 Min. : 1
## 1st Qu.: 906 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124
## Median :1359 Median : -2.00 Median :1535 Median :1556
## Mean :1344 Mean : 12.64 Mean :1502 Mean :1536
## 3rd Qu.:1729 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945
## Max. :2359 Max. :1301.00 Max. :2400 Max. :2359
## NA's :8255 NA's :8713
## arr_delay carrier flight tailnum
## Min. : -86.000 Length:336776 Min. : 1 Length:336776
## 1st Qu.: -17.000 Class :character 1st Qu.: 553 Class :character
## Median : -5.000 Mode :character Median :1496 Mode :character
## Mean : 6.895 Mean :1972
## 3rd Qu.: 14.000 3rd Qu.:3465
## Max. :1272.000 Max. :8500
## NA's :9430
## origin dest air_time distance
## Length:336776 Length:336776 Min. : 20.0 Min. : 17
## Class :character Class :character 1st Qu.: 82.0 1st Qu.: 502
## Mode :character Mode :character Median :129.0 Median : 872
## Mean :150.7 Mean :1040
## 3rd Qu.:192.0 3rd Qu.:1389
## Max. :695.0 Max. :4983
## NA's :9430
## hour minute time_hour
## Min. : 1.00 Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 9.00 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :13.00 Median :29.00 Median :2013-07-03 10:00:00
## Mean :13.18 Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:17.00 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :23.00 Max. :59.00 Max. :2013-12-31 23:00:00
##
names(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
df <- flights %>% filter(dest == c("IAH","HOU"))
unique(df$dest)
## [1] "IAH" "HOU"
Símbolo (codificação) das compainha aéreas: United Air Lines Inc. (UA), American Airlines Inc. (AA) e Delta Air Lines Inc. (DL).
# Listei duas formas de realizar essa tarefa:
df_1 <- flights %>% filter(carrier %in% c("UA","AA","DL"))
df_2 <- flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL" )
all.equal(df_1,df_2)
## [1] TRUE
df <- flights %>% filter(arr_delay >= 2)
glimpse(df)
## Observations: 127,929
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 554, 555, 558, 558, 559, 600, 602,…
## $ sched_dep_time <int> 515, 529, 540, 558, 600, 600, 600, 600, 600, 605,…
## $ dep_delay <dbl> 2, 4, 2, -4, -5, -2, -2, -1, 0, -3, 8, 11, 3, -8,…
## $ arr_time <int> 830, 850, 923, 740, 913, 753, 924, 941, 837, 821,…
## $ sched_arr_time <int> 819, 830, 850, 728, 854, 745, 917, 910, 825, 805,…
## $ arr_delay <dbl> 11, 20, 33, 12, 19, 8, 7, 31, 12, 16, 32, 14, 4, …
## $ carrier <chr> "UA", "UA", "AA", "UA", "B6", "AA", "UA", "AA", "…
## $ flight <int> 1545, 1714, 1141, 1696, 507, 301, 194, 707, 4650,…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N39463", "N516JB",…
## $ origin <chr> "EWR", "LGA", "JFK", "EWR", "EWR", "LGA", "JFK", …
## $ dest <chr> "IAH", "IAH", "MIA", "ORD", "FLL", "ORD", "LAX", …
## $ air_time <dbl> 227, 227, 160, 150, 158, 138, 345, 257, 134, 105,…
## $ distance <dbl> 1400, 1416, 1089, 719, 1065, 733, 2475, 1389, 762…
## $ hour <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 58, 0, 0, 0, 0, 0, 5, 0, 0, 10, 30, 1…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-0…
df <- flights %>% filter(month %in% c(7,8,9))
unique(df$month)
## [1] 7 8 9