Obtendo os dados.

Exercício proposto pela disciplina de Data Science no link http://rpubs.com/phsg13/530258.

Consistem em utilizar o dataset flights do pacote nycflights13 ou então no link: https://raw.githubusercontent.com/ismayc/pnwflights14/master/data/flights.csv responda as questões que seguem:

Nota: “|” denota “ou”; “&” denota “e”.

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558,…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600,…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, …
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3,…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, …
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN",…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", …
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", …
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, …
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944,…
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-0…

Respondendo as questões.

Faça uma breve descrição deste conjunto de dados;

O banco de dados possue 336.776 observações e 19 variáveis relacionadas a informações pontuais de todos os vôos que partiram de Nova York (ou seja, JFK, LGA ou EWR) em 2013. Os dados foram fornecidos publicamente pelo Departamento de Transportes dos Estados Unidos (DOT), detalhando o extenso sistema aeroportuário e os padrões de vôo dos Estados Unidos.

summary(flights)
##       year          month             day           dep_time   
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400  
##                                                  NA's   :8255  
##  sched_dep_time   dep_delay          arr_time    sched_arr_time
##  Min.   : 106   Min.   : -43.00   Min.   :   1   Min.   :   1  
##  1st Qu.: 906   1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124  
##  Median :1359   Median :  -2.00   Median :1535   Median :1556  
##  Mean   :1344   Mean   :  12.64   Mean   :1502   Mean   :1536  
##  3rd Qu.:1729   3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945  
##  Max.   :2359   Max.   :1301.00   Max.   :2400   Max.   :2359  
##                 NA's   :8255      NA's   :8713                 
##    arr_delay          carrier              flight       tailnum         
##  Min.   : -86.000   Length:336776      Min.   :   1   Length:336776     
##  1st Qu.: -17.000   Class :character   1st Qu.: 553   Class :character  
##  Median :  -5.000   Mode  :character   Median :1496   Mode  :character  
##  Mean   :   6.895                      Mean   :1972                     
##  3rd Qu.:  14.000                      3rd Qu.:3465                     
##  Max.   :1272.000                      Max.   :8500                     
##  NA's   :9430                                                           
##     origin              dest              air_time        distance   
##  Length:336776      Length:336776      Min.   : 20.0   Min.   :  17  
##  Class :character   Class :character   1st Qu.: 82.0   1st Qu.: 502  
##  Mode  :character   Mode  :character   Median :129.0   Median : 872  
##                                        Mean   :150.7   Mean   :1040  
##                                        3rd Qu.:192.0   3rd Qu.:1389  
##                                        Max.   :695.0   Max.   :4983  
##                                        NA's   :9430                  
##       hour           minute        time_hour                  
##  Min.   : 1.00   Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 9.00   1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :13.00   Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :13.18   Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:17.00   3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :23.00   Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

As varáveis contida no banco de dados são:

names(flights)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"
  • year, month, day, correspondem as dadas das saídas dos vôos.
  • dest corresponde ao destido do vôo.
  • dep_time, arr_time, correspondem Horários de partida e chegada, fuso horário local.
  • dep_delay, arr_delay, correspondem Atrasos de partida e chegada, em minutos. Tempos negativos representam partidas / chegadas antecipadas.
  • carrier: Abreviação de duas letras da companhia aérea.

Encontre os voos que foram para Houston (IAH ou HOU);

df <- flights %>% filter(dest == c("IAH","HOU"))
unique(df$dest)
## [1] "IAH" "HOU"

Encontre os voos que foram operados pela United, American ou Delta;

Símbolo (codificação) das compainha aéreas: United Air Lines Inc. (UA), American Airlines Inc. (AA) e Delta Air Lines Inc. (DL).

# Listei duas formas de realizar essa tarefa:
df_1 <- flights %>% filter(carrier %in% c("UA","AA","DL"))
df_2 <- flights %>% filter(carrier == "UA" | carrier == "AA" | carrier == "DL" )
all.equal(df_1,df_2)
## [1] TRUE

Econtre os voos que tiveram um atraso de duas ou mais horas na chegada.

df <- flights %>% filter(arr_delay >= 2)
glimpse(df)
## Observations: 127,929
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 554, 555, 558, 558, 559, 600, 602,…
## $ sched_dep_time <int> 515, 529, 540, 558, 600, 600, 600, 600, 600, 605,…
## $ dep_delay      <dbl> 2, 4, 2, -4, -5, -2, -2, -1, 0, -3, 8, 11, 3, -8,…
## $ arr_time       <int> 830, 850, 923, 740, 913, 753, 924, 941, 837, 821,…
## $ sched_arr_time <int> 819, 830, 850, 728, 854, 745, 917, 910, 825, 805,…
## $ arr_delay      <dbl> 11, 20, 33, 12, 19, 8, 7, 31, 12, 16, 32, 14, 4, …
## $ carrier        <chr> "UA", "UA", "AA", "UA", "B6", "AA", "UA", "AA", "…
## $ flight         <int> 1545, 1714, 1141, 1696, 507, 301, 194, 707, 4650,…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N39463", "N516JB",…
## $ origin         <chr> "EWR", "LGA", "JFK", "EWR", "EWR", "LGA", "JFK", …
## $ dest           <chr> "IAH", "IAH", "MIA", "ORD", "FLL", "ORD", "LAX", …
## $ air_time       <dbl> 227, 227, 160, 150, 158, 138, 345, 257, 134, 105,…
## $ distance       <dbl> 1400, 1416, 1089, 719, 1065, 733, 2475, 1389, 762…
## $ hour           <dbl> 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 58, 0, 0, 0, 0, 0, 5, 0, 0, 10, 30, 1…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-0…

Encontre os voos que partiram em julho, agosto e setembro.

df <- flights %>% filter(month %in% c(7,8,9))
unique(df$month)
## [1] 7 8 9