Homework 3 - dplyr

Pacotes usados nesses exercícios

library(nycflights13)
library(dplyr)
library(stringr)
library(kableExtra)
library(janitor)
library(ggplot2)

\[ \star \]

Exercício 1

Informações sobre o conjunto de dados clicando aqui.

ny_voos <- nycflights13::flights

glimpse(ny_voos)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 8...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-0...

Ex (a)

ny_voos %>% filter(arr_delay >= 120)

## # A tibble: 10,200 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      811            630       101     1047            830
##  2  2013     1     1      848           1835       853     1001           1950
##  3  2013     1     1      957            733       144     1056            853
##  4  2013     1     1     1114            900       134     1447           1222
##  5  2013     1     1     1505           1310       115     1638           1431
##  6  2013     1     1     1525           1340       105     1831           1626
##  7  2013     1     1     1549           1445        64     1912           1656
##  8  2013     1     1     1558           1359       119     1718           1515
##  9  2013     1     1     1732           1630        62     2028           1825
## 10  2013     1     1     1803           1620       103     2008           1750
## # ... with 10,190 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 10200 voos tiveram atraso de duas horas ou mais.

\[ \cdots \]

Ex (b)

ny_voos %>% filter(dest == "HOU" | dest == "IAH")

## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

O uso do operador %in% pode facilitar o código nesses casos.

ny_voos %>% filter(dest %in% c("HOU", "IAH"))

## # A tibble: 9,313 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      623            627        -4      933            932
##  4  2013     1     1      728            732        -4     1041           1038
##  5  2013     1     1      739            739         0     1104           1038
##  6  2013     1     1      908            908         0     1228           1219
##  7  2013     1     1     1028           1026         2     1350           1339
##  8  2013     1     1     1044           1045        -1     1352           1351
##  9  2013     1     1     1114            900       134     1447           1222
## 10  2013     1     1     1205           1200         5     1503           1505
## # ... with 9,303 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 9313 voos tiveram como destino Houston.

\[ \cdots \]

Ex (c)

Quero descobrir primeiro as siglas dessas companhias aéreas.

airlines %>%
    filter(name %in% c("American Airlines Inc.",
                       "United Air Lines Inc.",
                       "Delta Air Lines Inc.")) %>%
    knitr::kable(col.names = c("Sigla","Companhia aérea")) %>% 
    kable_styling(full_width = FALSE,
                  bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Sigla	Companhia aérea
AA	American Airlines Inc.
DL	Delta Air Lines Inc.
UA	United Air Lines Inc.

Sabendo as siglas das companhias podemos filtrar pela coluna carrier.

ny_voos %>% filter(carrier %in% c("AA", "DL", "UA"))

## # A tibble: 139,504 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      558            600        -2      753            745
##  7  2013     1     1      558            600        -2      924            917
##  8  2013     1     1      558            600        -2      923            937
##  9  2013     1     1      559            600        -1      941            910
## 10  2013     1     1      559            600        -1      854            902
## # ... with 139,494 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 139504 voos foram realizados por essas empresas.

\[ \cdots \]

Ex (d)

Como a coluna month está como numérica, vamos filtar pelo número correspondente de cada mês.

ny_voos %>% 
    filter(month %in% c(7, 8, 9))

## # A tibble: 86,326 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 86,316 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 86326 voos partiram no verão.

\[ \cdots \]

Ex (e)

ny_voos %>% filter(dep_delay <= 0 &
                   arr_delay >= 120)

## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1    27     1419           1420        -1     1754           1550
##  2  2013    10     7     1350           1350         0     1736           1526
##  3  2013    10     7     1357           1359        -2     1858           1654
##  4  2013    10    16      657            700        -3     1258           1056
##  5  2013    11     1      658            700        -2     1329           1015
##  6  2013     3    18     1844           1847        -3       39           2219
##  7  2013     4    17     1635           1640        -5     2049           1845
##  8  2013     4    18      558            600        -2     1149            850
##  9  2013     4    18      655            700        -5     1213            950
## 10  2013     5    22     1827           1830        -3     2217           2010
## # ... with 19 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 29 voos não partirar atrasados, porém chegaram com mais de duas horas de atraso.

\[ \cdots \]

Ex (f)

ny_voos %>% filter(dep_time == 2400 |
                   dep_time <= 600)

## # A tibble: 9,373 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 9,363 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Portanto, 9373 voos partirar entre meia noite e 6 horas da manhã.

\[ \cdots \]

Exercício 2

Classificando os voos para encontrarmos os mais atrasados.

ny_voos %>% arrange(desc(dep_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     6    27      959           1900       899     1236           2226
##  9  2013     7    22     2257            759       898      121           1026
## 10  2013    12     5      756           1700       896     1058           2020
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

\[ \cdots \]

Exercício 3

Conforme descrito pelas informações do banco de dados, a distância está com a unidade de medida em milhas, e tempo no ar está em minutos. Vamos criar uma coluna com a velocidade em km/h. Vou criar um novo tbl_df modificado e classificar pelos voos mais rápidos.

ny_voos_mdf <- ny_voos %>% mutate(vel = (distance / 0.62137) / (air_time / 60))
ny_voos_mdf %>% arrange(desc(vel))

## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     5    25     1709           1700         9     1923           1937
##  2  2013     7     2     1558           1513        45     1745           1719
##  3  2013     5    13     2040           2025        15     2225           2226
##  4  2013     3    23     1914           1910         4     2045           2043
##  5  2013     1    12     1559           1600        -1     1849           1917
##  6  2013    11    17      650            655        -5     1059           1150
##  7  2013     2    21     2355           2358        -3      412            438
##  8  2013    11    17      759            800        -1     1212           1255
##  9  2013    11    16     2003           1925        38       17             36
## 10  2013    11    16     2349           2359       -10      402            440
## # ... with 336,766 more rows, and 12 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   vel <dbl>

\[ \cdots \]

Exercício 4

Apesar de incluir várias vezes o nome de uma variável dentre de select a função seleciona a variável apenas uma única vez.

ny_voos %>% select(month, month, month)

## # A tibble: 336,776 x 1
##    month
##    <int>
##  1     1
##  2     1
##  3     1
##  4     1
##  5     1
##  6     1
##  7     1
##  8     1
##  9     1
## 10     1
## # ... with 336,766 more rows

\[ \cdots \]

Exercício 5

ny_voos %>%
    arrange(desc(dep_delay)) %>%
    mutate(rank_dpdelay = min_rank(-dep_delay)) %>%
    filter(rank_dpdelay %in% seq(1, 10)) %>%
    select(dep_delay, rank_dpdelay) %>% 
    knitr::kable(col.names = c("Atraso na saída*","Rank")) %>% 
    kable_styling(full_width = FALSE,
                  bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Atraso na saída*	Rank
1301	1
1137	2
1126	3
1014	4
1005	5
960	6
911	7
899	8
898	9
896	10

Atraso em minutos

\[ \cdots \]

Exercício 6

ny_voos <- ny_voos %>%
    mutate(air_time_mean = mean(air_time, na.rm = TRUE))

Portanto, o tempo médio no ar é de 150.6864602 minutos.

\[ \cdots \]

Exercício 7

Como a coluna dep_time está em um formato HH:MM resolvi separar essa coluna como uma string e ir selecionando cada caracter do vetor de caracteres para realizar as operações matemáticas.

vet <- c(ny_voos_mdf$dep_time) %>% as.character()

vet2 <- list()
for (i in 1:length(vet)) {
    if (is.na(vet[i])) {
        s <- NA
        vet2[i] <- s 

    } else if (as.numeric(str_length(vet[i])) == 4) {
        s <- (as.numeric(str_sub(vet[i], end = 2)) * 60) + as.numeric(str_sub(vet[i], start = 3))
        vet2[i] <- s
        
    } else if (as.numeric(str_length(vet[i])) == 3) {
        s <- (as.numeric(str_sub(vet[i], end = 1)) * 60) + as.numeric(str_sub(vet[i], start = 2))
        vet2[i] <- s 
        
    } else {
        s <- as.numeric(str_sub(vet[i], start = 1))
        vet2[i] <- s
    }
}

Acrescentando uma coluna de tempo contínuo em minutos no dataframe.

ny_voos_mdf <- ny_voos_mdf %>% mutate(dep_time_min = vet2)

Outra maneira de resolver

ny_voos_teste <- ny_voos %>% 
    mutate(dp_tm_min = ((dep_time %/% 100) * 60) + dep_time %% 100)

vet_teste <- c(ny_voos_teste$dp_tm_min)
for(i in 1:length(vet_teste)) {
    if(is.na(vet_teste[i])) {
    vet_teste[i] <- NA
    
    }else if(vet_teste[i] == 1440) {
        vet_teste[i] <- 0 
    }
}

ny_voos_teste <- ny_voos_teste %>% 
    mutate(dp_tm_min = vet_teste)

Terceira maneira de resolver

ny_voos_teste2 <- ny_voos %>% 
    mutate(dp_tm_min = (hour * 60) + minute + dep_delay)

vet_teste2 <- c(ny_voos_teste2$dp_tm_min)
for(i in 1:length(vet_teste2)) {
    if(is.na(vet_teste2[i])) {
    vet_teste2[i] <- NA
    
    }else if(vet_teste2[i] == 1440) {
        vet_teste2[i] <- 0
        
    }else if(vet_teste2[i] > 1440) {
        vet_teste2[i] <- vet_teste2[i] - 1440
    }
}

ny_voos_teste2 <- ny_voos_teste2 %>% 
    mutate(dp_tm_min = vet_teste2)

Em um novo dataframe resolvi deixar a coluna como um valor numérico e depois apresentar uma tabela, comparando os 20 primeiros dados, do tempo contínuo e tempo em HH:MM.

ny_voos_mdf2 <- ny_voos_mdf %>% 
    mutate_at(c("dep_time_min"), as.numeric)

ny_voos_mdf2 %>% select(dep_time_min, dep_time) %>% head(20) %>% 
    knitr::kable(col.names = c("Em minutos", "HH:MM")) %>% 
    kable_styling(full_width = FALSE,
                  bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Em minutos	HH:MM
317	517
333	533
342	542
344	544
354	554
354	554
355	555
357	557
357	557
358	558
358	558
358	558
358	558
358	558
359	559
359	559
359	559
360	600
360	600
361	601

\[ \cdots \]

Exercício 8

cia <- airlines

ny_voos %>% 
    tabyl(carrier) %>% 
    adorn_pct_formatting() %>% 
    knitr::kable(col.names = c("Companhias aéreas","Observações","Porcentagem")) %>% 
    kable_styling(full_width = FALSE,
                  bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Companhias aéreas	Observações	Porcentagem
9E	18460	5.5%
AA	32729	9.7%
AS	714	0.2%
B6	54635	16.2%
DL	48110	14.3%
EV	54173	16.1%
F9	685	0.2%
FL	3260	1.0%
HA	342	0.1%
MQ	26397	7.8%
OO	32	0.0%
UA	58665	17.4%
US	20536	6.1%
VX	5162	1.5%
WN	12275	3.6%
YV	601	0.2%

atraso_cia <- ny_voos %>% group_by(carrier)
atraso_cia <- atraso_cia %>%
    summarise(atr_mean = mean(dep_delay, na.rm = TRUE),
              atr_med = median(dep_delay, na.rm = TRUE),
              atr_sd = sd(dep_delay, na.rm = TRUE)) 
atraso_cia %>%
    arrange(desc(atr_mean)) %>%   
    knitr::kable(col.names = c("Companhias aéreas", "Média de atrasos", "Mediana de Atrasos", "DP")) %>% 
    kable_styling(full_width = FALSE,
                  bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Companhias aéreas	Média de atrasos	Mediana de Atrasos	DP
F9	20.215543	0.5	58.36265
EV	19.955390	-1.0	46.55235
YV	18.996330	-2.0	49.17227
FL	18.726075	1.0	52.66160
WN	17.711744	1.0	43.34435
9E	16.725769	-2.0	45.90604
B6	13.022522	-1.0	38.50337
VX	12.869421	0.0	44.81510
OO	12.586207	-6.0	43.06599
UA	12.106073	0.0	35.71660
MQ	10.552041	-3.0	39.18457
DL	9.264504	-2.0	39.73505
AA	8.586016	-3.0	37.35486
AS	5.804775	-3.0	31.36303
HA	4.900585	-4.0	74.10990
US	3.782418	-4.0	28.05633

Atraso em minutos

i_acumulada1 <- ggplot(ny_voos, aes(dep_delay)) +
    stat_ecdf(geom = "step",
            color = "blue") +
        labs(x = "Atraso de voos na saída (em minutos)",
             y = "F(x)",
             title = "ECDF Atraso na saída de voos em Nova York")
i_acumulada1

quantile(ny_voos$dep_delay, c(.60, .74, .92, .97), na.rm = TRUE)

## 60% 74% 92% 97% 
##   0  10  61 120

Pelo gráfico e pelos valores da mediana podemos perceber que a maioria dos voos não tem muito atraso, o que acontece é que voos que atrasam, influenciam a média não estar próximo de 0, por serem outliers.

\[ \star \]

Homework 3 - dplyr

Gustavo Prado

23 de Outubro de 2020

Pacotes usados nesses exercícios

Exercício 1

Ex (a)

Ex (b)

Ex (c)

Ex (d)

Ex (e)

Ex (f)

Exercício 2

Exercício 3

Exercício 4

Exercício 5

Exercício 6

Exercício 7

Exercício 8

Em minutos	HH:MM
317	517
333	533
342	542
344	544
354	554
354	554
355	555
357	557
357	557
358	558
358	558
358	558
358	558
358	558
359	559
359	559
359	559
360	600
360	600
361	601

Em minutos	HH:MM
317	517
333	533
342	542
344	544
354	554
354	554
355	555
357	557
357	557
358	558
358	558
358	558
358	558
358	558
359	559
359	559
359	559
360	600
360	600
361	601

Em minutos	HH:MM
317	517
333	533
342	542
344	544
354	554
354	554
355	555
357	557
357	557
358	558
358	558
358	558
358	558
358	558
359	559
359	559
359	559
360	600
360	600
361	601