Load the libraries and view the “flights” dataset

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.1.3

Arriving Late is the wrorst!

Let us consider only the flights that had any 3 or more hours delay!

Filter the observations to see which flights had an arrival delay of three or more hours. Remember, time is measured by minutes. Name the new subdata: arrivelate

arrivelate <- filter(flights, arr_delay >= 180)
arrivelate
## # A tibble: 3,897 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      848           1835       853     1001           1950
##  2  2013     1     1     1815           1325       290     2120           1542
##  3  2013     1     1     1842           1422       260     1958           1535
##  4  2013     1     1     2006           1630       216     2230           1848
##  5  2013     1     1     2115           1700       255     2330           1920
##  6  2013     1     1     2205           1720       285       46           2040
##  7  2013     1     1     2312           2000       192       21           2110
##  8  2013     1     1     2343           1724       379      314           1938
##  9  2013     1     2     1244            900       224     1431           1104
## 10  2013     1     2     1332            904       268     1616           1128
## # ... with 3,887 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Which carrier had the highest arrival delay?

1- Use the dplyr “select” to create a new subdata that includes the carrier and the arr_delay. Select your observations from arrivelat.

2- Name the new subdata: latecarrier. Note, your new dataframe should contain the same number of observations, but only two variables: Carrier and the arrival delay.

latecarrier <- select(arrivelate, carrier, arr_delay)
summary(latecarrier)
##    carrier            arr_delay     
##  Length:3897        Min.   : 180.0  
##  Class :character   1st Qu.: 197.0  
##  Mode  :character   Median : 225.0  
##                     Mean   : 247.9  
##                     3rd Qu.: 270.0  
##                     Max.   :1272.0

Now create one data visualization with this dataset

Create side by side boxplots

Since your variables are: categorical and quantitative, creating side by side boxplots visualize the situation.

plot1 <- ggplot(data = latecarrier, aes(x = carrier, y = arr_delay, fill = carrier)) +
  geom_boxplot() +
  coord_flip() +
  ggtitle("Most Delayed Carrier, NY 2013 \n")+
  xlab("Carrier") +
  ylab("Arrival Delay Time in minutes")
plot1

What do you notice?

1- On avaerage, all carrier have a simirlar midian (arround 200 min ~ 3.3 hrs of arrical delay.

2- Hawaiian Airlines (HA carrier) had the highest delay time, about 1250 min about 20 hrs!

3- Alaska Airlines (AS carrier) had the shortest delay.

What further information would I like to know?

I should compare the number of the flights first, because having 10 out 100 late is worse than having 10 out 500 late. I would also cross-refrence the departure time to the arrival delay to see if there is a pattern there.

Thank you!