Load the libraries and view the “flights” dataset
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.1.3
Arriving Late is the wrorst!
Let us consider only the flights that had any 3 or more hours
delay!
Filter the observations to see which flights had an arrival delay of
three or more hours. Remember, time is measured by
minutes. Name the new subdata: arrivelate
arrivelate <- filter(flights, arr_delay >= 180)
arrivelate
## # A tibble: 3,897 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 848 1835 853 1001 1950
## 2 2013 1 1 1815 1325 290 2120 1542
## 3 2013 1 1 1842 1422 260 1958 1535
## 4 2013 1 1 2006 1630 216 2230 1848
## 5 2013 1 1 2115 1700 255 2330 1920
## 6 2013 1 1 2205 1720 285 46 2040
## 7 2013 1 1 2312 2000 192 21 2110
## 8 2013 1 1 2343 1724 379 314 1938
## 9 2013 1 2 1244 900 224 1431 1104
## 10 2013 1 2 1332 904 268 1616 1128
## # ... with 3,887 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Which carrier had the highest arrival delay?
1- Use the dplyr “select” to create a new subdata that includes the
carrier and the arr_delay. Select your observations from arrivelat.
2- Name the new subdata: latecarrier. Note, your
new dataframe should contain the same number of observations, but only
two variables: Carrier and the arrival delay.
latecarrier <- select(arrivelate, carrier, arr_delay)
summary(latecarrier)
## carrier arr_delay
## Length:3897 Min. : 180.0
## Class :character 1st Qu.: 197.0
## Mode :character Median : 225.0
## Mean : 247.9
## 3rd Qu.: 270.0
## Max. :1272.0
Now create one data visualization with this dataset
Create side by side boxplots
Since your variables are: categorical and quantitative, creating
side by side boxplots visualize the situation.
plot1 <- ggplot(data = latecarrier, aes(x = carrier, y = arr_delay, fill = carrier)) +
geom_boxplot() +
coord_flip() +
ggtitle("Most Delayed Carrier, NY 2013 \n")+
xlab("Carrier") +
ylab("Arrival Delay Time in minutes")
plot1

What do you notice?
1- On avaerage, all carrier have a simirlar midian (arround 200 min
~ 3.3 hrs of arrical delay.
2- Hawaiian Airlines (HA carrier) had the highest delay time, about
1250 min about 20 hrs!
3- Alaska Airlines (AS carrier) had the shortest delay.