NYC Flights Homework

Load the libraries and view the “flights” dataset

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
library(viridis)
## Loading required package: viridisLite
library(ggExtra)
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Now create one data visualization with this dataset

Your assignment is to create one plot to visualize one aspect of this dataset. The plot may be any type we have covered so far in this class (bargraphs, scatterplots, boxplots, histograms, treemaps, heatmaps, streamgraphs, or alluvials)

Requirements for the plot:

  1. Include at least one dplyr command (filter, sort, summarize, group_by, select, mutate, ….)
  2. Include labels for the x- and y-axes
  3. Include a title
  4. Your plot must incorporate at least 2 colo rs
  5. Include a legend that indicates what the colors represent
  6. Write a brief paragraph that describes the visualization you have created and at least one aspect of the plot that you would like to highlight.

Start early so that if you do have trouble, you can email me with questions

Summary of flights

summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                     
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00.00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00.00  
##  Median :29.00   Median :2013-07-03 10:00:00.00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54.64  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00.00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00.00  
## 
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

check dimensions

dim(flights)
## [1] 336776     19

Flight delays per data set

flight_delay <- flights %>%
  select(carrier, dep_delay)
  view(flight_delay)

count missing values

flights %>%
  summarize(count = sum(is.na(dep_delay)))
## # A tibble: 1 × 1
##   count
##   <int>
## 1  8255

Box plot of delayed flights per airlines

flights %>%
  drop_na(carrier) %>%
  ggplot(aes(x= carrier,
             y= dep_delay,
             fill = carrier))+
  geom_boxplot()+
  xlab("Airlines")+
  ylab("Delayed Flights")+
  ggtitle("Box Plot Delayed Flights by Airlines")+
  theme_classic()
## Warning: Removed 8255 rows containing non-finite values (`stat_boxplot()`).

Box plot with jitter to see the amount of flights that were delayed per airline

flights %>%
  drop_na(carrier) %>%
  ggplot(aes(x= carrier,
             y= dep_delay,
             fill = carrier))+
  geom_boxplot()+
  coord_cartesian(ylim=c(-10,30))+
  geom_jitter(size = 0.1, alpha =0.1)+
  xlab("Airlines")+
  ylab("Delayed Flights")+
  ggtitle("Box Plot Delayed Flights by Airlines")+
  theme_classic()
## Warning: Removed 8255 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 8255 rows containing missing values (`geom_point()`).

Shifted the boxplot to see the effect of time

flights %>%
  drop_na(carrier) %>%
  ggplot(aes(x= carrier,
             y= dep_delay,
             fill = carrier))+
  geom_boxplot()+
  coord_cartesian(ylim=c(-10,50))+ coord_flip()+
  xlab("Airlines")+
  ylab("Delayed Flights")+
  ggtitle("Box Plot Delayed Flights by Airlines")+
  theme_classic()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Warning: Removed 8255 rows containing non-finite values (`stat_boxplot()`).

flights %>%
  drop_na(carrier) %>%
  ggplot(aes(x= carrier,
             y= dep_delay,
             fill = carrier))+
  geom_boxplot()+
  coord_cartesian(ylim=c(-10,30))+
  xlab("Airlines")+
  ylab("Delayed Flights")+
  ggtitle("Box Plot Delayed Flights by Airlines")+
  theme_classic()
## Warning: Removed 8255 rows containing non-finite values (`stat_boxplot()`).

I decided to make a box plot from the nycflights13 data set. I wanted to see the relationship between the airlines with the most delayed departure. The first box plot was the delayed departures between the airlines. This graph was too large to find anything that was significant. I zoomed in for the second box plot and saw how many flights were delayed per minute. I added a geom jitter to see all flights delayed per airline. On the y-axis, you had time; on the x-axis, you had the abbreviation of the airlines. The majority of the flights were, on avg, below 25 mins. The most significant outlier was that a flight was delayed for 1301 minutes, about 21 hours. The avg delay of all airlines was about 12 mins. For the third graph, I decided to flip the box plot to show the significance of time. The best chart was the last one. You can see the variation of flights that were delayed. Most airlines did depart part within 25 mins of departure time. Some airlines had extended quartile than others. For example, the interquartile range of Express Jet Airlines (EV) and Mesa (YV) was longer than most. That could mean that airlines need proper maintenance staff, or they have a more extensive range because they don’t fly out that often, and when these airlines do get delayed, it’s usually longer than the larger airlines with more fleets.