NYC Flights Homework

Author

Gabriel Castillo Lopez

Loading the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(treemap)
library(RColorBrewer)
data(flights)
#?flights

Removing the NAs

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay))#taking off all the NAs for both distance and arr_delay

My dataset I will be working with

fastest_flights <- flights_nona |> 
  filter(arr_delay <= -1) |> #Only looking at flights that arrived early
  mutate(airtime_hour = air_time/60) |> # To change the mintutes to hours
  mutate(MPH = distance/airtime_hour) |>
  mutate(airline = fct_recode(carrier, "United Airlines" = "UA", "American Airlines" = "AA","Jet Blue" = "B6", "Endeavor Air Inc"= "9E", "Alaska Airlines" = "AS", "Delta Airlines" = "DL", "ExpressJet Airlines"= "EV", "Frontier Airlines"= "F9", "AirTran Airways" = "FL", "Hawaiian Airlines" = "HA", "Envoy Air"= "MQ", "SkyWest Airlines" = "OO", "US Airways"= "US", "Virgin America" = "VX", "Southwest Airlines"= "WN", "Mesa Airlines"= "YV"))

A summary table for all the carriers that had 500+ flights that came early (Airline_Data)

fast <- fastest_flights|> 
  group_by(airline) |>  #Listing them by carrier
  summarise(count = n(),
            Avgmph = mean(MPH), #average MPH
            delay = mean(arr_delay), # average arrival delay only in the negatives
            Avg_Airtime= mean(airtime_hour), #average airtime
            distance = mean(distance)) # average distance

AirlineData <-filter(fast, count > 500) # only looking at carriers with 500+ early flights

Making a tree graph

treemap(AirlineData, index="airline", vSize="count", 
        vColor="Avg_Airtime", type="manual", title = "The Average Airtime Of Early Flights Compared Each Carrier By Count", 
        # note: type = "manual" changes to red yellow blue
        palette="RdYlBu")

Source: FAA Aircraft registry, https://www.faa.gov/licenses_certificates/aircraft_certification/ aircraft_registry/releasable_aircraft_download/

I created a treemap that highlighted only the early flights with arr_delay <= -1 with a minimum of 500 early flights and looked at how many early flights and the average airtime in hours for each carrier. Before that, I made many other mutate functions to explore what I could done before the tree map like creating minutes to hours. Importantly, I used the mutate function to also find the means of all the variables I was interested in for example both airtime_hour and arr_delay. Due to being interested in the carriers, I had to change the abbreviations to their full name to make it easier for the viewer to know which carrier is which. Then to make the numbers more presentable I made a new data set summary statistic that combined all the averages and the carriers names before the tree map. The main aspect that the tree map highlights are that Delta and American Airlines have the highest number of early flights with both average airtime hours of 3-3.5. One other aspect is that Virgin American had one of the lowest amount of early flights however had the highest average airtime hour of 5.5. I liked how the tree map came out to be and it’s now my new favorite way to present data.