NYC Flights Homework

Author

M Madinko

Load in the library and the Data

library(tidyverse)
library(nycflights23)
data(flights)

View the data using the “head” function

head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2023     1     1        1           2038       203      328              3
2  2023     1     1       18           2300        78      228            135
3  2023     1     1       31           2344        47      500            426
4  2023     1     1       33           2140       173      238           2352
5  2023     1     1       36           2048       228      223           2252
6  2023     1     1      503            500         3      808            815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Use of “Filter” function to remove the “na”

july_flights <- flights|>
  filter(month == 7 & !is.na(arr_delay))
july_flights
# A tibble: 32,771 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     7     1        2           2256        66      401            300
 2  2023     7     1       49           2106       223      256           2338
 3  2023     7     1      200           2135       265      423             20
 4  2023     7     1      457            500        -3      741            808
 5  2023     7     1      457            500        -3      743            812
 6  2023     7     1      459            500        -1      640            658
 7  2023     7     1      459            507        -8      824            902
 8  2023     7     1      528            535        -7      754            826
 9  2023     7     1      548            550        -2      812            836
10  2023     7     1      554            600        -6      832            903
# ℹ 32,761 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

use of “group by” and “summarize” function to create my summarize table

by_carrier<- july_flights|>
  group_by(carrier)|>
  summarise(avg_arr_delay = mean(arr_delay))|>
  arrange(desc(avg_arr_delay))
by_carrier
# A tibble: 13 × 2
   carrier avg_arr_delay
   <chr>           <dbl>
 1 HA               54.9
 2 F9               49.8
 3 B6               43.6
 4 UA               39.1
 5 OO               37.9
 6 AS               34.3
 7 G4               31.5
 8 NK               27.4
 9 WN               26.0
10 AA               19.8
11 DL               17.6
12 9E               11.9
13 YX               11.8

Upload airlines Dataset

data(airlines)

Join the Dataset Airlines

new_by_carrier <- left_join(by_carrier, airlines, by = "carrier")
new_by_carrier$name <- gsub("Inc\\.|Co\\.", "", new_by_carrier$name)
new_by_carrier
# A tibble: 13 × 3
   carrier avg_arr_delay name                 
   <chr>           <dbl> <chr>                
 1 HA               54.9 "Hawaiian Airlines " 
 2 F9               49.8 "Frontier Airlines " 
 3 B6               43.6 "JetBlue Airways"    
 4 UA               39.1 "United Air Lines "  
 5 OO               37.9 "SkyWest Airlines "  
 6 AS               34.3 "Alaska Airlines "   
 7 G4               31.5 "Allegiant Air"      
 8 NK               27.4 "Spirit Air Lines"   
 9 WN               26.0 "Southwest Airlines "
10 AA               19.8 "American Airlines " 
11 DL               17.6 "Delta Air Lines "   
12 9E               11.9 "Endeavor Air "      
13 YX               11.8 "Republic Airline"   

Bargraph vizualisation

ggplot(new_by_carrier, aes(x =reorder(name, avg_arr_delay), y = avg_arr_delay, fill = avg_arr_delay)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Average Arrival Delay by Airline in July",
    x = "Airline",
    y = "Average Arrival Delay (minutes)",
    caption = "Source: nycflights13 dataset" ) 

Brief Paragraph

The bar graph above allows us to visualize the average arrival delay of different airlines during the month of July. To do this, I used the filter function to select only flights from July and to remove any NA values. Then, I used the group_by and summarise functions to calculate the average arrival delay for each airline. Since the airline names were abbreviated in the flights dataset, I joined the airlines dataset to get their full names, making the graph clearer. On this graph, the bars represent the airlines,the function reoder organized from the one with the highest delay to the one with the lowest. The colors of the bars also reflect the average delay, allowing us to quickly identify the most punctual airlines and those with the largest delays. After this analysis, it is possible to identify which airlines tend to have higher delays during July, helping travelers avoid flights with frequent delays. The graph clearly shows the variation in punctuality among airlines, so passengers can make more informed decisions when planning their trips, choosing carriers that are generally more reliable during this period.