library(tidyverse)
library(nycflights23)
data(flights)NYC Flights Homework
Load in the library and the Data
View the data using the “head” function
head(flights)# A tibble: 6 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 1 1 1 2038 203 328 3
2 2023 1 1 18 2300 78 228 135
3 2023 1 1 31 2344 47 500 426
4 2023 1 1 33 2140 173 238 2352
5 2023 1 1 36 2048 228 223 2252
6 2023 1 1 503 500 3 808 815
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Use of “Filter” function to remove the “na”
july_flights <- flights|>
filter(month == 7 & !is.na(arr_delay))
july_flights# A tibble: 32,771 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2023 7 1 2 2256 66 401 300
2 2023 7 1 49 2106 223 256 2338
3 2023 7 1 200 2135 265 423 20
4 2023 7 1 457 500 -3 741 808
5 2023 7 1 457 500 -3 743 812
6 2023 7 1 459 500 -1 640 658
7 2023 7 1 459 507 -8 824 902
8 2023 7 1 528 535 -7 754 826
9 2023 7 1 548 550 -2 812 836
10 2023 7 1 554 600 -6 832 903
# ℹ 32,761 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
use of “group by” and “summarize” function to create my summarize table
by_carrier<- july_flights|>
group_by(carrier)|>
summarise(avg_arr_delay = mean(arr_delay))|>
arrange(desc(avg_arr_delay))
by_carrier# A tibble: 13 × 2
carrier avg_arr_delay
<chr> <dbl>
1 HA 54.9
2 F9 49.8
3 B6 43.6
4 UA 39.1
5 OO 37.9
6 AS 34.3
7 G4 31.5
8 NK 27.4
9 WN 26.0
10 AA 19.8
11 DL 17.6
12 9E 11.9
13 YX 11.8
Upload airlines Dataset
data(airlines)Join the Dataset Airlines
new_by_carrier <- left_join(by_carrier, airlines, by = "carrier")
new_by_carrier$name <- gsub("Inc\\.|Co\\.", "", new_by_carrier$name)
new_by_carrier# A tibble: 13 × 3
carrier avg_arr_delay name
<chr> <dbl> <chr>
1 HA 54.9 "Hawaiian Airlines "
2 F9 49.8 "Frontier Airlines "
3 B6 43.6 "JetBlue Airways"
4 UA 39.1 "United Air Lines "
5 OO 37.9 "SkyWest Airlines "
6 AS 34.3 "Alaska Airlines "
7 G4 31.5 "Allegiant Air"
8 NK 27.4 "Spirit Air Lines"
9 WN 26.0 "Southwest Airlines "
10 AA 19.8 "American Airlines "
11 DL 17.6 "Delta Air Lines "
12 9E 11.9 "Endeavor Air "
13 YX 11.8 "Republic Airline"
Bargraph vizualisation
ggplot(new_by_carrier, aes(x =reorder(name, avg_arr_delay), y = avg_arr_delay, fill = avg_arr_delay)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(
title = "Average Arrival Delay by Airline in July",
x = "Airline",
y = "Average Arrival Delay (minutes)",
caption = "Source: nycflights13 dataset" ) Brief Paragraph
The bar graph above allows us to visualize the average arrival delay of different airlines during the month of July. To do this, I used the filter function to select only flights from July and to remove any NA values. Then, I used the group_by and summarise functions to calculate the average arrival delay for each airline. Since the airline names were abbreviated in the flights dataset, I joined the airlines dataset to get their full names, making the graph clearer. On this graph, the bars represent the airlines,the function reoder organized from the one with the highest delay to the one with the lowest. The colors of the bars also reflect the average delay, allowing us to quickly identify the most punctual airlines and those with the largest delays. After this analysis, it is possible to identify which airlines tend to have higher delays during July, helping travelers avoid flights with frequent delays. The graph clearly shows the variation in punctuality among airlines, so passengers can make more informed decisions when planning their trips, choosing carriers that are generally more reliable during this period.