This data set was compiled by Chester Ismay contains information about all flights that departed from the three main New York City airports in 2023. For more information on the data set use the given link: https://CRAN.R-project.org/package=nycflights23
Next, let us take a look a the first 6 rows of this data-set to see what we are working with
Next, I correct the carrier column, replacing the abbreviations with the full name. I had to use “select(-airline_name)” to remove the extra table I mistakenly created for airline_names. The mutate function replaces the abbreviations with the full carrier names data-frame created.
Here i have mapped the Airport of Origin from abbreviation to full name.
df3_flights$origin[df3_flights$origin =="LGA"]<-"LaGuardia"df3_flights$origin[df3_flights$origin =="EWR"]<-"Newark Liberty International"df3_flights$origin[df3_flights$origin =="JFK"]<-"John F. Kennedy International"
Lets take a look at the filtered data set
head(df3_flights)
# A tibble: 6 × 7
month carrier origin dest dep_delay arr_delay flight
<chr> <chr> <chr> <chr> <dbl> <dbl> <int>
1 January United Airlines Newark Liberty Int… SMF 203 205 628
2 January Delta Air Lines John F. Kennedy In… ATL 78 53 393
3 January JetBlue John F. Kennedy In… BQN 47 34 371
4 January JetBlue John F. Kennedy In… CHS 173 166 1053
5 January United Airlines Newark Liberty Int… DTW 228 211 219
6 January American Airlines Newark Liberty Int… MIA 3 -7 499
Now lets look at the summary statistics
summary(df3_flights)
month carrier origin dest
Length:435352 Length:435352 Length:435352 Length:435352
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
dep_delay arr_delay flight
Min. : -50.00 Min. : -97.000 Min. : 1.0
1st Qu.: -6.00 1st Qu.: -22.000 1st Qu.: 364.0
Median : -2.00 Median : -10.000 Median : 734.0
Mean : 13.84 Mean : 4.345 Mean : 785.2
3rd Qu.: 10.00 3rd Qu.: 9.000 3rd Qu.:1188.0
Max. :1813.00 Max. :1812.000 Max. :1972.0
NA's :10738 NA's :12534
Looking at the summary it would appear we have some missing values in departure delays and arrival delays so we remove them
`summarise()` has grouped output by 'carrier'. You can override using the
`.groups` argument.
ggplot(df_facet_month, aes(x = month, y = Average_Delay, fill = carrier)) +geom_col(position ="dodge") +facet_wrap(~ Delay_Type, scales ="free") +labs(title ="Comparison of Departure & Arrival Delays by Month",x ="Month",y ="Average Delay (mins)") +theme_minimal() +theme(axis.text.x =element_text(angle =90, hjust =1))
With this visual i don’t have alot of insight with individual airlines but whit this i can see possible trends that has most arrival and departure delays around July and June, possibly an increased number of flyers during this period for vacation purposes.
##The NYC flight dataset for the year 2023
What I did first was try to see what theoretical questions could be asked; I started arranging the data.
After loading the dataset, I could see that it included approximately all the days from January to December of the year 2023, so I knew I would be able to get an intricate look into various parts of the dataset.
I chose to find out if I could identify a trend in delays for either departures or arrivals by calculating against different variables. From all the given airline carriers, could I determine if some specific planes have a tendency to arrive early or be delayed? In questions like this, a hidden variable, such as which pilot flew that day, could be at play, or depending on the plane type, what engine was installed and the weight of passengers in relation to fuel weight could also be important factors.
For example, most planes used commercially could perform twice as well or fly faster, so to speak, if their pilots decided to use the engines beyond regulated guidelines. However, since flight is a delicate process, a wide margin or allowance for extra energy when needed in emergencies calls for planes to cruise using perhaps 60 to 50 percent of the engines’ capabilities.