1 Ask

1.1 Problem Statement

I aim to conduct a comprehensive analysis of the flights dataset to extract actionable insights that can be utilized by airport management and airlines to improve operational efficiency, enhance customer satisfaction and optimize resource allocation.

1.2 Management Business Requests:

  • Punctuality Analysis: Evaluate the punctuality of flights by analyzing departure and arrival delays.
  • Route Optimization: Identify routes with the highest and lowest on-time performance to optimize scheduling and resource allocation.
  • Customer Experience Improvement: Investigate factors contributing to flight delays and cancellations to enhance the overall customer experience.
  • Resource Allocation: Analyze flight demand and distribution throughout the day to optimize resource allocation such as staff, gates, and aircraft.
  • Competitive Analysis: Compare the performance of different airlines in terms of on-time performance, flight duration, and customer satisfaction.
  • Trend Identification: Identify trends in flight delays, cancellations, and passenger traffic to predict future demand and plan accordingly.

2 Prepare

2.1 Data Source Description

This data is made available by Mahoora00135 as a Kaggle dataset with a CC0: Public Domain License.

2.1.1 Dataset Description

Variable Description
id A unique identifier assigned to each flight record in this dataset.
year The year in which the flight took place. The dataset includes flights from the year 2013
month The month of the year in which the flight occurred, represented by an integer ranging from 1 January to 12 Dec
day The day of the month on which the flight took place, represented by an integer from 1 to 31
dep_time The actual departure time of the flight, represented in 24-hour format (hhmm)
sched_dep_time The locally scheduled departure time of the flight, presented in a 24-hour format (hhmm)
dep_delay The delay in flight departure, calculated as the difference (in minutes) between the actual and scheduled departure times. Positive values indicate a delay, while negative values indicate an early departure.
arr_time The actual arrival time of the flight, represented in 24-hour format (hhmm)
sched_arr_time The locally scheduled arrival time of the flight, presented in a 24-hour format (hhmm)
arr_delay The delay in flight arrival, calculated as the difference (in minutes) between the actual and scheduled arrival times. Positive values indicate a delay, while negative values indicate an early arrival
carrier A two-letter code representing the airline carrier responsible for the flight
flight The designated number of the flight
tailnum A unique identifier associated with the aircraft used for the flight
origin A three-letter code signifying the airport from which the flight departed
dest A three-letter code representing the airport at which the flight arrived
air_time The duration of the flight, measured in minutes
distance The total distance (in miles) between the origin and destination airports
hour The hour component of the scheduled departure time, expressed in local time
minute The minute component of the scheduled departure time, expressed in local time
time_hour The scheduled departure time of the flight, represented in local time and formatted as “yyyy-mm-dd hh:mm:ss”
name The full name of the airline carrier responsible for the flight

2.2 Load Packages and Create Dataframes

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
Flights <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\flights.csv")

3 Process

3.1 Cleaning and Preprocessing

The steps I took for the data wrangling includes Data Discovery, Data structuring and Cleaning which includes reformatting of some wrongly formatted data types, Checking and handling data entry inconsistency/irregularities, handling blank values, checking for duplicates and so on. Perform data quality checks to ensure accuracy and reliability.

3.1.1 Step 1 - Data Discovery

View(Flights)
glimpse(Flights) 
## Rows: 336,776
## Columns: 21
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <dbl> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <dbl> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <int> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <int> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <chr> "2013-01-01 05:00:00", "2013-01-01 05:00:00", "2013-01-…
## $ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
str(Flights)
## 'data.frame':    336776 obs. of  21 variables:
##  $ id            : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : num  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : num  830 850 923 1004 812 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : int  1400 1416 1089 1576 762 719 1065 229 944 733 ...
##  $ hour          : int  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : int  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : chr  "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
##  $ name          : chr  "United Air Lines Inc." "United Air Lines Inc." "American Airlines Inc." "JetBlue Airways" ...

The view function showed that the dataset contains date, time, text, blank values and some numerical values. Previewing the structure of a flights data frame after running the glimpse and str functions, the following where observed;

  • Date columns like year, month, day are stored as ‘int’
  • Time columns like dep_time,sched_dep_time, arr_time, sched_arr_time, are stored as and represented as e.g. 830 instead of 8:30
  • time_hour stored as ‘chr’ represented in local time and formatted as “yyyy-mm-dd hh:mm:ss”

3.1.2 Step 2 Data Structuring & Cleaning

  • Checking for Data Entry Inconsistency/Irregularities. Before converting the Columns into the right data type, I check data columns for some inconsistency:
  • For the time_hour column, To be sure that the length of the values are all the same in the columns.
Flights %>%
  summarize(max_length = max(nchar(time_hour)),
            min_length = min(nchar(time_hour))) # maxlength is 19 and min length is 19
##   max_length min_length
## 1         19         19
  • Correcting wrong data types - how data are stored and Dropping Unused columns
  • Correct the date columns. Convert year, month, day columns to Date data type and delete original columns (e.g. time_hour, being duplicated) and dropping unused columns
Flights <- Flights %>%
  mutate(schd_dep_date = as.Date(paste(year, month, day, sep = "-"))) %>%
  select(id, schd_dep_date, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier,
         flight, tailnum, origin, dest, air_time, distance, name, -hour, -minute, -year, -month, -day, -time_hour)
time_columns <- c("dep_time", "sched_dep_time", "arr_time", "sched_arr_time")
Flights <- Flights %>%
  mutate(across(all_of(time_columns), ~ as.POSIXct(sprintf("%04d", .), format = "%H%M"))) %>%
  mutate(across(all_of(time_columns), ~ format(., format = "%H:%M")))
  • Glimpse to check the format of time columns
glimpse(Flights)
## Rows: 336,776
## Columns: 16
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ schd_dep_date  <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-0…
## $ dep_time       <chr> "05:17", "05:33", "05:42", "05:44", "05:54", "05:54", "…
## $ sched_dep_time <chr> "05:15", "05:29", "05:40", "05:45", "06:00", "05:58", "…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <chr> "08:30", "08:50", "09:23", "10:04", "08:12", "07:40", "…
## $ sched_arr_time <chr> "08:19", "08:30", "08:50", "10:22", "08:37", "07:28", "…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
  • Checking and Handling of blanks
sum(is.na(Flights)) # 44,083 cells
## [1] 44083
null_obs <- Flights[apply(is.na(Flights), 1, any), ] 

Out of the 336,776 obs, 9,430 obs had at least 1 NA. Thus 2.8% of the original rows had no information. A figure of 44,083, indicates that there are multiple null values within some observations. I noticed some null values in arr_delay when we had data in both arr_time and sched_arr_time. In a bid to reduce null values, we can calculate arr_delay.

Flights <- Flights %>%
  mutate(
    arr_delay = ifelse(!is.na(arr_time) & !is.na(sched_arr_time),
                       as.numeric(gsub(":", "", arr_time)) - as.numeric(gsub(":", "", sched_arr_time)),
                       NA)
  )

This reduced the null values (cells) by 717

sum(is.na(Flights)) #43,366 cells.
## [1] 43366

On investigation of the null observations, it was observed that variables with missing data are air_time, tailnum, dep_time, arr_time.

  • Possible Reasons for Missing Data
  • Missing departure and arrival times could be attributed to:
  • Cancelled flights
  • Incomplete data
  • Data entry error
  • Delayed Flights
  • Only Air Time
  • Data recording errors during flight
  • Technical issues with flight tracking systems
  • Data may not be available for certain types of flights (e.g., military or private flights)
  • Flights with no recorded air time (e.g., canceled, diverted, or unknown reasons)
  • Data entry errors
  • Unavailability of flight duration data due to technical limitations or data sources
  • Incomplete or unreliable flight records
  • Data processing errors
  • Only Airplane Tail Number
  • Some flights might not be assigned a tail number, such as military or private flights
  • Technical issues with aircraft transponders or data transmission
  • Missing or incomplete aircraft registration information
  • Data entry errors
  • Unavailable or corrupted data from aircraft maintenance records
  • Inconsistent or unreliable data sources
  • Only Departure Time - Data recording errors during departure
  • Technical issues with departure tracking systems
  • Flights with no recorded departure time (e.g., canceled, diverted, or unknown reasons)
  • Data entry errors
  • Unavailability of departure time data due to technical limitations or data sources
  • Incomplete or unreliable flight records
  • Only Arrival Time - Data recording errors during arrival
  • Technical issues with arrival tracking systems
  • Flights with no recorded arrival time (e.g., canceled, diverted, or unknown reasons)
  • Data entry errors
  • Unavailability of arrival time data due to technical limitations or data sources
  • Incomplete or unreliable flight records
airline_counts_null<- table(null_obs$name)
airline_counts <- table(Flights$name)
print(airline_counts)
## 
## AirTran Airways Corporation        Alaska Airlines Inc. 
##                        3260                         714 
##      American Airlines Inc.        Delta Air Lines Inc. 
##                       32729                       48110 
##           Endeavor Air Inc.                   Envoy Air 
##                       18460                       26397 
##    ExpressJet Airlines Inc.      Frontier Airlines Inc. 
##                       54173                         685 
##      Hawaiian Airlines Inc.             JetBlue Airways 
##                         342                       54635 
##          Mesa Airlines Inc.       SkyWest Airlines Inc. 
##                         601                          32 
##      Southwest Airlines Co.       United Air Lines Inc. 
##                       12275                       58665 
##             US Airways Inc.              Virgin America 
##                       20536                        5162
print(airline_counts_null)
## 
## AirTran Airways Corporation        Alaska Airlines Inc. 
##                          85                           5 
##      American Airlines Inc.        Delta Air Lines Inc. 
##                         782                         452 
##           Endeavor Air Inc.                   Envoy Air 
##                        1166                        1360 
##    ExpressJet Airlines Inc.      Frontier Airlines Inc. 
##                        3065                           4 
##             JetBlue Airways          Mesa Airlines Inc. 
##                         586                          57 
##       SkyWest Airlines Inc.      Southwest Airlines Co. 
##                           3                         231 
##       United Air Lines Inc.             US Airways Inc. 
##                         883                         705 
##              Virgin America 
##                          46

Hawaiian Airlines Inc. is the only airline carrier that does not appear in the null_obs (that does not have a null value in any row). Because of the varying possibilities of null values and ALL airlines but one (Hawaiian Airlines Inc.) having at least 1 null value in 9,430 observations, I am leaving the NA values for now

  • Creating New Columns The origin (Origin) and destination (Dest) columns are just 3 Letter code and may not be clear to the public conversant with the IATA airport code. I downloaded the code and created a table called airport code and will import it and join it with the Flights dataframe and create 4 new columns - Origin_Airport, Origin_Airport_Add, Dest_Airport, and Dest_Airport_Add to aid analysis.
Airport_Code <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\Airport Code.csv")
Airport_Code_Geotag <- read.csv("C:\\Users\\Asuquo Bassey\\Downloads\\Compressed\\Datasets\\Flights datasets\\Airport Code_geotag.csv")
glimpse(Flights)
## Rows: 336,776
## Columns: 16
## $ id             <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ schd_dep_date  <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01, 2013-0…
## $ dep_time       <chr> "05:17", "05:33", "05:42", "05:44", "05:54", "05:54", "…
## $ sched_dep_time <chr> "05:15", "05:29", "05:40", "05:45", "06:00", "05:58", "…
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <chr> "08:30", "08:50", "09:23", "10:04", "08:12", "07:40", "…
## $ sched_arr_time <chr> "08:19", "08:30", "08:50", "10:22", "08:37", "07:28", "…
## $ arr_delay      <dbl> 11, 20, 73, -18, -25, 12, 59, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…
glimpse(Airport_Code)
## Rows: 108
## Columns: 5
## $ Code          <chr> "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD", "MCO", …
## $ Airport.Name  <chr> " George Bush Intercontinental Airport", " Miami Interna…
## $ Town.City     <chr> " Houston ", " Miami ", " Aguadilla ", " Atlanta ", " Ch…
## $ State..Region <chr> " Texas", " Florida", " Puerto Rico", " Georgia", " Illi…
## $ Country       <chr> "  USA", "  USA", "  USA", "  USA", "  USA", "  USA", " …
Airport_Code <- unique(Airport_Code)
Airport_Code_Geotag <- unique(Airport_Code_Geotag)
Flights2 <- left_join(Flights, Airport_Code_Geotag, by = c("origin" = "original_Code"))
# Rename columns for origin airports
colnames(Flights2)[17:22] <- c("Origin_Airport.Name", "Origin_Town.City", "Origin_State..Region", "Origin_Country","Origin_latitude","Origin_longitude")
# Merge Flights dataframe with Airport_Code dataframe for destination airports
Flights2 <- left_join(Flights2, Airport_Code_Geotag, by = c("dest" = "original_Code"))
# Rename columns for destination airports
colnames(Flights2)[23:28] <- c("Dest_Airport.Name", "Dest_Town.City", "Dest_State..Region", "Dest_Country","Dest_latitude","Dest_longitude")
# View the first few rows of the resulting dataframe
head(Flights2)
##   id schd_dep_date dep_time sched_dep_time dep_delay arr_time sched_arr_time
## 1  0    2013-01-01    05:17          05:15         2    08:30          08:19
## 2  1    2013-01-01    05:33          05:29         4    08:50          08:30
## 3  2    2013-01-01    05:42          05:40         2    09:23          08:50
## 4  3    2013-01-01    05:44          05:45        -1    10:04          10:22
## 5  4    2013-01-01    05:54          06:00        -6    08:12          08:37
## 6  5    2013-01-01    05:54          05:58        -4    07:40          07:28
##   arr_delay carrier flight tailnum origin dest air_time distance
## 1        11      UA   1545  N14228    EWR  IAH      227     1400
## 2        20      UA   1714  N24211    LGA  IAH      227     1416
## 3        73      AA   1141  N619AA    JFK  MIA      160     1089
## 4       -18      B6    725  N804JB    JFK  BQN      183     1576
## 5       -25      DL    461  N668DN    LGA  ATL      116      762
## 6        12      UA   1696  N39463    EWR  ORD      150      719
##                     name                    Origin_Airport.Name
## 1  United Air Lines Inc.   Newark Liberty International Airport
## 2  United Air Lines Inc.                      LaGuardia Airport
## 3 American Airlines Inc.  John F. Kennedy International Airport
## 4        JetBlue Airways  John F. Kennedy International Airport
## 5   Delta Air Lines Inc.                      LaGuardia Airport
## 6  United Air Lines Inc.   Newark Liberty International Airport
##   Origin_Town.City Origin_State..Region Origin_Country Origin_latitude
## 1          Newark            New Jersey            USA        40.73566
## 2           Queens             New York            USA        40.71351
## 3           Queens             New York            USA        40.71351
## 4           Queens             New York            USA        40.71351
## 5           Queens             New York            USA        40.71351
## 6          Newark            New Jersey            USA        40.73566
##   Origin_longitude                                 Dest_Airport.Name
## 1        -74.17237              George Bush Intercontinental Airport
## 2        -73.82831              George Bush Intercontinental Airport
## 3        -73.82831                       Miami International Airport
## 4        -73.82831                          Rafael Hernández Airport
## 5        -73.82831  Hartsfield-Jackson Atlanta International Airport
## 6        -74.17237                      O'Hare International Airport
##   Dest_Town.City Dest_State..Region Dest_Country Dest_latitude Dest_longitude
## 1       Houston               Texas          USA      29.75894      -95.36770
## 2       Houston               Texas          USA      29.75894      -95.36770
## 3         Miami             Florida          USA      25.77417      -80.19362
## 4     Aguadilla         Puerto Rico          USA      18.42979      -67.15422
## 5       Atlanta             Georgia          USA      33.74899      -84.39026
## 6       Chicago            Illinois          USA      41.87556      -87.62442
# Duplication
    # Identify duplicate rows
duplicated_rows <- Flights[duplicated(Flights), ]

# Print or view the duplicated rows
print(duplicated_rows)
##  [1] id             schd_dep_date  dep_time       sched_dep_time dep_delay     
##  [6] arr_time       sched_arr_time arr_delay      carrier        flight        
## [11] tailnum        origin         dest           air_time       distance      
## [16] name          
## <0 rows> (or 0-length row.names)

I noticed multiple variables and see possible reasons; tailnum, name of carrier - Aircraft Rotation, Aircraft Substitutions, Leased or Shared Aircraft. Having multiple tail numbers associated with flights is generally acceptable as long as it reflects real-world operations and is consistent with the airline’s operations.

4 Analyze

I will do only minimal analysis and visualizations -Airline/Carrier summary, relationship between fleet size and arrival & departure delays and calculation of correlation coefficients (see Below). Then I will export the data to Tableau.

4.1 Create dataframe for Airline/Carrier summary.

Airline_Summary <- Flights2 %>% 
  group_by(name) %>% 
  summarize(Fleet_Size = n_distinct(tailnum, na.rm = TRUE),
            Average_Distance = mean(distance, na.rm = TRUE),
            Average_Flight_Hours = mean(air_time, na.rm = TRUE),
            Average_Departure_Delay = mean(dep_delay, na.rm = TRUE),
            Average_Arrival_Delay = mean(arr_delay, na.rm = TRUE),
            Max_Individual_Distance_Flown = max(distance)) %>% 
  arrange(desc(Fleet_Size))

head(Airline_Summary)
## # A tibble: 6 × 7
##   name   Fleet_Size Average_Distance Average_Flight_Hours Average_Departure_De…¹
##   <chr>       <int>            <dbl>                <dbl>                  <dbl>
## 1 Delta…        629            1237.                174.                    9.26
## 2 Unite…        621            1529.                212.                   12.1 
## 3 Ameri…        601            1340.                189.                    8.59
## 4 South…        583             996.                148.                   17.7 
## 5 Expre…        316             563.                 90.1                  20.0 
## 6 US Ai…        290             553.                 88.6                   3.78
## # ℹ abbreviated name: ¹​Average_Departure_Delay
## # ℹ 2 more variables: Average_Arrival_Delay <dbl>,
## #   Max_Individual_Distance_Flown <int>

4.2 Relationship between Fleet Size and Arrival & Departure Delays

Airline_Summary %>% 
  ggplot(aes(Fleet_Size,Average_Departure_Delay,colour=Average_Arrival_Delay))+
  geom_point(alpha=0.4,)+
  scale_color_gradient(low = "blue", high = "red") +  # Color gradient for calories
  geom_smooth()+
  labs(x = "Fleet_Size", 
       y = "Average_Departure_Delay", 
       color = "Average_Arrival_Delay",
       title = "Relationship between Fleet Size and Arrival & Departure Delays") +
  theme_bw()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

4.3 Calculate correlation coefficients

# Calculate correlation coefficients

correlation_delay_arrival <- cor(Flights2$arr_delay, Flights2$dep_delay,use = "complete.obs")
print(correlation_delay_arrival)
## [1] -0.1941299
correlation_distance_arrival <- cor(Flights2$distance,Flights2$arr_delay,use = "complete.obs") 
print(correlation_distance_arrival)
## [1] -0.03437079
correlation_flight_time_distance <- cor(Flights2$air_time,Flights2$distance,use = "complete.obs")
print(correlation_distance_arrival)
## [1] -0.03437079

Based on the correlation coefficient results:

  • Correlation between Departure Delay and Arrival Delay:

  • The correlation coefficient of approximately -0.19 suggests a weak negative linear relationship between departure delay and arrival delay.

  • This indicates that, on average, flights with longer departure delays tend to have slightly shorter arrival delays, and vice versa.

  • However, the strength of this relationship is weak, indicating that departure delay alone may not be a strong predictor of arrival delay.

  • Correlation between Distance and Arrival Delay:

  • The correlation coefficient of approximately -0.03 suggests a very weak negative linear relationship between distance traveled and arrival delay.

  • This indicates that there is little to no meaningful relationship between the distance a flight travels and the delay in its arrival time.

  • It implies that longer flights do not necessarily experience more or less arrival delay compared to shorter flights.

  • Correlation between Flight Time and Distance:

  • The correlation coefficient of approximately -0.03 indicates a very weak negative linear relationship between flight time and distance traveled.

  • This suggests that, on average, longer flights tend to have slightly shorter flight times, and vice versa.

  • However, the strength of this relationship is very weak, indicating that flight time alone may not be a strong predictor of the distance traveled.

The dataset was exported as CSV file to Tableau for further analysis and visualizations. Please,Visit Tableau Public for multi-dashboard visualizations to provide insights on different route analysis, airline & airport operations/performance.

5 Exporting file

# Specify file path to save the .csv file
file_path <- "flights2_data.csv"

# Write the dataframe to a .csv file}
write.csv(Flights, file = file_path, row.names = FALSE)

# Optionally, print a message to confirm the file has been saved
cat("DataFrame saved as", file_path, "\n")
## DataFrame saved as flights2_data.csv
# Get the current working directory
current_directory <- getwd()
print(current_directory)
## [1] "C:/Users/Asuquo Bassey/Downloads/Compressed/Datasets/Flights datasets"
print(unique(Flights$dest))
##   [1] "IAH" "MIA" "BQN" "ATL" "ORD" "FLL" "IAD" "MCO" "PBI" "TPA" "LAX" "SFO"
##  [13] "DFW" "BOS" "LAS" "MSP" "DTW" "RSW" "SJU" "PHX" "BWI" "CLT" "BUF" "DEN"
##  [25] "SNA" "MSY" "SLC" "XNA" "MKE" "SEA" "ROC" "SYR" "SRQ" "RDU" "CMH" "JAX"
##  [37] "CHS" "MEM" "PIT" "SAN" "DCA" "CLE" "STL" "MYR" "JAC" "MDW" "HNL" "BNA"
##  [49] "AUS" "BTV" "PHL" "STT" "EGE" "AVL" "PWM" "IND" "SAV" "CAK" "HOU" "LGB"
##  [61] "DAY" "ALB" "BDL" "MHT" "MSN" "GSO" "CVG" "BUR" "RIC" "GSP" "GRR" "MCI"
##  [73] "ORF" "SAT" "SDF" "PDX" "SJC" "OMA" "CRW" "OAK" "SMF" "TUL" "TYS" "OKC"
##  [85] "PVD" "DSM" "PSE" "BHM" "CAE" "HDN" "BZN" "MTJ" "EYW" "PSP" "ACK" "BGR"
##  [97] "ABQ" "ILM" "MVY" "SBN" "LEX" "CHO" "TVC" "ANC" "LGA"
print(length(unique(Flights$dest)))
## [1] 105

I pulled in data, reviewed and investigated it. Checked for data entry Inconsistencies and irregularities. Managed data types and blanks. Deleted unused columns and created new ones. I was able to show the ability to parsed Date and Time, Detect and correct badly formatted data and check for other error that can skew my analysis. After having a well wrangled data that is well formatted and can be used for further analysis, I proceeded with analysis and visualization without any fear and most especially;

6 Share

We conducted exploratory data analysis (EDA) to uncover trends and patterns in the dataset. This involved data cleaning, visualization, and statistical analysis.

6.1 Key findings

  • Correlations Overall, the correlation coefficients reveal weak or very weak relationships between the variables examined. It’s important to note that correlation does not imply causation, and other factors not considered in this analysis may influence the relationships between these variables. Further consideration of additional variables may provide more insights into the factors affecting flight delays and travel characteristics.

  • Airport Operations

  • Total Flights for the period in view 337,000 flights with 39.1% of those flights departing behind scheduled time.

  • For the period in view, LaGuardia Airport had the best performance in respect of departure delays with an average departure delay time of 10 minutes with 105,000 flights

  • Busiest days of the week were weekdays with Monday leading others marginally with 50,700 flights.Saturday had the least number of flights - 38,700.

  • Airlines Performance

  • Delta Air Line Inc. has the highest number of fleet size with Hawaiian Airline with the least.

  • Skywest Airlines Inc. had the highest average departure delay of 30 minutes with 31% of there 32 flights departing behind scheduled time.

  • United Airlines Inc. has flown the most distance, had the highest number of flights (58,665) but with 47% Late departure

  • Route Analysis

  • Top Destination is Chicago

  • Busiest route is JFK to LAX

  • Route EWR to TUL had the worst arrival time

6.2 Recommendations

Based on the analysis findings, we propose the following recommendations:

  • Additional Variables Analysis: Conduct further analysis to identify additional variables that may influence flight delays and travel characteristics. Factors such as weather conditions, air traffic congestion, and airport infrastructure could be explored.
  • Weekday vs. Weekend Operations: Consider adjusting staffing levels or operational procedures to better handle peak periods to balance weekday vs weekend operations.
  • Customer Communication: Proactively communicate with passengers about potential delays on the busiest routes, such as JFK to LAX, to manage expectations and minimize inconvenience. Providing real-time updates and alternative travel options can help enhance the passenger experience.
  • Route Diversification: Explore opportunities to diversify flight routes and reduce congestion on high-traffic routes like JFK to LAX. Introducing alternative routes or adjusting flight frequencies could help alleviate congestion and improve overall service reliability.
  • Performance Improvement Programs: Airline with poor statistics should implement performance improvement programs to reduce the percentage of late departures. This could involve optimizing flight scheduling, improving turnaround times, and enhancing operational efficiency.

6.3 Conclusions

In conclusion, our analysis reveals significant insights into airline, airport performance and route analysis by examining key as provided in the dataset. We have identified other metrics like fleet size, and route popularity and gained valuable insights into the operational efficiency and service quality of airlines and airport.

Other factors not included in this dataset and report such as weather conditions, air traffic congestion, and airport infrastructure could be explored to give a comprehensive analysis.

6.4 Next Steps

To further refine our analysis and drive actionable insights, we recommend the following next steps:

  • Search and include other factors data like weather conditions, air traffic congestion, airport infrastructure, airplane models, in flight entertainment, etc.
  • Explore qualitative research methods such as customer surveys and interviews to gain deeper insights into customer preferences and pain points.

Thank You!

7 Session Information

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.3 (2024-02-29 ucrt)
##  os       Windows 11 x64 (build 22631)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_Nigeria.utf8
##  ctype    English_Nigeria.utf8
##  tz       Africa/Lagos
##  date     2024-04-29
##  pandoc   3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  bslib         0.7.0   2024-03-29 [1] CRAN (R 4.3.3)
##  cachem        1.0.8   2023-05-01 [1] CRAN (R 4.3.3)
##  cli           3.6.2   2023-12-11 [1] CRAN (R 4.3.3)
##  colorspace    2.1-0   2023-01-23 [1] CRAN (R 4.3.3)
##  digest        0.6.35  2024-03-11 [1] CRAN (R 4.3.3)
##  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.3.3)
##  evaluate      0.23    2023-11-01 [1] CRAN (R 4.3.3)
##  fansi         1.0.6   2023-12-08 [1] CRAN (R 4.3.3)
##  farver        2.1.1   2022-07-06 [1] CRAN (R 4.3.3)
##  fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.3.3)
##  forcats     * 1.0.0   2023-01-29 [1] CRAN (R 4.3.3)
##  generics      0.1.3   2022-07-05 [1] CRAN (R 4.3.3)
##  ggplot2     * 3.5.0   2024-02-23 [1] CRAN (R 4.3.3)
##  glue          1.7.0   2024-01-09 [1] CRAN (R 4.3.3)
##  gtable        0.3.4   2023-08-21 [1] CRAN (R 4.3.3)
##  highr         0.10    2022-12-22 [1] CRAN (R 4.3.3)
##  hms           1.1.3   2023-03-21 [1] CRAN (R 4.3.3)
##  htmltools     0.5.8   2024-03-25 [1] CRAN (R 4.3.3)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.3.3)
##  jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.3.3)
##  knitr         1.45    2023-10-30 [1] CRAN (R 4.3.3)
##  labeling      0.4.3   2023-08-29 [1] CRAN (R 4.3.1)
##  lattice       0.22-5  2023-10-24 [2] CRAN (R 4.3.3)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.3.3)
##  lubridate   * 1.9.3   2023-09-27 [1] CRAN (R 4.3.3)
##  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.3.3)
##  Matrix        1.6-5   2024-01-11 [2] CRAN (R 4.3.3)
##  mgcv          1.9-1   2023-12-21 [2] CRAN (R 4.3.3)
##  munsell       0.5.0   2018-06-12 [1] CRAN (R 4.3.3)
##  nlme          3.1-164 2023-11-27 [2] CRAN (R 4.3.3)
##  pillar        1.9.0   2023-03-22 [1] CRAN (R 4.3.3)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.3.3)
##  purrr       * 1.0.2   2023-08-10 [1] CRAN (R 4.3.3)
##  R6            2.5.1   2021-08-19 [1] CRAN (R 4.3.3)
##  readr       * 2.1.5   2024-01-10 [1] CRAN (R 4.3.3)
##  rlang         1.1.3   2024-01-10 [1] CRAN (R 4.3.3)
##  rmarkdown     2.26    2024-03-05 [1] CRAN (R 4.3.3)
##  rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.3.3)
##  sass          0.4.9   2024-03-15 [1] CRAN (R 4.3.3)
##  scales        1.3.0   2023-11-28 [1] CRAN (R 4.3.3)
##  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.3.3)
##  stringi       1.8.3   2023-12-11 [1] CRAN (R 4.3.2)
##  stringr     * 1.5.1   2023-11-14 [1] CRAN (R 4.3.3)
##  tibble      * 3.2.1   2023-03-20 [1] CRAN (R 4.3.3)
##  tidyr       * 1.3.1   2024-01-24 [1] CRAN (R 4.3.3)
##  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.3.3)
##  tidyverse   * 2.0.0   2023-02-22 [1] CRAN (R 4.3.3)
##  timechange    0.3.0   2024-01-18 [1] CRAN (R 4.3.3)
##  tzdb          0.4.0   2023-05-12 [1] CRAN (R 4.3.3)
##  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.3.3)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.3.3)
##  withr         3.0.0   2024-01-16 [1] CRAN (R 4.3.3)
##  xfun          0.43    2024-03-25 [1] CRAN (R 4.3.3)
##  yaml          2.3.8   2023-12-11 [1] CRAN (R 4.3.2)
## 
##  [1] C:/Users/Asuquo Bassey/AppData/Local/R/win-library/4.3
##  [2] C:/Program Files/R/R-4.3.3/library
## 
## ──────────────────────────────────────────────────────────────────────────────