Introduction

The airline industry creates enormous volumes of information regarding flights, destinations, and passengers’ behavior and delays. Analysis of such information is critical in enhancing airline operations and passengers’ experiences. Association rule mining is utilized in this paper for a dataset of information regarding flights and destinations to reveal concealed patterns and relationships between them.

The objective of this paper is to identify Association rules regarding flights. All such intelligence can contribute towards optimized scheduling, demand forecasting, and operational efficiency, and can enhance operational performance and customer happiness.

Dataset

The dataset contains 98,619 records with detailed information on flights, passengers, and airports. It includes passenger details such as ID, name, gender, age, and nationality, along with flight information like departure date, arrival airport, pilot name, and flight status. Additionally, airport details cover the airport name, country, and continent.

Dataset was taken from https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset

library(dplyr)
library(arules)

df<-read.csv("C:/Users/Filip/Desktop/association rules/Airline Dataset Updated - v2.csv", sep=",", dec=".", header=TRUE)
summary(df)
##  Passenger.ID        First.Name         Last.Name            Gender         
##  Length:98619       Length:98619       Length:98619       Length:98619      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Age       Nationality        Airport.Name       Airport.Country.Code
##  Min.   : 1.0   Length:98619       Length:98619       Length:98619        
##  1st Qu.:23.0   Class :character   Class :character   Class :character    
##  Median :46.0   Mode  :character   Mode  :character   Mode  :character    
##  Mean   :45.5                                                             
##  3rd Qu.:68.0                                                             
##  Max.   :90.0                                                             
##  Country.Name       Airport.Continent   Continents        Departure.Date    
##  Length:98619       Length:98619       Length:98619       Length:98619      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Arrival.Airport     Pilot.Name        Flight.Status     
##  Length:98619       Length:98619       Length:98619      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
cat("Number of observations in the dataset:", nrow(df))
## Number of observations in the dataset: 98619
cat("Number of years variables in the analysis:", ncol(df))
## Number of years variables in the analysis: 15

The dataset initially contained both categorical and non-categorical data across various columns, making it less suitable for association rule mining. To improve its usability, I adjusted the data by removing redundant variables such as passenger names and IDs, which did not contribute to meaningful pattern discovery. Additionally, I transformed the departure dates into seasonal categories—spring, summer, autumn, and winter—allowing for more effective analysis of travel patterns based on seasonal trends. These modifications were essential to streamline the dataset and focus on relevant features for generating association rules.

df$Departure.Date <- as.Date(df$Departure.Date, format = "%m-%d-%Y")

df$Month <- format(df$Departure.Date, "%m")

df$Season <- ifelse(df$Month %in% c("12", "01", "02"), "Winter",
                      ifelse(df$Month %in% c("03", "04", "05"), "Spring",
                             ifelse(df$Month %in% c("06", "07", "08"), "Summer", "Fall")))


df <- df %>% select(-c(Passenger.ID, First.Name, Last.Name, Pilot.Name, Gender, Age, Airport.Country.Code, Departure.Date  ))


write.csv(df, file = "flight_data.csv", row.names = FALSE)
data <- read.transactions("C:/Users/Filip/Desktop/association rules/flight_data.csv", format = "basket", sep = ",", skip=1)
summary(data)
## transactions as itemMatrix in sparse format with
##  59658 rows (elements/itemsets/transactions) and
##  18351 columns (items) and a density of 0.0003259648 
## 
## most frequent items:
## Cancelled   On Time   Delayed    Summer    Spring   (Other) 
##     19908     19876     19874     15272     15107    266824 
## 
## element (itemset/transaction) length distribution:
## sizes
##     5     6 
##  1087 58571 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   6.000   6.000   5.982   6.000   6.000 
## 
## includes extended item information - examples:
##                    labels
## 1                       -
## 2                       0
## 3 28 de Noviembre Airport
itemFrequencyPlot(data, topN=25, type="relative", main="ItemFrequency") 

The Apriori algorithm

For the analysis, I applied the Apriori algorithm, a widely used method in association rule mining that efficiently identifies frequent itemsets and generates rules based on minimum support and confidence thresholds. Initially, I used the algorithm with default parameter values for support, confidence, and lift to explore general patterns in the dataset.

rules<-apriori(data, parameter=list(supp=0.1, conf=0.65)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.65    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 5965 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[18351 item(s), 59658 transaction(s)] done [0.10s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Unfortunately, the algorithm hasn’t found any rules for both sets of transactions.

In order to find any rules in analyzed datasets, thresholds of minimum support and minimum confidence had to be lowered. Their values have been set to 0.001 and 0.5 respectively.

rules<-apriori(data, parameter=list(supp=0.001, conf=0.5)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.001      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 59 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[18351 item(s), 59658 transaction(s)] done [0.09s].
## sorting and recoding items ... [177 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Lowering minimum support and confidence values resulted in finding 9 rules

library(arulesViz)
set.seed(240) 
plot_flights <- plot(rules, measure=c("support","lift"), shading="confidence", main="Flight Rules")
plot(rules, method="graph", measure="support", shading="lift", main="Graph for 9 rules")
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The visualization identifies strong links between Brazil and various flight conditions, with O’Hare International Airport (node 0) serving as a central hub in these patterns. Delays and cancellations are strongly associated, indicating that flights between Brazil and O’Hare are often subject to disruptions. Seasonal factors also play a role, as travel demand and performance shift throughout the year, with noticeable ties to spring, summer, fall, and winter. Although there is some connection to on-time performance, it’s weaker, suggesting that punctuality may be less reliable on these routes. The smaller node for China indicates that flights between Brazil and China through O’Hare are relatively uncommon. These findings point to the significant impact of both operational challenges and seasonal demand on flights between Brazil and O’Hare.

Confidence

rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))
##     lhs            rhs      support     confidence coverage    lift     count
## [1] {0}         => {Brazil} 0.009185692 1          0.009185692 12.15278 548  
## [2] {0, China}  => {Brazil} 0.001726508 1          0.001726508 12.15278 103  
## [3] {0, Winter} => {Brazil} 0.002279661 1          0.002279661 12.15278 136  
## [4] {0, Fall}   => {Brazil} 0.002279661 1          0.002279661 12.15278 136  
## [5] {0, Spring} => {Brazil} 0.002447283 1          0.002447283 12.15278 146  
## [6] {0, Summer} => {Brazil} 0.002179087 1          0.002179087 12.15278 130

It shows strong rules involving Brazil, with a confidence of 1.0 and a lift of 12.15, indicating highly predictable associations. The O’Hare International Airport (O) has the highest support (0.009), while seasonal rules (Winter, Fall, etc.) and {0, China} show lower support but the same strength. This highlights consistent seasonal and destination-based patterns in flights related to Brazil.

Lift

rules.by.lift<-sort(rules, by="lift", decreasing=TRUE) 
inspect(head(rules.by.lift))
##     lhs            rhs      support     confidence coverage    lift     count
## [1] {0}         => {Brazil} 0.009185692 1          0.009185692 12.15278 548  
## [2] {0, China}  => {Brazil} 0.001726508 1          0.001726508 12.15278 103  
## [3] {0, Winter} => {Brazil} 0.002279661 1          0.002279661 12.15278 136  
## [4] {0, Fall}   => {Brazil} 0.002279661 1          0.002279661 12.15278 136  
## [5] {0, Spring} => {Brazil} 0.002447283 1          0.002447283 12.15278 146  
## [6] {0, Summer} => {Brazil} 0.002179087 1          0.002179087 12.15278 130

This confirms that seasonal and destination-specific patterns strongly influence flights associated with Brazil.

Support

rules.by.supp<-sort(rules, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))
##     lhs               rhs      support     confidence coverage    lift    
## [1] {0}            => {Brazil} 0.009185692 1          0.009185692 12.15278
## [2] {0, Cancelled} => {Brazil} 0.003151296 1          0.003151296 12.15278
## [3] {0, On Time}   => {Brazil} 0.003084247 1          0.003084247 12.15278
## [4] {0, Delayed}   => {Brazil} 0.002950149 1          0.002950149 12.15278
## [5] {0, Spring}    => {Brazil} 0.002447283 1          0.002447283 12.15278
## [6] {0, Winter}    => {Brazil} 0.002279661 1          0.002279661 12.15278
##     count
## [1] 548  
## [2] 188  
## [3] 184  
## [4] 176  
## [5] 146  
## [6] 136

Rules involving cancellations, on-time flights, and delays show significant support, indicating that both flight reliability and disruptions are common considerations. Seasonal patterns such as travel during spring and winter also emerge, suggesting that flight activity and conditions fluctuate across different times of the year but remain strongly tied to flights involving Brazil.

Summary

The analysis focuses on flight patterns using the Apriori algorithm. Data was adjusted by removing redundant variables and categorizing travel dates into seasons. The results reveal strong associations with Brazil, particularly for flight delays, cancellations, and seasonal variations. Flights to and from Brazil experience frequent delays, with cancellations also showing notable patterns. Seasonal effects are consistent, influencing travel demand and flight performance.