The airline industry creates enormous volumes of information regarding flights, destinations, and passengers’ behavior and delays. Analysis of such information is critical in enhancing airline operations and passengers’ experiences. Association rule mining is utilized in this paper for a dataset of information regarding flights and destinations to reveal concealed patterns and relationships between them.
The objective of this paper is to identify Association rules regarding flights. All such intelligence can contribute towards optimized scheduling, demand forecasting, and operational efficiency, and can enhance operational performance and customer happiness.
The dataset contains 98,619 records with detailed information on flights, passengers, and airports. It includes passenger details such as ID, name, gender, age, and nationality, along with flight information like departure date, arrival airport, pilot name, and flight status. Additionally, airport details cover the airport name, country, and continent.
Dataset was taken from https://www.kaggle.com/datasets/iamsouravbanerjee/airline-dataset
library(dplyr)
library(arules)
df<-read.csv("C:/Users/Filip/Desktop/association rules/Airline Dataset Updated - v2.csv", sep=",", dec=".", header=TRUE)
summary(df)
## Passenger.ID First.Name Last.Name Gender
## Length:98619 Length:98619 Length:98619 Length:98619
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Age Nationality Airport.Name Airport.Country.Code
## Min. : 1.0 Length:98619 Length:98619 Length:98619
## 1st Qu.:23.0 Class :character Class :character Class :character
## Median :46.0 Mode :character Mode :character Mode :character
## Mean :45.5
## 3rd Qu.:68.0
## Max. :90.0
## Country.Name Airport.Continent Continents Departure.Date
## Length:98619 Length:98619 Length:98619 Length:98619
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Arrival.Airport Pilot.Name Flight.Status
## Length:98619 Length:98619 Length:98619
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
cat("Number of observations in the dataset:", nrow(df))
## Number of observations in the dataset: 98619
cat("Number of years variables in the analysis:", ncol(df))
## Number of years variables in the analysis: 15
The dataset initially contained both categorical and non-categorical data across various columns, making it less suitable for association rule mining. To improve its usability, I adjusted the data by removing redundant variables such as passenger names and IDs, which did not contribute to meaningful pattern discovery. Additionally, I transformed the departure dates into seasonal categories—spring, summer, autumn, and winter—allowing for more effective analysis of travel patterns based on seasonal trends. These modifications were essential to streamline the dataset and focus on relevant features for generating association rules.
df$Departure.Date <- as.Date(df$Departure.Date, format = "%m-%d-%Y")
df$Month <- format(df$Departure.Date, "%m")
df$Season <- ifelse(df$Month %in% c("12", "01", "02"), "Winter",
ifelse(df$Month %in% c("03", "04", "05"), "Spring",
ifelse(df$Month %in% c("06", "07", "08"), "Summer", "Fall")))
df <- df %>% select(-c(Passenger.ID, First.Name, Last.Name, Pilot.Name, Gender, Age, Airport.Country.Code, Departure.Date ))
write.csv(df, file = "flight_data.csv", row.names = FALSE)
data <- read.transactions("C:/Users/Filip/Desktop/association rules/flight_data.csv", format = "basket", sep = ",", skip=1)
summary(data)
## transactions as itemMatrix in sparse format with
## 59658 rows (elements/itemsets/transactions) and
## 18351 columns (items) and a density of 0.0003259648
##
## most frequent items:
## Cancelled On Time Delayed Summer Spring (Other)
## 19908 19876 19874 15272 15107 266824
##
## element (itemset/transaction) length distribution:
## sizes
## 5 6
## 1087 58571
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 6.000 6.000 5.982 6.000 6.000
##
## includes extended item information - examples:
## labels
## 1 -
## 2 0
## 3 28 de Noviembre Airport
itemFrequencyPlot(data, topN=25, type="relative", main="ItemFrequency")
For the analysis, I applied the Apriori algorithm, a widely used method in association rule mining that efficiently identifies frequent itemsets and generates rules based on minimum support and confidence thresholds. Initially, I used the algorithm with default parameter values for support, confidence, and lift to explore general patterns in the dataset.
rules<-apriori(data, parameter=list(supp=0.1, conf=0.65))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.65 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 5965
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[18351 item(s), 59658 transaction(s)] done [0.10s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Unfortunately, the algorithm hasn’t found any rules for both sets of transactions.
In order to find any rules in analyzed datasets, thresholds of minimum support and minimum confidence had to be lowered. Their values have been set to 0.001 and 0.5 respectively.
rules<-apriori(data, parameter=list(supp=0.001, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 59
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[18351 item(s), 59658 transaction(s)] done [0.09s].
## sorting and recoding items ... [177 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Lowering minimum support and confidence values resulted in finding 9 rules
library(arulesViz)
set.seed(240)
plot_flights <- plot(rules, measure=c("support","lift"), shading="confidence", main="Flight Rules")
plot(rules, method="graph", measure="support", shading="lift", main="Graph for 9 rules")
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The visualization identifies strong links between Brazil and various flight conditions, with O’Hare International Airport (node 0) serving as a central hub in these patterns. Delays and cancellations are strongly associated, indicating that flights between Brazil and O’Hare are often subject to disruptions. Seasonal factors also play a role, as travel demand and performance shift throughout the year, with noticeable ties to spring, summer, fall, and winter. Although there is some connection to on-time performance, it’s weaker, suggesting that punctuality may be less reliable on these routes. The smaller node for China indicates that flights between Brazil and China through O’Hare are relatively uncommon. These findings point to the significant impact of both operational challenges and seasonal demand on flights between Brazil and O’Hare.
rules.by.conf<-sort(rules, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift count
## [1] {0} => {Brazil} 0.009185692 1 0.009185692 12.15278 548
## [2] {0, China} => {Brazil} 0.001726508 1 0.001726508 12.15278 103
## [3] {0, Winter} => {Brazil} 0.002279661 1 0.002279661 12.15278 136
## [4] {0, Fall} => {Brazil} 0.002279661 1 0.002279661 12.15278 136
## [5] {0, Spring} => {Brazil} 0.002447283 1 0.002447283 12.15278 146
## [6] {0, Summer} => {Brazil} 0.002179087 1 0.002179087 12.15278 130
It shows strong rules involving Brazil, with a confidence of 1.0 and a lift of 12.15, indicating highly predictable associations. The O’Hare International Airport (O) has the highest support (0.009), while seasonal rules (Winter, Fall, etc.) and {0, China} show lower support but the same strength. This highlights consistent seasonal and destination-based patterns in flights related to Brazil.
rules.by.lift<-sort(rules, by="lift", decreasing=TRUE)
inspect(head(rules.by.lift))
## lhs rhs support confidence coverage lift count
## [1] {0} => {Brazil} 0.009185692 1 0.009185692 12.15278 548
## [2] {0, China} => {Brazil} 0.001726508 1 0.001726508 12.15278 103
## [3] {0, Winter} => {Brazil} 0.002279661 1 0.002279661 12.15278 136
## [4] {0, Fall} => {Brazil} 0.002279661 1 0.002279661 12.15278 136
## [5] {0, Spring} => {Brazil} 0.002447283 1 0.002447283 12.15278 146
## [6] {0, Summer} => {Brazil} 0.002179087 1 0.002179087 12.15278 130
This confirms that seasonal and destination-specific patterns strongly influence flights associated with Brazil.
rules.by.supp<-sort(rules, by="support", decreasing=TRUE)
inspect(head(rules.by.supp))
## lhs rhs support confidence coverage lift
## [1] {0} => {Brazil} 0.009185692 1 0.009185692 12.15278
## [2] {0, Cancelled} => {Brazil} 0.003151296 1 0.003151296 12.15278
## [3] {0, On Time} => {Brazil} 0.003084247 1 0.003084247 12.15278
## [4] {0, Delayed} => {Brazil} 0.002950149 1 0.002950149 12.15278
## [5] {0, Spring} => {Brazil} 0.002447283 1 0.002447283 12.15278
## [6] {0, Winter} => {Brazil} 0.002279661 1 0.002279661 12.15278
## count
## [1] 548
## [2] 188
## [3] 184
## [4] 176
## [5] 146
## [6] 136
Rules involving cancellations, on-time flights, and delays show significant support, indicating that both flight reliability and disruptions are common considerations. Seasonal patterns such as travel during spring and winter also emerge, suggesting that flight activity and conditions fluctuate across different times of the year but remain strongly tied to flights involving Brazil.
The analysis focuses on flight patterns using the Apriori algorithm. Data was adjusted by removing redundant variables and categorizing travel dates into seasons. The results reveal strong associations with Brazil, particularly for flight delays, cancellations, and seasonal variations. Flights to and from Brazil experience frequent delays, with cancellations also showing notable patterns. Seasonal effects are consistent, influencing travel demand and flight performance.