Detection of anomaly in flight search

Problem statement

The process of flight aggregation, performed by the flight team, gathers all possible flights (and combinations) for a particular user query. The objective of the data team is to remove extremely unlikely flights (eg., very long duration, very high fare, multiple stopovers etc.) from the aggregated flight list, for better user experience.

Proposal

The anomaly detection algorithm in machine learning could be used to detect anomalies (extrememly unlikely flights) and remove them from the aggregated flight list. An anomalous observation is defined as one that has an extremely low probability.

Rationale

Each key features (fare, duration, stopovers, etc.) that influence the decision have a distribution. From the booking_data (training set), the mean and standard deviation (sd) for each feature is calculated. For every new flight search, the probability of the flight being booked by a user can be calculated based on the trained mean and sd of each component feature. If this probability is less than a certain threshold, which is learned from the training and testing data, then the flight is deemed non-bookable.

Assumptions and process flow

The features are sector specific. Hence, the selected features have to be trained separately for each sector. For the purpose of demonstration, I chose the amd-blr sector. We will choose the duration, fare, and stopovers as the features. We will quantify the stopovers variable (0 for NA, 1 for single stopovers, etc). Days to journey is a feature that influences the other features, but for the time-being, let us not use it.

Below I have listed out the process flow: 1. We will use the booking_data for training and testing the algorithm 2. We will use input_data for cross-validation purpose 3. Sector amd-blr is chosen for this demonstration 4. Duration, fare, and stopovers will be chosen as features 5. The mean and sd are calculated for all the 3 features using the training data 6. Based on these values, calculate the probabilities of each feature for each observation (flight) in the test dataset 7. The total probability of each flight is the product of the individual probabilities of its features (indenpendence of features assumed) 8. Each flight in the testing dataset can be labeled as bookable (y = 0) or non-bookable (y = 1) by chosing a threshold of probability. The threshold should be such that so that most of the flights in the test dataset is marked bookable (since the test dataset is part of the booking_data) 9. Using this threshold and the mean and sd values from step 5, the cross-validation dataset (input_data) could be labeled. 10. Inspection of the outcome and error analysis may shine light on the efficacy of the algorithm.

Note: The exercise here is for the purpose of demonstration, hence, the scaling of the code (for all the sectors) or the benchmarking (for speed of implementation) are overlooked here.

Anomaly detection - example

Let me go through a simple example to demonstrate anomaly detection. Consider the quality check of an aircraft engine. Two features were measured for the qc purpose (x1 and x2). Figure 1 represents the process of the anomaly detection for this case (source: Andrew Ng).

Illustration of the anomaly detection process.

The top left shows a scatterplot of the two features x1 and x2, for engines deemed to be ok based on historical data. On the top right of the figure is an illustration of the distribution of the two features with their respective mean and sd. Based on the spread of the two features, a threshold can be selected to demarkate the spread of engines that are fit. Say, measurement of the features for a new flight is specified by the green x-mark (denoted x_test) in the scatterplot. Then, the probability for this observation is calculated as shown below.

Here, n = 2 for the engine example (for the 2 features x1 and x2). Hence, p(x) will be a product of two probabilities p(x1) and p(x2). The two features are assumed to be independent for this to be true.

Method

In this section, we will walk-through the process of implementation of the anomaly detection for the flight aggregator problem.

# relevant libraries
suppressMessages(library(dplyr))
suppressMessages(library(caret))
suppressMessages(library(caTools))

# Read the data in booking_data.tsv as an R data.frame
booking.df <- read.table("./dataset/booking_data.tsv", header = TRUE, sep = "\t")
# Cross-check if the tsv file is properly loaded
dim(booking.df)

## [1] 386907     10

# A simple summary of the data
str(booking.df)

## 'data.frame':    386907 obs. of  10 variables:
##  $ category       : Factor w/ 3 levels "domestic","international",..: 1 1 1 3 1 1 3 1 1 1 ...
##  $ transactionid  : Factor w/ 281341 levels "0000#0450a82d16ef0053ded2b767dcf84ae3",..: 165266 144839 9971 281070 150015 184580 2512 14211 164475 212502 ...
##  $ sector         : Factor w/ 15 levels "amd-blr","amd-dxb",..: 10 11 10 6 15 11 13 11 3 11 ...
##  $ days_to_journey: int  1 8 0 1 1 1 8 9 1 57 ...
##  $ traveldate     : Factor w/ 66547 levels "2015-06-01 09:15:00",..: 1639 17341 49922 32635 50949 30783 57290 34975 581 29050 ...
##  $ duration       : int  160 125 160 530 75 115 440 125 80 125 ...
##  $ fare           : num  5177 4140 7127 24464 4063 ...
##  $ stopovers      : Factor w/ 50 levels "amd","amd,bom",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ roundtrip      : Factor w/ 2 levels "n","y": 1 1 1 2 1 1 2 1 1 1 ...
##  $ flight_no      : Factor w/ 1881 levels "2t105","2t200",..: 34 462 181 1380 138 1448 95 499 1182 47 ...

# Tabulate the data to check the number of observations in each sector
table(booking.df$sector)

## 
##   amd-blr   amd-dxb   blr-pnq   blr-sin   bom-bkk bom-bkk-r bom-dxb-r 
##     20657      2417     38965       269       839      9175      7476 
## bom-sin-r del-bkk-r   del-blr   del-bom   del-dxb del-dxb-r   hyd-kul 
##      5280      9255    112469    134099      8889      8848       469 
##   hyd-maa 
##     27800

If time permits, I will include a detailed summary and exploration of the whole booking.df dataframe at the end as an Appendix section. I checked the quantiles of each feature, plotted scatterplots, histograms, boxplots, time-series plots etc. to understand the booking data. Moreover, there is not much meaning derivable from the time series plots, since the datetime in our booking_data includes bookings upto 10th month of 2017, with booking frequency, and totalsales going down drastically towards the end.

Subsetting a sector (Amd-Blr)

As stated earlier, we will chose the amd-blr sector for this demonstration. There are sufficient observations in this sector (20657). This is sufficient for us to split this data into a training set (70%) and a test set (30%).

#Choose amd-blr from the booking_data
sectAmdBlr <- filter(booking.df, sector == "amd-blr")

# i. Remove roundtrip data
sectAmdBlr <- filter(sectAmdBlr, roundtrip != "y")

Next, we will quantify the stopovers feature as 0, 1, 2 etc. based on how many stopovers each observations have.

# ii. replace NAs with 0 in stopover column
# (below code replaces every NA in the dataframe, but not a problem in our case)
sectAmdBlr$stopovers <- as.character(sectAmdBlr$stopovers)
sectAmdBlr[is.na(sectAmdBlr)] <- 0

# iii. quantify stopover data (0, 1, 2, etc.)
sectAmdBlr$stopovers[grep(",",sectAmdBlr$stopovers)] <- 2
sectAmdBlr$stopovers[!(sectAmdBlr$stopovers %in% "0" | 
                         sectAmdBlr$stopovers %in% "2")] <- "1"

sectAmdBlr$stopovers <- as.numeric(sectAmdBlr$stopovers)

Now, we will chose the features to train the model. We chose duration (minutes), stopovers, and fare (Rupees?).

# iv. let us take days-to-journey for later
# v. choose only duration, stopovers, and fare columns.
AmdBlrFeat <- select(sectAmdBlr, c(duration, stopovers, fare))

Unlike the engine qc example, we have chosen 3 features for our problem, so it will be hard to visualize the data on all features. Let us view a scatterplot of duration vs fare for the selected sector.

plot(sectAmdBlr$duration,sectAmdBlr$fare, xlab = "Duration", ylab = "Fare", pch = 3)

We see that the bookings are concentrated around lower values of duration and booking values. The bookigs become less probable as the fare price and the travel duration increase to very high values. Let us check the quantile values of fare and duration for a probability of 0.995.

# vi. check the quartile value for prob = 0.995 for durations and fare; 
# inspect the features
quantile(AmdBlrFeat$duration, 0.995)

## 99.5% 
##   670

quantile(AmdBlrFeat$fare, 0.995)

## 99.5% 
## 11156

Data spliting

The next step is to split the booking.df data into training and test set with a 70/30 proportion. We will be training the mean and sd for each feature using the training set, and we will use the test set to chose a relevent p-value as a threshold.

# vii. split the data into training and test data (later, we will test the 
# idea of adding more outliers into the testset, not now...)
# createDataPartition from caret is a better function for split, but for now,
# let us split the data as below.
set.seed(101) 
sample = sample.split(AmdBlrFeat$duration, SplitRatio = .70)
train = subset(AmdBlrFeat, sample == TRUE)
test = subset(AmdBlrFeat, sample == FALSE)

We will calculate the mean and standard deviation for each feature in our training dataset (3 features).

# x. let us compute mean and standard dev. of the features
muTrain <- apply(train,2,mean)
sdTrain <- apply(train,2,sd)

Before using the test set, lets calculate p-values for a few test cases manually. Below, I have created 4 vectors (4 test cases) with 3 features each. To calulate the p-value for each test case, I have implemented a nested for-loop (vectorization - will be useful later for testing any number of cases) and also performed some matrix-dataframe transformations that are necessary to perform the computations.

# xi. Calculate the probability for non-bookable and bookable-tickets
x1 <- matrix(c(2000,4,20000),1,3) # non-bookable
x2 <- matrix(c(170,0,5000),1,3)   # bookable (very likely)
x3 <- matrix(c(100,0,1000),1,3)   # bookable (extremely likely)
x4 <- matrix(c(900,0,5000),1,3)   # bookable (less likely)

x1234 <- rbind(x1,x2,x3,x4) # a matrix of dim 4 x 3

px123 <- matrix(rep(0,12),4,3)

# Vectorization is not complete yet!
for(j in 1:3){
  for(i in 1:4){
    px123[i,j] = (pnorm(x1234[i,j],muTrain[j],sdTrain[j], lower.tail = FALSE))
  }
}
px.df <- cbind(px123, px123[,1]*px123[,2]*px123[,3])
y <- ifelse(px.df[,4] < 1e-15, 1, 0)

px.df <- data.frame(px.df, y)

print(px.df)

##             X1           X2           X3            X4 y
## 1 2.320019e-82 6.757268e-21 8.345405e-20 1.308308e-121 1
## 2 5.016592e-01 6.928325e-01 4.015810e-01  1.395758e-01 0
## 3 7.698323e-01 6.928325e-01 9.818297e-01  5.236735e-01 0
## 4 9.907008e-15 6.928325e-01 4.015810e-01  2.756411e-15 0

In the above matrix, column x4 consists of the calculated total probability p(x) for each of the 4 manually created test cases. I have also added a column labeled y that denotes whether the test case flight is bookable (0) or not (1), based on an arbitrarily chosen threshold for the p-value.

If we inspect these test cases by comparing them with the Duration vs Fare scatterplot (disregarding the stopovers feature), we see that the 1 example, labeled non-bookable lies in an extreme corner. We could change the label in other cases, if we alter the threshold accordingly.

Testing the model and fine-tuning the threshold

Now, let us calculate the probabilities for flights in the test cases, and we will try to come up with a reasonably good threshold value.

# xii. label the test dataset as bookable (y = 0) or non-bookable (y = 1) 
# based on the product of probabilities of each variable
testmat <- as.matrix(test) # a matrix of dim 6195 x 3

pxtest <- matrix(rep(0,6195*3),6195,3)

for(j in 1:3){
  for(i in 1:6195){
    pxtest[i,j] = (pnorm(testmat[i,j],muTrain[j],sdTrain[j], lower.tail = FALSE))
  }
}
pxtest.df <- cbind(pxtest, pxtest[,1]*pxtest[,2]*pxtest[,3])
ytest <- ifelse(pxtest.df[,4] < 5e-15, 1, 0) #remember 5e-14 for comparison

pxtest.df <- data.frame(pxtest.df, ytest)
pxtest.df <- cbind(pxtest.df, dur = test$duration, far = test$fare)

plot(pxtest.df$dur, pxtest.df$far, pch = 3, col = pxtest.df$ytest+1)

Above, we see that for a chosen threshold (5e-15), the trained model has chosen the red points as non-bookable. I have chosen this value for a particular reason. If we look at the coordinates of duration ~ 700 and fare ~ 6000, two points are very close to one another, one marked black (bookable) and the other in red (non-bookable). Infact, the black point has a slightly higher price than the red point, but not seen in the plot is the third dimension (stopovers). The black point has 1 stopovers and the red point has 2 stopover. Hence, for a given fare and duration at that extreme, the model has marked the flight with extra stopover as non-bookable.

Cross-validation (testing out on input_data)

Now, for the purpose of cross-validation (on input_data), I am going to use a more relaxed threshold (5e-20), because all the data that we trained and tested so far are supposed to be with a label (y = 0). These are from booked data.

# Read the data in input_data.tsv as an R data.frame
input.df <- read.table("./dataset/input_data.tsv", header = TRUE, sep = "\t")

#Choose amd-blr from the booking_data
sectAmdBlr1 <- filter(input.df, sector == "amd-blr")

# i. Remove roundtrip data
sectAmdBlr1 <- filter(sectAmdBlr1, roundtrip != "y")

# ii. replace NAs with 0 in stopover column
# (below code replaces every NA in the dataframe, but not a problem in our case)
sectAmdBlr1$stopovers <- as.character(sectAmdBlr1$stopovers)
sectAmdBlr1[is.na(sectAmdBlr1)] <- 0

# iii. quantify stopover data (0, 1, 2, etc.)
sectAmdBlr1$stopovers[grep(",",sectAmdBlr1$stopovers)] <- 2
sectAmdBlr1$stopovers[!(sectAmdBlr1$stopovers %in% "0" | 
                         sectAmdBlr1$stopovers %in% "2")] <- "1"

sectAmdBlr1$stopovers <- as.numeric(sectAmdBlr1$stopovers)

# v. choose only duration, stopovers, and fare columns.
AmdBlrFeat1 <- select(sectAmdBlr1, c(duration, stopovers, fare))

# xii. label the test dataset as bookable (y = 0) or non-bookable (y = 1) 
# based on the product of probabilities of each variable
cvmat <- as.matrix(AmdBlrFeat1) # a matrix of dim 6195 x 3

pxcv <- matrix(rep(0,370*3),370,3)

for(j in 1:3){
  for(i in 1:370){
    pxcv[i,j] = (pnorm(cvmat[i,j],muTrain[j],sdTrain[j], lower.tail = FALSE))
  }
}
pxcv.df <- cbind(pxcv, pxcv[,1]*pxcv[,2]*pxcv[,3])
ytest1 <- ifelse(pxcv.df[,4] < 5e-20, 1, 0) #remember 5e-14 for comparison

pxcv.df <- data.frame(pxcv.df, ytest1)
pxcv.df <- cbind(pxcv.df, dur = AmdBlrFeat1$duration, far = AmdBlrFeat1$fare)

plot(pxcv.df$dur, pxcv.df$far, pch = 3, col = pxcv.df$ytest+1)

Based on the anomaly detection model that we have trained, and for the chosen threshold, the model has labeled all the red points in the above scatterplot to be non-bookable flights. Inspecting the plot, the predictions look reasonable.

Summary and conclusions

Without information about whether the flights in input_data were booked or not, it is not possible to perform an error analysis. The code provided here is for prototyping and demonstration of the algorithm. There has to be a model for each sector, or a vectorized implementation that calculates the parameters and stores information for each sector separately. There are advanced (robust) anomaly detection tools that are used for time-series scenarios. It has to be noted that the features involved are dynamic, hence the algorithm has to be trained frequently (a certain window of datetime updated everyweek or fortnight?). That should take care of the trend. Inclusion of some form of a time component as a feature might be useful in inclusion of the seasonal effects.