1. INTRODUCTION


Airline Industry is a rapidly growing, highly competitive sector and subject to drastic changes for even small changes in some critical parameters. Hence applying data mining process to decision making is very critical for the longer survival and success. Our primary objective is to deploy a model for one of the major Airlines in the US, to determine if they need to start operation to/from the newly opened airports and how they should price flights on these new routes. Analysis was based on the historical data collected from an airline on 638 air routes in the United States.

Data Code Description
S_CODE Starting airport’s code
S_CITY Starting city
E_CODE Ending airport’s code
E_CITY Ending city
COUPON Average number of coupons (a one-coupon flight is a non-stop flight, A two-coupon flight is a one stop flight etc.)
NEW Number of new carriers entering that route
VACATION Whether a vacation route (Yes) or not (No).
SW Whether Southwest Airlines serves that route (Yes) or not (No)
HI Herfindel Index - measure of market concentration
S_INCOME Starting city’s average personal income
E_INCOME Ending city’s average personal income
S_POP Starting city’s population
E_POP Ending city’s population

2. PACKAGES

The details of different packages used for this analysis is listed below.

. dplyr :Easy functions to perform data manipulation in R.
. caTools :Contains several basic utility functions including:statistic functions,read/write, samples etc.
. ggplot2 :Data visualisation in R
. corrplot:To plot correlation matrix showing relation between each varibles.
. caret :Creating Dummy Variables for this case.
. rpart :Used for creating regression based tree models.
. xgboost :Used for creating newly trained boosting models that learn from previous series of models, and are efficient, flexible and portable.
. glmnet :Fits linear models /generalized linear models penalizing the maximum likelihood,with LASSO/Ridge regression

3. DATA LOADING

Loading the data from the csv file. By default, the columns such as Fare, S_INCOME etc. has been assigned as factors which should be converted to numbers. Also since the amount columns are in currency format from the input file, R treats it as non-numeric values, and hence it is necessary to remove the currency and convert it into number format.

Airfare.data =  read.csv("Airfares.csv")
Airfare.data = Airfare.data[,-19]


dim(Airfare.data)  #638  18, meaning 638 rows and 18 columns
## [1] 638  18
str(Airfare.data)  # Structure of original data
## 'data.frame':    638 obs. of  18 variables:
##  $ S_CODE  : Factor w/ 8 levels "*","DCA","EWR",..: 1 1 1 8 7 1 1 1 1 1 ...
##  $ S_CITY  : Factor w/ 51 levels "Albuquerque         NM",..: 14 3 7 9 9 11 14 18 23 25 ...
##  $ E_CODE  : Factor w/ 8 levels "*","DCA","EWR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ E_CITY  : Factor w/ 68 levels "Amarillo            TX",..: 1 2 2 2 2 2 2 2 2 2 ...
##  $ COUPON  : num  1 1.06 1.06 1.06 1.06 1.01 1.28 1.15 1.33 1.6 ...
##  $ NEW     : int  3 3 3 3 3 3 3 3 3 2 ...
##  $ VACATION: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ SW      : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 1 2 2 2 ...
##  $ HI      : num  5292 5419 9185 2657 2657 ...
##  $ S_INCOME: Factor w/ 50 levels "$14,600","$18,933",..: 37 34 44 41 41 29 37 33 35 26 ...
##  $ E_INCOME: Factor w/ 67 levels "$14,600","$18,851",..: 7 57 57 57 57 57 57 57 57 57 ...
##  $ S_POP   : int  3036732 3532657 5787293 7830332 7830332 2230955 3036732 1440377 3770125 1694803 ...
##  $ E_POP   : int  205711 7145897 7145897 7145897 7145897 7145897 7145897 7145897 7145897 7145897 ...
##  $ SLOT    : Factor w/ 2 levels "Controlled","Free": 2 2 2 1 2 2 2 2 2 2 ...
##  $ GATE    : Factor w/ 2 levels "Constrained",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ DISTANCE: int  312 576 364 612 612 309 1220 921 1249 964 ...
##  $ PAX     : int  7864 8820 6452 25144 25144 13386 4625 5512 7811 4657 ...
##  $ FARE    : Factor w/ 451 levels "$100.36","$100.80",..: 371 170 219 427 427 345 241 48 168 41 ...
Airfare.data$FARE = as.numeric(substr(as.character(Airfare.data$FARE),2,10)) # To remove currency $

## S_INCOME AND E_INCOME are factors, we will need to convert it into number, and remove currency symbol

Airfare.data$S_INCOME = as.numeric(gsub("\\$|,","",Airfare.data$S_INCOME))
Airfare.data$E_INCOME = as.numeric(gsub("\\$|,","",Airfare.data$E_INCOME))

Airfare = Airfare.data[,5:18]                   # Removing the unwanted columns that have text values

dim(Airfare)      # Getting the dimension ( no of rows and columns in our data)
## [1] 638  14
summary(Airfare)   # Summary of data
##      COUPON           NEW        VACATION    SW            HI       
##  Min.   :1.000   Min.   :0.000   No :468   No :444   Min.   : 1230  
##  1st Qu.:1.040   1st Qu.:3.000   Yes:170   Yes:194   1st Qu.: 3090  
##  Median :1.150   Median :3.000                       Median : 4208  
##  Mean   :1.202   Mean   :2.754                       Mean   : 4442  
##  3rd Qu.:1.298   3rd Qu.:3.000                       3rd Qu.: 5481  
##  Max.   :1.940   Max.   :3.000                       Max.   :10000  
##     S_INCOME        E_INCOME         S_POP             E_POP        
##  Min.   :14600   Min.   :14600   Min.   :  29838   Min.   : 111745  
##  1st Qu.:24706   1st Qu.:23903   1st Qu.:1862106   1st Qu.:1228816  
##  Median :28637   Median :26409   Median :3532657   Median :2195215  
##  Mean   :27760   Mean   :27664   Mean   :4557004   Mean   :3194503  
##  3rd Qu.:29694   3rd Qu.:31981   3rd Qu.:7830332   3rd Qu.:4549784  
##  Max.   :38813   Max.   :38813   Max.   :9056076   Max.   :9056076  
##          SLOT              GATE        DISTANCE           PAX       
##  Controlled:182   Constrained:124   Min.   : 114.0   Min.   : 1504  
##  Free      :456   Free       :514   1st Qu.: 455.0   1st Qu.: 5328  
##                                     Median : 850.0   Median : 7792  
##                                     Mean   : 975.7   Mean   :12782  
##                                     3rd Qu.:1306.2   3rd Qu.:14090  
##                                     Max.   :2764.0   Max.   :73892  
##       FARE       
##  Min.   : 42.47  
##  1st Qu.:106.29  
##  Median :144.60  
##  Mean   :160.88  
##  3rd Qu.:209.35  
##  Max.   :402.02

4. EXPLORATORY DATA ANALYSIS

Creating a correlation matrix, to find the correlation among the input variables.

Airfare.corr = select_if(Airfare, is.numeric) # selecting the one which are numeric
coorelation = corrplot(cor(Airfare.corr), type = "upper", method = "number")

From the plot, it can be seen that Distance and Coupon are more positive correlated with our preditor variable, FARE. Creating individual plots to observe the behaviors.

plot(x = Airfare$DISTANCE, y = Airfare$FARE,type = "p", main = "Relation between Distance and Fair", xlab = "Dist between 2 airports", ylab = "Avg Price")

plot(x = Airfare$COUPON, y = Airfare$FARE,type = "p", main = "Relation between No of Flights Stops and respective Fair", xlab = "No of Flights Stops", ylab = "Avg Price")

5. DATA PRE-PROCESSING


Data Pre-Processing consists of mainly three steps:-

1. Converting categorical variables into dummy variables.

2. Standardising or Normalising the data columns, as some of the columns such as S_POP, E_INCOME has higher values when compared to other columns such as DISTANCE, SLOT etc., hence in order to remove bias, we have to normalise the data.

3. Splitting our dataset into training and testing sets, split ratio is 80:20.

# converting dummy variables
DummyVar = dummyVars("~.",data = Airfare)
Airfare = data.frame(predict(DummyVar, newdata = Airfare))

## FEAUTURE SCALING
Airfare[,-18] <- lapply(Airfare[,-18], function(x) if(is.numeric(x)){(x - min(x))/(max(x) - min(x))} else x)

# SPLITTING DATA INTO TRAINING AND TESTING SETS
sampleSplit = sample.split(Airfare,SplitRatio = 0.8)
Airfare.training = subset(Airfare, sampleSplit == TRUE)
Airfare.test = subset(Airfare, sampleSplit == FALSE)

6. MODEL BUILDING PROCESS:


The Main step involved is to develop a model that could effectively predict the airfare based on input variables. Analysis was carried out using four selected models, which were

  1. Multiple Linear Regression model,
  2. Tree based model
  3. Extreme Gradient Boosting model
  4. LASSO Regression model

Multiple Linear Regression

Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. Here our dependent variable is the FARE.

set.seed(100)
modelLr = lm(FARE ~., data = Airfare.training)
summary(modelLr)
## 
## Call:
## lm(formula = FARE ~ ., data = Airfare.training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -104.934  -21.552   -1.492   20.767  125.886 
## 
## Coefficients: (4 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -40.103     10.668  -3.759 0.000191 ***
## COUPON              9.604     13.254   0.725 0.469057    
## NEW                -7.839      5.942  -1.319 0.187713    
## VACATION.No        31.858      3.981   8.003 9.15e-15 ***
## VACATION.Yes           NA         NA      NA       NA    
## SW.No              40.772      4.180   9.754  < 2e-16 ***
## SW.Yes                 NA         NA      NA       NA    
## HI                 72.849      9.438   7.719 6.85e-14 ***
## S_INCOME           38.803     14.540   2.669 0.007872 ** 
## E_INCOME           25.799     10.064   2.563 0.010668 *  
## S_POP              25.196      6.398   3.938 9.43e-05 ***
## E_POP              47.410      7.517   6.307 6.44e-10 ***
## SLOT.Controlled    15.446      4.260   3.626 0.000319 ***
## SLOT.Free              NA         NA      NA       NA    
## GATE.Constrained   25.100      4.493   5.586 3.89e-08 ***
## GATE.Free              NA         NA      NA       NA    
## DISTANCE          201.908     10.778  18.733  < 2e-16 ***
## PAX               -61.815     11.334  -5.454 7.89e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.48 on 481 degrees of freedom
## Multiple R-squared:  0.8039, Adjusted R-squared:  0.7986 
## F-statistic: 151.7 on 13 and 481 DF,  p-value: < 2.2e-16

From the result, we can see S_INCOME, COUPON, NEW are the least significant, hence removing variables based on significant values and then re-running the model.

set.seed(100)
modelLR = lm(FARE ~ VACATION.No+SW.No+HI+E_INCOME+S_POP+E_POP+SLOT.Controlled+GATE.Constrained+DISTANCE+PAX, data = Airfare.training)
LR.predict = predict(modelLR, newdata = Airfare.test[,-18])

## Calculating Mean Squared Error
AccuracyLR = sum(abs(LR.predict - Airfare.test[,18]))/length(Airfare.test[,18])

## Plotting the graph
plot(modelLR)

UNDERSTANDING THE PLOTS:
“Residuals versus fits plot” is the most frequently created plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. This plot is generally used to detect non-linearity, unequal error variances, and outliers. From the plot, it can be seen that there is no significant distinctive pattern.

“The Q-Q plot”, or quantile-quantile plot, shows us if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Seems to be residuals are lined well on the straight dashed line.

“Scale-Location” shows us if residuals are spread equally along the ranges of predictors. Plot looks ok as it satisfies the condition of homoscedasticity (i.e. equal variance). The points scattered below and above the line are kind of same.Hence, the plot looks normal.

“Residuals vs Leverage” plot helps us to find out influential cases or outliers if any. From the plot, there is no influential case that can be directly observed.

Decision Tree model

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision trees can handle both categorical and numerical data.

set.seed(100)
modelDT = rpart(FARE ~.,data = Airfare.training)
DT.Predict = predict(modelDT, newdata = Airfare.test[,-18])

## Calculating Mean Squared Error       
AccuracyDT = sum(abs(DT.Predict - Airfare.test[,18]))/length(Airfare.test[,18]) 

Extreme Gradient Boosting

Gradient boosting is a type of ensemble techniques in supervised machine learning, which tries to generate the output by building many individual models, trying to calculate the error rate ( difference between actual and predicted values), trying to minimise this erorr rate.Hence, Gradient Boosting tries to build many individual models, thereby considering the error rate of previous model to the next model in order to minimise it, and building a final strong model by combining each of these individual models.

set.seed(100)

lab_matrix = as.matrix(Airfare.training$FARE)
data_trainX = as.matrix(Airfare.training[,-18])
dtrain = xgb.DMatrix(data = data_trainX, label = lab_matrix)
dim(as.matrix(Airfare.training$FARE))
## [1] 495   1
dtest = xgb.DMatrix(data = as.matrix(Airfare.test[,-18]), label = as.matrix(Airfare.test$FARE))

# Defining the parameters
parameters = list(booster = "gblinear",
                  objective = "reg:linear",    
                  eta = 0.1,           #Vary btwn 0.1-0.3
                  nthread = 5,         #Increase this to improve speed
                  max_depth = 15,
                  lambda= 0.5,         #Vary between 0-3
                  alpha= 0.5,          #Vary between 0-3
                  min_child_weight= 2, #Vary btwn 1-10
                  eval_metric = "rmse")

model.Xgb = xgboost(params = parameters, data = dtrain, nrounds = 53)
## [1]  train-rmse:114.233131 
## [2]  train-rmse:86.010605 
## [3]  train-rmse:73.702103 
## [4]  train-rmse:68.309982 
## [5]  train-rmse:65.643623 
## [6]  train-rmse:64.136803 
## [7]  train-rmse:63.163280 
## [8]  train-rmse:62.448975 
## [9]  train-rmse:61.898884 
## [10] train-rmse:61.453812 
## [11] train-rmse:61.077629 
## [12] train-rmse:60.755955 
## [13] train-rmse:60.473797 
## [14] train-rmse:60.230156 
## [15] train-rmse:60.023586 
## [16] train-rmse:59.838963 
## [17] train-rmse:59.674778 
## [18] train-rmse:59.533527 
## [19] train-rmse:59.411934 
## [20] train-rmse:59.296909 
## [21] train-rmse:59.199776 
## [22] train-rmse:59.111588 
## [23] train-rmse:59.038662 
## [24] train-rmse:58.969128 
## [25] train-rmse:58.906765 
## [26] train-rmse:58.858486 
## [27] train-rmse:58.813683 
## [28] train-rmse:58.771828 
## [29] train-rmse:58.738911 
## [30] train-rmse:58.705547 
## [31] train-rmse:58.677078 
## [32] train-rmse:58.653545 
## [33] train-rmse:58.633549 
## [34] train-rmse:58.618248 
## [35] train-rmse:58.603817 
## [36] train-rmse:58.592804 
## [37] train-rmse:58.582569 
## [38] train-rmse:58.573158 
## [39] train-rmse:58.569313 
## [40] train-rmse:58.564503 
## [41] train-rmse:58.560307 
## [42] train-rmse:58.557976 
## [43] train-rmse:58.557796 
## [44] train-rmse:58.555031 
## [45] train-rmse:58.555511 
## [46] train-rmse:58.556698 
## [47] train-rmse:58.559086 
## [48] train-rmse:58.562485 
## [49] train-rmse:58.564533 
## [50] train-rmse:58.565876 
## [51] train-rmse:58.569130 
## [52] train-rmse:58.574169 
## [53] train-rmse:58.579826
model.Predict = predict(model.Xgb, dtest)

AccuracyXgb = sum(abs(model.Predict - Airfare.test[,18]))/length(Airfare.test[,18]) 

LASSO Regression

LASSO - Least Absolute Shrinkage and Selection Operator is a powerful method that perform two main tasks: regularization and feature selection. The LASSO method tries to reduce the sum of the absolute values of the model parameters below a fixed value (upper bound). In order to do so the method apply a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero. Variables that still have a non-zero coefficient after the shrinking process are selected to be part of the model. The goal of this process is to minimize the prediction error.

Tuning parameter lambda, controls the strength of the penalty. The larger is the parameter lambda the more number of coefficients are shrinked to zero. On the other hand if lambda = 0 we have an OLS (Ordinary Least Sqaure) regression. The aplha value chooses between LASSO and ridge methods (a = 1 LASSO, a = 0 Ridge Regularization).The x-axis shows different values of the lambda parameter. Each line represent one of the explanatory variable and its role in the model.

set.seed(100)
# The input must be a matrix in case of LASSO
Airfare_X = as.matrix(Airfare[,-18])
Airfare_Y = Airfare$FARE
lasso_fit = glmnet(x = Airfare_X, y = Airfare_Y,family ="gaussian",alpha = 1 )

plot(lasso_fit, xvar = "lambda", label = TRUE)

# Cross validation to find value for la
cv_lasso = cv.glmnet(x = Airfare_X, y = Airfare_Y, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv_lasso)

# Predicting LASSO value, taking lambda's minimum value
Lasso.Predict = predict(lasso_fit,newx = Airfare_X, s=cv_lasso$lambda.min)
# Accuracy using the first LASSO method
AccuracyLS.min = sum(abs(Lasso.Predict - Airfare_Y))/length(Airfare_Y) #27.50271
# Predicting LASSO value, taking lambda's standard error of the minimum
Lasso.Predict1se = predict(lasso_fit, newx = Airfare_X, s=cv_lasso$lambda.1se)
# Accuracy using the second LASSO method
AccuracyLS.1se = sum(abs(Lasso.Predict1se - Airfare_Y))/length(Airfare_Y) #28.71035

7. SUMMARY


The main objective for the analysis was to build a suitable machine learning model, which could help the SouthWest Airlines to predict Airfare for a newer route.

a. The data was cleaned and manipulated for our analysis. Various correlation graphs were plotted to identify the relation among different variables.

b. Various Machine Learning algorithms were used to predict the price and to identify the best model.

c. The most accurate model i.e. model with the least RMSE(Root Mean Square Error) was identified from the four models, which are shown below:

MACHINE_LEARNING_MODELS = c("Multiple Linear Regression","Decision Tree Model","Extreme Gradient Boosting","LASSO Regression")
ERROR = c(AccuracyLR,AccuracyDT,AccuracyXgb,AccuracyLS.min)

df = data.frame(MACHINE_LEARNING_MODELS,ERROR)
df
##      MACHINE_LEARNING_MODELS    ERROR
## 1 Multiple Linear Regression 31.67563
## 2        Decision Tree Model 29.04397
## 3  Extreme Gradient Boosting 44.40906
## 4           LASSO Regression 27.50271


From the analysis, it can be observed that the least error rate was found out using the LASSO Regression model, hence we use that model as our standard analysis for predicting Airfares on the new route.

Error rates from other models like linear regression and tree models were close enough. XgBoost method didnt predicted good results, may be because training datasets were low.