Airline Industry is a rapidly growing, highly competitive sector and subject to drastic changes for even small changes in some critical parameters. Hence applying data mining process to decision making is very critical for the longer survival and success. Our primary objective is to deploy a model for one of the major Airlines in the US, to determine if they need to start operation to/from the newly opened airports and how they should price flights on these new routes. Analysis was based on the historical data collected from an airline on 638 air routes in the United States.
Data Code | Description |
---|---|
S_CODE | Starting airport’s code |
S_CITY | Starting city |
E_CODE | Ending airport’s code |
E_CITY | Ending city |
COUPON | Average number of coupons (a one-coupon flight is a non-stop flight, A two-coupon flight is a one stop flight etc.) |
NEW | Number of new carriers entering that route |
VACATION | Whether a vacation route (Yes) or not (No). |
SW | Whether Southwest Airlines serves that route (Yes) or not (No) |
HI | Herfindel Index - measure of market concentration |
S_INCOME | Starting city’s average personal income |
E_INCOME | Ending city’s average personal income |
S_POP | Starting city’s population |
E_POP | Ending city’s population |
The details of different packages used for this analysis is listed below.
. dplyr :Easy functions to perform data manipulation in R.
. caTools :Contains several basic utility functions including:statistic functions,read/write, samples etc.
. ggplot2 :Data visualisation in R
. corrplot:To plot correlation matrix showing relation between each varibles.
. caret :Creating Dummy Variables for this case.
. rpart :Used for creating regression based tree models.
. xgboost :Used for creating newly trained boosting models that learn from previous series of models, and are efficient, flexible and portable.
. glmnet :Fits linear models /generalized linear models penalizing the maximum likelihood,with LASSO/Ridge regression
Loading the data from the csv file. By default, the columns such as Fare, S_INCOME etc. has been assigned as factors which should be converted to numbers. Also since the amount columns are in currency format from the input file, R treats it as non-numeric values, and hence it is necessary to remove the currency and convert it into number format.
Airfare.data = read.csv("Airfares.csv")
Airfare.data = Airfare.data[,-19]
dim(Airfare.data) #638 18, meaning 638 rows and 18 columns
## [1] 638 18
str(Airfare.data) # Structure of original data
## 'data.frame': 638 obs. of 18 variables:
## $ S_CODE : Factor w/ 8 levels "*","DCA","EWR",..: 1 1 1 8 7 1 1 1 1 1 ...
## $ S_CITY : Factor w/ 51 levels "Albuquerque NM",..: 14 3 7 9 9 11 14 18 23 25 ...
## $ E_CODE : Factor w/ 8 levels "*","DCA","EWR",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ E_CITY : Factor w/ 68 levels "Amarillo TX",..: 1 2 2 2 2 2 2 2 2 2 ...
## $ COUPON : num 1 1.06 1.06 1.06 1.06 1.01 1.28 1.15 1.33 1.6 ...
## $ NEW : int 3 3 3 3 3 3 3 3 3 2 ...
## $ VACATION: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ SW : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 1 2 2 2 ...
## $ HI : num 5292 5419 9185 2657 2657 ...
## $ S_INCOME: Factor w/ 50 levels "$14,600","$18,933",..: 37 34 44 41 41 29 37 33 35 26 ...
## $ E_INCOME: Factor w/ 67 levels "$14,600","$18,851",..: 7 57 57 57 57 57 57 57 57 57 ...
## $ S_POP : int 3036732 3532657 5787293 7830332 7830332 2230955 3036732 1440377 3770125 1694803 ...
## $ E_POP : int 205711 7145897 7145897 7145897 7145897 7145897 7145897 7145897 7145897 7145897 ...
## $ SLOT : Factor w/ 2 levels "Controlled","Free": 2 2 2 1 2 2 2 2 2 2 ...
## $ GATE : Factor w/ 2 levels "Constrained",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ DISTANCE: int 312 576 364 612 612 309 1220 921 1249 964 ...
## $ PAX : int 7864 8820 6452 25144 25144 13386 4625 5512 7811 4657 ...
## $ FARE : Factor w/ 451 levels "$100.36","$100.80",..: 371 170 219 427 427 345 241 48 168 41 ...
Airfare.data$FARE = as.numeric(substr(as.character(Airfare.data$FARE),2,10)) # To remove currency $
## S_INCOME AND E_INCOME are factors, we will need to convert it into number, and remove currency symbol
Airfare.data$S_INCOME = as.numeric(gsub("\\$|,","",Airfare.data$S_INCOME))
Airfare.data$E_INCOME = as.numeric(gsub("\\$|,","",Airfare.data$E_INCOME))
Airfare = Airfare.data[,5:18] # Removing the unwanted columns that have text values
dim(Airfare) # Getting the dimension ( no of rows and columns in our data)
## [1] 638 14
summary(Airfare) # Summary of data
## COUPON NEW VACATION SW HI
## Min. :1.000 Min. :0.000 No :468 No :444 Min. : 1230
## 1st Qu.:1.040 1st Qu.:3.000 Yes:170 Yes:194 1st Qu.: 3090
## Median :1.150 Median :3.000 Median : 4208
## Mean :1.202 Mean :2.754 Mean : 4442
## 3rd Qu.:1.298 3rd Qu.:3.000 3rd Qu.: 5481
## Max. :1.940 Max. :3.000 Max. :10000
## S_INCOME E_INCOME S_POP E_POP
## Min. :14600 Min. :14600 Min. : 29838 Min. : 111745
## 1st Qu.:24706 1st Qu.:23903 1st Qu.:1862106 1st Qu.:1228816
## Median :28637 Median :26409 Median :3532657 Median :2195215
## Mean :27760 Mean :27664 Mean :4557004 Mean :3194503
## 3rd Qu.:29694 3rd Qu.:31981 3rd Qu.:7830332 3rd Qu.:4549784
## Max. :38813 Max. :38813 Max. :9056076 Max. :9056076
## SLOT GATE DISTANCE PAX
## Controlled:182 Constrained:124 Min. : 114.0 Min. : 1504
## Free :456 Free :514 1st Qu.: 455.0 1st Qu.: 5328
## Median : 850.0 Median : 7792
## Mean : 975.7 Mean :12782
## 3rd Qu.:1306.2 3rd Qu.:14090
## Max. :2764.0 Max. :73892
## FARE
## Min. : 42.47
## 1st Qu.:106.29
## Median :144.60
## Mean :160.88
## 3rd Qu.:209.35
## Max. :402.02
Creating a correlation matrix, to find the correlation among the input variables.
Airfare.corr = select_if(Airfare, is.numeric) # selecting the one which are numeric
coorelation = corrplot(cor(Airfare.corr), type = "upper", method = "number")
From the plot, it can be seen that Distance and Coupon are more positive correlated with our preditor variable, FARE. Creating individual plots to observe the behaviors.
plot(x = Airfare$DISTANCE, y = Airfare$FARE,type = "p", main = "Relation between Distance and Fair", xlab = "Dist between 2 airports", ylab = "Avg Price")
plot(x = Airfare$COUPON, y = Airfare$FARE,type = "p", main = "Relation between No of Flights Stops and respective Fair", xlab = "No of Flights Stops", ylab = "Avg Price")
Data Pre-Processing consists of mainly three steps:-
1. Converting categorical variables into dummy variables.
2. Standardising or Normalising the data columns, as some of the columns such as S_POP, E_INCOME has higher values when compared to other columns such as DISTANCE, SLOT etc., hence in order to remove bias, we have to normalise the data.
3. Splitting our dataset into training and testing sets, split ratio is 80:20.
# converting dummy variables
DummyVar = dummyVars("~.",data = Airfare)
Airfare = data.frame(predict(DummyVar, newdata = Airfare))
## FEAUTURE SCALING
Airfare[,-18] <- lapply(Airfare[,-18], function(x) if(is.numeric(x)){(x - min(x))/(max(x) - min(x))} else x)
# SPLITTING DATA INTO TRAINING AND TESTING SETS
sampleSplit = sample.split(Airfare,SplitRatio = 0.8)
Airfare.training = subset(Airfare, sampleSplit == TRUE)
Airfare.test = subset(Airfare, sampleSplit == FALSE)
The Main step involved is to develop a model that could effectively predict the airfare based on input variables. Analysis was carried out using four selected models, which were
Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. Here our dependent variable is the FARE.
set.seed(100)
modelLr = lm(FARE ~., data = Airfare.training)
summary(modelLr)
##
## Call:
## lm(formula = FARE ~ ., data = Airfare.training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -104.934 -21.552 -1.492 20.767 125.886
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.103 10.668 -3.759 0.000191 ***
## COUPON 9.604 13.254 0.725 0.469057
## NEW -7.839 5.942 -1.319 0.187713
## VACATION.No 31.858 3.981 8.003 9.15e-15 ***
## VACATION.Yes NA NA NA NA
## SW.No 40.772 4.180 9.754 < 2e-16 ***
## SW.Yes NA NA NA NA
## HI 72.849 9.438 7.719 6.85e-14 ***
## S_INCOME 38.803 14.540 2.669 0.007872 **
## E_INCOME 25.799 10.064 2.563 0.010668 *
## S_POP 25.196 6.398 3.938 9.43e-05 ***
## E_POP 47.410 7.517 6.307 6.44e-10 ***
## SLOT.Controlled 15.446 4.260 3.626 0.000319 ***
## SLOT.Free NA NA NA NA
## GATE.Constrained 25.100 4.493 5.586 3.89e-08 ***
## GATE.Free NA NA NA NA
## DISTANCE 201.908 10.778 18.733 < 2e-16 ***
## PAX -61.815 11.334 -5.454 7.89e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.48 on 481 degrees of freedom
## Multiple R-squared: 0.8039, Adjusted R-squared: 0.7986
## F-statistic: 151.7 on 13 and 481 DF, p-value: < 2.2e-16
From the result, we can see S_INCOME, COUPON, NEW are the least significant, hence removing variables based on significant values and then re-running the model.
set.seed(100)
modelLR = lm(FARE ~ VACATION.No+SW.No+HI+E_INCOME+S_POP+E_POP+SLOT.Controlled+GATE.Constrained+DISTANCE+PAX, data = Airfare.training)
LR.predict = predict(modelLR, newdata = Airfare.test[,-18])
## Calculating Mean Squared Error
AccuracyLR = sum(abs(LR.predict - Airfare.test[,18]))/length(Airfare.test[,18])
## Plotting the graph
plot(modelLR)
UNDERSTANDING THE PLOTS:
“Residuals versus fits plot” is the most frequently created plot. It is a scatter plot of residuals on the y axis and fitted values (estimated responses) on the x axis. This plot is generally used to detect non-linearity, unequal error variances, and outliers. From the plot, it can be seen that there is no significant distinctive pattern.
“The Q-Q plot”, or quantile-quantile plot, shows us if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. Seems to be residuals are lined well on the straight dashed line.
“Scale-Location” shows us if residuals are spread equally along the ranges of predictors. Plot looks ok as it satisfies the condition of homoscedasticity (i.e. equal variance). The points scattered below and above the line are kind of same.Hence, the plot looks normal.
“Residuals vs Leverage” plot helps us to find out influential cases or outliers if any. From the plot, there is no influential case that can be directly observed.
Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. Decision trees can handle both categorical and numerical data.
set.seed(100)
modelDT = rpart(FARE ~.,data = Airfare.training)
DT.Predict = predict(modelDT, newdata = Airfare.test[,-18])
## Calculating Mean Squared Error
AccuracyDT = sum(abs(DT.Predict - Airfare.test[,18]))/length(Airfare.test[,18])
Gradient boosting is a type of ensemble techniques in supervised machine learning, which tries to generate the output by building many individual models, trying to calculate the error rate ( difference between actual and predicted values), trying to minimise this erorr rate.Hence, Gradient Boosting tries to build many individual models, thereby considering the error rate of previous model to the next model in order to minimise it, and building a final strong model by combining each of these individual models.
set.seed(100)
lab_matrix = as.matrix(Airfare.training$FARE)
data_trainX = as.matrix(Airfare.training[,-18])
dtrain = xgb.DMatrix(data = data_trainX, label = lab_matrix)
dim(as.matrix(Airfare.training$FARE))
## [1] 495 1
dtest = xgb.DMatrix(data = as.matrix(Airfare.test[,-18]), label = as.matrix(Airfare.test$FARE))
# Defining the parameters
parameters = list(booster = "gblinear",
objective = "reg:linear",
eta = 0.1, #Vary btwn 0.1-0.3
nthread = 5, #Increase this to improve speed
max_depth = 15,
lambda= 0.5, #Vary between 0-3
alpha= 0.5, #Vary between 0-3
min_child_weight= 2, #Vary btwn 1-10
eval_metric = "rmse")
model.Xgb = xgboost(params = parameters, data = dtrain, nrounds = 53)
## [1] train-rmse:114.233131
## [2] train-rmse:86.010605
## [3] train-rmse:73.702103
## [4] train-rmse:68.309982
## [5] train-rmse:65.643623
## [6] train-rmse:64.136803
## [7] train-rmse:63.163280
## [8] train-rmse:62.448975
## [9] train-rmse:61.898884
## [10] train-rmse:61.453812
## [11] train-rmse:61.077629
## [12] train-rmse:60.755955
## [13] train-rmse:60.473797
## [14] train-rmse:60.230156
## [15] train-rmse:60.023586
## [16] train-rmse:59.838963
## [17] train-rmse:59.674778
## [18] train-rmse:59.533527
## [19] train-rmse:59.411934
## [20] train-rmse:59.296909
## [21] train-rmse:59.199776
## [22] train-rmse:59.111588
## [23] train-rmse:59.038662
## [24] train-rmse:58.969128
## [25] train-rmse:58.906765
## [26] train-rmse:58.858486
## [27] train-rmse:58.813683
## [28] train-rmse:58.771828
## [29] train-rmse:58.738911
## [30] train-rmse:58.705547
## [31] train-rmse:58.677078
## [32] train-rmse:58.653545
## [33] train-rmse:58.633549
## [34] train-rmse:58.618248
## [35] train-rmse:58.603817
## [36] train-rmse:58.592804
## [37] train-rmse:58.582569
## [38] train-rmse:58.573158
## [39] train-rmse:58.569313
## [40] train-rmse:58.564503
## [41] train-rmse:58.560307
## [42] train-rmse:58.557976
## [43] train-rmse:58.557796
## [44] train-rmse:58.555031
## [45] train-rmse:58.555511
## [46] train-rmse:58.556698
## [47] train-rmse:58.559086
## [48] train-rmse:58.562485
## [49] train-rmse:58.564533
## [50] train-rmse:58.565876
## [51] train-rmse:58.569130
## [52] train-rmse:58.574169
## [53] train-rmse:58.579826
model.Predict = predict(model.Xgb, dtest)
AccuracyXgb = sum(abs(model.Predict - Airfare.test[,18]))/length(Airfare.test[,18])
LASSO - Least Absolute Shrinkage and Selection Operator is a powerful method that perform two main tasks: regularization and feature selection. The LASSO method tries to reduce the sum of the absolute values of the model parameters below a fixed value (upper bound). In order to do so the method apply a shrinking (regularization) process where it penalizes the coefficients of the regression variables shrinking some of them to zero. Variables that still have a non-zero coefficient after the shrinking process are selected to be part of the model. The goal of this process is to minimize the prediction error.
Tuning parameter lambda, controls the strength of the penalty. The larger is the parameter lambda the more number of coefficients are shrinked to zero. On the other hand if lambda = 0 we have an OLS (Ordinary Least Sqaure) regression. The aplha value chooses between LASSO and ridge methods (a = 1 LASSO, a = 0 Ridge Regularization).The x-axis shows different values of the lambda parameter. Each line represent one of the explanatory variable and its role in the model.
set.seed(100)
# The input must be a matrix in case of LASSO
Airfare_X = as.matrix(Airfare[,-18])
Airfare_Y = Airfare$FARE
lasso_fit = glmnet(x = Airfare_X, y = Airfare_Y,family ="gaussian",alpha = 1 )
plot(lasso_fit, xvar = "lambda", label = TRUE)
# Cross validation to find value for la
cv_lasso = cv.glmnet(x = Airfare_X, y = Airfare_Y, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv_lasso)
# Predicting LASSO value, taking lambda's minimum value
Lasso.Predict = predict(lasso_fit,newx = Airfare_X, s=cv_lasso$lambda.min)
# Accuracy using the first LASSO method
AccuracyLS.min = sum(abs(Lasso.Predict - Airfare_Y))/length(Airfare_Y) #27.50271
# Predicting LASSO value, taking lambda's standard error of the minimum
Lasso.Predict1se = predict(lasso_fit, newx = Airfare_X, s=cv_lasso$lambda.1se)
# Accuracy using the second LASSO method
AccuracyLS.1se = sum(abs(Lasso.Predict1se - Airfare_Y))/length(Airfare_Y) #28.71035
The main objective for the analysis was to build a suitable machine learning model, which could help the SouthWest Airlines to predict Airfare for a newer route.
a. The data was cleaned and manipulated for our analysis. Various correlation graphs were plotted to identify the relation among different variables.
b. Various Machine Learning algorithms were used to predict the price and to identify the best model.
c. The most accurate model i.e. model with the least RMSE(Root Mean Square Error) was identified from the four models, which are shown below:
MACHINE_LEARNING_MODELS = c("Multiple Linear Regression","Decision Tree Model","Extreme Gradient Boosting","LASSO Regression")
ERROR = c(AccuracyLR,AccuracyDT,AccuracyXgb,AccuracyLS.min)
df = data.frame(MACHINE_LEARNING_MODELS,ERROR)
df
## MACHINE_LEARNING_MODELS ERROR
## 1 Multiple Linear Regression 31.67563
## 2 Decision Tree Model 29.04397
## 3 Extreme Gradient Boosting 44.40906
## 4 LASSO Regression 27.50271
From the analysis, it can be observed that the least error rate was found out using the LASSO Regression model, hence we use that model as our standard analysis for predicting Airfares on the new route.
Error rates from other models like linear regression and tree models were close enough. XgBoost method didnt predicted good results, may be because training datasets were low.