Objective

In this article we are going to work on predicting airfares between new routes from the Airfares.xls data set by fitting a Linear Regression model.

Approach:

  • Loading & Cleansing Data in R
  • Created New Fields (Calculated), if required!
  • Data Visualizations
  • Checking for Multicollinearity
  • Split the data into Training & Testing
  • Fitting Full Model
  • Interpretation & Deviations of Standard Regression Assumptions
  • Tuning the Model
  • Feature Pruning
  • Anova Test
  • Predicting on Test Data

Loading & Cleansing Data in R

Loading:

rm(list = ls())
setwd('D:\\personal\\ISB\\SA2\\Assignment1')
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS)
# library(leaps) 

##Loading required Functions
source(".\\PredictingFare\\UtilityFunctions.R")
airfares <- read.csv("Airfares.csv")
head(airfares)
##   S_CODE                 S_CITY E_CODE                 E_CITY COUPON NEW
## 1      * Dallas/Fort Worth   TX      * Amarillo            TX   1.00   3
## 2      * Atlanta             GA      * Baltimore/Wash Intl MD   1.06   3
## 3      * Boston              MA      * Baltimore/Wash Intl MD   1.06   3
## 4    ORD Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 5    MDW Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 6      * Cleveland           OH      * Baltimore/Wash Intl MD   1.01   3
##   VACATION  SW      HI S_INCOME E_INCOME   S_POP   E_POP       SLOT GATE
## 1       No Yes 5291.99  $28,637  $21,112 3036732  205711       Free Free
## 2       No  No 5419.16  $26,993  $29,838 3532657 7145897       Free Free
## 3       No  No 9185.28  $30,124  $29,838 5787293 7145897       Free Free
## 4       No Yes 2657.35  $29,260  $29,838 7830332 7145897 Controlled Free
## 5       No Yes 2657.35  $29,260  $29,838 7830332 7145897       Free Free
## 6       No Yes 3408.11  $26,046  $29,838 2230955 7145897       Free Free
##   DISTANCE   PAX    FARE
## 1      312  7864  $64.11
## 2      576  8820 $174.47
## 3      364  6452 $207.76
## 4      612 25144  $85.47
## 5      612 25144  $85.47
## 6      309 13386  $56.76

Cleansing:

As we can see from the head command, there are some special characters like($ and ,) in few of the columns, lets cleanse them;

  airfares$FARE <- as.numeric(substring(as.character(airfares$FARE),2)) 
  toberemovedChars <- c("$",",")
  for(c in toberemovedChars)
  {
    airfares$S_INCOME <- sub(x = as.character(airfares$S_INCOME), pattern = c,replacement = "",fixed = TRUE)
    airfares$E_INCOME <-  sub(x = as.character(airfares$E_INCOME), pattern = c,replacement = "",fixed = TRUE)
  }
  airfares$S_INCOME <- as.numeric(airfares$S_INCOME)
  airfares$E_INCOME <- as.numeric(airfares$E_INCOME)
  airfares$S_POP <- as.numeric(airfares$S_POP)
  airfares$E_POP <- as.numeric(airfares$E_POP)

Created new Fields

After looking at the data, we can understand that S_CODE & E_CODE are having 3 character airport code for high traffic cities or cities with multiple airports and ’*’ otherwise. I’m going to create a new field “HavingAtleastOneHighTraficCity” which will be used during the model creation.

  airfares$HavingAtleastOneHighTraficCity <- as.factor(airfares$S_CODE !='*' | airfares$E_CODE !='*')

Data Visualizations

Its always useful to understand the relationship between the predictors and the response variable (FARE) by plotting them and check linear relationship between them. This seciton creates a bunch of plots to explore and understand the structure of the data;

Histogram of FARE

  hist(airfares$FARE)

Observation 1

  • The Histogram of the response variable FARE resembles a RIGHT SKEWED distribution.

Scatter Plots between FARE & Other Numeric Predictors

My Figure

My Figure

Observation 2

  • Predictors DISTANCE & COUPON exhibits a linear relationsship with response variable. All other columns does not exhibit a good linear relationship

Checking for Multicollinearity

This step will help us identify predictors that are strongly linearly related to other predictor variables in the dataset. Including strongly related variablesin the model will effect the Reliability and Accurancy of the model. SO, it is always recommended to removed such predictors from the model as anyways the explanatory power about response variable in already included in the other predictors.

  airfares_x = subset(airfares, select = -c(FARE))
  numericData <- airfares_x[sapply(airfares_x, is.numeric)]
  
  #Calculating Correlation
  descrCor <- cor(numericData)
  highlyCorrelated <- findCorrelation(descrCor, cutoff=0.6)
  highlyCorCol <- colnames(numericData)[highlyCorrelated]
  descrCor
##               COUPON         NEW          HI    S_INCOME   E_INCOME
## COUPON    1.00000000  0.02022307 -0.34725207 -0.08840265  0.0468892
## NEW       0.02022307  1.00000000  0.05414685  0.02659673  0.1133766
## HI       -0.34725207  0.05414685  1.00000000 -0.02738221  0.0823926
## S_INCOME -0.08840265  0.02659673 -0.02738221  1.00000000 -0.1388642
## E_INCOME  0.04688920  0.11337664  0.08239260 -0.13886420  1.0000000
## S_POP    -0.10776336 -0.01667212 -0.17249541  0.51718718 -0.1440586
## E_POP     0.09496994  0.05856818 -0.06245600 -0.27228027  0.4584181
## DISTANCE  0.74680521  0.08096520 -0.31237457  0.02815334  0.1765307
## PAX      -0.33697358  0.01049527 -0.16896078  0.13819710  0.2599611
##                S_POP       E_POP    DISTANCE         PAX
## COUPON   -0.10776336  0.09496994  0.74680521 -0.33697358
## NEW      -0.01667212  0.05856818  0.08096520  0.01049527
## HI       -0.17249541 -0.06245600 -0.31237457 -0.16896078
## S_INCOME  0.51718718 -0.27228027  0.02815334  0.13819710
## E_INCOME -0.14405857  0.45841806  0.17653074  0.25996105
## S_POP     1.00000000 -0.28014283  0.01843667  0.28461056
## E_POP    -0.28014283  1.00000000  0.11563970  0.31469750
## DISTANCE  0.01843667  0.11563970  1.00000000 -0.10248160
## PAX       0.28461056  0.31469750 -0.10248160  1.00000000
  highlyCorCol
## [1] "COUPON"

Observation 3

  • At 0.6 pair-wise cutoff COUPON is highly correlated with DISTANCE predictor with 74.68% correlation coefficient. Now, we can use this column or remove while building the model

Split the data into Training & Testing

In this section, we will split the data into training (75%) and testing (25%) subsets;

  smp_size <- floor(0.75 * nrow(airfares))
  ## set the seed to make your partition reproductible
  set.seed(123)
  train_ind <- sample(seq_len(nrow(airfares)), size = smp_size)
  train <- airfares[train_ind, ]
  test <- airfares[-train_ind, ]

Fitting Full Model

Now, lets start simply by fitting a linear regression model with all the predictors (FULL Model) on our Training dataset;

  model_allfeatures <- lm(FARE~.,data = train)
  smr <- summary(model_allfeatures)
  paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste("         Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared:  0.898179          Adjusted R-squared:  0.860436"

Regression diagnostic plots

  #par(col.lab="maroon",col.main="blue")
   par(mfrow = c(1:2),col.lab="maroon")
   plot(model_allfeatures)

Interpretation & Deviations of Standard Regression Assumptions

  1. Residual vs Fitted: Under the standard assumptions of regression, this plot should be a random scatter of points (no pattern) and we shoud see a flat horizontal line.

Observation 4

Deviation: We see a ‘U’ shaped curved instead of a flat line.

  1. Normal Q-Q Plot This plot helps us evaluating one of the key assumptions of a least-squares regression is that the errors are normally distributed.Ideally, this plot should resemble a (nearly) straight line with intercept 0 and a slope of 1 and most if not all the points to lie on the dashed line.

Observation 5

Deviation: We can observe that especially at the ends, there are decent amount of points lie outside the dashed line, indicating that errors are not completed normal distributed.

Tuning the Model

After interpreting the observations

we can conclude that response variable FARE requires LOG transformation.

Log Transformation

  log_model_allfeatures <- lm(log(FARE)~.,data = train)
  smr <- summary(log_model_allfeatures)
  paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste("         Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared:  0.925601          Adjusted R-squared:  0.898022"

Observation 5

Multiple R-squared & Adjusted R-squared are increased by ~3%.

Regression diagnostic plots (with LOG Transformation)

  #par(col.lab="maroon",col.main="blue")
  par(mfrow = c(1:2),col.lab="maroon")
  plot(log_model_allfeatures)

Observation 6

Residual vs Fitted plot approaches a flat horizontal line.

Feature Pruning

Now, by making use of

I’m able to create a REDUCED model by eliminating below predictors; ####S_INCOME , E_INCOME , S_POP , E_POP , COUPON , SLOT , GATE , S_CODE , E_CODE , NEW by retaining the Multiple R-squared & Adjusted R-squared ~90%

   model_reducedfeatures_LogTrans <- lm(log(FARE)~.-(S_INCOME + E_INCOME +S_POP + E_POP + COUPON + SLOT + GATE + S_CODE + E_CODE + NEW),data = train)
   smr <- summary(model_reducedfeatures_LogTrans)
  paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste("         Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared:  0.921432          Adjusted R-squared:  0.895897"

Anova Test

In this section, we’ll test the last two models statistically using ANOVA. This test has a NULL hypothesis “No Significant Difference in SSE of FULL and REDUCED Models”

  anova(log_model_allfeatures,model_reducedfeatures_LogTrans)
## Analysis of Variance Table
## 
## Model 1: log(FARE) ~ S_CODE + S_CITY + E_CODE + E_CITY + COUPON + NEW + 
##     VACATION + SW + HI + S_INCOME + E_INCOME + S_POP + E_POP + 
##     SLOT + GATE + DISTANCE + PAX + HavingAtleastOneHighTraficCity
## Model 2: log(FARE) ~ (S_CODE + S_CITY + E_CODE + E_CITY + COUPON + NEW + 
##     VACATION + SW + HI + S_INCOME + E_INCOME + S_POP + E_POP + 
##     SLOT + GATE + DISTANCE + PAX + HavingAtleastOneHighTraficCity) - 
##     (S_INCOME + E_INCOME + S_POP + E_POP + COUPON + SLOT + GATE + 
##         S_CODE + E_CODE + NEW)
##   Res.Df    RSS  Df Sum of Sq     F  Pr(>F)  
## 1    348 9.3484                              
## 2    360 9.8723 -12  -0.52384 1.625 0.08281 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We fail to reject the NULL hypothesis at significance level 10%. Though the RSS is slighthly incresed, but since we are able to reduce the model we can proceed with the reduced model as final model. However, this assumption is crossed validated on Test Data.

Predicting on Test Data

Once, we have decided on the model, we can then validate the model performance on the TEST data that we splitted early from the original dataset. Now the below code PREDICTS the FARE for the test data;

  #removing rows where we dont have those cities in train data. Othwerwise this is throwing errors
 testWithOUtErros <-  test[!(test$S_CITY %in% c("Honolulu (Intl)     HI", "Nashville           TN", "Omaha               NE","Spokane             WA")), ]
 testWithOUtErros <-  testWithOUtErros[!(testWithOUtErros$E_CITY %in% c("El Paso             TX", "Little Rock         AR")), ]
   
  y_hat <- predict.lm(model_reducedfeatures_LogTrans, newdata=testWithOUtErros, se.fit=TRUE)$fit 
  y_hat <- as.vector(y_hat)
  dev <- log(testWithOUtErros$FARE)-(y_hat) 
  num<-sum(dev^2) 
  dev1<-log(testWithOUtErros$FARE)-mean(log(testWithOUtErros$FARE)) 
  den<-sum(dev1^2) 
  Predicted.Rsq<-1-(num/den) 
  Predicted.Rsq
## [1] 0.8209606

As, we can see we got 82% prediction accuracy on the test data!

Thanks,

SivaG