Predicting Airfares on new routes

Several new airports have opened in major cities, opening the market for new routes (a route refers to a pair of airports), and Southwest has not announced whether it will cover routes to/from these cities. In order to price flights on these routes, a major airline collected information on 638 air routes in the United States. Some factors are known about these new routes: the distance travelled, demographics of the city where the new airport is located, and whether this city is a vacation destination. Other factors are yet unknown(e.g., the number of passenges who will travel this route). A major unknown factor is whether Southwest or another discount airline will travel on these new routes. Southwest's strategy (point-to-point routes covering only major cities, use of secondary airports, standardized fleet, low fares) has been very different from the model followed by the older and bigger airlines (hub-and-spoke model extending to even smaller cities, presence in primary airports, variety in fleet, pursuit of high-end business travelers). The presence of discount airlines is therefore believed to reduce the fares greatly.

The file Airfares.xlsx contains real data that were collected for the third quarter of 1996. They consist of the predictors and responses. Note that some cities are served by more than one airport, and in those cases the aairports are distinguished by their three-letter code.

1. Data Insights and Metrics

Data Load

# Clear the workspace
rm(list=ls())
# libraries 
library("xlsx")
library("corrplot")
library("car")
library("MASS")
library("perturb")
library("VIF")
# Read the file and assign to an data frame
setwd('J:\\ISB Business Analytics\\Stastical Analysis\\SA2 Regression\\SA2 Assignment 1')

airport = read.xlsx("Airfares.xlsx", sheetName = "data")
#attach(airport)
# Dimensions of the dataset
colnames(airport)
##  [1] "S_CODE"   "S_CITY"   "E_CODE"   "E_CITY"   "COUPON"   "NEW"     
##  [7] "VACATION" "SW"       "HI"       "S_INCOME" "E_INCOME" "S_POP"   
## [13] "E_POP"    "SLOT"     "GATE"     "DISTANCE" "PAX"      "FARE"    
## [19] "NA."

Column names and Description

1.       S_CODE : starting airport’s code
2.       S_CITY : starting city
3.       E_CODE : ending airport’s code
4.       E_CITY : ending city
5.       COUPON : average number of coupons (a one-coupon flight is a non-stop flight, a two-coupon flight is a one stop flight, etc.) for that route
6.       NEW : number of new carriers entering that route between Q3-96 and Q2-97
7.       VACATION: whether a vacation route (Yes) or not (No); Florida and Las Vegas routes are generally considered vacation routes
8.       SW : whether Southwest Airlines serves that route (Yes) or not (No)
9.       HI : Herfindel Index - measure of market concentration
10.   S_INCOME: starting city’s average personal income
11.   E_INCOME: ending city’s average personal income
12.   S_POP : starting city’s population
13.   E_POP : ending city’s population
14.   SLOT : whether either endpoint airport is slot controlled or not; this is a measure of airport congestion
15.   GATE : whether either endpoint airport has gate constraints or not; this is another measure of airport congestion
16.   DISTANCE: distance between two endpoint airports in miles
17.   PAX : number of passengers on that route during period of data collection
18.   FARE : average fare on that route

Summary of the data

# Dimensions of the data
dim(airport)
## [1] 638  19
# sample data 
head(airport)
##   S_CODE                 S_CITY E_CODE                 E_CITY COUPON NEW
## 1      * Dallas/Fort Worth   TX      * Amarillo            TX   1.00   3
## 2      * Atlanta             GA      * Baltimore/Wash Intl MD   1.06   3
## 3      * Boston              MA      * Baltimore/Wash Intl MD   1.06   3
## 4    ORD Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 5    MDW Chicago             IL      * Baltimore/Wash Intl MD   1.06   3
## 6      * Cleveland           OH      * Baltimore/Wash Intl MD   1.01   3
##   VACATION  SW       HI S_INCOME E_INCOME   S_POP   E_POP       SLOT GATE
## 1       No Yes 5291.991    28637    21112 3036732  205711       Free Free
## 2       No  No 5419.161    26993    29838 3532657 7145897       Free Free
## 3       No  No 9185.283    30124    29838 5787293 7145897       Free Free
## 4       No Yes 2657.352    29260    29838 7830332 7145897 Controlled Free
## 5       No Yes 2657.352    29260    29838 7830332 7145897       Free Free
## 6       No Yes 3408.106    26046    29838 2230955 7145897       Free Free
##   DISTANCE   PAX   FARE NA.
## 1      312  7864  64.11  NA
## 2      576  8820 174.47  NA
## 3      364  6452 207.76  NA
## 4      612 25144  85.47  NA
## 5      612 25144  85.47  NA
## 6      309 13386  56.76  NA
#summary of the attributes of the data 
summary(airport)
##      S_CODE                       S_CITY        E_CODE   
##  *      :454   Chicago             IL: 90   *      :501  
##  MDW    : 45   New York/Newark     NY: 88   DCA    : 27  
##  ORD    : 45   Atlanta             GA: 41   IAD    : 27  
##  EWR    : 30   Dallas/Fort Worth   TX: 36   EWR    : 25  
##  JFK    : 29   Los Angeles         CA: 33   JFK    : 25  
##  LGA    : 29   Boston              MA: 31   LGA    : 25  
##  (Other):  6   (Other)               :319   (Other):  8  
##                     E_CITY        COUPON           NEW        VACATION 
##  New York/Newark     NY: 75   Min.   :1.000   Min.   :0.000   No :468  
##  Washington          DC: 54   1st Qu.:1.040   1st Qu.:3.000   Yes:170  
##  Phoenix             AZ: 25   Median :1.150   Median :3.000            
##  Baltimore/Wash Intl MD: 23   Mean   :1.202   Mean   :2.754            
##  Orlando             FL: 23   3rd Qu.:1.298   3rd Qu.:3.000            
##  San Francisco       CA: 21   Max.   :1.940   Max.   :3.000            
##  (Other)               :417                                            
##    SW            HI           S_INCOME        E_INCOME    
##  No :444   Min.   : 1230   Min.   :14600   Min.   :14600  
##  Yes:194   1st Qu.: 3090   1st Qu.:24706   1st Qu.:23903  
##            Median : 4208   Median :28637   Median :26409  
##            Mean   : 4442   Mean   :27760   Mean   :27664  
##            3rd Qu.: 5481   3rd Qu.:29694   3rd Qu.:31981  
##            Max.   :10000   Max.   :38813   Max.   :38813  
##                                                           
##      S_POP             E_POP                 SLOT              GATE    
##  Min.   :  29838   Min.   : 111745   Controlled:182   Constrained:124  
##  1st Qu.:1862106   1st Qu.:1228816   Free      :456   Free       :514  
##  Median :3532657   Median :2195215                                     
##  Mean   :4557004   Mean   :3194503                                     
##  3rd Qu.:7830332   3rd Qu.:4549784                                     
##  Max.   :9056076   Max.   :9056076                                     
##                                                                        
##     DISTANCE           PAX             FARE          NA.         
##  Min.   : 114.0   Min.   : 1504   Min.   : 42.47   Mode:logical  
##  1st Qu.: 455.0   1st Qu.: 5328   1st Qu.:106.29   NA's:638      
##  Median : 850.0   Median : 7792   Median :144.60                 
##  Mean   : 975.7   Mean   :12782   Mean   :160.88                 
##  3rd Qu.:1306.2   3rd Qu.:14090   3rd Qu.:209.35                 
##  Max.   :2764.0   Max.   :73892   Max.   :402.02                 
## 
from above summary Chicago is having more number of flights starting followed by New York/Newark and  New York/Newark is having more number of flights with endpoint. 

Creation of training and test data for model

The data is split into parts with 80% of data into training data and remaining 20% data into test data randomly. 
The below code depicts the process to split the data. 
# Select rows randomly
row.number    <- sample(1:nrow(airport), size=0.2*nrow(airport))

# test Data
Airport_test  <-  airport[row.number,]

# train Data
Airport_train <- airport[-row.number,]

Dimensions of training and test datasets

# Size of the testing set
dim(Airport_test)   
## [1] 127  19
# Size of the training set
dim(Airport_train)  
## [1] 511  19

Analysis on Training data

The below code depicts the process to show the data graphically.
# analysis on training data

attach(Airport_train)

par(mfrow=c(2, 2))
boxplot(S_INCOME, main="Starting City Avg Personal Income")
boxplot(S_POP, main="Starting City Population")
boxplot(E_INCOME, main="Ending City Avg Personal Income")
boxplot(E_POP, main="Ending City Population")

par(mfrow=c(2, 2))
boxplot(DISTANCE, main="Distance b/w two endpoints")
boxplot(HI, main="Market Concentration")
boxplot(PAX, main="No. of Passengers")
boxplot(COUPON, main="Avg no. of coupons")

par(mfrow=c(1, 2))
boxplot(NEW, main="No. of new Carriers")
boxplot(FARE, main="Fare")

From above graphical representation, it's clearly evident that Market Concentration(HI), Number of passengers on that route(PAX), average number of passengers(COUPON) and fare are skewed to right. Hence the logirithmic transformation can be applied on these variables.

Transformation of variables

#Transformation of variables
HI_Log       <- log10(HI)
PAX_Log      <- log10(PAX)
COUPON_Log   <- log10(COUPON)
FARE_Log     <- log10(FARE)

Box Plots Comparasion:

par(mfrow=c(1, 2))
boxplot(HI, main="HI")
boxplot(HI_Log, main="HI_Log")

par(mfrow=c(1, 2))
boxplot(PAX, main="PAX")
boxplot(PAX_Log, main="PAX_Log")

par(mfrow=c(1, 2))
boxplot(COUPON, main="COUPON")
boxplot(COUPON_Log, main="COUPON_Log")

par(mfrow=c(1, 2))
boxplot(FARE, main="FARE")
boxplot(FARE_Log , main="FARE_Log")

Initial Model

#Initial Model
Airport_train1 <- cbind(Airport_train,HI_Log,PAX_Log,COUPON_Log,FARE_Log)
attach(Airport_train1)
## The following objects are masked _by_ .GlobalEnv:
## 
##     COUPON_Log, FARE_Log, HI_Log, PAX_Log
## The following objects are masked from Airport_train:
## 
##     COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
##     HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
##     VACATION
airfareInitialModel <- lm(FARE_Log ~  COUPON_Log + NEW + VACATION + SW +HI_Log +S_INCOME+
                            E_INCOME+ S_POP+ E_POP+ SLOT + GATE + DISTANCE + PAX_Log)

Summary of Initial Model

#summary of the model
options(scipen=999)
summary(airfareInitialModel)
## 
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + VACATION + SW + HI_Log + 
##     S_INCOME + E_INCOME + S_POP + E_POP + SLOT + GATE + DISTANCE + 
##     PAX_Log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.268491 -0.063011 -0.002275  0.063858  0.245049 
## 
## Coefficients:
##                    Estimate      Std. Error t value             Pr(>|t|)
## (Intercept)  1.847695940213  0.160184254408  11.535 < 0.0000000000000002
## COUPON_Log   0.050554696965  0.133150612545   0.380               0.7043
## NEW         -0.002823634176  0.006042096403  -0.467               0.6405
## VACATIONYes -0.069636648392  0.011285823752  -6.170  0.00000000141515992
## SWYes       -0.145329850941  0.011453255748 -12.689 < 0.0000000000000002
## HI_Log       0.187139919636  0.031441612275   5.952  0.00000000500812979
## S_INCOME     0.000002378864  0.000001543714   1.541               0.1240
## E_INCOME     0.000002705500  0.000001152634   2.347               0.0193
## S_POP        0.000000014299  0.000000001952   7.325  0.00000000000097225
## E_POP        0.000000018296  0.000000002338   7.824  0.00000000000003092
## SLOTFree    -0.051319488062  0.012002006994  -4.276  0.00002283239218052
## GATEFree    -0.055839281883  0.012283009838  -4.546  0.00000687314860581
## DISTANCE     0.000185203401  0.000011767663  15.738 < 0.0000000000000002
## PAX_Log     -0.168218267891  0.020666735096  -8.140  0.00000000000000321
##                
## (Intercept) ***
## COUPON_Log     
## NEW            
## VACATIONYes ***
## SWYes       ***
## HI_Log      ***
## S_INCOME       
## E_INCOME    *  
## S_POP       ***
## E_POP       ***
## SLOTFree    ***
## GATEFree    ***
## DISTANCE    ***
## PAX_Log     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09689 on 497 degrees of freedom
## Multiple R-squared:  0.8071, Adjusted R-squared:  0.802 
## F-statistic: 159.9 on 13 and 497 DF,  p-value: < 0.00000000000000022

Residual Plots of Initial Model

#residual plots
residualPlots(airfareInitialModel)

##            Test stat Pr(>|t|)
## COUPON_Log    -7.149    0.000
## NEW            0.062    0.950
## VACATION          NA       NA
## SW                NA       NA
## HI_Log        -1.537    0.125
## S_INCOME      -3.605    0.000
## E_INCOME      -1.101    0.272
## S_POP         -5.760    0.000
## E_POP         -4.469    0.000
## SLOT              NA       NA
## GATE              NA       NA
## DISTANCE      -7.306    0.000
## PAX_Log        1.103    0.271
## Tukey test    -3.990    0.000
From above residual plots, aplly the quadratic transforamtion on S_POP, E_POP and DISTANCE.

Q-Q Plot of Residuals

#residual plots
qqPlot(airfareInitialModel, main="QQ Plot of residuals: Model_1")

2. Model Iteration 2

Transformation of variables

#Transformation of variables
DISTANCE_SQRT <- sqrt(DISTANCE)
S_POP_SQRT <- sqrt(S_POP)
E_POP_SQRT <- sqrt(E_POP)
E_INCOME_SQRT <- sqrt(E_INCOME)

Box Plots Comparasion:

par(mfrow=c(1, 2))
boxplot(S_POP, main="S_POP")
boxplot(S_POP_SQRT, main="S_POP_SQRT")

par(mfrow=c(1, 2))
boxplot(E_POP, main="E_POP")
boxplot(E_POP_SQRT, main="E_POP_SQRT")

par(mfrow=c(1, 2))
boxplot(DISTANCE, main="DISTANCE")
boxplot(DISTANCE_SQRT, main="DISTANCE_SQRT")

Model 2

#Iterative Model 2
Airport_train2 <-  cbind(Airport_train1,DISTANCE_SQRT,S_POP_SQRT,E_POP_SQRT)
attach(Airport_train2)
## The following objects are masked _by_ .GlobalEnv:
## 
##     COUPON_Log, DISTANCE_SQRT, E_POP_SQRT, FARE_Log, HI_Log,
##     PAX_Log, S_POP_SQRT
## The following objects are masked from Airport_train1:
## 
##     COUPON, COUPON_Log, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP,
##     FARE, FARE_Log, GATE, HI, HI_Log, NA., NEW, PAX, PAX_Log,
##     S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW, VACATION
## The following objects are masked from Airport_train:
## 
##     COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
##     HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
##     VACATION
airfareInitialModel2 <- lm(FARE_Log ~  COUPON_Log + NEW +SW + VACATION + HI_Log +S_INCOME+
                            E_INCOME+ S_POP_SQRT+ E_POP_SQRT+ SLOT + GATE + DISTANCE_SQRT + PAX_Log)

Summary of Model 2

#summary of the model 2
options(scipen=999)
summary(airfareInitialModel2)
## 
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + 
##     S_INCOME + E_INCOME + S_POP_SQRT + E_POP_SQRT + SLOT + GATE + 
##     DISTANCE_SQRT + PAX_Log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.251808 -0.057595  0.003556  0.063584  0.235500 
## 
## Coefficients:
##                   Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept)    1.761738674  0.146646499  12.014 < 0.0000000000000002 ***
## COUPON_Log    -0.163641659  0.125492150  -1.304              0.19284    
## NEW           -0.006934307  0.005580665  -1.243              0.21462    
## SWYes         -0.142146643  0.010549820 -13.474 < 0.0000000000000002 ***
## VACATIONYes   -0.073116267  0.010476098  -6.979     0.00000000000953 ***
## HI_Log         0.151202898  0.028772741   5.255     0.00000021999304 ***
## S_INCOME       0.000002745  0.000001428   1.922              0.05519 .  
## E_INCOME       0.000002861  0.000001060   2.700              0.00718 ** 
## S_POP_SQRT     0.000062078  0.000007043   8.814 < 0.0000000000000002 ***
## E_POP_SQRT     0.000074601  0.000007914   9.427 < 0.0000000000000002 ***
## SLOTFree      -0.053162064  0.010883387  -4.885     0.00000139911626 ***
## GATEFree      -0.054000185  0.011291855  -4.782     0.00000228792506 ***
## DISTANCE_SQRT  0.012887430  0.000707105  18.226 < 0.0000000000000002 ***
## PAX_Log       -0.191783611  0.019208720  -9.984 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08935 on 497 degrees of freedom
## Multiple R-squared:  0.8359, Adjusted R-squared:  0.8316 
## F-statistic: 194.7 on 13 and 497 DF,  p-value: < 0.00000000000000022

Residual Plots of Model 2

#residual plots for model 2
residualPlots(airfareInitialModel2)

##               Test stat Pr(>|t|)
## COUPON_Log       -4.169    0.000
## NEW               0.167    0.867
## SW                   NA       NA
## VACATION             NA       NA
## HI_Log           -0.922    0.357
## S_INCOME         -2.305    0.022
## E_INCOME         -1.158    0.248
## S_POP_SQRT       -1.683    0.093
## E_POP_SQRT       -2.588    0.010
## SLOT                 NA       NA
## GATE                 NA       NA
## DISTANCE_SQRT    -2.448    0.015
## PAX_Log           1.997    0.046
## Tukey test       -0.994    0.320

Q-Q Plot of Residuals

#residual plots
qqPlot(airfareInitialModel2, main="QQ Plot of residuals: Model_2")

3. Model Iteration 3

Transformation of variables

#Transformation of variables
LOG_E_POP_SQRT    <- log10(E_POP_SQRT)

S_INCOME_SQR      <- (S_INCOME)^2
LOG_S_INCOME_SQR  <- log10(S_INCOME_SQR)

E_INCOME_SQR      <- (E_INCOME)^2 
LOG_E_INCOME_SQR  <- log10(E_INCOME_SQR)

Box Plots Comparasion:

par(mfrow=c(1, 2))
boxplot(E_POP_SQRT, main="E_POP_SQRT")
boxplot(LOG_E_POP_SQRT, main="LOG_E_POP_SQRT")

par(mfrow=c(1, 2))
boxplot(S_INCOME, main="S_INCOME")
boxplot(LOG_S_INCOME_SQR, main="LOG_S_INCOME_SQR")

par(mfrow=c(1, 2))
boxplot(E_INCOME, main="E_INCOME")
boxplot(LOG_E_INCOME_SQR, main="LOG_E_INCOME_SQR")

Model 3

#Iterative Model 2

Airport_train3 <-  cbind(Airport_train2,LOG_E_POP_SQRT,S_INCOME_SQR,LOG_S_INCOME_SQR,
                         E_INCOME_SQR,LOG_E_INCOME_SQR)
attach(Airport_train3)
## The following objects are masked _by_ .GlobalEnv:
## 
##     COUPON_Log, DISTANCE_SQRT, E_INCOME_SQR, E_POP_SQRT, FARE_Log,
##     HI_Log, LOG_E_INCOME_SQR, LOG_E_POP_SQRT, LOG_S_INCOME_SQR,
##     PAX_Log, S_INCOME_SQR, S_POP_SQRT
## The following objects are masked from Airport_train2:
## 
##     COUPON, COUPON_Log, DISTANCE, DISTANCE_SQRT, E_CITY, E_CODE,
##     E_INCOME, E_POP, E_POP_SQRT, FARE, FARE_Log, GATE, HI, HI_Log,
##     NA., NEW, PAX, PAX_Log, S_CITY, S_CODE, S_INCOME, S_POP,
##     S_POP_SQRT, SLOT, SW, VACATION
## The following objects are masked from Airport_train1:
## 
##     COUPON, COUPON_Log, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP,
##     FARE, FARE_Log, GATE, HI, HI_Log, NA., NEW, PAX, PAX_Log,
##     S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW, VACATION
## The following objects are masked from Airport_train:
## 
##     COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
##     HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
##     VACATION
airfareInitialModel3 <- lm(FARE_Log ~  COUPON_Log + NEW +SW + VACATION + HI_Log +LOG_S_INCOME_SQR+
                             LOG_E_INCOME_SQR+ S_POP_SQRT+ LOG_E_POP_SQRT+ SLOT + GATE + DISTANCE_SQRT + PAX_Log)

Summary of Model 3

#summary of the model 3
options(scipen=999)
summary(airfareInitialModel3)
## 
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + 
##     LOG_S_INCOME_SQR + LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + 
##     SLOT + GATE + DISTANCE_SQRT + PAX_Log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.248014 -0.057798  0.003629  0.061475  0.227275 
## 
## Coefficients:
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)      -0.44688974  0.54924198  -0.814              0.41624    
## COUPON_Log       -0.14988757  0.12503703  -1.199              0.23120    
## NEW              -0.00689033  0.00556401  -1.238              0.21616    
## SWYes            -0.14415499  0.01043857 -13.810 < 0.0000000000000002 ***
## VACATIONYes      -0.07716396  0.01026424  -7.518  0.00000000000026199 ***
## HI_Log            0.13840821  0.02866601   4.828  0.00000183598158633 ***
## LOG_S_INCOME_SQR  0.08950873  0.04485762   1.995              0.04655 *  
## LOG_E_INCOME_SQR  0.11023745  0.03409547   3.233              0.00131 ** 
## S_POP_SQRT        0.00005586  0.00000687   8.131  0.00000000000000342 ***
## LOG_E_POP_SQRT    0.23981358  0.02533558   9.465 < 0.0000000000000002 ***
## SLOTFree         -0.05926245  0.01069567  -5.541  0.00000004890652654 ***
## GATEFree         -0.05833226  0.01115756  -5.228  0.00000025274801071 ***
## DISTANCE_SQRT     0.01276965  0.00070484  18.117 < 0.0000000000000002 ***
## PAX_Log          -0.18564038  0.01878021  -9.885 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08908 on 497 degrees of freedom
## Multiple R-squared:  0.8369, Adjusted R-squared:  0.8326 
## F-statistic: 196.2 on 13 and 497 DF,  p-value: < 0.00000000000000022

Residual Plots of Model 3

#residual plots for model 3
residualPlots(airfareInitialModel3)

##                  Test stat Pr(>|t|)
## COUPON_Log          -4.160    0.000
## NEW                  0.141    0.888
## SW                      NA       NA
## VACATION                NA       NA
## HI_Log              -1.168    0.243
## LOG_S_INCOME_SQR    -1.621    0.106
## LOG_E_INCOME_SQR    -1.230    0.219
## S_POP_SQRT          -1.671    0.095
## LOG_E_POP_SQRT       2.101    0.036
## SLOT                    NA       NA
## GATE                    NA       NA
## DISTANCE_SQRT       -2.501    0.013
## PAX_Log              2.301    0.022
## Tukey test          -0.737    0.461

Q-Q Plot of Residuals

#residual plots
qqPlot(airfareInitialModel3, main="QQ Plot of residuals: Model_3")

Distribution of studentized residuals

# distribution of studentized residuals
stu.resid1 <- studres(airfareInitialModel3)
hist(stu.resid1, freq=FALSE,
     main="Distribution of Studentized Residuals:Model_3")
xfit1<-seq(min(stu.resid1),max(stu.resid1),length=40)
yfit1<-dnorm(xfit1)
lines(xfit1, yfit1)

From above plot, it can be infered that there is no problem.

Best Subset selection

step <- stepAIC(airfareInitialModel3, direction="both")
## Start:  AIC=-2457.61
## FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR + 
##     LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + 
##     DISTANCE_SQRT + PAX_Log
## 
##                    Df Sum of Sq    RSS     AIC
## - COUPON_Log        1   0.01140 3.9553 -2458.1
## - NEW               1   0.01217 3.9560 -2458.0
## <none>                          3.9439 -2457.6
## - LOG_S_INCOME_SQR  1   0.03160 3.9754 -2455.5
## - LOG_E_INCOME_SQR  1   0.08295 4.0268 -2449.0
## - HI_Log            1   0.18499 4.1288 -2436.2
## - GATE              1   0.21689 4.1607 -2432.3
## - SLOT              1   0.24362 4.1875 -2429.0
## - VACATION          1   0.44848 4.3923 -2404.6
## - S_POP_SQRT        1   0.52460 4.4685 -2395.8
## - LOG_E_POP_SQRT    1   0.71097 4.6548 -2374.9
## - PAX_Log           1   0.77537 4.7192 -2367.9
## - SW                1   1.51336 5.4572 -2293.7
## - DISTANCE_SQRT     1   2.60457 6.5484 -2200.5
## 
## Step:  AIC=-2458.14
## FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR + 
##     LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + 
##     DISTANCE_SQRT + PAX_Log
## 
##                    Df Sum of Sq    RSS     AIC
## - NEW               1    0.0107 3.9660 -2458.8
## <none>                          3.9553 -2458.1
## + COUPON_Log        1    0.0114 3.9439 -2457.6
## - LOG_S_INCOME_SQR  1    0.0344 3.9897 -2455.7
## - LOG_E_INCOME_SQR  1    0.0850 4.0402 -2449.3
## - GATE              1    0.2141 4.1694 -2433.2
## - SLOT              1    0.2375 4.1928 -2430.3
## - HI_Log            1    0.2510 4.2063 -2428.7
## - VACATION          1    0.4397 4.3950 -2406.3
## - S_POP_SQRT        1    0.5345 4.4898 -2395.4
## - LOG_E_POP_SQRT    1    0.6996 4.6549 -2376.9
## - PAX_Log           1    0.9618 4.9170 -2348.9
## - SW                1    1.5030 5.4582 -2295.6
## - DISTANCE_SQRT     1    5.4389 9.3941 -2018.1
## 
## Step:  AIC=-2458.75
## FARE_Log ~ SW + VACATION + HI_Log + LOG_S_INCOME_SQR + LOG_E_INCOME_SQR + 
##     S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + DISTANCE_SQRT + 
##     PAX_Log
## 
##                    Df Sum of Sq    RSS     AIC
## <none>                          3.9660 -2458.8
## + NEW               1    0.0107 3.9553 -2458.1
## + COUPON_Log        1    0.0100 3.9560 -2458.0
## - LOG_S_INCOME_SQR  1    0.0336 3.9996 -2456.4
## - LOG_E_INCOME_SQR  1    0.0798 4.0458 -2450.6
## - GATE              1    0.2127 4.1787 -2434.1
## - SLOT              1    0.2318 4.1978 -2431.7
## - HI_Log            1    0.2470 4.2130 -2429.9
## - VACATION          1    0.4413 4.4073 -2406.8
## - S_POP_SQRT        1    0.5406 4.5066 -2395.4
## - LOG_E_POP_SQRT    1    0.6974 4.6634 -2378.0
## - PAX_Log           1    0.9580 4.9240 -2350.2
## - SW                1    1.4974 5.4634 -2297.1
## - DISTANCE_SQRT     1    5.4288 9.3948 -2020.1
step$anova # display results
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR + 
##     LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + 
##     DISTANCE_SQRT + PAX_Log
## 
## Final Model:
## FARE_Log ~ SW + VACATION + HI_Log + LOG_S_INCOME_SQR + LOG_E_INCOME_SQR + 
##     S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + DISTANCE_SQRT + 
##     PAX_Log
## 
## 
##           Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                  497   3.943852 -2457.612
## 2 - COUPON_Log  1 0.01140297       498   3.955255 -2458.137
## 3        - NEW  1 0.01073811       499   3.965993 -2458.751
From above only COUPON_Log is dropped.    

3. Best Model

Best model summary

#Transformation of variables

Bestmodel_Airfares <- lm(FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR + 
  LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + 
  DISTANCE_SQRT + PAX_Log)

summary(Bestmodel_Airfares)
## 
## Call:
## lm(formula = FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR + 
##     LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + 
##     DISTANCE_SQRT + PAX_Log)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.247430 -0.060890  0.003952  0.061548  0.221853 
## 
## Coefficients:
##                      Estimate   Std. Error t value             Pr(>|t|)
## (Intercept)      -0.568713907  0.539994882  -1.053              0.29277
## NEW              -0.006458914  0.005554798  -1.163              0.24548
## SWYes            -0.143394154  0.010423829 -13.756 < 0.0000000000000002
## VACATIONYes      -0.076144879  0.010233462  -7.441    0.000000000000442
## HI_Log            0.150647603  0.026797804   5.622    0.000000031546438
## LOG_S_INCOME_SQR  0.093235070  0.044769417   2.083              0.03780
## LOG_E_INCOME_SQR  0.111518623  0.034093661   3.271              0.00115
## S_POP_SQRT        0.000056302  0.000006863   8.204    0.000000000000002
## LOG_E_POP_SQRT    0.236253991  0.025172004   9.386 < 0.0000000000000002
## SLOTFree         -0.058380974  0.010675045  -5.469    0.000000071769817
## GATEFree         -0.057934051  0.011157502  -5.192    0.000000303055187
## DISTANCE_SQRT     0.012133047  0.000463647  26.169 < 0.0000000000000002
## PAX_Log          -0.173375184  0.015755202 -11.004 < 0.0000000000000002
##                     
## (Intercept)         
## NEW                 
## SWYes            ***
## VACATIONYes      ***
## HI_Log           ***
## LOG_S_INCOME_SQR *  
## LOG_E_INCOME_SQR ** 
## S_POP_SQRT       ***
## LOG_E_POP_SQRT   ***
## SLOTFree         ***
## GATEFree         ***
## DISTANCE_SQRT    ***
## PAX_Log          ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08912 on 498 degrees of freedom
## Multiple R-squared:  0.8364, Adjusted R-squared:  0.8325 
## F-statistic: 212.2 on 12 and 498 DF,  p-value: < 0.00000000000000022

Residual Plots of Best Model

#residual plots for Best Model 
residualPlots(Bestmodel_Airfares)

##                  Test stat Pr(>|t|)
## NEW                  0.177    0.860
## SW                      NA       NA
## VACATION                NA       NA
## HI_Log              -1.386    0.166
## LOG_S_INCOME_SQR    -1.661    0.097
## LOG_E_INCOME_SQR    -1.248    0.213
## S_POP_SQRT          -1.781    0.076
## LOG_E_POP_SQRT       2.029    0.043
## SLOT                    NA       NA
## GATE                    NA       NA
## DISTANCE_SQRT       -2.532    0.012
## PAX_Log              2.161    0.031
## Tukey test          -0.965    0.335

Q-Q Plot of Residuals of Best Model

#residual plots
qqPlot(Bestmodel_Airfares, main="QQ Plot of residuals: Best Model ")

Distribution of studentized residuals for Best Model

# distribution of studentized residuals
stu.resid1 <- studres(Bestmodel_Airfares)
hist(stu.resid1, freq=FALSE,
     main="Distribution of Studentized Residuals:Best Model ")
xfit1<-seq(min(stu.resid1),max(stu.resid1),length=40)
yfit1<-dnorm(xfit1)
lines(xfit1, yfit1)

3. Validation with Test Data

Validation of test data with test data created at initial stage of data split.
attach(Airport_test)

HI_Log    <-log10(Airport_test$HI)
PAX_Log   <- log10(Airport_test$PAX)
FARE_Log    <-log10(Airport_test$FARE)
DISTANCE_SQRT <- sqrt(Airport_test$DISTANCE)
S_INCOME_SQR <- (Airport_test$S_INCOME)^2
LOG_S_INCOME_SQR <- log10(S_INCOME_SQR)
E_INCOME_SQR <- (Airport_test$E_INCOME)^2 
LOG_E_INCOME_SQR <- log10(E_INCOME_SQR)
S_POP_SQRT <- sqrt(Airport_test$S_POP)
E_POP_SQRT <- sqrt(Airport_test$E_POP)
LOG_E_POP_SQRT  <- log10(E_POP_SQRT)

Airport_test1 <- cbind(Airport_test,HI_Log,PAX_Log,FARE_Log,DISTANCE_SQRT,S_INCOME_SQR,LOG_S_INCOME_SQR,
                      E_INCOME_SQR,LOG_E_INCOME_SQR,S_POP_SQRT,E_POP_SQRT,LOG_E_POP_SQRT)
attach(Airport_test1)

validatemodel_Airfares <- lm(Airport_test1$FARE_Log ~ Airport_test1$NEW + Airport_test1$SW + 
                               Airport_test1$VACATION + Airport_test1$HI_Log + Airport_test1$LOG_S_INCOME_SQR + 
                               Airport_test1$LOG_E_INCOME_SQR + Airport_test1$S_POP_SQRT + 
                               Airport_test1$LOG_E_POP_SQRT + Airport_test1$SLOT + Airport_test1$GATE + 
                               Airport_test1$DISTANCE_SQRT + Airport_test1$PAX_Log)
summary(validatemodel_Airfares)
## 
## Call:
## lm(formula = Airport_test1$FARE_Log ~ Airport_test1$NEW + Airport_test1$SW + 
##     Airport_test1$VACATION + Airport_test1$HI_Log + Airport_test1$LOG_S_INCOME_SQR + 
##     Airport_test1$LOG_E_INCOME_SQR + Airport_test1$S_POP_SQRT + 
##     Airport_test1$LOG_E_POP_SQRT + Airport_test1$SLOT + Airport_test1$GATE + 
##     Airport_test1$DISTANCE_SQRT + Airport_test1$PAX_Log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17399 -0.06492  0.00429  0.06272  0.17000 
## 
## Coefficients:
##                                   Estimate  Std. Error t value
## (Intercept)                    -3.96176335  1.17714235  -3.366
## Airport_test1$NEW              -0.01614090  0.00916214  -1.762
## Airport_test1$SWYes            -0.14238075  0.02212986  -6.434
## Airport_test1$VACATIONYes      -0.08017008  0.01936002  -4.141
## Airport_test1$HI_Log            0.13761262  0.05386018   2.555
## Airport_test1$LOG_S_INCOME_SQR  0.47778909  0.11206396   4.264
## Airport_test1$LOG_E_INCOME_SQR  0.11280092  0.06491233   1.738
## Airport_test1$S_POP_SQRT        0.00001316  0.00001542   0.853
## Airport_test1$LOG_E_POP_SQRT    0.25565703  0.05002118   5.111
## Airport_test1$SLOTFree         -0.03949117  0.01897550  -2.081
## Airport_test1$GATEFree         -0.06867279  0.02120831  -3.238
## Airport_test1$DISTANCE_SQRT     0.01162326  0.00093722  12.402
## Airport_test1$PAX_Log          -0.15183605  0.03119038  -4.868
##                                            Pr(>|t|)    
## (Intercept)                                 0.00104 ** 
## Airport_test1$NEW                           0.08080 .  
## Airport_test1$SWYes                   0.00000000303 ***
## Airport_test1$VACATIONYes             0.00006658542 ***
## Airport_test1$HI_Log                        0.01194 *  
## Airport_test1$LOG_S_INCOME_SQR        0.00004167980 ***
## Airport_test1$LOG_E_INCOME_SQR              0.08496 .  
## Airport_test1$S_POP_SQRT                    0.39538    
## Airport_test1$LOG_E_POP_SQRT          0.00000130298 ***
## Airport_test1$SLOTFree                      0.03966 *  
## Airport_test1$GATEFree                      0.00158 ** 
## Airport_test1$DISTANCE_SQRT    < 0.0000000000000002 ***
## Airport_test1$PAX_Log                 0.00000365521 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0835 on 114 degrees of freedom
## Multiple R-squared:  0.8642, Adjusted R-squared:  0.8499 
## F-statistic: 60.46 on 12 and 114 DF,  p-value: < 0.00000000000000022
y_hat<-predict.lm(validatemodel_Airfares,newdata=Airport_test1, se.fit=TRUE)$fit
y_hat<-as.vector(y_hat)
dev<-log(FARE)-(y_hat)
num<-sum(dev^2)
dev1<-log(FARE)-mean(log(FARE))
den<-sum(dev1^2)
Predicted.Rsq<- 1- (num/den)
As per the above statistcs the adjusted r-square is same and the predicted value is same.