Several new airports have opened in major cities, opening the market for new routes (a route refers to a pair of airports), and Southwest has not announced whether it will cover routes to/from these cities. In order to price flights on these routes, a major airline collected information on 638 air routes in the United States. Some factors are known about these new routes: the distance travelled, demographics of the city where the new airport is located, and whether this city is a vacation destination. Other factors are yet unknown(e.g., the number of passenges who will travel this route). A major unknown factor is whether Southwest or another discount airline will travel on these new routes. Southwest's strategy (point-to-point routes covering only major cities, use of secondary airports, standardized fleet, low fares) has been very different from the model followed by the older and bigger airlines (hub-and-spoke model extending to even smaller cities, presence in primary airports, variety in fleet, pursuit of high-end business travelers). The presence of discount airlines is therefore believed to reduce the fares greatly.
The file Airfares.xlsx contains real data that were collected for the third quarter of 1996. They consist of the predictors and responses. Note that some cities are served by more than one airport, and in those cases the aairports are distinguished by their three-letter code.
# Clear the workspace
rm(list=ls())
# libraries
library("xlsx")
library("corrplot")
library("car")
library("MASS")
library("perturb")
library("VIF")
# Read the file and assign to an data frame
setwd('J:\\ISB Business Analytics\\Stastical Analysis\\SA2 Regression\\SA2 Assignment 1')
airport = read.xlsx("Airfares.xlsx", sheetName = "data")
#attach(airport)
# Dimensions of the dataset
colnames(airport)
## [1] "S_CODE" "S_CITY" "E_CODE" "E_CITY" "COUPON" "NEW"
## [7] "VACATION" "SW" "HI" "S_INCOME" "E_INCOME" "S_POP"
## [13] "E_POP" "SLOT" "GATE" "DISTANCE" "PAX" "FARE"
## [19] "NA."
1.      S_CODE : starting airport’s code
2.      S_CITY : starting city
3.      E_CODE : ending airport’s code
4.      E_CITY : ending city
5.      COUPON : average number of coupons (a one-coupon flight is a non-stop flight, a two-coupon flight is a one stop flight, etc.) for that route
6.      NEW : number of new carriers entering that route between Q3-96 and Q2-97
7.      VACATION: whether a vacation route (Yes) or not (No); Florida and Las Vegas routes are generally considered vacation routes
8.      SW : whether Southwest Airlines serves that route (Yes) or not (No)
9.      HI : Herfindel Index - measure of market concentration
10.  S_INCOME: starting city’s average personal income
11.  E_INCOME: ending city’s average personal income
12.  S_POP : starting city’s population
13.  E_POP : ending city’s population
14.  SLOT : whether either endpoint airport is slot controlled or not; this is a measure of airport congestion
15.  GATE : whether either endpoint airport has gate constraints or not; this is another measure of airport congestion
16.  DISTANCE: distance between two endpoint airports in miles
17.  PAX : number of passengers on that route during period of data collection
18.  FARE : average fare on that route
# Dimensions of the data
dim(airport)
## [1] 638 19
# sample data
head(airport)
## S_CODE S_CITY E_CODE E_CITY COUPON NEW
## 1 * Dallas/Fort Worth TX * Amarillo TX 1.00 3
## 2 * Atlanta GA * Baltimore/Wash Intl MD 1.06 3
## 3 * Boston MA * Baltimore/Wash Intl MD 1.06 3
## 4 ORD Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 5 MDW Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 6 * Cleveland OH * Baltimore/Wash Intl MD 1.01 3
## VACATION SW HI S_INCOME E_INCOME S_POP E_POP SLOT GATE
## 1 No Yes 5291.991 28637 21112 3036732 205711 Free Free
## 2 No No 5419.161 26993 29838 3532657 7145897 Free Free
## 3 No No 9185.283 30124 29838 5787293 7145897 Free Free
## 4 No Yes 2657.352 29260 29838 7830332 7145897 Controlled Free
## 5 No Yes 2657.352 29260 29838 7830332 7145897 Free Free
## 6 No Yes 3408.106 26046 29838 2230955 7145897 Free Free
## DISTANCE PAX FARE NA.
## 1 312 7864 64.11 NA
## 2 576 8820 174.47 NA
## 3 364 6452 207.76 NA
## 4 612 25144 85.47 NA
## 5 612 25144 85.47 NA
## 6 309 13386 56.76 NA
#summary of the attributes of the data
summary(airport)
## S_CODE S_CITY E_CODE
## * :454 Chicago IL: 90 * :501
## MDW : 45 New York/Newark NY: 88 DCA : 27
## ORD : 45 Atlanta GA: 41 IAD : 27
## EWR : 30 Dallas/Fort Worth TX: 36 EWR : 25
## JFK : 29 Los Angeles CA: 33 JFK : 25
## LGA : 29 Boston MA: 31 LGA : 25
## (Other): 6 (Other) :319 (Other): 8
## E_CITY COUPON NEW VACATION
## New York/Newark NY: 75 Min. :1.000 Min. :0.000 No :468
## Washington DC: 54 1st Qu.:1.040 1st Qu.:3.000 Yes:170
## Phoenix AZ: 25 Median :1.150 Median :3.000
## Baltimore/Wash Intl MD: 23 Mean :1.202 Mean :2.754
## Orlando FL: 23 3rd Qu.:1.298 3rd Qu.:3.000
## San Francisco CA: 21 Max. :1.940 Max. :3.000
## (Other) :417
## SW HI S_INCOME E_INCOME
## No :444 Min. : 1230 Min. :14600 Min. :14600
## Yes:194 1st Qu.: 3090 1st Qu.:24706 1st Qu.:23903
## Median : 4208 Median :28637 Median :26409
## Mean : 4442 Mean :27760 Mean :27664
## 3rd Qu.: 5481 3rd Qu.:29694 3rd Qu.:31981
## Max. :10000 Max. :38813 Max. :38813
##
## S_POP E_POP SLOT GATE
## Min. : 29838 Min. : 111745 Controlled:182 Constrained:124
## 1st Qu.:1862106 1st Qu.:1228816 Free :456 Free :514
## Median :3532657 Median :2195215
## Mean :4557004 Mean :3194503
## 3rd Qu.:7830332 3rd Qu.:4549784
## Max. :9056076 Max. :9056076
##
## DISTANCE PAX FARE NA.
## Min. : 114.0 Min. : 1504 Min. : 42.47 Mode:logical
## 1st Qu.: 455.0 1st Qu.: 5328 1st Qu.:106.29 NA's:638
## Median : 850.0 Median : 7792 Median :144.60
## Mean : 975.7 Mean :12782 Mean :160.88
## 3rd Qu.:1306.2 3rd Qu.:14090 3rd Qu.:209.35
## Max. :2764.0 Max. :73892 Max. :402.02
##
from above summary Chicago is having more number of flights starting followed by New York/Newark and New York/Newark is having more number of flights with endpoint.
The data is split into parts with 80% of data into training data and remaining 20% data into test data randomly.
The below code depicts the process to split the data.
# Select rows randomly
row.number <- sample(1:nrow(airport), size=0.2*nrow(airport))
# test Data
Airport_test <- airport[row.number,]
# train Data
Airport_train <- airport[-row.number,]
# Size of the testing set
dim(Airport_test)
## [1] 127 19
# Size of the training set
dim(Airport_train)
## [1] 511 19
The below code depicts the process to show the data graphically.
# analysis on training data
attach(Airport_train)
par(mfrow=c(2, 2))
boxplot(S_INCOME, main="Starting City Avg Personal Income")
boxplot(S_POP, main="Starting City Population")
boxplot(E_INCOME, main="Ending City Avg Personal Income")
boxplot(E_POP, main="Ending City Population")
par(mfrow=c(2, 2))
boxplot(DISTANCE, main="Distance b/w two endpoints")
boxplot(HI, main="Market Concentration")
boxplot(PAX, main="No. of Passengers")
boxplot(COUPON, main="Avg no. of coupons")
par(mfrow=c(1, 2))
boxplot(NEW, main="No. of new Carriers")
boxplot(FARE, main="Fare")
From above graphical representation, it's clearly evident that Market Concentration(HI), Number of passengers on that route(PAX), average number of passengers(COUPON) and fare are skewed to right. Hence the logirithmic transformation can be applied on these variables.
#Transformation of variables
HI_Log <- log10(HI)
PAX_Log <- log10(PAX)
COUPON_Log <- log10(COUPON)
FARE_Log <- log10(FARE)
Box Plots Comparasion:
par(mfrow=c(1, 2))
boxplot(HI, main="HI")
boxplot(HI_Log, main="HI_Log")
par(mfrow=c(1, 2))
boxplot(PAX, main="PAX")
boxplot(PAX_Log, main="PAX_Log")
par(mfrow=c(1, 2))
boxplot(COUPON, main="COUPON")
boxplot(COUPON_Log, main="COUPON_Log")
par(mfrow=c(1, 2))
boxplot(FARE, main="FARE")
boxplot(FARE_Log , main="FARE_Log")
#Initial Model
Airport_train1 <- cbind(Airport_train,HI_Log,PAX_Log,COUPON_Log,FARE_Log)
attach(Airport_train1)
## The following objects are masked _by_ .GlobalEnv:
##
## COUPON_Log, FARE_Log, HI_Log, PAX_Log
## The following objects are masked from Airport_train:
##
## COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
## HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
## VACATION
airfareInitialModel <- lm(FARE_Log ~ COUPON_Log + NEW + VACATION + SW +HI_Log +S_INCOME+
E_INCOME+ S_POP+ E_POP+ SLOT + GATE + DISTANCE + PAX_Log)
#summary of the model
options(scipen=999)
summary(airfareInitialModel)
##
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + VACATION + SW + HI_Log +
## S_INCOME + E_INCOME + S_POP + E_POP + SLOT + GATE + DISTANCE +
## PAX_Log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.268491 -0.063011 -0.002275 0.063858 0.245049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.847695940213 0.160184254408 11.535 < 0.0000000000000002
## COUPON_Log 0.050554696965 0.133150612545 0.380 0.7043
## NEW -0.002823634176 0.006042096403 -0.467 0.6405
## VACATIONYes -0.069636648392 0.011285823752 -6.170 0.00000000141515992
## SWYes -0.145329850941 0.011453255748 -12.689 < 0.0000000000000002
## HI_Log 0.187139919636 0.031441612275 5.952 0.00000000500812979
## S_INCOME 0.000002378864 0.000001543714 1.541 0.1240
## E_INCOME 0.000002705500 0.000001152634 2.347 0.0193
## S_POP 0.000000014299 0.000000001952 7.325 0.00000000000097225
## E_POP 0.000000018296 0.000000002338 7.824 0.00000000000003092
## SLOTFree -0.051319488062 0.012002006994 -4.276 0.00002283239218052
## GATEFree -0.055839281883 0.012283009838 -4.546 0.00000687314860581
## DISTANCE 0.000185203401 0.000011767663 15.738 < 0.0000000000000002
## PAX_Log -0.168218267891 0.020666735096 -8.140 0.00000000000000321
##
## (Intercept) ***
## COUPON_Log
## NEW
## VACATIONYes ***
## SWYes ***
## HI_Log ***
## S_INCOME
## E_INCOME *
## S_POP ***
## E_POP ***
## SLOTFree ***
## GATEFree ***
## DISTANCE ***
## PAX_Log ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09689 on 497 degrees of freedom
## Multiple R-squared: 0.8071, Adjusted R-squared: 0.802
## F-statistic: 159.9 on 13 and 497 DF, p-value: < 0.00000000000000022
#residual plots
residualPlots(airfareInitialModel)
## Test stat Pr(>|t|)
## COUPON_Log -7.149 0.000
## NEW 0.062 0.950
## VACATION NA NA
## SW NA NA
## HI_Log -1.537 0.125
## S_INCOME -3.605 0.000
## E_INCOME -1.101 0.272
## S_POP -5.760 0.000
## E_POP -4.469 0.000
## SLOT NA NA
## GATE NA NA
## DISTANCE -7.306 0.000
## PAX_Log 1.103 0.271
## Tukey test -3.990 0.000
From above residual plots, aplly the quadratic transforamtion on S_POP, E_POP and DISTANCE.
#residual plots
qqPlot(airfareInitialModel, main="QQ Plot of residuals: Model_1")
#Transformation of variables
DISTANCE_SQRT <- sqrt(DISTANCE)
S_POP_SQRT <- sqrt(S_POP)
E_POP_SQRT <- sqrt(E_POP)
E_INCOME_SQRT <- sqrt(E_INCOME)
Box Plots Comparasion:
par(mfrow=c(1, 2))
boxplot(S_POP, main="S_POP")
boxplot(S_POP_SQRT, main="S_POP_SQRT")
par(mfrow=c(1, 2))
boxplot(E_POP, main="E_POP")
boxplot(E_POP_SQRT, main="E_POP_SQRT")
par(mfrow=c(1, 2))
boxplot(DISTANCE, main="DISTANCE")
boxplot(DISTANCE_SQRT, main="DISTANCE_SQRT")
#Iterative Model 2
Airport_train2 <- cbind(Airport_train1,DISTANCE_SQRT,S_POP_SQRT,E_POP_SQRT)
attach(Airport_train2)
## The following objects are masked _by_ .GlobalEnv:
##
## COUPON_Log, DISTANCE_SQRT, E_POP_SQRT, FARE_Log, HI_Log,
## PAX_Log, S_POP_SQRT
## The following objects are masked from Airport_train1:
##
## COUPON, COUPON_Log, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP,
## FARE, FARE_Log, GATE, HI, HI_Log, NA., NEW, PAX, PAX_Log,
## S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW, VACATION
## The following objects are masked from Airport_train:
##
## COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
## HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
## VACATION
airfareInitialModel2 <- lm(FARE_Log ~ COUPON_Log + NEW +SW + VACATION + HI_Log +S_INCOME+
E_INCOME+ S_POP_SQRT+ E_POP_SQRT+ SLOT + GATE + DISTANCE_SQRT + PAX_Log)
#summary of the model 2
options(scipen=999)
summary(airfareInitialModel2)
##
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log +
## S_INCOME + E_INCOME + S_POP_SQRT + E_POP_SQRT + SLOT + GATE +
## DISTANCE_SQRT + PAX_Log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.251808 -0.057595 0.003556 0.063584 0.235500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.761738674 0.146646499 12.014 < 0.0000000000000002 ***
## COUPON_Log -0.163641659 0.125492150 -1.304 0.19284
## NEW -0.006934307 0.005580665 -1.243 0.21462
## SWYes -0.142146643 0.010549820 -13.474 < 0.0000000000000002 ***
## VACATIONYes -0.073116267 0.010476098 -6.979 0.00000000000953 ***
## HI_Log 0.151202898 0.028772741 5.255 0.00000021999304 ***
## S_INCOME 0.000002745 0.000001428 1.922 0.05519 .
## E_INCOME 0.000002861 0.000001060 2.700 0.00718 **
## S_POP_SQRT 0.000062078 0.000007043 8.814 < 0.0000000000000002 ***
## E_POP_SQRT 0.000074601 0.000007914 9.427 < 0.0000000000000002 ***
## SLOTFree -0.053162064 0.010883387 -4.885 0.00000139911626 ***
## GATEFree -0.054000185 0.011291855 -4.782 0.00000228792506 ***
## DISTANCE_SQRT 0.012887430 0.000707105 18.226 < 0.0000000000000002 ***
## PAX_Log -0.191783611 0.019208720 -9.984 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08935 on 497 degrees of freedom
## Multiple R-squared: 0.8359, Adjusted R-squared: 0.8316
## F-statistic: 194.7 on 13 and 497 DF, p-value: < 0.00000000000000022
#residual plots for model 2
residualPlots(airfareInitialModel2)
## Test stat Pr(>|t|)
## COUPON_Log -4.169 0.000
## NEW 0.167 0.867
## SW NA NA
## VACATION NA NA
## HI_Log -0.922 0.357
## S_INCOME -2.305 0.022
## E_INCOME -1.158 0.248
## S_POP_SQRT -1.683 0.093
## E_POP_SQRT -2.588 0.010
## SLOT NA NA
## GATE NA NA
## DISTANCE_SQRT -2.448 0.015
## PAX_Log 1.997 0.046
## Tukey test -0.994 0.320
#residual plots
qqPlot(airfareInitialModel2, main="QQ Plot of residuals: Model_2")
#Transformation of variables
LOG_E_POP_SQRT <- log10(E_POP_SQRT)
S_INCOME_SQR <- (S_INCOME)^2
LOG_S_INCOME_SQR <- log10(S_INCOME_SQR)
E_INCOME_SQR <- (E_INCOME)^2
LOG_E_INCOME_SQR <- log10(E_INCOME_SQR)
Box Plots Comparasion:
par(mfrow=c(1, 2))
boxplot(E_POP_SQRT, main="E_POP_SQRT")
boxplot(LOG_E_POP_SQRT, main="LOG_E_POP_SQRT")
par(mfrow=c(1, 2))
boxplot(S_INCOME, main="S_INCOME")
boxplot(LOG_S_INCOME_SQR, main="LOG_S_INCOME_SQR")
par(mfrow=c(1, 2))
boxplot(E_INCOME, main="E_INCOME")
boxplot(LOG_E_INCOME_SQR, main="LOG_E_INCOME_SQR")
#Iterative Model 2
Airport_train3 <- cbind(Airport_train2,LOG_E_POP_SQRT,S_INCOME_SQR,LOG_S_INCOME_SQR,
E_INCOME_SQR,LOG_E_INCOME_SQR)
attach(Airport_train3)
## The following objects are masked _by_ .GlobalEnv:
##
## COUPON_Log, DISTANCE_SQRT, E_INCOME_SQR, E_POP_SQRT, FARE_Log,
## HI_Log, LOG_E_INCOME_SQR, LOG_E_POP_SQRT, LOG_S_INCOME_SQR,
## PAX_Log, S_INCOME_SQR, S_POP_SQRT
## The following objects are masked from Airport_train2:
##
## COUPON, COUPON_Log, DISTANCE, DISTANCE_SQRT, E_CITY, E_CODE,
## E_INCOME, E_POP, E_POP_SQRT, FARE, FARE_Log, GATE, HI, HI_Log,
## NA., NEW, PAX, PAX_Log, S_CITY, S_CODE, S_INCOME, S_POP,
## S_POP_SQRT, SLOT, SW, VACATION
## The following objects are masked from Airport_train1:
##
## COUPON, COUPON_Log, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP,
## FARE, FARE_Log, GATE, HI, HI_Log, NA., NEW, PAX, PAX_Log,
## S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW, VACATION
## The following objects are masked from Airport_train:
##
## COUPON, DISTANCE, E_CITY, E_CODE, E_INCOME, E_POP, FARE, GATE,
## HI, NA., NEW, PAX, S_CITY, S_CODE, S_INCOME, S_POP, SLOT, SW,
## VACATION
airfareInitialModel3 <- lm(FARE_Log ~ COUPON_Log + NEW +SW + VACATION + HI_Log +LOG_S_INCOME_SQR+
LOG_E_INCOME_SQR+ S_POP_SQRT+ LOG_E_POP_SQRT+ SLOT + GATE + DISTANCE_SQRT + PAX_Log)
#summary of the model 3
options(scipen=999)
summary(airfareInitialModel3)
##
## Call:
## lm(formula = FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log +
## LOG_S_INCOME_SQR + LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT +
## SLOT + GATE + DISTANCE_SQRT + PAX_Log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.248014 -0.057798 0.003629 0.061475 0.227275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.44688974 0.54924198 -0.814 0.41624
## COUPON_Log -0.14988757 0.12503703 -1.199 0.23120
## NEW -0.00689033 0.00556401 -1.238 0.21616
## SWYes -0.14415499 0.01043857 -13.810 < 0.0000000000000002 ***
## VACATIONYes -0.07716396 0.01026424 -7.518 0.00000000000026199 ***
## HI_Log 0.13840821 0.02866601 4.828 0.00000183598158633 ***
## LOG_S_INCOME_SQR 0.08950873 0.04485762 1.995 0.04655 *
## LOG_E_INCOME_SQR 0.11023745 0.03409547 3.233 0.00131 **
## S_POP_SQRT 0.00005586 0.00000687 8.131 0.00000000000000342 ***
## LOG_E_POP_SQRT 0.23981358 0.02533558 9.465 < 0.0000000000000002 ***
## SLOTFree -0.05926245 0.01069567 -5.541 0.00000004890652654 ***
## GATEFree -0.05833226 0.01115756 -5.228 0.00000025274801071 ***
## DISTANCE_SQRT 0.01276965 0.00070484 18.117 < 0.0000000000000002 ***
## PAX_Log -0.18564038 0.01878021 -9.885 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08908 on 497 degrees of freedom
## Multiple R-squared: 0.8369, Adjusted R-squared: 0.8326
## F-statistic: 196.2 on 13 and 497 DF, p-value: < 0.00000000000000022
#residual plots for model 3
residualPlots(airfareInitialModel3)
## Test stat Pr(>|t|)
## COUPON_Log -4.160 0.000
## NEW 0.141 0.888
## SW NA NA
## VACATION NA NA
## HI_Log -1.168 0.243
## LOG_S_INCOME_SQR -1.621 0.106
## LOG_E_INCOME_SQR -1.230 0.219
## S_POP_SQRT -1.671 0.095
## LOG_E_POP_SQRT 2.101 0.036
## SLOT NA NA
## GATE NA NA
## DISTANCE_SQRT -2.501 0.013
## PAX_Log 2.301 0.022
## Tukey test -0.737 0.461
#residual plots
qqPlot(airfareInitialModel3, main="QQ Plot of residuals: Model_3")
# distribution of studentized residuals
stu.resid1 <- studres(airfareInitialModel3)
hist(stu.resid1, freq=FALSE,
main="Distribution of Studentized Residuals:Model_3")
xfit1<-seq(min(stu.resid1),max(stu.resid1),length=40)
yfit1<-dnorm(xfit1)
lines(xfit1, yfit1)
From above plot, it can be infered that there is no problem.
step <- stepAIC(airfareInitialModel3, direction="both")
## Start: AIC=-2457.61
## FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR +
## LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE +
## DISTANCE_SQRT + PAX_Log
##
## Df Sum of Sq RSS AIC
## - COUPON_Log 1 0.01140 3.9553 -2458.1
## - NEW 1 0.01217 3.9560 -2458.0
## <none> 3.9439 -2457.6
## - LOG_S_INCOME_SQR 1 0.03160 3.9754 -2455.5
## - LOG_E_INCOME_SQR 1 0.08295 4.0268 -2449.0
## - HI_Log 1 0.18499 4.1288 -2436.2
## - GATE 1 0.21689 4.1607 -2432.3
## - SLOT 1 0.24362 4.1875 -2429.0
## - VACATION 1 0.44848 4.3923 -2404.6
## - S_POP_SQRT 1 0.52460 4.4685 -2395.8
## - LOG_E_POP_SQRT 1 0.71097 4.6548 -2374.9
## - PAX_Log 1 0.77537 4.7192 -2367.9
## - SW 1 1.51336 5.4572 -2293.7
## - DISTANCE_SQRT 1 2.60457 6.5484 -2200.5
##
## Step: AIC=-2458.14
## FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR +
## LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE +
## DISTANCE_SQRT + PAX_Log
##
## Df Sum of Sq RSS AIC
## - NEW 1 0.0107 3.9660 -2458.8
## <none> 3.9553 -2458.1
## + COUPON_Log 1 0.0114 3.9439 -2457.6
## - LOG_S_INCOME_SQR 1 0.0344 3.9897 -2455.7
## - LOG_E_INCOME_SQR 1 0.0850 4.0402 -2449.3
## - GATE 1 0.2141 4.1694 -2433.2
## - SLOT 1 0.2375 4.1928 -2430.3
## - HI_Log 1 0.2510 4.2063 -2428.7
## - VACATION 1 0.4397 4.3950 -2406.3
## - S_POP_SQRT 1 0.5345 4.4898 -2395.4
## - LOG_E_POP_SQRT 1 0.6996 4.6549 -2376.9
## - PAX_Log 1 0.9618 4.9170 -2348.9
## - SW 1 1.5030 5.4582 -2295.6
## - DISTANCE_SQRT 1 5.4389 9.3941 -2018.1
##
## Step: AIC=-2458.75
## FARE_Log ~ SW + VACATION + HI_Log + LOG_S_INCOME_SQR + LOG_E_INCOME_SQR +
## S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + DISTANCE_SQRT +
## PAX_Log
##
## Df Sum of Sq RSS AIC
## <none> 3.9660 -2458.8
## + NEW 1 0.0107 3.9553 -2458.1
## + COUPON_Log 1 0.0100 3.9560 -2458.0
## - LOG_S_INCOME_SQR 1 0.0336 3.9996 -2456.4
## - LOG_E_INCOME_SQR 1 0.0798 4.0458 -2450.6
## - GATE 1 0.2127 4.1787 -2434.1
## - SLOT 1 0.2318 4.1978 -2431.7
## - HI_Log 1 0.2470 4.2130 -2429.9
## - VACATION 1 0.4413 4.4073 -2406.8
## - S_POP_SQRT 1 0.5406 4.5066 -2395.4
## - LOG_E_POP_SQRT 1 0.6974 4.6634 -2378.0
## - PAX_Log 1 0.9580 4.9240 -2350.2
## - SW 1 1.4974 5.4634 -2297.1
## - DISTANCE_SQRT 1 5.4288 9.3948 -2020.1
step$anova # display results
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## FARE_Log ~ COUPON_Log + NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR +
## LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE +
## DISTANCE_SQRT + PAX_Log
##
## Final Model:
## FARE_Log ~ SW + VACATION + HI_Log + LOG_S_INCOME_SQR + LOG_E_INCOME_SQR +
## S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE + DISTANCE_SQRT +
## PAX_Log
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 497 3.943852 -2457.612
## 2 - COUPON_Log 1 0.01140297 498 3.955255 -2458.137
## 3 - NEW 1 0.01073811 499 3.965993 -2458.751
From above only COUPON_Log is dropped.
#Transformation of variables
Bestmodel_Airfares <- lm(FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR +
LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE +
DISTANCE_SQRT + PAX_Log)
summary(Bestmodel_Airfares)
##
## Call:
## lm(formula = FARE_Log ~ NEW + SW + VACATION + HI_Log + LOG_S_INCOME_SQR +
## LOG_E_INCOME_SQR + S_POP_SQRT + LOG_E_POP_SQRT + SLOT + GATE +
## DISTANCE_SQRT + PAX_Log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.247430 -0.060890 0.003952 0.061548 0.221853
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.568713907 0.539994882 -1.053 0.29277
## NEW -0.006458914 0.005554798 -1.163 0.24548
## SWYes -0.143394154 0.010423829 -13.756 < 0.0000000000000002
## VACATIONYes -0.076144879 0.010233462 -7.441 0.000000000000442
## HI_Log 0.150647603 0.026797804 5.622 0.000000031546438
## LOG_S_INCOME_SQR 0.093235070 0.044769417 2.083 0.03780
## LOG_E_INCOME_SQR 0.111518623 0.034093661 3.271 0.00115
## S_POP_SQRT 0.000056302 0.000006863 8.204 0.000000000000002
## LOG_E_POP_SQRT 0.236253991 0.025172004 9.386 < 0.0000000000000002
## SLOTFree -0.058380974 0.010675045 -5.469 0.000000071769817
## GATEFree -0.057934051 0.011157502 -5.192 0.000000303055187
## DISTANCE_SQRT 0.012133047 0.000463647 26.169 < 0.0000000000000002
## PAX_Log -0.173375184 0.015755202 -11.004 < 0.0000000000000002
##
## (Intercept)
## NEW
## SWYes ***
## VACATIONYes ***
## HI_Log ***
## LOG_S_INCOME_SQR *
## LOG_E_INCOME_SQR **
## S_POP_SQRT ***
## LOG_E_POP_SQRT ***
## SLOTFree ***
## GATEFree ***
## DISTANCE_SQRT ***
## PAX_Log ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08912 on 498 degrees of freedom
## Multiple R-squared: 0.8364, Adjusted R-squared: 0.8325
## F-statistic: 212.2 on 12 and 498 DF, p-value: < 0.00000000000000022
#residual plots for Best Model
residualPlots(Bestmodel_Airfares)
## Test stat Pr(>|t|)
## NEW 0.177 0.860
## SW NA NA
## VACATION NA NA
## HI_Log -1.386 0.166
## LOG_S_INCOME_SQR -1.661 0.097
## LOG_E_INCOME_SQR -1.248 0.213
## S_POP_SQRT -1.781 0.076
## LOG_E_POP_SQRT 2.029 0.043
## SLOT NA NA
## GATE NA NA
## DISTANCE_SQRT -2.532 0.012
## PAX_Log 2.161 0.031
## Tukey test -0.965 0.335
#residual plots
qqPlot(Bestmodel_Airfares, main="QQ Plot of residuals: Best Model ")
# distribution of studentized residuals
stu.resid1 <- studres(Bestmodel_Airfares)
hist(stu.resid1, freq=FALSE,
main="Distribution of Studentized Residuals:Best Model ")
xfit1<-seq(min(stu.resid1),max(stu.resid1),length=40)
yfit1<-dnorm(xfit1)
lines(xfit1, yfit1)
Validation of test data with test data created at initial stage of data split.
attach(Airport_test)
HI_Log <-log10(Airport_test$HI)
PAX_Log <- log10(Airport_test$PAX)
FARE_Log <-log10(Airport_test$FARE)
DISTANCE_SQRT <- sqrt(Airport_test$DISTANCE)
S_INCOME_SQR <- (Airport_test$S_INCOME)^2
LOG_S_INCOME_SQR <- log10(S_INCOME_SQR)
E_INCOME_SQR <- (Airport_test$E_INCOME)^2
LOG_E_INCOME_SQR <- log10(E_INCOME_SQR)
S_POP_SQRT <- sqrt(Airport_test$S_POP)
E_POP_SQRT <- sqrt(Airport_test$E_POP)
LOG_E_POP_SQRT <- log10(E_POP_SQRT)
Airport_test1 <- cbind(Airport_test,HI_Log,PAX_Log,FARE_Log,DISTANCE_SQRT,S_INCOME_SQR,LOG_S_INCOME_SQR,
E_INCOME_SQR,LOG_E_INCOME_SQR,S_POP_SQRT,E_POP_SQRT,LOG_E_POP_SQRT)
attach(Airport_test1)
validatemodel_Airfares <- lm(Airport_test1$FARE_Log ~ Airport_test1$NEW + Airport_test1$SW +
Airport_test1$VACATION + Airport_test1$HI_Log + Airport_test1$LOG_S_INCOME_SQR +
Airport_test1$LOG_E_INCOME_SQR + Airport_test1$S_POP_SQRT +
Airport_test1$LOG_E_POP_SQRT + Airport_test1$SLOT + Airport_test1$GATE +
Airport_test1$DISTANCE_SQRT + Airport_test1$PAX_Log)
summary(validatemodel_Airfares)
##
## Call:
## lm(formula = Airport_test1$FARE_Log ~ Airport_test1$NEW + Airport_test1$SW +
## Airport_test1$VACATION + Airport_test1$HI_Log + Airport_test1$LOG_S_INCOME_SQR +
## Airport_test1$LOG_E_INCOME_SQR + Airport_test1$S_POP_SQRT +
## Airport_test1$LOG_E_POP_SQRT + Airport_test1$SLOT + Airport_test1$GATE +
## Airport_test1$DISTANCE_SQRT + Airport_test1$PAX_Log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.17399 -0.06492 0.00429 0.06272 0.17000
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -3.96176335 1.17714235 -3.366
## Airport_test1$NEW -0.01614090 0.00916214 -1.762
## Airport_test1$SWYes -0.14238075 0.02212986 -6.434
## Airport_test1$VACATIONYes -0.08017008 0.01936002 -4.141
## Airport_test1$HI_Log 0.13761262 0.05386018 2.555
## Airport_test1$LOG_S_INCOME_SQR 0.47778909 0.11206396 4.264
## Airport_test1$LOG_E_INCOME_SQR 0.11280092 0.06491233 1.738
## Airport_test1$S_POP_SQRT 0.00001316 0.00001542 0.853
## Airport_test1$LOG_E_POP_SQRT 0.25565703 0.05002118 5.111
## Airport_test1$SLOTFree -0.03949117 0.01897550 -2.081
## Airport_test1$GATEFree -0.06867279 0.02120831 -3.238
## Airport_test1$DISTANCE_SQRT 0.01162326 0.00093722 12.402
## Airport_test1$PAX_Log -0.15183605 0.03119038 -4.868
## Pr(>|t|)
## (Intercept) 0.00104 **
## Airport_test1$NEW 0.08080 .
## Airport_test1$SWYes 0.00000000303 ***
## Airport_test1$VACATIONYes 0.00006658542 ***
## Airport_test1$HI_Log 0.01194 *
## Airport_test1$LOG_S_INCOME_SQR 0.00004167980 ***
## Airport_test1$LOG_E_INCOME_SQR 0.08496 .
## Airport_test1$S_POP_SQRT 0.39538
## Airport_test1$LOG_E_POP_SQRT 0.00000130298 ***
## Airport_test1$SLOTFree 0.03966 *
## Airport_test1$GATEFree 0.00158 **
## Airport_test1$DISTANCE_SQRT < 0.0000000000000002 ***
## Airport_test1$PAX_Log 0.00000365521 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0835 on 114 degrees of freedom
## Multiple R-squared: 0.8642, Adjusted R-squared: 0.8499
## F-statistic: 60.46 on 12 and 114 DF, p-value: < 0.00000000000000022
y_hat<-predict.lm(validatemodel_Airfares,newdata=Airport_test1, se.fit=TRUE)$fit
y_hat<-as.vector(y_hat)
dev<-log(FARE)-(y_hat)
num<-sum(dev^2)
dev1<-log(FARE)-mean(log(FARE))
den<-sum(dev1^2)
Predicted.Rsq<- 1- (num/den)
As per the above statistcs the adjusted r-square is same and the predicted value is same.