In this article we are going to work on predicting airfares between new routes from the Airfares.xls data set by fitting a Linear Regression model.
rm(list = ls())
setwd('D:\\personal\\ISB\\SA2\\Assignment1')
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(MASS)
# library(leaps)
##Loading required Functions
source(".\\PredictingFare\\UtilityFunctions.R")
airfares <- read.csv("Airfares.csv")
head(airfares)
## S_CODE S_CITY E_CODE E_CITY COUPON NEW
## 1 * Dallas/Fort Worth TX * Amarillo TX 1.00 3
## 2 * Atlanta GA * Baltimore/Wash Intl MD 1.06 3
## 3 * Boston MA * Baltimore/Wash Intl MD 1.06 3
## 4 ORD Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 5 MDW Chicago IL * Baltimore/Wash Intl MD 1.06 3
## 6 * Cleveland OH * Baltimore/Wash Intl MD 1.01 3
## VACATION SW HI S_INCOME E_INCOME S_POP E_POP SLOT GATE
## 1 No Yes 5291.99 $28,637 $21,112 3036732 205711 Free Free
## 2 No No 5419.16 $26,993 $29,838 3532657 7145897 Free Free
## 3 No No 9185.28 $30,124 $29,838 5787293 7145897 Free Free
## 4 No Yes 2657.35 $29,260 $29,838 7830332 7145897 Controlled Free
## 5 No Yes 2657.35 $29,260 $29,838 7830332 7145897 Free Free
## 6 No Yes 3408.11 $26,046 $29,838 2230955 7145897 Free Free
## DISTANCE PAX FARE
## 1 312 7864 $64.11
## 2 576 8820 $174.47
## 3 364 6452 $207.76
## 4 612 25144 $85.47
## 5 612 25144 $85.47
## 6 309 13386 $56.76
As we can see from the head command, there are some special characters like($ and ,) in few of the columns, lets cleanse them;
airfares$FARE <- as.numeric(substring(as.character(airfares$FARE),2))
toberemovedChars <- c("$",",")
for(c in toberemovedChars)
{
airfares$S_INCOME <- sub(x = as.character(airfares$S_INCOME), pattern = c,replacement = "",fixed = TRUE)
airfares$E_INCOME <- sub(x = as.character(airfares$E_INCOME), pattern = c,replacement = "",fixed = TRUE)
}
airfares$S_INCOME <- as.numeric(airfares$S_INCOME)
airfares$E_INCOME <- as.numeric(airfares$E_INCOME)
airfares$S_POP <- as.numeric(airfares$S_POP)
airfares$E_POP <- as.numeric(airfares$E_POP)
After looking at the data, we can understand that S_CODE & E_CODE are having 3 character airport code for high traffic cities or cities with multiple airports and ’*’ otherwise. I’m going to create a new field “HavingAtleastOneHighTraficCity” which will be used during the model creation.
airfares$HavingAtleastOneHighTraficCity <- as.factor(airfares$S_CODE !='*' | airfares$E_CODE !='*')
Its always useful to understand the relationship between the predictors and the response variable (FARE) by plotting them and check linear relationship between them. This seciton creates a bunch of plots to explore and understand the structure of the data;
hist(airfares$FARE)
My Figure
This step will help us identify predictors that are strongly linearly related to other predictor variables in the dataset. Including strongly related variablesin the model will effect the Reliability and Accurancy of the model. SO, it is always recommended to removed such predictors from the model as anyways the explanatory power about response variable in already included in the other predictors.
airfares_x = subset(airfares, select = -c(FARE))
numericData <- airfares_x[sapply(airfares_x, is.numeric)]
#Calculating Correlation
descrCor <- cor(numericData)
highlyCorrelated <- findCorrelation(descrCor, cutoff=0.6)
highlyCorCol <- colnames(numericData)[highlyCorrelated]
descrCor
## COUPON NEW HI S_INCOME E_INCOME
## COUPON 1.00000000 0.02022307 -0.34725207 -0.08840265 0.0468892
## NEW 0.02022307 1.00000000 0.05414685 0.02659673 0.1133766
## HI -0.34725207 0.05414685 1.00000000 -0.02738221 0.0823926
## S_INCOME -0.08840265 0.02659673 -0.02738221 1.00000000 -0.1388642
## E_INCOME 0.04688920 0.11337664 0.08239260 -0.13886420 1.0000000
## S_POP -0.10776336 -0.01667212 -0.17249541 0.51718718 -0.1440586
## E_POP 0.09496994 0.05856818 -0.06245600 -0.27228027 0.4584181
## DISTANCE 0.74680521 0.08096520 -0.31237457 0.02815334 0.1765307
## PAX -0.33697358 0.01049527 -0.16896078 0.13819710 0.2599611
## S_POP E_POP DISTANCE PAX
## COUPON -0.10776336 0.09496994 0.74680521 -0.33697358
## NEW -0.01667212 0.05856818 0.08096520 0.01049527
## HI -0.17249541 -0.06245600 -0.31237457 -0.16896078
## S_INCOME 0.51718718 -0.27228027 0.02815334 0.13819710
## E_INCOME -0.14405857 0.45841806 0.17653074 0.25996105
## S_POP 1.00000000 -0.28014283 0.01843667 0.28461056
## E_POP -0.28014283 1.00000000 0.11563970 0.31469750
## DISTANCE 0.01843667 0.11563970 1.00000000 -0.10248160
## PAX 0.28461056 0.31469750 -0.10248160 1.00000000
highlyCorCol
## [1] "COUPON"
In this section, we will split the data into training (75%) and testing (25%) subsets;
smp_size <- floor(0.75 * nrow(airfares))
## set the seed to make your partition reproductible
set.seed(123)
train_ind <- sample(seq_len(nrow(airfares)), size = smp_size)
train <- airfares[train_ind, ]
test <- airfares[-train_ind, ]
Now, lets start simply by fitting a linear regression model with all the predictors (FULL Model) on our Training dataset;
model_allfeatures <- lm(FARE~.,data = train)
smr <- summary(model_allfeatures)
paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste(" Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared: 0.898179 Adjusted R-squared: 0.860436"
#par(col.lab="maroon",col.main="blue")
par(mfrow = c(1:2),col.lab="maroon")
plot(model_allfeatures)
Deviation: We see a ‘U’ shaped curved instead of a flat line.
Deviation: We can observe that especially at the ends, there are decent amount of points lie outside the dashed line, indicating that errors are not completed normal distributed.
After interpreting the observations
we can conclude that response variable FARE requires LOG transformation.
log_model_allfeatures <- lm(log(FARE)~.,data = train)
smr <- summary(log_model_allfeatures)
paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste(" Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared: 0.925601 Adjusted R-squared: 0.898022"
Multiple R-squared & Adjusted R-squared are increased by ~3%.
#par(col.lab="maroon",col.main="blue")
par(mfrow = c(1:2),col.lab="maroon")
plot(log_model_allfeatures)
Residual vs Fitted plot approaches a flat horizontal line.
Now, by making use of
I’m able to create a REDUCED model by eliminating below predictors; ####S_INCOME , E_INCOME , S_POP , E_POP , COUPON , SLOT , GATE , S_CODE , E_CODE , NEW by retaining the Multiple R-squared & Adjusted R-squared ~90%
model_reducedfeatures_LogTrans <- lm(log(FARE)~.-(S_INCOME + E_INCOME +S_POP + E_POP + COUPON + SLOT + GATE + S_CODE + E_CODE + NEW),data = train)
smr <- summary(model_reducedfeatures_LogTrans)
paste(paste("Multiple R-squared: ",round(smr$r.squared,digits=6)), paste(" Adjusted R-squared: ",round(smr$adj.r.squared,digits=6)))
## [1] "Multiple R-squared: 0.921432 Adjusted R-squared: 0.895897"
In this section, we’ll test the last two models statistically using ANOVA. This test has a NULL hypothesis “No Significant Difference in SSE of FULL and REDUCED Models”
anova(log_model_allfeatures,model_reducedfeatures_LogTrans)
## Analysis of Variance Table
##
## Model 1: log(FARE) ~ S_CODE + S_CITY + E_CODE + E_CITY + COUPON + NEW +
## VACATION + SW + HI + S_INCOME + E_INCOME + S_POP + E_POP +
## SLOT + GATE + DISTANCE + PAX + HavingAtleastOneHighTraficCity
## Model 2: log(FARE) ~ (S_CODE + S_CITY + E_CODE + E_CITY + COUPON + NEW +
## VACATION + SW + HI + S_INCOME + E_INCOME + S_POP + E_POP +
## SLOT + GATE + DISTANCE + PAX + HavingAtleastOneHighTraficCity) -
## (S_INCOME + E_INCOME + S_POP + E_POP + COUPON + SLOT + GATE +
## S_CODE + E_CODE + NEW)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 348 9.3484
## 2 360 9.8723 -12 -0.52384 1.625 0.08281 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We fail to reject the NULL hypothesis at significance level 10%. Though the RSS is slighthly incresed, but since we are able to reduce the model we can proceed with the reduced model as final model. However, this assumption is crossed validated on Test Data.
Once, we have decided on the model, we can then validate the model performance on the TEST data that we splitted early from the original dataset. Now the below code PREDICTS the FARE for the test data;
#removing rows where we dont have those cities in train data. Othwerwise this is throwing errors
testWithOUtErros <- test[!(test$S_CITY %in% c("Honolulu (Intl) HI", "Nashville TN", "Omaha NE","Spokane WA")), ]
testWithOUtErros <- testWithOUtErros[!(testWithOUtErros$E_CITY %in% c("El Paso TX", "Little Rock AR")), ]
y_hat <- predict.lm(model_reducedfeatures_LogTrans, newdata=testWithOUtErros, se.fit=TRUE)$fit
y_hat <- as.vector(y_hat)
dev <- log(testWithOUtErros$FARE)-(y_hat)
num<-sum(dev^2)
dev1<-log(testWithOUtErros$FARE)-mean(log(testWithOUtErros$FARE))
den<-sum(dev1^2)
Predicted.Rsq<-1-(num/den)
Predicted.Rsq
## [1] 0.8209606
As, we can see we got 82% prediction accuracy on the test data!
Thanks,
SivaG