Analysis of Airline Ticket Pricing
NAME: Monesh Kumar Sharma
EMAIL: monesh.sharma23992@gmail.com
COLLEGE / COMPANY: Welingkar Institute of Management, Mumbai

Problem Definition

Use data set Airline prices.
1. Analysis of Data.(Summarize, plots etc.)
2. Find out Dependent and Independent Variable for Economy and Premium Economy Class.
3. Find out which all columns / features impact changes in Price of Airline ticket.
4. Predict the Price of Airline Ticket with some dummy values.

Data Location

Data is provided on below location,
https://drive.google.com/file/d/0B3oXSJHLqZqXX28wTkxncnVmdFE/view

Data Description

Size: 39KB 458 obs. of 18 variables:

Attributes:
Airline: Airline Name, 6 different Airlines
Aircraft: Boeing or Airbus
FlightDuration: Total time taken by Flight
TravelMonth: Month of the year in which flight was taken
IsInternational: Domestic or International Flight
SeatsEconomy: No. of seats in Economy class
SeatsPremium: No of seats in Premium Economy class
PitchEconomy: Pitch in the Economy class
PitchPremium: Pitch in the Premium Economy class
WidthEconomy: Width in Economy class
WidthPremium: Width in Premium economy class
PriceEconomy: Price of ticket in Economy Class
PricePremium: Price of ticket in premium Economy class
PriceRelative: Price relativity between Economy and Premium economy class
SeatsTotal: Total no of seats in flight
PitchDifference: Difference in Pitch between economy and Premium economy class
WidthDifference: Width difference between Economy and Premium economy class
PercentPremiumSeats: Percentage of Premium economy seats compare to toal seats

Setup

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

Functions

detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 1.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  #inp <- count(inp[is.na(otp)])
  sum(is.na(otp))
}

Non_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

Remove_Outliers <- function ( z, na.rm = TRUE){
 Out <- Non_outliers(z)
 Out <-as.data.frame (Out)
 z <- Out$Out[match(z, Out$Out)]
 z
}

Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

Dataset

setwd("D:/Welingkar/My/IL/Project/Dataset")
dfrModel <- read.csv("./SixAirlinesDataV2.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British   Boeing          12.25         Jul   International          122
## 2 British   Boeing          12.25         Aug   International          122
## 3 British   Boeing          12.25         Sep   International          122
## 4 British   Boeing          12.25         Oct   International          122
## 5 British   Boeing           8.16         Aug   International          122
## 6 British   Boeing           8.16         Sep   International          122
##   SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1           40           31           38           18           19
## 2           40           31           38           18           19
## 3           40           31           38           18           19
## 4           40           31           38           18           19
## 5           40           31           38           18           19
## 6           40           31           38           18           19
##   PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1         2707         3725          0.38        162               7
## 2         2707         3725          0.38        162               7
## 3         2707         3725          0.38        162               7
## 4         2707         3725          0.38        162               7
## 5         1793         2999          0.67        162               7
## 6         1793         2999          0.67        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69

Observation 1. There are total ‘intRowCount’ data records in the file.
2. From data we can see that Airline, Aircraft, Travel Month, IsInternational are having data type of Character so for analysis we have to change it to categorical variable.

Data Cleaning

#Changing the Airline name to Categorical Numeric data
dfrModel$Airline <- ifelse(dfrModel$Airline == "AirFrance",1, 
                         ifelse(dfrModel$Airline == "British", 2,
                         ifelse(dfrModel$Airline == "Delta" , 3,
                         ifelse(dfrModel$Airline == "Jet", 4,
                         ifelse(dfrModel$Airline == "Singapore", 5, 
                         ifelse(dfrModel$Airline == "Virgin", 6, 0))))))


#Changing the Aircraft to Categorical Numeric data
dfrModel$Aircraft <- ifelse(dfrModel$Aircraft == "AirBus",1, 
                         ifelse(dfrModel$Aircraft == "Boeing", 2,0))

#Changing the Month to Categorical Numeric data
dfrModel$TravelMonth <-  ifelse(dfrModel$TravelMonth == "Jan",1, 
                         ifelse(dfrModel$TravelMonth == "Feb", 2,
                         ifelse(dfrModel$TravelMonth == "Mar" , 3,
                         ifelse(dfrModel$TravelMonth == "Apr", 4,
                         ifelse(dfrModel$TravelMonth == "May", 5, 
                         ifelse(dfrModel$TravelMonth == "Jun", 6, 
                         ifelse(dfrModel$TravelMonth == "Jul", 7, 
                         ifelse(dfrModel$TravelMonth == "Aug", 8, 
                         ifelse(dfrModel$TravelMonth == "Sep", 9, 
                         ifelse(dfrModel$TravelMonth == "Oct", 10, 
                         ifelse(dfrModel$TravelMonth == "Nov", 11, 
                         ifelse(dfrModel$TravelMonth == "Dec", 12, 
                                0))))))))))))


#Changing the Flight is Domestic or International to Categorical Numeric data
dfrModel$IsInternational <- ifelse(dfrModel$IsInternational == "Domestic",1, 
                         ifelse(dfrModel$IsInternational == "International", 2,0))


head(dfrModel)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1       2        2          12.25           7               2          122
## 2       2        2          12.25           8               2          122
## 3       2        2          12.25           9               2          122
## 4       2        2          12.25          10               2          122
## 5       2        2           8.16           8               2          122
## 6       2        2           8.16           9               2          122
##   SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1           40           31           38           18           19
## 2           40           31           38           18           19
## 3           40           31           38           18           19
## 4           40           31           38           18           19
## 5           40           31           38           18           19
## 6           40           31           38           18           19
##   PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1         2707         3725          0.38        162               7
## 2         2707         3725          0.38        162               7
## 3         2707         3725          0.38        162               7
## 4         2707         3725          0.38        162               7
## 5         1793         2999          0.67        162               7
## 6         1793         2999          0.67        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69

Observation

All the character variables have been changed to Numeric categorical data.

Summary

#summary(dfrModel)
lapply(dfrModel, FUN=summary)
## $Airline
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   3.009   4.000   6.000 
## 
## $Aircraft
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.67    2.00    2.00 
## 
## $FlightDuration
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.250   4.260   7.790   7.578  10.620  14.660 
## 
## $TravelMonth
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.000   8.000   9.000   8.672  10.000  10.000 
## 
## $IsInternational
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   1.913   2.000   2.000 
## 
## $SeatsEconomy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    78.0   133.0   185.0   202.3   243.0   389.0 
## 
## $SeatsPremium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   21.00   36.00   33.65   40.00   66.00 
## 
## $PitchEconomy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   31.00   31.00   31.22   32.00   33.00 
## 
## $PitchPremium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.00   38.00   38.00   37.91   38.00   40.00 
## 
## $WidthEconomy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   18.00   18.00   17.84   18.00   19.00 
## 
## $WidthPremium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17.00   19.00   19.00   19.47   21.00   21.00 
## 
## $PriceEconomy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      65     413    1242    1327    1909    3593 
## 
## $PricePremium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    86.0   528.8  1737.0  1845.3  2989.0  7414.0 
## 
## $PriceRelative
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0200  0.1000  0.3650  0.4872  0.7400  1.8900 
## 
## $SeatsTotal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      98     166     227     236     279     441 
## 
## $PitchDifference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   6.000   7.000   6.688   7.000  10.000 
## 
## $WidthDifference
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.633   3.000   4.000 
## 
## $PercentPremiumSeats
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.71   12.28   13.21   14.65   15.36   24.69

T test
Null Hypothesis: There is no difference between the mean of Prices of Economy class and Premium Economy Class

Alternative Hypothesis: There is difference between the mean of prices of Economy class and Premium economy class

t.test(dfrModel$PriceEconomy, dfrModel$PricePremium)
## 
##  Welch Two Sample t-test
## 
## data:  dfrModel$PriceEconomy and dfrModel$PricePremium
## t = -6.8304, df = 856.56, p-value = 1.605e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -667.0831 -369.2793
## sample estimates:
## mean of x mean of y 
##  1327.076  1845.258

** T test Observations**
1. We can see that P value is less than 0.05 which is for 95% Confidence interval so it is rejecting Null Hypothesis and accepting the Alternative hypothesis.
2. So there should be difference between mean of prices of Economy class and Premium economy class
3. Mean of Economy class prices is 1327.076
4. Mean of Premium Class pricess is 1845. 258

#detect_outliers(dfrModel$WidthDifference)
lapply(dfrModel, FUN=detect_outliers)
## $Airline
## [1] 0
## 
## $Aircraft
## [1] 0
## 
## $FlightDuration
## [1] 0
## 
## $TravelMonth
## [1] 0
## 
## $IsInternational
## [1] 40
## 
## $SeatsEconomy
## [1] 0
## 
## $SeatsPremium
## [1] 0
## 
## $PitchEconomy
## [1] 0
## 
## $PitchPremium
## [1] 94
## 
## $WidthEconomy
## [1] 154
## 
## $WidthPremium
## [1] 0
## 
## $PriceEconomy
## [1] 0
## 
## $PricePremium
## [1] 3
## 
## $PriceRelative
## [1] 12
## 
## $SeatsTotal
## [1] 0
## 
## $PitchDifference
## [1] 94
## 
## $WidthDifference
## [1] 0
## 
## $PercentPremiumSeats
## [1] 100

Outliers Obsevations
1. We can see that there are outliers in the data, still we will go with outliers as no of data records are less

lapply(dfrModel, FUN=Graph_Boxplot)
## $Airline

## 
## $Aircraft

## 
## $FlightDuration

## 
## $TravelMonth

## 
## $IsInternational

## 
## $SeatsEconomy

## 
## $SeatsPremium

## 
## $PitchEconomy

## 
## $PitchPremium

## 
## $WidthEconomy

## 
## $WidthPremium

## 
## $PriceEconomy

## 
## $PricePremium

## 
## $PriceRelative

## 
## $SeatsTotal

## 
## $PitchDifference

## 
## $WidthDifference

## 
## $PercentPremiumSeats

Observation
Here we see that many predictor variables numeric categoric variable. We should convert these to factor variables

dfrModel1 <- dfrModel
dfrModel1$Airline <- as.factor(dfrModel$Airline)
levels(dfrModel1$Airline) <- c("AirFrance", "British" , "Delta" , "Jet" , "Singapore", "Virgin") 
dfrModel1$Aircraft <- as.factor(dfrModel$Aircraft)
levels(dfrModel1$Aircraft) <- c("AirBus", "Boeing")
dfrModel1$TravelMonth <- as.factor(dfrModel$TravelMonth)
levels(dfrModel1$TravelMonth) <- c("Jul", "Aug", "Sep", "Oct")
dfrModel1$IsInternational <- as.factor(dfrModel$IsInternational)
levels(dfrModel1$IsInternational) <- c("Domestic", "International")
head(dfrModel1)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British   Boeing          12.25         Jul   International          122
## 2 British   Boeing          12.25         Aug   International          122
## 3 British   Boeing          12.25         Sep   International          122
## 4 British   Boeing          12.25         Oct   International          122
## 5 British   Boeing           8.16         Aug   International          122
## 6 British   Boeing           8.16         Sep   International          122
##   SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1           40           31           38           18           19
## 2           40           31           38           18           19
## 3           40           31           38           18           19
## 4           40           31           38           18           19
## 5           40           31           38           18           19
## 6           40           31           38           18           19
##   PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1         2707         3725          0.38        162               7
## 2         2707         3725          0.38        162               7
## 3         2707         3725          0.38        162               7
## 4         2707         3725          0.38        162               7
## 5         1793         2999          0.67        162               7
## 6         1793         2999          0.67        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69

Correlation Plot

pairs(dfrModel)

** Observations**
As given above after checking the correlation between the Dependent (Price) variable and each independent variable, there are may variables for which correlation is very low.

head(dfrModel)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1       2        2          12.25           7               2          122
## 2       2        2          12.25           8               2          122
## 3       2        2          12.25           9               2          122
## 4       2        2          12.25          10               2          122
## 5       2        2           8.16           8               2          122
## 6       2        2           8.16           9               2          122
##   SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1           40           31           38           18           19
## 2           40           31           38           18           19
## 3           40           31           38           18           19
## 4           40           31           38           18           19
## 5           40           31           38           18           19
## 6           40           31           38           18           19
##   PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1         2707         3725          0.38        162               7
## 2         2707         3725          0.38        162               7
## 3         2707         3725          0.38        162               7
## 4         2707         3725          0.38        162               7
## 5         1793         2999          0.67        162               7
## 6         1793         2999          0.67        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69

Removing Outliers

#To Remove Outliers from each continuous variables which are having outliers
#dfrModel$PitchPremium <- Remove_Outliers(dfrModel$PitchPremium)
#dfrModel$WidthEconomy <- Remove_Outliers(dfrModel$WidthEconomy)
#dfrModel$PricePremium <- Remove_Outliers(dfrModel$PricePremium)
#dfrModel$PriceRelative <- Remove_Outliers(dfrModel$PriceRelative)
#dfrModel$PitchDifference <- Remove_Outliers(dfrModel$PitchDifference)
#dfrModel$PercentPremiumSeats <- Remove_Outliers(dfrModel$PercentPremiumSeats)
#dfrModel <- dfrModel[complete.cases(dfrModel), ]
#nrow(dfrModel)
#head(dfrModel)

Observations
1. We can also remove outliers to improve the accuracy of our data
2. Outliers should be removed only for continuous variables
3. Outliers should not be removed for Categorical variables
4. As we are having less data and less outliers so we are going with outliers

Data Frames for Economy and Premium Class

## Data frame for Economy Class
dfrModel_Economy <- dfrModel
dfrModel_Economy$SeatsPremium <- NULL
dfrModel_Economy$PitchPremium <- NULL
dfrModel_Economy$WidthPremium <- NULL
dfrModel_Economy$PricePremium <- NULL
dfrModel_Economy$PriceRelative <- NULL
#dfrModel_Economy$WidthDifference <- NULL
#dfrModel_Economy$PercentPremiumSeats <- NULL
head(dfrModel_Economy)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1       2        2          12.25           7               2          122
## 2       2        2          12.25           8               2          122
## 3       2        2          12.25           9               2          122
## 4       2        2          12.25          10               2          122
## 5       2        2           8.16           8               2          122
## 6       2        2           8.16           9               2          122
##   PitchEconomy WidthEconomy PriceEconomy SeatsTotal PitchDifference
## 1           31           18         2707        162               7
## 2           31           18         2707        162               7
## 3           31           18         2707        162               7
## 4           31           18         2707        162               7
## 5           31           18         1793        162               7
## 6           31           18         1793        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69
## Data frame for Premium Class
dfrModel_Premium <- dfrModel
dfrModel_Premium$SeatsEconomy <- NULL
dfrModel_Premium$PitchEconomy <- NULL
dfrModel_Premium$WidthEconomy <- NULL
dfrModel_Premium$PriceEconomy <- NULL
dfrModel_Premium$PriceRelative <- NULL
#dfrModel_Premium$WidthDifference <- NULL
#dfrModel_Premium$PercentPremiumSeats <- NULL
head(dfrModel_Premium)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsPremium
## 1       2        2          12.25           7               2           40
## 2       2        2          12.25           8               2           40
## 3       2        2          12.25           9               2           40
## 4       2        2          12.25          10               2           40
## 5       2        2           8.16           8               2           40
## 6       2        2           8.16           9               2           40
##   PitchPremium WidthPremium PricePremium SeatsTotal PitchDifference
## 1           38           19         3725        162               7
## 2           38           19         3725        162               7
## 3           38           19         3725        162               7
## 4           38           19         3725        162               7
## 5           38           19         2999        162               7
## 6           38           19         2999        162               7
##   WidthDifference PercentPremiumSeats
## 1               1               24.69
## 2               1               24.69
## 3               1               24.69
## 4               1               24.69
## 5               1               24.69
## 6               1               24.69

Observation of Data Frames
1. Two Data frames have been created for Price Economy and Price Premium so that they can be checked to find out the different independent variables

Visualize

dfrGraph <- gather(dfrModel_Economy, variable, value, -PriceEconomy)
head(dfrGraph)
##   PriceEconomy variable value
## 1         2707  Airline     2
## 2         2707  Airline     2
## 3         2707  Airline     2
## 4         2707  Airline     2
## 5         1793  Airline     2
## 6         1793  Airline     2
ggplot(dfrGraph) +
geom_jitter(aes(value,PriceEconomy, colour=variable)) + 
geom_smooth(aes(value,PriceEconomy, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Price With Other Features")

Observation
There is some impact of all the features with Price.

Find Best Multi Linear Model for Economy Class
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

#?step()
stpModel=step(lm(data=dfrModel_Economy, PriceEconomy~.), trace=0, steps=1000)
stpSummary <- summary(stpModel)
stpSummary 
## 
## Call:
## lm(formula = PriceEconomy ~ Airline + Aircraft + FlightDuration + 
##     TravelMonth + IsInternational + WidthEconomy + PitchDifference + 
##     WidthDifference + PercentPremiumSeats, data = dfrModel_Economy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2131.06  -279.11    39.33   368.10  1343.59 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6440.877   1993.124  -3.232  0.00132 ** 
## Airline              -504.056     40.746 -12.371  < 2e-16 ***
## Aircraft               99.888     64.268   1.554  0.12084    
## FlightDuration        110.791     10.072  10.999  < 2e-16 ***
## TravelMonth           -41.233     24.742  -1.667  0.09631 .  
## IsInternational      1623.256    265.284   6.119 2.06e-09 ***
## WidthEconomy          447.869    128.312   3.490  0.00053 ***
## PitchDifference      -672.698     41.339 -16.273  < 2e-16 ***
## WidthDifference      1031.084     87.078  11.841  < 2e-16 ***
## PercentPremiumSeats    24.349      5.788   4.207 3.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 551.1 on 448 degrees of freedom
## Multiple R-squared:  0.6951, Adjusted R-squared:  0.689 
## F-statistic: 113.5 on 9 and 448 DF,  p-value: < 2.2e-16

Observation
As we can see that Aircraft and Travel Month is not much significant to affect the price of Economy class ticket we can remove it.

Data Cleaning 2

dfrModel_Economy <- select(dfrModel_Economy, -c(Aircraft, TravelMonth))
head(dfrModel_Economy)
##   Airline FlightDuration IsInternational SeatsEconomy PitchEconomy
## 1       2          12.25               2          122           31
## 2       2          12.25               2          122           31
## 3       2          12.25               2          122           31
## 4       2          12.25               2          122           31
## 5       2           8.16               2          122           31
## 6       2           8.16               2          122           31
##   WidthEconomy PriceEconomy SeatsTotal PitchDifference WidthDifference
## 1           18         2707        162               7               1
## 2           18         2707        162               7               1
## 3           18         2707        162               7               1
## 4           18         2707        162               7               1
## 5           18         1793        162               7               1
## 6           18         1793        162               7               1
##   PercentPremiumSeats
## 1               24.69
## 2               24.69
## 3               24.69
## 4               24.69
## 5               24.69
## 6               24.69
#?step()
stpModel=step(lm(data=dfrModel_Economy, PriceEconomy~.), trace=0, steps=1000)
stpSummary <- summary(stpModel)
stpSummary 
## 
## Call:
## lm(formula = PriceEconomy ~ Airline + FlightDuration + IsInternational + 
##     WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats, 
##     data = dfrModel_Economy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2089.23  -297.36    50.12   345.41  1308.06 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6095.761   1902.645  -3.204 0.001452 ** 
## Airline              -501.027     40.770 -12.289  < 2e-16 ***
## FlightDuration        116.198      9.459  12.284  < 2e-16 ***
## IsInternational      1500.318    258.224   5.810 1.18e-08 ***
## WidthEconomy          419.594    126.207   3.325 0.000958 ***
## PitchDifference      -649.806     38.658 -16.809  < 2e-16 ***
## WidthDifference      1020.026     86.325  11.816  < 2e-16 ***
## PercentPremiumSeats    25.613      5.717   4.480 9.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 553.1 on 450 degrees of freedom
## Multiple R-squared:  0.6916, Adjusted R-squared:  0.6868 
## F-statistic: 144.1 on 7 and 450 DF,  p-value: < 2.2e-16

Observations of Economy Class Model
1. PriceEconomy ~ Airline + FlightDuration + IsInternational + WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats
2. Price of Economy class is dependent of all the variables which are given above
3. R square value is around 0.6868 which is good as it is more than 0.6
4. P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 7 variables are affecting the price of Economy class flight.

Make Final Multi Linear Model

x1 <- dfrModel_Economy$Airline
x2 <- dfrModel_Economy$FlightDuration
x3 <- dfrModel_Economy$IsInternational
x4 <- dfrModel_Economy$WidthEconomy
x5 <- dfrModel_Economy$PitchDifference
x6 <- dfrModel_Economy$WidthDifference
x7 <- dfrModel_Economy$PercentPremiumSeats
y1 <- dfrModel_Economy$PriceEconomy
slmModel1 <- lm(y1~x1+x2+x3+x4+x5+x6+x7, data=dfrModel_Economy)

Observation
No errors. Model successfully created.

Show Model

# print summary
summary(slmModel1)
## 
## Call:
## lm(formula = y1 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = dfrModel_Economy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2089.23  -297.36    50.12   345.41  1308.06 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6095.761   1902.645  -3.204 0.001452 ** 
## x1           -501.027     40.770 -12.289  < 2e-16 ***
## x2            116.198      9.459  12.284  < 2e-16 ***
## x3           1500.318    258.224   5.810 1.18e-08 ***
## x4            419.594    126.207   3.325 0.000958 ***
## x5           -649.806     38.658 -16.809  < 2e-16 ***
## x6           1020.026     86.325  11.816  < 2e-16 ***
## x7             25.613      5.717   4.480 9.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 553.1 on 450 degrees of freedom
## Multiple R-squared:  0.6916, Adjusted R-squared:  0.6868 
## F-statistic: 144.1 on 7 and 450 DF,  p-value: < 2.2e-16

Test Data

dfrEconomy_test <- data.frame(x1=c(2), x2=c(14.25),x3=c(2), x4=c(18), x5=c(7),x5=c(9), x6=c(1),x7=c(27.69))
dfrEconomy_test 
##   x1    x2 x3 x4 x5 x5.1 x6    x7
## 1  2 14.25  2 18  7    9  1 27.69

Observation
Test Data successfully created.

Predict

result <-  predict(slmModel1, dfrEconomy_test)
print(result)
##        1 
## 2291.959

** Observation**
1. Prediction is working fine but have some deviation from true values as r squared values was less.

Premium Class Price Model

Visualize

dfrGraph1 <- gather(dfrModel_Premium, variable, value, -PricePremium)
head(dfrGraph1)
##   PricePremium variable value
## 1         3725  Airline     2
## 2         3725  Airline     2
## 3         3725  Airline     2
## 4         3725  Airline     2
## 5         2999  Airline     2
## 6         2999  Airline     2
ggplot(dfrGraph1) +
geom_jitter(aes(value,PricePremium, colour=variable)) + 
geom_smooth(aes(value,PricePremium, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Price With Other Features")

Observation
There is some impact of all the features with Price.

Find Best Multi Linear Model
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.

#?step()
stpModel2=step(lm(data=dfrModel_Premium, PricePremium~.), trace=0, steps=1000)
stpSummary2 <- summary(stpModel2)
stpSummary2
## 
## Call:
## lm(formula = PricePremium ~ Airline + FlightDuration + IsInternational + 
##     PitchPremium + WidthPremium + PitchDifference + WidthDifference + 
##     PercentPremiumSeats, data = dfrModel_Premium)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2220.6  -451.0    43.0   436.8  4501.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         22765.101   5886.672   3.867 0.000126 ***
## Airline              -431.472     58.870  -7.329 1.08e-12 ***
## FlightDuration        192.269     13.575  14.164  < 2e-16 ***
## IsInternational      3216.347    474.020   6.785 3.67e-11 ***
## PitchPremium         -857.550    166.856  -5.139 4.12e-07 ***
## WidthPremium          269.890    180.927   1.492 0.136478    
## PitchDifference      -229.753     92.017  -2.497 0.012887 *  
## WidthDifference       770.438    102.693   7.502 3.39e-13 ***
## PercentPremiumSeats    20.417      9.481   2.154 0.031809 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 792.4 on 449 degrees of freedom
## Multiple R-squared:  0.6282, Adjusted R-squared:  0.6216 
## F-statistic: 94.84 on 8 and 449 DF,  p-value: < 2.2e-16

Observation
As we can see that WidthPremium is not much significant to affect the price of Economy class ticket we can remove it.

Data Cleaning 2

dfrModel_Premium <- select(dfrModel_Premium, -c(WidthPremium))
head(dfrModel_Premium)
##   Airline Aircraft FlightDuration TravelMonth IsInternational SeatsPremium
## 1       2        2          12.25           7               2           40
## 2       2        2          12.25           8               2           40
## 3       2        2          12.25           9               2           40
## 4       2        2          12.25          10               2           40
## 5       2        2           8.16           8               2           40
## 6       2        2           8.16           9               2           40
##   PitchPremium PricePremium SeatsTotal PitchDifference WidthDifference
## 1           38         3725        162               7               1
## 2           38         3725        162               7               1
## 3           38         3725        162               7               1
## 4           38         3725        162               7               1
## 5           38         2999        162               7               1
## 6           38         2999        162               7               1
##   PercentPremiumSeats
## 1               24.69
## 2               24.69
## 3               24.69
## 4               24.69
## 5               24.69
## 6               24.69
#?step()
stpModel2=step(lm(data=dfrModel_Premium, PricePremium~.), trace=0, steps=1000)
stpSummary2 <- summary(stpModel2)
stpSummary2
## 
## Call:
## lm(formula = PricePremium ~ Airline + FlightDuration + IsInternational + 
##     PitchPremium + PitchDifference + WidthDifference + PercentPremiumSeats, 
##     data = dfrModel_Premium)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2237.7  -481.9    56.8   478.9  4479.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         26516.624   5329.654   4.975 9.29e-07 ***
## Airline              -355.923     30.052 -11.844  < 2e-16 ***
## FlightDuration        193.455     13.570  14.256  < 2e-16 ***
## IsInternational      3578.516    407.676   8.778  < 2e-16 ***
## PitchPremium         -848.442    166.971  -5.081 5.50e-07 ***
## PitchDifference      -222.751     92.022  -2.421   0.0159 *  
## WidthDifference       884.506     68.638  12.886  < 2e-16 ***
## PercentPremiumSeats    20.165      9.492   2.124   0.0342 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 793.5 on 450 degrees of freedom
## Multiple R-squared:  0.6264, Adjusted R-squared:  0.6206 
## F-statistic: 107.8 on 7 and 450 DF,  p-value: < 2.2e-16

Observations of Premium Class Model
1. PricePremium ~ Airline + FlightDuration + IsInternational + PitchPremium + PitchDifference + WidthDifference + PercentPremiumSeats
2. Price of Premium class is dependent of all the variables which are given above
3. R square value is around 0.6206 which is good as it is more than 0.6
4. P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 7 variables are affecting the price of Economy class flight.

Make Final Multi Linear Model

x11 <- dfrModel_Premium$Airline
x12 <- dfrModel_Premium$FlightDuration
x13 <- dfrModel_Premium$IsInternational
x14 <- dfrModel_Premium$PitchPremium
x15 <- dfrModel_Premium$PitchDifference
x16 <- dfrModel_Premium$WidthDifference
x17 <- dfrModel_Premium$PercentPremiumSeats
y2 <- dfrModel_Premium$PricePremium
slmModel2 <- lm(y2~x11+x12+x13+x14+x15+x16+x17, data=dfrModel_Premium)

Observation
No errors. Model successfully created.

Show Model

# print summary
summary(slmModel2)
## 
## Call:
## lm(formula = y2 ~ x11 + x12 + x13 + x14 + x15 + x16 + x17, data = dfrModel_Premium)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2237.7  -481.9    56.8   478.9  4479.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26516.624   5329.654   4.975 9.29e-07 ***
## x11          -355.923     30.052 -11.844  < 2e-16 ***
## x12           193.455     13.570  14.256  < 2e-16 ***
## x13          3578.516    407.676   8.778  < 2e-16 ***
## x14          -848.442    166.971  -5.081 5.50e-07 ***
## x15          -222.751     92.022  -2.421   0.0159 *  
## x16           884.506     68.638  12.886  < 2e-16 ***
## x17            20.165      9.492   2.124   0.0342 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 793.5 on 450 degrees of freedom
## Multiple R-squared:  0.6264, Adjusted R-squared:  0.6206 
## F-statistic: 107.8 on 7 and 450 DF,  p-value: < 2.2e-16

Premium Test Data

# find price of Premium as per below details
dfrPremium_test <- data.frame(x11=c(4), x12=c(14.25),x13=c(2), x14=c(18), x15=c(9), x16=c(1),x17=c(27.69))
dfrPremium_test 
##   x11   x12 x13 x14 x15 x16   x17
## 1   4 14.25   2  18   9   1 27.69

Observation
Test Data successfully created.

Predict

result <-  predict(slmModel2, dfrPremium_test)
print(result)
##        1 
## 19172.85

** Observation**
1. Due to the Intercept Value Prediction is very high

Summary

  1. Data has been loaded successfully
  2. Data has been summarized to know the different statistical values
  3. Outliers has been find out in each variable and Evry variable is plotted on Box plot to know about the outliers
  4. T test is tested to check is there any difference between the mean of Economy price and Premium price
  5. This Model is after without removing the outliers.
    All the required assumptions for Regression model is tested successfully.
  1. There must be correlation between dependent and independent variable
  2. There should not be any correlation between two independent variables.
  1. Both the Model is created for Economy Price and Premium price
  2. For Economy Class
    PriceEconomy ~ Airline + FlightDuration + IsInternational + WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats

    Model is having R square value as 0.6868 as well as it is rejecting the NULL Hypothesis for Regression model as P Value is less than 0.05

  3. For Premium Class PricePremium ~ Airline + FlightDuration + IsInternational + PitchPremium + PitchDifference + WidthDifference + PercentPremiumSeats
    Model is having R square value as 0.62 as well as it is rejecting the NULL Hypothesis for Regression model as P Value is less than 0.05