Use data set Airline prices.
1. Analysis of Data.(Summarize, plots etc.)
2. Find out Dependent and Independent Variable for Economy and Premium Economy Class.
3. Find out which all columns / features impact changes in Price of Airline ticket.
4. Predict the Price of Airline Ticket with some dummy values.
Data is provided on below location,
https://drive.google.com/file/d/0B3oXSJHLqZqXX28wTkxncnVmdFE/view
Size: 39KB 458 obs. of 18 variables:
Attributes:
Airline: Airline Name, 6 different Airlines
Aircraft: Boeing or Airbus
FlightDuration: Total time taken by Flight
TravelMonth: Month of the year in which flight was taken
IsInternational: Domestic or International Flight
SeatsEconomy: No. of seats in Economy class
SeatsPremium: No of seats in Premium Economy class
PitchEconomy: Pitch in the Economy class
PitchPremium: Pitch in the Premium Economy class
WidthEconomy: Width in Economy class
WidthPremium: Width in Premium economy class
PriceEconomy: Price of ticket in Economy Class
PricePremium: Price of ticket in premium Economy class
PriceRelative: Price relativity between Economy and Premium economy class
SeatsTotal: Total no of seats in flight
PitchDifference: Difference in Pitch between economy and Premium economy class
WidthDifference: Width difference between Economy and Premium economy class
PercentPremiumSeats: Percentage of Premium economy seats compare to toal seats
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
Functions
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 1.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
#inp <- count(inp[is.na(otp)])
sum(is.na(otp))
}
Non_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
Remove_Outliers <- function ( z, na.rm = TRUE){
Out <- Non_outliers(z)
Out <-as.data.frame (Out)
z <- Out$Out[match(z, Out$Out)]
z
}
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
Dataset
setwd("D:/Welingkar/My/IL/Project/Dataset")
dfrModel <- read.csv("./SixAirlinesDataV2.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrModel)
head(dfrModel)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British Boeing 12.25 Jul International 122
## 2 British Boeing 12.25 Aug International 122
## 3 British Boeing 12.25 Sep International 122
## 4 British Boeing 12.25 Oct International 122
## 5 British Boeing 8.16 Aug International 122
## 6 British Boeing 8.16 Sep International 122
## SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1 40 31 38 18 19
## 2 40 31 38 18 19
## 3 40 31 38 18 19
## 4 40 31 38 18 19
## 5 40 31 38 18 19
## 6 40 31 38 18 19
## PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1 2707 3725 0.38 162 7
## 2 2707 3725 0.38 162 7
## 3 2707 3725 0.38 162 7
## 4 2707 3725 0.38 162 7
## 5 1793 2999 0.67 162 7
## 6 1793 2999 0.67 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
Observation 1. There are total ‘intRowCount’ data records in the file.
2. From data we can see that Airline, Aircraft, Travel Month, IsInternational are having data type of Character so for analysis we have to change it to categorical variable.
Data Cleaning
#Changing the Airline name to Categorical Numeric data
dfrModel$Airline <- ifelse(dfrModel$Airline == "AirFrance",1,
ifelse(dfrModel$Airline == "British", 2,
ifelse(dfrModel$Airline == "Delta" , 3,
ifelse(dfrModel$Airline == "Jet", 4,
ifelse(dfrModel$Airline == "Singapore", 5,
ifelse(dfrModel$Airline == "Virgin", 6, 0))))))
#Changing the Aircraft to Categorical Numeric data
dfrModel$Aircraft <- ifelse(dfrModel$Aircraft == "AirBus",1,
ifelse(dfrModel$Aircraft == "Boeing", 2,0))
#Changing the Month to Categorical Numeric data
dfrModel$TravelMonth <- ifelse(dfrModel$TravelMonth == "Jan",1,
ifelse(dfrModel$TravelMonth == "Feb", 2,
ifelse(dfrModel$TravelMonth == "Mar" , 3,
ifelse(dfrModel$TravelMonth == "Apr", 4,
ifelse(dfrModel$TravelMonth == "May", 5,
ifelse(dfrModel$TravelMonth == "Jun", 6,
ifelse(dfrModel$TravelMonth == "Jul", 7,
ifelse(dfrModel$TravelMonth == "Aug", 8,
ifelse(dfrModel$TravelMonth == "Sep", 9,
ifelse(dfrModel$TravelMonth == "Oct", 10,
ifelse(dfrModel$TravelMonth == "Nov", 11,
ifelse(dfrModel$TravelMonth == "Dec", 12,
0))))))))))))
#Changing the Flight is Domestic or International to Categorical Numeric data
dfrModel$IsInternational <- ifelse(dfrModel$IsInternational == "Domestic",1,
ifelse(dfrModel$IsInternational == "International", 2,0))
head(dfrModel)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 2 2 12.25 7 2 122
## 2 2 2 12.25 8 2 122
## 3 2 2 12.25 9 2 122
## 4 2 2 12.25 10 2 122
## 5 2 2 8.16 8 2 122
## 6 2 2 8.16 9 2 122
## SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1 40 31 38 18 19
## 2 40 31 38 18 19
## 3 40 31 38 18 19
## 4 40 31 38 18 19
## 5 40 31 38 18 19
## 6 40 31 38 18 19
## PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1 2707 3725 0.38 162 7
## 2 2707 3725 0.38 162 7
## 3 2707 3725 0.38 162 7
## 4 2707 3725 0.38 162 7
## 5 1793 2999 0.67 162 7
## 6 1793 2999 0.67 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
Observation
All the character variables have been changed to Numeric categorical data.
Summary
#summary(dfrModel)
lapply(dfrModel, FUN=summary)
## $Airline
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 3.009 4.000 6.000
##
## $Aircraft
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 2.00 1.67 2.00 2.00
##
## $FlightDuration
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 4.260 7.790 7.578 10.620 14.660
##
## $TravelMonth
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.000 8.000 9.000 8.672 10.000 10.000
##
## $IsInternational
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 1.913 2.000 2.000
##
## $SeatsEconomy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 78.0 133.0 185.0 202.3 243.0 389.0
##
## $SeatsPremium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 21.00 36.00 33.65 40.00 66.00
##
## $PitchEconomy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 31.00 31.00 31.22 32.00 33.00
##
## $PitchPremium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.00 38.00 38.00 37.91 38.00 40.00
##
## $WidthEconomy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 18.00 18.00 17.84 18.00 19.00
##
## $WidthPremium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 19.00 19.00 19.47 21.00 21.00
##
## $PriceEconomy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65 413 1242 1327 1909 3593
##
## $PricePremium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.0 528.8 1737.0 1845.3 2989.0 7414.0
##
## $PriceRelative
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0200 0.1000 0.3650 0.4872 0.7400 1.8900
##
## $SeatsTotal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 98 166 227 236 279 441
##
## $PitchDifference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 6.000 7.000 6.688 7.000 10.000
##
## $WidthDifference
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.633 3.000 4.000
##
## $PercentPremiumSeats
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.71 12.28 13.21 14.65 15.36 24.69
T test
Null Hypothesis: There is no difference between the mean of Prices of Economy class and Premium Economy Class
Alternative Hypothesis: There is difference between the mean of prices of Economy class and Premium economy class
t.test(dfrModel$PriceEconomy, dfrModel$PricePremium)
##
## Welch Two Sample t-test
##
## data: dfrModel$PriceEconomy and dfrModel$PricePremium
## t = -6.8304, df = 856.56, p-value = 1.605e-11
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -667.0831 -369.2793
## sample estimates:
## mean of x mean of y
## 1327.076 1845.258
** T test Observations**
1. We can see that P value is less than 0.05 which is for 95% Confidence interval so it is rejecting Null Hypothesis and accepting the Alternative hypothesis.
2. So there should be difference between mean of prices of Economy class and Premium economy class
3. Mean of Economy class prices is 1327.076
4. Mean of Premium Class pricess is 1845. 258
#detect_outliers(dfrModel$WidthDifference)
lapply(dfrModel, FUN=detect_outliers)
## $Airline
## [1] 0
##
## $Aircraft
## [1] 0
##
## $FlightDuration
## [1] 0
##
## $TravelMonth
## [1] 0
##
## $IsInternational
## [1] 40
##
## $SeatsEconomy
## [1] 0
##
## $SeatsPremium
## [1] 0
##
## $PitchEconomy
## [1] 0
##
## $PitchPremium
## [1] 94
##
## $WidthEconomy
## [1] 154
##
## $WidthPremium
## [1] 0
##
## $PriceEconomy
## [1] 0
##
## $PricePremium
## [1] 3
##
## $PriceRelative
## [1] 12
##
## $SeatsTotal
## [1] 0
##
## $PitchDifference
## [1] 94
##
## $WidthDifference
## [1] 0
##
## $PercentPremiumSeats
## [1] 100
Outliers Obsevations
1. We can see that there are outliers in the data, still we will go with outliers as no of data records are less
lapply(dfrModel, FUN=Graph_Boxplot)
## $Airline
##
## $Aircraft
##
## $FlightDuration
##
## $TravelMonth
##
## $IsInternational
##
## $SeatsEconomy
##
## $SeatsPremium
##
## $PitchEconomy
##
## $PitchPremium
##
## $WidthEconomy
##
## $WidthPremium
##
## $PriceEconomy
##
## $PricePremium
##
## $PriceRelative
##
## $SeatsTotal
##
## $PitchDifference
##
## $WidthDifference
##
## $PercentPremiumSeats
Observation
Here we see that many predictor variables numeric categoric variable. We should convert these to factor variables
dfrModel1 <- dfrModel
dfrModel1$Airline <- as.factor(dfrModel$Airline)
levels(dfrModel1$Airline) <- c("AirFrance", "British" , "Delta" , "Jet" , "Singapore", "Virgin")
dfrModel1$Aircraft <- as.factor(dfrModel$Aircraft)
levels(dfrModel1$Aircraft) <- c("AirBus", "Boeing")
dfrModel1$TravelMonth <- as.factor(dfrModel$TravelMonth)
levels(dfrModel1$TravelMonth) <- c("Jul", "Aug", "Sep", "Oct")
dfrModel1$IsInternational <- as.factor(dfrModel$IsInternational)
levels(dfrModel1$IsInternational) <- c("Domestic", "International")
head(dfrModel1)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 British Boeing 12.25 Jul International 122
## 2 British Boeing 12.25 Aug International 122
## 3 British Boeing 12.25 Sep International 122
## 4 British Boeing 12.25 Oct International 122
## 5 British Boeing 8.16 Aug International 122
## 6 British Boeing 8.16 Sep International 122
## SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1 40 31 38 18 19
## 2 40 31 38 18 19
## 3 40 31 38 18 19
## 4 40 31 38 18 19
## 5 40 31 38 18 19
## 6 40 31 38 18 19
## PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1 2707 3725 0.38 162 7
## 2 2707 3725 0.38 162 7
## 3 2707 3725 0.38 162 7
## 4 2707 3725 0.38 162 7
## 5 1793 2999 0.67 162 7
## 6 1793 2999 0.67 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
Correlation Plot
pairs(dfrModel)
** Observations**
As given above after checking the correlation between the Dependent (Price) variable and each independent variable, there are may variables for which correlation is very low.
head(dfrModel)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 2 2 12.25 7 2 122
## 2 2 2 12.25 8 2 122
## 3 2 2 12.25 9 2 122
## 4 2 2 12.25 10 2 122
## 5 2 2 8.16 8 2 122
## 6 2 2 8.16 9 2 122
## SeatsPremium PitchEconomy PitchPremium WidthEconomy WidthPremium
## 1 40 31 38 18 19
## 2 40 31 38 18 19
## 3 40 31 38 18 19
## 4 40 31 38 18 19
## 5 40 31 38 18 19
## 6 40 31 38 18 19
## PriceEconomy PricePremium PriceRelative SeatsTotal PitchDifference
## 1 2707 3725 0.38 162 7
## 2 2707 3725 0.38 162 7
## 3 2707 3725 0.38 162 7
## 4 2707 3725 0.38 162 7
## 5 1793 2999 0.67 162 7
## 6 1793 2999 0.67 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
Removing Outliers
#To Remove Outliers from each continuous variables which are having outliers
#dfrModel$PitchPremium <- Remove_Outliers(dfrModel$PitchPremium)
#dfrModel$WidthEconomy <- Remove_Outliers(dfrModel$WidthEconomy)
#dfrModel$PricePremium <- Remove_Outliers(dfrModel$PricePremium)
#dfrModel$PriceRelative <- Remove_Outliers(dfrModel$PriceRelative)
#dfrModel$PitchDifference <- Remove_Outliers(dfrModel$PitchDifference)
#dfrModel$PercentPremiumSeats <- Remove_Outliers(dfrModel$PercentPremiumSeats)
#dfrModel <- dfrModel[complete.cases(dfrModel), ]
#nrow(dfrModel)
#head(dfrModel)
Observations
1. We can also remove outliers to improve the accuracy of our data
2. Outliers should be removed only for continuous variables
3. Outliers should not be removed for Categorical variables
4. As we are having less data and less outliers so we are going with outliers
Data Frames for Economy and Premium Class
## Data frame for Economy Class
dfrModel_Economy <- dfrModel
dfrModel_Economy$SeatsPremium <- NULL
dfrModel_Economy$PitchPremium <- NULL
dfrModel_Economy$WidthPremium <- NULL
dfrModel_Economy$PricePremium <- NULL
dfrModel_Economy$PriceRelative <- NULL
#dfrModel_Economy$WidthDifference <- NULL
#dfrModel_Economy$PercentPremiumSeats <- NULL
head(dfrModel_Economy)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsEconomy
## 1 2 2 12.25 7 2 122
## 2 2 2 12.25 8 2 122
## 3 2 2 12.25 9 2 122
## 4 2 2 12.25 10 2 122
## 5 2 2 8.16 8 2 122
## 6 2 2 8.16 9 2 122
## PitchEconomy WidthEconomy PriceEconomy SeatsTotal PitchDifference
## 1 31 18 2707 162 7
## 2 31 18 2707 162 7
## 3 31 18 2707 162 7
## 4 31 18 2707 162 7
## 5 31 18 1793 162 7
## 6 31 18 1793 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
## Data frame for Premium Class
dfrModel_Premium <- dfrModel
dfrModel_Premium$SeatsEconomy <- NULL
dfrModel_Premium$PitchEconomy <- NULL
dfrModel_Premium$WidthEconomy <- NULL
dfrModel_Premium$PriceEconomy <- NULL
dfrModel_Premium$PriceRelative <- NULL
#dfrModel_Premium$WidthDifference <- NULL
#dfrModel_Premium$PercentPremiumSeats <- NULL
head(dfrModel_Premium)
## Airline Aircraft FlightDuration TravelMonth IsInternational SeatsPremium
## 1 2 2 12.25 7 2 40
## 2 2 2 12.25 8 2 40
## 3 2 2 12.25 9 2 40
## 4 2 2 12.25 10 2 40
## 5 2 2 8.16 8 2 40
## 6 2 2 8.16 9 2 40
## PitchPremium WidthPremium PricePremium SeatsTotal PitchDifference
## 1 38 19 3725 162 7
## 2 38 19 3725 162 7
## 3 38 19 3725 162 7
## 4 38 19 3725 162 7
## 5 38 19 2999 162 7
## 6 38 19 2999 162 7
## WidthDifference PercentPremiumSeats
## 1 1 24.69
## 2 1 24.69
## 3 1 24.69
## 4 1 24.69
## 5 1 24.69
## 6 1 24.69
Observation of Data Frames
1. Two Data frames have been created for Price Economy and Price Premium so that they can be checked to find out the different independent variables
Visualize
dfrGraph <- gather(dfrModel_Economy, variable, value, -PriceEconomy)
head(dfrGraph)
## PriceEconomy variable value
## 1 2707 Airline 2
## 2 2707 Airline 2
## 3 2707 Airline 2
## 4 2707 Airline 2
## 5 1793 Airline 2
## 6 1793 Airline 2
ggplot(dfrGraph) +
geom_jitter(aes(value,PriceEconomy, colour=variable)) +
geom_smooth(aes(value,PriceEconomy, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Price With Other Features")
Observation
There is some impact of all the features with Price.
Find Best Multi Linear Model for Economy Class
Choose the best linear model by using step(). Choose a model by AIC in a Stepwise Algorithm
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.
The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Hence, AIC provides a means for model selection.
#?step()
stpModel=step(lm(data=dfrModel_Economy, PriceEconomy~.), trace=0, steps=1000)
stpSummary <- summary(stpModel)
stpSummary
##
## Call:
## lm(formula = PriceEconomy ~ Airline + Aircraft + FlightDuration +
## TravelMonth + IsInternational + WidthEconomy + PitchDifference +
## WidthDifference + PercentPremiumSeats, data = dfrModel_Economy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2131.06 -279.11 39.33 368.10 1343.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6440.877 1993.124 -3.232 0.00132 **
## Airline -504.056 40.746 -12.371 < 2e-16 ***
## Aircraft 99.888 64.268 1.554 0.12084
## FlightDuration 110.791 10.072 10.999 < 2e-16 ***
## TravelMonth -41.233 24.742 -1.667 0.09631 .
## IsInternational 1623.256 265.284 6.119 2.06e-09 ***
## WidthEconomy 447.869 128.312 3.490 0.00053 ***
## PitchDifference -672.698 41.339 -16.273 < 2e-16 ***
## WidthDifference 1031.084 87.078 11.841 < 2e-16 ***
## PercentPremiumSeats 24.349 5.788 4.207 3.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 551.1 on 448 degrees of freedom
## Multiple R-squared: 0.6951, Adjusted R-squared: 0.689
## F-statistic: 113.5 on 9 and 448 DF, p-value: < 2.2e-16
Observation
As we can see that Aircraft and Travel Month is not much significant to affect the price of Economy class ticket we can remove it.
Data Cleaning 2
dfrModel_Economy <- select(dfrModel_Economy, -c(Aircraft, TravelMonth))
head(dfrModel_Economy)
## Airline FlightDuration IsInternational SeatsEconomy PitchEconomy
## 1 2 12.25 2 122 31
## 2 2 12.25 2 122 31
## 3 2 12.25 2 122 31
## 4 2 12.25 2 122 31
## 5 2 8.16 2 122 31
## 6 2 8.16 2 122 31
## WidthEconomy PriceEconomy SeatsTotal PitchDifference WidthDifference
## 1 18 2707 162 7 1
## 2 18 2707 162 7 1
## 3 18 2707 162 7 1
## 4 18 2707 162 7 1
## 5 18 1793 162 7 1
## 6 18 1793 162 7 1
## PercentPremiumSeats
## 1 24.69
## 2 24.69
## 3 24.69
## 4 24.69
## 5 24.69
## 6 24.69
#?step()
stpModel=step(lm(data=dfrModel_Economy, PriceEconomy~.), trace=0, steps=1000)
stpSummary <- summary(stpModel)
stpSummary
##
## Call:
## lm(formula = PriceEconomy ~ Airline + FlightDuration + IsInternational +
## WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats,
## data = dfrModel_Economy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2089.23 -297.36 50.12 345.41 1308.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6095.761 1902.645 -3.204 0.001452 **
## Airline -501.027 40.770 -12.289 < 2e-16 ***
## FlightDuration 116.198 9.459 12.284 < 2e-16 ***
## IsInternational 1500.318 258.224 5.810 1.18e-08 ***
## WidthEconomy 419.594 126.207 3.325 0.000958 ***
## PitchDifference -649.806 38.658 -16.809 < 2e-16 ***
## WidthDifference 1020.026 86.325 11.816 < 2e-16 ***
## PercentPremiumSeats 25.613 5.717 4.480 9.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 553.1 on 450 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6868
## F-statistic: 144.1 on 7 and 450 DF, p-value: < 2.2e-16
Observations of Economy Class Model
1. PriceEconomy ~ Airline + FlightDuration + IsInternational + WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats
2. Price of Economy class is dependent of all the variables which are given above
3. R square value is around 0.6868 which is good as it is more than 0.6
4. P value is less than 0.05 which is rejecting the NULL hypothesis which means all the above 7 variables are affecting the price of Economy class flight.
Make Final Multi Linear Model
x1 <- dfrModel_Economy$Airline
x2 <- dfrModel_Economy$FlightDuration
x3 <- dfrModel_Economy$IsInternational
x4 <- dfrModel_Economy$WidthEconomy
x5 <- dfrModel_Economy$PitchDifference
x6 <- dfrModel_Economy$WidthDifference
x7 <- dfrModel_Economy$PercentPremiumSeats
y1 <- dfrModel_Economy$PriceEconomy
slmModel1 <- lm(y1~x1+x2+x3+x4+x5+x6+x7, data=dfrModel_Economy)
Observation
No errors. Model successfully created.
Show Model
# print summary
summary(slmModel1)
##
## Call:
## lm(formula = y1 ~ x1 + x2 + x3 + x4 + x5 + x6 + x7, data = dfrModel_Economy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2089.23 -297.36 50.12 345.41 1308.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6095.761 1902.645 -3.204 0.001452 **
## x1 -501.027 40.770 -12.289 < 2e-16 ***
## x2 116.198 9.459 12.284 < 2e-16 ***
## x3 1500.318 258.224 5.810 1.18e-08 ***
## x4 419.594 126.207 3.325 0.000958 ***
## x5 -649.806 38.658 -16.809 < 2e-16 ***
## x6 1020.026 86.325 11.816 < 2e-16 ***
## x7 25.613 5.717 4.480 9.48e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 553.1 on 450 degrees of freedom
## Multiple R-squared: 0.6916, Adjusted R-squared: 0.6868
## F-statistic: 144.1 on 7 and 450 DF, p-value: < 2.2e-16
Test Data
dfrEconomy_test <- data.frame(x1=c(2), x2=c(14.25),x3=c(2), x4=c(18), x5=c(7),x5=c(9), x6=c(1),x7=c(27.69))
dfrEconomy_test
## x1 x2 x3 x4 x5 x5.1 x6 x7
## 1 2 14.25 2 18 7 9 1 27.69
Observation
Test Data successfully created.
Predict
result <- predict(slmModel1, dfrEconomy_test)
print(result)
## 1
## 2291.959
** Observation**
1. Prediction is working fine but have some deviation from true values as r squared values was less.
For Economy Class
PriceEconomy ~ Airline + FlightDuration + IsInternational + WidthEconomy + PitchDifference + WidthDifference + PercentPremiumSeats
Model is having R square value as 0.6868 as well as it is rejecting the NULL Hypothesis for Regression model as P Value is less than 0.05
For Premium Class PricePremium ~ Airline + FlightDuration + IsInternational + PitchPremium + PitchDifference + WidthDifference + PercentPremiumSeats
Model is having R square value as 0.62 as well as it is rejecting the NULL Hypothesis for Regression model as P Value is less than 0.05