Image by shealingreer from Pinterest
As summer approaches and travel activity is high, I’m interested in understanding what factors contribute to flight price dynamics in foreign countries. My research question was: What impact does the duration of the flight, number of stops, and flight class have on pricing of Vistara flights? This project specifically focuses on Vistara airline flight ticket prices in India. The dataset used for this project was sourced from EaseMyTrip.com, a popular travel booking platform in India. The company was founded in 2008 by three brothers and has grown to become the second-largest online travel platform in the India. The flight data is compiled directly from recorded flight listings and bookings, ensuring that the pricing information shows real-world values across different routes and travel conditions. This project aims to uncover patterns and build a strong model to estimate ticket prices.
Variables I Used: - airline: Name of the airline.
stops: Number of layovers (zero, one, two or more).
class: Economy or Business.
duration: Flight time in hours.
price: Ticket price (in rupees)
First, I loaded my libraries and dataset. Then I checked the different airlines in the dataset so I could choose which one to focus on; I decided on Vistara. Next I created a new dataset with the variables I wanted to focus on, which I then created a multiple linear regression model of along with diagnostics plots to see if my chosen indicators are significant to explaining price variation. I then went on to create three more linear regression models, each focusing on one indicator to see which was the strongest in explaining the variation in prices. Finally, I created a scatterplot to see the spread in prices based on duration and stops, and an interactive boxplot to show how flight class impacts the prices.
Load libraries and set working directory
Load in dataset, check dataset, check airlines names to choose which to filter
flight_prices <- read.csv("flight_prices.csv")
head(flight_prices)
## X airline flight source_city departure_time stops arrival_time
## 1 0 SpiceJet SG-8709 Delhi Evening zero Night
## 2 1 SpiceJet SG-8157 Delhi Early_Morning zero Morning
## 3 2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning
## 4 3 Vistara UK-995 Delhi Morning zero Afternoon
## 5 4 Vistara UK-963 Delhi Morning zero Morning
## 6 5 Vistara UK-945 Delhi Morning zero Afternoon
## destination_city class duration days_left price
## 1 Mumbai Economy 2.17 1 5953
## 2 Mumbai Economy 2.33 1 5953
## 3 Mumbai Economy 2.17 1 5956
## 4 Mumbai Economy 2.25 1 5955
## 5 Mumbai Economy 2.33 1 5955
## 6 Mumbai Economy 2.33 1 5955
unique(flight_prices$airline)
## [1] "SpiceJet" "AirAsia" "Vistara" "GO_FIRST" "Indigo" "Air_India"
Create new data frame filtered by the Vistara airline and including airline, stops, class, duration, and price columns/variables. Arrange price by descending order to see what’s the most expensive flight.
vistara_flights <- flight_prices |>
select(airline, stops, class, duration, price) |>
filter(airline == "Vistara") |>
arrange(desc(price))
head(vistara_flights)
## airline stops class duration price
## 1 Vistara one Business 13.50 123071
## 2 Vistara two_or_more Business 10.92 117307
## 3 Vistara two_or_more Business 21.08 116562
## 4 Vistara one Business 16.42 115211
## 5 Vistara one Business 9.50 114705
## 6 Vistara one Business 15.08 114704
Create multiple linear regression model to see variation of price based on duration, stops, and class
flights_model1 <- lm(price ~ duration + stops + class, data = vistara_flights)
summary(flights_model1)
##
## Call:
## lm(formula = price ~ duration + stops + class, data = vistara_flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27254 -4121 -1707 3861 66314
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57284.467 66.472 861.779 <2e-16 ***
## duration -39.106 3.979 -9.829 <2e-16 ***
## stopstwo_or_more 3669.370 109.695 33.451 <2e-16 ***
## stopszero -16239.354 99.371 -163.422 <2e-16 ***
## classEconomy -47988.176 47.860 -1002.684 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8439 on 127854 degrees of freedom
## Multiple R-squared: 0.8916, Adjusted R-squared: 0.8916
## F-statistic: 2.63e+05 on 4 and 127854 DF, p-value: < 2.2e-16
Equation of model:
Estimated Price = 57,284.47 - 39.11 * Duration + 3,669.37 * Stops(two or more) - 16,239.35 * Stops(zero) - 47,988.18 * Class(Economy)
Intercepts/p-values:
Intercept: 57284.467; (p-value: <2e-16), predicted price for a Vistara business class flight with one stop and 0 hours in duration (not realistic)
Duration: -39.106; (p-value: <2e-16), each 1 increase in hour is predicted to lower the price by 39 rupees
Stops (two or more): 3669.37; (p-value: <2e-16), two or more stops is predicted to increase price by 3669 rupees compared to one stop
Stops (zero): -16239.354; (p-value: <2e-16), nonstop flights is predicted to cost 16,239 rupees less than one-stop flights
Class (Economy): -47988.176; (p-value: <2e-16), economy tickets is predicted to be 47,988 rupees cheaper than business
All the p-values of the predictors are extremely small. This suggests strong evidence that they significantly affect the price.
R-Squared: The adjusted r-squared of the model is 0.8916, this means that around 89% of the variation in Vistara flight prices is explained by the model. Therefore, this is a very strong model fit.
Look at model’s diagnostic plots
autoplot(flights_model1, 1:4, nrow=2, ncol=2)
Check variation of price by individual variables to see which variable explains the most
duration_model <- lm(price ~ duration, data = vistara_flights)
summary(duration_model)
##
## Call:
## lm(formula = price ~ duration, data = vistara_flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30011 -23730 -15035 25010 92637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27493.70 157.91 174.11 <2e-16 ***
## duration 217.82 10.56 20.62 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25590 on 127857 degrees of freedom
## Multiple R-squared: 0.003316, Adjusted R-squared: 0.003308
## F-statistic: 425.3 on 1 and 127857 DF, p-value: < 2.2e-16
stops_model <- lm(price ~ stops, data = vistara_flights)
summary(stops_model)
##
## Call:
## lm(formula = price ~ stops, data = vistara_flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29124 -25498 -9913 23149 98456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32353.15 75.37 429.26 <2e-16 ***
## stopstwo_or_more -13502.38 322.62 -41.85 <2e-16 ***
## stopszero -15936.88 259.30 -61.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25130 on 127856 degrees of freedom
## Multiple R-squared: 0.03911, Adjusted R-squared: 0.03909
## F-statistic: 2602 on 2 and 127856 DF, p-value: < 2.2e-16
class_model <- lm(price ~ class, data = vistara_flights)
summary(class_model)
##
## Call:
## lm(formula = price ~ class, data = vistara_flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37873 -3190 -869 4331 67594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55477.03 38.69 1433.8 <2e-16 ***
## classEconomy -47670.08 53.34 -893.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9524 on 127857 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.862
## F-statistic: 7.987e+05 on 1 and 127857 DF, p-value: < 2.2e-16
By visualizing the data, I can identify which factors have the strongest associations with pricing and if those relationships are consistent across categories.
I created a scatter plot to show the relationship between flight duration and price while differentiating flights by color based on the number of stops. There seems to be a clear horizontal gap between the price range of 25000-30000; the reason for the gap is explored in the next visualization.
ggplot(vistara_flights, aes(x = duration, y = price, color = stops)) +
geom_point(size = 2, alpha = 0.2) +
labs(title = "Vistara Flight Prices by Duration and Stops",
x = "Duration (hours)",
y = "Price (in Rupees)",
color = "Stops",
caption = "Source: EaseMyTrip.com") +
theme_bw() +
scale_color_manual(values = c("one" = "#8fceff", "two_or_more" = "#ff8fd5", "zero" = "#000000"))
I created an interactive box plot to display the distribution of flight prices across different flight classes and stop counts. Thus helps identify patterns and variations in pricing based on these categorical variables. You can hover over each box and see exact values, which makes it easier to compare pricing trends across flight classes and number of stops.
I followed an example of an interactive box-plot (linked in references) in order to compose the code below.
Interactive box-plot that shows the summary statistics of flight prices baased on seat class and number of stops
plotly_flight <- plot_ly(data = vistara_flights,
x = ~class,
y = ~price,
color = ~stops,
type = "box",
colors = c("#f4df30", "#f40f9d", "#6deece")
)
plotly_flight <- plotly_flight |>
layout(
title = "Vistara Flights Price Distribution by Class and Number of Stops \n Source: EaseMyTrip.com",
xaxis = list(title = "Flight Class"),
yaxis = list(title = "Price (in rupees)"),
boxmode = "group",
legend = list(title = list(text = "# of Stops")))
plotly_flight
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'
Interpretations:
Research question was answered. The duration of the flight, number of stops, and flight class all impact the pricing of Vistara flights; flight class affects it the most as there is a clear gap in price shown in plots due to business/economy difference. My model was statistically a good fit with an adjusted r-squared value of ~89%.
With the linear regression models focusing on just one indicator, my results came out to be:
R-squared of model with only duration: 0.003308
R-squared of model with only stops: 0.03909
R-squared of model with only class: 0.862
This shows that flight class explains the variation of ticket price the most in this dataset.
The diagnostic plots for the linear regression model suggest that the model is statistically strong (due to its adjusted r-squared of 0.8916 and small p-values for the predictors). However, the diagnostic plots show that it may not be as accurate as it seems. The Residuals vs Fitted plot shows a curved pattern, suggesting that there may be a non-linear relationship between the predictors and price. The Normal Q-Q plot shows that residuals deviate from normality, which may affect the reliability of the model’s results. The Scale-Location plot shows that my model’s prediction errors aren’t evenly spread out across all levels of the predicted prices, meaning its accurate at some predicted values and but inaccurate at others. Lastly, the Cook’s Distance plot shows a few observations with slightly higher influence, but none seem to be very extreme. These plots suggest that while the model explains most of the variance in flight prices, it doesn’t fully meet the assumptions of linear regression, which affects how confidently we can interpret its estimates. In the future, I may refit the model using a log-transformed price variable to capture non-linear patterns.
The scatter plot depicting flight prices relative to duration and number of stops provides additional insight into the relationship between flight duration and cost. Flights with no stops are clustered in the lower left portion of the graph, showing that direct flights are typically shorter and less expensive. On the other hand, flights with one or more stops have a wide range of durations and prices, showing that there’s little correlation between number of stops and duration. However, the correlation between duration and price is not exactly linear, as some longer flights are priced near the same as shorter ones. This variation suggests that while flight duration plays a role in pricing, other factors may be stronger indicators. Together, the plots support the conclusion that flight class, number of stops, and duration all significantly influence Vistara flight pricing. It was surprising to see that longer flight durations did not always correspond to higher prices,
The box plot that illustrates the distribution of flight prices by flight class and number of stops reveals that business class flights are consistently priced higher than economy class flights, indicating that flight class is a major indicator of ticket cost. It is evident that the median prices for business flights are extremely higher than the median prices for economy flights regardless of how many stops the flight has. For both classes, two or more stops have the greatest median compared to the other number of stops.
In the future, I want explore additional variables like departure time, discounts, seasons, destination popularity and analyze other airlines. These variables will give me a more detailed understanding of what impacts pricing for flights.
References: