Image by shealingreer from Pinterest

Image by shealingreer from Pinterest

As summer approaches and travel activity is high, I’m interested in understanding what factors contribute to flight price dynamics in foreign countries. My research question was: What impact does the duration of the flight, number of stops, and flight class have on pricing of Vistara flights? This project specifically focuses on Vistara airline flight ticket prices in India. The dataset used for this project was sourced from EaseMyTrip.com, a popular travel booking platform in India. The company was founded in 2008 by three brothers and has grown to become the second-largest online travel platform in the India. The flight data is compiled directly from recorded flight listings and bookings, ensuring that the pricing information shows real-world values across different routes and travel conditions. This project aims to uncover patterns and build a strong model to estimate ticket prices.

Variables I Used: - airline: Name of the airline.

First, I loaded my libraries and dataset. Then I checked the different airlines in the dataset so I could choose which one to focus on; I decided on Vistara. Next I created a new dataset with the variables I wanted to focus on, which I then created a multiple linear regression model of along with diagnostics plots to see if my chosen indicators are significant to explaining price variation. I then went on to create three more linear regression models, each focusing on one indicator to see which was the strongest in explaining the variation in prices. Finally, I created a scatterplot to see the spread in prices based on duration and stops, and an interactive boxplot to show how flight class impacts the prices.

Load libraries and set working directory

Load in dataset, check dataset, check airlines names to choose which to filter

flight_prices <- read.csv("flight_prices.csv")

head(flight_prices)
##   X  airline  flight source_city departure_time stops  arrival_time
## 1 0 SpiceJet SG-8709       Delhi        Evening  zero         Night
## 2 1 SpiceJet SG-8157       Delhi  Early_Morning  zero       Morning
## 3 2  AirAsia  I5-764       Delhi  Early_Morning  zero Early_Morning
## 4 3  Vistara  UK-995       Delhi        Morning  zero     Afternoon
## 5 4  Vistara  UK-963       Delhi        Morning  zero       Morning
## 6 5  Vistara  UK-945       Delhi        Morning  zero     Afternoon
##   destination_city   class duration days_left price
## 1           Mumbai Economy     2.17         1  5953
## 2           Mumbai Economy     2.33         1  5953
## 3           Mumbai Economy     2.17         1  5956
## 4           Mumbai Economy     2.25         1  5955
## 5           Mumbai Economy     2.33         1  5955
## 6           Mumbai Economy     2.33         1  5955
unique(flight_prices$airline)
## [1] "SpiceJet"  "AirAsia"   "Vistara"   "GO_FIRST"  "Indigo"    "Air_India"

Create new data frame filtered by the Vistara airline and including airline, stops, class, duration, and price columns/variables. Arrange price by descending order to see what’s the most expensive flight.

vistara_flights <- flight_prices |>
  select(airline, stops, class, duration, price) |>
  filter(airline == "Vistara") |>
  arrange(desc(price))

head(vistara_flights)
##   airline       stops    class duration  price
## 1 Vistara         one Business    13.50 123071
## 2 Vistara two_or_more Business    10.92 117307
## 3 Vistara two_or_more Business    21.08 116562
## 4 Vistara         one Business    16.42 115211
## 5 Vistara         one Business     9.50 114705
## 6 Vistara         one Business    15.08 114704

Create multiple linear regression model to see variation of price based on duration, stops, and class

flights_model1 <- lm(price ~ duration + stops + class, data = vistara_flights)
summary(flights_model1)
## 
## Call:
## lm(formula = price ~ duration + stops + class, data = vistara_flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27254  -4121  -1707   3861  66314 
## 
## Coefficients:
##                    Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)       57284.467     66.472   861.779   <2e-16 ***
## duration            -39.106      3.979    -9.829   <2e-16 ***
## stopstwo_or_more   3669.370    109.695    33.451   <2e-16 ***
## stopszero        -16239.354     99.371  -163.422   <2e-16 ***
## classEconomy     -47988.176     47.860 -1002.684   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8439 on 127854 degrees of freedom
## Multiple R-squared:  0.8916, Adjusted R-squared:  0.8916 
## F-statistic: 2.63e+05 on 4 and 127854 DF,  p-value: < 2.2e-16

Equation of model:

Estimated Price = 57,284.47 - 39.11 * Duration + 3,669.37 * Stops(two or more) - 16,239.35 * Stops(zero) - 47,988.18 * Class(Economy)

Intercepts/p-values:

Intercept: 57284.467; (p-value: <2e-16), predicted price for a Vistara business class flight with one stop and 0 hours in duration (not realistic)

Duration: -39.106; (p-value: <2e-16), each 1 increase in hour is predicted to lower the price by 39 rupees

Stops (two or more): 3669.37; (p-value: <2e-16), two or more stops is predicted to increase price by 3669 rupees compared to one stop

Stops (zero): -16239.354; (p-value: <2e-16), nonstop flights is predicted to cost 16,239 rupees less than one-stop flights

Class (Economy): -47988.176; (p-value: <2e-16), economy tickets is predicted to be 47,988 rupees cheaper than business

All the p-values of the predictors are extremely small. This suggests strong evidence that they significantly affect the price.

R-Squared: The adjusted r-squared of the model is 0.8916, this means that around 89% of the variation in Vistara flight prices is explained by the model. Therefore, this is a very strong model fit.

Look at model’s diagnostic plots

autoplot(flights_model1, 1:4, nrow=2, ncol=2)

Check variation of price by individual variables to see which variable explains the most

duration_model <- lm(price ~ duration, data = vistara_flights)
summary(duration_model)
## 
## Call:
## lm(formula = price ~ duration, data = vistara_flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30011 -23730 -15035  25010  92637 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27493.70     157.91  174.11   <2e-16 ***
## duration      217.82      10.56   20.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25590 on 127857 degrees of freedom
## Multiple R-squared:  0.003316,   Adjusted R-squared:  0.003308 
## F-statistic: 425.3 on 1 and 127857 DF,  p-value: < 2.2e-16
stops_model <- lm(price ~ stops, data = vistara_flights)
summary(stops_model)
## 
## Call:
## lm(formula = price ~ stops, data = vistara_flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29124 -25498  -9913  23149  98456 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       32353.15      75.37  429.26   <2e-16 ***
## stopstwo_or_more -13502.38     322.62  -41.85   <2e-16 ***
## stopszero        -15936.88     259.30  -61.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25130 on 127856 degrees of freedom
## Multiple R-squared:  0.03911,    Adjusted R-squared:  0.03909 
## F-statistic:  2602 on 2 and 127856 DF,  p-value: < 2.2e-16
class_model <- lm(price ~ class, data = vistara_flights)
summary(class_model)
## 
## Call:
## lm(formula = price ~ class, data = vistara_flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37873  -3190   -869   4331  67594 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   55477.03      38.69  1433.8   <2e-16 ***
## classEconomy -47670.08      53.34  -893.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9524 on 127857 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.862 
## F-statistic: 7.987e+05 on 1 and 127857 DF,  p-value: < 2.2e-16

By visualizing the data, I can identify which factors have the strongest associations with pricing and if those relationships are consistent across categories.

I created a scatter plot to show the relationship between flight duration and price while differentiating flights by color based on the number of stops. There seems to be a clear horizontal gap between the price range of 25000-30000; the reason for the gap is explored in the next visualization.

ggplot(vistara_flights, aes(x = duration, y = price, color = stops)) +
  geom_point(size = 2, alpha = 0.2) +
  labs(title = "Vistara Flight Prices by Duration and Stops",
    x = "Duration (hours)",
    y = "Price (in Rupees)",
    color = "Stops",
    caption = "Source: EaseMyTrip.com") +
  theme_bw() +
  scale_color_manual(values = c("one" = "#8fceff", "two_or_more" = "#ff8fd5", "zero" = "#000000"))

I created an interactive box plot to display the distribution of flight prices across different flight classes and stop counts. Thus helps identify patterns and variations in pricing based on these categorical variables. You can hover over each box and see exact values, which makes it easier to compare pricing trends across flight classes and number of stops.

I followed an example of an interactive box-plot (linked in references) in order to compose the code below.

Interactive box-plot that shows the summary statistics of flight prices baased on seat class and number of stops

plotly_flight <- plot_ly(data = vistara_flights,
  x = ~class,
  y = ~price,
  color = ~stops,
  type = "box",
  colors = c("#f4df30", "#f40f9d", "#6deece")
)

plotly_flight <- plotly_flight |>
  layout(
    title = "Vistara Flights Price Distribution by Class and Number of Stops \n Source: EaseMyTrip.com",
    xaxis = list(title = "Flight Class"),
    yaxis = list(title = "Price (in rupees)"),
    boxmode = "group",
    legend = list(title = list(text = "# of Stops")))

plotly_flight
## Warning: 'layout' objects don't have these attributes: 'boxmode'
## Valid attributes include:
## '_deprecated', 'activeshape', 'annotations', 'autosize', 'autotypenumbers', 'calendar', 'clickmode', 'coloraxis', 'colorscale', 'colorway', 'computed', 'datarevision', 'dragmode', 'editrevision', 'editType', 'font', 'geo', 'grid', 'height', 'hidesources', 'hoverdistance', 'hoverlabel', 'hovermode', 'images', 'legend', 'mapbox', 'margin', 'meta', 'metasrc', 'modebar', 'newshape', 'paper_bgcolor', 'plot_bgcolor', 'polar', 'scene', 'selectdirection', 'selectionrevision', 'separators', 'shapes', 'showlegend', 'sliders', 'smith', 'spikedistance', 'template', 'ternary', 'title', 'transition', 'uirevision', 'uniformtext', 'updatemenus', 'width', 'xaxis', 'yaxis', 'barmode', 'bargap', 'mapType'

Interpretations:

Research question was answered. The duration of the flight, number of stops, and flight class all impact the pricing of Vistara flights; flight class affects it the most as there is a clear gap in price shown in plots due to business/economy difference. My model was statistically a good fit with an adjusted r-squared value of ~89%.

With the linear regression models focusing on just one indicator, my results came out to be:

R-squared of model with only duration: 0.003308

R-squared of model with only stops: 0.03909

R-squared of model with only class: 0.862

This shows that flight class explains the variation of ticket price the most in this dataset.

The diagnostic plots for the linear regression model suggest that the model is statistically strong (due to its adjusted r-squared of 0.8916 and small p-values for the predictors). However, the diagnostic plots show that it may not be as accurate as it seems. The Residuals vs Fitted plot shows a curved pattern, suggesting that there may be a non-linear relationship between the predictors and price. The Normal Q-Q plot shows that residuals deviate from normality, which may affect the reliability of the model’s results. The Scale-Location plot shows that my model’s prediction errors aren’t evenly spread out across all levels of the predicted prices, meaning its accurate at some predicted values and but inaccurate at others. Lastly, the Cook’s Distance plot shows a few observations with slightly higher influence, but none seem to be very extreme. These plots suggest that while the model explains most of the variance in flight prices, it doesn’t fully meet the assumptions of linear regression, which affects how confidently we can interpret its estimates. In the future, I may refit the model using a log-transformed price variable to capture non-linear patterns.

The scatter plot depicting flight prices relative to duration and number of stops provides additional insight into the relationship between flight duration and cost. Flights with no stops are clustered in the lower left portion of the graph, showing that direct flights are typically shorter and less expensive. On the other hand, flights with one or more stops have a wide range of durations and prices, showing that there’s little correlation between number of stops and duration. However, the correlation between duration and price is not exactly linear, as some longer flights are priced near the same as shorter ones. This variation suggests that while flight duration plays a role in pricing, other factors may be stronger indicators. Together, the plots support the conclusion that flight class, number of stops, and duration all significantly influence Vistara flight pricing. It was surprising to see that longer flight durations did not always correspond to higher prices,

The box plot that illustrates the distribution of flight prices by flight class and number of stops reveals that business class flights are consistently priced higher than economy class flights, indicating that flight class is a major indicator of ticket cost. It is evident that the median prices for business flights are extremely higher than the median prices for economy flights regardless of how many stops the flight has. For both classes, two or more stops have the greatest median compared to the other number of stops.

In the future, I want explore additional variables like departure time, discounts, seasons, destination popularity and analyze other airlines. These variables will give me a more detailed understanding of what impacts pricing for flights.

References:

  1. https://www.kaggle.com/datasets/shubhambathwal/flight-price-prediction?select=Clean_Dataset.csv

  2. https://www.easemytrip.com/about-us.html

  3. https://plotly.com/r/box-plots/