November 08, 2024

1

Introduction

Flight Cost Prediction

  • An analysis on predicting flight ticket prices based on flight details.

Data Overview

A Quick Look at the Dataset
X airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
0 SpiceJet SG-8709 Delhi Evening zero Night Mumbai Economy 2.17 1 5953
1 SpiceJet SG-8157 Delhi Early_Morning zero Morning Mumbai Economy 2.33 1 5953
2 AirAsia I5-764 Delhi Early_Morning zero Early_Morning Mumbai Economy 2.17 1 5956
3 Vistara UK-995 Delhi Morning zero Afternoon Mumbai Economy 2.25 1 5955
4 Vistara UK-963 Delhi Morning zero Morning Mumbai Economy 2.33 1 5955
5 Vistara UK-945 Delhi Morning zero Afternoon Mumbai Economy 2.33 1 5955

Summary of all the data

Overall Summary of the Dataset
X airline flight source_city departure_time stops arrival_time destination_city class duration days_left price
Min. : 0 Length:300153 Length:300153 Length:300153 Length:300153 Length:300153 Length:300153 Length:300153 Length:300153 Min. : 0.83 Min. : 1 Min. : 1105
1st Qu.: 75038 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.: 6.83 1st Qu.:15 1st Qu.: 4783
Median :150076 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median :11.25 Median :26 Median : 7425
Mean :150076 NA NA NA NA NA NA NA NA Mean :12.22 Mean :26 Mean : 20890
3rd Qu.:225114 NA NA NA NA NA NA NA NA 3rd Qu.:16.17 3rd Qu.:38 3rd Qu.: 42521
Max. :300152 NA NA NA NA NA NA NA NA Max. :49.83 Max. :49 Max. :123071

Exploratory Data Analysis

  • Initial exploration of variables influencing flight costs.

Linear Regression Analysis

  • Predicting flight cost based on days_left and duration.
Linear Regression Model Summary
term estimate std.error statistic p.value
(Intercept) 16799.5997 113.070271 148.57663 0
days_left -140.7304 2.981942 -47.19421 0
duration 634.1303 5.622655 112.78128 0

Corellation Plot

  • Studying the Relationship between duration and price by airline.
## `geom_smooth()` using formula = 'y ~ x'

Distribution of Price

Linear Regression Equation

  • The formula for the predicted price in terms of days_left and duration:

Model Equation

The linear regression model can be written as: \[ \text{Price} = \beta_0 + \beta_1 \times \text{Days Left} + \beta_2 \times \text{Duration} + \epsilon \]

Where: - \(\beta_0\) is the intercept, representing the baseline price when both predictors (Days Left and Duration) are zero.

  • \(\beta_1\) is the coefficient for Days Left, showing the expected change in price for a one-day increase in days left.

  • \(\beta_2\) is the coefficient for Duration, showing the expected change in price for each additional hour of flight duration.

  • \(\epsilon\) is the error term, accounting for variability not explained by the model.

Residuals

The residuals (\(e_i\)) for each observation \(i\) represent the difference between the actual price and the predicted price: \[ e_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 \times x_{1,i} + \beta_2 \times x_{2,i}) \] where \(y_i\) is the observed price, and \(\hat{y}_i\) is the predicted price.

Hypothesis Testing

Hypothesis Testing for Model Coefficients
term estimate std.error statistic p.value
(Intercept) 16799.5997 113.070271 148.57663 0
days_left -140.7304 2.981942 -47.19421 0
duration 634.1303 5.622655 112.78128 0

Hypothesis Testing

Hypothesis test to assess if days_left significantly affects price.

\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]

For \(\beta_1\) (Days Left) and \(\beta_2\) (Duration):

  • Null Hypothesis \(H_0\): \(\beta_j = 0\) (predictor \(j\) has no effect on price)

  • Alternative Hypothesis \(H_1\): \(\beta_j \neq 0\) (predictor \(j\) significantly affects price)

The p-values obtained in the regression output help us determine whether we can reject the null hypothesis for each coefficient.

  • Check the p-values for days_left and duration. If either is below 0.05, you can conclude that predictor significantly impacts ticket prices.

  • For example:

    • If \(p\)-value for Days Left < 0.05: Conclude that Days Left significantly affects ticket price.

    • If \(p\)-value for Duration < 0.05: Conclude that Duration significantly affects ticket price.

R-code Output

Code to create a ggplot of days_left vs price.

ggplot(data, aes(x = days_left, y = price)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Days Left Until Flight vs Price")