Flight Cost Prediction
- An analysis on predicting flight ticket prices based on flight details.
November 08, 2024
1
| X | airline | flight | source_city | departure_time | stops | arrival_time | destination_city | class | duration | days_left | price |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | SpiceJet | SG-8709 | Delhi | Evening | zero | Night | Mumbai | Economy | 2.17 | 1 | 5953 |
| 1 | SpiceJet | SG-8157 | Delhi | Early_Morning | zero | Morning | Mumbai | Economy | 2.33 | 1 | 5953 |
| 2 | AirAsia | I5-764 | Delhi | Early_Morning | zero | Early_Morning | Mumbai | Economy | 2.17 | 1 | 5956 |
| 3 | Vistara | UK-995 | Delhi | Morning | zero | Afternoon | Mumbai | Economy | 2.25 | 1 | 5955 |
| 4 | Vistara | UK-963 | Delhi | Morning | zero | Morning | Mumbai | Economy | 2.33 | 1 | 5955 |
| 5 | Vistara | UK-945 | Delhi | Morning | zero | Afternoon | Mumbai | Economy | 2.33 | 1 | 5955 |
| X | airline | flight | source_city | departure_time | stops | arrival_time | destination_city | class | duration | days_left | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0 | Length:300153 | Length:300153 | Length:300153 | Length:300153 | Length:300153 | Length:300153 | Length:300153 | Length:300153 | Min. : 0.83 | Min. : 1 | Min. : 1105 | |
| 1st Qu.: 75038 | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 6.83 | 1st Qu.:15 | 1st Qu.: 4783 | |
| Median :150076 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :11.25 | Median :26 | Median : 7425 | |
| Mean :150076 | NA | NA | NA | NA | NA | NA | NA | NA | Mean :12.22 | Mean :26 | Mean : 20890 | |
| 3rd Qu.:225114 | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:16.17 | 3rd Qu.:38 | 3rd Qu.: 42521 | |
| Max. :300152 | NA | NA | NA | NA | NA | NA | NA | NA | Max. :49.83 | Max. :49 | Max. :123071 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 16799.5997 | 113.070271 | 148.57663 | 0 |
| days_left | -140.7304 | 2.981942 | -47.19421 | 0 |
| duration | 634.1303 | 5.622655 | 112.78128 | 0 |
## `geom_smooth()` using formula = 'y ~ x'
The linear regression model can be written as: \[ \text{Price} = \beta_0 + \beta_1 \times \text{Days Left} + \beta_2 \times \text{Duration} + \epsilon \]
Where: - \(\beta_0\) is the intercept, representing the baseline price when both predictors (Days Left and Duration) are zero.
\(\beta_1\) is the coefficient for Days Left, showing the expected change in price for a one-day increase in days left.
\(\beta_2\) is the coefficient for Duration, showing the expected change in price for each additional hour of flight duration.
\(\epsilon\) is the error term, accounting for variability not explained by the model.
The residuals (\(e_i\)) for each observation \(i\) represent the difference between the actual price and the predicted price: \[ e_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1 \times x_{1,i} + \beta_2 \times x_{2,i}) \] where \(y_i\) is the observed price, and \(\hat{y}_i\) is the predicted price.
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 16799.5997 | 113.070271 | 148.57663 | 0 |
| days_left | -140.7304 | 2.981942 | -47.19421 | 0 |
| duration | 634.1303 | 5.622655 | 112.78128 | 0 |
Hypothesis test to assess if days_left significantly affects price.
\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_a: \beta_1 \neq 0 \]
For \(\beta_1\) (Days Left) and \(\beta_2\) (Duration):
Null Hypothesis \(H_0\): \(\beta_j = 0\) (predictor \(j\) has no effect on price)
Alternative Hypothesis \(H_1\): \(\beta_j \neq 0\) (predictor \(j\) significantly affects price)
The p-values obtained in the regression output help us determine whether we can reject the null hypothesis for each coefficient.
Check the p-values for days_left and duration. If either is below 0.05, you can conclude that predictor significantly impacts ticket prices.
For example:
If \(p\)-value for Days Left < 0.05: Conclude that Days Left significantly affects ticket price.
If \(p\)-value for Duration < 0.05: Conclude that Duration significantly affects ticket price.
Code to create a ggplot of days_left vs price.
ggplot(data, aes(x = days_left, y = price)) + geom_point() + geom_smooth(method = "lm", color = "red") + labs(title = "Days Left Until Flight vs Price")