Assignment_#6

Author

Hanh Nguyen

Flights Information - scraping from GoogleFlight website

The topic of interest for this assignment is the flights information of one-way flights from Cincinnati to Ho Chi Minh city (Vietnam) on the Sunday May 11, 2025 and Monday May 12, 2025 to see which factors affect the price the most.

Data Preparation

Scraping Information

The information I wanted to scrape from the GoogleFlight website is:

Departure time
Arrival Time
Airline Company and their partners (since this is an international flight)
Amount of stops
Duration of the flights
Price

Forming a data frame with information from the website

Some cleaning need to be conducted for facilitating the analysis:

Separating the airline companies for national and international flights
Changing departure time from 12-hour display to 24-hour display for further analysis
Creating a new variable of “rough_estimated_duration” to be the number of hours of the flight

Hypothesis Questions

The main 3 hypothesis questions I want to answer based on this data frame:

Do the duration of the flights, or the departure time in the day have more significant affect on the price of the flight?
Does the price of the flight vary by the airline companies? If it does, which companies have the highest price?
Does the day of booking the ticket effect the price of the flights (difference between Sunday and Monday price)?

Analysis

Question number 1

Does the duration of the flights, or the departure time in the day have significant affect on the price of the flight?

Price ~ Departure time

In order to determine whether or not the departure time of the flights during the day has significant effect on the price of the flights, we create a new categorical variable based on the departure time to determine if the flights are in early morning (before 08:00 AM), normal day time (from 09:00 AM to 06:00 PM), and night flights( from 06:00 PM to 10:00 PM).

Before going into any analysis or running any test, I wanted to have a general look at the association between the time frame of the the flights and the price


Early morning  Night flight        Normal 
           26            27            43

From the visualization above, the average price of the night flights seem to be much higher than early morning flights or normal day time flights. This could be caused by the comfort elements of night flights compared to other time frames for international flights.

The table above indicates the number of flights between early morning, normal, and night time. This table could help prove that the significant high price of night flights are not caused by larger sample size of night flights category.

To ensure that there are a significant difference between the mean price of different time categories of the flights, I will conduct an ANOVA test with price and time_frame variables. Why ANOVA? –> Because the independent variable is a categorical variable with 3 levels (early morning, normal, late night)

Suppose that \(\alpha = 0.05\)

Null hypothesis (\(H_0\)): The mean price is the same across all time frames.
Alternative hypothesis (\(H_A\)): The mean price is different for at least one time frame.

            Df    Sum Sq  Mean Sq F value  Pr(>F)    
time_frame   2  49764487 24882243   10.23 9.6e-05 ***
Residuals   93 226091362  2431090                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value = 0.000096 <\(\alpha\)

\(\implies\)We reject the null hypothesis, which means that the mean prices are significantly different between at least some of the time frame. Therefore, we can conclude that time frame does have a significant effect on flight prices.

Price ~ Duration

Before running any analysis, I want to look at the general association between the price of the flights and the amount of time of the flights.

Based on the visualization above, it seems that the variable “rough_estimated_duration” does not have a strong association and effect on the price of the flight since there is no obvious pattern between the 2 variables.

To make sure there is no significant association between the variables, I generated a regression line model between price and rough_estimated_duration and test assumptions of this model.


Call:
lm(formula = analysis1$price ~ analysis1$rough_estimated_duration)

Residuals:
    Min      1Q  Median      3Q     Max 
-2755.2  -954.8  -692.3   643.5  5361.7 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)   
(Intercept)                         -343.08     839.96  -0.408  0.68388   
analysis1$rough_estimated_duration    81.29      26.94   3.018  0.00327 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1636 on 94 degrees of freedom
Multiple R-squared:  0.08834,   Adjusted R-squared:  0.07864 
F-statistic: 9.109 on 1 and 94 DF,  p-value: 0.003274

Although based on the hypothesis test of the regression line model(since p-value is less than 0.05), the rough_estimated_time has a significant association, but the R-squared measure is not high (~10%) and the model violated the linearity and normality assumption test. Therefore, we could conclude that the amount of time of a flight does not have any effect on the price of the flight, there might be other factors involved.

Conclusion:

Through the analysis we ran above, there are two conclusions I could give for the first hypothesis question:

The time_frame or the departure_time has significant effect on the price of the flights. Night flights are statistically more expensive than early morning and normal daytime flights.
The duration of the flights generally does not have impact on the price of the flights. Although this conclusion might go against initial assumptions, we need to take in consideration that there might be other factors involved( such as airline company, or number of stops, etc.)

Question number 2

Does the price of the flight vary by the airline companies? If it does, which companies have the highest price?

Since this flight is an international flight, there would be usually 2 airline companies cooperating together. Therefore, I would run the analysis separately for the airline companies inside the US, and the ones that handle international flights, and then I would run them together to see if there are any significant interactions.

Price ~ US airline companies

I run a box-plot graph below to see the general difference between the average flights’ price of different airline companies. Moreover, to ensure the significance level of the average price among companies, an ANOVA (Analysis of Variance)

Based on the visualization above, we could see that the average price of flights varies significantly among the companies. While Delta and Alaska Airline seem to offer the most affordable options for their customers, United Airline and British Airways seem to have the most expensive average price for their flights.

ANOVA TEST

                      Df    Sum Sq  Mean Sq F value Pr(>F)    
as.factor(airline_US)  5 200248533 40049707   47.67 <2e-16 ***
Residuals             90  75607316   840081                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis (\(H_0\)): The mean price is the same across all the airline companies.
Alternative hypothesis (\(H_A\)): The mean price is different for at least one airline company.

From the anova model above, we see that p-value = \(2*e^-16\) < \(\alpha\) = 0.05

\(\implies\) We reject the null hypothesis, which means that the price varies among different airline companies for the flights in the same day to the same destination.

Price ~ International Airline (partner with US airline companies)

Running the initial general visualization between the price and partner companies. Based on the visualization below, we have an issue which is that there are companies only work with the US airline companies for a very few times (once or twice). Therefore, I have a question is that whether or not I could remove the companies that rarely work with the US companies, and if the removal would effect (better or worse) to the ANOVA test or the impact of companies to the price.


                ANA              Asiana      Cathay Pacific      China Airlines 
                 11                   2                   7                   2 
      China Eastern      China Southern            Emirates             EVA Air 
                  1                   1                   1                   6 
                JAL          Korean Air   Malaysia Airlines Philippine Airlines 
                  9                  16                   3                   1 
      Qatar Airways    STARLUX Airlines                THAI           THAISWISS 
                 14                   2                   3                   1 
   Turkish Airlines    Vietnam Airlines 
                  1                  15

Therefore, I generated a testing data set in which removed any companies that appear less than 3 times, and ran the ANOVA test again on this testing data frame. Then, I would do some comparison between the AIC (Akaike Information Criterion) of the original and testing results.

Based in the AIC measurement below, the AIC of testing data set is lower than the original data set. Therefore we could conclude that the testing data set keep the most information from the full data set.

               df      AIC
ano_model_test  8 1362.358
ano_model2     19 1678.559

                           Df    Sum Sq Mean Sq F value Pr(>F)  
as.factor(airline_outside)  6  33129986 5521664   2.737 0.0189 *
Residuals                  71 143210516 2017050                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA summary of the testing data set, the difference between the average price among the airline companies is significance. Therefore, we could also conclude that the price also varies by the international partner airline companies.

Two-way ANOVA: Price ~ Interactions between domestic and international airlines

Though when we conducted one-way ANOVA test for Price~airline_outside, we used the testing data set. We should use the original data when we run two-way ANOVA now due to the match of length of the 2 variables.

                           Df    Sum Sq   Mean Sq F value   Pr(>F)    
airline_US                  5 4.689e-06 9.378e-07  25.882 1.90e-14 ***
airline_outside            17 3.474e-06 2.043e-07   5.639 1.24e-07 ***
airline_US:airline_outside  6 4.930e-07 8.220e-08   2.268   0.0472 *  
Residuals                  67 2.428e-06 3.620e-08                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have 2 visualization, but it will be better if we only he visualization above was built based on the testing

Conclusion

Based on the results of the one-way and two-way ANOVA tests results, we can conclude that the price of the flights varies by the airline companies.

From the first one-way ANOVA test, we have the conclusion that British Airway has the most expensive range of price. Now from the visualization above of the interaction between the airlines, British Airway*Qatar Airway and British Airway*Vietnam Airlines

Question number 3

Does the day of booking the ticket effect the price of the flights (difference between Sunday and Monday price)?

For initial visualization, we generated a box-plot to indicate the differences in price of hte flights between 2 different booking date .

Based on the graph above, there is no significant difference in the price of booking ticket on Monday and booking ticket on Sunday.

For further analysis, conducting one-way ANOVA test is necessary to confirm the significance level of the variation of the price.

one_way <- aov(price ~ booking_date, data = analysis2)
summary(one_way)

              Df    Sum Sq Mean Sq F value Pr(>F)
booking_date   1   2300375 2300375   0.751  0.387
Residuals    177 542064558 3062512

Since p-value > 0.05, we can conclude that one day apart of booking date does not have significant effect on the price of the flights.