Dataset Information

The dataset “Air Traffic Passenger Statistics” is a dataset containing information about specific types of flights around the world. The dataset displays the flights made at a specific time frame, when the data was logged, and what kind of flight was it (either a domestic or international flight). Specific variables will be examined to figure out the question, is there a significant difference between the number of domestic flights and international flights?

Key Variables: - GEO.Summary - Passenger.Count - data_as_of - binary_type_of_flight_result (Imputed)

Hypothesis \(H_0\): \(p_d\) = \(p_i\) = 0.5 \(H_a\): At least one population is greater than 0.5

α : 0.05

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(zoo)
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(forcats)
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
airTravelDataSet <- read.csv("Air_Traffic_Passenger_Statistics.csv")
head(airTravelDataSet)
##   Activity.Period Activity.Period.Start.Date
## 1          199907                 1999/07/01
## 2          199907                 1999/07/01
## 3          199907                 1999/07/01
## 4          199907                 1999/07/01
## 5          199907                 1999/07/01
## 6          199907                 1999/07/01
##                         Operating.Airline Operating.Airline.IATA.Code
## 1                            ATA Airlines                          TZ
## 2                            ATA Airlines                          TZ
## 3                            ATA Airlines                          TZ
## 4 Aeroflot Russian International Airlines                            
## 5 Aeroflot Russian International Airlines                            
## 6                              Air Canada                          AC
##                         Published.Airline Published.Airline.IATA.Code
## 1                            ATA Airlines                          TZ
## 2                            ATA Airlines                          TZ
## 3                            ATA Airlines                          TZ
## 4 Aeroflot Russian International Airlines                            
## 5 Aeroflot Russian International Airlines                            
## 6                              Air Canada                          AC
##     GEO.Summary GEO.Region Activity.Type.Code Price.Category.Code   Terminal
## 1      Domestic         US           Deplaned            Low Fare Terminal 1
## 2      Domestic         US           Enplaned            Low Fare Terminal 1
## 3      Domestic         US     Thru / Transit            Low Fare Terminal 1
## 4 International     Europe           Deplaned               Other Terminal 2
## 5 International     Europe           Enplaned               Other Terminal 2
## 6 International     Canada           Deplaned               Other Terminal 1
##   Boarding.Area Passenger.Count             data_as_of         data_loaded_at
## 1             B           31432 2025/11/20 02:00:28 PM 2025/11/22 03:02:24 PM
## 2             B           31353 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 3             B            2518 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 4             D            1324 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 5             D            1198 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 6             B           24124 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
str(airTravelDataSet)
## 'data.frame':    38893 obs. of  15 variables:
##  $ Activity.Period            : int  199907 199907 199907 199907 199907 199907 199907 199907 199907 199907 ...
##  $ Activity.Period.Start.Date : chr  "1999/07/01" "1999/07/01" "1999/07/01" "1999/07/01" ...
##  $ Operating.Airline          : chr  "ATA Airlines" "ATA Airlines" "ATA Airlines" "Aeroflot Russian International Airlines" ...
##  $ Operating.Airline.IATA.Code: chr  "TZ" "TZ" "TZ" "" ...
##  $ Published.Airline          : chr  "ATA Airlines" "ATA Airlines" "ATA Airlines" "Aeroflot Russian International Airlines" ...
##  $ Published.Airline.IATA.Code: chr  "TZ" "TZ" "TZ" "" ...
##  $ GEO.Summary                : chr  "Domestic" "Domestic" "Domestic" "International" ...
##  $ GEO.Region                 : chr  "US" "US" "US" "Europe" ...
##  $ Activity.Type.Code         : chr  "Deplaned" "Enplaned" "Thru / Transit" "Deplaned" ...
##  $ Price.Category.Code        : chr  "Low Fare" "Low Fare" "Low Fare" "Other" ...
##  $ Terminal                   : chr  "Terminal 1" "Terminal 1" "Terminal 1" "Terminal 2" ...
##  $ Boarding.Area              : chr  "B" "B" "B" "D" ...
##  $ Passenger.Count            : int  31432 31353 2518 1324 1198 24124 23613 4983 4604 205 ...
##  $ data_as_of                 : chr  "2025/11/20 02:00:28 PM" "2025/11/20 02:00:29 PM" "2025/11/20 02:00:29 PM" "2025/11/20 02:00:29 PM" ...
##  $ data_loaded_at             : chr  "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" ...

Cleaning The Dataset

Cleaning the data set as the names of each column contain periods and capital letters. By using the gsub function, it replaced all column names containing periods with underscores. By using the tolower function, it sets all character values in each column name to be lower cased.

names(airTravelDataSet) <- gsub("\\.", "_", names(airTravelDataSet))
names(airTravelDataSet) <- tolower(names(airTravelDataSet))

head(airTravelDataSet)
##   activity_period activity_period_start_date
## 1          199907                 1999/07/01
## 2          199907                 1999/07/01
## 3          199907                 1999/07/01
## 4          199907                 1999/07/01
## 5          199907                 1999/07/01
## 6          199907                 1999/07/01
##                         operating_airline operating_airline_iata_code
## 1                            ATA Airlines                          TZ
## 2                            ATA Airlines                          TZ
## 3                            ATA Airlines                          TZ
## 4 Aeroflot Russian International Airlines                            
## 5 Aeroflot Russian International Airlines                            
## 6                              Air Canada                          AC
##                         published_airline published_airline_iata_code
## 1                            ATA Airlines                          TZ
## 2                            ATA Airlines                          TZ
## 3                            ATA Airlines                          TZ
## 4 Aeroflot Russian International Airlines                            
## 5 Aeroflot Russian International Airlines                            
## 6                              Air Canada                          AC
##     geo_summary geo_region activity_type_code price_category_code   terminal
## 1      Domestic         US           Deplaned            Low Fare Terminal 1
## 2      Domestic         US           Enplaned            Low Fare Terminal 1
## 3      Domestic         US     Thru / Transit            Low Fare Terminal 1
## 4 International     Europe           Deplaned               Other Terminal 2
## 5 International     Europe           Enplaned               Other Terminal 2
## 6 International     Canada           Deplaned               Other Terminal 1
##   boarding_area passenger_count             data_as_of         data_loaded_at
## 1             B           31432 2025/11/20 02:00:28 PM 2025/11/22 03:02:24 PM
## 2             B           31353 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 3             B            2518 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 4             D            1324 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 5             D            1198 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 6             B           24124 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
summary(airTravelDataSet)
##  activity_period  activity_period_start_date operating_airline 
##  Min.   :199907   Length:38893               Length:38893      
##  1st Qu.:200611   Class :character           Class :character  
##  Median :201310   Mode  :character           Mode  :character  
##  Mean   :201295                                                
##  3rd Qu.:201910                                                
##  Max.   :202509                                                
##  operating_airline_iata_code published_airline  published_airline_iata_code
##  Length:38893                Length:38893       Length:38893               
##  Class :character            Class :character   Class :character           
##  Mode  :character            Mode  :character   Mode  :character           
##                                                                            
##                                                                            
##                                                                            
##  geo_summary         geo_region        activity_type_code price_category_code
##  Length:38893       Length:38893       Length:38893       Length:38893       
##  Class :character   Class :character   Class :character   Class :character   
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character   
##                                                                              
##                                                                              
##                                                                              
##    terminal         boarding_area      passenger_count   data_as_of       
##  Length:38893       Length:38893       Min.   :     0   Length:38893      
##  Class :character   Class :character   1st Qu.:  4358   Class :character  
##  Mode  :character   Mode  :character   Median :  8592   Mode  :character  
##                                        Mean   : 27821                     
##                                        3rd Qu.: 19615                     
##                                        Max.   :856501                     
##  data_loaded_at    
##  Length:38893      
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Imputing Dataset

Imputing the original dataset “airTravelDataSet” by using the mutate function to create a new column, comverting the values of geo_summary into binary code, 0 = Domestic flights, 1 = International flights. Then, using the select function to gather the dataset by its key variables.

imputed_airTravelDataSet <- airTravelDataSet |>
  # mutate(model_year = model.year) |>
  mutate(binary_type_of_flight_result = ifelse(airTravelDataSet$geo_summary == "Domestic", 0, 1)) |>
  select(c(geo_summary, passenger_count, data_as_of, binary_type_of_flight_result))

head(imputed_airTravelDataSet)
##     geo_summary passenger_count             data_as_of
## 1      Domestic           31432 2025/11/20 02:00:28 PM
## 2      Domestic           31353 2025/11/20 02:00:29 PM
## 3      Domestic            2518 2025/11/20 02:00:29 PM
## 4 International            1324 2025/11/20 02:00:29 PM
## 5 International            1198 2025/11/20 02:00:29 PM
## 6 International           24124 2025/11/20 02:00:29 PM
##   binary_type_of_flight_result
## 1                            0
## 2                            0
## 3                            0
## 4                            1
## 5                            1
## 6                            1

Creating a Domestic Flight Sub-Dataset

Imputing the already imputed dataset “imputed_airTravelDataSet” by using the filter function to select only the rows containing domestic flights storing them in a sub-dataset called “domestic_flight_dataset”.

domestic_flight_dataset <- imputed_airTravelDataSet |>
  filter(imputed_airTravelDataSet$geo_summary == "Domestic")

head(domestic_flight_dataset)
##   geo_summary passenger_count             data_as_of
## 1    Domestic           31432 2025/11/20 02:00:28 PM
## 2    Domestic           31353 2025/11/20 02:00:29 PM
## 3    Domestic            2518 2025/11/20 02:00:29 PM
## 4    Domestic           41433 2025/11/20 02:00:29 PM
## 5    Domestic           47409 2025/11/20 02:00:29 PM
## 6    Domestic            4135 2025/11/20 02:00:29 PM
##   binary_type_of_flight_result
## 1                            0
## 2                            0
## 3                            0
## 4                            0
## 5                            0
## 6                            0

Creating a International Sub-Dataset

Imputing the already imputed dataset “imputed_airTravelDataSet” by using the filter function to select only the rows containing international flights storing them in a sub-dataset called “international_flight_dataset”.

international_flight_dataset <- imputed_airTravelDataSet |>
  filter(imputed_airTravelDataSet$geo_summary == "International")

head(international_flight_dataset)
##     geo_summary passenger_count             data_as_of
## 1 International            1324 2025/11/20 02:00:29 PM
## 2 International            1198 2025/11/20 02:00:29 PM
## 3 International           24124 2025/11/20 02:00:29 PM
## 4 International           23613 2025/11/20 02:00:29 PM
## 5 International            4983 2025/11/20 02:00:29 PM
## 6 International            4604 2025/11/20 02:00:29 PM
##   binary_type_of_flight_result
## 1                            1
## 2                            1
## 3                            1
## 4                            1
## 5                            1
## 6                            1

Chi-Square Test Comparing The Amount of Domestic Flights to International Flights

Chi-Square test comparing the the amount of flights that are categorized as either domestic or international and checking if their probability are equal to 0.5 by first gathering the sum of the amount of flights and storing them in the observed variable. Then using the chisq.test function, utilized the function with the observed variable and checking if each value has a probability of 50% in the dataset by putting in the parameter p = c(0.5, 0.5) alongside observed.

# observed <- c(13538, 25355)
observed <- c(sum(airTravelDataSet$geo_summary == "Domestic"), sum(airTravelDataSet$geo_summary == "International"))

theoretical_proportions <- c(sum(airTravelDataSet$geo_summary == "Domestic")/38893, 
                             sum(airTravelDataSet$geo_summary == "International")/38893)

chisq.test(observed)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 3590.4, df = 1, p-value < 2.2e-16
chisq.test(observed, p = c(0.5, 0.5))
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 3590.4, df = 1, p-value < 2.2e-16

Based on the p-value from the Chi-Squared test being 2.2e-16 ~ 0.00000000000000022, which is much less than the value of α = 0.05, we have sufficient evidence and the results are significant. The results reject the null hypothesis as both domestic and international flights to not share an equal probability of 50%. The df (degree of freedom) equates to one, indicating that only one categorical type of variable is being compared to another categorical type within the variable, in this case, comparing between domestic and international flights.

Checking Expected Values

Based on the dataset and looking at the amount of flights that the project will gather the probabilities, checking the expect values would allow us to predict what would be the probability of flights being either domestic or internation by taking the probability of the flight multiplied by the sum of all flights observed in the dataset.

expected <- theoretical_proportions * sum(observed)
expected
## [1] 13538 25355

Extra check for expected values This chunk of code is extra in checking the expected values of number of domestic and international flights from the previous chunk.

cat("Number of domestic flights: ", sum(airTravelDataSet$geo_summary == "Domestic"), "\n")
## Number of domestic flights:  13538
cat("Number of international flights: ", sum(airTravelDataSet$geo_summary == "International"), "\n")
## Number of international flights:  25355
summary(imputed_airTravelDataSet)
##  geo_summary        passenger_count   data_as_of       
##  Length:38893       Min.   :     0   Length:38893      
##  Class :character   1st Qu.:  4358   Class :character  
##  Mode  :character   Median :  8592   Mode  :character  
##                     Mean   : 27821                     
##                     3rd Qu.: 19615                     
##                     Max.   :856501                     
##  binary_type_of_flight_result
##  Min.   :0.0000              
##  1st Qu.:0.0000              
##  Median :1.0000              
##  Mean   :0.6519              
##  3rd Qu.:1.0000              
##  Max.   :1.0000
summary(domestic_flight_dataset)
##  geo_summary        passenger_count   data_as_of       
##  Length:13538       Min.   :     0   Length:13538      
##  Class :character   1st Qu.:  5587   Class :character  
##  Mode  :character   Median : 23776   Mode  :character  
##                     Mean   : 60899                     
##                     3rd Qu.: 71277                     
##                     Max.   :856501                     
##  binary_type_of_flight_result
##  Min.   :0                   
##  1st Qu.:0                   
##  Median :0                   
##  Mean   :0                   
##  3rd Qu.:0                   
##  Max.   :0
summary(international_flight_dataset)
##  geo_summary        passenger_count   data_as_of       
##  Length:25355       Min.   :     1   Length:25355      
##  Class :character   1st Qu.:  4006   Class :character  
##  Mode  :character   Median :  7358   Mode  :character  
##                     Mean   : 10159                     
##                     3rd Qu.: 11716                     
##                     Max.   :128468                     
##  binary_type_of_flight_result
##  Min.   :1                   
##  1st Qu.:1                   
##  Median :1                   
##  Mean   :1                   
##  3rd Qu.:1                   
##  Max.   :1

Bar Chart Visualization

The bar chart visualization “Type of Flights Comparison Bar Plot” is established by using the barplot function to create a chart displaying two bars containing the amount of flights depending on the type of flight, with domestic flights being the bar on the left and international flights being the bar on the right. As depicted, there are more international flights compared to domestic flights, reafirmming the chi-square test that the proportion of flights are not equal in the dataset.

barplot(main = "Type of Flights Comparison Bar Plot", xlab = "Type of Flight", ylab = "Amount of Flights",  table(imputed_airTravelDataSet$geo_summary))

Conclusion

Due to the p-value being 2.2e-16 ~ 0.00000000000000022 from the chi-squared test, which is much less than the value of α = 0.05, the results are significant. Therefore, the results reject the null hypothesis as both domestic and international flights to not share an equal probability of 50%. The results are relevant as the chi-square test examines how the null hypothesis is rejected by displaying a p-value less than the value of α, while displaying how the probability of international flights are not equal to the probability of domestic flights, as there are more international flights compared to domestic flights. With the given results, it could potentially suggest future trends on how demands of air travel change over time, or give insight as to how shipping of goods and imports may cause more international flights to be made compared to domestic flights.

Works Cited

Air Traffic Passenger Statistics. (2023, March 13). Data.gov; data.sfgov.org. https://catalog.data.gov/dataset/air-traffic-passenger-statistics