The dataset “Air Traffic Passenger Statistics” is a dataset containing information about specific types of flights around the world. The dataset displays the flights made at a specific time frame, when the data was logged, and what kind of flight was it (either a domestic or international flight). Specific variables will be examined to figure out the question, is there a significant difference between the number of domestic flights and international flights?
Key Variables: - GEO.Summary - Passenger.Count - data_as_of - binary_type_of_flight_result (Imputed)
Hypothesis \(H_0\): \(p_d\) = \(p_i\) = 0.5 \(H_a\): At least one population is greater than 0.5
α : 0.05
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats)
library(ggplot2)
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
airTravelDataSet <- read.csv("Air_Traffic_Passenger_Statistics.csv")
head(airTravelDataSet)
## Activity.Period Activity.Period.Start.Date
## 1 199907 1999/07/01
## 2 199907 1999/07/01
## 3 199907 1999/07/01
## 4 199907 1999/07/01
## 5 199907 1999/07/01
## 6 199907 1999/07/01
## Operating.Airline Operating.Airline.IATA.Code
## 1 ATA Airlines TZ
## 2 ATA Airlines TZ
## 3 ATA Airlines TZ
## 4 Aeroflot Russian International Airlines
## 5 Aeroflot Russian International Airlines
## 6 Air Canada AC
## Published.Airline Published.Airline.IATA.Code
## 1 ATA Airlines TZ
## 2 ATA Airlines TZ
## 3 ATA Airlines TZ
## 4 Aeroflot Russian International Airlines
## 5 Aeroflot Russian International Airlines
## 6 Air Canada AC
## GEO.Summary GEO.Region Activity.Type.Code Price.Category.Code Terminal
## 1 Domestic US Deplaned Low Fare Terminal 1
## 2 Domestic US Enplaned Low Fare Terminal 1
## 3 Domestic US Thru / Transit Low Fare Terminal 1
## 4 International Europe Deplaned Other Terminal 2
## 5 International Europe Enplaned Other Terminal 2
## 6 International Canada Deplaned Other Terminal 1
## Boarding.Area Passenger.Count data_as_of data_loaded_at
## 1 B 31432 2025/11/20 02:00:28 PM 2025/11/22 03:02:24 PM
## 2 B 31353 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 3 B 2518 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 4 D 1324 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 5 D 1198 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 6 B 24124 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
str(airTravelDataSet)
## 'data.frame': 38893 obs. of 15 variables:
## $ Activity.Period : int 199907 199907 199907 199907 199907 199907 199907 199907 199907 199907 ...
## $ Activity.Period.Start.Date : chr "1999/07/01" "1999/07/01" "1999/07/01" "1999/07/01" ...
## $ Operating.Airline : chr "ATA Airlines" "ATA Airlines" "ATA Airlines" "Aeroflot Russian International Airlines" ...
## $ Operating.Airline.IATA.Code: chr "TZ" "TZ" "TZ" "" ...
## $ Published.Airline : chr "ATA Airlines" "ATA Airlines" "ATA Airlines" "Aeroflot Russian International Airlines" ...
## $ Published.Airline.IATA.Code: chr "TZ" "TZ" "TZ" "" ...
## $ GEO.Summary : chr "Domestic" "Domestic" "Domestic" "International" ...
## $ GEO.Region : chr "US" "US" "US" "Europe" ...
## $ Activity.Type.Code : chr "Deplaned" "Enplaned" "Thru / Transit" "Deplaned" ...
## $ Price.Category.Code : chr "Low Fare" "Low Fare" "Low Fare" "Other" ...
## $ Terminal : chr "Terminal 1" "Terminal 1" "Terminal 1" "Terminal 2" ...
## $ Boarding.Area : chr "B" "B" "B" "D" ...
## $ Passenger.Count : int 31432 31353 2518 1324 1198 24124 23613 4983 4604 205 ...
## $ data_as_of : chr "2025/11/20 02:00:28 PM" "2025/11/20 02:00:29 PM" "2025/11/20 02:00:29 PM" "2025/11/20 02:00:29 PM" ...
## $ data_loaded_at : chr "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" "2025/11/22 03:02:24 PM" ...
Cleaning the data set as the names of each column contain periods and capital letters. By using the gsub function, it replaced all column names containing periods with underscores. By using the tolower function, it sets all character values in each column name to be lower cased.
names(airTravelDataSet) <- gsub("\\.", "_", names(airTravelDataSet))
names(airTravelDataSet) <- tolower(names(airTravelDataSet))
head(airTravelDataSet)
## activity_period activity_period_start_date
## 1 199907 1999/07/01
## 2 199907 1999/07/01
## 3 199907 1999/07/01
## 4 199907 1999/07/01
## 5 199907 1999/07/01
## 6 199907 1999/07/01
## operating_airline operating_airline_iata_code
## 1 ATA Airlines TZ
## 2 ATA Airlines TZ
## 3 ATA Airlines TZ
## 4 Aeroflot Russian International Airlines
## 5 Aeroflot Russian International Airlines
## 6 Air Canada AC
## published_airline published_airline_iata_code
## 1 ATA Airlines TZ
## 2 ATA Airlines TZ
## 3 ATA Airlines TZ
## 4 Aeroflot Russian International Airlines
## 5 Aeroflot Russian International Airlines
## 6 Air Canada AC
## geo_summary geo_region activity_type_code price_category_code terminal
## 1 Domestic US Deplaned Low Fare Terminal 1
## 2 Domestic US Enplaned Low Fare Terminal 1
## 3 Domestic US Thru / Transit Low Fare Terminal 1
## 4 International Europe Deplaned Other Terminal 2
## 5 International Europe Enplaned Other Terminal 2
## 6 International Canada Deplaned Other Terminal 1
## boarding_area passenger_count data_as_of data_loaded_at
## 1 B 31432 2025/11/20 02:00:28 PM 2025/11/22 03:02:24 PM
## 2 B 31353 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 3 B 2518 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 4 D 1324 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 5 D 1198 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
## 6 B 24124 2025/11/20 02:00:29 PM 2025/11/22 03:02:24 PM
summary(airTravelDataSet)
## activity_period activity_period_start_date operating_airline
## Min. :199907 Length:38893 Length:38893
## 1st Qu.:200611 Class :character Class :character
## Median :201310 Mode :character Mode :character
## Mean :201295
## 3rd Qu.:201910
## Max. :202509
## operating_airline_iata_code published_airline published_airline_iata_code
## Length:38893 Length:38893 Length:38893
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## geo_summary geo_region activity_type_code price_category_code
## Length:38893 Length:38893 Length:38893 Length:38893
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## terminal boarding_area passenger_count data_as_of
## Length:38893 Length:38893 Min. : 0 Length:38893
## Class :character Class :character 1st Qu.: 4358 Class :character
## Mode :character Mode :character Median : 8592 Mode :character
## Mean : 27821
## 3rd Qu.: 19615
## Max. :856501
## data_loaded_at
## Length:38893
## Class :character
## Mode :character
##
##
##
Imputing the original dataset “airTravelDataSet” by using the mutate function to create a new column, comverting the values of geo_summary into binary code, 0 = Domestic flights, 1 = International flights. Then, using the select function to gather the dataset by its key variables.
imputed_airTravelDataSet <- airTravelDataSet |>
# mutate(model_year = model.year) |>
mutate(binary_type_of_flight_result = ifelse(airTravelDataSet$geo_summary == "Domestic", 0, 1)) |>
select(c(geo_summary, passenger_count, data_as_of, binary_type_of_flight_result))
head(imputed_airTravelDataSet)
## geo_summary passenger_count data_as_of
## 1 Domestic 31432 2025/11/20 02:00:28 PM
## 2 Domestic 31353 2025/11/20 02:00:29 PM
## 3 Domestic 2518 2025/11/20 02:00:29 PM
## 4 International 1324 2025/11/20 02:00:29 PM
## 5 International 1198 2025/11/20 02:00:29 PM
## 6 International 24124 2025/11/20 02:00:29 PM
## binary_type_of_flight_result
## 1 0
## 2 0
## 3 0
## 4 1
## 5 1
## 6 1
Imputing the already imputed dataset “imputed_airTravelDataSet” by using the filter function to select only the rows containing domestic flights storing them in a sub-dataset called “domestic_flight_dataset”.
domestic_flight_dataset <- imputed_airTravelDataSet |>
filter(imputed_airTravelDataSet$geo_summary == "Domestic")
head(domestic_flight_dataset)
## geo_summary passenger_count data_as_of
## 1 Domestic 31432 2025/11/20 02:00:28 PM
## 2 Domestic 31353 2025/11/20 02:00:29 PM
## 3 Domestic 2518 2025/11/20 02:00:29 PM
## 4 Domestic 41433 2025/11/20 02:00:29 PM
## 5 Domestic 47409 2025/11/20 02:00:29 PM
## 6 Domestic 4135 2025/11/20 02:00:29 PM
## binary_type_of_flight_result
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Imputing the already imputed dataset “imputed_airTravelDataSet” by using the filter function to select only the rows containing international flights storing them in a sub-dataset called “international_flight_dataset”.
international_flight_dataset <- imputed_airTravelDataSet |>
filter(imputed_airTravelDataSet$geo_summary == "International")
head(international_flight_dataset)
## geo_summary passenger_count data_as_of
## 1 International 1324 2025/11/20 02:00:29 PM
## 2 International 1198 2025/11/20 02:00:29 PM
## 3 International 24124 2025/11/20 02:00:29 PM
## 4 International 23613 2025/11/20 02:00:29 PM
## 5 International 4983 2025/11/20 02:00:29 PM
## 6 International 4604 2025/11/20 02:00:29 PM
## binary_type_of_flight_result
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
Chi-Square test comparing the the amount of flights that are categorized as either domestic or international and checking if their probability are equal to 0.5 by first gathering the sum of the amount of flights and storing them in the observed variable. Then using the chisq.test function, utilized the function with the observed variable and checking if each value has a probability of 50% in the dataset by putting in the parameter p = c(0.5, 0.5) alongside observed.
# observed <- c(13538, 25355)
observed <- c(sum(airTravelDataSet$geo_summary == "Domestic"), sum(airTravelDataSet$geo_summary == "International"))
theoretical_proportions <- c(sum(airTravelDataSet$geo_summary == "Domestic")/38893,
sum(airTravelDataSet$geo_summary == "International")/38893)
chisq.test(observed)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 3590.4, df = 1, p-value < 2.2e-16
chisq.test(observed, p = c(0.5, 0.5))
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 3590.4, df = 1, p-value < 2.2e-16
Based on the p-value from the Chi-Squared test being 2.2e-16 ~ 0.00000000000000022, which is much less than the value of α = 0.05, we have sufficient evidence and the results are significant. The results reject the null hypothesis as both domestic and international flights to not share an equal probability of 50%. The df (degree of freedom) equates to one, indicating that only one categorical type of variable is being compared to another categorical type within the variable, in this case, comparing between domestic and international flights.
Based on the dataset and looking at the amount of flights that the project will gather the probabilities, checking the expect values would allow us to predict what would be the probability of flights being either domestic or internation by taking the probability of the flight multiplied by the sum of all flights observed in the dataset.
expected <- theoretical_proportions * sum(observed)
expected
## [1] 13538 25355
Extra check for expected values This chunk of code is extra in checking the expected values of number of domestic and international flights from the previous chunk.
cat("Number of domestic flights: ", sum(airTravelDataSet$geo_summary == "Domestic"), "\n")
## Number of domestic flights: 13538
cat("Number of international flights: ", sum(airTravelDataSet$geo_summary == "International"), "\n")
## Number of international flights: 25355
summary(imputed_airTravelDataSet)
## geo_summary passenger_count data_as_of
## Length:38893 Min. : 0 Length:38893
## Class :character 1st Qu.: 4358 Class :character
## Mode :character Median : 8592 Mode :character
## Mean : 27821
## 3rd Qu.: 19615
## Max. :856501
## binary_type_of_flight_result
## Min. :0.0000
## 1st Qu.:0.0000
## Median :1.0000
## Mean :0.6519
## 3rd Qu.:1.0000
## Max. :1.0000
summary(domestic_flight_dataset)
## geo_summary passenger_count data_as_of
## Length:13538 Min. : 0 Length:13538
## Class :character 1st Qu.: 5587 Class :character
## Mode :character Median : 23776 Mode :character
## Mean : 60899
## 3rd Qu.: 71277
## Max. :856501
## binary_type_of_flight_result
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
summary(international_flight_dataset)
## geo_summary passenger_count data_as_of
## Length:25355 Min. : 1 Length:25355
## Class :character 1st Qu.: 4006 Class :character
## Mode :character Median : 7358 Mode :character
## Mean : 10159
## 3rd Qu.: 11716
## Max. :128468
## binary_type_of_flight_result
## Min. :1
## 1st Qu.:1
## Median :1
## Mean :1
## 3rd Qu.:1
## Max. :1
The bar chart visualization “Type of Flights Comparison Bar Plot” is established by using the barplot function to create a chart displaying two bars containing the amount of flights depending on the type of flight, with domestic flights being the bar on the left and international flights being the bar on the right. As depicted, there are more international flights compared to domestic flights, reafirmming the chi-square test that the proportion of flights are not equal in the dataset.
barplot(main = "Type of Flights Comparison Bar Plot", xlab = "Type of Flight", ylab = "Amount of Flights", table(imputed_airTravelDataSet$geo_summary))
Due to the p-value being 2.2e-16 ~ 0.00000000000000022 from the chi-squared test, which is much less than the value of α = 0.05, the results are significant. Therefore, the results reject the null hypothesis as both domestic and international flights to not share an equal probability of 50%. The results are relevant as the chi-square test examines how the null hypothesis is rejected by displaying a p-value less than the value of α, while displaying how the probability of international flights are not equal to the probability of domestic flights, as there are more international flights compared to domestic flights. With the given results, it could potentially suggest future trends on how demands of air travel change over time, or give insight as to how shipping of goods and imports may cause more international flights to be made compared to domestic flights.
Air Traffic Passenger Statistics. (2023, March 13). Data.gov; data.sfgov.org. https://catalog.data.gov/dataset/air-traffic-passenger-statistics