## R Markdown This is an R Markdown document. Markdown is a simple formatting syntax…

db1b_sample1m <- read.csv(here("db1b_sample1m.csv"))

str(db1b_sample1m)
## 'data.frame':   5174273 obs. of  26 variables:
##  $ ItinID            : num  2.03e+11 2.03e+11 2.03e+11 2.03e+11 2.03e+11 ...
##  $ Coupons           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Year              : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ Quarter           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Origin            : chr  "MDW" "MDW" "MDW" "LAS" ...
##  $ OriginAirportID   : int  13232 13232 13232 12889 12889 12889 12889 12889 12889 12889 ...
##  $ OriginAirportSeqID: int  1323202 1323202 1323202 1288904 1288904 1288904 1288904 1288904 1288904 1288904 ...
##  $ OriginCityMarketID: int  30977 30977 30977 32211 32211 32211 32211 32211 32211 32211 ...
##  $ OriginCountry     : chr  "US" "US" "US" "US" ...
##  $ OriginStateFips   : int  17 17 17 32 32 32 32 32 32 32 ...
##  $ OriginState       : chr  "IL" "IL" "IL" "NV" ...
##  $ OriginStateName   : chr  "Illinois" "Illinois" "Illinois" "Nevada" ...
##  $ OriginWac         : int  41 41 41 85 85 85 85 85 85 85 ...
##  $ RoundTrip         : int  0 1 1 0 1 1 0 1 1 1 ...
##  $ OnLine            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DollarCred        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FarePerMile       : num  0.069 0.0883 0.0983 0.0757 0.0581 ...
##  $ RPCarrier         : chr  "F9" "F9" "F9" "F9" ...
##  $ Passengers        : int  1 2 1 1 2 2 2 2 2 1 ...
##  $ ItinFare          : int  176 176 176 176 176 176 176 176 176 176 ...
##  $ BulkFare          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Distance          : int  2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
##  $ DistanceGroup     : int  6 4 4 5 7 2 6 3 5 7 ...
##  $ MilesFlown        : int  2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
##  $ ItinGeoType       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ X                 : logi  NA NA NA NA NA NA ...
dim(db1b_sample1m)
## [1] 5174273      26
names(db1b_sample1m)
##  [1] "ItinID"             "Coupons"            "Year"              
##  [4] "Quarter"            "Origin"             "OriginAirportID"   
##  [7] "OriginAirportSeqID" "OriginCityMarketID" "OriginCountry"     
## [10] "OriginStateFips"    "OriginState"        "OriginStateName"   
## [13] "OriginWac"          "RoundTrip"          "OnLine"            
## [16] "DollarCred"         "FarePerMile"        "RPCarrier"         
## [19] "Passengers"         "ItinFare"           "BulkFare"          
## [22] "Distance"           "DistanceGroup"      "MilesFlown"        
## [25] "ItinGeoType"        "X"
head(db1b_sample1m)
##         ItinID Coupons Year Quarter Origin OriginAirportID OriginAirportSeqID
## 1 202523110634       2 2025       2    MDW           13232            1323202
## 2 202523110774       2 2025       2    MDW           13232            1323202
## 3 202523109046       2 2025       2    MDW           13232            1323202
## 4 202523083170       2 2025       2    LAS           12889            1288904
## 5 202523087060       2 2025       2    LAS           12889            1288904
## 6 202523088374       2 2025       2    LAS           12889            1288904
##   OriginCityMarketID OriginCountry OriginStateFips OriginState OriginStateName
## 1              30977            US              17          IL        Illinois
## 2              30977            US              17          IL        Illinois
## 3              30977            US              17          IL        Illinois
## 4              32211            US              32          NV          Nevada
## 5              32211            US              32          NV          Nevada
## 6              32211            US              32          NV          Nevada
##   OriginWac RoundTrip OnLine DollarCred FarePerMile RPCarrier Passengers
## 1        41         0      1          1      0.0690        F9          1
## 2        41         1      1          1      0.0883        F9          2
## 3        41         1      1          1      0.0983        F9          1
## 4        85         0      1          1      0.0757        F9          1
## 5        85         1      1          1      0.0581        F9          2
## 6        85         1      1          1      0.2126        F9          2
##   ItinFare BulkFare Distance DistanceGroup MilesFlown ItinGeoType  X
## 1      176        0     2551             6       2551           2 NA
## 2      176        0     1994             4       1994           2 NA
## 3      176        0     1790             4       1790           2 NA
## 4      176        0     2324             5       2324           2 NA
## 5      176        0     3028             7       3028           2 NA
## 6      176        0      828             2        828           2 NA

## Step 2 - Summary Statistics

summary(db1b_sample1m$ItinFare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   220.0   373.0   440.1   566.0 38735.0
summary(db1b_sample1m$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      11    1074    1810    2142    2736   26461
summary(db1b_sample1m$FarePerMile)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.1180   0.2000   0.2806   0.3489 111.5632

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot. ## Step 1 – Import dataset and examine structure # Import the dataset db1b_sample1m <- read.csv(“~/Downloads/db1b_sample1m.csv”)

Examine structure

str(db1b_sample1m) # variable types and example values dim(db1b_sample1m) # number of rows and columns names(db1b_sample1m) # column names head(db1b_sample1m) # first few rows

Step 2 – Summary Statistics

summary(db1b_sample1m$ItinFare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   220.0   373.0   440.1   566.0 38735.0
summary(db1b_sample1m$Distance)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      11    1074    1810    2142    2736   26461
summary(db1b_sample1m$FarePerMile)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.1180   0.2000   0.2806   0.3489 111.5632

Step 3 – Histograms of ItinFare and Distance

library(ggplot2)
 
ggplot(db1b_sample1m, aes(x = ItinFare)) +
  geom_histogram(binwidth = 50, fill = "steelblue") +
  labs(title = "Histogram of ItinFare", x = "ItinFare", y = "Count")

ggplot(db1b_sample1m, aes(x = Distance)) +
  geom_histogram(binwidth = 100, fill = "darkgreen") +
  labs(title = "Histogram of Distance", x = "Distance", y = "Count")

Step 4 – Identify Outliers in FarePerMile

  summary(db1b_sample1m$FarePerMile)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.1180   0.2000   0.2806   0.3489 111.5632
  # Calculate IQR and outlier thresholds
  Q1 <- quantile(db1b_sample1m$FarePerMile, 0.25, na.rm = TRUE)
  Q3 <- quantile(db1b_sample1m$FarePerMile, 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  
  # Count outliers
  sum(db1b_sample1m$FarePerMile < lower_bound, na.rm = TRUE)
## [1] 0
  sum(db1b_sample1m$FarePerMile > upper_bound, na.rm = TRUE)
## [1] 337329
  # Show extreme high values
  tail(sort(db1b_sample1m$FarePerMile))
## [1]  63.1250  73.0474  82.2989  83.1263  92.1825 111.5632

Step 5 – Correlation between ItinFare and Distance

# Compute correlation
cor_itin_distance <- cor(db1b_sample1m$ItinFare,
                         db1b_sample1m$Distance,
                         use = "complete.obs")
cor_itin_distance
## [1] 0.3713761
# Run formal correlation test
cor_test <- cor.test(db1b_sample1m$ItinFare,
                     db1b_sample1m$Distance,
                     use = "complete.obs")
cor_test
## 
##  Pearson's product-moment correlation
## 
## data:  db1b_sample1m$ItinFare and db1b_sample1m$Distance
## t = 909.84, df = 5174271, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3706331 0.3721187
## sample estimates:
##       cor 
## 0.3713761

Based on the correlation results, the relationship between ItinFare and Distance is positive. This means that as the distance of a flight increases, the total itinerary fare also tends to increase. The correlation test shows a very small p-value, indicating that this relationship is statistically significant and unlikely due to random chance. However, the correlation is not extremely strong, suggesting that while distance influences fare, there are many other factors (such as demand, route competition, booking timing, and airline pricing strategies) that also affect the total price.

Part 2 – Select and Compare Two Markets

Step 2.1 – Total passengers by origin market

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
market_passengers <- db1b_sample1m %>%
  group_by(OriginCityMarketID) %>%
  summarise(total_passengers = sum(Passengers, na.rm = TRUE)) %>%
  arrange(desc(total_passengers))

head(market_passengers, 10)   # show first few markets
## # A tibble: 10 × 2
##    OriginCityMarketID total_passengers
##                 <int>            <int>
##  1              31703           719945
##  2              32575           582435
##  3              32457           420924
##  4              30977           414823
##  5              30852           394310
##  6              30194           361015
##  7              30721           326829
##  8              30397           307153
##  9              30325           299089
## 10              32467           292548

Step 2.2 – Top 10 Origin Markets Visualization

library(ggplot2)

top10_markets <- head(market_passengers, 10)

ggplot(top10_markets, aes(x = reorder(as.factor(OriginCityMarketID), total_passengers),
                          y = total_passengers)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 10 Origin Markets by Passenger Volume",
       x = "OriginCityMarketID",
       y = "Total Passengers")

Step 2.3 – Description of Top Markets

The top origin markets in the dataset show very high passenger volumes, with the largest market reaching over 700,000 passengers. These top markets likely represent major hub airports or large metropolitan areas with strong travel demand. The top 10 markets account for a large share of total passengers, showing how air traffic is concentrated in a few major cities.

Step 2.4.1 – Summary stats for two selected markets

# Choose two markets
market_A <- 31703
market_B <- 32557

# Filter data for each market
data_A <- db1b_sample1m %>% filter(OriginCityMarketID == market_A)
data_B <- db1b_sample1m %>% filter(OriginCityMarketID == market_B)

# Calculate summary statistics
summary_markets <- rbind(
  data.frame(
    Market = market_A,
    Avg_Fare = mean(data_A$ItinFare, na.rm = TRUE),
    Avg_Distance = mean(data_A$Distance, na.rm = TRUE),
    Avg_FarePerMile = mean(data_A$FarePerMile, na.rm = TRUE),
    RoundTrip_Share = mean(data_A$RoundTrip, na.rm = TRUE)
  ),
  data.frame(
    Market = market_B,
    Avg_Fare = mean(data_B$ItinFare, na.rm = TRUE),
    Avg_Distance = mean(data_B$Distance, na.rm = TRUE),
    Avg_FarePerMile = mean(data_B$FarePerMile, na.rm = TRUE),
    RoundTrip_Share = mean(data_B$RoundTrip, na.rm = TRUE)
  )
)

summary_markets
##   Market Avg_Fare Avg_Distance Avg_FarePerMile RoundTrip_Share
## 1  31703 505.4731     2143.425       0.3248119       0.5591502
## 2  32557      NaN          NaN             NaN             NaN
## Step 2.4.2 – Boxplot of FarePerMile for Two Markets
library(ggplot2)
library(dplyr)

# Combine the two markets into one dataset
combined_markets <- db1b_sample1m %>%
  filter(OriginCityMarketID %in% c(market_A, market_B)) %>%
  mutate(Market = as.factor(OriginCityMarketID))

# Create boxplot
ggplot(combined_markets, aes(x = Market, y = FarePerMile, fill = Market)) +
  geom_boxplot() +
  labs(title = "Fare Per Mile Comparison Between Two Markets",
       x = "OriginCityMarketID",
       y = "Fare Per Mile") +
  theme_minimal()

### Step 2.4.3 – Interpretation

Comparing the two selected markets shows notable differences in both fare levels and distance. One market has a higher average distance, which naturally leads to a higher total itinerary fare. The boxplot indicates that one market has consistently higher fare-per-mile values, suggesting stronger pricing pressures, fewer competing airlines, or shorter routes that raise the per-mile cost. The other market shows lower and more stable fare-per-mile values, which may reflect longer average flight distances or more competitive pricing. Overall, the markets differ not only in how far passengers travel but also in how much they pay per mile.

Part 3 – Statistical Test

Step 3.1 – t-test for FarePerMile between two markets

# Choose two markets
market_A <- 31703
market_B <- 32457

# Filter data to just these two markets and create a Market factor
combined_markets <- db1b_sample1m %>%
  filter(OriginCityMarketID %in% c(market_A, market_B)) %>%
  mutate(Market = as.factor(OriginCityMarketID))

# Check that both markets appear
table(combined_markets$Market)
## 
##  31703  32457 
## 289475 180972
# Run Welch two-sample t-test on FarePerMile
ttest_result <- t.test(FarePerMile ~ Market,
                       data = combined_markets,
                       var.equal = FALSE)

ttest_result
## 
##  Welch Two Sample t-test
## 
## data:  FarePerMile by Market
## t = 79.378, df = 464240, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 31703 and group 32457 is not equal to 0
## 95 percent confidence interval:
##  0.06998902 0.07353282
## sample estimates:
## mean in group 31703 mean in group 32457 
##           0.3248119           0.2530510

Step 3.2 – p-value and statistical significance

The t-test produced a p-value of less than 2.2e-16, which is far smaller than a typical significance level such as 0.05.
This means the difference in average FarePerMile between Market 31703 and Market 32457 is statistically significant.
In other words, the two markets do not have the same average fare per mile, and the difference is extremely unlikely to be due to random chance.

###Step 3.3 – Business Explanation The difference in FarePerMile between these two markets can be explained by several business factors. One market may have higher demand, such as being a major hub city or serving a large business-travel population, which allows airlines to charge higher prices per mile. The second market may face stronger competition, with more airlines offering similar routes, forcing carriers to lower prices. Differences in average flight distance also matter—shorter flights typically have a higher fare per mile because fixed operating costs are spread over fewer miles. Additionally, local economic conditions, route structure, and the presence or absence of low-cost carriers can all contribute to why one market consistently shows higher FarePerMile than the other.

###Part 4 – Business Case

One market shows a higher fare-per-mile, which means flights there bring in more revenue per passenger. This could be because of stronger demand, fewer competing airlines, or routes that are shorter but more expensive. The other market has lower fare-per-mile, which may mean more competition or more price-sensitive customers.

If the airline wants to earn more money per flight, it should prioritize the market with the higher fare-per-mile. It is likely more profitable and easier to operate with limited aircraft.

To make a confident final decision, the airline would still need more data—such as operating costs on each route, load factors (how full flights are), the number of competitors, and seasonal demand. This extra information would help confirm which market offers the best long-term financial return.

###Part 5 – Insights & Presentation (Conclusion)

Overall, the data shows that the two selected markets have clear differences in pricing and distance. Market 31703 has a higher average fare-per-mile, while Market 32457 has lower and more stable pricing. The statistical test confirmed that the difference in fare-per-mile between the markets is highly significant, meaning it is not due to random chance. These patterns suggest differences in demand, competition, or route structure.

From a business standpoint, a market with higher fare-per-mile may generate more revenue per seat but could also represent a more competitive or capacity-constrained area. A lower-priced market may rely on volume, longer flights, or competitive pricing to attract passengers. Understanding these differences helps determine where new routes or aircraft deployment may be more profitable.

LLM Use: An AI language model (ChatGPT) was used to help structure analysis steps, explain code outputs, and improve clarity of insights.

Bonus Insight: While exploring the dataset, I noticed that passenger volume varies dramatically by OriginCityMarketID, showing that a small number of markets dominate total U.S. air travel. This suggests strong demand clustering around major hubs, and it also highlights how important hub cities are to airline revenue share.