Statistical Analysis of U.S. Domestic Air Travel (Jan–July 2025)

Project Overview

This statistical analysis examines seasonal and operational patterns in U.S. domestic air travel using the T‑100 Domestic Market (U.S. Carriers) dataset from January to July 2025. The study employs hypothesis testing to investigate two key dimensions of air travel dynamics.

Research Questions & Methodology

1. Seasonal Flight Distribution

We tested whether flight frequencies are uniformly distributed across months using a chi‑square goodness‑of‑fit test. Our analysis provides strong evidence against equal monthly distribution (χ² = 155.54, df = 6, p ≈ 0), indicating airlines strategically adjust schedules in response to seasonal travel patterns.

2. Carrier‑Type Performance

We evaluated differences in flight frequencies and passenger volumes across three carrier classifications—major, national, and regional—using:

Chi‑square tests for flight frequency comparisons ANOVA for passenger volume analysis Both tests yielded statistically significant results (p ≈ 0), rejecting the null hypothesis of equal performance across carrier types. Major carriers dominate operations, accounting for 61% of flights and 57.6% of passengers.

Analytical Framework

This report details the inferential procedures, test assumptions, and practical interpretations of these findings, providing evidence‑based insights into seasonal scheduling strategies and carrier‑type operational differentiation in the U.S. domestic aviation market.

# import Libraries

library(GGally)
## Loading required package: ggplot2
library(scales)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard()    masks scales::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::lag()        masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read Flights

url <- "https://raw.githubusercontent.com/mehreengillani/Final_project_Data606/refs/heads/main/flights_data_clean.csv"
flights <- read_csv(url)
## Rows: 32476 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): unique_carrier, unique_carrier_name, origin_city_name, dest_city_n...
## dbl  (6): passengers, freight, mail, distance, month, distance_group
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Verify the loaded data
head(flights)
## # A tibble: 6 × 16
##   passengers freight  mail distance unique_carrier unique_carrier_name          
##        <dbl>   <dbl> <dbl>    <dbl> <chr>          <chr>                        
## 1         20       0     0       54 C5             CommuteAir LLC dba CommuteAir
## 2         20       0     0       59 M5             Kenmore Air Harbor           
## 3         20       0     0       64 J5             Kalinin Aviation LLC d/b/a A…
## 4         20       0     0       73 C5             CommuteAir LLC dba CommuteAir
## 5         20       0     0       76 6F             FOX AIRCRAFT, LLC            
## 6         20       0     0       79 8V             Wright Air Service           
## # ℹ 10 more variables: origin_city_name <chr>, dest_city_name <chr>,
## #   month <dbl>, distance_group <dbl>, class <chr>, distance_cat <chr>,
## #   carrier_type <chr>, month_name <chr>, route <chr>, distance_simple <chr>

Research Questions

1. Seasonal Flight Operations: How do flight frequencies vary across months? Statistical Framework:

Null Hypothesis (H₀): There is no significant difference in mean flight

frequencies across months (μ_jan = μ_feb = μ_mar = μ_apr = μ_may = μ_jun = μ_jul)

###Alternative Hypothesis (H₁): Mean flight frequencies differ significantly across months, with summer months (June-July) showing elevated frequencies driven by seasonal travel patterns *

Response: flight_count (Numerical)

Explanatory: month (Categorical)

# Create summary data for monthly analysis
monthly_summary <- flights %>%
  group_by(month) %>%
  summarise(flight_count = n()) %>%
  mutate(month_name = factor(c("January", "February", "March", "April", "May", "June", "July")[month],
                            levels = c("January", "February", "March", "April", "May", "June", "July")))

# View the summary data
print(monthly_summary)
## # A tibble: 7 × 3
##   month flight_count month_name
##   <dbl>        <int> <fct>     
## 1     1         4991 January   
## 2     2         4573 February  
## 3     3         4907 March     
## 4     4         4407 April     
## 5     5         4792 May       
## 6     6         4811 June      
## 7     7         3995 July
# Basic statistics
cat("\n=== SUMMARY STATISTICS ===\n")
## 
## === SUMMARY STATISTICS ===
cat("Total flights:", format(sum(monthly_summary$flight_count), big.mark = ","), "\n")
## Total flights: 32,476
cat("Average monthly flights:", format(mean(monthly_summary$flight_count), big.mark = ","), "\n")
## Average monthly flights: 4,639.429
cat("Month with highest flights:", monthly_summary$month_name[which.max(monthly_summary$flight_count)], "\n")
## Month with highest flights: 1
cat("Month with lowest flights:", monthly_summary$month_name[which.min(monthly_summary$flight_count)], "\n")
## Month with lowest flights: 7
# Calculate percentage change from lowest to highest month
lowest_count <- min(monthly_summary$flight_count)
highest_count <- max(monthly_summary$flight_count)
percentage_increase <- ((highest_count - lowest_count) / lowest_count) * 100
cat("Seasonal increase from lowest to highest month:", round(percentage_increase, 1), "%\n")
## Seasonal increase from lowest to highest month: 24.9 %
# Create the main visualization - Bar chart
ggplot(monthly_summary, aes(x = month_name, y = flight_count, fill = flight_count)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = format(flight_count, big.mark = ",")), vjust = -0.5, size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Flight Count") +
  labs(title = "Flight Frequencies by Month",
       subtitle = "Analysis of Seasonal Patterns in Flight Operations",
       x = "Month",
       y = "Number of Flights",
       caption = paste("Total flights analyzed:", format(sum(monthly_summary$flight_count), big.mark = ","))) +
  scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"))

# Calculate month-over-month changes
monthly_changes <- monthly_summary %>%
  mutate(previous_month = lag(flight_count),
         mom_change = flight_count - previous_month,
         mom_percentage = ((flight_count - previous_month) / previous_month) * 100)
cat("\n=== MONTH-OVER-MONTH CHANGES ===\n")
## 
## === MONTH-OVER-MONTH CHANGES ===
print(monthly_changes %>% select(month_name, flight_count, mom_change, mom_percentage))
## # A tibble: 7 × 4
##   month_name flight_count mom_change mom_percentage
##   <fct>             <int>      <int>          <dbl>
## 1 January            4991         NA         NA    
## 2 February           4573       -418         -8.38 
## 3 March              4907        334          7.30 
## 4 April              4407       -500        -10.2  
## 5 May                4792        385          8.74 
## 6 June               4811         19          0.396
## 7 July               3995       -816        -17.0
# Chi-Square Test: Are flights distributed equally across months?
cat("\n=== CHI-SQUARE GOODNESS OF FIT TEST ===\n")
## 
## === CHI-SQUARE GOODNESS OF FIT TEST ===
monthly_counts <- flights %>%
  count(month_name) %>%
  arrange(match(month_name, c("jan", "feb", "mar", "apr", "may", "jun", "jul")))
#print(monthly_counts$n)
# Expected frequencies (if equal distribution)
expected <- rep(sum(monthly_counts$n) / 7, 7)
print(expected)
## [1] 4639.429 4639.429 4639.429 4639.429 4639.429 4639.429 4639.429
chi_test <- chisq.test(monthly_counts$n, p = rep(1/7, 7))
print(chi_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  monthly_counts$n
## X-squared = 155.54, df = 6, p-value < 2.2e-16
# Effect size
cramers_v <- sqrt(chi_test$statistic / (sum(monthly_counts$n) * (7-1)/7))
cat("Effect size (Cramer's V):", round(cramers_v, 3), "\n")
## Effect size (Cramer's V): 0.075
# Measures how strong the monthly differences are (0 = no effect, 1 = maximum effect)
# Adjusts for sample size and number of categories

observed vs expected:

Month Observed Expected

jan 4991 4639.429 (+351.6)

feb 4573 4639.429 (-66.4)

mar 4907 4639.429 (+267.6)

apr 4407 4639.429 (-232.4)

may 4792 4639.429 (+152.6)

jun 4811 4639.429 (+171.6)

jul 3995 4639.429 (-644.4)

What the Chi-Square Test Does: It calculates how much your observed counts (actual data) differ from these expected counts (equal distribution). The large differences (especially January and July) are what produced your significant p-value.

Statistical Conclusion:

REJECT the Null Hypothesis Interpretation:

p < 0.05: Reject H₀ - Significant evidence that flight frequencies vary by month p ≥ 0.05: Fail to reject H₀ - No significant evidence of monthly variation

Interpretation:

p-value < 2.2e-16 (extremely small) → Strong evidence against H₀ X-squared = 155.54 with 6 degrees of freedom → Large test statistic

Result: Flight frequencies are NOT equally distributed across months

Statistical Significance: The probability of observing this uneven distribution by random chance is virtually zero There are statistically significant differences in flight frequencies across months

ffect size (Cramer’s V): 0.075 means you have a SMALL EFFECT SIZE

Interpreting Cramer’s V:

Value Range Effect Size Interpretation 0.00 - 0.06 Negligible No practical difference 0.07 - 0.20 Small Weak association 0.21 - 0.35 Medium Moderate association 0.36+ Large Strong association

Real-World Interpretation:

While monthly differences are real and not due to chance The magnitude of these differences is quite small Month alone explains very little of the variation in flight frequencies

# Check actual monthly distribution
monthly_counts %>%
  mutate(percentage = round(n/sum(n)*100, 1)) %>%
  arrange(month_name)
## # A tibble: 7 × 3
##   month_name     n percentage
##   <chr>      <int>      <dbl>
## 1 apr         4407       13.6
## 2 feb         4573       14.1
## 3 jan         4991       15.4
## 4 jul         3995       12.3
## 5 jun         4811       14.8
## 6 mar         4907       15.1
## 7 may         4792       14.8

Statistical Conclusion: REJECT NULL HYPOTHESIS ✅

Statistical Interpretation

“We reject the null hypothesis that flight frequencies are uniformly distributed across months. The Chi-square goodness of fit test reveals statistically significant variation in flight frequencies across different months (χ² = 155.54, df = 6, p-value = 2.2e).”

Research Question: How do flight frequencies and passenger volumes

differ across carrier types (major, national, and regional carriers)? Hypotheses:

Null Hypothesis (H₀):

There is no significant difference in flight frequencies and passenger volumes among major carriers, national carriers, and regional carriers.

Statistical Form: μ_major = μ_national = μ_regional (where μ represents

mean flight frequencies or passenger volumes)

Alternative Hypothesis (H₁):

There are significant differences in flight frequencies and passenger volumes among carrier types, with major carriers handling higher volumes than national and regional carriers. Statistical Form: At least one carrier type has significantly different mean flight frequencies or passenger volumes, with major carriers expected to demonstrate the highest values.

Variables:

Response variables: flight_count (numerical), passengers (numerical)

Explanatory variable: carrier_type (categorical, 3 levels: major,

national, regional)

# Statistical Analysis: Carrier Type vs Flight Frequencies
cat("\n=== CARRIER TYPE ANALYSIS ===\n")
## 
## === CARRIER TYPE ANALYSIS ===
# 1. Count flights by carrier type
carrier_summary <- flights %>%
  group_by(carrier_type) %>%
  summarise(
    flight_count = n(),
    total_passengers = sum(passengers),
    avg_passengers = mean(passengers),
    .groups = 'drop'
  )

print(carrier_summary)
## # A tibble: 3 × 4
##   carrier_type      flight_count total_passengers avg_passengers
##   <chr>                    <int>            <dbl>          <dbl>
## 1 major carriers           19811          5072805           256.
## 2 national carriers         7282          2653957           364.
## 3 regional carriers         5383          1083427           201.
# 2. Visualize the differences
ggplot(carrier_summary, aes(x = carrier_type, y = flight_count, fill = carrier_type)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = flight_count), vjust = -0.5, size = 5) +
  labs(title = "Flight Counts by Carrier Type",
       x = "Carrier Type", y = "Number of Flights") +
  theme_classic()

# 3. Perform Chi-Square Test (Simplest approach)
cat("\n=== CHI-SQUARE TEST ===\n")
## 
## === CHI-SQUARE TEST ===
chi_test <- chisq.test(carrier_summary$flight_count)
print(chi_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  carrier_summary$flight_count
## X-squared = 11355, df = 2, p-value < 2.2e-16

Statistical Conclusion: REJECT THE NULL HYPOTHESIS

Interpretation:

p-value < 2.2e-16: Extremely significant (virtually zero chance this pattern is random) X-squared = 11,355: Massive test statistic df = 2: Comparing 3 carrier types

What This Means:

STRONG EVIDENCE that flight frequencies are NOT equally distributed across carrier types

# 4. Compare passenger volumes (ANOVA)
cat("\n=== PASSENGER VOLUME COMPARISON ===\n")
## 
## === PASSENGER VOLUME COMPARISON ===
anova_test <- aov(passengers ~ carrier_type, data = flights)
summary(anova_test)
##                 Df    Sum Sq  Mean Sq F value Pr(>F)    
## carrier_type     2 9.419e+07 47096599   615.1 <2e-16 ***
## Residuals    32473 2.486e+09    76563                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Statistical Conclusion: REJECT THE NULL HYPOTHESIS

ANOVA Results Interpretation:

p-value < 2e-16: Extremely significant F-value = 615.1: Very large F-statistic Significance: *: Highest significance level What This Means:

STRONG EVIDENCE that passenger volumes differ significantly across carrier types

Key Findings:

Flight counts differ by carrier type (Chi-square test) Passenger volumes differ by carrier type (ANOVA test)

library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# 5. Simple percentage breakdown
cat("\n=== PERCENTAGE BREAKDOWN ===\n")
## 
## === PERCENTAGE BREAKDOWN ===
carrier_summary %>%
  mutate(
    flight_percent = round(flight_count / sum(flight_count) * 100, 1),
    passenger_percent = round(total_passengers / sum(total_passengers) * 100, 1)
  ) %>%
  select(carrier_type, flight_count, flight_percent, total_passengers, avg_passengers, passenger_percent)%>%
  kable()
carrier_type flight_count flight_percent total_passengers avg_passengers passenger_percent
major carriers 19811 61.0 5072805 256.0600 57.6
national carriers 7282 22.4 2653957 364.4544 30.1
regional carriers 5383 16.6 1083427 201.2683 12.3

Key Insight:

Major carriers dominate the market, handling:

61% of all flights 58% of all passengers

# Create summary data for analysis
flight_summary <- flights %>%
  group_by(month, carrier_type) %>%
  summarise(flight_count = n(), .groups = 'drop')

# View the summary data
print(flight_summary)
## # A tibble: 21 × 3
##    month carrier_type      flight_count
##    <dbl> <chr>                    <int>
##  1     1 major carriers            2967
##  2     1 national carriers         1313
##  3     1 regional carriers          711
##  4     2 major carriers            2501
##  5     2 national carriers         1362
##  6     2 regional carriers          710
##  7     3 major carriers            2789
##  8     3 national carriers         1252
##  9     3 regional carriers          866
## 10     4 major carriers            2778
## # ℹ 11 more rows
# Create the main visualization
ggplot(flight_summary, aes(x = factor(month), y = flight_count, fill = carrier_type)) +
  geom_col(position = "dodge") +
  labs(title = "Flight Frequencies by Month and Carrier Type",
       subtitle = "Analysis of Seasonal Patterns Across Different Carrier Types",
       x = "Month",
       y = "Flight Count",
       fill = "Carrier Type") +
  scale_y_continuous(labels = comma) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Alternative: Line plot to better show trends
ggplot(flight_summary, aes(x = month, y = flight_count, color = carrier_type, group = carrier_type)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 2) +
  labs(title = "Seasonal Flight Frequency Trends by Carrier Type",
       subtitle = "Monthly variation in flight operations",
       x = "Month",
       y = "Flight Count",
       color = "Carrier Type") +
  scale_y_continuous(labels = comma) +
  scale_x_continuous(breaks = 1:7) +
  theme_minimal()

# Calculate percentage distribution
monthly_totals <- flight_summary %>%
  group_by(month) %>%
  summarise(total_flights = sum(flight_count))

flight_percentage <- flight_summary %>%
  left_join(monthly_totals, by = "month") %>%
  mutate(percentage = (flight_count / total_flights) * 100)

# View percentage distribution
print(flight_percentage)
## # A tibble: 21 × 5
##    month carrier_type      flight_count total_flights percentage
##    <dbl> <chr>                    <int>         <int>      <dbl>
##  1     1 major carriers            2967          4991       59.4
##  2     1 national carriers         1313          4991       26.3
##  3     1 regional carriers          711          4991       14.2
##  4     2 major carriers            2501          4573       54.7
##  5     2 national carriers         1362          4573       29.8
##  6     2 regional carriers          710          4573       15.5
##  7     3 major carriers            2789          4907       56.8
##  8     3 national carriers         1252          4907       25.5
##  9     3 regional carriers          866          4907       17.6
## 10     4 major carriers            2778          4407       63.0
## # ℹ 11 more rows
#Perform Chi-Square Test (Simplest approach)
cat("\n=== CHI-SQUARE TEST ===\n")
## 
## === CHI-SQUARE TEST ===
chi_test <- chisq.test(flight_summary$flight_count)
print(chi_test)
## 
##  Chi-squared test for given probabilities
## 
## data:  flight_summary$flight_count
## X-squared = 11898, df = 20, p-value < 2.2e-16

Total Flights by month

# Calculate total flights by month
monthly_flights <- flights %>%
  group_by(month) %>%
  summarise(
    total_flights = n(),
    .groups = 'drop'
  ) %>%
  arrange(desc(total_flights))

# Create month labels
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul")

# Create proper month order while showing highest to lowest
monthly_flights_ordered <- monthly_flights %>%
  mutate(
    month_name = factor(month, 
                       levels = 1:7,
                       labels = month_labels[1:7]),
    rank = rank(-total_flights)
  ) %>%
  arrange(desc(total_flights))

# Create the plot
ggplot(monthly_flights_ordered, aes(x = reorder(month_name, -total_flights), 
                                   y = total_flights,
                                   fill = total_flights)) +
  geom_col(alpha = 0.8) +
  geom_text(aes(label = paste0(format(total_flights, big.mark = ","), 
                              "\n(", rank, ")")), 
            vjust = -0.08, size = 1.9, fontface = "bold") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", 
                     name = "Total Flights") +
  labs(title = "monthly Flight Volume (January - July)",
       subtitle = "Ranked from highest to lowest, with rank numbers in parentheses",
       x = "month",
       y = "Total Number of Flights") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 10),
        plot.subtitle = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 0, hjust = 0.5))

# Correlation between distance and month
cor(monthly_flights$month, monthly_flights$total_flights, use = "complete.obs")
## [1] -0.5844159

There is a strong correlation between month and total flights

Passenger by carrier

# 6. ANOVA to test categorical relationships
cat("\n=== ANOVA: PASSENGERS BY CARRIER TYPE ===\n")
## 
## === ANOVA: PASSENGERS BY CARRIER TYPE ===
anova_carrier <- aov(passengers ~ carrier_type, data = flights)
summary(anova_carrier)
##                 Df    Sum Sq  Mean Sq F value Pr(>F)    
## carrier_type     2 9.419e+07 47096599   615.1 <2e-16 ***
## Residuals    32473 2.486e+09    76563                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n=== ANOVA: PASSENGERS BY DISTANCE CATEGORY ===\n")
## 
## === ANOVA: PASSENGERS BY DISTANCE CATEGORY ===
anova_distance <- aov(passengers ~ distance_cat, data = flights)
summary(anova_distance)
##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## distance_cat    10 4.177e+07 4176560   53.41 <2e-16 ***
## Residuals    32465 2.539e+09   78197                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LINEAR REGRESSION FOR PASSENGERS

# linear regression

better_model <- lm(passengers ~ distance + carrier_type + month, data = flights)
summary(better_model)
## 
## Call:
## lm(formula = passengers ~ distance + carrier_type + month, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -376.4 -192.3 -112.7  143.2  859.2 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   306.571653   4.530172  67.673   <2e-16 ***
## distance                       -0.049186   0.002705 -18.186   <2e-16 ***
## carrier_typenational carriers  92.783794   3.881768  23.902   <2e-16 ***
## carrier_typeregional carriers -88.636315   4.623451 -19.171   <2e-16 ***
## month                          -0.271610   0.773559  -0.351    0.726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.3 on 32471 degrees of freedom
## Multiple R-squared:  0.04623,    Adjusted R-squared:  0.04611 
## F-statistic: 393.5 on 4 and 32471 DF,  p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ carrier_type, data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ carrier_type, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -344.4 -198.1 -115.3  146.9  797.7 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    256.060      1.966  130.25   <2e-16 ***
## carrier_typenational carriers  108.394      3.792   28.59   <2e-16 ***
## carrier_typeregional carriers  -54.792      4.253  -12.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 276.7 on 32473 degrees of freedom
## Multiple R-squared:  0.0365, Adjusted R-squared:  0.03644 
## F-statistic: 615.1 on 2 and 32473 DF,  p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ distance, data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ distance, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -279.7 -207.8 -127.6  137.6  812.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 301.66731    2.58209  116.83   <2e-16 ***
## distance     -0.03705    0.00251  -14.76   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 280.9 on 32474 degrees of freedom
## Multiple R-squared:  0.006665,   Adjusted R-squared:  0.006635 
## F-statistic: 217.9 on 1 and 32474 DF,  p-value: < 2.2e-16
# linear regression with distance AND carrier_type
model <- lm(passengers ~ distance + carrier_type, data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ distance + carrier_type, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -375.7 -192.4 -112.8  143.1  859.8 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   305.500101   3.347903   91.25   <2e-16 ***
## distance                       -0.049204   0.002704  -18.20   <2e-16 ***
## carrier_typenational carriers  92.899465   3.867711   24.02   <2e-16 ***
## carrier_typeregional carriers -88.661428   4.622835  -19.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.3 on 32472 degrees of freedom
## Multiple R-squared:  0.04623,    Adjusted R-squared:  0.04614 
## F-statistic: 524.6 on 3 and 32472 DF,  p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ distance+ carrier_type+ month, data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ distance + carrier_type + month, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -376.4 -192.3 -112.7  143.2  859.2 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   306.571653   4.530172  67.673   <2e-16 ***
## distance                       -0.049186   0.002705 -18.186   <2e-16 ***
## carrier_typenational carriers  92.783794   3.881768  23.902   <2e-16 ***
## carrier_typeregional carriers -88.636315   4.623451 -19.171   <2e-16 ***
## month                          -0.271610   0.773559  -0.351    0.726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.3 on 32471 degrees of freedom
## Multiple R-squared:  0.04623,    Adjusted R-squared:  0.04611 
## F-statistic: 393.5 on 4 and 32471 DF,  p-value: < 2.2e-16
# linear regression
model <- lm(passengers ~ distance + carrier_type , data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ distance + carrier_type, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -375.7 -192.4 -112.8  143.1  859.8 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   305.500101   3.347903   91.25   <2e-16 ***
## distance                       -0.049204   0.002704  -18.20   <2e-16 ***
## carrier_typenational carriers  92.899465   3.867711   24.02   <2e-16 ***
## carrier_typeregional carriers -88.661428   4.622835  -19.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.3 on 32472 degrees of freedom
## Multiple R-squared:  0.04623,    Adjusted R-squared:  0.04614 
## F-statistic: 524.6 on 3 and 32472 DF,  p-value: < 2.2e-16
model <- lm(passengers ~ distance * carrier_type , data = flights)
summary(model)
## 
## Call:
## lm(formula = passengers ~ distance * carrier_type, data = flights)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -420.9 -192.4 -110.9  142.3  927.7 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             3.268e+02  3.629e+00  90.035  < 2e-16
## distance                               -7.036e-02  3.047e-03 -23.095  < 2e-16
## carrier_typenational carriers           1.586e+01  6.850e+00   2.315   0.0206
## carrier_typeregional carriers          -1.308e+02  6.128e+00 -21.350  < 2e-16
## distance:carrier_typenational carriers  1.020e-01  7.648e-03  13.339  < 2e-16
## distance:carrier_typeregional carriers  8.723e-02  1.063e-02   8.203 2.44e-16
##                                           
## (Intercept)                            ***
## distance                               ***
## carrier_typenational carriers          *  
## carrier_typeregional carriers          ***
## distance:carrier_typenational carriers ***
## distance:carrier_typeregional carriers ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 274.4 on 32470 degrees of freedom
## Multiple R-squared:  0.05274,    Adjusted R-squared:  0.05259 
## F-statistic: 361.5 on 5 and 32470 DF,  p-value: < 2.2e-16
# Create aggregated features
origin_features <- flights %>%
  group_by(origin_city_name) %>%
  summarise(
    origin_avg_passengers = mean(passengers),
    origin_flight_count = n(),
    .groups = "drop"
  )

dest_features <- flights %>%
  group_by(dest_city_name) %>%
  summarise(
    dest_avg_passengers = mean(passengers),
    dest_flight_count = n(),
    .groups = "drop"
  )

# Merge and use
flights_features <- flights %>%
  left_join(origin_features, by = "origin_city_name") %>%
  left_join(dest_features, by = "dest_city_name")

model_features <- lm(passengers ~ distance + carrier_type + 
                       origin_avg_passengers + dest_avg_passengers, 
                     data = flights_features)
summary(model_features)
## 
## Call:
## lm(formula = passengers ~ distance + carrier_type + origin_avg_passengers + 
##     dest_avg_passengers, data = flights_features)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -759.91 -166.73  -54.95  106.45  903.95 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -1.350e+02  5.469e+00 -24.682  < 2e-16 ***
## distance                      -3.378e-02  2.393e-03 -14.115  < 2e-16 ***
## carrier_typenational carriers  6.477e+01  3.428e+00  18.894  < 2e-16 ***
## carrier_typeregional carriers  2.966e+01  4.265e+00   6.954 3.61e-12 ***
## origin_avg_passengers          7.587e-01  1.314e-02  57.736  < 2e-16 ***
## dest_avg_passengers            7.694e-01  1.292e-02  59.533  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 243.1 on 32470 degrees of freedom
## Multiple R-squared:  0.2562, Adjusted R-squared:  0.2561 
## F-statistic:  2237 on 5 and 32470 DF,  p-value: < 2.2e-16

Linear regression for flights

# Your aggregated data is ready
flight_counts <- flights %>%
  group_by(carrier_type, month) %>%
  summarise(
    flight_count = n(),
    total_passengers = sum(passengers),
    avg_passengers = mean(passengers),
    .groups = "drop"
  )

# 1. Simple carrier type model
model1 <- lm(flight_count ~ carrier_type, data = flight_counts)
summary(model1)
## 
## Call:
## lm(formula = flight_count ~ carrier_type, data = flight_counts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -395.3  -59.0  -35.0  136.9  368.9 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2830.14      81.65   34.66  < 2e-16 ***
## carrier_typenational carriers -1789.86     115.47  -15.50 7.43e-12 ***
## carrier_typeregional carriers -2061.14     115.47  -17.85 6.79e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216 on 18 degrees of freedom
## Multiple R-squared:  0.9543, Adjusted R-squared:  0.9493 
## F-statistic: 188.1 on 2 and 18 DF,  p-value: 8.611e-13
# 2. Add month (now meaningful with aggregated data)
model2 <- lm(flight_count ~ carrier_type + month, data = flight_counts)
summary(model2)
## 
## Call:
## lm(formula = flight_count ~ carrier_type + month, data = flight_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.69 -145.29   18.26  108.82  400.13 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2955.24     121.90  24.244 1.26e-14 ***
## carrier_typenational carriers -1789.86     112.85 -15.860 1.27e-11 ***
## carrier_typeregional carriers -2061.14     112.85 -18.264 1.31e-12 ***
## month                           -31.27      23.04  -1.358    0.192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.1 on 17 degrees of freedom
## Multiple R-squared:  0.9588, Adjusted R-squared:  0.9515 
## F-statistic: 131.9 on 3 and 17 DF,  p-value: 5.66e-12
# 3. Interaction model
model3 <- lm(flight_count ~ carrier_type * month, data = flight_counts)
summary(model3)
## 
## Call:
## lm(formula = flight_count ~ carrier_type * month, data = flight_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -312.93  -55.75  -24.25   98.54  360.75 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          2797.714    150.539  18.585 9.11e-12 ***
## carrier_typenational carriers       -1304.714    212.895  -6.128 1.93e-05 ***
## carrier_typeregional carriers       -2073.714    212.895  -9.741 7.05e-08 ***
## month                                   8.107     33.662   0.241   0.8129    
## carrier_typenational carriers:month  -121.286     47.605  -2.548   0.0223 *  
## carrier_typeregional carriers:month     3.143     47.605   0.066   0.9482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 178.1 on 15 degrees of freedom
## Multiple R-squared:  0.9741, Adjusted R-squared:  0.9655 
## F-statistic:   113 on 5 and 15 DF,  p-value: 2.34e-11
# 4. Visualize
library(ggplot2)

# Plot 1: Flight counts by carrier
p1 <- ggplot(flight_counts, 
             aes(x = carrier_type, y = flight_count, fill = carrier_type)) +
  geom_boxplot() +
  labs(title = "Flight Counts by Carrier Type",
       x = "Carrier Type", y = "Number of Flights") +
  theme_minimal()

# Plot 2: Monthly trends by carrier
p2 <- ggplot(flight_counts, 
             aes(x = month, y = flight_count, color = carrier_type, 
                 group = carrier_type)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  labs(title = "Monthly Flight Trends by Carrier",
       x = "Month", y = "Number of Flights") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#print(p1)
#print(p2)
# Create aggregated data with both flight counts and passenger info
flight_data <- flights %>%
  group_by(carrier_type, month) %>%  # grouping variables
  summarise(
    flight_count = n(),               # Number of flights
    total_passengers = sum(passengers),  # Total passengers
    avg_passengers = mean(passengers),   # Average per flight
    .groups = "drop"
  )
print(summary(flight_data))
##  carrier_type           month    flight_count  total_passengers avg_passengers 
##  Length:21          Min.   :1   Min.   : 645   Min.   :131136   Min.   :184.7  
##  Class :character   1st Qu.:2   1st Qu.: 784   1st Qu.:166225   1st Qu.:212.8  
##  Mode  :character   Median :4   Median : 996   Median :374916   Median :255.2  
##                     Mean   :4   Mean   :1546   Mean   :419533   Mean   :277.4  
##                     3rd Qu.:6   3rd Qu.:2566   3rd Qu.:620443   3rd Qu.:330.2  
##                     Max.   :7   Max.   :3199   Max.   :862468   Max.   :438.6
# Does passenger volume predict flight frequency?
model1 <- lm(flight_count ~ total_passengers, data = flight_data)
summary(model1)
## 
## Call:
## lm(formula = flight_count ~ total_passengers, data = flight_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -461.8 -284.2  151.8  201.5  513.7 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.456e+01  1.408e+02   0.174    0.863    
## total_passengers 3.628e-03  2.902e-04  12.502 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 324 on 19 degrees of freedom
## Multiple R-squared:  0.8916, Adjusted R-squared:  0.8859 
## F-statistic: 156.3 on 1 and 19 DF,  p-value: 1.293e-10
# Or average passengers
model2 <- lm(flight_count ~ avg_passengers, data = flight_data)
summary(model2)
## 
## Call:
## lm(formula = flight_count ~ avg_passengers, data = flight_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -975.0 -850.9 -376.1  992.4 1624.3 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    1961.177    774.337   2.533   0.0203 *
## avg_passengers   -1.495      2.684  -0.557   0.5840  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 976.1 on 19 degrees of freedom
## Multiple R-squared:  0.01607,    Adjusted R-squared:  -0.03572 
## F-statistic: 0.3103 on 1 and 19 DF,  p-value: 0.584
# Flight count predicted by carrier type AND passengers
model3 <- lm(flight_count ~ carrier_type + avg_passengers, data = flight_data)
summary(model3)
## 
## Call:
## lm(formula = flight_count ~ carrier_type + avg_passengers, data = flight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -360.82 -115.64  -23.24   73.48  377.24 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3708.937    327.160  11.337 2.39e-09 ***
## carrier_typenational carriers -1382.129    178.213  -7.755 5.56e-07 ***
## carrier_typeregional carriers -2249.404    120.246 -18.707 8.90e-13 ***
## avg_passengers                   -3.432      1.248  -2.750   0.0137 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 184.9 on 17 degrees of freedom
## Multiple R-squared:  0.9684, Adjusted R-squared:  0.9628 
## F-statistic: 173.7 on 3 and 17 DF,  p-value: 5.977e-13
# Most comprehensive model
best_model <- lm(flight_count ~ carrier_type + avg_passengers + month, 
                 data = flight_data)
summary(best_model)
## 
## Call:
## lm(formula = flight_count ~ carrier_type + avg_passengers + month, 
##     data = flight_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303.59 -118.31  -24.26   83.84  392.79 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3697.722    331.616  11.151 5.91e-09 ***
## carrier_typenational carriers -1417.538    186.365  -7.606 1.06e-06 ***
## carrier_typeregional carriers -2233.054    123.645 -18.060 4.58e-12 ***
## avg_passengers                   -3.134      1.323  -2.368   0.0308 *  
## month                           -16.276     21.391  -0.761   0.4578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 187.3 on 16 degrees of freedom
## Multiple R-squared:  0.9695, Adjusted R-squared:  0.9619 
## F-statistic: 127.2 on 4 and 16 DF,  p-value: 6.548e-12
# Model 1: Flight count by carrier type
model1 <- lm(flight_count ~ carrier_type, data = flight_counts)
summary(model1)
## 
## Call:
## lm(formula = flight_count ~ carrier_type, data = flight_counts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -395.3  -59.0  -35.0  136.9  368.9 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2830.14      81.65   34.66  < 2e-16 ***
## carrier_typenational carriers -1789.86     115.47  -15.50 7.43e-12 ***
## carrier_typeregional carriers -2061.14     115.47  -17.85 6.79e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216 on 18 degrees of freedom
## Multiple R-squared:  0.9543, Adjusted R-squared:  0.9493 
## F-statistic: 188.1 on 2 and 18 DF,  p-value: 8.611e-13
# Model 2: Flight count by month (seasonality)
model2 <- lm(flight_count ~ month, data = flight_counts)  
summary(model2)
## 
## Call:
## lm(formula = flight_count ~ month, data = flight_counts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -929.3 -711.8 -487.9 1113.3 1683.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1671.57     479.10   3.489  0.00246 **
## month         -31.27     107.13  -0.292  0.77351   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 981.9 on 19 degrees of freedom
## Multiple R-squared:  0.004465,   Adjusted R-squared:  -0.04793 
## F-statistic: 0.08522 on 1 and 19 DF,  p-value: 0.7735
# Model 3: Both predictors
model3 <- lm(flight_count ~ carrier_type + month, data = flight_counts)
summary(model3)
## 
## Call:
## lm(formula = flight_count ~ carrier_type + month, data = flight_counts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -391.69 -145.29   18.26  108.82  400.13 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    2955.24     121.90  24.244 1.26e-14 ***
## carrier_typenational carriers -1789.86     112.85 -15.860 1.27e-11 ***
## carrier_typeregional carriers -2061.14     112.85 -18.264 1.31e-12 ***
## month                           -31.27      23.04  -1.358    0.192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.1 on 17 degrees of freedom
## Multiple R-squared:  0.9588, Adjusted R-squared:  0.9515 
## F-statistic: 131.9 on 3 and 17 DF,  p-value: 5.66e-12