This statistical analysis examines seasonal and operational patterns in U.S. domestic air travel using the T‑100 Domestic Market (U.S. Carriers) dataset from January to July 2025. The study employs hypothesis testing to investigate two key dimensions of air travel dynamics.
We tested whether flight frequencies are uniformly distributed across months using a chi‑square goodness‑of‑fit test. Our analysis provides strong evidence against equal monthly distribution (χ² = 155.54, df = 6, p ≈ 0), indicating airlines strategically adjust schedules in response to seasonal travel patterns.
We evaluated differences in flight frequencies and passenger volumes across three carrier classifications—major, national, and regional—using:
Chi‑square tests for flight frequency comparisons ANOVA for passenger volume analysis Both tests yielded statistically significant results (p ≈ 0), rejecting the null hypothesis of equal performance across carrier types. Major carriers dominate operations, accounting for 61% of flights and 57.6% of passengers.
This report details the inferential procedures, test assumptions, and practical interpretations of these findings, providing evidence‑based insights into seasonal scheduling strategies and carrier‑type operational differentiation in the U.S. domestic aviation market.
# import Libraries
library(GGally)
## Loading required package: ggplot2
library(scales)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Read Flights
url <- "https://raw.githubusercontent.com/mehreengillani/Final_project_Data606/refs/heads/main/flights_data_clean.csv"
flights <- read_csv(url)
## Rows: 32476 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): unique_carrier, unique_carrier_name, origin_city_name, dest_city_n...
## dbl (6): passengers, freight, mail, distance, month, distance_group
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Verify the loaded data
head(flights)
## # A tibble: 6 × 16
## passengers freight mail distance unique_carrier unique_carrier_name
## <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 20 0 0 54 C5 CommuteAir LLC dba CommuteAir
## 2 20 0 0 59 M5 Kenmore Air Harbor
## 3 20 0 0 64 J5 Kalinin Aviation LLC d/b/a A…
## 4 20 0 0 73 C5 CommuteAir LLC dba CommuteAir
## 5 20 0 0 76 6F FOX AIRCRAFT, LLC
## 6 20 0 0 79 8V Wright Air Service
## # ℹ 10 more variables: origin_city_name <chr>, dest_city_name <chr>,
## # month <dbl>, distance_group <dbl>, class <chr>, distance_cat <chr>,
## # carrier_type <chr>, month_name <chr>, route <chr>, distance_simple <chr>
1. Seasonal Flight Operations: How do flight frequencies vary across months? Statistical Framework:
frequencies across months (μ_jan = μ_feb = μ_mar = μ_apr = μ_may = μ_jun = μ_jul)
###Alternative Hypothesis (H₁): Mean flight frequencies differ significantly across months, with summer months (June-July) showing elevated frequencies driven by seasonal travel patterns *
Response: flight_count (Numerical)
Explanatory: month (Categorical)
# Create summary data for monthly analysis
monthly_summary <- flights %>%
group_by(month) %>%
summarise(flight_count = n()) %>%
mutate(month_name = factor(c("January", "February", "March", "April", "May", "June", "July")[month],
levels = c("January", "February", "March", "April", "May", "June", "July")))
# View the summary data
print(monthly_summary)
## # A tibble: 7 × 3
## month flight_count month_name
## <dbl> <int> <fct>
## 1 1 4991 January
## 2 2 4573 February
## 3 3 4907 March
## 4 4 4407 April
## 5 5 4792 May
## 6 6 4811 June
## 7 7 3995 July
# Basic statistics
cat("\n=== SUMMARY STATISTICS ===\n")
##
## === SUMMARY STATISTICS ===
cat("Total flights:", format(sum(monthly_summary$flight_count), big.mark = ","), "\n")
## Total flights: 32,476
cat("Average monthly flights:", format(mean(monthly_summary$flight_count), big.mark = ","), "\n")
## Average monthly flights: 4,639.429
cat("Month with highest flights:", monthly_summary$month_name[which.max(monthly_summary$flight_count)], "\n")
## Month with highest flights: 1
cat("Month with lowest flights:", monthly_summary$month_name[which.min(monthly_summary$flight_count)], "\n")
## Month with lowest flights: 7
# Calculate percentage change from lowest to highest month
lowest_count <- min(monthly_summary$flight_count)
highest_count <- max(monthly_summary$flight_count)
percentage_increase <- ((highest_count - lowest_count) / lowest_count) * 100
cat("Seasonal increase from lowest to highest month:", round(percentage_increase, 1), "%\n")
## Seasonal increase from lowest to highest month: 24.9 %
# Create the main visualization - Bar chart
ggplot(monthly_summary, aes(x = month_name, y = flight_count, fill = flight_count)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = format(flight_count, big.mark = ",")), vjust = -0.5, size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "darkblue", name = "Flight Count") +
labs(title = "Flight Frequencies by Month",
subtitle = "Analysis of Seasonal Patterns in Flight Operations",
x = "Month",
y = "Number of Flights",
caption = paste("Total flights analyzed:", format(sum(monthly_summary$flight_count), big.mark = ","))) +
scale_y_continuous(labels = comma, expand = expansion(mult = c(0, 0.1))) +
theme_minimal(base_size = 12) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"))
# Calculate month-over-month changes
monthly_changes <- monthly_summary %>%
mutate(previous_month = lag(flight_count),
mom_change = flight_count - previous_month,
mom_percentage = ((flight_count - previous_month) / previous_month) * 100)
cat("\n=== MONTH-OVER-MONTH CHANGES ===\n")
##
## === MONTH-OVER-MONTH CHANGES ===
print(monthly_changes %>% select(month_name, flight_count, mom_change, mom_percentage))
## # A tibble: 7 × 4
## month_name flight_count mom_change mom_percentage
## <fct> <int> <int> <dbl>
## 1 January 4991 NA NA
## 2 February 4573 -418 -8.38
## 3 March 4907 334 7.30
## 4 April 4407 -500 -10.2
## 5 May 4792 385 8.74
## 6 June 4811 19 0.396
## 7 July 3995 -816 -17.0
# Chi-Square Test: Are flights distributed equally across months?
cat("\n=== CHI-SQUARE GOODNESS OF FIT TEST ===\n")
##
## === CHI-SQUARE GOODNESS OF FIT TEST ===
monthly_counts <- flights %>%
count(month_name) %>%
arrange(match(month_name, c("jan", "feb", "mar", "apr", "may", "jun", "jul")))
#print(monthly_counts$n)
# Expected frequencies (if equal distribution)
expected <- rep(sum(monthly_counts$n) / 7, 7)
print(expected)
## [1] 4639.429 4639.429 4639.429 4639.429 4639.429 4639.429 4639.429
chi_test <- chisq.test(monthly_counts$n, p = rep(1/7, 7))
print(chi_test)
##
## Chi-squared test for given probabilities
##
## data: monthly_counts$n
## X-squared = 155.54, df = 6, p-value < 2.2e-16
# Effect size
cramers_v <- sqrt(chi_test$statistic / (sum(monthly_counts$n) * (7-1)/7))
cat("Effect size (Cramer's V):", round(cramers_v, 3), "\n")
## Effect size (Cramer's V): 0.075
# Measures how strong the monthly differences are (0 = no effect, 1 = maximum effect)
# Adjusts for sample size and number of categories
observed vs expected:
Month Observed Expected
jan 4991 4639.429 (+351.6)
feb 4573 4639.429 (-66.4)
mar 4907 4639.429 (+267.6)
apr 4407 4639.429 (-232.4)
may 4792 4639.429 (+152.6)
jun 4811 4639.429 (+171.6)
jul 3995 4639.429 (-644.4)
What the Chi-Square Test Does: It calculates how much your observed counts (actual data) differ from these expected counts (equal distribution). The large differences (especially January and July) are what produced your significant p-value.
Statistical Conclusion:
REJECT the Null Hypothesis Interpretation:
p < 0.05: Reject H₀ - Significant evidence that flight frequencies vary by month p ≥ 0.05: Fail to reject H₀ - No significant evidence of monthly variation
Interpretation:
p-value < 2.2e-16 (extremely small) → Strong evidence against H₀ X-squared = 155.54 with 6 degrees of freedom → Large test statistic
Result: Flight frequencies are NOT equally distributed across months
Statistical Significance: The probability of observing this uneven distribution by random chance is virtually zero There are statistically significant differences in flight frequencies across months
ffect size (Cramer’s V): 0.075 means you have a SMALL EFFECT SIZE
Interpreting Cramer’s V:
Value Range Effect Size Interpretation 0.00 - 0.06 Negligible No practical difference 0.07 - 0.20 Small Weak association 0.21 - 0.35 Medium Moderate association 0.36+ Large Strong association
Real-World Interpretation:
While monthly differences are real and not due to chance The magnitude of these differences is quite small Month alone explains very little of the variation in flight frequencies
# Check actual monthly distribution
monthly_counts %>%
mutate(percentage = round(n/sum(n)*100, 1)) %>%
arrange(month_name)
## # A tibble: 7 × 3
## month_name n percentage
## <chr> <int> <dbl>
## 1 apr 4407 13.6
## 2 feb 4573 14.1
## 3 jan 4991 15.4
## 4 jul 3995 12.3
## 5 jun 4811 14.8
## 6 mar 4907 15.1
## 7 may 4792 14.8
Statistical Conclusion: REJECT NULL HYPOTHESIS ✅
Statistical Interpretation
“We reject the null hypothesis that flight frequencies are uniformly distributed across months. The Chi-square goodness of fit test reveals statistically significant variation in flight frequencies across different months (χ² = 155.54, df = 6, p-value = 2.2e).”
differ across carrier types (major, national, and regional carriers)? Hypotheses:
There is no significant difference in flight frequencies and passenger volumes among major carriers, national carriers, and regional carriers.
mean flight frequencies or passenger volumes)
There are significant differences in flight frequencies and passenger volumes among carrier types, with major carriers handling higher volumes than national and regional carriers. Statistical Form: At least one carrier type has significantly different mean flight frequencies or passenger volumes, with major carriers expected to demonstrate the highest values.
national, regional)
# Statistical Analysis: Carrier Type vs Flight Frequencies
cat("\n=== CARRIER TYPE ANALYSIS ===\n")
##
## === CARRIER TYPE ANALYSIS ===
# 1. Count flights by carrier type
carrier_summary <- flights %>%
group_by(carrier_type) %>%
summarise(
flight_count = n(),
total_passengers = sum(passengers),
avg_passengers = mean(passengers),
.groups = 'drop'
)
print(carrier_summary)
## # A tibble: 3 × 4
## carrier_type flight_count total_passengers avg_passengers
## <chr> <int> <dbl> <dbl>
## 1 major carriers 19811 5072805 256.
## 2 national carriers 7282 2653957 364.
## 3 regional carriers 5383 1083427 201.
# 2. Visualize the differences
ggplot(carrier_summary, aes(x = carrier_type, y = flight_count, fill = carrier_type)) +
geom_bar(stat = "identity") +
geom_text(aes(label = flight_count), vjust = -0.5, size = 5) +
labs(title = "Flight Counts by Carrier Type",
x = "Carrier Type", y = "Number of Flights") +
theme_classic()
# 3. Perform Chi-Square Test (Simplest approach)
cat("\n=== CHI-SQUARE TEST ===\n")
##
## === CHI-SQUARE TEST ===
chi_test <- chisq.test(carrier_summary$flight_count)
print(chi_test)
##
## Chi-squared test for given probabilities
##
## data: carrier_summary$flight_count
## X-squared = 11355, df = 2, p-value < 2.2e-16
Statistical Conclusion: REJECT THE NULL HYPOTHESIS
Interpretation:
p-value < 2.2e-16: Extremely significant (virtually zero chance this pattern is random) X-squared = 11,355: Massive test statistic df = 2: Comparing 3 carrier types
What This Means:
STRONG EVIDENCE that flight frequencies are NOT equally distributed across carrier types
# 4. Compare passenger volumes (ANOVA)
cat("\n=== PASSENGER VOLUME COMPARISON ===\n")
##
## === PASSENGER VOLUME COMPARISON ===
anova_test <- aov(passengers ~ carrier_type, data = flights)
summary(anova_test)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier_type 2 9.419e+07 47096599 615.1 <2e-16 ***
## Residuals 32473 2.486e+09 76563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Statistical Conclusion: REJECT THE NULL HYPOTHESIS
ANOVA Results Interpretation:
p-value < 2e-16: Extremely significant F-value = 615.1: Very large F-statistic Significance: *: Highest significance level What This Means:
STRONG EVIDENCE that passenger volumes differ significantly across carrier types
Key Findings:
Flight counts differ by carrier type (Chi-square test) Passenger volumes differ by carrier type (ANOVA test)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
# 5. Simple percentage breakdown
cat("\n=== PERCENTAGE BREAKDOWN ===\n")
##
## === PERCENTAGE BREAKDOWN ===
carrier_summary %>%
mutate(
flight_percent = round(flight_count / sum(flight_count) * 100, 1),
passenger_percent = round(total_passengers / sum(total_passengers) * 100, 1)
) %>%
select(carrier_type, flight_count, flight_percent, total_passengers, avg_passengers, passenger_percent)%>%
kable()
| carrier_type | flight_count | flight_percent | total_passengers | avg_passengers | passenger_percent |
|---|---|---|---|---|---|
| major carriers | 19811 | 61.0 | 5072805 | 256.0600 | 57.6 |
| national carriers | 7282 | 22.4 | 2653957 | 364.4544 | 30.1 |
| regional carriers | 5383 | 16.6 | 1083427 | 201.2683 | 12.3 |
Key Insight:
Major carriers dominate the market, handling:
61% of all flights 58% of all passengers
# Create summary data for analysis
flight_summary <- flights %>%
group_by(month, carrier_type) %>%
summarise(flight_count = n(), .groups = 'drop')
# View the summary data
print(flight_summary)
## # A tibble: 21 × 3
## month carrier_type flight_count
## <dbl> <chr> <int>
## 1 1 major carriers 2967
## 2 1 national carriers 1313
## 3 1 regional carriers 711
## 4 2 major carriers 2501
## 5 2 national carriers 1362
## 6 2 regional carriers 710
## 7 3 major carriers 2789
## 8 3 national carriers 1252
## 9 3 regional carriers 866
## 10 4 major carriers 2778
## # ℹ 11 more rows
# Create the main visualization
ggplot(flight_summary, aes(x = factor(month), y = flight_count, fill = carrier_type)) +
geom_col(position = "dodge") +
labs(title = "Flight Frequencies by Month and Carrier Type",
subtitle = "Analysis of Seasonal Patterns Across Different Carrier Types",
x = "Month",
y = "Flight Count",
fill = "Carrier Type") +
scale_y_continuous(labels = comma) +
theme_minimal() +
theme(legend.position = "bottom")
# Alternative: Line plot to better show trends
ggplot(flight_summary, aes(x = month, y = flight_count, color = carrier_type, group = carrier_type)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2) +
labs(title = "Seasonal Flight Frequency Trends by Carrier Type",
subtitle = "Monthly variation in flight operations",
x = "Month",
y = "Flight Count",
color = "Carrier Type") +
scale_y_continuous(labels = comma) +
scale_x_continuous(breaks = 1:7) +
theme_minimal()
# Calculate percentage distribution
monthly_totals <- flight_summary %>%
group_by(month) %>%
summarise(total_flights = sum(flight_count))
flight_percentage <- flight_summary %>%
left_join(monthly_totals, by = "month") %>%
mutate(percentage = (flight_count / total_flights) * 100)
# View percentage distribution
print(flight_percentage)
## # A tibble: 21 × 5
## month carrier_type flight_count total_flights percentage
## <dbl> <chr> <int> <int> <dbl>
## 1 1 major carriers 2967 4991 59.4
## 2 1 national carriers 1313 4991 26.3
## 3 1 regional carriers 711 4991 14.2
## 4 2 major carriers 2501 4573 54.7
## 5 2 national carriers 1362 4573 29.8
## 6 2 regional carriers 710 4573 15.5
## 7 3 major carriers 2789 4907 56.8
## 8 3 national carriers 1252 4907 25.5
## 9 3 regional carriers 866 4907 17.6
## 10 4 major carriers 2778 4407 63.0
## # ℹ 11 more rows
#Perform Chi-Square Test (Simplest approach)
cat("\n=== CHI-SQUARE TEST ===\n")
##
## === CHI-SQUARE TEST ===
chi_test <- chisq.test(flight_summary$flight_count)
print(chi_test)
##
## Chi-squared test for given probabilities
##
## data: flight_summary$flight_count
## X-squared = 11898, df = 20, p-value < 2.2e-16
# Calculate total flights by month
monthly_flights <- flights %>%
group_by(month) %>%
summarise(
total_flights = n(),
.groups = 'drop'
) %>%
arrange(desc(total_flights))
# Create month labels
month_labels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul")
# Create proper month order while showing highest to lowest
monthly_flights_ordered <- monthly_flights %>%
mutate(
month_name = factor(month,
levels = 1:7,
labels = month_labels[1:7]),
rank = rank(-total_flights)
) %>%
arrange(desc(total_flights))
# Create the plot
ggplot(monthly_flights_ordered, aes(x = reorder(month_name, -total_flights),
y = total_flights,
fill = total_flights)) +
geom_col(alpha = 0.8) +
geom_text(aes(label = paste0(format(total_flights, big.mark = ","),
"\n(", rank, ")")),
vjust = -0.08, size = 1.9, fontface = "bold") +
scale_fill_gradient(low = "lightblue", high = "darkblue",
name = "Total Flights") +
labs(title = "monthly Flight Volume (January - July)",
subtitle = "Ranked from highest to lowest, with rank numbers in parentheses",
x = "month",
y = "Total Number of Flights") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10),
plot.subtitle = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 0, hjust = 0.5))
# Correlation between distance and month
cor(monthly_flights$month, monthly_flights$total_flights, use = "complete.obs")
## [1] -0.5844159
There is a strong correlation between month and total flights
# 6. ANOVA to test categorical relationships
cat("\n=== ANOVA: PASSENGERS BY CARRIER TYPE ===\n")
##
## === ANOVA: PASSENGERS BY CARRIER TYPE ===
anova_carrier <- aov(passengers ~ carrier_type, data = flights)
summary(anova_carrier)
## Df Sum Sq Mean Sq F value Pr(>F)
## carrier_type 2 9.419e+07 47096599 615.1 <2e-16 ***
## Residuals 32473 2.486e+09 76563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("\n=== ANOVA: PASSENGERS BY DISTANCE CATEGORY ===\n")
##
## === ANOVA: PASSENGERS BY DISTANCE CATEGORY ===
anova_distance <- aov(passengers ~ distance_cat, data = flights)
summary(anova_distance)
## Df Sum Sq Mean Sq F value Pr(>F)
## distance_cat 10 4.177e+07 4176560 53.41 <2e-16 ***
## Residuals 32465 2.539e+09 78197
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# linear regression
better_model <- lm(passengers ~ distance + carrier_type + month, data = flights)
summary(better_model)
##
## Call:
## lm(formula = passengers ~ distance + carrier_type + month, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -376.4 -192.3 -112.7 143.2 859.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.571653 4.530172 67.673 <2e-16 ***
## distance -0.049186 0.002705 -18.186 <2e-16 ***
## carrier_typenational carriers 92.783794 3.881768 23.902 <2e-16 ***
## carrier_typeregional carriers -88.636315 4.623451 -19.171 <2e-16 ***
## month -0.271610 0.773559 -0.351 0.726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.3 on 32471 degrees of freedom
## Multiple R-squared: 0.04623, Adjusted R-squared: 0.04611
## F-statistic: 393.5 on 4 and 32471 DF, p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ carrier_type, data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ carrier_type, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -344.4 -198.1 -115.3 146.9 797.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 256.060 1.966 130.25 <2e-16 ***
## carrier_typenational carriers 108.394 3.792 28.59 <2e-16 ***
## carrier_typeregional carriers -54.792 4.253 -12.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 276.7 on 32473 degrees of freedom
## Multiple R-squared: 0.0365, Adjusted R-squared: 0.03644
## F-statistic: 615.1 on 2 and 32473 DF, p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ distance, data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ distance, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -279.7 -207.8 -127.6 137.6 812.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 301.66731 2.58209 116.83 <2e-16 ***
## distance -0.03705 0.00251 -14.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 280.9 on 32474 degrees of freedom
## Multiple R-squared: 0.006665, Adjusted R-squared: 0.006635
## F-statistic: 217.9 on 1 and 32474 DF, p-value: < 2.2e-16
# linear regression with distance AND carrier_type
model <- lm(passengers ~ distance + carrier_type, data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ distance + carrier_type, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375.7 -192.4 -112.8 143.1 859.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 305.500101 3.347903 91.25 <2e-16 ***
## distance -0.049204 0.002704 -18.20 <2e-16 ***
## carrier_typenational carriers 92.899465 3.867711 24.02 <2e-16 ***
## carrier_typeregional carriers -88.661428 4.622835 -19.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.3 on 32472 degrees of freedom
## Multiple R-squared: 0.04623, Adjusted R-squared: 0.04614
## F-statistic: 524.6 on 3 and 32472 DF, p-value: < 2.2e-16
# linear regression with distance only
model <- lm(passengers ~ distance+ carrier_type+ month, data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ distance + carrier_type + month, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -376.4 -192.3 -112.7 143.2 859.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.571653 4.530172 67.673 <2e-16 ***
## distance -0.049186 0.002705 -18.186 <2e-16 ***
## carrier_typenational carriers 92.783794 3.881768 23.902 <2e-16 ***
## carrier_typeregional carriers -88.636315 4.623451 -19.171 <2e-16 ***
## month -0.271610 0.773559 -0.351 0.726
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.3 on 32471 degrees of freedom
## Multiple R-squared: 0.04623, Adjusted R-squared: 0.04611
## F-statistic: 393.5 on 4 and 32471 DF, p-value: < 2.2e-16
# linear regression
model <- lm(passengers ~ distance + carrier_type , data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ distance + carrier_type, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -375.7 -192.4 -112.8 143.1 859.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 305.500101 3.347903 91.25 <2e-16 ***
## distance -0.049204 0.002704 -18.20 <2e-16 ***
## carrier_typenational carriers 92.899465 3.867711 24.02 <2e-16 ***
## carrier_typeregional carriers -88.661428 4.622835 -19.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 275.3 on 32472 degrees of freedom
## Multiple R-squared: 0.04623, Adjusted R-squared: 0.04614
## F-statistic: 524.6 on 3 and 32472 DF, p-value: < 2.2e-16
model <- lm(passengers ~ distance * carrier_type , data = flights)
summary(model)
##
## Call:
## lm(formula = passengers ~ distance * carrier_type, data = flights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -420.9 -192.4 -110.9 142.3 927.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.268e+02 3.629e+00 90.035 < 2e-16
## distance -7.036e-02 3.047e-03 -23.095 < 2e-16
## carrier_typenational carriers 1.586e+01 6.850e+00 2.315 0.0206
## carrier_typeregional carriers -1.308e+02 6.128e+00 -21.350 < 2e-16
## distance:carrier_typenational carriers 1.020e-01 7.648e-03 13.339 < 2e-16
## distance:carrier_typeregional carriers 8.723e-02 1.063e-02 8.203 2.44e-16
##
## (Intercept) ***
## distance ***
## carrier_typenational carriers *
## carrier_typeregional carriers ***
## distance:carrier_typenational carriers ***
## distance:carrier_typeregional carriers ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.4 on 32470 degrees of freedom
## Multiple R-squared: 0.05274, Adjusted R-squared: 0.05259
## F-statistic: 361.5 on 5 and 32470 DF, p-value: < 2.2e-16
# Create aggregated features
origin_features <- flights %>%
group_by(origin_city_name) %>%
summarise(
origin_avg_passengers = mean(passengers),
origin_flight_count = n(),
.groups = "drop"
)
dest_features <- flights %>%
group_by(dest_city_name) %>%
summarise(
dest_avg_passengers = mean(passengers),
dest_flight_count = n(),
.groups = "drop"
)
# Merge and use
flights_features <- flights %>%
left_join(origin_features, by = "origin_city_name") %>%
left_join(dest_features, by = "dest_city_name")
model_features <- lm(passengers ~ distance + carrier_type +
origin_avg_passengers + dest_avg_passengers,
data = flights_features)
summary(model_features)
##
## Call:
## lm(formula = passengers ~ distance + carrier_type + origin_avg_passengers +
## dest_avg_passengers, data = flights_features)
##
## Residuals:
## Min 1Q Median 3Q Max
## -759.91 -166.73 -54.95 106.45 903.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.350e+02 5.469e+00 -24.682 < 2e-16 ***
## distance -3.378e-02 2.393e-03 -14.115 < 2e-16 ***
## carrier_typenational carriers 6.477e+01 3.428e+00 18.894 < 2e-16 ***
## carrier_typeregional carriers 2.966e+01 4.265e+00 6.954 3.61e-12 ***
## origin_avg_passengers 7.587e-01 1.314e-02 57.736 < 2e-16 ***
## dest_avg_passengers 7.694e-01 1.292e-02 59.533 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 243.1 on 32470 degrees of freedom
## Multiple R-squared: 0.2562, Adjusted R-squared: 0.2561
## F-statistic: 2237 on 5 and 32470 DF, p-value: < 2.2e-16
# Your aggregated data is ready
flight_counts <- flights %>%
group_by(carrier_type, month) %>%
summarise(
flight_count = n(),
total_passengers = sum(passengers),
avg_passengers = mean(passengers),
.groups = "drop"
)
# 1. Simple carrier type model
model1 <- lm(flight_count ~ carrier_type, data = flight_counts)
summary(model1)
##
## Call:
## lm(formula = flight_count ~ carrier_type, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.3 -59.0 -35.0 136.9 368.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2830.14 81.65 34.66 < 2e-16 ***
## carrier_typenational carriers -1789.86 115.47 -15.50 7.43e-12 ***
## carrier_typeregional carriers -2061.14 115.47 -17.85 6.79e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216 on 18 degrees of freedom
## Multiple R-squared: 0.9543, Adjusted R-squared: 0.9493
## F-statistic: 188.1 on 2 and 18 DF, p-value: 8.611e-13
# 2. Add month (now meaningful with aggregated data)
model2 <- lm(flight_count ~ carrier_type + month, data = flight_counts)
summary(model2)
##
## Call:
## lm(formula = flight_count ~ carrier_type + month, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -391.69 -145.29 18.26 108.82 400.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2955.24 121.90 24.244 1.26e-14 ***
## carrier_typenational carriers -1789.86 112.85 -15.860 1.27e-11 ***
## carrier_typeregional carriers -2061.14 112.85 -18.264 1.31e-12 ***
## month -31.27 23.04 -1.358 0.192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 211.1 on 17 degrees of freedom
## Multiple R-squared: 0.9588, Adjusted R-squared: 0.9515
## F-statistic: 131.9 on 3 and 17 DF, p-value: 5.66e-12
# 3. Interaction model
model3 <- lm(flight_count ~ carrier_type * month, data = flight_counts)
summary(model3)
##
## Call:
## lm(formula = flight_count ~ carrier_type * month, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -312.93 -55.75 -24.25 98.54 360.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2797.714 150.539 18.585 9.11e-12 ***
## carrier_typenational carriers -1304.714 212.895 -6.128 1.93e-05 ***
## carrier_typeregional carriers -2073.714 212.895 -9.741 7.05e-08 ***
## month 8.107 33.662 0.241 0.8129
## carrier_typenational carriers:month -121.286 47.605 -2.548 0.0223 *
## carrier_typeregional carriers:month 3.143 47.605 0.066 0.9482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 178.1 on 15 degrees of freedom
## Multiple R-squared: 0.9741, Adjusted R-squared: 0.9655
## F-statistic: 113 on 5 and 15 DF, p-value: 2.34e-11
# 4. Visualize
library(ggplot2)
# Plot 1: Flight counts by carrier
p1 <- ggplot(flight_counts,
aes(x = carrier_type, y = flight_count, fill = carrier_type)) +
geom_boxplot() +
labs(title = "Flight Counts by Carrier Type",
x = "Carrier Type", y = "Number of Flights") +
theme_minimal()
# Plot 2: Monthly trends by carrier
p2 <- ggplot(flight_counts,
aes(x = month, y = flight_count, color = carrier_type,
group = carrier_type)) +
geom_line(size = 1) +
geom_point(size = 2) +
scale_x_continuous(breaks = 1:12, labels = month.abb) +
labs(title = "Monthly Flight Trends by Carrier",
x = "Month", y = "Number of Flights") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#print(p1)
#print(p2)
# Create aggregated data with both flight counts and passenger info
flight_data <- flights %>%
group_by(carrier_type, month) %>% # grouping variables
summarise(
flight_count = n(), # Number of flights
total_passengers = sum(passengers), # Total passengers
avg_passengers = mean(passengers), # Average per flight
.groups = "drop"
)
print(summary(flight_data))
## carrier_type month flight_count total_passengers avg_passengers
## Length:21 Min. :1 Min. : 645 Min. :131136 Min. :184.7
## Class :character 1st Qu.:2 1st Qu.: 784 1st Qu.:166225 1st Qu.:212.8
## Mode :character Median :4 Median : 996 Median :374916 Median :255.2
## Mean :4 Mean :1546 Mean :419533 Mean :277.4
## 3rd Qu.:6 3rd Qu.:2566 3rd Qu.:620443 3rd Qu.:330.2
## Max. :7 Max. :3199 Max. :862468 Max. :438.6
# Does passenger volume predict flight frequency?
model1 <- lm(flight_count ~ total_passengers, data = flight_data)
summary(model1)
##
## Call:
## lm(formula = flight_count ~ total_passengers, data = flight_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -461.8 -284.2 151.8 201.5 513.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.456e+01 1.408e+02 0.174 0.863
## total_passengers 3.628e-03 2.902e-04 12.502 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 324 on 19 degrees of freedom
## Multiple R-squared: 0.8916, Adjusted R-squared: 0.8859
## F-statistic: 156.3 on 1 and 19 DF, p-value: 1.293e-10
# Or average passengers
model2 <- lm(flight_count ~ avg_passengers, data = flight_data)
summary(model2)
##
## Call:
## lm(formula = flight_count ~ avg_passengers, data = flight_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -975.0 -850.9 -376.1 992.4 1624.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1961.177 774.337 2.533 0.0203 *
## avg_passengers -1.495 2.684 -0.557 0.5840
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 976.1 on 19 degrees of freedom
## Multiple R-squared: 0.01607, Adjusted R-squared: -0.03572
## F-statistic: 0.3103 on 1 and 19 DF, p-value: 0.584
# Flight count predicted by carrier type AND passengers
model3 <- lm(flight_count ~ carrier_type + avg_passengers, data = flight_data)
summary(model3)
##
## Call:
## lm(formula = flight_count ~ carrier_type + avg_passengers, data = flight_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -360.82 -115.64 -23.24 73.48 377.24
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3708.937 327.160 11.337 2.39e-09 ***
## carrier_typenational carriers -1382.129 178.213 -7.755 5.56e-07 ***
## carrier_typeregional carriers -2249.404 120.246 -18.707 8.90e-13 ***
## avg_passengers -3.432 1.248 -2.750 0.0137 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 184.9 on 17 degrees of freedom
## Multiple R-squared: 0.9684, Adjusted R-squared: 0.9628
## F-statistic: 173.7 on 3 and 17 DF, p-value: 5.977e-13
# Most comprehensive model
best_model <- lm(flight_count ~ carrier_type + avg_passengers + month,
data = flight_data)
summary(best_model)
##
## Call:
## lm(formula = flight_count ~ carrier_type + avg_passengers + month,
## data = flight_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303.59 -118.31 -24.26 83.84 392.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3697.722 331.616 11.151 5.91e-09 ***
## carrier_typenational carriers -1417.538 186.365 -7.606 1.06e-06 ***
## carrier_typeregional carriers -2233.054 123.645 -18.060 4.58e-12 ***
## avg_passengers -3.134 1.323 -2.368 0.0308 *
## month -16.276 21.391 -0.761 0.4578
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 187.3 on 16 degrees of freedom
## Multiple R-squared: 0.9695, Adjusted R-squared: 0.9619
## F-statistic: 127.2 on 4 and 16 DF, p-value: 6.548e-12
# Model 1: Flight count by carrier type
model1 <- lm(flight_count ~ carrier_type, data = flight_counts)
summary(model1)
##
## Call:
## lm(formula = flight_count ~ carrier_type, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -395.3 -59.0 -35.0 136.9 368.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2830.14 81.65 34.66 < 2e-16 ***
## carrier_typenational carriers -1789.86 115.47 -15.50 7.43e-12 ***
## carrier_typeregional carriers -2061.14 115.47 -17.85 6.79e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 216 on 18 degrees of freedom
## Multiple R-squared: 0.9543, Adjusted R-squared: 0.9493
## F-statistic: 188.1 on 2 and 18 DF, p-value: 8.611e-13
# Model 2: Flight count by month (seasonality)
model2 <- lm(flight_count ~ month, data = flight_counts)
summary(model2)
##
## Call:
## lm(formula = flight_count ~ month, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -929.3 -711.8 -487.9 1113.3 1683.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1671.57 479.10 3.489 0.00246 **
## month -31.27 107.13 -0.292 0.77351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 981.9 on 19 degrees of freedom
## Multiple R-squared: 0.004465, Adjusted R-squared: -0.04793
## F-statistic: 0.08522 on 1 and 19 DF, p-value: 0.7735
# Model 3: Both predictors
model3 <- lm(flight_count ~ carrier_type + month, data = flight_counts)
summary(model3)
##
## Call:
## lm(formula = flight_count ~ carrier_type + month, data = flight_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -391.69 -145.29 18.26 108.82 400.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2955.24 121.90 24.244 1.26e-14 ***
## carrier_typenational carriers -1789.86 112.85 -15.860 1.27e-11 ***
## carrier_typeregional carriers -2061.14 112.85 -18.264 1.31e-12 ***
## month -31.27 23.04 -1.358 0.192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 211.1 on 17 degrees of freedom
## Multiple R-squared: 0.9588, Adjusted R-squared: 0.9515
## F-statistic: 131.9 on 3 and 17 DF, p-value: 5.66e-12