##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 5174273 obs. of 26 variables:
## $ ItinID : num 2.03e+11 2.03e+11 2.03e+11 2.03e+11 2.03e+11 ...
## $ Coupons : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Year : int 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
## $ Quarter : int 2 2 2 2 2 2 2 2 2 2 ...
## $ Origin : chr "MDW" "MDW" "MDW" "LAS" ...
## $ OriginAirportID : int 13232 13232 13232 12889 12889 12889 12889 12889 12889 12889 ...
## $ OriginAirportSeqID: int 1323202 1323202 1323202 1288904 1288904 1288904 1288904 1288904 1288904 1288904 ...
## $ OriginCityMarketID: int 30977 30977 30977 32211 32211 32211 32211 32211 32211 32211 ...
## $ OriginCountry : chr "US" "US" "US" "US" ...
## $ OriginStateFips : int 17 17 17 32 32 32 32 32 32 32 ...
## $ OriginState : chr "IL" "IL" "IL" "NV" ...
## $ OriginStateName : chr "Illinois" "Illinois" "Illinois" "Nevada" ...
## $ OriginWac : int 41 41 41 85 85 85 85 85 85 85 ...
## $ RoundTrip : int 0 1 1 0 1 1 0 1 1 1 ...
## $ OnLine : int 1 1 1 1 1 1 1 1 1 1 ...
## $ DollarCred : int 1 1 1 1 1 1 1 1 1 1 ...
## $ FarePerMile : num 0.069 0.0883 0.0983 0.0757 0.0581 ...
## $ RPCarrier : chr "F9" "F9" "F9" "F9" ...
## $ Passengers : int 1 2 1 1 2 2 2 2 2 1 ...
## $ ItinFare : int 176 176 176 176 176 176 176 176 176 176 ...
## $ BulkFare : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Distance : int 2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
## $ DistanceGroup : int 6 4 4 5 7 2 6 3 5 7 ...
## $ MilesFlown : int 2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
## $ ItinGeoType : int 2 2 2 2 2 2 2 2 2 2 ...
## $ X : logi NA NA NA NA NA NA ...
## ItinFare Distance FarePerMile
## Min. : 0.0 Min. : 11 Min. : 0.0000
## 1st Qu.: 220.0 1st Qu.: 1074 1st Qu.: 0.1180
## Median : 373.0 Median : 1810 Median : 0.2000
## Mean : 440.1 Mean : 2142 Mean : 0.2806
## 3rd Qu.: 566.0 3rd Qu.: 2736 3rd Qu.: 0.3489
## Max. :38735.0 Max. :26461 Max. :111.5632
## 1. Summary Statistics (ItinFare, Distance, FarePerMile)
## The summary statistics show typical fare levels, flight distances, and fare-per-mile amounts while highlighting variability in the dataset. There is clear evidence of skewness in both fares and fare per mile. This suggests the presence of unusually high-priced itineraries that affect overall averages.
## [1] "ItinID" "Coupons" "Year"
## [4] "Quarter" "Origin" "OriginAirportID"
## [7] "OriginAirportSeqID" "OriginCityMarketID" "OriginCountry"
## [10] "OriginStateFips" "OriginState" "OriginStateName"
## [13] "OriginWac" "RoundTrip" "OnLine"
## [16] "DollarCred" "FarePerMile" "RPCarrier"
## [19] "Passengers" "ItinFare" "BulkFare"
## [22] "Distance" "DistanceGroup" "MilesFlown"
## [25] "ItinGeoType" "X"
Top 10 Origin Markets by Passenger Volume
| 31703 |
719945 |
| 32575 |
582435 |
| 32457 |
420924 |
| 30977 |
414823 |
| 30852 |
394310 |
| 30194 |
361015 |
| 30721 |
326829 |
| 30397 |
307153 |
| 30325 |
299089 |
| 32467 |
292548 |

## 2. Top 10 Origin Markets by Passenger Volume
## The table and bar chart reveal that passenger volume is concentrated in a small number of origin markets. A few cities dominate demand, likely reflecting major hubs or large metropolitan regions. This concentration indicates where airlines may focus capacity or competitive strategy.

## 3. Histogram of Itinerary Fare (ItinFare)
## The histogram shows that most itinerary fares fall in a lower-to-mid price range, with fewer high-cost tickets. The right skew suggests that premium or last-minute fares occur but are relatively rare. This pattern is typical of consumer-flight pricing where standard fares dominate.

## 4. Histogram of Distance
## Most flights in the dataset are short to medium distance, as shown by the strong clustering on the left side of the histogram. A long right tail represents fewer long-haul routes. This distribution aligns with typical domestic flight networks where short regional routes are more common.

## 5. Boxplot of FarePerMile
## The boxplot highlights high variability in fare per mile, including several extreme outliers. These outliers indicate routes with unusually high cost per mile, likely short flights with low mileage but high base fares. This suggests pricing efficiency differs noticeably by market.
## [1] 0.3713761
## 6. Correlation Between ItinFare and Distance
## The correlation value indicates a positive relationship between fare and distance, meaning longer flights generally cost more. However, the relationship is not perfect, showing that distance alone does not determine fare levels. Other factors such as demand and competition also influence pricing.
Comparison of Selected Origin Markets
| 31703 |
505.4731 |
2143.425 |
0.3248119 |
0.5591502 |
719945 |
| 32457 |
488.6489 |
2630.363 |
0.2530510 |
0.6156422 |
420924 |
## 7. Comparison of Two Markets (Table)
## The comparison table shows clear differences in average fare, distance, fare per mile, and passenger volume between the two markets. These metrics help identify which market operates more efficiently or generates higher demand. Such comparisons support strategic decisions regarding pricing or route expansion.

## 8. Boxplot: Fare Per Mile for Selected Markets
## The boxplot comparison shows whether one market consistently has higher fare-per-mile values than the other. Differences in spread and median illustrate how pricing behaviors vary between the markets. This visual supports the statistical findings from the t-test.
##
## Welch Two Sample t-test
##
## data: market1_data and market2_data
## t = -79.378, df = 464240, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07353282 -0.06998902
## sample estimates:
## mean of x mean of y
## 0.2530510 0.3248119
## 9. T-Test on FarePerMile Between Markets
## The t-test determines whether the average fare per mile differs significantly between the two markets. A low p-value indicates that the difference is statistically meaningful rather than due to chance. This result helps validate whether market-level pricing differences are real and actionable.
## Summary and Insights
## The analysis shows differences in passenger volume and fare per mile between the two selected markets. A t-test produced a p-value of 0, indicating that the difference in fare per mile between these markets is statistically significant. Based on demand and fare efficiency, the California market 32457 may offer better potential for expansion than the New York market 31703.