## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 'data.frame':    5174273 obs. of  26 variables:
##  $ ItinID            : num  2.03e+11 2.03e+11 2.03e+11 2.03e+11 2.03e+11 ...
##  $ Coupons           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Year              : int  2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 ...
##  $ Quarter           : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ Origin            : chr  "MDW" "MDW" "MDW" "LAS" ...
##  $ OriginAirportID   : int  13232 13232 13232 12889 12889 12889 12889 12889 12889 12889 ...
##  $ OriginAirportSeqID: int  1323202 1323202 1323202 1288904 1288904 1288904 1288904 1288904 1288904 1288904 ...
##  $ OriginCityMarketID: int  30977 30977 30977 32211 32211 32211 32211 32211 32211 32211 ...
##  $ OriginCountry     : chr  "US" "US" "US" "US" ...
##  $ OriginStateFips   : int  17 17 17 32 32 32 32 32 32 32 ...
##  $ OriginState       : chr  "IL" "IL" "IL" "NV" ...
##  $ OriginStateName   : chr  "Illinois" "Illinois" "Illinois" "Nevada" ...
##  $ OriginWac         : int  41 41 41 85 85 85 85 85 85 85 ...
##  $ RoundTrip         : int  0 1 1 0 1 1 0 1 1 1 ...
##  $ OnLine            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ DollarCred        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FarePerMile       : num  0.069 0.0883 0.0983 0.0757 0.0581 ...
##  $ RPCarrier         : chr  "F9" "F9" "F9" "F9" ...
##  $ Passengers        : int  1 2 1 1 2 2 2 2 2 1 ...
##  $ ItinFare          : int  176 176 176 176 176 176 176 176 176 176 ...
##  $ BulkFare          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Distance          : int  2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
##  $ DistanceGroup     : int  6 4 4 5 7 2 6 3 5 7 ...
##  $ MilesFlown        : int  2551 1994 1790 2324 3028 828 2507 1166 2138 3042 ...
##  $ ItinGeoType       : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ X                 : logi  NA NA NA NA NA NA ...
##     ItinFare          Distance      FarePerMile      
##  Min.   :    0.0   Min.   :   11   Min.   :  0.0000  
##  1st Qu.:  220.0   1st Qu.: 1074   1st Qu.:  0.1180  
##  Median :  373.0   Median : 1810   Median :  0.2000  
##  Mean   :  440.1   Mean   : 2142   Mean   :  0.2806  
##  3rd Qu.:  566.0   3rd Qu.: 2736   3rd Qu.:  0.3489  
##  Max.   :38735.0   Max.   :26461   Max.   :111.5632
## 1. Summary Statistics (ItinFare, Distance, FarePerMile)
## The summary statistics show typical fare levels, flight distances, and fare-per-mile amounts while highlighting variability in the dataset. There is clear evidence of skewness in both fares and fare per mile. This suggests the presence of unusually high-priced itineraries that affect overall averages.
##  [1] "ItinID"             "Coupons"            "Year"              
##  [4] "Quarter"            "Origin"             "OriginAirportID"   
##  [7] "OriginAirportSeqID" "OriginCityMarketID" "OriginCountry"     
## [10] "OriginStateFips"    "OriginState"        "OriginStateName"   
## [13] "OriginWac"          "RoundTrip"          "OnLine"            
## [16] "DollarCred"         "FarePerMile"        "RPCarrier"         
## [19] "Passengers"         "ItinFare"           "BulkFare"          
## [22] "Distance"           "DistanceGroup"      "MilesFlown"        
## [25] "ItinGeoType"        "X"
Top 10 Origin Markets by Passenger Volume
OriginCityMarketID total_passengers
31703 719945
32575 582435
32457 420924
30977 414823
30852 394310
30194 361015
30721 326829
30397 307153
30325 299089
32467 292548

## 2. Top 10 Origin Markets by Passenger Volume
## The table and bar chart reveal that passenger volume is concentrated in a small number of origin markets. A few cities dominate demand, likely reflecting major hubs or large metropolitan regions. This concentration indicates where airlines may focus capacity or competitive strategy.

## 3. Histogram of Itinerary Fare (ItinFare)
## The histogram shows that most itinerary fares fall in a lower-to-mid price range, with fewer high-cost tickets. The right skew suggests that premium or last-minute fares occur but are relatively rare. This pattern is typical of consumer-flight pricing where standard fares dominate.

## 4. Histogram of Distance
## Most flights in the dataset are short to medium distance, as shown by the strong clustering on the left side of the histogram. A long right tail represents fewer long-haul routes. This distribution aligns with typical domestic flight networks where short regional routes are more common.

## 5. Boxplot of FarePerMile
## The boxplot highlights high variability in fare per mile, including several extreme outliers. These outliers indicate routes with unusually high cost per mile, likely short flights with low mileage but high base fares. This suggests pricing efficiency differs noticeably by market.
## [1] 0.3713761
## 6. Correlation Between ItinFare and Distance
## The correlation value indicates a positive relationship between fare and distance, meaning longer flights generally cost more. However, the relationship is not perfect, showing that distance alone does not determine fare levels. Other factors such as demand and competition also influence pricing.
Comparison of Selected Origin Markets
OriginCityMarketID avg_fare avg_distance avg_fpm roundtrip_share total_passengers
31703 505.4731 2143.425 0.3248119 0.5591502 719945
32457 488.6489 2630.363 0.2530510 0.6156422 420924
## 7. Comparison of Two Markets (Table)
## The comparison table shows clear differences in average fare, distance, fare per mile, and passenger volume between the two markets. These metrics help identify which market operates more efficiently or generates higher demand. Such comparisons support strategic decisions regarding pricing or route expansion.

## 8. Boxplot: Fare Per Mile for Selected Markets
## The boxplot comparison shows whether one market consistently has higher fare-per-mile values than the other. Differences in spread and median illustrate how pricing behaviors vary between the markets. This visual supports the statistical findings from the t-test.
## 
##  Welch Two Sample t-test
## 
## data:  market1_data and market2_data
## t = -79.378, df = 464240, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07353282 -0.06998902
## sample estimates:
## mean of x mean of y 
## 0.2530510 0.3248119
## 9. T-Test on FarePerMile Between Markets
## The t-test determines whether the average fare per mile differs significantly between the two markets. A low p-value indicates that the difference is statistically meaningful rather than due to chance. This result helps validate whether market-level pricing differences are real and actionable.
## Summary and Insights
## The analysis shows differences in passenger volume and fare per mile between the two selected markets. A t-test produced a p-value of 0, indicating that the difference in fare per mile between these markets is statistically significant. Based on demand and fare efficiency, the California market 32457 may offer better potential for expansion than the New York market 31703.