Dataset Summary
The ramen ratings dataset contains 3,180 observations and 6 variables, including review number, brand, variety, style, country, and star rating.
The dataset includes both categorical variables (brand, variety, style, and country) and numerical variables (review_number and stars). A total of 456 unique brands, 44 countries, and 8 ramen styles are represented.
The star rating ranges from 0 to 5, with:
Mean rating = 3.69 Median rating = 3.75
Most ratings fall between 3.25 and 4.5, indicating generally favorable reviews. There are 14 missing values in the star rating variable.
ramen_ratings %>%
filter(style == "Pack", !is.na(stars)) %>%
ggplot(aes(x = stars)) +
geom_histogram(binwidth = 0.5) +
labs(
title = "Distribution of Ramen Ratings (Pack)",
x = "Stars",
y = "Count"
)Interpretation of Distribution (Pack Style)
The distribution of ramen ratings for the “Pack” style shows that most products receive relatively high ratings. The majority of observations fall between 3 and 5 stars, with a clear concentration around 3.5 to 4 stars.
The distribution is left-skewed, indicating that while most ramen products are rated favorably, there are a smaller number of low-rated products extending toward 0. Ratings below 2 stars are uncommon.
Overall, this suggests that “Pack” style ramen is generally well-reviewed, with most products receiving moderate to high ratings.
This pattern aligns with the overall dataset summary, where the mean (3.69) and median (3.75) also indicate generally positive ratings.
ramen_ratings %>%
filter(style == "Pack", !is.na(stars)) %>%
ggplot(aes(x = stars)) +
geom_histogram(binwidth = 0.5) +
labs(
title = "Distribution of Ramen Ratings (Pack)",
x = "Stars",
y = "Count"
)Comparison of Ramen Ratings by Style
The faceted histograms illustrate the distribution of star ratings across different ramen styles. Overall, most styles show a similar pattern, with ratings concentrated between 3 and 5 stars, indicating generally favorable reviews regardless of packaging type.
The Pack style has the largest number of observations and shows a strong concentration of ratings between 3.5 and 5 stars, suggesting consistently high ratings. Similarly, Cup and Bowl styles also display a high frequency of ratings in this range, though with fewer observations than Pack.
In contrast, styles such as Bar, Can, Box, and Restaurant have very few observations, making it difficult to draw meaningful conclusions about their rating distributions. These smaller categories show limited variability due to the small sample sizes.
Overall, while most ramen styles are well-rated, the Pack, Cup, and Bowl styles dominate the dataset and demonstrate consistently higher ratings compared to less common styles.
ggplot(ramen_ratings %>% filter(!is.na(stars)), aes(x = stars)) +
geom_histogram(binwidth = 0.5) +
facet_wrap(~style) +
labs(title = "Ramen Ratings by Style",
x = "Stars",
y = "Count")Ramen Products by Country
The bar chart shows that Japan produces the highest number of ramen products by a wide margin, significantly exceeding all other countries in the dataset. This highlights Japan’s central role in the global ramen market.
A second tier of countries, including the United States, South Korea, Taiwan, China and Thailand, also contribute a substantial number of products, though at much lower levels than Japan.
Beyond these top contributors, there is a sharp decline in the number of products by country. Most countries have relatively few ramen products represented, with many contributing fewer than 100 observations.
Overall, the distribution is highly skewed, with a small number of countries dominating production and a long tail of countries with minimal representation.
I noticed some duplicate country entries (United States and Philippines) and cleaned the data. Before creating the final bar chart, I standardized country labels to correct inconsistent entries and typographical errors so that duplicate country categories were combined into a single, accurate count. This data cleaning did not change the interpretation.
ramen_ratings %>%
count(country) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(country, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Ramen Products by Country",
x = "Country",
y = "Count")ramen_ratings_clean <- ramen_ratings %>%
mutate(country = if_else(country == "USA", "United States", country))
ramen_ratings_clean %>%
count(country) %>%
arrange(desc(n)) %>%
ggplot(aes(x = reorder(country, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Ramen Products by Country",
x = "Country",
y = "Count")Highest Mean Stars Rating by Country
After grouping the data by country and calculating the average star rating, Cambodia had the highest mean stars rating at 4.20 stars. This indicates that, on average, ramen products from Cambodia received the strongest ratings in the dataset
Because some countries have far fewer observations than others, the highest average rating may be based on a relatively small number of ramen products. So while Cambodia has the highest mean rating in this dataset, countries with more products may provide a more stable estimate of average quality.
ramen_ratings_clean %>%
group_by(country) %>%
summarise(mean_stars = mean(stars, na.rm = TRUE)) %>%
arrange(desc(mean_stars)) %>%
print(n = Inf)## # A tibble: 43 × 2
## country mean_stars
## <chr> <dbl>
## 1 Cambodia 4.2
## 2 France 4.19
## 3 Malaysia 4.16
## 4 Indonesia 4.11
## 5 Singapore 4.10
## 6 Brazil 4.04
## 7 Sarawak 4
## 8 Myanmar 3.95
## 9 Japan 3.91
## 10 Fiji 3.88
## 11 South Korea 3.81
## 12 Hong Kong 3.81
## 13 Taiwan 3.79
## 14 Mexico 3.69
## 15 Hungary 3.61
## 16 Bangladesh 3.59
## 17 Dubai 3.58
## 18 Finland 3.58
## 19 Ukraine 3.58
## 20 Germany 3.58
## 21 Holland 3.56
## 22 Nepal 3.52
## 23 United States 3.51
## 24 Estonia 3.5
## 25 Ghana 3.5
## 26 Phlippines 3.5
## 27 China 3.47
## 28 Thailand 3.41
## 29 India 3.37
## 30 Philippines 3.36
## 31 Italy 3.33
## 32 Colombia 3.29
## 33 Australia 3.26
## 34 Russia 3.25
## 35 Sweden 3.25
## 36 Vietnam 3.17
## 37 Poland 3.08
## 38 New Zealand 3
## 39 UK 2.99
## 40 Pakistan 2.92
## 41 Netherlands 2.5
## 42 Nigeria 2.38
## 43 Canada 2.26
There are multiple ways to select variables in a dataset. Columns can
be selected directly by name, or by using helper functions such as
starts_with(), ends_with(), and
contains(). Another flexible option is to create a vector
of variable names and use all_of() to select those
variables.
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 7
## dep_time sched_dep_time arr_time sched_arr_time air_time dep_delay arr_delay
## <int> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 517 515 830 819 227 2 11
## 2 533 529 850 830 227 4 20
## 3 542 540 923 850 160 2 33
## 4 544 545 1004 1022 183 -1 -18
## 5 554 600 812 837 116 -6 -25
## 6 554 558 740 728 150 -4 12
## 7 555 600 913 854 158 -5 19
## 8 557 600 709 723 53 -3 -14
## 9 557 600 838 846 140 -3 -8
## 10 558 600 753 745 138 -2 8
## # ℹ 336,766 more rows
## # A tibble: 336,776 × 8
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ℹ 336,766 more rows
## # ℹ 2 more variables: dep_delay <dbl>, arr_delay <dbl>
# Using a vector
vars <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
select(flights, all_of(vars))## # A tibble: 336,776 × 4
## dep_time dep_delay arr_time arr_delay
## <int> <dbl> <int> <dbl>
## 1 517 2 830 11
## 2 533 4 850 20
## 3 542 2 923 33
## 4 544 -1 1004 -18
## 5 554 -6 812 -25
## 6 554 -4 740 12
## 7 555 -5 913 19
## 8 557 -3 709 -14
## 9 557 -3 838 -8
## 10 558 -2 753 8
## # ℹ 336,766 more rows
If a variable is included multiple times in a select()
statement, it will only appear once in the resulting dataset. The
select() function automatically removes duplicate
columns.
## # A tibble: 336,776 × 2
## dep_time arr_time
## <int> <int>
## 1 517 830
## 2 533 850
## 3 542 923
## 4 544 1004
## 5 554 812
## 6 554 740
## 7 555 913
## 8 557 709
## 9 557 838
## 10 558 753
## # ℹ 336,766 more rows
The any_of() function selects variables from a vector
but does not produce an error if some variables are missing. Instead, it
only selects the variables that exist in the dataset. This is useful
when working with variable lists that may not always match the dataset
exactly.
## # A tibble: 336,776 × 5
## year month day dep_delay arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 2 11
## 2 2013 1 1 4 20
## 3 2013 1 1 2 33
## 4 2013 1 1 -1 -18
## 5 2013 1 1 -6 -25
## 6 2013 1 1 -4 12
## 7 2013 1 1 -5 19
## 8 2013 1 1 -3 -14
## 9 2013 1 1 -3 -8
## 10 2013 1 1 -2 8
## # ℹ 336,766 more rows
The result may be surprising because contains("TIME")
still matches variables such as dep_time and
arr_time. This is because select helper functions are
case-insensitive by default. This behavior can be changed by setting
ignore.case = FALSE, which forces case-sensitive
matching.
## # A tibble: 336,776 × 6
## dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
## <int> <int> <int> <int> <dbl> <dttm>
## 1 517 515 830 819 227 2013-01-01 05:00:00
## 2 533 529 850 830 227 2013-01-01 05:00:00
## 3 542 540 923 850 160 2013-01-01 05:00:00
## 4 544 545 1004 1022 183 2013-01-01 05:00:00
## 5 554 600 812 837 116 2013-01-01 06:00:00
## 6 554 558 740 728 150 2013-01-01 05:00:00
## 7 555 600 913 854 158 2013-01-01 06:00:00
## 8 557 600 709 723 53 2013-01-01 06:00:00
## 9 557 600 838 846 140 2013-01-01 06:00:00
## 10 558 600 753 745 138 2013-01-01 06:00:00
## # ℹ 336,766 more rows
The variables dep_time and sched_dep_time
are stored in HHMM format, which makes them difficult to analyze. By
converting them into minutes since midnight, we create continuous
numeric variables that are easier to use in calculations and
comparisons.
We would expect air_time to be similar to the difference
between arrival and departure times. However, this difference can be
inaccurate due to issues such as crossing midnight or differences in
time zones. To fix this, time variables should first be converted into a
continuous format (such as minutes since midnight), and adjustments
should be made for overnight flights.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## 20.0 82.0 129.0 150.7 192.0 695.0 9430
## Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
## -2346.0 156.0 216.0 153.2 292.0 1170.0 8713
We would expect the departure delay to equal the difference between
actual and scheduled departure times. However, because times are stored
in HHMM format, the relationship is not exact without converting them
into continuous time. After conversion, the calculated delay should
closely match the recorded dep_delay.
flights <- flights %>%
dplyr::mutate(
calculated_delay = dep_time - sched_dep_time
)
dplyr::select(flights, dep_time, sched_dep_time, dep_delay, calculated_delay) %>%
head()## # A tibble: 6 × 4
## dep_time sched_dep_time dep_delay calculated_delay
## <int> <int> <dbl> <int>
## 1 517 515 2 2
## 2 533 529 4 4
## 3 542 540 2 2
## 4 544 545 -1 -1
## 5 554 600 -6 -46
## 6 554 558 -4 -4
The min_rank() function assigns the same rank to tied
values, which means that if multiple flights share the same delay, they
will receive the same rank. This can result in more than 10 rows being
returned when ties are present.
flights %>%
mutate(rank = min_rank(desc(dep_delay))) %>%
filter(rank <= 10) %>%
select(year, month, day, dep_time, dep_delay, rank)## # A tibble: 10 × 6
## year month day dep_time dep_delay rank
## <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 1301 1
## 2 2013 1 10 1121 1126 3
## 3 2013 12 5 756 896 10
## 4 2013 3 17 2321 911 7
## 5 2013 4 10 1100 960 6
## 6 2013 6 15 1432 1137 2
## 7 2013 6 27 959 899 8
## 8 2013 7 22 845 1005 5
## 9 2013 7 22 2257 898 9
## 10 2013 9 20 1139 1014 4
The expression 1:3 + 1:10 returns a numeric vector of length 10. This occurs because of R’s vector recycling behavior, where the shorter vector (1:3) is repeated to match the length of the longer vector (1:10) before performing element-wise addition.
However, since the length of the longer vector (10) is not a multiple of the shorter vector (3), R produces a warning indicating that the recycling is uneven. This results in the shorter vector repeating partially through its sequence.
## [1] 2 4 6 5 7 9 8 10 12 11
R provides a full set of trigonometric functions, including
sin(), cos(), tan(), as well as
their inverses such as asin(), acos(), and
atan(). These functions operate on numeric inputs and
return values in radians by default.
There are several ways to assess typical delay characteristics:
For the given scenarios: - Flights that are equally early and late will have a mean near zero but high variability. - Flights that are always late will have a consistent delay and low variability. - Rare extreme delays (e.g., 1% very late flights) will increase the mean but not the median.
Arrival delay is generally more important than departure delay, since it reflects the total impact on passengers reaching their destination, including delays incurred during the flight.
The same result as count() can be created using
group_by() and summarise().
count(dest) is equivalent to grouping by
dest and using summarise(n = n()).count(tailnum, wt = distance) is equivalent to grouping
by tailnum and summing the distance variable
with summarise(n = sum(distance, na.rm = TRUE)).This works because count() is essentially a shortcut for
grouping observations and then summarizing them.
# Equivalent to: not_cancelled %>% count(dest)
not_cancelled %>%
group_by(dest) %>%
summarise(n = n())## # A tibble: 104 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ℹ 94 more rows
# Equivalent to: not_cancelled %>% count(tailnum, wt = distance)
not_cancelled %>%
group_by(tailnum) %>%
summarise(n = sum(distance, na.rm = TRUE))## # A tibble: 4,037 × 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ℹ 4,027 more rows
The definition is.na(dep_delay) | is.na(arr_delay) is
suboptimal because missing arrival delay does not always indicate
cancellation. In my summary, arr_delay had more missing
values (9,430) than dep_time and dep_delay
(8,255 each), suggesting that some flights have missing arrival delay
for other reasons. The most important column is dep_time,
since a missing departure time is the clearest indicator that a flight
was cancelled.
flights %>%
summarise(
missing_dep_time = sum(is.na(dep_time)),
missing_dep_delay = sum(is.na(dep_delay)),
missing_arr_delay = sum(is.na(arr_delay))
)## # A tibble: 1 × 3
## missing_dep_time missing_dep_delay missing_arr_delay
## <int> <int> <int>
## 1 8255 8255 9430
The daily summary shows that the number of cancelled flights varies across the year. While the overall cancellation rate is relatively low, it fluctuates from day to day, indicating that cancellations are not evenly distributed. For example, in early January, the number of cancelled flights ranges from 1 to 10 per day, with corresponding variation in cancellation rates.
Average departure delay also varies across days. Some days with higher cancellation counts tend to also have higher average delays, suggesting that disruptions affecting airline operations—such as weather or congestion—may impact both cancellations and delays simultaneously. For example, January 2 and January 3 show relatively higher cancellation rates and higher average delays compared to later days with fewer cancellations and lower delays.
Overall, there appears to be a potential relationship between cancellation rates and average delay, where more disruptive days may lead to both increased cancellations and longer delays. However, this relationship is not entirely clear from the table alone, and a visual analysis (such as a scatterplot) would provide a better assessment of the strength of this relationship.
flights %>%
mutate(cancelled = is.na(dep_time)) %>%
group_by(year, month, day) %>%
summarise(
total_flights = n(),
cancelled_flights = sum(cancelled),
cancel_rate = cancelled_flights / total_flights,
avg_delay = mean(dep_delay, na.rm = TRUE)
)## # A tibble: 365 × 7
## # Groups: year, month [12]
## year month day total_flights cancelled_flights cancel_rate avg_delay
## <int> <int> <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 842 4 0.00475 11.5
## 2 2013 1 2 943 8 0.00848 13.9
## 3 2013 1 3 914 10 0.0109 11.0
## 4 2013 1 4 915 6 0.00656 8.95
## 5 2013 1 5 720 3 0.00417 5.73
## 6 2013 1 6 832 1 0.00120 7.15
## 7 2013 1 7 933 3 0.00322 5.42
## 8 2013 1 8 899 4 0.00445 2.55
## 9 2013 1 9 902 5 0.00554 2.28
## 10 2013 1 10 932 3 0.00322 2.84
## # ℹ 355 more rows
The results show clear differences in average arrival delays across carriers. Some airlines appear to have significantly higher delays than others. For example, carriers such as F9 and FL have the highest average delays, at approximately 21.9 and 20.1 minutes, respectively. In contrast, carriers such as AS and HA have negative average delays, indicating that their flights tend to arrive earlier than scheduled on average.
Although this summary suggests that some airlines perform worse than others, the comparison is potentially misleading. It does not account for differences in routes, destinations, or operating conditions. For instance, some carriers may fly more frequently to congested airports or over longer distances, both of which can increase delays.
Additionally, airlines serving specific regions or operating long-haul flights may experience different patterns of delay than those operating shorter or less congested routes. As a result, the observed differences in average delays may reflect factors beyond the airlines’ control rather than true differences in performance.
Overall, while the analysis identifies carriers with higher average delays, a fair comparison would require controlling for factors such as route, distance, and airport conditions.
A more accurate analysis would compare airlines operating on similar routes or adjust for factors such as distance and airport congestion.
flights %>%
group_by(carrier) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))## # A tibble: 16 × 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 21.9
## 2 FL 20.1
## 3 EV 15.8
## 4 YV 15.6
## 5 OO 11.9
## 6 MQ 10.8
## 7 WN 9.65
## 8 B6 9.46
## 9 9E 7.38
## 10 UA 3.56
## 11 US 2.13
## 12 VX 1.76
## 13 DL 1.64
## 14 AA 0.364
## 15 HA -6.92
## 16 AS -9.93
R provides a full set of trigonometric functions, including
sin(), cos(), and tan(), as well
as their inverse functions such as asin(),
acos(), and atan(). These functions operate on
numeric inputs and return values based on angles measured in
radians.
For example, sin(pi / 2) = 1, cos(0) = 1,
and tan(pi / 4) = 1, which are expected values for these
standard angles. This confirms that R uses radians rather than degrees
by default.
These functions are vectorized, meaning they can be applied to entire vectors of values efficiently, making them useful for mathematical and statistical computations.
## [1] 1
## [1] 1
## [1] 1
When mutate() is used with group_by(), the
calculations are performed within each group rather than across the
entire dataset. For example, means or ranks are computed separately for
each group instead of globally.
Similarly, when filter() is used with
group_by(), the filtering conditions are applied within
each group. For example, filtering for the maximum value will return the
maximum within each group rather than the overall maximum.
Overall, grouping changes the context of operations so that they are applied to each subgroup independently rather than the dataset as a whole.
After filtering to include only planes with at least 20 flights, the plane with the worst on-time record was N203FR, with an average arrival delay of 59.1 minutes across 41 flights.
This is a more reliable result than the initial ranking because it excludes planes with only one or two flights, which can produce misleadingly high average delays. By restricting the analysis to planes with more observations, the average delay provides a more stable estimate of a plane’s typical performance.
flights %>%
group_by(tailnum) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) %>%
arrange(desc(avg_delay))## # A tibble: 4,044 × 3
## tailnum avg_delay n
## <chr> <dbl> <int>
## 1 N844MH 320 1
## 2 N911DA 294 1
## 3 N922EV 276 1
## 4 N587NW 264 1
## 5 N851NW 219 1
## 6 N928DN 201 1
## 7 N7715E 188 1
## 8 N654UA 185 1
## 9 N665MQ 175. 6
## 10 N427SW 157 1
## # ℹ 4,034 more rows
flights %>%
group_by(tailnum) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()
) %>%
filter(n >= 20) %>%
arrange(desc(avg_delay))## # A tibble: 3,164 × 3
## tailnum avg_delay n
## <chr> <dbl> <int>
## 1 N203FR 59.1 41
## 2 N645MQ 51 25
## 3 N956AT 47.6 36
## 4 N988AT 44.3 37
## 5 N521VA 42.2 27
## 6 N353AT 41.2 21
## 7 N942AT 41.2 20
## 8 N6716C 40.3 25
## 9 N908MQ 38.5 22
## 10 N657MQ 38.5 39
## # ℹ 3,154 more rows
Flights departing earlier in the day tend to have the lowest average delays. In this dataset, flights leaving at 5 AM had the smallest average delay, followed by departures at 6 AM and 7 AM. Average delays increase steadily later in the day, with evening flights showing the largest delays. This suggests that early morning is the best time to fly to minimize delays, likely because delays accumulate as the day progresses.
flights %>%
group_by(hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
arrange(avg_delay)## # A tibble: 20 × 2
## hour avg_delay
## <dbl> <dbl>
## 1 5 0.688
## 2 6 1.64
## 3 7 1.91
## 4 8 4.13
## 5 9 4.58
## 6 10 6.50
## 7 11 7.19
## 8 12 8.61
## 9 13 11.4
## 10 14 13.8
## 11 23 14.0
## 12 15 16.9
## 13 16 18.8
## 14 22 18.8
## 15 17 21.1
## 16 18 21.1
## 17 21 24.2
## 18 20 24.3
## 19 19 24.8
## 20 1 NaN
flights %>%
group_by(hour) %>%
summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
filter(!is.nan(avg_delay)) %>%
arrange(avg_delay)## # A tibble: 19 × 2
## hour avg_delay
## <dbl> <dbl>
## 1 5 0.688
## 2 6 1.64
## 3 7 1.91
## 4 8 4.13
## 5 9 4.58
## 6 10 6.50
## 7 11 7.19
## 8 12 8.61
## 9 13 11.4
## 10 14 13.8
## 11 23 14.0
## 12 15 16.9
## 13 16 18.8
## 14 22 18.8
## 15 17 21.1
## 16 18 21.1
## 17 21 24.2
## 18 20 24.3
## 19 19 24.8
By grouping flights by destination and calculating the total arrival
delay within each group, this analysis shows how much each flight
contributes to the overall delay experienced at its destination. Flights
with larger positive delay_prop values account for a
greater share of delays, while negative values represent flights that
arrived early.
This approach is useful for identifying whether delays at a destination are concentrated in a few unusually delayed flights or distributed more evenly across many flights. In the output, destinations with relatively small total delay can show larger proportional contributions from a single delayed flight, whereas destinations with very large total delay tend to have much smaller proportions for individual flights.
flights %>%
group_by(dest) %>%
mutate(
total_delay = sum(arr_delay, na.rm = TRUE),
delay_prop = arr_delay / total_delay
) %>%
select(dest, arr_delay, total_delay, delay_prop)## # A tibble: 336,776 × 4
## # Groups: dest [105]
## dest arr_delay total_delay delay_prop
## <chr> <dbl> <dbl> <dbl>
## 1 IAH 11 30046 0.000366
## 2 IAH 20 30046 0.000666
## 3 MIA 33 3467 0.00952
## 4 BQN -18 7322 -0.00246
## 5 ATL -25 190260 -0.000131
## 6 ORD 12 97352 0.000123
## 7 FLL 19 96153 0.000198
## 8 IAD -14 74631 -0.000188
## 9 MCO -8 76185 -0.000105
## 10 ORD 8 97352 0.0000822
## # ℹ 336,766 more rows
The correlation results show a positive relationship between a flight’s delay and the delay of the immediately preceding flight at each airport. The correlations were 0.254 for EWR, 0.238 for JFK, and 0.282 for LGA, indicating that delays tend to carry forward over time.
Although the correlations are only moderate, they provide evidence that delays are not completely independent. Instead, earlier delays may contribute to later delays, likely because operational disruptions can affect multiple flights in sequence.
flights %>%
arrange(origin, time_hour) %>%
group_by(origin) %>%
mutate(prev_delay = lag(dep_delay)) %>%
select(origin, time_hour, dep_delay, prev_delay) %>%
print(n = 20)## # A tibble: 336,776 × 4
## # Groups: origin [3]
## origin time_hour dep_delay prev_delay
## <chr> <dttm> <dbl> <dbl>
## 1 EWR 2013-01-01 05:00:00 2 NA
## 2 EWR 2013-01-01 05:00:00 -4 2
## 3 EWR 2013-01-01 06:00:00 -5 -4
## 4 EWR 2013-01-01 06:00:00 -2 -5
## 5 EWR 2013-01-01 06:00:00 -1 -2
## 6 EWR 2013-01-01 06:00:00 1 -1
## 7 EWR 2013-01-01 06:00:00 -4 1
## 8 EWR 2013-01-01 06:00:00 0 -4
## 9 EWR 2013-01-01 06:00:00 8 0
## 10 EWR 2013-01-01 06:00:00 0 8
## 11 EWR 2013-01-01 06:00:00 -8 0
## 12 EWR 2013-01-01 06:00:00 -6 -8
## 13 EWR 2013-01-01 06:00:00 -2 -6
## 14 EWR 2013-01-01 06:00:00 -1 -2
## 15 EWR 2013-01-01 06:00:00 24 -1
## 16 EWR 2013-01-01 06:00:00 -3 24
## 17 EWR 2013-01-01 06:00:00 -2 -3
## 18 EWR 2013-01-01 06:00:00 8 -2
## 19 EWR 2013-01-01 06:00:00 1 8
## 20 EWR 2013-01-01 06:00:00 47 1
## # ℹ 336,756 more rows
flights %>%
arrange(origin, year, month, day, sched_dep_time) %>%
group_by(origin) %>%
mutate(prev_delay = lag(dep_delay)) %>%
select(origin, year, month, day, sched_dep_time, dep_delay, prev_delay)## # A tibble: 336,776 × 7
## # Groups: origin [3]
## origin year month day sched_dep_time dep_delay prev_delay
## <chr> <int> <int> <int> <int> <dbl> <dbl>
## 1 EWR 2013 1 1 515 2 NA
## 2 EWR 2013 1 1 558 -4 2
## 3 EWR 2013 1 1 600 -5 -4
## 4 EWR 2013 1 1 600 -2 -5
## 5 EWR 2013 1 1 600 -1 -2
## 6 EWR 2013 1 1 600 1 -1
## 7 EWR 2013 1 1 600 8 1
## 8 EWR 2013 1 1 607 0 8
## 9 EWR 2013 1 1 608 24 0
## 10 EWR 2013 1 1 610 -4 24
## # ℹ 336,766 more rows
flights %>%
arrange(origin, year, month, day, sched_dep_time) %>%
group_by(origin) %>%
mutate(prev_delay = lag(dep_delay)) %>%
summarise(correlation = cor(dep_delay, prev_delay, use = "complete.obs"))## # A tibble: 3 × 2
## origin correlation
## <chr> <dbl>
## 1 EWR 0.254
## 2 JFK 0.238
## 3 LGA 0.282
By comparing each flight’s air time to the minimum observed air time
for the same destination, this analysis creates a relative measure of
flight duration. A time_ratio of 1 indicates that a flight
matched the shortest observed time for that route, while larger values
indicate longer flights relative to the route minimum.
The current output is dominated by flights with a ratio of 1 because
the results were sorted in ascending order. This highlights the fastest
flights for each destination. To identify flights that were unusually
slow relative to the minimum, it would be more informative to sort the
output in descending order of time_ratio.
flights %>%
filter(!is.na(air_time)) %>%
group_by(dest) %>%
mutate(
min_air_time = min(air_time, na.rm = TRUE),
time_ratio = air_time / min_air_time
) %>%
arrange(time_ratio) %>%
select(dest, air_time, min_air_time, time_ratio) %>%
print(n = 20)## # A tibble: 327,346 × 4
## # Groups: dest [104]
## dest air_time min_air_time time_ratio
## <chr> <dbl> <dbl> <dbl>
## 1 BWI 31 31 1
## 2 PBI 105 105 1
## 3 PWM 38 38 1
## 4 BDL 20 20 1
## 5 PWM 38 38 1
## 6 PWM 38 38 1
## 7 ACK 35 35 1
## 8 ACK 35 35 1
## 9 MSN 102 102 1
## 10 MDW 92 92 1
## 11 CAK 53 53 1
## 12 DCA 32 32 1
## 13 ORF 36 36 1
## 14 BUF 38 38 1
## 15 BHM 105 105 1
## 16 ILM 63 63 1
## 17 BQN 173 173 1
## 18 BQN 173 173 1
## 19 PSE 179 179 1
## 20 SJU 170 170 1
## # ℹ 327,326 more rows
Top shared destinations: ATL, BOS, CLT, ORD, TPA –> 7 carriers
Highest average delays: F9 (21.9), FL (20.3), EV (15.7), YV (15.6) Lowest average delays: HA (-6.92), (-9.93)
After restricting the analysis to destinations served by at least two carriers, the ranking of airlines becomes more comparable because the carriers are being evaluated on overlapping routes rather than completely different destination mixes.
The results show that F9 and FL have the highest average arrival delays, at approximately 21.9 minutes and 20.3 minutes, respectively. In contrast, AS and HA have negative average delays, indicating that they tend to arrive early on average.
This comparison is more fair than ranking airlines across all destinations because it reduces the influence of route-specific factors such as distance, weather, and airport congestion. Although differences in performance still remain, the results are more interpretable when airlines are compared only on destinations they have in common.
flights %>%
distinct(dest, carrier) %>%
group_by(dest) %>%
summarise(n_carriers = n()) %>%
filter(n_carriers >= 2) %>%
arrange(desc(n_carriers))## # A tibble: 76 × 2
## dest n_carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ℹ 66 more rows
shared_dests <- flights %>%
distinct(dest, carrier) %>%
group_by(dest) %>%
summarise(n_carriers = n()) %>%
filter(n_carriers >= 2)
flights %>%
semi_join(shared_dests, by = "dest") %>%
group_by(carrier) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(desc(avg_delay))## # A tibble: 16 × 2
## carrier avg_delay
## <chr> <dbl>
## 1 F9 21.9
## 2 FL 20.3
## 3 EV 15.7
## 4 YV 15.6
## 5 OO 11.9
## 6 MQ 10.8
## 7 B6 9.67
## 8 WN 8.29
## 9 9E 7.38
## 10 UA 3.72
## 11 US 2.13
## 12 VX 1.82
## 13 DL 1.64
## 14 AA 0.364
## 15 HA -6.92
## 16 AS -9.93
This analysis counts how many flights each plane completed before experiencing its first departure delay greater than 60 minutes. By arranging flights chronologically within each tail number and tracking when the first major delay occurred, it is possible to measure how long each plane operated before a substantial delay appeared.
The results show that tail number N954UW completed the most flights before its first major delay, with 206 flights. Other planes with long runs before a major delay included N952UW with 163 flights and N957UW with 142 flights.
Planes with higher values may reflect more stable operations or more favorable scheduling conditions. However, the results should be interpreted with caution, because some planes may appear to perform well simply because they did not experience a delay greater than 60 minutes during the time period covered by the dataset.
N954UW had the highest number of flights before its first major departure delay, with 206 flights.
flights %>%
filter(!is.na(tailnum)) %>%
arrange(tailnum, year, month, day, sched_dep_time) %>%
group_by(tailnum) %>%
mutate(
big_delay = if_else(!is.na(dep_delay) & dep_delay > 60, TRUE, FALSE),
delay_seen = cumsum(big_delay)
) %>%
filter(delay_seen == 0) %>%
summarise(flights_before_delay = n()) %>%
arrange(desc(flights_before_delay))## # A tibble: 3,815 × 2
## tailnum flights_before_delay
## <chr> <int>
## 1 N954UW 206
## 2 N952UW 163
## 3 N957UW 142
## 4 N5FAAA 117
## 5 N516JB 102
## 6 N38727 99
## 7 N3742C 98
## 8 N5EWAA 98
## 9 N705TW 97
## 10 N765US 97
## # ℹ 3,805 more rows
…