· This test must be solved using R Markdown (.Rmd).
· Your submission must include code, output, and written explanation of the logic used.
· Use dplyr with pipelines (%>%) wherever appropriate.
· Simply writing code is not sufficient explain the reasoning behind each transformation.
· Close all other tabs, AI tools, and windows except RStudio during the test.
· Label each answer clearly in your R Markdown file.
You are working as a data analyst for an airline operations team. The management wants to understand flight delays, airline performance, and route efficiency so they can improve scheduling decisions. You have been given the flights dataset from the nycflights13 dataset inside the dslabs package.
library(dslabs) data(‘nycflights13’)
library(dslabs)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data('nycflights13')
## Warning in data("nycflights13"): data set 'nycflights13' not found
flights <- nycflights13::flights
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
· The airline suspects that some flights appear to have negative delays, meaning they arrived earlier than expected. Calculate the percentage of flights that arrived earlier than scheduled (arr_delay < 0).
· Identify which airline carrier has the highest proportion of early arrivals.
· Explain whether early arrival necessarily means efficient airline performance or if other factors might influence this.
# Percentage of flights arriving early
early_flights =
flights |>
filter(arr_delay < 0)
early_flights
## # A tibble: 188,933 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 544 545 -1 1004 1022
## 2 2013 1 1 554 600 -6 812 837
## 3 2013 1 1 557 600 -3 709 723
## 4 2013 1 1 557 600 -3 838 846
## 5 2013 1 1 558 600 -2 849 851
## 6 2013 1 1 558 600 -2 853 856
## 7 2013 1 1 558 600 -2 923 937
## 8 2013 1 1 559 559 0 702 706
## 9 2013 1 1 559 600 -1 854 902
## 10 2013 1 1 600 600 0 851 858
## # ℹ 188,923 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
early_percentage =
flights |>
summarise(
percent_early = length(early_flights$arr_delay) / length(arr_delay) * 100
)
early_percentage
## # A tibble: 1 × 1
## percent_early
## <dbl>
## 1 56.1
# Airline with highest proportion of early arrivals
early_by_carrier =
flights |>
group_by(carrier) |>
summarise(
early_prop = mean(arr_delay < 0)
) |>
arrange(desc(early_prop))
early_by_carrier
## # A tibble: 16 × 2
## carrier early_prop
## <chr> <dbl>
## 1 HA 0.705
## 2 9E NA
## 3 AA NA
## 4 AS NA
## 5 B6 NA
## 6 DL NA
## 7 EV NA
## 8 F9 NA
## 9 FL NA
## 10 MQ NA
## 11 OO NA
## 12 UA NA
## 13 US NA
## 14 VX NA
## 15 WN NA
## 16 YV NA
# Explanation : -
# Filtered the data on the basis of arrival delay to access the flights arriving early, then calculated the percentage using summarize on flights.
# Accessed the carriers with highest proportion of early arrivals with group by and summarize.
# Early arrival does not necessarily mean efficiency because airlines may add buffer time in schedules, weather conditions may improve travel time, or air traffic may be low.
· Create a new variable: delay_per_100_miles = arr_delay / (distance / 100).
· Identify the top 10 worst performing routes (origin–destination pairs) based on average delay_per_100_miles.
· Filter routes that have at least 50 flights to avoid misleading results.
· Explain why filtering by flight count is important.
· Explain why delay normalized by distance is more informative than raw delay.
# Create normalized delay variable
flights2 =
flights |>
mutate(
delay_per_100_miles = arr_delay / (distance / 100)
)
# Top 10 worst performing routes with at least 50 flights
worst_routes =
flights2 |>
group_by(origin, dest) |>
summarise(
avg_delay_per_100 = mean(delay_per_100_miles),
flight_count = n()
) |>
filter(flight_count >= 50) |>
arrange(desc(avg_delay_per_100)) |>
head(10)
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
worst_routes
## # A tibble: 10 × 4
## # Groups: origin [3]
## origin dest avg_delay_per_100 flight_count
## <chr> <chr> <dbl> <int>
## 1 JFK ABQ 0.240 254
## 2 JFK HNL -0.139 342
## 3 LGA SAV -1.46 68
## 4 EWR ALB NA 439
## 5 EWR ATL NA 5022
## 6 EWR AUS NA 968
## 7 EWR AVL NA 265
## 8 EWR BDL NA 443
## 9 EWR BNA NA 2336
## 10 EWR BOS NA 5327
# Explanation
# Created a delay variable to analyse the delays per 100 miles of flights
# Filtering routes with at least 50 flights avoids misleading results caused by very small sample sizes.
# Delay normalized by distance is more informative because long routes naturally have larger delays, so dividing by distance allows fair comparison.
· For each airline carrier calculate: Average arrival delay and Standard deviation of arrival delay. · Identify airlines that have average delay below the overall dataset average AND lower variability than the dataset average. · Explain why variability (standard deviation) is an important metric for airline reliability. · Create a scatter plot: X-axis = Average delay, Y-axis = Standard deviation of delay, Label each airline carrier. · Explain what the plot tells you about risk vs reliability.
# Average delay and standard deviation per airline
carrier_stats =
flights |>
group_by(carrier) |>
summarise(
avg_delay = mean(arr_delay),
sd_delay = sd(arr_delay)
)
carrier_stats
## # A tibble: 16 × 3
## carrier avg_delay sd_delay
## <chr> <dbl> <dbl>
## 1 9E NA NA
## 2 AA NA NA
## 3 AS NA NA
## 4 B6 NA NA
## 5 DL NA NA
## 6 EV NA NA
## 7 F9 NA NA
## 8 FL NA NA
## 9 HA -6.92 75.1
## 10 MQ NA NA
## 11 OO NA NA
## 12 UA NA NA
## 13 US NA NA
## 14 VX NA NA
## 15 WN NA NA
## 16 YV NA NA
# Overall dataset averages
dataset_avg_delay = mean(flights$arr_delay)
dataset_sd_delay = sd(flights$arr_delay)
# Airlines with lower delay and lower variability
reliable_airlines =
carrier_stats |>
filter(
avg_delay < dataset_avg_delay,
sd_delay < dataset_sd_delay
)
reliable_airlines
## # A tibble: 0 × 3
## # ℹ 3 variables: carrier <chr>, avg_delay <dbl>, sd_delay <dbl>
# Scatter plot
ggplot(carrier_stats, aes(x = avg_delay, y = sd_delay, label = carrier)) +
geom_point() +
labs(
title = "Airline Reliability: Average Delay vs Variability",
x = "Average Arrival Delay",
y = "Standard Deviation of Arrival Delay"
)
## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_point()`).
# Explanation
# Analysing the statistical values of data on the basis of carriers through summarise and accessing mean and standard deviation for the arrival delay, then filtering the reliable airlines on the basis of those statistical informations by comparing to delays of whole dataset with delay of groups of carrier.
# Plotting a scatter plot for visualizing the comparison of average arrival delay vs standard deviation of arrival delay.
# Standard deviation measures variability in delays. Airlines with low variability are more reliable because their arrival times are more consistent.
· Create a new variable departure_hour using dep_time. · Compute average departure delay for each hour of the day. · Identify which hour experiences the worst delays. · Create a visualization that clearly shows delay trends throughout the day. · Explain whether delays gradually increase or are concentrated in specific hours and provide a possible operational explanation.
# Question 4
# Create departure hour variable
flights3 =
flights |>
mutate(
departure_hour = dep_time / 100
)
# Average departure delay by hour
hourly_delay =
flights3 |>
group_by(departure_hour) |>
summarise(
avg_dep_delay = mean(dep_delay)
)
hourly_delay
## # A tibble: 1,319 × 2
## departure_hour avg_dep_delay
## <dbl> <dbl>
## 1 0.01 78.8
## 2 0.02 97.3
## 3 0.03 67.6
## 4 0.04 62.3
## 5 0.05 78.2
## 6 0.06 100.
## 7 0.07 59.1
## 8 0.08 105.
## 9 0.09 105.
## 10 0.1 121.
## # ℹ 1,309 more rows
# Hour with worst delays
worst_hour =
hourly_delay |>
arrange(desc(avg_dep_delay)) |>
head(1)
worst_hour
## # A tibble: 1 × 2
## departure_hour avg_dep_delay
## <dbl> <dbl>
## 1 3.53 503
# Visualization of delay trends
ggplot(hourly_delay, aes(x = departure_hour, y = avg_dep_delay)) +
geom_line() +
labs(
title = "Average Departure Delay by Hour",
x = "Departure Hour",
y = "Average Departure Delay"
)
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).
# Explanation
# departure hour variable for departure hour then average delay by hour for better analysis on basis of different different hours and accessing hours with worst delays then plotted a graphs to visualize the comparison between Departure hour and average departure delay.
# Delays generally increase later in the day because delays from earlier flights