Continuous Assessment – Practical Test

Maximum Marks: 30

Instructions

· This test must be solved using R Markdown (.Rmd).

· Your submission must include code, output, and written explanation of the logic used.

· Use dplyr with pipelines (%>%) wherever appropriate.

· Simply writing code is not sufficient explain the reasoning behind each transformation.

· Close all other tabs, AI tools, and windows except RStudio during the test.

· Label each answer clearly in your R Markdown file.

Scenario:

You are working as a data analyst for an airline operations team. The management wants to understand flight delays, airline performance, and route efficiency so they can improve scheduling decisions. You have been given the flights dataset from the nycflights13 dataset inside the dslabs package.

Load the data in R:

library(dslabs) data(‘nycflights13’)

flights <- nycflights13::flights

The dataset contains information such as airline carrier, departure and arrival delays, distance, flight time, origin and destination airports, and time/date information.

Your task is to analyze airline performance and communicate insights to management.

Loading the Libraries and Dataset

library(dslabs)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)


data('nycflights13')

## Warning in data("nycflights13"): data set 'nycflights13' not found

flights <- nycflights13::flights
flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 1

· The airline suspects that some flights appear to have negative delays, meaning they arrived earlier than expected. Calculate the percentage of flights that arrived earlier than scheduled (arr_delay < 0).

· Identify which airline carrier has the highest proportion of early arrivals.

· Explain whether early arrival necessarily means efficient airline performance or if other factors might influence this.

# Percentage of flights arriving early
early_flights =
flights |>
  filter(arr_delay < 0)

early_flights

## # A tibble: 188,933 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      544            545        -1     1004           1022
##  2  2013     1     1      554            600        -6      812            837
##  3  2013     1     1      557            600        -3      709            723
##  4  2013     1     1      557            600        -3      838            846
##  5  2013     1     1      558            600        -2      849            851
##  6  2013     1     1      558            600        -2      853            856
##  7  2013     1     1      558            600        -2      923            937
##  8  2013     1     1      559            559         0      702            706
##  9  2013     1     1      559            600        -1      854            902
## 10  2013     1     1      600            600         0      851            858
## # ℹ 188,923 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

early_percentage =
flights |>
  summarise(
    percent_early = length(early_flights$arr_delay) / length(arr_delay) * 100
  )

early_percentage

## # A tibble: 1 × 1
##   percent_early
##           <dbl>
## 1          56.1

# Airline with highest proportion of early arrivals
early_by_carrier =
flights |>
  group_by(carrier) |>
  summarise(
    early_prop = mean(arr_delay < 0)
  ) |>
  arrange(desc(early_prop))

early_by_carrier

## # A tibble: 16 × 2
##    carrier early_prop
##    <chr>        <dbl>
##  1 HA           0.705
##  2 9E          NA    
##  3 AA          NA    
##  4 AS          NA    
##  5 B6          NA    
##  6 DL          NA    
##  7 EV          NA    
##  8 F9          NA    
##  9 FL          NA    
## 10 MQ          NA    
## 11 OO          NA    
## 12 UA          NA    
## 13 US          NA    
## 14 VX          NA    
## 15 WN          NA    
## 16 YV          NA

# Explanation : -

# Filtered the data on the basis of arrival delay to access the flights arriving early, then calculated the percentage using summarize on flights.
# Accessed the carriers with highest proportion of early arrivals with group by and summarize.

# Early arrival does not necessarily mean efficiency because airlines may add buffer time in schedules, weather conditions may improve travel time, or air traffic may be low.

Question 2

· Create a new variable: delay_per_100_miles = arr_delay / (distance / 100).

· Identify the top 10 worst performing routes (origin–destination pairs) based on average delay_per_100_miles.

· Filter routes that have at least 50 flights to avoid misleading results.

· Explain why filtering by flight count is important.

· Explain why delay normalized by distance is more informative than raw delay.

# Create normalized delay variable
flights2 =
flights |>
  mutate(
    delay_per_100_miles = arr_delay / (distance / 100)
  )


# Top 10 worst performing routes with at least 50 flights
worst_routes =
flights2 |>
  group_by(origin, dest) |>
  summarise(
    avg_delay_per_100 = mean(delay_per_100_miles),
    flight_count = n()
  ) |>
  filter(flight_count >= 50) |>
  arrange(desc(avg_delay_per_100)) |>
  head(10)

## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.

worst_routes

## # A tibble: 10 × 4
## # Groups:   origin [3]
##    origin dest  avg_delay_per_100 flight_count
##    <chr>  <chr>             <dbl>        <int>
##  1 JFK    ABQ               0.240          254
##  2 JFK    HNL              -0.139          342
##  3 LGA    SAV              -1.46            68
##  4 EWR    ALB              NA              439
##  5 EWR    ATL              NA             5022
##  6 EWR    AUS              NA              968
##  7 EWR    AVL              NA              265
##  8 EWR    BDL              NA              443
##  9 EWR    BNA              NA             2336
## 10 EWR    BOS              NA             5327

# Explanation

# Created a delay variable to analyse the delays per 100 miles of flights
# Filtering routes with at least 50 flights avoids misleading results caused by very small sample sizes.

# Delay normalized by distance is more informative because long routes naturally have larger delays, so dividing by distance allows fair comparison.

Question 3

· For each airline carrier calculate: Average arrival delay and Standard deviation of arrival delay. · Identify airlines that have average delay below the overall dataset average AND lower variability than the dataset average. · Explain why variability (standard deviation) is an important metric for airline reliability. · Create a scatter plot: X-axis = Average delay, Y-axis = Standard deviation of delay, Label each airline carrier. · Explain what the plot tells you about risk vs reliability.

# Average delay and standard deviation per airline
carrier_stats =
flights |>
  group_by(carrier) |>
  summarise(
    avg_delay = mean(arr_delay),
    sd_delay = sd(arr_delay)
  )

carrier_stats

## # A tibble: 16 × 3
##    carrier avg_delay sd_delay
##    <chr>       <dbl>    <dbl>
##  1 9E          NA        NA  
##  2 AA          NA        NA  
##  3 AS          NA        NA  
##  4 B6          NA        NA  
##  5 DL          NA        NA  
##  6 EV          NA        NA  
##  7 F9          NA        NA  
##  8 FL          NA        NA  
##  9 HA          -6.92     75.1
## 10 MQ          NA        NA  
## 11 OO          NA        NA  
## 12 UA          NA        NA  
## 13 US          NA        NA  
## 14 VX          NA        NA  
## 15 WN          NA        NA  
## 16 YV          NA        NA

# Overall dataset averages
dataset_avg_delay = mean(flights$arr_delay)
dataset_sd_delay = sd(flights$arr_delay)


# Airlines with lower delay and lower variability
reliable_airlines =
carrier_stats |>
  filter(
    avg_delay < dataset_avg_delay,
    sd_delay < dataset_sd_delay
  )

reliable_airlines

## # A tibble: 0 × 3
## # ℹ 3 variables: carrier <chr>, avg_delay <dbl>, sd_delay <dbl>

# Scatter plot
ggplot(carrier_stats, aes(x = avg_delay, y = sd_delay, label = carrier)) +
  geom_point() +
  labs(
    title = "Airline Reliability: Average Delay vs Variability",
    x = "Average Arrival Delay",
    y = "Standard Deviation of Arrival Delay"
  )

## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Explanation

# Analysing the statistical values of data on the basis of carriers through summarise and accessing mean and standard deviation for the arrival delay, then filtering the reliable airlines on the basis of those statistical informations by comparing to delays of whole dataset with delay of groups of carrier.
# Plotting a scatter plot for visualizing the comparison of average arrival delay vs standard deviation of arrival delay.

# Standard deviation measures variability in delays. Airlines with low variability are more reliable because their arrival times are more consistent.

Question 4

· Create a new variable departure_hour using dep_time. · Compute average departure delay for each hour of the day. · Identify which hour experiences the worst delays. · Create a visualization that clearly shows delay trends throughout the day. · Explain whether delays gradually increase or are concentrated in specific hours and provide a possible operational explanation.

# Question 4


# Create departure hour variable
flights3 =
flights |>
  mutate(
    departure_hour = dep_time / 100
  )


# Average departure delay by hour
hourly_delay =
flights3 |>
  group_by(departure_hour) |>
  summarise(
    avg_dep_delay = mean(dep_delay)
  )

hourly_delay

## # A tibble: 1,319 × 2
##    departure_hour avg_dep_delay
##             <dbl>         <dbl>
##  1           0.01          78.8
##  2           0.02          97.3
##  3           0.03          67.6
##  4           0.04          62.3
##  5           0.05          78.2
##  6           0.06         100. 
##  7           0.07          59.1
##  8           0.08         105. 
##  9           0.09         105. 
## 10           0.1          121. 
## # ℹ 1,309 more rows

# Hour with worst delays
worst_hour =
hourly_delay |>
  arrange(desc(avg_dep_delay)) |>
  head(1)

worst_hour

## # A tibble: 1 × 2
##   departure_hour avg_dep_delay
##            <dbl>         <dbl>
## 1           3.53           503

# Visualization of delay trends
ggplot(hourly_delay, aes(x = departure_hour, y = avg_dep_delay)) +
  geom_line() +
  labs(
    title = "Average Departure Delay by Hour",
    x = "Departure Hour",
    y = "Average Departure Delay"
  )

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_line()`).

# Explanation

# departure hour variable for departure hour then average delay by hour for better analysis on basis of different different hours and accessing hours with worst delays then plotted a graphs to visualize the comparison between Departure hour and average departure delay.

# Delays generally increase later in the day because delays from earlier flights

DSR CA1

Sparsh Verma

2026-03-09