Exploring and Summarizing Data

In this second lab, “Introduction to Data,” I learned essential R techniques for summarizing and describing raw dataset information. Before conducting any analysis, data scientists must become familiar with their data’s contents. Since datasets can encompass petabytes of information, it’s impossible to understand trends simply by examining individual rows and columns. A more systematic approach is required, and the exercises in this lab demonstrate how to effectively summarize and preprocess data.

Prior to beginning the data analysis, I loaded the necessary packages and added design code to ensure a professional looking graphs.

library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
knitr::opts_chunk$set(
  echo = TRUE,
  fig.show = "hold",        
  fig.align = "center",     
  fig.width = 6,           
  fig.height = 4,           
  out.width = "80%",     
  dev = "png",             
  dpi = 300                 
)

Loading the dataset that will be used for the data analysis.

data("nycflights")

Using glimpse() to analyse the dataset.

glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
head(nycflights, 5)
## # A tibble: 5 × 16
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## # ℹ 6 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

Exercise 1

In Exercise 1, I analyzed three histograms, each shown below. At a glance, you can tell that all three share the same shape: a right-skewed distribution. All histograms cover the same range of values for the variable on the x-axis, departure delay (dep_delay). The maximum count of departure delays for the three graphs varies from 27,000 to 30,000 delays.

All three histograms are made from the same dataset. The main difference in the visualization is binwidth, which varies from 15 to 150, with one graph having a default bandwidth calculated using built-in algorithms. In graph 1 and graph 2, the distribution of data is detailed to the point that you can see how the changes in the count of departure delay occur. Graph 3 has a large binwidth that obscures the details of the data but clearly shows that the data is a right-skewed distribution.

In both Graph 1 (default binwidth) and Graph 3 (binwidth = 150), the right tail of the distribution extends to about a value of 380 for departure delays. The maximum values of the peak of each graph are different. In graph 1, the count of departure delays is just below 27,000. In Graph 3, with the largest binwidth of 150, the peak of the graph extends to approximately 32,000 counts. In Graph 2, the graph with the binwidth of 15, the peak is at 20,100.

In summary, Graph 3 is good for a quick snapshot of the distribution, while Graphs 1 and 2 are best for analyzing details about the graph.

Graph 1 (Default Binwidth)
  • Provides optimal balance between detail and readability
  • Peak count just below 27,000 delays
Graph 2 (Binwidth = 15)
  • Shows maximum granular detail with jagged fluctuations
  • Lowest peak at 20,100 counts
  • Best for identifying specific patterns in the data
Graph 3 (Binwidth = 150)
  • Creates smooth visualization of overall distribution pattern
  • Highest peak at ~32,000 counts
  • Ideal for quick assessment of data shape
ggplot(nycflights, aes(x = dep_delay)) + 
  geom_histogram() +
labs(title = "Graph 1: binwidth is default value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Binwidth = 15
ggplot(nycflights, aes(x = dep_delay)) + 
  geom_histogram(binwidth = 15) +
labs(title = "Graph 2: binwidth is 15")

# Binwidth = 150
ggplot(nycflights, aes(x = dep_delay)) + 
  geom_histogram(binwidth = 150) +
labs(title = "Graph 3: binwidth is 150")

Exercise 2

# Filtering and counting the rows in the new
sfo_feb_flights <-  nycflights %>%
  filter(dest == "SFO", month == 2)
  nrow(sfo_feb_flights)
## [1] 68

Based on the above code, 68 flights met the criterion of being flights to San Francisco from New York City that occurred in the month of February.

Exercise 3

ggplot(sfo_feb_flights, aes(x = arr_delay)) + 
  geom_histogram(binwidth = 15, fill = "navy", color = "grey") +
  labs(
    title = "Arrival Delays of Flights to San Francisco from NYC in February",
    x = "Arrival Delay (minutes)",
    y = "Count of Flights"
  ) + 
  theme_minimal()

The above shows a right skewed histogram. The summary statistics chosen for the sfo_feb_flights dataset are:
  • Median - A value perfect for this distribution because it is resistant to the outliers present
  • IQR - A measure of the spread of the middle 50% of the dataset
  • The 25th and 75th percentiles to define boundaries where 25% and 75% of the data falls
  • Range - A complete spread of the minimum and maximum flight delays
# Appropriate Summary Statistics
median_delay <- median(sfo_feb_flights$arr_delay, na.rm = TRUE)
iqr_delay <- IQR(sfo_feb_flights$arr_delay, na.rm = TRUE)
q25_delay <- quantile(sfo_feb_flights$arr_delay, 0.25, na.rm = TRUE)
q75_delay <- quantile(sfo_feb_flights$arr_delay, 0.75, na.rm = TRUE)
range_delay <- range(sfo_feb_flights$arr_delay, na.rm = TRUE)
## The median arrival delay for NYC to SFO flights in February is -11 minutes.
## The IQR (interquartile range) is 23.25 minutes, showing the spread of the middle 50% of delays.
## 25% of flights arrive -21.25 minutes or earlier than scheduled.
## 75% of flights arrive within 2 minutes of their scheduled time.
## Arrival delays range from -66 minutes (early) to 196 minutes (late).

Exercise 4

# Group by carrier
sfo_feb_flights %>%
  group_by(carrier) %>%
#Summarize the data based on median and interquartile range of arrival delay and remove missing values
  summarise(
    median_delay = median(arr_delay, na.rm = TRUE),
    iqr_delay = IQR(arr_delay, na.rm = TRUE),
  ) %>%
# Sort by IQR Descending
  arrange(desc(iqr_delay))  # Sort by IQR to see most variable first
## # A tibble: 5 × 3
##   carrier median_delay iqr_delay
##   <chr>          <dbl>     <dbl>
## 1 DL             -15        22  
## 2 UA             -10        22  
## 3 VX             -22.5      21.2
## 4 AA               5        17.5
## 5 B6             -10.5      12.2
ggplot(sfo_feb_flights, aes(x = carrier, y = arr_delay)) +
  geom_boxplot() +
  labs(
    title = "Arrival Delay Variability by Carrier",
    subtitle = "Box height shows variability (IQR)",
    x = "Carrier",
    y = "Arrival Delay (minutes)"
  ) +
  theme_minimal()

Analysis Box Plot of Arrival Delays

Delta (DL) and United (UA) show the highest variability in arrival delays with IQR values of 22 minutes each, indicating the most inconsistent arrival performance among the carriers.

JetBlue (B6) demonstrates the most consistent arrival times with the lowest IQR of 12.2 minutes, making it the most reliable carrier for predictable arrival delays, while Virgin America (VX) has the best overall performance with flights arriving an average of 22.5 minutes early.

Exercise 5

The pros and cons of planing trip hy viewing the mean departure delay vs. median departure delay;
library(dplyr)

# Create monthly summary with month names
monthly_summary <- nycflights %>%
  group_by(month) %>%
  summarise(
    mean_delay = mean(dep_delay, na.rm = TRUE),
    median_delay = median(dep_delay, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  mutate(month_name = month.name[month])

# Find months with lowest delays (including the actual values)
month_lowest_mean <- monthly_summary %>% 
  filter(mean_delay == min(mean_delay)) %>%
  select(month, month_name, mean_delay)

month_lowest_median <- monthly_summary %>% 
  filter(median_delay == min(median_delay)) %>%
  select(month, month_name, median_delay)

# Print with actual values
print("Month with the lowest mean delay:")
## [1] "Month with the lowest mean delay:"
print(month_lowest_mean)
## # A tibble: 1 × 3
##   month month_name mean_delay
##   <int> <chr>           <dbl>
## 1    10 October          5.88
cat("Month", month_lowest_mean$month, "(", month_lowest_mean$month_name, 
    ") has the lowest mean delay of", round(month_lowest_mean$mean_delay, 2), "minutes\n\n")
## Month 10 ( October ) has the lowest mean delay of 5.88 minutes
print("Month with the lowest median delay:")
## [1] "Month with the lowest median delay:"
print(month_lowest_median)
## # A tibble: 2 × 3
##   month month_name median_delay
##   <int> <chr>             <dbl>
## 1     9 September            -3
## 2    10 October              -3
cat("Month", month_lowest_median$month, "(", month_lowest_median$month_name, 
    ") has the lowest median delay of", round(month_lowest_median$median_delay, 2), "minutes\n")
## Month 9 10 ( September October ) has the lowest median delay of -3 -3 minutes

Considerations for Using the Median Monthy Flight Delay Data

Advantages:

  • Value not inflated by extreme delays
  • Better represents typical experience
  • Half of flights will have delays at or below the median
  • Extreme outliers do not distort the picture

Disadvantages:

  • Ignores Extreme Delays
  • No insight into variability of delays

Considerations for Using the Mean Flight Delay Data

Advantages:

  • Reveals hidden patters
  • Gives a realistic picture of waht to expect
  • Better for estimating total travel time _ Shows traveller is using risk planning

Disadvantages:

  • Skewed by extreme events
  • You might over-plan and arrive too early unnecessarily
  • Less predictive as the average” delay might rarely occur in practice

Practical tip: Plan for slightly more than the median delay to cover most scenarios, but don’t base your planning on the mean if it’s much higher due to outliers.

Exercise 6

Selecting which NYC airport to fly out of, based on the percentage of on time flights
library(dplyr)

# Calculate on-time performance by airport
airport_performance <- nycflights %>%
  filter(!is.na(dep_delay)) %>%  # Remove flights with missing delay data
  group_by(origin) %>%
  summarise(
    total_flights = n(),
    on_time_flights = sum(dep_delay <= 15, na.rm = TRUE),  # ≤15 min = on-time
    delayed_flights = sum(dep_delay > 15, na.rm = TRUE),   # >15 min = delayed
    on_time_percentage = (on_time_flights / total_flights) * 100,
    avg_delay = mean(dep_delay, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  arrange(desc(on_time_percentage))

print("NYC Airport On-Time Performance:")
## [1] "NYC Airport On-Time Performance:"
print(airport_performance)
## # A tibble: 3 × 6
##   origin total_flights on_time_flights delayed_flights on_time_percentage
##   <chr>          <int>           <int>           <int>              <dbl>
## 1 LGA            10067            8160            1907               81.1
## 2 JFK            10897            8657            2240               79.4
## 3 EWR            11771            8829            2942               75.0
## # ℹ 1 more variable: avg_delay <dbl>
# Find the best airport
best_airport <- airport_performance$origin[1]
best_percentage <- round(airport_performance$on_time_percentage[1], 2)

cat("Best NYC airport for on-time departures:", best_airport, 
    "with", best_percentage, "% on-time performance\n")
## Best NYC airport for on-time departures: LGA with 81.06 % on-time performance
library(ggplot2)
library(dplyr)

# Create the data
airport_performance <- nycflights %>%
  filter(!is.na(dep_delay)) %>%
  group_by(origin) %>%
  summarise(
    total_flights = n(),
    on_time_percentage = sum(dep_delay <= 15) / n() * 100,
    .groups = 'drop'
  )

# Create ggplot showing airport and ontime percentage
ggplot(airport_performance, aes(x = origin, y = on_time_percentage, fill = origin)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = paste0(round(on_time_percentage, 1), "%")), 
            vjust = -0.5, size = 4) +
  labs(title = "On-Time Departure Percentage by NYC Airport",
       subtitle = "Flights departing within 15 minutes of scheduled time",
       x = "Airport",
       y = "On-Time Percentage (%)") +
  theme_minimal() +
  scale_fill_manual(values = c("EWR" = "skyblue", "JFK" = "lightgreen", "LGA" = "coral")) +
  ylim(0, max(airport_performance$on_time_percentage) + 5) +
  theme(legend.position = "none")

Additional Practice, Exercises 7-9

Exercise 7

library(dplyr)

# Using the mutate function to add avg_speed column
nycflights_with_speed <- nycflights %>%
  mutate(
    # Convert air_time from minutes to hours by dividing by 60
    air_time_hours = air_time / 60,
    # Calculate average speed: distance (miles) / time (hours) = mph
    avg_speed = distance / air_time_hours
  )
# Print first few rows of nyc_flights_speed to view new speed column
print(head(nycflights_with_speed))
## # A tibble: 6 × 18
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
## # ℹ 8 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, air_time_hours <dbl>, avg_speed <dbl>

Exercise 8

Scatterplot of avaerage speed vs distance
ggplot(nycflights_with_speed, aes(x = distance, y = avg_speed, color = origin)) +
  geom_point(alpha = 0.4, size = 0.9) +
  scale_x_continuous(limits = c(0, 3500)) +  # Extend x-axis from 0 to 3500
  labs(
    title = "Relationship Between Average Speed and Distance",
    x = "Distance (miles)",
    y = "Average Speed (mph)",
    caption = "Source: NYC Flights 2013"
  ) +
  theme_minimal()
## Warning: Removed 66 rows containing missing values or values outside the scale range
## (`geom_point()`).

Exercise 9

Image A: Graph to Recreate

Image A: Graph to Recreate

#Filter the dataset to get obtain these airlines AA", "DL", "UA"

library(dplyr)
library(ggplot2)

# Filter for the three specific airlines
filtered_flights <- nycflights %>%
  filter(carrier %in% c("AA", "DL", "UA")) %>%  # AA=American, DL=Delta, UA=United
  filter(!is.na(dep_delay), !is.na(arr_delay))  # Remove missing values

# Create the plot: departure delay vs arrival delay, colored by carrier
ggplot(filtered_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point(alpha = 0.3, size = 0.8) +  # Semi-transparent points
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") +  # Reference line
  labs(
    title = "Graph of Image A, created with code",
    x = "Departure Delay (minutes)",
    y = "Arrival Delay (minutes)",
    color = "Carrier"
  ) +
  theme_minimal() +
  scale_color_manual(
    values = c("AA" = "red", "DL" = "blue", "UA" = "green"),
    labels = c("AA" = "American", "DL" = "Delta", "UA" = "United")
  )