In this second lab, “Introduction to Data,” I learned essential R techniques for summarizing and describing raw dataset information. Before conducting any analysis, data scientists must become familiar with their data’s contents. Since datasets can encompass petabytes of information, it’s impossible to understand trends simply by examining individual rows and columns. A more systematic approach is required, and the exercises in this lab demonstrate how to effectively summarize and preprocess data.
Prior to beginning the data analysis, I loaded the necessary packages and added design code to ensure a professional looking graphs.
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(dplyr)
knitr::opts_chunk$set(
echo = TRUE,
fig.show = "hold",
fig.align = "center",
fig.width = 6,
fig.height = 4,
out.width = "80%",
dev = "png",
dpi = 300
)
Loading the dataset that will be used for the data analysis.
data("nycflights")
glimpse(nycflights)
## Rows: 32,735
## Columns: 16
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…
head(nycflights, 5)
## # A tibble: 5 × 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## # ℹ 6 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>
In Exercise 1, I analyzed three histograms, each shown below. At a glance, you can tell that all three share the same shape: a right-skewed distribution. All histograms cover the same range of values for the variable on the x-axis, departure delay (dep_delay). The maximum count of departure delays for the three graphs varies from 27,000 to 30,000 delays.
All three histograms are made from the same dataset. The main difference in the visualization is binwidth, which varies from 15 to 150, with one graph having a default bandwidth calculated using built-in algorithms. In graph 1 and graph 2, the distribution of data is detailed to the point that you can see how the changes in the count of departure delay occur. Graph 3 has a large binwidth that obscures the details of the data but clearly shows that the data is a right-skewed distribution.
In both Graph 1 (default binwidth) and Graph 3 (binwidth = 150), the right tail of the distribution extends to about a value of 380 for departure delays. The maximum values of the peak of each graph are different. In graph 1, the count of departure delays is just below 27,000. In Graph 3, with the largest binwidth of 150, the peak of the graph extends to approximately 32,000 counts. In Graph 2, the graph with the binwidth of 15, the peak is at 20,100.
In summary, Graph 3 is good for a quick snapshot of the distribution, while Graphs 1 and 2 are best for analyzing details about the graph.
ggplot(nycflights, aes(x = dep_delay)) +
geom_histogram() +
labs(title = "Graph 1: binwidth is default value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Binwidth = 15
ggplot(nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15) +
labs(title = "Graph 2: binwidth is 15")
# Binwidth = 150
ggplot(nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150) +
labs(title = "Graph 3: binwidth is 150")
# Filtering and counting the rows in the new
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
nrow(sfo_feb_flights)
## [1] 68
Based on the above code, 68 flights met the criterion of being flights to San Francisco from New York City that occurred in the month of February.
ggplot(sfo_feb_flights, aes(x = arr_delay)) +
geom_histogram(binwidth = 15, fill = "navy", color = "grey") +
labs(
title = "Arrival Delays of Flights to San Francisco from NYC in February",
x = "Arrival Delay (minutes)",
y = "Count of Flights"
) +
theme_minimal()
# Appropriate Summary Statistics
median_delay <- median(sfo_feb_flights$arr_delay, na.rm = TRUE)
iqr_delay <- IQR(sfo_feb_flights$arr_delay, na.rm = TRUE)
q25_delay <- quantile(sfo_feb_flights$arr_delay, 0.25, na.rm = TRUE)
q75_delay <- quantile(sfo_feb_flights$arr_delay, 0.75, na.rm = TRUE)
range_delay <- range(sfo_feb_flights$arr_delay, na.rm = TRUE)
## The median arrival delay for NYC to SFO flights in February is -11 minutes.
## The IQR (interquartile range) is 23.25 minutes, showing the spread of the middle 50% of delays.
## 25% of flights arrive -21.25 minutes or earlier than scheduled.
## 75% of flights arrive within 2 minutes of their scheduled time.
## Arrival delays range from -66 minutes (early) to 196 minutes (late).
# Group by carrier
sfo_feb_flights %>%
group_by(carrier) %>%
#Summarize the data based on median and interquartile range of arrival delay and remove missing values
summarise(
median_delay = median(arr_delay, na.rm = TRUE),
iqr_delay = IQR(arr_delay, na.rm = TRUE),
) %>%
# Sort by IQR Descending
arrange(desc(iqr_delay)) # Sort by IQR to see most variable first
## # A tibble: 5 × 3
## carrier median_delay iqr_delay
## <chr> <dbl> <dbl>
## 1 DL -15 22
## 2 UA -10 22
## 3 VX -22.5 21.2
## 4 AA 5 17.5
## 5 B6 -10.5 12.2
ggplot(sfo_feb_flights, aes(x = carrier, y = arr_delay)) +
geom_boxplot() +
labs(
title = "Arrival Delay Variability by Carrier",
subtitle = "Box height shows variability (IQR)",
x = "Carrier",
y = "Arrival Delay (minutes)"
) +
theme_minimal()
Delta (DL) and United (UA) show the highest variability in arrival delays with IQR values of 22 minutes each, indicating the most inconsistent arrival performance among the carriers.
JetBlue (B6) demonstrates the most consistent arrival times with the lowest IQR of 12.2 minutes, making it the most reliable carrier for predictable arrival delays, while Virgin America (VX) has the best overall performance with flights arriving an average of 22.5 minutes early.
library(dplyr)
# Create monthly summary with month names
monthly_summary <- nycflights %>%
group_by(month) %>%
summarise(
mean_delay = mean(dep_delay, na.rm = TRUE),
median_delay = median(dep_delay, na.rm = TRUE),
.groups = 'drop'
) %>%
mutate(month_name = month.name[month])
# Find months with lowest delays (including the actual values)
month_lowest_mean <- monthly_summary %>%
filter(mean_delay == min(mean_delay)) %>%
select(month, month_name, mean_delay)
month_lowest_median <- monthly_summary %>%
filter(median_delay == min(median_delay)) %>%
select(month, month_name, median_delay)
# Print with actual values
print("Month with the lowest mean delay:")
## [1] "Month with the lowest mean delay:"
print(month_lowest_mean)
## # A tibble: 1 × 3
## month month_name mean_delay
## <int> <chr> <dbl>
## 1 10 October 5.88
cat("Month", month_lowest_mean$month, "(", month_lowest_mean$month_name,
") has the lowest mean delay of", round(month_lowest_mean$mean_delay, 2), "minutes\n\n")
## Month 10 ( October ) has the lowest mean delay of 5.88 minutes
print("Month with the lowest median delay:")
## [1] "Month with the lowest median delay:"
print(month_lowest_median)
## # A tibble: 2 × 3
## month month_name median_delay
## <int> <chr> <dbl>
## 1 9 September -3
## 2 10 October -3
cat("Month", month_lowest_median$month, "(", month_lowest_median$month_name,
") has the lowest median delay of", round(month_lowest_median$median_delay, 2), "minutes\n")
## Month 9 10 ( September October ) has the lowest median delay of -3 -3 minutes
Advantages:
Disadvantages:
Advantages:
Disadvantages:
Practical tip: Plan for slightly more than the median delay to cover most scenarios, but don’t base your planning on the mean if it’s much higher due to outliers.
library(dplyr)
# Calculate on-time performance by airport
airport_performance <- nycflights %>%
filter(!is.na(dep_delay)) %>% # Remove flights with missing delay data
group_by(origin) %>%
summarise(
total_flights = n(),
on_time_flights = sum(dep_delay <= 15, na.rm = TRUE), # ≤15 min = on-time
delayed_flights = sum(dep_delay > 15, na.rm = TRUE), # >15 min = delayed
on_time_percentage = (on_time_flights / total_flights) * 100,
avg_delay = mean(dep_delay, na.rm = TRUE),
.groups = 'drop'
) %>%
arrange(desc(on_time_percentage))
print("NYC Airport On-Time Performance:")
## [1] "NYC Airport On-Time Performance:"
print(airport_performance)
## # A tibble: 3 × 6
## origin total_flights on_time_flights delayed_flights on_time_percentage
## <chr> <int> <int> <int> <dbl>
## 1 LGA 10067 8160 1907 81.1
## 2 JFK 10897 8657 2240 79.4
## 3 EWR 11771 8829 2942 75.0
## # ℹ 1 more variable: avg_delay <dbl>
# Find the best airport
best_airport <- airport_performance$origin[1]
best_percentage <- round(airport_performance$on_time_percentage[1], 2)
cat("Best NYC airport for on-time departures:", best_airport,
"with", best_percentage, "% on-time performance\n")
## Best NYC airport for on-time departures: LGA with 81.06 % on-time performance
library(ggplot2)
library(dplyr)
# Create the data
airport_performance <- nycflights %>%
filter(!is.na(dep_delay)) %>%
group_by(origin) %>%
summarise(
total_flights = n(),
on_time_percentage = sum(dep_delay <= 15) / n() * 100,
.groups = 'drop'
)
# Create ggplot showing airport and ontime percentage
ggplot(airport_performance, aes(x = origin, y = on_time_percentage, fill = origin)) +
geom_bar(stat = "identity") +
geom_text(aes(label = paste0(round(on_time_percentage, 1), "%")),
vjust = -0.5, size = 4) +
labs(title = "On-Time Departure Percentage by NYC Airport",
subtitle = "Flights departing within 15 minutes of scheduled time",
x = "Airport",
y = "On-Time Percentage (%)") +
theme_minimal() +
scale_fill_manual(values = c("EWR" = "skyblue", "JFK" = "lightgreen", "LGA" = "coral")) +
ylim(0, max(airport_performance$on_time_percentage) + 5) +
theme(legend.position = "none")
library(dplyr)
# Using the mutate function to add avg_speed column
nycflights_with_speed <- nycflights %>%
mutate(
# Convert air_time from minutes to hours by dividing by 60
air_time_hours = air_time / 60,
# Calculate average speed: distance (miles) / time (hours) = mph
avg_speed = distance / air_time_hours
)
# Print first few rows of nyc_flights_speed to view new speed column
print(head(nycflights_with_speed))
## # A tibble: 6 × 18
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight
## <int> <int> <int> <int> <dbl> <int> <dbl> <chr> <chr> <int>
## 1 2013 6 30 940 15 1216 -4 VX N626VA 407
## 2 2013 5 7 1657 -3 2104 10 DL N3760C 329
## 3 2013 12 8 859 -1 1238 11 DL N712TW 422
## 4 2013 5 14 1841 -4 2122 -34 DL N914DL 2391
## 5 2013 7 21 1102 -3 1230 -8 9E N823AY 3652
## 6 2013 1 1 1817 -3 2008 3 AA N3AXAA 353
## # ℹ 8 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, air_time_hours <dbl>, avg_speed <dbl>
ggplot(nycflights_with_speed, aes(x = distance, y = avg_speed, color = origin)) +
geom_point(alpha = 0.4, size = 0.9) +
scale_x_continuous(limits = c(0, 3500)) + # Extend x-axis from 0 to 3500
labs(
title = "Relationship Between Average Speed and Distance",
x = "Distance (miles)",
y = "Average Speed (mph)",
caption = "Source: NYC Flights 2013"
) +
theme_minimal()
## Warning: Removed 66 rows containing missing values or values outside the scale range
## (`geom_point()`).
Image A: Graph to Recreate
#Filter the dataset to get obtain these airlines AA", "DL", "UA"
library(dplyr)
library(ggplot2)
# Filter for the three specific airlines
filtered_flights <- nycflights %>%
filter(carrier %in% c("AA", "DL", "UA")) %>% # AA=American, DL=Delta, UA=United
filter(!is.na(dep_delay), !is.na(arr_delay)) # Remove missing values
# Create the plot: departure delay vs arrival delay, colored by carrier
ggplot(filtered_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
geom_point(alpha = 0.3, size = 0.8) + # Semi-transparent points
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray") + # Reference line
labs(
title = "Graph of Image A, created with code",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)",
color = "Carrier"
) +
theme_minimal() +
scale_color_manual(
values = c("AA" = "red", "DL" = "blue", "UA" = "green"),
labels = c("AA" = "American", "DL" = "Delta", "UA" = "United")
)