#install.packages("nycflights23")
library(nycflights23)
library(tidyverse)
data(flights)
data(airlines)
Heatmaps Treemaps and Alluvials
So many ways to visualize data
Use the dataset NYCFlights23 to explore late arrivals
Source: FAA Aircraft registry, https://www.faa.gov/licenses_certificates/aircraft_certification/ aircraft_registry/releasable_aircraft_download/
Create an initial scatterplot with loess smoother for distance to delays
Use “group_by” together with summarize functions
<- flights |>
flights_nona filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))
# remove na's for distance, arr_delay, departure delay
Use group_by and summarise to create a summary table
The table includes counts for each destination, mean distance traveled, mean arrival delay, and mean departure delay
<- flights_nona |>
by_dest group_by(dest) |> # group all destinations
summarise(count = n(), # counts totals for each destination
avg_dist = mean(distance), # calculates the mean distance traveled
avg_arr_delay = mean(arr_delay), # calculates the mean arrival delay
avg_dep_delay = mean(dep_delay), # calculates the mean dep delay
.groups = "drop") |> # remove the grouping structure after summarizing
arrange(avg_arr_delay) |>
filter(avg_dist < 3000)
head(by_dest)
# A tibble: 6 × 5
dest count avg_dist avg_arr_delay avg_dep_delay
<chr> <int> <dbl> <dbl> <dbl>
1 PNS 71 1030 -10.6 -1.24
2 HHH 461 695. -9.95 1.38
3 HDN 27 1728 -9.93 8.78
4 VPS 107 988 -9.41 2.62
5 AVP 140 93 -8.53 -0.957
6 GSO 2857 456. -7.77 3.81
Heatmaps
A heatmap is a way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
Heatmap of average departure delays, arrival delays, distance and flight times
<- data.matrix(by_dest[, -1]) # drop dest from matrix so it won't show in heatmap
by_dest_matrix row.names(by_dest_matrix) <- by_dest$dest # restore row names
library(viridis)
Loading required package: viridisLite
<- heatmap(by_dest_matrix,
by_dest_heatmap Rowv=NA,
Colv=NA,
col = viridis(25),
cexCol = .7, # shrink x-axis label size
scale="column",
xlab = "",
ylab = "",
main = "Heatmap of Flight Destinations from NY Airports")
Which 6 destination airports have the highest average arrival delay from NYC?
PSE - Ponce Mercedita Airport, PR ABQ - Albuquerque, NM BQN - Rafael Hernández Airport, PR SJC - San José Mineta, CA MCO - Orlando International, FL FLL - Fort Lauderdale International, FL
Treemaps
Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The Downside to Treemaps
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Join the delay_punctuality dataset with the airlines dataset
Also remove “Inc.” or “Co.” from the Carrier Name
<- left_join(flights_nona, airlines, by = "carrier")
flights2 $name <- gsub("Inc\\.|Co\\.", "", flights2$name) flights2
# Convert months from numerical to abbreviated labels
<- flights2 |>
flights3 group_by(name)|>
summarise(avg_dist = mean(distance), # calculates the mean distance traveled
avg_arr_delay = mean(arr_delay)) # calculates the mean arrival delay
#flights2$month_label <- month(flights2$month, label = TRUE, abbr = TRUE)
Create a treemap for NYC FLights
The index is a categorical variable - carrier (name)
The size of the box is by average distance
The heatmap color is average arrival delay
Notice how the treemap includes a legend for average arrival delay
library(RColorBrewer)
library(treemap)
treemap(flights3,
index="name", # airline
vSize="avg_dist",
vColor="avg_arr_delay",
type="manual",
palette="RdYlBu", #Use RColorBrewer palette
title = "Average Distance and Arrival Delay by Carrier", # plot title
title.legend = "Avg Arrival Delay (min)" ) # legend label
Graph On-Time Performance using Departure Delay and Arrival Delay
Some of the most important data that is collected for reporting is to analyze key performance indicators (KPIs) and the subset that agencies look at the most is “On-Time Performance” (OTP) which is usually defined as arriving at the origin location within 15 minutes of the requested/scheduled pickup time. The following code will create a bidirectional bar graph that has both the departure delay percentage and arrival delay percentage for each carrier.
Calculate the percentage of flights with less than 15 minutes delay (OTP)
<- flights2 |>
delay_OTP group_by(name) |> #name = airline
summarize(Departure_Percentage = sum(dep_delay <= 15)
/ n() * 100,
Arrival_Percentage = sum(arr_delay <= 15) / n() * 100)
Create a bidirectional horizontal bar chart
ggplot(delay_OTP, aes(x = -Departure_Percentage, y = reorder(name, Departure_Percentage))) +
geom_text(aes(label = paste0(round(Departure_Percentage, 0), "%")),
hjust = 1.1, size = 3.5) + #departure % labels
geom_bar(aes(fill = "Departure_Percentage"), stat = "identity", width = .75) +
geom_bar(aes(x = Arrival_Percentage, fill = "Arrival_Percentage"),
stat = "identity", width = .75) +
geom_text(aes(x = Arrival_Percentage, label = paste0(round(Arrival_Percentage, 0), "%")),
hjust =-.1, size = 3.5) + # arrival % labels
labs(x = "Departures < On-Time Performance > Arrivals",
y = "",
title = "On-Time Performance of Airline Carriers \n (Percent of Flights < 15 Minutes Delay)",
caption = "Source: FAA") +
scale_fill_manual(
name = "Performance",
breaks = c("Departure_Percentage", "Arrival_Percentage"), # Specify the order of legend items
values = c("Departure_Percentage" = "#8bd3c7", "Arrival_Percentage" = "#beb9db"),
labels = c("Departure_Percentage" = "Departure", "Arrival_Percentage" = "Arrival")
+
)
scale_x_continuous(labels = abs, limits = c(-120, 120)) + # Positive negative axis
theme_minimal()
Alluvials
Load the alluvial package
Refugees is a prebuilt dataset in the alluvial package
If you want to save the prebuilt dataset to your folder, use the write_csv function
library(alluvial)
library(ggalluvial)
data(Refugees)
Show UNHCR-recognised refugees
Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category
<- Refugees |>
ggalluv ggplot(aes(x = year, y = refugees, alluvium = country)) +
theme_bw() +
geom_alluvium(aes(fill = country),
color = "white",
width = .1,
alpha = .8,
decreasing = FALSE) +
scale_fill_brewer(palette = "Spectral") +
# Spectral has enough colors for all countries listed
scale_x_continuous(lim = c(2002, 2013)) +
labs(title = "UNHCR-Recognised Refugees Top 10 Countries\n (2003-2013)",
# \n breaks the long title
y = "Number of Refugees",
fill = "Country",
caption = "Source: United Nations High Commissioner for Refugees (UNHCR)")
Plot the Alluvial
ggalluv
A final touch to fix the y-axis scale
Notice the y-values are in scientific notation. We can convert them to standard notation with options scipen function
options(scipen = 999)
ggalluv