Heatmaps Treemaps and Alluvials

So many ways to visualize data

https://www.travelsavvy.agency/blog/what-airlines-fly-to-los-angeles

Use the dataset NYCFlights23 to explore late arrivals

Source: FAA Aircraft registry, https://www.faa.gov/licenses_certificates/aircraft_certification/ aircraft_registry/releasable_aircraft_download/

#install.packages("nycflights23")
library(nycflights23)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(flights)
data(airlines)

Create an initial scatterplot with loess smoother for distance to delays

Use “group_by” together with summarize functions

Never use the function “na.omit”!!!!

flights_nona <- flights |>
  filter(!is.na(distance) & !is.na(arr_delay) & !is.na(dep_delay))  
# remove na's for distance, arr_delay, departure delay

Use group_by and summarise to create a summary table

The table includes counts for each destination, mean distance traveled, mean arrival delay, and mean departure delay

by_dest <- flights_nona |>
  group_by(dest) |>  # group all destinations
  summarise(count = n(),   # counts totals for each destination
            avg_dist = mean(distance), # calculates the mean distance traveled
            avg_arr_delay = mean(arr_delay),  # calculates the mean arrival delay
            avg_dep_delay = mean(dep_delay), # calculates the mean dep delay
            .groups = "drop") |>  # remove the grouping structure after summarizing
  arrange(avg_arr_delay) |>
  filter(avg_dist < 3000)
head(by_dest)
# A tibble: 6 × 5
  dest  count avg_dist avg_arr_delay avg_dep_delay
  <chr> <int>    <dbl>         <dbl>         <dbl>
1 PNS      71    1030         -10.6         -1.24 
2 HHH     461     695.         -9.95         1.38 
3 HDN      27    1728          -9.93         8.78 
4 VPS     107     988          -9.41         2.62 
5 AVP     140      93          -8.53        -0.957
6 GSO    2857     456.         -7.77         3.81 

Heatmaps

A heatmap is a way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)

Heatmap of average departure delays, arrival delays, distance and flight times

by_dest_matrix <- data.matrix(by_dest[, -1])  # drop dest from matrix so it won't show in heatmap
row.names(by_dest_matrix) <- by_dest$dest     # restore row names
library(viridis)
Loading required package: viridisLite
by_dest_heatmap <- heatmap(by_dest_matrix, 
                       Rowv=NA, 
                       Colv=NA, 
                       col = viridis(250), 
                       cexCol = .7,  # shrink x-axis label size 
                       scale="column", 
                       xlab = "",
                       ylab = "",
                       main = "")

Which 6 destination airports have the highest average arrival delay from NYC?

PSE - Ponce Mercedita Airport, PR ABQ - Albuquerque, NM BQN - Rafael Hernández Airport, PR SJC - San José Mineta, CA MCO - Orlando International, FL FLL - Fort Lauderdale International, FL

Treemaps

Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.

When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.

The Downside to Treemaps

The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)

Join the delay_punctuality dataset with the airlines dataset

Also remove “Inc.” or “Co.” from the Carrier Name

flights2 <- left_join(flights_nona, airlines, by = "carrier")
flights2$name <- gsub("Inc\\.|Co\\.", "", flights2$name)
# Convert months from numerical to abbreviated labels
flights3 <- flights2 |>
  group_by(name)|>
  summarise(avg_dist = mean(distance), # calculates the mean distance traveled
            avg_arr_delay = mean(arr_delay))  # calculates the mean arrival delay
#flights2$month_label <- month(flights2$month, label = TRUE, abbr = TRUE)

Create a treemap for NYC FLights

  • The index is a categorical variable - carrier

  • The size of the box is by average distance

  • The heatmap color is average arrival delay

  • Notice how the treemap includes a legend for average arrival delay

library(RColorBrewer)
library(treemap)
treemap(flights3, 
        index="name", 
        vSize="avg_dist", 
        vColor="avg_arr_delay", 
        type="manual",    
        palette="RdYlBu",  #Use RColorBrewer palette
        title = "Average Distance and Arrival Delay by Carrier",  # plot title
        title.legend = "Avg Arrival Delay (min)" )   # legend label 

Graph On-Time Performance using Departure Delay and Arrival Delay

Some of the most important data that is collected for reporting is to analyze key performance indicators (KPIs) and the subset that agencies look at the most is “On-Time Performance” (OTP) which is usually defined as arriving at the origin location within 15 minutes of the requested/scheduled pickup time. The following code will create a bidirectional bar graph that has both the departure delay percentage and arrival delay percentage for each carrier.

Calculate the percentage of flights with less than 15 minutes delay (OTP)

delay_OTP <- flights2 |>
  group_by(name) |>
  summarize(Departure_Percentage = sum(dep_delay <= 15) 
            / n() * 100,
            Arrival_Percentage = sum(arr_delay <= 15) / n() * 100)

Create a bidirectional horizontal bar chart

ggplot(delay_OTP, aes(x = -Departure_Percentage, y = reorder(name, Departure_Percentage))) +
  geom_text(aes(label = paste0(round(Departure_Percentage, 0), "%")), 
            hjust = 1.1, size = 3.5) +  #departure % labels
  geom_bar(aes(fill = "Departure_Percentage"), stat = "identity", width = .75) +
  geom_bar(aes(x = Arrival_Percentage, fill = "Arrival_Percentage"), 
           stat = "identity", width = .75) +
  geom_text(aes(x = Arrival_Percentage, label = paste0(round(Arrival_Percentage, 0), "%")),
            hjust =-.1, size = 3.5) +  # arrival % labels
  
  labs(x = "Departures < On-Time Performance > Arrivals", 
       y = "Carrier",
       title = "On-Time Performance of Airline Carriers \n (Percent of Flights < 15 Minutes Delay)",
       caption = "Source: FAA") +
  
  scale_fill_manual(
    name = "Performance",
    breaks = c("Departure_Percentage", "Arrival_Percentage"),  # Specify the order of legend items
    values = c("Departure_Percentage" = "#8bd3c7", "Arrival_Percentage" = "#beb9db"),
    labels = c("Departure_Percentage" = "Departure", "Arrival_Percentage" = "Arrival")
  ) +
  
  scale_x_continuous(labels = abs, limits = c(-120, 120)) +  # Positive negative axis
  theme_minimal()

Alluvials

Load the alluvial package

Refugees is a prebuilt dataset in the alluvial package

If you want to save the prebuilt dataset to your folder, use the write_csv function

library(alluvial)
library(ggalluvial)
data(Refugees)

Show UNHCR-recognised refugees

Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category

ggalluv <- Refugees |>
  ggplot(aes(x = year, y = refugees, alluvium = country)) + 
  theme_bw() +
  geom_alluvium(aes(fill = country), 
                color = "white",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Spectral") + 
  # Spectral has enough colors for all countries listed
  scale_x_continuous(lim = c(2002, 2013)) +
  labs(title = "UNHCR-Recognised Refugees Top 10 Countries\n (2003-2013)",
         # \n breaks the long title
       y = "Number of Refugees", 
       fill = "Country",
       caption = "Source: United Nations High Commissioner for Refugees (UNHCR)")

Plot the Alluvial

ggalluv

A final touch to fix the y-axis scale

Notice the y-values are in scientific notation. We can convert them to standard notation with options scipen function

options(scipen = 999)
ggalluv