── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("flights")data("airlines")
Create An Initial Scatterplot With Loess(?) Smoother for Distance to Delays
Use “group_by” together with summarize functions
Never use the function “na.omit”!!!!
flights_nona <- flights |>filter(!is.na(distance) &!is.na(arr_delay) &!is.na(dep_delay))# Remove na's for distance, arr_delay, and departure delay
Use “group_by” and “summarise” to Create a Summary Table
The table includes counts for each destination, mean distance traveled, mean arrival delay, and mean departure delay
by_dest <- flights_nona |>group_by(dest) |># Group all destinationssummarise(count =n(), # Counts totals for each destination avg_dist =mean(distance), # Calculates the mean distance traveledavg_arr_delay =mean(arr_delay), # Calculates the mean arrival delayavg_dep_delay =mean(dep_delay), # Calculates the mean departure delay.groups ="drop") |># Removes the grouping structure after summarizingarrange(avg_arr_delay) |>filter(avg_dist <3000)head(by_dest)
Average Arrival Delay is Only Slightly Related to Average Distance Flown by a Plane
Show a scatterplot of distance versus arrival delay
ggplot(by_dest, aes(avg_dist, avg_arr_delay)) +geom_point(aes(size = count), alpha =0.3) +geom_smooth(se =FALSE) +# Remove the error bandscale_size_area() +theme_bw() +labs(x ="Average Flight Distance (miles)",y ="Average Arrival Delay (minutes)",size ="Number of Flights \n Per Destination", # \n brings the text to the next linecaption ="FAA Aircraft Registry",titlee ="Average Distance and Average Arrival Delays from Flights from NY")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Heatmaps
A heatmap is a way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns are intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
Heatmap of AAverage Departure Delays, Arrival Delays, Distance and Flight Times
by_dest_matrix <-data.matrix(by_dest [, -1]) # Drop dest from matrix so it won't show in heatmaprow.names(by_dest_matrix) <- by_dest$dest # Restores row names
Which 6 Desitination Airports Have the Highest Average Arrival Delay from NYC?
PSE-Ponce Mercedita Airport, PR
ABQ-Albuquerque, NM
BQN-Rafael Hernandez Airport, PR
SJC-San Jose Mineta, CA
MCO-Orlando International, FL
FLL-Fort Lauderdale International, FL
Treemaps
Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data. Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of tree maps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The Downside to Treemaps
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As thee order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Join the “delay_punctuality” Dataset with the Airlines Dataset
Also remove “inc.” or “Co.” from the Carrier Name
flights2 <-left_join(flights_nona, airlines, by ="carrier")flights2$name <-gsub("Inc\\.|Co\\.", "", flights2$name)
# Convert months from numerical to abbreviated labelsflights3 <- flights2 |>group_by(name) |>summarise(avg_dist =mean(distance), # Calculates the mean distance traveledavg_arr_delay =mean(arr_delay)) # Calculates the mean arrival delayflights2$month_label <-month(flights2$month, label =TRUE, abbr =TRUE)
Create a Treemap for NYC Flights
The index is a categorical variable-carrier
The size of the box is by average distance
The heatmap color is average arrival delay
Notice how the treemap includes a legend for average arrival delay
library(RColorBrewer)library(treemap)treemap(flights3,index ="name",vSize ="avg_dist",vColor ="avg_arr_delay",type ="manual",palette ="RdYlBu", # Use RColorBrewer palettetitle ="Average Distance and Arrival Delay by Carrier", # Plot titletitle.legend ="Avg Arrival Delay (min)") # Legend label
Graph On-Time Performance Using Departure Delay and Arrival Delay
Some of the most important data that is collected for reporting is to analyze key performance indicators (KPIs) and the subset that agencies look at the most is “On-Time Performance” (OTP) which is usually defined as arriving at the origin location within 15 minutes of the requested/scheduled pickup time. The following code will create a bidrectional bar graph that has both the departure delay percentage and arrival delay percentage for each carrier.
Calculate the Percentage of Flights with Less Than 15 Minutes Delay (OTP)
Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category
ggalluv <- Refugees |>ggplot(aes(x = year, y = refugees, alluvium = country)) +theme_bw() +geom_alluvium(aes(fill = country),color ="white",width =0.1,alpha =0.8,decreasing =FALSE) +scale_fill_brewer(palette ="Spectral") +# Spectral has enough colors for all countries listedscale_x_continuous(lim =c(2002, 2013)) +labs(title ="UNHCR-Recognised Refugees Top 10 Countries \n (2003-2013)",y ="Number of Refugees",fill ="Country",caption ="Source: United Nations High Commissioner for Refugees (UNHCR)")
Plot the Alluvial
ggalluv
A Final Touch to Fix the Y-Axis Scale
Notice the y-values are in scientific Notation. We can convert them to standard notation with options(scipen) function