Use group_by and summarise to create a summary table
The table includes counts for each destination, mean distance traveled, mean arrival delay, and mean departure delay
by_dest <- flights_nona |>group_by(dest) |># group all destinationssummarise(count =n(), # counts totals for each destinationavg_dist =mean(distance), # calculates the mean distance traveledavg_arr_delay =mean(arr_delay), # calculates the mean arrival delayavg_dep_delay =mean(dep_delay), # calculates the mean dep delay.groups ="drop") |># remove the grouping structure after summarizingarrange(avg_arr_delay) |>filter(avg_dist <3000)head(by_dest)
Average arrival delay is only slightly related to average distance flown by a plane
Show a scatterplot of distance versus
ggplot(by_dest, aes(avg_dist, avg_arr_delay)) +geom_point(aes(size = count), alpha = .3) +geom_smooth(se =FALSE) +# remove the error bandscale_size_area() +theme_bw() +labs(x ="Average Flight Distance (miles)",y ="Average Arrival Delay (minutes)",size ="Number of Flights \n Per Destination",caption ="FAA Aircraft registry",title ="Average Distance and Average Arrival Delays from Flights from NY")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Heatmaps
A heatmap is a way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
Heatmap of average departure delays, arrival delays, distance and flight times
by_dest_matrix <-data.matrix(by_dest[, -1]) # drop dest from matrix so it won't show in heatmaprow.names(by_dest_matrix) <- by_dest$dest # restore row names
Which 6 destination airports have the highest average arrival delay from NYC?
PSE - Ponce Mercedita Airport, PR ABQ - Albuquerque, NM BQN - Rafael Hernández Airport, PR SJC - San José Mineta, CA MCO - Orlando International, FL FLL - Fort Lauderdale International, FL
Treemaps
Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The Downside to Treemaps
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Join the delay_punctuality dataset with the airlines dataset
Also remove “Inc.” or “Co.” from the Carrier Name
flights2 <-left_join(flights_nona, airlines, by ="carrier")flights2$name <-gsub("Inc\\.|Co\\.", "", flights2$name)
# Convert months from numerical to abbreviated labelsflights3 <- flights2 |>group_by(name)|>summarise(avg_dist =mean(distance), # calculates the mean distance traveledavg_arr_delay =mean(arr_delay)) # calculates the mean arrival delay#flights2$month_label <- month(flights2$month, label = TRUE, abbr = TRUE)
Create a treemap for NYC FLights
The index is a categorical variable - carrier
The size of the box is by average distance
The heatmap color is average arrival delay
Notice how the treemap includes a legend for average arrival delay
Graph On-Time Performance using Departure Delay and Arrival Delay
Some of the most important data that is collected for reporting is to analyze key performance indicators (KPIs) and the subset that agencies look at the most is “On-Time Performance” (OTP) which is usually defined as arriving at the origin location within 15 minutes of the requested/scheduled pickup time. The following code will create a bidirectional bar graph that has both the departure delay percentage and arrival delay percentage for each carrier.
Calculate the percentage of flights with less than 15 minutes delay (OTP)
Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category
ggalluv <- Refugees |>ggplot(aes(x = year, y = refugees, alluvium = country)) +theme_bw() +geom_alluvium(aes(fill = country), color ="white",width = .1, alpha = .8,decreasing =FALSE) +scale_fill_brewer(palette ="Spectral") +# Spectral has enough colors for all countries listedscale_x_continuous(lim =c(2002, 2013)) +labs(title ="UNHCR-Recognised Refugees Top 10 Countries\n (2003-2013)",# \n breaks the long titley ="Number of Refugees", fill ="Country",caption ="Source: United Nations High Commissioner for Refugees (UNHCR)")
Plot the Alluvial
ggalluv
A final touch to fix the y-axis scale
Notice the y-values are in scientific notation. We can convert them to standard notation with options scipen function