Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Use Nathan Yau’s dataset from the flowingdata website: http://datasets.flowingdata.com/post-data.txt You will need the package “treemap” and the package “RColorBrewer”.
The data is a csv file that compares number of views, number of comments to various categories of Yau’s visualization creations
# This chuck is to fix my error - when I run library(tidyverse) and I got error
# Error: package or namespace load failed for ‘tidyverse’ in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]): namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.0.1 is required
#remove.packages("rlang")
#install.packages("rlang")
library(rlang)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#install.packages("treemap")
#install.packages("RColorBrewer")
library(treemap)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v stringr 1.4.0
## v tidyr 1.1.4 v forcats 0.5.1
## v readr 2.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x purrr::%@%() masks rlang::%@%()
## x purrr::as_function() masks rlang::as_function()
## x dplyr::filter() masks stats::filter()
## x purrr::flatten() masks rlang::flatten()
## x purrr::flatten_chr() masks rlang::flatten_chr()
## x purrr::flatten_dbl() masks rlang::flatten_dbl()
## x purrr::flatten_int() masks rlang::flatten_int()
## x purrr::flatten_lgl() masks rlang::flatten_lgl()
## x purrr::flatten_raw() masks rlang::flatten_raw()
## x purrr::invoke() masks rlang::invoke()
## x dplyr::lag() masks stats::lag()
## x purrr::splice() masks rlang::splice()
library(RColorBrewer)
data <- read.csv("http://datasets.flowingdata.com/post-data.txt")
head(data)
## id views comments category
## 1 5019 148896 28 Artistic Visualization
## 2 1416 81374 26 Visualization
## 3 1416 81374 26 Featured
## 4 3485 80819 37 Featured
## 5 3485 80819 37 Mapping
## 6 3485 80819 37 Data Sources
A heatmap is a literal way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
This data appears to contain data about 2008 NBA player stats.
# How to make a heatmap
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
#apparently you have to use read.csv here instead of read_csv
nba
## Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA
## 1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5
## 2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7
## 3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1
## 4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1
## 5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7
## 6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1
## 7 Kevin Martin 51 38.2 24.6 6.7 15.9 0.420 9.0 10.3 0.867 2.3 5.4
## 8 Al Jefferson 50 36.6 23.1 9.7 19.5 0.497 3.7 5.0 0.738 0.0 0.1
## 9 Chris Paul 78 38.5 22.8 8.1 16.1 0.503 5.8 6.7 0.868 0.8 2.3
## 10 Carmelo Anthony 66 34.5 22.8 8.1 18.3 0.443 5.6 7.1 0.793 1.0 2.6
## 11 Chris Bosh 77 38.1 22.7 8.0 16.4 0.487 6.5 8.0 0.817 0.2 0.6
## 12 Brandon Roy 78 37.2 22.6 8.1 16.9 0.480 5.3 6.5 0.824 1.1 2.8
## 13 Antawn Jamison 81 38.2 22.2 8.3 17.8 0.468 4.2 5.6 0.754 1.4 3.9
## 14 Tony Parker 72 34.1 22.0 8.9 17.5 0.506 3.9 5.0 0.782 0.3 0.9
## 15 Amare Stoudemire 53 36.8 21.4 7.6 14.1 0.539 6.1 7.3 0.835 0.1 0.1
## 16 Joe Johnson 79 39.5 21.4 7.8 18.0 0.437 3.8 4.6 0.826 1.9 5.2
## 17 Devin Harris 69 36.1 21.3 6.6 15.1 0.438 7.2 8.8 0.820 0.9 3.2
## 18 Michael Redd 33 36.4 21.2 7.5 16.6 0.455 4.0 4.9 0.814 2.1 5.8
## 19 David West 76 39.3 21.0 8.0 17.0 0.472 4.8 5.5 0.884 0.1 0.3
## 20 Zachary Randolph 50 35.1 20.8 8.3 17.5 0.475 3.6 4.9 0.734 0.6 1.9
## 21 Caron Butler 67 38.6 20.8 7.3 16.2 0.453 5.1 6.0 0.858 1.0 3.1
## 22 Vince Carter 80 36.8 20.8 7.4 16.8 0.437 4.2 5.1 0.817 1.9 4.9
## 23 Stephen Jackson 59 39.7 20.7 7.0 16.9 0.414 5.0 6.0 0.826 1.7 5.2
## 24 Ben Gordon 82 36.6 20.7 7.3 16.0 0.455 4.0 4.7 0.864 2.1 5.1
## 25 Dwight Howard 79 35.7 20.6 7.1 12.4 0.572 6.4 10.7 0.594 0.0 0.0
## 26 Paul Pierce 81 37.4 20.5 6.7 14.6 0.457 5.7 6.8 0.830 1.5 3.8
## 27 Al Harrington 73 34.9 20.1 7.3 16.6 0.439 3.2 4.0 0.793 2.3 6.4
## 28 Jamal Crawford 65 38.1 19.7 6.4 15.7 0.410 4.6 5.3 0.872 2.2 6.1
## 29 Yao Ming 77 33.6 19.7 7.4 13.4 0.548 4.9 5.7 0.866 0.0 0.0
## 30 Richard Jefferson 82 35.9 19.6 6.5 14.9 0.439 5.1 6.3 0.805 1.4 3.6
## 31 Jason Terry 74 33.6 19.6 7.3 15.8 0.463 2.7 3.0 0.880 2.3 6.2
## 32 Deron Williams 68 36.9 19.4 6.8 14.5 0.471 4.8 5.6 0.849 1.0 3.3
## 33 Tim Duncan 75 33.7 19.3 7.4 14.8 0.504 4.5 6.4 0.692 0.0 0.0
## 34 Monta Ellis 25 35.6 19.0 7.8 17.2 0.451 3.1 3.8 0.830 0.3 1.0
## 35 Rudy Gay 79 37.3 18.9 7.2 16.0 0.453 3.3 4.4 0.767 1.1 3.1
## 36 Pau Gasol 81 37.1 18.9 7.3 12.9 0.567 4.2 5.4 0.781 0.0 0.0
## 37 Andre Iguodala 82 39.8 18.8 6.6 14.0 0.473 4.6 6.4 0.724 1.0 3.2
## 38 Corey Maggette 51 31.1 18.6 5.7 12.4 0.461 6.7 8.1 0.824 0.5 1.9
## 39 O.J. Mayo 82 38.0 18.5 6.9 15.6 0.438 3.0 3.4 0.879 1.8 4.6
## 40 John Salmons 79 37.5 18.3 6.5 13.8 0.472 3.6 4.4 0.830 1.6 3.8
## 41 Richard Hamilton 67 34.0 18.3 7.0 15.6 0.447 3.3 3.9 0.848 1.0 2.8
## 42 Ray Allen 79 36.3 18.2 6.3 13.2 0.480 3.0 3.2 0.952 2.5 6.2
## 43 LaMarcus Aldridge 81 37.1 18.1 7.4 15.3 0.484 3.2 4.1 0.781 0.1 0.3
## 44 Josh Howard 52 31.9 18.0 6.8 15.1 0.451 3.3 4.2 0.782 1.1 3.2
## 45 Maurice Williams 81 35.0 17.8 6.5 13.9 0.467 2.6 2.8 0.912 2.3 5.2
## 46 Shaquille O'neal 75 30.1 17.8 6.8 11.2 0.609 4.1 6.9 0.595 0.0 0.0
## 47 Rashard Lewis 79 36.2 17.7 6.1 13.8 0.439 2.8 3.4 0.836 2.8 7.0
## 48 Chauncey Billups 79 35.3 17.7 5.2 12.4 0.418 5.3 5.8 0.913 2.1 5.0
## 49 Allen Iverson 57 36.7 17.5 6.1 14.6 0.417 4.8 6.1 0.781 0.5 1.7
## 50 Nate Robinson 74 29.9 17.2 6.1 13.9 0.437 3.4 4.0 0.841 1.7 5.2
## X3PP ORB DRB TRB AST STL BLK TO PF
## 1 0.317 1.1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
## 2 0.344 1.3 6.3 7.6 7.2 1.7 1.1 3.0 1.7
## 3 0.351 1.1 4.1 5.2 4.9 1.5 0.5 2.6 2.3
## 4 0.359 1.1 7.3 8.4 2.4 0.8 0.8 1.9 2.2
## 5 0.404 0.7 4.4 5.1 2.7 1.0 1.4 2.5 3.1
## 6 0.422 1.0 5.5 6.5 2.8 1.3 0.7 3.0 1.8
## 7 0.415 0.6 3.0 3.6 2.7 1.2 0.2 2.9 2.3
## 8 0.000 3.4 7.5 11.0 1.6 0.8 1.7 1.8 2.8
## 9 0.364 0.9 4.7 5.5 11.0 2.8 0.1 3.0 2.7
## 10 0.371 1.6 5.2 6.8 3.4 1.1 0.4 3.0 3.0
## 11 0.245 2.8 7.2 10.0 2.5 0.9 1.0 2.3 2.5
## 12 0.377 1.3 3.4 4.7 5.1 1.1 0.3 1.9 1.6
## 13 0.351 2.4 6.5 8.9 1.9 1.2 0.3 1.5 2.7
## 14 0.292 0.4 2.7 3.1 6.9 0.9 0.1 2.6 1.5
## 15 0.429 2.2 5.9 8.1 2.0 0.9 1.1 2.8 3.1
## 16 0.360 0.8 3.6 4.4 5.8 1.1 0.2 2.5 2.2
## 17 0.291 0.4 2.9 3.3 6.9 1.7 0.2 3.1 2.4
## 18 0.366 0.7 2.5 3.2 2.7 1.1 0.1 1.6 1.4
## 19 0.240 2.1 6.4 8.5 2.3 0.6 0.9 2.1 2.7
## 20 0.330 3.1 6.9 10.1 2.1 0.9 0.3 2.3 2.7
## 21 0.310 1.8 4.4 6.2 4.3 1.6 0.3 3.1 2.5
## 22 0.385 0.9 4.2 5.1 4.7 1.0 0.5 2.1 2.9
## 23 0.338 1.2 3.9 5.1 6.5 1.5 0.5 3.9 2.6
## 24 0.410 0.6 2.8 3.5 3.4 0.9 0.3 2.4 2.2
## 25 0.000 4.3 9.6 13.8 1.4 1.0 2.9 3.0 3.4
## 26 0.391 0.7 5.0 5.6 3.6 1.0 0.3 2.8 2.7
## 27 0.364 1.4 4.9 6.2 1.4 1.2 0.3 2.2 3.1
## 28 0.360 0.4 2.6 3.0 4.4 0.9 0.2 2.3 1.4
## 29 1.000 2.6 7.2 9.9 1.8 0.4 1.9 3.0 3.3
## 30 0.397 0.7 3.9 4.6 2.4 0.8 0.2 2.0 3.1
## 31 0.366 0.5 1.9 2.4 3.4 1.3 0.3 1.6 1.9
## 32 0.310 0.4 2.5 2.9 10.7 1.1 0.3 3.4 2.0
## 33 0.000 2.7 8.0 10.7 3.5 0.5 1.7 2.2 2.3
## 34 0.308 0.6 3.8 4.3 3.7 1.6 0.3 2.7 2.7
## 35 0.351 1.4 4.2 5.5 1.7 1.2 0.7 2.6 2.8
## 36 0.500 3.2 6.4 9.6 3.5 0.6 1.0 1.9 2.1
## 37 0.307 1.1 4.6 5.7 5.3 1.6 0.4 2.7 1.9
## 38 0.253 1.0 4.6 5.5 1.8 0.9 0.2 2.4 3.8
## 39 0.384 0.7 3.1 3.8 3.2 1.1 0.2 2.8 2.5
## 40 0.417 0.7 3.5 4.2 3.2 1.1 0.3 2.1 2.3
## 41 0.368 0.7 2.4 3.1 4.4 0.6 0.1 2.0 2.6
## 42 0.409 0.8 2.7 3.5 2.8 0.9 0.2 1.7 2.0
## 43 0.250 2.9 4.6 7.5 1.9 1.0 1.0 1.5 2.6
## 44 0.345 1.1 3.9 5.1 1.6 1.1 0.6 1.7 2.6
## 45 0.436 0.6 2.9 3.4 4.1 0.9 0.1 2.2 2.7
## 46 0.000 2.5 5.9 8.4 1.7 0.7 1.4 2.2 3.4
## 47 0.397 1.2 4.6 5.7 2.6 1.0 0.6 2.0 2.5
## 48 0.408 0.4 2.6 3.0 6.4 1.2 0.2 2.2 2.0
## 49 0.283 0.5 2.5 3.0 5.0 1.5 0.1 2.6 1.5
## 50 0.325 1.3 2.6 3.9 4.1 1.3 0.1 1.9 2.8
nba <- nba[order(nba$PTS),]
row.names(nba) <- nba$Name
nba <- nba[,2:19]
nba_matrix <- data.matrix(nba)
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA,
col = cm.colors(256), scale="column", margins=c(5,10),
xlab = "NBA Player Stats",
ylab = "NBA Players",
main = "NBA Player Stats in 2008")
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv=NA, col = heat.colors(256),
scale="column", margins=c(5,10),
xlab = "NBA Player Stats",
ylab = "NBA Players",
main = "NBA Player Stats in 2008")
Try using direction = -1 and then try removing it; you will see that it reverses the heatmap levels
library(viridis)
## Loading required package: viridisLite
nba_heatmap <- heatmap(nba_matrix, Rowv=NA, Colv = NA, col = viridis(25, direction = -1),
scale="column", margins=c(5,10),
xlab = "NBA Player Stats",
ylab = "NBA Players",
main = "NBA Player Stats in 2008")
treemap(data, index="category", vSize="views",
vColor="comments", type="value", # note: type="value"
palette="RdYlBu")
treemap(data, index="category", vSize="views",
vColor="comments", type="manual", # note: type = "manual"
palette="RdYlBu")
library(nycflights13)
library(RColorBrewer)
flights <- flights
#view(flights)
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).
This was modified from Raul Miranda’s work
Remove observations with NA values from distand and arr_delay variables - notice number of rows changed from 336,776 to 327,346
flights_nona <- flights %>%
filter(!is.na(distance) & !is.na(arr_delay))
Create a dataframe that is composed of summary statistics
delays <- flights_nona %>% # create a delays dataframe by:
group_by (dest) %>% # grouping by point of destination
summarize (count = n(), # creating variables: number of flights to each destination,
dist = mean (distance), # the mean distance flown to each destination,
delay = mean (arr_delay), # the mean delay of arrival to each destination,
delaycost = mean(count*delay/dist)) # delay cost index defined as:
# [(number of flights)*delay/distance] for a destination
delays <- arrange(delays, desc(delaycost)) # sort the rows by delay cost
head(delays) # look at the data
## # A tibble: 6 x 5
## dest count dist delay delaycost
## <chr> <int> <dbl> <dbl> <dbl>
## 1 DCA 9111 211. 9.07 391.
## 2 IAD 5383 225. 13.9 332.
## 3 ATL 16837 757. 11.3 251.
## 4 BOS 15022 191. 2.91 230.
## 5 CLT 13674 538. 7.36 187.
## 6 RDU 7770 427. 10.1 183.
top100 <- delays %>% # select the 100 largest delay costs
head(100) %>%
arrange(delaycost) # sort ascending so the heatmap displays descending costs
row.names(top100) <- top100$dest # rename the rows according to destination airport codes
## Warning: Setting row names on a tibble is deprecated.
row.names(top100) <- top100$dest # rename the rows according to destination airport codes
## Warning: Setting row names on a tibble is deprecated.
delays_mat <- data.matrix(top100) # convert delays dataframe to a matrix (required by heatmap)
delays_mat2 <- delays_mat[,2:5] # remove the redundant column of destination airport codes
varcols = setNames(colorRampPalette(brewer.pal(nrow(delays_mat2), "YlGnBu"))(nrow(delays_mat2)),
rownames(delays_mat2)) # parameter for RowSideColors
## Warning in brewer.pal(nrow(delays_mat2), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
heatmap(delays_mat2,
Rowv = NA, Colv = NA,
col= colorRampPalette(brewer.pal(nrow(delays_mat2), "YlGnBu"))(nrow(delays_mat2)),
s=0.6, v=1, scale="column",
margins=c(7,10),
main = "Cost of Late Arrivals",
xlab = "Flight Characteristics",
ylab="Arrival Airport", labCol = c("Flights","Distance","Delay","Cost Index"),
cexCol=1, cexRow =1, RowSideColors = varcols)
## layout: widths = 0.05 0.2 4 , heights = 0.25 4 ; lmat=
## [,1] [,2] [,3]
## [1,] 0 0 4
## [2,] 3 1 2
## Warning in brewer.pal(nrow(delays_mat2), "YlGnBu"): n too large, allowed maximum for palette YlGnBu is 9
## Returning the palette you asked for with that many colors
“Cost index” is defined as a measure of how arrival delays impact the cost of flying into each airport and is calculated as number of flights * mean delay / mean flight distance. For airlines it is a measure of how much the cost to fly to an airport increases due to frequent delays of arrival. Cost index is inversely proportional to distance because delays affect short flights more than long flights and because the profit per seat increases with distance due to the larger and more efficient planes used for longer distances.
The variance in delays across airports is mainly due to (a) airline traffic congestion relative to the airport size; and (b)regional climate and weather events. It is not strongly dependent upon airline carrier or tailnumber.
Therefore, airports such as ORD and BOS have high cost index because they are highly congested and are frequently delayed due to weather. Airports like IAD, PHL, DTW, etc., are very congested despite their large size and also show high cost index. Smaller airports such as HDN, SNA, HNL, LEX, etc., have null to slightly negative cost index because they are not congested and keep flights on time.
This type of visualisation is a variation of a stacked area graph, but instead of plotting values against a fixed, straight axis, a streamgraph has values displaced around a varying central baseline. Streamgraphs display the changes in data over time of different categories through the use of flowing, organic shapes that somewhat resemble a river-like stream. This makes streamgraphs aesthetically pleasing and more engaging to look at.
The size of each individual stream shape is proportional to the values in each category. The axis that a streamgraph flows parallel to is used for the timescale. Color can be used to either distinguish each category or to visualize each category’s additional quantitative values through varying the color shade.
Streamgraphs are ideal for displaying high-volume datasets, in order to discover trends and patterns over time across a wide range of categories. For example, seasonal peaks and troughs in the stream shape can suggest a periodic pattern. A streamgraph could also be used to visualize the volatility for a large group of assets over a certain period of time.
The downside to a streamgraph is that they suffer from legibility issues, as they are often very cluttered. The categories with smaller values are often drowned out to make way for categories with much larger values, making it impossible to see all the data. Also, it’s impossible to read the exact values visualized, as there is no axis to use as a reference.
The code for making streamgraphs has changed with new updates to R. You have to download and install Rtools40 from the link, https://cran.rstudio.com/bin/windows/Rtools/. and then used the code provided below.