library(treemap)
library(tidyverse)
library(RColorBrewer)
Heatmaps Treemaps and Alluvials Part 1
So many ways to visualize data
Load the packages and the data from flowingdata.com website
Heatmaps
A heatmap is a literal way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
Load the nba data from Yau’s website
This data appears to contain data about 2008 NBA player stats.
<- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
nba #apparently you have to use read.csv here instead of read_csv
head(nba)
Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA X3PP ORB
1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5 0.317 1.1
2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7 0.344 1.3
3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1 0.351 1.1
4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1 0.359 1.1
5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7 0.404 0.7
6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1 0.422 1.0
DRB TRB AST STL BLK TO PF
1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 5.5 6.5 2.8 1.3 0.7 3.0 1.8
Create a cool-color heatmap
This older heatmap function requires the data to be formatted as a matrix using the data.matrix
<- nba[order(nba$PTS),]
nba row.names(nba) <- nba$Name
<- nba[,2:19]
nba2 <- data.matrix(nba2) nba_matrix
#?heatmap # this gives documentation about the code used to create a heatmap using this function
The basic layout of the heatmap relies on the parameters rows, columns and values. You can think of them like aesthetics in ggplot2::ggplot(), similar to something like aes(x = columns, y = rows, fill = values).
Use a warm color palette
<- heatmap(nba_matrix,
nba_heatmap Rowv=NA,
Colv=NA,
col = heat.colors(20),
scale="column",
margins=c(5,10),
xlab = "NBA Player Stats",
ylab = "NBA Players",
main = "NBA Player Stats in 2008")
Use the viridis color palette
For some reason the viridis colors from viridisLite package default to give dentrite clustering (the branches).
library(viridis)
Loading required package: viridisLite
# Loading required package: viridis
<- heatmap(nba_matrix,
nba_heatmap Rowv=NA,
col = viridis(5),
scale="column",
margins=c(5,10),
xlab = "NBA Player Stats",
ylab = "NBA Players",
keep.dendro = FALSE,
main = "NBA Payer Stats in 2008")
Treemaps
Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The Downside to Treemaps
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Use Nathan Yau’s dataset from the flowingdata website: http://datasets.flowingdata.com/post-data.txt You will need the package “treemap” and the package “RColorBrewer”.
Create a treemap which explores categories of views
Load the data for creating a treemap from Nathan Yao’s flowing data which explores number of views and comments for different categories of posts on his website.
<- read.csv("http://datasets.flowingdata.com/post-data.txt")
flowingdata # again, here use read.csv instead of read_csv
head(flowingdata)
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
Use RColorBrewer to change the palette to RdYlBu
treemap(flowingdata, index="category", vSize="views",
vColor="comments", type="manual",
# note: type = "manual" changes to red yellow blue
palette="RdYlBu")
Notice the following:
The index is a categorical variable - in this case, “category” of post
The size of the box is by number of views of the post
The heatmap color is by number of comments for the post
Notice how the treemap includes a legend for number of comments *
A heatmap of World Happiness
Set your working directory and read in the happiness19.csv from the class google drive.
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
<- read_csv("happiness2019.csv")
happy19 head(happy19)
# A tibble: 6 × 9
`Overall rank` `Country or region` Score `GDP per capita` `Social support`
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Finland 7.77 1.34 1.59
2 2 Denmark 7.6 1.38 1.57
3 3 Norway 7.55 1.49 1.58
4 4 Iceland 7.49 1.38 1.62
5 5 Netherlands 7.49 1.40 1.52
6 6 Switzerland 7.48 1.45 1.53
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
# `Freedom to make life choices` <dbl>, Generosity <dbl>,
# `Perceptions of corruption` <dbl>
We can see that there are 156 countries ranked by their “happiness score” based on other measurements.
Clean the happiness dataset to work with it
first remove the first column for “overall_rank”. Clean the remaining headers.
<- happy19 |>
happy select(-`Overall rank`)
names(happy) <- tolower(names(happy))
names(happy) <- gsub(" ", "_", names(happy))
Because there are 156 countries, narrow the inclusion criteria to be for the top 20 scoring countries.
<- happy |>
happytop arrange(desc(score)) |>
mutate(happy = "top") |> # add a column for use later
head(20)
happytop
# A tibble: 20 × 9
country_or_region score gdp_per_capita social_support healthy_life_expectancy
<chr> <dbl> <dbl> <dbl> <dbl>
1 Finland 7.77 1.34 1.59 0.986
2 Denmark 7.6 1.38 1.57 0.996
3 Norway 7.55 1.49 1.58 1.03
4 Iceland 7.49 1.38 1.62 1.03
5 Netherlands 7.49 1.40 1.52 0.999
6 Switzerland 7.48 1.45 1.53 1.05
7 Sweden 7.34 1.39 1.49 1.01
8 New Zealand 7.31 1.30 1.56 1.03
9 Canada 7.28 1.36 1.50 1.04
10 Austria 7.25 1.38 1.48 1.02
11 Australia 7.23 1.37 1.55 1.04
12 Costa Rica 7.17 1.03 1.44 0.963
13 Israel 7.14 1.28 1.46 1.03
14 Luxembourg 7.09 1.61 1.48 1.01
15 United Kingdom 7.05 1.33 1.54 0.996
16 Ireland 7.02 1.50 1.55 0.999
17 Germany 6.98 1.37 1.45 0.987
18 Belgium 6.92 1.36 1.50 0.986
19 United States 6.89 1.43 1.46 0.874
20 Czech Republic 6.85 1.27 1.49 0.92
# ℹ 4 more variables: freedom_to_make_life_choices <dbl>, generosity <dbl>,
# perceptions_of_corruption <dbl>, happy <chr>
Then do the same for the lowest 20.
<- happy |>
happybottom arrange(score) |>
mutate(happy = "bottom") |>
head(20)
happybottom
# A tibble: 20 × 9
country_or_region score gdp_per_capita social_support healthy_life_expecta…¹
<chr> <dbl> <dbl> <dbl> <dbl>
1 South Sudan 2.85 0.306 0.575 0.295
2 Central African R… 3.08 0.026 0 0.105
3 Afghanistan 3.20 0.35 0.517 0.361
4 Tanzania 3.23 0.476 0.885 0.499
5 Rwanda 3.33 0.359 0.711 0.614
6 Yemen 3.38 0.287 1.16 0.463
7 Malawi 3.41 0.191 0.56 0.495
8 Syria 3.46 0.619 0.378 0.44
9 Botswana 3.49 1.04 1.14 0.538
10 Haiti 3.60 0.323 0.688 0.449
11 Zimbabwe 3.66 0.366 1.11 0.433
12 Burundi 3.78 0.046 0.447 0.38
13 Lesotho 3.80 0.489 1.17 0.168
14 Madagascar 3.93 0.274 0.916 0.555
15 Comoros 3.97 0.274 0.757 0.505
16 Liberia 3.98 0.073 0.922 0.443
17 India 4.01 0.755 0.765 0.588
18 Togo 4.08 0.275 0.572 0.41
19 Zambia 4.11 0.578 1.06 0.426
20 Egypt 4.17 0.913 1.04 0.644
# ℹ abbreviated name: ¹healthy_life_expectancy
# ℹ 4 more variables: freedom_to_make_life_choices <dbl>, generosity <dbl>,
# perceptions_of_corruption <dbl>, happy <chr>
Convert from wide to long format.
<- happytop |>
happy_longtop pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
happy_longtop
# A tibble: 140 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 Finland top score 7.77
2 Finland top gdp_per_capita 1.34
3 Finland top social_support 1.59
4 Finland top healthy_life_expectancy 0.986
5 Finland top freedom_to_make_life_choices 0.596
6 Finland top generosity 0.153
7 Finland top perceptions_of_corruption 0.393
8 Denmark top score 7.6
9 Denmark top gdp_per_capita 1.38
10 Denmark top social_support 1.57
# ℹ 130 more rows
Convert from wide to long for bottom 10
<- happybottom |>
happy_longbottom pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
happy_longbottom
# A tibble: 140 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 South Sudan bottom score 2.85
2 South Sudan bottom gdp_per_capita 0.306
3 South Sudan bottom social_support 0.575
4 South Sudan bottom healthy_life_expectancy 0.295
5 South Sudan bottom freedom_to_make_life_choices 0.01
6 South Sudan bottom generosity 0.202
7 South Sudan bottom perceptions_of_corruption 0.091
8 Central African Republic bottom score 3.08
9 Central African Republic bottom gdp_per_capita 0.026
10 Central African Republic bottom social_support 0
# ℹ 130 more rows
Plot top 10 using geom_tile()
ggplot(data = happy_longtop, aes(x=country_or_region, y=measurements, fill = values)) +
geom_tile()+
scale_fill_distiller(palette="Spectral") +
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
Plot bottom 10 using geom_tile()
ggplot(data = happy_longbottom, aes(x=country_or_region, y=measurements, fill = values)) +
geom_tile()+
scale_fill_distiller(palette="Spectral") +
theme_bw()+
theme(axis.text.x = element_text(angle = 90))
Put the top 20 and bottom 20 together to compare the two plots
<- rbind(happytop, happybottom)
newdf <- newdf |>
newdf_long pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
head(newdf_long)
# A tibble: 6 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 Finland top score 7.77
2 Finland top gdp_per_capita 1.34
3 Finland top social_support 1.59
4 Finland top healthy_life_expectancy 0.986
5 Finland top freedom_to_make_life_choices 0.596
6 Finland top generosity 0.153
create a facet plot of the geom_tile
ggplot(data = newdf_long, aes(x=country_or_region, y=measurements, fill = values)) +
geom_tile()+
scale_fill_distiller(palette="Spectral") +
facet_grid(~happy) +
theme_bw()+
theme(axis.text.x = element_blank()) # remove the countries to generally compare top and bottom ranked countries
What do you notice when you facet the two together?
Alluvials
Load the alluvial package
Refugees is a prebuilt dataset in the alluvial package
If you want to save the prebuilt dataset to your folder, use the write_csv function
library(alluvial)
library(ggalluvial)
data(Refugees)
Show UNHCR-recognised refugees
Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category
<- Refugees |>
ggalluv ggplot(aes(x = year, y = refugees, alluvium = country)) +
theme_bw() +
geom_alluvium(aes(fill = country),
color = "white",
width = .1,
alpha = .8,
decreasing = FALSE) +
scale_fill_brewer(palette = "Spectral") +
# Spectral has enough colors for all countries listed
scale_x_continuous(lim = c(2002, 2013)) +
labs(title = "UNHCR-Recognised Refugees Top 10 Countries\n (2003-2013)",
# \n breaks the long title
y = "Number of Refugees",
fill = "Country",
caption = "Source: United Nations High Commissioner for Refugees (UNHCR)")
Plot the Alluvial
ggalluv
A final touch to fix the y-axis scale
Notice the y-values are in scientific notation. We can convert them to standard notation with options scipen function
options(scipen = 999)
ggalluv
Use the dataset NYCFlights13 to create a heatmap that explores Late Arrivals
Source: FAA Aircraft registry,
https://www.faa.gov/licenses_certificates/aircraft_certification/ aircraft_registry/releasable_aircraft_download/
#install.packages("nycflights13")
library(nycflights13)
library(RColorBrewer)
data(flights)
Create an initial scatterplot with loess smoother for distance to delays
Use “group_by” together with summarise functions
Remove observations with NA values from distand and arr_delay variables - notice number of rows changed from 336,776 to 327,346
Never use the function “na.omit”!!!!
<- flights |>
flights_nona filter(!is.na(distance) & !is.na(arr_delay))
# remove na's for distance and arr_delay
Use group_by and summarise to create a summary table
The table includes, counts for each tail number, mean distance traveled, and mean arrival delay
<- flights_nona |>
by_tailnum group_by(tailnum) |> # group all tailnumbers together
summarise(count = n(), # counts totals for each tailnumber
dist = mean(distance), # calculates the mean distance traveled
delay = mean(arr_delay)
# calculates the mean arrival delay
) head(by_tailnum)
# A tibble: 6 × 4
tailnum count dist delay
<chr> <int> <dbl> <dbl>
1 D942DN 4 854. 31.5
2 N0EGMQ 352 679. 9.98
3 N10156 145 756. 12.7
4 N102UW 48 536. 2.94
5 N103US 46 535. -6.93
6 N104UW 46 535. 1.80
<- filter(by_tailnum, count > 20, dist < 2000)
delay # only include counts > 20 and distance < 2000 mi