A heatmap is a literal way of visualizing a table of numbers, where you substitute the numbers with colored cells. There are two fundamentally different categories of heat maps: the cluster heat map and the spatial heat map. In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete categories, and the sorting of rows and columns is intentional. The size of the cell is arbitrary but large enough to be clearly visible. By contrast, the position of a magnitude in a spatial heat map is forced by the location of the magnitude in that space, and there is no notion of cells; the phenomenon is considered to vary continuously. (Wikipedia)
This data appears to contain data about 2008 NBA player stats.
nba <- read.csv("http://datasets.flowingdata.com/ppg2008.csv")
#apparently you have to use read.csv here instead of read_csv
head(nba)
Name G MIN PTS FGM FGA FGP FTM FTA FTP X3PM X3PA X3PP ORB
1 Dwyane Wade 79 38.6 30.2 10.8 22.0 0.491 7.5 9.8 0.765 1.1 3.5 0.317 1.1
2 LeBron James 81 37.7 28.4 9.7 19.9 0.489 7.3 9.4 0.780 1.6 4.7 0.344 1.3
3 Kobe Bryant 82 36.2 26.8 9.8 20.9 0.467 5.9 6.9 0.856 1.4 4.1 0.351 1.1
4 Dirk Nowitzki 81 37.7 25.9 9.6 20.0 0.479 6.0 6.7 0.890 0.8 2.1 0.359 1.1
5 Danny Granger 67 36.2 25.8 8.5 19.1 0.447 6.0 6.9 0.878 2.7 6.7 0.404 0.7
6 Kevin Durant 74 39.0 25.3 8.9 18.8 0.476 6.1 7.1 0.863 1.3 3.1 0.422 1.0
DRB TRB AST STL BLK TO PF
1 3.9 5.0 7.5 2.2 1.3 3.4 2.3
2 6.3 7.6 7.2 1.7 1.1 3.0 1.7
3 4.1 5.2 4.9 1.5 0.5 2.6 2.3
4 7.3 8.4 2.4 0.8 0.8 1.9 2.2
5 4.4 5.1 2.7 1.0 1.4 2.5 3.1
6 5.5 6.5 2.8 1.3 0.7 3.0 1.8
This older heatmap function requires the data to be formatted as a matrix using the data.matrix
The basic layout of the heatmap relies on the parameters rows, columns and values. You can think of them like aesthetics in ggplot2::ggplot(), similar to something like aes(x = columns, y = rows, fill = values).
For some reason the veridis colors from viridisLite package default to give dentrite clusering (the branches).
Treemaps display hierarchical (tree-structured) data as a set of nested rectangles. Each branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches. A leaf node’s rectangle has an area proportional to a specified dimension of the data.[1] Often the leaf nodes are colored to show a separate dimension of the data.
When the color and size dimensions are correlated in some way with the tree structure, one can often easily see patterns that would be difficult to spot in other ways, such as whether a certain color is particularly relevant. A second advantage of treemaps is that, by construction, they make efficient use of space. As a result, they can legibly display thousands of items on the screen simultaneously.
The downside of treemaps is that as the aspect ratio is optimized, the order of placement becomes less predictable. As the order becomes more stable, the aspect ratio is degraded. (Wikipedia)
Use Nathan Yau’s dataset from the flowingdata website: http://datasets.flowingdata.com/post-data.txt You will need the package “treemap” and the package “RColorBrewer”.
Load the data for creating a treemap from Nathan Yao’s flowing data which explores number of views and comments for different categories of posts on his website.
flowingdata <- read.csv("http://datasets.flowingdata.com/post-data.txt")
# again, here use read.csv instead of read_csv
head(flowingdata)
id views comments category
1 5019 148896 28 Artistic Visualization
2 1416 81374 26 Visualization
3 1416 81374 26 Featured
4 3485 80819 37 Featured
5 3485 80819 37 Mapping
6 3485 80819 37 Data Sources
The index is a categorical variable - in this case, “category” of post
The size of the box is by number of views of the post
The heatmap color is by number of comments for the post
Notice how the treemap includes a legend for number of comments *
Set your working directory and read in the happiness19.csv from the class google drive.
setwd("C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets")
happy19 <- read_csv("happiness2019.csv")
head(happy19)
# A tibble: 6 × 9
`Overall rank` `Country or region` Score `GDP per capita` `Social support`
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Finland 7.77 1.34 1.59
2 2 Denmark 7.6 1.38 1.57
3 3 Norway 7.55 1.49 1.58
4 4 Iceland 7.49 1.38 1.62
5 5 Netherlands 7.49 1.40 1.52
6 6 Switzerland 7.48 1.45 1.53
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
# `Freedom to make life choices` <dbl>, Generosity <dbl>,
# `Perceptions of corruption` <dbl>
We can see that there are 156 countries ranked by their “happiness score” based on other measurements.
first remove the first column for “overall_rank”. Clean the remaining headers.
Because there are 156 countries, narrow the inclusion criteria to be for the top 20 scoring countries. Then do the same for the lowest 20.
happytop <- happy |>
arrange(desc(score)) |>
mutate(happy = "top") |> # add a column for use later
head(20)
happytop
# A tibble: 20 × 9
country_or_region score gdp_per_capita social_support healthy_life_expectancy
<chr> <dbl> <dbl> <dbl> <dbl>
1 Finland 7.77 1.34 1.59 0.986
2 Denmark 7.6 1.38 1.57 0.996
3 Norway 7.55 1.49 1.58 1.03
4 Iceland 7.49 1.38 1.62 1.03
5 Netherlands 7.49 1.40 1.52 0.999
6 Switzerland 7.48 1.45 1.53 1.05
7 Sweden 7.34 1.39 1.49 1.01
8 New Zealand 7.31 1.30 1.56 1.03
9 Canada 7.28 1.36 1.50 1.04
10 Austria 7.25 1.38 1.48 1.02
11 Australia 7.23 1.37 1.55 1.04
12 Costa Rica 7.17 1.03 1.44 0.963
13 Israel 7.14 1.28 1.46 1.03
14 Luxembourg 7.09 1.61 1.48 1.01
15 United Kingdom 7.05 1.33 1.54 0.996
16 Ireland 7.02 1.50 1.55 0.999
17 Germany 6.98 1.37 1.45 0.987
18 Belgium 6.92 1.36 1.50 0.986
19 United States 6.89 1.43 1.46 0.874
20 Czech Republic 6.85 1.27 1.49 0.92
# ℹ 4 more variables: freedom_to_make_life_choices <dbl>, generosity <dbl>,
# perceptions_of_corruption <dbl>, happy <chr>
# A tibble: 20 × 9
country_or_region score gdp_per_capita social_support healthy_life_expecta…¹
<chr> <dbl> <dbl> <dbl> <dbl>
1 South Sudan 2.85 0.306 0.575 0.295
2 Central African R… 3.08 0.026 0 0.105
3 Afghanistan 3.20 0.35 0.517 0.361
4 Tanzania 3.23 0.476 0.885 0.499
5 Rwanda 3.33 0.359 0.711 0.614
6 Yemen 3.38 0.287 1.16 0.463
7 Malawi 3.41 0.191 0.56 0.495
8 Syria 3.46 0.619 0.378 0.44
9 Botswana 3.49 1.04 1.14 0.538
10 Haiti 3.60 0.323 0.688 0.449
11 Zimbabwe 3.66 0.366 1.11 0.433
12 Burundi 3.78 0.046 0.447 0.38
13 Lesotho 3.80 0.489 1.17 0.168
14 Madagascar 3.93 0.274 0.916 0.555
15 Comoros 3.97 0.274 0.757 0.505
16 Liberia 3.98 0.073 0.922 0.443
17 India 4.01 0.755 0.765 0.588
18 Togo 4.08 0.275 0.572 0.41
19 Zambia 4.11 0.578 1.06 0.426
20 Egypt 4.17 0.913 1.04 0.644
# ℹ abbreviated name: ¹healthy_life_expectancy
# ℹ 4 more variables: freedom_to_make_life_choices <dbl>, generosity <dbl>,
# perceptions_of_corruption <dbl>, happy <chr>
Then convert from wide to long format.
happy_longtop <- happytop |>
pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
happy_longtop
# A tibble: 140 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 Finland top score 7.77
2 Finland top gdp_per_capita 1.34
3 Finland top social_support 1.59
4 Finland top healthy_life_expectancy 0.986
5 Finland top freedom_to_make_life_choices 0.596
6 Finland top generosity 0.153
7 Finland top perceptions_of_corruption 0.393
8 Denmark top score 7.6
9 Denmark top gdp_per_capita 1.38
10 Denmark top social_support 1.57
# ℹ 130 more rows
happy_longbottom <- happybottom |>
pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
happy_longbottom
# A tibble: 140 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 South Sudan bottom score 2.85
2 South Sudan bottom gdp_per_capita 0.306
3 South Sudan bottom social_support 0.575
4 South Sudan bottom healthy_life_expectancy 0.295
5 South Sudan bottom freedom_to_make_life_choices 0.01
6 South Sudan bottom generosity 0.202
7 South Sudan bottom perceptions_of_corruption 0.091
8 Central African Republic bottom score 3.08
9 Central African Republic bottom gdp_per_capita 0.026
10 Central African Republic bottom social_support 0
# ℹ 130 more rows
newdf <- rbind(happytop, happybottom)
newdf_long <- newdf |>
pivot_longer(cols = 2:8,
names_to = "measurements",
values_to = "values")
head(newdf_long)
# A tibble: 6 × 4
country_or_region happy measurements values
<chr> <chr> <chr> <dbl>
1 Finland top score 7.77
2 Finland top gdp_per_capita 1.34
3 Finland top social_support 1.59
4 Finland top healthy_life_expectancy 0.986
5 Finland top freedom_to_make_life_choices 0.596
6 Finland top generosity 0.153
Load the alluvial package
If you want to save the prebuilt dataset to your folder, use the write_csv function
Top 10 most affected countries causing refugees from 2003-2013 Alluvials need the variables: time-variable, value, category
ggalluv <- Refugees |>
ggplot(aes(x = year, y = refugees, alluvium = country)) +
theme_bw() +
geom_alluvium(aes(fill = country),
color = "white",
width = .1,
alpha = .8,
decreasing = FALSE) +
scale_fill_brewer(palette = "Spectral") +
# Spectral has enough colors for all countries listed
scale_x_continuous(lim = c(2002, 2013)) +
labs(title = "UNHCR-Recognised Refugees Top 10 Countries\n (2003-2013)",
# \n breaks the long title
y = "Number of Refugees",
fill = "Country",
caption = "Source: United Nations High Commissioner for Refugees (UNHCR)")
Notice the y-values are in scientific notation. We can convert them to standard notation with options scipen function
Source: FAA Aircraft registry,
https://www.faa.gov/licenses_certificates/aircraft_certification/ aircraft_registry/releasable_aircraft_download/
Use “group_by” together with summarise functions
Remove observations with NA values from distand and arr_delay variables - notice number of rows changed from 336,776 to 327,346
Never use the function “na.omit”!!!!
The table includes, counts for each tail number, mean distance traveled, and mean arrival delay
by_tailnum <- flights_nona |>
group_by(tailnum) |> # group all tailnumbers together
summarise(count = n(), # counts totals for each tailnumber
dist = mean(distance), # calculates the mean distance traveled
delay = mean(arr_delay)
) # calculates the mean arrival delay
head(by_tailnum)
# A tibble: 6 × 4
tailnum count dist delay
<chr> <int> <dbl> <dbl>
1 190NV 29 597. -9.24
2 191NV 6 583 -6.67
3 193NV 18 626. -13.8
4 195NV 2 587 -0.5
5 196NV 3 605 -11.3
6 202NV 17 597. -14.1
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = .3) +
geom_smooth() +
scale_size_area() +
theme_bw() +
labs(x = "Average Flight Distance (miles)",
y = "Average Arrival Delay (minutes)",
caption = "FAA Aircraft registry",
title = "Flight Distance and Average Arrival Delays \n from Flights from NY")