Week 14 - introduction to heatmaps

This tutorial will help you learn about building heatmaps, which are z-score based visual representations of data. In class, we will discuss how to interpret heatmaps and hierarchical clustering.

This tutorial is somewhat simiplified (less text, more code).

The following packages are installed in the Rstudio.cloud workspace: install.packages(c(“dplyr”, “Rmisc”, “gplots”, “ggplot2”, “RColorBrewer”, “plotly”))

The Rstudio.cloud link is here

Soccer example

First, read the dataset, which Crossroads league soccer stats from the 2020 Fall soccer teams. We can use this to see which teams Marian compared to, based on statistics like goals, assists, etc. You can do similar things with any dataset full of stats.

soccer <- read.csv("soccer.csv")
str(soccer)
## 'data.frame':    10 obs. of  13 variables:
##  $ TEAM              : chr  "Bethel (Ind.)" "Goshen (Ind.)" "Grace (Ind.)" "Huntington (Ind.)" ...
##  $ goals             : int  3 1 23 11 22 64 26 42 11 41
##  $ goals.game        : num  0.27 0.2 1.92 0.85 1.83 4.27 2 3.5 0.85 3.15
##  $ assists           : int  5 1 22 8 18 49 19 32 9 29
##  $ assist.game       : num  0.45 0.2 1.83 0.62 1.5 3.27 1.46 2.67 0.69 2.23
##  $ shots             : int  40 25 207 122 202 323 176 221 93 278
##  $ shots.game        : num  3.64 5 17.25 9.38 16.83 ...
##  $ saves             : int  88 58 45 108 72 34 55 44 134 50
##  $ saves.game        : num  8 11.6 3.75 8.31 6 ...
##  $ goals.allowed     : int  55 16 12 40 10 7 14 5 43 20
##  $ goals.allowed.game: num  5 3.2 1 3.08 0.83 0.47 1.08 0.42 3.31 1.54
##  $ shutouts          : int  1 1 6 2 6 10 6 8 3 4
##  $ shutouts.game     : num  0.09 0.2 0.5 0.15 0.5 0.67 0.46 0.67 0.23 0.31

Like I mentioned, heatmaps use z-scores. But first, we have to remove column 1, because those are the names of the schools. Those can’t be converted to z scores.

soccer2 <- soccer[, -1] # remove first column of names

We are reviewing what z-scores are in class, so go to those slides. To do this easily in R, we can do it the long way (i had you do this before) by converting every column to a z score and then combining the columns. But, we can just used more advance features (using the dplyr package - which is very powerful).

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
soccer3 <- sapply(soccer2, function(soccer2) (soccer2-mean(soccer2, na.rm=TRUE))/sd(soccer2,na.rm=TRUE))

# note that this is the z score formula - soccer2-mean(soccer2, na.rm=TRUE))/sd(soccer2) which you are applying to every value in soccer2

Finally, we need convert the dataset into a specific type of file, called a data matrix. Compare the soc_matrix dataset to soccer3.

soc_matrix <- data.matrix(soccer3) # heatmaps like matrices
soc_matrix
##             goals  goals.game     assists  assist.game       shots  shots.game
##  [1,] -1.08080195 -1.15732580 -0.97230883 -1.021352678 -1.31472555 -1.46101614
##  [2,] -1.18181148 -1.20751961 -1.24619864 -1.266398906 -1.46795697 -1.25768192
##  [3,] -0.07070667  0.02581396  0.19172287  0.331302500  0.39125088  0.57382111
##  [4,] -0.67676384 -0.74143425 -0.76689147 -0.854721243 -0.47706048 -0.60282614
##  [5,] -0.12121143 -0.03872094 -0.08216694  0.007841479  0.34017374  0.51102673
##  [6,]  1.99998866  1.71089180  2.04047909  1.742768773  1.57624051  1.21372585
##  [7,]  0.08080762  0.08317831 -0.01369449 -0.031365917  0.07457262  0.01913734
##  [8,]  0.88888385  1.15875991  0.87644739  1.154657826  0.53426687  0.74874834
##  [9,] -0.67676384 -0.74143425 -0.69841902 -0.786108299 -0.77330788 -0.93623445
## [10,]  0.83837909  0.90779087  0.67103004  0.723376465  1.11654626  1.19129928
##             saves  saves.game goals.allowed goals.allowed.game   shutouts
##  [1,]  0.59988285  0.56627664     1.8882618          1.9416300 -1.2250432
##  [2,] -0.33743410  1.69820107    -0.3569275          0.7793640 -1.2250432
##  [3,] -0.74360478 -0.77002304    -0.5872034         -0.6411835  0.4304206
##  [4,]  1.22476081  0.66374791     1.0247274          0.7018796 -0.8939504
##  [5,]  0.09998047 -0.06257027    -0.7023413         -0.7509530  0.4304206
##  [6,] -1.08728766 -1.23536975    -0.8750481         -0.9834062  1.7547916
##  [7,] -0.43116580 -0.61909978    -0.4720654         -0.5895272  0.4304206
##  [8,] -0.77484868 -0.79517691    -0.9901860         -1.0156914  1.0926061
##  [9,]  2.03710217  1.29259482     1.1974343          0.8503913 -0.5628577
## [10,] -0.58738529 -0.73858069    -0.1266517         -0.2925036 -0.2317649
##       shutouts.game
##  [1,]    -1.3658308
##  [2,]    -0.8441593
##  [3,]     0.5785811
##  [4,]    -1.0812827
##  [5,]     0.5785811
##  [6,]     1.3848007
##  [7,]     0.3888824
##  [8,]     1.3848007
##  [9,]    -0.7018853
## [10,]    -0.3224878

We are going to create a object that just has everyone’s names, to be used by the next function to make a nice heatmap.

row.names(soc_matrix) <- soccer$TEAM # make an object for team names

Now, we are ready to make a heatmap.

heatmap(soc_matrix)

Very nice!!

Now. That was a lot of work (sort of, it’s actually pretty cool to me that you can generate such a elegant visual with a few lines of code!). But, the heatmap.2 function does the z scores for you! So really it’s even easier (if you know what you are doing).

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
soc_matrix3 <- data.matrix(soccer2)
row.names(soc_matrix3) <- soccer$TEAM
soc_heatmap <- heatmap.2(soc_matrix3, 
                        scale="column", 
                        margins=c(5,10),
                        labRow = row.names(soc_matrix3),
                         keep.dendro = TRUE,
                         density.info = "none", 
                         trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter

Now you can change the color scheme if you like.

library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.2
?RColorBrewer
col <- colorRampPalette(c("cyan", "deeppink3"))
soc_matrix4 <- data.matrix(soccer2) #i renamed this to show you you can work off the original datasets
row.names(soc_matrix4) <- soccer$TEAM
soc_heatmap <- heatmap.2(soc_matrix4, 
                        scale="column", 
                        margins=c(5,10),
                        labRow = row.names(soc_matrix4),
                         keep.dendro = TRUE,
                         col = col(256),
                         density.info = "none", 
                         trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter

Example using public datasets.

1. NFL statistics

First, let’s read the dataset. There are lots of Numbers.

nfl <- read.csv("NFLprospects.csv")
str(nfl)
## 'data.frame':    110 obs. of  18 variables:
##  $ Player   : chr  "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
##  $ School   : chr  "Western Kentucky" "Iowa State" "Nevada" "Miami (FL)" ...
##  $ Class    : int  2017 2018 2018 2016 2017 2017 2017 2018 2017 2017 ...
##  $ Birthdate: chr  "4/26/99" "12/27/99" "9/14/99" "8/24/97" ...
##  $ Years    : int  5 4 4 6 5 5 5 4 5 5 ...
##  $ Position : chr  "QB" "QB" "QB" "QB" ...
##  $ Final.Age: num  22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
##  $ Height   : num  72.5 72.6 75.4 68.8 75.4 73 75.3 72.8 75.3 72.5 ...
##  $ Weight   : int  215 212 226 196 211 210 218 208 217 219 ...
##  $ Hand     : num  9.8 9.3 9.1 9.3 10 9.4 9.5 9.3 8.5 9.5 ...
##  $ Arm      : num  31.4 29 32 28.9 32.8 31.8 31.1 30.6 30.9 31.8 ...
##  $ Wing     : num  74.8 70.1 76 71.1 79 77.4 74.9 75.8 73.8 77.4 ...
##  $ Forty    : num  4.88 4.84 NA 4.7 4.52 4.75 4.9 NA 4.73 NA ...
##  $ Bench    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Vert     : num  30 27 NA NA 36 NA 33 25.5 33.5 NA ...
##  $ Broad    : num  109 NA NA NA 127 NA 115 111 121 NA ...
##  $ Cone     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Shuttle  : num  NA NA NA NA NA NA NA NA NA NA ...

Now, let’s convert the dataset into a specific type of file, called a data matrix.

nfl_matrix <- data.matrix(nfl)

We are going to create a object that just has everyone’s names, to be used later.

row.names(nfl) <- nfl$Player # make an object for names

We only want to analze some of the columns. Let’s start small.

nfl2 <- nfl[, c(7:11)] # select some columns
nfl_matrix2 <- data.matrix(nfl2)
str(nfl_matrix2)
##  num [1:110, 1:5] 22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:110] "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
##   ..$ : chr [1:5] "Final.Age" "Height" "Weight" "Hand" ...

Now, we can make a simple heat map

heatmap(nfl_matrix2)

Need to convert to z scores!! Because the values are not the same.

library(dplyr)
# remove nas
# convent all rows to z scores

nfl3 <- sapply(nfl2, function(nfl2) (nfl2-mean(nfl2, na.rm=TRUE))/sd(nfl2,na.rm=TRUE))
row.names(nfl3) <- nfl$Player

nfl3_matrix <- data.matrix(nfl3)

Add some color to the graph.

library(RColorBrewer)
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(nfl3_matrix, col=col)

We can see how well we did.

nfl4 <- sapply(nfl2, function(nfl2) (nfl2-mean(nfl2, na.rm=TRUE))/sd(nfl2,na.rm=TRUE))
row.names(nfl4) <- nfl$Position
nfl4_matrix <- data.matrix(nfl4)
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(nfl4_matrix, col=col)

Again, this is a lot quicker with heatmap.2() function in the gplots package.

library(gplots)
nfl.2 <-nfl_matrix2
row.names(nfl.2) <- nfl$Player
col <- colorRampPalette(c("cyan", "deeppink3"))
nfl_heatmap <- heatmap.2(nfl.2, 
                        scale="column", 
                        margins=c(5,10),
                        labRow = row.names(nfl.2),
                         keep.dendro = TRUE,
                         col = col(256),
                         density.info = "none", 
                         trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter

Let’s look at other metrics.

library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
?plot_ly()
plot_ly(z= nfl3_matrix, type = "heatmap")

NOW, we can try with other stats

First, create a data frame that only has the following variables: forty bench vert broad cone shuttle

Then, turn that data into a z-score for every point (sapply function above - and remove the NAs) Then, make sure you name the row.names (see above) Turn that into a data matrix Make your heatmap. (note, the dendrogram doesn’t work because there are too many NAs. Add Colv=NA, Rowv=NA to your code.)

The code is below.

str(nfl)
## 'data.frame':    110 obs. of  18 variables:
##  $ Player   : chr  "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
##  $ School   : chr  "Western Kentucky" "Iowa State" "Nevada" "Miami (FL)" ...
##  $ Class    : int  2017 2018 2018 2016 2017 2017 2017 2018 2017 2017 ...
##  $ Birthdate: chr  "4/26/99" "12/27/99" "9/14/99" "8/24/97" ...
##  $ Years    : int  5 4 4 6 5 5 5 4 5 5 ...
##  $ Position : chr  "QB" "QB" "QB" "QB" ...
##  $ Final.Age: num  22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
##  $ Height   : num  72.5 72.6 75.4 68.8 75.4 73 75.3 72.8 75.3 72.5 ...
##  $ Weight   : int  215 212 226 196 211 210 218 208 217 219 ...
##  $ Hand     : num  9.8 9.3 9.1 9.3 10 9.4 9.5 9.3 8.5 9.5 ...
##  $ Arm      : num  31.4 29 32 28.9 32.8 31.8 31.1 30.6 30.9 31.8 ...
##  $ Wing     : num  74.8 70.1 76 71.1 79 77.4 74.9 75.8 73.8 77.4 ...
##  $ Forty    : num  4.88 4.84 NA 4.7 4.52 4.75 4.9 NA 4.73 NA ...
##  $ Bench    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Vert     : num  30 27 NA NA 36 NA 33 25.5 33.5 NA ...
##  $ Broad    : num  109 NA NA NA 127 NA 115 111 121 NA ...
##  $ Cone     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Shuttle  : num  NA NA NA NA NA NA NA NA NA NA ...
nfl_new <- nfl[, 13:18]
nfl_new2 <- sapply(nfl_new, function(nfl_new) (nfl_new-mean(nfl_new, na.rm=TRUE))/sd(nfl_new,na.rm=TRUE))
row.names(nfl_new2) <- nfl$Player

nflnew_matrix <- data.matrix(nfl_new2)
heatmap(nflnew_matrix, na.col = "grey", na.rm = TRUE, Colv=NA, Rowv=NA)
## Warning in plot.window(...): "na.col" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "na.col" is not a graphical parameter
## Warning in title(...): "na.col" is not a graphical parameter

other heat maps

install.packages(“usmap”) library(usmap) library(ggplot2) library(dplyr) library(descr) library(ggrepel) # if need to repel labels library(data.table)

## try with my data
statecovid <- read.csv("statecovid.csv")
str(statecovid)
## 'data.frame':    51 obs. of  7 variables:
##  $ state            : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ count            : int  1297649 234251 2016798 810160 8528764 1363352 744459 259796 134623 5872586 ...
##  $ deaths           : int  19475 1185 29823 11333 88440 12001 10809 2884 1319 73622 ...
##  $ unemployment.rate: num  6.6 7.2 6.7 7.3 11 6.4 7.8 8.2 8.7 7.6 ...
##  $ colleges         : int  129 35 155 108 1246 171 114 23 33 439 ...
##  $ McDonalds        : int  210 27 212 127 1165 181 143 34 NA 728 ...
##  $ X7dayavehosp     : int  119 5 198 90 675 72 49 47 49 341 ...
# by covid case count
library(usmap)
## Warning: package 'usmap' was built under R version 4.1.2
us_map <- usmap::us_map() # used to add map scale
?plot_usmap
base <- usmap::plot_usmap(data = statecovid, values = "count", 
                          labels = T, size=0.2)+
  labs(fill = 'Covid cases per 100,000') + 
  scale_fill_gradientn(colours=rev(heat.colors(100)), na.value="grey90") 

base

base + # put legend at the bottom, adjust legend title and text font sizes
  theme(legend.position = "bottom",
        legend.title=element_text(size=12), 
        legend.text=element_text(size=10))

You can make a map by covid death count

base <- usmap::plot_usmap(data = statecovid, values = "deaths", 
                          labels = T, size=0.2)+
  labs(fill = 'Covid total deaths since January 1, 2020') + 
  scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90") 

base

base + # put legend at the bottom, adjust legend title and text font sizes
  theme(legend.position = "bottom",
        legend.title=element_text(size=10), 
        legend.text=element_text(size=8))

Now you try to do one with the number of McDonalds per state (it’s in the same csv file).