This tutorial will help you learn about building heatmaps, which are z-score based visual representations of data. In class, we will discuss how to interpret heatmaps and hierarchical clustering.
This tutorial is somewhat simiplified (less text, more code).
The following packages are installed in the Rstudio.cloud workspace: install.packages(c(“dplyr”, “Rmisc”, “gplots”, “ggplot2”, “RColorBrewer”, “plotly”))
The Rstudio.cloud link is here
First, read the dataset, which Crossroads league soccer stats from the 2020 Fall soccer teams. We can use this to see which teams Marian compared to, based on statistics like goals, assists, etc. You can do similar things with any dataset full of stats.
soccer <- read.csv("soccer.csv")
str(soccer)
## 'data.frame': 10 obs. of 13 variables:
## $ TEAM : chr "Bethel (Ind.)" "Goshen (Ind.)" "Grace (Ind.)" "Huntington (Ind.)" ...
## $ goals : int 3 1 23 11 22 64 26 42 11 41
## $ goals.game : num 0.27 0.2 1.92 0.85 1.83 4.27 2 3.5 0.85 3.15
## $ assists : int 5 1 22 8 18 49 19 32 9 29
## $ assist.game : num 0.45 0.2 1.83 0.62 1.5 3.27 1.46 2.67 0.69 2.23
## $ shots : int 40 25 207 122 202 323 176 221 93 278
## $ shots.game : num 3.64 5 17.25 9.38 16.83 ...
## $ saves : int 88 58 45 108 72 34 55 44 134 50
## $ saves.game : num 8 11.6 3.75 8.31 6 ...
## $ goals.allowed : int 55 16 12 40 10 7 14 5 43 20
## $ goals.allowed.game: num 5 3.2 1 3.08 0.83 0.47 1.08 0.42 3.31 1.54
## $ shutouts : int 1 1 6 2 6 10 6 8 3 4
## $ shutouts.game : num 0.09 0.2 0.5 0.15 0.5 0.67 0.46 0.67 0.23 0.31
Like I mentioned, heatmaps use z-scores. But first, we have to remove column 1, because those are the names of the schools. Those can’t be converted to z scores.
soccer2 <- soccer[, -1] # remove first column of names
We are reviewing what z-scores are in class, so go to those slides. To do this easily in R, we can do it the long way (i had you do this before) by converting every column to a z score and then combining the columns. But, we can just used more advance features (using the dplyr package - which is very powerful).
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
soccer3 <- sapply(soccer2, function(soccer2) (soccer2-mean(soccer2, na.rm=TRUE))/sd(soccer2,na.rm=TRUE))
# note that this is the z score formula - soccer2-mean(soccer2, na.rm=TRUE))/sd(soccer2) which you are applying to every value in soccer2
Finally, we need convert the dataset into a specific type of file, called a data matrix. Compare the soc_matrix dataset to soccer3.
soc_matrix <- data.matrix(soccer3) # heatmaps like matrices
soc_matrix
## goals goals.game assists assist.game shots shots.game
## [1,] -1.08080195 -1.15732580 -0.97230883 -1.021352678 -1.31472555 -1.46101614
## [2,] -1.18181148 -1.20751961 -1.24619864 -1.266398906 -1.46795697 -1.25768192
## [3,] -0.07070667 0.02581396 0.19172287 0.331302500 0.39125088 0.57382111
## [4,] -0.67676384 -0.74143425 -0.76689147 -0.854721243 -0.47706048 -0.60282614
## [5,] -0.12121143 -0.03872094 -0.08216694 0.007841479 0.34017374 0.51102673
## [6,] 1.99998866 1.71089180 2.04047909 1.742768773 1.57624051 1.21372585
## [7,] 0.08080762 0.08317831 -0.01369449 -0.031365917 0.07457262 0.01913734
## [8,] 0.88888385 1.15875991 0.87644739 1.154657826 0.53426687 0.74874834
## [9,] -0.67676384 -0.74143425 -0.69841902 -0.786108299 -0.77330788 -0.93623445
## [10,] 0.83837909 0.90779087 0.67103004 0.723376465 1.11654626 1.19129928
## saves saves.game goals.allowed goals.allowed.game shutouts
## [1,] 0.59988285 0.56627664 1.8882618 1.9416300 -1.2250432
## [2,] -0.33743410 1.69820107 -0.3569275 0.7793640 -1.2250432
## [3,] -0.74360478 -0.77002304 -0.5872034 -0.6411835 0.4304206
## [4,] 1.22476081 0.66374791 1.0247274 0.7018796 -0.8939504
## [5,] 0.09998047 -0.06257027 -0.7023413 -0.7509530 0.4304206
## [6,] -1.08728766 -1.23536975 -0.8750481 -0.9834062 1.7547916
## [7,] -0.43116580 -0.61909978 -0.4720654 -0.5895272 0.4304206
## [8,] -0.77484868 -0.79517691 -0.9901860 -1.0156914 1.0926061
## [9,] 2.03710217 1.29259482 1.1974343 0.8503913 -0.5628577
## [10,] -0.58738529 -0.73858069 -0.1266517 -0.2925036 -0.2317649
## shutouts.game
## [1,] -1.3658308
## [2,] -0.8441593
## [3,] 0.5785811
## [4,] -1.0812827
## [5,] 0.5785811
## [6,] 1.3848007
## [7,] 0.3888824
## [8,] 1.3848007
## [9,] -0.7018853
## [10,] -0.3224878
We are going to create a object that just has everyone’s names, to be used by the next function to make a nice heatmap.
row.names(soc_matrix) <- soccer$TEAM # make an object for team names
Now, we are ready to make a heatmap.
heatmap(soc_matrix)
Very nice!!
Now. That was a lot of work (sort of, it’s actually pretty cool to me that you can generate such a elegant visual with a few lines of code!). But, the heatmap.2 function does the z scores for you! So really it’s even easier (if you know what you are doing).
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
soc_matrix3 <- data.matrix(soccer2)
row.names(soc_matrix3) <- soccer$TEAM
soc_heatmap <- heatmap.2(soc_matrix3,
scale="column",
margins=c(5,10),
labRow = row.names(soc_matrix3),
keep.dendro = TRUE,
density.info = "none",
trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter
Now you can change the color scheme if you like.
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 4.1.2
?RColorBrewer
col <- colorRampPalette(c("cyan", "deeppink3"))
soc_matrix4 <- data.matrix(soccer2) #i renamed this to show you you can work off the original datasets
row.names(soc_matrix4) <- soccer$TEAM
soc_heatmap <- heatmap.2(soc_matrix4,
scale="column",
margins=c(5,10),
labRow = row.names(soc_matrix4),
keep.dendro = TRUE,
col = col(256),
density.info = "none",
trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter
First, let’s read the dataset. There are lots of Numbers.
nfl <- read.csv("NFLprospects.csv")
str(nfl)
## 'data.frame': 110 obs. of 18 variables:
## $ Player : chr "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
## $ School : chr "Western Kentucky" "Iowa State" "Nevada" "Miami (FL)" ...
## $ Class : int 2017 2018 2018 2016 2017 2017 2017 2018 2017 2017 ...
## $ Birthdate: chr "4/26/99" "12/27/99" "9/14/99" "8/24/97" ...
## $ Years : int 5 4 4 6 5 5 5 4 5 5 ...
## $ Position : chr "QB" "QB" "QB" "QB" ...
## $ Final.Age: num 22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
## $ Height : num 72.5 72.6 75.4 68.8 75.4 73 75.3 72.8 75.3 72.5 ...
## $ Weight : int 215 212 226 196 211 210 218 208 217 219 ...
## $ Hand : num 9.8 9.3 9.1 9.3 10 9.4 9.5 9.3 8.5 9.5 ...
## $ Arm : num 31.4 29 32 28.9 32.8 31.8 31.1 30.6 30.9 31.8 ...
## $ Wing : num 74.8 70.1 76 71.1 79 77.4 74.9 75.8 73.8 77.4 ...
## $ Forty : num 4.88 4.84 NA 4.7 4.52 4.75 4.9 NA 4.73 NA ...
## $ Bench : num NA NA NA NA NA NA NA NA NA NA ...
## $ Vert : num 30 27 NA NA 36 NA 33 25.5 33.5 NA ...
## $ Broad : num 109 NA NA NA 127 NA 115 111 121 NA ...
## $ Cone : num NA NA NA NA NA NA NA NA NA NA ...
## $ Shuttle : num NA NA NA NA NA NA NA NA NA NA ...
Now, let’s convert the dataset into a specific type of file, called a data matrix.
nfl_matrix <- data.matrix(nfl)
We are going to create a object that just has everyone’s names, to be used later.
row.names(nfl) <- nfl$Player # make an object for names
We only want to analze some of the columns. Let’s start small.
nfl2 <- nfl[, c(7:11)] # select some columns
nfl_matrix2 <- data.matrix(nfl2)
str(nfl_matrix2)
## num [1:110, 1:5] 22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:110] "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
## ..$ : chr [1:5] "Final.Age" "Height" "Weight" "Hand" ...
Now, we can make a simple heat map
heatmap(nfl_matrix2)
Need to convert to z scores!! Because the values are not the same.
library(dplyr)
# remove nas
# convent all rows to z scores
nfl3 <- sapply(nfl2, function(nfl2) (nfl2-mean(nfl2, na.rm=TRUE))/sd(nfl2,na.rm=TRUE))
row.names(nfl3) <- nfl$Player
nfl3_matrix <- data.matrix(nfl3)
Add some color to the graph.
library(RColorBrewer)
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(nfl3_matrix, col=col)
We can see how well we did.
nfl4 <- sapply(nfl2, function(nfl2) (nfl2-mean(nfl2, na.rm=TRUE))/sd(nfl2,na.rm=TRUE))
row.names(nfl4) <- nfl$Position
nfl4_matrix <- data.matrix(nfl4)
col <- colorRampPalette(brewer.pal(10, "RdYlBu"))(256)
heatmap(nfl4_matrix, col=col)
Again, this is a lot quicker with heatmap.2() function in the gplots package.
library(gplots)
nfl.2 <-nfl_matrix2
row.names(nfl.2) <- nfl$Player
col <- colorRampPalette(c("cyan", "deeppink3"))
nfl_heatmap <- heatmap.2(nfl.2,
scale="column",
margins=c(5,10),
labRow = row.names(nfl.2),
keep.dendro = TRUE,
col = col(256),
density.info = "none",
trace = "none")
## Warning in plot.window(...): "keep.dendro" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "keep.dendro" is not a graphical parameter
## Warning in title(...): "keep.dendro" is not a graphical parameter
Let’s look at other metrics.
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
?plot_ly()
plot_ly(z= nfl3_matrix, type = "heatmap")
First, create a data frame that only has the following variables: forty bench vert broad cone shuttle
Then, turn that data into a z-score for every point (sapply function above - and remove the NAs) Then, make sure you name the row.names (see above) Turn that into a data matrix Make your heatmap. (note, the dendrogram doesn’t work because there are too many NAs. Add Colv=NA, Rowv=NA to your code.)
The code is below.
str(nfl)
## 'data.frame': 110 obs. of 18 variables:
## $ Player : chr "Bailey Zappe" "Brock Purdy" "Carson Strong" "D'Eriq King" ...
## $ School : chr "Western Kentucky" "Iowa State" "Nevada" "Miami (FL)" ...
## $ Class : int 2017 2018 2018 2016 2017 2017 2017 2018 2017 2017 ...
## $ Birthdate: chr "4/26/99" "12/27/99" "9/14/99" "8/24/97" ...
## $ Years : int 5 4 4 6 5 5 5 4 5 5 ...
## $ Position : chr "QB" "QB" "QB" "QB" ...
## $ Final.Age: num 22.7 22 22.3 24.4 22.4 23 23.1 21.8 23.6 22.6 ...
## $ Height : num 72.5 72.6 75.4 68.8 75.4 73 75.3 72.8 75.3 72.5 ...
## $ Weight : int 215 212 226 196 211 210 218 208 217 219 ...
## $ Hand : num 9.8 9.3 9.1 9.3 10 9.4 9.5 9.3 8.5 9.5 ...
## $ Arm : num 31.4 29 32 28.9 32.8 31.8 31.1 30.6 30.9 31.8 ...
## $ Wing : num 74.8 70.1 76 71.1 79 77.4 74.9 75.8 73.8 77.4 ...
## $ Forty : num 4.88 4.84 NA 4.7 4.52 4.75 4.9 NA 4.73 NA ...
## $ Bench : num NA NA NA NA NA NA NA NA NA NA ...
## $ Vert : num 30 27 NA NA 36 NA 33 25.5 33.5 NA ...
## $ Broad : num 109 NA NA NA 127 NA 115 111 121 NA ...
## $ Cone : num NA NA NA NA NA NA NA NA NA NA ...
## $ Shuttle : num NA NA NA NA NA NA NA NA NA NA ...
nfl_new <- nfl[, 13:18]
nfl_new2 <- sapply(nfl_new, function(nfl_new) (nfl_new-mean(nfl_new, na.rm=TRUE))/sd(nfl_new,na.rm=TRUE))
row.names(nfl_new2) <- nfl$Player
nflnew_matrix <- data.matrix(nfl_new2)
heatmap(nflnew_matrix, na.col = "grey", na.rm = TRUE, Colv=NA, Rowv=NA)
## Warning in plot.window(...): "na.col" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "na.col" is not a graphical parameter
## Warning in title(...): "na.col" is not a graphical parameter
install.packages(“usmap”) library(usmap) library(ggplot2) library(dplyr) library(descr) library(ggrepel) # if need to repel labels library(data.table)
## try with my data
statecovid <- read.csv("statecovid.csv")
str(statecovid)
## 'data.frame': 51 obs. of 7 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ count : int 1297649 234251 2016798 810160 8528764 1363352 744459 259796 134623 5872586 ...
## $ deaths : int 19475 1185 29823 11333 88440 12001 10809 2884 1319 73622 ...
## $ unemployment.rate: num 6.6 7.2 6.7 7.3 11 6.4 7.8 8.2 8.7 7.6 ...
## $ colleges : int 129 35 155 108 1246 171 114 23 33 439 ...
## $ McDonalds : int 210 27 212 127 1165 181 143 34 NA 728 ...
## $ X7dayavehosp : int 119 5 198 90 675 72 49 47 49 341 ...
# by covid case count
library(usmap)
## Warning: package 'usmap' was built under R version 4.1.2
us_map <- usmap::us_map() # used to add map scale
?plot_usmap
base <- usmap::plot_usmap(data = statecovid, values = "count",
labels = T, size=0.2)+
labs(fill = 'Covid cases per 100,000') +
scale_fill_gradientn(colours=rev(heat.colors(100)), na.value="grey90")
base
base + # put legend at the bottom, adjust legend title and text font sizes
theme(legend.position = "bottom",
legend.title=element_text(size=12),
legend.text=element_text(size=10))
You can make a map by covid death count
base <- usmap::plot_usmap(data = statecovid, values = "deaths",
labels = T, size=0.2)+
labs(fill = 'Covid total deaths since January 1, 2020') +
scale_fill_gradientn(colours=rev(heat.colors(10)),na.value="grey90")
base
base + # put legend at the bottom, adjust legend title and text font sizes
theme(legend.position = "bottom",
legend.title=element_text(size=10),
legend.text=element_text(size=8))
Now you try to do one with the number of McDonalds per state (it’s in the same csv file).