Heatmaps are a way to colorize, visualize, and organize a data set with the goal of intuiting relationships among observations and features.
We will use heatmaps in this course to find patterns in the gene expression data for the 1K breast cancer patients from The Cancer Genome Atlas. Here, we focus on what heatmaps are and how to create them by practicing with a small dataset.
R provides many data sets to work with, so we can learn new analysis skills before scaling up. mtcars is a classic go-to R data frame. It was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 design and performance features for 32 automobiles (1973–74 models).
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To get a global view of the data use the View() function.
View(mtcars)
The function heatmap() is an easy way to convert the values in mtcars to colors which helps us understand the relative values of different features and to find relationships, either across cars or across features.
Check out the help page for heatmap().
# The second argument tells R which `heatmap()` function we want.
help(heatmap)
In the help file, we learn that heatmap() plots a numeric matrix of values. So our first step will be to ensure that the data are converted from a data frame to a matrix. A matrix contains a single type of data- number-valued data.
# We ensure the data set is a matrix of numbers
data <- as.matrix(mtcars)
Let’s build a simple heatmap with no rearranging of rows and columns to see the transformation of our matrix to color-coded cells. The help file tells us that the arguments Rowv=NA and Colv=NA suppress ordering of the features and observations based on similarity.
# A heat map is a false color image optionally with dendrograms
# NA for Rowv and Colv suppresses calculation of dendrograms
# cexRow scales font so we can see all car names
heatmap(data, Rowv=NA, Colv=NA,cexRow=0.6, cexCol=0.8)
The rows correspond to cars (observations) and the columns to the 10 performance features. In the default coloring scheme, the highest values have the darkest colors. Compare the colors in the heatmap with the values in the matrix. Some features disp and hp have higher values than others.
Before we delve into the heatmap, we note that the display, as is, isn’t very helpful in discerning relationships. It’s easy to see disp and hp have higher values but it’s harder to compare differences among the values in other features.
What do we mean that two features are on different scales?
Let’s compare wt and disp because wt has lower values.
# create arrays
wt<-data[,"wt"]
disp<-data[,"disp"]
# plot both arrays with boxplot function
distributions <- data.frame( wt, disp)
boxplot(distributions,
main="Boxplots of weight and displacement", boxwex=.3)
The boxplots shows the distributions of the values for wt and disp: The minimum and maximum values and the median. All of the value of wt are much smaller than the values of disp.
In data analysis, normalizing or scaling is a technique for comparing data that isn’t measured in the same way. One way to do this is to use the mean value and standard deviation.
For example, for wt, we would subtract the mean of wt from each value and then divide each by the standard deviation of wt. The standard deviation is a measure of how spread out the numbers are.
We can scale wt and disp so that there mean values and ranges are more similar.
# use the `scale()` function to normalize the data
wt_scaled<-scale(wt)
disp_scaled<-scale(disp)
scaled_distributions <- data.frame( wt_scaled, disp_scaled)
boxplot(scaled_distributions,
main="Boxplots of weight and displacement, scaled", boxwex=.3)
The values scaled in this way are called z-scores. The z-score tells you how many standard deviations from the mean your score is.
The important point is that each feature is scaled according to its own distribution.
We will use the scale() function to scale our data.
data_scaled <- scale(data)
Take a look at the scaled data data_scaled and compare it with the unscaled data data.
View(data)
View(round(data_scaled,2))
Create a simple heatmap for the scaled data.
heatmap(data_scaled, Rowv=NA, Colv=NA, cexRow=0.6, cexCol=0.8)
Now we can see if a car has both high wt and disp, e.g. the Chrysler Imperial.
But it’s tedious to find these relationship by eye…
A dendrogram is a tree diagram that shows the hierarchical relationships among objects. It is most commonly created as an output from hierarchical clustering and shows how objects are allocated to different clusters.
The dendrogram is a summary of the distance matrix that gives the pairwise distances between among all points.
There are many ways to calculate the distance between two points, for example, via roads or “as the crow flies.” By default, heatmap() uses the Euclidean distance or the length of a line between two points.
# by default, heatmap will calculate the dendrograms and reorder
# the arguments margins, cexRow and cexCol are for plotting
heatmap(data_scaled, margins = c(2,2),cexRow=0.6, cexCol=0.8)
The heatmap with dendrograms enables rapid visual inspection of our data. Each row can be considered as a profile of features. Colorized and ordered this way, we can spot relationships among the profiles more easily.
What relationships do you find? To remind yourself of the feature definitions, run the command help(mtcars). Are the values and groupings for wt and mpg surprising? Does the clustering of vehicles make sense?
We can also find features that are similar and perhaps redundant, e.g. disp and cyl in the first two columns of the heatmap.
You can use a color palette to change the color coding and style of your heatmap.
RColorBrewer is an R package that contains ready-to-use color palettes for creating nice graphics.
# Load the palette package
library(RColorBrewer)
# Parameters for plotting
par(cex = 0.5)
# Get a graphic for all color schemes
display.brewer.all()
We can use the “RdBu” palette to change the default color-coding of our data from oranges (“YlOrRd”) to a red-blue spectrum.
heatmap(data_scaled, margins = c(2,2),cexRow=0.6, cexCol=0.8, col=brewer.pal(8,"RdBu"))
Change the parameter col in the code chunk below to any of the palettes listed above.
heatmap(data_scaled, margins = c(2,2),cexRow=0.6, cexCol=0.8, col=brewer.pal(8,"YlOrRd"))
Great work! We will use heatmaps and dendrograms to organize the TCGA breast cancer gene expression data.
Nutrition data on 80 cereal products: Should I stick with Lucky Charms to move to Wheaties?
The columns of the dataset are
* name: Name of cereal
* calories: calories per serving
* protein: grams of protein
* fat: grams of fat
* sodium: milligrams of sodium
* fiber: grams of dietary fiber
* carbo: grams of complex carbohydrates
* sugars: grams of sugars
* potass: milligrams of potassium
* vitamins: vitamins and minerals - 0, 25, or 100
* shelf: display shelf (1, 2, or 3, counting from the floor)
* weight: weight in ounces of one serving
* cups: number of cups in one serving
* rating: a rating of the cereals, unknown source
# Read the file in your directory
cereal_data<-read.csv("cereal.csv")
# Takea look at what's there
View(cereal_data)
We have to do some manipulation of the data before we can make a heatmap. The first column of cereal_data is names. We want to use the cereal names as row names. The columns are already named for us. We also have to make sure the data is in the form of a matrix to make a heatmap.
# Create an object of names
rnames<-cereal_data[,1]
# Remove the names column
cereal<-cereal_data[,-1]
# Set the row names
row.names(cereal)<-rnames
# Make sure the data is in matrix form
cereal<-as.matrix(cereal)
# Is the unscaled data sensible?
heatmap(cereal,cexRow=0.6, cexCol=0.8)
# Scale the data
cereal_scaled<-scale(cereal)
# What does the heatmap look like with scaling?
heatmap(cereal_scaled,cexRow=0.6, cexCol=0.8)
Back to our original question. Should we switch from Lucky Charms to Wheaties? I guess it depends on whether we want to be “healthy” or enjoy our sugar fix!
Can you make a guess as to what the cereal ratings may have stressed?