1.0 Overview

Heatmaps visualise data through variations in colouring. When applied to a tabular format, heatmaps are useful for cross-examining multivariate data, through placing variables in the columns and observation (or records) in rowa and colouring the cells within the table. Heatmaps are good for showing variance across multiple variables, revealing any patterns, displaying whether any variables are similar to each other, and for detecting if any correlations exist in-between them.

In this hands-on exercise, you will gain hands-on experience on using R to plot static and interactive heatmap for visualising and analysing multivariate data.

2.0 Installing and Launching R Packages

Before you get started, you are required:

to start a new R project, and
to create a new R Markdown document. Next, you will use the code chunk below to install and launch seriation, heatmaply, dendextend and tidyverse in RStudio.

3.0 Importing and Preparing The Data Set

In this hands-on exercise, the data of World Happines 2018 report will be used. The data set is downloaded from here. The original data set is in Microsoft Excel format. It has been extracted and saved in csv file called WHData-2018.csv.

3.1 Importing the data set

In the code chunk below, read_csv() of readr is used to import WHData-2018.csv into R and parsed it into tibble R data frame format.

tesco <- read_csv("data/year_borough_grocery.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   area_id = col_character()
## )

## See spec(...) for full column specifications.

The output tibbled data frame is called wh.

3.2 Preparing the data Next, we need to change the rows by country name instead of row number by using the code chunk below

row.names(tesco) <- tesco$area_id

## Warning: Setting row names on a tibble is deprecated.

The output tibbled data frame is called wh.

3.2 Preparing the data

Next, we need to change the rows by country name instead of row number by using the code chunk below

tesco_measures <- select(tesco, c(18,26,34,42,50,58,66,74))

Notice that the row number has been replaced into the country name.

3.3 Transforming the data frame into a matrix

The data was loaded into a data frame, but it has to be a data matrix to make your heatmap.

The code chunk below will be used to transform wh data frame into a data matrix.

tesco_matrix <- data.matrix(tesco_measures)

Notice that wh_matrix is in R matrix format.

5.0 Creating Interactive Heatmap

heatmaply is an R package for building interactive cluster heatmap that can be shared online as a stand-alone HTML file. It is designed and maintained by Tal Galili.

Before we get started, you should review the Introduction to Heatmaply to have an overall understanding of the features and functions of Heatmaply package. You are also required to have the user manualof the package handy with you for reference purposes.

In this section, you will gain hands-on experience on using heatmaply to design an interactive cluster heatmap. We will still use the wh_matrix as the input data.

5.1 Working with heatmaply

The code chunk below shows the basic syntax needed to create n interactive heatmap by using heatmaply package.

heatmaply(tesco_matrix)

## Warning in doTryCatch(return(expr), name, parentenv, handler): unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so':
##   dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
##   Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so
##   Reason: image not found

## 5.2 Data trasformation When analysing multivariate data set, it is very common that the variables in the data sets includes values that reflect different types of measurement. In general, these variables’ values have their own range. In order to ensure that all the variables have comparable values, data transformation are commonly used before clustering.

Three main data transformation methods are supported by heatmaply(), namely: scale, normalise and percentilse.

5.2.1 Scaling method

When all variables are came from or assumed to come from some normal distribution, then scaling (i.e.: subtract the mean and divide by the standard deviation) would bring them all close to the standard normal distribution. In such a case, each value would reflect the distance from the mean in units of standard deviation. The scale argument in heatmaply() supports column and row scaling. The code chunk below is used to scale variable values columewise.

heatmaply(tesco_matrix,
          scale = "column")

5.2.2 Normalising method

When variables in the data comes from possibly different (and non-normal) distributions, the normalize function can be used to bring data to the 0 to 1 scale by subtracting the minimum and dividing by the maximum of all observations. This preserves the shape of each variable’s distribution while making them easily comparable on the same “scale”. Different from Scaling, the normalise method is performed on the input data set i.e. wh_matrix as shown in the code chunk below.

heatmaply(normalize(tesco_matrix))

5.2.3 Percentising method

This is similar to ranking the variables, but instead of keeping the rank values, divide them by the maximal rank. This is done by using the ecdf of the variables on their own values, bringing each value to its empirical percentile. The benefit of the percentize function is that each value has a relatively clear interpretation, it is the percent of observations that got that value or below it. Similar to Normalize method, the Percentize method is also performed on the input data set i.e. wh_matrix as shown in the code chunk below.

heatmaply(percentize(tesco_matrix))

5.3 Clustering algorithm heatmaply supports a variety of hierarchical clustering algorithm. The main arguments provided are:

distfun: function used to compute the distance (dissimilarity) between both rows and columns. Defaults to dist. The options “pearson”, “spearman” and “kendall” can be used to use correlation-based clustering, which uses as.dist(1 - cor(t(x))) as the distance metric (using the specified correlation method). hclustfun: function used to compute the hierarchical clustering when Rowv or Colv are not dendrograms. Defaults to hclust. dist_method default is NULL, which results in “euclidean” to be used. It can accept alternative character strings indicating the method to be passed to distfun. By default distfun is “dist”" hence this can be one of “euclidean”, “maximum”, “manhattan”, “canberra”, “binary” or “minkowski”. hclust_method default is NULL, which results in “complete” method to be used. It can accept alternative character strings indicating the method to be passed to hclustfun. By default hclustfun is hclust hence this can be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). In general, a clustering model can be calibrated either manually or statistically.

5.3.2 Statistical approach

In order to determine the best clustering method and number of cluster the dend_expend() and find_k() functions of dendextend package will be used.

First, the dend_expend() will be used to determine the recommended clustering method to be used.

tesco_d <- dist(normalize(tesco_matrix), method = "euclidean")
dend_expend(tesco_d)[[3]]

##   dist_methods hclust_methods     optim
## 1      unknown         ward.D 0.5074206
## 2      unknown        ward.D2 0.5837509
## 3      unknown         single 0.5979904
## 4      unknown       complete 0.6074036
## 5      unknown        average 0.6918454
## 6      unknown       mcquitty 0.6100607
## 7      unknown         median 0.6382037
## 8      unknown       centroid 0.5914103

The output table shows that “average” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of cluster.

tesco_clust <- hclust(tesco_d, method = "average")
num_k <- find_k(tesco_clust)
plot(num_k)

Figure above shows that k=3 would be good.

With reference to the statistical analysis results, we can prepare the code chunk as shown below.

heatmaply(normalize(tesco_matrix),
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 2)

Correlation map

heatmaply_cor(
  cor(tesco_measures),
  xlab = "Features",
  ylab = "Features",
  k_col = 2,
  k_row = 2
)

5.5 The finishing touch

Beside providing a wide collection of arguments for meeting the statistical analysis needs, heatmaply also provides many plotting features to ensure cartographic quality heatmap can be produced.

In the code chunk below the following arguments are used:

k_row is used to produce 5 groups. margins is used to change the top margin to 60 and row margin to 200. fontsizw_row and fontsize_col are used to change the font size for row and column labels to 4. main is used to write the main title of the plot. xlab and ylab are used to write the x-axis and y-axis labels respectively.

heatmaply(normalize(tesco_matrix), # normalize formula - what it does?
          Colv=NA,
          seriate = "none",
          colors = Blues,
          k_row = 2,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Tesco Measures by Borough, 2017 \nDataTransformation using Normalize Method",
          xlab = "Tesco measures",
          ylab = "Borough"
          )

Tesco Borough Hierarchical Clustering

Jufri Ramli

3/29/2020