The following code reproduces the code by Prof. Peng on Exploratory Data Analysis course on Coursera The goal is to experiment different settings of heatmap and image functions.
Let’s create a dataset with random data.
set.seed(12345)
par(mar = rep(0.2, 4))
dataMatrix <- matrix(rnorm(400), nrow = 40)
dim ( dataMatrix )
## [1] 40 10
dataMatrix[1:5,]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 0.5855 1.1285 0.6454 1.5449 -0.4876 -1.4361 -0.7001 -1.5139
## [2,] 0.7095 -2.3804 1.0431 1.3215 0.3032 -0.6293 -0.5674 0.1643
## [3,] -0.1093 -1.0603 -0.3044 0.3222 -0.2420 0.2435 -0.2614 -0.8709
## [4,] -0.4535 0.9371 2.4771 1.5310 -0.4817 1.0584 -1.0639 1.5933
## [5,] 0.6059 0.8545 0.9712 -0.4212 -0.9918 0.8313 -0.1064 0.6466
## [,9] [,10]
## [1,] 0.3803 -0.37582
## [2,] 0.6051 -1.81283
## [3,] 1.0197 0.28860
## [4,] 0.4749 -0.18962
## [5,] -2.1859 0.01786
We have a dataset of 40x10 random normal variables. No pattern is present.
The function image plot in the grid x y the value defined by z.
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
Cluster the data:
par(mar = rep(0.2, 4))
heatmap(dataMatrix)
On average half of the rows will add a certain pattern.
set.seed(678910)
for (i in 1:40) {
# flip a coin
coinFlip <- rbinom(1, size = 1, prob = 0.5)
# if coin is heads add a common pattern to that row
if (coinFlip) {
dataMatrix[i, ] <- dataMatrix[i, ] + rep(c(0, 3), each = 5)
}
}
In this case the plot contains the last 5 columns at the beginning. So we expect the first 5 columns to have higher values
par(mar = rep(0.2, 4))
image(1:10, 1:40, t(dataMatrix)[, nrow(dataMatrix):1])
Let’s sort data per similarity.
hh <- hclust(dist(dataMatrix))
dataMatrixOrdered <- dataMatrix[hh$order, ]
This way first rows of dataMatrixOrdered will be rows more similar.
par(mfrow = c(1, 3))
image(t(dataMatrixOrdered)[, nrow(dataMatrixOrdered):1])
plot( rowMeans(dataMatrixOrdered) , 40:1, xlab = "Row Mean", ylab = "Row", pch = 19)
plot( colMeans(dataMatrixOrdered) , xlab = "Column", ylab = "Column Mean", pch = 19)