Introducing Apply, Sapply & Tapply through a simple Cluster Analysis Example

Introduction

Let us explore some of the applications of the famous loop functions in R - apply(), sapply() and tapply() and how they work. In this post I am aiming to explain some applications of the aforesaid loop functions throug a simple cluster analysis example using k-means clutering. In order to understand the advantage of using the loop functions I will also be taking you through solving the same problem using the common for loop.

To demonstrate custer analysis I will use the famous Fisher’s iris flower data. To get the documentation of the data, type

?iris

in your R console. The data contains four numerical variables and one categorical variable that labels each observations as a particular species of iris flower.

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For now, let us consider that the data is not labeled, i.e. we do not have the information of the last variable and all we are required to do here is to find the similar observations using some similarity measures and group these observations together.

#Removing the last column (Species) of the data
iris.n <- iris[, -5] 
head(iris.n)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

K-Means Clustering

An important objective of cluster analysis is to minimize the variance within the clusters. The idea is that the variance within the clusters should be as much smaller as possible and the variance between the clusters should be as much higher as possible.

The k-means clustering helps us to find cluster solutions for a data when the number of clusters required is specified. In R language the kmeans() function helps us to achieve this goal. For example, the code below helps us to find a 3-cluster solution for the iris.n data.

clust <- kmeans(iris.n, 3)

Here I would like to bring to your notice that the object clust does not contain the segmented data. The object clust is composed of a number of attributes.

#List of attributes contained in clust
attributes(clust)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## $class
## [1] "kmeans"

Each of these attributes can be extracted out using the $ operator. For example, the following two codes will extract the values of within cluster sum of squares (WSS) and total WSS respectively from the (kmeans) object clust.

#WSS
clust$withinss

## [1] 23.87947 15.15100 39.82097

#Total WSS
clust$tot.withinss

## [1] 78.85144

But how do we choose the number of clusters?

To do a kmeans clustering we are required to specify the number of clusters in advance. But before doing so we must figure out how many clusters would be appropriate. The validity of the numbers of clusters can be checked through a scree plot. A scree plot is a plot between number of clusters Vs the Total WSS. Before we plot a scree plot let us answer one question,

What is total SS of the data before it is broken into any clusters?

Total SS is given by the formula, Total SS = (var(Variable1) + var(Variable2) + … + var(Variable k))*(n-1), were n is the number of observations.

Let us compute this Total SS.

Computing Total SS Using for loop

SS <- numeric()
for(i in 1:ncol(iris.n))
{
  SS[i] <- var(iris.n[,i])
}

TSS <- sum(SS)*(nrow(iris.n)-1)
TSS

## [1] 681.3706

Computing Total SS Using apply()

The above loop programming can be easily avoided and the program can be written in just a single line using the apply() function. Understand that in the calculation of the total SS we require to compute the variance of all the columns (numeric) of the data set. This can be very simply achieved using apply() function as follow.

apply(iris.n, MARGIN = 2, FUN = var) #Note MARGIN=2 represents columns

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063

#Or simply

apply(iris.n, 2, var)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.6856935    0.1899794    3.1162779    0.5810063

Next the calculation of SS requires summing up all these variances and multiplying the result by (n-1). So to put up the entire thing in a single line we may write,

#Calculation of Total SS using apply() function
SS <- sum(apply(iris.n, 2, var)) * (nrow(iris.n) - 1)
SS

## [1] 681.3706

And we have avoided using the entire for loop.

[Check out the appendix for some basic examples on apply]

Note Note that there is another way to do it. Total SS could have also been obtained using the kmeans() formula by setting the centres as equal to 1 and extracting out the totss (total SS) attribute out of it.

kmeans(iris.n, centers = 1)$totss

## [1] 681.3706

Keep this method in mind. We are going to use this method to compute the WSS in the coming section.

Scree Plot

As we said, a scree plot in cluster analysis is a plot between No. of clusters Vs the Total WSS. Let us do an iteration to check how the value of total WSS changes with the change in the number of clusters. First let us try this using for loop and later discover how using sapply() function makes it even easier and compact.

Computing WSS for 1 to 15 cluster solution using for loop

WSS <- numeric(15)

for(i in 1:15)
{
  #Extract total WSS for i-cluster solution
  WSS[i] <- kmeans(iris.n, centers = i)$tot.withinss  
}
  
cbind(No.of.Cluters=1:15, WSS)

##       No.of.Cluters       WSS
##  [1,]             1 681.37060
##  [2,]             2 152.34795
##  [3,]             3 142.75352
##  [4,]             4  57.22847
##  [5,]             5  49.82228
##  [6,]             6  47.61943
##  [7,]             7  37.59074
##  [8,]             8  32.55390
##  [9,]             9  28.16728
## [10,]            10  29.93024
## [11,]            11  26.38821
## [12,]            12  25.63757
## [13,]            13  26.57428
## [14,]            14  23.83452
## [15,]            15  19.72982

Note that the above table gives us the values of WSS for each of the 15 clusters. Plotting this would generate our scree plot. But before that, as I told, let us figure out how this can be done in a single step using the sapply() function.

Computing WSS for 1 to 15 cluster solution sapply()

WSS. <- sapply(1:15, function(i){return(kmeans(iris.n, centers = i)$tot.withinss)})
cbind(No.of.Cluters=1:15, WSS.)

##       No.of.Cluters      WSS.
##  [1,]             1 681.37060
##  [2,]             2 152.34795
##  [3,]             3  78.85144
##  [4,]             4  71.44525
##  [5,]             5  49.85942
##  [6,]             6  39.03999
##  [7,]             7  37.59074
##  [8,]             8  30.30321
##  [9,]             9  31.46078
## [10,]            10  27.22953
## [11,]            11  30.40404
## [12,]            12  23.87618
## [13,]            13  24.40182
## [14,]            14  22.01979
## [15,]            15  23.49559

Sapply function applies a specified function to each elements of a vector. In the above code, the function is,

function(i)

{

return(kmeans(iris.n, centers = i)$tot.withinss)

}

So, for every values of the vector (1:15) the sapply will apply the above function and return the corresponding total WSS value.

[Check out the appendix for some basic examples on apply]

Note: Note that the the above two outputs of WSS are different. (Can you tell why?) Well, everytime you rum kmeans() the initial centres are randomly chosen and therefore the output differs slightly from each other.

Plotting the scree plot

plot(1:15, WSS., type="l", xlab = "No. of clusters", ylab = "Total WSS", main = "Scree Plot")

The scree plot is bit confusing. A proper elbow is not formed. It is kind of a broken elbow. There are other methods for finding the optimal number of clusters like using silhouette distance. However, note from the scree plot that the Total WSS falls sharply from 1 to 3 clusters and thereafter it declines at a very slow rate. Without getting into much hassle let us consider that the number of optimal clusters as 3 and proceed to our next step. We will discuss more about optimal cluster selection in further posts.

Segmenting the data set

First we will use the kmeans() functions to identify segmentation and then we will label each observations by its cluster number.

#Creating a 3 cluster solution
clust.n <- kmeans(iris.n, 3)

#Now let us create a variable names clusterNo. that would label each of the observations by the cluster in which the observation is grouped into.
iris.n$clusterNo. <- clust.n$cluster

head(iris.n, n=10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width clusterNo.
## 1           5.1         3.5          1.4         0.2          3
## 2           4.9         3.0          1.4         0.2          3
## 3           4.7         3.2          1.3         0.2          3
## 4           4.6         3.1          1.5         0.2          3
## 5           5.0         3.6          1.4         0.2          3
## 6           5.4         3.9          1.7         0.4          3
## 7           4.6         3.4          1.4         0.3          3
## 8           5.0         3.4          1.5         0.2          3
## 9           4.4         2.9          1.4         0.2          3
## 10          4.9         3.1          1.5         0.1          3

Cluster summaries for each variables

Now that we have segregated the observations into a number of clusters we would be interested in looking at the summaries of the variables in each clusters. For example we would like to observe the summary (like, mean, sd, etc.) for the variable sepal length in each of the clusters. There comes the application of the tapply() function. The general format of the tapply function is as follows:

tapply(summarizing_variable, group_variable, function)

In our problem, the summarizing variable would be Sepal.Length, group_variable would be clusterNo. and the function…let us start with mean. So the code goes as follows:

tapply(iris.n$Sepal.Length, iris.n$clusterNo., mean)

##        1        2        3 
## 5.901613 6.850000 5.006000

And here it is. Note that we get the mean value of the variable sepal length for each of the clusters. So tapply() takes a numerical variable, segregates it by the categories of another variable, applies the function to each of the segments and return the result of the numerical variable by the categories of the group variable. In the same way if we change the Sepal.Length to Sepal.Width and change the function to min, then we would get the minimum values of the variable sepal width for each cluster. Now let us have a look at how could we have done the same computation using for loop.

Calculating the mean of sepal length for each clusters using for loop

avg <- numeric()
for(i in sort(unique(iris.n$clusterNo.)))
{
  avg[i] <- mean(iris.n$Sepal.Length[iris.n$clusterNo. == i])
  
}
names(avg) <- sort(unique(iris.n$clusterNo.))
avg

##        1        2        3 
## 5.901613 6.850000 5.006000

The above output gives the same solution as using the tapply() function gave. However note that the tapply() function compressed the entire code (written above) in a single line.

This is how these loop function makes coding more efficient. Lazy coders keep looking for codes like these and if they cannot find they simply make them.

APPENDIX

How apply() work?

Apply() function accepts a matrix or a data frame or an array and help applying a specified function in its rows or column. The general apply syntax goes like this,

apply(X, MARGIN, FUN)

where,

x is a data frame, matrix or array
MARGIN can take only two values - 1 or 2. 1 represents rows and 2 represents column
FUN is where you specify the function to be applied to the specified margin of the data

For example, consider the following matrix:

data <- matrix(c(3,2,4,5,8,1), 3, 2)
data

##      [,1] [,2]
## [1,]    3    5
## [2,]    2    8
## [3,]    4    1

suppose that you are interested to find the maximum value for each column, then using apply() we write:

apply(data, 2, max)

## [1] 4 8

The above function takes the data and applies the function mean to its column (specified by MARGIN=2). If the we change the margin to 1 then the same function will be applied to the rows of the data. In that case it will return the maximum values for each rows.

apply(data, 1, min)

## [1] 3 2 1

The following code compute the variance of each columns in a data:

apply(data, 2, var)

## [1]  1.00000 12.33333

If you recall, we have done the same thing while computing the variance of each of the variables in the data iris.n. I hope you are abe to connect now.

How sapply() work?

sapply takes a vector as an argument and returns another vector of the same length by applying a specified function to each elements of the vector in its argument. Its syntax is,

sapply(X, FUN)