Hierarchical Clustering for Concept Mapping in R

Here I am providing an example of how to analyze concept mapping data in R. This tutorial assumes that you have already collected and summed the data. The data set that is presented is in a final version where the number of times each person placed each item (i.e. suggestion based on the prompt) into the same pile (e.g. theme, category) is summed.

Much of this example is based on the University of Cincinnati’s R Programming Guide for Hierarchical Cluster Analysis: https://uc-r.github.io/hc_clustering.

Below are some packages that will be necessary for this tutorial.

library(MASS)
library(cluster)
library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(ggpubr)

## Loading required package: magrittr

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ purrr   0.2.4
## ✔ tidyr   0.8.0     ✔ dplyr   0.7.4
## ✔ readr   1.1.1     ✔ stringr 1.3.0
## ✔ tibble  1.4.2     ✔ forcats 0.2.0

## ── Conflicts ────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract()   masks magrittr::extract()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::select()    masks MASS::select()
## ✖ purrr::set_names() masks magrittr::set_names()

library(cluster)

The artificial data set that we have included fifteen items summed across 15 people in a concept mapping brainstorming session. The structure of the dataset is similar to a correlation matrix where each side of the diagonals (zero in this case) are a mirror of each other. 10 is placed into the diagonals because each item is always placed into the same category by each person; therefore the diagonals will always equal the number of people who sorted the responses.

setwd("~/Desktop")
datTest = read.csv("datCorMD.csv", header = TRUE)
datTest

##    Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11
## 1     10     4     5     1     2     8     9     7     6      4      3
## 2      4    10     9     7     1     5     6     7     8      5      4
## 3      5     9    10     6     3     1     2     4     5      7      9
## 4      1     7     6    10     5     4     3     2     7      8      9
## 5      2     1     3     5    10     4     6     7     8      1      2
## 6      8     5     1     4     4    10     1     2     4      3      2
## 7      9     6     2     3     6     1    10     6     7      5      2
## 8      7     7     4     2     7     2     6    10     4      5      3
## 9      6     8     5     7     8     4     7     4    10      6      7
## 10     4     5     7     8     1     3     5     5     6     10      5
## 11     3     4     9     9     2     2     2     3     7      5     10
## 12     2     2     4     2     3     3     3     2     8      6      5
## 13     1     3     6     7     4     1     1     6     9      4      6
## 14     4     4     8     9     9     3     2     5     1      2      7
## 15     3     5     9     9     7     2     3     3     2      5      4
##    Item12 Item13 Item14 Item15
## 1       2      1      4      3
## 2       2      3      4      5
## 3       4      6      8      9
## 4       2      7      9      9
## 5       3      4      9      7
## 6       3      1      3      2
## 7       3      1      2      3
## 8       2      6      5      3
## 9       8      9      1      2
## 10      6      4      2      5
## 11      5      6      7      4
## 12     10      7      8      8
## 13      7     10      9      4
## 14      8      9     10      6
## 15      8      4      6     10

Next, we use Agglomerative clustering using the agnes function in the cluster package. AGNES starts by placing each item into its cluster and then finds items that are most similar and combine them into similar clusters. This process is continued until all items are placed into one cluster. Because in this example, I am using the original data set and not a dissimilatory matrix as the data set, I set diss to FALSE. I then standardize the data, because AGNES automatically reduces the variables to two allowing us to plot the data later on x and y coordinates. The multicollinearity that can be present and thus decrease the accuracy of the data reduction process can be reduced by scaling the variables (i.e. turning them into z-scores). Then the method of partitioning into clusters is selected. Ward’s method attempts to minimize the total within-cluster variance. For information about other methods see: https://uc-r.github.io/hc_clustering

hcWard = agnes(datTest, diss = FALSE, stand = TRUE, method = "ward")
hcC = agnes(datTest, diss = FALSE, stand = TRUE, method = "complete")
hcWeighted = agnes(datTest, diss = FALSE, stand = TRUE, method = "weighted")

A bonus of hieratical clustering in the agnes function is that it produces an agglomerative coefficient, which provides an indication of model fit where a coefficient closer to one indicates a better fit. Below we show that Ward’s is the best fit relative to the complete, single, and weighted methods because it has the highest agglomerative coefficient.

hcWard$ac

## [1] 0.6413872

hcC$ac

## [1] 0.4664918

hcWeighted$ac

## [1] 0.357097

Now that we have the data sorted into two dimensions using, hierarchical clustering, we can them group the items into different clusters. I will start by selecting four clusters; however, we will evaluate this decision in the next step.

Then so we can see how each item is placed into which cluster, we use the mutate function to combine the clusters identification variable with the items.

hcTree = cutree(hcWard, k=3)
datTest %>%
  mutate(cluster = hcTree)

##    Item1 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9 Item10 Item11
## 1     10     4     5     1     2     8     9     7     6      4      3
## 2      4    10     9     7     1     5     6     7     8      5      4
## 3      5     9    10     6     3     1     2     4     5      7      9
## 4      1     7     6    10     5     4     3     2     7      8      9
## 5      2     1     3     5    10     4     6     7     8      1      2
## 6      8     5     1     4     4    10     1     2     4      3      2
## 7      9     6     2     3     6     1    10     6     7      5      2
## 8      7     7     4     2     7     2     6    10     4      5      3
## 9      6     8     5     7     8     4     7     4    10      6      7
## 10     4     5     7     8     1     3     5     5     6     10      5
## 11     3     4     9     9     2     2     2     3     7      5     10
## 12     2     2     4     2     3     3     3     2     8      6      5
## 13     1     3     6     7     4     1     1     6     9      4      6
## 14     4     4     8     9     9     3     2     5     1      2      7
## 15     3     5     9     9     7     2     3     3     2      5      4
##    Item12 Item13 Item14 Item15 cluster
## 1       2      1      4      3       1
## 2       2      3      4      5       2
## 3       4      6      8      9       3
## 4       2      7      9      9       3
## 5       3      4      9      7       1
## 6       3      1      3      2       1
## 7       3      1      2      3       1
## 8       2      6      5      3       1
## 9       8      9      1      2       2
## 10      6      4      2      5       2
## 11      5      6      7      4       3
## 12     10      7      8      8       3
## 13      7     10      9      4       3
## 14      8      9     10      6       3
## 15      8      4      6     10       3

Now we want to figure out how many clusters is the best fit for the data given the specified method for partitioning. Three common methods are:

Elbow method = visually inspecting where the elbow is on the graph of variance accounted for

Average silhouette method = measures how well each object or item lies within a cluster of different numbers of clusters

Gap Cluster = This is a comparison of the intracluster variation to what we would expect with a simulated data set with no inherent clustering structure given different amounts of clusters.

Unfortunately, our results are presenting different answers. This is to be expected since I randomly generated the data. We will stick with three for the sake of the tutorial and move on to plotting.

fviz_nbclust(datTest, FUN = hcut, method = "wss")

fviz_nbclust(datTest, FUN = hcut, method = "silhouette")

gap_stat = clusGap(datTest, FUN = hcut, nstart = 25, K.max = 10, B = 10)
fviz_gap_stat(gap_stat)

Finally, we can plot data onto a two-dimensional map where the clusters are linked together with lines and highlighted different colors.

fviz_cluster(list(data =datTest, cluster = hcTree))