Clustering Methods With R

`dist` function

dist() calculates distance and returns a distance matrix. The method argument specifies the distance metric to be used and defaults to Euclidean distance. Examples follow:

Euclidean distance

Let us first read example data and look what is in it.

eg1 <- read.csv("Data/example1.csv")
eg1

##         L1 article
## 1   Korean     0.6
## 2   German     0.8
## 3 Japanese     0.5

We then calculate distance between the elements in the second column.

dist(eg1[ , 2])

##     1   2
## 2 0.2    
## 3 0.1 0.3

Alternatively, we can specify the column by the column name as well.

dist(eg1$article)

##     1   2
## 2 0.2    
## 3 0.1 0.3

Similarly, we can compute distance in higher-dimensional spaces.

# 2D
eg2 <- read.csv("Data/example2.csv")
eg2

##         L1 article  ed
## 1   Korean     0.4 0.8
## 2   German     0.8 0.6
## 3 Japanese     0.6 0.5

dist(eg2[ , 2:3])

##           1         2
## 2 0.4472136          
## 3 0.3605551 0.2236068

#3D
eg3 <- read.csv("Data/example3.csv")
eg3

##         L1 article  ed   s
## 1   Korean     0.6 0.9 0.6
## 2   German     0.3 0.6 0.9
## 3 Japanese     0.9 0.4 0.5

dist(eg3[ , 2:4])

##           1         2
## 2 0.5196152          
## 3 0.5916080 0.7483315

Manhattan distance

Manhattan distance can be computed by specifying “manhattan” to the method argument of dist().

dist(eg2[ , 2:3], method = "manhattan")

##     1   2
## 2 0.6    
## 3 0.5 0.3

Hierarchical clustering

The following code snippet performs an agglomerative hierarchical cluster analysis with squared Euclidean distance and the Ward’s method. Before running the analysis, however, we assign L1 labels to the row names of the data frame so that the resulting dendrogram uses L1 labels.

rownames(eg3) <- eg3$L1
# The row names represent L1.
eg3

##                L1 article  ed   s
## Korean     Korean     0.6 0.9 0.6
## German     German     0.3 0.6 0.9
## Japanese Japanese     0.9 0.4 0.5

We then obtain a distance matrix as before.

eg3.dist <- dist(eg3[ , 2:4])
eg3.dist

##             Korean    German
## German   0.5196152          
## Japanese 0.5916080 0.7483315

The function hclust performs a hierarhical cluster analysis. The method argument specifies the linkage criterion.

eg3.hclust <- hclust(eg3.dist, method = "ward.D2")

We can draw a dendrogram using plot(). By specifying hang = -1, all the leaves are aligned.

plot(eg3.hclust, hang = -1)

More authentic example

We now cluster more authentic data. The data we use are the relative frequency of eight error tags in 12 nationalities in EF-Cambridge Open Language Database (EFCAMDAT). We first read the data as usual and look at the data.

error.freq <- read.csv("Data/error_freq.csv")
error.freq

##    nationality        AR       PR       VT       PL       AG        WO
## 1           br  5.584696 6.052505 3.128397 1.904357 1.990424 1.3657120
## 2           cn  5.592782 4.125751 3.551567 2.442850 1.747515 0.6549510
## 3           de  3.233303 3.957666 3.074154 1.231066 1.156172 1.1702148
## 4           es  3.739341 5.472206 2.713302 1.254047 1.459255 1.0488394
## 5           fr  4.898914 5.462552 3.169369 2.179051 1.455627 1.3660770
## 6           it  4.921787 4.939960 2.706239 2.255199 1.361380 1.4423363
## 7           jp  8.659891 4.806812 3.044314 2.826863 1.365745 0.8888787
## 8           kr 10.185876 5.031577 3.738282 2.935873 1.991863 1.1611332
## 9           mx  4.660731 5.724839 3.476950 1.764889 1.919061 1.4425301
## 10          ru  9.093976 4.476776 2.670699 1.618846 1.478422 1.1154406
## 11          tr  8.705263 5.115789 3.257895 2.773684 1.563158 1.2421053
## 12          tw  7.290674 5.102923 4.592356 3.181435 2.314021 0.8015350
##           CO        SI
## 1  0.5449309 0.3771483
## 2  0.7863817 0.3458317
## 3  0.5125541 0.2995750
## 4  0.2280086 0.2508094
## 5  0.5197414 0.5074502
## 6  0.3436494 0.4279096
## 7  1.1101447 0.4844961
## 8  0.8401695 0.4814455
## 9  0.5864995 0.3902809
## 10 0.7091835 0.3382532
## 11 0.9052632 0.5105263
## 12 0.7768301 0.4556672

What each error tag represents is as follows:

AR = article
PR = preposition
VT = verb tense (and aspect?)
PL = plural
AG = verb agreement
WO = word order
CO = conjunction (?)
SI = singular (?)

Notice that the absolute frequency varies across error types (e.g., AR is much more frequent than SI). We therefore need to standardize variables. The scale functoin standardizes a variable. To standardize multiple columns, we can use apply() that applies a function row-wise (when MARGIN = 1) or column-wise (when MARGIN = 2).

error.freq2 <- apply(error.freq[ , 2:ncol(error.freq)], MARGIN = 2, scale)
error.freq2

##               AR          PR           VT          PL         AG
##  [1,] -0.3445771  1.65628001 -0.244719210 -0.44646068  0.9993837
##  [2,] -0.3410764 -1.44183999  0.540423186  0.37409985  0.2858129
##  [3,] -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161
##  [4,] -1.1434988  0.72318945 -1.014878242 -1.43740982 -0.5609801
##  [5,] -0.6414772  0.70766717 -0.168700120 -0.02787943 -0.5716371
##  [6,] -0.6315747 -0.13263309 -1.027982431  0.08815636 -0.8484962
##  [7,]  0.9867872 -0.34672898 -0.400724359  0.95926373 -0.8356748
##  [8,]  1.6474415  0.01468173  0.886851381  1.12537452  1.0036089
##  [9,] -0.7445955  1.12941050  0.401980447 -0.65898342  0.7897465
## [10,]  1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734
## [11,]  1.0064303  0.15009072 -0.004451063  0.87822901 -0.2557545
## [12,]  0.3940032  0.12940232  2.471484096  1.49956413  1.9499803
##                WO         CO         SI
##  [1,]  0.87668665 -0.4458702 -0.3342733
##  [2,] -1.90425741  0.5297247 -0.6998565
##  [3,]  0.11177852 -0.5766907 -1.2398486
##  [4,] -0.36311806 -1.7264120 -1.8091277
##  [5,]  0.87811493 -0.5476499  1.1868451
##  [6,]  1.17648906 -1.2591589  0.2583042
##  [7,] -0.98898499  1.8379062  0.9188837
##  [8,]  0.07624572  0.7470573  0.8832709
##  [9,]  1.17724720 -0.2779101 -0.1809656
## [10,] -0.10253245  0.2178012 -0.7883272
## [11,]  0.39305938  1.0100715  1.2227549
## [12,] -1.33072855  0.4911310  0.5823401

As before, we assign the nationality code as the rowname of the matrix.

rownames(error.freq2) <- sort(unique(error.freq$nationality))
error.freq2

##            AR          PR           VT          PL         AG          WO
## br -0.3445771  1.65628001 -0.244719210 -0.44646068  0.9993837  0.87668665
## cn -0.3410764 -1.44183999  0.540423186  0.37409985  0.2858129 -1.90425741
## de -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161  0.11177852
## es -1.1434988  0.72318945 -1.014878242 -1.43740982 -0.5609801 -0.36311806
## fr -0.6414772  0.70766717 -0.168700120 -0.02787943 -0.5716371  0.87811493
## it -0.6315747 -0.13263309 -1.027982431  0.08815636 -0.8484962  1.17648906
## jp  0.9867872 -0.34672898 -0.400724359  0.95926373 -0.8356748 -0.98898499
## kr  1.6474415  0.01468173  0.886851381  1.12537452  1.0036089  0.07624572
## mx -0.7445955  1.12941050  0.401980447 -0.65898342  0.7897465  1.17724720
## ru  1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734 -0.10253245
## tr  1.0064303  0.15009072 -0.004451063  0.87822901 -0.2557545  0.39305938
## tw  0.3940032  0.12940232  2.471484096  1.49956413  1.9499803 -1.33072855
##            CO         SI
## br -0.4458702 -0.3342733
## cn  0.5297247 -0.6998565
## de -0.5766907 -1.2398486
## es -1.7264120 -1.8091277
## fr -0.5476499  1.1868451
## it -1.2591589  0.2583042
## jp  1.8379062  0.9188837
## kr  0.7470573  0.8832709
## mx -0.2779101 -0.1809656
## ru  0.2178012 -0.7883272
## tr  1.0100715  1.2227549
## tw  0.4911310  0.5823401

We then compute a distance matrix, performs a hierarchical cluster analysis, and draws the dendrogram.

error.freq.dist <- dist(error.freq2)
error.freq.hclust <- hclust(error.freq.dist, method = "ward.D2")
plot(error.freq.hclust, hang = -1)

k = 3 looks good.

plot(error.freq.hclust, hang = -1)
rect.hclust(error.freq.hclust, 3)

To assign a cluster to each observation, we use cutree().

cutree(error.freq.hclust, 3)

## br cn de es fr it jp kr mx ru tr tw 
##  1  2  2  2  1  1  3  3  1  2  3  3

To add a column to error.freq2, we first convert it into a data.frame. We then add cluster as a separate column.

error.freq2 <- as.data.frame(error.freq2)
error.freq2$cluster <- cutree(error.freq.hclust, 3)

To investigate the error profile of each L1 group, we visualize the (standardized) error frequency of each error in each cluster. We first load necessary packages.

# Run this line if these packhages have not been installed yet.
# install.packages(c("tidyr", "ggplot2"))

# To install from source files in a local folder:
# install.packages("C://abc/xyz/Packages/tidyr_0.3.1.tar.gz", repos = NULL, type = "source")
# install.packages("C://abc/xyz/Packages/ggplot2_1.0.1.tar.gz", repos = NULL, type = "source")

library("tidyr") # to restructure data
library("ggplot2") # to draw figures efficiently

We first convert the data frame into the so-called long format.

error.freq.long <- gather(error.freq2, tag, freq, AR:SI)
# first few lines
head(error.freq.long)

##   cluster tag       freq
## 1       1  AR -0.3445771
## 2       2  AR -0.3410764
## 3       2  AR -1.3625807
## 4       2  AR -1.1434988
## 5       1  AR -0.6414772
## 6       1  AR -0.6315747

We then draw the error profile for each cluster. The code below says that

the x-axis should represent tag,
the y-axis should represent freq,
the boxplot should be drawn, and
separate figures should be drawn for separate clusters.

ggplot(error.freq.long, aes(tag, freq)) +
  geom_boxplot() +
  facet_wrap(~ cluster)

k-means Clustering

Below, I illustrate how to perform a k-means cluster analysis. Let us first read data and see what it looks like.

eg4 <- read.csv("Data/example4.csv")
eg4

##   data article  ed
## 1    1     0.4 0.2
## 2    2     0.2 0.4
## 3    3     0.4 0.8
## 4    4     0.8 0.8
## 5    5     0.9 0.4

kmeans() performs k-means clustering. The second argument specifies the number of clusters.

eg4.kmeans <- kmeans(eg4[ , 2:3], 2)

eg4.kmeans$cluster returns the cluster membership of each observation (similar to cutree() in hierarchical clustering), and eg4.kmeans$centers returns the location of the mean of each cluster.

eg4.kmeans$cluster

## [1] 2 2 1 1 1

eg4.kmeans$centers

##   article        ed
## 1     0.7 0.6666667
## 2     0.3 0.3000000

K-means clustering involves randomness:

kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster

## [1] 2 2 1 1 1
## [1] 2 2 1 1 1
## [1] 1 1 2 2 2
## [1] 2 2 1 1 1

To make the results reproducible, we can use set.seed().

set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster

## [1] 1 1 2 2 2
## [1] 1 1 2 2 2
## [1] 1 1 2 2 2
## [1] 1 1 2 2 2

Authentic data

We now cluster more authentic data with k-means clustering. Let us first read data and look at the first few lines.

efcorpus.msl <- read.csv("Data/efcorpus.msl.csv")
head(efcorpus.msl)

##   learnerid nationality mean.level  writing1  writing2 writing3  writing4
## 1  18426339          br   15.50000  7.750000  4.666667  2.56250  6.000000
## 2  18426396          br   16.50000  7.600000 27.000000  6.00000  6.750000
## 3  18426461          mx   39.56667 13.500000 43.000000 56.00000 25.500000
## 4  18427875          cn   55.80000 58.000000 14.666667 23.66667 12.625000
## 5  18431359          br   31.03333  9.428571 11.727273 11.33333  3.333333
## 6  18433178          fr   47.50000 14.500000 13.666667 10.66667  3.440000
##    writing5  writing6  writing7  writing8  writing9 writing10 writing11
## 1  5.666667 15.500000  9.200000 11.500000  4.052632  22.00000  8.400000
## 2  4.750000  4.600000  6.333333 11.714286 11.250000   6.12500  5.714286
## 3 11.800000  6.666667 16.000000  8.500000 12.166667  28.50000  6.833333
## 4 12.714286 36.333333 25.666667 11.285714 23.000000  15.66667 31.333333
## 5  4.875000  5.625000  3.800000  2.307692  7.333333  83.00000  9.166667
## 6 15.000000 39.000000 10.571429 87.000000  6.272727   4.56250 25.000000
##    writing12 writing13 writing14 writing15 writing16  writing17 writing18
## 1  24.500000 19.500000 44.000000 10.000000  17.00000   8.125000  46.00000
## 2  28.500000  7.250000  8.166667 50.500000  10.55556  27.000000  11.50000
## 3   4.000000  5.857143  9.166667  6.777778   7.70000   6.000000  14.00000
## 4 100.000000 37.000000 29.400000 19.000000   6.75000  17.666667   8.80000
## 5   7.750000 12.285714  6.875000  9.000000  11.55556   7.333333  62.00000
## 6   8.444444 74.000000  4.625000 17.000000   4.81250 100.000000  17.16667
##   writing19 writing20 writing21 writing22 writing23 writing24 writing25
## 1  3.857143  5.166667  8.750000 19.500000   7.60000 10.500000  5.727273
## 2 10.000000 10.333333  6.333333  6.666667  11.80000  9.833333 42.500000
## 3 22.333333 67.000000  3.333333 33.250000  11.16667  9.833333 15.428571
## 4 18.800000 15.625000 11.000000 22.000000  43.75000 58.000000 41.000000
## 5 27.000000 37.000000 52.000000 28.500000  19.66667 28.500000 58.000000
## 6 24.250000  5.315789 18.000000 10.700000  29.00000 10.923077  3.351351
##   writing26 writing27 writing28 writing29 writing30
## 1  6.800000  5.100000 14.833333  84.00000  16.80000
## 2 13.400000  9.666667 17.000000   5.40000  29.66667
## 3 14.600000 27.666667 13.333333  13.16667  15.40000
## 4 23.800000 55.000000 17.444444  32.20000  52.00000
## 5 29.500000 11.200000 12.500000  21.66667  12.60000
## 6  8.833333 12.555556  8.307692  17.00000  14.25000

The data include the longitudinal development of mean sentence length in 2,774 learners across 30 writings. What we do is we cluster learners according to their developmental patterns. We follow the same procedure as the above.

# set the seed
set.seed(1)
# cluster rows (learners) by k-means clustering, where k = 3
efcorpus.msl.kmeans <- kmeans(efcorpus.msl[ , 4:33], 3)
# assign cluster to the original data frame
efcorpus.msl$cluster <- efcorpus.msl.kmeans$cluster
head(efcorpus.msl)

##   learnerid nationality mean.level  writing1  writing2 writing3  writing4
## 1  18426339          br   15.50000  7.750000  4.666667  2.56250  6.000000
## 2  18426396          br   16.50000  7.600000 27.000000  6.00000  6.750000
## 3  18426461          mx   39.56667 13.500000 43.000000 56.00000 25.500000
## 4  18427875          cn   55.80000 58.000000 14.666667 23.66667 12.625000
## 5  18431359          br   31.03333  9.428571 11.727273 11.33333  3.333333
## 6  18433178          fr   47.50000 14.500000 13.666667 10.66667  3.440000
##    writing5  writing6  writing7  writing8  writing9 writing10 writing11
## 1  5.666667 15.500000  9.200000 11.500000  4.052632  22.00000  8.400000
## 2  4.750000  4.600000  6.333333 11.714286 11.250000   6.12500  5.714286
## 3 11.800000  6.666667 16.000000  8.500000 12.166667  28.50000  6.833333
## 4 12.714286 36.333333 25.666667 11.285714 23.000000  15.66667 31.333333
## 5  4.875000  5.625000  3.800000  2.307692  7.333333  83.00000  9.166667
## 6 15.000000 39.000000 10.571429 87.000000  6.272727   4.56250 25.000000
##    writing12 writing13 writing14 writing15 writing16  writing17 writing18
## 1  24.500000 19.500000 44.000000 10.000000  17.00000   8.125000  46.00000
## 2  28.500000  7.250000  8.166667 50.500000  10.55556  27.000000  11.50000
## 3   4.000000  5.857143  9.166667  6.777778   7.70000   6.000000  14.00000
## 4 100.000000 37.000000 29.400000 19.000000   6.75000  17.666667   8.80000
## 5   7.750000 12.285714  6.875000  9.000000  11.55556   7.333333  62.00000
## 6   8.444444 74.000000  4.625000 17.000000   4.81250 100.000000  17.16667
##   writing19 writing20 writing21 writing22 writing23 writing24 writing25
## 1  3.857143  5.166667  8.750000 19.500000   7.60000 10.500000  5.727273
## 2 10.000000 10.333333  6.333333  6.666667  11.80000  9.833333 42.500000
## 3 22.333333 67.000000  3.333333 33.250000  11.16667  9.833333 15.428571
## 4 18.800000 15.625000 11.000000 22.000000  43.75000 58.000000 41.000000
## 5 27.000000 37.000000 52.000000 28.500000  19.66667 28.500000 58.000000
## 6 24.250000  5.315789 18.000000 10.700000  29.00000 10.923077  3.351351
##   writing26 writing27 writing28 writing29 writing30 cluster
## 1  6.800000  5.100000 14.833333  84.00000  16.80000       3
## 2 13.400000  9.666667 17.000000   5.40000  29.66667       1
## 3 14.600000 27.666667 13.333333  13.16667  15.40000       3
## 4 23.800000 55.000000 17.444444  32.20000  52.00000       2
## 5 29.500000 11.200000 12.500000  21.66667  12.60000       3
## 6  8.833333 12.555556  8.307692  17.00000  14.25000       3

Let’s visualize the developmental profile of each cluster. We first convert the data into the long format.

efcorpus.msl.long <- gather(efcorpus.msl, writingnum, mean.sent.len, writing1:writing30)
head(efcorpus.msl.long)

##   learnerid nationality mean.level cluster writingnum mean.sent.len
## 1  18426339          br   15.50000       3   writing1      7.750000
## 2  18426396          br   16.50000       1   writing1      7.600000
## 3  18426461          mx   39.56667       3   writing1     13.500000
## 4  18427875          cn   55.80000       2   writing1     58.000000
## 5  18431359          br   31.03333       3   writing1      9.428571
## 6  18433178          fr   47.50000       3   writing1     14.500000

We want to have writingnum as the x-axis and for that purpose, it’s more convenient to have numeric values for writingnum. The tidyr package we loaded earlier has the function called extract_numeric that retrieves the numerical part from the given character (or factor). Run, for example, extract_numeric(c("writingnum1", "writingnum2")).

efcorpus.msl.long$writingnum <- extract_numeric(efcorpus.msl.long$writingnum)
head(efcorpus.msl.long)

##   learnerid nationality mean.level cluster writingnum mean.sent.len
## 1  18426339          br   15.50000       3          1      7.750000
## 2  18426396          br   16.50000       1          1      7.600000
## 3  18426461          mx   39.56667       3          1     13.500000
## 4  18427875          cn   55.80000       2          1     58.000000
## 5  18431359          br   31.03333       3          1      9.428571
## 6  18433178          fr   47.50000       3          1     14.500000

We now visualize the data using ggplot().

group = learnerid specifies that separate lines are drawn for separate learners.
alpha = 0.1 means that lines should be transparent (0-1, where smaller values indicate more transparency).
stat_smooth() draws a trend line.
- group = NULL means we draw only one trend line for each cluster (i.e., not separate trend lines for separate learners)
- se = FALSE means we do not draw standard errors around the trend line.
coord_cartesian(ylim = c(0, 100)) limits the y-axis from 0 to 100.

ggplot(efcorpus.msl.long, aes(writingnum, mean.sent.len, group = learnerid)) + 
  geom_line(alpha = 0.1) +
  facet_wrap(~ cluster) +
  stat_smooth(aes(group = NULL), se = FALSE) +
  coord_cartesian(ylim = c(0, 100))

Note that clustering was performed based on the absolute value of mean sentence length. Indeed, the proficiency of the learner (represented in the mean.level column) is higher in the clusters that mark higher mean sentence length.

tapply(efcorpus.msl$mean.level, efcorpus.msl$cluster, mean)

##        1        2        3 
## 17.96468 65.02373 32.65439

Centered Data

If we want to cluster learners according to the developmental shape rather than to the absolute value of mean sentence length, a strategy we can take is to apply within-learner centering to the concerned variable. The data after within-learner centering is available as efcorpus.msl.c.csv. Just in case, the code below illustrates how the data were transformed from efcorpus.msl.

# efcorpus.msl.c <- efcorpus.msl
# efcorpus.msl.c[ , 4:33] <- t(apply(efcorpus.msl.c[ , 4:33], 1, scale, scale = FALSE))
# efcorpus.msl.c$cluster <- NULL

We read the withi-learner centered data.

efcorpus.msl.c <- read.csv("Data/efcorpus.msl.c.csv")
head(efcorpus.msl.c)

##   learnerid nationality mean.level   writing1   writing2   writing3
## 1  18426339          br   15.50000  -7.418596 -10.501929 -12.606096
## 2  18426396          br   16.50000  -6.196971  13.203029  -7.796971
## 3  18426461          mx   39.56667  -4.116005  25.383995  38.383995
## 4  18427875          cn   55.80000  28.860185 -14.473148  -5.473148
## 5  18431359          br   31.03333 -10.800200  -8.501499  -8.895438
## 6  18433178          fr   47.50000  -6.107180  -6.940513  -9.940513
##     writing4   writing5   writing6   writing7   writing8   writing9
## 1  -9.168596  -9.501929   0.331404  -5.968596  -3.668596 -11.115964
## 2  -7.046971  -9.046971  -9.196971  -7.463638  -2.082685  -2.546971
## 3   7.883995  -5.816005 -10.949339  -1.616005  -9.116005  -5.449339
## 4 -16.514815 -16.425529   7.193519  -3.473148 -17.854101  -6.139815
## 5 -16.895438 -15.353771 -14.603771 -16.428771 -17.921079 -12.895438
## 6 -17.167180  -5.607180  18.392820 -10.035751  66.392820 -14.334453
##    writing10  writing11  writing12  writing13   writing14  writing15
## 1   6.831404  -6.768596   9.331404   4.331404  28.8314040  -5.168596
## 2  -7.671971  -8.082685  14.703029  -6.546971  -5.6303042  36.703029
## 3  10.883995 -10.782672 -13.616005 -11.758862  -8.4493386 -10.838228
## 4 -13.473148   2.193519  70.860185   7.860185   0.2601852 -10.139815
## 5  62.771229 -11.062105 -12.478771  -7.943057 -13.3537713 -11.228771
## 6 -16.044680   4.392820 -12.162736  53.392820 -15.9821800  -3.607180
##    writing16  writing17  writing18  writing19  writing20  writing21
## 1   1.831404  -7.043596  30.831404 -11.311453 -10.001929  -6.418596
## 2  -3.241415  13.203029  -2.296971  -3.796971  -3.463638  -7.463638
## 3  -9.916005 -11.616005  -3.616005   4.717328  49.383995 -14.282672
## 4 -22.389815 -11.473148 -20.339815 -10.339815 -13.514815 -18.139815
## 5  -8.673216 -12.895438  41.771229   6.771229  16.771229  31.771229
## 6 -15.794680  79.392820  -3.440513   3.642820 -15.291391  -2.607180
##   writing22  writing23 writing24  writing25   writing26  writing27
## 1  4.331404 -7.5685960 -4.668596  -9.441323  -8.3685960 -10.068596
## 2 -7.130304 -1.9969709 -3.963638  28.703029  -0.3969709  -4.130304
## 3 15.633995 -6.4493386 -7.782672  -2.187434  -3.0160053  10.050661
## 4 -7.139815 14.6101852 28.860185  11.860185  -5.3398148  25.860185
## 5  8.271229 -0.5621047  8.271229  37.771229   9.2712287  -9.028771
## 6 -9.907180  8.3928200 -9.684103 -17.255829 -11.7738466  -8.051624
##     writing28 writing29 writing30
## 1  -0.3352627 68.831404  1.631404
## 2   3.2030291 -8.396971 15.869696
## 3  -4.2826720 -4.449339 -2.216005
## 4 -11.6953704  3.060185 22.860185
## 5  -7.7287713  1.437895 -7.628771
## 6 -12.2994877 -3.607180 -6.357180

Practicals

K-means clustering on the within-learner-centered data:

set.seed(1)
efcorpus.msl.c.kmeans <- kmeans(efcorpus.msl.c[ , 4:33], 3)
efcorpus.msl.c$cluster <- efcorpus.msl.c.kmeans$cluster
efcorpus.msl.c.long <- gather(efcorpus.msl.c, writingnum, mean.sent.len.c, writing1:writing30)
efcorpus.msl.c.long$writingnum <- extract_numeric(efcorpus.msl.c.long$writingnum)
ggplot(efcorpus.msl.c.long, aes(writingnum, mean.sent.len.c, group = learnerid)) +
  geom_line(alpha = 0.1) +
  facet_wrap(~ cluster) +
  stat_smooth(aes(group = NULL), se = FALSE) +
  coord_cartesian(ylim = c(-50, 50))

Cluster Validation

Let us say that we want to see whether the clusters are “real”. We can calculate the silhouette value with the silhouette function in the cluster package. We first (install and) load the package.

# Run this line if the package has not been installed yet.
# install.packages("cluster")

# To install from source files in a local folder:
# install.packages("C://abc/xyz/Packages/cluster_2.0.3.tar.gz", repos = NULL, type = "source")

library("cluster")

Let’s test whether the clusters found in the (non-centered) mean sentence length data were random. We follow the same procedure as before and perform k-means clustering. Because the silhouette function requires the distance matrix, we make it with dist() as well.

efcorpus.msl <- read.csv("Data/efcorpus.msl.csv")
d <- efcorpus.msl[ , 4:33]
set.seed(1)
res.kmeans <- kmeans(d, 3)
ddist <- dist(d)

We then feed the cluster of each observation and the distance matrix to silhouette(). We look at the first few lines.

sil <- silhouette(res.kmeans$cluster, ddist)
sil[1:5, ]

##      cluster neighbor  sil_width
## [1,]       3        1 -0.2285915
## [2,]       1        3  0.3166222
## [3,]       3        1 -0.2172080
## [4,]       2        1 -0.3014261
## [5,]       3        1 -0.1345547

Since the silhouette value is given in the third column, we take the mean of the third column to calculate the mean silhouette value. The value (0.169) indicates how good the clustering in the present case is.

mean(sil[ , 3])

## [1] 0.1691578

We now compute the null distribution of the silhouette value. First, 1,000 sets of 2,774 (= number of learners) times 30 (= number of writings) values are randomly chosen from the original data. These sets of data should not include systematic patterns and thus form the basis of the null hypothesis. K-means clustering is performed on each of the 1,000 data sets, and the mean silhouette value is calculated. Finally, we inspect whether the 95% range of those silhouette values include the observed silhouette value in our original data set (0.169).

set.seed(1)
ef.m <- as.matrix(efcorpus.msl[ , 4:33])
boot.samp <- sample(ef.m, 2774 * 30 * 1000, replace = TRUE)
boot.samp.array <- array(boot.samp, dim = c(2774, 30, 1000))

boot.res <- numeric(1000)
for (i in 1:1000) {
  boot.d <- boot.samp.array[ , , i]
  boot.dist <- dist(boot.d)
  boot.kmeans <- kmeans(boot.d, 3)
  boot.res[i] <- mean(silhouette(boot.kmeans$cluster, boot.dist)[ , 3])
}
quantile(boot.res, 0.95)

##       95% 
## 0.1261744

rm(boot.samp, boot.samp.array)

Validation for centered data

efcorpus.msl.c <- read.csv("Data/efcorpus.msl.c.csv")
d <- efcorpus.msl.c[ , 4:33]
set.seed(1)
res.kmeans <- kmeans(d, 3)
ddist <- dist(d)
mean(silhouette(res.kmeans$cluster, ddist)[ , 3])

## [1] 0.1229812

set.seed(1)
ef.m <- as.matrix(efcorpus.msl.c[ , 4:33])
boot.samp <- sample(ef.m, 2774 * 30 * 1000, replace = TRUE)
boot.samp.array <- array(boot.samp, dim = c(2774, 30, 1000))

boot.res <- numeric(1000)
for (i in 1:1000) {
  boot.d <- boot.samp.array[ , , i]
  boot.d2 <- t(apply(boot.d, 1, scale, scale = FALSE))
  boot.dist <- dist(boot.d2)
  boot.kmeans <- kmeans(boot.d2, 3)
  boot.res[i] <- mean(silhouette(boot.kmeans$cluster, boot.dist)[ , 3])
}
quantile(boot.res, 0.95)

##       95% 
## 0.0863205

rm(boot.samp, boot.samp.array)

Clustering Methods With R

Akira Murakami

October 8, 2015

`dist` function

Euclidean distance

Manhattan distance

Hierarchical clustering

More authentic example

k-means Clustering

Authentic data

Centered Data

Practicals

Cluster Validation

Validation for centered data

Clustering Methods With R

Akira Murakami

October 8, 2015

dist function

Euclidean distance

Manhattan distance

Hierarchical clustering

More authentic example

k-means Clustering

Authentic data

Centered Data

Practicals

Cluster Validation

Validation for centered data

`dist` function