dist functiondist() calculates distance and returns a distance matrix. The method argument specifies the distance metric to be used and defaults to Euclidean distance. Examples follow:
Let us first read example data and look what is in it.
eg1 <- read.csv("Data/example1.csv")
eg1
## L1 article
## 1 Korean 0.6
## 2 German 0.8
## 3 Japanese 0.5
We then calculate distance between the elements in the second column.
dist(eg1[ , 2])
## 1 2
## 2 0.2
## 3 0.1 0.3
Alternatively, we can specify the column by the column name as well.
dist(eg1$article)
## 1 2
## 2 0.2
## 3 0.1 0.3
Similarly, we can compute distance in higher-dimensional spaces.
# 2D
eg2 <- read.csv("Data/example2.csv")
eg2
## L1 article ed
## 1 Korean 0.4 0.8
## 2 German 0.8 0.6
## 3 Japanese 0.6 0.5
dist(eg2[ , 2:3])
## 1 2
## 2 0.4472136
## 3 0.3605551 0.2236068
#3D
eg3 <- read.csv("Data/example3.csv")
eg3
## L1 article ed s
## 1 Korean 0.6 0.9 0.6
## 2 German 0.3 0.6 0.9
## 3 Japanese 0.9 0.4 0.5
dist(eg3[ , 2:4])
## 1 2
## 2 0.5196152
## 3 0.5916080 0.7483315
Manhattan distance can be computed by specifying “manhattan” to the method argument of dist().
dist(eg2[ , 2:3], method = "manhattan")
## 1 2
## 2 0.6
## 3 0.5 0.3
The following code snippet performs an agglomerative hierarchical cluster analysis with squared Euclidean distance and the Ward’s method. Before running the analysis, however, we assign L1 labels to the row names of the data frame so that the resulting dendrogram uses L1 labels.
rownames(eg3) <- eg3$L1
# The row names represent L1.
eg3
## L1 article ed s
## Korean Korean 0.6 0.9 0.6
## German German 0.3 0.6 0.9
## Japanese Japanese 0.9 0.4 0.5
We then obtain a distance matrix as before.
eg3.dist <- dist(eg3[ , 2:4])
eg3.dist
## Korean German
## German 0.5196152
## Japanese 0.5916080 0.7483315
The function hclust performs a hierarhical cluster analysis. The method argument specifies the linkage criterion.
eg3.hclust <- hclust(eg3.dist, method = "ward.D2")
We can draw a dendrogram using plot(). By specifying hang = -1, all the leaves are aligned.
plot(eg3.hclust, hang = -1)
We now cluster more authentic data. The data we use are the relative frequency of eight error tags in 12 nationalities in EF-Cambridge Open Language Database (EFCAMDAT). We first read the data as usual and look at the data.
error.freq <- read.csv("Data/error_freq.csv")
error.freq
## nationality AR PR VT PL AG WO
## 1 br 5.584696 6.052505 3.128397 1.904357 1.990424 1.3657120
## 2 cn 5.592782 4.125751 3.551567 2.442850 1.747515 0.6549510
## 3 de 3.233303 3.957666 3.074154 1.231066 1.156172 1.1702148
## 4 es 3.739341 5.472206 2.713302 1.254047 1.459255 1.0488394
## 5 fr 4.898914 5.462552 3.169369 2.179051 1.455627 1.3660770
## 6 it 4.921787 4.939960 2.706239 2.255199 1.361380 1.4423363
## 7 jp 8.659891 4.806812 3.044314 2.826863 1.365745 0.8888787
## 8 kr 10.185876 5.031577 3.738282 2.935873 1.991863 1.1611332
## 9 mx 4.660731 5.724839 3.476950 1.764889 1.919061 1.4425301
## 10 ru 9.093976 4.476776 2.670699 1.618846 1.478422 1.1154406
## 11 tr 8.705263 5.115789 3.257895 2.773684 1.563158 1.2421053
## 12 tw 7.290674 5.102923 4.592356 3.181435 2.314021 0.8015350
## CO SI
## 1 0.5449309 0.3771483
## 2 0.7863817 0.3458317
## 3 0.5125541 0.2995750
## 4 0.2280086 0.2508094
## 5 0.5197414 0.5074502
## 6 0.3436494 0.4279096
## 7 1.1101447 0.4844961
## 8 0.8401695 0.4814455
## 9 0.5864995 0.3902809
## 10 0.7091835 0.3382532
## 11 0.9052632 0.5105263
## 12 0.7768301 0.4556672
What each error tag represents is as follows:
Notice that the absolute frequency varies across error types (e.g., AR is much more frequent than SI). We therefore need to standardize variables. The scale functoin standardizes a variable. To standardize multiple columns, we can use apply() that applies a function row-wise (when MARGIN = 1) or column-wise (when MARGIN = 2).
error.freq2 <- apply(error.freq[ , 2:ncol(error.freq)], MARGIN = 2, scale)
error.freq2
## AR PR VT PL AG
## [1,] -0.3445771 1.65628001 -0.244719210 -0.44646068 0.9993837
## [2,] -0.3410764 -1.44183999 0.540423186 0.37409985 0.2858129
## [3,] -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161
## [4,] -1.1434988 0.72318945 -1.014878242 -1.43740982 -0.5609801
## [5,] -0.6414772 0.70766717 -0.168700120 -0.02787943 -0.5716371
## [6,] -0.6315747 -0.13263309 -1.027982431 0.08815636 -0.8484962
## [7,] 0.9867872 -0.34672898 -0.400724359 0.95926373 -0.8356748
## [8,] 1.6474415 0.01468173 0.886851381 1.12537452 1.0036089
## [9,] -0.7445955 1.12941050 0.401980447 -0.65898342 0.7897465
## [10,] 1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734
## [11,] 1.0064303 0.15009072 -0.004451063 0.87822901 -0.2557545
## [12,] 0.3940032 0.12940232 2.471484096 1.49956413 1.9499803
## WO CO SI
## [1,] 0.87668665 -0.4458702 -0.3342733
## [2,] -1.90425741 0.5297247 -0.6998565
## [3,] 0.11177852 -0.5766907 -1.2398486
## [4,] -0.36311806 -1.7264120 -1.8091277
## [5,] 0.87811493 -0.5476499 1.1868451
## [6,] 1.17648906 -1.2591589 0.2583042
## [7,] -0.98898499 1.8379062 0.9188837
## [8,] 0.07624572 0.7470573 0.8832709
## [9,] 1.17724720 -0.2779101 -0.1809656
## [10,] -0.10253245 0.2178012 -0.7883272
## [11,] 0.39305938 1.0100715 1.2227549
## [12,] -1.33072855 0.4911310 0.5823401
As before, we assign the nationality code as the rowname of the matrix.
rownames(error.freq2) <- sort(unique(error.freq$nationality))
error.freq2
## AR PR VT PL AG WO
## br -0.3445771 1.65628001 -0.244719210 -0.44646068 0.9993837 0.87668665
## cn -0.3410764 -1.44183999 0.540423186 0.37409985 0.2858129 -1.90425741
## de -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161 0.11177852
## es -1.1434988 0.72318945 -1.014878242 -1.43740982 -0.5609801 -0.36311806
## fr -0.6414772 0.70766717 -0.168700120 -0.02787943 -0.5716371 0.87811493
## it -0.6315747 -0.13263309 -1.027982431 0.08815636 -0.8484962 1.17648906
## jp 0.9867872 -0.34672898 -0.400724359 0.95926373 -0.8356748 -0.98898499
## kr 1.6474415 0.01468173 0.886851381 1.12537452 1.0036089 0.07624572
## mx -0.7445955 1.12941050 0.401980447 -0.65898342 0.7897465 1.17724720
## ru 1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734 -0.10253245
## tr 1.0064303 0.15009072 -0.004451063 0.87822901 -0.2557545 0.39305938
## tw 0.3940032 0.12940232 2.471484096 1.49956413 1.9499803 -1.33072855
## CO SI
## br -0.4458702 -0.3342733
## cn 0.5297247 -0.6998565
## de -0.5766907 -1.2398486
## es -1.7264120 -1.8091277
## fr -0.5476499 1.1868451
## it -1.2591589 0.2583042
## jp 1.8379062 0.9188837
## kr 0.7470573 0.8832709
## mx -0.2779101 -0.1809656
## ru 0.2178012 -0.7883272
## tr 1.0100715 1.2227549
## tw 0.4911310 0.5823401
We then compute a distance matrix, performs a hierarchical cluster analysis, and draws the dendrogram.
error.freq.dist <- dist(error.freq2)
error.freq.hclust <- hclust(error.freq.dist, method = "ward.D2")
plot(error.freq.hclust, hang = -1)
k = 3 looks good.
plot(error.freq.hclust, hang = -1)
rect.hclust(error.freq.hclust, 3)
To assign a cluster to each observation, we use cutree().
cutree(error.freq.hclust, 3)
## br cn de es fr it jp kr mx ru tr tw
## 1 2 2 2 1 1 3 3 1 2 3 3
To add a column to error.freq2, we first convert it into a data.frame. We then add cluster as a separate column.
error.freq2 <- as.data.frame(error.freq2)
error.freq2$cluster <- cutree(error.freq.hclust, 3)
To investigate the error profile of each L1 group, we visualize the (standardized) error frequency of each error in each cluster. We first load necessary packages.
# Run this line if these packhages have not been installed yet.
# install.packages(c("tidyr", "ggplot2"))
# To install from source files in a local folder:
# install.packages("C://abc/xyz/Packages/tidyr_0.3.1.tar.gz", repos = NULL, type = "source")
# install.packages("C://abc/xyz/Packages/ggplot2_1.0.1.tar.gz", repos = NULL, type = "source")
library("tidyr") # to restructure data
library("ggplot2") # to draw figures efficiently
We first convert the data frame into the so-called long format.
error.freq.long <- gather(error.freq2, tag, freq, AR:SI)
# first few lines
head(error.freq.long)
## cluster tag freq
## 1 1 AR -0.3445771
## 2 2 AR -0.3410764
## 3 2 AR -1.3625807
## 4 2 AR -1.1434988
## 5 1 AR -0.6414772
## 6 1 AR -0.6315747
We then draw the error profile for each cluster. The code below says that
ggplot(error.freq.long, aes(tag, freq)) +
geom_boxplot() +
facet_wrap(~ cluster)
Below, I illustrate how to perform a k-means cluster analysis. Let us first read data and see what it looks like.
eg4 <- read.csv("Data/example4.csv")
eg4
## data article ed
## 1 1 0.4 0.2
## 2 2 0.2 0.4
## 3 3 0.4 0.8
## 4 4 0.8 0.8
## 5 5 0.9 0.4
kmeans() performs k-means clustering. The second argument specifies the number of clusters.
eg4.kmeans <- kmeans(eg4[ , 2:3], 2)
eg4.kmeans$cluster returns the cluster membership of each observation (similar to cutree() in hierarchical clustering), and eg4.kmeans$centers returns the location of the mean of each cluster.
eg4.kmeans$cluster
## [1] 2 2 1 1 1
eg4.kmeans$centers
## article ed
## 1 0.7 0.6666667
## 2 0.3 0.3000000
K-means clustering involves randomness:
kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster
kmeans(eg4[ , 2:3], 2)$cluster
## [1] 2 2 1 1 1
## [1] 2 2 1 1 1
## [1] 1 1 2 2 2
## [1] 2 2 1 1 1
To make the results reproducible, we can use set.seed().
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
set.seed(1)
kmeans(eg4[ , 2:3], 2)$cluster
## [1] 1 1 2 2 2
## [1] 1 1 2 2 2
## [1] 1 1 2 2 2
## [1] 1 1 2 2 2
We now cluster more authentic data with k-means clustering. Let us first read data and look at the first few lines.
efcorpus.msl <- read.csv("Data/efcorpus.msl.csv")
head(efcorpus.msl)
## learnerid nationality mean.level writing1 writing2 writing3 writing4
## 1 18426339 br 15.50000 7.750000 4.666667 2.56250 6.000000
## 2 18426396 br 16.50000 7.600000 27.000000 6.00000 6.750000
## 3 18426461 mx 39.56667 13.500000 43.000000 56.00000 25.500000
## 4 18427875 cn 55.80000 58.000000 14.666667 23.66667 12.625000
## 5 18431359 br 31.03333 9.428571 11.727273 11.33333 3.333333
## 6 18433178 fr 47.50000 14.500000 13.666667 10.66667 3.440000
## writing5 writing6 writing7 writing8 writing9 writing10 writing11
## 1 5.666667 15.500000 9.200000 11.500000 4.052632 22.00000 8.400000
## 2 4.750000 4.600000 6.333333 11.714286 11.250000 6.12500 5.714286
## 3 11.800000 6.666667 16.000000 8.500000 12.166667 28.50000 6.833333
## 4 12.714286 36.333333 25.666667 11.285714 23.000000 15.66667 31.333333
## 5 4.875000 5.625000 3.800000 2.307692 7.333333 83.00000 9.166667
## 6 15.000000 39.000000 10.571429 87.000000 6.272727 4.56250 25.000000
## writing12 writing13 writing14 writing15 writing16 writing17 writing18
## 1 24.500000 19.500000 44.000000 10.000000 17.00000 8.125000 46.00000
## 2 28.500000 7.250000 8.166667 50.500000 10.55556 27.000000 11.50000
## 3 4.000000 5.857143 9.166667 6.777778 7.70000 6.000000 14.00000
## 4 100.000000 37.000000 29.400000 19.000000 6.75000 17.666667 8.80000
## 5 7.750000 12.285714 6.875000 9.000000 11.55556 7.333333 62.00000
## 6 8.444444 74.000000 4.625000 17.000000 4.81250 100.000000 17.16667
## writing19 writing20 writing21 writing22 writing23 writing24 writing25
## 1 3.857143 5.166667 8.750000 19.500000 7.60000 10.500000 5.727273
## 2 10.000000 10.333333 6.333333 6.666667 11.80000 9.833333 42.500000
## 3 22.333333 67.000000 3.333333 33.250000 11.16667 9.833333 15.428571
## 4 18.800000 15.625000 11.000000 22.000000 43.75000 58.000000 41.000000
## 5 27.000000 37.000000 52.000000 28.500000 19.66667 28.500000 58.000000
## 6 24.250000 5.315789 18.000000 10.700000 29.00000 10.923077 3.351351
## writing26 writing27 writing28 writing29 writing30
## 1 6.800000 5.100000 14.833333 84.00000 16.80000
## 2 13.400000 9.666667 17.000000 5.40000 29.66667
## 3 14.600000 27.666667 13.333333 13.16667 15.40000
## 4 23.800000 55.000000 17.444444 32.20000 52.00000
## 5 29.500000 11.200000 12.500000 21.66667 12.60000
## 6 8.833333 12.555556 8.307692 17.00000 14.25000
The data include the longitudinal development of mean sentence length in 2,774 learners across 30 writings. What we do is we cluster learners according to their developmental patterns. We follow the same procedure as the above.
# set the seed
set.seed(1)
# cluster rows (learners) by k-means clustering, where k = 3
efcorpus.msl.kmeans <- kmeans(efcorpus.msl[ , 4:33], 3)
# assign cluster to the original data frame
efcorpus.msl$cluster <- efcorpus.msl.kmeans$cluster
head(efcorpus.msl)
## learnerid nationality mean.level writing1 writing2 writing3 writing4
## 1 18426339 br 15.50000 7.750000 4.666667 2.56250 6.000000
## 2 18426396 br 16.50000 7.600000 27.000000 6.00000 6.750000
## 3 18426461 mx 39.56667 13.500000 43.000000 56.00000 25.500000
## 4 18427875 cn 55.80000 58.000000 14.666667 23.66667 12.625000
## 5 18431359 br 31.03333 9.428571 11.727273 11.33333 3.333333
## 6 18433178 fr 47.50000 14.500000 13.666667 10.66667 3.440000
## writing5 writing6 writing7 writing8 writing9 writing10 writing11
## 1 5.666667 15.500000 9.200000 11.500000 4.052632 22.00000 8.400000
## 2 4.750000 4.600000 6.333333 11.714286 11.250000 6.12500 5.714286
## 3 11.800000 6.666667 16.000000 8.500000 12.166667 28.50000 6.833333
## 4 12.714286 36.333333 25.666667 11.285714 23.000000 15.66667 31.333333
## 5 4.875000 5.625000 3.800000 2.307692 7.333333 83.00000 9.166667
## 6 15.000000 39.000000 10.571429 87.000000 6.272727 4.56250 25.000000
## writing12 writing13 writing14 writing15 writing16 writing17 writing18
## 1 24.500000 19.500000 44.000000 10.000000 17.00000 8.125000 46.00000
## 2 28.500000 7.250000 8.166667 50.500000 10.55556 27.000000 11.50000
## 3 4.000000 5.857143 9.166667 6.777778 7.70000 6.000000 14.00000
## 4 100.000000 37.000000 29.400000 19.000000 6.75000 17.666667 8.80000
## 5 7.750000 12.285714 6.875000 9.000000 11.55556 7.333333 62.00000
## 6 8.444444 74.000000 4.625000 17.000000 4.81250 100.000000 17.16667
## writing19 writing20 writing21 writing22 writing23 writing24 writing25
## 1 3.857143 5.166667 8.750000 19.500000 7.60000 10.500000 5.727273
## 2 10.000000 10.333333 6.333333 6.666667 11.80000 9.833333 42.500000
## 3 22.333333 67.000000 3.333333 33.250000 11.16667 9.833333 15.428571
## 4 18.800000 15.625000 11.000000 22.000000 43.75000 58.000000 41.000000
## 5 27.000000 37.000000 52.000000 28.500000 19.66667 28.500000 58.000000
## 6 24.250000 5.315789 18.000000 10.700000 29.00000 10.923077 3.351351
## writing26 writing27 writing28 writing29 writing30 cluster
## 1 6.800000 5.100000 14.833333 84.00000 16.80000 3
## 2 13.400000 9.666667 17.000000 5.40000 29.66667 1
## 3 14.600000 27.666667 13.333333 13.16667 15.40000 3
## 4 23.800000 55.000000 17.444444 32.20000 52.00000 2
## 5 29.500000 11.200000 12.500000 21.66667 12.60000 3
## 6 8.833333 12.555556 8.307692 17.00000 14.25000 3
Let’s visualize the developmental profile of each cluster. We first convert the data into the long format.
efcorpus.msl.long <- gather(efcorpus.msl, writingnum, mean.sent.len, writing1:writing30)
head(efcorpus.msl.long)
## learnerid nationality mean.level cluster writingnum mean.sent.len
## 1 18426339 br 15.50000 3 writing1 7.750000
## 2 18426396 br 16.50000 1 writing1 7.600000
## 3 18426461 mx 39.56667 3 writing1 13.500000
## 4 18427875 cn 55.80000 2 writing1 58.000000
## 5 18431359 br 31.03333 3 writing1 9.428571
## 6 18433178 fr 47.50000 3 writing1 14.500000
We want to have writingnum as the x-axis and for that purpose, it’s more convenient to have numeric values for writingnum. The tidyr package we loaded earlier has the function called extract_numeric that retrieves the numerical part from the given character (or factor). Run, for example, extract_numeric(c("writingnum1", "writingnum2")).
efcorpus.msl.long$writingnum <- extract_numeric(efcorpus.msl.long$writingnum)
head(efcorpus.msl.long)
## learnerid nationality mean.level cluster writingnum mean.sent.len
## 1 18426339 br 15.50000 3 1 7.750000
## 2 18426396 br 16.50000 1 1 7.600000
## 3 18426461 mx 39.56667 3 1 13.500000
## 4 18427875 cn 55.80000 2 1 58.000000
## 5 18431359 br 31.03333 3 1 9.428571
## 6 18433178 fr 47.50000 3 1 14.500000
We now visualize the data using ggplot().
group = learnerid specifies that separate lines are drawn for separate learners.alpha = 0.1 means that lines should be transparent (0-1, where smaller values indicate more transparency).stat_smooth() draws a trend line.
group = NULL means we draw only one trend line for each cluster (i.e., not separate trend lines for separate learners)se = FALSE means we do not draw standard errors around the trend line.coord_cartesian(ylim = c(0, 100)) limits the y-axis from 0 to 100.ggplot(efcorpus.msl.long, aes(writingnum, mean.sent.len, group = learnerid)) +
geom_line(alpha = 0.1) +
facet_wrap(~ cluster) +
stat_smooth(aes(group = NULL), se = FALSE) +
coord_cartesian(ylim = c(0, 100))
Note that clustering was performed based on the absolute value of mean sentence length. Indeed, the proficiency of the learner (represented in the mean.level column) is higher in the clusters that mark higher mean sentence length.
tapply(efcorpus.msl$mean.level, efcorpus.msl$cluster, mean)
## 1 2 3
## 17.96468 65.02373 32.65439
If we want to cluster learners according to the developmental shape rather than to the absolute value of mean sentence length, a strategy we can take is to apply within-learner centering to the concerned variable. The data after within-learner centering is available as efcorpus.msl.c.csv. Just in case, the code below illustrates how the data were transformed from efcorpus.msl.
# efcorpus.msl.c <- efcorpus.msl
# efcorpus.msl.c[ , 4:33] <- t(apply(efcorpus.msl.c[ , 4:33], 1, scale, scale = FALSE))
# efcorpus.msl.c$cluster <- NULL
We read the withi-learner centered data.
efcorpus.msl.c <- read.csv("Data/efcorpus.msl.c.csv")
head(efcorpus.msl.c)
## learnerid nationality mean.level writing1 writing2 writing3
## 1 18426339 br 15.50000 -7.418596 -10.501929 -12.606096
## 2 18426396 br 16.50000 -6.196971 13.203029 -7.796971
## 3 18426461 mx 39.56667 -4.116005 25.383995 38.383995
## 4 18427875 cn 55.80000 28.860185 -14.473148 -5.473148
## 5 18431359 br 31.03333 -10.800200 -8.501499 -8.895438
## 6 18433178 fr 47.50000 -6.107180 -6.940513 -9.940513
## writing4 writing5 writing6 writing7 writing8 writing9
## 1 -9.168596 -9.501929 0.331404 -5.968596 -3.668596 -11.115964
## 2 -7.046971 -9.046971 -9.196971 -7.463638 -2.082685 -2.546971
## 3 7.883995 -5.816005 -10.949339 -1.616005 -9.116005 -5.449339
## 4 -16.514815 -16.425529 7.193519 -3.473148 -17.854101 -6.139815
## 5 -16.895438 -15.353771 -14.603771 -16.428771 -17.921079 -12.895438
## 6 -17.167180 -5.607180 18.392820 -10.035751 66.392820 -14.334453
## writing10 writing11 writing12 writing13 writing14 writing15
## 1 6.831404 -6.768596 9.331404 4.331404 28.8314040 -5.168596
## 2 -7.671971 -8.082685 14.703029 -6.546971 -5.6303042 36.703029
## 3 10.883995 -10.782672 -13.616005 -11.758862 -8.4493386 -10.838228
## 4 -13.473148 2.193519 70.860185 7.860185 0.2601852 -10.139815
## 5 62.771229 -11.062105 -12.478771 -7.943057 -13.3537713 -11.228771
## 6 -16.044680 4.392820 -12.162736 53.392820 -15.9821800 -3.607180
## writing16 writing17 writing18 writing19 writing20 writing21
## 1 1.831404 -7.043596 30.831404 -11.311453 -10.001929 -6.418596
## 2 -3.241415 13.203029 -2.296971 -3.796971 -3.463638 -7.463638
## 3 -9.916005 -11.616005 -3.616005 4.717328 49.383995 -14.282672
## 4 -22.389815 -11.473148 -20.339815 -10.339815 -13.514815 -18.139815
## 5 -8.673216 -12.895438 41.771229 6.771229 16.771229 31.771229
## 6 -15.794680 79.392820 -3.440513 3.642820 -15.291391 -2.607180
## writing22 writing23 writing24 writing25 writing26 writing27
## 1 4.331404 -7.5685960 -4.668596 -9.441323 -8.3685960 -10.068596
## 2 -7.130304 -1.9969709 -3.963638 28.703029 -0.3969709 -4.130304
## 3 15.633995 -6.4493386 -7.782672 -2.187434 -3.0160053 10.050661
## 4 -7.139815 14.6101852 28.860185 11.860185 -5.3398148 25.860185
## 5 8.271229 -0.5621047 8.271229 37.771229 9.2712287 -9.028771
## 6 -9.907180 8.3928200 -9.684103 -17.255829 -11.7738466 -8.051624
## writing28 writing29 writing30
## 1 -0.3352627 68.831404 1.631404
## 2 3.2030291 -8.396971 15.869696
## 3 -4.2826720 -4.449339 -2.216005
## 4 -11.6953704 3.060185 22.860185
## 5 -7.7287713 1.437895 -7.628771
## 6 -12.2994877 -3.607180 -6.357180
K-means clustering on the within-learner-centered data:
set.seed(1)
efcorpus.msl.c.kmeans <- kmeans(efcorpus.msl.c[ , 4:33], 3)
efcorpus.msl.c$cluster <- efcorpus.msl.c.kmeans$cluster
efcorpus.msl.c.long <- gather(efcorpus.msl.c, writingnum, mean.sent.len.c, writing1:writing30)
efcorpus.msl.c.long$writingnum <- extract_numeric(efcorpus.msl.c.long$writingnum)
ggplot(efcorpus.msl.c.long, aes(writingnum, mean.sent.len.c, group = learnerid)) +
geom_line(alpha = 0.1) +
facet_wrap(~ cluster) +
stat_smooth(aes(group = NULL), se = FALSE) +
coord_cartesian(ylim = c(-50, 50))
Let us say that we want to see whether the clusters are “real”. We can calculate the silhouette value with the silhouette function in the cluster package. We first (install and) load the package.
# Run this line if the package has not been installed yet.
# install.packages("cluster")
# To install from source files in a local folder:
# install.packages("C://abc/xyz/Packages/cluster_2.0.3.tar.gz", repos = NULL, type = "source")
library("cluster")
Let’s test whether the clusters found in the (non-centered) mean sentence length data were random. We follow the same procedure as before and perform k-means clustering. Because the silhouette function requires the distance matrix, we make it with dist() as well.
efcorpus.msl <- read.csv("Data/efcorpus.msl.csv")
d <- efcorpus.msl[ , 4:33]
set.seed(1)
res.kmeans <- kmeans(d, 3)
ddist <- dist(d)
We then feed the cluster of each observation and the distance matrix to silhouette(). We look at the first few lines.
sil <- silhouette(res.kmeans$cluster, ddist)
sil[1:5, ]
## cluster neighbor sil_width
## [1,] 3 1 -0.2285915
## [2,] 1 3 0.3166222
## [3,] 3 1 -0.2172080
## [4,] 2 1 -0.3014261
## [5,] 3 1 -0.1345547
Since the silhouette value is given in the third column, we take the mean of the third column to calculate the mean silhouette value. The value (0.169) indicates how good the clustering in the present case is.
mean(sil[ , 3])
## [1] 0.1691578
We now compute the null distribution of the silhouette value. First, 1,000 sets of 2,774 (= number of learners) times 30 (= number of writings) values are randomly chosen from the original data. These sets of data should not include systematic patterns and thus form the basis of the null hypothesis. K-means clustering is performed on each of the 1,000 data sets, and the mean silhouette value is calculated. Finally, we inspect whether the 95% range of those silhouette values include the observed silhouette value in our original data set (0.169).
set.seed(1)
ef.m <- as.matrix(efcorpus.msl[ , 4:33])
boot.samp <- sample(ef.m, 2774 * 30 * 1000, replace = TRUE)
boot.samp.array <- array(boot.samp, dim = c(2774, 30, 1000))
boot.res <- numeric(1000)
for (i in 1:1000) {
boot.d <- boot.samp.array[ , , i]
boot.dist <- dist(boot.d)
boot.kmeans <- kmeans(boot.d, 3)
boot.res[i] <- mean(silhouette(boot.kmeans$cluster, boot.dist)[ , 3])
}
quantile(boot.res, 0.95)
## 95%
## 0.1261744
rm(boot.samp, boot.samp.array)
efcorpus.msl.c <- read.csv("Data/efcorpus.msl.c.csv")
d <- efcorpus.msl.c[ , 4:33]
set.seed(1)
res.kmeans <- kmeans(d, 3)
ddist <- dist(d)
mean(silhouette(res.kmeans$cluster, ddist)[ , 3])
## [1] 0.1229812
set.seed(1)
ef.m <- as.matrix(efcorpus.msl.c[ , 4:33])
boot.samp <- sample(ef.m, 2774 * 30 * 1000, replace = TRUE)
boot.samp.array <- array(boot.samp, dim = c(2774, 30, 1000))
boot.res <- numeric(1000)
for (i in 1:1000) {
boot.d <- boot.samp.array[ , , i]
boot.d2 <- t(apply(boot.d, 1, scale, scale = FALSE))
boot.dist <- dist(boot.d2)
boot.kmeans <- kmeans(boot.d2, 3)
boot.res[i] <- mean(silhouette(boot.kmeans$cluster, boot.dist)[ , 3])
}
quantile(boot.res, 0.95)
## 95%
## 0.0863205
rm(boot.samp, boot.samp.array)