dist function

dist() calculates distance and returns a distance matrix. The method argument specifies the distance metric to be used and defaults to Euclidean distance. Examples follow:

Euclidean distance

Let us first read example data and look what is in it.

eg1 <- read.csv("Data/example1.csv")
eg1
##         L1 article
## 1   Korean     0.6
## 2   German     0.8
## 3 Japanese     0.5

We then calculate distance between the elements in the second column.

dist(eg1[ , 2])
##     1   2
## 2 0.2    
## 3 0.1 0.3

Alternatively, we can specify the column by the column name as well.

dist(eg1$article)
##     1   2
## 2 0.2    
## 3 0.1 0.3

Similarly, we can compute distance in higher-dimensional spaces.

# 2D
eg2 <- read.csv("Data/example2.csv")
eg2
##         L1 article  ed
## 1   Korean     0.4 0.8
## 2   German     0.8 0.6
## 3 Japanese     0.6 0.5
dist(eg2[ , 2:3])
##           1         2
## 2 0.4472136          
## 3 0.3605551 0.2236068
#3D
eg3 <- read.csv("Data/example3.csv")
eg3
##         L1 article  ed   s
## 1   Korean     0.6 0.9 0.6
## 2   German     0.3 0.6 0.9
## 3 Japanese     0.9 0.4 0.5
dist(eg3[ , 2:4])
##           1         2
## 2 0.5196152          
## 3 0.5916080 0.7483315

Manhattan distance

Manhattan distance can be computed by specifying “manhattan” to the method argument of dist().

dist(eg2[ , 2:3], method = "manhattan")
##     1   2
## 2 0.6    
## 3 0.5 0.3

Hierarchical clustering

The following code snippet performs an agglomerative hierarchical cluster analysis with squared Euclidean distance and the Ward’s method. Before running the analysis, however, we assign L1 labels to the row names of the data frame so that the resulting dendrogram uses L1 labels.

rownames(eg3) <- eg3$L1
# The row names represent L1.
eg3
##                L1 article  ed   s
## Korean     Korean     0.6 0.9 0.6
## German     German     0.3 0.6 0.9
## Japanese Japanese     0.9 0.4 0.5

We then obtain a distance matrix as before.

eg3.dist <- dist(eg3[ , 2:4])
eg3.dist
##             Korean    German
## German   0.5196152          
## Japanese 0.5916080 0.7483315

The function hclust performs a hierarhical cluster analysis. The method argument specifies the linkage criterion.

eg3.hclust <- hclust(eg3.dist, method = "ward.D2")

We can draw a dendrogram using plot(). By specifying hang = -1, all the leaves are aligned.

plot(eg3.hclust, hang = -1)

More authentic example

We now cluster more authentic data. The data we use are the relative frequency of eight error tags in 12 nationalities in EF-Cambridge Open Language Database (EFCAMDAT). We first read the data as usual and look at the data.

error.freq <- read.csv("Data/error_freq.csv")
error.freq
##    nationality        AR       PR       VT       PL       AG        WO
## 1           br  5.584696 6.052505 3.128397 1.904357 1.990424 1.3657120
## 2           cn  5.592782 4.125751 3.551567 2.442850 1.747515 0.6549510
## 3           de  3.233303 3.957666 3.074154 1.231066 1.156172 1.1702148
## 4           es  3.739341 5.472206 2.713302 1.254047 1.459255 1.0488394
## 5           fr  4.898914 5.462552 3.169369 2.179051 1.455627 1.3660770
## 6           it  4.921787 4.939960 2.706239 2.255199 1.361380 1.4423363
## 7           jp  8.659891 4.806812 3.044314 2.826863 1.365745 0.8888787
## 8           kr 10.185876 5.031577 3.738282 2.935873 1.991863 1.1611332
## 9           mx  4.660731 5.724839 3.476950 1.764889 1.919061 1.4425301
## 10          ru  9.093976 4.476776 2.670699 1.618846 1.478422 1.1154406
## 11          tr  8.705263 5.115789 3.257895 2.773684 1.563158 1.2421053
## 12          tw  7.290674 5.102923 4.592356 3.181435 2.314021 0.8015350
##           CO        SI
## 1  0.5449309 0.3771483
## 2  0.7863817 0.3458317
## 3  0.5125541 0.2995750
## 4  0.2280086 0.2508094
## 5  0.5197414 0.5074502
## 6  0.3436494 0.4279096
## 7  1.1101447 0.4844961
## 8  0.8401695 0.4814455
## 9  0.5864995 0.3902809
## 10 0.7091835 0.3382532
## 11 0.9052632 0.5105263
## 12 0.7768301 0.4556672

What each error tag represents is as follows:

  • AR = article
  • PR = preposition
  • VT = verb tense (and aspect?)
  • PL = plural
  • AG = verb agreement
  • WO = word order
  • CO = conjunction (?)
  • SI = singular (?)

Notice that the absolute frequency varies across error types (e.g., AR is much more frequent than SI). We therefore need to standardize variables. The scale functoin standardizes a variable. To standardize multiple columns, we can use apply() that applies a function row-wise (when MARGIN = 1) or column-wise (when MARGIN = 2).

error.freq2 <- apply(error.freq[ , 2:ncol(error.freq)], MARGIN = 2, scale)
error.freq2
##               AR          PR           VT          PL         AG
##  [1,] -0.3445771  1.65628001 -0.244719210 -0.44646068  0.9993837
##  [2,] -0.3410764 -1.44183999  0.540423186  0.37409985  0.2858129
##  [3,] -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161
##  [4,] -1.1434988  0.72318945 -1.014878242 -1.43740982 -0.5609801
##  [5,] -0.6414772  0.70766717 -0.168700120 -0.02787943 -0.5716371
##  [6,] -0.6315747 -0.13263309 -1.027982431  0.08815636 -0.8484962
##  [7,]  0.9867872 -0.34672898 -0.400724359  0.95926373 -0.8356748
##  [8,]  1.6474415  0.01468173  0.886851381  1.12537452  1.0036089
##  [9,] -0.7445955  1.12941050  0.401980447 -0.65898342  0.7897465
## [10,]  1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734
## [11,]  1.0064303  0.15009072 -0.004451063  0.87822901 -0.2557545
## [12,]  0.3940032  0.12940232  2.471484096  1.49956413  1.9499803
##                WO         CO         SI
##  [1,]  0.87668665 -0.4458702 -0.3342733
##  [2,] -1.90425741  0.5297247 -0.6998565
##  [3,]  0.11177852 -0.5766907 -1.2398486
##  [4,] -0.36311806 -1.7264120 -1.8091277
##  [5,]  0.87811493 -0.5476499  1.1868451
##  [6,]  1.17648906 -1.2591589  0.2583042
##  [7,] -0.98898499  1.8379062  0.9188837
##  [8,]  0.07624572  0.7470573  0.8832709
##  [9,]  1.17724720 -0.2779101 -0.1809656
## [10,] -0.10253245  0.2178012 -0.7883272
## [11,]  0.39305938  1.0100715  1.2227549
## [12,] -1.33072855  0.4911310  0.5823401

As before, we assign the nationality code as the rowname of the matrix.

rownames(error.freq2) <- sort(unique(error.freq$nationality))
error.freq2
##            AR          PR           VT          PL         AG          WO
## br -0.3445771  1.65628001 -0.244719210 -0.44646068  0.9993837  0.87668665
## cn -0.3410764 -1.44183999  0.540423186  0.37409985  0.2858129 -1.90425741
## de -1.3625807 -1.71211070 -0.345359889 -1.47242885 -1.4513161  0.11177852
## es -1.1434988  0.72318945 -1.014878242 -1.43740982 -0.5609801 -0.36311806
## fr -0.6414772  0.70766717 -0.168700120 -0.02787943 -0.5716371  0.87811493
## it -0.6315747 -0.13263309 -1.027982431  0.08815636 -0.8484962  1.17648906
## jp  0.9867872 -0.34672898 -0.400724359  0.95926373 -0.8356748 -0.98898499
## kr  1.6474415  0.01468173  0.886851381  1.12537452  1.0036089  0.07624572
## mx -0.7445955  1.12941050  0.401980447 -0.65898342  0.7897465  1.17724720
## ru  1.1747183 -0.87740913 -1.093923795 -0.88152539 -0.5046734 -0.10253245
## tr  1.0064303  0.15009072 -0.004451063  0.87822901 -0.2557545  0.39305938
## tw  0.3940032  0.12940232  2.471484096  1.49956413  1.9499803 -1.33072855
##            CO         SI
## br -0.4458702 -0.3342733
## cn  0.5297247 -0.6998565
## de -0.5766907 -1.2398486
## es -1.7264120 -1.8091277
## fr -0.5476499  1.1868451
## it -1.2591589  0.2583042
## jp  1.8379062  0.9188837
## kr  0.7470573  0.8832709
## mx -0.2779101 -0.1809656
## ru  0.2178012 -0.7883272
## tr  1.0100715  1.2227549
## tw  0.4911310  0.5823401

We then compute a distance matrix, performs a hierarchical cluster analysis, and draws the dendrogram.

error.freq.dist <- dist(error.freq2)
error.freq.hclust <- hclust(error.freq.dist, method = "ward.D2")
plot(error.freq.hclust, hang = -1)

k = 3 looks good.

plot(error.freq.hclust, hang = -1)
rect.hclust(error.freq.hclust, 3)

To assign a cluster to each observation, we use cutree().

cutree(error.freq.hclust, 3)
## br cn de es fr it jp kr mx ru tr tw 
##  1  2  2  2  1  1  3  3  1  2  3  3

To add a column to error.freq2, we first convert it into a data.frame. We then add cluster as a separate column.

error.freq2 <- as.data.frame(error.freq2)
error.freq2$cluster <- cutree(error.freq.hclust, 3)

To investigate the error profile of each L1 group, we visualize the (standardized) error frequency of each error in each cluster. We first load necessary packages.

# Run this line if these packhages have not been installed yet.
# install.packages(c("tidyr", "ggplot2"))

# To install from source files in a local folder:
# install.packages("C://abc/xyz/Packages/tidyr_0.3.1.tar.gz", repos = NULL, type = "source")
# install.packages("C://abc/xyz/Packages/ggplot2_1.0.1.tar.gz", repos = NULL, type = "source")

library("tidyr") # to restructure data
library("ggplot2") # to draw figures efficiently

We first convert the data frame into the so-called long format.

error.freq.long <- gather(error.freq2, tag, freq, AR:SI)
# first few lines
head(error.freq.long)
##   cluster tag       freq
## 1       1  AR -0.3445771
## 2       2  AR -0.3410764
## 3       2  AR -1.3625807
## 4       2  AR -1.1434988
## 5       1  AR -0.6414772
## 6       1  AR -0.6315747

We then draw the error profile for each cluster. The code below says that

  • the x-axis should represent tag,
  • the y-axis should represent freq,
  • the boxplot should be drawn, and
  • separate figures should be drawn for separate clusters.
ggplot(error.freq.long, aes(tag, freq)) +
  geom_boxplot() +
  facet_wrap(~ cluster)