Text mining with Elasticsearch database

Elasticsearch is a distributed, RESTful, free/open source search server based on Apache Lucene. Elasticsearch is a NoSQL database, which enables to retrieve stored data in JSON format using GET requests.

R does not provide any tool to directly work with Elasticsearch database (unlike some SQL databases) and this post will only show, how to easily retrieve textual data from Elasticsearch and perform basic text mining using package tm.

In following example, we will use database of github users. Every user can write short bio about himself, and we will analyze these. The database runs on localhost and this example is irreproducible, but If you are super interested, you can download crawler, which I have used to fill the database from here.

Data preparation

First, we will see, how many users we have in a database. The database is called github and we will retrieve number of documents stored in a mapping called user.

# get number of documents in a database
count <- fromJSON(getURL("http://localhost:9200/github/user/_count"))

n = count$count
n
## [1] 14354

Now we can retrieve all documents, and limit our query only to fields bio and login.

# get all users bio information and name
url <- paste("http://localhost:9200/github/_search?type=user&fields=bio,login&size=", 
    n, sep = "")
json <- fromJSON(getURL(url))

We are interested only in fields bio and login, which we can easily extract. Also, we will filter only those strings, which are not empty.

# list of individual documents
lst <- json$hits$hits

# extract description from documents
desc <- sapply(lst, FUN = function(x) {
    ifelse(is.null(x$field$bio), "", x$field$bio)
})  # some documents don't have bio description
names <- unlist(sapply(lst, FUN = function(x) x$fields$login))
names(desc) <- names
desc[12]  # example of user description
##                                                                 sunny 
## "Red-headed geek Web developer that loves Ruby, JavaScript and HTTP."

# select only those, which are not empty
desc <- subset(desc, nchar(desc) > 0)

Using library tm, we will transform data to class Corpus, eliminate whitespaces, english stopwords and lowercase the strings.

# create corpus for tm package
corpus <- Corpus(VectorSource(desc))

# eliminate extra whitespace
text <- tm_map(corpus, stripWhitespace)

# convert to lower case
text <- tm_map(text, tolower)

# remove english stopwords
text <- tm_map(text, removeWords, stopwords("english"))

Now we can create document-term matrix.

# create document-term matrix
dtm <- DocumentTermMatrix(text)

Data analysis

Let's see, which terms are present in at least 1 % of all documents, and which terms correlate with ruby and javascript, two most common languages, used by github users.

# which terms are present in at least 1 % of all documents
findFreqTerms(dtm, round(n/100))
##  [1] "also"         "application"  "applications" "computer"    
##  [5] "design"       "developed"    "developer"    "development" 
##  [9] "engineer"     "experience"   "java"         "management"  
## [13] "new"          "programming"  "project"      "rails"       
## [17] "ruby"         "software"     "system"       "systems"     
## [21] "team"         "technical"    "using"        "web"         
## [25] "work"         "worked"       "working"      "years"

# which terms correlate with ruby?
findAssocs(dtm, "ruby", 0.4)
##     rails       new    custom  projects     using      2000 interface 
##      0.61      0.45      0.44      0.43      0.42      0.42      0.41 
##  software  november       web 
##      0.41      0.40      0.40

# which terms correlate with javascript?
findAssocs(dtm, "javascript", 0.4)
##                     brad                accessory            agricultural. 
##                     0.44                     0.43                     0.43 
##                  arrange        brad@braddavis.cc                   brad's 
##                     0.43                     0.43                     0.43 
##                    cater                  (client             competencies 
##                     0.43                     0.43                     0.43 
##                  conduct            deliverables.                depending 
##                     0.43                     0.43                     0.43 
##                discussed              enterprises                  jekyll, 
##                     0.43                     0.43                     0.43 
##                    maple               networking                      ntl 
##                     0.43                     0.43                     0.43 
##                    obama                    ohio.                      owl 
##                     0.43                     0.43                     0.43 
##               podcasting produced/directed/edited                  relying 
##                     0.43                     0.43                     0.43 
##                    saas)                  (server                    side) 
##                     0.43                     0.43                     0.43 
##                    sizes                   social                    stow, 
##                     0.43                     0.43                     0.43 
##                thinktank                 umbrella                 varrying 
##                     0.43                     0.43                     0.43 
##               volunteers                     1921                     2723 
##                     0.43                     0.43                     0.43 
##                    (330)                     333-                    44224 
##                     0.43                     0.43                     0.43 
##                     site                  website                    davis 
##                     0.42                     0.42                     0.41 
##                      web 
##                     0.41

Clearly – ruby highly correlates with rails, remaining associations are far less significant.

Clustering

First, let's reduce the dimensionality of the matrix and select only those terms, which are present in at least 5 documents and than select only those documents, which include at least 5 terms.

# transform to 0-1 matrix
dtm$v[dtm$v > 0] <- 1

dtm <- as.matrix(dtm)

# select those terms, which are present in at least 5 documents
n_documents <- apply(as.matrix(dtm), 2, sum)
table(n_documents)  # most terms are included only once or twice
## n_documents
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 12776  1846   806   467   327   241   155   131    82    91    74    47 
##    13    14    15    16    17    18    19    20    21    22    23    24 
##    55    37    22    40    34    32    18    26    20    23    19    18 
##    25    26    27    28    29    30    31    32    33    34    35    36 
##    11     9    13    15     7    10     9    13     5     4     9     7 
##    37    38    39    40    41    42    43    44    45    46    47    48 
##     9     8     4    11     1     4     1     1     4     5     2     3 
##    49    50    51    53    54    55    56    57    58    60    61    62 
##     2     1     2     2     4     1     3     3     3     1     2     1 
##    63    64    65    66    68    69    71    72    73    74    75    76 
##     1     4     2     1     3     2     1     2     1     1     1     1 
##    77    79    80    83    84    88    89    91    92    93    94    96 
##     1     1     1     1     2     2     1     1     2     1     1     1 
##    98   100   101   103   104   106   107   111   112   113   122   129 
##     1     1     1     1     2     1     2     1     1     1     1     1 
##   131   139   161   163   178   190   244   251   333   337   346 
##     1     1     1     1     1     1     1     1     1     1     1
dtm_subset <- subset(dtm, select = n_documents >= 5)

# select those records, which has at least 5 terms
n_terms <- apply(as.matrix(dtm_subset), 1, sum)
table(n_terms)
## n_terms
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17 
## 121  92 135 113  84  52  58  50  36  29  27  21  26  22  16  15  13  14 
##  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35 
##  16  11  16  13   6   8   6   8   8   8  12  13  11   8   2   6   4   6 
##  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53 
##  10   4   7   3   3   4   6   4   5   3   3   3   7   5   6   5   3   5 
##  54  55  56  57  58  60  61  63  64  65  66  67  68  72  73  74  75  76 
##   5   3   1   2   3   1   1   2   1   1   2   1   2   2   3   1   3   2 
##  78  79  80  81  84  88  89  90  92  95  96  97  99 100 101 103 104 105 
##   1   1   1   2   1   1   1   1   1   1   1   2   1   2   2   1   1   1 
## 117 119 123 125 128 129 133 134 135 138 139 140 145 146 148 150 158 161 
##   2   3   1   1   1   1   1   1   1   1   1   1   1   1   2   2   1   1 
## 163 166 173 174 180 183 185 189 190 192 201 223 228 230 236 237 244 247 
##   1   1   1   1   1   1   2   1   1   1   1   1   1   1   2   1   1   2 
## 249 252 260 263 285 290 309 315 341 377 463 
##   1   2   1   1   1   1   1   1   1   1   1
dtm_subset <- subset(dtm_subset, subset = n_terms >= 5)

# reduction of dimension is substantial
dim(dtm)
## [1]  1286 17637
dim(dtm_subset)
## [1]  741 1742

Now we can calculate distances between documents (using jaccard distance) and perform clustering.

# calculate distance between documents jaccard distance
dM_documents <- dist.binary(as.matrix(dtm_subset), method = 1)

# perform hierarchical clustering of documents
fit_documents <- hclust(dM_documents, method = "single")

plot(fit_documents, labels = FALSE, main = "Dendrogram of biographies of github users")

plot of chunk unnamed-chunk-9

Unfortunately, there seems to be no clear structure, but let's cut the dendrogram in height 0.8 and find clusters with at least 10 users.

# cut dendrogram
ct <- cutree(fit_documents, h = 0.8)
m <- which(table(ct) > 10)  # clusters with more than 10 users

Print names of users in those clusters

# print names of users
for (i in m) {
    print(paste("=== Cluster", i, "==="))
    n <- names(which(ct == i))
    print(n)
}
## [1] "=== Cluster 31 ==="
##  [1] "cjse"           "janne"          "bradly"         "johndagostino" 
##  [5] "andyw8"         "mikehale"       "polaris"        "sbraford"      
##  [9] "piclez"         "apexsutherland" "ledermann"      "penso"         
## [13] "jongilbraith"   "kelyar"         "mudge"          "mbcharbonneau" 
## [17] "ariejan"        "alto"           "bauerpl"

And we can do the same thing for terms.

# calculate distance between terms jaccard distance on transposed matrix
dM_terms <- dist.binary(t(as.matrix(dtm_subset)), method = 1)

# perform hierarchical clustering of terms
fit_terms <- hclust(dM_terms, method = "single")

plot(fit_terms, labels = FALSE, main = "Dendrogram of terms in biographies of github users")

plot of chunk unnamed-chunk-12


# cut dendrogram
ct <- cutree(fit_terms, h = 0.8)
m <- which(table(ct) > 10)  # clusters with more than 10 terms

# print terms in clusters
for (i in m) {
    print(paste("=== Cluster", i, "==="))
    n <- names(which(ct == i))
    print(n)
}
## [1] "=== Cluster 84 ==="
##  [1] "april"     "august"    "december"  "january"   "july"     
##  [6] "june"      "march"     "may"       "november"  "october"  
## [11] "september" "2011"