Elasticsearch is a distributed, RESTful, free/open source search server based on Apache Lucene. Elasticsearch is a NoSQL database, which enables to retrieve stored data in JSON format using GET requests.
R does not provide any tool to directly work with Elasticsearch database (unlike some SQL databases) and this post will only show, how to easily retrieve textual data from Elasticsearch and perform basic text mining using package tm.
In following example, we will use database of github users. Every user can write short bio about himself, and we will analyze these. The database runs on localhost and this example is irreproducible, but If you are super interested, you can download crawler, which I have used to fill the database from here.
First, we will see, how many users we have in a database. The database is called github and we will retrieve number of documents stored in a mapping called user.
# get number of documents in a database
count <- fromJSON(getURL("http://localhost:9200/github/user/_count"))
n = count$count
n
## [1] 14354
Now we can retrieve all documents, and limit our query only to fields bio and login.
# get all users bio information and name
url <- paste("http://localhost:9200/github/_search?type=user&fields=bio,login&size=",
n, sep = "")
json <- fromJSON(getURL(url))
We are interested only in fields bio and login, which we can easily extract. Also, we will filter only those strings, which are not empty.
# list of individual documents
lst <- json$hits$hits
# extract description from documents
desc <- sapply(lst, FUN = function(x) {
ifelse(is.null(x$field$bio), "", x$field$bio)
}) # some documents don't have bio description
names <- unlist(sapply(lst, FUN = function(x) x$fields$login))
names(desc) <- names
desc[12] # example of user description
## sunny
## "Red-headed geek Web developer that loves Ruby, JavaScript and HTTP."
# select only those, which are not empty
desc <- subset(desc, nchar(desc) > 0)
Using library tm, we will transform data to class Corpus, eliminate whitespaces, english stopwords and lowercase the strings.
# create corpus for tm package
corpus <- Corpus(VectorSource(desc))
# eliminate extra whitespace
text <- tm_map(corpus, stripWhitespace)
# convert to lower case
text <- tm_map(text, tolower)
# remove english stopwords
text <- tm_map(text, removeWords, stopwords("english"))
Now we can create document-term matrix.
# create document-term matrix
dtm <- DocumentTermMatrix(text)
Let's see, which terms are present in at least 1 % of all documents, and which terms correlate with ruby and javascript, two most common languages, used by github users.
# which terms are present in at least 1 % of all documents
findFreqTerms(dtm, round(n/100))
## [1] "also" "application" "applications" "computer"
## [5] "design" "developed" "developer" "development"
## [9] "engineer" "experience" "java" "management"
## [13] "new" "programming" "project" "rails"
## [17] "ruby" "software" "system" "systems"
## [21] "team" "technical" "using" "web"
## [25] "work" "worked" "working" "years"
# which terms correlate with ruby?
findAssocs(dtm, "ruby", 0.4)
## rails new custom projects using 2000 interface
## 0.61 0.45 0.44 0.43 0.42 0.42 0.41
## software november web
## 0.41 0.40 0.40
# which terms correlate with javascript?
findAssocs(dtm, "javascript", 0.4)
## brad accessory agricultural.
## 0.44 0.43 0.43
## arrange brad@braddavis.cc brad's
## 0.43 0.43 0.43
## cater (client competencies
## 0.43 0.43 0.43
## conduct deliverables. depending
## 0.43 0.43 0.43
## discussed enterprises jekyll,
## 0.43 0.43 0.43
## maple networking ntl
## 0.43 0.43 0.43
## obama ohio. owl
## 0.43 0.43 0.43
## podcasting produced/directed/edited relying
## 0.43 0.43 0.43
## saas) (server side)
## 0.43 0.43 0.43
## sizes social stow,
## 0.43 0.43 0.43
## thinktank umbrella varrying
## 0.43 0.43 0.43
## volunteers 1921 2723
## 0.43 0.43 0.43
## (330) 333- 44224
## 0.43 0.43 0.43
## site website davis
## 0.42 0.42 0.41
## web
## 0.41
Clearly – ruby highly correlates with rails, remaining associations are far less significant.
First, let's reduce the dimensionality of the matrix and select only those terms, which are present in at least 5 documents and than select only those documents, which include at least 5 terms.
# transform to 0-1 matrix
dtm$v[dtm$v > 0] <- 1
dtm <- as.matrix(dtm)
# select those terms, which are present in at least 5 documents
n_documents <- apply(as.matrix(dtm), 2, sum)
table(n_documents) # most terms are included only once or twice
## n_documents
## 1 2 3 4 5 6 7 8 9 10 11 12
## 12776 1846 806 467 327 241 155 131 82 91 74 47
## 13 14 15 16 17 18 19 20 21 22 23 24
## 55 37 22 40 34 32 18 26 20 23 19 18
## 25 26 27 28 29 30 31 32 33 34 35 36
## 11 9 13 15 7 10 9 13 5 4 9 7
## 37 38 39 40 41 42 43 44 45 46 47 48
## 9 8 4 11 1 4 1 1 4 5 2 3
## 49 50 51 53 54 55 56 57 58 60 61 62
## 2 1 2 2 4 1 3 3 3 1 2 1
## 63 64 65 66 68 69 71 72 73 74 75 76
## 1 4 2 1 3 2 1 2 1 1 1 1
## 77 79 80 83 84 88 89 91 92 93 94 96
## 1 1 1 1 2 2 1 1 2 1 1 1
## 98 100 101 103 104 106 107 111 112 113 122 129
## 1 1 1 1 2 1 2 1 1 1 1 1
## 131 139 161 163 178 190 244 251 333 337 346
## 1 1 1 1 1 1 1 1 1 1 1
dtm_subset <- subset(dtm, select = n_documents >= 5)
# select those records, which has at least 5 terms
n_terms <- apply(as.matrix(dtm_subset), 1, sum)
table(n_terms)
## n_terms
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 121 92 135 113 84 52 58 50 36 29 27 21 26 22 16 15 13 14
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 16 11 16 13 6 8 6 8 8 8 12 13 11 8 2 6 4 6
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 10 4 7 3 3 4 6 4 5 3 3 3 7 5 6 5 3 5
## 54 55 56 57 58 60 61 63 64 65 66 67 68 72 73 74 75 76
## 5 3 1 2 3 1 1 2 1 1 2 1 2 2 3 1 3 2
## 78 79 80 81 84 88 89 90 92 95 96 97 99 100 101 103 104 105
## 1 1 1 2 1 1 1 1 1 1 1 2 1 2 2 1 1 1
## 117 119 123 125 128 129 133 134 135 138 139 140 145 146 148 150 158 161
## 2 3 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1
## 163 166 173 174 180 183 185 189 190 192 201 223 228 230 236 237 244 247
## 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2
## 249 252 260 263 285 290 309 315 341 377 463
## 1 2 1 1 1 1 1 1 1 1 1
dtm_subset <- subset(dtm_subset, subset = n_terms >= 5)
# reduction of dimension is substantial
dim(dtm)
## [1] 1286 17637
dim(dtm_subset)
## [1] 741 1742
Now we can calculate distances between documents (using jaccard distance) and perform clustering.
# calculate distance between documents jaccard distance
dM_documents <- dist.binary(as.matrix(dtm_subset), method = 1)
# perform hierarchical clustering of documents
fit_documents <- hclust(dM_documents, method = "single")
plot(fit_documents, labels = FALSE, main = "Dendrogram of biographies of github users")
Unfortunately, there seems to be no clear structure, but let's cut the dendrogram in height 0.8 and find clusters with at least 10 users.
# cut dendrogram
ct <- cutree(fit_documents, h = 0.8)
m <- which(table(ct) > 10) # clusters with more than 10 users
Print names of users in those clusters
# print names of users
for (i in m) {
print(paste("=== Cluster", i, "==="))
n <- names(which(ct == i))
print(n)
}
## [1] "=== Cluster 31 ==="
## [1] "cjse" "janne" "bradly" "johndagostino"
## [5] "andyw8" "mikehale" "polaris" "sbraford"
## [9] "piclez" "apexsutherland" "ledermann" "penso"
## [13] "jongilbraith" "kelyar" "mudge" "mbcharbonneau"
## [17] "ariejan" "alto" "bauerpl"
And we can do the same thing for terms.
# calculate distance between terms jaccard distance on transposed matrix
dM_terms <- dist.binary(t(as.matrix(dtm_subset)), method = 1)
# perform hierarchical clustering of terms
fit_terms <- hclust(dM_terms, method = "single")
plot(fit_terms, labels = FALSE, main = "Dendrogram of terms in biographies of github users")
# cut dendrogram
ct <- cutree(fit_terms, h = 0.8)
m <- which(table(ct) > 10) # clusters with more than 10 terms
# print terms in clusters
for (i in m) {
print(paste("=== Cluster", i, "==="))
n <- names(which(ct == i))
print(n)
}
## [1] "=== Cluster 84 ==="
## [1] "april" "august" "december" "january" "july"
## [6] "june" "march" "may" "november" "october"
## [11] "september" "2011"