Overview
This is the second week assignment of the Data Science Capstone course in Coursera, which is the last course of the Data Science Specialization by the John Hopkins University.
The task in general is to implement a prediction algorithm which provides hints when typing a text, in the form of Shiny App.
We have a text corpus in 4 languages:
- English
- German
- Russian
- Finnish
Each of then have 3 categories:
- Blog entries
- News entries
- Twitter posts
The first week’s assignment was to download the corpus and perform some basic exploration.
This is the second part of the task: the Exploratory Data Analysis. The task is to perform some deeper analysis of the corpus, and present the findings here.
Approach
Goal
I targeted here the following goals:
- Get a deeper insight of the corpus.
- Explore more detailed statistics.
- Implement a preliminary algorithm which calculates the word and phrase frequencies.
Algorithm
The main sketch of the algorithm is the following:
- Read the corpus line by line.
- Split up every line at punctuation, like dot, comma, exclamation mark etc.
- Perform some cleanup, like removing unnecessary white spaces, make the text lowercase, convert or remove special characters etc.
- Create a map where the key is a word, word duplet, word triplet etc., and the value is its occurrence.
Speed up
It turned out that the speed will is a big bottleneck, so I decided to start using a small fraction of the corpus. I started with the blog.
To speed up the file read I decided to read multiple rows at once instead of reading it line by line. Reading 100 lines turned to be a good balance, as it significantly increased the reading speed compared to line per line (resulting about one third speed), but reading 10000 lines at once did not considerably change the speed (about 10%).
R does not have a direct dict (AKA: hash, map, associative array) implementation, but as we can add names to practically anything, the name could be the key and the actual value the value. So at first I decided to use the built-in list for that purpose. It is really suitable for creating dict, however, it turned to be pretty slow. Using that implementation the initial approach took about one minute to read 1000 entries (10 times 100), which is only 0.1% of the whole file. Therefore I looked for another approach, and found the hash library, which is an implementation of the data structure I want. It has the following advantages:
- Simple to install, just
install.packages('hash'), and then library(hash).
- It uses exactly the same syntax the list uses. So the change was literally only replacing
list() with hash().
- Its speed is much better; elaborating 1000 lines took a few seconds. So I increased the number of read lines with one magnitude, up to 10,000 (100*100), which is 1% of the whole text; that takes about a minute
Unfortunately this is still not suitable for analyzing the whole file, because the memory and time consumption grows more than linear when the number of lines is increased. I did not perform any scientific calculation, but the simple fact is that the number duplets, triplet etc. combinations increase by increasing the read lines. Imagine how much different 5 word long phrases we have. Maybe the analysis of the 5 word long phrases should be limited.
In order to have broader results and not just dealing with the first 10,000 rows I implemented a randomization: as just reading turned out to be quick, after reading 100 lines a random boolean value with 1% probability being TRUE decides is that piece is considered or not. So finally about 1% of the total text (i.e. around 10,000 lines) were considered, but from various parts of the text.
Implementation details
Here are some more important details about the implementation:
- I decided to use standard R tools as much as possible.
- As mentioned I used library
hash. Actually that is the only one external library I am using so far.
- The algorithm replaces some special characters either with their more common counterparts, or just removes.
- The implementation is general enough to handle any long phrases. It considers word as 1 word long phrase. I experienced up to 5 word long phrases.
- The
en_US.news.txt contains the so-called SUB character, he code is 1A, which the R function readLines() treat as end of file. I found no better solution than removing them by hand.
Problems
There were a lot of problems to be solved besides the already mentioned, which made the process slow.
- The usage of RStudio is not self-explanatory. For example, it took quite some time until I realized how to start hte program.
- The R does not really help on errors. Unlike other programming languages it does not provide the exact place (the line number) of the error.
- It is very hard to debug (compared to other programming languages).
- Using R structures like lists is not easy. It is hard to find out when to use
[[]] as an index, and when just [].
- It took pretty long until I realized that the global variables should be assigned with
<<- instead of just <-.
- The right R functions are are hard to find, as they are not self-explanatory either. E.g.
gsub() (for replacing characters), trimws() for trimming white spaces etc.
- It took quite some time to find the solution for file reading problems. E.g.
encoding="UTF-8" solved many of the problems (and brought many new).
- Starting an R script from command line is also not quite straightforward. There are several options, the default parameters are not enough.
- Using parameters in the R script is also not really straightforward. It took some while I realized that
args <- commandArgs(trailingOnly=TRUE) is the right solution.
And many-many others. (I already used R extensively a few years ago. Now I realized that this is not a good programming language in terms of ease of used, compated to, let’s say, Python.)
Lessons learned
This experience resulted a lot of experience. Some further notes to mentin are the following:
- In order to have a better result, we cannot neglect the language. For example, in most of the languages the apostrophe can be treaded as a simple puctuation, but in English not: the word like
don't is pretty common. If we would treat it as puctuation, then it would increase the number of don and t occurrences.
- Handling special characters is not easy.
- We should consider not to automatice everything.
Extension possibilities
A lot has already been implemented, and there are even more ideas. Some of these could be implemented in the upcoming weeks, and some may not.
- Find a solution for the SUB character problem.
- Find a solution to a more sophisticated character change, compared to the actual one.
- Check out other hash libraries; maybe there are even faster ones.
- Play around with various phrase lengths. Maybe 5 long is not the optimal, as memory and CPU consumption increases exponential to the maximum phrase length. Maybe maximalism could be achieved only with single words, and for a longer phrases the sampling could be used.
- It is inevitable to cache the results in a proper format, and at suggestion that cache should be used.
- The cache implementation could be dynamic, i.e. extendable.
Results
Main statistics
The table below shows some statistics about the input data (English only):
| Number of lines (thousand) |
1010.2 |
2360.1 |
899.3 |
| Number of line parts (thousand) |
7053.4 |
8780.7 |
6739.3 |
| Number of words (million) |
35.6 |
31.0 |
38.3 |
Let’s visualize it!


The sizes of the various corpus are similar, but there are interesting differences between them. Let’s check various calculated statistics!
| Average number of words per line |
35.2 |
13.3 |
43.1 |
| Average number of line parts per line |
6.98 |
3.72 |
7.49 |
| Average number of words per line part |
5.05 |
3.53 |
5.68 |
In case of Twitter - not surprizingly - the average number of words per line is much less than the other two (about one third), the average number of line parts per line is about half, and the avergae number of words per line part is about two third. So the news and blogs are similar in this sense, but Twitter is different.
Most frequent words and idioms
Words
| the |
102,095 |
the |
46,247 |
the |
90,558 |
| to |
47,528 |
to |
39,251 |
and |
53,303 |
| and |
45,962 |
I |
35,335 |
to |
52,326 |
| a |
45,515 |
a |
30,161 |
a |
44,252 |
| of |
40,211 |
you |
26,964 |
of |
42,910 |
| in |
35,450 |
and |
21,507 |
I |
37,796 |
| for |
18,456 |
for |
19,119 |
in |
29,040 |
| that |
17,971 |
in |
18,826 |
that |
22,238 |
| is |
14,752 |
is |
18,005 |
is |
21,180 |
| on |
13,804 |
of |
17,939 |
it |
19,606 |
Let’s visualize the number of occurrences!

There are hardly any difference between the top list. * The word the is the first one in every category, but in case of the Twitter is is spectacularly less frequent than in the other two. * Some common words: to, and, a, of, in, for etc. * The word I is the third most frequent in Twitter, the sixth in the blogs and it is not within the TOP10 in the news (actuall it is the 27th). * Worth to mention that I occurs more frequently than any other personal pronouns.
2 long idioms
| of the |
9709 |
in the |
3968 |
of the |
9137 |
| in the |
9206 |
for the |
3695 |
in the |
7504 |
| to the |
4415 |
of the |
2735 |
to the |
4290 |
| on the |
3723 |
on the |
2388 |
on the |
3665 |
| for the |
3596 |
to be |
2361 |
to be |
3316 |
| at the |
2960 |
thanks for |
2116 |
and the |
2854 |
| in a |
2693 |
to the |
2076 |
for the |
2810 |
| and the |
2643 |
at the |
1836 |
I was |
2536 |
| to be |
2500 |
I love |
1753 |
and I |
2437 |
| with the |
2375 |
going to |
1699 |
I have |
2394 |
Here the similarities are still obvious but the differences are bigger in copared to the one word frequency. For example Twitter and blogs are more personal (thanks for, I love; I was, and I, I have).
3 long idioms
| one of the |
721 |
thanks for the |
1202 |
one of the |
691 |
| a lot of |
565 |
looking forward to |
444 |
a lot of |
567 |
| part of the |
326 |
thank you for |
436 |
to be a |
356 |
| as well as |
323 |
for the follow |
423 |
out of the |
337 |
| according to the |
300 |
going to be |
380 |
a couple of |
332 |
| going to be |
285 |
I love you |
371 |
be able to |
330 |
| to be a |
277 |
can’t wait to |
368 |
it was a |
324 |
| some of the |
276 |
I want to |
341 |
as well as |
323 |
| the end of |
268 |
a lot of |
309 |
the end of |
317 |
| out of the |
265 |
to be a |
304 |
some of the |
298 |
4 long idioms
| the end of the |
133 |
thanks for the follow |
347 |
the end of the |
159 |
| the rest of the |
130 |
thank you for the |
146 |
at the end of |
154 |
| for the first time |
127 |
can’t wait to see |
143 |
the rest of the |
149 |
| at the end of |
122 |
thank you so much |
140 |
at the same time |
122 |
| one of the most |
100 |
is going to be |
131 |
when it comes to |
101 |
| is one of the |
94 |
thanks for the RT |
128 |
one of the most |
100 |
| said in a statement |
88 |
going to be a |
97 |
is one of the |
93 |
| in the United States |
86 |
for the first time |
92 |
to be able to |
90 |
| a member of the |
80 |
hope to see you |
76 |
for the first time |
87 |
| at the same time |
77 |
the end of the |
71 |
in the middle of |
82 |
5 long idioms
Let’s check here a little bit more!
| at the end of the |
62 |
thank you so much for |
51 |
at the end of the |
81 |
| for the first time in |
43 |
no no no no no |
40 |
in the middle of the |
42 |
| for the rest of the |
31 |
can’t wait to see you |
34 |
the end of the day |
28 |
| by the end of the |
29 |
is going to be a |
34 |
the other side of the |
27 |
| at the time of the |
26 |
thanks for the shout out |
34 |
for those of you who |
24 |
| in the middle of the |
25 |
let me know if you |
30 |
is one of the most |
24 |
| could not be reached for |
21 |
at the end of the |
29 |
on the other side of |
24 |
| the Dow Jones industrial average |
21 |
it’s going to be a |
29 |
by the end of the |
23 |
| is one of the most |
20 |
keep up the good work |
29 |
the North Dakota Township map |
23 |
| not be reached for comment |
20 |
this is going to be |
28 |
for the rest of the |
22 |
| the end of the day |
19 |
hope you have a great |
27 |
I thought it would be |
21 |
| at the top of the |
18 |
I love you so much |
27 |
a couple of weeks ago |
20 |
| rock and roll hall of |
17 |
that awkward moment when you |
27 |
all you have to do |
20 |
| the end of the year |
16 |
I can’t wait to see |
26 |
for the first time in |
20 |
| as a result of the |
15 |
thank you for the RT |
26 |
you have to do is |
20 |
| centers for disease control and |
15 |
for the first time in |
24 |
but at the same time |
18 |
| said in a statement that |
15 |
thank you for the follow |
24 |
at the bottom of the |
17 |
| declined to comment on the |
14 |
look forward to seeing you |
23 |
the rest of the world |
17 |
| for disease control and prevention |
14 |
caps caps caps caps caps |
22 |
and the rest of the |
16 |
| for the first time since |
14 |
call call call call call |
21 |
at the end of this |
15 |
| had nothing to do with |
14 |
I’m not the only one |
21 |
in the form of a |
15 |
| is going to be a |
14 |
you so much for the |
21 |
when it comes to the |
14 |
The longer the idiom is, there are bigger and bigger differences between the texts.
- The Twitter has the most garbage; for example if somebody writes the same word many-many times, then it becomes one of the most frequent idiom. Almost all the executions contained such an entry, with different words.
- The internet slang appears. For example thank you for the RT the RT means retweet.
- Most of the idioms in Twitter is personal.
Plans
This is my current plan about implementing the Shiny App:
- The most frequent words and idioms would be cached.
- It would contain a multi-line input field, where the user can write anything.
- Another field, a read-only one would show up suggestions.
- The user can accept the suggestion by a single click (I need to check how to do that).
- I need to check if it is possible to show up the suggestions in the input field itself (smilar to Google search).
- The user is able to set the style of the text: News, Twitter or Blog. If none is selected, then the combined corpus is used.
- The user is able to select language. English, German, Russian and Finnish can be used. If none selected, then all of them is considered.
- Addition: it could find out the language. E.g. if the entered text so far has words and idioms with gigh occurrence rate in one language and practically zero with the other 3, than it can be considered that the input language is that one.
Appendix
This is the R source code I implemented.
library(hash)
maxNumberOfWords <- 5
wordCounts <- list()
initialize <- function() {
for (i in 1:maxNumberOfWords) {
wordCounts[[i]] <<- hash()
}
seed <- ceiling(runif(1, 0, 65536))
print(paste("Seed:", seed))
set.seed(seed)
}
handleWords <- function(words, numberOfWords) {
if (length(words) < numberOfWords) {
return()
}
actualWordCounts <- wordCounts[[numberOfWords]]
for (i in 1:(length(words) - numberOfWords + 1)) {
collapsed <- paste(words[i:(i+numberOfWords-1)], collapse = " ")
if (is.null(actualWordCounts[[collapsed]])) {
actualWordCounts[collapsed] <- 1
} else {
actualWordCounts[collapsed] <- actualWordCounts[[collapsed]] + 1
}
}
wordCounts[[numberOfWords]] <<- actualWordCounts
}
replaceSpecialChars <- function(line) {
line <- gsub("[#$%&*+-/:;<=>@^_|~]", "", line)
line <- gsub('["\\[\\]\\{\\}\\(\\)]', "", line)
line <- gsub("`", "'", line)
line <- gsub("\U2018", "'", line) # ‘
line <- gsub("\U2019", "'", line) # ’
line <- gsub("\U2026", ".", line) # …
line <- gsub("\U2032", "'", line) # ′
line <- gsub("\U215B", "'", line) # ′
line <- gsub("\UFFFD", "", line) # �
line
}
findNumberOfOccurrences <- function(filename) {
print(filename)
numberOfReads <- 0
numberOfSkips <- 0
numberOfTotalLines <- 0
numberOfTotalLineParts <- 0
numberOfTotalWords <- 0
input <- file(filename, "r")
while (TRUE) {
lines <- readLines(input, 100, encoding="UTF-8")
if (length(lines) == 0) {
break
}
if (rbinom(1, 1, 0.05)[1] == 0) {
numberOfSkips <- numberOfSkips + 1
next
}
numberOfReads <- numberOfReads + 1
for (line in lines) {
numberOfTotalLines <- numberOfTotalLines + 1
line <- replaceSpecialChars(line)
lineParts <- strsplit(line, '[.!?,]')
for (linePart in lineParts[[1]]) {
numberOfTotalLineParts <- numberOfTotalLineParts + 1
linePartClean <- gsub("\\s+", " ", trimws(tolower(linePart)))
words <- strsplit(linePartClean, " ")
numberOfTotalWords <- numberOfTotalWords + length(words[[1]])
for (j in 1:maxNumberOfWords) {
handleWords(words[[1]], j)
}
}
}
}
close(input)
print(paste("Number of reads:", numberOfReads))
print(paste("Number of skips:", numberOfSkips))
print(paste("Number of total lines:", numberOfTotalLines))
print(paste("Number of total line parts:", numberOfTotalLineParts))
print(paste("Number of total words:", numberOfTotalWords))
}
printStats <- function(numberOfWords, limit) {
print(paste("Number of consecutive words:", numberOfWords))
actualWordCounts <- wordCounts[[numberOfWords]]
print(paste("Number of entries:", length(actualWordCounts)))
wordCountLimited <- list()
for (key in names(actualWordCounts)) {
if (actualWordCounts[[key]] >= limit) {
wordCountLimited[key] <- actualWordCounts[[key]]
}
}
if (length(wordCountLimited) > 0) {
print(wordCountLimited[order(unlist(wordCountLimited))])
} else {
print(paste("The list of minimum length", limit, "for number of words", numberOfWords, "is empty."))
}
}
initialize()
args <- commandArgs(trailingOnly=TRUE)
startTime <- Sys.time()
findNumberOfOccurrences(args[1])
endTime <- Sys.time()
print(endTime - startTime)
printStats(1, 1000)
printStats(2, 100)
printStats(3, 50)
printStats(4, 10)
printStats(5, 5)
warnings()