Overview

This is the second week assignment of the Data Science Capstone course in Coursera, which is the last course of the Data Science Specialization by the John Hopkins University.

The task in general is to implement a prediction algorithm which provides hints when typing a text, in the form of Shiny App.

We have a text corpus in 4 languages:

Each of then have 3 categories:

The first week’s assignment was to download the corpus and perform some basic exploration.

This is the second part of the task: the Exploratory Data Analysis. The task is to perform some deeper analysis of the corpus, and present the findings here.

Approach

Goal

I targeted here the following goals:

  • Get a deeper insight of the corpus.
  • Explore more detailed statistics.
  • Implement a preliminary algorithm which calculates the word and phrase frequencies.

Algorithm

The main sketch of the algorithm is the following:

  • Read the corpus line by line.
  • Split up every line at punctuation, like dot, comma, exclamation mark etc.
  • Perform some cleanup, like removing unnecessary white spaces, make the text lowercase, convert or remove special characters etc.
  • Create a map where the key is a word, word duplet, word triplet etc., and the value is its occurrence.

Speed up

It turned out that the speed will is a big bottleneck, so I decided to start using a small fraction of the corpus. I started with the blog.

To speed up the file read I decided to read multiple rows at once instead of reading it line by line. Reading 100 lines turned to be a good balance, as it significantly increased the reading speed compared to line per line (resulting about one third speed), but reading 10000 lines at once did not considerably change the speed (about 10%).

R does not have a direct dict (AKA: hash, map, associative array) implementation, but as we can add names to practically anything, the name could be the key and the actual value the value. So at first I decided to use the built-in list for that purpose. It is really suitable for creating dict, however, it turned to be pretty slow. Using that implementation the initial approach took about one minute to read 1000 entries (10 times 100), which is only 0.1% of the whole file. Therefore I looked for another approach, and found the hash library, which is an implementation of the data structure I want. It has the following advantages:

  • Simple to install, just install.packages('hash'), and then library(hash).
  • It uses exactly the same syntax the list uses. So the change was literally only replacing list() with hash().
  • Its speed is much better; elaborating 1000 lines took a few seconds. So I increased the number of read lines with one magnitude, up to 10,000 (100*100), which is 1% of the whole text; that takes about a minute

Unfortunately this is still not suitable for analyzing the whole file, because the memory and time consumption grows more than linear when the number of lines is increased. I did not perform any scientific calculation, but the simple fact is that the number duplets, triplet etc. combinations increase by increasing the read lines. Imagine how much different 5 word long phrases we have. Maybe the analysis of the 5 word long phrases should be limited.

In order to have broader results and not just dealing with the first 10,000 rows I implemented a randomization: as just reading turned out to be quick, after reading 100 lines a random boolean value with 1% probability being TRUE decides is that piece is considered or not. So finally about 1% of the total text (i.e. around 10,000 lines) were considered, but from various parts of the text.

Implementation details

Here are some more important details about the implementation:

  • I decided to use standard R tools as much as possible.
  • As mentioned I used library hash. Actually that is the only one external library I am using so far.
  • The algorithm replaces some special characters either with their more common counterparts, or just removes.
  • The implementation is general enough to handle any long phrases. It considers word as 1 word long phrase. I experienced up to 5 word long phrases.
  • The en_US.news.txt contains the so-called SUB character, he code is 1A, which the R function readLines() treat as end of file. I found no better solution than removing them by hand.

Problems

There were a lot of problems to be solved besides the already mentioned, which made the process slow.

  • The usage of RStudio is not self-explanatory. For example, it took quite some time until I realized how to start hte program.
  • The R does not really help on errors. Unlike other programming languages it does not provide the exact place (the line number) of the error.
  • It is very hard to debug (compared to other programming languages).
  • Using R structures like lists is not easy. It is hard to find out when to use [[]] as an index, and when just [].
  • It took pretty long until I realized that the global variables should be assigned with <<- instead of just <-.
  • The right R functions are are hard to find, as they are not self-explanatory either. E.g. gsub() (for replacing characters), trimws() for trimming white spaces etc.
  • It took quite some time to find the solution for file reading problems. E.g. encoding="UTF-8" solved many of the problems (and brought many new).
  • Starting an R script from command line is also not quite straightforward. There are several options, the default parameters are not enough.
  • Using parameters in the R script is also not really straightforward. It took some while I realized that args <- commandArgs(trailingOnly=TRUE) is the right solution.

And many-many others. (I already used R extensively a few years ago. Now I realized that this is not a good programming language in terms of ease of used, compated to, let’s say, Python.)

Lessons learned

This experience resulted a lot of experience. Some further notes to mentin are the following:

  • In order to have a better result, we cannot neglect the language. For example, in most of the languages the apostrophe can be treaded as a simple puctuation, but in English not: the word like don't is pretty common. If we would treat it as puctuation, then it would increase the number of don and t occurrences.
  • Handling special characters is not easy.
  • We should consider not to automatice everything.

Extension possibilities

A lot has already been implemented, and there are even more ideas. Some of these could be implemented in the upcoming weeks, and some may not.

  • Find a solution for the SUB character problem.
  • Find a solution to a more sophisticated character change, compared to the actual one.
  • Check out other hash libraries; maybe there are even faster ones.
  • Play around with various phrase lengths. Maybe 5 long is not the optimal, as memory and CPU consumption increases exponential to the maximum phrase length. Maybe maximalism could be achieved only with single words, and for a longer phrases the sampling could be used.
  • It is inevitable to cache the results in a proper format, and at suggestion that cache should be used.
  • The cache implementation could be dynamic, i.e. extendable.

Results

Main statistics

The table below shows some statistics about the input data (English only):

News Twitter Blog
Number of lines (thousand) 1010.2 2360.1 899.3
Number of line parts (thousand) 7053.4 8780.7 6739.3
Number of words (million) 35.6 31.0 38.3

Let’s visualize it!

The sizes of the various corpus are similar, but there are interesting differences between them. Let’s check various calculated statistics!

News Twitter Blog
Average number of words per line 35.2 13.3 43.1
Average number of line parts per line 6.98 3.72 7.49
Average number of words per line part 5.05 3.53 5.68

In case of Twitter - not surprizingly - the average number of words per line is much less than the other two (about one third), the average number of line parts per line is about half, and the avergae number of words per line part is about two third. So the news and blogs are similar in this sense, but Twitter is different.

Most frequent words and idioms

Words

News Twitter Blog
the 102,095 the 46,247 the 90,558
to 47,528 to 39,251 and 53,303
and 45,962 I 35,335 to 52,326
a 45,515 a 30,161 a 44,252
of 40,211 you 26,964 of 42,910
in 35,450 and 21,507 I 37,796
for 18,456 for 19,119 in 29,040
that 17,971 in 18,826 that 22,238
is 14,752 is 18,005 is 21,180
on 13,804 of 17,939 it 19,606

Let’s visualize the number of occurrences!

There are hardly any difference between the top list. * The word the is the first one in every category, but in case of the Twitter is is spectacularly less frequent than in the other two. * Some common words: to, and, a, of, in, for etc. * The word I is the third most frequent in Twitter, the sixth in the blogs and it is not within the TOP10 in the news (actuall it is the 27th). * Worth to mention that I occurs more frequently than any other personal pronouns.

2 long idioms

News Twitter Blog
of the 9709 in the 3968 of the 9137
in the 9206 for the 3695 in the 7504
to the 4415 of the 2735 to the 4290
on the 3723 on the 2388 on the 3665
for the 3596 to be 2361 to be 3316
at the 2960 thanks for 2116 and the 2854
in a 2693 to the 2076 for the 2810
and the 2643 at the 1836 I was 2536
to be 2500 I love 1753 and I 2437
with the 2375 going to 1699 I have 2394

Here the similarities are still obvious but the differences are bigger in copared to the one word frequency. For example Twitter and blogs are more personal (thanks for, I love; I was, and I, I have).

3 long idioms

News Twitter Blog
one of the 721 thanks for the 1202 one of the 691
a lot of 565 looking forward to 444 a lot of 567
part of the 326 thank you for 436 to be a 356
as well as 323 for the follow 423 out of the 337
according to the 300 going to be 380 a couple of 332
going to be 285 I love you 371 be able to 330
to be a 277 can’t wait to 368 it was a 324
some of the 276 I want to 341 as well as 323
the end of 268 a lot of 309 the end of 317
out of the 265 to be a 304 some of the 298

4 long idioms

News Twitter Blog
the end of the 133 thanks for the follow 347 the end of the 159
the rest of the 130 thank you for the 146 at the end of 154
for the first time 127 can’t wait to see 143 the rest of the 149
at the end of 122 thank you so much 140 at the same time 122
one of the most 100 is going to be 131 when it comes to 101
is one of the 94 thanks for the RT 128 one of the most 100
said in a statement 88 going to be a 97 is one of the 93
in the United States 86 for the first time 92 to be able to 90
a member of the 80 hope to see you 76 for the first time 87
at the same time 77 the end of the 71 in the middle of 82

5 long idioms

Let’s check here a little bit more!

News Twitter Blog
at the end of the 62 thank you so much for 51 at the end of the 81
for the first time in 43 no no no no no 40 in the middle of the 42
for the rest of the 31 can’t wait to see you 34 the end of the day 28
by the end of the 29 is going to be a 34 the other side of the 27
at the time of the 26 thanks for the shout out 34 for those of you who 24
in the middle of the 25 let me know if you 30 is one of the most 24
could not be reached for 21 at the end of the 29 on the other side of 24
the Dow Jones industrial average 21 it’s going to be a 29 by the end of the 23
is one of the most 20 keep up the good work 29 the North Dakota Township map 23
not be reached for comment 20 this is going to be 28 for the rest of the 22
the end of the day 19 hope you have a great 27 I thought it would be 21
at the top of the 18 I love you so much 27 a couple of weeks ago 20
rock and roll hall of 17 that awkward moment when you 27 all you have to do 20
the end of the year 16 I can’t wait to see 26 for the first time in 20
as a result of the 15 thank you for the RT 26 you have to do is 20
centers for disease control and 15 for the first time in 24 but at the same time 18
said in a statement that 15 thank you for the follow 24 at the bottom of the 17
declined to comment on the 14 look forward to seeing you 23 the rest of the world 17
for disease control and prevention 14 caps caps caps caps caps 22 and the rest of the 16
for the first time since 14 call call call call call 21 at the end of this 15
had nothing to do with 14 I’m not the only one 21 in the form of a 15
is going to be a 14 you so much for the 21 when it comes to the 14

The longer the idiom is, there are bigger and bigger differences between the texts.

  • The Twitter has the most garbage; for example if somebody writes the same word many-many times, then it becomes one of the most frequent idiom. Almost all the executions contained such an entry, with different words.
  • The internet slang appears. For example thank you for the RT the RT means retweet.
  • Most of the idioms in Twitter is personal.

Plans

This is my current plan about implementing the Shiny App:

Appendix

This is the R source code I implemented.

library(hash)

maxNumberOfWords <- 5
wordCounts <- list()

initialize <- function() {
  for (i in 1:maxNumberOfWords) {
    wordCounts[[i]] <<- hash()
  }
  seed <- ceiling(runif(1, 0, 65536))
  print(paste("Seed:", seed))
  set.seed(seed)
}

handleWords <- function(words, numberOfWords) {
  if (length(words) < numberOfWords) {
    return()
  }
  actualWordCounts <- wordCounts[[numberOfWords]]
  for (i in 1:(length(words) - numberOfWords + 1)) {
    collapsed <- paste(words[i:(i+numberOfWords-1)], collapse = " ")
    if (is.null(actualWordCounts[[collapsed]])) {
      actualWordCounts[collapsed] <- 1
    } else {
      actualWordCounts[collapsed] <- actualWordCounts[[collapsed]] + 1
    }
  }
  wordCounts[[numberOfWords]] <<- actualWordCounts
}

replaceSpecialChars <- function(line) {
  line <- gsub("[#$%&*+-/:;<=>@^_|~]", "", line)
  line <- gsub('["\\[\\]\\{\\}\\(\\)]', "", line)
  line <- gsub("`", "'", line)
  line <- gsub("\U2018", "'", line) # ‘
  line <- gsub("\U2019", "'", line) # ’
  line <- gsub("\U2026", ".", line) # …
  line <- gsub("\U2032", "'", line) # ′
  line <- gsub("\U215B", "'", line) # ′
  line <- gsub("\UFFFD", "", line) # �
  line
}


findNumberOfOccurrences <- function(filename) {
  print(filename)
  numberOfReads <- 0
  numberOfSkips <- 0
  numberOfTotalLines <- 0
  numberOfTotalLineParts <- 0
  numberOfTotalWords <- 0
  input <- file(filename, "r")
  while (TRUE) {
    lines <- readLines(input, 100, encoding="UTF-8")
    if (length(lines) == 0) {
      break
    }
    if (rbinom(1, 1, 0.05)[1] == 0) {
      numberOfSkips <- numberOfSkips + 1
      next
    }
    numberOfReads <- numberOfReads + 1
    for (line in lines) {
      numberOfTotalLines <- numberOfTotalLines + 1
      line <- replaceSpecialChars(line)
      lineParts <- strsplit(line, '[.!?,]')
      for (linePart in lineParts[[1]]) {
        numberOfTotalLineParts <- numberOfTotalLineParts + 1
        linePartClean <- gsub("\\s+", " ", trimws(tolower(linePart)))
        words <- strsplit(linePartClean, " ")
        numberOfTotalWords <- numberOfTotalWords + length(words[[1]])
        for (j in 1:maxNumberOfWords) {
          handleWords(words[[1]], j)
        }
      }
    }
  }
  close(input)
  print(paste("Number of reads:", numberOfReads))
  print(paste("Number of skips:", numberOfSkips))
  print(paste("Number of total lines:", numberOfTotalLines))
  print(paste("Number of total line parts:", numberOfTotalLineParts))
  print(paste("Number of total words:", numberOfTotalWords))
}

printStats <- function(numberOfWords, limit) {
  print(paste("Number of consecutive words:", numberOfWords))
  actualWordCounts <- wordCounts[[numberOfWords]]
  print(paste("Number of entries:", length(actualWordCounts)))
  wordCountLimited <- list()
  for (key in names(actualWordCounts)) {
    if (actualWordCounts[[key]] >= limit) {
      wordCountLimited[key] <- actualWordCounts[[key]]
    }
  }
  if (length(wordCountLimited) > 0) {
    print(wordCountLimited[order(unlist(wordCountLimited))])
  } else {
    print(paste("The list of minimum length", limit, "for number of words", numberOfWords, "is empty."))
  }
}

initialize()
args <- commandArgs(trailingOnly=TRUE)

startTime <- Sys.time()
findNumberOfOccurrences(args[1])
endTime <- Sys.time()
print(endTime - startTime)

printStats(1, 1000)
printStats(2, 100)
printStats(3, 50)
printStats(4, 10)
printStats(5, 5)

warnings()