Overview

This is the second week assignment of the Data Science Capstone course in Coursera, which is the last course of the Data Science Specialization by the John Hopkins University.

The task in general is to implement a prediction algorithm which provides hints when typing a text, in the form of Shiny App.

We have a text corpus in 4 languages:

English
German
Russian
Finnish

Each of then have 3 categories:

Blog entries
News entries
Twitter posts

The first week’s assignment was to download the corpus and perform some basic exploration.

This is the second part of the task: the Exploratory Data Analysis. The task is to perform some deeper analysis of the corpus, and present the findings here.

Approach

Goal

I targeted here the following goals:

Get a deeper insight of the corpus.
Explore more detailed statistics.
Implement a preliminary algorithm which calculates the word and phrase frequencies.

Algorithm

The main sketch of the algorithm is the following:

Read the corpus line by line.
Split up every line at punctuation, like dot, comma, exclamation mark etc.
Perform some cleanup, like removing unnecessary white spaces, make the text lowercase, convert or remove special characters etc.
Create a map where the key is a word, word duplet, word triplet etc., and the value is its occurrence.

Speed up

It turned out that the speed will is a big bottleneck, so I decided to start using a small fraction of the corpus. I started with the blog.

To speed up the file read I decided to read multiple rows at once instead of reading it line by line. Reading 100 lines turned to be a good balance, as it significantly increased the reading speed compared to line per line (resulting about one third speed), but reading 10000 lines at once did not considerably change the speed (about 10%).

R does not have a direct dict (AKA: hash, map, associative array) implementation, but as we can add names to practically anything, the name could be the key and the actual value the value. So at first I decided to use the built-in list for that purpose. It is really suitable for creating dict, however, it turned to be pretty slow. Using that implementation the initial approach took about one minute to read 1000 entries (10 times 100), which is only 0.1% of the whole file. Therefore I looked for another approach, and found the hash library, which is an implementation of the data structure I want. It has the following advantages:

Simple to install, just install.packages('hash'), and then library(hash).
It uses exactly the same syntax the list uses. So the change was literally only replacing list() with hash().
Its speed is much better; elaborating 1000 lines took a few seconds. So I increased the number of read lines with one magnitude, up to 10,000 (100*100), which is 1% of the whole text; that takes about a minute

Unfortunately this is still not suitable for analyzing the whole file, because the memory and time consumption grows more than linear when the number of lines is increased. I did not perform any scientific calculation, but the simple fact is that the number duplets, triplet etc. combinations increase by increasing the read lines. Imagine how much different 5 word long phrases we have. Maybe the analysis of the 5 word long phrases should be limited.

In order to have broader results and not just dealing with the first 10,000 rows I implemented a randomization: as just reading turned out to be quick, after reading 100 lines a random boolean value with 1% probability being TRUE decides is that piece is considered or not. So finally about 1% of the total text (i.e. around 10,000 lines) were considered, but from various parts of the text.

Implementation details

Here are some more important details about the implementation:

I decided to use standard R tools as much as possible.
As mentioned I used library hash. Actually that is the only one external library I am using so far.
The algorithm replaces some special characters either with their more common counterparts, or just removes.
The implementation is general enough to handle any long phrases. It considers word as 1 word long phrase. I experienced up to 5 word long phrases.
The en_US.news.txt contains the so-called SUB character, he code is 1A, which the R function readLines() treat as end of file. I found no better solution than removing them by hand.

Problems

There were a lot of problems to be solved besides the already mentioned, which made the process slow.

The usage of RStudio is not self-explanatory. For example, it took quite some time until I realized how to start hte program.
The R does not really help on errors. Unlike other programming languages it does not provide the exact place (the line number) of the error.
It is very hard to debug (compared to other programming languages).
Using R structures like lists is not easy. It is hard to find out when to use [[]] as an index, and when just [].
It took pretty long until I realized that the global variables should be assigned with <<- instead of just <-.
The right R functions are are hard to find, as they are not self-explanatory either. E.g. gsub() (for replacing characters), trimws() for trimming white spaces etc.
It took quite some time to find the solution for file reading problems. E.g. encoding="UTF-8" solved many of the problems (and brought many new).
Starting an R script from command line is also not quite straightforward. There are several options, the default parameters are not enough.
Using parameters in the R script is also not really straightforward. It took some while I realized that args <- commandArgs(trailingOnly=TRUE) is the right solution.

And many-many others. (I already used R extensively a few years ago. Now I realized that this is not a good programming language in terms of ease of used, compated to, let’s say, Python.)

Lessons learned

This experience resulted a lot of experience. Some further notes to mentin are the following:

In order to have a better result, we cannot neglect the language. For example, in most of the languages the apostrophe can be treaded as a simple puctuation, but in English not: the word like don't is pretty common. If we would treat it as puctuation, then it would increase the number of don and t occurrences.
Handling special characters is not easy.
We should consider not to automatice everything.

Extension possibilities

A lot has already been implemented, and there are even more ideas. Some of these could be implemented in the upcoming weeks, and some may not.

Find a solution for the SUB character problem.
Find a solution to a more sophisticated character change, compared to the actual one.
Check out other hash libraries; maybe there are even faster ones.
Play around with various phrase lengths. Maybe 5 long is not the optimal, as memory and CPU consumption increases exponential to the maximum phrase length. Maybe maximalism could be achieved only with single words, and for a longer phrases the sampling could be used.
It is inevitable to cache the results in a proper format, and at suggestion that cache should be used.
The cache implementation could be dynamic, i.e. extendable.

Results

Main statistics

The table below shows some statistics about the input data (English only):

	News	Twitter	Blog
Number of lines (thousand)	1010.2	2360.1	899.3
Number of line parts (thousand)	7053.4	8780.7	6739.3
Number of words (million)	35.6	31.0	38.3

Let’s visualize it!

The sizes of the various corpus are similar, but there are interesting differences between them. Let’s check various calculated statistics!

	News	Twitter	Blog
Average number of words per line	35.2	13.3	43.1
Average number of line parts per line	6.98	3.72	7.49
Average number of words per line part	5.05	3.53	5.68

In case of Twitter - not surprizingly - the average number of words per line is much less than the other two (about one third), the average number of line parts per line is about half, and the avergae number of words per line part is about two third. So the news and blogs are similar in this sense, but Twitter is different.

Word related stastistics

I executed a test for 5% of the whole corpus. They lasted 7-12 mins each. There are some interesting findings. Let’s first see some basic statistics!

	News	Twitter	Blog
Number of different words	98,417	84,717	90,774
Number of different 2 words	769,878	550,595	699,330
Number of different 3 words	1,392,931	943,976	1,353,395
Number of different 4 words	1,574,351	1,010,413	1,593,910
Number of different 5 words	1,570,428	934,210	1,610,694

The Twitter is here also shows other pattern than the other two: the number of idioms seem less. The reason needs more investigation, if it is because the Twitter users really use less idoms, or just the lines are shorter, and there are less combination.

Most frequent words and idioms

Words

News		Twitter		Blog
the	102,095	the	46,247	the	90,558
to	47,528	to	39,251	and	53,303
and	45,962	I	35,335	to	52,326
a	45,515	a	30,161	a	44,252
of	40,211	you	26,964	of	42,910
in	35,450	and	21,507	I	37,796
for	18,456	for	19,119	in	29,040
that	17,971	in	18,826	that	22,238
is	14,752	is	18,005	is	21,180
on	13,804	of	17,939	it	19,606

Let’s visualize the number of occurrences!

There are hardly any difference between the top list. * The word the is the first one in every category, but in case of the Twitter is is spectacularly less frequent than in the other two. * Some common words: to, and, a, of, in, for etc. * The word I is the third most frequent in Twitter, the sixth in the blogs and it is not within the TOP10 in the news (actuall it is the 27th). * Worth to mention that I occurs more frequently than any other personal pronouns.

2 long idioms

News		Twitter		Blog
of the	9709	in the	3968	of the	9137
in the	9206	for the	3695	in the	7504
to the	4415	of the	2735	to the	4290
on the	3723	on the	2388	on the	3665
for the	3596	to be	2361	to be	3316
at the	2960	thanks for	2116	and the	2854
in a	2693	to the	2076	for the	2810
and the	2643	at the	1836	I was	2536
to be	2500	I love	1753	and I	2437
with the	2375	going to	1699	I have	2394

Here the similarities are still obvious but the differences are bigger in copared to the one word frequency. For example Twitter and blogs are more personal (thanks for, I love; I was, and I, I have).

3 long idioms

News		Twitter		Blog
one of the	721	thanks for the	1202	one of the	691
a lot of	565	looking forward to	444	a lot of	567
part of the	326	thank you for	436	to be a	356
as well as	323	for the follow	423	out of the	337
according to the	300	going to be	380	a couple of	332
going to be	285	I love you	371	be able to	330
to be a	277	can’t wait to	368	it was a	324
some of the	276	I want to	341	as well as	323
the end of	268	a lot of	309	the end of	317
out of the	265	to be a	304	some of the	298

4 long idioms

News		Twitter		Blog
the end of the	133	thanks for the follow	347	the end of the	159
the rest of the	130	thank you for the	146	at the end of	154
for the first time	127	can’t wait to see	143	the rest of the	149
at the end of	122	thank you so much	140	at the same time	122
one of the most	100	is going to be	131	when it comes to	101
is one of the	94	thanks for the RT	128	one of the most	100
said in a statement	88	going to be a	97	is one of the	93
in the United States	86	for the first time	92	to be able to	90
a member of the	80	hope to see you	76	for the first time	87
at the same time	77	the end of the	71	in the middle of	82

5 long idioms

Let’s check here a little bit more!

News		Twitter		Blog
at the end of the	62	thank you so much for	51	at the end of the	81
for the first time in	43	no no no no no	40	in the middle of the	42
for the rest of the	31	can’t wait to see you	34	the end of the day	28
by the end of the	29	is going to be a	34	the other side of the	27
at the time of the	26	thanks for the shout out	34	for those of you who	24
in the middle of the	25	let me know if you	30	is one of the most	24
could not be reached for	21	at the end of the	29	on the other side of	24
the Dow Jones industrial average	21	it’s going to be a	29	by the end of the	23
is one of the most	20	keep up the good work	29	the North Dakota Township map	23
not be reached for comment	20	this is going to be	28	for the rest of the	22
the end of the day	19	hope you have a great	27	I thought it would be	21
at the top of the	18	I love you so much	27	a couple of weeks ago	20
rock and roll hall of	17	that awkward moment when you	27	all you have to do	20
the end of the year	16	I can’t wait to see	26	for the first time in	20
as a result of the	15	thank you for the RT	26	you have to do is	20
centers for disease control and	15	for the first time in	24	but at the same time	18
said in a statement that	15	thank you for the follow	24	at the bottom of the	17
declined to comment on the	14	look forward to seeing you	23	the rest of the world	17
for disease control and prevention	14	caps caps caps caps caps	22	and the rest of the	16
for the first time since	14	call call call call call	21	at the end of this	15
had nothing to do with	14	I’m not the only one	21	in the form of a	15
is going to be a	14	you so much for the	21	when it comes to the	14

The longer the idiom is, there are bigger and bigger differences between the texts.

The Twitter has the most garbage; for example if somebody writes the same word many-many times, then it becomes one of the most frequent idiom. Almost all the executions contained such an entry, with different words.
The internet slang appears. For example thank you for the RT the RT means retweet.
Most of the idioms in Twitter is personal.

Plans

This is my current plan about implementing the Shiny App:

The most frequent words and idioms would be cached.
It would contain a multi-line input field, where the user can write anything.
Another field, a read-only one would show up suggestions.
The user can accept the suggestion by a single click (I need to check how to do that).
I need to check if it is possible to show up the suggestions in the input field itself (smilar to Google search).
The user is able to set the style of the text: News, Twitter or Blog. If none is selected, then the combined corpus is used.
The user is able to select language. English, German, Russian and Finnish can be used. If none selected, then all of them is considered.
Addition: it could find out the language. E.g. if the entered text so far has words and idioms with gigh occurrence rate in one language and practically zero with the other 3, than it can be considered that the input language is that one.

Appendix

This is the R source code I implemented.

library(hash)

maxNumberOfWords <- 5
wordCounts <- list()

initialize <- function() {
  for (i in 1:maxNumberOfWords) {
    wordCounts[[i]] <<- hash()
  }
  seed <- ceiling(runif(1, 0, 65536))
  print(paste("Seed:", seed))
  set.seed(seed)
}

handleWords <- function(words, numberOfWords) {
  if (length(words) < numberOfWords) {
    return()
  }
  actualWordCounts <- wordCounts[[numberOfWords]]
  for (i in 1:(length(words) - numberOfWords + 1)) {
    collapsed <- paste(words[i:(i+numberOfWords-1)], collapse = " ")
    if (is.null(actualWordCounts[[collapsed]])) {
      actualWordCounts[collapsed] <- 1
    } else {
      actualWordCounts[collapsed] <- actualWordCounts[[collapsed]] + 1
    }
  }
  wordCounts[[numberOfWords]] <<- actualWordCounts
}

replaceSpecialChars <- function(line) {
  line <- gsub("[#$%&*+-/:;<=>@^_|~]", "", line)
  line <- gsub('["\\[\\]\\{\\}\\(\\)]', "", line)
  line <- gsub("`", "'", line)
  line <- gsub("\U2018", "'", line) # ‘
  line <- gsub("\U2019", "'", line) # ’
  line <- gsub("\U2026", ".", line) # …
  line <- gsub("\U2032", "'", line) # ′
  line <- gsub("\U215B", "'", line) # ′
  line <- gsub("\UFFFD", "", line) # �
  line
}


findNumberOfOccurrences <- function(filename) {
  print(filename)
  numberOfReads <- 0
  numberOfSkips <- 0
  numberOfTotalLines <- 0
  numberOfTotalLineParts <- 0
  numberOfTotalWords <- 0
  input <- file(filename, "r")
  while (TRUE) {
    lines <- readLines(input, 100, encoding="UTF-8")
    if (length(lines) == 0) {
      break
    }
    if (rbinom(1, 1, 0.05)[1] == 0) {
      numberOfSkips <- numberOfSkips + 1
      next
    }
    numberOfReads <- numberOfReads + 1
    for (line in lines) {
      numberOfTotalLines <- numberOfTotalLines + 1
      line <- replaceSpecialChars(line)
      lineParts <- strsplit(line, '[.!?,]')
      for (linePart in lineParts[[1]]) {
        numberOfTotalLineParts <- numberOfTotalLineParts + 1
        linePartClean <- gsub("\\s+", " ", trimws(tolower(linePart)))
        words <- strsplit(linePartClean, " ")
        numberOfTotalWords <- numberOfTotalWords + length(words[[1]])
        for (j in 1:maxNumberOfWords) {
          handleWords(words[[1]], j)
        }
      }
    }
  }
  close(input)
  print(paste("Number of reads:", numberOfReads))
  print(paste("Number of skips:", numberOfSkips))
  print(paste("Number of total lines:", numberOfTotalLines))
  print(paste("Number of total line parts:", numberOfTotalLineParts))
  print(paste("Number of total words:", numberOfTotalWords))
}

printStats <- function(numberOfWords, limit) {
  print(paste("Number of consecutive words:", numberOfWords))
  actualWordCounts <- wordCounts[[numberOfWords]]
  print(paste("Number of entries:", length(actualWordCounts)))
  wordCountLimited <- list()
  for (key in names(actualWordCounts)) {
    if (actualWordCounts[[key]] >= limit) {
      wordCountLimited[key] <- actualWordCounts[[key]]
    }
  }
  if (length(wordCountLimited) > 0) {
    print(wordCountLimited[order(unlist(wordCountLimited))])
  } else {
    print(paste("The list of minimum length", limit, "for number of words", numberOfWords, "is empty."))
  }
}

initialize()
args <- commandArgs(trailingOnly=TRUE)

startTime <- Sys.time()
findNumberOfOccurrences(args[1])
endTime <- Sys.time()
print(endTime - startTime)

printStats(1, 1000)
printStats(2, 100)
printStats(3, 50)
printStats(4, 10)
printStats(5, 5)

warnings()

Milestone Report

Csaba Faragó

2021-04-24