Introduction

This milestone report serves as a background document to the final project in the Data Science Specialisation with Johns Hopkins on Coursera. The final project is in the area of natural language processing where the task is to build an algorithm for SwiftKey that can predict words for sake of faster typing. While Chomsky (1956) held that natural language is largely unpredictable, we must here do our best to make an application we can market to those segments that like predictability in their choice of words.

With the milestone report the objective is to (1) demonstrate ability to load the text data, (2) make considerations about the text data based on simple summary statistics of dataset sizes (such as tradeoff between the amount of data we feed to the prediction model and the physical RAM used by the final application that we will develop and not least the runtime necessary to make predictions). (3) Based on n-grams we start in the milestone report to build the methodology behind the word prediction model. This is done by investigating word frequencies and order in unigrams (single words), bigrams (two consecutive words) and trigrams (three consecutive words). (4) Finally we must start making some plans towards building a prediction algorithm based on the n-grams methodology (Cavnar and Trenkle, 1994) and finally developing a Shiny application. The report is build up accordingly following these four tasks.

Several R-packages that we have not used before are useful towards completing the above tasks. One package is described in Meyer, Hornik and Feinerer (2008) and is called tm. However, a more recent package has become available on n-grams (published in November 2017) and this milestone report has been written using this latter package. Such packages are very helpful when you are not exactly a computer programmer of training. So a big thank you to those that have programmed the n-gram babbler!

Loading the data

The three text files (twitter, blogs and news) are read into R. It is assumed that the files have been downloaded to the working directory beforehand.

setwd ('/Users/ravenclaw/Desktop/Capstone Project/final-2/en_US/')
twitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul=TRUE)
blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul=TRUE)
news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul=TRUE)

Summarizing the data and building the corpora

Now we can summarize the raw text files also to estimate how large samples we need from each to build the corpora for the prediction model. Taking the same percentage from each source will bias the corpora in the direction of the language used in the largest source. Taking equal portions of words from each source will not bias the corpora in the direction of any of the three particular sources. What is the correct choice also depends on the audience we want to target with the app. We ought to choose the sample to best reflect the demographics of the target segment. An ideal app would sample on the natural language of the individual user depending on the users input of some basic demographic traits which could help to estimate the likely colloquial language of the individual user. Since I have no information about this at the moment I choose a non-biased approach across the three sources, e.g. I sample 10,000 words randomly (or words in lines randomly) from each of the three sources. But it is important as just noted that the words stay in their sentences. So I calculate in the last column how many lines to extract from each file to get an unbiased corpora.

library(stringi)
size_blogs <- object.size(blogs)/1024^2
size_news <- object.size(news)/1024^2
size_twitter <- object.size(twitter)/1024^2

lines_blogs <-length(blogs)
lines_news <-length(news)
lines_twitter <- length(twitter)
lines <-  c(lines_blogs,lines_news, lines_twitter)
words_blogs <- sum(stri_count_words(blogs))
words_news <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))
words <- c(words_blogs, words_news, words_twitter)
words_per_line_blogs <- words_blogs/lines_blogs
words_per_line_news <-  words_news/lines_news
words_per_line_twitter <- words_twitter/lines_twitter
lines_nec_1000_blogs <- 10000/words_per_line_blogs
lines_nec_1000_news <- 10000/words_per_line_news
lines_nec_1000_twitter <- 10000/words_per_line_twitter

df <- data.frame(media = c("blogs", "news", "twitter"),
           size = c(size_blogs, size_news, size_twitter),
           lines =  c(lines_blogs,lines_news, lines_twitter),
           words = c(words_blogs, words_news, words_twitter),
           words_per_line = c(words_per_line_blogs, words_per_line_news, words_per_line_twitter),
           lines_to_sample=c(lines_nec_1000_blogs, lines_nec_1000_news, lines_nec_1000_twitter))  
print(df)
##     media     size   lines    words words_per_line lines_to_sample
## 1   blogs 248.4935  899288 37546246       41.75108        239.5148
## 2    news 249.6329 1010242 34762395       34.40997        290.6135
## 3 twitter 301.3969 2360148 30093410       12.75065        784.2740

So now I sample 240 lines from the blogs, 291 lines from the news and 784 lines from the twitter files and delete the original files as they take up a lot of space:

sample_twitter <- twitter[sample(1:length(twitter), 784)]
sample_news <- news[sample(1:length(news), 291)]
sample_blogs <- blogs[sample(1:length(blogs), 240)]
twitter <- NULL
news <- NULL
blogs <- NULL

Then we join the sample strings into the corpora string for further analysis. The single string large sample is more easily cleaned using the preprocess function from the ngram package.

library(ngram)
corpora <- c(sample_blogs, sample_news, sample_twitter )
corpora <- concatenate(corpora, collapse="")
corpora <-  preprocess(corpora, case = "lower", remove.punct = TRUE,
remove.numbers = TRUE, fix.spacing = TRUE)
string.summary(corpora)
## Chars:       160058
## Letters:     130861
## Whitespace:  28095
## Punctuation: 0
## Digits:      0
## Words:       28096
## Sentences:   0
## Lines:       1 
## Wordlens:    953 1210 1442 1629 2204 2393 3161 4582 5031 5491 
##              1 1 1 1 1 1 1 1 1 1 
## Senlens:     0 
##              10 
## Syllens:     4 5 9 20 69 288 1051 2816 7154 16418 
##              1 1 1 1 1 1 1 1 1 1

Cleaning up here involves changing all letters to lower cases, removing punctuation, removing numbers and fixing spaces between words.

Applying n-grams to the data

Now we are ready to make an exploratory n-gram analysis of the Corpora. For easy coding and fast processing the n-gram package is the most convenient to use here:

library(ngram)
unigram <- ngram(corpora, n=1)
bigram <- ngram(corpora, n=2)
trigram <- ngram(corpora, n=3)

With the ngram package we can also easily summarise the ngrams using the get table function and we get a nice tabulation to get an overview of the text predictions we will get if we use the coding presented here:

head(get.phrasetable(unigram), n=15L)
##    ngrams freq        prop
## 1    the  1259 0.044810649
## 2     to   803 0.028580581
## 3      a   740 0.026338269
## 4    and   662 0.023562073
## 5     of   537 0.019113041
## 6     in   459 0.016336845
## 7      i   388 0.013809795
## 8     is   338 0.012030182
## 9    for   323 0.011496298
## 10  that   266 0.009467540
## 11    it   257 0.009147210
## 12    on   238 0.008470957
## 13   you   235 0.008364180
## 14  with   208 0.007403189
## 15   was   179 0.006371014
head(get.phrasetable(bigram), n=15L)
##       ngrams freq         prop
## 1    of the   113 0.0040220680
## 2    in the   102 0.0036305392
## 3   for the    60 0.0021356113
## 4    to the    58 0.0020644243
## 5     to be    53 0.0018864567
## 6    on the    47 0.0016728955
## 7      in a    45 0.0016017085
## 8   and the    39 0.0013881474
## 9    at the    37 0.0013169603
## 10    for a    36 0.0012813668
## 11     of a    33 0.0011745862
## 12     is a    32 0.0011389927
## 13   is the    29 0.0010322121
## 14 with the    29 0.0010322121
## 15   it was    27 0.0009610251
head(get.phrasetable(trigram), n=15L)
##                 ngrams freq         prop
## 1          the end of    10 0.0003559479
## 2            a lot of    10 0.0003559479
## 3         some of the     8 0.0002847583
## 4          one of the     7 0.0002491635
## 5           i need to     5 0.0001779739
## 6         why ought i     5 0.0001779739
## 7         going to be     4 0.0001423792
## 8         part of the     4 0.0001423792
## 9      the first time     4 0.0001423792
## 10          i will be     4 0.0001423792
## 11        i was going     4 0.0001423792
## 12 do something about     4 0.0001423792
## 13       be real with     4 0.0001423792
## 14           it was a     4 0.0001423792
## 15        want you to     4 0.0001423792

Expected work ahead

The next tasks for the final project report will be to integrate the n-grams into a proper prediction model and validating it. Part of that task is to go from the bigrams or trigrams into a vectorized word matrix. At the moment I have little idea how to achieve that but I hope there will be some useful classes before the final project that can help me.

I think it is too early to set data aside for cross validation purposes, because we need the vectorized dataset first.

Another thing I am in doubt about having read around is what to do about the stop words. I think the stop words need to be included as I have done here. I also hope to learn more about this before the final report.

Once the model has been build and validated it can be written to a Shiny application. Before that we have to be careful to check the size and runtime of the prediction model to ensure that it can run on popular devices such as smartphones with relatively limited RAM. We also have to take into account that users nowadays are very impatient so for the application to be marketable it should be efficient and quick to use.

References

Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161-175.

Chomsky, N. (1956). Three models for the description of language. IRE Transactions on information theory, 2(3), 113-124.

Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of statistical software, 25(5), 1-54.