This milestone report serves as a background document to the final project in the Data Science Specialisation with Johns Hopkins on Coursera. The final project is in the area of natural language processing where the task is to build an algorithm for SwiftKey that can predict words for sake of faster typing. While Chomsky (1956) held that natural language is largely unpredictable, we must here do our best to make an application we can market to those segments that like predictability in their choice of words.
With the milestone report the objective is to (1) demonstrate ability to load the text data, (2) make considerations about the text data based on simple summary statistics of dataset sizes (such as tradeoff between the amount of data we feed to the prediction model and the physical RAM used by the final application that we will develop and not least the runtime necessary to make predictions). (3) Based on n-grams we start in the milestone report to build the methodology behind the word prediction model. This is done by investigating word frequencies and order in unigrams (single words), bigrams (two consecutive words) and trigrams (three consecutive words). (4) Finally we must start making some plans towards building a prediction algorithm based on the n-grams methodology (Cavnar and Trenkle, 1994) and finally developing a Shiny application. The report is build up accordingly following these four tasks.
Several R-packages that we have not used before are useful towards completing the above tasks. One package is described in Meyer, Hornik and Feinerer (2008) and is called tm. However, a more recent package has become available on n-grams (published in November 2017) and this milestone report has been written using this latter package. Such packages are very helpful when you are not exactly a computer programmer of training. So a big thank you to those that have programmed the n-gram babbler!
The three text files (twitter, blogs and news) are read into R. It is assumed that the files have been downloaded to the working directory beforehand.
setwd ('/Users/ravenclaw/Desktop/Capstone Project/final-2/en_US/')
twitter <- readLines(con <- file("en_US.twitter.txt"), encoding = "UTF-8", skipNul=TRUE)
blogs <- readLines(con <- file("en_US.blogs.txt"), encoding = "UTF-8", skipNul=TRUE)
news <- readLines(con <- file("en_US.news.txt"), encoding = "UTF-8", skipNul=TRUE)
Now we can summarize the raw text files also to estimate how large samples we need from each to build the corpora for the prediction model. Taking the same percentage from each source will bias the corpora in the direction of the language used in the largest source. Taking equal portions of words from each source will not bias the corpora in the direction of any of the three particular sources. What is the correct choice also depends on the audience we want to target with the app. We ought to choose the sample to best reflect the demographics of the target segment. An ideal app would sample on the natural language of the individual user depending on the users input of some basic demographic traits which could help to estimate the likely colloquial language of the individual user. Since I have no information about this at the moment I choose a non-biased approach across the three sources, e.g. I sample 10,000 words randomly (or words in lines randomly) from each of the three sources. But it is important as just noted that the words stay in their sentences. So I calculate in the last column how many lines to extract from each file to get an unbiased corpora.
library(stringi)
size_blogs <- object.size(blogs)/1024^2
size_news <- object.size(news)/1024^2
size_twitter <- object.size(twitter)/1024^2
lines_blogs <-length(blogs)
lines_news <-length(news)
lines_twitter <- length(twitter)
lines <- c(lines_blogs,lines_news, lines_twitter)
words_blogs <- sum(stri_count_words(blogs))
words_news <- sum(stri_count_words(news))
words_twitter <- sum(stri_count_words(twitter))
words <- c(words_blogs, words_news, words_twitter)
words_per_line_blogs <- words_blogs/lines_blogs
words_per_line_news <- words_news/lines_news
words_per_line_twitter <- words_twitter/lines_twitter
lines_nec_1000_blogs <- 10000/words_per_line_blogs
lines_nec_1000_news <- 10000/words_per_line_news
lines_nec_1000_twitter <- 10000/words_per_line_twitter
df <- data.frame(media = c("blogs", "news", "twitter"),
size = c(size_blogs, size_news, size_twitter),
lines = c(lines_blogs,lines_news, lines_twitter),
words = c(words_blogs, words_news, words_twitter),
words_per_line = c(words_per_line_blogs, words_per_line_news, words_per_line_twitter),
lines_to_sample=c(lines_nec_1000_blogs, lines_nec_1000_news, lines_nec_1000_twitter))
print(df)
## media size lines words words_per_line lines_to_sample
## 1 blogs 248.4935 899288 37546246 41.75108 239.5148
## 2 news 249.6329 1010242 34762395 34.40997 290.6135
## 3 twitter 301.3969 2360148 30093410 12.75065 784.2740
So now I sample 240 lines from the blogs, 291 lines from the news and 784 lines from the twitter files and delete the original files as they take up a lot of space:
sample_twitter <- twitter[sample(1:length(twitter), 784)]
sample_news <- news[sample(1:length(news), 291)]
sample_blogs <- blogs[sample(1:length(blogs), 240)]
twitter <- NULL
news <- NULL
blogs <- NULL
Then we join the sample strings into the corpora string for further analysis. The single string large sample is more easily cleaned using the preprocess function from the ngram package.
library(ngram)
corpora <- c(sample_blogs, sample_news, sample_twitter )
corpora <- concatenate(corpora, collapse="")
corpora <- preprocess(corpora, case = "lower", remove.punct = TRUE,
remove.numbers = TRUE, fix.spacing = TRUE)
string.summary(corpora)
## Chars: 160058
## Letters: 130861
## Whitespace: 28095
## Punctuation: 0
## Digits: 0
## Words: 28096
## Sentences: 0
## Lines: 1
## Wordlens: 953 1210 1442 1629 2204 2393 3161 4582 5031 5491
## 1 1 1 1 1 1 1 1 1 1
## Senlens: 0
## 10
## Syllens: 4 5 9 20 69 288 1051 2816 7154 16418
## 1 1 1 1 1 1 1 1 1 1
Cleaning up here involves changing all letters to lower cases, removing punctuation, removing numbers and fixing spaces between words.
Now we are ready to make an exploratory n-gram analysis of the Corpora. For easy coding and fast processing the n-gram package is the most convenient to use here:
library(ngram)
unigram <- ngram(corpora, n=1)
bigram <- ngram(corpora, n=2)
trigram <- ngram(corpora, n=3)
With the ngram package we can also easily summarise the ngrams using the get table function and we get a nice tabulation to get an overview of the text predictions we will get if we use the coding presented here:
head(get.phrasetable(unigram), n=15L)
## ngrams freq prop
## 1 the 1259 0.044810649
## 2 to 803 0.028580581
## 3 a 740 0.026338269
## 4 and 662 0.023562073
## 5 of 537 0.019113041
## 6 in 459 0.016336845
## 7 i 388 0.013809795
## 8 is 338 0.012030182
## 9 for 323 0.011496298
## 10 that 266 0.009467540
## 11 it 257 0.009147210
## 12 on 238 0.008470957
## 13 you 235 0.008364180
## 14 with 208 0.007403189
## 15 was 179 0.006371014
head(get.phrasetable(bigram), n=15L)
## ngrams freq prop
## 1 of the 113 0.0040220680
## 2 in the 102 0.0036305392
## 3 for the 60 0.0021356113
## 4 to the 58 0.0020644243
## 5 to be 53 0.0018864567
## 6 on the 47 0.0016728955
## 7 in a 45 0.0016017085
## 8 and the 39 0.0013881474
## 9 at the 37 0.0013169603
## 10 for a 36 0.0012813668
## 11 of a 33 0.0011745862
## 12 is a 32 0.0011389927
## 13 is the 29 0.0010322121
## 14 with the 29 0.0010322121
## 15 it was 27 0.0009610251
head(get.phrasetable(trigram), n=15L)
## ngrams freq prop
## 1 the end of 10 0.0003559479
## 2 a lot of 10 0.0003559479
## 3 some of the 8 0.0002847583
## 4 one of the 7 0.0002491635
## 5 i need to 5 0.0001779739
## 6 why ought i 5 0.0001779739
## 7 going to be 4 0.0001423792
## 8 part of the 4 0.0001423792
## 9 the first time 4 0.0001423792
## 10 i will be 4 0.0001423792
## 11 i was going 4 0.0001423792
## 12 do something about 4 0.0001423792
## 13 be real with 4 0.0001423792
## 14 it was a 4 0.0001423792
## 15 want you to 4 0.0001423792
The next tasks for the final project report will be to integrate the n-grams into a proper prediction model and validating it. Part of that task is to go from the bigrams or trigrams into a vectorized word matrix. At the moment I have little idea how to achieve that but I hope there will be some useful classes before the final project that can help me.
I think it is too early to set data aside for cross validation purposes, because we need the vectorized dataset first.
Another thing I am in doubt about having read around is what to do about the stop words. I think the stop words need to be included as I have done here. I also hope to learn more about this before the final report.
Once the model has been build and validated it can be written to a Shiny application. Before that we have to be careful to check the size and runtime of the prediction model to ensure that it can run on popular devices such as smartphones with relatively limited RAM. We also have to take into account that users nowadays are very impatient so for the application to be marketable it should be efficient and quick to use.
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. Ann Arbor MI, 48113(2), 161-175.
Chomsky, N. (1956). Three models for the description of language. IRE Transactions on information theory, 2(3), 113-124.
Meyer, D., Hornik, K., & Feinerer, I. (2008). Text mining infrastructure in R. Journal of statistical software, 25(5), 1-54.