The goal of the capstone project is to develop a text prediction algorithm. The algorithm needs to have some basic properties: - it must be fast - it must be efficient (RStudios host servers are limited to 1Gb on the free-to-play model) - it must be accurate (-ish, speed is to be prioritised over accuracy according to the Coursera forums!)
The overall strategy for my app will follow the advice provided by Len Greski in his Simplify, Simplify, Simplify post on the discussion forum i.e.
"A simple solution to the Capstone can be accomplished with three key tools:
data.table – due to its high performance, low memory usage, and ability to do an indexed search like a database table, this package is extremely useful not only to create the data needed for the prediction algorithm, but it is also very valuable in the shiny app.
quanteda::tokens_ngrams() – the workhorse that will generate the data needed for the easiest possible algorithm, a simple back off model based on last word frequencies / probabilities given a set of first words
SQL with the sqldf package – given a set of n-grams that are aggregated into three columns, a base consisting of n-1 words in the n-gram, and a prediction that is the last word, and a count variable for the frequency of occurrence of this n-gram, it’s easy to write an SQL statement to extract the most frequently occurring prediction and save these into an output data.table for your shiny app"
The quanteda package will be key to the analysis and I’ve referred to the quick start guide and the cheat sheet here quite a bit in the solution.
The proposed approach will be to:
We will work on the English language files as that language is more familiar to me. There are three English language files taken from web scraping a news website, Twitter and weblogs website. The files are quite large. Sentences are incomplete and non-sequential to preserve anonymity.
First we load the data and then we gather some basic information on the data. Here the readLines function from readtext package and the stri_stats_general function from the stringi package are useful.
## Warning in readLines(path_news, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'C:/Users/rmulligan001/Documents/My Training/R/Data Science
## Specialization/Capstone/NLP/data/en_US.news.txt'
Summary table of information about the text files.
| FileName | FileSizeinMB | Lines | LinesNEmpty | Chars | CharsNWhite | WordCount |
|---|---|---|---|---|---|---|
| en_US.blogs | 200.4 | 899,288 | 899,288 | 206,824,382 | 170,389,539 | 37,570,839 |
| en_US.news | 196.3 | 77,259 | 77,259 | 15,639,408 | 13,072,698 | 2,651,432 |
| en_US.twitter | 159.4 | 2,360,148 | 2,360,148 | 162,096,241 | 134,082,806 | 30,451,170 |
The table shows that the file sizes are large and this may be a factor in later analysis as tokenisation is a memory intensive process.
We convert the raw text files to a corpus so that we can more easily analyse the data using quanteda.
We begin with the tokenisation of the corpus. Tokenisation 1 Wikipedia provides a simple overview of lexical analysis here: https://en.wikipedia.org/wiki/Lexical_analysis. coverts the text in the corpus to useful units (in this case words) and allows for easier future analysis including statistical analysis. The quanteda default tokeniser is used. There are a number of data cleaning steps performed as part of tokenisation including:
The quanteda package contains the so-called “Swiss Army Knife” function dfm or document feature matrix. This is used to identify the key features of the data.
The most frequently occurring words are those we might expect. Profanity has been removed. We did not remove stopwords as these are frequently occurring and we are trying to build a prediction algorithm so we should expect frequently occurring words to be important!
Similarly the bigrams are as we might expect. “The” will feature prominently in our algorithm!
Ditto for the trigrams