Given the a sequence of words we would like to build algorithm to be able to predict next word. The train data consists of few reasonably large text Corpora that includes data from twitter, news and blogs. The data was downloaded and unzipped Capstone Dataset with R (the script is given in the appendix).
There are three text files : en_US.twitter.txt (2360148 of records), en_US.news.txt (1010242 of records) and en_US.blogs.txt (899288 of records). As one may guess quick review ofthe data shows the corporas are very different. The line of text from twitter is on average much shorter that line of news and somewhat shorter from the line of blog text. Besides, there are much more non-character signs and words written all in upper case in twitter data than in those other two data sets. Blog data looks somewhat between twitter and news data. Twitter data has also special character like @ and # used a lot. This observations lead to the following cleansing procedure:
substitute @ to at ( this will work ok on twitter data, but will fail on news)
remove all #words
substitute all (http, https, www ) web link with “httplabel”
all HH:SS substitute with “timelabel”
all numbers or sequences of numbers substitute with “numlabel”
trim all apostrophies (so that isn’t -> isnt) (this has draw backs)
All ! or . or : or ; or ? are substituted as “punctlabel”
All sequential letters (more than 2) that are written all in upper case are labeled as “abbrlabel”
A “startlabel” added to all of parsed lines.
All remaining non-characters were removed and all text converted to lower case and trim all extra spaces.
There are a few drawbacks of this cleansing proceadure that should be corrected before final model delivery. No words are removed for now. We are keeping them until basic model is done. The senteces are not distinguished within the corporas rows. (some algorithm to identify when the
After applying cleansing procedure only words that are separated with spaces remain. Tokenization is performed to extract unigrams, bigrams and trigrams. Other grams are not considered until the performace of basic model is satisfactory. For each of data sets approximatelly 600000 of rows were processed and 1,2,3- grams were extracted using data.table package. The results of basic statistics are below.
1-grams counts: 136045 with median of 1 and mean = 88.9. The ammount of 1’s is more than 52%. Inspecting the data it was clear that set of words with count less than approximatelly 5 in most cases is either written with mistake, or in some jargon, or even a numner of miningless characters written together.
counts | labels
----------|-------------
3633 | httplabel
4497 | timelabel
170659 | numlabel
1695379 | puncktlabel
273776 | abblabel
500000 | startlabel
2-grams counts: 1884323; median = 1, mean = 6.1. The ammount of 1’s are about 70% of the data. 3-grams counts: 5359861; median = 1, mean = 2.1. The ammount of 1’s are about 84% of the data.
1-grams count: 198170; median = 2, mean = 155 and max counts is 3316000. 40% of the words has only 1 occurance and approximatelly 75% of the words occur less than 10 times.
counts | labels
---------|-------------
3109 | httplabel
17651 | timelabel
769786 | numlabel
3315664 | puncktlabel
160650 | abblabel
519570 | startlabel
2- grams count: 4257165; Median = 1 and Mean = 7. Max count is 233198. The most frequent are pais of punctuation and numbers. Among other very frequent results (more than 50000 times ) are pais like: “for the”, “on the”, “to the” etc. Among 2-grams with frequency higher than 1000 there are pairs like “over the”, “about the”, “but the”, “would be”, “have to”, etc. Word pairs such as “you are”, “you look”, “will go”,“were just”, etc., have freaquency in range from 500 to 1000.
3-grams count: 13822019 with median = 1 and mean = 2.14.
As a result a few things were carried out during processing and inspecting the results:
add stop words filter (will reduce table size drammatically plus will remove all those very frequent pairs. The structure of the filter depends on what we wont to be able to predict, for example, we may only predict nouns and verbs)
Define a criteria to label words as unknown (that may also reduce the size of the table and increase performance pluss adds ability to handle cases with unknown words)
On the first step it was desided to build very simple model of predicting the next word. The idea is following. Assume we have sentence “I like” and we want to predict what goes next.
First we need to identify the sourse of the sentence, wheather it comes from twitter, news or blogs ( in progress). This will give the reference of what database to use for prediction.
Then we need to apply cleansing procedures as described before.
Next step is to find the candidates to be that predicted word. In order to do that, we first look up in 2-grams words that starts with “like”, grep all second words as candiates and find their probability (Naive Baiess). Then we look up in 3-grams to find the pairs starting with “I like” and grep all candidates and calculate their probabilities. The result of this step is candidate table with 3 columns: 1 candidate word; 2 2-gram probability; 3 3-gram probability.
On this step, taking the table of candidates with probabilities as an input, we need some kind of “rank” of each candidate to select one. The rank is calculated as follows: (2-gram probability) x lambda1 + (3-gram probability) x lambda2, where lambda1 and lambda2 are weights assigned to the 2-grams and 3-grams respectivelly.
For each data sets train lambda1 and lambda2. (To-do task)
Ok, but what if the last word is the one we haven’t seen before, for example, imagine that in the given sentence “My parents” we do not have “parents” in train data set. Or even worse, We do not have “my” as well. Well, in that case we can use words with low frequency to approximate probability of occurance of some known word.
Well, assuming that “My parrents” with lambda1 = lambda2 = 0.5 sentence is more likely to appear in news type of text, by using news “n-gram” tables above model outputs 51 candidate and top 5 of them are:
cand | prob2 |prob3 | rank
----------|----------|-----------|------
divorced | 0.05 | 0 | 0.025
should | 0.0033 | 0 | 0.00166
need | 0.00264 | 0 | 0.00132
whose | 0.0026 | 0 | 0.0013
must | 0.0025 | 0 | 0.0012
The result was obtained in about 0.53 secons, that means
data.table rocks!
we can add more n- grams,especially if develop a strategy to reduce the amount of n-grams tables in size.
Calculating accuracy of basic model as well as trainig the parameter (lambda of beliefs in 2- and 3-grams ) are in progress.
## download data and save to temp file
url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dir.create("rawdata")
file = tempfile()
download.file(url, destfile =file , method = "curl")
#inspect downloaded file, zip and extract zip content
file.info(file)
finfo = unzip(file, list = T)
finfo = unzip(zipfile = "Data/Coursera-SwiftKey.zip", list = T)
unzip(file, exdir = "Data")
# subset EN related info
fx = finfo[ grepl("final/en_US.*txt", finfo$Name), ]
Sicne dat is too big to be processed in memory it was desided to use data.table library to manipulate data and store the resuls. Data was read in R and processed in 12 chanks of length 50000 records. Not all chunks were processed sucessfuly since there are “NULL strings” embeded in files (and I don’t know how to remove them before reading the data). To increase perfomance during processing the parralel library was used when extracting n-grams.