Summary

The project is to create a text prediction algorithm using data supplied by Swiftkey. The data is text from online news sources, blogs and Twitter. I am using the English language data.

Loading the Data

The data is read in from the three text files.

# Read the  data in
blog <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")

Profanity is filtered out, then the blog and twitter data sets are subsetted at random to result in data of a manageable size.

set.seed(123455)
rfilter <- rbinom(length(twitter), size=1, prob=0.25)
filter <- rfilter == 1
twitter <- twitter[filter]

rfilter <- rbinom(length(blog), size=1, prob=0.25)
filter <- rfilter == 1
blog <- blog[filter]

Finally the three data sets are combined into one corpus and non-ASCII characters are removed.

text = c(news, twitter, blog)

# Remove non-ASCII characters
text = iconv(text, "latin1", "ASCII", sub="")

Data Summary

To give an overview of the data it is converted into a TextDocumentMatrix using the tm library.

corpus <- VCorpus(VectorSource(text))
tdm <- as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf))))
frequencycount <- rowSums(tdm)
frequencycount <- sort(frequencycount, decreasing=TRUE)

We can use this to see the most common words.

head(frequencycount, 12)
##   the    to   and     a    of    in   for  that    is    on  with  said 
## 18723  8634  8510  8392  7121  6511  3380  3320  2722  2701  2387  2373

Or we can look at a word cloud using the wordcloud library:

pal2 <- brewer.pal(8,"Dark2")
wordcloud(corpus, scale=c(6,.5), min.freq=2, max.words=300, random.order=TRUE,
          rot.per=0.5, colors=pal2,  use.r.layout=FALSE)

Tokenizing the Corpus

Next the data is tokenized into ngrams of length 2 to 5 using the tokenizers library. The ngrams are converted into a table which is sorted by Frequency to show us the most common ngrams.

twograms <- tokenize_ngrams(text, lowercase=TRUE, n=2L, n_min=2L, simplify=TRUE)
twograms <- unlist(twograms)
twogramfreq = as.data.frame(table(twograms))
twogramfreq <- twogramfreq[order(-twogramfreq$Freq),]

threegrams <- tokenize_ngrams(text, lowercase=TRUE, n=3L, n_min=3L, simplify=TRUE)
threegrams <- unlist(threegrams)
threegramfreq = as.data.frame(table(threegrams))
threegramfreq <- threegramfreq[order(-threegramfreq$Freq),]

This will allow us to view the most common two and three grams:

barplot(twogramfreq[2:26,2], names.arg=twogramfreq[2:26,1], col = "blue", 
        main="Twograms (Top 25)", las=2, ylab = "Frequency")

barplot(threegramfreq[1:25,2], names.arg=threegramfreq[1:25,1], col = "red", 
        main="Threegrams (Top 25)", las=2, ylab = "Frequency")

Note that the most common two gram was “a a” which has been removed due to not being a valid phrase.

Plans for Algorithm

I have attempted to input the tokenized text into a decision tree algorithm, but was unable to do so due to the high amount of RAM required.

My plan is to create a feature matrix of ngrams with the y as the last word and the features as the preceding words. This should enable me to filter down the possible matches by looping through the input string and reducing the set of matches for each word. In the end the top three matches will be returned.

Some issues which will need to be addressed to accomplish this:

  1. The corpus contains many words which are not valid English words or are names. Ideally the spelling errors or non-words would be filtered out. I am not sure how to handle names of people or places yet.
  2. Much of the text is not grammatically correct. Ideally these would be filtered out, although time and processing power may make that difficult for this project.
  3. I am not yet sure how to handle input that does not match an existing ngram.