The project is to create a text prediction algorithm using data supplied by Swiftkey. The data is text from online news sources, blogs and Twitter. I am using the English language data.
The data is read in from the three text files.
# Read the data in
blog <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8")
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8")
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8")
Profanity is filtered out, then the blog and twitter data sets are subsetted at random to result in data of a manageable size.
set.seed(123455)
rfilter <- rbinom(length(twitter), size=1, prob=0.25)
filter <- rfilter == 1
twitter <- twitter[filter]
rfilter <- rbinom(length(blog), size=1, prob=0.25)
filter <- rfilter == 1
blog <- blog[filter]
Finally the three data sets are combined into one corpus and non-ASCII characters are removed.
text = c(news, twitter, blog)
# Remove non-ASCII characters
text = iconv(text, "latin1", "ASCII", sub="")
To give an overview of the data it is converted into a TextDocumentMatrix using the tm library.
corpus <- VCorpus(VectorSource(text))
tdm <- as.matrix(TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf))))
frequencycount <- rowSums(tdm)
frequencycount <- sort(frequencycount, decreasing=TRUE)
We can use this to see the most common words.
head(frequencycount, 12)
## the to and a of in for that is on with said
## 18723 8634 8510 8392 7121 6511 3380 3320 2722 2701 2387 2373
Or we can look at a word cloud using the wordcloud library:
pal2 <- brewer.pal(8,"Dark2")
wordcloud(corpus, scale=c(6,.5), min.freq=2, max.words=300, random.order=TRUE,
rot.per=0.5, colors=pal2, use.r.layout=FALSE)
Next the data is tokenized into ngrams of length 2 to 5 using the tokenizers library. The ngrams are converted into a table which is sorted by Frequency to show us the most common ngrams.
twograms <- tokenize_ngrams(text, lowercase=TRUE, n=2L, n_min=2L, simplify=TRUE)
twograms <- unlist(twograms)
twogramfreq = as.data.frame(table(twograms))
twogramfreq <- twogramfreq[order(-twogramfreq$Freq),]
threegrams <- tokenize_ngrams(text, lowercase=TRUE, n=3L, n_min=3L, simplify=TRUE)
threegrams <- unlist(threegrams)
threegramfreq = as.data.frame(table(threegrams))
threegramfreq <- threegramfreq[order(-threegramfreq$Freq),]
This will allow us to view the most common two and three grams:
barplot(twogramfreq[2:26,2], names.arg=twogramfreq[2:26,1], col = "blue",
main="Twograms (Top 25)", las=2, ylab = "Frequency")
barplot(threegramfreq[1:25,2], names.arg=threegramfreq[1:25,1], col = "red",
main="Threegrams (Top 25)", las=2, ylab = "Frequency")
Note that the most common two gram was “a a” which has been removed due to not being a valid phrase.
I have attempted to input the tokenized text into a decision tree algorithm, but was unable to do so due to the high amount of RAM required.
My plan is to create a feature matrix of ngrams with the y as the last word and the features as the preceding words. This should enable me to filter down the possible matches by looping through the input string and reducing the set of matches for each word. In the end the top three matches will be returned.
Some issues which will need to be addressed to accomplish this: