The task of capstone project is to understand and build predictive text models that will help users in typing sentences faster in keypads. A typical example is the SwiftKey Keyboard in mobile devices. The project will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. The libraries and frameworks to use were analysed and tm for text mining and RWeka for Tokenizing and creating n-grams were choosen. The tokenizer function in RWeka gave unexpected results for tm 0.7 version. So tm was downversioned to 0.6 to continue the project. For profanity filtering Carnegie Mellon University Luis von Ahn’s Research Group’s bad word collection was used.
The dataset I am using is provided by the Coursera Course and is created by folks at Swiftkey. The dataset is in 4 languages,
and each language contains text documents from different sources,
Lets try to quantify the raw English dataset.
| TextDocument | Lines | Words | MaxWords | AvgWords | Characters | MaxChars | AvgChars |
|---|---|---|---|---|---|---|---|
| 2360148 | 30373545 | 47 | 12.86934 | 164744972 | 213 | 68.80281 | |
| News | 899288 | 37334131 | 6630 | 41.51521 | 209260725 | 40835 | 231.69601 |
| Blogs | 77259 | 2643969 | 1031 | 34.22215 | 15761023 | 5760 | 203.00243 |
Data was downloaded from the Coursera Course Page by using the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset was loaded into R by using file and readline function.
con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)
The main concern in doing text mining is the resource. Text mining requires alot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random subsample the data and get a fraction of data to continue the project.
US_Twitter_Sample <- sample(US_Twitter, 50000)
US_News_Sample <- sample(US_News, 50000)
US_Blogs_Sample <- sample(US_Blogs, 50000)
Now the statistics are as follows,
| TextDocument | Lines | Words | MaxWords | AvgWords | Characters | MaxChars | AvgChars |
|---|---|---|---|---|---|---|---|
| 50000 | 643612 | 36 | 12.87224 | 3493142 | 156 | 68.86286 | |
| News | 50000 | 2088704 | 748 | 41.77408 | 11702829 | 4068 | 233.05660 |
| Blogs | 50000 | 1707139 | 363 | 34.14278 | 10183157 | 2000 | 202.66316 |
And finally the text documents were converted to corpus objects.
twitter <- Corpus(VectorSource(US_Twitter))
There were few concerns to be addressed in the dataset we gathered. The following few steps show how these concerns were addressed.
# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
twitter.cleaned <- tm_map(twitter,removePunctuation)
twitter.cleaned <- tm_map(twitter.cleaned,removeNumbers)
# Profanity filtering was done
twitter.cleaned <- tm_map(twitter.cleaned, removeWords, readLines("bad-words.txt"))
After cleaning and profanity filtering, the dataset is as follows,
| TextDocument | Lines | Words | MaxWords | AvgWords | Characters | MaxChars | AvgChars |
|---|---|---|---|---|---|---|---|
| 50000 | 643612 | 36 | 12.76882 | 3301348 | 143 | 65.02698 | |
| News | 50000 | 2088704 | 748 | 41.73222 | 11318324 | 4018 | 225.36650 |
| Blogs | 50000 | 1707139 | 363 | 34.07872 | 9731653 | 1934 | 193.63308 |
Next I created a basic 1-Gram model to inspect what are the mostly used words in the dataset.
The 10 most used words in Twitter dataset:
The 10 most used words in Blogs dataset:
The 10 most used words in News dataset:
As we can see, most of the top 10 words are actually pretty much the same.
The similar words in Twitter and Blogs are:
## [1] "the" "to" "a" "and" "for" "in" "is" "of"
The similar words in Twitter and News are:
## [1] "the" "to" "i" "a" "and" "in" "is" "of"
The similar words in Blogs and News are:
## [1] "the" "to" "and" "a" "of" "in" "that" "is"
And, the similar words in all sets are:
## [1] "the" "to" "a" "and" "in" "is" "of"
Here its interesting to note 7 out of 10 most used words are same in all three datasets.
Now, lets observe how the 2-Grams and 3-Grams in Twitter data set looks like.
2-Gram
3-Gram
We can even find data for 2 to N-Grams.
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 10))
tdm.bigram = TermDocumentMatrix(twitter.cleaned,
control = list(tokenize = BigramTokenizer))
t.twitter <- as.data.frame(head(sort(slam::row_sums(tdm.bigram), decreasing = TRUE), 20))
t2.twitter <- data.frame("y" = t.twitter[,1], "x" = row.names(t.twitter))
p<-ggplot(data=t2.twitter, aes(x=x, y=y)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
p
From the N-Grams created we can get an idea of how to predict the texts. The 2-Grams and 3-Grams can be used to predict subsequent words and 1-Gram can be used to predict most frequent words.
Other than this some more advanced features such as stemming and tagging can be utilized.