Synopsis

The task of capstone project is to understand and build predictive text models that will help users in typing sentences faster in keypads. A typical example is the SwiftKey Keyboard in mobile devices. The project will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. The libraries and frameworks to use were analysed and tm for text mining and RWeka for Tokenizing and creating n-grams were choosen. The tokenizer function in RWeka gave unexpected results for tm 0.7 version. So tm was downversioned to 0.6 to continue the project. For profanity filtering Carnegie Mellon University Luis von Ahn’s Research Group’s bad word collection was used.

Dataset

The dataset I am using is provided by the Coursera Course and is created by folks at Swiftkey. The dataset is in 4 languages,

and each language contains text documents from different sources,

Lets try to quantify the raw English dataset.

RAW Data
TextDocument Lines Words MaxWords AvgWords Characters MaxChars AvgChars
Twitter 2360148 30373545 47 12.86934 164744972 213 68.80281
News 899288 37334131 6630 41.51521 209260725 40835 231.69601
Blogs 77259 2643969 1031 34.22215 15761023 5760 203.00243

Loading Data

Data was downloaded from the Coursera Course Page by using the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset was loaded into R by using file and readline function.

con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)

The main concern in doing text mining is the resource. Text mining requires alot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random subsample the data and get a fraction of data to continue the project.

US_Twitter_Sample <- sample(US_Twitter, 50000)
US_News_Sample <- sample(US_News, 50000)
US_Blogs_Sample <- sample(US_Blogs, 50000)

Now the statistics are as follows,

Sample Data
TextDocument Lines Words MaxWords AvgWords Characters MaxChars AvgChars
Twitter 50000 643612 36 12.87224 3493142 156 68.86286
News 50000 2088704 748 41.77408 11702829 4068 233.05660
Blogs 50000 1707139 363 34.14278 10183157 2000 202.66316

And finally the text documents were converted to corpus objects.

twitter <- Corpus(VectorSource(US_Twitter))

Preprocessing

There were few concerns to be addressed in the dataset we gathered. The following few steps show how these concerns were addressed.

# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
twitter.cleaned <- tm_map(twitter,removePunctuation)
twitter.cleaned <- tm_map(twitter.cleaned,removeNumbers)

# Profanity filtering was done
twitter.cleaned <- tm_map(twitter.cleaned, removeWords, readLines("bad-words.txt"))

After cleaning and profanity filtering, the dataset is as follows,

Cleaned Data
TextDocument Lines Words MaxWords AvgWords Characters MaxChars AvgChars
Twitter 50000 643612 36 12.76882 3301348 143 65.02698
News 50000 2088704 748 41.73222 11318324 4018 225.36650
Blogs 50000 1707139 363 34.07872 9731653 1934 193.63308

Tokenization and N-Gram Modelling

Next I created a basic 1-Gram model to inspect what are the mostly used words in the dataset.

The 10 most used words in Twitter dataset:

The 10 most used words in Blogs dataset:

The 10 most used words in News dataset:

As we can see, most of the top 10 words are actually pretty much the same.

The similar words in Twitter and Blogs are:

## [1] "the" "to"  "a"   "and" "for" "in"  "is"  "of"

The similar words in Twitter and News are:

## [1] "the" "to"  "i"   "a"   "and" "in"  "is"  "of"

The similar words in Blogs and News are:

## [1] "the"  "to"   "and"  "a"    "of"   "in"   "that" "is"

And, the similar words in all sets are:

## [1] "the" "to"  "a"   "and" "in"  "is"  "of"

Here its interesting to note 7 out of 10 most used words are same in all three datasets.

Now, lets observe how the 2-Grams and 3-Grams in Twitter data set looks like.

2-Gram

3-Gram

We can even find data for 2 to N-Grams.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 10))
tdm.bigram = TermDocumentMatrix(twitter.cleaned,
                                control = list(tokenize = BigramTokenizer))
t.twitter <- as.data.frame(head(sort(slam::row_sums(tdm.bigram), decreasing = TRUE), 20)) 
t2.twitter <- data.frame("y" = t.twitter[,1], "x" = row.names(t.twitter))

p<-ggplot(data=t2.twitter, aes(x=x, y=y)) +
    geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
p

Future Plans

From the N-Grams created we can get an idea of how to predict the texts. The 2-Grams and 3-Grams can be used to predict subsequent words and 1-Gram can be used to predict most frequent words.

Other than this some more advanced features such as stemming and tagging can be utilized.