Week 02 Milestone Report of Capstone Project

Synopsis

The task of capstone project is to understand and build predictive text models that will help users in typing sentences faster in keypads. A typical example is the SwiftKey Keyboard in mobile devices. The project will start with the basics, analyzing a large corpus of text documents to discover the structure in the data and how words are put together. It will cover cleaning and analyzing text data, then building and sampling from a predictive text model. The libraries and frameworks to use were analysed and tm for text mining and RWeka for Tokenizing and creating n-grams were choosen. The tokenizer function in RWeka gave unexpected results for tm 0.7 version. So tm was downversioned to 0.6 to continue the project. For profanity filtering Carnegie Mellon University Luis von Ahn’s Research Group’s bad word collection was used.

Dataset

The dataset I am using is provided by the Coursera Course and is created by folks at Swiftkey. The dataset is in 4 languages,

English
German
Finnish
Russian

and each language contains text documents from different sources,

Twitter Tweets
Blogs
News

Lets try to quantify the raw English dataset.

RAW Data
TextDocument	Lines	Words	MaxWords	AvgWords	Characters	MaxChars	AvgChars
Twitter	2360148	30373545	47	12.86934	164744972	213	68.80281
News	899288	37334131	6630	41.51521	209260725	40835	231.69601
Blogs	77259	2643969	1031	34.22215	15761023	5760	203.00243

Loading Data

Data was downloaded from the Coursera Course Page by using the url https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset was loaded into R by using file and readline function.

con1 <- file("SwiftKey/en_US/en_US.twitter.txt", "r")
US_Twitter <- readLines(con1)

The main concern in doing text mining is the resource. Text mining requires alot of computational power and memory in particular. Even though the data set i have is couple of millions of records long, I could only load around 50000 records of data in memory in one go. So I have decided to random subsample the data and get a fraction of data to continue the project.

US_Twitter_Sample <- sample(US_Twitter, 50000)
US_News_Sample <- sample(US_News, 50000)
US_Blogs_Sample <- sample(US_Blogs, 50000)

Now the statistics are as follows,

Sample Data
TextDocument	Lines	Words	MaxWords	AvgWords	Characters	MaxChars	AvgChars
Twitter	50000	643612	36	12.87224	3493142	156	68.86286
News	50000	2088704	748	41.77408	11702829	4068	233.05660
Blogs	50000	1707139	363	34.14278	10183157	2000	202.66316

And finally the text documents were converted to corpus objects.

twitter <- Corpus(VectorSource(US_Twitter))

Preprocessing

There were few concerns to be addressed in the dataset we gathered. The following few steps show how these concerns were addressed.

# the puncuations and numbers in the texts were removed as there is no need to predict punctations or numbers
twitter.cleaned <- tm_map(twitter,removePunctuation)
twitter.cleaned <- tm_map(twitter.cleaned,removeNumbers)

# Profanity filtering was done
twitter.cleaned <- tm_map(twitter.cleaned, removeWords, readLines("bad-words.txt"))

After cleaning and profanity filtering, the dataset is as follows,

Cleaned Data
TextDocument	Lines	Words	MaxWords	AvgWords	Characters	MaxChars	AvgChars
Twitter	50000	643612	36	12.76882	3301348	143	65.02698
News	50000	2088704	748	41.73222	11318324	4018	225.36650
Blogs	50000	1707139	363	34.07872	9731653	1934	193.63308

Tokenization and N-Gram Modelling

Next I created a basic 1-Gram model to inspect what are the mostly used words in the dataset.

The 10 most used words in Twitter dataset:

The 10 most used words in Blogs dataset:

The 10 most used words in News dataset:

As we can see, most of the top 10 words are actually pretty much the same.

The similar words in Twitter and Blogs are:

## [1] "the" "to"  "a"   "and" "for" "in"  "is"  "of"

The similar words in Twitter and News are:

## [1] "the" "to"  "i"   "a"   "and" "in"  "is"  "of"

The similar words in Blogs and News are:

## [1] "the"  "to"   "and"  "a"    "of"   "in"   "that" "is"

And, the similar words in all sets are:

## [1] "the" "to"  "a"   "and" "in"  "is"  "of"

Here its interesting to note 7 out of 10 most used words are same in all three datasets.

Now, lets observe how the 2-Grams and 3-Grams in Twitter data set looks like.

2-Gram

3-Gram

We can even find data for 2 to N-Grams.

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 10))
tdm.bigram = TermDocumentMatrix(twitter.cleaned,
                                control = list(tokenize = BigramTokenizer))
t.twitter <- as.data.frame(head(sort(slam::row_sums(tdm.bigram), decreasing = TRUE), 20)) 
t2.twitter <- data.frame("y" = t.twitter[,1], "x" = row.names(t.twitter))

p<-ggplot(data=t2.twitter, aes(x=x, y=y)) +
    geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 60, hjust = 1))
p

Future Plans

From the N-Grams created we can get an idea of how to predict the texts. The 2-Grams and 3-Grams can be used to predict subsequent words and 1-Gram can be used to predict most frequent words.

Other than this some more advanced features such as stemming and tagging can be utilized.