Introduction

It would have been of great use to me if my computer would have assisted me in typing this assginment report. I would be using a corpus of english text produced in a similar manner as I am typing in to build a model that would assist in typing english text. The purpose of the project is to build a n-gram model to predict the next most likely word using the previously typed words. In this report, I would briefly explain the data set I would use to build the model and some initial exploratory analysis.

Data Acquisition

Data is downloaded from the location - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip which contains data in four languages viz. DE, US, FI, RU among which US is chosen for intial exploratory analysis.

Basic statistics such as line count(l), word count(w), charecter count(m) and byte count is found using system commands as below.

system("wc -wlmc en_US.blogs.txt")
system("wc -wlmc en_US.news.txt")
system("wc -wlmc en_US.twitter.txt")

Following result is obtained.

899288 37334690 210160014 en_US.blogs.txt
1010242 34372720 205811889 en_US.news.txt
2360148 30374206 167105338 en_US.twitter.txt

As the data size is too large ( =~ 200MB), I sample the data to do further analysis. THe sampled data is stored in respective sampled files.

set.seed(10)
blogs <- readLines('en_US.blogs.txt')
news <- readLines('en_US.news.txt')
tweets <- readLines('en_US.twitter.txt')

dir.create('sample')
len <- length(blogs)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledBlogs <- blogs[samples]
write(sampledBlogs, file='sample/sampledBlogs.txt')

len <- length(news)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledNews <- news[samples]
write(sampledNews, file='sample/sampledNews.txt')


len <- length(tweets)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledTweets <- tweets[samples]
write(sampledTweets, file='sample/sampledTweets.txt')

Using the R’s text mining library - ‘tm’ the sampled data is loaded into a corpus.

library('tm')
## Loading required package: NLP
corpus <- VCorpus(DirSource('/personal_workspace/coursera/final/en_US/sample/'))

Summary of the corpus is given as below:

summary(corpus)
##                   Length Class             Mode
## sampledBlogs.txt  2      PlainTextDocument list
## sampledNews.txt   2      PlainTextDocument list
## sampledTweets.txt 2      PlainTextDocument list

Cleaning data

After sampling the data the corpus is applied a couple of transformations such as changing letters to lower case, removing punctuations/numbers and removing stop words.

cleanedCorpus <- tm_map(corpus, tolower)
cleanedCorpus <- tm_map(corpus, stripWhitespace)
cleanedCorpus <- tm_map(corpus, removeNumbers)
cleanedCorpus <- tm_map(corpus, removePunctuation)
corpus <- Corpus(VectorSource(cleanedCorpus))

Tokenization

The english text is now broken into atomic terms i.e. english words which is then used to build a prediction model using machine learning algorithms.

dtm <- DocumentTermMatrix(corpus, control = list(minWordLength = 2))   
tdm <- TermDocumentMatrix(corpus, control = list(minWordLength = 2))   

The number of tokens extracted from the sampled text is found to be:

dim(dtm)[2]
## [1] 63953

Most Frequent Terms

freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
top10 <- as.numeric(freq[1:10])
plot(seq(1:10), top10, type = 'h', lwd = 4, xaxt = "n", ylab = 'Term Frequency', xlab = "Term")
axis(1, at=seq(1:10),labels=names(freq[1:10]), col.axis="red", las=2)

Conclusion and Project Plan

The english text corpora were analysed to find the most frequently occuring terms. Toenization was done for unigrams, for developing a predictive algorithm n-grams will be used and all the three text will be analysed differently.