It would have been of great use to me if my computer would have assisted me in typing this assginment report. I would be using a corpus of english text produced in a similar manner as I am typing in to build a model that would assist in typing english text. The purpose of the project is to build a n-gram model to predict the next most likely word using the previously typed words. In this report, I would briefly explain the data set I would use to build the model and some initial exploratory analysis.
Data is downloaded from the location - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip which contains data in four languages viz. DE, US, FI, RU among which US is chosen for intial exploratory analysis.
Basic statistics such as line count(l), word count(w), charecter count(m) and byte count is found using system commands as below.
system("wc -wlmc en_US.blogs.txt")
system("wc -wlmc en_US.news.txt")
system("wc -wlmc en_US.twitter.txt")
Following result is obtained.
As the data size is too large ( =~ 200MB), I sample the data to do further analysis. THe sampled data is stored in respective sampled files.
set.seed(10)
blogs <- readLines('en_US.blogs.txt')
news <- readLines('en_US.news.txt')
tweets <- readLines('en_US.twitter.txt')
dir.create('sample')
len <- length(blogs)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledBlogs <- blogs[samples]
write(sampledBlogs, file='sample/sampledBlogs.txt')
len <- length(news)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledNews <- news[samples]
write(sampledNews, file='sample/sampledNews.txt')
len <- length(tweets)
samples <- sample(1:len, len*0.01, replace = FALSE)
sampledTweets <- tweets[samples]
write(sampledTweets, file='sample/sampledTweets.txt')
Using the R’s text mining library - ‘tm’ the sampled data is loaded into a corpus.
library('tm')
## Loading required package: NLP
corpus <- VCorpus(DirSource('/personal_workspace/coursera/final/en_US/sample/'))
Summary of the corpus is given as below:
summary(corpus)
## Length Class Mode
## sampledBlogs.txt 2 PlainTextDocument list
## sampledNews.txt 2 PlainTextDocument list
## sampledTweets.txt 2 PlainTextDocument list
After sampling the data the corpus is applied a couple of transformations such as changing letters to lower case, removing punctuations/numbers and removing stop words.
cleanedCorpus <- tm_map(corpus, tolower)
cleanedCorpus <- tm_map(corpus, stripWhitespace)
cleanedCorpus <- tm_map(corpus, removeNumbers)
cleanedCorpus <- tm_map(corpus, removePunctuation)
corpus <- Corpus(VectorSource(cleanedCorpus))
The english text is now broken into atomic terms i.e. english words which is then used to build a prediction model using machine learning algorithms.
dtm <- DocumentTermMatrix(corpus, control = list(minWordLength = 2))
tdm <- TermDocumentMatrix(corpus, control = list(minWordLength = 2))
The number of tokens extracted from the sampled text is found to be:
dim(dtm)[2]
## [1] 63953
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
top10 <- as.numeric(freq[1:10])
plot(seq(1:10), top10, type = 'h', lwd = 4, xaxt = "n", ylab = 'Term Frequency', xlab = "Term")
axis(1, at=seq(1:10),labels=names(freq[1:10]), col.axis="red", las=2)
The english text corpora were analysed to find the most frequently occuring terms. Toenization was done for unigrams, for developing a predictive algorithm n-grams will be used and all the three text will be analysed differently.