In this report, an exploratory analysis of the three datasets in English language, namely en_US.blogs.txt, en_US.twitter.txt and en_US.news.txt, is performed. The analysis includes basic summaries of the three datasets and frequency count in the form of wordcloud and histogram. Packages used in this report are tm, wordcloud, RColorBrewer, NLP, and SnowballC (required by tm).
For the purpose of reproducibility, all codes used to produce this report will be presented. The data have been downloaded into the paths below and are loaded using both the ‘scan’ and ‘readLines’ functions.
#store as character
blog.char <- scan("/Users/kevins/swiftkey/final/en_US/en_US.blogs.txt", what="character")
tweet.char <- scan("/Users/kevins/swiftkey/final/en_US/en_US.twitter.txt", what="character")
news.char <- scan("/Users/kevins/swiftkey/final/en_US/en_US.news.txt", what="character")
#store as list of lines
blog.con <- file("/Users/kevins/swiftkey/final/en_US/en_US.blogs.txt", "r")
blog.lines <- readLines(blog.con)
tweet.con <- file("/Users/kevins/swiftkey/final/en_US/en_US.twitter.txt", "r")
tweet.lines <- readLines(tweet.con)
news.con <- file("/Users/kevins/swiftkey/final/en_US/en_US.news.txt", "r")
news.lines <- readLines(news.con)
A basic summary statistics about the test data are combined into a dataframe and reported as follows.
blog.summary <- c(length(blog.char), length(blog.lines))
tweet.summary <- c(length(tweet.char), length(tweet.lines))
news.summary <- c(length(news.char), length(news.lines))
summaries <- rbind(blog.summary, tweet.summary, news.summary)
colnames(summaries) <- c("nwords", "nlines")
summaries
## nwords nlines
## blog.summary 35314175 899288
## tweet.summary 9141409 2360148
## news.summary 29313276 1010242
Moving forward, as the data size is big, only a subset of 10,000 lines (randomly sampled) will be used.
rm(blog.char, tweet.char, news.char)
blog.lines <- sample(blog.lines, 10000)
tweet.lines <- sample(tweet.lines, 10000)
news.lines <- sample(news.lines, 10000)
combined.data <- c(blog.lines, tweet.lines, news.lines)
One way to illustrate the basic features of a text data/corpus is by constructing a wordcloud. Preprocessing required before plotting the wordcloud:
library(NLP)
library(tm)
library(RColorBrewer)
library(wordcloud)
#processing to remove punctuation, stopwords, and perform stemming.
combined.data <- removePunctuation(combined.data)
combined.data <- tolower(combined.data)
combined.data <- stemDocument(combined.data)
combined.data <- removeWords(combined.data, words=c('the', stopwords("english")))
combined.data <- stripWhitespace(combined.data)
wordcloud(combined.data, max.words=100, random.order=FALSE, rot.per=0.5, colors=brewer.pal(8, 'Accent'))
For the purpose of prediction, a histogram plot of bigram frequencies will be more useful than wordcloud as it gives quantitative information on the frequencies. The ngrams function from the package ‘NLP’ is used to tokenize the corpus into bigrams.
#Further processing on the corpus before tokenization
combined.data <- paste0(unlist(combined.data), collapse=" ")
combined.data <- strsplit(combined.data, " ", fixed=TRUE)[[1L]]
combined.data <- combined.data[combined.data != ""]
#Tokenizing into bigrams
bigrams <- vapply(ngrams(combined.data, 2L), paste, "", collapse=" ")
#Get the 5 most frequent bigrams
top5 <- sort(table(bigrams), decreasing=T)[1:5]
barplot(top5)
A basic text prediction model based on N-gram will be employed. The N-gram model allows one to predict the last word of a sequence of N words given the previous (N-1) words. Given the limitation of hardware and time, a trigram model will likely be chosen to compute the relative frequency of an observed particular three-word sequence to its (two-word) prefix. In other words, the plan is to build a matrix/table of probabilities of trigram counts and make prediction (i.e. the word with the highest probability given the preceding two words) based on the table.
THANK YOU FOR READING THIS FAR. ALL THE BEST FOR THE PROJECT!!