Introduction

As part of the Data Science Specialisation in Coursera, we are tasked to build a predictive text product.

We begin by taking a quick look at the data. As the data is too large, we will not be able to go through each line; we will focus on a (relatively small) subset from the training data. Following which, we will use standard language cleaning techniques to clean the data - converting words to lower case, removing whitespaces, punctuations and numbers from the text. After the text is cleaned, we rely on term frequencies for our text predictive model. That is, if an user types more than 2 words, we look at the most frequent trigram to predict the most likely word based on our term frequencies, if an user types a word, we refer to the most frequent bigram and if he types a new word, we rely on the most frequent unigram to predict the word that he is looking for.

Importing key libraries and reading dataframes into R

First, we import key libraries which can help us conduct exploratory data analysis on the data. Next, we create a corpus to read our data into R.

setwd("~/R_Final_Project")
options(java.parameters = "-Xmx8000m")
library(RWeka); library(tm); library(NLP);
## Loading required package: NLP
library(magrittr); library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
con_blogs <- file("final/en_US/en_US.blogs.txt", 'r')
con_news <- file("final/en_US/en_US.news.txt", 'r')
con_twitter <- file("final/en_US/en_US.twitter.txt", 'r')

blogs <- readLines(con_blogs)
news <- readLines(con_news)
twitter <- readLines(con_twitter)
## Warning in readLines(con_twitter): line 167155 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 268547 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1274086 appears to contain an
## embedded nul
## Warning in readLines(con_twitter): line 1759032 appears to contain an
## embedded nul
close(con_blogs)
close(con_news)
close(con_twitter)

Let’s take a quick look at the 3 datasets.

# Blogs Dataset
print(summary(blogs))
##    Length     Class      Mode 
##    899288 character character
# News Dataset
print(summary(news))
##    Length     Class      Mode 
##   1010242 character character
# Twitter Dataset
print(summary(twitter))
##    Length     Class      Mode 
##   2360148 character character

It appears that the blogs dataset has 899,288 lines and is of character class, the news dataset has 1,010,242 lines and is of character class while the twitter dataset has 2,360,148 lines and is of character class. [Sidenote: Although the twitter dataset has the most lines out of the 3 dataframes, tweets are usually pretty short. Hence, the twitter dataframe takes up less memory than the other 2 dataframes.]

As the files are too large, we extract 70,000 lines at random from each file.

set.seed(1212)
subset_blogs = sample(blogs, 70000, replace = F)
subset_news = sample(news, 70000, replace = F)
subset_twitter = sample(twitter, 70000, replace = F)

Next, we conduct basic exploratory data analysis on the subsetted dataframes.

word_count <- function(x) {  
        count = 0
        count = count + lengths(strsplit(x, ' '))        
}

blogs_word_count = sum(sapply(subset_blogs, word_count))
news_word_count = sum(sapply(subset_news, word_count))
twitter_word_count = sum(sapply(subset_twitter, word_count))
total_words = blogs_word_count + news_word_count + twitter_word_count

The total number of non-unique words in the blogs dataset is \(2.90167\times 10^{6}\), the total number of non-unique words in the news dataset is \(2.385643\times 10^{6}\) while the total number of non-unique words in the twitter dataset is \(9.02392\times 10^{5}\). It doesn’t come as a surprise that the twitter dataset has the least number of non-unique words, since users are restricted to only 140 characters!

In total, we are dealing with \(6.189705\times 10^{6}\).

Let us save the new files in another folder, and remove the files from our working environment.

writeLines(subset_blogs, con = "~/R_Final_Project/final/subset_EN/subset_blogs.txt")
writeLines(subset_news, con = "~/R_Final_Project/final/subset_EN/subset_news.txt")
writeLines(subset_twitter, con = "~/R_Final_Project/final/subset_EN/subset_twitter.txt")

rm(list=ls())

After saving these new files, we will read the corpus.

docs <- VCorpus(DirSource("~/R_Final_Project/final/subset_EN/"))

Data Cleaning

After loading the dataframes, we proceed to clean the data. In this phase, we convert our words in the corpus to lower case, remove punctuations, numbers and whitespaces from our corpus.

cleaned <- docs %>% tm_map(content_transformer(tolower)) %>% # Converting words to lower case
        tm_map(stripWhitespace) %>% # Remove Whitespaces
        tm_map(removePunctuation) %>% # Remove Punctuations
        tm_map(removeNumbers) # Remove Numbers

rm(docs)

Exploratory Data Analysis

Now that we have cleaned our corpus, we can conduct some preliminary data analysis on the data. We proceed to create 5 different dataframes - dataframes consisting of all the different combinations of Unigram, Bigram, Trigram, Quadgram and Quintgrams and their counts. Using the dataframes, we are able to sieve out the 10 most frequent terms in each of the dataframes and plot them using ggplot2.

UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

dtm_unigram <- DocumentTermMatrix(cleaned, control = list(tokenize = UnigramTokenizer))
dtm_bigram <- DocumentTermMatrix(cleaned, control = list(tokenize = BigramTokenizer))
dtm_trigram <- DocumentTermMatrix(cleaned, control = list(tokenize = TrigramTokenizer))

After creating the dataframes, let’s take a look at the 10 most frequent terms in each of the model.

word_counts_unigram <- as.data.frame(sort(colSums(as.matrix(dtm_unigram)), 
                                          decreasing=TRUE)[1:10])
colnames(word_counts_unigram) <- 'Counts'
word_counts_unigram$Unigrams <- rownames(word_counts_unigram)
p1 <- ggplot(word_counts_unigram, aes(x = Unigrams, y = Counts)) + 
        geom_bar(position = "identity", stat = "identity", alpha = .8) + 
        ggtitle("Most Frequent Unigrams")
p1

word_counts_bigram <- as.data.frame(sort(colSums(as.matrix(dtm_bigram)), 
                                         decreasing=TRUE)[1:10])
colnames(word_counts_bigram) <- 'Counts'
word_counts_bigram$Bigrams <- rownames(word_counts_bigram)
p2 <- ggplot(word_counts_bigram, aes(x = Bigrams, y = Counts)) + 
        geom_bar(position = "identity", stat = "identity", alpha = .8) + 
        ggtitle("Most Frequent Bigrams")
p2

word_counts_trigram <- as.data.frame(sort(colSums(as.matrix(dtm_trigram)), 
                                         decreasing=TRUE)[1:10])
colnames(word_counts_trigram) <- 'Counts'
word_counts_trigram$Trigrams <- rownames(word_counts_trigram)
p3 <- ggplot(word_counts_trigram, aes(x = Trigrams, y = Counts)) + 
        geom_bar(position = "identity", stat = "identity", alpha = .8) + 
        ggtitle("Most Frequent Trigrams ")
p3

rm(dtm_unigram); rm(dtm_bigram); rm(dtm_trigram)

From our preliminary analysis, we do note that ‘the’ is the most frequent term in the corpus (for the unigram model), followed by the words ‘and’ and ‘that’. For bigrams, the term ‘of the’ takes the top spot, followed by ‘in the’ and ‘to the’. We do note that the top bigrams all consist of the word ‘the’, and it comes as no surprise that the word ‘the’ will rank first in the unigram model.

For trigrams, the terms ‘one of the’ and ‘a lot of’ comes in at first and second place respectively. We also note that the term ‘one of the’ consists of the bigram ‘of the’, which is one of the top bigrams. Other interesting trigrams are ‘to be a’ and ‘the end of’.

We proceed to plot the 4-gram and 5-gram dataframes after removing the 1, 2, 3-gram dataframes from our environment to save memory space.

QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 4, max = 4))
dtm_quadgram <- DocumentTermMatrix(cleaned, control = list(tokenize = QuadgramTokenizer))

word_counts_quadgram <- as.data.frame(sort(colSums(as.matrix(dtm_quadgram)), 
                                           decreasing=TRUE)[1:10])
colnames(word_counts_quadgram) <- 'Counts'
word_counts_quadgram$Quadgrams <- rownames(word_counts_quadgram)
p4 <- ggplot(word_counts_quadgram, aes(x = Quadgrams, y = Counts)) + 
        geom_bar(position = "identity", stat = "identity", alpha = .8) + 
        ggtitle("Most Frequent Quadgrams ")
p4

rm(dtm_quadgram)

What can we gather from the quadgrams? It does appear that the quadgrams, ‘the end of the’ and ‘at the end of’ occur very frequently in the corpora. From the most frequent quadgrams, nothing really seems to stand out. I use these terms pretty frequently myself…

QuintgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 5, max = 5))
dtm_quintgram <- DocumentTermMatrix(cleaned, control = list(tokenize = QuintgramTokenizer))

word_counts_quintgram <- as.data.frame(sort(colSums(as.matrix(dtm_quintgram)), 
                                            decreasing=TRUE)[1:10])
colnames(word_counts_quintgram) <- 'Counts'
word_counts_quintgram$Quintgrams <- rownames(word_counts_quintgram)
p4 <- ggplot(word_counts_quintgram, aes(x = Quintgrams, y = Counts)) + 
        geom_bar(position = "identity", stat = "identity", alpha = .8) + 
        ggtitle("Most Frequent Quintgrams ")
p4

rm(dtm_quintgram)

What about the quintgrams? It does appear that the quintgrams, ‘at the end of the’ and ‘by the end of the’ occur very frequently in the corpora. From these terms, the most likely word to occur next would be the word ‘day’.

For now, I think we had a good understanding of what we are dealing with. I think we can possibly make something out of this.