Synopsis

This project is part of the Data Science Specialization - Capstone project in Natural Language Processing(NLP) field. The project is only a part of NLP involving the prediction of next word given a set of one to three words. The process involves document summarization of three files provided in the course by analyzing the text files, cleansing them and doing some analytics to get some insight into the text files. The analysis is done on a random sample generated from the files as the source files are more than 6GB in size.

Load the required libraries

As with any project, the first step is to load the required libraries whose functions will be used in the process of loading, cleaning and analysing the files.

library(stringi)
library(stringr)
library(RCurl)
library(plyr)
library(dplyr)
library(parallel)
library(tm)
library(NLP)
library(ngram)
library(ggplot2)
library(RWeka)
library(plotly)
library(ggplot2)
library(wordcloud)
library(wordcloud2)

Define a function which takes the file path and gets the stats on the file

The below code is a function which takes the text files and get some insights on the files like the size, number of lines, words, characters etc. The code is written as a function and will be called with the raw data files and sampled files.

file_operations <- function(b,t,n, sample_size) {

   # Read the lines in the files
   blogs <- readLines(b, encoding = "UTF-8", skipNul = TRUE)
   twitter <- readLines(t, encoding = "UTF-8", skipNul = TRUE)
   news <- readLines(n, encoding = "UTF-8", skipNul = TRUE)

   if (sample_size < 100) {
      blogs <- sample(blogs, length(blogs) * sample_size / 100)
      twitter <- sample(twitter, length(twitter)  * sample_size / 100)
      news <- sample(news, length(news)  * sample_size / 100)
   }

   # Get the size of each file
   file_size = sapply(list(blogs, twitter, news), function(x) {object.size(x)/1024^2})

   # Get statistics like number of line, characters in the files
   stats <- t(sapply(list(blogs, twitter, news), stri_stats_general))

   # Number of words
   num_words <- sapply(list(blogs,news,twitter),stri_stats_latex)[4,]

   # Char count of the longest line
   max_char <- sapply(list(blogs, twitter, news), function(x){max(unlist(lapply(x, function(y) nchar(y))))})

   data_frame <- data.frame (
      FileName = c(b,t,n),
      FileSizeInMB = file_size,
      Stats = stats,
      Words = num_words,
      MaxChar = max_char
   )
   return(data_frame)
}

Read the raw data files (blogs, twitter, news) and get the stats on them

The call to the above function with the file path as parameters

# File names and size
file_names <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
df_stats <- file_operations(file_names[1], file_names[2], file_names[3], 100)
df_stats
##            FileName FileSizeInMB Stats.Lines Stats.LinesNEmpty Stats.Chars
## 1   en_US.blogs.txt    255.35453      899288            899288   206824382
## 2 en_US.twitter.txt    318.98975     2360148           2360148   162096241
## 3    en_US.news.txt     19.76917       77259             77259    15639408
##   Stats.CharsNWhite    Words MaxChar
## 1         170389539 37570839   40833
## 2         134082806  2651432     140
## 3          13072698 30451170    5760

Sample the raw files due to their large size and get the stats on the sampled files

As seen from the data above, the size of all three files is nearly 6Gb. An 80% confidence level with a 5% error margin will need a file size of 1GB. We will sample 20% of the raw data to perform analysis. First the stats on the sample file is performed.

# Sample 40% of the file and get the stats on them
file_names <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
df_sample_stats <- file_operations(file_names[1], file_names[2], file_names[3], 20)
df_sample_stats
##            FileName FileSizeInMB Stats.Lines Stats.LinesNEmpty Stats.Chars
## 1   en_US.blogs.txt    51.017235      179857            179857    41295927
## 2 en_US.twitter.txt    64.316933      472029            472029    32465839
## 3    en_US.news.txt     3.951569       15451             15451     3126367
##   Stats.CharsNWhite   Words MaxChar
## 1          34019989 7500678   19795
## 2          26856515  530601     140
## 3           2613048 6097762    1598

Process the sample files for text mining

Analysis on the sampled files start with cleaning and pre-processing the data like removing punctuations, numbers, making all characters into lower case etc. This pre-processing is so important as this leads to a better algorithm. This step takes much of the analysis work.

no_of_cores <- detectCores() - 1
#clusters <- makeCluster(no_of_cores)

blogs <- readLines(file_names[1], encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(file_names[2], encoding = "UTF-8", skipNul = TRUE)
news <- readLines(file_names[3], encoding = "UTF-8", skipNul = TRUE)

blogs_sample <- sample(blogs, length(blogs) * 0.2 / 100)
twitter_sample <- sample(twitter, length(twitter)  * 0.2 / 100)
news_sample <- sample(news, length(news)  * 0.1 / 100)

# Merge the sampled files into as one data set
df_sample <- c(blogs_sample, twitter_sample, news_sample)

#Perform Corpus functions on the data set
vs <- VectorSource(df_sample)
corpus1 <- VCorpus(vs, readerControl=list(readPlain, language="en", load=TRUE))

corpus1 <- tm_map(corpus1, content_transformer(tolower))
corpus1 <- tm_map(corpus1, content_transformer(removePunctuation))
corpus1 <- tm_map(corpus1, content_transformer(removeNumbers))

profanityLink <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanityWords <- readLines(profanityLink)
#profanityWords <- profanityWords[!duplicated(profanityWords),]
corpus1 <- tm_map(corpus1, removeWords, profanityWords) 

profanityLink <- "http://www.bannedwordlist.com/lists/swearWords.txt"
profanityWords <- readLines(profanityLink)
#profanityWords <- profanityWords[!duplicated(profanityWords),]
corpus1 <- tm_map(corpus1, removeWords, profanityWords)     

corpus1 <- tm_map(corpus1,  content_transformer(stripWhitespace))

removeNonASCII <- function(x) {iconv(x, "latin1", "ASCII", sub="")}
corpus1 <- tm_map(corpus1, content_transformer(removeNonASCII))

removeRepeatedWords <- function(x) {gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)}
corpus1 <- tm_map(corpus1, content_transformer(removeRepeatedWords))

removeSingleLetterWords <- function(x) { gsub(" *\\b[[:alpha:]]{1,2}\\b *","", x) }
corpus1 <- tm_map(corpus1, content_transformer(removeSingleLetterWords))

removeSameLetterWords <- function(x) {gsub("(\\w)\\1{2, }","", x)}
corpus1 <- tm_map(corpus1, content_transformer(removeSameLetterWords))

corpus2plus <- corpus1
corpus1 <- tm_map(corpus1, content_transformer(removeWords), stopwords("en"))

corpus1 <- tm_map(corpus1, PlainTextDocument)
corpus2plus <- tm_map(corpus2plus, PlainTextDocument)

Tokenizer function

Now that out text files are preprocesed, we need to split the text into individual word sequences so that an analysis of the word frequencies can be done on each sampled word sequence. The code is written as a function as it will be used to analyse the frequencies of single word(unigram), two words(bigram), three words(trigram) and four words(4-gram).

tokenizer <- function(cp, ws) {
   token <- function(x) NGramTokenizer(x, Weka_control(min=ws, max=ws))
   word_matrix <- as.data.frame(as.matrix(TermDocumentMatrix(cp, control = list(tokenize = token))))
   frq <- sort(rowSums(word_matrix), decreasing = TRUE)
   freq_df <- data.frame(word = names(frq), freq = frq)
   return(freq_df)
}

Unigram (1-Gram) - Sequence of single word

Frequency of most commonly used single word. Here the most used 25 words are shown with their frequency of usage.

df1 <- tokenizer(corpus1,1)
plot1 <- df1[1:25,]
rownames(plot1) <- NULL
plot1
##      word freq
## 1    just  327
## 2    like  266
## 3     one  239
## 4    will  216
## 5     can  189
## 6     now  187
## 7  people  169
## 8     new  167
## 9    love  166
## 10   dont  160
## 11    day  159
## 12   good  157
## 13   time  152
## 14    get  148
## 15  right  133
## 16  first  127
## 17 really  124
## 18   back  121
## 19   last  117
## 20 thanks  117
## 21   know  115
## 22   next  107
## 23   even  105
## 24    lol  102
## 25    ive  100

Bigram (2-Gram) - Sequence of two words

Frequency of most commonly used continuous sequence of two words. Here the most used 25 words are shown with their frequency of usage.

df2 <- tokenizer(corpus2plus,2)
plot2 <- df2[1:25,]
rownames(plot2) <- NULL
plot2
##          word freq
## 1     for the  258
## 2     and the  138
## 3    from the  103
## 4    with the  101
## 5     you can   69
## 6   thank you   66
## 7  thanks for   64
## 8   about the   63
## 9    that the   61
## 10    all the   58
## 11    are you   58
## 12    you are   58
## 13   the only   49
## 14   over the   47
## 15   they are   47
## 16  the first   46
## 17   into the   44
## 18  have been   36
## 19   the same   36
## 20   that you   35
## 21   the best   35
## 22    was the   35
## 23   you will   35
## 24  right now   34
## 25  there are   34

Trigram (3-Gram) - Sequence of three words

Frequency of most commonly used continuous sequence of three words. Here the most used 25 words are shown with their frequency of usage.

df3 <- tokenizer(corpus1,3)
plot3 <- df3[1:25,]
rownames(plot3) <- NULL
plot3
##                                                          word freq
## 1                                         cricket new zealand    7
## 2                                         amazon services llc    6
## 3                                         north dakota indian    6
## 4                                         drive data recovery    4
## 5                                             hard drive data    4
## 6                advertising feesadvertising linkingamazoncom    3
## 7            amazonassociates programmes designedprovidemeans    3
## 8                                amazonca amazoncouk amazonde    3
## 9                                amazoncouk amazonde amazonfr    3
## 10                                 amazonde amazonfr amazonit    3
## 11                                   amazones certain content    3
## 12                                 amazonfr amazonit amazones    3
## 13                                  amazonit amazones certain    3
## 14  amazonthis contentprovidedandsubjectchangeremovalany time    3
## 15 andor amazonthis contentprovidedandsubjectchangeremovalany    3
## 16                                  appearsthis website comes    3
## 17                                certain content appearsthis    3
## 18                                      comes amazon services    3
## 19                                content appearsthis website    3
## 20                                       customize taco times    3
## 21                 designedprovidemeans sitesearn advertising    3
## 22                  feesadvertising linkingamazoncom amazonca    3
## 23                                      good morning everyone    3
## 24                                      health customize taco    3
## 25                                 impactour health customize    3

4-Gram - Sequence of four words

Frequency of most commonly used continuous sequence of three words. Here the most used 25 words are shown with their frequency of usage.

df4 <- tokenizer(corpus1,4)
plot4 <- df4[1:25,]
rownames(plot4) <- NULL
plot4
##                                                               word freq
## 1                                         hard drive data recovery    4
## 2            advertising feesadvertising linkingamazoncom amazonca    3
## 3                             amazon services llc amazonassociates    3
## 4                                        amazon services llc andor    3
## 5       amazonassociates programmes designedprovidemeans sitesearn    3
## 6                            amazonca amazoncouk amazonde amazonfr    3
## 7                            amazoncouk amazonde amazonfr amazonit    3
## 8                              amazonde amazonfr amazonit amazones    3
## 9                             amazones certain content appearsthis    3
## 10                              amazonfr amazonit amazones certain    3
## 11                               amazonit amazones certain content    3
## 12 andor amazonthis contentprovidedandsubjectchangeremovalany time    3
## 13                                appearsthis website comes amazon    3
## 14                             certain content appearsthis website    3
## 15                                       comes amazon services llc    3
## 16                               content appearsthis website comes    3
## 17                                       customize taco times menu    3
## 18      designedprovidemeans sitesearn advertising feesadvertising    3
## 19            feesadvertising linkingamazoncom amazonca amazoncouk    3
## 20                                     health customize taco times    3
## 21                                 impactour health customize taco    3
## 22                                     items light healthy options    3
## 23                   linkingamazoncom amazonca amazoncouk amazonde    3
## 24            llc amazonassociates programmes designedprovidemeans    3
## 25  llc andor amazonthis contentprovidedandsubjectchangeremovalany    3

Plot the frequencies of words

Visualization of the above data (unigram, bigram, trigram)

# Histogram and Wordcloud
ggplot(plot1, aes(word, freq)) + geom_bar(stat = "identity", fill="blue") + xlab("Words") + ylab("Frequency") +    
         ggtitle("Unigram - Word Frequency") + theme(axis.text = element_text(angle = 90, hjust=1))

wordcloud(words = plot1$word, freq = plot1$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))

ggplot(plot2, aes(word, freq)) + geom_bar(stat = "identity", fill="green") + xlab("Words") + ylab("Frequency") + 
         ggtitle("Bigram - Word Frequency") + theme(axis.text = element_text(angle = 90, hjust=1))

wordcloud(words = plot2$word, freq = plot2$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))

ggplot(plot3, aes(word, freq)) + geom_bar(stat = "identity", fill="red") + xlab("Words") + ylab("Frequency") + 
         ggtitle("Trigram - Word Frequency") + theme(axis.text = element_text(angle = 30, hjust=1))

wordcloud(words = plot3$word, freq = plot3$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))

#####Here is an interactive wordcloud for bigram

wordcloud2(data=plot2, size=0.7, shape="pentagon")
#stopCluster(clusters)

Creating a dictionary out of the word sequences and the frequencies

Now the bigrams, trigrams and 4-grams are combined in a dictionary way where each word sequence(the key) will have the next word as the prediction (the value). This data frame will also have the frequencies of the word sequence so that the most commonly followed word can be suggested. The data is then written in a csv file which will then be used for the prediction. The csv file provides an advantage when the whole corpus operation has to be rerun multiple times to capture most of the word sequencing. But we have to keep in mind that the csv file size will also grow if the corpus operations are run multiple times.

df_full = do.call("rbind", list(df2, df3, df4))
keys <- lapply(df_full$word, function(x) {word(x, 1, -2)})
values <- lapply(df_full$word, function(x) {word(x, 1, -1)})
freq <- as.list(df_full$freq) 
df <-  do.call(rbind.data.frame, Map('c', keys, values, freq))
colnames(df)[1] <- "keys"
colnames(df)[2] <- "values"
colnames(df)[3] <- "freq"
df$freq <- as.numeric(df$freq)
head(df)
##    keys    values freq
## 1   for   for the   22
## 2   and   and the    8
## 3  from  from the    4
## 4  with  with the    3
## 5   you   you can   46
## 6 thank thank you   45
write.table(df, file="NLP Data.csv", append=TRUE, sep=",", col.names = !file.exists("NLP Data.csv"), row.names = FALSE)

Remove unused variables from the memory

With only a limited memory at hand, it is wise to do some clean up to return the unused space to OS.

# rm() is used to remove the unwanted variables
rm(blogs, twitter, news, blogs_sample, twitter_sample, news_sample)
rm(df_stats, df_sample_stats)
rm(df1, df2, df3, df4, df_full)
rm(corpus1, corpus2plus, vs, profanityWords)
rm(plot1, plot2, plot3, plot4)
gc() # garbage collector
##           used (Mb) gc trigger   (Mb)   max used   (Mb)
## Ncells 1710278 91.4    7379446  394.2    9224307  492.7
## Vcells 6226694 47.6  823456620 6282.5 1286634223 9816.3

Model Input File Clean Up

Clean up the csv file if the corpus operations are run multiple times to capture more word sequences and their frequencies. The csv file will be used to predict the next word following a sequence of words.

NLP_data <- read.csv("NLP Data.csv", header=TRUE)
NLP_data <- NLP_data %>% group_by(keys) %>% arrange(desc(freq)) %>% slice(1)
NLP_data <- NLP_data[!duplicated(NLP_data$keys),]