This project is part of the Data Science Specialization - Capstone project in Natural Language Processing(NLP) field. The project is only a part of NLP involving the prediction of next word given a set of one to three words. The process involves document summarization of three files provided in the course by analyzing the text files, cleansing them and doing some analytics to get some insight into the text files. The analysis is done on a random sample generated from the files as the source files are more than 6GB in size.
As with any project, the first step is to load the required libraries whose functions will be used in the process of loading, cleaning and analysing the files.
library(stringi)
library(stringr)
library(RCurl)
library(plyr)
library(dplyr)
library(parallel)
library(tm)
library(NLP)
library(ngram)
library(ggplot2)
library(RWeka)
library(plotly)
library(ggplot2)
library(wordcloud)
library(wordcloud2)
The below code is a function which takes the text files and get some insights on the files like the size, number of lines, words, characters etc. The code is written as a function and will be called with the raw data files and sampled files.
file_operations <- function(b,t,n, sample_size) {
# Read the lines in the files
blogs <- readLines(b, encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(t, encoding = "UTF-8", skipNul = TRUE)
news <- readLines(n, encoding = "UTF-8", skipNul = TRUE)
if (sample_size < 100) {
blogs <- sample(blogs, length(blogs) * sample_size / 100)
twitter <- sample(twitter, length(twitter) * sample_size / 100)
news <- sample(news, length(news) * sample_size / 100)
}
# Get the size of each file
file_size = sapply(list(blogs, twitter, news), function(x) {object.size(x)/1024^2})
# Get statistics like number of line, characters in the files
stats <- t(sapply(list(blogs, twitter, news), stri_stats_general))
# Number of words
num_words <- sapply(list(blogs,news,twitter),stri_stats_latex)[4,]
# Char count of the longest line
max_char <- sapply(list(blogs, twitter, news), function(x){max(unlist(lapply(x, function(y) nchar(y))))})
data_frame <- data.frame (
FileName = c(b,t,n),
FileSizeInMB = file_size,
Stats = stats,
Words = num_words,
MaxChar = max_char
)
return(data_frame)
}
The call to the above function with the file path as parameters
# File names and size
file_names <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
df_stats <- file_operations(file_names[1], file_names[2], file_names[3], 100)
df_stats
## FileName FileSizeInMB Stats.Lines Stats.LinesNEmpty Stats.Chars
## 1 en_US.blogs.txt 255.35453 899288 899288 206824382
## 2 en_US.twitter.txt 318.98975 2360148 2360148 162096241
## 3 en_US.news.txt 19.76917 77259 77259 15639408
## Stats.CharsNWhite Words MaxChar
## 1 170389539 37570839 40833
## 2 134082806 2651432 140
## 3 13072698 30451170 5760
As seen from the data above, the size of all three files is nearly 6Gb. An 80% confidence level with a 5% error margin will need a file size of 1GB. We will sample 20% of the raw data to perform analysis. First the stats on the sample file is performed.
# Sample 40% of the file and get the stats on them
file_names <- c("en_US.blogs.txt", "en_US.twitter.txt", "en_US.news.txt")
df_sample_stats <- file_operations(file_names[1], file_names[2], file_names[3], 20)
df_sample_stats
## FileName FileSizeInMB Stats.Lines Stats.LinesNEmpty Stats.Chars
## 1 en_US.blogs.txt 51.017235 179857 179857 41295927
## 2 en_US.twitter.txt 64.316933 472029 472029 32465839
## 3 en_US.news.txt 3.951569 15451 15451 3126367
## Stats.CharsNWhite Words MaxChar
## 1 34019989 7500678 19795
## 2 26856515 530601 140
## 3 2613048 6097762 1598
Analysis on the sampled files start with cleaning and pre-processing the data like removing punctuations, numbers, making all characters into lower case etc. This pre-processing is so important as this leads to a better algorithm. This step takes much of the analysis work.
no_of_cores <- detectCores() - 1
#clusters <- makeCluster(no_of_cores)
blogs <- readLines(file_names[1], encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(file_names[2], encoding = "UTF-8", skipNul = TRUE)
news <- readLines(file_names[3], encoding = "UTF-8", skipNul = TRUE)
blogs_sample <- sample(blogs, length(blogs) * 0.2 / 100)
twitter_sample <- sample(twitter, length(twitter) * 0.2 / 100)
news_sample <- sample(news, length(news) * 0.1 / 100)
# Merge the sampled files into as one data set
df_sample <- c(blogs_sample, twitter_sample, news_sample)
#Perform Corpus functions on the data set
vs <- VectorSource(df_sample)
corpus1 <- VCorpus(vs, readerControl=list(readPlain, language="en", load=TRUE))
corpus1 <- tm_map(corpus1, content_transformer(tolower))
corpus1 <- tm_map(corpus1, content_transformer(removePunctuation))
corpus1 <- tm_map(corpus1, content_transformer(removeNumbers))
profanityLink <- "http://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
profanityWords <- readLines(profanityLink)
#profanityWords <- profanityWords[!duplicated(profanityWords),]
corpus1 <- tm_map(corpus1, removeWords, profanityWords)
profanityLink <- "http://www.bannedwordlist.com/lists/swearWords.txt"
profanityWords <- readLines(profanityLink)
#profanityWords <- profanityWords[!duplicated(profanityWords),]
corpus1 <- tm_map(corpus1, removeWords, profanityWords)
corpus1 <- tm_map(corpus1, content_transformer(stripWhitespace))
removeNonASCII <- function(x) {iconv(x, "latin1", "ASCII", sub="")}
corpus1 <- tm_map(corpus1, content_transformer(removeNonASCII))
removeRepeatedWords <- function(x) {gsub("([[:alpha:]])\\1{2,}", "\\1\\1", x)}
corpus1 <- tm_map(corpus1, content_transformer(removeRepeatedWords))
removeSingleLetterWords <- function(x) { gsub(" *\\b[[:alpha:]]{1,2}\\b *","", x) }
corpus1 <- tm_map(corpus1, content_transformer(removeSingleLetterWords))
removeSameLetterWords <- function(x) {gsub("(\\w)\\1{2, }","", x)}
corpus1 <- tm_map(corpus1, content_transformer(removeSameLetterWords))
corpus2plus <- corpus1
corpus1 <- tm_map(corpus1, content_transformer(removeWords), stopwords("en"))
corpus1 <- tm_map(corpus1, PlainTextDocument)
corpus2plus <- tm_map(corpus2plus, PlainTextDocument)
Now that out text files are preprocesed, we need to split the text into individual word sequences so that an analysis of the word frequencies can be done on each sampled word sequence. The code is written as a function as it will be used to analyse the frequencies of single word(unigram), two words(bigram), three words(trigram) and four words(4-gram).
tokenizer <- function(cp, ws) {
token <- function(x) NGramTokenizer(x, Weka_control(min=ws, max=ws))
word_matrix <- as.data.frame(as.matrix(TermDocumentMatrix(cp, control = list(tokenize = token))))
frq <- sort(rowSums(word_matrix), decreasing = TRUE)
freq_df <- data.frame(word = names(frq), freq = frq)
return(freq_df)
}
Frequency of most commonly used single word. Here the most used 25 words are shown with their frequency of usage.
df1 <- tokenizer(corpus1,1)
plot1 <- df1[1:25,]
rownames(plot1) <- NULL
plot1
## word freq
## 1 just 327
## 2 like 266
## 3 one 239
## 4 will 216
## 5 can 189
## 6 now 187
## 7 people 169
## 8 new 167
## 9 love 166
## 10 dont 160
## 11 day 159
## 12 good 157
## 13 time 152
## 14 get 148
## 15 right 133
## 16 first 127
## 17 really 124
## 18 back 121
## 19 last 117
## 20 thanks 117
## 21 know 115
## 22 next 107
## 23 even 105
## 24 lol 102
## 25 ive 100
Frequency of most commonly used continuous sequence of two words. Here the most used 25 words are shown with their frequency of usage.
df2 <- tokenizer(corpus2plus,2)
plot2 <- df2[1:25,]
rownames(plot2) <- NULL
plot2
## word freq
## 1 for the 258
## 2 and the 138
## 3 from the 103
## 4 with the 101
## 5 you can 69
## 6 thank you 66
## 7 thanks for 64
## 8 about the 63
## 9 that the 61
## 10 all the 58
## 11 are you 58
## 12 you are 58
## 13 the only 49
## 14 over the 47
## 15 they are 47
## 16 the first 46
## 17 into the 44
## 18 have been 36
## 19 the same 36
## 20 that you 35
## 21 the best 35
## 22 was the 35
## 23 you will 35
## 24 right now 34
## 25 there are 34
Frequency of most commonly used continuous sequence of three words. Here the most used 25 words are shown with their frequency of usage.
df3 <- tokenizer(corpus1,3)
plot3 <- df3[1:25,]
rownames(plot3) <- NULL
plot3
## word freq
## 1 cricket new zealand 7
## 2 amazon services llc 6
## 3 north dakota indian 6
## 4 drive data recovery 4
## 5 hard drive data 4
## 6 advertising feesadvertising linkingamazoncom 3
## 7 amazonassociates programmes designedprovidemeans 3
## 8 amazonca amazoncouk amazonde 3
## 9 amazoncouk amazonde amazonfr 3
## 10 amazonde amazonfr amazonit 3
## 11 amazones certain content 3
## 12 amazonfr amazonit amazones 3
## 13 amazonit amazones certain 3
## 14 amazonthis contentprovidedandsubjectchangeremovalany time 3
## 15 andor amazonthis contentprovidedandsubjectchangeremovalany 3
## 16 appearsthis website comes 3
## 17 certain content appearsthis 3
## 18 comes amazon services 3
## 19 content appearsthis website 3
## 20 customize taco times 3
## 21 designedprovidemeans sitesearn advertising 3
## 22 feesadvertising linkingamazoncom amazonca 3
## 23 good morning everyone 3
## 24 health customize taco 3
## 25 impactour health customize 3
Frequency of most commonly used continuous sequence of three words. Here the most used 25 words are shown with their frequency of usage.
df4 <- tokenizer(corpus1,4)
plot4 <- df4[1:25,]
rownames(plot4) <- NULL
plot4
## word freq
## 1 hard drive data recovery 4
## 2 advertising feesadvertising linkingamazoncom amazonca 3
## 3 amazon services llc amazonassociates 3
## 4 amazon services llc andor 3
## 5 amazonassociates programmes designedprovidemeans sitesearn 3
## 6 amazonca amazoncouk amazonde amazonfr 3
## 7 amazoncouk amazonde amazonfr amazonit 3
## 8 amazonde amazonfr amazonit amazones 3
## 9 amazones certain content appearsthis 3
## 10 amazonfr amazonit amazones certain 3
## 11 amazonit amazones certain content 3
## 12 andor amazonthis contentprovidedandsubjectchangeremovalany time 3
## 13 appearsthis website comes amazon 3
## 14 certain content appearsthis website 3
## 15 comes amazon services llc 3
## 16 content appearsthis website comes 3
## 17 customize taco times menu 3
## 18 designedprovidemeans sitesearn advertising feesadvertising 3
## 19 feesadvertising linkingamazoncom amazonca amazoncouk 3
## 20 health customize taco times 3
## 21 impactour health customize taco 3
## 22 items light healthy options 3
## 23 linkingamazoncom amazonca amazoncouk amazonde 3
## 24 llc amazonassociates programmes designedprovidemeans 3
## 25 llc andor amazonthis contentprovidedandsubjectchangeremovalany 3
Visualization of the above data (unigram, bigram, trigram)
# Histogram and Wordcloud
ggplot(plot1, aes(word, freq)) + geom_bar(stat = "identity", fill="blue") + xlab("Words") + ylab("Frequency") +
ggtitle("Unigram - Word Frequency") + theme(axis.text = element_text(angle = 90, hjust=1))
wordcloud(words = plot1$word, freq = plot1$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))
ggplot(plot2, aes(word, freq)) + geom_bar(stat = "identity", fill="green") + xlab("Words") + ylab("Frequency") +
ggtitle("Bigram - Word Frequency") + theme(axis.text = element_text(angle = 90, hjust=1))
wordcloud(words = plot2$word, freq = plot2$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))
ggplot(plot3, aes(word, freq)) + geom_bar(stat = "identity", fill="red") + xlab("Words") + ylab("Frequency") +
ggtitle("Trigram - Word Frequency") + theme(axis.text = element_text(angle = 30, hjust=1))
wordcloud(words = plot3$word, freq = plot3$freq, min.freq = 1, max.words=200, random.order=FALSE, rot.per=0.35, colors=brewer.pal(7, "Dark2"))
#####Here is an interactive wordcloud for bigram
wordcloud2(data=plot2, size=0.7, shape="pentagon")
#stopCluster(clusters)
Now the bigrams, trigrams and 4-grams are combined in a dictionary way where each word sequence(the key) will have the next word as the prediction (the value). This data frame will also have the frequencies of the word sequence so that the most commonly followed word can be suggested. The data is then written in a csv file which will then be used for the prediction. The csv file provides an advantage when the whole corpus operation has to be rerun multiple times to capture most of the word sequencing. But we have to keep in mind that the csv file size will also grow if the corpus operations are run multiple times.
df_full = do.call("rbind", list(df2, df3, df4))
keys <- lapply(df_full$word, function(x) {word(x, 1, -2)})
values <- lapply(df_full$word, function(x) {word(x, 1, -1)})
freq <- as.list(df_full$freq)
df <- do.call(rbind.data.frame, Map('c', keys, values, freq))
colnames(df)[1] <- "keys"
colnames(df)[2] <- "values"
colnames(df)[3] <- "freq"
df$freq <- as.numeric(df$freq)
head(df)
## keys values freq
## 1 for for the 22
## 2 and and the 8
## 3 from from the 4
## 4 with with the 3
## 5 you you can 46
## 6 thank thank you 45
write.table(df, file="NLP Data.csv", append=TRUE, sep=",", col.names = !file.exists("NLP Data.csv"), row.names = FALSE)
With only a limited memory at hand, it is wise to do some clean up to return the unused space to OS.
# rm() is used to remove the unwanted variables
rm(blogs, twitter, news, blogs_sample, twitter_sample, news_sample)
rm(df_stats, df_sample_stats)
rm(df1, df2, df3, df4, df_full)
rm(corpus1, corpus2plus, vs, profanityWords)
rm(plot1, plot2, plot3, plot4)
gc() # garbage collector
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1710278 91.4 7379446 394.2 9224307 492.7
## Vcells 6226694 47.6 823456620 6282.5 1286634223 9816.3
Clean up the csv file if the corpus operations are run multiple times to capture more word sequences and their frequencies. The csv file will be used to predict the next word following a sequence of words.
NLP_data <- read.csv("NLP Data.csv", header=TRUE)
NLP_data <- NLP_data %>% group_by(keys) %>% arrange(desc(freq)) %>% slice(1)
NLP_data <- NLP_data[!duplicated(NLP_data$keys),]