This milestone report is prepared in partial fulfillment of the Data Science Capstone and represents the Week 2 Peer Graded Assignment. The genesis of this project is to utilize the knowledge gained in the data specialization courses to work with new data types and build a predictive text model. In this case the project calls for the analysis of text data and natural language processing.
The initial steps in this project was firstly to obtain the data and being able to load / manipulate it in R. Another integral part was becoming familiarized with Natural Language Processing and text mining in order to draw a relation to the data science process in this specialization. As such, the first part of this report covers a section on ‘Understanding the Problem’ and ‘Getting and Cleaning the data’. The next section of this report covers the ‘Exploratory Data Analysis’. Such analysis is necessary as in building a predictive model, it is prudent to understand the basic relationships observed in the data such as the distribution, relationships between words, tokens and phrases in text. The third task in this assignment is modeling. The modeling task involved the building of a simple model for the relationship between words. The report will conclude with feedback on plans for creating a prediction algorithm and Shiny app.
This project required the application of data science in the area of natural language processing. As such, it was necessary to become familiar with concepts such as Natural Language Processing (NLP) and Text mining infrastructure in R. According to Wikipedia.org, “Natural Language Processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.” Some common NLP tasks in relation to test and speech processing includes optical character recognition, speech recognition, speech segmentation, text to speech and word segmentation (Tokenization). In this assignment, Tokenization tasks will be applied.
Choosing a package for textual analysis with R The main package contenders for this project included RWeka and tm. In deciding to move ahead with these two packages, consideration was given to their broader set of text mining features. The obvious tradeoff being their slower performance and limited scalability with large corpus. This, therefore, meant that more effort would be placed on optimization techniques to be able to build a functional predictive text model.
The libraries used in this project includes:
library(ggplot2)
library(NLP)
library(tm)
library(RWeka)
library(data.table)
library(R.utils)
library(dplyr)
library(parallel)
library(kable)
library(kableExtra)The data was accessed via the following Course Dataset.
See Appendix 1 for the script to load the training data from the Course Dataset.
This course dataset is composed of corpora collected from publicly available sources by a web crawler. The corpora consisted of blogs, news and twitter files in various languages:
English
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
German
## [1] "de_DE.blogs.txt" "de_DE.news.txt" "de_DE.twitter.txt"
Finnish
## [1] "fi_FI.blogs.txt" "fi_FI.news.txt" "fi_FI.twitter.txt"
Russian
## [1] "ru_RU.blogs.txt" "ru_RU.news.txt" "ru_RU.twitter.txt"
The US English versions of the Blogs, News and Twitter files were selected to build the corpora. The summary of these files are as follows:
| File | Lines | Characters | Words | File_size |
|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 208361438 | 37865888 | 200 MB |
| en_US.news.txt | 77259 | 15683765 | 2665742 | 196 MB |
| en_US.twitter.txt | 2360148 | 162385035 | 30578933 | 159 MB |
It was not necessary to load in and use all of the data from the course dataset in order to build models. In this case, a separate sub sample dataset was created by reading in a random subset of the original data and writing it out to a separate file. This allowed the sample to be stored in a manner that did not require having to recreate it every time it was needed throughout the assignment.
See Appendix 2 for Sample creation script
## [1] "Number of lines in Sample Data file:1050"
Best practice suggests that profanity and other words that should not be predicted in the algorithm be removed. To achieve this, a file was sourced from here, then the bad words from this text file was filtered and removed from the sample data previously created.
Appendix 3 provides the script for the profanity filtering
## [1] "Number of lines in Sample Data:1050"
Tokenization involved the identification of appropriate token words, punctuation, and numbers. This task involved writing a function that takes the sample dataset as input and return a tokenized version of it. Wikipadia.org explains tokenization as the separation of a chunk of continuous text into separate words.
In their journal entitled ‘Text Mining Infrastructure in R’, Feinerer I, Hornik, K and Meyer, D. (2008) presented the tm package which provided a framework for text mining applications within R and explained how typical application tasks can be carried out using their framework.
In the first instance, the sample data was converted to all lower case characters. Then all punctuation, special characters and white spaces were removed. The cleaned version was saved in a file called sample_data_file.
See Appendix 4 for the script used for cleaning the data
## [1] "Number of lines in Sample Data file:1050"
An exploratory analysis involved performing a thorough analysis of data to understand the distribution of words and the relationship between the words in the corpora.
Understand Frequencies of Words and Word Pairs The next piece of work involved building figures and tables to understand the variation in the frequencies of words and word pairs in the data.
See Appendix 5 for the Summary Statistics of the Dataset
Plot and Word Cloud of Most Frequent Words
Frequencies in 2-gram and 3-grams dataset
Plot of Most Frequent Unigrams
Plot of Most Frequent Bigrams
Plot of Most Frequent Trigrams
The modeling task involved the building of a simple model for the relationship between words. In order to build a predictive text mining application, it was necessary to build a basic n-gram model. The basis of this n-gram model was to predict the next word based on the previous 1, 2 or 3 words. An extension of the n-gram model is to be able to predict the next word even though a combination of words may not appear in the corpora.
In the first instance, the corpus was assigned using VCorpus, then a function was created to make the NGrams. Once created, the Ngrams were extracted and calculated. The results were then converted into a table format. A file for a Quadram, Trigram and Bigram were rendered.
See Appendix 6 for the script used to build the Corpus & NGram Frequencies
With the predictive text model built and tested, the next step in the project would be to integrate the prediction algorithm into a Shiny App. Using the Unigram, Bigram, Trigram and Quadgram files generated, an application will be developed to “Predict the Next Word” based on a word or phrase inputted. In moving forward, one of the challenges in this project would be to find ways to increase the Corpora and make the application more efficient in terms of computing speed and memory.
Load training data from Course Dataset
zip_URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zip_Data_File <- "data/Coursera-SwiftKey.zip"
if (!file.exists('data')) {
dir.create('data')
}
if (!file.exists("data/final/en_US")) {
tempFile <- tempfile()
download.file(zip_URL, tempFile)
unzip(tempFile, exdir = "data")
unlink(tempFile)
rm(tempFile)
}Summary of Data
English
German
Finnish
Russian
# Blogs
blog <- readLines("data/final/en_US/en_US.blogs.txt",skipNul = TRUE, warn = FALSE)
# News
news <- readLines("data/final/en_US/en_US.news.txt",skipNul = TRUE, warn = FALSE)
# Twitter
twitter <- readLines("data/final/en_US/en_US.twitter.txt",skipNul = TRUE, warn = FALSE)
# Number of lines per file
number_lines <- sapply(list(blog, news, twitter), length)
# Number of characters per file
number_characters <- sapply(list(nchar(blog), nchar(news), nchar(twitter)), sum)
# Number of words per file
number_words <- sapply(list(blog, news, twitter), stri_stats_latex)[4,]
# File Size in MB
blogs_file <- "data/final/en_US/en_US.blogs.txt"
news_file <- "data/final/en_US/en_US.news.txt"
twitter_file <- "data/final/en_US/en_US.twitter.txt"
file_size <- round(file.info(c(blogs_file, news_file, twitter_file))$size / 1024 ^ 2)
summary <- data.frame(
File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
Lines = number_lines,
Characters = number_characters,
Words = number_words,
File_size = paste(file_size, " MB"))
kable(summary,
row.names = FALSE,
align = c("l", rep("r", 7)),
caption = "") %>% kable_styling(position = "left")
# Remove variables to optimize computing memory usage
rm(zip_URL, zip_Data_File)Sample creation script
# Set seed for reproducibility and Assign sample size
set.seed(2000)
sample_size = 350 #Limited due to physical computing memory
sample_blog <- blog[sample(1:length(blog),sample_size)]
sample_news <- news[sample(1:length(news),sample_size)]
sample_twitter <- twitter[sample(1:length(twitter),sample_size)]
sample_data<-rbind(sample_blog,sample_news,sample_twitter)
rm(blog,news,twitter) # remove unused variables for optimizing the environment
print(paste0("Number of lines in Sample Data file:", length(sample_data)))Profanity Filtering
# Load list of bad words file
bad_words_URL <- "https://www.phorum.org/phorum5/file.php/63/2330/badwords.txt.zip"
bad_words_file <- "data/badwords.txt"
if (!file.exists('data')) {
dir.create('data')
}
if (!file.exists(bad_words_file)) {
tempFile <- tempfile()
download.file(bad_words_URL, tempFile)
unzip(tempFile, exdir = "data")
unlink(tempFile)
rm(tempFile)
}
bad <- file(bad_words_file, open = "r")
bad_words <- readLines(bad, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
bad_words <- iconv(bad_words, "latin1", "ASCII", sub = "")
close(bad)# Remove Bad Words (Profanity)
#sample_data <- removeWords(sample_data, bad_words)
print(paste0("Number of lines in Sample Data:", length(sample_data)))## [1] "Number of lines in Sample Data:1050"
Script used for cleaning the data
# Convert Text to Lowercase
sample_data <- tolower(sample_data)
# Remove Punctuation, Special Characters and Strip White Space
sample_data <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("\\S+[@]\\S+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("@[^\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("#[^\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("[^\\p{L}'\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("[^0-9A-Za-z///' ]","'" , sample_data ,ignore.case = TRUE)
sample_data <- gsub("''", "" , sample_data ,ignore.case = TRUE)
sample_data <- gsub("^\\s+|\\s+$", "", sample_data)
sample_data <- stripWhitespace(sample_data)
# Write sample data set to disk and optimize computing memory usage
sample_data_file <- "data/en_US.sample.txt"
sdf <- file(sample_data_file, open = "w")
writeLines(sample_data, sdf)
close(sdf)
# Optimize computing memory usage
rm(bad_words_URL, bad_words_file, sdf, sample_data_file)
print(paste0("Number of lines in Sample Data file:", length(sample_data)))## [1] "Number of lines in Sample Data file:1050"
Summary Statistics of the Data Set Plot and Word Cloud of Most Frequent Words
library(wordcloud)
library(RColorBrewer)
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)
# Plot the 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("skyblue"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)# Build Word Cloud of most frequent words
suppressWarnings (
wordcloud(words = wordFreq$word,
freq = wordFreq$freq,
min.freq = 1,
max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors=brewer.pal(8, "Dark2"))
)Frequencies in 2-gram and 3-grams dataset
library(RWeka)
# Create variables for Tokenization
unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))Plot of Most Frequent Unigrams
# Create Term Document Matrix for the Corpus
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))
# Remove sparse terms for N-gram and generate frequencies of most common N-grams
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing = TRUE)
unigramMatrixFreq <- data.frame(word = names(unigramMatrixFreq), freq = unigramMatrixFreq)
# Create Plot of Most Frequent Unigrams
g <- ggplot(unigramMatrixFreq[1:10,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("skyblue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Unigrams")
print(g)Plot of Most Frequent Bigrams
# Create Term Document Matrix for the Corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))
# Remove sparse terms for N-gram and generate frequencies of most common N-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.999))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)
# Create Plot of Most Frequent Bigrams
g <- ggplot(bigramMatrixFreq[1:10,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("skyblue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Bigrams")
print(g)Plot of Most Frequent Trigrams
# Create Term Document Matrix for the Corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))
# Remove sparse terms for N-gram and generate frequencies of most common N-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.9999))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)
# Create Plot of Most Frequent Trigrams
g <- ggplot(trigramMatrixFreq[1:10,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("skyblue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
axis.text.x = element_text(hjust = 1.0, angle = 45),
axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Trigrams")
print(g)Script used to build the Corpus & NGram Frequencies
corpus<-VCorpus(VectorSource(sample_data))
# Function to make NGrams
makeNgram <- function (corpdata, n) {
NgramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = n, max = n))}
makengram <- TermDocumentMatrix(corpdata, control = list(tokenizer = NgramTokenizer))
makengram
}
# Function to extract Ngrams
ngramdataframe <- function (makengram) {
ngrammatrix <- as.matrix(makengram)
ngramdataframe2 <- as.data.frame(ngrammatrix)
colnames(ngramdataframe2) <- "Count"
ngramdataframe2 <- ngramdataframe2[order(-ngramdataframe2$Count), , drop = FALSE]
ngramdataframe2
}
# Calculate NGrams
makeunigram <- makeNgram(corpus, 1)
makebigram <- makeNgram(corpus, 2)
maketrigram <- makeNgram(corpus, 3)
makequadgram <- makeNgram(corpus, 4)
# Convert N-Grams into table format
unigramdataframe <- ngramdataframe(makeunigram)
bigramdataframe <- ngramdataframe(makebigram)
trigramdataframe <- ngramdataframe(maketrigram)
quadgramdataframe <- ngramdataframe(makequadgram)
# Create Quadgram File
quadgram <- data.frame(rows=rownames(quadgramdataframe),count=quadgramdataframe$Count)
quadgram$rows <- as.character(quadgram$rows)
quadgram_split <- strsplit(as.character(quadgram$rows),split=" ")
quadgram <- transform(quadgram,first = sapply(quadgram_split,"[[",1),second = sapply(quadgram_split,"[[",2),third = sapply(quadgram_split,"[[",3), fourth = sapply(quadgram_split,"[[",4))
quadgram <- data.frame(unigram = quadgram$first,bigram = quadgram$second, trigram = quadgram$third, quadgram = quadgram$fourth, freq = quadgram$count,stringsAsFactors=FALSE)
write.csv(quadgram[quadgram$freq > 1,],"quadgram2.csv",row.names=F)
quadgram <- read.csv("quadgram2.csv",stringsAsFactors = F)
saveRDS(quadgram,"quadgram2.RData")
# Create Trigram File
trigram <- data.frame(rows=rownames(trigramdataframe),count=trigramdataframe$Count)
trigram$rows <- as.character(trigram$rows)
trigram_split <- strsplit(as.character(trigram$rows),split=" ")
trigram <- transform(trigram,first = sapply(trigram_split,"[[",1),second = sapply(trigram_split,"[[",2),third = sapply(trigram_split,"[[",3))
trigram <- data.frame(unigram = trigram$first,bigram = trigram$second, trigram = trigram$third, freq = trigram$count,stringsAsFactors=FALSE)
write.csv(trigram[trigram$freq > 1,],"trigram2.csv",row.names=F)
trigram <- read.csv("trigram2.csv",stringsAsFactors = F)
saveRDS(trigram,"trigram2.RData")
# Create Bigram File
bigram <- data.frame(rows=rownames(bigramdataframe),count=bigramdataframe$Count)
bigram$rows <- as.character(bigram$rows)
bigram_split <- strsplit(as.character(bigram$rows),split=" ")
bigram <- transform(bigram,first = sapply(bigram_split,"[[",1),second = sapply(bigram_split,"[[",2))
bigram <- data.frame(unigram = bigram$first,bigram = bigram$second,freq = bigram$count,stringsAsFactors=FALSE)
write.csv(bigram[bigram$freq > 1,],"bigram2.csv",row.names=F)
bigram <- read.csv("bigram2.csv",stringsAsFactors = F)
saveRDS(bigram,"bigram2.RData")https://www.jstatsoft.org/article/view/v025i05 ‘Text Mining Infrastructure in R’. Feinerer I, Hornik, K and Meyer, D. (2008) Journal of Statistical Software. DOI 10.18637/jss.v025.i05 Date Accessed 09/21/2020
https://en.wikipedia.org/wiki/Natural_language_processing ‘Natural language processing’ Date Accessed 09/21/2020
https://www.phorum.org/phorum5/file.php/63/2330/badwords.txt.zip ‘Bad Words’ Date Accessed 09/21/2020