Milestone Report

Synopsis

This milestone report is prepared in partial fulfillment of the Data Science Capstone and represents the Week 2 Peer Graded Assignment. The genesis of this project is to utilize the knowledge gained in the data specialization courses to work with new data types and build a predictive text model. In this case the project calls for the analysis of text data and natural language processing.

The initial steps in this project was firstly to obtain the data and being able to load / manipulate it in R. Another integral part was becoming familiarized with Natural Language Processing and text mining in order to draw a relation to the data science process in this specialization. As such, the first part of this report covers a section on ‘Understanding the Problem’ and ‘Getting and Cleaning the data’. The next section of this report covers the ‘Exploratory Data Analysis’. Such analysis is necessary as in building a predictive model, it is prudent to understand the basic relationships observed in the data such as the distribution, relationships between words, tokens and phrases in text. The third task in this assignment is modeling. The modeling task involved the building of a simple model for the relationship between words. The report will conclude with feedback on plans for creating a prediction algorithm and Shiny app.

Task 0: Understanding the problem

This project required the application of data science in the area of natural language processing. As such, it was necessary to become familiar with concepts such as Natural Language Processing (NLP) and Text mining infrastructure in R. According to Wikipedia.org, “Natural Language Processing is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.” Some common NLP tasks in relation to test and speech processing includes optical character recognition, speech recognition, speech segmentation, text to speech and word segmentation (Tokenization). In this assignment, Tokenization tasks will be applied.

Choosing a package for textual analysis with R The main package contenders for this project included RWeka and tm. In deciding to move ahead with these two packages, consideration was given to their broader set of text mining features. The obvious tradeoff being their slower performance and limited scalability with large corpus. This, therefore, meant that more effort would be placed on optimization techniques to be able to build a functional predictive text model.

The libraries used in this project includes:

library(ggplot2)
library(NLP)
library(tm)
library(RWeka)
library(data.table)
library(R.utils)
library(dplyr)
library(parallel)
library(kable)
library(kableExtra)

Task 1 – Getting and cleaning the data

Loading the data in:

The data was accessed via the following Course Dataset.
See Appendix 1 for the script to load the training data from the Course Dataset.

This course dataset is composed of corpora collected from publicly available sources by a web crawler. The corpora consisted of blogs, news and twitter files in various languages:

English

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

German

## [1] "de_DE.blogs.txt"   "de_DE.news.txt"    "de_DE.twitter.txt"

Finnish

## [1] "fi_FI.blogs.txt"   "fi_FI.news.txt"    "fi_FI.twitter.txt"

Russian

## [1] "ru_RU.blogs.txt"   "ru_RU.news.txt"    "ru_RU.twitter.txt"

The US English versions of the Blogs, News and Twitter files were selected to build the corpora. The summary of these files are as follows:


File	Lines	Characters	Words	File_size
en_US.blogs.txt	899288	208361438	37865888	200 MB
en_US.news.txt	77259	15683765	2665742	196 MB
en_US.twitter.txt	2360148	162385035	30578933	159 MB

Sampling:

It was not necessary to load in and use all of the data from the course dataset in order to build models. In this case, a separate sub sample dataset was created by reading in a random subset of the original data and writing it out to a separate file. This allowed the sample to be stored in a manner that did not require having to recreate it every time it was needed throughout the assignment.

See Appendix 2 for Sample creation script

## [1] "Number of lines in Sample Data file:1050"

Profanity filtering:

Best practice suggests that profanity and other words that should not be predicted in the algorithm be removed. To achieve this, a file was sourced from here, then the bad words from this text file was filtered and removed from the sample data previously created.

Appendix 3 provides the script for the profanity filtering

## [1] "Number of lines in Sample Data:1050"

Tokenization:

Tokenization involved the identification of appropriate token words, punctuation, and numbers. This task involved writing a function that takes the sample dataset as input and return a tokenized version of it. Wikipadia.org explains tokenization as the separation of a chunk of continuous text into separate words.

In their journal entitled ‘Text Mining Infrastructure in R’, Feinerer I, Hornik, K and Meyer, D. (2008) presented the tm package which provided a framework for text mining applications within R and explained how typical application tasks can be carried out using their framework.

In the first instance, the sample data was converted to all lower case characters. Then all punctuation, special characters and white spaces were removed. The cleaned version was saved in a file called sample_data_file.

See Appendix 4 for the script used for cleaning the data

## [1] "Number of lines in Sample Data file:1050"

Task 2: Exploratory Data Analysis

Exploratory Analysis:

An exploratory analysis involved performing a thorough analysis of data to understand the distribution of words and the relationship between the words in the corpora.

Summary Statistics of Dataset

Understand Frequencies of Words and Word Pairs The next piece of work involved building figures and tables to understand the variation in the frequencies of words and word pairs in the data.

See Appendix 5 for the Summary Statistics of the Dataset

Plot and Word Cloud of Most Frequent Words

Frequencies in 2-gram and 3-grams dataset

Plot of Most Frequent Unigrams

Plot of Most Frequent Bigrams

Plot of Most Frequent Trigrams

Task 3: Modeling

The modeling task involved the building of a simple model for the relationship between words. In order to build a predictive text mining application, it was necessary to build a basic n-gram model. The basis of this n-gram model was to predict the next word based on the previous 1, 2 or 3 words. An extension of the n-gram model is to be able to predict the next word even though a combination of words may not appear in the corpora.

In the first instance, the corpus was assigned using VCorpus, then a function was created to make the NGrams. Once created, the Ngrams were extracted and calculated. The results were then converted into a table format. A file for a Quadram, Trigram and Bigram were rendered.

See Appendix 6 for the script used to build the Corpus & NGram Frequencies

Plans for creating a prediction algorithm and Shiny app

With the predictive text model built and tested, the next step in the project would be to integrate the prediction algorithm into a Shiny App. Using the Unigram, Bigram, Trigram and Quadgram files generated, an application will be developed to “Predict the Next Word” based on a word or phrase inputted. In moving forward, one of the challenges in this project would be to find ways to increase the Corpora and make the application more efficient in terms of computing speed and memory.

Appendix

Appendix 1

Load training data from Course Dataset

zip_URL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zip_Data_File <- "data/Coursera-SwiftKey.zip"

if (!file.exists('data')) {
  dir.create('data')
}

if (!file.exists("data/final/en_US")) {
  tempFile <- tempfile()
  download.file(zip_URL, tempFile)
  unzip(tempFile, exdir = "data")
  unlink(tempFile)
  rm(tempFile)
}

Summary of Data

English

list.files ("data/final/en_US")

German

list.files ("data/final/de_DE")

Finnish

list.files ("data/final/fi_FI")

Russian

list.files ("data/final//ru_RU")

# Blogs
blog <- readLines("data/final/en_US/en_US.blogs.txt",skipNul = TRUE, warn = FALSE)

# News
news <- readLines("data/final/en_US/en_US.news.txt",skipNul = TRUE, warn = FALSE)

# Twitter
twitter <- readLines("data/final/en_US/en_US.twitter.txt",skipNul = TRUE, warn = FALSE)

# Number of lines per file
number_lines <- sapply(list(blog, news, twitter), length)


# Number of characters per file
number_characters <- sapply(list(nchar(blog), nchar(news), nchar(twitter)), sum)


# Number of words per file
number_words <- sapply(list(blog, news, twitter), stri_stats_latex)[4,]

# File Size in MB
blogs_file <- "data/final/en_US/en_US.blogs.txt"
news_file <- "data/final/en_US/en_US.news.txt"
twitter_file <- "data/final/en_US/en_US.twitter.txt"
file_size <- round(file.info(c(blogs_file, news_file, twitter_file))$size / 1024 ^ 2)

summary <- data.frame(
    File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    Lines = number_lines,
    Characters = number_characters,
    Words = number_words,
    File_size = paste(file_size, " MB"))


kable(summary,
      row.names = FALSE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")

# Remove variables to optimize computing memory usage
rm(zip_URL, zip_Data_File)

Appendix 2

Sample creation script

# Set seed for reproducibility and Assign sample size
set.seed(2000)
sample_size = 350 #Limited due to physical computing memory

sample_blog <- blog[sample(1:length(blog),sample_size)]
sample_news <- news[sample(1:length(news),sample_size)]
sample_twitter <- twitter[sample(1:length(twitter),sample_size)]

sample_data<-rbind(sample_blog,sample_news,sample_twitter)
rm(blog,news,twitter) # remove unused variables for optimizing the environment
print(paste0("Number of lines in Sample Data file:", length(sample_data)))

Appendix 3

Profanity Filtering

# Load list of bad words file
bad_words_URL <- "https://www.phorum.org/phorum5/file.php/63/2330/badwords.txt.zip"
bad_words_file <- "data/badwords.txt"
if (!file.exists('data')) {
  dir.create('data')
}
if (!file.exists(bad_words_file)) {
  tempFile <- tempfile()
  download.file(bad_words_URL, tempFile)
  unzip(tempFile, exdir = "data")
  unlink(tempFile)
  rm(tempFile)
}
bad <- file(bad_words_file, open = "r")
bad_words <- readLines(bad, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
bad_words <- iconv(bad_words, "latin1", "ASCII", sub = "")
close(bad)

# Remove Bad Words (Profanity)
#sample_data <- removeWords(sample_data, bad_words)
print(paste0("Number of lines in Sample Data:", length(sample_data)))

## [1] "Number of lines in Sample Data:1050"

Appendix 4

Script used for cleaning the data

# Convert Text to Lowercase
sample_data <- tolower(sample_data)

# Remove Punctuation, Special Characters and Strip White Space
sample_data <- gsub("(f|ht)tp(s?)://(.*)[.][a-z]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("\\S+[@]\\S+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("@[^\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("#[^\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data <- gsub("[^\\p{L}'\\s]+", "", sample_data, ignore.case = FALSE, perl = TRUE)
sample_data  <- gsub("[^0-9A-Za-z///' ]","'" , sample_data ,ignore.case = TRUE)
sample_data <- gsub("''", "" , sample_data ,ignore.case = TRUE)
sample_data <- gsub("^\\s+|\\s+$", "", sample_data)
sample_data <- stripWhitespace(sample_data)

# Write sample data set to disk and optimize computing memory usage
sample_data_file <- "data/en_US.sample.txt"
sdf <- file(sample_data_file, open = "w")
writeLines(sample_data, sdf)
close(sdf)

# Optimize computing memory usage
rm(bad_words_URL, bad_words_file, sdf, sample_data_file)

print(paste0("Number of lines in Sample Data file:", length(sample_data)))

## [1] "Number of lines in Sample Data file:1050"

Appendix 5

Summary Statistics of the Data Set Plot and Word Cloud of Most Frequent Words

library(wordcloud)
library(RColorBrewer)

tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)

# Plot the 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
                                  y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("skyblue"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 16, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)

# Build Word Cloud of most frequent words
suppressWarnings (
    wordcloud(words = wordFreq$word,
              freq = wordFreq$freq,
              min.freq = 1,
              max.words = 200,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

# Free up memory
rm(tdm, freq, wordFreq, g)