Introduction

This is related to Data Science Specilization certification, Capstone project. The corporate partner ‘SwiftKey’ is famous for Natural Language Processing (NLP) which predicts the next word looking at what users types now.

In this capstone we will be applying data science in the area of natural language processing. As a first step toward working on this project, we should familiarize ourself with Natural Language Processing, Text Mining, and the associated tools in R. Here are some resources that may be helpful to us.

https://en.wikipedia.org/wiki/Natural_language_processing

http://www.jstatsoft.org/v25/i05/

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

https://www.coursera.org/course/nlp

–Tasks to accomplish

  1. Obtaining the data - Can you download the data and load/manipulate it in R?

  2. Familiarizing yourself with NLP and text mining - Learn about the basics of natural language processing and how it relates to the data science process you have learned in the Data Science Specialization.

  3. Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.

  4. Profanity filtering - removing profanity and other words you do not want to predict.

Load required libraries like: stringi, tm, RWeka, ggplot2

Load data

This is the training data downloaded from the Coursera site. Here we are using only US english language data file for analysis and model building.

options(warn = -1)
setwd("C:/Installer/R/course-material/Assignments/C10-Assignment")

con <- file("./final/en_US/en_US.news.txt", open = "rb")
us_news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("./final/en_US/en_US.blogs.txt", open = "rb")
us_blog <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

con <- file("./final/en_US/en_US.twitter.txt", open = "rb")
us_twit <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# This file downloaded from Google to remove bad words
profanity <- readLines("full-list-of-bad-words-banned-by-google.txt")

Let us check content of loaded data by counting lines and words.

##        Files Line_Count Word_Count
## 1   US blogs     899288   37546246
## 2    US news    1010242   34762395
## 3 US twitter    2360148   30093410

Data sampling

To build models we don’t need to load in and use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data. We might want to create a separate sub-sample dataset by reading in a random subset of the original data and writing it out to a separate file.

rand_sample <- function(input_data, sample_pct)
{  return(input_data[as.logical(rbinom(length(input_data), 1, sample_pct))]) }

sample_pct <- 0.005 # let us set sample size as 0.5% (more sample takes more processing power)

sample_us_news <- rand_sample(us_news, sample_pct)
sample_us_blog <- rand_sample(us_blog, sample_pct)
sample_us_twit <- rand_sample(us_twit, sample_pct)

# Save sampled data in files
saveRDS(sample_us_news, file='./sample_en_US.news.RDS')
saveRDS(sample_us_blog, file='./sample_en_US.blogs.RDS')
saveRDS(sample_us_twit, file='./sample_en_US.twitter.RDS')

Let us check content of sample data by counting lines and words.

##                 Files Line_Count Word_Count
## 1   US blogs [sample]       4637     194947
## 2    US news [sample]       5084     175566
## 3 US twitter [sample]      11794     150117

Cleanup the data

I have used tm package to convert text dataset into the corpus which is a structured set of texts used for further statistical analysis. It will combine three (news, blogs, twitter) samples in one corpus.

sample_us_corpus <- c(sample_us_news, sample_us_blog, sample_us_twit)
sample_us_corpus <- Corpus(VectorSource(list(sample_us_corpus)))

Looking at data I have planned to cleanup following from corpus:

  1. Convert to lower case, so that we can group them well

  2. Remove punctuation

  3. Remove numbers

  4. Strip multiple white spaces

  5. Remove bad words (profanity)

I have choose not to remove the stopwords because I want them to be part of prediction as well.

sample_us_corpus <- tm_map(sample_us_corpus, PlainTextDocument)
sample_us_corpus <- tm_map(sample_us_corpus, tolower) #lower conversion
sample_us_corpus <- tm_map(sample_us_corpus, removePunctuation) #remove puctuation
sample_us_corpus <- tm_map(sample_us_corpus, removeNumbers) #remove numbers
sample_us_corpus <- tm_map(sample_us_corpus, removeWords, c('"','=','–','—','“')) #remove own syntax
sample_us_corpus <- tm_map(sample_us_corpus, removeWords, profanity) #remove proanity
sample_us_corpus <- tm_map(sample_us_corpus, stripWhitespace) #remove additional spaces

Let us check content of corpus by counting lines and words.

##    Files Line_Count Word_Count
## 1 Corpus          1     505446

Exploratory analysis

Now we need to tokenize each words or combination of words to check their frequency. We need perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. The top 30 most frequent words from corpora is reported in the form of a histogram. A document-term matrix or term-document matrix (TDM) is a mathematical matrix that describes the frequency of terms that occur in a collection of documents.

n-gram : In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n in modern language, e.g., “four-gram”, “five-gram”, and so on.

We use Weka package to creat the n-grams tokens which will be used later for predicting the next words based on the phrase of 1, 2, or 3 previous words.

options(java.parameters = "-Xmx2048m")
options(mc.cores = 1)

UnigramPlot <- function(input_data, xlabel, plotcolor) 
{
    ggplot(data = input_data[1:30,], aes(reorder(word, frequency), frequency)) + 
    geom_bar(stat = "identity", fill = plotcolor) + coord_flip() +
    labs(x = xlabel, y = "Frequency", title = paste("Plot",xlabel,"by Frequency/Occurance"))
}

# Building unigram or 1-gram tokenization
unigram_token <- NGramTokenizer(sample_us_corpus, Weka_control(min = 1, max = 1))
unigram_token <- data.frame(table(unigram_token))
names(unigram_token) <- c("word","frequency")
unigram_token <- unigram_token[order(unigram_token$frequency, decreasing = TRUE),]

# Building bigram or 2-gram tokenization
bigram_token <- NGramTokenizer(sample_us_corpus, Weka_control(min = 2, max = 2))
bigram_token <- data.frame(table(bigram_token))
names(bigram_token) <- c("word","frequency")
bigram_token <- bigram_token[order(bigram_token$frequency, decreasing = TRUE),]

# Building trigram or 3-gram tokenization
trigram_token <- NGramTokenizer(sample_us_corpus, Weka_control(min = 3, max = 3))
trigram_token <- data.frame(table(trigram_token))
names(trigram_token) <- c("word","frequency")
trigram_token <- trigram_token[order(trigram_token$frequency, decreasing = TRUE),]

UnigramPlot(unigram_token,"Unigram word", "yellow") #bar chart for unigram

UnigramPlot(bigram_token,"Bigram word", "orange")   #bar chart for bigram

UnigramPlot(trigram_token,"Trigram word", "red") #bar chart for trigram

#Save tokenized data into files
saveRDS(unigram_token, file='./sample_en_US.unigram.RDS')
saveRDS(bigram_token, file='./sample_en_US.bigram.RDS')
saveRDS(trigram_token, file='./sample_en_US.trigram.RDS')

Conclusion

I have few observations from the above exploratory analysis :

The next plan :