Data Science Capstone

Milestone Report - Natural Language Processing

Prepared by: Bernard NK

July 25, 2015

Executive summary

In this Data Science capstone we will apply data modeling and prediction in the area of natural language processing. The project is done in association with SwiftKey, a company developing a smart prediction technology for easier mobile typing. A first step towards working on this project is to familiarize ourself with Natural Language Processing, Text Mining, and the associated tools in R. The goal of this milestone report is to demonstrate that we are familiar with the data and on track to create a prediction algorithm.

Data acquisition and basic summary

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The dataset can be downloaded from the location below, and once unzipped will provide us a folder “en_US/” containing the English text that we will use in our data analysis:

destination_file <- "Coursera-SwiftKey.zip"
source_file <- "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# execute the download
download.file(source_file, destination_file)
# extract the files from the zip file
unzip(destination_file)

Let’s observe the data to understand what it looks like and how much effort is needed to clean the data.

The line count for each file is, wc -l *.txt:

  899288 en_US.blogs.txt
 1010242 en_US.news.txt
 2360148 en_US.twitter.txt
 4269678 total

The word count for each file is, wc -w *.txt:

 37334690 en_US.blogs.txt
 34372720 en_US.news.txt
 30374206 en_US.twitter.txt
 102081616 total

File size (in MegaBytes/MB):

file.info("Project/final/en_US/en_US.blogs.txt")$size   / 1024^2
## [1] 200.4242
file.info("Project/final/en_US/en_US.news.txt")$size    / 1024^2
## [1] 196.2775
file.info("Project/final/en_US/en_US.twitter.txt")$size / 1024^2
## [1] 159.3641

Due to the file sizes above, caching the data will improve long computations and plots that are expensive to generate in knitr.

library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
opts_chunk$set(cache=TRUE, cache.path = 'DocumentName_cache/', fig.path='figure/')

The corpora are contained in three separate plain-text files that will import as follows. Note that for the final capstone project, the data selection will be randomnized.

iblogs<-file("Project/final/en_US/en_US.blogs.txt")
blogs<-readLines(iblogs,encoding="UTF-8", 5000)
close(iblogs)
inews<-file("Project/final/en_US/en_US.news.txt")
news<-readLines(inews,encoding="UTF-8", 5000)
close(inews)
itwitter<-file("Project/final/en_US/en_US.twitter.txt")
twitters<-readLines(itwitter,encoding="UTF-8", 5000)
close(itwitter)

news <- paste(blogs,news,twitters) #combine lines from all 3 source documents

In the next section, basic data tables are created from this text, including sorting the frequency of 1-gram, 2-grams and 3-grams.

Cleaning the data

Tokenization

Create a corpus and identify appropriate tokens such as words, punctuation, and numbers. A corpus is a collection of texts, usually stored electronically, and from which we perform our analysis. The description of the tm package is here, http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf:

library(NLP)
## Warning: package 'NLP' was built under R version 3.1.3
library(tm)
## Warning: package 'tm' was built under R version 3.1.3
txt<-VectorSource(news)
corpus<-VCorpus(txt)
#inspect(corpus)

we modify the text by making everything lower case, removing punctuation, removing numbers, and removing common English stop words. The ‘tm map’ function allows us to apply transformation functions to a corpus.

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, content_transformer(tolower))
#corpus <- tm_map(corpus, removeWords, stopwords("english")) #Note that stopwords were intentionally kept so that we can predict them
corpus <- tm_map(corpus, stripWhitespace);

Next we perform stemming, which truncates words (ex. “truncate”, “truncates” & “truncating” translate to “truncat”) using the SnowballC package to identify specific stem elements.

# stem words do not need to be removed, per TA's advice, more valuable for sentiment prediction
require(SnowballC)
corpus<- tm_map(corpus, stemDocument)
detach("package:SnowballC")

Profanity filtering

Removing profanity and other words you do not want to predict.

badwordsvector <- VectorSource(readLines("Project/badwords.txt"))
corpus <- tm_map(corpus, removeWords, badwordsvector) 

Exploratory analysis

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Perform an exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.

Prepare the corpus and n-grams:

# Create the corpus
corpus<-data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)

library(RWeka) #For n-gram vector generation
# Generate one-gram sets, bi-grams sets, tri-grams sets and quad-grams sets
onetoken <- NGramTokenizer(corpus, Weka_control(min = 1, max = 1))
bitoken <- NGramTokenizer(corpus, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
tritoken <- NGramTokenizer(corpus, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
quadtoken <- NGramTokenizer(corpus, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))

Understand the distribution of Word Frequencies - single Word, bi-word, tri-word and quad-word combinations:

Transforming the n-grams to dataframes and ordering by Frequency for charting.

one <- data.frame(table(onetoken))
two <- data.frame(table(bitoken))
tri <- data.frame(table(tritoken))
quad <- data.frame(table(quadtoken))
onesorted <- one[order(one$Freq,decreasing = TRUE),]
twosorted <- two[order(two$Freq,decreasing = TRUE),]
trisorted <- tri[order(tri$Freq,decreasing = TRUE),]
quadsorted <- quad[order(quad$Freq,decreasing = TRUE),]
# single Word combinations
one20 <- onesorted[1:20,]
colnames(one20) <- c("Word","Frequency")
# bi-word combinations
two20 <- twosorted[1:20,]
colnames(two20) <- c("Word","Frequency")
# tri-word combinations
tri20 <- trisorted[1:20,]
colnames(tri20) <- c("Word","Frequency")
# quad-word combinations
quad20 <- quadsorted[1:20,]
colnames(quad20) <- c("Word","Frequency")

Here is the summary of highest frequency n-gram combinations by size:

head(onesorted)
##       onetoken  Freq
## 31755      the 21838
## 32227       to 11878
## 1181       and 11259
## 35           a 10557
## 21995       of  9279
## 15590       in  7248
head(twosorted)
##        bitoken Freq
## 140551  of the 2006
## 100406  in the 1897
## 211327  to the  981
## 143132  on the  851
## 75628  for the  840
## 209147   to be  750
head(trisorted)
##           tritoken Freq
## 230595  one of the  154
## 4326      a lot of  150
## 125239 going to be   77
## 333483     to be a   77
## 170670    it was a   69
## 151112   i want to   65
head(quadsorted)
##                 quadtoken Freq
## 335192     the end of the   34
## 122948 for the first time   32
## 345048    the rest of the   28
## 44881       at the end of   27
## 402293   when it comes to   27
## 173913   in the middle of   24

To understand frequencies of words and word pairs in the data, we show the uniqueness of words and word pairs in the data:

oneunique <- onesorted[onesorted$Freq == 1,]
nrow(oneunique) # number of unique words
## [1] 19247
sum(onesorted$Freq) # total number of words in the text
## [1] 430667
twounique <- twosorted[twosorted$Freq == 1,]
nrow(twounique) # number of unique bi-grams
## [1] 195502
sum(twosorted$Freq) # total number of bi-grams in the text
## [1] 430666

An interesting observation is that the total number of single words is nearly the same as the total number of bi-grams.

Plots to illustrate n-grams in the data

Top 20 single words (sorted alphabetically):

library(ggplot2)
ggplot(one20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 20 bi-grams (sorted alphabetically):

ggplot(two20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 20 tri-grams (sorted alphabetically):

ggplot(tri20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Top 20 quad-grams (sorted alphabetically):

ggplot(quad20, aes(x=Word,y=Frequency), ) + geom_bar(stat="Identity", fill="blue") + geom_text(aes(label=Frequency), vjust=-0.2) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Modeling

The goal is to build our first simple model for the relationship between words. This is the first step in building a predictive text mining application. We will explore simple models.

Findings so far:

  • When comparing the highest frequency results using 4-grams, we did not find that 4-grams were helpful in finding the next word in a n-gram.

  • A major tradeoff is the amount of data analyzed (corpus size) vs analysis time. Note that stopwords were intentionally kept so that we can predict them, although there is a large number of them.

  • Adding more lines from the text in the target corpus does not always contribute to a better model accuracy. The model will therefore be built based on qualitative n-gram criteria versus quantitative.

  • The twitter text includes a lot of non-printable characters and therefore difficult to clean and use.

Next steps:

  1. Build a n-gram model using the exploratory analysis previously performed (http://en.wikipedia.org/wiki/N-gram) for predicting the next word based on 1, 2, or 3 words.
  2. Assess the Katz Back-Off Model for accuracy.
  3. Generate a 2-column table of unique n-grams by frequencies by summing frequency counts,
  4. Match a n-gram character string with the appropriate n+1 gram entry in the n-gram frequency Table.
  5. If there is a match, propose high frequency words to the user.

References

HC Corpora (www.corpora.heliohost.org)
Johns Hopkins Data Science Capstone, https://www.coursera.org/course/dsscapstone