Natural Language Processing (NLP) is a field of study that focuses on “computers to process and understand human languages.” I am working on a CAPSTONE project for data science specialization with the objective of building web app that will predict the word following the one typed before it, just like cell phone SMS apps. The motivation behind this is swiftkey a company that write software for smart phones texting apps.
When building data product the first step is to get the data and the second is to examine (explore) it. Exploring the data is necessary to gain better understanding of the content and prepare it for modeling and application building. As such, the text mining process follows specific set of steps to get ready for further statistical analysis and developing data mining applications.”
The objective of this blog is to show the steps taken to explore unstructured text. Since the data is made available in zip format we are not going to look at how to scrape the data from the web here, nor will we see how to develop the product. The raw text data used here was randomly scraped from the web and includes news snippets, blog texts and twitter messages.To explore the data we will be using several text mining software packages in R programming.
This is a reproducible document that shows each of the steps starting from downloading the data from the source, import the data to R memory, sample the data, import all data into corpus, tidy/cleanup the corpus, generate three ngram models (unigram, bigram, trigram) using text mining packages, explore the data by looking at frequency distribution of words, visualize the data with wordcloud, ggplot and drawing conclusion of what is observed.
The first step is to import the data. A combination of R and UNIX command lines(sampling) are used to import the data into R and create a “Corpus”. A corpora is a container framework of several different data from various sources, similar to an SQL database that holds several tables. Except that the data in Corpa is often unstructured, meaning does not’t necessarily fit neatly into rectangular data frame, as required by structured databases. No specific definition of variables, rows and columns. On our case we are going to dump the news snippets, the blog text and twitter messages into a software container - Corpa. This provides a common interface to all the documents that reside int he Corpa, making it efficient to work with thousands even millions of documents at once.
setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone") #set working direcotry
library(tm) # Text Mining Package Version: 0.6-2
library(RWeka) # An R interface to Weka (Version 3.7.13) - RWeka version 0.4-2
library(stringi) # String Processing Package Version: stringi_1.0-1
library(stringr) # String manipulation Package Version: stringr_1.0.0
library(wordcloud) # Word Clouds Version: 2.5
library(ggplot2) # Grammar of graphics in R Version: 2.1.0
library(dplyr) # DataTables library in R Version: 0.1
options(mc.cores=1) # Change the default core processor to use - for RWeka set to 1
# rm(list = ls())
download and unzip the data
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip", method = "curl")
unzip("./data/Coursera-SwiftKey.zip") #unizp the file; #creates a directory named final
dir("final") #confirm data is downloaded
Next, we are going to read the raw file into R, sample 5% of each data.
#read the raw text files
setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone")
news <- scan(file = './final/en_US/en_US.news.txt' , sep = '\n', what = '', skipNul = TRUE)
blog <- scan(file = './final/en_US/en_US.blogs.txt' , sep = '\n', what = '', skipNul = TRUE)
tweet <- scan(file = './final/en_US/en_US.twitter.txt', sep = '\n', what = '', skipNul = TRUE)
news.data <-sample(news,length(news) *.05, replace=FALSE) # Sample the news data without replacment
blog.data <-sample(blog,length(blog) *.05, replace=FALSE) # Sample the blog data without replacment
tweet.data <-sample(tweet,length(tweet) *.05, replace=FALSE) # Sample the blog data without replacment
Examine the data to list number of bytes, character and total number of words for the entire raw data. We are starting with 102.4 million words.
library(stringi)
library(stringr)
# -c The number of bytes
# -l The number of lines
# -m The number of characters
# -w The number of words
# system("wc -clmw final/en_US/en_US.*.txt")
stri_stats_general(blog) # Blog stats
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
stri_stats_general(news) # News stats
## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
stri_stats_general(tweet) # Tweet stats
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
stri_stats_general(c(blog, news, tweet)) # Combined Sats for the raw text
## Lines LinesNEmpty Chars CharsNWhite
## 4269678 4269678 572143777 474333211
Total_word_count <- stri_count_words(c(blog, news, tweet)) # Word count
sum(Total_word_count) # 102,402,051 there are 102.4 millon words
## [1] 102402051
#summary(Total_word_count)
rm(news, blog, tweet) # Remove raw files
Using the stringi package to extract the total number of words used.
stri_stats_general(c(blog.data, news.data, tweet.data))
## Lines LinesNEmpty Chars CharsNWhite
## 213483 213483 28471772 23603245
Total_word_count <- stri_count_words(c(blog.data, news.data, tweet.data))
totalWords <- sum(Total_word_count)
The sampled data shows that we have totalWords words.
Before importing the three files into a corpus we split lines into sentences. For that we are going to use a regex expression that will identify sentences that end with a “.”, “!” and “?”. IT is supposed to not treat the period in “St. Louis” or “Lyndon B. Johnson”" as end of sentence.
news.data1 <- stri_split_regex(c(news.data) ,"( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s") # added reg ex
blog.data1 <- stri_split_regex(c(blog.data) , "( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")
tweet.data1 <- stri_split_regex(c(tweet.data), "( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")
Load the data and create the Corpus with the following steps:
data.Corpus <- VCorpus(VectorSource(c(blog.data1, news.data1, tweet.data1))) #corpus
#[ reached getOption("max.print") -- omitted 16976 entries ]
Display detailed information on a corpus with the following commands.
inspect(data.Corpus)
summary(data.Corpus)
We can view the content inside the Corpa as follows. As an example:
head(as.character(data.Corpus[[3]]))
Now that we have our collection of text neatly seating in the Corpa, we will need to do preliminary cleaning up of the data (or tidy the data). This includes removing profane words, white spaces, numbers and computations. When cleaning the corpous, it is important to follow specific set of steps in order avoid loosing certain key words.
The first function removes none English letters and punctuations including ones used to create emoji on tweets. We also import list profane words saved in a file so they can be used to match and remove profane words. We then create function that contains the tm_map function to remove the content.
setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone")
rmvNonEngGsub <- function(x) { gsub(pattern="[^[:alpha:]]", " ", x) } # remove letters that are not A-Z or numbers 0 -9
con1 <- file("./final/profanewords.txt", "r") # import collecton of profane words saved as txt
profanewords <- readLines(con1)
close(con1)
preProcessFunction <- function(myCorpus) { # main function that will clean the corupus
myCorpus <- tm_map(myCorpus, rmvNonEngGsub)
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpus <- tm_map(myCorpus, removeWords, profanewords)
myCorpus <- tm_map(myCorpus, removeNumbers)
myCorpus <- tm_map(myCorpus, stemDocument)
myCorpus <- tm_map(myCorpus, stripWhitespace)
myCorpus <- tm_map(myCorpus, PlainTextDocument)
return (myCorpus)
}
#as.character(twtdb_clean[[1]])
#inspect(twtdb_clean)
Next is to clean the data using the functions created.
system.time(data.corpus_clean <- preProcessFunction(data.Corpus)) #clean the data
## user system elapsed
## 6.564 0.024 6.586
# the following commands can be used to examine the corpus. For brevity, it is commented out.
#inspect(data.corpus_clean)
#summary(data.corpus_clean)
#as.character(data.corpus_clean[[2002]])
We start tokenizing the data with the following commands to see the distribution of word frequency in the corpus. We can also see the sparsity (ho much of the data are zero) percentile and maximum term length in the dataset.
unigramTDM <- TermDocumentMatrix(data.corpus_clean)
#inspect(unigramTDM)
#<<TermDocumentMatrix (terms: 49103, documents: 26976)>>
#Non-/sparse entries: 386812/1324215716
#Sparsity : 100%
#Maximal term length: 74
#Weighting : term frequency (tf)
#summary(unigramTDM)
#unigramTDM <- removeSparseTerms(unigramTDM , 0.95) #reduce sparsity
m <- as.matrix(unigramTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df_unigram <- data.frame(words = names(v),freq=v)
row.names(df_unigram) <- NULL
head(df_unigram, 10)
## words freq
## 1 said 516
## 2 one 479
## 3 will 459
## 4 just 418
## 5 like 389
## 6 can 354
## 7 time 312
## 8 get 310
## 9 new 301
## 10 good 243
#tail(df_unigram, 10)
#View(df_unigram)
options(mc.cores=1)
biagramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))} #biagram function
biagramTDM <- TermDocumentMatrix(data.corpus_clean, control=list(tokenize = biagramTokenizer))
biagramTDM #TermDocumentMatrix (terms: 37634, documents: 2697)
## <<TermDocumentMatrix (terms: 75327, documents: 5394)>>
## Non-/sparse entries: 79538/406234300
## Sparsity : 100%
## Maximal term length: 38
## Weighting : term frequency (tf)
#inspect(biagramTDM)
#biagramTDM <- removeSparseTerms(biagramTDM, 0.99) #reduce sparsity
#head(inspect(biagramTDM),20)
#inspect(biagramTDM)
#as.character(biagramTDM[[1]])
m <- as.matrix(biagramTDM) #convert to mamtrix
v <- sort(rowSums(m),decreasing=TRUE) #sort high to low
df_biagram <- data.frame(words = names(v),freq=v) #convert to dataframe
row.names(df_biagram) <- NULL #remove row lable
head(df_biagram, 10)
## words freq
## 1 new york 32
## 2 last year 31
## 3 right now 29
## 4 feel like 27
## 5 last week 24
## 6 first time 17
## 7 last night 17
## 8 just like 16
## 9 make sure 16
## 10 new jersey 16
#tail(df_biagram, 10)
#View(df_biagram)
triagramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))} #triagram function
triagramTDM <- TermDocumentMatrix(data.corpus_clean, control=list(tokenize = triagramTokenizer))
triagramTDM
## <<TermDocumentMatrix (terms: 74725, documents: 5394)>>
## Non-/sparse entries: 74833/402991817
## Sparsity : 100%
## Maximal term length: 50
## Weighting : term frequency (tf)
#triagramTDM <- removeSparseTerms(triagramTDM, 0.80) #reduce sparsity
#head(inspect(triagramTDM),10)
m <- as.matrix(triagramTDM) #convert to matrix
v <- sort(rowSums(m),decreasing=TRUE) #sort high to low
df_triagram <- data.frame(words = names(v),freq=v) #convert to dataframe
row.names(df_triagram) <- NULL #remove row lable
head(df_triagram, 10)
## words freq
## 1 bmw z i 3
## 2 city council members 3
## 3 green bay packers 3
## 4 happy mothers day 3
## 5 la la la 3
## 6 long time ago 3
## 7 martin luther king 3
## 8 new york city 3
## 9 next next next 3
## 10 one main things 3
#tail(df_triagram,10)
#View(df_triagram)
###########
If we take a look at the histogram for the word data set, we can see vast majority of words used in news, twitter and blogs are repeated. So, few words represent the vast majority of the frequency.
To examine what percentile of the total words are represented in the frequency distribution, we extract the frequency counts, classify them into chunks (top 5, top 10, top 15 etc) and calculate the percentile. As we can see in the table the top 5% words account for 86% of the total frequency, and next five words account for additional 7.1%. combined the top 10% of words account for 93% of the frequency.
options(scipen=2) # Format the digit apperance - not 0e+00
tokenFrequency <- df_unigram$freq # Unigram token frequency
range(df_unigram$freq) # check out the spread 1, 274
## [1] 1 516
breaks = seq(1, 274, by = 4)
token.counts = cut(tokenFrequency,
breaks,
right=FALSE) # group token frequency every 5 count
tokenFrequency.freq = table(token.counts) #toekn frequency on each interval
tokenFreq <- as.data.frame(tokenFrequency.freq) #convert to data frame
tokenFreq <- tokenFreq %>%
mutate( rltvFreq = Freq / sum(Freq)) # add column relative frequency distribution
head(tokenFreq, 10)
## token.counts Freq rltvFreq
## 1 [1,5) 16155 0.831446217
## 2 [5,9) 1521 0.078281009
## 3 [9,13) 572 0.029439012
## 4 [13,17) 322 0.016572311
## 5 [17,21) 186 0.009572826
## 6 [21,25) 139 0.007153886
## 7 [25,29) 97 0.004992280
## 8 [29,33) 81 0.004168811
## 9 [33,37) 54 0.002779207
## 10 [37,41) 42 0.002161606
Here are some of the exploratory statistics for mean, median, variance and standard deviation for the sampled data.
mean(df_unigram$freq) # measure of center distribution
## [1] 4.280261
median(df_unigram$freq) # median is a better representation of center
## [1] 1
var(df_unigram$freq) # variance
## [1] 197.6429
sd(df_unigram$freq)
## [1] 14.05855
mfrow = c(1,3)
library(wordcloud)
wordcloud(words = df_unigram$word, freq = df_unigram$freq, min.freq = 3,
max.words=75, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
wordcloud(words = df_biagram$word, freq = df_unigram$freq, min.freq = 3,
max.words=75, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
wordcloud(words = df_triagram$word, freq = df_unigram$freq, min.freq = 3,
max.words=75, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
What are the frequency distribution of words Top 20 most frequent words for the entire corpus
What are the frequencies of 2-grams and 3grams in the data set
trigram <- function(x {RWeka::NGramTokenizer})
head(df_biagram, 10) #top 10
## words freq
## 1 new york 32
## 2 last year 31
## 3 right now 29
## 4 feel like 27
## 5 last week 24
## 6 first time 17
## 7 last night 17
## 8 just like 16
## 9 make sure 16
## 10 new jersey 16
head(df_triagram, 10) #top 10
## words freq
## 1 bmw z i 3
## 2 city council members 3
## 3 green bay packers 3
## 4 happy mothers day 3
## 5 la la la 3
## 6 long time ago 3
## 7 martin luther king 3
## 8 new york city 3
## 9 next next next 3
## 10 one main things 3
quantile(df_unigram$freq, .5) #50 percent of the observations
## 50%
## 1
quantile(df_unigram$freq, .9) #90 percent of the observations
## 90%
## 8
How do you evaluate how many of the words come from foreign language? We can use regex to include only foreign letters or exclude English letters.
Can you think of a way to increase the coverage - identify words that may not be in the corpora of using a smaller number of words in the dictionary to cover the same number of phrases?
First observation learned from exploring the raw unstructured data is that the combined data is “big data” for home based computers. Combined corpus has 4.3 Million lines, and 102.4 million words before tiding. Attempting to sample at 10% even 5% maxed R’s capacity to generate bigrams and ngrams. Interestingly 10% can be handled from UNIX command line. For this exploration I can only use 2% of the total data size. Second, the relative frequency distribution and ngram models shows less than 10 words have high frequency and account for more than 90% of the total. Planing on using some of the advanced NLP techniques such as normalization, lematization and morphing to increase the distribution frequencies.
I used excerpts and learned from the following.
* Wikepedia
* quora
* Stanford NLP online course
* Columbia NLP online course
* slideshare.net
* bioconductor
* Stackoverflow
* github
* google
* yahoo