Natural Language Processing Exploratory Data Analysis

Motivation:
Examing the data and its content
Preprocess and cleaning the Corpus
- A fucntion that will clean the data
Tokenize the data (ngram)
Relative Term Frequency of the tokens for unigram
Wordcloud Visualization using the most frequently words in unigram, bigram, trigram
Questions considred while exploring the data
Take away
Credit

Motivation:

Natural Language Processing (NLP) is a field of study that focuses on “computers to process and understand human languages.” I am working on a CAPSTONE project for data science specialization with the objective of building web app that will predict the word following the one typed before it, just like cell phone SMS apps. The motivation behind this is swiftkey a company that write software for smart phones texting apps.

When building data product the first step is to get the data and the second is to examine (explore) it. Exploring the data is necessary to gain better understanding of the content and prepare it for modeling and application building. As such, the text mining process follows specific set of steps to get ready for further statistical analysis and developing data mining applications.”

The objective of this blog is to show the steps taken to explore unstructured text. Since the data is made available in zip format we are not going to look at how to scrape the data from the web here, nor will we see how to develop the product. The raw text data used here was randomly scraped from the web and includes news snippets, blog texts and twitter messages.To explore the data we will be using several text mining software packages in R programming.

Examing the data and its content

This is a reproducible document that shows each of the steps starting from downloading the data from the source, import the data to R memory, sample the data, import all data into corpus, tidy/cleanup the corpus, generate three ngram models (unigram, bigram, trigram) using text mining packages, explore the data by looking at frequency distribution of words, visualize the data with wordcloud, ggplot and drawing conclusion of what is observed.

Import the data

The first step is to import the data. A combination of R and UNIX command lines(sampling) are used to import the data into R and create a “Corpus”. A corpora is a container framework of several different data from various sources, similar to an SQL database that holds several tables. Except that the data in Corpa is often unstructured, meaning does not’t necessarily fit neatly into rectangular data frame, as required by structured databases. No specific definition of variables, rows and columns. On our case we are going to dump the news snippets, the blog text and twitter messages into a software container - Corpa. This provides a common interface to all the documents that reside int he Corpa, making it efficient to work with thousands even millions of documents at once.

Packages used for the exploratory data

setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone") #set working direcotry

library(tm)                     # Text Mining Package Version: 0.6-2
library(RWeka)                  # An R interface to Weka (Version 3.7.13) - RWeka version 0.4-2
library(stringi)                # String Processing Package Version: stringi_1.0-1
library(stringr)                # String manipulation Package Version:  stringr_1.0.0
library(wordcloud)              # Word Clouds Version: 2.5
library(ggplot2)                # Grammar of graphics in R Version: 2.1.0
library(dplyr)                     # DataTables library in R Version: 0.1
options(mc.cores=1)             # Change the default core processor to use - for RWeka set to 1
# rm(list = ls())

download and unzip the data

if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/Coursera-SwiftKey.zip", method = "curl")

unzip("./data/Coursera-SwiftKey.zip")  #unizp the file;  #creates a directory named final
dir("final") #confirm data is downloaded

Sample the data

Next, we are going to read the raw file into R, sample 5% of each data.

#read the raw text files
setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone")
 news  <- scan(file = './final/en_US/en_US.news.txt'   ,    sep = '\n', what = '', skipNul = TRUE)
 blog  <- scan(file = './final/en_US/en_US.blogs.txt'  ,    sep = '\n', what = '', skipNul = TRUE)
 tweet <- scan(file = './final/en_US/en_US.twitter.txt',    sep = '\n', what = '', skipNul = TRUE)
 
 news.data <-sample(news,length(news)    *.05, replace=FALSE)       # Sample the news data without replacment
 blog.data <-sample(blog,length(blog)    *.05, replace=FALSE)       # Sample the blog data without replacment
 tweet.data <-sample(tweet,length(tweet) *.05, replace=FALSE)       # Sample the blog data without replacment

Examine the data to list number of bytes, character and total number of words for the entire raw data. We are starting with 102.4 million words.

 library(stringi) 
 library(stringr)
# -c      The number of bytes 
# -l      The number of lines 
# -m      The number of characters 
# -w      The number of words 
# system("wc -clmw final/en_US/en_US.*.txt")
stri_stats_general(blog)                                      # Blog stats

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

stri_stats_general(news)                                      # News stats

##       Lines LinesNEmpty       Chars CharsNWhite 
##     1010242     1010242   203223154   169860866

stri_stats_general(tweet)                                     # Tweet stats

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

stri_stats_general(c(blog, news, tweet))                      # Combined Sats for the raw text

##       Lines LinesNEmpty       Chars CharsNWhite 
##     4269678     4269678   572143777   474333211

Total_word_count   <- stri_count_words(c(blog, news, tweet))  # Word count
sum(Total_word_count)                                         # 102,402,051 there are 102.4 millon words

## [1] 102402051

#summary(Total_word_count)
rm(news, blog, tweet)                                        # Remove raw files

Using the stringi package to extract the total number of words used.

 stri_stats_general(c(blog.data, news.data, tweet.data))

##       Lines LinesNEmpty       Chars CharsNWhite 
##      213483      213483    28471772    23603245

 Total_word_count   <- stri_count_words(c(blog.data, news.data, tweet.data))
 totalWords <- sum(Total_word_count)

The sampled data shows that we have totalWords words.

Spliting sentence boundries

Before importing the three files into a corpus we split lines into sentences. For that we are going to use a regex expression that will identify sentences that end with a “.”, “!” and “?”. IT is supposed to not treat the period in “St. Louis” or “Lyndon B. Johnson”" as end of sentence.

news.data1  <- stri_split_regex(c(news.data) ,"( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s") # added reg ex 
blog.data1  <- stri_split_regex(c(blog.data) , "( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")
tweet.data1 <- stri_split_regex(c(tweet.data), "( [A-Z]\\. )(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?|\\!)\\s")

Generate Corpus

Load the data and create the Corpus with the following steps:

data.Corpus <- VCorpus(VectorSource(c(blog.data1, news.data1, tweet.data1))) #corpus
#[ reached getOption("max.print") -- omitted 16976 entries ]

Display detailed information on a corpus with the following commands.

inspect(data.Corpus)
summary(data.Corpus)

We can view the content inside the Corpa as follows. As an example:

head(as.character(data.Corpus[[3]]))

Preprocess and cleaning the Corpus

Now that we have our collection of text neatly seating in the Corpa, we will need to do preliminary cleaning up of the data (or tidy the data). This includes removing profane words, white spaces, numbers and computations. When cleaning the corpous, it is important to follow specific set of steps in order avoid loosing certain key words.

A fucntion that will clean the data

The first function removes none English letters and punctuations including ones used to create emoji on tweets. We also import list profane words saved in a file so they can be used to match and remove profane words. We then create function that contains the tm_map function to remove the content.

setwd("~/Documents/Data-Science/DataScienceSpecialization/Capstone")

rmvNonEngGsub <- function(x) { gsub(pattern="[^[:alpha:]]", " ", x) }  # remove letters that are not A-Z or numbers 0 -9

con1 <- file("./final/profanewords.txt", "r")                         # import collecton of profane words saved as txt 
profanewords <- readLines(con1)
close(con1)

preProcessFunction <- function(myCorpus) {                            # main function that will clean the corupus
        myCorpus <- tm_map(myCorpus, rmvNonEngGsub)
        myCorpus <- tm_map(myCorpus, content_transformer(tolower))
        myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"))
        myCorpus <- tm_map(myCorpus, removePunctuation)
        myCorpus <- tm_map(myCorpus, removeWords, profanewords)
        myCorpus <- tm_map(myCorpus, removeNumbers)
        myCorpus <- tm_map(myCorpus, stemDocument)
        myCorpus <- tm_map(myCorpus, stripWhitespace)
        myCorpus <- tm_map(myCorpus, PlainTextDocument)
        return (myCorpus)
                                         }
#as.character(twtdb_clean[[1]])
#inspect(twtdb_clean)

Next is to clean the data using the functions created.

system.time(data.corpus_clean <- preProcessFunction(data.Corpus)) #clean the data

##    user  system elapsed 
##   6.564   0.024   6.586

# the following commands can be used to examine the corpus. For brevity, it is commented out.
#inspect(data.corpus_clean)
#summary(data.corpus_clean)
#as.character(data.corpus_clean[[2002]])

Tokenize the data (ngram)

We start tokenizing the data with the following commands to see the distribution of word frequency in the corpus. We can also see the sparsity (ho much of the data are zero) percentile and maximum term length in the dataset.

Unigram - one word token

unigramTDM   <- TermDocumentMatrix(data.corpus_clean)
#inspect(unigramTDM)
 #<<TermDocumentMatrix (terms: 49103, documents: 26976)>>
#Non-/sparse entries: 386812/1324215716
#Sparsity           : 100%
#Maximal term length: 74
#Weighting          : term frequency (tf)

#summary(unigramTDM)
#unigramTDM  <- removeSparseTerms(unigramTDM , 0.95) #reduce sparsity
m <- as.matrix(unigramTDM)
v <- sort(rowSums(m),decreasing=TRUE)
df_unigram <- data.frame(words = names(v),freq=v)
row.names(df_unigram) <- NULL
head(df_unigram, 10)

##    words freq
## 1   said  516
## 2    one  479
## 3   will  459
## 4   just  418
## 5   like  389
## 6    can  354
## 7   time  312
## 8    get  310
## 9    new  301
## 10  good  243

#tail(df_unigram, 10)
#View(df_unigram)

Bigram - two words token

options(mc.cores=1)
biagramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))} #biagram function

biagramTDM   <- TermDocumentMatrix(data.corpus_clean, control=list(tokenize = biagramTokenizer))
biagramTDM  #TermDocumentMatrix (terms: 37634, documents: 2697)

## <<TermDocumentMatrix (terms: 75327, documents: 5394)>>
## Non-/sparse entries: 79538/406234300
## Sparsity           : 100%
## Maximal term length: 38
## Weighting          : term frequency (tf)

#inspect(biagramTDM)
#biagramTDM <- removeSparseTerms(biagramTDM, 0.99) #reduce sparsity
#head(inspect(biagramTDM),20) 

#inspect(biagramTDM)
#as.character(biagramTDM[[1]])
m <- as.matrix(biagramTDM)                              #convert to mamtrix
v <- sort(rowSums(m),decreasing=TRUE)                   #sort high to low
df_biagram <- data.frame(words = names(v),freq=v)       #convert to dataframe
row.names(df_biagram) <- NULL                           #remove row lable
head(df_biagram, 10)

##         words freq
## 1    new york   32
## 2   last year   31
## 3   right now   29
## 4   feel like   27
## 5   last week   24
## 6  first time   17
## 7  last night   17
## 8   just like   16
## 9   make sure   16
## 10 new jersey   16

#tail(df_biagram, 10)
#View(df_biagram)

Trigram - three word token

triagramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 3, max = 3))} #triagram function

triagramTDM   <- TermDocumentMatrix(data.corpus_clean, control=list(tokenize = triagramTokenizer))
triagramTDM

## <<TermDocumentMatrix (terms: 74725, documents: 5394)>>
## Non-/sparse entries: 74833/402991817
## Sparsity           : 100%
## Maximal term length: 50
## Weighting          : term frequency (tf)

#triagramTDM  <- removeSparseTerms(triagramTDM, 0.80) #reduce sparsity
#head(inspect(triagramTDM),10) 

m <- as.matrix(triagramTDM)                             #convert to matrix
v <- sort(rowSums(m),decreasing=TRUE)                   #sort high to low
df_triagram <- data.frame(words = names(v),freq=v)      #convert to dataframe
row.names(df_triagram) <- NULL                          #remove row lable
head(df_triagram, 10)

##                   words freq
## 1               bmw z i    3
## 2  city council members    3
## 3     green bay packers    3
## 4     happy mothers day    3
## 5              la la la    3
## 6         long time ago    3
## 7    martin luther king    3
## 8         new york city    3
## 9        next next next    3
## 10      one main things    3

#tail(df_triagram,10)
#View(df_triagram)
###########

Histograms of tokens

If we take a look at the histogram for the word data set, we can see vast majority of words used in news, twitter and blogs are repeated. So, few words represent the vast majority of the frequency.

Relative Term Frequency of the tokens for unigram

To examine what percentile of the total words are represented in the frequency distribution, we extract the frequency counts, classify them into chunks (top 5, top 10, top 15 etc) and calculate the percentile. As we can see in the table the top 5% words account for 86% of the total frequency, and next five words account for additional 7.1%. combined the top 10% of words account for 93% of the frequency.

options(scipen=2)                               # Format the digit apperance - not 0e+00
tokenFrequency <- df_unigram$freq               # Unigram token frequency
range(df_unigram$freq)                          # check out the spread 1, 274

## [1]   1 516

breaks = seq(1, 274, by = 4)                                       
token.counts = cut(tokenFrequency, 
                         breaks, 
                         right=FALSE)           # group token frequency every 5 count
tokenFrequency.freq = table(token.counts)   #toekn frequency on each interval

tokenFreq <- as.data.frame(tokenFrequency.freq) #convert to data frame 
tokenFreq <- tokenFreq %>% 
                 mutate( rltvFreq = Freq / sum(Freq)) # add column relative frequency distribution 
head(tokenFreq, 10)

##    token.counts  Freq    rltvFreq
## 1         [1,5) 16155 0.831446217
## 2         [5,9)  1521 0.078281009
## 3        [9,13)   572 0.029439012
## 4       [13,17)   322 0.016572311
## 5       [17,21)   186 0.009572826
## 6       [21,25)   139 0.007153886
## 7       [25,29)    97 0.004992280
## 8       [29,33)    81 0.004168811
## 9       [33,37)    54 0.002779207
## 10      [37,41)    42 0.002161606

Here are some of the exploratory statistics for mean, median, variance and standard deviation for the sampled data.

mean(df_unigram$freq)                               # measure of center distribution

## [1] 4.280261

median(df_unigram$freq)                             # median is a better representation of center

## [1] 1

var(df_unigram$freq)                                # variance

## [1] 197.6429

sd(df_unigram$freq)

## [1] 14.05855

Wordcloud Visualization using the most frequently words in unigram, bigram, trigram

mfrow = c(1,3)

library(wordcloud)

wordcloud(words = df_unigram$word, freq = df_unigram$freq, min.freq = 3,
          max.words=75, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

wordcloud(words = df_biagram$word, freq = df_unigram$freq, min.freq = 3,
          max.words=75, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

wordcloud(words = df_triagram$word, freq = df_unigram$freq, min.freq = 3,
          max.words=75, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Questions considred while exploring the data

What are the frequency distribution of words Top 20 most frequent words for the entire corpus
What are the frequencies of 2-grams and 3grams in the data set

trigram <- function(x {RWeka::NGramTokenizer})

head(df_biagram, 10)        #top 10

##         words freq
## 1    new york   32
## 2   last year   31
## 3   right now   29
## 4   feel like   27
## 5   last week   24
## 6  first time   17
## 7  last night   17
## 8   just like   16
## 9   make sure   16
## 10 new jersey   16

head(df_triagram, 10)       #top 10

##                   words freq
## 1               bmw z i    3
## 2  city council members    3
## 3     green bay packers    3
## 4     happy mothers day    3
## 5              la la la    3
## 6         long time ago    3
## 7    martin luther king    3
## 8         new york city    3
## 9        next next next    3
## 10      one main things    3

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%

quantile(df_unigram$freq, .5)  #50 percent of the observations

## 50% 
##   1

quantile(df_unigram$freq, .9)  #90 percent of the observations

## 90% 
##   8

How do you evaluate how many of the words come from foreign language? We can use regex to include only foreign letters or exclude English letters.
Can you think of a way to increase the coverage - identify words that may not be in the corpora of using a smaller number of words in the dictionary to cover the same number of phrases?
We can increase the sample size.
We can apply advanced text mining techniques both to reduce number of words and cover the same number of phrases.

Morphology - removes the affixes from the stems - example remove automate(s), automatic, automation all reduce to ’automatic.
lematization - is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Example am, are, is change to “be”. car, cars, car’s, cars’ change to “car”
Normalize - case folding - ex. “Fed” vs “fed” reduce words that are separated by punctuation. Delete period in terms such as U.S.A -> USA and

Take away

First observation learned from exploring the raw unstructured data is that the combined data is “big data” for home based computers. Combined corpus has 4.3 Million lines, and 102.4 million words before tiding. Attempting to sample at 10% even 5% maxed R’s capacity to generate bigrams and ngrams. Interestingly 10% can be handled from UNIX command line. For this exploration I can only use 2% of the total data size. Second, the relative frequency distribution and ngram models shows less than 10 words have high frequency and account for more than 90% of the total. Planing on using some of the advanced NLP techniques such as normalization, lematization and morphing to increase the distribution frequencies.

Credit

I used excerpts and learned from the following.
* Wikepedia
* quora
* Stanford NLP online course
* Columbia NLP online course
* slideshare.net
* bioconductor
* Stackoverflow
* github
* google
* yahoo