Synopsis

The purpose of this document is to perfom some natural language processing analysis on some text samples from different sources such as blogs, news or twitter. We took the English language as an easy first approach to learn the basic techniques that can be use on text samples. However the presented algorithms and techniques here could be extended to other type of languages. Some R code will be added to each chapter but most of the source code can be found at the Appendix of this document.

Sample Data

Original datasets

The data was downloaded and extracted to the working directory from:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

It contains three main files: 1. en_US.blogs.txt 2. en_US.news.txt 3. en_US.twitter.txt

The ‘’file.size’’ function gives us information about the size of the files in the system (1 MB = 1048576 bytes).

file.size("./en_US/en_US.blogs.txt")/ 1048576
## [1] 200.4242
file.size("./en_US/en_US.news.txt")/ 1048576
## [1] 196.2775
file.size("./en_US/en_US.twitter.txt")/ 1048576
## [1] 159.3641

Thus we are in the range or 160 to 200 MB per file.

Some measures with stringi package

Let’s extract some analytics from the files, such as number of lines, words, sentences and words per line. ‘’’stri_read_lines()’’’ is a substitute for the system’s readLines function, with the ability to auto-detect input encodings.

library(stringi,quietly = TRUE)
blogLines <- stri_read_lines("./en_US/en_US.blogs.txt")
newsLines <- stri_read_lines("./en_US/en_US.news.txt")
twitterLines <- stri_read_lines("./en_US/en_US.twitter.txt")
blogsWords <- stri_count_words(blogLines)
newsWords <- stri_count_words(newsLines)
twitterWords <- stri_count_words(twitterLines)
# build a table with all the information
data.frame(source = c("blogs", "news", "twitter"),
           nrLines = c(length(blogLines), length(newsLines), length(twitterLines)),
           nrWords = c(sum(blogsWords), 
                       sum(newsWords), 
                       sum(twitterWords)),
           nrSentences = c(sum(stri_count_boundaries(blogLines, type = "sentence")), 
                           sum(stri_count_boundaries(newsLines, type = "sentence")), 
                           sum(stri_count_boundaries(twitterLines, type = "sentence"))),
           meanWordsPerLine = c(round(mean(blogsWords)), round(mean(newsWords)), round(mean(twitterWords))))
##    source nrLines  nrWords nrSentences meanWordsPerLine
## 1   blogs  899288 37546246     2380481               42
## 2    news 1010242 34762395     2025776               34
## 3 twitter 2360148 30093410     3780376               13
rm(blogsWords)
rm(newsWords)
rm(twitterWords)

Conclusion the twitter source has more lines but less words per line. While the blogs source has less number of lines but more words per line.

Reduced datasets

Due to performance issues and size restrictions on the next steps (running a shiny app on a server) of this project, we need to reduce the sample data size. In order to have a good heterogeneous information we decide to take samples from each of the three sources. However since the goal is to build a standard text predictor and the language in twitter is not so close to the standard way or writing sentences, we reduce the contribution of this source. Thus we decide to take 4000 lines from the news and blogs source files, plus 2000 lines from twitter. In total 10000 lines.

set.seed(6379)
textSampleData <- c(sample(blogLines, 4000),
                 sample(newsLines, 4000),
                 sample(twitterLines, 2000))
rm(blogLines)
rm(newsLines)
rm(twitterLines)

Cleaning Data

First we assure that the encoding is ASCII, in order to avoid non ASCII characters in the sample data. Afterwards the corpus will be created. Finally we begin with some cleaning before building any models.

convert2ASCII <- function(input, print=FALSE){
    return(sapply(input, function(x) iconv(x, "latin1", "ASCII", sub = "")))
}
textSampleData <- convert2ASCII(textSampleData)
library(tm,quietly = TRUE)
corpusData <- VCorpus(VectorSource(textSampleData))
rm(textSampleData)

We choose to clean the following things from the sample data, defining a specific function if needed:

  1. URLs, Hashtags or Twitter references
removeURL <- function(x) {gsub("http:[[:alnum:]]*", "", x)}
removeHashTags <- function(x) {gsub("#\\S+", "", x)}
removeTwitterRefs <- function(x) {gsub("@\\S+", "", x)}
  1. special characters such as *, %, $, etc. The “&” is replaced by “and”.
removeSpecialChars <- function(x) {gsub("\\*|\\$|\\%|\\€|\\~","", x)}
replaceAnd <- function(x) {gsub("\\&", " and ", x)}
  1. punctuations (we use the generic removePunctuation() method of tm), and dot from abbreviations.
removeAbbv <- function(x) {gsub('\\.','',x)}
  1. multiple consecutive white spaces (we use the generic stripWhitespace() from tm). This will be done at the very last step.

  2. numbers (we use the generic removeNumbers() method of tm)

Let’s apply all these transformations to the corpus:

library(tm,quietly = TRUE)
cleanCorpus<-function(inputCorpus){
    inputCorpus <- tm_map(inputCorpus, content_transformer(removeURL),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(removeHashTags),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(removeTwitterRefs),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(removeSpecialChars),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(removePunctuation),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(replaceAnd),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(stripWhitespace),lazy=TRUE)
    inputCorpus <- tm_map(inputCorpus, content_transformer(removeNumbers),lazy=TRUE)
  return(inputCorpus)
}
cleanCorpusData <- cleanCorpus(corpusData)
rm(corpusData)

Besides all text will be transformed to lower case (use tm_map) to facilitate the count of same words. The bad words will be removed later with a profanity filter.

library(tm, quietly = TRUE)
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
cleanCorpusData <- tm_map(cleanCorpusData, content_transformer(tolower),lazy=TRUE)

Profanity filter

We found a github repository with some bad words lists in many languages. https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words The one for english language is used inside our ‘’’profanity filter’’’ (see Appendix) to remove bad words from our sample data, since we do not want to include them in the analysis.

library(tm, quietly = TRUE)
  # read the file with the bad words
  badwords <- extractSet("./badWordsEN.txt")
  cleanCorpusData <- tm_map(cleanCorpusData, content_transformer(removeWords),badwords,lazy=TRUE)
  cleanCorpusData <- tm_map(cleanCorpusData, content_transformer(stripWhitespace),lazy=TRUE)
  #rm(badwords)

Stop words

Do not remove stop words since we will need them for the prediction model

Type token ratio (TTR)

The ‘’type token ratio’‘(TTR) is a measure of vocabulary variation within a written text or a person’s speech. In corpus linguistics,’‘Type’‘refers to all different types of words of a piece of text.’‘Token’’, on the other hand, refers to all words of a piece of text. This ratio is always < 1. We will compute the TTR of our sample clean text data and interprete it. Thus we will have a measure of the lexical variety. A good explanation of this concept can be found here: https://www.sltinfo.com/wp-content/uploads/2014/01/type-token-ratio.pdf

We need to build first the term-document matrix of the corpus. A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers. The rows correspond to documents in the collection and columns correspond to terms(words) from the text that are to be used in the analysis. Then we calculate separately the number of tokens and types in the corpus and give the ration.

library(tm,quietly = TRUE)
dtmCorpus <- DocumentTermMatrix(cleanCorpusData)
nrTokens <- sum(as.matrix(dtmCorpus))
nrTypes <- length(dtmCorpus$dimnames$Terms)
nrTypes / nrTokens
## [1] 0.1140456

It seems to be not so a diverse text data, but we have some size limitations on the computer and we cannot afford a bigger size for the sample data to increase the lexical diversity.

Modelling Data

Let’s inspect the document term matrix.

inspect(dtmCorpus)
## <<DocumentTermMatrix (documents: 10000, terms: 29467)>>
## Non-/sparse entries: 219426/294450574
## Sparsity           : 100%
## Maximal term length: 85
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   and but for have that the this was with you
##   159   16   0   2    2    1  34    2   1    0   0
##   1904   5   3   4    1    8  21    2   0    1   0
##   2489   9   3   2    0    3   8    2   0    5   0
##   254    4   1   4    3    6  29    7   1    1   0
##   286   16   1   2    2    5  17    0   5    3   0
##   2908  22   4  11    6   16  28    1   0    1   0
##   3038  14   5   5    0    4  15    1   1    4   0
##   3410   9   3   4    4    5  15    0   1    2   3
##   560    7   1   1    1    5  17    2   0    5  14
##   827    7   3   5    1    5  10    0   0    3   8

So our matrix is very sparse, but that was expected since we did not remove the stopwords of the language. Term-document matrices tend to get very big already for normal sized data sets. Therefore there is a method to remove sparse terms, i.e., terms occurring only in very few documents. Normally, this reduces the matrix dramatically without losing significant relations inherent to the matrix, we remove.

unigramDTM <- removeSparseTerms(dtmCorpus, 0.99)
inspect(unigramDTM)
## <<DocumentTermMatrix (documents: 10000, terms: 273)>>
## Non-/sparse entries: 93660/2636340
## Sparsity           : 97%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##       Terms
## Docs   and but for have that the this was with you
##   1196   3   6   6    8    1  12    5   2    4   1
##   159   16   0   2    2    1  34    2   1    0   0
##   201   12   1   8    2    8  13    5   0    0   0
##   2723   4   3   1    3    5  10    1   1    2  14
##   286   16   1   2    2    5  17    0   5    3   0
##   2908  22   4  11    6   16  28    1   0    1   0
##   3038  14   5   5    0    4  15    1   1    4   0
##   3140   9   2   2    2    5  13    2   1    0   0
##   560    7   1   1    1    5  17    2   0    5  14
##   827    7   3   5    1    5  10    0   0    3   8

N-grams and coverage

Via the ‘’control’‘argument we could use an n-gram tokenizer (NGramTokenizer) from the Weka toolkit to tokenize into phrases instead of single words and compute their frequencies. A’‘n-gram token’’ is just a compound of n tokens, i.e. a 2-gram token is a pair of two words. These are the functions to build the document term matrixes for a specific ngram.

library(RWeka,quietly = TRUE)
getNGramTokenizer <- function(corpus, ngram){
  NGramTokenizer(corpus, Weka_control(min = ngram, max = ngram))
}
createNGramDTM <- function(corpus,ngram){
  dtm <- DocumentTermMatrix(corpus, control = list(tokenize = function(x){
    getNGramTokenizer(x,ngram)
  }))
  return(dtm)
}
# the unigram model is just the one we have been working with
bigramDTM <- removeSparseTerms(createNGramDTM(cleanCorpusData,2), 0.99)
trigramDTM <- removeSparseTerms(createNGramDTM(cleanCorpusData,3), 0.99)

To get the frequency of words in the dtm we use the auxiliary function getFreq (provided at the appendix).

unigramFreq <- getFreq(unigramDTM)
bigramFreq <- getFreq(bigramDTM)
trigramFreq <- getFreq(trigramDTM)

Finally we plot the word frequencies of each n-gram model.

library(ggplot2,quietly = TRUE)
# rank is the maximum number of top terms to show
makePlot <- function(data, rank) {
    xLabel <- paste(as.String(rank),"most common terms")
    maxRow <- rank
    if(nrow(data)< rank){
        maxRow <- nrow(data)
    }
  ggplot(data[1:maxRow,], aes(reorder(word, -freq), freq)) +
         labs(x = xLabel, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("lightBlue"))
}
unigramPlot <- makePlot(unigramFreq, 15)
bigramPlot <- makePlot(bigramFreq,15)
trigramPlot <- makePlot(trigramFreq,15)

Unigram

unigramPlot

### Bigram

bigramPlot

### Trigram

trigramPlot

Appendix

extractSet <- function(filepath){
  con = file(filepath, "r")
  lines <- readLines(con)
  close(con)
  return(lines)
} 
# get the frequencies of words in a document term matrix
getFreq <- function(dtm) {
    freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
    # remove possible NAs
    result <- data.frame(word = names(freq), freq = freq)
    result <- result[complete.cases(result),]
    return(result)
}