Abstract

This report analyzes three large corpora of English (US) text and explores the main features of the data. The three text documents, called corpora in linguistics, come from different sources: internet blog posts, news articles and twitter messages. While Blogs and News datasets are quite similar regarding the length of text lines, the Twitter text corpus is different in a way that sentences in twitter dataset are much shorter, obviously due to the 140 character limit of Twitter messages. We analyzed the main features of each corpus, like size, number of text lines and word count of each document, and then preprocessed and tokenized the corpora to take a closer look at word distributions and frequencies of 2-grams and 3-grams.

Introduction

This milestone report is a part of the capstone project in the Data Science Specialization offered on Coursera by the Johns Hopkins University. The goal of this intermediate report is to perform exploratory analysis of three different corpora of English (US) text used for creating the prediction algorithm later on in the project. The plan is to use Natural Language Processing (NLP) methods to analyze the corpora and develop smart auto correct keyboard application for mobile phones that can predict user’s input and suggest the next word user can use in his chat. The corpora was provided by SwiftKey, the company which develops smart keyboard application for mobile devices used for text prediction in messaging applications. The dataset originally comes from a corpus called HC Corpora and can be downloaded from here.

url = "http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
zipfile = "Coursera-SwiftKey.zip"
zipfile_path = paste(cwd, "Data", zipfile, sep="/")

if (!file.exists(zipfile_path)) {
    download.file(url, destfile=zipfile, method="auto")
    unzip(zipfile, exdir = "./Data", overwrite = FALSE)
 } 

Summary statistics

First we’ll report summary statistics for the three datasets to illustrate the main features of the data. In order to do that we collect the following types of information on each text corpus:

  1. File sizes
  2. Number of characters
  3. Number of non-white characters
  4. Number of lines
  5. Number of non-empty lines
  6. Total number of words
  7. Distribution of words (mean, median and quantiles) per line
# Load data
dir = file.path(cwd, "Data", "final", "en_US")
blogs_file = file(file.path(dir, "en_US.blogs.txt"), "rb")
news_file = file(file.path(dir, "en_US.news.txt"), "rb")
twitter_file = file(file.path(dir, "en_US.twitter.txt"), "rb")

# Read all 3 text files
blogs = readLines(blogs_file, encoding="UTF-8", skipNul=TRUE)
news = readLines(news_file, encoding="UTF-8", skipNul=TRUE)
twitter = readLines(twitter_file, encoding="UTF-8", skipNul=TRUE)

# Replace special Unicode characters in twitter file
twitter <- iconv(twitter, from = "latin1", to = "UTF-8", sub="")
twitter <- stri_replace_all_regex(twitter, "\u2019|`", "'")
twitter <- stri_replace_all_regex(twitter, "\u201c|\u201d|u201f|``", '"')

# Calculate file sizes
blogs_size = file.info("./Data/final/en_US/en_US.blogs.txt")$size / 1024^2
news_size = file.info("./Data/final/en_US/en_US.news.txt")$size / 1024^2
twitter_size = file.info("./Data/final/en_US/en_US.twitter.txt")$size / 1024^2

# Using stringi library for characters/lines count
blogs_stats = stri_stats_general(blogs)
news_stats = stri_stats_general(news)
twitter_stats = stri_stats_general(twitter)

# Close connections
close(blogs_file)
close(news_file)
close(twitter_file)

Shown below are the obtained summary statistics for all three complete text files showing some basic features of each, before sampling the corpora and preprocessing, which we’ll do afterwards:

DATASET SIZE CHARACTERS NON-WHITE CHARACTERS LINES NON-EMPTY LINES
Blogs 200.4MB 206,824,382 170,389,539 899,288 899,288
News 196.3MB 203,223,154 169,860,866 1,010,242 1,010,242
Twitter 159.4MB 162,385,035 134,370,242 2,360,148 2,360,148

Similarly, we count the number of words per line in each corpus using the very useful stringi library and we summarize distributions in a single table together with a total number of words for each corpus, so we can make easier comparisons between corresponding values.

# Count words per line in each corpus
blogs_wordcount = stri_count_words(blogs)
news_wordcount = stri_count_words(news)
twitter_wordcount = stri_count_words(twitter)

# Show distributions and total word count
rbind(Blogs=c(summary(blogs_wordcount), "Total word count"=sum(blogs_wordcount)),
      News=c(summary(news_wordcount), "Total word count"=sum(news_wordcount)),
      Twitter=c(summary(twitter_wordcount), 
                "Total word count"=sum(twitter_wordcount)))
##         Min. 1st Qu. Median  Mean 3rd Qu. Max. Total word count
## Blogs      0       9     28 41.75      60 6726         37546246
## News       1      19     32 34.41      46 1796         34762395
## Twitter    1       7     12 12.79      18   61         30195133

Sampling the corpora

Since the complete corpora has a quite large footprint on a disk, approximately 556MB with more than four millions lines which is very demanding for an analysis performed on a laptop, for our next steps we will use only the smaller part of it. So we will randomly sample 5% of each corpus and then combine these subsets into a single large document.

set.seed(808)
sample_size = 0.05
blogs_sample = sample(blogs, size=round(sample_size * length(blogs)))
news_sample = sample(news, size=round(sample_size * length(news)))
twitter_sample = sample(twitter, size=round(sample_size * length(twitter)))
linesCount = length(c(blogs_sample, news_sample, twitter_sample))

The acquired subsets contain 213483 lines of text thus being a sufficiently large representation of the original corpora for us to make a valid inference about it. For the final predictive model that we are going to develop later in the project we will use complete, but preprocessed corpora containing all three datasets in full.

Cleaning the corpora and additional preprocessing

At this point we load the sampled text datasets using tm library into a Corpus data type, which is essentially a collection of (natural language) text documents. Corpus (plural corpora) is in linguistics defined as a large set of structured texts used for statistical analysis of language. The tm library is a framework for text mining applications, which provides us various methods that can be used for tasks such as text transformations, creating Document-Term Matrices, tokenization, stemming and computing frequencies of n-grams. In fields of computational linguistics and probability n-gram stands for a contiguous sequence of n language items from a given sequence of text or speech, words in our case, such as 1-gram (unigram, single word), 2-gram (bigram, pair of words) or 3-gram (trigram, three consecutive words).

To be able to use our texts for the analyses, first we need to clean the corpora and transform it into a form suitable for the natural language processing methods we’re going to use. This procedures include converting all words to lowercase, removing the numbers, punctuation, stopwords and white spaces. Also we need to get rid of the profanity (“dirty words”) in our corpora, since we don’t want our algorithm to predict them for the users. As the source for the profane words I used a list of words from Shutterstock github page.

corpusSamplePath = file.path(cwd, "Data", "Sampled_files")

#cleanPunctuation = content_transformer(function(x, pattern) gsub(pattern, " ", x))

# Profane words removal
profaneWordlist = read.table("./Data/Profanity/profanity.txt", header=FALSE, colClasses="character", sep="\n")
profanity = as.vector(as.character(profaneWordlist$V1))

GetCorpusFromDir = function(dirpath)
{
    docs = Corpus(DirSource(dirpath))
    docs = tm_map(docs, tolower)
    docs = tm_map(docs, removeNumbers)
    docs = tm_map(docs, removePunctuation, preserve_intra_word_dashes=TRUE)
    docs = tm_map(docs, removeWords, profanity)
    #docs = tm_map(docs, stemDocument, "english") # Won't use stemming at this point
    docs = tm_map(docs, stripWhitespace)
    docs = tm_map(docs, PlainTextDocument)
    return(docs)
}

docs = GetCorpusFromDir(corpusSamplePath) 

# Save processed sample data to individual .txt files
writeCorpus(docs, path="./Data/Sampled_files/Processed",
            filenames=paste(c("blogs","news","twitter"), "_processed.txt", sep = ""))

Exploratory analysis: n-grams distribution and unique words in corpora coverage

In this phase of the analyses we create a sparse Document-Term Matrix from our corpora, which we use to calculate frequencies of the word occurrences in our sample. We sort these occurrences to find out which are the 50 most used words in the corpora.

dtm = DocumentTermMatrix(docs)

# Remove sparse terms
dtms = removeSparseTerms(dtm, 0.1) 
freq = colSums(as.matrix(dtms))
ord = order(-freq) 
freq = freq[ord] 

# get top 50 most frequent unigrams
top50 = head(freq, 50) 
unigramTop50 = data.frame(Unigram=names(top50), Frequency=top50)

As we could have expected, we conclude from observing the barplot in Figure 1. that most common single English words in everyday use are “the”, “and”, “for”, “that”, “you”, which are actually called the “stopwords” in language. This is not surprising because we didn’t carry out the removal of these common words with stopwords (“english”) transformation, which is usually recommended when performing text analysis, since these words can don’t bring much analytical significance in typical text mining applications. But since we are building a word prediction application, we want to enable users access to those words that they use frequently in everyday language and provide them with the most accurate predictions for message typing.

Figure 1: Frequencies of 50 most common 1-grams

Also, from the skewness of the plot we notice that the frequency of word occurrences falls very rapidly, suggesting that small number of common words might be used a lot in the corpora and that other not-so-common words are used rarely, so we would like to find out more about the word coverage in our corpora. To be precise, we would like to know how many unique words from our frequency sorted dictionary are needed to cover all word instances in English language at specific levels of corpora coverage.

CumUniSums = cumsum(freq)
RCumUniSums = CumUniSums/sum(freq)
TotalWords = length(freq)
Coverage = data.frame(Quantiles = seq(0.1, 1, 0.1), "Unique words" = rep(NA, 10), "Percent of total words" = rep(NA, 10))

for (idx in seq(1, 10)) {
    Coverage$Unique.words[idx] = length(RCumUniSums[RCumUniSums < Coverage$Quantiles[idx]])
    Coverage$Percent.of.total.words[idx] = round(Coverage$Unique.words[idx]/TotalWords, 4)
}
print(Coverage)
##    Quantiles Unique.words Percent.of.total.words
## 1        0.1            2                 0.0001
## 2        0.2           12                 0.0006
## 3        0.3           40                 0.0022
## 4        0.4           96                 0.0052
## 5        0.5          224                 0.0120
## 6        0.6          493                 0.0265
## 7        0.7         1039                 0.0559
## 8        0.8         2175                 0.1170
## 9        0.9         4988                 0.2683
## 10       1.0        18589                 0.9999

There are 18590 unique words in our dictionary. We can see that only 224 unique words (or 1.2% of our dictionary) cover 50% percent of the analyzed sample of our texts and exactly 4988 words (or 26.83% of the dictionary) are needed to cover the 90% of it. Interestingly only 2 words (“the” & “and”) make up for 10% of the whole corpora. To get a better perspective, from Figure 2. we can observe how many unique words are needed to cover specific percent of the corpora and what part of our dictionary each number of words represents.

Figure 2: Unique words needed to cover different percentage of the corpora

2-grams and 3-grams distribution

Next, we would like to know what are most common 2-ngrams and 3-grams in the corpora. We’ll perform tokenization which is a process of dividing text into meaningful lexical elements called tokens, in our case words and combinations of two and three-word sequences that are usually found together in a sentence.

BigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TDM2 = TermDocumentMatrix(docs, control=list(tokenize=BigramTokenizer))
TDM2 = removeSparseTerms(TDM2, 0.7)
FreqBigram = sort(rowSums(as.matrix((TDM2))), decreasing = TRUE)[1:50]
FreqBigramDF = data.frame(Bigram=names(FreqBigram), Frequency=FreqBigram)

Figure 3: Frequencies of 50 most common 2-grams

TrigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
TDM3 = TermDocumentMatrix(docs, control=list(tokenize=TrigramTokenizer))
TDM3 = removeSparseTerms(TDM3, 0.7)
FreqTrigram = sort(rowSums(as.matrix((TDM3))), decreasing = TRUE)[1:50]
FreqTrigramDF = data.frame(Trigram=names(FreqTrigram), Frequency=FreqTrigram)

Figure 4: Frequencies of 50 most common 3-grams

Finally, we show the fancy-looking wordcloud of the most prominent 3-grams in our corpora where the size of the phrase in the plot is directly related to the frequency of occurrence of the phrase in corpora, the larger the phrase the more used it is in English language.

set.seed(909)  
wordcloud(words=FreqTrigramDF$Trigram, freq=FreqTrigramDF$Frequency, max.words=50, scale=c(5, 0.2), colors=rev(brewer.pal(8, "RdYlBu")), random.order = FALSE, rot.per=0.25)

Figure 5: Wordcloud of 100 most common 3-grams

Next steps in developing predictive algorithm

After exploring the corpora and performing the initial analysis there are few things to consider before I start developing the prediction algorithm. First thing is to explore in greater depth text cleaning techniques and come up with more sophisticated way to transform the documents for the purpose of removing foreign-language words, words with no meaning and misspelled words before making text-prediction algorithm . The second thing is to find out how to use the information about prevalence of different n-grams and their occurence probabilities to optimize the performance of the algorithm and make it faster. The idea is to first use 3-grams for predicting next word, if there is no adequate 3-gram, then we’ll use 2-gram if found and lastly 1-gram. In the end, we’ll use the final prediction algorithm to develop our Shiny text prediction app.


This RMarkdown document was produced with RStudio v0.0.99.893 on R v3.2.5.