1 Executive Summary

The goal of the project is to build a prediction algorithm that predicts the next word given an input text string. To accomplish this goal, we analyzed the datasets provided to understand the data and their characteristics. In this report, we present the findings from our exploratory analysis.

The main objectives of this report are as follows: (1) demonstrate that we have downloaded the data and successfully loaded it in R, (2) present a basic summary statistics report on the datasets, (3) present interesting characteristics of the datasets, and finally (4) outline the plan for building the prediction algorithm.

2 Data Acquisition and Preprocessing

The datasets for this project comes from a corpus called HC Corpora. We download the datasets from the Coursera Site. We extract the en_US dataset from the downloaded file and load it in R using readLines method.

NOTE: We do not show all the code chunks in this report. Please refer to Notes section at the end of the report for a link to GitHub repository that contains the source files.

# load the en_US.blogs.txt dataset
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding="UTF-8", skipNul = TRUE)
# load the en_US.news.txt dataset
news <- readLines("final/en_US/en_US.news.txt", encoding="UTF-8", skipNul = TRUE)
# load the en_US.twitter.txt dataset
tweets <- readLines("final/en_US/en_US.twitter.txt", encoding="UTF-8", skipNul = TRUE)

Here is the summary statistics of the datasets. We use stringi::stri_stats_latex method to retrieve the number of words in the dataset.

Summary statistics of raw datasets
	Size (in MB)	Max characters in a line	Number of words	Number of lines
blogs	248.5 Mb	40833	37570839	899288
news	249.6 Mb	11384	34494539	1010242
tweets	301.4 Mb	140	30451170	2360148

2.1 Data Sampling and Cleaning

The summary statistics show that the datasets are huge. Hence, we sample the dataset before proceeding with further analysis. We randomly sample 10000 lines from each dataset.

# set a seed number for reproducibility of the results
set.seed(1234)
sample.size <- 10000
# get a sample of 10000 lines from the blogs dataset
sample.blogs <- blogs[sample(1:length(blogs), sample.size, replace=FALSE)]
# get a sample of 10000 lines from the news dataset
sample.news <- news[sample(1:length(news), sample.size, replace=FALSE)]
# get a sample of 10000 lines from the twitter dataset
sample.tweets <- tweets[sample(1:length(tweets), sample.size, replace=FALSE)]
rm(list=c("blogs", "news", "tweets"))

Next, we split each line of the dataset into sentences according to the punctuation. If we remove punctuation without splitting them into sentences, we may merge unrelated words and this can mis-represent the actual combinations of words that appear together in sentences. We consider sentences formed by periods, commas, colons, and semicolons. The following code shows the split function.

form_sentences <- function(line) {
  sentences <- line
  # if periods, commas, colons, or semicolons present in a line
  if (grepl("[.]|[,]|[:]|[;]", line)) {
    sentences <- strsplit(line, "[.]|[,]|[:]|[;]")
  }
  # return sentences
  return(sentences[[1]])
}

convert_dataset <- function(data) {
  result <- vector(mode="character")
  # for each line in the dataset, convert them to individual sentences
  for (lineno in 1:length(data)) {
    result <- c(result, form_sentences(data[lineno]))
  }
  # remove empty entries
  result <- result[result != ""]
  # return the sentences
  return(result)
}

An example of how this conversion works is shown here.

sample.news[1]

## [1] "Three weeks later, businessman Ferris Kleem took Dimora and former county Auditor Frank Russo to Las Vegas, showering them with airfare and gambling money in exchange for their help on projects, federal prosecutors said."

form_sentences(sample.news[1])

## [1] "Three weeks later"                                                                       
## [2] " businessman Ferris Kleem took Dimora and former county Auditor Frank Russo to Las Vegas"
## [3] " showering them with airfare and gambling money in exchange for their help on projects"  
## [4] " federal prosecutors said"

We split the lines into sentences and write the resulting documents to appropriate files.

sentences.blogs <- convert_dataset(sample.blogs)
sentences.news <- convert_dataset(sample.news)
sentences.tweets <- convert_dataset(sample.tweets)
rm(list=c("sample.blogs", "sample.news", "sample.tweets"))
writeLines(sentences.blogs, con="./sample/sample.blogs")
writeLines(sentences.news, con="./sample/sample.news")
writeLines(sentences.tweets, con="./sample/sample.tweets")

Summary Statistics of sampled & cleaned datasets
	Size (in MB)	Max characters in a line	Number of words	Number of lines (=sentences)
sentences.blogs	4.9 Mb	911	421074	47918
sentences.news	4.3 Mb	347	342591	45086
sentences.twitter	1.7 Mb	140	129352	20091

2.2 Building a Corpus

A corpus is a collection of text documents. All our exploration and modeling exercises are based on this corpus. We use tm package to explore the sampled datasets. We build the corpus using tm::VCorpus method with the sampled datasets. Once the corpus is built, we apply transformations on the corpus to: (i) convert all characters to lowercase, (ii) remove punctuation, (iii) remove numbers and (iv) strip extra white spaces.

# build the corpus
corpus <- VCorpus(DirSource("./sample/"))
# convert all characters to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
# remove whitespaces
corpus <- tm_map(corpus, stripWhitespace)

2.3 Profanity Filtering

Now, we filter the sampled datasets for profanity. We download the profanity word list from CMU Website. And, we apply removeWords transformation on the corpus to remove profanity.

profanity_file <- "profanity_list.txt"
# if the file does not exists, then download and unzip the file
if (!file.exists(profanity_file)) {
  # download the file from CMU we
  download.file("http://www.cs.cmu.edu/~biglou/resources/bad-words.txt", profanity_file)
}
# load the profane words from the file
profane_words <- readLines(profanity_file)
# remove the profane words from the corpus
corpus <- tm_map(corpus, removeWords, profane_words)

3 Exploratory Analysis

We now have a cleaned dataset that can be explored. We use the RWeka::NGramTokenizer to tokenize the words into 1-gram, 2-gram and 3-gram tokens. n-gram is a continuous sequence of n words that appear in the corpus. This helps identify the frequencies of 1 word, 2 words together, and 3 words together, and so on.

3.1 Tokenization

Here are our tokenizer functions.

unigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))

We create a tm::TermDocumenMatrix using the tokenizer functions. A TermDocumentMatrix is a matrix where the rows are the tokens and columns are the datasets. Each cell in this matrix represents the frequencies of the tokens in the datasets.

uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize=unigramTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize=bigramTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize=trigramTokenizer))

3.2 Frequencies of Tokens

Now, we sort the frequencies of the tokens to identify the most common 1-gram, 2-gram and n-gram tokens. The function tdm_freq creates a data frame that contains the tokens and their frequencies, sorted according to the frequencies. The function freq_plot creates a plot of the top 15 tokens.

tdm_freq <- function(tdm) {
  m <- as.matrix(tdm)
  wordsums <- rowSums(m)
  frequency <- sort(wordsums, decreasing=TRUE)
  return(data.frame(words=names(frequency), frequency=frequency, row.names = c()))
}

freq_plot <- function(freq, label) {
  freq <- freq[1:15, ]
  ggplot(freq, aes(reorder(words, frequency), frequency, fill=words)) +
    geom_text(aes(label = freq$frequency), vjust=-0.5, size=3) + 
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 0.5, size = 10)) +
    labs(x=paste(label, "Tokens")) + 
    labs(y="Frequency") +
    labs(title=paste("Frequency of", label, "tokens"))
}

Here are the top 15 1-gram tokens.

Here are the top 15 2-gram tokens.

Here are the top 15 3-gram tokens.

Wordclouds: Top 15 Tokens from 1-Gram, 2-Gram, and 3-Gram Models

Another interesting representation of the tokens is in the form of a wordcloud. A wordcloud is a visual representation of the words (tokens) where the size of a word (or token) is proportional to its frequency of occurrence in the dataset. Here we show the wordclouds of top 15 tokens of 1-gram, 2-gram and 3-gram.

3.3 Word Coverage from 1-gram Model

Now, we consider the coverage of 1-gram tokens to capture 50% and 90% of all word instances from the language. We assume that the sample dataset represents the corpus of the entire language. (Note that beyond 1-gram model, the number of word combinations will keep increasing and, hence, will not be useful in computing the coverage statistics.) First, we compute the cumulative frequencies of the word instances and plot the data.

getCumulativeFreq <- function(tokens) {
  return(cumsum(tokens$frequency))
}
getCoverage <- function(cumdist, total, target) {
  targetfreq <- totalfreq * target
  return(which(cumdist >= targetfreq)[1])
}
# Cumulative distribution of word frequencies
cumdist <- getCumulativeFreq(uniFreq)
# total number of word instances
totalfreq <- cumdist[dim(uniFreq)[1]]
# index of word that covers 50% of word instances
c50 <- getCoverage(cumdist, totalfreq, 0.5)
# index of word that covers 90% of word instances
c90 <- getCoverage(cumdist, totalfreq, 0.9)

Here is the cumulative frequencies plot of 1-gram tokens.

Coverage statistics
	Number of tokens	Percentage of total tokens
50% coverage	309	0.62
90% coverage	9539	19.01

From the above table, we observe that 50% coverage is achieved with 0.62% of total words in the corpus. Furthermore, it requires only 19.01% of total words to achieve 90% coverage.

3.4 Increasing Coverage

To increase the coverage of words, we consider stemming. Stemming recognizes similar words and accounts for them in frequency data. For example, stemming will reduce each of the following words: rain, rained, rains, raining to just rain. In prior analysis, we did not include stemming in our transformation since we wanted to capture tokens in all its forms. However, it may be reasonable to predict tokens that are stemmed in order to provide a quicker prediction of words for mobile users.

stemmed.corpus <- tm_map(corpus, stemDocument)
stemmed.TDM <- TermDocumentMatrix(stemmed.corpus, control=list(tokenize=unigramTokenizer))
stemmed.freq <- tdm_freq(stemmed.TDM)
stemmed.cumdist <- getCumulativeFreq(stemmed.freq)
stemmed.total_freq <- stemmed.cumdist[dim(cumdist)[1]]
stemmed.c50 <- getCoverage(stemmed.cumdist, stemmed.total_freq, 0.5)
stemmed.c90 <- getCoverage(stemmed.cumdist, stemmed.total_freq, 0.9)

Coverage statistics of stemmed corpus
	Number of tokens	Percentage of total tokens
50% coverage	238	0.86
90% coverage	5011	26.50

Stemming does increase the coverage at the expense of prediction accuracy. As a result, in order to incorporate stemming in prediction algorithm, further investigation is required to understand the trade-off between the application accuracy and the potential of covering a large corpus of language.

We can also consider removing stop words of English language from the corpus in order to increase coverage. Again, this will reduce the prediction accuracy.

3.5 Foreign Words in the Dataset

In this section, we evaluate the number of foreign (non-English) words in the corpus. We use 1-gram Tokens to do this analysis. The Compact Language Detector for R package (cldr library) provides a way to detect the language of the words in the corpus. cldr package is available at: CRAN website website. cldr::detectLanguage method returns the detected language of the words and provides a confidence score for the detection.

sentences <- c(sentences.blogs, sentences.news, sentences.tweets)
token.language <- detectLanguage(sentences)
english.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage == "ENGLISH")
foreign.words <- which(token.language$percentScore1 > 50 & token.language$detectedLanguage != "ENGLISH")

The ratio of number of foreign words over English words (detected with over 50% confidence) is: 5.4%. Since this ratio is very small, the presence of foreign words in the corpus does not impact our prediction algorithm significantly. We can eliminate the reliably detected foreign words from our prediction model.

4 Plan for Prediction Model and Shiny App

The next step of the project is to design and implement the prediction algorithm. The prediction algorithm will be based on the n-gram model. We will use the highest n-gram model we showed in this analysis to predict the next few words given an input text. Finally, we will develop a data product and deploy it as a Shiny App to showcase the prediction algorithm.

5 Notes

Dataset: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Profanity word List: http://www.cs.cmu.edu/~biglou/resources/bad-words.txt
Libraries Used:
- stringi
- tm
- RWeka
- ggplot2
- wordcloud
- cldr
- SnowballC
Source code of this report: predictive-text-model

Exploratory Analysis of Coursera-SwiftKey Dataset: Milestone Report

Mahesh Arumugam

May 1, 2016