JHU data science capstone session - predictive text modeling

Synopsis

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. In this project, I will attempt to build a predictive text model for english text, which will make it easier for people to type on their mobile devices. As an example, when someone types

I went to the

the predictive model will present three options for what the next word might be (e.g. gym, store, restaurant).

In order to build a predictive model, I need to learn how words and word pairs are related in english text. By related I mean how frequent different terms co-occur together. I will use a large collection of text coming from blogs, twitter messages, and news articles, and apply various techniques from natural language processing (NLP) and text mining to build our model.

In this report, I will:

present the underlying data sources for the predictive model that will be built
demonstrate how text can be tokenized and cleaned (i.e. removal of non-meaningful text such as punctuation and profanity words)
present statistics with respect to the frequencies of words and word phrases (commonly referred to as n-grams in NLP).

Download of datasets

The main datasets for this project come from a corpus coined HC Corpora. The datasets have been language filtered, but may still contain some foreign text. Below, I will use the datasets that contain English data. As part of the text cleaning/filtering (section below), we will remove profanity words, which were retrieved from ‘http://www.cs.cmu.edu/~biglou/resources/bad-words.txt’

suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(textcat)) ## identify language
options(java.parameters = "-Xmx8g")
suppressPackageStartupMessages(library(RWeka))
suppressPackageStartupMessages(library(qdap))
suppressPackageStartupMessages(library(SnowballC))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))

destfile <- 'Coursera-SwiftKey.zip'
if(!file.exists(destfile)){
  download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',destfile = 'Coursera-SwiftKey.zip')
  unzip('Coursera-SwiftKey.zip')
}
fsize_us_blogs <- file.info(paste0(getwd(),'/final/en_US/en_US.blogs.txt'))
fsize_us_news <- file.info(paste0(getwd(),'/final/en_US/en_US.news.txt'))
fsize_us_twitter <- file.info(paste0(getwd(),'/final/en_US/en_US.twitter.txt'))

en_blogs <- NULL
en_news <- NULL
en_twitter <- NULL

if(is.null(en_blogs)){
  en_blogs <- scan('final/en_US/en_US.blogs.txt', what = character(), sep='\n')
}
if(is.null(en_news)){
  en_news <- scan('final/en_US/en_US.news.txt', what = character(), sep='\n')
}
if(is.null(en_twitter)){
  en_twitter <- scan('final/en_US/en_US.twitter.txt', what = character(), sep='\n',skipNul = T)
}

if(!file.exists('profanity-words.txt')){
  download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt', destfile = 'profanity-words.txt')
}

##split profanity_words into two vectors to avoid problems with tm_map
profanity_words <- read.table('profanity-words.txt', stringsAsFactors = F)
profanity_list1 <- as.vector(head(profanity_words$V1, 700))
profanity_list2 <- as.vector(tail(profanity_words$V1, 683))

Below is summary of the file sizes of the datasets, where they come from, and how many records or entries they contain:

datasets_stats <- data.frame('DocumentCategory' = character(), 'Size_Mb' = numeric(), 'NumberOfEntries' = integer(), stringsAsFactors = F)
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='Twitter','Size_Mb' = fsize_us_twitter$size / (1024^2), 'NumberOfEntries' = length(en_twitter)))
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='News','Size_Mb' = fsize_us_news$size / (1024^2), 'NumberOfEntries' = length(en_news)))
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='Blogs','Size_Mb' = fsize_us_blogs$size / (1024^2), 'NumberOfEntries' = length(en_blogs)))
knitr::kable(datasets_stats, digits = 2)

DocumentCategory	Size_Mb	NumberOfEntries
Twitter	159.36	2360148
News	196.28	1010242
Blogs	200.42	899288

Data preprocessing - tokenization, cleaning, and filtering

In the section below, I demonstrate through the use of the tm package how sample texts from news, blogs and Twitter messages can be cleaned and filtered so that words and phrases of most predictive value remain. Since the datasets are fairly big in size, I have sampled 10% from each document category for speed and simplicity purposes. I have employed the following steps to clean the text so that frequencies of words and word phrases can be calculated:

Conversion of all text to lowercase
Removal of punctuation characters
Removal of stopwords (e.g. ‘as’,‘to’,‘I’, ‘you’ etc. i.e. words of low predictive value)
Removal of profanity words
Removal of numbers
Removal of excessive whitespace

Currently, I have not employed any tools to detect and correct spelling errors, nor have I removed text of non-English origin. Neither have I experimented with stemming (e.g. Porter). I plan to implement most of this in my next iteration of the data pre-processing analysis that is presented here. I also plan to perform the pre-processing on a pr. sentence level rather than on a document level, as is done here.

I use the RWeka package to retrieve N-grams from the cleaned text documents.

clean_text <- function(x){
  
  corpus <- tm::Corpus(VectorSource(x))
  corpus <- tm::tm_map(corpus, tolower)
  corpus <- tm::tm_map(corpus, removePunctuation)
  corpus <- tm::tm_map(corpus, stripWhitespace)
  stopwords_no_punctuation <- removePunctuation(stopwords("english"))
  corpus <- tm::tm_map(corpus, removeWords, stopwords_no_punctuation)
  corpus <- tm::tm_map(corpus, stripWhitespace)
  corpus <- tm::tm_map(corpus, removeWords, c("u"))
  corpus <- tm::tm_map(corpus, stripWhitespace)
  corpus <- tm::tm_map(corpus, removeWords, profanity_list1)
  corpus <- tm::tm_map(corpus, stripWhitespace)
  corpus <- tm::tm_map(corpus, removeWords, profanity_list2)
  corpus <- tm::tm_map(corpus, stripWhitespace)
  corpus <- tm::tm_map(corpus, removeNumbers)
  corpus <- tm::tm_map(corpus, stripWhitespace)
  
  corpus <- tm::tm_map(corpus, PlainTextDocument)
  doc_content <- data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
  
  return(doc_content$text)
}

index_blogs <- 1
index_twitter <- 1
index_news <- 1
sample_docs <- NULL
sample_length_blogs <- as.integer(length(en_blogs)/1000)
sample_length_news <- as.integer(length(en_news)/1000)
sample_length_twitter <- as.integer(length(en_twitter)/1000)
num_iterations <- 1
max_iterations <- 100
all_docs <- NULL
all_onegrams <- NULL
all_bigrams <- NULL
all_trigrams <- NULL
if(!file.exists('onegrams.rda')){
  while(num_iterations <= max_iterations){
    sample_blogs <- en_blogs[c(index_blogs:min(index_blogs + sample_length_blogs,length(en_blogs)))]
    sample_twitter <- en_twitter[c(index_twitter:min(index_twitter + sample_length_twitter,length(en_twitter)))]
    sample_news <- en_news[c(index_news:min(index_news + sample_length_news,length(en_news)))]
    sample_docs <- c(sample_blogs, sample_twitter, sample_news)
    cleaned_docs <- clean_text(sample_docs)
    
    index_blogs <- index_blogs + sample_length
    index_news <- index_news + sample_length
    index_twitter <- index_twitter + sample_length
    
    ## retrieve onegrams, bi-grams, and tri-grams using RWeka
    onegrams <- NULL
    bigrams <- NULL
    trigrams <- NULL
    try({
      onegrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 1, max = 1, delimiters = " \\r\\t\\n.,;:\"()?!"))
      bigrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 2, max = 2, delimiters = " \\r\\t\\n.,;:\"()?!"))
      trigrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 3, max = 3, delimiters = " \\r\\t\\n.,;:\"()?!"))
    })
    if(!is.null(onegrams)){
      all_onegrams <- c(all_onegrams, onegrams)
    }
    if(!is.null(bigrams)){
      all_bigrams <- c(all_bigrams, bigrams)
    }
    if(!is.null(trigrams)){
      all_trigrams <- c(all_trigrams, trigrams)
    }
    cat(num_iterations, '\n')
    num_iterations <- num_iterations + 1
  }
  
  save(all_onegrams,file="onegrams.rda")
  save(all_bigrams, file="bigrams.rda")
  save(all_trigrams, file="trigrams.rda")
}

Summary of exploratory analysis - N-gram frequencies

Calculation of N-gram (n = 1 - 3) frequencies in my sampled documents through the table package and plots of top 30 N-grams:

if(is.null(all_onegrams)){
  load(file='onegrams.rda')
}
if(is.null(all_bigrams)){
  load(file='bigrams.rda')
}
if(is.null(all_trigrams)){
  load(file='trigrams.rda')
}
onegram_count <- data.frame(table(all_onegrams), stringsAsFactors = F)
onegram_count <- rename(onegram_count, frequency = Freq, term = all_onegrams)
onegram_count$term <- as.character(onegram_count$term)
a <- dplyr::arrange(onegram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") + 
  theme_classic()  + 
  ggtitle('Top 30 one-gram terms') + 
  theme(
    axis.text.x = element_text(angle = 45, size=12, family="Helvetica", vjust = 0.3), 
    axis.text.y = element_text(size=12, family="Helvetica"),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 12, family = "Helvetica"),
    plot.title = element_text(size=12,face="bold")
  )

bigram_count <- data.frame(table(all_bigrams), stringsAsFactors = F)
bigram_count <- rename(bigram_count, frequency = Freq, term = all_bigrams)
bigram_count$term <- as.character(bigram_count$term)
a <- dplyr::arrange(bigram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") + 
  theme_classic()  + 
  ggtitle('Top 30 bi-gram terms') + 
  theme(
    axis.text.x = element_text(angle = 45, size=12, family="Helvetica", vjust = 0.6), 
    axis.text.y = element_text(size=12, family="Helvetica"),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 12, family = "Helvetica"),
    plot.title = element_text(size=12,face="bold")
  )

trigram_count <- data.frame(table(all_trigrams), stringsAsFactors = F)
trigram_count <- rename(trigram_count, frequency = Freq, term = all_trigrams)
trigram_count$term <- as.character(trigram_count$term)
a <- dplyr::arrange(trigram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") + 
  theme_classic()  + 
  ggtitle('Top 30 tri-gram terms') + 
  theme(
    axis.text.x = element_text(angle = 90, size=12, family="Helvetica", vjust = 0.6), 
    axis.text.y = element_text(size=12, family="Helvetica"),
    axis.title.x = element_blank(),
    axis.title.y = element_text(size = 12, family = "Helvetica"),
    plot.title = element_text(size=12,face="bold")
  )

Summary and next steps

I have shown above the sizes of the document datasets, and through sampling of the documents, I have demonstrated various techniques that can be applied to the text to clean the text into meaningful words and word phrases. In the coming stage of the project, I will improve upon the current pre-processing steps (some limitations were outlined above, such as misspellings, non-English text, and non-stemmed words), and work towards building an N-gram model for text prediction. I plan to get an overview of Markov chain models in R, and how they can be efficiently utilized for prediction purposes during the next stage of this project.