Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. In this project, I will attempt to build a predictive text model for english text, which will make it easier for people to type on their mobile devices. As an example, when someone types
I went to the
the predictive model will present three options for what the next word might be (e.g. gym, store, restaurant).
In order to build a predictive model, I need to learn how words and word pairs are related in english text. By related I mean how frequent different terms co-occur together. I will use a large collection of text coming from blogs, twitter messages, and news articles, and apply various techniques from natural language processing (NLP) and text mining to build our model.
In this report, I will:
The main datasets for this project come from a corpus coined HC Corpora. The datasets have been language filtered, but may still contain some foreign text. Below, I will use the datasets that contain English data. As part of the text cleaning/filtering (section below), we will remove profanity words, which were retrieved from ‘http://www.cs.cmu.edu/~biglou/resources/bad-words.txt’
suppressPackageStartupMessages(library(tm))
suppressPackageStartupMessages(library(textcat)) ## identify language
options(java.parameters = "-Xmx8g")
suppressPackageStartupMessages(library(RWeka))
suppressPackageStartupMessages(library(qdap))
suppressPackageStartupMessages(library(SnowballC))
suppressPackageStartupMessages(library(knitr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
destfile <- 'Coursera-SwiftKey.zip'
if(!file.exists(destfile)){
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip',destfile = 'Coursera-SwiftKey.zip')
unzip('Coursera-SwiftKey.zip')
}
fsize_us_blogs <- file.info(paste0(getwd(),'/final/en_US/en_US.blogs.txt'))
fsize_us_news <- file.info(paste0(getwd(),'/final/en_US/en_US.news.txt'))
fsize_us_twitter <- file.info(paste0(getwd(),'/final/en_US/en_US.twitter.txt'))
en_blogs <- NULL
en_news <- NULL
en_twitter <- NULL
if(is.null(en_blogs)){
en_blogs <- scan('final/en_US/en_US.blogs.txt', what = character(), sep='\n')
}
if(is.null(en_news)){
en_news <- scan('final/en_US/en_US.news.txt', what = character(), sep='\n')
}
if(is.null(en_twitter)){
en_twitter <- scan('final/en_US/en_US.twitter.txt', what = character(), sep='\n',skipNul = T)
}
if(!file.exists('profanity-words.txt')){
download.file('http://www.cs.cmu.edu/~biglou/resources/bad-words.txt', destfile = 'profanity-words.txt')
}
##split profanity_words into two vectors to avoid problems with tm_map
profanity_words <- read.table('profanity-words.txt', stringsAsFactors = F)
profanity_list1 <- as.vector(head(profanity_words$V1, 700))
profanity_list2 <- as.vector(tail(profanity_words$V1, 683))
Below is summary of the file sizes of the datasets, where they come from, and how many records or entries they contain:
datasets_stats <- data.frame('DocumentCategory' = character(), 'Size_Mb' = numeric(), 'NumberOfEntries' = integer(), stringsAsFactors = F)
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='Twitter','Size_Mb' = fsize_us_twitter$size / (1024^2), 'NumberOfEntries' = length(en_twitter)))
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='News','Size_Mb' = fsize_us_news$size / (1024^2), 'NumberOfEntries' = length(en_news)))
datasets_stats <- rbind(datasets_stats, data.frame('DocumentCategory'='Blogs','Size_Mb' = fsize_us_blogs$size / (1024^2), 'NumberOfEntries' = length(en_blogs)))
knitr::kable(datasets_stats, digits = 2)
| DocumentCategory | Size_Mb | NumberOfEntries |
|---|---|---|
| 159.36 | 2360148 | |
| News | 196.28 | 1010242 |
| Blogs | 200.42 | 899288 |
In the section below, I demonstrate through the use of the tm package how sample texts from news, blogs and Twitter messages can be cleaned and filtered so that words and phrases of most predictive value remain. Since the datasets are fairly big in size, I have sampled 10% from each document category for speed and simplicity purposes. I have employed the following steps to clean the text so that frequencies of words and word phrases can be calculated:
Currently, I have not employed any tools to detect and correct spelling errors, nor have I removed text of non-English origin. Neither have I experimented with stemming (e.g. Porter). I plan to implement most of this in my next iteration of the data pre-processing analysis that is presented here. I also plan to perform the pre-processing on a pr. sentence level rather than on a document level, as is done here.
I use the RWeka package to retrieve N-grams from the cleaned text documents.
clean_text <- function(x){
corpus <- tm::Corpus(VectorSource(x))
corpus <- tm::tm_map(corpus, tolower)
corpus <- tm::tm_map(corpus, removePunctuation)
corpus <- tm::tm_map(corpus, stripWhitespace)
stopwords_no_punctuation <- removePunctuation(stopwords("english"))
corpus <- tm::tm_map(corpus, removeWords, stopwords_no_punctuation)
corpus <- tm::tm_map(corpus, stripWhitespace)
corpus <- tm::tm_map(corpus, removeWords, c("u"))
corpus <- tm::tm_map(corpus, stripWhitespace)
corpus <- tm::tm_map(corpus, removeWords, profanity_list1)
corpus <- tm::tm_map(corpus, stripWhitespace)
corpus <- tm::tm_map(corpus, removeWords, profanity_list2)
corpus <- tm::tm_map(corpus, stripWhitespace)
corpus <- tm::tm_map(corpus, removeNumbers)
corpus <- tm::tm_map(corpus, stripWhitespace)
corpus <- tm::tm_map(corpus, PlainTextDocument)
doc_content <- data.frame(text=unlist(sapply(corpus, `[`, "content")), stringsAsFactors=F)
return(doc_content$text)
}
index_blogs <- 1
index_twitter <- 1
index_news <- 1
sample_docs <- NULL
sample_length_blogs <- as.integer(length(en_blogs)/1000)
sample_length_news <- as.integer(length(en_news)/1000)
sample_length_twitter <- as.integer(length(en_twitter)/1000)
num_iterations <- 1
max_iterations <- 100
all_docs <- NULL
all_onegrams <- NULL
all_bigrams <- NULL
all_trigrams <- NULL
if(!file.exists('onegrams.rda')){
while(num_iterations <= max_iterations){
sample_blogs <- en_blogs[c(index_blogs:min(index_blogs + sample_length_blogs,length(en_blogs)))]
sample_twitter <- en_twitter[c(index_twitter:min(index_twitter + sample_length_twitter,length(en_twitter)))]
sample_news <- en_news[c(index_news:min(index_news + sample_length_news,length(en_news)))]
sample_docs <- c(sample_blogs, sample_twitter, sample_news)
cleaned_docs <- clean_text(sample_docs)
index_blogs <- index_blogs + sample_length
index_news <- index_news + sample_length
index_twitter <- index_twitter + sample_length
## retrieve onegrams, bi-grams, and tri-grams using RWeka
onegrams <- NULL
bigrams <- NULL
trigrams <- NULL
try({
onegrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 1, max = 1, delimiters = " \\r\\t\\n.,;:\"()?!"))
bigrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 2, max = 2, delimiters = " \\r\\t\\n.,;:\"()?!"))
trigrams <- RWeka::NGramTokenizer(cleaned_docs, RWeka::Weka_control(min = 3, max = 3, delimiters = " \\r\\t\\n.,;:\"()?!"))
})
if(!is.null(onegrams)){
all_onegrams <- c(all_onegrams, onegrams)
}
if(!is.null(bigrams)){
all_bigrams <- c(all_bigrams, bigrams)
}
if(!is.null(trigrams)){
all_trigrams <- c(all_trigrams, trigrams)
}
cat(num_iterations, '\n')
num_iterations <- num_iterations + 1
}
save(all_onegrams,file="onegrams.rda")
save(all_bigrams, file="bigrams.rda")
save(all_trigrams, file="trigrams.rda")
}
Calculation of N-gram (n = 1 - 3) frequencies in my sampled documents through the table package and plots of top 30 N-grams:
if(is.null(all_onegrams)){
load(file='onegrams.rda')
}
if(is.null(all_bigrams)){
load(file='bigrams.rda')
}
if(is.null(all_trigrams)){
load(file='trigrams.rda')
}
onegram_count <- data.frame(table(all_onegrams), stringsAsFactors = F)
onegram_count <- rename(onegram_count, frequency = Freq, term = all_onegrams)
onegram_count$term <- as.character(onegram_count$term)
a <- dplyr::arrange(onegram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") +
theme_classic() +
ggtitle('Top 30 one-gram terms') +
theme(
axis.text.x = element_text(angle = 45, size=12, family="Helvetica", vjust = 0.3),
axis.text.y = element_text(size=12, family="Helvetica"),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 12, family = "Helvetica"),
plot.title = element_text(size=12,face="bold")
)
bigram_count <- data.frame(table(all_bigrams), stringsAsFactors = F)
bigram_count <- rename(bigram_count, frequency = Freq, term = all_bigrams)
bigram_count$term <- as.character(bigram_count$term)
a <- dplyr::arrange(bigram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") +
theme_classic() +
ggtitle('Top 30 bi-gram terms') +
theme(
axis.text.x = element_text(angle = 45, size=12, family="Helvetica", vjust = 0.6),
axis.text.y = element_text(size=12, family="Helvetica"),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 12, family = "Helvetica"),
plot.title = element_text(size=12,face="bold")
)
trigram_count <- data.frame(table(all_trigrams), stringsAsFactors = F)
trigram_count <- rename(trigram_count, frequency = Freq, term = all_trigrams)
trigram_count$term <- as.character(trigram_count$term)
a <- dplyr::arrange(trigram_count, desc(frequency)) %>% head(30)
a <- within(a, term <- factor(term, levels=a$term))
ggplot(a, aes(x = term, y = frequency)) + geom_bar(stat = "identity") +
theme_classic() +
ggtitle('Top 30 tri-gram terms') +
theme(
axis.text.x = element_text(angle = 90, size=12, family="Helvetica", vjust = 0.6),
axis.text.y = element_text(size=12, family="Helvetica"),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 12, family = "Helvetica"),
plot.title = element_text(size=12,face="bold")
)
I have shown above the sizes of the document datasets, and through sampling of the documents, I have demonstrated various techniques that can be applied to the text to clean the text into meaningful words and word phrases. In the coming stage of the project, I will improve upon the current pre-processing steps (some limitations were outlined above, such as misspellings, non-English text, and non-stemmed words), and work towards building an N-gram model for text prediction. I plan to get an overview of Markov chain models in R, and how they can be efficiently utilized for prediction purposes during the next stage of this project.