Our objective is to create an English-language text prediction model and interactive product using R and Shiny. Our training data set consists of three large files of English text collected from the web, including:
| Source | Lines | Words |
|---|---|---|
| Blogs | 899,288 | 37,334,131 |
| News | 1,010,242 | 34,372,530 |
| 2,360,148 | 30,373,584 |
Although word counting is a built-in feature of many NLP packages (such as quanteda which I use later) I decided to do this in base R to gain a better sense of the data (see Appendix for code) .
Notes:
From the blogs file
## [1] "Its no costume, Pricklewood, Im the real McCoy. I then got down onto the carpet, grasped the feet of the armchair with my toes and lifted it off the ground. How many humans do you know who can do that? I asked."
From the twitter file
## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
From the news file Note the character encoding not rendered by the standard UTF-8. This is a challenge when working with any large set of unfamiliar text documents.
## [1] "<U+0093>I was just trying to hit it hard someplace,<U+0094> said Rizzo, who hit the pitch to the opposite field in left-center. <U+0093>I<U+0092>m just up there trying to make good contact.<U+0094>"
1. Sampling: The provided data is too large to build a predictive model using a personal desktop. So the first step was to create a sample. I took a random 10 percent sample of each document-set combined into one.
2. Profanity Filtering: Next, I filtered out all lines with certain profanity, as we don’t want to suggest such words to any users in the final app. I decided to not be overly aggresive so I used George Carlin’s 7 words and a few more. There are much longer word blacklists available but this seemed like overkill for this project.
3. Fixing Apostrophes & Removing non-ASCII Characters: After some initial trial-and-error I decided to strip most non-ASCII characters (e.g. ♥) to simplify future work. My assumption is there will be no material impact on the accuracy of the final deliverable. First, I ensured that the right-apostrophe, which is not always encoded consistently (see the deliberately selected examples displayed above), be preserved and converted to a single quote. It is important, for example, in the final product that “don’t” is suggested instead of “dont”.
1. Tokenization: Tokenization is the process of segmenting a text into words and a critical task for our problem. After trialing of several options, I decided to use the quanteda package for tokenization. It is fast and relatively easy to follow. It’s tokenization function includes a built in option to return n-grams, so I could easily create a unigram, bigram, trigram & quadigram frequency tables using a single package combined with some processing in data.table (not really required here, but I’ve grown fond of it.
2. N-Gram Frequencies: I decided to convert to lowercase and remove between word punctuation. Now, punctuation is important in any specific case, but my assumption is in the aggregate, remove punctuation before creating n-grams will not material impact the accuracy of the model while also simplifying the task. I did not remove any “stopwords” (common words) such as “the” and “a” because these are very common and important for the text prediction problem. I created four frequency tables from unigram to quadrigram.
A few observations:
Another good way to visualize text analysis results is through word clouds, using the wordcloud package:
wordcloud(words = unigrams$words, freq = unigrams$frequency, min.freq = 1,max.words=200, random.order = FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))
#Step 1 - Initial Loading & Summary
setwd("./data/Coursera-SwiftKey/final/en_US/")
library(readr)
#Blogs
con <- file("en_US.blogs.txt","rb",encoding="UTF-8")
blogs <- read_lines(con)
close(con)
## In this case counting any series of non-space characters as a single word
blog_words <- unlist(strsplit(blogs,"\\s+"))
format(length(blogs),big.mark=",")
## 899,288 lines
format(length(blog_words),big.mark=",")
## 37,334,131 words
#News
con <- file("en_US.news.txt","rb",encoding="UTF-8")
news <- read_lines(con)
close(con)
## In this case counting any series of non-space characters as a single word
news_words <- unlist(strsplit(news,"\\s+"))
format(length(news),big.mark=",")
## 1,010,242 lines
format(length(news_words),big.mark=",")
## 34,372,530 words
#Twitter
con <- file("en_US.twitter.txt","rb",encoding="UTF-8")
twitter <- read_lines(con)
close(con)
## In this case counting any series of non-space characters as a single word
twitter_words <- unlist(strsplit(twitter,"\\s+"))
format(length(twitter),big.mark=",")
## 2,360,148 lines
format(length(twitter_words),big.mark=",")
## 30,373,584 words
#Step 2 - Sampling
sample_p <- function(x,p) {
sample(x,size=round(length(x)*p,0))
}
set.seed(123)
docs_10p <- c(sample_p(blogs,.1),sample_p(news,.1),sample_p(twitter,.1))
#Step 3 - Profanity & Non-ASCII Cleaning Handling
##From George Carlin's 7 words plus more, case insensitive
profanity <- "(?i)fuck|cocksuck|piss|cunt|tits|bitch|faggot|nigger|asshole|nigga(?i)"
cleanse_line <- function(line) {
line <- gsub("[\u2019\u0092]","'",line)
line <- iconv(line,"UTF-8","ASCII",sub="")
return(grep(profanity,line,value=TRUE,invert=TRUE))
}
docs_10p_clean <- unlist(lapply(docs_10p,cleanse_line))
#Step 4 - Tokenization using Quanteda
library(ggplot2)
library(wordcloud)
library(quanteda)
library(readr)
docs_10p_clean_path <- "./data/samples/docs_10p_clean.txt"
docs_10p_clean <- textfile(docs_10p_clean_path, cache = FALSE)
summary(corpus(docs_10p_clean))
ngram_frequency_table <- function(x,n = 1L) {
require(quanteda)
require(data.table)
ngrams <- quanteda::tokenize(toLower(corpus(x)),removePunct=TRUE,ngrams=n,concatenator=" ")
ngramsDT <- data.table(words = ngrams$text1)
ngramsDT <- ngramsDT[,.(frequency=.N),by=.(words)]
setorder(ngramsDT,-frequency)
return(ngramsDT)
}
unigrams <- ngram_frequency_table(docs_10p_clean)
bigrams <- ngram_frequency_table(docs_10p_clean,n=2)
trigrams <- ngram_frequency_table(docs_10p_clean,n=3)
quadrigrams <- ngram_frequency_table(docs_10p_clean,n=4)
#Step 5 - Frequency Plotting
plot_ngrams <- function(ngrams,top=30,title ="") {
require(ggplot2)
require(scales)
g <- ggplot(dat = ngrams[1:top], aes(x=reorder(words,frequency),y=frequency))
g <- g + geom_bar(stat = "identity") + coord_flip() +
scale_y_continuous(name="Frequency", labels = comma) + xlab("n-gram") +
ggtitle(title)
return (g)
}