Data Science Capstone Assignment 1: Text Prediction Milestone Report

Synopsis

Our objective is to create an English-language text prediction model and interactive product using R and Shiny. Our training data set consists of three large files of English text collected from the web, including:

Source	Lines	Words
Blogs	899,288	37,334,131
News	1,010,242	34,372,530
Twitter	2,360,148	30,373,584

Although word counting is a built-in feature of many NLP packages (such as quanteda which I use later) I decided to do this in base R to gain a better sense of the data (see Appendix for code) .

Notes:

Line: counted as one paragraph from a news article or blog entry, or one tweet
Word: counted as a contiguous alphanumeric/punctuation characters separated by whitespace (space, tab or newline)

Examples

From the blogs file

## [1] "Its no costume, Pricklewood, Im the real McCoy. I then got down onto the carpet, grasped the feet of the armchair with my toes and lifted it off the ground. How many humans do you know who can do that? I asked."

From the twitter file

## [1] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

From the news file Note the character encoding not rendered by the standard UTF-8. This is a challenge when working with any large set of unfamiliar text documents.

## [1] "<U+0093>I was just trying to hit it hard someplace,<U+0094> said Rizzo, who hit the pitch to the opposite field in left-center. <U+0093>I<U+0092>m just up there trying to make good contact.<U+0094>"

Data Processing/Cleansing

1. Sampling: The provided data is too large to build a predictive model using a personal desktop. So the first step was to create a sample. I took a random 10 percent sample of each document-set combined into one.

2. Profanity Filtering: Next, I filtered out all lines with certain profanity, as we don’t want to suggest such words to any users in the final app. I decided to not be overly aggresive so I used George Carlin’s 7 words and a few more. There are much longer word blacklists available but this seemed like overkill for this project.

3. Fixing Apostrophes & Removing non-ASCII Characters: After some initial trial-and-error I decided to strip most non-ASCII characters (e.g. ♥) to simplify future work. My assumption is there will be no material impact on the accuracy of the final deliverable. First, I ensured that the right-apostrophe, which is not always encoded consistently (see the deliberately selected examples displayed above), be preserved and converted to a single quote. It is important, for example, in the final product that “don’t” is suggested instead of “dont”.

Tokenization & Exploratory Analysis

1. Tokenization: Tokenization is the process of segmenting a text into words and a critical task for our problem. After trialing of several options, I decided to use the quanteda package for tokenization. It is fast and relatively easy to follow. It’s tokenization function includes a built in option to return n-grams, so I could easily create a unigram, bigram, trigram & quadigram frequency tables using a single package combined with some processing in data.table (not really required here, but I’ve grown fond of it.

2. N-Gram Frequencies: I decided to convert to lowercase and remove between word punctuation. Now, punctuation is important in any specific case, but my assumption is in the aggregate, remove punctuation before creating n-grams will not material impact the accuracy of the model while also simplifying the task. I did not remove any “stopwords” (common words) such as “the” and “a” because these are very common and important for the text prediction problem. I created four frequency tables from unigram to quadrigram.

A few observations:

As an English speaker I was not surprised with anything from unigram to trigram
Contractions such as “don’t” and “can’t” and “i’m” do appear in many of the n-grams, so preserving the single quote is important
One of the most frequent quadrigrams is “thanks for the follow” which at first seems odd except for the fact that Twitter is a primary source

Wordcloud

Another good way to visualize text analysis results is through word clouds, using the wordcloud package:

wordcloud(words = unigrams$words, freq = unigrams$frequency, min.freq = 1,max.words=200, random.order = FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))

Prediction Model & Product Plans

Research prediction model options including use of hidden markov chains
Consider how to handle n-grams not in model including misspellings
Build prediction model
Evaluate model for accuracy & efficiency
Continue to revise 3, trying different techniques & additional data
Design Shiny app
Develop & test Shiny app
Write final report and submit!

Appendix A: Code

#Step 1 - Initial Loading & Summary

setwd("./data/Coursera-SwiftKey/final/en_US/")
library(readr)

#Blogs
con <- file("en_US.blogs.txt","rb",encoding="UTF-8")
blogs <- read_lines(con)
close(con)
## In this case counting any series of non-space characters as a single word 
blog_words <- unlist(strsplit(blogs,"\\s+"))

format(length(blogs),big.mark=",")
## 899,288 lines
format(length(blog_words),big.mark=",")
## 37,334,131 words

#News
con <- file("en_US.news.txt","rb",encoding="UTF-8")
news <- read_lines(con)
close(con)
## In this case counting any series of non-space characters as a single word 
news_words <- unlist(strsplit(news,"\\s+"))

format(length(news),big.mark=",")
## 1,010,242 lines
format(length(news_words),big.mark=",")
## 34,372,530 words

#Twitter

con <- file("en_US.twitter.txt","rb",encoding="UTF-8")
twitter <- read_lines(con)
close(con)

## In this case counting any series of non-space characters as a single word 
twitter_words <- unlist(strsplit(twitter,"\\s+"))

format(length(twitter),big.mark=",")
## 2,360,148 lines
format(length(twitter_words),big.mark=",")
## 30,373,584 words

#Step 2 - Sampling

sample_p <- function(x,p) {
        sample(x,size=round(length(x)*p,0))
}

set.seed(123)
docs_10p <- c(sample_p(blogs,.1),sample_p(news,.1),sample_p(twitter,.1))

#Step 3 - Profanity & Non-ASCII Cleaning Handling

##From George Carlin's 7 words plus more, case insensitive
profanity <- "(?i)fuck|cocksuck|piss|cunt|tits|bitch|faggot|nigger|asshole|nigga(?i)"
cleanse_line <- function(line) {
        line <- gsub("[\u2019\u0092]","'",line)
        line <- iconv(line,"UTF-8","ASCII",sub="")
        return(grep(profanity,line,value=TRUE,invert=TRUE))
}

docs_10p_clean <- unlist(lapply(docs_10p,cleanse_line))

#Step 4 - Tokenization using Quanteda

library(ggplot2)
library(wordcloud)
library(quanteda)
library(readr)

docs_10p_clean_path <- "./data/samples/docs_10p_clean.txt"

docs_10p_clean <- textfile(docs_10p_clean_path, cache = FALSE)
summary(corpus(docs_10p_clean))

ngram_frequency_table <- function(x,n = 1L) {
        require(quanteda)
        require(data.table)
        ngrams <- quanteda::tokenize(toLower(corpus(x)),removePunct=TRUE,ngrams=n,concatenator=" ")
        ngramsDT <- data.table(words = ngrams$text1)
        ngramsDT <- ngramsDT[,.(frequency=.N),by=.(words)]
        setorder(ngramsDT,-frequency)
        return(ngramsDT)
}

unigrams <- ngram_frequency_table(docs_10p_clean)
bigrams <- ngram_frequency_table(docs_10p_clean,n=2)
trigrams <- ngram_frequency_table(docs_10p_clean,n=3)
quadrigrams <- ngram_frequency_table(docs_10p_clean,n=4)

#Step 5 - Frequency Plotting

plot_ngrams <- function(ngrams,top=30,title ="") {
        require(ggplot2)
        require(scales)
        g <- ggplot(dat = ngrams[1:top], aes(x=reorder(words,frequency),y=frequency))
        g <- g + geom_bar(stat = "identity") + coord_flip() + 
                scale_y_continuous(name="Frequency", labels = comma) + xlab("n-gram") +
                ggtitle(title)
        return (g)
}