Creating a Predictive Text Application

Objective

The objective of this capstone project is to build a predictive text app similar to those found in SwiftKey technology. Based upon previous word combinations, new words are proposed to help making typing more efficient.

This R Markdown document will show the entire project from beginning to end.

1. Getting and Cleaning the Data

The data provided are text written in English, German, Russian, and Finnish. Only the English database will be used now. The text come from a variety of mediums such as blogs, news, and Twitter. These data will be used to help develop the algorithm. However, there’s quite a bit of preprocessing that will need to occur in order to get the data into an acceptable format.

The main objective for this week is to read in a subset of the data, tokenize the text (raw text -> tokenizer function -> tokenized file), and filter profanity from the text.

Loading the Data

My smooth.operatoR function loads packages and downloads ones that I may specify in the character vector but don’t have in my package library.

# load packages ----
smooth.operatoR(c("dplyr", "tm", "ggplot2", "tidytext"))
   dplyr       tm  ggplot2 tidytext 
    TRUE     TRUE     TRUE     TRUE 

A 10% random sample will be taken from each file. Using rules of inference, we know that the 10% sample will provide reasonable generalizability to the three datasets in their whole.

Using the wc command in Terminal, the following stats were obtained from the pre-sampled data (i.e., the original data).

  • en_US.blogs.txt - 37334690 words, 899288 lines
  • en_US.news.txt - 34372720 words, 1010242 lines
  • en_US.twitter.txt - 30374206 words, 2360148 lines
# list the files in the dropbox folder that hold the data ----
dropbox_files <- list.files(dropbox_path, pattern = ".txt")

# readLines through the text files and sample 10% from each ----
for(i in seq_along(dropbox_files)) {
        ## open connection
        open_con <- file(paste(dropbox_path, dropbox_files[i], sep = "/"), 
                         open = "r")
        ## load data from dropbox_files loop
        show(paste0("loading ", dropbox_files[i]))
        text <- readLines(open_con, skipNul = TRUE)
        show(paste0("finished loading ", dropbox_files[i]))
        ## get the document length
        doc_length <- length(text)
        ## set.seed for reproducibility and sample
        set.seed(20)
        show(paste0("sampling ", dropbox_files[i]))
        text_sample <- text[sample(1:doc_length, doc_length * .10, 
                                   replace = FALSE)]
        ## assign a name to text
        assign(x = tolower(gsub("[[:punct:]]", "_", dropbox_files[i])),
               value = text_sample, envir = .GlobalEnv)
        ## close connection
        close(open_con)
        ## garbage collection
        text <- NULL
        text_sample <- NULL
        gc() #// garbage collection //
}
[1] "loading en_US.blogs.txt"
[1] "finished loading en_US.blogs.txt"
[1] "sampling en_US.blogs.txt"
[1] "loading en_US.news.txt"
[1] "finished loading en_US.news.txt"
[1] "sampling en_US.news.txt"
[1] "loading en_US.twitter.txt"
[1] "finished loading en_US.twitter.txt"
[1] "sampling en_US.twitter.txt"

Preprocessing

There are a few clean-up tasks to take care of:

  • Combine data subsets into single corpus.
  • Clean the corpus as much as possible.
    • Check for spelling errors.
    • Remove profanity.
  • Create a tidy dataset with the tidytext package.
  • Look at some summary statistics of the words.
    • N-gram frequency.

Create a Corpus

Combining the samples from all three text files seems like it would make sense in order to have the most robust predictive ability. Below I combine the three texts into a single corpus. The clean.corpus performs the standard text preprocessing, including the removal of the so-called “Seven Dirty Words”.

# combine text vectors together, create corpus ----
single_vector <- c(en_us_blogs_txt, en_us_news_txt, en_us_twitter_txt)
corpus <- VCorpus(VectorSource(single_vector))

# clean corpus function ----
clean.corpus <- function(corpus) {
        require(tm)
        corpus <- tm_map(corpus, stripWhitespace)
        corpus <- tm_map(corpus, content_transformer(tolower))
        corpus <- tm_map(corpus, removePunctuation)
        corpus <- tm_map(corpus, removeNumbers)
        corpus <- tm_map(corpus, content_transformer(gsub),
                         pattern = "shit|piss|fuck|cunt|cocksucker|motherfucker|tits",
        replacement = "")
        corpus <- tm_map(corpus, PlainTextDocument)
}

# process corpus ----
corpus <- clean.corpus(corpus)

Create a Tidy Dataset

The tidytext package’s tidy function will be used to convert the corpus into a tibble.

# create a data frame from the corpus ----
text_df <- tidy(corpus)

2. Exploratory Analysis

At this point, the data are fairly clean and are ready to explore.

Unigram Frequency

This is a look at the most common occuring unigrams. In the Create a Corpus section, I decided against removing stopwords as they will be needed for the predictive functionality of the future app. That being said, the most frequent unigrams are articles and prepositions.

# tokenize and get unigrams ----
text_unigrams <- text_df %>% 
        select(text) %>% 
        unnest_tokens(unigram, text) %>% 
        count(unigram, sort = TRUE)

# plot unigrams ----
ggplot(text_unigrams[1:10,], aes(x = reorder(unigram,-n), y = n)) + geom_col(fill = "purple") + labs(x = "unigram", y = "frequency", title = "top 10 unigrams")

Bigram Frequency

The plot below shows the most frequent two-word combinations in the corpus.

# tokenize by bigram ----
text_bigrams <- text_df %>% 
        select(text) %>% 
        unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
        count(bigram, sort = TRUE)

# plot bigrams ----
ggplot(text_bigrams[1:10, ], aes(x = reorder(bigram, -n), y = n)) + geom_col(fill = "blue") + labs(x = "bigram", y = "frequency", title = "top 10 bigrams")

Trigram Frequency

These are the most frequent three-word combinations.

# tokenize by trigram ----
text_trigrams <- text_df %>% 
        select(text) %>% 
        unnest_tokens(trigram, text, token = "ngrams", n = 3) %>% 
        count(trigram, sort = TRUE)

# plot trigram ----
ggplot(text_trigrams[1:10, ], aes(x = reorder(trigram, -n), y = n)) + geom_col(fill = "orange") + labs(x = "trigram", y = "frequency", title = "top 10 trigrams")

Quadgram Frequency

Finally, these are the most frequent four-word combinations.

# tokenize by trigram ----
text_quadgrams <- text_df %>% 
        select(text) %>% 
        unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>% 
        count(quadgram, sort = TRUE)

# plot trigram ----
ggplot(text_quadgrams[1:10, ], aes(x = reorder(quadgram, -n), y = n)) + geom_col(fill = "red") + labs(x = "trigram", y = "frequency", title = "top 10 quadgrams") + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Matt Milunski

2017-08-23