Creating a Predictive Text Application
Exploratory Data Analysis
Objective
The objective of this capstone project is to build a predictive text app similar to those found in SwiftKey technology. Based upon previous word combinations, new words are proposed to help making typing more efficient.
This R Markdown document will show the entire project from beginning to end.
1. Getting and Cleaning the Data
The data provided are text written in English, German, Russian, and Finnish. Only the English database will be used now. The text come from a variety of mediums such as blogs, news, and Twitter. These data will be used to help develop the algorithm. However, there’s quite a bit of preprocessing that will need to occur in order to get the data into an acceptable format.
The main objective for this week is to read in a subset of the data, tokenize the text (raw text -> tokenizer function -> tokenized file), and filter profanity from the text.
Loading the Data
My smooth.operatoR function loads packages and downloads ones that I may specify in the character vector but don’t have in my package library.
# load packages ----
smooth.operatoR(c("dplyr", "tm", "ggplot2", "tidytext")) dplyr tm ggplot2 tidytext
TRUE TRUE TRUE TRUE
A 10% random sample will be taken from each file. Using rules of inference, we know that the 10% sample will provide reasonable generalizability to the three datasets in their whole.
Using the wc command in Terminal, the following stats were obtained from the pre-sampled data (i.e., the original data).
- en_US.blogs.txt - 37334690 words, 899288 lines
- en_US.news.txt - 34372720 words, 1010242 lines
- en_US.twitter.txt - 30374206 words, 2360148 lines
# list the files in the dropbox folder that hold the data ----
dropbox_files <- list.files(dropbox_path, pattern = ".txt")
# readLines through the text files and sample 10% from each ----
for(i in seq_along(dropbox_files)) {
## open connection
open_con <- file(paste(dropbox_path, dropbox_files[i], sep = "/"),
open = "r")
## load data from dropbox_files loop
show(paste0("loading ", dropbox_files[i]))
text <- readLines(open_con, skipNul = TRUE)
show(paste0("finished loading ", dropbox_files[i]))
## get the document length
doc_length <- length(text)
## set.seed for reproducibility and sample
set.seed(20)
show(paste0("sampling ", dropbox_files[i]))
text_sample <- text[sample(1:doc_length, doc_length * .10,
replace = FALSE)]
## assign a name to text
assign(x = tolower(gsub("[[:punct:]]", "_", dropbox_files[i])),
value = text_sample, envir = .GlobalEnv)
## close connection
close(open_con)
## garbage collection
text <- NULL
text_sample <- NULL
gc() #// garbage collection //
}[1] "loading en_US.blogs.txt"
[1] "finished loading en_US.blogs.txt"
[1] "sampling en_US.blogs.txt"
[1] "loading en_US.news.txt"
[1] "finished loading en_US.news.txt"
[1] "sampling en_US.news.txt"
[1] "loading en_US.twitter.txt"
[1] "finished loading en_US.twitter.txt"
[1] "sampling en_US.twitter.txt"
Preprocessing
There are a few clean-up tasks to take care of:
- Combine data subsets into single corpus.
- Clean the corpus as much as possible.
- Check for spelling errors.
- Remove profanity.
- Create a tidy dataset with the
tidytextpackage. - Look at some summary statistics of the words.
- N-gram frequency.
Create a Corpus
Combining the samples from all three text files seems like it would make sense in order to have the most robust predictive ability. Below I combine the three texts into a single corpus. The clean.corpus performs the standard text preprocessing, including the removal of the so-called “Seven Dirty Words”.
# combine text vectors together, create corpus ----
single_vector <- c(en_us_blogs_txt, en_us_news_txt, en_us_twitter_txt)
corpus <- VCorpus(VectorSource(single_vector))
# clean corpus function ----
clean.corpus <- function(corpus) {
require(tm)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, content_transformer(gsub),
pattern = "shit|piss|fuck|cunt|cocksucker|motherfucker|tits",
replacement = "")
corpus <- tm_map(corpus, PlainTextDocument)
}
# process corpus ----
corpus <- clean.corpus(corpus)Create a Tidy Dataset
The tidytext package’s tidy function will be used to convert the corpus into a tibble.
# create a data frame from the corpus ----
text_df <- tidy(corpus)2. Exploratory Analysis
At this point, the data are fairly clean and are ready to explore.
Unigram Frequency
This is a look at the most common occuring unigrams. In the Create a Corpus section, I decided against removing stopwords as they will be needed for the predictive functionality of the future app. That being said, the most frequent unigrams are articles and prepositions.
# tokenize and get unigrams ----
text_unigrams <- text_df %>%
select(text) %>%
unnest_tokens(unigram, text) %>%
count(unigram, sort = TRUE)
# plot unigrams ----
ggplot(text_unigrams[1:10,], aes(x = reorder(unigram,-n), y = n)) + geom_col(fill = "purple") + labs(x = "unigram", y = "frequency", title = "top 10 unigrams")Bigram Frequency
The plot below shows the most frequent two-word combinations in the corpus.
# tokenize by bigram ----
text_bigrams <- text_df %>%
select(text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
# plot bigrams ----
ggplot(text_bigrams[1:10, ], aes(x = reorder(bigram, -n), y = n)) + geom_col(fill = "blue") + labs(x = "bigram", y = "frequency", title = "top 10 bigrams")Trigram Frequency
These are the most frequent three-word combinations.
# tokenize by trigram ----
text_trigrams <- text_df %>%
select(text) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE)
# plot trigram ----
ggplot(text_trigrams[1:10, ], aes(x = reorder(trigram, -n), y = n)) + geom_col(fill = "orange") + labs(x = "trigram", y = "frequency", title = "top 10 trigrams")Quadgram Frequency
Finally, these are the most frequent four-word combinations.
# tokenize by trigram ----
text_quadgrams <- text_df %>%
select(text) %>%
unnest_tokens(quadgram, text, token = "ngrams", n = 4) %>%
count(quadgram, sort = TRUE)
# plot trigram ----
ggplot(text_quadgrams[1:10, ], aes(x = reorder(quadgram, -n), y = n)) + geom_col(fill = "red") + labs(x = "trigram", y = "frequency", title = "top 10 quadgrams") + theme(axis.text.x = element_text(angle = 45, hjust = 1))