Alan Turing (1950) opening statement “I propose to consider the question, ‘Can machines think?’” (p. 433) in his article “Computing Machinery and Intelligence”. The Turing Test has become a basis of natural language processing.

The essence of the Capstone project is to create an application that uses NLP techniques and predictive analytics, and like SwiftKey’s applications, takes in a word phrase and returns next-predicted word.

The project is developed in partnership with SwiftKey as the company well known for their predictive text analytics. As of March 1, 2016, SwiftKey became part of the Microsoft family of products. SwiftKey applications are used on Android and iOS anticipating and providing next-word choices while keyboard typing through Natural Language Processing (NLP) techniques. Microsoft’s Word Flow Technology is another example of NLP in action.

The milestone reports purpose is to identify the initial steps taken to produce the overall capstone project, a Natural Language Processing word prediction alogrithm application.

Additionally, the work presented in this project follows the tenets of reproducible research and all code is available in an open-source repository to enable readers to review the approach, reproduce the results, and collaborate to enhance the model.


The data for analysis to be used is from


This analysis is based on three text files containing internet blogs, news feeds, and Twitter tweets from HCCorpora.

This report will outline the process for:

  1. Retrieving the data
  2. Cleaning up the data
  3. The tokenizing functions used to find the most frequently used word (unigrams), word expressions (bigrams,trigrams,n-grams)
  4. Exploratory analyse of the data
  5. Future plans for a prediction application

Retrieve the data

Having downloaded and extracted the zip file, we then read the files and adjust for encoding.
A sample of each text is taken to optimise performance of the analyse.

data.blogs <- read_file("final/en_US/en_US.blogs.txt")
data.news <- read_file("final/en_US/en_US.news.txt")
data.twitter <- read_file("final/en_US/en_US.twitter.txt")
data.blogs = iconv(data.blogs, "latin1", "ASCII", sub = "")
data.news = iconv(data.news, "latin1", "ASCII", sub = "")
data.twitter = iconv(data.twitter, "latin1", "ASCII", sub = "")

sample.blogs <- smaller(data.blogs)
sample.news <- smaller(data.news)
sample.twitter <- smaller(data.twitter)
sample.all <- c(sample.blogs, sample.news, sample.twitter)

Overview of the original data

The texts inside the zip file we will analyse are under the ‘en_US’ folder



Characters in each file

    data_summary <- data.frame(
    Blogs=c((length(nchar(data.blogs)))),News =c((length(nchar(data.news)))),
    Twitter =c((length(nchar(data.twitter)))))
Blogs News Twitter
899288 1010242 2360148

Clean the text to create a corpus we can use to build a model

Remove numbers, urls, whitespace, punctuation, and non-useful text (eg lines ‘—–’)

    group_texts <- list(sample.blogs, sample.news, sample.twitter)
    group_texts <- tolower(group_texts)
    group_texts <- removeNumbers(group_texts)
    group_texts <- removePunctuation(group_texts, preserve_intra_word_dashes = TRUE)
    group_texts <- gsub("http[[:alnum:]]*", "", group_texts)
    group_texts <- stripWhitespace(group_texts)
    group_texts <- gsub("\u0092", "'", group_texts)
    group_texts <- gsub("\u0093|\u0094", "", group_texts)

    corpus.data <- VCorpus(VectorSource(group_texts))
    toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x, fixed = TRUE))
    toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x, fixed = TRUE))
    corpus.data <- tm_map(corpus.data, toEmpty, "#\\w+")
    corpus.data <- tm_map(corpus.data, toEmpty, "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)")
    corpus.data <- tm_map(corpus.data, toEmpty, "@\\w+")
    corpus.data <- tm_map(corpus.data, toEmpty, "http[^[:space:]]*")
    corpus.data <- tm_map(corpus.data, toSpace, "/|@|\\|")

Remove derogatory words from a known bad word list

    save_file("https://goo.gl/To9w5B", "bad_word_list.txt")
    bad_words <- readLines("./bad_word_list.txt")
    corpus.data <- tm_map(corpus.data, removeWords, bad_words)

Stem words in a text document using Porter’s stemming algorithm

So the word ‘mining’ was stemmed to ‘mine’
    corpus.data <- tm_map(corpus.data, stemDocument)

Explore the data through N-Grams

Jurafsky and Martin (2000) provide a seminal work within the domain of NLP. The authors present a key approach for building prediction models called the N-Gram, which relies on knowledge of word sequences from (N-1) prior words. It is a type of language model based on counting words in the corpora to establish probabilities about next words.

Here we will tokenize the sample texts to generate uni/bi/tri grams.
This will allow us to see the most used words and expressions.

Uni-grams - frequency of single words

    g1 <- ggplot(gram1freq, aes(x = word, y = freq)) +
        geom_bar(stat = "identity", fill = "red") +
        ggtitle("1-gram") +
        xlab("1-grams") + ylab("Frequency")

Bi-gram - frequency of two words phrases

    g2 <- ggplot(gram2freq, aes(x = word, y = freq)) +
        geom_bar(stat = "identity", fill = "yellow") +
        ggtitle("2-gram") +
        xlab("2-grams") + ylab("Frequency")

Tri-gram - frequency of three words phrases

    g3 <- ggplot(gram3freq, aes(x = word, y = freq)) +
        geom_bar(stat = "identity", fill = "blue") +
        ggtitle("3-gram") +
        xlab("3-grams") + ylab("Frequency")


Following is a wordcloud of unigrams and distributions of 2- and 3-gram phrases. These are calculated from taking 100 samples from each of the three N-Grams analysis. The more pronounced the word / phrase, the more frequent its appearance in the texts.

create_wordcloud <- function(tdm, palette = "Dark2") {
    mtx = as.matrix(tdm)
    # get word counts in decreasing order
    word_freqs = sort(rowSums(mtx), decreasing = TRUE)
    # create a data frame with words and their frequencies
    dm = data.frame(word = names(word_freqs), freq = word_freqs)
      dm <- sqldf("select * from dm limit 100")
    # plot wordcloud
    wordcloud(dm$word, dm$freq, random.order = FALSE, colors = brewer.pal(8, palette))
create_wordcloud(tdm1, "Set1")

create_wordcloud(tdm2, "Set2")

create_wordcloud(tdm3, "Set3")

Observations from the analysis

  1. Case: corpora words will not be case-sensitive. Although important for spelling correction and part of text analysis, the words themselves - not their case - are important for prediction.
  2. Stopwords: similarly, unlike classification and clustering applications, all words will be included in the model as they represent more than just the primary carriers of the message.
  3. Wordform: stemming will be used as N-Grams are typically based on wordforms Stemming appears useful as bi/tri grams are made of short words that have a better frequence for matching when stemmed.
  4. Punctuation: Is removed because tweets are rarely correctly punctuated
  5. Bias may exist in twitter text as it is a shorter form than news/blog text. Need to account for this and nullify it.
  6. Numbers: there is no intuition based on the research that numbers will have a great impact on a predication model and they will be removed
  7. Sparse Words: all words will be retained. A key concept from Jurafsky and Martin is the idea that even bigram models are quite sparse; however, rather than eliminating those wordforms, they become clues to the “probability of unseen N-Grams” (p. 209, 2000). They include as their fourth of eight key concepts Things Seen Once and recommend using the count of wordforms seen a single time to provide a basis to estimate those things not seen in the training set and will likely appear in a test set MODSIM World 2015
  8. Whitespace: this was not discussed directly by Jurafsky and Martin. The intuition is that whitespace has little to do with context and excess whitespace will be removed

Future plans for creating a prediction algorithm and Shiny app

Using the data in this analyse a Katz back off n-gram model will used to build a predictive text input application.

The Katz back-off is a generative n-gram language model that estimates the conditional probability of a word given its history in the n-gram. It accomplishes this estimation by “backing-off” to models with smaller histories under certain conditions. By doing so, the model with the most reliable information about a given history is used to provide the better results.

This will be hosted in a ShinyApp web application that accepts user input and outputs the next predicted words.