Introduction

This is the milestone report for the Coursera Data Science Capstone Project by John Hopkins. The goal of the project is to build a predictive text application that will predict the next word as the user types a sentence, very similar to smart phone keyboards today which are implemented using Swiftkey’s technology. The application comprises of two parts, the prediction algorithm and a Shiny app UI.

The data provided for this project comes for 3 sources, 1) blogs, 2) news and 3) twitter feeds and in 4 different languages: 1) German, 2) English, 3) Finnish and 4) Russian. For this project, we will be using the English database.

In this report we will highlight the results of some exploratory data analysis and detail the next steps to complete the application.

The following R packages were used for the data cleaning and exploratory data analysis

Data Description

First we read in the 3 files and table some of the information about the data.

File Size (MB) Word Count Line Count Shortest line (no. of words) Longest Line (no. of words)
en_US.blogs.txt 200.42 37,334,131 899,288 1 40,835
en_US.news.txt 196.28 34,372,530 1,010,242 1 11,384
en_US.twitter.txt 159.36 30,373,543 2,360,148 2 213

Sampling the data and constructing a corpus

A corpus, large and structured set of text is created by combining a 5% sample from each text source. This is to reduce the time needed for pre-processing and cleaning as well as tokenising the data.

Cleaning the data

The corpus was then cleaned to remove punctuation and numbers, strip white space, and convert all text to lowercase. Stopwords were left in as these are used in normal language and could be the expected next input from a user. Profanities were removed, hashtags as well.

#Create corpus object for using tm_map functions
doc_vector <- VectorSource(full_sample)
corpus <- VCorpus(doc_vector)

## Profanity list obtained from here http://www.bannedwordlist.com/
profanityfile <- file("Coursera-SwiftKey/final/swearWords.txt", open = "rb")
profanity<-readLines(profanityfile, encoding = "UTF-8", warn=TRUE, skipNul=TRUE)
close(profanityfile)
rm(profanityfile)

#Custom content transformers
toEmpty <- content_transformer(function(x, pattern) gsub(pattern, "", x,fixed=TRUE))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x,fixed=TRUE))

corpus<-tm_map(corpus, toEmpty, "#\\w+")                      # Hashtags (#justsaying)
corpus<-tm_map(corpus, toEmpty, "(\\b\\S+\\@\\S+\\..{1,3}(\\s)?\\b)") # Email addresses and shout outs (e.g. foo@demo.net)
corpus<-tm_map(corpus, toEmpty, "@\\w+")                     # shout outs (e.g. @yoyo)
corpus<-tm_map(corpus, toEmpty, "http[^[:space:]]*") #- only removes things starting with http
corpus<-tm_map(corpus, toSpace, "/|@|\\|")                   # slashes
corpus<-tm_map(corpus, removePunctuation)                    # Any punctuation (.,!)
corpus<-tm_map(corpus, removeNumbers)                        # Remove numbers
corpus<-tm_map(corpus, content_transformer(tolower))         # Lower case all the text
corpus<-tm_map(corpus, removeWords, profanity)               # Remove profanity
corpus<-tm_map(corpus, stripWhitespace)                      # Remove extra white space

Checking the cleaned corpus

corpus[[77]]$content

[1] “another book by margaret wise brown the runaway bunny written in and never out of print since then is the sixth book ive bought for henri it is a book about unconditional love the little bunny in the story wants to run away from home but realizes there is no where he can go where his loving mother wont find him and bring him home i want henri to know he can be assured of that kind of unconditional love from his family”

Data Analysis

Now that we’ve created our corpus and cleaned it up we can now analye the frequency of terms. Using the n-grams functions from the RWeka library we can create different n-grams from the corpus then construct a term-document matricies for the various n-gram tokens. (https://en.wikipedia.org/wiki/Document-term_matrix) Initial exploration of the created unigram resulted in a lot of sparse terms, hence we remove them.

#Tokenizer functions
bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
quadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

#Word/phrase count function
freq_df <- function(tdm){
  # Helper function to tabulate frequency
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

#Creating the n-grams
corpus.unigram <- TermDocumentMatrix(corpus)
corpus.unigram <- removeSparseTerms(corpus.unigram, 0.99)
corpus.unigram.freq <- freq_df(corpus.unigram)

corpus.bigram <- TermDocumentMatrix(corpus, control=list(tokenize=bigramTokenizer))
corpus.bigram <- removeSparseTerms(corpus.bigram, 0.999)
corpus.bigram.freq <- freq_df(corpus.bigram)

corpus.trigram <- TermDocumentMatrix(corpus, control=list(tokenize=trigramTokenizer))
corpus.trigram <- removeSparseTerms(corpus.trigram, 0.999)
corpus.trigram.freq <- freq_df(corpus.trigram)

corpus.quadgram <- TermDocumentMatrix(corpus, control=list(tokenize=quadgramTokenizer))
corpus.quadgram <- removeSparseTerms(corpus.quadgram, 0.9999)
corpus.quadgram.freq <- freq_df(corpus.quadgram)

Visualization

Here we plot the top 50 phrases for each of the n-grams.

library(ggplot2)
#Plotting function
top_50_plot <- function(df, title, color) {
  ggplot(df[1:50,], aes(x = seq(1:50), y = freq)) +
    geom_bar(stat = "identity", fill = color, colour = "black", width = 0.80) +
    coord_cartesian(xlim = c(0, 51)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks = seq(1, 50, by = 1), labels = df$word[1:50]) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
# Top 50 words - unigram
top_50_plot(corpus.unigram.freq,"Top 50 words","steelblue")

# Top 50 words - bigram
top_50_plot(corpus.bigram.freq,"Top 50 2 word phrases","steelblue")

# Top 50 words - trigram
top_50_plot(corpus.trigram.freq,"Top 50 3 word phrases","steelblue")

# Top 50 words - quadgram
top_50_plot(corpus.quadgram.freq,"Top 50 4 word phrases","steelblue")

From the unigram we can see that “the” and “and” are the top two occuring words by far. As these are generally used words in normal speech we did not remove these stopwords from the corpus as it was felt that doing so would impact the prediction in a negative way.

Next Steps

Now that we have performed some exploratory analysis, and built some preliminary n-gram models, a potential strategy for the final product would be using the n-gram model with a frequency look-up table combined with a back-off technique. Depending on the available time some stemming might be considered in the data preprocessing.

For the user interface, the current plan is to create a Shiny app with a simple user interface for text input and display a list of suggested “next” words based on our prediction model.