Introduction

The goal of this project is to predict words, based on previous written words. An example of an application that uses word prediction is SwiftKey and can be seen in the figure below.

SwiftKey word prediction

The used programming language is R and the used hardware is an Intel Core i5 6600K with 32GB ram.

About the data set

The data source is http://www.corpora.heliohost.org/, which is a collection of corpora for various languages and is freely available. The dataset is The dataset is divided into 4 languages: German (de_DE), American English (en_US), Finnish (fi_FI) and Russian (ru_RU). Each language has 3 files which indicate the source of text: blogs, news and twitter. Each file contains a sentence or paragraph per line.

Other helpful data sets could be obtained from http://corpus.byu.edu/full-text/, which include full-text data from spoken language, academic papers, newspapers, magazines and fiction books collected from 1990 till 2012. Also older texts are available that contain non-fiction book texts. Other sources could be the https://en.wikipedia.org/wiki/Wikipedia:Database_download or the complete http://www.chrisharrison.net/index.php/Visualizations/WebTrigrams.

Natural Language Processing

According to https://en.wikipedia.org/wiki/Natural_language_processing, Natural Language Processing (NLP) is “a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages”.

Text mining analysis process consists of the following steps: 1. Import texts 2. Preprocessing 3. Transformation into structured formats

Data acquisition and cleaning

## Warning in initDict(): cannot find WordNet 'dict' directory: please set the
## environment variable WNHOME to its parent

The dataset is downloaded and unzipped. The dataset can be found at the following url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

# only download and extract the file if the data file does not exist already
if (!file.exists(file.path("data", "Coursera-SwiftKey.zip"))) {
  # download the file
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                destfile=file.path("data", "Coursera-SwiftKey.zip"))

  # and extract the file
  unzip(file.path("data", "Coursera-SwiftKey.zip"), exdir="data")
}

fileBlogs <- file.path(getwd(), "data", "final", "en_US", "en_US.blogs.txt")
fileNews <- file.path(getwd(), "data", "final", "en_US", "en_US.news.txt")
fileTwitter <- file.path(getwd(), "data", "final", "en_US", "en_US.twitter.txt")

A summary of the data files can be seen in the table below:

Source	# of lines	# of words	File size (bytes)
blogs	899288	37334131	2.101600110^{8}
news	1010242	34372530	2.058118910^{8}
twitter	2360148	30373543	1.671053410^{8}

Extract data

The data is read line by line and each data source is sampled. For the blogs and Twitter source, 100.000 entries are used. For the news source, 50.000 entries are used due to less data.

tokenizeText <- function (filename, limit) {
  # read all lines of the file as UTF-8 encoded
  tokens <- readLines(filename, 
                     encoding = "UTF-8", skipNul=TRUE)
  
  # take a sample
  tokens <- tokens[sample(1:length(tokens), limit)]
  
  tokens
}

# read tokens for blogs, news and twitter
tokensBlogs <- tokenizeText(fileBlogs, 100000)
head(tokensBlogs, 1)

## [1] "They have an amazing amount of gorgeous designs available as Digi Stamps. You must check them out as they do have some of the best designs i have ever seen! They are very kindly giving the lucky challenge winner a $12 Gift Voucher to spend in there store, how great is that?!"

tokensNews <- tokenizeText(fileNews, 50000)
head(tokensNews, 1)

## [1] "Last year Central’s varsity squad took home a then-school-best second-place finish, just 2.5 points behind Dunbar of Kentucky, a team that won seven consecutive national championships."

tokensTwitter <- tokenizeText(fileTwitter, 100000)
head(tokensTwitter, 1)

## [1] "#DYK \"By age 4 the avg child of a welfare fam has heard 13 million less #words than the avg child in a working class fam.\" -Betty Hart, 1995"

# concate all data sources and remove the separated variables
lines <- c(tokensBlogs, tokensNews, tokensTwitter)
rm(tokensBlogs, tokensNews, tokensTwitter)

Exploratory analysis

To evaluate which words are foreign, words can be compared against an English dictionary of words such as the lexical database http://wordnet.princeton.edu/ by Princeton University.

First, a corpus is created using the all lines sampled from blogs, news and twitter.

corpus = corpus(lines)

Unigram

A unigram is created and a histogram and wordcloud are displayed.

unigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 1, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


unigram <- data.table(ngram=colnames(unigramDfm), freq=colSums(unigramDfm))
unigram <- unigram[order(-freq),]

ggplot(data=unigram[0:10], aes(x=ngram, y=freq))+geom_bar(stat="identity")+ggtitle("Histogram unigram")

wordcloud(unigram$ngram,unigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

Bigram

A bigram is created and a histogram and wordcloud are displayed.

bigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 2, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


bigram <- data.table(ngram=colnames(bigramDfm), freq=colSums(bigramDfm))
bigram <- bigram[order(-freq),]

ggplot(data=bigram[0:10], aes(x=ngram, y=freq))+geom_bar(stat="identity")+ggtitle("Histogram bigram")

wordcloud(bigram$ngram,bigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

Trigram

A trigram is created and a histogram and wordcloud are displayed.

trigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 3, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


trigram <- data.table(ngram=colnames(trigramDfm), freq=colSums(trigramDfm))
trigram <- trigram[order(-freq),]

ggplot(data=trigram[0:10], aes(x=ngram, y=freq))+geom_bar(stat="identity")+ggtitle("Histogram trigram")

wordcloud(trigram$ngram,trigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : a_lot_of could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : some_of_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : thanks_for_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : it_would_be could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : in_order_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : looking_forward_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : i_wanted_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : most_of_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : i_have_been could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : to_go_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : if_you_are could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : thank_you_for could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : i_dont_think could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : dont_want_to could not be fit on page. It will not be plotted.

Next steps

The next step is to create a predictive model. This could be the Katz’s Backoff model. As can seen in the unigram wordcloud, the word “the” is the most popular word. The unigram model can thus be replaced by the word “the” if the next word needs to be predicted using the unigram and the highest probable word is choosen constantly. Also, a profinity filter will be needed such that no offensive words will be predicted. Maybe stemming can be used in combination with removal of non-English words to reduce the number of terms. Finally, the data product will be created in Shiny. A presentation will be made to demonstrate the features of the product.

Intermediate

Rutger Prins

November 27, 2016