Introduction

The goal of this project is to predict words, based on previous written words. An example of an application that uses word prediction is SwiftKey and can be seen in the figure below.

SwiftKey word prediction

The used programming language is R and the used hardware is an Intel Core i5 6600K with 32GB ram.

About the data set

The data source is http://www.corpora.heliohost.org/, which is a collection of corpora for various languages and is freely available. The dataset is The dataset is divided into 4 languages: German (de_DE), American English (en_US), Finnish (fi_FI) and Russian (ru_RU). Each language has 3 files which indicate the source of text: blogs, news and twitter. Each file contains a sentence or paragraph per line.

Other helpful data sets could be obtained from http://corpus.byu.edu/full-text/, which include full-text data from spoken language, academic papers, newspapers, magazines and fiction books collected from 1990 till 2012. Also older texts are available that contain non-fiction book texts. Other sources could be the https://en.wikipedia.org/wiki/Wikipedia:Database_download or the complete http://www.chrisharrison.net/index.php/Visualizations/WebTrigrams.

Natural Language Processing

According to https://en.wikipedia.org/wiki/Natural_language_processing, Natural Language Processing (NLP) is “a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages”.

Text mining analysis process consists of the following steps: 1. Import texts 2. Preprocessing 3. Transformation into structured formats

Data acquisition and cleaning

## Loading required package: NLP

## Warning in initDict(): cannot find WordNet 'dict' directory: please set the
## environment variable WNHOME to its parent

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Loading required package: RColorBrewer

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## quanteda version 0.9.8.5

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:NLP':
## 
##     ngrams

## The following object is masked from 'package:base':
## 
##     sample

The dataset is downloaded and unzipped. The dataset can be found at the following url: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

# only download and extract the file if the data file does not exist already
if (!file.exists(file.path("data", "Coursera-SwiftKey.zip"))) {
  # download the file
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
                destfile=file.path("data", "Coursera-SwiftKey.zip"))

  # and extract the file
  unzip(file.path("data", "Coursera-SwiftKey.zip"), exdir="data")
}

fileBlogs <- file.path(getwd(), "data", "final", "en_US", "en_US.blogs.txt")
fileNews <- file.path(getwd(), "data", "final", "en_US", "en_US.news.txt")
fileTwitter <- file.path(getwd(), "data", "final", "en_US", "en_US.twitter.txt")

A summary of the data files can be seen in the table below:

Extract data

The data is read line by line and each data source is sampled. For the blogs and Twitter source, 100.000 entries are used. For the news source, 50.000 entries are used due to less data.

tokenizeText <- function (filename, limit) {
  # read all lines of the file as UTF-8 encoded
  tokens <- readLines(filename, 
                     encoding = "UTF-8", skipNul=TRUE)
  
  # take a sample
  tokens <- tokens[sample(1:length(tokens), limit)]
  
  tokens
}

# read tokens for blogs, news and twitter
tokensBlogs <- tokenizeText(fileBlogs, 100000)
head(tokensBlogs, 1)

## [1] "A relationship. With a lover, with an ill relative. And if that relationship fails, or the relative dies, we need to remember that we were not defined purely by that relationship."

tokensNews <- tokenizeText(fileNews, 50000)
head(tokensNews, 1)

## [1] "Oregon kicks off its 2011 cross country season Saturday with a low-key dual meet against Gonzaga in Sunriver. The Ducks take part in five regular-season meets, including the Wisconsin adidas Invitational in Madison, Wisc. on Oct. 14. The NCAA Championships are Nov. 21 in Terre Haute, Ind."

tokensTwitter <- tokenizeText(fileTwitter, 100000)
head(tokensTwitter, 1)

## [1] "at the end of the day what is supposed to happen will do regardless. hold on to your dream is so worth it"

# concate all data sources and remove the separated variables
lines <- c(tokensBlogs, tokensNews, tokensTwitter)
rm(tokensBlogs, tokensNews, tokensTwitter)

Exploratory analysis

To evaluate which words are foreign, words can be compared against an English dictionary of words such as the lexical database http://wordnet.princeton.edu/ by Princeton University.

First, a corpus is created using the all lines sampled from blogs, news and twitter.

corpus = corpus(lines)

Unigram

A unigram is created and a histogram and wordcloud are displayed.

unigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 1, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


unigram <- data.table(ngram=colnames(unigramDfm), freq=colSums(unigramDfm))
unigram <- unigram[order(-freq),]

ggplot(data=unigram, aes(y=freq,x=seq(1, length(unigram$freq))))+geom_line()+ggtitle("Histogram unigram")

wordcloud(unigram$ngram,unigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

Bigram

A bigram is created and a histogram and wordcloud are displayed.

bigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 2, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


bigram <- data.table(ngram=colnames(bigramDfm), freq=colSums(bigramDfm))
bigram <- bigram[order(-freq),]

ggplot(data=bigram, aes(y=freq,x=seq(1, length(bigram$freq))))+geom_line()+ggtitle("Histogram bigram")

wordcloud(bigram$ngram,bigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

Trigram

A trigram is created and a histogram and wordcloud are displayed.

trigramDfm <- dfm(corpus, verbose = FALSE, ngrams = 3, what = "fastestword", toLower = TRUE, removePunct = TRUE, removeNumbers = TRUE, removeTwitter = TRUE)


trigram <- data.table(ngram=colnames(trigramDfm), freq=colSums(trigramDfm))
trigram <- trigram[order(-freq),]

ggplot(data=trigram, aes(y=freq,x=seq(1, length(trigram$freq))))+geom_line()+ggtitle("Histogram trigram")

wordcloud(trigram$ngram,trigram$freq,
          c(5,.3),
          max.words=50,
          random.order=FALSE)

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : there_is_a could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : one_of_my could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : im_going_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : you_want_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : i_have_been could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : this_is_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : it_would_be could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : at_the_end could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : is_going_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : in_the_world could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : if_you_are could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : most_of_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : there_is_no could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : you_have_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : all_of_the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : i_wanted_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : looking_forward_to could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : to_have_a could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : for_the_first could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : thank_you_for could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigram$ngram, trigram$freq, c(5, 0.3), max.words =
## 50, : when_i_was could not be fit on page. It will not be plotted.

Intermediate

Rutger Prins

November 27, 2016