Introduction

This is the Milestone Report for the Capstone project of Data Science Specialization.

The goal of Capstone project is to build application which predicts next word as user types sentence.

About Data

We are using these libraries to process and visualize report:

library(stringi)
## Warning: package 'stringi' was built under R version 3.2.5
library(tm)
## Loading required package: NLP
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Capstone project uses a large text corpus of documents as training data. The data provided for this project contains 3 text sources: blogs, news and twitter and comes in four languages: English, German, Finnish, Russian. We use only data in English for this exploratory analysis.

We downloaded zip file with text data, extracted files to current directory. https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

#blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
#news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
#twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

#Loaded data from saved environment
load("~/CourseraR/Capstone/e1.RData")

Here is some statistics for these files and data:

# Disk size (in MB)
blogs_dsize <- file.info("final/en_US/en_US.blogs.txt")$size / 1024 / 1024
news_dsize <- file.info("final/en_US/en_US.news.txt")$size / 1024 / 1024
twitter_dsize <- file.info("final/en_US/en_US.twitter.txt")$size / 1024 / 1024

#In-memory size (in MB)
blogs_msize<-object.size(blogs) / 1024 / 1024
news_msize<-object.size(news) / 1024 / 1024
twitter_msize<-object.size(twitter) / 1024 / 1024

# Words in lines
blogs_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)

# Summary
data.frame(source = c("blogs", "news", "twitter"),
           files_MB = c(blogs_dsize, news_dsize, twitter_dsize),
           in_memory_MB = c(blogs_msize, news_msize, twitter_msize),
           lines = c(length(blogs), length(news), length(twitter)),
           words_num = c(sum(blogs_words), sum(news_words), sum(twitter_words)),
           mean_words_num = c(mean(blogs_words), mean(news_words), mean(twitter_words)))
##    source files_MB in_memory_MB   lines words_num mean_words_num
## 1   blogs 200.4242     248.4935  899288  37546246       41.75108
## 2    news 196.2775     249.6329 1010242  34762395       34.40997
## 3 twitter 159.3641     301.3969 2360148  30093410       12.75065

Creating Corpus and Cleaning Data

We are cleaning data to remove some special characters, any numbers, punctuations, excess whitespace, then changing to lower case and remove stopwords.

Creating corpus is very labour-intensive operation and in our case it really halts the computer while trying to process nearly 1GB in-memory, so we are loading only 5000 lines from each source of text data.

# this file contains limited data - head(..., 5000)
load("~/CourseraR/Capstone/e2.RData")
data<-c(blogs,news,twitter)

#Corpus object for tm_map functions
vector_doc <- VectorSource(data)
corpus <- VCorpus(vector_doc)

corpus <- tm_map(corpus,  content_transformer(function(x) iconv(x, to='UTF-8-MAC', sub='byte')),  mc.cores=1)

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x,fixed=TRUE))

#Removing special characters and URLs
#corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
#corpus <- tm_map(corpus, toSpace, "@[^\\s]+")

corpus<-tm_map(corpus, content_transformer(tolower))
corpus<-tm_map(corpus, stripWhitespace)

#Removing...
corpus<-tm_map(corpus, removePunctuation)
corpus<-tm_map(corpus, removeNumbers)
corpus<-tm_map(corpus, removeWords, stopwords('english'))

Exploratory Analysis

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. https://en.wikipedia.org/wiki/N-gram

The goal of our analysis is to create different n-grams from this corpus and analyze frequency of words.

#bigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
#trigram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

#RWeka controls failed to for me with strage instant error, so I found another tokenizers:
BigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
  function(x)
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)


#Count words
freq_df <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_df <- data.frame(word=names(freq), freq=freq)
  return(freq_df)
}

unigram <- removeSparseTerms(TermDocumentMatrix(corpus), 0.9999)
unigram_freq <- freq_df(unigram)

bigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)), 0.9999)
bigram_freq <- freq_df(bigram)

trigram <- removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = TrigramTokenizer)), 0.9999)
trigram_freq <- freq_df(trigram)
freq_plot <- function(data, title) {
  ggplot(data[1:25,], aes(reorder(word, -freq), freq)) +
         labs(x = "Words/Phrases", y = "Frequency") +
         ggtitle(title) +
         theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
         geom_bar(stat = "identity")
}

freq_plot(unigram_freq, "Top-25 Unigrams")

freq_plot(bigram_freq, "Top-25 Bigrams")

freq_plot(trigram_freq, "Top-25 Trigrams")

Interesting Findings

What to do next?

Next step of this project is to build predictive algorithm, test it, surround it with ShinyApps-functions and deploy it to shinyapps.io

We have to find and to test some strategies of prediction of next word, now we are thinking about using trigram model, but it is not last solution - we need to test it! If trigram models fails to predict next word, we are switching to bigram model.