Synopsis

This is a milestone report for the last part the Coursera Data Science Specialization: Capstone Project. The goal of this project is to build predictive text models like those used by SwiftKey mobile application.

This report outlines the exploratory analysis of the training data set, and the plans for creating the prediction algorithm and Shiny app.

Capstone dataset

The training dataset for capstone project uses text files from a corpus called HC Corpora (www.corpora.heliohost.org). The details of it are available att http://www.corpora.heliohost.org/aboutcorpus.html (Unavailable at 20/07/2015). The training dataset are available from the Coursera site for this projet.

It was necessary to perform the following steps to access the dataset:

  1. setting the work directory
# specifying the work directory
setwd("C:/Courseradata/Capstone")
  1. downloading Coursera-SwiftKey.zip file
# Checking and downloading the zip file
if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file("http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "Coursera-SwiftKey.zip")
}
  1. unzipping downloaded file, but only english text files
# extracting en_US.blogs.txt
if (!file.exists("en_US.blogs.txt")) {
    unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt")
}
# extracting en_US.news.txt
if (!file.exists("en_US.news.txt")) {
    unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt")
}
# extracting en_US.twitter.txt
if (!file.exists("en_US.twitter.txt")) {
    unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt")
}

This project will use only the English text files, although there are others text files in dataset related others languages, mainly Russian, Finnish, German and English. Only English words will be predict in this project.

Summaries of the training set files

Thee text files will use:

These files have the following features (lines, words, and size):

# counting lines, words, and computing file sizes 
options(warn=-1)
files <- c("en_US.blogs.txt","en_US.news.txt", "en_US.twitter.txt")
# specifying new work directory
setwd("C:/Courseradata/Capstone/final/en_US/")
lines <- c(length(count.fields("en_US.blogs.txt")), length(count.fields("en_US.news.txt")), length(count.fields("en_US.twitter.txt")))
sizes_bytes <- c(file.info("en_US.blogs.txt")$size, file.info("en_US.news.txt")$size,file.info("en_US.twitter.txt")$size)
words <- c(length(scan("en_US.blogs.txt", "", quiet=TRUE)),length(scan("en_US.news.txt", "", quiet=TRUE)),length(scan("en_US.twitter.txt", "", quiet=TRUE)))
options(warn=0)
info <- data.frame(files, lines, words,sizes_bytes)
info
##               files   lines    words sizes_bytes
## 1   en_US.blogs.txt  898436 35314678   210160014
## 2    en_US.news.txt   77258  2263785   205811889
## 3 en_US.twitter.txt 2304374  9141571   167105338
remove(files, lines, sizes_bytes)

Creating a training corpus

In this step, the project use The TM package: a framework for text mining applications within R.

It was necessary to perform the following tasks to get a training corpus (sample collection of texts for testing):

  1. Creating a sample of each file (2% of lines), because the original files are very large. They will be the base for the project training corpus.
# loading files and creating samples
setwd("C:/Courseradata/Capstone/final/en_US/")
options(warn=-1)
tBlogs <- readLines("en_US.blogs.txt",skipNul = TRUE)
tNews <- readLines("en_US.news.txt",skipNul = TRUE)
tTwitter <- readLines("en_US.twitter.txt",skipNul = TRUE)
options(warn=0)
sBlogs <- sample(tBlogs, length(tBlogs)*0.02, replace = FALSE)
sNews <- sample(tNews, length(tNews)*0.02, replace = FALSE)
sTwitter <- sample(tTwitter , length(tTwitter)*0.02, replace = FALSE)
remove(tBlogs, tNews, tTwitter)
  1. Creating a training corpus joining files:
# creating corpus
sampleText <- c(sBlogs, sNews, sTwitter)
remove(sBlogs, sNews, sTwitter)
# Create the Corpus
library(tm)
## Loading required package: NLP
preFinalText = Corpus(VectorSource(sampleText))
remove(sampleText)

3.Cleaning the text with some preprocessing operations as remove invalid characters, numbers, and punctuation, for example:

library(stringi)
# extract invalid characters
finalText <- tm_map(preFinalText,content_transformer(function(x) stri_replace_all_regex(x,"[^\\p{L}\\s[']]+","")))
remove(preFinalText)
# extract whitespaces
finalText <- tm_map(finalText, stripWhitespace)
# extract numbers
finalText <- tm_map(finalText, removePunctuation)
# transform to lower case
finalText <- tm_map(finalText, content_transformer(tolower))
# transform to plain text document
finalText <- tm_map(finalText, PlainTextDocument)

Tokenization

The next step is to perform a tokenization process. This operation breaks a stream of text up into words or groups of words (tokens). The list of tokens will be input for the next steps. This project uses the N-gram tokenizer (RWeka package for generating one word token from the text at this stage.

library(RWeka)
# creating tokenizer controls
oneTk <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# creating term-document matrices 
uniDoc <- TermDocumentMatrix(finalText, control = list(tokenize = oneTk))

Basic exploratory analysis

Here, it is possible to analyze data with a histogram to illustrate the most common words.

library(slam)
uniDocx <- as.matrix(rollup(uniDoc, 2, na.rm=TRUE, FUN=sum))
barplot(head(sort(rowSums(uniDocx), decreasing=TRUE), 15), las=3)

Plans for prediction algorithm and Shiny app

the following features are including in the plans for the prediction algorithm:

The Shiny application will have the following features: