This is a milestone report for the last part the Coursera Data Science Specialization: Capstone Project. The goal of this project is to build predictive text models like those used by SwiftKey mobile application.
This report outlines the exploratory analysis of the training data set, and the plans for creating the prediction algorithm and Shiny app.
The training dataset for capstone project uses text files from a corpus called HC Corpora (www.corpora.heliohost.org). The details of it are available att http://www.corpora.heliohost.org/aboutcorpus.html (Unavailable at 20/07/2015). The training dataset are available from the Coursera site for this projet.
It was necessary to perform the following steps to access the dataset:
# specifying the work directory
setwd("C:/Courseradata/Capstone")
# Checking and downloading the zip file
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file("http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip")
}
# extracting en_US.blogs.txt
if (!file.exists("en_US.blogs.txt")) {
unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.blogs.txt")
}
# extracting en_US.news.txt
if (!file.exists("en_US.news.txt")) {
unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.news.txt")
}
# extracting en_US.twitter.txt
if (!file.exists("en_US.twitter.txt")) {
unzip("Coursera-SwiftKey.zip", "final/en_US/en_US.twitter.txt")
}
This project will use only the English text files, although there are others text files in dataset related others languages, mainly Russian, Finnish, German and English. Only English words will be predict in this project.
Thee text files will use:
These files have the following features (lines, words, and size):
# counting lines, words, and computing file sizes
options(warn=-1)
files <- c("en_US.blogs.txt","en_US.news.txt", "en_US.twitter.txt")
# specifying new work directory
setwd("C:/Courseradata/Capstone/final/en_US/")
lines <- c(length(count.fields("en_US.blogs.txt")), length(count.fields("en_US.news.txt")), length(count.fields("en_US.twitter.txt")))
sizes_bytes <- c(file.info("en_US.blogs.txt")$size, file.info("en_US.news.txt")$size,file.info("en_US.twitter.txt")$size)
words <- c(length(scan("en_US.blogs.txt", "", quiet=TRUE)),length(scan("en_US.news.txt", "", quiet=TRUE)),length(scan("en_US.twitter.txt", "", quiet=TRUE)))
options(warn=0)
info <- data.frame(files, lines, words,sizes_bytes)
info
## files lines words sizes_bytes
## 1 en_US.blogs.txt 898436 35314678 210160014
## 2 en_US.news.txt 77258 2263785 205811889
## 3 en_US.twitter.txt 2304374 9141571 167105338
remove(files, lines, sizes_bytes)
In this step, the project use The TM package: a framework for text mining applications within R.
It was necessary to perform the following tasks to get a training corpus (sample collection of texts for testing):
# loading files and creating samples
setwd("C:/Courseradata/Capstone/final/en_US/")
options(warn=-1)
tBlogs <- readLines("en_US.blogs.txt",skipNul = TRUE)
tNews <- readLines("en_US.news.txt",skipNul = TRUE)
tTwitter <- readLines("en_US.twitter.txt",skipNul = TRUE)
options(warn=0)
sBlogs <- sample(tBlogs, length(tBlogs)*0.02, replace = FALSE)
sNews <- sample(tNews, length(tNews)*0.02, replace = FALSE)
sTwitter <- sample(tTwitter , length(tTwitter)*0.02, replace = FALSE)
remove(tBlogs, tNews, tTwitter)
# creating corpus
sampleText <- c(sBlogs, sNews, sTwitter)
remove(sBlogs, sNews, sTwitter)
# Create the Corpus
library(tm)
## Loading required package: NLP
preFinalText = Corpus(VectorSource(sampleText))
remove(sampleText)
3.Cleaning the text with some preprocessing operations as remove invalid characters, numbers, and punctuation, for example:
library(stringi)
# extract invalid characters
finalText <- tm_map(preFinalText,content_transformer(function(x) stri_replace_all_regex(x,"[^\\p{L}\\s[']]+","")))
remove(preFinalText)
# extract whitespaces
finalText <- tm_map(finalText, stripWhitespace)
# extract numbers
finalText <- tm_map(finalText, removePunctuation)
# transform to lower case
finalText <- tm_map(finalText, content_transformer(tolower))
# transform to plain text document
finalText <- tm_map(finalText, PlainTextDocument)
The next step is to perform a tokenization process. This operation breaks a stream of text up into words or groups of words (tokens). The list of tokens will be input for the next steps. This project uses the N-gram tokenizer (RWeka package for generating one word token from the text at this stage.
library(RWeka)
# creating tokenizer controls
oneTk <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
# creating term-document matrices
uniDoc <- TermDocumentMatrix(finalText, control = list(tokenize = oneTk))
Here, it is possible to analyze data with a histogram to illustrate the most common words.
library(slam)
uniDocx <- as.matrix(rollup(uniDoc, 2, na.rm=TRUE, FUN=sum))
barplot(head(sort(rowSums(uniDocx), decreasing=TRUE), 15), las=3)
the following features are including in the plans for the prediction algorithm:
The Shiny application will have the following features: