Data Science Specialization SwiftKey Capstone

Introduction

The objective is to explore the datasets that are going to be used to build the prediction model. Basic exploratory analysis of the files and some data preparation like cleaning and sampling are going to be performed.

The idea is to build a corpus based on three files with text extracted from blogs, news and twitter.

Load the data

Download and load the data.

zipFile <- "files/Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if (!file.exists('files')) {
    dir.create('files')
}
if (!file.exists(zipFile)) {
    tempFile <- tempfile()
    download.file(url, zipFile)
    unlink(tempFile)
}

if (!file.exists("files/final/en_US")) {
    unzip(zipFile, exdir = "files")
}

blogsFile <- "files/final/en_US/en_US.blogs.txt"
conn <- file(blogsFile, open = "r")
blogs <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

newsFile <- "files/final/en_US/en_US.news.txt"
conn <- file(newsFile, open = "r")
news <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

twitterFile <- "files/final/en_US/en_US.twitter.txt"
conn <- file(twitterFile, open = "r")
twitter <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

rm(conn)

Files Summary


File	FileSize	Lines	Words	Characters	WordPerLine.Mean	WordPerLine.Max
en_US.blogs.txt	200 MB	899288	37570839	206824505	42	6780
en_US.news.txt	196 MB	1010242	34494539	203223159	35	1803
en_US.twitter.txt	159 MB	2360148	30451170	162096241	13	47

The first obvious conclusion is that, as expected, Twitter has fewer words per line than News and News less than Blogs.

Also, the size of the files are big so sampling is going to be needed.

Sampling and cleaning

The three data sets will be sampled at 5% and non ASCII characters will be removed. A new file with the 3 samples will be created.

sampleDataFile <- "files/final/en_US/sample.txt"

percentage <- 0.05
set.seed(450134)
sampleBlogs <- sample(blogs, length(blogs) * percentage, replace = FALSE)
sampleNews <- sample(news, length(news) * percentage, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * percentage, replace = FALSE)
sampleBlogs <- iconv(sampleBlogs, "latin1", "ASCII", sub = "")
sampleNews <- iconv(sampleNews, "latin1", "ASCII", sub = "")
sampleTwitter <- iconv(sampleTwitter, "latin1", "ASCII", sub = "")
allSamples <- c(sampleBlogs, sampleNews, sampleTwitter)
conn <- file(sampleDataFile, open = "w")
writeLines(allSamples, conn)
close(conn)

Corpus

To build the corpus I’m going to apply these transformations:

Remove URL and email patterns
Remove stop words
Remove numbers
Convert to lowercase
Remove punctuation

The result is going to be saved into disk.

N-Gram

Uniqrams, Bigrams, and Trigrams are going to be used in the prediction model. The following plots are an exploration of these.