Introduction

The objective is to explore the datasets that are going to be used to build the prediction model. Basic exploratory analysis of the files and some data preparation like cleaning and sampling are going to be performed.

The idea is to build a corpus based on three files with text extracted from blogs, news and twitter.

Load the data

Download and load the data.

zipFile <- "files/Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if (!file.exists('files')) {
    dir.create('files')
}
if (!file.exists(zipFile)) {
    tempFile <- tempfile()
    download.file(url, zipFile)
    unlink(tempFile)
}

if (!file.exists("files/final/en_US")) {
    unzip(zipFile, exdir = "files")
}

blogsFile <- "files/final/en_US/en_US.blogs.txt"
conn <- file(blogsFile, open = "r")
blogs <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

newsFile <- "files/final/en_US/en_US.news.txt"
conn <- file(newsFile, open = "r")
news <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

twitterFile <- "files/final/en_US/en_US.twitter.txt"
conn <- file(twitterFile, open = "r")
twitter <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)

rm(conn)

Files Summary

File FileSize Lines Words Characters WordPerLine.Mean WordPerLine.Max
en_US.blogs.txt 200 MB 899288 37570839 206824505 42 6780
en_US.news.txt 196 MB 1010242 34494539 203223159 35 1803
en_US.twitter.txt 159 MB 2360148 30451170 162096241 13 47

The first obvious conclusion is that, as expected, Twitter has fewer words per line than News and News less than Blogs.

Also, the size of the files are big so sampling is going to be needed.

Sampling and cleaning

The three data sets will be sampled at 5% and non ASCII characters will be removed. A new file with the 3 samples will be created.

sampleDataFile <- "files/final/en_US/sample.txt"

percentage <- 0.05
set.seed(450134)
sampleBlogs <- sample(blogs, length(blogs) * percentage, replace = FALSE)
sampleNews <- sample(news, length(news) * percentage, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * percentage, replace = FALSE)
sampleBlogs <- iconv(sampleBlogs, "latin1", "ASCII", sub = "")
sampleNews <- iconv(sampleNews, "latin1", "ASCII", sub = "")
sampleTwitter <- iconv(sampleTwitter, "latin1", "ASCII", sub = "")
allSamples <- c(sampleBlogs, sampleNews, sampleTwitter)
conn <- file(sampleDataFile, open = "w")
writeLines(allSamples, conn)
close(conn)

Corpus

To build the corpus I’m going to apply these transformations:

The result is going to be saved into disk.

N-Gram

Uniqrams, Bigrams, and Trigrams are going to be used in the prediction model. The following plots are an exploration of these.