The objective is to explore the datasets that are going to be used to build the prediction model. Basic exploratory analysis of the files and some data preparation like cleaning and sampling are going to be performed.
The idea is to build a corpus based on three files with text extracted from blogs, news and twitter.
Download and load the data.
zipFile <- "files/Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists('files')) {
dir.create('files')
}
if (!file.exists(zipFile)) {
tempFile <- tempfile()
download.file(url, zipFile)
unlink(tempFile)
}
if (!file.exists("files/final/en_US")) {
unzip(zipFile, exdir = "files")
}
blogsFile <- "files/final/en_US/en_US.blogs.txt"
conn <- file(blogsFile, open = "r")
blogs <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)
newsFile <- "files/final/en_US/en_US.news.txt"
conn <- file(newsFile, open = "r")
news <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)
twitterFile <- "files/final/en_US/en_US.twitter.txt"
conn <- file(twitterFile, open = "r")
twitter <- readLines(conn, encoding = "UTF-8", skipNul = TRUE)
close(conn)
rm(conn)
| File | FileSize | Lines | Words | Characters | WordPerLine.Mean | WordPerLine.Max |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200 MB | 899288 | 37570839 | 206824505 | 42 | 6780 |
| en_US.news.txt | 196 MB | 1010242 | 34494539 | 203223159 | 35 | 1803 |
| en_US.twitter.txt | 159 MB | 2360148 | 30451170 | 162096241 | 13 | 47 |
The first obvious conclusion is that, as expected, Twitter has fewer words per line than News and News less than Blogs.
Also, the size of the files are big so sampling is going to be needed.
The three data sets will be sampled at 5% and non ASCII characters will be removed. A new file with the 3 samples will be created.
sampleDataFile <- "files/final/en_US/sample.txt"
percentage <- 0.05
set.seed(450134)
sampleBlogs <- sample(blogs, length(blogs) * percentage, replace = FALSE)
sampleNews <- sample(news, length(news) * percentage, replace = FALSE)
sampleTwitter <- sample(twitter, length(twitter) * percentage, replace = FALSE)
sampleBlogs <- iconv(sampleBlogs, "latin1", "ASCII", sub = "")
sampleNews <- iconv(sampleNews, "latin1", "ASCII", sub = "")
sampleTwitter <- iconv(sampleTwitter, "latin1", "ASCII", sub = "")
allSamples <- c(sampleBlogs, sampleNews, sampleTwitter)
conn <- file(sampleDataFile, open = "w")
writeLines(allSamples, conn)
close(conn)
To build the corpus I’m going to apply these transformations:
The result is going to be saved into disk.
Uniqrams, Bigrams, and Trigrams are going to be used in the prediction model. The following plots are an exploration of these.