This assignment is part of the Coursera Data Science Capstone project. It’s goal is to:
* Demonstrate that we were able to download and read the Swiftkey data
* Create a basic report of summary statistics about the data sets
* Report on findings and provide some information on the next steps.
The following steps were performed:
- Download/unzip the Swiftkey file and read the three text files (in binary format) skipping blank lines
- Collect basic data statistics (memory allocation, number of lines/words, maximum number of chars) on the blogs, news and twitter text files (scope: english files)
- Create samples of the original datasets
- Organize the text in a repository (corpus)
- Tidy up the text, removing for example words which have no information value (e.g. stopwords, whitespaces)
- Create a term-document matrix holding frequences of distinct terms in the sample corpus
- Plot the results
First step in this assessment is to download the Swiftkey dataset, to unzip it and to load the required R packages.
To get familiar with the data, we are now gathering key statistics for each of the text files, meaning the memory size allocated, the number of lines, the total number of characters, the maximum number of characters per line, the number of words.
File | Memory_Allocation | Nr_Lines | Max_Chars_Per_Line | Nr_Chars_Total | Nr_Words |
---|---|---|---|---|---|
en_US.twitter.txt | 318.99 Mb | 2360148 | 213 | 162385035 | 30218166 |
en_US.blogs.txt | 255.35 Mb | 899288 | 40835 | 208361438 | 38154238 |
en_US.news.txt | 257.34 Mb | 1010242 | 11384 | 203791405 | 35010782 |
Sample | Memory_Allocation | Nr_Lines | Max_Chars_Per_Line | Nr_Chars_Total | Nr_Words |
---|---|---|---|---|---|
0.51 Mb | 11800 | 147 | 811708 | 151114 | |
Blogs | 0.51 Mb | 4496 | 2450 | 1071504 | 195259 |
News | 0.5 Mb | 5051 | 920 | 1018763 | 175045 |
The three text samples are loaded into a Corpus and the text is pre-processed. The whole text is transformed into lower cases, english stopwords (e.g. the, not, you), numbers, punctuation and extra white spaces are removed.
Finally, we look at the frequency of the single words (1-gram) in the sample corpus. The top 200 words are easily identified in the Wordcloud and the histogram illustrates the frequency of the top 20. In the Wordclould we can notice that additional data cleanup is needed to remove insignificant words (e.g.Euro sign).
The analysis must definitely be completed. Ideas for the next steps are:
* Create a larger corpus looking at splitting options to overcome memory/performance issues
* Implement additional pre-processing to remove words we do not want to predict (like Web URLs, repeating/special characters, profanity words) and possibly streamline the text using text stemming (which reduces words to their root form)
* Tokenize the corpus to explore the combination of 2 words (bigram) or more (n-grams) (Prerequisite: Solve issue with Java which prevented me from using the RWeka package)
* Optimize the code for faster processing
* Create a text prediction model
* Build the Shiny application.
##### Setup the environment and load the required packages
rm(list = ls())
setwd("C:/RAGNIMY1/datasciencecoursera/FinalProject") ###to remove before submitting
Sys.setlocale("LC_TIME", "English")
## [1] "English_United States.1252"
suppressWarnings(library(kableExtra))
suppressWarnings(library(dplyr))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringi)) #### file statistics
suppressWarnings(library(tm)) #### text mining
#### suppressWarnings(library(SnowballC)) #### text stemming
suppressWarnings(library(wordcloud)) #### word-cloud generator
suppressWarnings(library(RColorBrewer)) #### color palettes
##### Downloading / unzip the zipped text files
SrcFileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
DataFileZip <- "Coursera-SwiftKey.zip"
SrcDir <- "./final/en_US/"
SampleDir <- "./sample/en_US/"
Fblogs <- "en_US.blogs.txt"
Fnews <- "en_US.news.txt"
Ftwitter <- "en_US.twitter.txt"
#### Check if zipped data file was already downloaded, if not, download
#### the file
if (!file.exists(DataFileZip)) {
download.file(SrcFileURL, destfile = DataFileZip)
}
##### If any of the unzipped files (blogs, twitter, news) do not exist,
##### unzip the source file
if (!file.exists(paste(SrcDir, Fblogs, sep = ""))) {
unzip(DataFileZip)
} else {
if (!file.exists(paste(SrcDir, Fnews, sep = ""))) {
unzip(DataFileZip)
} else {
if (!file.exists(paste(SrcDir, Ftwitter, sep = ""))) {
unzip(DataFileZip)
}
}
}
##### Function to read the Text Files Blank lines are ignored, file is read
##### in binary format
ReadTextFile <- function(path, filename) {
con <- file(paste(path, filename, sep = ""), open = "rb")
TextFile <- readLines(con, encoding = "RTF-8", skipNul = TRUE, warn = FALSE)
close(con)
return(TextFile)
}
TwitterText <- ReadTextFile(SrcDir, Ftwitter)
BlogsText <- ReadTextFile(SrcDir, Fblogs)
NewsText <- ReadTextFile(SrcDir, Fnews)
##### Report of Basic Data Statistics
TxtFileList <- list(TwitterText, BlogsText, NewsText)
##### Function returns Memory Allocated by a file
MemAlloc <- function(txtfile) {
format(object.size(txtfile), units = "Mb", standard = "auto", digits = 2)
}
#### Function returns nr. of lines in a file
NrLines <- function(txtfile) {
length(txtfile)
}
#### Function returns the maximum nr of charaters in a file
NrCharLongLine <- function(txtfile) {
ncpl <- nchar(txtfile)
IdxLongLine <- which.max(ncpl)
ncpl[IdxLongLine]
}
#### Function returns the total nr. of characters in a file
NrChars <- function(txtfile) {
sum(nchar(txtfile))
}
#### Function returns the total nr. of words in a file
NrWords <- function(txtfile) {
sum(stri_count_words(txtfile))
}
FileStats <- data.frame(File = c(Ftwitter, Fblogs, Fnews), Memory_Allocation = sapply(TxtFileList,
MemAlloc), Nr_Lines = sapply(TxtFileList, NrLines), Max_Chars_Per_Line = sapply(TxtFileList,
NrCharLongLine), Nr_Chars_Total = sapply(TxtFileList, NrChars), Nr_Words = sapply(TxtFileList,
NrWords))
FileStats %>% kable() %>% kable_styling(bootstrap_options = "striped",
full_width = F, position = "left")
##### Function randomly selects 0.5% of the obeservations from the original
##### files
CreateSample <- function(txtfile) {
set.seed(345)
sample <- txtfile[rbinom(NrLines(txtfile) * 0.005, NrLines(txtfile),
0.5)]
return(sample)
}
#### Functions saves the blogs, news and twitter samples in txt files for
#### later usage
WriteTextFile <- function(path, samplename, newfile) {
con <- file(paste(path, newfile, sep = ""), open = "wt", encoding = "UTF-8")
TextFile <- writeLines(samplename, con)
close(con)
}
TwitterSample <- CreateSample(TwitterText)
BlogsSample <- CreateSample(BlogsText)
NewsSample <- CreateSample(NewsText)
TxtFileList <- list(TwitterSample, BlogsSample, NewsSample)
FileStats <- data.frame(Sample = c("Twitter", "Blogs", "News"), Memory_Allocation = sapply(TxtFileList,
MemAlloc), Nr_Lines = sapply(TxtFileList, NrLines), Max_Chars_Per_Line = sapply(TxtFileList,
NrCharLongLine), Nr_Chars_Total = sapply(TxtFileList, NrChars), Nr_Words = sapply(TxtFileList,
NrWords))
FileStats %>% kable() %>% kable_styling(bootstrap_options = "striped",
full_width = F, position = "left")
rm(TwitterText, BlogsText, NewsText, TxtFileList) #### free up memory
WriteTextFile(SampleDir, TwitterSample, "sample.twitter.txt")
WriteTextFile(SampleDir, BlogsSample, "sample.blogs.txt")
WriteTextFile(SampleDir, NewsSample, "sample.news.txt")
#### Combine blogs, twitter and news and load the data as a Corpus
FullSample <- c(TwitterSample, BlogsSample, NewsSample)
corpus <- VCorpus(VectorSource(FullSample))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
#### lapply(corpus[1], as.character)
terms <- as.matrix(TermDocumentMatrix(corpus))
terms_sorted <- sort(rowSums(terms), decreasing = TRUE)
terms <- data.frame(word = names(terms_sorted), freq = terms_sorted)
set.seed(345)
wordcloud(words = terms$word, freq = terms$freq, min.freq = 1, max.words = 200,
random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
Plot <- function(data) {
ggplot(head(data, 20), aes(x = reorder(word, -freq), freq)) + geom_bar(stat = "identity",
fill = "purple") + theme(axis.text.x = element_text(angle = 45,
hjust = 1), plot.title = element_text(hjust = 0.5)) + labs(title = "Top 20 Unigram words in sample corpus",
x = "1-gram Words", y = "Frequency")
}
Plot(terms)