Capstone Week2 - Milestone Report

Executive Summary

This assignment is part of the Coursera Data Science Capstone project. It’s goal is to:
* Demonstrate that we were able to download and read the Swiftkey data
* Create a basic report of summary statistics about the data sets
* Report on findings and provide some information on the next steps.

The following steps were performed:
- Download/unzip the Swiftkey file and read the three text files (in binary format) skipping blank lines
- Collect basic data statistics (memory allocation, number of lines/words, maximum number of chars) on the blogs, news and twitter text files (scope: english files)
- Create samples of the original datasets
- Organize the text in a repository (corpus)
- Tidy up the text, removing for example words which have no information value (e.g. stopwords, whitespaces)
- Create a term-document matrix holding frequences of distinct terms in the sample corpus
- Plot the results

Preparing the environment

First step in this assessment is to download the Swiftkey dataset, to unzip it and to load the required R packages.

Analysing the Source Files

To get familiar with the data, we are now gathering key statistics for each of the text files, meaning the memory size allocated, the number of lines, the total number of characters, the maximum number of characters per line, the number of words.

File	Memory_Allocation	Nr_Lines	Max_Chars_Per_Line	Nr_Chars_Total	Nr_Words
en_US.twitter.txt	318.99 Mb	2360148	213	162385035	30218166
en_US.blogs.txt	255.35 Mb	899288	40835	208361438	38154238
en_US.news.txt	257.34 Mb	1010242	11384	203791405	35010782

Sampling

As shown above, the original data files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) are extremely large (>30millions of words) , therefore, due to resource contraints, the data analysis will further on be based on a sample of the original text files. The sample was created by randomly pulling 0.5% of each of the original files using the ‘rbinom’ function. The generated samples are saved in text files for further usage, if required.
Below basic statistics of the samples.

Sample	Memory_Allocation	Nr_Lines	Max_Chars_Per_Line	Nr_Chars_Total	Nr_Words
Twitter	0.51 Mb	11800	147	811708	151114
Blogs	0.51 Mb	4496	2450	1071504	195259
News	0.5 Mb	5051	920	1018763	175045

Transforming / Cleaning the text

The three text samples are loaded into a Corpus and the text is pre-processed. The whole text is transformed into lower cases, english stopwords (e.g. the, not, you), numbers, punctuation and extra white spaces are removed.

Reporting on the Frequence of Terms

Finally, we look at the frequency of the single words (1-gram) in the sample corpus. The top 200 words are easily identified in the Wordcloud and the histogram illustrates the frequency of the top 20. In the Wordclould we can notice that additional data cleanup is needed to remove insignificant words (e.g.Euro sign).

Next Steps

The analysis must definitely be completed. Ideas for the next steps are:
* Create a larger corpus looking at splitting options to overcome memory/performance issues
* Implement additional pre-processing to remove words we do not want to predict (like Web URLs, repeating/special characters, profanity words) and possibly streamline the text using text stemming (which reduces words to their root form)
* Tokenize the corpus to explore the combination of 2 words (bigram) or more (n-grams) (Prerequisite: Solve issue with Java which prevented me from using the RWeka package)
* Optimize the code for faster processing
* Create a text prediction model
* Build the Shiny application.

Appendix

Preparing the environment

1. Set the environment and load the needed packages

##### Setup the environment and load the required packages
rm(list = ls())
setwd("C:/RAGNIMY1/datasciencecoursera/FinalProject")  ###to remove before submitting
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

suppressWarnings(library(kableExtra))
suppressWarnings(library(dplyr))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringi))  #### file statistics
suppressWarnings(library(tm))  #### text mining
#### suppressWarnings(library(SnowballC)) #### text stemming
suppressWarnings(library(wordcloud))  #### word-cloud generator 
suppressWarnings(library(RColorBrewer))  #### color palettes

2. Download / unzip the zipped data file

##### Downloading / unzip the zipped text files
SrcFileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
DataFileZip <- "Coursera-SwiftKey.zip"
SrcDir <- "./final/en_US/"
SampleDir <- "./sample/en_US/"
Fblogs <- "en_US.blogs.txt"
Fnews <- "en_US.news.txt"
Ftwitter <- "en_US.twitter.txt"

#### Check if zipped data file was already downloaded, if not, download
#### the file
if (!file.exists(DataFileZip)) {
    download.file(SrcFileURL, destfile = DataFileZip)
}

##### If any of the unzipped files (blogs, twitter, news) do not exist,
##### unzip the source file
if (!file.exists(paste(SrcDir, Fblogs, sep = ""))) {
    unzip(DataFileZip)
} else {
    if (!file.exists(paste(SrcDir, Fnews, sep = ""))) {
        unzip(DataFileZip)
    } else {
        if (!file.exists(paste(SrcDir, Ftwitter, sep = ""))) {
            unzip(DataFileZip)
        }
    }
}

3. Analysing the Source Files

##### Function to read the Text Files Blank lines are ignored, file is read
##### in binary format
ReadTextFile <- function(path, filename) {
    con <- file(paste(path, filename, sep = ""), open = "rb")
    TextFile <- readLines(con, encoding = "RTF-8", skipNul = TRUE, warn = FALSE)
    close(con)
    return(TextFile)
}

TwitterText <- ReadTextFile(SrcDir, Ftwitter)
BlogsText <- ReadTextFile(SrcDir, Fblogs)
NewsText <- ReadTextFile(SrcDir, Fnews)

##### Report of Basic Data Statistics
TxtFileList <- list(TwitterText, BlogsText, NewsText)
##### Function returns Memory Allocated by a file
MemAlloc <- function(txtfile) {
    format(object.size(txtfile), units = "Mb", standard = "auto", digits = 2)
}
#### Function returns nr. of lines in a file
NrLines <- function(txtfile) {
    length(txtfile)
}
#### Function returns the maximum nr of charaters in a file
NrCharLongLine <- function(txtfile) {
    ncpl <- nchar(txtfile)
    IdxLongLine <- which.max(ncpl)
    ncpl[IdxLongLine]
}
#### Function returns the total nr. of characters in a file
NrChars <- function(txtfile) {
    sum(nchar(txtfile))
}
#### Function returns the total nr. of words in a file
NrWords <- function(txtfile) {
    sum(stri_count_words(txtfile))
}

FileStats <- data.frame(File = c(Ftwitter, Fblogs, Fnews), Memory_Allocation = sapply(TxtFileList, 
    MemAlloc), Nr_Lines = sapply(TxtFileList, NrLines), Max_Chars_Per_Line = sapply(TxtFileList, 
    NrCharLongLine), Nr_Chars_Total = sapply(TxtFileList, NrChars), Nr_Words = sapply(TxtFileList, 
    NrWords))

FileStats %>% kable() %>% kable_styling(bootstrap_options = "striped", 
    full_width = F, position = "left")

4.Sampling

##### Function randomly selects 0.5% of the obeservations from the original
##### files
CreateSample <- function(txtfile) {
    set.seed(345)
    sample <- txtfile[rbinom(NrLines(txtfile) * 0.005, NrLines(txtfile), 
        0.5)]
    return(sample)
}
#### Functions saves the blogs, news and twitter samples in txt files for
#### later usage
WriteTextFile <- function(path, samplename, newfile) {
    con <- file(paste(path, newfile, sep = ""), open = "wt", encoding = "UTF-8")
    TextFile <- writeLines(samplename, con)
    close(con)
}

TwitterSample <- CreateSample(TwitterText)
BlogsSample <- CreateSample(BlogsText)
NewsSample <- CreateSample(NewsText)
TxtFileList <- list(TwitterSample, BlogsSample, NewsSample)
FileStats <- data.frame(Sample = c("Twitter", "Blogs", "News"), Memory_Allocation = sapply(TxtFileList, 
    MemAlloc), Nr_Lines = sapply(TxtFileList, NrLines), Max_Chars_Per_Line = sapply(TxtFileList, 
    NrCharLongLine), Nr_Chars_Total = sapply(TxtFileList, NrChars), Nr_Words = sapply(TxtFileList, 
    NrWords))

FileStats %>% kable() %>% kable_styling(bootstrap_options = "striped", 
    full_width = F, position = "left")

rm(TwitterText, BlogsText, NewsText, TxtFileList)  #### free up memory

WriteTextFile(SampleDir, TwitterSample, "sample.twitter.txt")
WriteTextFile(SampleDir, BlogsSample, "sample.blogs.txt")
WriteTextFile(SampleDir, NewsSample, "sample.news.txt")

5.TRansforming/cleaning the text

#### Combine blogs, twitter and news and load the data as a Corpus
FullSample <- c(TwitterSample, BlogsSample, NewsSample)
corpus <- VCorpus(VectorSource(FullSample))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
#### lapply(corpus[1], as.character)

6.Reporting on the Frequence of Terms

terms <- as.matrix(TermDocumentMatrix(corpus))
terms_sorted <- sort(rowSums(terms), decreasing = TRUE)
terms <- data.frame(word = names(terms_sorted), freq = terms_sorted)

set.seed(345)
wordcloud(words = terms$word, freq = terms$freq, min.freq = 1, max.words = 200, 
    random.order = FALSE, rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

Plot <- function(data) {
    ggplot(head(data, 20), aes(x = reorder(word, -freq), freq)) + geom_bar(stat = "identity", 
        fill = "purple") + theme(axis.text.x = element_text(angle = 45, 
        hjust = 1), plot.title = element_text(hjust = 0.5)) + labs(title = "Top 20 Unigram words in sample corpus", 
        x = "1-gram Words", y = "Frequency")
}
Plot(terms)

Capstone Week2 - Milestone Report

Myriam Ragni

14 March 2020

Executive Summary

Preparing the environment

Analysing the Source Files

Sampling

Transforming / Cleaning the text

Reporting on the Frequence of Terms

Next Steps

Appendix

Preparing the environment

1. Set the environment and load the needed packages

2. Download / unzip the zipped data file

3. Analysing the Source Files

4.Sampling

5.TRansforming/cleaning the text

6.Reporting on the Frequence of Terms