Executive Summary

The Data Science Capstone project aims to build a shiny application which predicts the next word when the user enters a phrase. The project will use the English blogs, news and Twitter data files from HC Corpora http://www.corpora.heliohost.org to build the predictive model required by the shiny application. The current milestone is to perform exploratory analysis to get findings on how the data can be used for building the predictive app.

1. Data Acquisition

This section downloads the HC Corpora data from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Code Reference - Appendix A: Download Files

Within the “en_US” folder of the zip file, we import the three data files for blogs, news and twitter respectively. Code Reference - Appendix B: Import Files

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

2. Data Summary

This section performs some basic preprocessing to understand the nature of these 3 files in terms of file size, number of lines and number of words. It is evident that these data files are very huge in size (about 200 MB/file) where Blog and News data have about 1 milllion lines while Twitter data has the most number of lines (over 2 millions). Code Reference - Appendix C: Data Summary

Due to the large data set, subsequent analysis will be performed using random sample of 1% from each file. Code Reference - Appendix D: Data Sampling

## [1] "Table 1"

##          File Size(MB)   Lines    Words
## 1   blogs.txt      200  899288 37541795
## 2    news.txt      196 1010242 34762303
## 3 twitter.txt      159 2360148 30092866

## [1] "Figure 1"

3. NLP Exploratory analysis

The first step in building a predictive model for text is to understand the distribution and relationship between the words, tokens, and phrases in the text. The purpose is also to ascertain if these 3 files contain similar contents.

To find out what are the key words and phrases in each file, I first cleansed each file to remove profanity and other stopwords which are not meaningful. Then, the file is tokenized to find the top terms with frequency higher than 10. Code Reference - Appendix E: FUNC_createCorpus, Appendix G: nlpAnalysis

Finally, I perform ngram analysis to understand the word relationships up to 3 consequent words in each file. Code Reference - Appendix F: FUNC_createNgram, Appendix G: nlpAnalysis

## [1] "Figure 2a - Blogs: Freq and Wordclouds"

## [1] "Figure 2b - Blogs: Ngram Analysis"

## [1] "Figure 3a - News: Freq and Wordclouds"

## [1] "Figure 3b - News: Ngram Analysis"

## [1] "Figure 4a - Twitter: Freq and Wordclouds"

## [1] "Figure 4b - Twitter: Ngram Analysis"

4. Next Steps: Building Predictive Model

I will conduct further studies on the results of the term frequencies, Uni-gram, Bi-gram and Tri-gram to determine the design of the text predictive model. To achieve better accuracy, I will also be making reference to some great recommendations shared by the course mentor at IDA MOOC program, such as the use of improved Naive Model and other advanced methods (https://docs.google.com/presentation/d/1UYdtSLRv-qmRFJub2A5GtXIxZL1ZwVabt9ermjYIVvU/edit?usp=sharing).

APPENDICES

Appendix A: Download Files

# Set source and destination for file download.
sFile <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dFile <- "./data/Coursera-SwiftKey.zip"

# Download file from source to destination.
if (!file.exists(dFile)) {
  download.file(sFile, dFile)
  unzip(dFile)
}

# Check unzipped contents.
list.files("./data/final/en_US")

Appendix B: Import Files

# Import the 3 files in English from "en_US" folder: blogs.txt, news.txt and twitter.txt.Blogs
# Then, save the data to an .RData files.

if (!file.exists("./data/blogs.RData")) {
  con <- file("./data/final/en_US/en_US.blogs.txt","rb")
  blogs <- readLines(con, encoding="UTF-8")
  close(con)
  save(blogs, file="./data/blogs.RData")
}

if (!file.exists("./data/news.RData")) {
  con <- file("./data/final/en_US/en_US.news.txt","rb")
  news <- readLines(con, encoding="UTF-8")
  close(con)
  save(news, file="./data/news.RData")
}

if (!file.exists("./data/twitter.RData")) {
  con <- file("./data/final/en_US/en_US.twitter.txt","rb")
  twitter <- readLines(con, encoding="UTF-8")
  close(con)
  save(twitter, file="./data/twitter.RData")
}

Appendix C: Data Summary

# Character analysis via "stringi" libary
library(stringi)

# (a) File size (in MB) of each document
fsize_blogs <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024^2
fsize_news <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024^2
fsize_twitter <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024^2

# (b) Total number of lines and chars in each document
linechar_blogs   <- stri_stats_general(blogs)
linechar_news    <- stri_stats_general(news)
linechar_twitter <- stri_stats_general(twitter)

# (c) Total words in each document
wordcnt_blogs       <- sum(stri_count_words(blogs))
wordcnt_news        <- sum(stri_count_words(news))
wordcnt_twitter     <- sum(stri_count_words(twitter))
wordcnt_blogs_Mil   <- wordcnt_blogs / 1000000
wordcnt_news_Mil    <- wordcnt_news / 1000000
wordcnt_twitter_Mil <- wordcnt_twitter / 1000000

print("Table 1")
df <- data.frame(file=c("blogs.txt","news.txt","twitter.txt"),
                 size=c(fsize_blogs,fsize_news,fsize_twitter),
                 lines=c(length(blogs),length(news),length(twitter)),
                 words=c(wordcnt_blogs,wordcnt_news,wordcnt_twitter)
                 )
df$size=round(df$size,digits=0)
colnames(df) <- c("File","Size(MB)","Lines","Words")
df

# plots
library(ggplot2)
library(gridExtra)

print("Figure 1")

fsize <- c(fsize_blogs,fsize_news,fsize_twitter)
fsize <- data.frame(fsize)
fsize$names <- c("blogs","news","twitter")
plot1 <- ggplot(fsize,aes(x=names,y=fsize)) + geom_bar(stat='identity', fill="yellow", colour='blue') + xlab('File source') + ylab('File Size (MB)') + ggtitle('File Size')

linecnt <- c(length(blogs),length(news),length(twitter))
linecnt <- data.frame(linecnt)
linecnt$names <- c("blogs","news","twitter")
plot2 <- ggplot(linecnt,aes(x=names,y=linecnt)) + geom_bar(stat='identity', fill="green", colour='blue') + xlab('File source') + ylab('Total No. of Lines') + ggtitle('Line Count')

wordcnt <- c(wordcnt_blogs_Mil,wordcnt_news_Mil,wordcnt_twitter_Mil)
wordcnt <- data.frame(wordcnt)
wordcnt$names <- c("blogs","news","twitter")
plot3 <- ggplot(wordcnt,aes(x=names,y=wordcnt)) + geom_bar(stat='identity', fill="cyan", colour='blue') + xlab('File source') + ylab('Total No. of Words (Millions)') + ggtitle('Word Count')

grid.arrange(plot1, plot2, plot3, ncol=3)

Appendix D: Data Sampling

# random sample of data
sample_blogs   <- sample(blogs, 10000)
sample_news    <- sample(news, 10000)
sample_twitter <- sample(twitter, 20000)

# save samples
save(sample_blogs, sample_news, sample_twitter, file= "./data/samples.RData")

Appendix E: Function to clean and create corpus

createCorpus <- function(sample) {
  clean <- gsub(" #\\S*","",sample) #remove hashtags
  clean <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", clean) #remove URLs (http, https, ftp)
  clean <- gsub("[^0-9A-Za-z///' ]", "", clean) #remove all non english / non numeric 
  
  # clean data via tm
  library(RCurl)
  profanity <- c(t(read.csv(text = getURL("http://www.bannedwordlist.com/lists/swearWords.csv"),header=F)))
 
  # customized stopwords
  stopwords1 <- c("haha","hey","lol","yeah","please","thank","thanks","let","word","words","say","says","said","lot","get","got","this","that","them","yes", "yet", "may", "just", "can", "the", "will", "also", "and", "for", "in", "it", "to")
  stopwords2 <- c("go","going","went","is","are","was","were","has","have","had","never","ever","even","didnt","doesnt","dont", "really","around","make","made","making","include","included","including","next","last","now","day","week","month","year")
  stopwords3 <- c("good","better","nice","sure","though","want","use","used","ill","ive","well","weve","theyll","theyve","youre","youve","youll","youd","thats","cant","could","couldn't","would","wouldnt")
  stopwords4 <- c("since","any","another","still","few","much","many","time","feel","feeling","think","thought","know","take","always","way","things","thing","something","back","now","see","saw","seeing","look","looking","show","come","came")
  
  library(tm)
  corpus <- Corpus(VectorSource(clean))
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, profanity)
  corpus <- tm_map(corpus, removeWords, stopwords("english")) # common stop Words such as for, very, and, of, are, etc,
  corpus <- tm_map(corpus, removeWords, stopwords1)
  corpus <- tm_map(corpus, removeWords, stopwords2)
  corpus <- tm_map(corpus, removeWords, stopwords3)
  corpus <- tm_map(corpus, removeWords, stopwords4)
  
  # Compute a term-document matrix that contains occurrance of terms in each doc
  dtm=TermDocumentMatrix(corpus)
  
  # Find terms with a frequency higher than 10
  findFreqTerms(dtm, lowfreq=10) 
  
  #wordcloud
  library(wordcloud)
  
  set.seed(123)
  wordcloud(corpus, scale=c(5,0.5), min.freq=10, max.words=50, random.order=FALSE, rot.per=0.35, 
            use.r.layout=FALSE, colors=brewer.pal(8, "Dark2"))
  
  return(corpus)  
}

Appendix F: Function to create ngrams

# N-gram analysis was carried out by Tokenizing the data to Uni-gram, Bi-gram and Tri-gram, and then
# plotting the results to gain more understanding of the dataset. 

createNgram <- function(corpus) {
  library(RWeka)
  sample_df <- data.frame(text=unlist(sapply(corpus, '[',"content")),stringsAsFactors=F)
  token_delim <- " \\t\\r\\n.!?,;\"()"
  UnigramTokenizer <- NGramTokenizer(sample_df, Weka_control(min=1,max=1))
  BigramTokenizer <- NGramTokenizer(sample_df, Weka_control(min=2,max=2, delimiters = token_delim))
  TrigramTokenizer <- NGramTokenizer(sample_df, Weka_control(min=3,max=3, delimiters = token_delim))

  unigramTable <- data.frame(table(UnigramTokenizer))
  bigramTable <- data.frame(table(BigramTokenizer))
  trigramTable <- data.frame(table(TrigramTokenizer))

  unigramTable <- unigramTable[order(unigramTable$Freq,decreasing = TRUE),]
  bigramTable <- bigramTable[order(bigramTable$Freq,decreasing = TRUE),]
  trigramTable <- trigramTable[order(trigramTable$Freq,decreasing = TRUE),]

  library(ggplot2)
  library(gridExtra)
  plot1 <- ggplot(unigramTable[1:10,], aes(x=reorder(UnigramTokenizer,-Freq,sum),y=Freq), ) +
    geom_bar(stat="Identity",fill="orange", colour='blue') + geom_text(aes(label=Freq))
  plot2 <- ggplot(bigramTable[1:10,], aes(x=reorder(BigramTokenizer,-Freq,sum),y=Freq), ) + 
    geom_bar(stat="Identity",fill="orange", colour='blue') + geom_text(aes(label=Freq))
  plot3 <- ggplot(trigramTable[1:10,], aes(x=reorder(TrigramTokenizer,-Freq,sum),y=Freq), ) + 
    geom_bar(stat="Identity",fill="orange", colour='blue') + geom_text(aes(label=Freq))

  grid.arrange(plot1, plot2, plot3, nrow=3, widths=20)
}

Appendix G: NLP Analysis

# Load sample_blogs, sample_news, sample_twitter.
load("./data/samples.RData")

print("Figure 2a - Blogs: Freq and Wordclouds")
mycorpus <- createCorpus(sample_blogs)
print("Figure 2b - Blogs: Ngram Analysis")
createNgram(mycorpus)

print("Figure 3a - News: Freq and Wordclouds")
mycorpus <- createCorpus(sample_news)
print("Figure 3b - News: Ngram Analysis")
createNgram(mycorpus)

print("Figure 4a - Twitter: Freq and Wordclouds")
mycorpus <- createCorpus(sample_twitter)
print("Figure 4b - Twitter: Ngram Analysis")
createNgram(mycorpus)

Data Science Capstone - Milestone Report

Charmaine Ang

July 2015