Executive Summary

This report shows a data science application in Natural Language Processing (NLP). The goal for this report is to demonstrate the data acquisition, summary statistics on the data sets, exploratory analysis, and outline plans for the prediction algorithm and Shiny app for the final project.

Getting the Data

The data is from a corpus called HC Corpora. See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available. The files have been language filtered but may still contain some foreign text which needs to be cleaned up. The data cab be downloaded from Coursera Data Sicence Capstone class website.

# set working directory
setwd("~/Online Classes/Data Science Capstone")

# get data
if (!file.exists("Coursera-SwiftKey.zip")){
        print("Downloading file ...")
        download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
                      destfile = "Coursera-SwiftKey.zip", quiet = T)
}
if (!file.exists("./final/en_US/en_US.news.txt") ||
            !file.exists("./final/en_US/en_US.blogs.txt") ||
            !file.exists("./final/en_US/en_US.twitter.txt") ){
        print("Extracting archive ...")
        unzip(zipfile = "Coursera-SwiftKey.zip")
}

All the r libraries are loaded for the exploratory analysis.

# set java home to avoid error loading RWeka library
Sys.setenv(JAVA_HOME="C:\\Program Files\\Java\\jre7\\")

# load libraries
library(data.table)
library(ggplot2)
library(gridExtra)
library(RWeka)
library(stringi)
library(SnowballC)
library(tm)
library(wordcloud)

Before data cleaning, we create a coupus using the three text files (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt) and perform some basic statistics analysis on the full raw data set. A corpus (plural corpora) or text corpus is a large and structured set of texts.

# create a coupus with three files
corpus <- VCorpus(DirSource(directory="final/en_US", encoding = "UTF-8"), readerControl = list(language = "en")) 

# blogs
blogs <- as.character(corpus[[1]])
format(object.size(blogs), units = "Mb") 
length(blogs) 
blogs <- stri_flatten(blogs, collapse =" ")
blogs.words <- unlist(stri_extract_words(blogs, locale = "en"))
length(blogs.words)
length(unique(blogs.words)) 

# news
news <- as.character(corpus[[2]])
format(object.size(as.character(news)), units = "Mb") 
length(news) 
news <- stri_flatten(news, collapse =" ")
news.words <- unlist(stri_extract_words(news, locale = "en"))
length(news.words) 
length(unique(news.words)) 

# tweets
tweets <- as.character(corpus[[3]])
format(object.size(as.character(tweets)), units = "Mb") 
length(tweets) 
tweets <- stri_flatten(tweets, collapse =" ")
tweets.words <- unlist(stri_extract_words(tweets, locale = "en"))
length(tweets.words) 
length(unique(tweets.words)) 

Table below shows the summary for the raw data:

File Name Size (MB) Line Count Word Count Unique Words
en_US.blogs.txt 200.4 899,288 37,541,795 395,147
en_US.news.txt 196.3 77,259 2,693,898 103,969
en_US.twitter.txt 159.4 2,360,148 38,154,238 445,884

Data Processing and Cleaning

Due to the data size and the memory of the computer, we randomly select a certain percentage of text from each data file to create a sample data set for further analysis.

set.seed (1234) 

if (file.exists("./corpus.rdata")) {
        load("corpus.rdata")
} else {
        corpus <- VCorpus(DirSource(directory="final/en_US", encoding = "UTF-8"),          
                          readerControl = list(language = "en")) 
        save(corpus, file="corpus.rdata")        
}
# get sample data for analysis
corpus[[1]]$content<-sample(corpus[[1]]$content, length(corpus[[1]]$content)*0.05)
corpus[[2]]$content<-sample(corpus[[2]]$content, length(corpus[[2]]$content)*0.75)
corpus[[3]]$content<-sample(corpus[[3]]$content, length(corpus[[3]]$content)*0.02)

The sample distributions of blogs and news word counts are mostly skewed, suggesting that many lines contain 0 words. Also some blogs are very long, containing several thousand words.

Prior to text analysis, the sample corpus needs a few transformations, including removing numbers and punctuations, stripping white space and changing letters to lower case.

# remove numbers
corpus <- tm_map(corpus, removeNumbers)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_dashes = TRUE)
# strip white space
corpus <- tm_map(corpus , stripWhitespace)
# lower case
corpus <- tm_map(corpus, content_transformer(tolower))

if (file.exists("./dtm-analysis.rdata")) {
        load("dtm-analysis.rdata")
} else {
        dtm <- DocumentTermMatrix(corpus, 
                                  control=list(wordLengths=c(1, Inf), 
                                               removeStopwords=FALSE, 
                                               language="english") )
        save(dtm, file="dtm-analysis.rdata")
        
}
wordMatrix <- as.matrix(dtm)

Before performing analysis, we need one more time to clean and to keep only the valid words for future modeling. After using the regular expression based filtering, the first 20 words comparison list shows that the text is much tidier than before. The total number of words decreased from 141,483 to 121,771 with the regular expression filter.

words <- colnames(wordMatrix)

valid.words <- words[regexpr(pattern = '^([a-zA-Z])(?!(\\1{1,}))[a-zA-Z]*([a-zA-Z]+-([a-zA-Z]){2,})?(\'(s|t)?)?$', words, perl=T )>0]

before.regex <- head(words, 20)
after.regex <- head(valid.words, 20)
compare.list <- cbind(before.regex, after.regex)
compare.list
##       before.regex                 after.regex  
##  [1,] "\b\bêì‚홈니ë"           "a"          
##  [2,] "-”"                       "ab"         
##  [3,] "-t"                         "abacelas"   
##  [4,] "–›•"                        "aback"      
##  [5,] "ˆì"                        "abacus"     
##  [6,] "‘•š"                        "abandon"    
##  [7,] "â"                          "aba"        
##  [8,] "ã"                          "abab"       
##  [9,] "a"                          "abaft"      
## [10,] "a-a"                        "abagnale"   
## [11,] "a-b"                        "abalone"    
## [12,] "a-being"                    "abandonded" 
## [13,] "a-bs"                       "abandoned"  
## [14,] "a-changin"                  "abandonees" 
## [15,] "a-child"                    "abandoning" 
## [16,] "a-comin"                    "abandonment"
## [17,] "a-coming”"                "abandons"   
## [18,] "a-coming"                   "abart"      
## [19,] "a-day"                      "abasement"  
## [20,] "a-different"                "abated"
length(words) ; length(valid.words)
## [1] 141483
## [1] 121771

Word Cloud

The word cloud below clearly shows that “the”, “and”, “to” and “of” are the most important words. It seems that the conjunction words and be verbs, as well pronouns are the most frequently used words in a mono-gram form.

set.seed(567)

validWordMatrix<-wordMatrix[,valid.words]
validWordCount<-colSums(validWordMatrix)
wordcloud(names(validWordCount), validWordCount,  max.words=200, colors=brewer.pal(6, "Dark2"), rot.per=0.2)

Next Steps

The next steps are to build n-gram frequency matrices and use these matrices to create a predictive text algorithm, then to create a shiny app for word prediction, i.e. entering two words to predict the third words or entering the three words to predict the fourth word.