Milestone Project - Preliminary Data Analysis of Text Corpora

Outline of the Project

This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.

Download the data

The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

Install required R packages

R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words

Summary of Dataset

Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.

summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info

##   file_name file_size_in_MB line_counts word_counts
## 1      us_b        200.4242      899288    37546246
## 2      us_n        196.2775     1010242    34762395
## 3      us_t        159.3641     2360148    30093410

Sampling of Dataset

I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.

us_b_sub <- sample(us_b, size=length(us_b)*.02, replace=FALSE)  
us_n_sub <- sample(us_n, size=length(us_n)*.02, replace=FALSE)  
us_t_sub <- sample(us_t, size=length(us_t)*.02, replace=FALSE)  
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)

Cleaning Dataset

Clean the sample file, removing the foreign words.

setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")

Create the corpus for the sample file and further clean the data. Since the profanity words are required to be removed from the data, the word list was searched on internet and the following list, http://www.bannedwordlist.com/lists/swearWords.txt, was used for the purpose.

profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram

tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english")) 
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)

Creating Ngram Word Lists and Plotting Ngram Frequency Charts

Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.

Unigram world cloud and top 20 word list

##      word freq
## 1    will 6277
## 2    said 6151
## 3    just 6125
## 4     one 5806
## 5    like 5420
## 6     can 5170
## 7     get 4652
## 8    time 4299
## 9     new 3845
## 10    now 3631
## 11   good 3537
## 12    day 3267
## 13   love 3235
## 14   know 3203
## 15 people 3149
## 16     go 2915
## 17    see 2852
## 18   back 2774
## 19  first 2719
## 20   also 2671

Bigram Word List

##               word freq
## 1        right now  540
## 2         new york  392
## 3        last year  374
## 4       last night  338
## 5      high school  277
## 6        years ago  264
## 7        feel like  250
## 8        last week  233
## 9       first time  229
## 10 looking forward  221
## 11         can get  212
## 12     even though  201
## 13      looks like  198
## 14       make sure  186
## 15  happy birthday  183
## 16        st louis  173
## 17        just got  163
## 18    good morning  160
## 19         one day  158
## 20   united states  157

Trigram World List

##                      word freq
## 1           new york city   62
## 2             let us know   42
## 3          happy new year   40
## 4       happy mothers day   36
## 5             ass ass ass   33
## 6  president barack obama   32
## 7           two years ago   32
## 8          new york times   30
## 9      happy mother's day   29
## 10           world war ii   25
## 11        st louis county   20
## 12     gov chris christie   20
## 13         duke duke duke   20
## 14          cinco de mayo   18
## 15         josh josh josh   18
## 16        cents per share   17
## 17         four years ago   17
## 18        will take place   17
## 19        osama bin laden   16
## 20         five years ago   15

Analysis and Future Plan

This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.

In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.