Milestone Project - Preliminary Data Analysis of Text Corpora

Outline of the Project

This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.

Download the data

The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

Install required R packages

R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words

Summary of Dataset

Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.

summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info

##   file_name file_size_in_MB line_counts word_counts
## 1      us_b        200.4242      899288    37546246
## 2      us_n        196.2775     1010242    34762395
## 3      us_t        159.3641     2360148    30093410

Sampling of Dataset

I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.

us_b_sub <- sample(us_b, size=length(us_b)*.07, replace=FALSE)  
us_n_sub <- sample(us_n, size=length(us_n)*.07, replace=FALSE)  
us_t_sub <- sample(us_t, size=length(us_t)*.07, replace=FALSE)  
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)

Cleaning Dataset

Clean the sample file, removing the foreign words.

setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")

Create the corpus for the sample file and further clean the data. Since the profanity words are required to be removed from the data, the word list was searched on internet and the following list, http://www.bannedwordlist.com/lists/swearWords.txt, was used for the purpose.

profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram

tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english")) 
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#4-grams
tokensNgramsNoStopwords_4 <- tokens_ngrams(tokensNoStopwords, n=4,concatenator = " " )
fourgrams<-dfm(tokensNgramsNoStopwords_4)
fourgrams <- dfm_select(fourgrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)

#5-grams
tokensNgramsNoStopwords_5 <- tokens_ngrams(tokensNoStopwords, n=5,concatenator = " " )
fivegrams<-dfm(tokensNgramsNoStopwords_5)
fivegrams <- dfm_select(fivegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)

Creating Ngram Word Lists and Plotting Ngram Frequency Charts

Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.

Unigram world cloud and top 20 word list

##      word  freq
## 1    will 21951
## 2    said 21415
## 3    just 21334
## 4     one 20637
## 5    like 18561
## 6     can 17080
## 7     get 15822
## 8    time 15051
## 9     new 13617
## 10    now 12599
## 11   good 12521
## 12    day 11930
## 13   love 11303
## 14   know 11103
## 15 people 11100
## 16     go  9950
## 17    see  9896
## 18   back  9852
## 19  first  9484
## 20   also  9083

Bigram Word List

##               word freq
## 1        right now 1719
## 2         new york 1399
## 3        last year 1361
## 4       last night 1074
## 5      high school 1044
## 6        years ago  958
## 7        feel like  897
## 8        last week  840
## 9       first time  818
## 10 looking forward  785
## 11         can get  766
## 12       make sure  746
## 13     even though  694
## 14      looks like  678
## 15        st louis  659
## 16    good morning  644
## 17  happy birthday  635
## 18        just got  632
## 19      new jersey  571
## 20        let know  566

Trigram World List

##                      word freq
## 1             let us know  211
## 2           new york city  182
## 3          happy new year  138
## 4           two years ago  117
## 5      happy mother's day  115
## 6       happy mothers day  111
## 7           cinco de mayo   99
## 8  president barack obama   89
## 9          new york times   83
## 10        st louis county   79
## 11           world war ii   78
## 12     gov chris christie   76
## 13 looking forward seeing   75
## 14        will take place   65
## 15          two weeks ago   62
## 16       first time since   60
## 17       st patrick's day   60
## 18          just got back   51
## 19            rock n roll   50
## 20        three years ago   46

4-gram World List

##                            word freq
## 1         martin luther king jr   28
## 2             g fat g saturated   26
## 3          calories g protein g   24
## 4             let us know think   23
## 5               let us know can   23
## 6  dow jones industrial average   23
## 7      g protein g carbohydrate   23
## 8           happy cinco de mayo   22
## 9       happy new year everyone   21
## 10           please let us know   21
## 11     protein g carbohydrate g   21
## 12         g carbohydrate g fat   21
## 13        get real rewards just   20
## 14   real rewards just watching   20
## 15     rewards just watching tv   20
## 16  follow follow follow follow   20
## 17       per serving calories g   20
## 18         carbohydrate g fat g   20
## 19     mg cholesterol mg sodium   19
## 20      cholesterol mg sodium g   19

5-gram World List

##                                     word freq
## 1      calories g protein g carbohydrate   22
## 2           protein g carbohydrate g fat   21
## 3         get real rewards just watching   20
## 4          real rewards just watching tv   20
## 5             g protein g carbohydrate g   20
## 6                 g carbohydrate g fat g   20
## 7         carbohydrate g fat g saturated   20
## 8          amazon services llc amazon eu   18
## 9           serving calories g protein g   18
## 10        fat g saturated mg cholesterol   18
## 11         cholesterol mg sodium g fiber   18
## 12        per serving calories g protein   17
## 13                  g fat g saturated mg   17
## 14         g saturated mg cholesterol mg   17
## 15    saturated mg cholesterol mg sodium   17
## 16         moist moist moist moist moist   16
## 17    follow follow follow follow follow   15
## 18            mg cholesterol mg sodium g   14
## 19              food food food food food   14
## 20 chicago illinois incorporated item pp   12

Analysis and Future Plan

This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.

In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.