Milestone Project - Preliminary Data Analysis of Text Corpora

Outline of the Project

This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.

Download the data

The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt

Install required R packages

R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words

Summary of Dataset

Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.

summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info

##   file_name file_size_in_MB line_counts word_counts
## 1      us_b        200.4242      899288    37546246
## 2      us_n        196.2775     1010242    34762395
## 3      us_t        159.3641     2360148    30093410

Sampling of Dataset

I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.

us_b_sub <- sample(us_b, size=length(us_b)*.1, replace=FALSE)  
us_n_sub <- sample(us_n, size=length(us_n)*.1, replace=FALSE)  
us_t_sub <- sample(us_t, size=length(us_t)*.1, replace=FALSE)  
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)

Cleaning Dataset

Clean the sample file, removing the foreign words.

setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")

Create the corpus for the sample file and further clean the data. Since the profanity words are required to be removed from the data, the word list was searched on internet and the following list, http://www.bannedwordlist.com/lists/swearWords.txt, was used for the purpose.

profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram

tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english")) 
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove", 
                         valuetype = "fixed", verbose= FALSE)

Creating Ngram Word Lists and Plotting Ngram Frequency Charts

Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.

Unigram world cloud and top 20 word list

##      word  freq
## 1    will 31575
## 2    just 30552
## 3    said 30487
## 4     one 29243
## 5    like 27005
## 6     can 24615
## 7     get 22839
## 8    time 21649
## 9     new 19168
## 10    now 18112
## 11   good 18004
## 12    day 16880
## 13   know 16287
## 14   love 16240
## 15 people 15771
## 16   back 14203
## 17     go 13993
## 18    see 13790
## 19  first 13456
## 20   make 13332

Bigram Word List

##               word freq
## 1        right now 2573
## 2         new york 1838
## 3        last year 1819
## 4       last night 1628
## 5        years ago 1402
## 6      high school 1363
## 7        feel like 1275
## 8        last week 1218
## 9       first time 1205
## 10 looking forward 1164
## 11         can get 1070
## 12       make sure 1034
## 13      looks like  994
## 14        st louis  921
## 15     even though  913
## 16    good morning  860
## 17        just got  844
## 18      new jersey  843
## 19  happy birthday  840
## 20        let know  831

Trigram World List

##                      word freq
## 1           new york city  251
## 2             let us know  237
## 3          happy new year  204
## 4           two years ago  189
## 5       happy mothers day  182
## 6      happy mother's day  172
## 7  president barack obama  153
## 8           cinco de mayo  137
## 9            world war ii  127
## 10         new york times  123
## 11 looking forward seeing  102
## 12        st louis county   93
## 13       first time since   93
## 14     gov chris christie   89
## 15        will take place   87
## 16            rock n roll   73
## 17        three years ago   71
## 18          two weeks ago   69
## 19         four years ago   68
## 20       couple weeks ago   67

Analysis and Future Plan

This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.

In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.