This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.
The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt
R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words
Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.
summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info
## file_name file_size_in_MB line_counts word_counts
## 1 us_b 200.4242 899288 37546246
## 2 us_n 196.2775 1010242 34762395
## 3 us_t 159.3641 2360148 30093410
I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.
us_b_sub <- sample(us_b, size=length(us_b)*.02, replace=FALSE)
us_n_sub <- sample(us_n, size=length(us_n)*.02, replace=FALSE)
us_t_sub <- sample(us_t, size=length(us_t)*.02, replace=FALSE)
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)
setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")
profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram
tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.
## word freq
## 1 will 6277
## 2 said 6151
## 3 just 6125
## 4 one 5806
## 5 like 5420
## 6 can 5170
## 7 get 4652
## 8 time 4299
## 9 new 3845
## 10 now 3631
## 11 good 3537
## 12 day 3267
## 13 love 3235
## 14 know 3203
## 15 people 3149
## 16 go 2915
## 17 see 2852
## 18 back 2774
## 19 first 2719
## 20 also 2671
## word freq
## 1 right now 540
## 2 new york 392
## 3 last year 374
## 4 last night 338
## 5 high school 277
## 6 years ago 264
## 7 feel like 250
## 8 last week 233
## 9 first time 229
## 10 looking forward 221
## 11 can get 212
## 12 even though 201
## 13 looks like 198
## 14 make sure 186
## 15 happy birthday 183
## 16 st louis 173
## 17 just got 163
## 18 good morning 160
## 19 one day 158
## 20 united states 157
## word freq
## 1 new york city 62
## 2 let us know 42
## 3 happy new year 40
## 4 happy mothers day 36
## 5 ass ass ass 33
## 6 president barack obama 32
## 7 two years ago 32
## 8 new york times 30
## 9 happy mother's day 29
## 10 world war ii 25
## 11 st louis county 20
## 12 gov chris christie 20
## 13 duke duke duke 20
## 14 cinco de mayo 18
## 15 josh josh josh 18
## 16 cents per share 17
## 17 four years ago 17
## 18 will take place 17
## 19 osama bin laden 16
## 20 five years ago 15
This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.
In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.