This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.
The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt
R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words
Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.
summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info
## file_name file_size_in_MB line_counts word_counts
## 1 us_b 200.4242 899288 37546246
## 2 us_n 196.2775 1010242 34762395
## 3 us_t 159.3641 2360148 30093410
I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.
us_b_sub <- sample(us_b, size=length(us_b)*.1, replace=FALSE)
us_n_sub <- sample(us_n, size=length(us_n)*.1, replace=FALSE)
us_t_sub <- sample(us_t, size=length(us_t)*.1, replace=FALSE)
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)
setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")
profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram
tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.
## word freq
## 1 will 31575
## 2 just 30552
## 3 said 30487
## 4 one 29243
## 5 like 27005
## 6 can 24615
## 7 get 22839
## 8 time 21649
## 9 new 19168
## 10 now 18112
## 11 good 18004
## 12 day 16880
## 13 know 16287
## 14 love 16240
## 15 people 15771
## 16 back 14203
## 17 go 13993
## 18 see 13790
## 19 first 13456
## 20 make 13332
## word freq
## 1 right now 2573
## 2 new york 1838
## 3 last year 1819
## 4 last night 1628
## 5 years ago 1402
## 6 high school 1363
## 7 feel like 1275
## 8 last week 1218
## 9 first time 1205
## 10 looking forward 1164
## 11 can get 1070
## 12 make sure 1034
## 13 looks like 994
## 14 st louis 921
## 15 even though 913
## 16 good morning 860
## 17 just got 844
## 18 new jersey 843
## 19 happy birthday 840
## 20 let know 831
## word freq
## 1 new york city 251
## 2 let us know 237
## 3 happy new year 204
## 4 two years ago 189
## 5 happy mothers day 182
## 6 happy mother's day 172
## 7 president barack obama 153
## 8 cinco de mayo 137
## 9 world war ii 127
## 10 new york times 123
## 11 looking forward seeing 102
## 12 st louis county 93
## 13 first time since 93
## 14 gov chris christie 89
## 15 will take place 87
## 16 rock n roll 73
## 17 three years ago 71
## 18 two weeks ago 69
## 19 four years ago 68
## 20 couple weeks ago 67
This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.
In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.