This project will perform some preliminary data exploration and analysis on the dataset provided by Capstone Project course, as shown in below link, https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip, which is a corpus of nonstructured text documents. The text mining tool (package), quanteda, is used to perform some priliminary cleaning and text mining such as creating ngram lists. At the end of the report, future plans on creating a language model to predict the next word in an incomplete phrase as well as creating a front end application, a Shiny application, will be discussed.
The data is downloaded using the link provided above and then unzipped to a subdirectory of home directory, “final”. Under “final” directory, four folders are created, with three documents in each, with its own language. These four languages are English, Rusian, French and German. We only select the sub directory, “en_US”, which contains three documents in English. The three documents are files of blogs, news and twitters respectively. So the directory interested is ~/final/en_US/, and the three files under the directory are en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt
R packages installed: Package “stringi”: used for word count of three data. Package “quanteda”: used for text cleanning and text mining package “ggplots”: used to plot graphics package: “wordcloud”: used to plot a cloud of words
Read the three files into R, specifiy binary file to read
Summarize file Info, file size, number of lines per file, total word count per file.
summary_file_info <- data.frame(file_name,file_size_in_MB,line_counts, word_counts)
summary_file_info
## file_name file_size_in_MB line_counts word_counts
## 1 us_b 200.4242 899288 37546246
## 2 us_n 196.2775 1010242 34762395
## 3 us_t 159.3641 2360148 30093410
I have tried to sample 20%,10%,5% of the three files to create a sample file include the text documents. When I contined the steps of creating bigram and trigram from the corpus, the steps took forever and caused the computer to hang.I finally decided to reduce the sampling size fo 2% for each file.
us_b_sub <- sample(us_b, size=length(us_b)*.07, replace=FALSE)
us_n_sub <- sample(us_n, size=length(us_n)*.07, replace=FALSE)
us_t_sub <- sample(us_t, size=length(us_t)*.07, replace=FALSE)
us_sub <- c(us_b_sub, us_n_sub, us_t_sub)
setwd("~")
suppressMessages(library(quanteda))
us_sub <- iconv(us_sub, from="UTF-8", to="ASCII", sub="")
profanity <- read.table("http://www.bannedwordlist.com/lists/swearWords.txt")
profanity<-as.character(profanity[,1])
# Clean up the text before ngram
tokensAll <- tokens(char_tolower(us_sub), remove_punct =TRUE,remove_numbers=TRUE)
tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english"))
# unigrams
onegrams<-dfm(tokensNoStopwords)
onegrams <- dfm_select(onegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#bigrams
tokensNgramsNoStopwords_2 <- tokens_ngrams(tokensNoStopwords, n=2,concatenator = " " )
twograms<-dfm(tokensNgramsNoStopwords_2)
twograms <- dfm_select(twograms, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#trigrams
tokensNgramsNoStopwords_3 <- tokens_ngrams(tokensNoStopwords, n=3,concatenator = " " )
threegrams<-dfm(tokensNgramsNoStopwords_3)
threegrams <- dfm_select(threegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#4-grams
tokensNgramsNoStopwords_4 <- tokens_ngrams(tokensNoStopwords, n=4,concatenator = " " )
fourgrams<-dfm(tokensNgramsNoStopwords_4)
fourgrams <- dfm_select(fourgrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
#5-grams
tokensNgramsNoStopwords_5 <- tokens_ngrams(tokensNoStopwords, n=5,concatenator = " " )
fivegrams<-dfm(tokensNgramsNoStopwords_5)
fivegrams <- dfm_select(fivegrams, profanity, selection = "remove",
valuetype = "fixed", verbose= FALSE)
Based on the Capstone project’s objective, to predict next word of a phrase, Ngram language model is a suitable model for the prediction. Word lists for unigram, bigram and trigram are created for future training the Ngram language model.
## word freq
## 1 will 21951
## 2 said 21415
## 3 just 21334
## 4 one 20637
## 5 like 18561
## 6 can 17080
## 7 get 15822
## 8 time 15051
## 9 new 13617
## 10 now 12599
## 11 good 12521
## 12 day 11930
## 13 love 11303
## 14 know 11103
## 15 people 11100
## 16 go 9950
## 17 see 9896
## 18 back 9852
## 19 first 9484
## 20 also 9083
## word freq
## 1 right now 1719
## 2 new york 1399
## 3 last year 1361
## 4 last night 1074
## 5 high school 1044
## 6 years ago 958
## 7 feel like 897
## 8 last week 840
## 9 first time 818
## 10 looking forward 785
## 11 can get 766
## 12 make sure 746
## 13 even though 694
## 14 looks like 678
## 15 st louis 659
## 16 good morning 644
## 17 happy birthday 635
## 18 just got 632
## 19 new jersey 571
## 20 let know 566
## word freq
## 1 let us know 211
## 2 new york city 182
## 3 happy new year 138
## 4 two years ago 117
## 5 happy mother's day 115
## 6 happy mothers day 111
## 7 cinco de mayo 99
## 8 president barack obama 89
## 9 new york times 83
## 10 st louis county 79
## 11 world war ii 78
## 12 gov chris christie 76
## 13 looking forward seeing 75
## 14 will take place 65
## 15 two weeks ago 62
## 16 first time since 60
## 17 st patrick's day 60
## 18 just got back 51
## 19 rock n roll 50
## 20 three years ago 46
## word freq
## 1 martin luther king jr 28
## 2 g fat g saturated 26
## 3 calories g protein g 24
## 4 let us know think 23
## 5 let us know can 23
## 6 dow jones industrial average 23
## 7 g protein g carbohydrate 23
## 8 happy cinco de mayo 22
## 9 happy new year everyone 21
## 10 please let us know 21
## 11 protein g carbohydrate g 21
## 12 g carbohydrate g fat 21
## 13 get real rewards just 20
## 14 real rewards just watching 20
## 15 rewards just watching tv 20
## 16 follow follow follow follow 20
## 17 per serving calories g 20
## 18 carbohydrate g fat g 20
## 19 mg cholesterol mg sodium 19
## 20 cholesterol mg sodium g 19
## word freq
## 1 calories g protein g carbohydrate 22
## 2 protein g carbohydrate g fat 21
## 3 get real rewards just watching 20
## 4 real rewards just watching tv 20
## 5 g protein g carbohydrate g 20
## 6 g carbohydrate g fat g 20
## 7 carbohydrate g fat g saturated 20
## 8 amazon services llc amazon eu 18
## 9 serving calories g protein g 18
## 10 fat g saturated mg cholesterol 18
## 11 cholesterol mg sodium g fiber 18
## 12 per serving calories g protein 17
## 13 g fat g saturated mg 17
## 14 g saturated mg cholesterol mg 17
## 15 saturated mg cholesterol mg sodium 17
## 16 moist moist moist moist moist 16
## 17 follow follow follow follow follow 15
## 18 mg cholesterol mg sodium g 14
## 19 food food food food food 14
## 20 chicago illinois incorporated item pp 12
This work is for preliminary data exploration and analysis. The data cleaning might need to be fine tuned further. I have done much performance comparison and tuning on the ngram word list generations. I have concluded that tokenizing the texts before performing ngram tasks speeds up the process at least 100 folds.
In the future, I will 1) Use larger set of data to create ngram word lists to train the model; 2) Use trigram model as a starting point to twick the model with test dataset to obtain the optimal model. I will have fall back plans if the optimal model does not work; 3) In the final project, I will pay especially attention to the execution time of the prediction model. More accurate and thorough trigram word lists might need to be aquired outside of the list generated by the corpus.