This milestone report project is a part of the data science capstone project of Coursera and Swiftkey. The main objective of the capstone project is to transform corpora of text into a Next Word Prediction system, based on word frequencies and context, applying data science in the area of natural language processing. This Rmarkdown report describes exploratory analysis of the sample training data set and summarizes plans for creating the prediction model. Text mining R packages tm[1] and quanteda[2] are used for cleaning, preprocessing, managing and analyzing text. This report meets the following requirements:
Downloads, loads the data, creates sample training data and preprocess it.
Generates summary statistics about the data sets and makes basic plots such as histograms to illustrate features of the data.
Describes some interesting findings.
Reports plans for creating a prediction algorithm and Shiny app.
The R packages used here include: quanteda, tm, stringi, downloader, readr, stringr, dtplyr, tibble, ggplot2, rmarkdown, knitr, and ggthemes.
Download the data and save to local disk:
[1] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.blogs.txt"
[2] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.news.txt"
[3] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.twitter.txt"
| file_name | file_size (Mb) | word_count | line_count | Max words/line | Avg words/line |
|---|---|---|---|---|---|
| en_US.blogs.txt | 200 | 37272578 | 899288 | 40832 | 42 |
| en_US.news.txt | 196 | 34309642 | 1010242 | 11384 | 34 |
| en_US.twitter.txt | 159 | 30341028 | 2360148 | 140 | 13 |
| Total | 555 | 101923248 | 4269678 | 40832 | 24 |
In order to enable faster data processing, a data sample from all three sources was generated, extracting 0.01 of data randomly using rbinoma() function and store them.
# A tibble: 3 x 2
`sample text` length
<chr> <int>
1 sblog 6278
2 snews 8721
3 stwit 22982
The cleaning and preprocessing include:
sample quanteda corpus:
Corpus consisting of 37981 documents, showing 5 documents:
Text Types Tokens Sentences author datetimestamp description heading id language origin
text1 3 3 1 NA 2020-07-06 17:35:16 NA NA 1 en NA
text2 10 10 1 NA 2020-07-06 17:35:16 NA NA 2 en NA
text3 9 10 1 NA 2020-07-06 17:35:16 NA NA 3 en NA
text4 21 26 1 NA 2020-07-06 17:35:16 NA NA 4 en NA
text5 17 27 1 NA 2020-07-06 17:35:16 NA NA 5 en NA
In statistical Natural Language Processing (NLP), an n-gram is a contiguous sequence of n items from a given sequence of text or speech. Bigram and trigram are combination of two and tree words respectively. We will build and use n-gram model, a type of probabilistic language model, for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.
Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
features
docs just anoth feel tell later blog get wors now say
text1 1 1 1 0 0 0 0 0 0 0
text2 0 0 0 1 1 1 1 1 1 1
text3 0 0 0 0 0 0 0 0 0 0
text4 0 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 1 0 1 0
text6 1 0 0 0 0 0 2 0 0 0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]
Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
features
docs just anoth feel tell later blog get wors now say
text1 1 1 1 0 0 0 0 0 0 0
text2 0 0 0 1 1 1 1 1 1 1
text3 0 0 0 0 0 0 0 0 0 0
text4 0 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 1 0 1 0
text6 1 0 0 0 0 0 2 0 0 0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]
Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
features
docs just anoth feel tell later blog get wors now say
text1 1 1 1 0 0 0 0 0 0 0
text2 0 0 0 1 1 1 1 1 1 1
text3 0 0 0 0 0 0 0 0 0 0
text4 0 0 0 0 0 0 0 0 0 0
text5 0 0 0 0 0 0 1 0 1 0
text6 1 0 0 0 0 0 2 0 0 0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]
[1] "get" "just" "said" "like" "one" "go" "time" "can" "day" "love" "year" "make"
[13] "new" "good" "thank" "work" "now" "know" "want" "peopl"
[1] "get" "just" "said" "like" "one" "go" "time" "can" "day" "love" "year" "make"
[13] "new" "good" "thank" "work" "now" "know" "want" "peopl"
[1] "get" "just" "said" "like" "one" "go" "time" "can" "day" "love" "year" "make"
[13] "new" "good" "thank" "work" "now" "know" "want" "peopl"
The three corpora of US english text are around 200, 196, and 159 Megabytes respectively. The twitter corpus has shorter lines, not exceeding 140 “words” per line; while the blogs has the longest line.
Bigrams and trigrams should be formed within a sentence, not crossing the sentences.
Cleaning and other preprocessing may make the sentence boundaries vague or destroyed. We may use special tokens to mark the beginning and ending of each sentence before converting to lower case.
Trigrams such as “follow_follow_back” and “love_love_love” should not happen by the ngrams functions. Need to avoid them Or filter them.
Word stemming is necessary, but it may result in something like “peopl”, “citi”, “happi”, “good_morn”, “st_loui_counti”, “cinco_de_mayo”. Restoring some stemmed words might need a lot of work. Any better ways?
Removing the stopwords is necessary concerning the memory size and speed. But the stopwords might be necessary to get real world phrases in the final next-word prediction.
Data size, memory, speed and accuracy are the challenges, especially for very limited resources (such as x86-64, windows 7 with 8GB RAM).
Split the original data randomly into training, held-out and test data set with 60%, 20% and 20% ratio.
Rewrite the cleaning and preprocessing functions. Tokenize as “sentence” at first before converting to lower case and removing punctuation. Find out better ways to handle “stemming” and “stopwords” issues.
Clean and preprocess the training, held-out and test sets exactly the same way. Test data should not be touched in the model building process, but should have the same feature variables as training data. But in the reality the test data may have words that are not in the training sets. (Please correct me if my understanding is incorrect.)
Create unigrams, bigrams and trigrams from the training data. Remove singletons and sparse terms.
Want to build an interpolated modified Kneser-Ney smoothing next word prediction model. Will try to compile on Windows 7 the KenLM package (in C++), which seems superior in memory demand, performance, accuracy and speed. But KenLM is not found in CRAN. Any suggestion?
Apply the model to the held-out data set to evaluate and tune the model.
Apply the word prediction model to the test data sets to predict the next word.
Create a shiny App and publish it at “shinyapps.io” server.
Any corrections and suggestions would be deeply appreciated.
The Rmarkdown code index.Rmd can be found in my github repository
Session Info
R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quanteda_2.1.0 tm_0.7-7 NLP_0.2-0 ggthemes_4.2.0 ggplot2_3.3.2 tibble_3.0.1
[7] dplyr_1.0.0 stringr_1.4.0 readr_1.3.1 stringi_1.4.6 downloader_0.4 knitr_1.28
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4.6 pillar_1.4.4 compiler_4.0.1 stopwords_2.0
[5] tools_4.0.1 digest_0.6.25 lattice_0.20-41 evaluate_0.14
[9] lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.6
[13] fastmatch_1.1-0 Matrix_1.2-18 cli_2.0.2 yaml_2.2.1
[17] parallel_4.0.1 xfun_0.14 withr_2.2.0 xml2_1.3.2
[21] fs_1.4.2 generics_0.0.2 vctrs_0.3.1 hms_0.5.3
[25] grid_4.0.1 tidyselect_1.1.0 data.table_1.12.8 glue_1.4.1
[29] R6_2.4.1 fansi_0.4.1 rmarkdown_2.3 farver_2.0.3
[33] purrr_0.3.4 magrittr_1.5 SnowballC_0.7.0 ISOcodes_2020.03.16
[37] codetools_0.2-16 usethis_1.6.1 scales_1.1.1 ellipsis_0.3.1
[41] htmltools_0.5.0 assertthat_0.2.1 colorspace_1.4-1 labeling_0.3
[45] utf8_1.1.4 RcppParallel_5.0.2 munsell_0.5.0 slam_0.1-47
[49] crayon_1.3.4