Data Science Capstone Project

Introduction
Data Acquisition and Summary Statistics
Data Cleaning and Preprocessing
N-grams and dfm (sparse Document-Feature Matrix)
- Creating dfm for n-grams
- Most common ngrams
Some Observations and Issues in the Exploritary Analysis
Plans for creating a prediction algorithm and Shiny app
Appendix

Introduction

This milestone report project is a part of the data science capstone project of Coursera and Swiftkey. The main objective of the capstone project is to transform corpora of text into a Next Word Prediction system, based on word frequencies and context, applying data science in the area of natural language processing. This Rmarkdown report describes exploratory analysis of the sample training data set and summarizes plans for creating the prediction model. Text mining R packages tm[1] and quanteda[2] are used for cleaning, preprocessing, managing and analyzing text. This report meets the following requirements:

Downloads, loads the data, creates sample training data and preprocess it.
Generates summary statistics about the data sets and makes basic plots such as histograms to illustrate features of the data.
Describes some interesting findings.
Reports plans for creating a prediction algorithm and Shiny app.

Data Acquisition and Summary Statistics

Data Source

Load the libraries

The R packages used here include: quanteda, tm, stringi, downloader, readr, stringr, dtplyr, tibble, ggplot2, rmarkdown, knitr, and ggthemes.

Download and Load the Course Data Sets

Download the data and save to local disk:

[1] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.blogs.txt"  
[2] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.news.txt"   
[3] "C:/Users/Lenovo/Desktop/peer4/DataScience-Milestone-Report/final/en_US/en_US.twitter.txt"

Summary Statistics about the Data Sets

file_name	file_size (Mb)	word_count	line_count	Max words/line	Avg words/line
en_US.blogs.txt	200	37272578	899288	40832	42
en_US.news.txt	196	34309642	1010242	11384	34
en_US.twitter.txt	159	30341028	2360148	140	13
Total	555	101923248	4269678	40832	24

Load the text Data in R and remove non-ASCII characters

Sampling the Data for exploratory analysis

In order to enable faster data processing, a data sample from all three sources was generated, extracting 0.01 of data randomly using rbinoma() function and store them.

# A tibble: 3 x 2
  `sample text` length
  <chr>          <int>
1 sblog           6278
2 snews           8721
3 stwit          22982

Data Cleaning and Preprocessing

Loading bad-word list from here

Create a tm corpus from three kinds of samples

Clean and transform the corpus using stringi() and tm_map()

The cleaning and preprocessing include:

convert to lowercase
remove stopwords: c(“will”, quanteda::stopwords(“english”)
remove profanity and other bad words
remove URL: (http, https, atp, www and followings)
remove twitter hash tag and email id
remove Symbols
remove Punctuation including Hyphens using tm::removePunctuation
remove Numbers
Stem words using tm::stemDocument (Porter’s stemming algorithm)
remove white space

Converting tm corpus to quanteda corpus

sample quanteda corpus:

Corpus consisting of 37981 documents, showing 5 documents:

  Text Types Tokens Sentences author       datetimestamp description heading id language origin
 text1     3      3         1     NA 2020-07-06 17:35:16          NA      NA  1       en     NA
 text2    10     10         1     NA 2020-07-06 17:35:16          NA      NA  2       en     NA
 text3     9     10         1     NA 2020-07-06 17:35:16          NA      NA  3       en     NA
 text4    21     26         1     NA 2020-07-06 17:35:16          NA      NA  4       en     NA
 text5    17     27         1     NA 2020-07-06 17:35:16          NA      NA  5       en     NA

N-grams and dfm (sparse Document-Feature Matrix)

Creating dfm for n-grams

In statistical Natural Language Processing (NLP), an n-gram is a contiguous sequence of n items from a given sequence of text or speech. Bigram and trigram are combination of two and tree words respectively. We will build and use n-gram model, a type of probabilistic language model, for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.

Unigram

Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
       features
docs    just anoth feel tell later blog get wors now say
  text1    1     1    1    0     0    0   0    0   0   0
  text2    0     0    0    1     1    1   1    1   1   1
  text3    0     0    0    0     0    0   0    0   0   0
  text4    0     0    0    0     0    0   0    0   0   0
  text5    0     0    0    0     0    0   1    0   1   0
  text6    1     0    0    0     0    0   2    0   0   0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]

Bigram

Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
       features
docs    just anoth feel tell later blog get wors now say
  text1    1     1    1    0     0    0   0    0   0   0
  text2    0     0    0    1     1    1   1    1   1   1
  text3    0     0    0    0     0    0   0    0   0   0
  text4    0     0    0    0     0    0   0    0   0   0
  text5    0     0    0    0     0    0   1    0   1   0
  text6    1     0    0    0     0    0   2    0   0   0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]

Trigram

Document-feature matrix of: 37,981 documents, 34,525 features (100.0% sparse) and 7 docvars.
       features
docs    just anoth feel tell later blog get wors now say
  text1    1     1    1    0     0    0   0    0   0   0
  text2    0     0    0    1     1    1   1    1   1   1
  text3    0     0    0    0     0    0   0    0   0   0
  text4    0     0    0    0     0    0   0    0   0   0
  text5    0     0    0    0     0    0   1    0   1   0
  text6    1     0    0    0     0    0   2    0   0   0
[ reached max_ndoc ... 37,975 more documents, reached max_nfeat ... 34,515 more features ]

Most common ngrams

The most common unigrams

 [1] "get"   "just"  "said"  "like"  "one"   "go"    "time"  "can"   "day"   "love"  "year"  "make" 
[13] "new"   "good"  "thank" "work"  "now"   "know"  "want"  "peopl"

The most common bigrams

 [1] "get"   "just"  "said"  "like"  "one"   "go"    "time"  "can"   "day"   "love"  "year"  "make" 
[13] "new"   "good"  "thank" "work"  "now"   "know"  "want"  "peopl"

The most common trigrams

 [1] "get"   "just"  "said"  "like"  "one"   "go"    "time"  "can"   "day"   "love"  "year"  "make" 
[13] "new"   "good"  "thank" "work"  "now"   "know"  "want"  "peopl"

Some Observations and Issues in the Exploritary Analysis

The three corpora of US english text are around 200, 196, and 159 Megabytes respectively. The twitter corpus has shorter lines, not exceeding 140 “words” per line; while the blogs has the longest line.
Bigrams and trigrams should be formed within a sentence, not crossing the sentences.
Cleaning and other preprocessing may make the sentence boundaries vague or destroyed. We may use special tokens to mark the beginning and ending of each sentence before converting to lower case.
Trigrams such as “follow_follow_back” and “love_love_love” should not happen by the ngrams functions. Need to avoid them Or filter them.
Word stemming is necessary, but it may result in something like “peopl”, “citi”, “happi”, “good_morn”, “st_loui_counti”, “cinco_de_mayo”. Restoring some stemmed words might need a lot of work. Any better ways?
Removing the stopwords is necessary concerning the memory size and speed. But the stopwords might be necessary to get real world phrases in the final next-word prediction.
Data size, memory, speed and accuracy are the challenges, especially for very limited resources (such as x86-64, windows 7 with 8GB RAM).

Plans for creating a prediction algorithm and Shiny app

Split the original data randomly into training, held-out and test data set with 60%, 20% and 20% ratio.
Rewrite the cleaning and preprocessing functions. Tokenize as “sentence” at first before converting to lower case and removing punctuation. Find out better ways to handle “stemming” and “stopwords” issues.
Clean and preprocess the training, held-out and test sets exactly the same way. Test data should not be touched in the model building process, but should have the same feature variables as training data. But in the reality the test data may have words that are not in the training sets. (Please correct me if my understanding is incorrect.)
Create unigrams, bigrams and trigrams from the training data. Remove singletons and sparse terms.
Want to build an interpolated modified Kneser-Ney smoothing next word prediction model. Will try to compile on Windows 7 the KenLM package (in C++), which seems superior in memory demand, performance, accuracy and speed. But KenLM is not found in CRAN. Any suggestion?
Apply the model to the held-out data set to evaluate and tune the model.
Apply the word prediction model to the test data sets to predict the next word.
Create a shiny App and publish it at “shinyapps.io” server.

Any corrections and suggestions would be deeply appreciated.

Appendix

The Rmarkdown code index.Rmd can be found in my github repository
Session Info

R version 4.0.1 (2020-06-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] quanteda_2.1.0 tm_0.7-7       NLP_0.2-0      ggthemes_4.2.0 ggplot2_3.3.2  tibble_3.0.1  
 [7] dplyr_1.0.0    stringr_1.4.0  readr_1.3.1    stringi_1.4.6  downloader_0.4 knitr_1.28    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6        pillar_1.4.4        compiler_4.0.1      stopwords_2.0      
 [5] tools_4.0.1         digest_0.6.25       lattice_0.20-41     evaluate_0.14      
 [9] lifecycle_0.2.0     gtable_0.3.0        pkgconfig_2.0.3     rlang_0.4.6        
[13] fastmatch_1.1-0     Matrix_1.2-18       cli_2.0.2           yaml_2.2.1         
[17] parallel_4.0.1      xfun_0.14           withr_2.2.0         xml2_1.3.2         
[21] fs_1.4.2            generics_0.0.2      vctrs_0.3.1         hms_0.5.3          
[25] grid_4.0.1          tidyselect_1.1.0    data.table_1.12.8   glue_1.4.1         
[29] R6_2.4.1            fansi_0.4.1         rmarkdown_2.3       farver_2.0.3       
[33] purrr_0.3.4         magrittr_1.5        SnowballC_0.7.0     ISOcodes_2020.03.16
[37] codetools_0.2-16    usethis_1.6.1       scales_1.1.1        ellipsis_0.3.1     
[41] htmltools_0.5.0     assertthat_0.2.1    colorspace_1.4-1    labeling_0.3       
[45] utf8_1.1.4          RcppParallel_5.0.2  munsell_0.5.0       slam_0.1-47        
[49] crayon_1.3.4