This is prepared for Johns Hopkins’ Data Science Capstone online class Final Report
The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).
The large number of text-based information have been using in current social media from such sources as e-mail, personal blogs, newspaper news, twitter, Web pages, and scanned/handwritten notes.
Understanding the problem Majority data in unstructured format which is harder to search, query, retrieve and analyze.
Using natural language processing(NLP) techniques can add more structure and semantic information to unstructured text content, and allowing us to be efficient, and treat data valuable in decision management such as marketing, sale advertisement, business decision, and kid/youth education, and healthcare etc.
My primary focus is to use the freeform text of the English (United State) language data files for my exploratory analysis, and then to build the best algorithm, to predict the possible the next word inputted from the user’s typing. Furthermore, if the prediction performance is fast, this could help users typing problem currently most of us struggling in phone/tablet devices.
1. Data Collection: Access data
This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info
2. Exploratory Analysis: Explore Data and Basic Statistics
This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables described in Natural Language Processing
Exploratory analysis tasks involved
Created sample data set since source are extremely big over 580MB. Ideal 70% for training, and 30% for testing sample dataset. Note: In reality, my PC capacity only could handle 10% up to 40% of data. 10% data was used to complete to end step.
The preprocess and cleanup data were performed to summarize, and cache the result into local RData files. The result data set measured word frequency/occurrences, word association probability.
3. Statistical Modeling
- Used N-Gram API to generate Unigram to N5 Gram result including word count, probability, and new variables for observations
Normalized word count per TM corpus. Expect value in range of [0,1].
Algorithm: wordcnt - min(wordcnt)) / (max(wordcnt)-min(wordcnt)
Ranked the Word Preferred score 1 to 9 based on the Normalized value Criteria to determine Word Preferred Rank: 1: if normalized value is 1-0.9, 2: 0.9-0.8 until 9: 0.1-0
4. Fitting Modeling - Used rbinom API to split 60% Train/40% Test data set. Tested against test data for perdition result, and compared the highest accuracy (i.e. close to 0.99+), 95% CI, the lowest RMSE, Error Rate for the best prediction model.
Fitting Model Comparison
| Modeling | Average accuracy Rate | Average Error Rate |
|---|---|---|
| rf | 0.9942122 | 0.0057878 |
| knn | 0.9887485 | 0.0112515 |
| lm | 0.9874978 | 0.0125022 |
| gbm | 0.9554841 | 0.0445159 |
The Prediction was determined by the Preferred Rank which was based on N-Gram result of word counts/normalized probabilities
5. Reproducibility
All analysis performed for the APP are reproducible, and files located in the github project repository
Text Analyzer App Source Repository
As data source change, or sample data probability change, preprocess data R scripts must be executed to reflect the latest cached version of preprocessed datasets.
6. Data Product: US English Text Analyzer
The Shiny application is the web-based data product and implemented the prediction model to predict what next words you want to type.
The Shiny APP prediction algorithm utilized the N-Gram dictionary - Used Unigram to predict the possible current word - Used Bigram to N5 gram to predict the possible next words.
Users can access the APP (https://jtmoogle.shinyapps.io/textAnalyzer) through any browsers (i.e. IE, Firefox, Chrome) via internet access
Input:
1. User type any words in the phrase/sentence. *No need to press an ENTER key*.
The App consider the "Valid Words" which has NO special characters/symbols
2. Select the numbers of next words/prediction to display
Output: The App predicts and shows to users
1. Possible current word you are typing
2. Possible next one or few word you might type next
Summary Reports:
Users can view summary of N-Gram content in various ways
1. Data Table to show summarized result
2. Plots to display summary result by range/by all
3. Word Cloud illustrate beautiful color and fonts representing word frequency
Toolset
Observation from my analysis
Limitation Identified
Experienced the quality of API packages were not tuned well in the latest R version. (i.e. rJava, RWeka in Ubentu)
Using 90% sparse rate did not reserve more words to be measured, so need to use exposed various rations to find the max ratios based on PC capacity. Consistently encountered the PC limitation if dataset is too huge.
Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.
Used TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.
Performance of smoothing API seems slower, and could not do well using larger word sets
Future
Furthermore, About Future, might extend to text topic model, spelling checking, sentiment features. Could apply in areas of customer survey, trouble ticketing, voice response system and etc.
| Filename | FileSizenInByte | FileSizeInMByte | FileLineCnt | LanguageLocation | Encoding1 | Encoding2 |
|---|---|---|---|---|---|---|
| de_DE.blogs.txt | 85459666 | 81.5 Mb | 371440 | German (Germany) | UTF-8 | ISO-8859-1 |
| de_DE.news.txt | 95591959 | 91.2 Mb | 244743 | German (Germany) | UTF-8 | ISO-8859-1 |
| de_DE.twitter.txt | 75578341 | 72.1 Mb | 947774 | German (Germany) | UTF-8 | ISO-8859-1 |
| en_US.blogs.txt | 210160014 | 200.4 Mb | 899288 | English (United States) | UTF-8 | ISO-8859-1 |
| en_US.news.txt | 205811889 | 196.3 Mb | 1010242 | English (United States) | UTF-8 | ISO-8859-1 |
| en_US.twitter.txt | 167105338 | 159.4 Mb | 2360148 | English (United States) | UTF-8 | ISO-8859-1 |
| fi_FI.blogs.txt | 108503595 | 103.5 Mb | 439785 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| fi_FI.news.txt | 94234350 | 89.9 Mb | 485758 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| fi_FI.twitter.txt | 25331142 | 24.2 Mb | 285214 | Finnish (Finland) | UTF-8 | ISO-8859-1 |
| ru_RU.blogs.txt | 116855835 | 111.4 Mb | 337100 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| ru_RU.news.txt | 118996424 | 113.5 Mb | 196360 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| ru_RU.twitter.txt | 105182346 | 100.3 Mb | 881414 | Russian (Russia) | UTF-8 | ISO-8859-5 |
| FieldSummaryBy | Filetype | Min. | X1st.Qu. | Median | Mean | X3rd.Qu. | Max. |
|---|---|---|---|---|---|---|---|
| LineCharCnt | en_US.blogs | 0 | 44 | 149 | 221.5 | 317 | 39240 |
| LineCharCnt | en_US.news | 0 | 104 | 177 | 193.1 | 258 | 3764 |
| LineCharCnt | en_US.twitter | 1 | 34 | 60 | 64.82 | 94 | 140 |
| rowId | en_US.blogs | 1 | 224800 | 449600 | 449600 | 674500 | 899300 |
| rowId | en_US.news | 1 | 19320 | 38630 | 38630 | 57940 | 77260 |
| rowId | en_US.twitter | 1 | 590000 | 1180000 | 1180000 | 1770000 | 2360000 |
| LineWordCnt | en_US.blogs | 0 | 8 | 28 | 40.94 | 59 | 6327 |
| LineWordCnt | en_US.news | 0 | 18 | 30 | 33.31 | 44 | 544 |
| LineWordCnt | en_US.twitter | 1 | 7 | 12 | 12.44 | 18 | 47 |
| LineAvgWordLen | en_US.blogs | 1 | 4 | 4.387 | 4.556 | 4.879 | 74 |
| LineAvgWordLen | en_US.news | 1 | 4.415 | 4.812 | 4.873 | 5.231 | 31 |
| LineAvgWordLen | en_US.twitter | 1 | 3.733 | 4.188 | 4.305 | 4.733 | 126 |
| LineUniqWordCnt | en_US.blogs | 0 | 8 | 24 | 31.2 | 46 | 1685 |
| LineUniqWordCnt | en_US.news | 0 | 17 | 27 | 28.06 | 37 | 271 |
| LineUniqWordCnt | en_US.twitter | 1 | 7 | 11 | 11.72 | 17 | 36 |
| LineHashtagCnt | en_US.blogs | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHashtagCnt | en_US.news | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHashtagCnt | en_US.twitter | 0 | 0 | 0 | 0 | 0 | 0 |
| LineHttpCnt | en_US.blogs | 0 | 0 | 0 | 0.001702 | 0 | 8 |
| LineHttpCnt | en_US.news | 0 | 0 | 0 | 0.0009449 | 0 | 4 |
| LineHttpCnt | en_US.twitter | 0 | 0 | 0 | 0.000222 | 0 | 2 |
| blogs.word | blogs.wordcnt | blogs.wordprob | news.word | news.wordcnt | news.wordprob | twt.word | twt.wordcnt | twt.wordprob |
|---|---|---|---|---|---|---|---|---|
| the | 1855771 | 0.0528795 | the | 151524 | 0.0608450 | the | 934172 | 0.0335948 |
| and | 1086110 | 0.0309483 | to | 69348 | 0.0278470 | to | 786629 | 0.0282888 |
| to | 1065698 | 0.0303667 | and | 68216 | 0.0273924 | you | 543700 | 0.0195526 |
| of | 875028 | 0.0249336 | of | 59089 | 0.0237274 | and | 433686 | 0.0155963 |
| in | 593633 | 0.0169154 | in | 51464 | 0.0206656 | for | 384535 | 0.0138287 |
| that | 459500 | 0.0130933 | for | 27112 | 0.0108869 | in | 377036 | 0.0135590 |
| is | 431834 | 0.0123050 | that | 26358 | 0.0105842 | of | 358981 | 0.0129097 |
| it | 400905 | 0.0114236 | is | 21961 | 0.0088185 | is | 357544 | 0.0128580 |
| for | 362867 | 0.0103398 | on | 20578 | 0.0082632 | it | 291398 | 0.0104793 |
| you | 296855 | 0.0084588 | with | 19754 | 0.0079323 | my | 290517 | 0.0104476 |
| with | 286177 | 0.0081545 | said | 19167 | 0.0076966 | on | 276264 | 0.0099350 |
| was | 278002 | 0.0079216 | was | 17625 | 0.0070774 | that | 232907 | 0.0083758 |
| on | 274047 | 0.0078089 | he | 17556 | 0.0070497 | me | 200067 | 0.0071948 |
| my | 270181 | 0.0076987 | it | 16693 | 0.0067031 | be | 187176 | 0.0067312 |
| this | 257977 | 0.0073510 | at | 16413 | 0.0065907 | at | 185524 | 0.0066718 |
| as | 223359 | 0.0063645 | as | 14662 | 0.0058876 | with | 172995 | 0.0062213 |
| have | 218541 | 0.0062272 | his | 12107 | 0.0048616 | your | 170771 | 0.0061413 |
| be | 208303 | 0.0059355 | but | 11658 | 0.0046813 | have | 168051 | 0.0060435 |
| but | 203446 | 0.0057971 | from | 11648 | 0.0046773 | so | 163273 | 0.0058716 |
| are | 193634 | 0.0055175 | be | 11579 | 0.0046496 | this | 162736 | 0.0058523 |
##### N-Gram Summary Result
mfw.maxnbr<-25
barplot(head(df1g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df1g, mfw.maxnbr)$word,
col ="lightgreen", main =paste0("English Most ", mfw.maxnbr, " Unigram Frequent Words"),
ylab = "Word Count")
library(wordcloud)
wordcloud(unique(df1g$word), df1g$wordcnt, scale=c(4,0.5),
min.freq = 1, max.words=mfw.maxnbr,
colors=brewer.pal(8, "Dark2"), title="Unigram WordCloud")
barplot(head(df2g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df2g, mfw.maxnbr)$word,
col ="lightblue", main =paste0("English Bigram Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
ylab = "Word Count")
wordcloud(unique(df2g$word), df2g$wordcnt, scale=c(4,0.5),
min.freq = 1, max.words=mfw.maxnbr,
colors=brewer.pal(8, "Dark2"), title="Bigram WordCloud")
barplot(head(df3g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df3g, mfw.maxnbr)$word,
col ="lightblue", main =paste0("English Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
ylab = "Word Count")
wordcloud(unique(df3g$word), df3g$wordcnt, scale=c(4,0.5),
min.freq = 1, max.words=mfw.maxnbr,
colors=brewer.pal(8, "Dark2"), title="N3-Gram WordCloud")
| filetype | ngram | sparseRatio | WordCnt | DocCnt | CntMin | CntMedian | CntMean | CntMax | |
|---|---|---|---|---|---|---|---|---|---|
| 5 | twitter-train | 1 | 0.999 | 1083 | 235955 | 239 | 516 | 1038 | 14740 |
| 6 | blogs-train | 1 | 0.9995 | 4236 | 89811 | 45 | 147 | 380.8 | 13490 |
| 51 | news-train | 1 | 0.999 | 2773 | 100942 | 102 | 288 | 544.2 | 24830 |
| 61 | blogs-train | 2 | 0.9995 | 936 | 89811 | 45 | 64 | 88.55 | 740 |
| 62 | twitter-train | 2 | 0.9995 | 282 | 235955 | 118 | 176 | 245.8 | 1739 |
| 63 | news-train | 2 | 0.9995 | 736 | 100942 | 51 | 77 | 105.8 | 1532 |
| 7 | twitter-train | 3 | 0.9999 | 119 | 235955 | 24 | 32 | 46.97 | 359 |
| 71 | news-train | 3 | 0.9999 | 532 | 100942 | 11 | 15 | 19.41 | 177 |
| 72 | blogs-train | 3 | 0.9999 | 382 | 89811 | 9 | 12 | 14.49 | 74 |
| 8 | twitter-train | 4 | 0.99999 | 1678 | 235955 | 3 | 3 | 4.232 | 46 |
| 73 | blogs-train | 4 | 0.9999 | 50 | 89811 | 9 | 17 | 16.02 | 21 |
| 74 | news-train | 4 | 0.9999 | 71 | 100942 | 11 | 15 | 17.17 | 40 |
| 75 | blogs-train | 5 | 0.9999 | 42 | 89811 | 10 | 17 | 16.36 | 17 |
| 81 | twitter-train | 5 | 0.99999 | 723 | 235955 | 3 | 3 | 3.913 | 34 |
| 82 | twitter-train | 5 | 0.99999 | 723 | 235955 | 3 | 3 | 3.913 | 34 |
| filetype | ngram | maxsparse | lm.accuracy | lm.outofsample.accuracy | lm.outofSample.error | |
|---|---|---|---|---|---|---|
| 3 | news-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 6 | news-train | 2 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 11 | blogs-train | 4 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 13 | blogs-train | 5 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 14 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 15 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 10 | twitter-train | 4 | 0.99999 | 0.9965870 | 0.9965870 | 0.0034130 |
| 9 | blogs-train | 3 | 0.9999 | 0.9913793 | 0.9913793 | 0.0086207 |
| 1 | blogs-train | 1 | 0.9995 | 0.9898386 | 0.9898386 | 0.0101614 |
| 2 | blogs-train | 1 | 0.9995 | 0.9898386 | 0.9898386 | 0.0101614 |
| 7 | twitter-train | 3 | 0.9999 | 0.9767442 | 0.9767442 | 0.0232558 |
| 8 | news-train | 3 | 0.9999 | 0.9757576 | 0.9757576 | 0.0242424 |
| 5 | twitter-train | 2 | 0.9995 | 0.9727273 | 0.9727273 | 0.0272727 |
| 4 | blogs-train | 2 | 0.9995 | 0.9672131 | 0.9672131 | 0.0327869 |
| 12 | news-train | 4 | 0.9999 | 0.9523810 | 0.9523810 | 0.0476190 |
| filetype | ngram | maxsparse | gbm.accuracy | gbm.outofsample.accuracy | gbm.outofSample.error | |
|---|---|---|---|---|---|---|
| 1 | blogs-train | 1 | 0.9995 | 0.9970114 | 0.9970114 | 0.0029886 |
| 2 | blogs-train | 1 | 0.9995 | 0.9970114 | 0.9970114 | 0.0029886 |
| 4 | blogs-train | 2 | 0.9995 | 0.9781421 | 0.9781421 | 0.0218579 |
| 10 | twitter-train | 4 | 0.99999 | 0.9761092 | 0.9761092 | 0.0238908 |
| 14 | twitter-train | 5 | 0.99999 | 0.9761092 | 0.9761092 | 0.0238908 |
| 15 | twitter-train | 5 | 0.99999 | 0.9761092 | 0.9761092 | 0.0238908 |
| 8 | news-train | 3 | 0.9999 | 0.9575758 | 0.9575758 | 0.0424242 |
| 12 | news-train | 4 | 0.9999 | 0.9575758 | 0.9575758 | 0.0424242 |
| 3 | news-train | 1 | 0.9995 | 0.9568966 | 0.9568966 | 0.0431034 |
| 6 | news-train | 2 | 0.9995 | 0.9568966 | 0.9568966 | 0.0431034 |
| 9 | blogs-train | 3 | 0.9999 | 0.9568966 | 0.9568966 | 0.0431034 |
| 11 | blogs-train | 4 | 0.9999 | 0.9568966 | 0.9568966 | 0.0431034 |
| 13 | blogs-train | 5 | 0.9999 | 0.9568966 | 0.9568966 | 0.0431034 |
| 5 | twitter-train | 2 | 0.9995 | 0.9181818 | 0.9181818 | 0.0818182 |
| 7 | twitter-train | 3 | 0.9999 | 0.8139535 | 0.8139535 | 0.1860465 |
| filetype | ngram | maxsparse | rf.accuracy | rf.outofsample.accuracy | rf.outofSample.error | |
|---|---|---|---|---|---|---|
| 1 | blogs-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 2 | blogs-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 3 | news-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 5 | twitter-train | 2 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 6 | news-train | 2 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 8 | news-train | 3 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 9 | blogs-train | 3 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 10 | twitter-train | 4 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 12 | news-train | 4 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 13 | blogs-train | 5 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 14 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 15 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 4 | blogs-train | 2 | 0.9995 | 0.9890710 | 0.9890710 | 0.0109290 |
| 7 | twitter-train | 3 | 0.9999 | 0.9767442 | 0.9767442 | 0.0232558 |
| 11 | blogs-train | 4 | 0.9999 | 0.9473684 | 0.9473684 | 0.0526316 |
| filetype | ngram | maxsparse | knn.accuracy | knn.outofsample.accuracy | knn.outofSample.error | |
|---|---|---|---|---|---|---|
| 1 | blogs-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 2 | blogs-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 3 | news-train | 1 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 6 | news-train | 2 | 0.9995 | 1.0000000 | 1.0000000 | 0.0000000 |
| 9 | blogs-train | 3 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 11 | blogs-train | 4 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 13 | blogs-train | 5 | 0.9999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 14 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 15 | twitter-train | 5 | 0.99999 | 1.0000000 | 1.0000000 | 0.0000000 |
| 10 | twitter-train | 4 | 0.99999 | 0.9931741 | 0.9931741 | 0.0068259 |
| 4 | blogs-train | 2 | 0.9995 | 0.9918033 | 0.9918033 | 0.0081967 |
| 5 | twitter-train | 2 | 0.9995 | 0.9818182 | 0.9818182 | 0.0181818 |
| 8 | news-train | 3 | 0.9999 | 0.9818182 | 0.9818182 | 0.0181818 |
| 12 | news-train | 4 | 0.9999 | 0.9523810 | 0.9523810 | 0.0476190 |
| 7 | twitter-train | 3 | 0.9999 | 0.9302326 | 0.9302326 | 0.0697674 |