This is prepared for Johns Hopkins’ Data Science Capstone online class Final Report

The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).

Executive Summary

The large number of text-based information have been using in current social media from such sources as e-mail, personal blogs, newspaper news, twitter, Web pages, and scanned/handwritten notes.

Understanding the problem Majority data in unstructured format which is harder to search, query, retrieve and analyze.
Using natural language processing(NLP) techniques can add more structure and semantic information to unstructured text content, and allowing us to be efficient, and treat data valuable in decision management such as marketing, sale advertisement, business decision, and kid/youth education, and healthcare etc.

My primary focus is to use the freeform text of the English (United State) language data files for my exploratory analysis, and then to build the best algorithm, to predict the possible the next word inputted from the user’s typing. Furthermore, if the prediction performance is fast, this could help users typing problem currently most of us struggling in phone/tablet devices.

  1. The textAnalyzer analysis learned terms/words from all documents from English data files
  2. Modeled each document by counting the number of time each word/term appears. If collecting words/terms are extremely large, I would consider to limit the size of data result by defining the maximum numbers of Most Frequent Words, and also to remove the common and less used words.

Methods: Data -> Analyze -> Modeling -> Data Product

1. Data Collection: Access data
This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info

2. Exploratory Analysis: Explore Data and Basic Statistics

This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables described in Natural Language Processing

Exploratory analysis tasks involved

3. Statistical Modeling
- Used N-Gram API to generate Unigram to N5 Gram result including word count, probability, and new variables for observations

  1. Normalized word count per TM corpus. Expect value in range of [0,1].
    Algorithm: wordcnt - min(wordcnt)) / (max(wordcnt)-min(wordcnt)

  2. Ranked the Word Preferred score 1 to 9 based on the Normalized value Criteria to determine Word Preferred Rank: 1: if normalized value is 1-0.9, 2: 0.9-0.8 until 9: 0.1-0

4. Fitting Modeling - Used rbinom API to split 60% Train/40% Test data set. Tested against test data for perdition result, and compared the highest accuracy (i.e. close to 0.99+), 95% CI, the lowest RMSE, Error Rate for the best prediction model.

Fitting Model Comparison

Modeling Average accuracy Rate Average Error Rate
rf 0.9942122 0.0057878
knn 0.9887485 0.0112515
lm 0.9874978 0.0125022
gbm 0.9554841 0.0445159

The Prediction was determined by the Preferred Rank which was based on N-Gram result of word counts/normalized probabilities

5. Reproducibility

  1. Rmd file need to regenerate HTML report, and publish to RPub site.
  2. NGram dictionary need to recompile, and redeploy application to Shiny site

6. Data Product: US English Text Analyzer

The Shiny application is the web-based data product and implemented the prediction model to predict what next words you want to type.

The Shiny APP prediction algorithm utilized the N-Gram dictionary - Used Unigram to predict the possible current word - Used Bigram to N5 gram to predict the possible next words.


How to use Data Product

Users can access the APP (https://jtmoogle.shinyapps.io/textAnalyzer) through any browsers (i.e. IE, Firefox, Chrome) via internet access

Input:

  1. User type any words in the phrase/sentence.  *No need to press an ENTER key*.  
  The App consider the "Valid Words" which has NO special characters/symbols  
  2. Select the numbers of next words/prediction to display  
  

Output: The App predicts and shows to users

  1. Possible current word you are typing  
  2. Possible next one or few word you might type next  
  

Summary Reports:

  Users can view summary of N-Gram content in various ways   
    1. Data Table to show summarized result 
    2. Plots to display summary result by range/by all  
    3. Word Cloud illustrate beautiful color and fonts representing word frequency  
    

Toolset

  • “R” to preprocess data/generate N-gram result/predict modeling/cross validation/test
  • knit kit to generate the Milestone Report in HTML format
  • RStudio Presenter used to introduce data product feature
  • Shiny used to build a data product, to deploy to the shiny server site, and available to public access via internet.

Conclusions:

Observation from my analysis
Limitation Identified

  • Experienced the quality of API packages were not tuned well in the latest R version. (i.e. rJava, RWeka in Ubentu)

  • Using 90% sparse rate did not reserve more words to be measured, so need to use exposed various rations to find the max ratios based on PC capacity. Consistently encountered the PC limitation if dataset is too huge.

  • Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.

  • Used TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.

  • Performance of smoothing API seems slower, and could not do well using larger word sets

Future

Furthermore, About Future, might extend to text topic model, spelling checking, sentiment features. Could apply in areas of customer survey, trouble ticketing, voice response system and etc.


English(US) Summary Result

  • Blogs contain multiple sentence(s), average words were 40.94 words per line. Longest line has 39,240 characters, 6,327 words, 1,685 unique words. In total, we saw 899,288 lines.
  • News contain samll senentence(s), average words were 33.31 per line. Longest line has 3,764 characters, 544 words, 271 unique words. In total, we saw 2,360,148 lines.
  • Twitter contain samller senentence(s), average words were 12.44 per line. Longest line has 140 characters, 47 words, 36 unique words. In total, we saw 1,010,242 lines.
Summary of All Data Files
Filename FileSizenInByte FileSizeInMByte FileLineCnt LanguageLocation Encoding1 Encoding2
de_DE.blogs.txt 85459666 81.5 Mb 371440 German (Germany) UTF-8 ISO-8859-1
de_DE.news.txt 95591959 91.2 Mb 244743 German (Germany) UTF-8 ISO-8859-1
de_DE.twitter.txt 75578341 72.1 Mb 947774 German (Germany) UTF-8 ISO-8859-1
en_US.blogs.txt 210160014 200.4 Mb 899288 English (United States) UTF-8 ISO-8859-1
en_US.news.txt 205811889 196.3 Mb 1010242 English (United States) UTF-8 ISO-8859-1
en_US.twitter.txt 167105338 159.4 Mb 2360148 English (United States) UTF-8 ISO-8859-1
fi_FI.blogs.txt 108503595 103.5 Mb 439785 Finnish (Finland) UTF-8 ISO-8859-1
fi_FI.news.txt 94234350 89.9 Mb 485758 Finnish (Finland) UTF-8 ISO-8859-1
fi_FI.twitter.txt 25331142 24.2 Mb 285214 Finnish (Finland) UTF-8 ISO-8859-1
ru_RU.blogs.txt 116855835 111.4 Mb 337100 Russian (Russia) UTF-8 ISO-8859-5
ru_RU.news.txt 118996424 113.5 Mb 196360 Russian (Russia) UTF-8 ISO-8859-5
ru_RU.twitter.txt 105182346 100.3 Mb 881414 Russian (Russia) UTF-8 ISO-8859-5
Summary of English (United State) Data Files
FieldSummaryBy Filetype Min. X1st.Qu. Median Mean X3rd.Qu. Max.
LineCharCnt en_US.blogs 0 44 149 221.5 317 39240
LineCharCnt en_US.news 0 104 177 193.1 258 3764
LineCharCnt en_US.twitter 1 34 60 64.82 94 140
rowId en_US.blogs 1 224800 449600 449600 674500 899300
rowId en_US.news 1 19320 38630 38630 57940 77260
rowId en_US.twitter 1 590000 1180000 1180000 1770000 2360000
LineWordCnt en_US.blogs 0 8 28 40.94 59 6327
LineWordCnt en_US.news 0 18 30 33.31 44 544
LineWordCnt en_US.twitter 1 7 12 12.44 18 47
LineAvgWordLen en_US.blogs 1 4 4.387 4.556 4.879 74
LineAvgWordLen en_US.news 1 4.415 4.812 4.873 5.231 31
LineAvgWordLen en_US.twitter 1 3.733 4.188 4.305 4.733 126
LineUniqWordCnt en_US.blogs 0 8 24 31.2 46 1685
LineUniqWordCnt en_US.news 0 17 27 28.06 37 271
LineUniqWordCnt en_US.twitter 1 7 11 11.72 17 36
LineHashtagCnt en_US.blogs 0 0 0 0 0 0
LineHashtagCnt en_US.news 0 0 0 0 0 0
LineHashtagCnt en_US.twitter 0 0 0 0 0 0
LineHttpCnt en_US.blogs 0 0 0 0.001702 0 8
LineHttpCnt en_US.news 0 0 0 0.0009449 0 4
LineHttpCnt en_US.twitter 0 0 0 0.000222 0 2
Most 20 Frequent Words
blogs.word blogs.wordcnt blogs.wordprob news.word news.wordcnt news.wordprob twt.word twt.wordcnt twt.wordprob
the 1855771 0.0528795 the 151524 0.0608450 the 934172 0.0335948
and 1086110 0.0309483 to 69348 0.0278470 to 786629 0.0282888
to 1065698 0.0303667 and 68216 0.0273924 you 543700 0.0195526
of 875028 0.0249336 of 59089 0.0237274 and 433686 0.0155963
in 593633 0.0169154 in 51464 0.0206656 for 384535 0.0138287
that 459500 0.0130933 for 27112 0.0108869 in 377036 0.0135590
is 431834 0.0123050 that 26358 0.0105842 of 358981 0.0129097
it 400905 0.0114236 is 21961 0.0088185 is 357544 0.0128580
for 362867 0.0103398 on 20578 0.0082632 it 291398 0.0104793
you 296855 0.0084588 with 19754 0.0079323 my 290517 0.0104476
with 286177 0.0081545 said 19167 0.0076966 on 276264 0.0099350
was 278002 0.0079216 was 17625 0.0070774 that 232907 0.0083758
on 274047 0.0078089 he 17556 0.0070497 me 200067 0.0071948
my 270181 0.0076987 it 16693 0.0067031 be 187176 0.0067312
this 257977 0.0073510 at 16413 0.0065907 at 185524 0.0066718
as 223359 0.0063645 as 14662 0.0058876 with 172995 0.0062213
have 218541 0.0062272 his 12107 0.0048616 your 170771 0.0061413
be 208303 0.0059355 but 11658 0.0046813 have 168051 0.0060435
but 203446 0.0057971 from 11648 0.0046773 so 163273 0.0058716
are 193634 0.0055175 be 11579 0.0046496 this 162736 0.0058523

##### N-Gram Summary Result

mfw.maxnbr<-25
barplot(head(df1g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df1g, mfw.maxnbr)$word,
                col ="lightgreen", main =paste0("English Most ", mfw.maxnbr, " Unigram Frequent Words"),
                ylab = "Word Count")

library(wordcloud)
  
  wordcloud(unique(df1g$word), df1g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="Unigram WordCloud")

barplot(head(df2g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df2g, mfw.maxnbr)$word,
                col ="lightblue", main =paste0("English Bigram Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
                ylab = "Word Count")

    wordcloud(unique(df2g$word), df2g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="Bigram WordCloud")

    barplot(head(df3g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df3g, mfw.maxnbr)$word,
                col ="lightblue", main =paste0("English Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
                ylab = "Word Count")

  wordcloud(unique(df3g$word), df3g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="N3-Gram WordCloud")

Max Sparse Ratios by N-Gram Type and File Type
filetype ngram sparseRatio WordCnt DocCnt CntMin CntMedian CntMean CntMax
5 twitter-train 1 0.999 1083 235955 239 516 1038 14740
6 blogs-train 1 0.9995 4236 89811 45 147 380.8 13490
51 news-train 1 0.999 2773 100942 102 288 544.2 24830
61 blogs-train 2 0.9995 936 89811 45 64 88.55 740
62 twitter-train 2 0.9995 282 235955 118 176 245.8 1739
63 news-train 2 0.9995 736 100942 51 77 105.8 1532
7 twitter-train 3 0.9999 119 235955 24 32 46.97 359
71 news-train 3 0.9999 532 100942 11 15 19.41 177
72 blogs-train 3 0.9999 382 89811 9 12 14.49 74
8 twitter-train 4 0.99999 1678 235955 3 3 4.232 46
73 blogs-train 4 0.9999 50 89811 9 17 16.02 21
74 news-train 4 0.9999 71 100942 11 15 17.17 40
75 blogs-train 5 0.9999 42 89811 10 17 16.36 17
81 twitter-train 5 0.99999 723 235955 3 3 3.913 34
82 twitter-train 5 0.99999 723 235955 3 3 3.913 34
Most lm Accurancy by N-Gram Type and File Type
filetype ngram maxsparse lm.accuracy lm.outofsample.accuracy lm.outofSample.error
3 news-train 1 0.9995 1.0000000 1.0000000 0.0000000
6 news-train 2 0.9995 1.0000000 1.0000000 0.0000000
11 blogs-train 4 0.9999 1.0000000 1.0000000 0.0000000
13 blogs-train 5 0.9999 1.0000000 1.0000000 0.0000000
14 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
15 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
10 twitter-train 4 0.99999 0.9965870 0.9965870 0.0034130
9 blogs-train 3 0.9999 0.9913793 0.9913793 0.0086207
1 blogs-train 1 0.9995 0.9898386 0.9898386 0.0101614
2 blogs-train 1 0.9995 0.9898386 0.9898386 0.0101614
7 twitter-train 3 0.9999 0.9767442 0.9767442 0.0232558
8 news-train 3 0.9999 0.9757576 0.9757576 0.0242424
5 twitter-train 2 0.9995 0.9727273 0.9727273 0.0272727
4 blogs-train 2 0.9995 0.9672131 0.9672131 0.0327869
12 news-train 4 0.9999 0.9523810 0.9523810 0.0476190
Most gbm Accurancy by N-Gram Type and File Type
filetype ngram maxsparse gbm.accuracy gbm.outofsample.accuracy gbm.outofSample.error
1 blogs-train 1 0.9995 0.9970114 0.9970114 0.0029886
2 blogs-train 1 0.9995 0.9970114 0.9970114 0.0029886
4 blogs-train 2 0.9995 0.9781421 0.9781421 0.0218579
10 twitter-train 4 0.99999 0.9761092 0.9761092 0.0238908
14 twitter-train 5 0.99999 0.9761092 0.9761092 0.0238908
15 twitter-train 5 0.99999 0.9761092 0.9761092 0.0238908
8 news-train 3 0.9999 0.9575758 0.9575758 0.0424242
12 news-train 4 0.9999 0.9575758 0.9575758 0.0424242
3 news-train 1 0.9995 0.9568966 0.9568966 0.0431034
6 news-train 2 0.9995 0.9568966 0.9568966 0.0431034
9 blogs-train 3 0.9999 0.9568966 0.9568966 0.0431034
11 blogs-train 4 0.9999 0.9568966 0.9568966 0.0431034
13 blogs-train 5 0.9999 0.9568966 0.9568966 0.0431034
5 twitter-train 2 0.9995 0.9181818 0.9181818 0.0818182
7 twitter-train 3 0.9999 0.8139535 0.8139535 0.1860465
Most rf Accurancy by N-Gram Type and File Type
filetype ngram maxsparse rf.accuracy rf.outofsample.accuracy rf.outofSample.error
1 blogs-train 1 0.9995 1.0000000 1.0000000 0.0000000
2 blogs-train 1 0.9995 1.0000000 1.0000000 0.0000000
3 news-train 1 0.9995 1.0000000 1.0000000 0.0000000
5 twitter-train 2 0.9995 1.0000000 1.0000000 0.0000000
6 news-train 2 0.9995 1.0000000 1.0000000 0.0000000
8 news-train 3 0.9999 1.0000000 1.0000000 0.0000000
9 blogs-train 3 0.9999 1.0000000 1.0000000 0.0000000
10 twitter-train 4 0.99999 1.0000000 1.0000000 0.0000000
12 news-train 4 0.9999 1.0000000 1.0000000 0.0000000
13 blogs-train 5 0.9999 1.0000000 1.0000000 0.0000000
14 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
15 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
4 blogs-train 2 0.9995 0.9890710 0.9890710 0.0109290
7 twitter-train 3 0.9999 0.9767442 0.9767442 0.0232558
11 blogs-train 4 0.9999 0.9473684 0.9473684 0.0526316
Most knn Accurancy by N-Gram Type and File Type
filetype ngram maxsparse knn.accuracy knn.outofsample.accuracy knn.outofSample.error
1 blogs-train 1 0.9995 1.0000000 1.0000000 0.0000000
2 blogs-train 1 0.9995 1.0000000 1.0000000 0.0000000
3 news-train 1 0.9995 1.0000000 1.0000000 0.0000000
6 news-train 2 0.9995 1.0000000 1.0000000 0.0000000
9 blogs-train 3 0.9999 1.0000000 1.0000000 0.0000000
11 blogs-train 4 0.9999 1.0000000 1.0000000 0.0000000
13 blogs-train 5 0.9999 1.0000000 1.0000000 0.0000000
14 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
15 twitter-train 5 0.99999 1.0000000 1.0000000 0.0000000
10 twitter-train 4 0.99999 0.9931741 0.9931741 0.0068259
4 blogs-train 2 0.9995 0.9918033 0.9918033 0.0081967
5 twitter-train 2 0.9995 0.9818182 0.9818182 0.0181818
8 news-train 3 0.9999 0.9818182 0.9818182 0.0181818
12 news-train 4 0.9999 0.9523810 0.9523810 0.0476190
7 twitter-train 3 0.9999 0.9302326 0.9302326 0.0697674