US English Text Analyzer Report

The goal is to use the public available web source collected by web crawler, HC Corpora (www.corpora.heliohost.org), to exercise the data science analysis skills/algorithm/modeling methods I learned in 2015 JH Data Science Track, and have a great opportunity to apply in the area of the natural language processing (NLP).

Methods: Data -> Analyze -> Modeling -> Data Product

1. Data Collection: Access data
This task was to download the data source from coursera.com class location Capstone Datase. The data used for analysis was originally from a corpus called HC Corpora (www.corpora.heliohost.org), More Info

The source is a compressed file which contain text files including twitter, news, and personal blogs in languages/locales of English (United States), Finnish (Finland), German (Germany) and Russian (Russia).
Toolset Programming language “R” was used to download, and load data
The “UTF-8” encoding was used to read line from English(US) files, and assumed it should work well for other language such as Finnish (Finland), German (Germany) and Russian (Russia)

2. Exploratory Analysis: Explore Data and Basic Statistics

This task was performed by examining file content, tables and figures of the observed data. The data transformation was performed against the raw data on the basis of plots, summarized results using NLP APIs, and knowledge of the scale of measured variables described in Natural Language Processing

Exploratory analysis tasks involved

Created sample data set since source are extremely big over 580MB. Ideal 70% for training, and 30% for testing sample dataset. Note: In reality, my PC capacity only could handle 10% up to 40% of data. 10% data was used to complete to end step.
The Text Mining(TM) APIs were used to clean data
- Removed URLS, Replaced digit number, punctuation, control keys (non-English letters)
- Converted to lower case (better for building N-Gram /dictionary)
- Removed common/stop/unused words
- Stemmed English words
The preprocess and cleanup data were performed to summarize, and cache the result into local RData files. The result data set measured word frequency/occurrences, word association probability.

3. Statistical Modeling
- Used N-Gram API to generate Unigram to N5 Gram result including word count, probability, and new variables for observations

Normalized word count per TM corpus. Expect value in range of [0,1].
Algorithm: wordcnt - min(wordcnt)) / (max(wordcnt)-min(wordcnt)
Ranked the Word Preferred score 1 to 9 based on the Normalized value Criteria to determine Word Preferred Rank: 1: if normalized value is 1-0.9, 2: 0.9-0.8 until 9: 0.1-0

4. Fitting Modeling - Used rbinom API to split 60% Train/40% Test data set. Tested against test data for perdition result, and compared the highest accuracy (i.e. close to 0.99+), 95% CI, the lowest RMSE, Error Rate for the best prediction model.

Fitting Model Comparison

Modeling	Average accuracy Rate	Average Error Rate
rf	0.9942122	0.0057878
knn	0.9887485	0.0112515
lm	0.9874978	0.0125022
gbm	0.9554841	0.0445159

Classified Principal Component Analysis (PCA), linear model (lm), random Forest model (rf), kmean, knn for cross validation, iterated the following activity until to find the best prediction model.
Removed variables “Near Zero” and “Highly Coefficient”
TM sparse ratios 0.85, 0.9, 0.95, 0.99, 0.999, 0.9995, 0.9999, 0.99999 were used to find out best/max sparse ratio per corpus.
Applied the max Text Mining sparse ratios based on corpus size and Hardware capacity/memory
Criteria to determine the best were looking for the highest accuracy, 95% CI, the lowest RMSE, Error Rate

The Prediction was determined by the Preferred Rank which was based on N-Gram result of word counts/normalized probabilities

N1 to N5 Gram files of the best prediction model were compressed as one N-Gram Dictionary file whose size was less than 258KB. The dictionary file was deployed along with Shiny application.
The N-Gram dictionary is used in the shiny prediction model.
The shiny prediction use Unigram to predict the possible current word, and to use Bigram to N5 to predict the possible next words.

5. Reproducibility

All analysis performed for the APP are reproducible, and files located in the github project repository

Text Analyzer App Source Repository

Text Analyzer Milestone Report

Text Analyzer Final Report
As data source change, or sample data probability change, preprocess data R scripts must be executed to reflect the latest cached version of preprocessed datasets.

Rmd file need to regenerate HTML report, and publish to RPub site.
NGram dictionary need to recompile, and redeploy application to Shiny site

6. Data Product: US English Text Analyzer

The Shiny application is the web-based data product and implemented the prediction model to predict what next words you want to type.

The APP learn terms/words from English blogs/news/twitters provided by [JH Capstone Swiftkey] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip).
Real-time to interact with user(s) to take user’s typing as input method and to display predicted next word(s)
Efficient Modeling: Markov Chain
Cleaned/Tidy datasets as a dictionary: compressed size 250KB
Lightweight APP, QUICK response time: less than ~0.5 sec

The Shiny APP prediction algorithm utilized the N-Gram dictionary - Used Unigram to predict the possible current word - Used Bigram to N5 gram to predict the possible next words.

How to use Data Product

Users can access the APP (https://jtmoogle.shinyapps.io/textAnalyzer) through any browsers (i.e. IE, Firefox, Chrome) via internet access

Input:

  1. User type any words in the phrase/sentence.  *No need to press an ENTER key*.  
  The App consider the "Valid Words" which has NO special characters/symbols  
  2. Select the numbers of next words/prediction to display

Output: The App predicts and shows to users

  1. Possible current word you are typing  
  2. Possible next one or few word you might type next

Summary Reports:

  Users can view summary of N-Gram content in various ways   
    1. Data Table to show summarized result 
    2. Plots to display summary result by range/by all  
    3. Word Cloud illustrate beautiful color and fonts representing word frequency

Toolset

“R” to preprocess data/generate N-gram result/predict modeling/cross validation/test
knit kit to generate the Milestone Report in HTML format
RStudio Presenter used to introduce data product feature
Shiny used to build a data product, to deploy to the shiny server site, and available to public access via internet.

Conclusions:

Observation from my analysis
Limitation Identified

Experienced the quality of API packages were not tuned well in the latest R version. (i.e. rJava, RWeka in Ubentu)
Using 90% sparse rate did not reserve more words to be measured, so need to use exposed various rations to find the max ratios based on PC capacity. Consistently encountered the PC limitation if dataset is too huge.
Encountered R APIs (i.e. TM, matrix) internal defects while processing larger sample data set. Running over 10 hours on window 10 platform 12 GB RAM, 8-core PCs when executing task2 (tm APIs: TermDocumentMatrix, Corpus, n-gram etc.), and crashed in R when processing 70% sample data due to overflow vector size.
Used TM APIs to clean up documents/terms have seen slower than using stringi API with the regular expression to remove special characters, punctuation, numbers, and etc.
Performance of smoothing API seems slower, and could not do well using larger word sets

Future

Furthermore, About Future, might extend to text topic model, spelling checking, sentiment features. Could apply in areas of customer survey, trouble ticketing, voice response system and etc.

English(US) Summary Result

Blogs contain multiple sentence(s), average words were 40.94 words per line. Longest line has 39,240 characters, 6,327 words, 1,685 unique words. In total, we saw 899,288 lines.
News contain samll senentence(s), average words were 33.31 per line. Longest line has 3,764 characters, 544 words, 271 unique words. In total, we saw 2,360,148 lines.
Twitter contain samller senentence(s), average words were 12.44 per line. Longest line has 140 characters, 47 words, 36 unique words. In total, we saw 1,010,242 lines.

Summary of All Data Files
Filename	FileSizenInByte	FileSizeInMByte	FileLineCnt	LanguageLocation	Encoding1	Encoding2
de_DE.blogs.txt	85459666	81.5 Mb	371440	German (Germany)	UTF-8	ISO-8859-1
de_DE.news.txt	95591959	91.2 Mb	244743	German (Germany)	UTF-8	ISO-8859-1
de_DE.twitter.txt	75578341	72.1 Mb	947774	German (Germany)	UTF-8	ISO-8859-1
en_US.blogs.txt	210160014	200.4 Mb	899288	English (United States)	UTF-8	ISO-8859-1
en_US.news.txt	205811889	196.3 Mb	1010242	English (United States)	UTF-8	ISO-8859-1
en_US.twitter.txt	167105338	159.4 Mb	2360148	English (United States)	UTF-8	ISO-8859-1
fi_FI.blogs.txt	108503595	103.5 Mb	439785	Finnish (Finland)	UTF-8	ISO-8859-1
fi_FI.news.txt	94234350	89.9 Mb	485758	Finnish (Finland)	UTF-8	ISO-8859-1
fi_FI.twitter.txt	25331142	24.2 Mb	285214	Finnish (Finland)	UTF-8	ISO-8859-1
ru_RU.blogs.txt	116855835	111.4 Mb	337100	Russian (Russia)	UTF-8	ISO-8859-5
ru_RU.news.txt	118996424	113.5 Mb	196360	Russian (Russia)	UTF-8	ISO-8859-5
ru_RU.twitter.txt	105182346	100.3 Mb	881414	Russian (Russia)	UTF-8	ISO-8859-5

Summary of English (United State) Data Files
FieldSummaryBy	Filetype	Min.	X1st.Qu.	Median	Mean	X3rd.Qu.	Max.
LineCharCnt	en_US.blogs	0	44	149	221.5	317	39240
LineCharCnt	en_US.news	0	104	177	193.1	258	3764
LineCharCnt	en_US.twitter	1	34	60	64.82	94	140
rowId	en_US.blogs	1	224800	449600	449600	674500	899300
rowId	en_US.news	1	19320	38630	38630	57940	77260
rowId	en_US.twitter	1	590000	1180000	1180000	1770000	2360000
LineWordCnt	en_US.blogs	0	8	28	40.94	59	6327
LineWordCnt	en_US.news	0	18	30	33.31	44	544
LineWordCnt	en_US.twitter	1	7	12	12.44	18	47
LineAvgWordLen	en_US.blogs	1	4	4.387	4.556	4.879	74
LineAvgWordLen	en_US.news	1	4.415	4.812	4.873	5.231	31
LineAvgWordLen	en_US.twitter	1	3.733	4.188	4.305	4.733	126
LineUniqWordCnt	en_US.blogs	0	8	24	31.2	46	1685
LineUniqWordCnt	en_US.news	0	17	27	28.06	37	271
LineUniqWordCnt	en_US.twitter	1	7	11	11.72	17	36
LineHashtagCnt	en_US.blogs	0	0	0	0	0	0
LineHashtagCnt	en_US.news	0	0	0	0	0	0
LineHashtagCnt	en_US.twitter	0	0	0	0	0	0
LineHttpCnt	en_US.blogs	0	0	0	0.001702	0	8
LineHttpCnt	en_US.news	0	0	0	0.0009449	0	4
LineHttpCnt	en_US.twitter	0	0	0	0.000222	0	2

Most 20 Frequent Words
blogs.word	blogs.wordcnt	blogs.wordprob	news.word	news.wordcnt	news.wordprob	twt.word	twt.wordcnt	twt.wordprob
the	1855771	0.0528795	the	151524	0.0608450	the	934172	0.0335948
and	1086110	0.0309483	to	69348	0.0278470	to	786629	0.0282888
to	1065698	0.0303667	and	68216	0.0273924	you	543700	0.0195526
of	875028	0.0249336	of	59089	0.0237274	and	433686	0.0155963
in	593633	0.0169154	in	51464	0.0206656	for	384535	0.0138287
that	459500	0.0130933	for	27112	0.0108869	in	377036	0.0135590
is	431834	0.0123050	that	26358	0.0105842	of	358981	0.0129097
it	400905	0.0114236	is	21961	0.0088185	is	357544	0.0128580
for	362867	0.0103398	on	20578	0.0082632	it	291398	0.0104793
you	296855	0.0084588	with	19754	0.0079323	my	290517	0.0104476
with	286177	0.0081545	said	19167	0.0076966	on	276264	0.0099350
was	278002	0.0079216	was	17625	0.0070774	that	232907	0.0083758
on	274047	0.0078089	he	17556	0.0070497	me	200067	0.0071948
my	270181	0.0076987	it	16693	0.0067031	be	187176	0.0067312
this	257977	0.0073510	at	16413	0.0065907	at	185524	0.0066718
as	223359	0.0063645	as	14662	0.0058876	with	172995	0.0062213
have	218541	0.0062272	his	12107	0.0048616	your	170771	0.0061413
be	208303	0.0059355	but	11658	0.0046813	have	168051	0.0060435
but	203446	0.0057971	from	11648	0.0046773	so	163273	0.0058716
are	193634	0.0055175	be	11579	0.0046496	this	162736	0.0058523

##### N-Gram Summary Result

mfw.maxnbr<-25
barplot(head(df1g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df1g, mfw.maxnbr)$word,
                col ="lightgreen", main =paste0("English Most ", mfw.maxnbr, " Unigram Frequent Words"),
                ylab = "Word Count")

library(wordcloud)
  
  wordcloud(unique(df1g$word), df1g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="Unigram WordCloud")

barplot(head(df2g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df2g, mfw.maxnbr)$word,
                col ="lightblue", main =paste0("English Bigram Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
                ylab = "Word Count")

    wordcloud(unique(df2g$word), df2g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="Bigram WordCloud")

    barplot(head(df3g, mfw.maxnbr)$wordcnt, las = 2, names.arg = head(df3g, mfw.maxnbr)$word,
                col ="lightblue", main =paste0("English Most ", mfw.maxnbr, " N3-Gram Frequent Words"),
                ylab = "Word Count")

  wordcloud(unique(df3g$word), df3g$wordcnt, scale=c(4,0.5),
                min.freq = 1, max.words=mfw.maxnbr,
                colors=brewer.pal(8, "Dark2"), title="N3-Gram WordCloud")

Max Sparse Ratios by N-Gram Type and File Type
	filetype	ngram	sparseRatio	WordCnt	DocCnt	CntMin	CntMedian	CntMean	CntMax
5	twitter-train	1	0.999	1083	235955	239	516	1038	14740
6	blogs-train	1	0.9995	4236	89811	45	147	380.8	13490
51	news-train	1	0.999	2773	100942	102	288	544.2	24830
61	blogs-train	2	0.9995	936	89811	45	64	88.55	740
62	twitter-train	2	0.9995	282	235955	118	176	245.8	1739
63	news-train	2	0.9995	736	100942	51	77	105.8	1532
7	twitter-train	3	0.9999	119	235955	24	32	46.97	359
71	news-train	3	0.9999	532	100942	11	15	19.41	177
72	blogs-train	3	0.9999	382	89811	9	12	14.49	74
8	twitter-train	4	0.99999	1678	235955	3	3	4.232	46
73	blogs-train	4	0.9999	50	89811	9	17	16.02	21
74	news-train	4	0.9999	71	100942	11	15	17.17	40
75	blogs-train	5	0.9999	42	89811	10	17	16.36	17
81	twitter-train	5	0.99999	723	235955	3	3	3.913	34
82	twitter-train	5	0.99999	723	235955	3	3	3.913	34

Most lm Accurancy by N-Gram Type and File Type
	filetype	ngram	maxsparse	lm.accuracy	lm.outofsample.accuracy	lm.outofSample.error
3	news-train	1	0.9995	1.0000000	1.0000000	0.0000000
6	news-train	2	0.9995	1.0000000	1.0000000	0.0000000
11	blogs-train	4	0.9999	1.0000000	1.0000000	0.0000000
13	blogs-train	5	0.9999	1.0000000	1.0000000	0.0000000
14	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
15	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
10	twitter-train	4	0.99999	0.9965870	0.9965870	0.0034130
9	blogs-train	3	0.9999	0.9913793	0.9913793	0.0086207
1	blogs-train	1	0.9995	0.9898386	0.9898386	0.0101614
2	blogs-train	1	0.9995	0.9898386	0.9898386	0.0101614
7	twitter-train	3	0.9999	0.9767442	0.9767442	0.0232558
8	news-train	3	0.9999	0.9757576	0.9757576	0.0242424
5	twitter-train	2	0.9995	0.9727273	0.9727273	0.0272727
4	blogs-train	2	0.9995	0.9672131	0.9672131	0.0327869
12	news-train	4	0.9999	0.9523810	0.9523810	0.0476190

Most gbm Accurancy by N-Gram Type and File Type
	filetype	ngram	maxsparse	gbm.accuracy	gbm.outofsample.accuracy	gbm.outofSample.error
1	blogs-train	1	0.9995	0.9970114	0.9970114	0.0029886
2	blogs-train	1	0.9995	0.9970114	0.9970114	0.0029886
4	blogs-train	2	0.9995	0.9781421	0.9781421	0.0218579
10	twitter-train	4	0.99999	0.9761092	0.9761092	0.0238908
14	twitter-train	5	0.99999	0.9761092	0.9761092	0.0238908
15	twitter-train	5	0.99999	0.9761092	0.9761092	0.0238908
8	news-train	3	0.9999	0.9575758	0.9575758	0.0424242
12	news-train	4	0.9999	0.9575758	0.9575758	0.0424242
3	news-train	1	0.9995	0.9568966	0.9568966	0.0431034
6	news-train	2	0.9995	0.9568966	0.9568966	0.0431034
9	blogs-train	3	0.9999	0.9568966	0.9568966	0.0431034
11	blogs-train	4	0.9999	0.9568966	0.9568966	0.0431034
13	blogs-train	5	0.9999	0.9568966	0.9568966	0.0431034
5	twitter-train	2	0.9995	0.9181818	0.9181818	0.0818182
7	twitter-train	3	0.9999	0.8139535	0.8139535	0.1860465

Most rf Accurancy by N-Gram Type and File Type
	filetype	ngram	maxsparse	rf.accuracy	rf.outofsample.accuracy	rf.outofSample.error
1	blogs-train	1	0.9995	1.0000000	1.0000000	0.0000000
2	blogs-train	1	0.9995	1.0000000	1.0000000	0.0000000
3	news-train	1	0.9995	1.0000000	1.0000000	0.0000000
5	twitter-train	2	0.9995	1.0000000	1.0000000	0.0000000
6	news-train	2	0.9995	1.0000000	1.0000000	0.0000000
8	news-train	3	0.9999	1.0000000	1.0000000	0.0000000
9	blogs-train	3	0.9999	1.0000000	1.0000000	0.0000000
10	twitter-train	4	0.99999	1.0000000	1.0000000	0.0000000
12	news-train	4	0.9999	1.0000000	1.0000000	0.0000000
13	blogs-train	5	0.9999	1.0000000	1.0000000	0.0000000
14	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
15	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
4	blogs-train	2	0.9995	0.9890710	0.9890710	0.0109290
7	twitter-train	3	0.9999	0.9767442	0.9767442	0.0232558
11	blogs-train	4	0.9999	0.9473684	0.9473684	0.0526316

Most knn Accurancy by N-Gram Type and File Type
	filetype	ngram	maxsparse	knn.accuracy	knn.outofsample.accuracy	knn.outofSample.error
1	blogs-train	1	0.9995	1.0000000	1.0000000	0.0000000
2	blogs-train	1	0.9995	1.0000000	1.0000000	0.0000000
3	news-train	1	0.9995	1.0000000	1.0000000	0.0000000
6	news-train	2	0.9995	1.0000000	1.0000000	0.0000000
9	blogs-train	3	0.9999	1.0000000	1.0000000	0.0000000
11	blogs-train	4	0.9999	1.0000000	1.0000000	0.0000000
13	blogs-train	5	0.9999	1.0000000	1.0000000	0.0000000
14	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
15	twitter-train	5	0.99999	1.0000000	1.0000000	0.0000000
10	twitter-train	4	0.99999	0.9931741	0.9931741	0.0068259
4	blogs-train	2	0.9995	0.9918033	0.9918033	0.0081967
5	twitter-train	2	0.9995	0.9818182	0.9818182	0.0181818
8	news-train	3	0.9999	0.9818182	0.9818182	0.0181818
12	news-train	4	0.9999	0.9523810	0.9523810	0.0476190
7	twitter-train	3	0.9999	0.9302326	0.9302326	0.0697674

References:

The R Project for Statistical Computing
Johns Hopkins Data Science Capstone by Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhD Milestone Report
Natural language processing Wikipedia page
Text mining infrastructure in R
CRAN Task View: Natural Language Processing
Natural Language Processing by by Dan Jurafsky, Christopher Manning
Google.com
Bing.com

US English Text Analyzer Report

jtmoogle @github.com All Rights Reserved

Jan 20, 2016

Executive Summary

Methods: Data -> Analyze -> Modeling -> Data Product

How to use Data Product

Conclusions:

English(US) Summary Result

References: