The purpose of this project is to develop an accurate, fast, and lightweight (resource efficient) predictive typing model and application that can be used in both mobile device and standard computer interfaces to improve the user experience (UX). The intent of this project is to both demonstrate mastery of the content of the Johns Hopkins University (JHU) Data Science Specialization series body of knowledge and meet or exceed the current performance of commercial predictive typing applications such as the Swiftkey’s tool that is the corporate partner of the class.
This report is an interim update to demonstrate success in:
The progress so far includes upload of the English language data sources, analysis of the raw data from those files, plots and tables illustrating the details of the word frequency within the data sets, demonstration of Term Document Matrix (TDM) development, and preliminary work in creating sparse TDM representations and N-gram representations for use in the future predictive model development.
There have been some time-consuming technical challenges with respect to the use of standard R packages for text mining and analysis (e,g, RWeka and Java version) and incompatibilities with the machine configuration being used along with other issues around large scale matrix structures and memory management. However, steady progress has been made and the beginnings of the modeling work has begun.
The English language sources have been the focus for this interim stage.
The project relies on data from this source to build the models: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
It is assumed that the development of the initial models will rely on this data set but it is permissible to use other sources to form the text corpus (including potentially domain specific text from healthcare, finance, science, etc) for the final product. Model testing will include a broad array of inputs.
There is a growing body of pre-existing R packages available to handle text mining and analysis. This project heavily relies on those platforms (e.g., tm, RWeka, SnowballC, wordcloud) and depends upon their wide body of testing and application. As these utiliites improve and new capabilities emerge, it may make sense to revisit the decisions made. The results depend upon the quality of those tools and there is a key assumption that these are both accurate and complete. This assumption should be validated by others who may build on this work and R code.
The following example results provide an overview of the size and sources of input data:
## Loading required package: plyr
## filename characters words lines
## 1 en_US.blogs.txt 210160014 37334690 899288
## 2 en_US.news.txt 205811889 34372720 1010242
## 3 en_US.twitter.txt 167105338 30374206 2360148
## 4 total 583077241 102081616 4269678
##
## Use all text documents are just 1 (ALL or blank):
##
The input sources are very large (see above) and I have chosen to sample a small portion of the data to keep the analysis within reasonable processing times and to confirm that the subsequent steps are working properly. I have chosen a random sampling with a variable number of documents. Ideally, the project and application would not only use these sources but continue to scan additional sources (and the emergence of new terms, e.g. slang and acronyms in tweets and blogs) of current materials searching for new word patterns and usage to keep the predictive capabilities valuable to users. This is a recommended future course of action.
sampleSIZE = 10000 # adjust sample size from Corpus - TBD: test findings vs. Sample Size
aSample = sample(aCorpusVector, sampleSIZE)
The source documents include lots of noise and non-useful characters. There are foreign language characters, spelling, grammar, and many other issues discovered so far. I have performed clean up on first order items but there are many more to consider in refining the quality of the input data. What is not clear to me at this point is how valuable further clean up is relative to the practical application of the tool. The tradeoffs at each stage such as broadening the data sources, eliminating noisy data, use of compression techniques, importance of phrases construction in context, use of capitalization in prediction, stemming, etc. need to be explored in subsequent iterations on this project. It may be possible to develop a highly accurate and reliable prediction engine with a relative small set of noisy data but I do not know that at this point. I have chosen to limit my clean up work at this stage until I have developed a fully working prototype to test the pipeline from ingest to prediction.
I have included a processing flag to remove stopwords or keep them in. The n-gram analysis seems to need the stopwords to form meaningful phrases (this needs to be tested) for answering the class quizes; however, this may not be true for the actual predictive text application where earlier preceding non-stopwords may be better predictors.
## Loading required package: RWeka
## Loading required package: wordcloud
## Loading required package: RColorBrewer
This section is under development and is left as a subsequent activity.
The current plan is to use a subset of the Machine Learning tools from our earlier classes and test their effectiveness (accuracy and breadth of word coverage), efficiency (study processing time since this will be a mobile app deployment and needs to provide instaneous responses to be usable), and memory footprint.
The current plan is to demonstrate the model using the Shiny application platform as we did in the Developing Data Products class. Considerations of model size and processing speed are critical. Exploration of API based interaction will also be studied (that is, can this be provided as a service in the cloud and be responsive enough for mobile applications).
I have successfully demonstrated the ability to ingest the source files, perform a subset of potential “clean up” activities, explore the data to look for patterns, and used n-gram processing tools. The work and project is ready for model development and testing.