Sysnopsis

The Data Science Capstone is a final project of the JHU Data Science Specialization. The goal of the project is to build the context-based text input recognition system similar to those created by Swiftkey (text recognition software company partnering with JHU).

This report is devoted to the initial stages of the project: strategy development and data exploration. Strategy development is the outline of the general approach applied to the data collection, prediction algorithm and UI creation. Data exploration is the first step of the strategy execution - it includes the ETL and EDA.

Strategy Development

The goal of the project is to build a data product - the context-based text input recommendation system, i.e. the program with UI that takes the user’s text input and outputs the recommendation on the next word (next word prediction). The steps required to achieve the goal are defined as follows:

  1. Define the scope of the project, ie. languages, target devices, etc.
  2. Obtain the data (text corpus) representative of the language (or languages) chosen.
  3. Create and train the predictive algorithm (model) to be used to create the tool engine.
  4. Create the UI for the target devices identified in scope.

Scope of the Project

The project is limited to English language text input prediction. However, it should be designed in a way allowing for fast switch to any other language, i.e. no major coding effort should be needed for the implementation of different language corpus.

The text prediction algorithm should give the accurate predictions, but should not require wast computing resources. The compromise between the speed and accuracy should be found. Note: the algorithm should produce only the next word recommendation, the partial word auto-complete recommendations are not the part of this project.

The UI is limited to the web browser demonstartional interface. there is no plan to implement compatibility libraries for the side programs, such as iPhone keyboard, or similar.

The Data

The data source for this project is chosen to be the HC Corpora language database. It contains the raw text data for multiple languages obtained by the web crawler. The main sources are:

  • News - news sites/aggregators;
  • Blogs - blogging resources;
  • Twitter.

As per scope of this project, we will be using the English part of the corpus to train our prediction algorithm. we will be differentiating the different types of text sources only while data exploration - further the whole corpus will be treated as one.

The Algorithm

In order to create the predictions of the next word, we will be assuming that the every next word depends on sort sequence of previous words, i.e. we will be using the Markov assumption.

Basing the model on the assumption defined above, we will produce a set of N-Grams from the text corpus (the sequences of words of length N) and will assign probability to every word based on N-1 preceding words. A bit deeper description of methodology can be found here: Stanford NLP Course by E. Roberts. As we have the limited computing resources, we will be limiting N to 3 (trigram) or even 2 (bigram).

Overall, the algorithm will be as follows:

  1. Take the text input;
  2. Extract last two words (in case we choose tri-grams);
  3. Check probability of a tri-gram intersected with the bi-gram taken as input;
  4. Output the last word in the tri-gram with highest probability.

The UI

The User Interface will be a simple browser-based Shiny App. It will have an R code in the back-end and Shiny front-end.

ETL and EDA

As a first step within the strategy implementation we load the data (ETL Extract, Transform, Load) and explore it (EDA- Exploratory Data Analysis). The data is loaded directly from the source in form of archive, unpacked and loaded into the programming environment. This what the loaded data looks like:

en_US.blogs.txt en_US.news.txt en_US.twitter.txt Corpus
File size (MB) 210 205 167 583
Total lines 899288 77259 2360148 3336695
Total charachters 206824505 15639408 162096031 384559944
Min charachters per line 1 2 2 1
Average charachters per line 230 202 69 115
Max charachters per line 40833 5760 140 40833
Total words 37334131 2643969 30373543 70351643
Min words per line 1 1 1 1
Average words per line 42 34 13 21
Max words per line 6630 1031 47 6630

Once we load the text corpus, we prepare it to further use in the project by cleaning. We do the following:

Note: We do not perform stemming, as we do not want to reduce word forms - this might harm our text prediction goals.

As the entire database is quite large, we will furhter use the sample to avoid long running times and machine overload. We will be randomly taking 20000 lines from every document, which still should be quite representative.

After the cleaning stage, this is what our data sample looks like:

en_US.blogs.txt en_US.news.txt en_US.twitter.txt Corpus
Total words 411754 374273 132024 918051
Total unique words 43500 43085 22566 73116
Total unique words covering 90% of text 13075 14012 9363 15766
Sparcity % 40.5% 41.1% 69.1% 50.2%
Most frequent word one said just said

We can notice that this unigram bag-of-words matrix (document term matrix) is quite sparse. Also, the numbers in a table above are dependent on the document total size - the blogs have the largest word set while twitter has the smallest. Another interesting thing would be the distribution of words in every document. If we look at the number of unique words covering 90% of text, we can easily see that there is long tail present in every source and also in a whole corpus - number of ‘90% coverage’ words is much lower than 90% of total unique words, i.e. most of text is covered with a set of frequent terms, while the tail contains from lots of infrequent ones.

However, for our project we do not need to compare different sources of text data. What we are interested in is the aggregation, i.e.. the whole corpus and nGrams within it. As we noted before, we will use Ngrams up to order 3 in our algorithm, so, below are the frequency distribution charts for unigrams (one word), bigrams (two words) and trigrams (three words).

As we can see, the unigrams are distributed better than bigrams, and bigrams, in turn, distributed better than trigrams. This is quite intuitive though - the more elements we have in a unit, the less is the probability to see repeated combinations.

Speaking of the most frequent Ngrams in our corpus sample, below are charts for top 10:

Most frequent words are not very surprising. The trigrams show the most commonly used idioms. However, as we eliminated many stepwords, we might have not the full picture. Further in the process we will have to think if having the stepwords will be beneficial (as many trigrams actually use prepositions).

Finally, here is a beautiful wordcloud for the combined corpus of one to three Ngrams.

Conclusion

As a result of the initial project stage, the following was accomplished:

  1. The text input prediction tool development strategy was defined.
  2. The text corpus was loaded.
  3. The text corpus was explored and described.

The following step will be the development of predictive algorithm that will be based on the data described above.

The code for this report can be found on my GitHub in the project “DS_Capstone”.