Data Science Capstone Project Coursera

Ramon Schildknecht
September 17, 2017

“The goal of data scientists is to provide data products that people love. A lot of people will benefit from this added value. Hopefully the result of the established product gives the people some added value.”

Happy enduser

Problem Definition

We need to develop a model that predicts the next best word or rather next probable word given one to several input words. The problem is part of the services that SwiftKey offers.

These days people are more engaged than ever. The goal is to save those end users of various mobile devices their precious time either to get more done or have more time to relax.

Example: “I went to the” and the next probable words are “gym”, “store” or “restaurant”.

I like to try a rough estimation about a possible business value:
Assume your company hast 1'000 employees and they use five mobile company apps. If we increase their text input speed by 1% we can save 220 (yearly working days) x 1000 (employees) x 1 (daily app usage hours) x 0.01 (percent) = 2'200 hours. Then we multiply this value with a $35 per hour salary.

The result: $77'000 dollar added value.
Every 1% increase of saved time results in approximately $77'000 a year!

Time is money

Approach

Load data
- Sources: Website entries from blogs and news sites as well as from Twitter
- about 4 million documents (text unit which contains one to n words)
Explore data & sample data
- Calculate KPIs as documents^1, vocabulary^1, word types^1, type per token ratio¹ (TTP) and diversity¹
- Sample just blogs data after taking TTP and diversity into account, because the offer the most value for prediction accuracy
Clean data
- Processes like lower all characters, remove unnecessary characters as whitespaces, URL, punctuation as well as numbers and many more
Develop and improve model
- Create Uni-, Bi- and Trigrams (uni = one word, ex. new, bi = two words, ex.new york, tri words = three, example new york city)
- Simplified: Model takes last two words from input sentence and predicts the 4 most probable words. Additionaly the user can see the probabilities of the single words.
Deploy model as Shiny app, [access it here]
Create R presentation for my boss or an investor

1 Vocabulary = total number of words, word types = unique words, TTP = word type divided by TTP vocabulary, diversity = measure of vocabulary diversity = word types divided by sqrt(2*vocabulary)

Results

Fundamental functionality

Model takes one to n words as input and cut the last two words (l2w).
Check if l2w are matching with the first two words of the trigram.
- If yes: Filter the matching trigram entries and take 4 values with highest frequency
- If no: Use the same approach like before but with bigrams (input is just last word)
Return third words in trigram or second words in bigram with the most four frequencies as well as a visualizaton

Hints: (I) If there are no matching values the user gets a suitable error message. (II) if bi- or trigram frequency is <= 5 Good-Turing smooting (page 13) is applied. (III) We use Katz backoff (explanation part 1) (explanation part 2) to “back off” to a lower-order ngram if there is zero evidence for a higher-order ngram.

The final accuracy is at about 30% and you have to type at least one word. The accuracy of Swiftkey is 33% but you have to usually type just two letters.

Access to Shiny data product

Click here to try your prediction.

Result impression & Sources

Result impression

Example: “I am going to new York”
Result example

Special thanks

To Yanchang Zhao who provided the some important cleaning aspects in his presentation Text Mining with R - Twitter Data Analysis. (17.9.2017)
To Dan Jurafsky and James H. Martin for the beneficial presentation Language Modeling - Introduction to N-grams (Dan Jurafsky & James H. Martin Stanford University) (17.9.2017)
To Gerald R Gendron for important insights about Natural Language Processing Models (17.9.2017)

Further ressources

Text mining infrastucture in R (Ingo Feinerer et al.) (17.9.2017)
Type per token ratio and diversity (Brian Richards) (17.9.2017)