Capstone Week 2 progress report

1.Introduction

The purpose of the Capstone project is to build a “smart” keyboard that makes it easier for people to type on their mobile devices. One cornerstone of a “smart”" keyboard is predictive text models. When someone types “I went to the”, the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone we work on understanding and building predictive text models like those used by our corporate partner SwiftKey. **The code has been suppressed and was moved to the appendix to maintain readability of this document. ## 2.Requirements For the Capsone we need to consider the following requirements.

  1. we need to ensure that it runs in the amount of RAM you have on your computer to build the models. In my case it will run on a MacBook Pro with 8GB of memory.

  2. we need to ensure that our Shiny app will run in less than the 1Gb free version of Shiny.

  3. one must consider the load time for the application. You don’t want to keep your users waiting too long for your app to start.

The above requirements are actually very realistic in practice, since current available predictive text models run on mobile phones, which typically have limited memory and processing power compared to desktop computers.

3. Loading the r packages

  • “dtplyr” implements the data table back-end for ‘dplyr’ so that you can seamlessly use data table and ‘dplyr’ together.
  • “stringi” allows for fast, correct, consistent, portable, as well as convenient character string/text processing in every locale and any native encoding
  • “kableExtra” onstructs Complex Table with ‘kable’ and Pipe Syntax
  • “quanteda” is a fast, flexible, and comprehensive framework for quantitative text analysis in R.
  • “readtext” for Import and Handling for Plain and Formatted Text Files.
  • “ggplot2” is a plotting system for R

4. Loading and cleaning the data

Load the data from the coursera website Capstone Dataset The goal of this task is to get familiar with the databases and do the necessary cleaning.

  • First we get an idea of the size of the Corpora. We determine the file size of each text file in the Corpora (en_US.blogs.txt, en_US.news.txt, n_US.twitter.txt).

  • Then we uplaod each file and we count the number of lines in each text file.

  • Finally we determine the number of words in each file.

The table below shows size, length and numbers for each of the files in Millions.
file size length words
USblogs 200.4242 0.8576279 35.80689
USnews 196.2775 0.9634418 33.15200
US twitter 159.3641 2.2508125 28.69931

As you can tell each file is enormous in size. We can not use all the data for this project as it would require more compute power and memory than we have. Therefore we take an initial subset of 5% of the data.

5. Exploratory Analysis

We upload a 5% subset of the data as our training data set. Depending on how our prediction model performs we may decide to tweak this number up or down.

To build a predictive model for text we need to understand the distribution and relationship between the words, tokens, and phrases in the text.

  • We do profanity filtering. We remove profanity and other words we do not want to predict. We’ve downloaded a “Terms to Block” - file from Frontgate media for this purpose.

  • We use “Quanteda” a text analytics package which provides a rich set of text analysis features coupled with excellent performance relative to Java-based R packages for text analysis.

  • We remove punctuations, numbers, separators and English stopwords.

  • We create unigrams, bigrams and trigrams.

5.1 ngram frequency graphs

The following graphs show the top20 unigrams, bigrams, trigrams.

5.2 Removing singletons

ngram ngramTotal ngramNoSingleton
Unigram 144160 64352
Bigram 1570436 369635
Trigram 3350828 346532

5.3 Coverage

One of the questions in the assignment was to determine how many unique words it would take to cover 50%, 90% and 99% of the dictionary. We determined that to cover 50% of the dictionary we need 967 words, to cover 90% we need 13103 unique words and to cover 99% we need 51022 unique words.
PercentageCoverage Uniquewords
50% 967
90% 13103
99% 51022