Capstone Project - Final Project Report

B. McCracken - Tech Entrepreneur

Capstone Progress Report

This is the final project report for the John Hopkins University Data Science Specialization Capstone course. The project is focused on demonstration of the use of Natural Language Processing tools to build a model to predict the next word typed in a sentence. The project uses several language processing packages:

  • tm: used to read the corpus of documents in a folder and create a vCorpus
  • quanteda: used in this excercise to show summary statistics of the corpus
  • rweka: used to create a tokenizer and n-grams from a TermDocumentMatrix
  • dplyr and grid.Extra: used to identify and plot most used terms and n-grams

The tm package was the primary package utilized in this project

The final application is deployed on the shiny server at: https://mccracmiler.shinyapps.io/CAPAPP/

Prepare Environment, Download Data and Create Sample files

The first step in the project is to prepare the environment, create a sample corpus and ngram files. An “n-gram” is a combination of “n” words extracted from text. The data set from SwiftKey allows for analysis of multiple languages English, German, Russian and Finnish This project will focus on the English version. The entire corpus of three documents is 510Mb which is too large to manipulate. A sample of each document was selected and then written out in a sample directory.

Steps taken to prepare the data for the application are as follows:

  • Load packages needed
  • Download the data
  • Create sample directories to hold the sample corpus and ngram files
  • A sample of each file was extracted to create a corpus of documents that could be used to create n-gram combinations

Create the NGRAM files for staging and use in the application

Extract ngrams from Corpus: The corpus has three files. A US News file, blog file extracts and a file of tweets. A sample of all the files was chosen for analysis to build or broader range of phrases and terms upon which to build a aplication dictionary. In cleaning the data, stemming and removing stop words produced somewhat meaninless words and phrases. Several combinations of samples were attempted to create the largest number of ngrams for use in the app. My windows 64-based machine with 8GB of RAM was not capable of processing a sample of more than 15% of the twitter file. My application rely's heavily on the ngrams created from tweets as this is most likely what testers of the application will use. I utilized RWEKA as a tokenizer and tm to create ngrams from the TermDocumentMatrix. There may be more memory efficient ngram creators but I was not able to locate them. I proceeded as follows:

  • The next step created 2-word, 3-word and 4-word ngram combination files for use in the application
  • This resulting files included 110K 2-word, 69k 3word, and 16k 4-word ngrams
  • The final step I took was to split and sort the ngrams to take work out of the application.
  • I split the ngrams in each file to create shorter ngrams and a last word recommendation for each ngram
  • First step was to extracted the last word in each of the ngrams and added it as an extra field for each ngram
  • Second step was to add the other part of the ngrams as another field for each ngram.
  • The resulting files contains the:
    • original ngram
    • number of occurrences of the ngram in the file
    • n-1 word ngram (first three words of a quadgram, first two words of trigram etc.)
    • last word of each ngram

How Does The Application Work

THE APPLICATION: The application is fairly simple.

  • Once the user starts typing, a fucntion is called to identify matching ngrams.
  • Numbers, spaces and punctuation are ignored. Capital letters are converted to lower case.
  • The function returns the three most likely next words based on prioritizing matches found first in the 4-word file, second from the 3-word file, and finally from the 2-word file.
  • Ideally, the proposed words would dynamically update the radio button selections and the user would select the next word.
    • The selected word would be appended to the input phrase
    • The input phrase would be appened with the chosen word and the user could keep typing.
    • I could not get this dynamic step to work :(
    • I was able to dynamically update radio buttons but it is populating the buttins with a random number…..I will continue to work on this after the course is completed.

Conclusion and Final Thoughts

Overall, memory limitations in R prevent the ability to use the entire corpus for analysis. As a result, I have tried to be very efficient in the use of data to achieve the objective. It has been very useful to test the capabilities of tm, quanteda, and Rweka packages.

I had great difficulty with shiny as I could not dynamically update the radio buttons as liked. ideally, the next predicted words would be presented to the user. The user would select one. The selected word would be appended to the phrase the user was typing and then the application would continue to search for the next word. I decided to just submit and hope for he best.

Try out the application: https://mccracmiler.shinyapps.io/CAPAPP/

Go Texas Aggies!!!