B. McCracken - Tech Entrepreneur
This is the final project report for the John Hopkins University Data Science Specialization Capstone course. The project is focused on demonstration of the use of Natural Language Processing tools to build a model to predict the next word typed in a sentence. The project uses several language processing packages:
The tm package was the primary package utilized in this project
The final application is deployed on the shiny server at: https://mccracmiler.shinyapps.io/CAPAPP/
The first step in the project is to prepare the environment, create a sample corpus and ngram files. An “n-gram” is a combination of “n” words extracted from text. The data set from SwiftKey allows for analysis of multiple languages English, German, Russian and Finnish This project will focus on the English version. The entire corpus of three documents is 510Mb which is too large to manipulate. A sample of each document was selected and then written out in a sample directory.
Steps taken to prepare the data for the application are as follows:
Extract ngrams from Corpus: The corpus has three files. A US News file, blog file extracts and a file of tweets. A sample of all the files was chosen for analysis to build or broader range of phrases and terms upon which to build a aplication dictionary. In cleaning the data, stemming and removing stop words produced somewhat meaninless words and phrases. Several combinations of samples were attempted to create the largest number of ngrams for use in the app. My windows 64-based machine with 8GB of RAM was not capable of processing a sample of more than 15% of the twitter file. My application rely's heavily on the ngrams created from tweets as this is most likely what testers of the application will use. I utilized RWEKA as a tokenizer and tm to create ngrams from the TermDocumentMatrix. There may be more memory efficient ngram creators but I was not able to locate them. I proceeded as follows:
THE APPLICATION: The application is fairly simple.
Overall, memory limitations in R prevent the ability to use the entire corpus for analysis. As a result, I have tried to be very efficient in the use of data to achieve the objective. It has been very useful to test the capabilities of tm, quanteda, and Rweka packages.
I had great difficulty with shiny as I could not dynamically update the radio buttons as liked. ideally, the next predicted words would be presented to the user. The user would select one. The selected word would be appended to the phrase the user was typing and then the application would continue to search for the next word. I decided to just submit and hope for he best.
Try out the application: https://mccracmiler.shinyapps.io/CAPAPP/