Coursera DSS Capstone Project
Smita Desai
January 24, 2016
Exploratory Analysis
- This prediction algorithm uses Swiftkey data - components of blogs, news and twitter.
- Initial analysis included exploratory analysis of the data to understand what was in the data, most frequently used words, etc.
- It was very quickly obvious that given the hardware resources, the data would have to be broken down into chunks and this final analysis includes only 7% of the data.
- This sample data was loaded and cleaned up.
- Remove punctuations
- Remove numbers
- Remove stopwords - these are words that do not have any value added to the final analysis. Examples are a, an, the, etc.
- Remove profanities.
- Strip whitespace.
Final Analysis
- The three components - blogs, news & twitter - were then combined into a corpus.
- A TermDocumentMatrix was created from the combined corpus and n-gram model applied, where n = 2 thru 4.
- The TermDocumentMatrix would then enable calculation of frequency of phrases in each of the n-gram models. The n-gram model begins at n = 4 and then downgrades lower if no match is found.
- These frequency tables would be eventually used for the Word Predictor App.
- Challenge here has been to have an output format that is scalable and fast.
- Various output formats were considered including RData, RDS and ultimately SQLite was determined to be the most optimum option. This format would also allow using SQL queries to both input and output data from the above-mentioned frequency tables.
The Word Predictor App
- The SQLite DB file has been loaded onto the SHiny server.
- This app retrieves the partial sentence entered by the user and cleans it up using steps mentioned in Exploratory Analysis.
- Then the n-gram model is applied to the input sentence, where n = 5.
- If an appropriate prediction is not returned, stupid back-off is used by downgrading the n-gram model. This is independently applied.
- Total of up to five predictions are returned to the user.