Datascience capstone project presentation

Lian Rui
September 27, 2018.

This is the summary report on the capstone project of “datascience specialization” provided by John Hopskins University.

Language modeling is very fundamental in nature language processing (NLP);
It is the fundation of many advanced NLP application, such as machine translation, meaning abstraction etc.
In this case, we are going to build a word prediction model;
The raw language corpus was provided by SwiftKey at: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The ngram approach was used to build the language model:

25% of the total language corpus was randomly selected as the training dataset;
Raw text data was processed by quanteda package (https://quanteda.io/)
Speciafically:
1. Data was processed into unigram, bigram and trigram;
2. The frequency of each gram is calculated by dfm function in quanteda package;
3. A search function to find the grams in each ngram data frame was used;
4. Stupid backoff approach by Brants et al was applied to smooth the data (http://www.aclweb.org/anthology/D07-1090.pdf)
Further detailed report and codes can be found at:
https://github.com/Rui-Lian/datascience-capstone/blob/master/Datascinece%20final%20report.Rmd
https://github.com/Rui-Lian/datascience-capstone/blob/master/app.R

There are several limitations in this simple case:

1. I applied stupid backoff method by Brants et al. But the key assumption for stupid backoff is huge corpus. So far I don't know if cureent training set is big enough to meet the assumption;
2. As such, I didn't compare the stupid backoff approach with other more sophisticated smoothing method, for example Kneser-Ney Smoothing;
3. I didn't set test data set to quantify the accuracy of the model;
4. In terms of programming, the current codes are kind of 'cumbersome', further efforts can definitely optimize the efficiency.

I've learnt great deal in this capstone project and the JHU datascience specialization.

Many thanks for the great great professors in datascience specialization: Jeff Leek, Roger D. Peng and Brian Caffo
Many thanks for the mentors in the forum of datascience speciliation;
Many thanks for the students who reviewed and commented my every assignment;