Data Science - Capstone Project Presentation

Pierre Baudin
31 August 2017

Text processing, modeling and prediction for text typing.

About the author:

Pierre Baudin

Data analyst for product quality assurance in China.

Project Background

Typing on mobile devices is becoming mainstream for all sort of activities. To improve efficiency and reduce the time spent on typing an algorithm can be used and help predict the next word or group of words from the existing text.

The application developped includes the following elements:

  • Input text from user
  • Live processing of the input and live prediction of the next word(s)
  • Presentation to the user of four different options based on the n-gram model used.

Data Processing and Prediction Model

  • Data Processing for the model construction

The raw data consists of lines of text sampled from news articles, blogs and twitter. Data are cleaned up to remove numbers, punctuations, white space… Following this step, a n-gram model is applied on the clean dataset and output the n-gram string with a frequency. For this application, 2-gram, 3-gram, 4-gram and 5-gram models are computed.

  • Prediction Model operation

The prediction model uses the n-gram dataset to compare the user input with the available string in the n-gram dataset. The highest frequency found that correspond to the user input is returned. For more than one word input, the prediction model look for every instance of the input combination in the n-gram dataset and returns the best match based on the highest frequency.

  • Live User Data Processing operation

The user text input is regularly scanned and processed by the R server to feed clean characters and words to the prediction algorithm

  • Live prediction operation

The clean user text is fed to the prediction model and four options are returned based on the four n-gram dataset available.

Application Presentation

Capture of the app outlook

To use the app, simply enter your text in the input box.

Prediction will automatically appear in the table below.

Discussion

  • Architecture:

The app operation and prediction algorithm rely on a prepared dataset to produce the prediction. Improvement on this part could be to only use the most used expression with a cut-off ratio and use the actual user input to learn the unique user language and structure. This would propose a tailored prediction based on previous input.

  • Limitations:

The prediction presented in this app can seem limited. This is due to the limited computing power in the processing of the n-gram model. Improvement on this part can be done to ensure significant word coverage by determining the optimum raw dataset size.