Executive summary

This report describes initial work performed to produce an English text prediction application. It includes a basic description of the data used to develop the prediction algorithm, and plans on application implementation are also described to obtain feedback from the reviewer. Briefly, the project aims to produce an application which predicts the next word in a text input provided by the user. It is expected that the application will have enough accuracy to fulfil user expectations, and good response times to provide an adequate user experience. Some technical terms are briefly explained for clarity.


Details about the project

Text prediction of user input can be valuable to improve user experience in platforms with limited or difficult text prediction facilities, such as mobile systems. An application can be programmed to predict text input using techniques known as Natural Language Processing. NLP uses large samples of actual language use to develop models which can predict different features of the language in varying degrees of accuracy and complexity.

This project uses English language samples from digital media (blogs, news and Twitter) in order to predict simple English text input. The following sections describe the datasets used for the project, highlight some interesting features and present plans for the final model and application.

Descriptive analysis of the dataset

The English data (or corpus) come from HC Corpora and are available for download. Details about the corpus are available through the provider. The corpus is comprised by three files, containing texts obtained from news, blogs and Twitter feeds. Follows a table describing the files:

File description: Some characteristics of the data files
Source File name Size (MB) Lines Words Characters
Blogs en_US.blogs.txt 200.42 899 288 37 334 114 208 623 085
News en_US.news.txt 196.28 1 010 242 34 365 936 205 243 643
Twitter en_US.twitter.txt 159.36 2 360 148 30 359 852 166 843 164

The data was imported into the R statistical environment, processed to remove unwanted information (non-English characters, punctuation signs and extra space), reduce unnecessary details which could reduce prediction accuracy (specific dates, addresses and groups of numbers replaced by generic markers) and clean the data to ease further analyses and modelling (use only lower case characters and store each sentence separately) . Follow an example of how some typical English text would change when prepared for analysis:

Sample:
“We shall meet in the place where there is no darkness” - great book! read it 3 times last February at NYPL on 5th and 42nd Street NY, 10018.

Processed:
[1] we shall meet in the place where there is no darkness
[2] great book
[3] read it {digits} times last {date} at {address}


Basic summaries

In NLP, each word or informative group of characters is regarded as a token; tokens are grouped into n-grams, which are n contiguous tokens, with n varying usually between 1 and 4; n-grams are then used to estimate the probability of a given phrase, which allows to use n-1 tokens from the input to predict the next word (last token in an n-gram). The clean texts were further processed to produce groups of tokens, and n-grams of length 1 to 4. For example, the 3-grams from the first sample sentence above would be:

Sample: we shall meet in the place where there is no darkness
3-grams: we shall meet, shall meet in, meet in the, in the place, the place where, place where there, where there is, there is no, is no darkness

Follows a table showing some basic summaries for the generated tokens and n-grams:

Basic summaries: Some general information about the English text data
Source Total tokens Median tokens per sentence Median chars per token Number of 2-grams Number of 3-grams Number of 4-grams
Blogs 38 273 316 11 4 5 362 812 16 087 952 23 724 661
News 35 420 084 12 4 5 325 622 15 676 714 22 380 268
Twitter 30 771 302 6 4 4 030 805 10 605 972 14 170 541

As is reflected by the numbers in the previous table, the dataset used to develop the application is large (taking around 5GB of memory when loaded completely) and the information obtained from each source varies considerably (using number of n-grams as a proxy). This results in drawbacks for application implementation, some of which will be discussed in the section detailing the plans (at the end of this document).


Features in the data

The exploratory data analysis of the generated tokens and n-grams shows some interesting information about the text. Typical English phrases contain mostly words which serve some grammatical function (e.g. connecting elements, like articles), which are commonly termed stop words in NLP. The stop words are usually removed for language modelling, because otherwise would dominate the predictions since they are very common, but were preserved for this application due to the implementation of additional NLP techniques which allow to level them with other words which convey more meaning and are also common in the dataset (such as nouns or verbs). Figure 1 shows the count of the most common 3-grams found in the data:

Figure 1: Counts of the 10 most common 3-grams for each data source. Counts are presented in logarithmic scale to show detail at lower counts. 3-grams are arranged in descending order (most common 3-grams at the top of each panel).

Figure 1: Counts of the 10 most common 3-grams for each data source. Counts are presented in logarithmic scale to show detail at lower counts. 3-grams are arranged in descending order (most common 3-grams at the top of each panel).



Plans for prediction algorithm and shiny app

Based on the prepared tokens and n-grams, a shiny application will be developed and published in the shinyapps.io platform. This application will receive plain text input from the user (in a text box), and output the predicted next word. The prediction will be performed based on a 4-gram model, with the following additional features:

  • Back-off: use lower-n-grams when no data is available (smoothing technique)
  • Discounting: reduce the likelihood of very low count n-grams
  • Bag of words: increase the coverage by using pairs of words co-occurring in a sentence (regardless of order or distance), in addition to n-grams (n contiguous words)

For now a complete dictionary has been used for testing and initial accuracy assessment but, due to the limited availability of resources for the hosted app, a reduced dictionary will be used. Figure 2 depicts the dictionary size (MB of RAM used when loaded) given the minimum counts to include a particular n-gram:

Figure 2: Dictionary size given minimum n-gram count

Figure 2: Dictionary size given minimum n-gram count

Coverage and accuracy of the dictionaries compressed to different sizes has not been evaluated, but is considered in the following steps. Since high load time can be a factor which discourage users from adoption, the use of an index based dictionary will be considered, to avoid loading all the data and only read the entries of interest when needed.

Besides the language model features described above, the application will provide the following features to improve user experience:

  • Auto complete for the input text box, to ease typing when no prediction is requested (shows a list of common words which start with the letters already typed in)
  • List of other possible predictions, to provide alternatives in case the main prediction is not satisfactory
  • Option to download the log file of the user’s session, to examine closely the app in action.

Suggestions and collaboration

Suggestions and feedback are welcome, and should be provided in the evaluation form. Specific ideas regarding app implementation or technical concerns can be recorded as issues in the app’s repository. Thanks for your time and consideration.