This report describes initial work performed to produce an English text prediction application. It includes a basic description of the data used to develop the prediction algorithm, and plans on application implementation are also described to obtain feedback from the reviewer. Briefly, the project aims to produce an application which predicts the next word in a text input provided by the user. It is expected that the application will have enough accuracy to fulfil user expectations, and good response times to provide an adequate user experience. Some technical terms are briefly explained for clarity.
Text prediction of user input can be valuable to improve user experience in platforms with limited or difficult text prediction facilities, such as mobile systems. An application can be programmed to predict text input using techniques known as Natural Language Processing. NLP uses large samples of actual language use to develop models which can predict different features of the language in varying degrees of accuracy and complexity.
This project uses English language samples from digital media (blogs, news and Twitter) in order to predict simple English text input. The following sections describe the datasets used for the project, highlight some interesting features and present plans for the final model and application.
The English data (or corpus) come from HC Corpora and are available for download. Details about the corpus are available through the provider. The corpus is comprised by three files, containing texts obtained from news, blogs and Twitter feeds. Follows a table describing the files:
| Source | File name | Size (MB) | Lines | Words | Characters |
|---|---|---|---|---|---|
| Blogs | en_US.blogs.txt |
200.42 | 899 288 | 37 334 114 | 208 623 085 |
| News | en_US.news.txt |
196.28 | 1 010 242 | 34 365 936 | 205 243 643 |
en_US.twitter.txt |
159.36 | 2 360 148 | 30 359 852 | 166 843 164 |
The data was imported into the R statistical environment, processed to remove unwanted information (non-English characters, punctuation signs and extra space), reduce unnecessary details which could reduce prediction accuracy (specific dates, addresses and groups of numbers replaced by generic markers) and clean the data to ease further analyses and modelling (use only lower case characters and store each sentence separately) . Follow an example of how some typical English text would change when prepared for analysis:
Sample:
“We shall meet in the place where there is no darkness” - great book! read it 3 times last February at NYPL on 5th and 42nd Street NY, 10018.Processed: [1] we shall meet in the place where there is no darkness [2] great book [3] read it {digits} times last {date} at {address}
In NLP, each word or informative group of characters is regarded as a token; tokens are grouped into n-grams, which are n contiguous tokens, with n varying usually between 1 and 4; n-grams are then used to estimate the probability of a given phrase, which allows to use n-1 tokens from the input to predict the next word (last token in an n-gram). The clean texts were further processed to produce groups of tokens, and n-grams of length 1 to 4. For example, the 3-grams from the first sample sentence above would be:
Sample: we shall meet in the place where there is no darkness
3-grams:we shall meet,shall meet in,meet in the,in the place,the place where,place where there,where there is,there is no,is no darkness
Follows a table showing some basic summaries for the generated tokens and n-grams:
| Source | Total tokens | Median tokens per sentence | Median chars per token | Number of 2-grams | Number of 3-grams | Number of 4-grams |
|---|---|---|---|---|---|---|
| Blogs | 38 273 316 | 11 | 4 | 5 362 812 | 16 087 952 | 23 724 661 |
| News | 35 420 084 | 12 | 4 | 5 325 622 | 15 676 714 | 22 380 268 |
| 30 771 302 | 6 | 4 | 4 030 805 | 10 605 972 | 14 170 541 |
As is reflected by the numbers in the previous table, the dataset used to develop the application is large (taking around 5GB of memory when loaded completely) and the information obtained from each source varies considerably (using number of n-grams as a proxy). This results in drawbacks for application implementation, some of which will be discussed in the section detailing the plans (at the end of this document).
The exploratory data analysis of the generated tokens and n-grams shows some interesting information about the text. Typical English phrases contain mostly words which serve some grammatical function (e.g. connecting elements, like articles), which are commonly termed stop words in NLP. The stop words are usually removed for language modelling, because otherwise would dominate the predictions since they are very common, but were preserved for this application due to the implementation of additional NLP techniques which allow to level them with other words which convey more meaning and are also common in the dataset (such as nouns or verbs). Figure 1 shows the count of the most common 3-grams found in the data:
Figure 1: Counts of the 10 most common 3-grams for each data source. Counts are presented in logarithmic scale to show detail at lower counts. 3-grams are arranged in descending order (most common 3-grams at the top of each panel).
Based on the prepared tokens and n-grams, a shiny application will be developed and published in the shinyapps.io platform. This application will receive plain text input from the user (in a text box), and output the predicted next word. The prediction will be performed based on a 4-gram model, with the following additional features:
For now a complete dictionary has been used for testing and initial accuracy assessment but, due to the limited availability of resources for the hosted app, a reduced dictionary will be used. Figure 2 depicts the dictionary size (MB of RAM used when loaded) given the minimum counts to include a particular n-gram:
Figure 2: Dictionary size given minimum n-gram count
Coverage and accuracy of the dictionaries compressed to different sizes has not been evaluated, but is considered in the following steps. Since high load time can be a factor which discourage users from adoption, the use of an index based dictionary will be considered, to avoid loading all the data and only read the entries of interest when needed.
Besides the language model features described above, the application will provide the following features to improve user experience:
Suggestions and feedback are welcome, and should be provided in the evaluation form. Specific ideas regarding app implementation or technical concerns can be recorded as issues in the app’s repository. Thanks for your time and consideration.