This report provides a brief overview of the exploratory analyses conducted for the purpose of constructing a text prediction model and a text prediction app. The model and app are constructed to fulfill the requirements for the JHDSS Capstone Project course. A text prediction app would assist a user in typing text by providing meaningful suggestions for text completion; an example would be a user typing in “baba baba black” – “sheep” would then be the (or one of the) suggestion(s) for the next word.
In order to build a prediction model and an app for a given language, data would be required in that language to discover and learn features about the language. The data provided for the project can be downloaded from: here. The data contains text corpora gathered from 3 types of sources {news, blogs, twitter} in 4 different languages {English, German, Finnish,Russian}. The zipped data source is approximately 562 MB in size. Only the english sections of the corpora are explored and addressed in this report.
The english section of the data comes in 3 files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter (in the en_US subfolder). The filenames indicate the source of the text data.
The line counts and the word counts for the english data are summarized by the following table. Note that the word counts are approximate and are governed by the tokenization scheme employed.
| Data Source | Line Count | Total Word Count | Unique Word Count |
|---|---|---|---|
| Blogs | 899,288 | 36,636,565 | 471357 |
| News | 1,010,242 | 33,256,428 | 343319 |
| 2,360,148 | 28,821,930 | 490359 |
These figures include stopwords (words such as: a, an, the, at, be etc.) and profanities.
The following figure illustrate the same data graphically.
The following graphs indicate word frequencies for the top 30 most frequently occuring words in the data.
The next steps will involve: