I will describe below what I have done up to date in building the app that will predict the next word, up to three words. First, I will describe the data and discuss how I tried to solve the problem of data size. Second, I will describe my choices in processing the data to make it suitable for building a predictive model. Third, I will describe what interesting measures may be useful for building the app. I will present tables and graphs in the process. Any and all feedback and suggestions will be positively received and much appreciated. I also listed two resources I used at the end. These may be valuable if you wish to look into it yourself.
The data are textfiles from distinct sources. There exist datasets in English, Finnish, German, and Russian. They are independent samples rather than translations. Each language has its own dataset from three different source media. The source media include Twitter, blogs, and news. The variation among source texts is important for creating a more representative ngram dictionary and count, free from the distorting effects of different kinds of written media; for example, the heavy use of abbreviations on Twitter that arise due to the limit on the number of characters used.
The most frequent ngrams across the source texts will be crucial for training a predictive model. I decided to use the whole data set to train the model since text data is cheap and plentiful. Some independent test set can be accessed at a later time in the modeling process. However, as I will describe below, using the whole dataset for training purposes may have the virtue of broadening the ngram dictionary, but it also raises problems of dealing with the size of the dataset given that R relies on physical memory.
The table below contains some of the relevant summary statistics of the data files.
| Media Source | Size of File (MB) | Number of Lines (pre-processing | Number of 1-grams (post-processing) | Number of 2-grams (post-processing) | Number of 3-grams (post-processing) | Number of 4-grams (post-processing) |
|---|---|---|---|---|---|---|
| 167.11 | 2360148 | 11765760 | 4519455 | 1856985 | 806505 | |
| blogs | 210.16 | 899288 | 13959855 | 4670025 | 1658490 | 662865 |
| news | 205.81 | 1010242 | 15518250 | 6469110 | 2733140 | 1217160 |
From the summary statistics, one can see that the file sizes are large and the intermediate products are also large. The file sizes total about a little more than half a gig. Moreover, one can see that the processed files decay at an exponential rate when extracting larger n-grams.
There are many strategies to get around the problem of downloading and processing these large text files. There are a variety of approaches already out there which I experimented with. These include cloud computing and/or the use of the bigmemory package which converts large dataframes to matrices for shrinking the size of the file that allows for faster processing.
Unfortunately, I found both of these approaches costly. Cloud computing would cost money for the space, in addtion to time for learning SQL or Hadoop. While not appropriate for the timeline of this project, these strategies need to be learned to augment my data scientist toolbox to deal with even larger files more efficiently. The bigmemory package also has costs in terms of time. I would have to write a preprocessing script that would require some time to learn and perfect.
I chose to pursue a strategy of downloading and processing the data in chunks. Moreover, I used the packages doParallel and foreach in order to run code for parallel computing. Each chunk could be processed independently and therefore more quickly using a foreach loop. This strategy paid off when executing the most labor intensive code, such as tokenizing the data. Instead of having my computer reach max memory or run for hours, I was able to process the data relatively efficiently, within tens of minutes for each file when extracting n-grams. As a result of parallel processing, I was able to actually extract all the n-grams and do so in a way that cost less money and time than the alternatives.
To process the data I used the standard framework for natural language processing. This means that I tokenized the data–that is, processing the data into words or word combinations by stripping it of punctuation, capital letters, special characters, numbers, profanity, and white space. This strategy of tokenization allows you to get a count of how frequent different word combinations are. This is crucial for predicting the next two, three, or four words. The latter are called n-grams. This information in turn gives you access to probability estimates that one word will be followed by another, thus facilitating the development of a predictive model.
R provides many options to tokenize data efficiently without having to write original code, such as the packages tm and ngram. I experimented with these, but encountered too many obstacles along the way.
I chose to use tidytext for the simplicity of the process and the product. A simple function like unnest_tokens() is able to tokenize data into any n-gram or numbered word combination I desire with very little difficulty. It also includes a dictionary of common words called stop words that add little to a predictive model because they are associated with so many words. By stripping the text of stop words, I was further able to reduce the size of the intermediate files, increase later processing time, and have a more targeted training set for the prospective n-gram predictive model. I could also have kept the stop words and computed the adjusted term frequency as I will describe later. However, I have not had time to experiement and the data size may get too large.
The output of tidytext is also a benefit. The output is a dataframe organized according to tidy principles. Hence, each row is one n-gram. This format allows me to convert the n-grams into a graph and then a matrix for a prediction algorithm.
The following contains tables and plots used to explore the text data. The significant choices to made here are the measures I am interested in. I chose to give priority to an adjusted term frequency that takes into account how rarely a token is used. The result is that I get a list of 4-grams that may be rarely used, but are more likely to be used together. This information will be valuable for the predictive model.
Below is a table that lists the top 10 4-grams that are used across the source texts taking into account how rarely the combination of words is used. It gives numbers for the how often a 4-gram is used (term frequency), how rare the combination is (idf), and how often the combination is used given its rarity (adjusted term frequency).
| Text Source | 4-gram | Count | Term Frequency | Inverse Document Frequency | Adjusted Term Frequency |
|---|---|---|---|---|---|
| blogs | meep meep meep meep | 240 | 0.0003621 | 1.098612 | 0.0003978 |
| blogs | amazon.ca amazon.co.uk amazon.de amazon.fr | 225 | 0.0003394 | 1.098612 | 0.0003729 |
| blogs | amazon.co.uk amazon.de amazon.fr amazon.it | 225 | 0.0003394 | 1.098612 | 0.0003729 |
| blogs | amazon.com amazon.ca amazon.co.uk amazon.de | 225 | 0.0003394 | 1.098612 | 0.0003729 |
| love love love love | 270 | 0.0003348 | 1.098612 | 0.0003678 | |
| blogs | john smith’s grand national | 180 | 0.0002715 | 1.098612 | 0.0002983 |
| blogs | free game online paintball | 165 | 0.0002489 | 1.098612 | 0.0002735 |
| blogs | game online paintball play | 165 | 0.0002489 | 1.098612 | 0.0002735 |
| cake cake cake cake | 195 | 0.0002418 | 1.098612 | 0.0002656 | |
| add boston add boston | 180 | 0.0002232 | 1.098612 | 0.0002452 |
From the table and the bar charts below which show the top ten 4grams that occur within a source text, we can see that some of twitter and commercial expressions still need to be filtered. We can also see the value of the adjusted term frequency measure. It allows me to target the likely 2-grams, 3-grams, and 4-grams.
I will construct the predictive model on the basis of a graph. I will take the 4-gram data and convert it to a network of words (graph). This can then be represented as a matrix with which to make calculations akin to moving from one word to another word. I have not done a lot of thinking about it since I have spent most of my time focusing on how to solve the problems of the data size and data dirtiness. Any input or suggestions would be greatly appreciated.
Nolan, D. and Lang, D. T. Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving. CRC Press. 2015.
Silge, J. and Robinson, D. Text Mining with R: A Tidy Approach. O’reilly. 2017.