The data is downloaded from the coursera page. It contains text files from three sources, blogs, news and twitter and in four different languages namely, English, German, French and Russian. For this analysis, only the files in english was used to train the model.
The data was downloaded as a zip file from this link. The text files of interest was extracted from this dataset and a random subset of 0.1% of the dataset was used to perform the exploratory analysis. The ‘rbinom’ function was used to select this random subset by setting a probability of 0.001 and with a seed set to 123 before performing any subsetting. The size of the resulting data is described in Table 1.
When preprocessing the datasets, the punctuations and numbers are removed, the lines are tokenzed into single words and finally, profane words are removed.
Punctuations: Care is taken to preserve contractions such as can’t, should’ve etc. Furthermore, a symbol (<n>) is used to mark the end of the line. This is done so that the final model could possibily suggest an end of the sentence and also reduce the confusion due to all lines appearing as a single line.
Numbers: All numbers were removed, including the numbers that were part of the word like 20dollars, 2day, etc. Unfortunately, it has the consequence of misinterpretting contracted words like 2day as day but a quick and simple solution to this problem wasn’t reached.
Tokenizer (unigram): All words within the dataset were converted to lower case and then the individual words are separated by using white spaces as as division between the words.
Profanity filter: A list of profane words from this github repository was used to identify the profane words in the datasets. It is not a complete list, but it covers a good fraction of the profanity words. This list was then cleaned to match the format of the tokens (from previous step) and the profane words were indexed out of the datasets. This was found to be the fastest and simplest method of including a profanity filter in the analysis.
Preprocessing function: Finally, a function was created to perform these tasks in the appropriate succession, to avoid conflicts and errors.
The size of the datasets are described in the table below. It also includes the number of words and the number of chractes which will go on to become the dictionary for the text prediction.
| Source_Name | Lines_Raw | Lines_Subset | Words_Subset | Characters_Subset |
|---|---|---|---|---|
| 2360148 | 2415 | 36718 | 148360 | |
| News | 77259 | 97 | 3494 | 16069 |
| Blogs | 899288 | 957 | 44222 | 192317 |
The most common words and the frequency of words is analysed to understand the distribution of words in the datasets. A histogram of the 20 most common words in the three datasets is included in figure 1. Most of the common words are common accross the three datasets, and are often reffered to as stop words. They’re not informative, but as they would play a vital role in text prediction, the words are not removed from the dataset.
Figure 1: Histogram of the 20 most common words in the three datasets.
The variety of words can be described by the number of words that make up a percentile of the dataset. Looking at the histograms above, it is easy to assume that twitter dataset has the largest variety of words as the three most common words make up barely above 2.5% of the dataset, compared to almost 3% in the other two. However, a plot of the cumulative frequency shows that the dataset from News contains the largest variety of words.
Three-fourth of the Twitter and Blog datasets are comprised of about 12% of the most common words. On the other hand, about 50% of the most common words used in the News dataset makes up three-fourth of the dataset (figure 2). This makes the News dataset more usefull and can help improve the size of the dictionary for the text predicting algorithm.
Figure 2: Cumulative frequency of the words used in each dataset, sorted by percentile rank.
Improve the function that removes numbers from the dataset.
Correct for spelling mistakes and shorter words (shrtr wrds)
Increasing the size of the news dataset could provide a bigger dictionary to refer to.
Using sample sentences from a dictionary could further help increase the reference words