The aim of this Milestone Report is to show basic features of the text data and how I processed it for training a predictive NLP model. I will show the results of my exploratory analysis, highlighting features of the data sets provided, the distributions of words per line from each data source, and the most common words and word combinations. Afterwards I will talk about my ideas for a predictive model.
First I will show some features of the three data sets provided. The data sets are lines mined from blogs, news and tweets off the internet. Here is a summary table showing the features of the files from each source:
| File | FileSize | Lines | Characters | Words | min.words.per.line. | median.words.per.line. | max.words.per.line. |
|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200 MB | 899288 | 206824505 | 37570839 | 0 | 28 | 6726 |
| en_US.news.txt | 196 MB | 1010242 | 203223159 | 34494539 | 1 | 32 | 1796 |
| en_US.twitter.txt | 159 MB | 2360148 | 162096031 | 30451128 | 1 | 12 | 47 |
We can see that by the file size, number of characters and amount of words that the samples from all three sources are about equally large, with twitter contributing a bit less to our data. Noticeable is the high number of lines and logically following the low number of words per line in the twitter data, suggesting that the tweets mined are usually shorter then the text parts from blogs and news, which seems intuitively logical. To further look at the distributions of words per line in each of the sources I will now provide histograms for all 3 sources:
When interpreting this one has to be careful to take the varying x-axis into account. There are some incredibly long lines in both the news and especially the blogs data. Overall the words per line seem to be tightly distributed around the low end though, as we saw in the medians in the table above already.
To process the data into a useful training set I used the ‘tm’ text mining package for R. First I concatenated the 3 data sets. Then I removed all URLs, twitter handles and email adresses using regular expressions. I filter for profanity using a data base provided by Luis von Ahn’s Research Group at Carnegie Mellon University from here. Using ’tm’s tm_map function I further cleaned the data by removing punctuation, numbers and uneccesary whitespaces, casting everything to lower case and converting to plain text. To avoid the runtime of cleaning data multiple times but still using the entire data set I ran the data cleaning on a computing cluster and saved the cleaned data set as an extra file, which can now be read in when needed.
To look for most common words and word combinations I used the tokenizer of the ‘RWeka’ R package. This allows to look at the frequencies of n-grams, so word combinations of length n occuring together. For this to have a reasonable runtime I only sampled the first 50 000 lines of the cleaned data set.
First here are the frequencies of the top 20 most occuring 1-grams, so single words:
I don’t think these words come as a surprise, since we expected the most common words in the english language to appear here. This however shows that our sample is representative of the english language.
Now the frequencies of the top 20 most occuring 2-grams, so 2-word combinations:
The result looks similar to the one above.
Lastly the frequencies of the top 20 most occuring 3-grams:
Again, no surprises, which is good in terms of the representability of our sampled data. I would conclude that the quality of the data should be good enough for a decent predictive model.
Since I already investigated the frequencies of n-grams, a model that should result in a good space and time complexity, and therefore be not too demanding on computing power and memory usage, would be a Katz Back-off model. For this after the implementation testing would be needed to figure out the best n for the n-gram language model. Another, perhaps more timely approach, would be to use transformers or convolutional neural networks. I imagine that with R’s ‘reticulate’ package it should be possible to implement the most popular NLP models, like BERT or GPT, from python. I would consider using a well performing pretrained NLP model, run more training on my data and maybe adjust the model architecture. I am sure that if done correctly this would produce the best results, since this models have been proven to be incredibly good at natural language generation.