Main steps of the data study:
- Download the data and save it in local files
The data was read from the URL given in the documentation of the project:
unpacked and saved on directory “./Datos” on the working environment. There are three files there: one contains blogs, the second news, and the third twitters.
The files were read line by line, while extracting basic information from them. It is important to use the appropriate parameters on the reading instructions, so as not get interrupted by alien characters (present in three places of the news file); to skip nulls, and to format strange characters adequately to future processing (making the unicode codes of non textual characters explicit).
- Extract a random sample of each one of the three files
The read data has the following characteristics:
- Original Data:
Size # of lines Longest line # of words Mean.words/line
Blogs 210160014 899288 40835 38171210 42.44604
News 205811889 1010242 11384 34797994 34.44521
Twitter 167105338 2360148 213 30657971 12.98985
Since the total size of the original files (583077241 bytes) is quite big for a PC’s memory and speed, and, furthermore, since it is only necessary to work with a reduced size random sample to get, in accordance to statistical inference, quite acceptable results, a process was implemented to get approximately a randomly obtained 10% of the lines from the given data. These sampled files - we are going to call each of them with the name of the original from where each one was extracted - are described in the following table:
- Sampled Data:
Size # of lines Longest line # of words Mean.words/line
Blogs 20759205 90210 10785 3786070 41.96952
News 20323363 100633 1914 3444766 34.23098
Twitter 16499901 235626 764 3068024 13.02074
It can be seen that the size in bytes, in number of lines, and in number of words of each file has been reduced to approximately 10% of the original. On the other hand, the mean number of words per line on the News and Twitter sampled files is quite close to the original’s number, as it should be. However, it is not so close in the Blogs file.
Furthermore, what is rather anomalous is the maximum line length in the Twitter file: while in the original file it is 213 bytes (remember that the tweets in general could not excede 140 characters until last year, but any URL counts only as 22 characters and there are other technical considerations to consider in that count), in the sampled file it has grown to 764 bytes. How is this possible?.
The reason for these anomalies is that the sampled files were read making the unicodes of non textual characters explicit. For example, as we’ll see in some examples below, three non english characters may be replaced by a string of 28 bytes. These characters are absent from the News file (more formal), while there are some in the Blogs, and much more in the Twitter files. The replacement of some characters by its much longer codes explains the noticed observations.
Nevertheless, this code-explicit form of reading the samples allows us a better control of content, and an easier way to eliminate those characters from the future predictors list.
The blogs are the documents containing more words, while they also have less lines. But the meaning of lines in this context should be understood as paragraphs for the blogs and news, and tweets in the Twitter case. So, the blogs have less paragraphs than the news, but they are longer. And the tweets are many more that the paragraphs in the other two types of documents, but the mean number of words per tweet is quite little compared with the same statistic per paragraph of blogs and news.
This analysis seems to be in accordance with what we know about these kind of documents, and does make sense, verifying thus the quality of the data and the performed reading and sampling.
- Load the samples into a working corpus
Using a text mining package (tm), the three sampled files were built into a corpus, that is, a collection of documents to be preprocessed, in general, with a common criteria. Since the predictive application we are planning is not oriented to a specialized type of people or form of expression, but to common use english, we have chosen to mingle the three sampled files in only one bunch. However, if necessary, we can access lines at specific locations in any of the three components of the corpus.
- Perform over the corpus transformations oriented to the purpose of the work.
Once the corpus is built, it will be transformed adequately to eliminate profanity, suppress unnecesary white spaces, remove punctuation, numbers, and remove the unicodes of non textual characters - adequately formatted during the initial reading - since neither of these are intended to be predicted by the working application. Also, all letters are changed to lower case. On the other hand, we decided not to remove stopwords, since these may be important on a predictive system (consider, for example, articles and prepositions).
All of these facilities, provided by the tm package, act over the documents in the corpus as a whole.
Let’s see the effects of this transformation on some lines of the sampled set, as they were initially, and in their post-transformed state:
Twitter line 4072.- Before tm_map transform:
I have the best sistaa <f0><U+009F><U+0098><U+0098><U+0093>: Making home made cookies for my sissy! <U+0094>
After tm_map transform:
i have the best sistaa making home made cookies for my sissy
Longest Twitter line.- Before tm_map transform:
<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>CAKE<f0><U+009F><U+008E><U+0082>CAKE<f0><U+009F><U+008D><U+00B0>
After tm_map transform:
cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake cake
The profanity elimination part is provided by tm with a special call to tm_map, using the given function removeWords. Previously, it is necessary to build or download a list of undesirable words, called badwords by us. In our particular case, these would be words we don’t want to see appearing as the next input in the project’s predictive application.
The removeWords function removes words from a text document, and for our corpus we should call it as: {r corpus <- tm_map(corpus, removeWords, badwords[,1])}
However, its performance was not satisfactory for this project, since didn’t change many instances of each of the banned words from the list. As an example, using two four-letter words as our list, we found 4555 appearences in the sampled corpus. From them, removeWords took off only 1842 (40.4%).
Let’s see this with some detail, since it took a good deal of our time to try to fix this problem.
I wrote a very simple substitute of removeWords, calling it noBadwords, and used it in this way:
noBadwords <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
for (bw in badwords[,1]){
corpus <- tm_map(corpus, noBadwords, bw)
}
This worked better than removeWords, but it wasn’t perfect either: It got removed 3334 (73.2%) of the bad words’ instances, but left 1221.
Any intent of improving this, using more complex patterns than the word itself, was unsuccesful.
Let’s see, as an example, what happened through the whole transformation process with a repetitive instance of a rather innocuous word from the list: damn.
Blogs line 5169:
Damnit I heard her whisper. Damnitdamnitdamnitdamnitdamnit. I can't deal with this. Not right now.
After tm_map transform:
damnit i heard her whisper damnit it it it it i can't deal with this not right now