Objective

This report shows some statistical features of textual data, like the frequency of N-grams (N=1,2,3), the total number of words, number of sentences. The final aim is to build an algorithm which predicts the next word while writing a sentence.

Data set

The data set has been provided by the company Swiftkey for the Capstone Project in the Coursera Data Science Specialization course. It consists of a set of text files in four different languages: English, Finnish, Russian, and Deutch. For each language, there are 3 files taken from different sources: blogs, news, and twitter. Each file consists of a list of text strings.

A statistical analysis of N-grams in the text is shown for the English language. N-grams consist in sequences of N words, usually N is equal to 1,2 or 3, and can be considered as the unit of the text.

Pre-processing

Pre-processing of the data consists in:

  1. special characters, such as ().,:<>;$^&!“£% etc., have been discarded. The only exception is in cases like”don’t“,”i’m“, and similar
  2. all words are lowercase (NON-case-sensitive)
  3. profanities have been removed

Note that no spell-correction has been performed and that “don’t”, “i’m”, etc are considered 1-grams.

Source-type statistics

Statistical differences in text types have been assessed for the average number of sentences, total words, new words rate (unique words), 1-grams, 2-grams, and 3-grams. The average has been performed on chunks of 100000 strings.

# A tibble: 3 x 7
  dataType Nsentences NwordsTot UniqueWordsRate N1grams  N2grams  N3grams
  <chr>         <dbl>     <dbl>           <dbl>   <dbl>    <dbl>    <dbl>
1 blogs       260701.  4117854.          0.0247 101515. 1266416. 2830689 
2 news        176693.  3038044.          0.0324  89758. 1081966  2207208.
3 twitter     159715.  1274870.          0.0462  58943.  483475.  852248.

Not surprisingly, the “richer” textual data come from blogs, followed by news and finally twitter. Infact, the former has 1 times the number of sentences than the second, which in turn has 1 times more sentences. This fact influences the total number of words and consequently all other features. Notice that when the total number of words increment of around \(10^6\), there is a decrement of 1% in the number of unique words.

  dataType Nsentences NwordsTot uniqueWordsRate  N1grams  N2grams  N3grams
1    blogs   2.646274  41.79866       0.8689292 31.90147 39.57357 39.57380
2  twitter   1.597153  12.74870       0.9574179 12.00450 11.66902 10.75453
3     news   2.003846  34.45399       0.8871688 29.12916 32.92242 32.35963

A more detail analysis shows that, despite the data source-type, in a single string even for a different number of sentences and total words, the number of unique words is at least . This means that in a sentence of 12- 41 words, it’s unlikely that the same word appears more than once.

Common N-grams

Now let’s focus on the N-grams. All three data sets (blogs, news, and twitter) have been divided into chunks of 100000 strings.

For each chunk, the first 1000 most common 1-grams, 2-grams, and 3-grams have been calculated. The figure below shows the percentage of N-grams which is common among chunks.

50% of 1-grams appears in all chunks, while the percentage strongly decreases for 2-grams and 3-grams, 33% and 14%, respectively.

Conversely, only 6% of 1-grams appears in only one chunk, while the percentage strongly increases for 2-grams and 3-grams, 14% and 50%, respectively.

This means that in a text of ~200000 sentences (on average there are 2 sentences per string) at least 506 common 1-grams is present and the number of “rarer” words is relatively small (68).

On the other hand, the presence of 3-grams would be reversed: ~147 common against ~500 rare.

The case of of 2-grams would be something in the middle: ~339 common against ~147 rare.

In the following, some example of common N-grams with the mean frequency calculated among the 23 chunks of data.

   1-grams      Freq  2-grams     Freq       3-grams     Freq
1      the 110165.13   of the 846.0870    one of the 846.0870
2      and  64154.65   in the 716.6957      a lot of 716.6957
3       to  61803.04   to the 456.8696    the end of 456.8696
4        a  54962.83   on the 430.6522      it was a 430.6522
5       of  50656.91    to be 420.0435       to be a 420.0435
6        i  43672.17  and the 411.4783    as well as 411.4783
7       in  30869.57  for the 401.7391    out of the 401.7391
8     that  27269.70    i was 386.0000   some of the 386.0000
9       is  24762.70    and i 346.8696    be able to 346.8696
10      it  23678.13   i have 323.6087   a couple of 323.6087
11     for  21013.13   at the 311.7391     i want to 311.7391
12     you  18805.13   it was 296.2609     i have to 296.2609
13    with  18191.04    it is 279.1304     this is a 279.1304
14     was  16844.83     is a 269.0000 the fact that 269.0000
15      on  15966.78     in a 259.9130      i have a 259.9130
16      my  14911.17 with the 254.1739   the rest of 254.1739
17    this  13882.65     i am 247.7391   part of the 247.7391

Conclusions and applications

This report gives a general idea of the complexity of textual data. In fact, in a relatively large amount of data (~200000 sentences), the diversity of N-grams, especially when N>2, is large. If one wants to use N-grams to build a word prediction algorithm, strategies that take into account combination of N-grams should be used to improve performance and accuracy. For example, a possible strategy consists of:

  1. select a set of possible words based on frequency(probability) tables of 2/3-grams (i.e. considering 1 or 2 words before the one to be predicted)

  2. choose the one that maximizes the probability of the already written sentence (or the last k words) by combining the probability of known 2/3-grams.

Step 2. is the most important and tricky because it must reproduce new(unseen) combination of words, strongly influencing the algorithm accuracy. The number of last words (k) and the way of combining the probability of known N-grams are the crucial elements to create a performing algorithm.

All analysis have been prformed with the software RStudio.