This report shows some statistical features of textual data, like the frequency of N-grams (N=1,2,3), the total number of words, number of sentences. The final aim is to build an algorithm which predicts the next word while writing a sentence.
The data set has been provided by the company Swiftkey for the Capstone Project in the Coursera Data Science Specialization course. It consists of a set of text files in four different languages: English, Finnish, Russian, and Deutch. For each language, there are 3 files taken from different sources: blogs, news, and twitter. Each file consists of a list of text strings.
A statistical analysis of N-grams in the text is shown for the English language. N-grams consist in sequences of N words, usually N is equal to 1,2 or 3, and can be considered as the unit of the text.
Pre-processing of the data consists in:
Note that no spell-correction has been performed and that “don’t”, “i’m”, etc are considered 1-grams.
Statistical differences in text types have been assessed for the average number of sentences, total words, new words rate (unique words), 1-grams, 2-grams, and 3-grams. The average has been performed on chunks of 100000 strings.
# A tibble: 3 x 7
dataType Nsentences NwordsTot UniqueWordsRate N1grams N2grams N3grams
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 blogs 260701. 4117854. 0.0247 101515. 1266416. 2830689
2 news 176693. 3038044. 0.0324 89758. 1081966 2207208.
3 twitter 159715. 1274870. 0.0462 58943. 483475. 852248.
Not surprisingly, the “richer” textual data come from blogs, followed by news and finally twitter. Infact, the former has 1 times the number of sentences than the second, which in turn has 1 times more sentences. This fact influences the total number of words and consequently all other features. Notice that when the total number of words increment of around \(10^6\), there is a decrement of 1% in the number of unique words.
dataType Nsentences NwordsTot uniqueWordsRate N1grams N2grams N3grams
1 blogs 2.646274 41.79866 0.8689292 31.90147 39.57357 39.57380
2 twitter 1.597153 12.74870 0.9574179 12.00450 11.66902 10.75453
3 news 2.003846 34.45399 0.8871688 29.12916 32.92242 32.35963
A more detail analysis shows that, despite the data source-type, in a single string even for a different number of sentences and total words, the number of unique words is at least . This means that in a sentence of 12- 41 words, it’s unlikely that the same word appears more than once.
Now let’s focus on the N-grams. All three data sets (blogs, news, and twitter) have been divided into chunks of 100000 strings.
For each chunk, the first 1000 most common 1-grams, 2-grams, and 3-grams have been calculated. The figure below shows the percentage of N-grams which is common among chunks.
50% of 1-grams appears in all chunks, while the percentage strongly decreases for 2-grams and 3-grams, 33% and 14%, respectively.
Conversely, only 6% of 1-grams appears in only one chunk, while the percentage strongly increases for 2-grams and 3-grams, 14% and 50%, respectively.
This means that in a text of ~200000 sentences (on average there are 2 sentences per string) at least 506 common 1-grams is present and the number of “rarer” words is relatively small (68).
On the other hand, the presence of 3-grams would be reversed: ~147 common against ~500 rare.
The case of of 2-grams would be something in the middle: ~339 common against ~147 rare.
In the following, some example of common N-grams with the mean frequency calculated among the 23 chunks of data.
1-grams Freq 2-grams Freq 3-grams Freq
1 the 110165.13 of the 846.0870 one of the 846.0870
2 and 64154.65 in the 716.6957 a lot of 716.6957
3 to 61803.04 to the 456.8696 the end of 456.8696
4 a 54962.83 on the 430.6522 it was a 430.6522
5 of 50656.91 to be 420.0435 to be a 420.0435
6 i 43672.17 and the 411.4783 as well as 411.4783
7 in 30869.57 for the 401.7391 out of the 401.7391
8 that 27269.70 i was 386.0000 some of the 386.0000
9 is 24762.70 and i 346.8696 be able to 346.8696
10 it 23678.13 i have 323.6087 a couple of 323.6087
11 for 21013.13 at the 311.7391 i want to 311.7391
12 you 18805.13 it was 296.2609 i have to 296.2609
13 with 18191.04 it is 279.1304 this is a 279.1304
14 was 16844.83 is a 269.0000 the fact that 269.0000
15 on 15966.78 in a 259.9130 i have a 259.9130
16 my 14911.17 with the 254.1739 the rest of 254.1739
17 this 13882.65 i am 247.7391 part of the 247.7391
This report gives a general idea of the complexity of textual data. In fact, in a relatively large amount of data (~200000 sentences), the diversity of N-grams, especially when N>2, is large. If one wants to use N-grams to build a word prediction algorithm, strategies that take into account combination of N-grams should be used to improve performance and accuracy. For example, a possible strategy consists of:
select a set of possible words based on frequency(probability) tables of 2/3-grams (i.e. considering 1 or 2 words before the one to be predicted)
choose the one that maximizes the probability of the already written sentence (or the last k words) by combining the probability of known 2/3-grams.
Step 2. is the most important and tricky because it must reproduce new(unseen) combination of words, strongly influencing the algorithm accuracy. The number of last words (k) and the way of combining the probability of known N-grams are the crucial elements to create a performing algorithm.
All analysis have been prformed with the software RStudio.