The aim of this milestone report is to describe the data stored in the files en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt and applies to this information operations such as removing punctuation symbols, discarding unnecessary words, and cleaning profanity language. This results will help to develop a predictive texting model.
We are going to analyze the files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt which correspond with data stored in blogs, news and twitter respectively. This data is from the web site http://www.corpora.heliohost.org/. The following table shows basic information about these files.
| File | Number of lines | Size in MB |
|---|---|---|
| en_US.blogs.txt | 899288 | 201 |
| en_US.news.txt | 1010242 | 197 |
| en_US.twitter.txt | 2360148 | 160 |
Due to the great size of the files, we took a sample of the 0.5 percent of the lines in the files and wrote these lines to the files: sample.en_US.blogs.txt, sample.en_US.news.txt and sample.en_US.twitter.txt, then these files were splitted in 48, 52, and 118 files for the purpose of to be handled better by the statistical software R.
The first step in textual data analysis is tokenization. Using the R package tm, the following tasks were applied to the sample data:
Due to the great amount of swear words and other kind of profanity language, the data were cleaned with the goal to prepare the data to exploratory analysis.
In the graph below, x-axis is the frequency of contiguous sequence of 1, 2 and 3 words and y-axis represents counting of these frequencies. Clearly the graph reveal the majority of contiguous sequences of words have low frequency. Counting falls steeply with increasing frequency.
Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the blogs data
The Frequency distribution of contiguous sequence of 1, 2 and 3 words in the blogs data are shown respectively in the table 1, 2 and 3 from the appendix A. Taking words with frequency between 31 and 678 allows covering 54% of all word instance, we will take this fact into account to focus in the words that appears more frequently in the blogs data to improve the predictive model.
The figure 2 presents the correlation for words with a frequency more or equal to 31 and with at least 0.40 correlation. It appears that the concepts “alon”" and “along” are similar, however the term “alon” is an typo and will be replaced by its correct term “along”.
Figure 2: Visualization of correlations within the blogs data
In the appendices B and C are shown information about the news and twitter data. The news and twitter data have features similar to blogs data. Moreover, the typo “alon” also appears therefore we will fix it changing “alon” by “along”. Finally, the modelling phase of the data science capstone project is ready to begin.
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,31) | 13367 | 0.9519 | 0.9519 |
| [31,61) | 383 | 0.0273 | 0.9792 |
| [61,91) | 150 | 0.0107 | 0.9899 |
| [91,121) | 49 | 0.0035 | 0.9934 |
| [121,151) | 25 | 0.0018 | 0.9952 |
| [151,181) | 16 | 0.0011 | 0.9963 |
| [181,211) | 11 | 0.0008 | 0.9971 |
| [211,241) | 10 | 0.0007 | 0.9978 |
| [241,271) | 5 | 0.0004 | 0.9981 |
| [271,301) | 6 | 0.0004 | 0.9986 |
| [301,331) | 8 | 0.0006 | 0.9991 |
| [331,361) | 3 | 0.0002 | 0.9994 |
| [361,391) | 0 | 0.0000 | 0.9994 |
| [391,421) | 2 | 0.0001 | 0.9995 |
| [421,451) | 0 | 0.0000 | 0.9995 |
| [451,481) | 1 | 0.0001 | 0.9996 |
| [481,511) | 0 | 0.0000 | 0.9996 |
| [511,541) | 2 | 0.0001 | 0.9997 |
| [541,571) | 2 | 0.0001 | 0.9999 |
| [571,601) | 0 | 0.0000 | 0.9999 |
| [601,631) | 1 | 0.0001 | 0.9999 |
| [631,661) | 0 | 0.0000 | 0.9999 |
| [661,691) | 1 | 0.0001 | 1.0000 |
| Sum | 14042 | 1.0000 | NA |
Table 1: Frequency distribution of contiguous sequence of 1 word in the blogs data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,6) | 91092 | 0.9970 | 0.9970 |
| [6,11) | 215 | 0.0024 | 0.9993 |
| [11,16) | 38 | 0.0004 | 0.9998 |
| [16,21) | 13 | 0.0001 | 0.9999 |
| [21,26) | 3 | 0.0000 | 0.9999 |
| [26,31) | 5 | 0.0000 | 1.0000 |
| [31,36) | 1 | 0.0000 | 1.0000 |
| [36,41) | 0 | 0.0000 | 1.0000 |
| [41,46) | 1 | 0.0000 | 1.0000 |
| Sum | 91368 | 1.0000 | NA |
Table 2: Frequency distribution of contiguous sequence of 2 word in the blogs data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,2) | 100545 | 0.9957 | 0.9957 |
| [2,3) | 394 | 0.0039 | 0.9996 |
| [3,4) | 26 | 0.0003 | 0.9999 |
| [4,5) | 12 | 0.0001 | 1.0000 |
| [5,6) | 1 | 0.0000 | 1.0000 |
| Sum | 100978 | 1.0000 | NA |
Table 3: Frequency distribution of contiguous sequence of 3 word in the blogs data
Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the news data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,31) | 13943 | 0.9536 | 0.9536 |
| [31,61) | 423 | 0.0289 | 0.9826 |
| [61,91) | 127 | 0.0087 | 0.9912 |
| [91,121) | 54 | 0.0037 | 0.9949 |
| [121,151) | 19 | 0.0013 | 0.9962 |
| [151,181) | 23 | 0.0016 | 0.9978 |
| [181,211) | 7 | 0.0005 | 0.9983 |
| [211,241) | 5 | 0.0003 | 0.9986 |
| [241,271) | 5 | 0.0003 | 0.9990 |
| [271,301) | 7 | 0.0005 | 0.9995 |
| [301,331) | 1 | 0.0001 | 0.9995 |
| [331,361) | 2 | 0.0001 | 0.9997 |
| [361,391) | 1 | 0.0001 | 0.9997 |
| [391,421) | 1 | 0.0001 | 0.9998 |
| [421,451) | 0 | 0.0000 | 0.9998 |
| [451,481) | 0 | 0.0000 | 0.9998 |
| [481,511) | 0 | 0.0000 | 0.9998 |
| [511,541) | 1 | 0.0001 | 1.0000 |
| Sum | 14619 | 1.0000 | NA |
Table 1: Frequency distribution of contiguous sequence of 1 word in the news data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,6) | 88473 | 0.9970 | 0.9970 |
| [6,11) | 207 | 0.0023 | 0.9993 |
| [11,16) | 32 | 0.0004 | 0.9997 |
| [16,21) | 13 | 0.0001 | 0.9998 |
| [21,26) | 5 | 0.0001 | 0.9999 |
| [26,31) | 1 | 0.0000 | 0.9999 |
| [31,36) | 2 | 0.0000 | 0.9999 |
| [36,41) | 1 | 0.0000 | 0.9999 |
| [41,46) | 1 | 0.0000 | 0.9999 |
| [46,51) | 2 | 0.0000 | 1.0000 |
| [51,56) | 1 | 0.0000 | 1.0000 |
| Sum | 88738 | 1.0000 | NA |
Table 2: Frequency distribution of contiguous sequence of 2 word in the news data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,2) | 97008 | 0.9952 | 0.9952 |
| [2,3) | 410 | 0.0042 | 0.9994 |
| [3,4) | 37 | 0.0004 | 0.9998 |
| [4,5) | 14 | 0.0001 | 0.9999 |
| [5,6) | 3 | 0.0000 | 1.0000 |
| Sum | 97472 | 1.0000 | NA |
Table 3: Frequency distribution of contiguous sequence of 3 word in the news data
Figure 2: Visualization of correlations within the news data
Figure 1: Frequency of contiguous sequence of 1, 2 and 3 words in the twitters data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,31) | 12334 | 0.9637 | 0.9637 |
| [31,61) | 251 | 0.0196 | 0.9834 |
| [61,91) | 77 | 0.0060 | 0.9894 |
| [91,121) | 45 | 0.0035 | 0.9929 |
| [121,151) | 20 | 0.0016 | 0.9945 |
| [151,181) | 17 | 0.0013 | 0.9958 |
| [181,211) | 11 | 0.0009 | 0.9966 |
| [211,241) | 8 | 0.0006 | 0.9973 |
| [241,271) | 5 | 0.0004 | 0.9977 |
| [271,301) | 3 | 0.0002 | 0.9979 |
| [301,331) | 6 | 0.0005 | 0.9984 |
| [331,361) | 2 | 0.0002 | 0.9985 |
| [361,391) | 5 | 0.0004 | 0.9989 |
| [391,421) | 3 | 0.0002 | 0.9991 |
| [421,451) | 3 | 0.0002 | 0.9994 |
| [451,481) | 1 | 0.0001 | 0.9995 |
| [481,511) | 0 | 0.0000 | 0.9995 |
| [511,541) | 1 | 0.0001 | 0.9995 |
| [541,571) | 1 | 0.0001 | 0.9996 |
| [571,601) | 1 | 0.0001 | 0.9997 |
| [601,631) | 1 | 0.0001 | 0.9998 |
| [631,661) | 0 | 0.0000 | 0.9998 |
| [661,691) | 0 | 0.0000 | 0.9998 |
| [691,721) | 3 | 0.0002 | 1.0000 |
| Sum | 12798 | 1.0000 | NA |
Table 1: Frequency distribution of contiguous sequence of 1 word in the twitter data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,6) | 73426 | 0.9940 | 0.9940 |
| [6,11) | 330 | 0.0045 | 0.9984 |
| [11,16) | 60 | 0.0008 | 0.9992 |
| [16,21) | 30 | 0.0004 | 0.9996 |
| [21,26) | 5 | 0.0001 | 0.9997 |
| [26,31) | 5 | 0.0001 | 0.9998 |
| [31,36) | 3 | 0.0000 | 0.9998 |
| [36,41) | 4 | 0.0001 | 0.9999 |
| [41,46) | 3 | 0.0000 | 0.9999 |
| [46,51) | 2 | 0.0000 | 0.9999 |
| [51,56) | 1 | 0.0000 | 1.0000 |
| [56,61) | 0 | 0.0000 | 1.0000 |
| [61,66) | 1 | 0.0000 | 1.0000 |
| Sum | 73870 | 1.0000 | NA |
Table 2: Frequency distribution of contiguous sequence of 2 word in the twitter data
| Interval | Freq | Rel.Freq | Rel.Cum.Freq |
|---|---|---|---|
| [1,2) | 84686 | 0.9944 | 0.9944 |
| [2,3) | 395 | 0.0046 | 0.9990 |
| [3,4) | 53 | 0.0006 | 0.9996 |
| [4,5) | 14 | 0.0002 | 0.9998 |
| [5,6) | 6 | 0.0001 | 0.9999 |
| [6,7) | 2 | 0.0000 | 0.9999 |
| [7,8) | 1 | 0.0000 | 0.9999 |
| [8,9) | 1 | 0.0000 | 0.9999 |
| [9,10) | 2 | 0.0000 | 1.0000 |
| Sum | 85160 | 1.0000 | NA |
Table 3: Frequency distribution of contiguous sequence of 3 word in the twitter data
Figure 2: Visualization of correlations within the twitter data