2 Data Processing

2.1 Loading and summarizing the data

Here, I present a basic summary of the three source files (downloaded from coursera website, capstone project) with the following rubrics:

file name (text file),
file size (in byte),
the number of lines,
the number of words,
the number of characters,
the maximum line length (in characters),
the time taken to load and run these basic statistic (expressed in seconds and only relevant to my current OS and resources).

Summary original files
filename	file_size	num_lines	num_words	num_chars	max_len_chars	time_elapsed
en_US.blogs.txt	210,160,014	899,288	37,334,131	206,824,505	40,833	44.753
en_US.news.txt	205,811,889	1,010,242	34,372,530	203,223,159	11,384	41.189
en_US.twitter.txt	167,105,338	2,360,148	30,373,583	162,096,241	140	77.038

2.2 Data Cleaning

For this step I used the following ordered succession of steps (written with core R functions):

normalize single-quotes, double-quotes, hyphens, underscores, dots,
convert to ASCII (from utf-8),
convert to lower case,
remove hash-tags and urls,
normalize dots (to single dot), and split sentence using dot as separator,
remove non-words,
filter out sentences containing offending words according to profanity list,
remove empty words, remove uni-letter non word,
normalize spaces to single space,
remove heading and trailing spaces.

Remarks:

I did not remove stop words (in an attempt to create less meaningless ngrams).
I decided to remove whole sentences containing offending words, instead of just the offending words, once again in a attempt to limit meaningless ngrams.
Our profanity list was downloaded from Google blacklisted Words, Bad Words List, List of Swear Words, July 2018 and contains, after review 1265 entries (note: it contains multi-words per line, some entries are not ‘bad’ words per se but depends on context. Not being a native English speaker nor an expert on bad words, I kept almost all the entries).
At this stage, I did not find an efficient way to detect (and remove) foreign words in the corpus (en_US), although the (expected) low frequency in regard to the whole corpus may make this a non issue?

Summary cleaned up files
filename	file_size	num_lines	num_words	num_chars	max_len_chars	time_elapsed
en_US.blogs.clean_100p.txt	183,208,188	2,341,953	33,709,011	180,866,235	2,941	80.285
en_US.news.clean_100p.txt	180,285,309	2,056,668	30,976,197	178,228,641	3,213	71.603
en_US.twitter.clean_100p.txt	141,623,805	3,794,753	27,022,842	137,829,052	140	126.605

Comparison with original data, expressed as a gain in percentage, in other words the percentage of removal from original file, reported only for the file size, the number of words and number of characters. So for example, the first line tells us that for file en_US.blogs.txt:

file size was reduced by 12.82%,
number of words reduced by 9.71%, and
number of chars reduced by 12.55%.

Summary main differnces
filename	diff_size	diff_words	diff_chars
en_US.blogs.clean_100p.txt	12.82%	9.71%	12.55%
en_US.news.clean_100p.txt	12.4%	9.88%	12.3%
en_US.twitter.clean_100p.txt	15.25%	11.03%	14.97%

2.3 Exploratory Analysis

I created a corpus based on the three cleaned up files obtained in the previous stage, using the R quanteda package (quanteda: Quantitative Analysis of Textual Data). Then I defined a document feature matrix from which I built our 5-gram model (wikipedia n-gram) all the way down to uni-gram.

I then computed the frequencies for my 5-gram model (down to uni-gram), using textstat_frequency() function (from R quanteda package).
In the following sections, I present results from a 20% sample of the three cleaned up files (corpus), which is actually split into three subsets: training (80%), development (10%) and testing (10%). These results are from the training set.

5-grams model (on 20% sample)
n-grams	number of instances
1-grams	249,801
2-grams	3,188,407
3-grams	7,619,595
4-grams	9,666,425
5-grams	9,608,035

2.3.1 Frequency, histogram and word cloud for 5-grams

Top 50 5-grams for our model.

                    feature frequency rank
1         at the end of the       590    1
2     for the first time in       256    2
3      in the middle of the       255    3
4        the end of the day       194    4
5         by the end of the       185    5
6     thank you so much for       175    6
7       for the rest of the       171    7
8          is going to be a       154    8
9        there are a lot of       144    9
10       it's going to be a       142   10
11          to be a part of       133   11
12    the other side of the       129   12
13 thanks for the shout out       126   13
14      i can't wait to see       121   14
15       is one of the most       120   15
16      the end of the year       118   16
17        at the top of the       116   17
18    can't wait to see you       116   18
19      this is going to be       113   19
20     on the other side of       110   20
21       let me know if you       107   21
22 for the first time since       103   22
23       i love you so much        99   23
24      and the rest of the        94   24
25     for those of you who        93   25

                     feature frequency rank
26      the end of the month        93   26
27        in the middle of a        92   27
28     keep up the good work        92   28
29    thanks so much for the        90   29
30      at the bottom of the        89   30
31  thank you for the follow        88   31
32     hope you have a great        83   32
33      but at the same time        82   33
34     i thought it would be        82   34
35      to figure out how to        80   35
36       the rest of the day        80   36
37      let me know what you        79   37
38          to be one of the        78   38
39      if you would like to        78   39
40      in the bottom of the        77   40
41 happy mother's day to all        75   41
42    this is the first time        73   42
43        let us know if you        73   43
44       for a chance to win        72   44
45   at the beginning of the        70   45
46          to find a way to        69   46
47         there is a lot of        68   47
48       going to be a great        68   48
49        going to be a good        68   49
50        it was going to be        67   50

Most frequent 5grams

Word cloud of most frequent 5grams

2.3.2 Frequency, histogram and word cloud for 4-grams

Top 50 4-grams for our model.

                 feature frequency rank
1         the end of the      1215    1
2          at the end of      1042    2
3        the rest of the       987    3
4     for the first time       948    4
5  thanks for the follow       940    5
6       at the same time       746    6
7         is going to be       654    7
8          is one of the       610    8
9        one of the most       582    9
10      when it comes to       572   10
11         going to be a       560   11
12      in the middle of       557   12
13         to be able to       538   13
14     thanks for the rt       534   14
15        if you want to       509   15
16     thank you for the       483   16
17     can't wait to see       478   17
18       one of the best       461   18
19     thank you so much       445   19
20       i don't want to       441   20
21         i am going to       388   21
22  in the united states       377   22
23       i would like to       376   23
24       i can't wait to       356   24
25        the top of the       354   25

               feature frequency rank
26    it's going to be       351   26
27       by the end of       346   27
28      i wish i could       336   28
29  one of my favorite       332   29
30   the middle of the       328   30
31      i was going to       328   31
32     for the rest of       328   32
33     a lot of people       320   33
34     in front of the       304   34
35   on the other hand       299   35
36          a bit of a       298   36
37   what do you think       297   37
38      was one of the       291   38
39   the first time in       289   39
40   the bottom of the       286   40
41      as well as the       280   41
42      i just want to       273   42
43       i was able to       271   43
44 said in a statement       269   44
45   you don't have to       267   45
46       have a lot of       267   46
47   i don't know what       267   47
48        to go to the       264   48
49    have a great day       261   49
50     hope to see you       261   50

Most frequent 4grams

Word cloud of most frequent 4grams

2.3.3 Frequency, histogram and word cloud for 3-grams

Top 50 3-grams for our model.

              feature frequency rank
1          one of the      5052    1
2            a lot of      4620    2
3      thanks for the      3788    3
4         going to be      2633    4
5             to be a      2615    5
6          the end of      2307    6
7           i want to      2243    7
8            it was a      2223    8
9          out of the      2213    9
10        some of the      2037   10
11         as well as      2029   11
12         be able to      1956   12
13        part of the      1883   13
14           i have a      1801   14
15          i have to      1700   15
16 looking forward to      1678   16
17        the rest of      1667   17
18       i don't know      1657   18
19      thank you for      1600   19
20     the first time      1554   20
21        is going to      1514   21
22        a couple of      1499   22
23         i love you      1480   23
24          this is a      1474   24
25         end of the      1441   25

          feature frequency rank
26      i need to      1437   26
27    you have to      1415   27
28    you want to      1407   28
29   i'm going to      1361   29
30     there is a      1351   30
31  can't wait to      1315   31
32   in the world      1310   32
33  the fact that      1305   33
34    this is the      1291   34
35     at the end      1290   35
36    it would be      1279   36
37       to go to      1279   37
38      one of my      1277   38
39    there is no      1224   39
40  for the first      1222   40
41        it is a      1216   41
42  i don't think      1195   42
43      is one of      1191   43
44 for the follow      1177   44
45   in the first      1156   45
46      to have a      1151   46
47    most of the      1123   47
48     all of the      1109   48
49    in front of      1093   49
50    of the year      1079   50

Most frequent 3grams

Word cloud of most frequent 3grams

2.3.4 Frequency, histogram and word cloud for bi-grams

Top 50 bi-grams for our model.

    feature frequency rank
1    of the     63376    1
2    in the     59531    2
3    to the     31422    3
4   for the     30487    4
5    on the     28911    5
6     to be     23992    6
7    at the     21017    7
8   and the     18175    8
9      in a     17112    9
10 with the     15784   10
11     is a     14763   11
12   it was     14557   12
13    for a     14070   13
14   i have     13221   14
15    i was     12793   15
16 from the     12774   16
17  will be     12398   17
18    and i     12129   18
19    it is     12053   19
20   with a     11854   20
21 going to     11830   21
22     i am     11732   22
23     of a     11365   23
24   have a     11150   24
25   if you     10941   25

     feature frequency rank
26    is the     10935   26
27    one of     10643   27
28    to get     10398   28
29      as a      9862   29
30   want to      9624   30
31   have to      9311   31
32   this is      8994   32
33    by the      8862   33
34   i think      8815   34
35     to do      8714   35
36  that the      8655   36
37 the first      8473   37
38     and a      8435   38
39   i don't      8408   39
40    to see      8205   40
41      to a      8156   41
42    out of      8145   42
43     was a      8144   43
44      on a      7917   44
45    that i      7802   45
46     but i      7800   46
47    i love      7754   47
48   all the      7484   48
49   you can      7376   49
50   to make      7371   50

Most frequent 2grams

Word cloud of most frequent 2grams

2.3.5 Frequency, histogram and word cloud for uni-grams

Top 50 uni-grams for our model.

   feature frequency rank
1      the    701491    1
2       to    409185    2
3      and    353071    3
4        a    350075    4
5       of    293013    5
6        i    248945    6
7       in    241229    7
8      for    165302    8
9       is    159340    9
10    that    152626   10
11     you    141593   11
12      it    136952   12
13      on    121136   13
14    with    105561   14
15     was     91207   15
16      my     89682   16
17      at     84793   17
18      be     81832   18
19    this     81189   19
20    have     80262   20
21     are     73441   21
22     but     72151   22
23      as     70751   23
24      we     64082   24
25      he     63103   25

   feature frequency rank
26     not     60751   26
27      so     57590   27
28    from     56703   28
29      me     54238   29
30     all     49032   30
31    will     47779   31
32    they     47378   32
33      by     45835   33
34    just     45508   34
35      or     45262   35
36    your     44972   36
37    said     44948   37
38      an     43901   38
39     out     43876   39
40   about     43796   40
41     his     42975   41
42      up     42880   42
43     one     42763   43
44    what     41936   44
45      if     41199   45
46    like     39378   46
47    when     38221   47
48     has     38216   48
49     can     37149   49
50    more     36578   50

Most frequent unigrams

Word cloud of most frequent unigrams

Capstone Project - Exploratory Data Analysis

Pascal P

27 October 2018

1 Synopsis