Description of the process:

The process starts loading the data from the file, then a process to remove profanity words is executed. The new data is used to identify the more frequent words from the data, two plots show the more often words.

Additional, exist a process to identify the more frequent two words that happens together. It is the main idea behind the algorithm to achieve the goal.

Some tables and plots are showed in order to show the results.

Functions

For the exploratory analisys some functions are builded, each of them with a specific responsabilities. To next a description about the main functions

  1. getTokensFromText : Main function that orquest the process. This function receive the path to the file to load.

  2. removeProfanityWords : its responsability is delete profanity words from data

  3. identifyProfanityWords : its goal is identify the profanity words in a sentence and add it to ‘uniqueProfanityWords’ variable

  4. existsOnUniqueProfanityWords : The responsability is avoid add an existing profanity word in ‘uniqueProfanityWords’ variable.

  5. visualizeMoreFrequentTokens : ploting to show principal tokens. It receives the tokens

  6. create_n_gram : Used to identify the most frequent number of words used one after another.

More information about the functions in: https://rpubs.com/ricoherrera/autocomplete-functions

Executions - twitter data

To next the execution of process for file and visualization by word:

getTokensFromText("final/en_US/en_US.twitter.txt")

visualizeTwentyMoreFrequentTokens()

100 Frequent tokens on a cloud:

wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)

n-grams - bigram:

create_bigrams(dataWithoutProfanityWords,700)
##        word1     word2    n
## 1         sa        ay 5941
## 2      happy  birthday 5800
## 3     social     media 2707
## 4   mother's       day 2020
## 5       stay     tuned 1834
## 6    mothers       day 1793
## 7         cl        es 1679
## 8        san     diego 1540
## 9         rt        rt 1418
## 10     happy    friday 1358
## 11        cl        ic 1344
## 12       ice     cream 1336
## 13     happy      hour 1309
## 14 beautiful       day 1252
## 15     happy   mothers 1225
## 16  tomorrow     night 1134
## 17     happy  mother's 1130
## 18        ha        ha 1076
## 19   awkward    moment 1071
## 20     merry christmas 1027
create_bigram_plot(bi_grama)

n-grams - trigram:

create_trigrams(dataWithoutProfanityWords)
##                  trigram Total reps
## 1               NA NA NA      24115
## 2      happy mothers day       1203
## 3     happy mother's day       1123
## 4          cinco de mayo        693
## 5            sa ay night        641
## 6       st patrick's day        289
## 7         love love love        267
## 8               ha ha ha        262
## 9         cake cake cake        235
## 10 happy valentine's day        235
## 11         sa ay morning        215
## 12   ralph waldo emerson        214
## 13  happy valentines day        201
## 14        happy hump day        194
## 15        happy cinco de        183
## 16    martin luther king        179
## 17           omg omg omg        172
## 18           happy sa ay        166
## 19        blah blah blah        153
## 20              la la la        153

Executions - blogs data

getTokensFromText("final/en_US/en_US.blogs.txt")

visualizeTwentyMoreFrequentTokens()

100 Frequent tokens on a cloud:

wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)

n-grams - bigram:

create_bigrams(dataWithoutProfanityWords,700)
##     word1   word2    n
## 1    <NA>    <NA> 3612
## 2      sa      ay 3391
## 3      cl      ic 1771
## 4      cl      es 1559
## 5      po    tion 1450
## 6   compe     ion 1284
## 7     ice   cream 1177
## 8   weeks     ago 1114
## 9     cir stances 1072
## 10  jesus  christ  910
## 11 social   media  906
## 12  parti      te  825
## 13  south  africa  810
## 14   real    life  789
## 15  olive     oil  726
## 16     gl      es  714
## 17   cons   ution  703
## 18   feel    free  703
## 19 months     ago  694
## 20   blog    post  691
create_bigram_plot(bi_grama)

Executions - news data

getTokensFromText("final/en_US/en_US.news.txt")

visualizeTwentyMoreFrequentTokens()

100 Frequent tokens on new data:

wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)

n-grams - bigram:

create_bigrams(dataWithoutProfanityWords,100)
##        word1     word2   n
## 1         sa        ay 605
## 2         st     louis 449
## 3        los   angeles 300
## 4        san francisco 237
## 5     health      care 217
## 6      compe       ion 187
## 7        san     diego 158
## 8       vice president 148
## 9         po      tion 136
## 10        cl        ic 131
## 11       ins       ute 126
## 12     white     house 120
## 13        cl        es 119
## 14   supreme     court 118
## 15      real    estate 113
## 16 executive  director 108
## 17      city   council 106
## 18    school  district 106
## 19 president    barack 100
## 20     compe       ive  99
create_bigram_plot(bi_grama)