The process starts loading the data from the file, then a process to remove profanity words is executed. The new data is used to identify the more frequent words from the data, two plots show the more often words.
Additional, exist a process to identify the more frequent two words that happens together. It is the main idea behind the algorithm to achieve the goal.
Some tables and plots are showed in order to show the results.
For the exploratory analisys some functions are builded, each of them with a specific responsabilities. To next a description about the main functions
getTokensFromText : Main function that orquest the process. This function receive the path to the file to load.
removeProfanityWords : its responsability is delete profanity words from data
identifyProfanityWords : its goal is identify the profanity words in a sentence and add it to ‘uniqueProfanityWords’ variable
existsOnUniqueProfanityWords : The responsability is avoid add an existing profanity word in ‘uniqueProfanityWords’ variable.
visualizeMoreFrequentTokens : ploting to show principal tokens. It receives the tokens
create_n_gram : Used to identify the most frequent number of words used one after another.
More information about the functions in: https://rpubs.com/ricoherrera/autocomplete-functions
getTokensFromText("final/en_US/en_US.twitter.txt")
visualizeTwentyMoreFrequentTokens()
wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)
create_bigrams(dataWithoutProfanityWords,700)
## word1 word2 n
## 1 sa ay 5941
## 2 happy birthday 5800
## 3 social media 2707
## 4 mother's day 2020
## 5 stay tuned 1834
## 6 mothers day 1793
## 7 cl es 1679
## 8 san diego 1540
## 9 rt rt 1418
## 10 happy friday 1358
## 11 cl ic 1344
## 12 ice cream 1336
## 13 happy hour 1309
## 14 beautiful day 1252
## 15 happy mothers 1225
## 16 tomorrow night 1134
## 17 happy mother's 1130
## 18 ha ha 1076
## 19 awkward moment 1071
## 20 merry christmas 1027
create_bigram_plot(bi_grama)
create_trigrams(dataWithoutProfanityWords)
## trigram Total reps
## 1 NA NA NA 24115
## 2 happy mothers day 1203
## 3 happy mother's day 1123
## 4 cinco de mayo 693
## 5 sa ay night 641
## 6 st patrick's day 289
## 7 love love love 267
## 8 ha ha ha 262
## 9 cake cake cake 235
## 10 happy valentine's day 235
## 11 sa ay morning 215
## 12 ralph waldo emerson 214
## 13 happy valentines day 201
## 14 happy hump day 194
## 15 happy cinco de 183
## 16 martin luther king 179
## 17 omg omg omg 172
## 18 happy sa ay 166
## 19 blah blah blah 153
## 20 la la la 153
getTokensFromText("final/en_US/en_US.blogs.txt")
visualizeTwentyMoreFrequentTokens()
wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)
create_bigrams(dataWithoutProfanityWords,700)
## word1 word2 n
## 1 <NA> <NA> 3612
## 2 sa ay 3391
## 3 cl ic 1771
## 4 cl es 1559
## 5 po tion 1450
## 6 compe ion 1284
## 7 ice cream 1177
## 8 weeks ago 1114
## 9 cir stances 1072
## 10 jesus christ 910
## 11 social media 906
## 12 parti te 825
## 13 south africa 810
## 14 real life 789
## 15 olive oil 726
## 16 gl es 714
## 17 cons ution 703
## 18 feel free 703
## 19 months ago 694
## 20 blog post 691
create_bigram_plot(bi_grama)
getTokensFromText("final/en_US/en_US.news.txt")
visualizeTwentyMoreFrequentTokens()
wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)
create_bigrams(dataWithoutProfanityWords,100)
## word1 word2 n
## 1 sa ay 605
## 2 st louis 449
## 3 los angeles 300
## 4 san francisco 237
## 5 health care 217
## 6 compe ion 187
## 7 san diego 158
## 8 vice president 148
## 9 po tion 136
## 10 cl ic 131
## 11 ins ute 126
## 12 white house 120
## 13 cl es 119
## 14 supreme court 118
## 15 real estate 113
## 16 executive director 108
## 17 city council 106
## 18 school district 106
## 19 president barack 100
## 20 compe ive 99
create_bigram_plot(bi_grama)