Description of the process:

The process starts loading the data from the files (twitter, blogs, news), for each file a sample is obtained, the three samples are joined in one data set, then a process to identify profanity words is executed.

The second step is tokenizer for word the new data. For this process the profanity and stopWords are removed, also the numbers. Two plots show the relevant information about more frequent tokens.

Additional, exist a process to identify the more frequent two words that happens together. It is the main idea behind the algorithm to achieve the goal.

More information about the functions in: https://rpubs.com/ricoherrera/autocomplete-fns

Executions

load data and identify profanity words:

getTokensFromText('final/en_US/en_US.twitter.txt','final/en_US/en_US.news.txt','final/en_US/en_US.blogs.txt')

tokenize_data()
## [1] " starting tokenization"
## [1] "Time tokenization->0.247117968400319"
## # A tibble: 30 × 2
##    word   total
##    <chr>  <int>
##  1 time   29684
##  2 day    21638
##  3 love   20488
##  4 people 19852
##  5 life   12665
##  6 home    9911
##  7 week    9663
##  8 world   9086
##  9 night   8982
## 10 rt      8935
## # ℹ 20 more rows
visualizeFrequentTokens()

100 Frequent tokens on a cloud:

wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)

Bigram data :

create_bigram_data()
## [1] " starting bigram-data"
## [1] "Time buildin biGram data->4.74183368682861"

Plot bigram :

create_bigram_plot()

Bigram data - 2 technique :

create_bigram_data_2(2,2)
## [[1]]
## [1] "dammnnnnn catch"
## 
## [[2]]
## [1] "bum squad"     "squad rt"      "rt shout"      "shout ninja"  
## [5] "ninja winning"
## 
## [[3]]
## [1] "ya ik"            "ik follow"        "follow mentioned" "mentioned tweets"
## [5] "tweets didnt"    
## 
## [[4]]
## [1] "tonight tomorrow"
## 
## [[5]]
## [1] "stress balls" "balls keg"    "keg crew"     "crew guys"    "guys party"  
## 
## [[6]]
##  [1] "music holiday"      "holiday 23"         "23 santa"          
##  [4] "santa barbara"      "barbara bar"        "bar tim"           
##  [7] "tim palin"          "palin white"        "white jr"          
## [10] "jr richards"        "richards dishwalla"