The process starts loading the data from the files (twitter, blogs, news), for each file a sample is obtained, the three samples are joined in one data set, then a process to identify profanity words is executed.
The second step is tokenizer for word the new data. For this process the profanity and stopWords are removed, also the numbers. Two plots show the relevant information about more frequent tokens.
Additional, exist a process to identify the more frequent two words that happens together. It is the main idea behind the algorithm to achieve the goal.
More information about the functions in: https://rpubs.com/ricoherrera/autocomplete-fns
getTokensFromText('final/en_US/en_US.twitter.txt','final/en_US/en_US.news.txt','final/en_US/en_US.blogs.txt')
tokenize_data()
## [1] " starting tokenization"
## [1] "Time tokenization->0.247117968400319"
## # A tibble: 30 × 2
## word total
## <chr> <int>
## 1 time 29684
## 2 day 21638
## 3 love 20488
## 4 people 19852
## 5 life 12665
## 6 home 9911
## 7 week 9663
## 8 world 9086
## 9 night 8982
## 10 rt 8935
## # ℹ 20 more rows
visualizeFrequentTokens()
wordcloud(tokenizerData$word, tokenizerData$total, max.words = 100)
create_bigram_data()
## [1] " starting bigram-data"
## [1] "Time buildin biGram data->4.74183368682861"
create_bigram_plot()
create_bigram_data_2(2,2)
## [[1]]
## [1] "dammnnnnn catch"
##
## [[2]]
## [1] "bum squad" "squad rt" "rt shout" "shout ninja"
## [5] "ninja winning"
##
## [[3]]
## [1] "ya ik" "ik follow" "follow mentioned" "mentioned tweets"
## [5] "tweets didnt"
##
## [[4]]
## [1] "tonight tomorrow"
##
## [[5]]
## [1] "stress balls" "balls keg" "keg crew" "crew guys" "guys party"
##
## [[6]]
## [1] "music holiday" "holiday 23" "23 santa"
## [4] "santa barbara" "barbara bar" "bar tim"
## [7] "tim palin" "palin white" "white jr"
## [10] "jr richards" "richards dishwalla"