This report explores text data from three text files - en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. The goal is to understand the distribution of words, relationships amongst the words in each document, explore the frequency of words and word pairs and, their distributions With this information, to build an n-gram word model. The intent was to laod all documents as a corpus and analyze all the documents simultenously. However, this attempt was not successful because the program was taking too long to run so, I analyzed each document separately just to complete this exercise.
For data pre-processing, unecceasy whitspaces within the documents, reduced all words to lowercases and then, got rid of english stopwords.
Here, we extract consective word pairs using token=“ngrams” and specify a value for n. With n=2, extract b–grams and count the most frequent word pairs. By changing values of n, one could extend this method extracting a sequence of words to include tri–grams (n=3), etc
## # A tibble: 1,002,638 x 2 ## bigram n ## <chr> <int> ## 1 of the 14094 ## 2 in the 13507 ## 3 to the 6408 ## 4 on the 5554 ## 5 for the 5378 ## 6 at the 4494 ## 7 and the 4028 ## 8 in a 4014 ## 9 to be 3572 ## 10 with the 3318 ## # ... with 1,002,628 more rows |
This report explored the three text files separately as taking them collectively as a corpus was taking too long to run and was fraustrating my efforts. Wordlcoud and summary statistics of most frequently oocuring words and word pairs were made for each of the three text files and relationships between the word pairs in the news text was also explored.