Introduction

This report explores text data from three text files - en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. The goal is to understand the distribution of words, relationships amongst the words in each document, explore the frequency of words and word pairs and, their distributions With this information, to build an n-gram word model. The intent was to laod all documents as a corpus and analyze all the documents simultenously. However, this attempt was not successful because the program was taking too long to run so, I analyzed each document separately just to complete this exercise.

Data Processing - Cleaning the text files

For data pre-processing, unecceasy whitspaces within the documents, reduced all words to lowercases and then, got rid of english stopwords.

Exploratory Data Analysis - news text file

Exploratory Data Analysis - blogs text file

Exploratory Data Analysis - twitter text file

Exploring word pairs (bi-gram) in the text documents

Here, we extract consective word pairs using token=“ngrams” and specify a value for n. With n=2, extract b–grams and count the most frequent word pairs. By changing values of n, one could extend this method extracting a sequence of words to include tri–grams (n=3), etc

## # A tibble: 1,002,638 x 2 ##    bigram       n ##    <chr>    <int> ##  1 of the   14094 ##  2 in the   13507 ##  3 to the    6408 ##  4 on the    5554 ##  5 for the   5378 ##  6 at the    4494 ##  7 and the   4028 ##  8 in a      4014 ##  9 to be     3572 ## 10 with the  3318 ## # ... with 1,002,628 more rows

Observations and Comments

This report explored the three text files separately as taking them collectively as a corpus was taking too long to run and was fraustrating my efforts. Wordlcoud and summary statistics of most frequently oocuring words and word pairs were made for each of the three text files and relationships between the word pairs in the news text was also explored.

Text Mining - Exploratory Data Analysis

Ike

2019-05-21