In this assignment we will explore the 3 SwiftKey data files provided in attempt to draw insights from the data that may help us in developing a predictive model that predicts the next word. The process that we will follow to explore the data is as follows:
Explore the high level file row-counts and word-counts.
Explore the term frequency of the 3 files.
Explore the Unigram, BiGram and TriGram frequency of the 3 files.
Explore the word correlation of the top terms for each of the 3 files.
Explore some of the Entities of each of the 3 files.
Below we explore the high level row-counts and word-counts fo the 3 files. We see that the tweets file has the highest row-count, however the blogs file has the highest word-count.
| File | Line Count | Word Count |
|---|---|---|
| en_US.blogs.txt | 899,288 | 37,334,690 |
| en_US.twitter.txt | 2,360,148 | 30,374,206 |
| en_US.news.txt | 1,010,242 | 34,372,720 |
Before we move on with exploratory analysis we need to clean the corpora by performing the following data pre-processing steps:
remove white-space
remove punctuation
convert each document to lowercase
remove stop-words & profanity
stem or lemmatize each term in the corpus
skip irrelevant stopwords based on a dictionary
Below are the results summary tables and word-clouds of the most frequent terms found in the Corpora.
|
|
|
Word-clouds reflect the same information as the summary table, but are a nice way to visualize the results.
In addition to the summary tables and word-clouds we should look at N-Grams to understand which words frequently appear together, or rather what groups of words are most highly correlated in each of the files.
Word associations below are presented for each of the top bi-Grams highlighted in the previous section. The word associations charts help us understand what words are highly correlated with the anchor word, and therefore what is likely to be the next words predicted by our next-word algorithm/model.
Lastly we take a look at the entities that exist in each Corpora, however as entity extraction is quite compute intensive an slow the entities have been extracted on a 1000 document sample extract from each file. Entities may or may not help us significantly with our next-word prediction model, however there is no harm to explore them.
|
|
|
|
|
|
|
|
|