The purpose of this project is to explore a set of text documents, (a corpora) using text mining and natural language processing, in order to build a predictive word model for english text.
First, I generated a sample (sample size was set to 1 percent of the total number of lines in each text document) of each of the three text documents by randomly selecting/reading lines from each text and then converting read lines into a dataframe (a tabular data structure in R)
Next, I tokenized each data frame of text by unnesting the text in each data frame to create a data frame of i) only single/unigram words, ii) only two paired words/bigrams and iii) only three paired words/trigrams
Then, I cleaned each tokenized dataframe by removing digits, punctuation, foreign words (i.e. words of non English orign) and profanity.
Then, I calculated the term frequency (tf) and the inverse document frequency (idf) of all words, bigrams and trigrams in order to identify and differentiate between common words across all three text documents and important words unique to each text document.
Next, I created a dictionary of common (stop) words in the English language (for example: and, in, of, then) using the tf-idf term as well as words from a dataset in R called “stop_words”. I used this dictionary to differentiate between common words and important words in the corpora.
Finally, for exploratory purposes, I calculated the proportion of single words, bigrams and trigrams in the corpora; identified the top twenty words, bigrams and trigrams in the corpora and plotted the top twenty words that make up 50%, 90% and 100% coverage of all word instances in the corpora.
| wordtype | n | n/total(%) | n_common | n_common/n(%) |
|---|---|---|---|---|
| words | 58072 | 6 | 13034 | 22 |
| bigrams | 344308 | 36 | 176864 | 51 |
| trigrams | 534108 | 57 | 295854 | 55 |
| word_type | word_coverage | n | n_common | n_commons/n(%) |
|---|---|---|---|---|
| words | 50% | 29035 | 6496 | 22 |
| words | 90% | 52264 | 11630 | 22 |
| words | 100% | 58072 | 13034 | 22 |
| bigrams | 50% | 172154 | 88539 | 51 |
| bigrams | 90% | 309877 | 157903 | 50 |
| bigrams | 100% | 344308 | 176864 | 51 |
| trigrams | 50% | 267054 | 146543 | 54 |
| trigrams | 90% | 480697 | 264506 | 55 |
| trigrams | 100% | 534108 | 295854 | 55 |
The figures below show the top 20 words, bigrams and trigrams in the corpora
I plan to use term frequencies (tf) and inverse document frequencies (idf) to identify words, bigrams and trigrams as well as differentiate common words from importrant words in the corpora.
The predictive model will calculate the probability of each word, bigram and trigram based on the tf-idf term, word group (i.e. common versus important) and correlation with other word(s). I am also considering the stemming technique (i.e. calculating the frequencies of root words) as another metric in calculating probabilities and predicting words, bigrams and trigrams. My goal is to categorize words, bigrams trigrams into as many unique groups as possible so as to reduce the model, more efficiently predict words, bigrams and tirgrams and reduce performance time.
Altogether, I have conceptualized my predictive model as a large tree made up of leaves and branches with each leaf assigned to a word, bigrams or trigrams and each branch as steps (probabilities) leading to prediction of words, bigrams or trigrams.
I plan to expand coverage of the corpora by adding at most 5 unobserved words associated with each observed word (a set of synonyms) using online dictionaries such as the words association network. Each synonyms will be assigned probabilities based on the type of speech e.g. noun, verb, adjective and adverb.
In addition, for cases where a particular n-gram is not observed, I plan on using the word embeddings approach. This approach will add unobserved n-grams to the corpora by (randomly or using some criteria) labelling a selection of low term frequency (non stpp) ngrams as unobserved.
Finally, to evaluate the model, I plan to calculate prediction accuracy i.e. calculate the the number of times the predicted word matches the observed word. I also intend to explore other evaluation metrics such as the average number of keystrokes by the user per character and the amount of time the algorithm takes to make a prediction given an acceptable input.