Introduction - Understanding the task

The purpose of this project is to explore a set of text documents, (a corpora) using text mining and natural language processing, in order to build a predictive word model for english text.

Methods- Cleaning, Tokenization, Exploratory Analyses

First, I generated a sample (sample size was set to 1 percent of the total number of lines in each text document) of each of the three text documents by randomly selecting/reading lines from each text and then converting read lines into a dataframe (a tabular data structure in R)
Next, I tokenized each data frame of text by unnesting the text in each data frame to create a data frame of i) only single/unigram words, ii) only two paired words/bigrams and iii) only three paired words/trigrams
Then, I cleaned each tokenized dataframe by removing digits, punctuation, foreign words (i.e. words of non English orign) and profanity.
Then, I calculated the term frequency (tf) and the inverse document frequency (idf) of all words, bigrams and trigrams in order to identify and differentiate between common words across all three text documents and important words unique to each text document.
Next, I created a dictionary of common (stop) words in the English language (for example: and, in, of, then) using the tf-idf term as well as words from a dataset in R called “stop_words”. I used this dictionary to differentiate between common words and important words in the corpora.
Finally, for exploratory purposes, I calculated the proportion of single words, bigrams and trigrams in the corpora; identified the top twenty words, bigrams and trigrams in the corpora and plotted the top twenty words that make up 50%, 90% and 100% coverage of all word instances in the corpora.

Key Findings

Table 1 - the total number (and proportion) of words, bigrams and trigrams in the corpora
wordtype	n	n/total(%)	n_common	n_common/n(%)
words	58072	6	13034	22
bigrams	344308	36	176864	51
trigrams	534108	57	295854	55

Table 2 - the total number of unique words needed to cover 50%, 90% and 100% of all word, bigram and trigram instances in the corpora
word_type	word_coverage	n	n_common	n_commons/n(%)
words	50%	29035	6496	22
words	90%	52264	11630	22
words	100%	58072	13034	22
bigrams	50%	172154	88539	51
bigrams	90%	309877	157903	50
bigrams	100%	344308	176864	51
trigrams	50%	267054	146543	54
trigrams	90%	480697	264506	55
trigrams	100%	534108	295854	55

The figures below show the top 20 words, bigrams and trigrams in the corpora

Next Steps - Building a basic n-gram model

I plan to use term frequencies (tf) and inverse document frequencies (idf) to identify words, bigrams and trigrams as well as differentiate common words from importrant words in the corpora.

The predictive model will calculate the probability of each word, bigram and trigram based on the tf-idf term, word group (i.e. common versus important) and correlation with other word(s). I am also considering the stemming technique (i.e. calculating the frequencies of root words) as another metric in calculating probabilities and predicting words, bigrams and trigrams. My goal is to categorize words, bigrams trigrams into as many unique groups as possible so as to reduce the model, more efficiently predict words, bigrams and tirgrams and reduce performance time.

Altogether, I have conceptualized my predictive model as a large tree made up of leaves and branches with each leaf assigned to a word, bigrams or trigrams and each branch as steps (probabilities) leading to prediction of words, bigrams or trigrams.

I plan to expand coverage of the corpora by adding at most 5 unobserved words associated with each observed word (a set of synonyms) using online dictionaries such as the words association network. Each synonyms will be assigned probabilities based on the type of speech e.g. noun, verb, adjective and adverb.

In addition, for cases where a particular n-gram is not observed, I plan on using the word embeddings approach. This approach will add unobserved n-grams to the corpora by (randomly or using some criteria) labelling a selection of low term frequency (non stpp) ngrams as unobserved.

Finally, to evaluate the model, I plan to calculate prediction accuracy i.e. calculate the the number of times the predicted word matches the observed word. I also intend to explore other evaluation metrics such as the average number of keystrokes by the user per character and the amount of time the algorithm takes to make a prediction given an acceptable input.

Coursera Datascience Capstone Project

Temi A. Sorungbe

April 11, 2018

Introduction - Understanding the task

Methods- Cleaning, Tokenization, Exploratory Analyses

Key Findings

Next Steps - Building a basic n-gram model