Text Mining: Predicting the Next Word

Milestone Report Project

Marcos Medeiros

02/19/2022

Introduction

This is the milestone report assignment for week 2 of the Coursera Data Science Capstone Project.

The purpose of this report is to demonstrate how to do a text mining using bag-of-words method, loading and cleaning the dataset, making some exploratory analysis and describing a strategy for building a predictive model in a Shiny application based in Natural Language API.

The data set contains 4 different languages corpora: German, English, Finnish and Russian, with 3 data sources in each set: Blogs, News and Twitter.

We will use only the English corpora, in a compiled training corpus with the three sources of text.

Purpose of Text Mining

Here we find the first challenge: choosing which is the best package to perform exploratory analysis, filtering of texts and conversion of data sample into a matrix.

This programming aspect is at the heart of the project, as there are differences in the criteria used in each package to perform clustering and tokenization.

A clear example of this problem is the history of search engines since 1995. We had search engines such as Excite, SAPO, AltaVista, Yahoo, among others, which competed in the market for being the preferred search engine by users. Having the best predictive algorithm of what you are thinking or desiring from two or three words you type is the work of artificial intelligence from natural language. Tech giants such as Google and Netflix have the best predictive algorithms that show what people are thinking or wanting and can offer targeted products and services.

There is no perfect predictive algorithm based on natural language, but there is a relevant aspect in our research brought by Franz Brentano applied to Artificial Intelligence: Psychic phenomena of representation, judgment, approval (love) and disapproval (hatred) provoke engagement in social media. His posthumous work published in 1928 on sentient and noetic consciousness, which deals with the spiritual dimension of consciousness is considered the main dilemma of human intelligence and artificial intelligence: what is the difference between a virtual mind and a real mind? What is the difference between thinking and imitating thought? AI predictive algorithms are based on Brentano’s philosophical work.

Thus, a predictive algorithm can learn from human behavior and imitate it, generating a cycle between being influenced by people’s behavior and, at the same time, influencing how people will act.

Choice of R package and strategies.

We tested the model in 3 different text mining packages: qdap, RWeka and quanteda. Each of these packages has positive and negative aspects that can influence the final choice. Anyway, due the obvious limitations of a home computer, the results would be very similar in any of them.

A major issue is having enough RAM space to generate a matrix from the TDM (TermDocumentMatrix) or DTM (DocumentTermMatrix) files. There are some strategies: 1) Reduce the sample size without compromising the reliability of the training set, so that it can reflect the corpus; 2) Removal in the construction of the matrix the sparse data.

Data Summary

Below we have a data summary of the three text corpora to evaluate the size, number of lines and number of words for each source text.

Basic Corpora Analysis
Lines Words Size in Mb
Blogs 899288 37570839 200.4242
News 1010242 34494539 196.2775
Twitter 2360148 30451170 159.3641
Total 4269678 102516548 556.0658

The dataset contains 4,269,678 lines and 102,516,548 words. The code for the data summary is shown in the Appendix 1.

Tokenization and stemming the training data

We tried initially to work with a 5%, then 3% samples, but with these sample sizes we couldn’t generate a matrix from DTM due limitations of my PC.

There is a couple of tricks in configuration that improves memory, like removing sparse terms and expanding the memory.limit() function, but it costs a lot of time in processing and increases the risk of crash, which is frustrating.

After a few days and a lot of testing, the final choice was to use a 1% sample and some combined packages functions.

We will convert the files in just one corpus, removing punctuation, numbers, symbols, separators, and stopwords. In the sequence, we will create DTM and TDM files in order to get a frequency matrix.

The code for building the matrix and tokenization is shown in the Appendix 2.

Frequency Plots

Now we can built the frequency plots with a list of most used words in the corpus, for unigrams, bigrams and trigrams.

We will plot the top 10 and top 20 words unigram in the sample data set by the two different methods to compare the results. In the sequence, we will plot the top 20 bigram and top 20 trigram.

The codes for the frequency barplots are shown in the Appendix 3.

Wordclouds

Wordclouds can help decision makers visualize the most relevant data. We will show wordcloud for unigrams, bigrams and trigrams.

The code for the wordclouds is shown in the Appendix 4.

Next Steps and Conclusions

To build the predictive algorithm we will need to consider the processing and memory limitations of a home computer, but it is possible to demonstrate through the training dataset using an n-gram model with a word frequency search similar to that performed in the exploratory data how we can predict the next word will be typed. On the other hand, using the strategy of eliminating stopwords can decrease the accuracy of the model, since the predictability of next word is based on natural language. A good predictive model accuracy will depend on the ability of AI learning from people’s behavior and thus increase effectiveness.

For the purpose of this project, we will be using a demo in shiny application for R.

Appendixes