NLP presentation

Andrew Kireru
27/11/2014

Introduction

The begginning

The purpose of this presentation is to describe the efforts taken by the author for the Capstone Project of the Data Science Specialization on Coursera.

The goal of the Capstone Project was to build an algorithm to predict the next word in a sequence of words, based on a data set of newspaper articles, blog posts and twitter tweets.

Raw data processing

The starting point for the project was a large set of text data. In order to build a prediction algorithm, the following data processing steps were performed:

It is important to know Portable office actually means the works done on the cellphones or tablets and we need a smart input system to saving our time on typing. So efficient keyboard is required. The core of this input system is a naturallanguage processing model. This model is good as it filters bad words and thus does not display them on the final predictition choosing instead to display some other words.

It is also fast especially when you choose to use Stupid Kick-off process. This is how it works: This is the link http://kireru1.shinyapps.io/predictive-model/.

process snapshot

alt text

NL Processing

How the model predicts

The keyboard

alt text

The process

The whole tokenization is aiming at removing meaningless characters and the words with low frequency to avoid overfit in the corpus. The final corpus will show the words or terms with a high frequency which will be helpful for exploring the relationship between the words and building a meaningful statistical model. So, I extracted 1)the ASCII characters, 2)changed the capital characters to lower case, 3)removed the punctuation, 4)numbers and 5)stop words and 6)stemmed the left words to get the corpus. Thedirty words were not removed in those documents.

How the app works The n-gram model worked well if the terms were huge enough to cover any cases. However, building such model will cost a lot of time. Another way is just using a back-off model to change n-gram model into (n-1)-gram model for the unseen words in n-gram. The simplest back-off model will first get the probability of every (n-1) terms, order them and show the first few words as prediction. When no words were shown, a (n-1)-gram model will be used until uni-gram model, which will show the most common words in the corpus.

using the App

When you input a sentence in the topleft panel, then select the number of words you'd like to see, e.g. 3 words by default and try to find a smooth method for the n-gram model. Then press the SUBMIT button. You will see the result of a predicted words.

Advantages & shortcomings

Nothing return when you input something?

This will occur when you only input punctuation, numbers and some common words. The model will remove them in the input and nothing will return.

Important

To make this model faster, I only extracted the terms occurred in the whole sources more than 5 times.

words!

Dirty words will be deleted from the final result

Disambiguation failure and misspelling

Textonyms in which a disambiguation systems gives more than one dictionary word for a single sequence of keystrokes, are not the only issue, or even the most important issue, limiting the effectiveness of predictive text implementations.

Apllications

Software applications

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results.

Online media applications

Text mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site “stickiness” and revenue.

Finally

Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management.

Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie.

Text has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text.

References

Indurkhya, N.,and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons McKnight, W. (2005). “Building business intelligence: Text data mining in business intelligence”. Korner, M. C. (n.d.). Implementation of Modified Kneser-Ney Smoothing on Top of Generalized Language Models for Next Word Prediction Bachelorarbeit, (September 2013).

Williams, G. (2014). Data Science with R Text Mining.

Kneser, Ney (1995). Improved backing-off for m-gram language modeling

Christopher D. Manning, Hinrich Schutze, Foundations of Statistical Natural Language Processing