Introduction

The role of text based information is gradually increased as social media growth in this era. Can machines think? The query of Alan Turing is a basis to develop NLP and techniques. The goal of this project is to take a huge corpus of text given by Coursera SwiftKey datasets and build algorithm of predictive model to give the user with a prediction of next words for their input words. To build a model, knowledge of data in the corpus is important. We have to explore detail characteristic of text corpus and determine which part of corpus to be used for prediction process. Among the four language dataset, we will use “en_US” dataset as a corpus text data.

This report will cover:

Understanding the problem
Data acquisition and cleaning
Exploratory data analysis
Modeling
Plan for Final Data Product Development

Understanding the Problem

As understanding customer behaviour and their related needs are so important in this economy, the tools and techniques of business and customer analytics become essential to support management decisions. However, it is a challenge to analyse and predict from massive dataset for the researchers and data scientists as online data grow tremendously through web based technologies like internet web sites, Facebook, and Twitter. The objective of this project is to explore data structure, clean and analyse, and build a model to make web based app of text prediction.

Theoretical Background

“Can machine think?”. It is a query that leads to develop Turing Test , the foundation of Natural Language Processing (NLP) in 1950 by ALan Turing. It cover spelling correction, speech recognition, and prediction of words based on preceding words. Modern NLP algorithm is based on statistical machine learning. Machine learning algorithm is grounded on statistical inference, automatically learn from a large corpus text of real world data. Decision trees and statistical models are focused on NLP algorithm for prediction.

Logical Steps of NLP

A simplified view of NLP include:

Getting dat- text input
Study the structure and form of text (morphological analysis)
Parsing or tokenisation- sequence of words
Predictive modelling- build model based on algorithm
Accuracy measure - review

Determining R packages for Data Science

We will use the following R packages in this project:

openNLP
tm
wordnet
tau
stringi
SnowballC
LanguageR
wordcloud
Sematic Analysis
kernLab
RTextTools
corpora

Data Processing

It will include data acquisition, data processing and data exploration.

Data Acquisition

A corporate partner of SwiftKey provide corpus text dataset to this project. The dataset are included in the folder named “final”. It is a huge file of 1.4 GB if it was direct downloaded. It include:

## [1] ""      "de_DE" "en_US" "fi_FI" "ru_RU"

The dataset includes blog posts, news articles and twitter tweets in four languages( German, English, Finnish and Russian). English language will be used as a based of building text prediction modelling.

Data Exploration

We will explore number of lines, word count and file size of each text file in the “en_US” data folder.

Table 1: Summary of Dataset

##         Source   Lines    Words  Char FileSize
## 1    blogs.txt  899288 37546246 40833 200.4242
## 2 twitter.txt  2360148 30093369   140 159.3641
## 3     news.txt 1010242 34762395 11384 196.2775

Data Cleaning

It was found that such a huge dataset needed to be cleaned before model building. The capacity of CPU and RAM, availability of R resources are limiting factors for the processing. Relatively small sample size would be sensible to work in this case. Special characters, stripping punctuation, nulls, extra spaces, some funny logos and numeric values needed to be removed.

Sample data

3% of corpus test data are the sample

## Loading required package: NLP

## [1] 128089

The sample dataset was tokenised into individual words. # Word Frequency table in the sample dataset

The above frequency plot is important feature for N grams. These count enable prediction for other words.

Review

The accuracy of text prediction may be questionable due to using N gram tokenization and 3% sample corpus data in the modeling. Having experiences of memory runing shortage is also limitation to use larger sample for improving accuracy.

Preparing for the next step

Build a prection model

Text Prediction Modeling

We will try to build a model that balance accuracy and speed for the users. We will use 2-gram, 3-gram, 4-gram and 5-gram to predict the next word. Use Katz’s back-off model.

Create shinny app for presentation

Consider the size in memory and performance (runtime) of the model. Develop the app Publish the app in shinyapps.io and check the resources consuption.

Capstone Milestone Report

Kyaw Thu

18 March 2016