NLP Word Prediction - Milestone Report

Project Background

This Capstone Project is completed as part of the requirement for Coursera Data Science Specialization. To find out more about this specialization, follow this link. https://www.coursera.org/specializations/jhu-data-science

Introduction

In this project, Natural Language Processing (NLP) is used to analyse and provide a next-word prediction for a series of words a user key in. The algorithm will be constructed based on a large collection of text sample extracted from 3 main sources, blogs, news and twitter. This Milestone Report will mainly focus on the following

Exploratory Data Analysis - provide basic statistics of text data used in this project
Reproducible Research - each step of the project will be properly documented
Algorithm Approach - based on data source exploration, an approach to develop the algorithm will be decided

Data Exploration

Download Data

The dataset used is downloaded from the following location. Coursera Download Link

Downloading and extracting this file will yield the folder /final with sample data of 4 different languages. For the purpose of this project, only en_US is used.

The 3 data files are

en_US.blogs.txt - text from blog posts
en_US.news.txt - text from news articles posted online
en_US.twitter.txt - tweets on Twitter

Data Statistics

Firstly, some basic statistics on the source data files will be explored.

File	File Size (bytes)	No of Rows	Word Count	Average Word Count (per row)	Longest Row (characters)	Shortest Row (characters)
Blogs	210160014	899288	37334131	42	40833	1
News	205811889	77259	2643969	34	5760	2
Twitter	167105338	2360148	30373583	13	140	2

Cleaning Data & Generating Corpus

The whole dataset is huge and therefore difficult to be processed in its entirely given the limited resource of a regular laptop, a subset the data is extracted for processing. About 30,000 rows are taken at random from each dataset.

Next, the data will be cleaned by generally

converting to lower case
removing punctuation
remove all numbers
remove additional whitespaces

**Code for sampling & data cleanup is available in the Appendix.

Term Frequency

The cleaned up data is converted into a Corpus for further processing. Corpus (which consists of all 3 dataset) is tokenize into n-gram as follows

unigram
bigram
trigram
quadgram

The top terms for these n-grams are listed here

Wordcloud for better top terms visualization.

Next Steps

Further Work

Further work will need to be done to further clean up the dataset

remove additional symbols and special characters
retain ’ (example, ’s or ’ve) to increase prediction accuracy

Algorithm Plan

It is planned to use trigram, bigram and unigram to predict the next word which the user might type. Quadgram might be included in the final algorithm, depending on processing capability and performance optimization. Generally,

unigram will be used to predict the first word a user might type in
bigram will be used next when user key in the first word
when more than 2 words has been typed, trigram will be used to do prediction
however, if trigram prediction does not provide satisfactory result, will revert back to bigram and next to unigram
this approach will be an implementation of the ‘back off’ method

Appendix

Code: Generating Sample (Function)

## Generate Sample
gen_sample <- function(file, row=1000, seed=123) {
    
    ## set file path for reading and writing
    infile <- file(paste('../myCapCode/final/en_US/',
                if(file=='blog')'en_US.blogs.txt',
                if(file=='news')'en_US.news.txt',
                if(file=='twitter')'en_US.twitter.txt',sep=''),'rb')
    outfile <- file(paste('../myCapCode/final/sample/',
                ifelse(file=='blog','blogs.txt',
                ifelse(file=='news','news.txt','twitter.txt')),sep=''),'w')
    
    ## set seed, determine rows to take, by rbinom
    set.seed(seed)
    samplerows <- rbinom(n=row*2, size=1, prob=0.6)
    
    ## read in rows and write out, until required # of rows are attained.
    i<-0
    written <- 0
    for (i in 1:(row*2)) {
        # read in one line at a time
        currLine <- readLines(infile, n=1, encoding="UTF-8", skipNul=TRUE)
        # if reached end of file, close all conns
        if (length(currLine) == 0) {
            close(infile)
            close(outfile)
            break
        }
        ## write out current row if chosen
        if (samplerows[i] == 1){
            writeLines(currLine, outfile)
            written <- written + 1
        }
        ## stop if enough rows written out
        if(written >= row){
            close(infile)
            close(outfile)
            break
        }
        i<-i+1
    }
    return(written)
}

Code: Generate Sample (30,000 from each dataset, en_US only)

gen_sample('blog',30000,seed=88)
gen_sample('news',30000,seed=88)
gen_sample('twitter',30000,seed=88)

Code: Corpus Generation & Cleanup

Further improvement will be done here to improve corpus quality

library(tm)
library(RWeka)
library(ggplot2)

sample_path <- '../myCapCode/final/sample/'

## create corpus using all 3 sample files
doc <- Corpus(DirSource(sample_path), readerControl = list(reader = readPlain,language = "en_US",load = TRUE))


## corpus cleanup
d <- doc
d = tm_map(d,content_transformer(tolower))
d = tm_map(d,removePunctuation)
d = tm_map(d,removeNumbers)
d = tm_map(d,stripWhitespace)

n1gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
n2gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
n3gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
n4gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))

dtm1 <- TermDocumentMatrix(d, control = list(tokenize = n1gramtoken))
dtm2 <- TermDocumentMatrix(d, control = list(tokenize = n2gramtoken))
dtm3 <- TermDocumentMatrix(d, control = list(tokenize = n3gramtoken))
dtm4 <- TermDocumentMatrix(d, control = list(tokenize = n4gramtoken))