Project Background

This Capstone Project is completed as part of the requirement for Coursera Data Science Specialization. To find out more about this specialization, follow this link. https://www.coursera.org/specializations/jhu-data-science

Introduction

In this project, Natural Language Processing (NLP) is used to analyse and provide a next-word prediction for a series of words a user key in. The algorithm will be constructed based on a large collection of text sample extracted from 3 main sources, blogs, news and twitter. This Milestone Report will mainly focus on the following

Data Exploration

Download Data

The dataset used is downloaded from the following location. Coursera Download Link

Downloading and extracting this file will yield the folder /final with sample data of 4 different languages. For the purpose of this project, only en_US is used.

The 3 data files are

  • en_US.blogs.txt - text from blog posts
  • en_US.news.txt - text from news articles posted online
  • en_US.twitter.txt - tweets on Twitter

Data Statistics

Firstly, some basic statistics on the source data files will be explored.

File File Size (bytes) No of Rows Word Count Average Word Count (per row) Longest Row (characters) Shortest Row (characters)
Blogs 210160014 899288 37334131 42 40833 1
News 205811889 77259 2643969 34 5760 2
Twitter 167105338 2360148 30373583 13 140 2

Cleaning Data & Generating Corpus

The whole dataset is huge and therefore difficult to be processed in its entirely given the limited resource of a regular laptop, a subset the data is extracted for processing. About 30,000 rows are taken at random from each dataset.

Next, the data will be cleaned by generally

**Code for sampling & data cleanup is available in the Appendix.

Term Frequency

The cleaned up data is converted into a Corpus for further processing. Corpus (which consists of all 3 dataset) is tokenize into n-gram as follows

The top terms for these n-grams are listed here

Wordcloud for better top terms visualization.

Next Steps

Further Work

Further work will need to be done to further clean up the dataset

  • remove additional symbols and special characters
  • retain ’ (example, ’s or ’ve) to increase prediction accuracy

Algorithm Plan

It is planned to use trigram, bigram and unigram to predict the next word which the user might type. Quadgram might be included in the final algorithm, depending on processing capability and performance optimization. Generally,

  • unigram will be used to predict the first word a user might type in
  • bigram will be used next when user key in the first word
  • when more than 2 words has been typed, trigram will be used to do prediction
  • however, if trigram prediction does not provide satisfactory result, will revert back to bigram and next to unigram
  • this approach will be an implementation of the ‘back off’ method

Appendix

Code: Generating Sample (Function)

## Generate Sample
gen_sample <- function(file, row=1000, seed=123) {
    
    ## set file path for reading and writing
    infile <- file(paste('../myCapCode/final/en_US/',
                if(file=='blog')'en_US.blogs.txt',
                if(file=='news')'en_US.news.txt',
                if(file=='twitter')'en_US.twitter.txt',sep=''),'rb')
    outfile <- file(paste('../myCapCode/final/sample/',
                ifelse(file=='blog','blogs.txt',
                ifelse(file=='news','news.txt','twitter.txt')),sep=''),'w')
    
    ## set seed, determine rows to take, by rbinom
    set.seed(seed)
    samplerows <- rbinom(n=row*2, size=1, prob=0.6)
    
    ## read in rows and write out, until required # of rows are attained.
    i<-0
    written <- 0
    for (i in 1:(row*2)) {
        # read in one line at a time
        currLine <- readLines(infile, n=1, encoding="UTF-8", skipNul=TRUE)
        # if reached end of file, close all conns
        if (length(currLine) == 0) {
            close(infile)
            close(outfile)
            break
        }
        ## write out current row if chosen
        if (samplerows[i] == 1){
            writeLines(currLine, outfile)
            written <- written + 1
        }
        ## stop if enough rows written out
        if(written >= row){
            close(infile)
            close(outfile)
            break
        }
        i<-i+1
    }
    return(written)
}

Code: Generate Sample (30,000 from each dataset, en_US only)

gen_sample('blog',30000,seed=88)
gen_sample('news',30000,seed=88)
gen_sample('twitter',30000,seed=88)

Code: Corpus Generation & Cleanup

Further improvement will be done here to improve corpus quality

library(tm)
library(RWeka)
library(ggplot2)

sample_path <- '../myCapCode/final/sample/'

## create corpus using all 3 sample files
doc <- Corpus(DirSource(sample_path), readerControl = list(reader = readPlain,language = "en_US",load = TRUE))


## corpus cleanup
d <- doc
d = tm_map(d,content_transformer(tolower))
d = tm_map(d,removePunctuation)
d = tm_map(d,removeNumbers)
d = tm_map(d,stripWhitespace)

n1gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 1, max = 1, delimiters = " \\r\\n\\t.,;:\"()?!"))
n2gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 2, max = 2, delimiters = " \\r\\n\\t.,;:\"()?!"))
n3gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 3, max = 3, delimiters = " \\r\\n\\t.,;:\"()?!"))
n4gramtoken <- function (x) NGramTokenizer(x, Weka_control(min = 4, max = 4, delimiters = " \\r\\n\\t.,;:\"()?!"))

dtm1 <- TermDocumentMatrix(d, control = list(tokenize = n1gramtoken))
dtm2 <- TermDocumentMatrix(d, control = list(tokenize = n2gramtoken))
dtm3 <- TermDocumentMatrix(d, control = list(tokenize = n3gramtoken))
dtm4 <- TermDocumentMatrix(d, control = list(tokenize = n4gramtoken))