Data Science Capstone Peer-graded Assignment: Milestone Report

Introduction

The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.

In this report we will provide initial analysis of the data, as well as discuss approach to building the application.

Obtaining the data

We download the data from the URL provided in the course description, and unzip it.

if (!file.exists("Coursera-SwiftKey.zip")) {
        download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip", method = "curl")
        unzip("Coursera-SwiftKey.zip")
}

Loading (Blogs, News, Twitter) Data

Load the training data

# blogs
blogsFileName <- "final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

# news
newsFileName <- "final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)

## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'final/en_US/en_US.news.txt'

close(con)

# twitter
twitterFileName <- "final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)

rm(con)

Basic Data Summary

Prior to building the unified document corpus and cleaning the data, a basic summary of the three text corpora is being provided which includes file sizes, number of lines, number of characters and number of words for each source file. Also included are basic statistics on the number of words per line (min, mean, and max).

Initial Data Summary

## Warning: package 'stringi' was built under R version 4.2.2

## Warning: package 'dplyr' was built under R version 4.2.2

## Warning: package 'kableExtra' was built under R version 4.2.3

# assign sample size
sampleSize = 0.01

# Size of Files in Megabytes
fileSizeMB <- (file.size(c(blogsFileName,
                         newsFileName,
                         twitterFileName)) / 1024^2)

# Number of Lines per file
numLines <- sapply(list(blogs, news, twitter), length)

# Number of characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)

# Counting the Words (number of words per file)
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]

# Number of words per line

wordsPerLine <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x)) 

# words per line summary
wordsPerLineSummary = sapply(list(blogs, news, twitter),
                    function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])

rownames(wordsPerLineSummary) = c('wordsPerLineMin', 'wordsPerLineMean', 'wordsPerLineMax')

summary <- data.frame(
    File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
    FileSize = paste(fileSizeMB, " MB"),
    Lines = numLines,
    Characters = numChars,
    Words = numWords,
    t(rbind(round(wordsPerLineSummary)))
)

kable(summary,
      row.names = FALSE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")


File	FileSize	Lines	Characters	Words	wordsPerLineMin	wordsPerLineMean	wordsPerLineMax
en_US.blogs.txt	200.424207687378 MB	899288	206824505	37570839	0	42	6726
en_US.news.txt	196.277512550354 MB	77259	15639408	2651432	1	35	1123
en_US.twitter.txt	159.364068984985 MB	2360148	162096241	30451170	1	13	47

An initial investigation of the data shows that on average, each text corpora has a relatively low number of words per line. The lower number of words per line for the Twitter data is expected given that a tweet is limited to a certain number of characters.

Another important observation in this initial investigation shows that the text files are fairly large. To improve processing time, a sample size of 1% will be obtained from all three data sets and then combined into a unified document corpus for subsequent analyses later in this report as part of preparing the data.

Histogram of Words per Line

## Warning: package 'ggplot2' was built under R version 4.2.2

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

The relatively low number of words in the three source files charted earlier in this section is also visible in the histogram plots shown above. This observation seems to support a general trend towards short and concise communications that may be useful later in the project.

Prepare the Data

Remove all non-English characters and then compile a sample dataset that is composed of 1% of each of the 3 original datasets.

set.seed(12345)

# remove all non-English characters from the data
blogs1 <- iconv(blogs, "latin1", "ASCII", sub="")
news1 <- iconv(news, "latin1", "ASCII", sub="")
twitter1 <- iconv(twitter, "latin1", "ASCII", sub="")

# Since Data sets is too big for processing, use sample() function to sample each file
sampleData <- c(sample(blogs1,length(blogs1)*sampleSize),
               sample(news1,length(news1)*sampleSize),
               sample(twitter1,length(twitter1)*sampleSize))

# get number of lines and words from the sample data set
sampleDataLines <- length(sampleData)
sampleDataWords <- sum(stri_count_words(sampleData))

Clean and Build Corpus

Prior to performing exploratory data analysis, all non-English characters will be removed and then the three data sets will be sampled at 1% to improve performance.

The next step is to create a corpus from the sampled data set. Perform the following transformation steps: 1. Convert all words to lowercase 2. Remove common English stop words 3. Remove punctuation marks 4. Remove numbers 5. Trim whitespace 6. Convert to plain text documents

library(tm) # Text mining

## Warning: package 'tm' was built under R version 4.2.3

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(NLP) # Natural Language Processing

corpus <- VCorpus(VectorSource(sampleData))

corpus1 <- tm_map(corpus,removePunctuation)
corpus2 <- tm_map(corpus1,stripWhitespace)
corpus3 <- tm_map(corpus2,tolower) # Convert to lowercase
corpus4 <- tm_map(corpus3,removeNumbers)
corpus5 <- tm_map(corpus4,PlainTextDocument)

#removing stop words in English (a, as, at, so, etc.)
corpus6 <- tm_map(corpus5,removeWords,stopwords("english"))

Build N-Grams

In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech. Unigrams are single words. Bigrams are two word combinations. Trigrams are three-word combinations.

The following function is used to extract 1-grams, 2-grams, 3-grams from the text Corpus using RWeka.

library(RWeka) # tokenizer - create unigrams, bigrams, trigrams

## Warning: package 'RWeka' was built under R version 4.2.3

# Use the RWeka package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams.
one <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
two <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
three <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
oneTable <- TermDocumentMatrix(corpus6, control = list(tokenize = one))
twoTable <- TermDocumentMatrix(corpus6, control = list(tokenize = two))
threeTable <- TermDocumentMatrix(corpus6, control = list(tokenize = three))

Then find the frequency of terms in each of these 3 matrices and construct dataframes of these frequencies.

oneCorpus <- findFreqTerms(oneTable,lowfreq = 1000)
twoCorpus<-findFreqTerms(twoTable,lowfreq = 80)
threeCorpus<-findFreqTerms(threeTable,lowfreq = 10)

oneCorpusNum <- rowSums(as.matrix(oneTable[oneCorpus,]))
oneCorpusTable <- data.frame(Word = names(oneCorpusNum), frequency = oneCorpusNum)
oneCorpusSort <- oneCorpusTable[order(-oneCorpusTable$frequency),]
head(oneCorpusSort)

##      Word frequency
## just just      2576
## like like      2218
## will will      2211
## one   one      2049
## get   get      1869
## can   can      1866

twoCorpusNum <- rowSums(as.matrix(twoTable[twoCorpus,]))
twoCorpusTable <- data.frame(Word = names(twoCorpusNum), frequency = twoCorpusNum)
twoCorpusSort <- twoCorpusTable[order(-twoCorpusTable$frequency),]
head(twoCorpusSort)

##                  Word frequency
## cant wait   cant wait       208
## right now   right now       206
## dont know   dont know       164
## last night last night       148
## im going     im going       130
## feel like   feel like       125

threeCorpusNum <- rowSums(as.matrix(threeTable[threeCorpus,]))
threeCorpusTable <- data.frame(Word = names(threeCorpusNum), frequency = threeCorpusNum)
threeCorpusSort <- threeCorpusTable[order(-threeCorpusTable$frequency),]
head(threeCorpusSort)

##                                      Word frequency
## cant wait see               cant wait see        45
## happy mothers day       happy mothers day        36
## happy new year             happy new year        24
## im pretty sure             im pretty sure        18
## italy lakes holidays italy lakes holidays        18
## little italy boston   little italy boston        17

Exploratory Analysis (Graphs & Visualizations)

The frequency distribution of each n-grams category were visualized into 3 different bar plots.

library(ggplot2) #visualization

one_g <- ggplot(oneCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
one_g <- one_g + geom_bar(stat = "identity")
one_g <- one_g + labs(title = "Unigrams", x = "Words", y = "Frequency")
one_g <- one_g + theme(axis.text.x = element_text(angle = 90))
one_g

two_g <- ggplot(twoCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
two_g <- two_g + geom_bar(stat = "identity")
two_g <- two_g + labs(title = "Bigrams", x = "Words", y = "Frequency")
two_g <- two_g + theme(axis.text.x = element_text(angle = 90))
two_g

thr_g <- ggplot(threeCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
thr_g <- thr_g + geom_bar(stat = "identity")
thr_g <- thr_g + labs(title = "Trigrams", x = "Words", y = "Frequency")
thr_g <- thr_g + theme(axis.text.x = element_text(angle = 90))
thr_g

Conclusion & Next Steps

The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.

The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n is increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.

Another possible strategy may be to predict the next word using the trigram model. If no matching trigram can be found, then the algorithm would check the bigram model. If still not found, use the unigram model.

The final strategy will be based on the one that increases efficiency and provides the best accuracy.