The purpose of the capstone project is to build a Natural Language Processing (NLP) application, that, given a chunk of text, predicts the next most probable word. The application may be used, for example, in mobile devided to provide suggestions as the user tips in some text.
In this report we will provide initial analysis of the data, as well as discuss approach to building the application.
We download the data from the URL provided in the course description, and unzip it.
if (!file.exists("Coursera-SwiftKey.zip")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", destfile = "Coursera-SwiftKey.zip", method = "curl")
unzip("Coursera-SwiftKey.zip")
}
Load the training data
# blogs
blogsFileName <- "final/en_US/en_US.blogs.txt"
con <- file(blogsFileName, open = "r")
blogs <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
# news
newsFileName <- "final/en_US/en_US.news.txt"
con <- file(newsFileName, open = "r")
news <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
## Warning in readLines(con, encoding = "UTF-8", skipNul = TRUE): incomplete final
## line found on 'final/en_US/en_US.news.txt'
close(con)
# twitter
twitterFileName <- "final/en_US/en_US.twitter.txt"
con <- file(twitterFileName, open = "r")
twitter <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
close(con)
rm(con)
Prior to building the unified document corpus and cleaning the data, a basic summary of the three text corpora is being provided which includes file sizes, number of lines, number of characters and number of words for each source file. Also included are basic statistics on the number of words per line (min, mean, and max).
## Warning: package 'stringi' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'kableExtra' was built under R version 4.2.3
# assign sample size
sampleSize = 0.01
# Size of Files in Megabytes
fileSizeMB <- (file.size(c(blogsFileName,
newsFileName,
twitterFileName)) / 1024^2)
# Number of Lines per file
numLines <- sapply(list(blogs, news, twitter), length)
# Number of characters per file
numChars <- sapply(list(nchar(blogs), nchar(news), nchar(twitter)), sum)
# Counting the Words (number of words per file)
numWords <- sapply(list(blogs, news, twitter), stri_stats_latex)[4,]
# Number of words per line
wordsPerLine <- lapply(list(blogs, news, twitter), function(x) stri_count_words(x))
# words per line summary
wordsPerLineSummary = sapply(list(blogs, news, twitter),
function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wordsPerLineSummary) = c('wordsPerLineMin', 'wordsPerLineMean', 'wordsPerLineMax')
summary <- data.frame(
File = c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt"),
FileSize = paste(fileSizeMB, " MB"),
Lines = numLines,
Characters = numChars,
Words = numWords,
t(rbind(round(wordsPerLineSummary)))
)
kable(summary,
row.names = FALSE,
align = c("l", rep("r", 7)),
caption = "") %>% kable_styling(position = "left")
| File | FileSize | Lines | Characters | Words | wordsPerLineMin | wordsPerLineMean | wordsPerLineMax |
|---|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.424207687378 MB | 899288 | 206824505 | 37570839 | 0 | 42 | 6726 |
| en_US.news.txt | 196.277512550354 MB | 77259 | 15639408 | 2651432 | 1 | 35 | 1123 |
| en_US.twitter.txt | 159.364068984985 MB | 2360148 | 162096241 | 30451170 | 1 | 13 | 47 |
An initial investigation of the data shows that on average, each text corpora has a relatively low number of words per line. The lower number of words per line for the Twitter data is expected given that a tweet is limited to a certain number of characters.
Another important observation in this initial investigation shows that the text files are fairly large. To improve processing time, a sample size of 1% will be obtained from all three data sets and then combined into a unified document corpus for subsequent analyses later in this report as part of preparing the data.
## Warning: package 'ggplot2' was built under R version 4.2.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
The relatively low number of words in the three source files charted earlier in this section is also visible in the histogram plots shown above. This observation seems to support a general trend towards short and concise communications that may be useful later in the project.
Remove all non-English characters and then compile a sample dataset that is composed of 1% of each of the 3 original datasets.
set.seed(12345)
# remove all non-English characters from the data
blogs1 <- iconv(blogs, "latin1", "ASCII", sub="")
news1 <- iconv(news, "latin1", "ASCII", sub="")
twitter1 <- iconv(twitter, "latin1", "ASCII", sub="")
# Since Data sets is too big for processing, use sample() function to sample each file
sampleData <- c(sample(blogs1,length(blogs1)*sampleSize),
sample(news1,length(news1)*sampleSize),
sample(twitter1,length(twitter1)*sampleSize))
# get number of lines and words from the sample data set
sampleDataLines <- length(sampleData)
sampleDataWords <- sum(stri_count_words(sampleData))
Prior to performing exploratory data analysis, all non-English characters will be removed and then the three data sets will be sampled at 1% to improve performance.
The next step is to create a corpus from the sampled data set. Perform the following transformation steps: 1. Convert all words to lowercase 2. Remove common English stop words 3. Remove punctuation marks 4. Remove numbers 5. Trim whitespace 6. Convert to plain text documents
library(tm) # Text mining
## Warning: package 'tm' was built under R version 4.2.3
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(NLP) # Natural Language Processing
corpus <- VCorpus(VectorSource(sampleData))
corpus1 <- tm_map(corpus,removePunctuation)
corpus2 <- tm_map(corpus1,stripWhitespace)
corpus3 <- tm_map(corpus2,tolower) # Convert to lowercase
corpus4 <- tm_map(corpus3,removeNumbers)
corpus5 <- tm_map(corpus4,PlainTextDocument)
#removing stop words in English (a, as, at, so, etc.)
corpus6 <- tm_map(corpus5,removeWords,stopwords("english"))
In Natural Language Processing (NLP), n-gram is a contiguous sequence of n items from a given sequence of text or speech. Unigrams are single words. Bigrams are two word combinations. Trigrams are three-word combinations.
The following function is used to extract 1-grams, 2-grams, 3-grams from the text Corpus using RWeka.
library(RWeka) # tokenizer - create unigrams, bigrams, trigrams
## Warning: package 'RWeka' was built under R version 4.2.3
# Use the RWeka package to construct functions that tokenize the sample and construct matrices of uniqrams, bigrams, and trigrams.
one <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
two <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
three <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
oneTable <- TermDocumentMatrix(corpus6, control = list(tokenize = one))
twoTable <- TermDocumentMatrix(corpus6, control = list(tokenize = two))
threeTable <- TermDocumentMatrix(corpus6, control = list(tokenize = three))
Then find the frequency of terms in each of these 3 matrices and construct dataframes of these frequencies.
oneCorpus <- findFreqTerms(oneTable,lowfreq = 1000)
twoCorpus<-findFreqTerms(twoTable,lowfreq = 80)
threeCorpus<-findFreqTerms(threeTable,lowfreq = 10)
oneCorpusNum <- rowSums(as.matrix(oneTable[oneCorpus,]))
oneCorpusTable <- data.frame(Word = names(oneCorpusNum), frequency = oneCorpusNum)
oneCorpusSort <- oneCorpusTable[order(-oneCorpusTable$frequency),]
head(oneCorpusSort)
## Word frequency
## just just 2576
## like like 2218
## will will 2211
## one one 2049
## get get 1869
## can can 1866
twoCorpusNum <- rowSums(as.matrix(twoTable[twoCorpus,]))
twoCorpusTable <- data.frame(Word = names(twoCorpusNum), frequency = twoCorpusNum)
twoCorpusSort <- twoCorpusTable[order(-twoCorpusTable$frequency),]
head(twoCorpusSort)
## Word frequency
## cant wait cant wait 208
## right now right now 206
## dont know dont know 164
## last night last night 148
## im going im going 130
## feel like feel like 125
threeCorpusNum <- rowSums(as.matrix(threeTable[threeCorpus,]))
threeCorpusTable <- data.frame(Word = names(threeCorpusNum), frequency = threeCorpusNum)
threeCorpusSort <- threeCorpusTable[order(-threeCorpusTable$frequency),]
head(threeCorpusSort)
## Word frequency
## cant wait see cant wait see 45
## happy mothers day happy mothers day 36
## happy new year happy new year 24
## im pretty sure im pretty sure 18
## italy lakes holidays italy lakes holidays 18
## little italy boston little italy boston 17
The frequency distribution of each n-grams category were visualized into 3 different bar plots.
library(ggplot2) #visualization
one_g <- ggplot(oneCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
one_g <- one_g + geom_bar(stat = "identity")
one_g <- one_g + labs(title = "Unigrams", x = "Words", y = "Frequency")
one_g <- one_g + theme(axis.text.x = element_text(angle = 90))
one_g
two_g <- ggplot(twoCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
two_g <- two_g + geom_bar(stat = "identity")
two_g <- two_g + labs(title = "Bigrams", x = "Words", y = "Frequency")
two_g <- two_g + theme(axis.text.x = element_text(angle = 90))
two_g
thr_g <- ggplot(threeCorpusSort[1:10,], aes(x = reorder(Word, -frequency), y = frequency, fill = frequency))
thr_g <- thr_g + geom_bar(stat = "identity")
thr_g <- thr_g + labs(title = "Trigrams", x = "Words", y = "Frequency")
thr_g <- thr_g + theme(axis.text.x = element_text(angle = 90))
thr_g
The final deliverable in the capstone project is to build a predictive algorithm that will be deployed as a Shiny app for the user interface. The Shiny app should take as input a phrase (multiple words) in a text box input and output a prediction of the next word.
The predictive algorithm will be developed using an n-gram model with a word frequency lookup similar to that performed in the exploratory data analysis section of this report. A strategy will be built based on the knowledge gathered during the exploratory analysis. For example, as n is increased for each n-gram, the frequency decreased for each of its terms. So one possible strategy may be to construct the model to first look for the unigram that would follow from the entered text. Once a full term is entered followed by a space, find the most common bigram model and so on.
Another possible strategy may be to predict the next word using the trigram model. If no matching trigram can be found, then the algorithm would check the bigram model. If still not found, use the unigram model.
The final strategy will be based on the one that increases efficiency and provides the best accuracy.