For the capstone project I have to create an application that predicts the next word in a sentence. To create this application, I will use natural language processing, namely: N-grams to evaluate the statistical probability that the next word appears when one looks to the prior word or multiple words. The data we will use will be a variety of large bodies (corpus) of text documents, such as: blog data, news data and twitter data. In this report, I will go into the details of the Corpora and show some descriptive statistics to research possibilities for modeling the data.

Given Data

For this project, some files were provided. These files contained twitter messages, news articles and blog entries in four different languages. I will use only the US English data to crate the application where I already spoke of.

## Reading data

setwd("./final/en_US")
US.blog <- readLines("en_US.blogs.txt",skipNul = T, warn = T)
US.news <- readLines("en_US.news.txt",skipNul = T, warn = T)
US.twitter <- readLines("en_US.twitter.txt",skipNul = T, warn = T)
## Do the analysis of the data by counting words, lines en line length per dataset: blogs, news and twitter.
#Get line counts
bloglines <- length(US.blog)
newslines <- length(US.news)
twitterlines <- length(US.twitter)

#Get max line length
blog.char.cnt <- lapply(US.blog, nchar)
blog.max.chars <- blog.char.cnt[[which.max(blog.char.cnt)]]

news.char.cnt <- lapply(US.news, nchar)
news.max.chars <- news.char.cnt[[which.max(news.char.cnt)]]

twitter.char.cnt <- lapply(US.twitter, nchar)
twitter.max.chars <- twitter.char.cnt[[which.max(twitter.char.cnt)]]

#Get word counts (based on spaces)
blog.words <- sum( sapply(gregexpr("\\S+", US.blog), length ) )
news.words <- sum( sapply(gregexpr("\\S+", US.news), length ) )
twitter.words <- sum( sapply(gregexpr("\\S+", US.twitter), length ) )

Summary of corpus statistics

stats <- data.frame( "Files" = c("Blogs", "News", "Twitter"),
                            "Line count" = c(bloglines, newslines, twitterlines),
                            "Longest Line" = c(blog.max.chars, news.max.chars, twitter.max.chars),
                            "Word count" = c(blog.words, news.words, twitter.words))
print(stats)
##     Files Line.count Longest.Line Word.count
## 1   Blogs     899288        40835   37334441
## 2    News      77259         5760    2643972
## 3 Twitter    2360148          213   30373832

The Blog data has the most number of words, but the least number of lines. This shows that in blogs more thought goes into the message someone writes. News data, has the second largest number of lines, number of words and longest line. At last, the Twitter data has many short phrases. Therefore the Twitter data has the most number of lines and still the least number of words.

Sampling

Because of the large size of these data, processing the data on a local computer takes a lot of time. To improve calculation time the data is sampled. For each corpus, I chose to randomly sample 5% of the data and combined them into a single file.

set.seed(1234)
samplesize.blogs <- round(0.05*length(US.blog))
samplesize.news <- round(0.05*length(US.news))
samplesize.twitter <- round(0.05*length(US.twitter))


sample.blogs   <- sample(US.blog, samplesize.blogs, replace = T)
sample.news    <- sample(US.news, samplesize.news, replace = T)
sample.twitter <- sample(US.twitter, samplesize.twitter, replace = T)

sample <- c(sample.blogs, sample.news, sample.twitter)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Loading required package: magrittr
## creating a corpus
vectorsample <- VectorSource(sample)
vectorsample <- Corpus(vectorsample)

## Remove URL's

removeURL <- gsub("http[^[:space:]]*", "", vectorsample)


# Remove anything other than English letters or space

removeNumPunct <- gsub("[^[:alpha:][:space:]]*", "", removeURL)
removePunct <- removePunctuation(removeNumPunct)

## All lower case

cleansample <- tolower(removePunct)


## remove profanity
## profanity list
setwd("./final/en_US")
Profanitylist <- read.csv("./Profanitylist.csv", header=FALSE)

Profanitylist <- as.character(Profanitylist$V1)
Profanitylist <- c(Profanitylist)

##Then one can apply custom words to text file.

cleansample <- VectorSource(cleansample)
cleansample <- VCorpus(cleansample)
cleansample <- tm_map(cleansample, content_transformer(tolower))
cleansample <- tm_map(cleansample, removeWords, Profanitylist)
cleansample <- tm_map(cleansample, removeWords, c("vested", "interests"))
cleansample <- tm_map(cleansample, stripWhitespace)

cleansample.ns <- tm_map(cleansample, removeWords, stopwords('english'))

## create doc term matrix
dtm <- DocumentTermMatrix(cleansample.ns)


## converting text to matrix and Need to remove sparse elements, otherwise recource of memory will be to demanding
dtm.ST <- removeSparseTerms(dtm, 0.9999)
dtm.matrix <- as.matrix(dtm.ST)


wordcount <- colSums(dtm.matrix)
toptwenty <- head(sort(wordcount, decreasing=TRUE), 20)


dfplot <- as.data.frame(melt(toptwenty))
dfplot$word <- dimnames(dfplot)[[1]]
dfplot$word <- factor(dfplot$word,
                      levels=dfplot$word[order(dfplot$value,
                                               decreasing=TRUE)])

fig <- ggplot(dfplot, aes(x=word, y=value)) + geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Word") + ylab("Count") + 
        theme(axis.text.x = element_text(angle = 90, size = 10, hjust = 1, vjust = 0.4))

## proportion
count <- sum(sort(wordcount, decreasing = T))
prop <- toptwenty/count
fig.prop <- ggplot(dfplot, aes(x = word, y = prop)) +  
        geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Word")+ ylab("Proportion") +
        theme(axis.text.x = element_text(angle = 90, size = 10, hjust = 1, vjust = 0.4)) +      
        scale_y_continuous(labels = percent)

## plot of both bar charts.
ggarrange(fig, fig.prop,
          ncol = 2)

## Trying to get a picture of the pair of words that are most common.
twogram <-
        function(x)
                unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)


dtm2 <- DocumentTermMatrix(cleansample.ns,
                           control=list(tokenize=twogram))

# Need to remove sparse elements, otherwise recource of memory will be to demanding
dtm2.ST <- removeSparseTerms(dtm2, 0.9999)
dtm2.matrix <- as.matrix(dtm2.ST)

wordcount2 <- colSums(dtm2.matrix)
toptwenty2 <- head(sort(wordcount2, decreasing=TRUE), 20)

dfplot2 <- as.data.frame(melt(toptwenty2))
dfplot2$word <- dimnames(dfplot2)[[1]]
dfplot2$word <- factor(dfplot2$word,
                      levels=dfplot2$word[order(dfplot2$value,
                                               decreasing=TRUE)])

fig2 <- ggplot(dfplot2, aes(x=word, y=value)) + geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Pair of words")+ ylab("Count") +
        theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1))


## proportion of words

count2 <- sum(sort(wordcount2, decreasing = T))
prop2 <- toptwenty2/count2
fig.prop2 <- ggplot(dfplot2, aes(x = word, y = prop2)) +  
        geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Pair of words")+ ylab("Proportion") +
        theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +      
        scale_y_continuous(labels = percent)

## plot of both bar charts.
ggarrange(fig2, fig.prop2,
          ncol = 2)

## Trying to get a picture of the triplets of words that are most common.
threegram <-
        function(x)
                unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

dtm3 <- DocumentTermMatrix(cleansample.ns,
                           control=list(tokenize=threegram))

dtm3.matrix <- as.matrix(dtm3)

wordcount3 <- colSums(dtm3.matrix)
toptwenty3 <- head(sort(wordcount3, decreasing=TRUE), 20)

dfplot3 <- as.data.frame(melt(toptwenty3))
dfplot3$word <- dimnames(dfplot3)[[1]]
dfplot3$word <- factor(dfplot3$word,
                      levels=dfplot3$word[order(dfplot3$value,
                                               decreasing=TRUE)])

fig3 <- ggplot(dfplot3, aes(x=word, y=value)) + geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Triplets of words")+ ylab("Count") +
        theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1))



## plot with proportion on y-axis

count3 <- sum(sort(wordcount3, decreasing = T))
prop3 <- toptwenty3/count
fig.prop3 <- ggplot(dfplot3, aes(x = word, y = prop3)) +  
        geom_bar(stat="identity", color = "blue", fill = "white") + xlab("Triplets of words")+ ylab("Proportion") +
                theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +      
                        scale_y_continuous(labels = percent)
## plot of both bar charts.
ggarrange(fig3, fig.prop3,
          ncol = 2)

## Trying to get a picture of the quadriples of words that are most common.
fourgram <-
        function(x)
                unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)

dtm4 <- DocumentTermMatrix(cleansample,
                           control=list(tokenize=fourgram))

dtm4.matrix <- as.matrix(dtm4)

wordcount4 <- colSums(dtm4.matrix)
toptwenty4 <- head(sort(wordcount4, decreasing=TRUE), 20)

dfplot4 <- as.data.frame(melt(toptwenty4))
dfplot4$word <- dimnames(dfplot4)[[1]]
dfplot4$word <- factor(dfplot4$word,
                       levels=dfplot4$word[order(dfplot4$value,
                                                 decreasing=TRUE)])

fig4 <- ggplot(dfplot4, aes(x=word, y=value)) + geom_bar(stat="identity", color = "blue", fill = "white") + xlab("quadruplets of words")+ ylab("Count") +
        theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1))



## plot with proportion on y-axis

count4 <- sum(sort(wordcount4, decreasing = T))
prop4 <- toptwenty4/count
fig.prop4 <- ggplot(dfplot4, aes(x = word, y = prop4)) +  
        geom_bar(stat="identity", color = "blue", fill = "white") + xlab("quadruplets of words")+ ylab("Proportion") +
        theme(axis.text.x = element_text(angle = 45, size = 10, hjust = 1)) +      
        scale_y_continuous(labels = percent)

## plot of both bar charts.
ggarrange(fig4, fig.prop4,
          ncol = 2)

And now …

Based on the sampled data analyzed above I will create a statistical model (based on N-grams/Markov Models) to predict the next word. The model created will be evaluated to make sure that a good balance between accuracy and speed is considered. This iterative process will may require other techniques and new data sources.