Overview

This document serves as a checkpoint or milestone in the process for the Data Science Specialization Capstone project, showing the use of the data, some analysis and avenue definition.

Reading and quick summary of the data

The first step will be to read the data, and we will focus on the English language, which is what we can understand. The data will come from the URL provided, but once downloaded, will be stored out of the repository, to prevent from crowding it in github. We can see that in terms of words, all three files provide more or less the same amount, however, twitter has many more lines, so in the end words per line reflect that as well.

library(stringi)
library(kableExtra)

read_file <- function(file_name) {
  con <- file(file_name, open = "r")
  data <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
  close(con)
  return(data)
}

calculate_mbytes <- function(bytes) {
  return(round(bytes)/1024 ^ 2)
}

file_paths <- c("../final_first/en_US/en_US.blogs.txt", "../final_first/en_US/en_US.twitter.txt", "../final_first/en_US/en_US.news.txt")
file_contents <- sapply(file_paths, read_file)
file_size <- sapply(sapply(file_paths, file.info)['size',],calculate_mbytes)
file_lines <- sapply(file_contents, length)
file_words <- sapply(file_contents, stri_stats_latex)[4,]
file_chars <- sapply(sapply(file_contents, nchar),sum)
file_length <- sapply(file_contents, length)
file_wpl <- lapply(file_contents, function(x) stri_count_words(x))
file_wpl_mean <- sapply(file_wpl,mean)

summary_stats = data.frame(size = file_size, lines = file_lines, 
                           words = file_words, chars = file_chars, words_per_line = file_wpl_mean)

kable(summary_stats,
      row.names = TRUE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")
size lines words chars words_per_line
../final_first/en_US/en_US.blogs.txt 200.4242 899288 37570839 206824505 41.75170
../final_first/en_US/en_US.twitter.txt 159.3641 2360148 30451170 162096241 12.75203
../final_first/en_US/en_US.news.txt 196.2775 1010206 34493122 203214543 34.40996

Word per line distribution

As per the last section, we can confirm that twitter data also has a more clear an predictable distribution, mostly within 40 words per line, while the blogs and news although sweked to the lower values, has lines with very large number of words.

par(mfrow = c(length(file_paths),1))

# Loop through file paths and create plots
for (file_path in file_paths) {
  hist(file_wpl[[file_path]], breaks = 50, main = file_path, xlab = 'Words per line', ylab = 'Occurrences')
}

# Reset the plot layout to default after plotting (optional)
par(mfrow = c(1, 1))

Sampling the data

To make the process more efficient, we will look at 1% of the data, generating a line sample based on that parameter. As we can see, words per line are very clearly maintaining the distribution we had in the original data, which is encouraging.

sample_rate = 0.01
set.seed(2222)

sample_text <- function (data, size) {
  sampled_data <- sample(data, size * sample_rate, replace=FALSE)
  sampled_data <- iconv(sampled_data, "latin1", "ASCII", sub = "")
}

sampled_data <- mapply(sample_text, file_contents, file_length)

sampleDataFileName <- "../final_first/en_US/en_US.sample_data.txt"
con <- file(sampleDataFileName, open = "w")
for(e in sampled_data) {
  writeLines(e, con)
}
close(con)

sample_lines <- sapply(sampled_data, length)
sample_words <- sapply(sapply(sampled_data, stri_count_words), sum)

sample_summary <- data.frame(lines=sample_lines, words=sample_words)
sample_summary$words_per_line <- sample_summary$words / sample_summary$lines

kable(sample_summary,
      row.names = TRUE,
      align = c("l", rep("r", 7)),
      caption = "") %>% kable_styling(position = "left")
lines words words_per_line
../final_first/en_US/en_US.blogs.txt 8992 374922 41.69506
../final_first/en_US/en_US.twitter.txt 23601 300092 12.71522
../final_first/en_US/en_US.news.txt 10102 345991 34.24975
# get number of lines and words from the sample data set
sampleDataLines <- sum(sapply(sampled_data,length))
print(paste('Total sampled lines: ',sampleDataLines))
## [1] "Total sampled lines:  42695"
sampleDataWords <- sum(stri_count_words(sampled_data))
print(paste('Total Sampled words: ', sampleDataWords))
## [1] "Total Sampled words:  1021008"

Offensive language

To use the data set for training, we will have to first remove offensive words from the sampled text, for that we will use data coming from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words, which was downloaded into a txt file. That URL contains data for different languages, so the en.txt file will be used for English. A total of 556 words are contained there.

badWordsFileName <- "../final_first/en_US/en.txt"
con <- file(badWordsFileName, open = "r")
badWords <- readLines(con)
badWords <- iconv(badWords, "latin1", "ASCII", sub = "")
close(con)
print("Total bad words listed ... ")
## [1] "Total bad words listed ... "
print(sum(stri_count_words(badWords)))
## [1] 556

Building the corpus

We need to group the documents into a corpus or vector form, which R can do via rm, so we first create the vector corpus with VCorpus, then apply a series of transformations:

And finally, return a plain text document, to be stored as an RDS format and txt for later use in the apps, but not before showing a few lines of what we have on the consolidated list of lines from the data.

library(tm)
## Loading required package: NLP
dataSet <- sampled_data

docs <- VCorpus(VectorSource(dataSet))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

# remove internet formats
docs <- tm_map(docs, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
docs <- tm_map(docs, toSpace, "@[^\\s]+")
docs <- tm_map(docs, toSpace, "\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b")

# remove bad words from the sample data set
docs <- tm_map(docs, removeWords, badWords)

# convert to lowercase and remove stop words, punctuation and numbers.
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)

corpus <- docs

# save the corpus file for later use
saveRDS(corpus, file = "../final_first/en_US/en_US.corpus.rds")

# save the corpus as a plain text file
corpusText <- data.frame(text = unlist(sapply(corpus, '[', "content")), stringsAsFactors = FALSE)
con <- file("../final_first/en_US/en_US.corpus.txt", open = "w")
writeLines(corpusText$text, con)
close(con)

# show a few lines of the sample
kable(head(corpusText$text, 10),
      row.names = FALSE,
      col.names = NULL,
      align = c("l"),
      caption = "First 10 Documents") %>% kable_styling(position = "left")
First 10 Documents
discomfort grows many levels across country bitching moaning started yet still blind direction real threat coming
step use pins necessary get desired effect bow can add pins side bows make feel secure place use hairspray smooth bow
r reflux one meal turns two
necklace thrifted
ready ive done work great story dialogue crackling cant launch world now
different aspects including quotes table rep one rep two
will able judge taste education enthusiasms background person whose home enter day ever comes browse bookshelves will happen judging person books keep well ask look kindle index can
according sting tail article linda stewart todays belfast telegraph doe spokesman said rps undertook survey found evidence bee location survey may accurate considered independent
oh happy day
filled heart within

Exploratory Data Analysis

On the sample data, we can now proceed to look at specific word frecuencies, for that, we will use the TermDocumentMatrix functionality in tm, which will provide us with the frequencies in each document for every word found. Additionally, besides the ordinary bar plot, which will provide us with the exact frequencies, we can also use the word cloud, which highlights N number of more frequent words.

library(wordcloud)
## Loading required package: RColorBrewer
library(RColorBrewer)
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
tdm <- TermDocumentMatrix(corpus)
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
wordFreq <- data.frame(word = names(freq), freq = freq)

# plot the top 10 most frequent words
g <- ggplot (wordFreq[1:10,], aes(x = reorder(wordFreq[1:10,]$word, -wordFreq[1:10,]$fre),
                                  y = wordFreq[1:10,]$fre ))
g <- g + geom_bar( stat = "Identity" , fill = I("blue"))
g <- g + geom_text(aes(label = wordFreq[1:10,]$fre), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Word Frequencies")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 0.5, vjust = 0.5, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("10 Most Frequent Words")
print(g)

# construct word cloud
suppressWarnings (
    wordcloud(words = wordFreq$word,
              freq = wordFreq$freq,
              min.freq = 1,
              max.words = 100,
              random.order = FALSE,
              rot.per = 0.35, 
              colors=brewer.pal(8, "Dark2"))
)

## Tokenization

As part of the RWeka package we have the ability to tokenize the data, so we will setup three separate functions to use to plot the unigrams, bigrams and trigrams.

Unigrams

For unigrams, we first use the TDM to add up all the ocurrences of each unigram on each document, applying the tokenizer to the TermDocumentMatrix function. From there, we can plot the more frequent unigrams. It should be no surprise that the most frequent words are the actual unigrams, we’ll just plot for a few more.

# create term document matrix for the corpus
unigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = unigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
unigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(unigramMatrix, 0.99))), decreasing = TRUE)
unigramMatrixFreq <- data.frame(word = names(unigramMatrixFreq), freq = unigramMatrixFreq)

# generate plot
g <- ggplot(unigramMatrixFreq[1:25,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("blue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("25 Most Common Unigrams")
print(g)

Bigrams

For bigrams, however, we can to look at two word combinations and how frequent they are, so we follow the same procedure described on the previous paragraph, and plot the 25 most common Bigrams.

# create term document matrix for the corpus
bigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = bigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
bigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(bigramMatrix, 0.99))), decreasing = TRUE)
bigramMatrixFreq <- data.frame(word = names(bigramMatrixFreq), freq = bigramMatrixFreq)

# generate plot
g <- ggplot(bigramMatrixFreq[1:25,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("blue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("25 Most Common Bigrams")
print(g)

Trigrams

For trigrams, is the same procedure as bigrams, except we are looking at 3 word combinations.

# create term document matrix for the corpus
trigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = trigramTokenizer))

# eliminate sparse terms for each n-gram and get frequencies of most common n-grams
trigramMatrixFreq <- sort(rowSums(as.matrix(removeSparseTerms(trigramMatrix, 0.99))), decreasing = TRUE)
trigramMatrixFreq <- data.frame(word = names(trigramMatrixFreq), freq = trigramMatrixFreq)

# generate plot
g <- ggplot(trigramMatrixFreq[1:20,], aes(x = reorder(word, -freq), y = freq))
g <- g + geom_bar(stat = "identity", fill = I("blue"))
g <- g + geom_text(aes(label = freq ), vjust = -0.20, size = 3)
g <- g + xlab("")
g <- g + ylab("Frequency")
g <- g + theme(plot.title = element_text(size = 14, hjust = 0.5, vjust = 0.5),
               axis.text.x = element_text(hjust = 1.0, angle = 45),
               axis.text.y = element_text(hjust = 0.5, vjust = 0.5))
g <- g + ggtitle("20 Most Common Trigrams")
print(g)

Next steps

The reason we are looking at n-grams and understanding how the data is stored in the corpus, is so that we can come to a matrix that will use combinations of the unigrams to bigrams to trigrams or some similar fashion in order to use a model that uses probability distributions at any given time to see which word may come next, via the identified combinations in the n-grams.

The main idea for the model would be that once the user types in a word, we can look it up in a matrix, and find the highest probability in the list to use as the next word. We can use greedy algorithms using the highest probability or we can also use a different approach by adding certain parameters to make the algorithm a bit random but with certain sense.

The other interesting point from this work is that we can use some of the 25 samples obtained in the unigram, bigram and trigram sections to test the model once is has ben built.