The document at hand is the milestone report for Week 2 of the Coursera Data Science Capstone project. The report deals with the results of the expoloratory data analysis of natural language data from three sources, i.e. news, blogs, and tweets.

DATA PROCESSING

The data to be used is downloaded from the provided URL and stored locally.

Exploratory analysis

Read data

suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(RColorBrewer)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(textcat)))
suppressWarnings(suppressMessages(library(tidytext)))
suppressWarnings(suppressMessages(library(janeaustenr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(quanteda)))

file1<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.blogs.txt"
file2<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.news.txt"
file3<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.twitter.txt"

blogs<-readLines(file1,warn=FALSE,encoding="UTF-8")
news<-readLines(file2,warn=FALSE,encoding="UTF-8")
twitter<-readLines(file3,warn=FALSE,encoding="UTF-8")

Summary statistics

Next we generate an overview table for metrics that describe the loaded files.

summaries_data<-data.frame(file_name=c("blogs","news","twitter"))
summaries_data$size<-sapply(list(blogs,news,twitter),function(x) format(object.size(x),"MB"))
summaries_data$lines<-sapply(list(blogs,news,twitter),function(x) length(x))
summaries_data$max_line<-sapply(list(blogs,news,twitter),function(x) max(unlist(nchar(x))))
summaries_data
##   file_name     size   lines max_line
## 1     blogs 255.4 Mb  899288    40833
## 2      news  19.8 Mb   77259     5760
## 3   twitter   319 Mb 2360148      140

Data Sampling

For the sampling we use 1% of all data in files; reproducibility is assured by setting a seed.

set.seed(123)
n=length(blogs)
lines_read<-rbinom(n,1,prob=0.01)
text_11<-blogs[(lines_read==TRUE)]
n=length(news)
lines_read<-rbinom(n,1,prob=0.01)
text_12<-news[(lines_read==TRUE)]
n=length(twitter)
lines_read<-rbinom(n,1,prob=0.01)
text_13<-twitter[(lines_read==TRUE)]

Combining data and remove unusuful variables

text_1<-rbind(text_11,text_12,text_13)
rm(twitter,news,blogs,text_11,text_12,text_13)

Text-Mining

Create corpus & prepara data

After creating the so-called corpus with the tm package, we apply some functions from the mentioned package to streamline and clean the data.

dataset_1<-iconv(text_1,"latin1","ASCII","")
corpus <- VCorpus(VectorSource(dataset_1)) # create corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # transform to lower case
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove white spaces
curse_words<-readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
corpus <- tm_map(corpus, removeWords, c(curse_words)) #remove curse_words

dataset_final<-data.frame(text_column=sapply(corpus, as.character), stringsAsFactors = FALSE)

rm(corpus,curse_words,dataset_1,text_1)  # remove unuseful variables

Tokeninzation

Next we apply the tidytext package to generate the n-grams, i.e. one_grams, two_grams and three_grams.

one_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=1)
one_grams_2<-subset(one_grams_1,nchar(one_grams_1$ngram)>1)
frequency<-table(one_grams_2$ngram)
one_grams<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
one_grams$history<-""
one_grams$word<-one_grams$ngram
one_grams$ngram_length<-1

rm(one_grams_1,one_grams_2)

two_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=2)
frequency<-table(two_grams_1$ngram)
two_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
two_grams<-two_grams_2 %>% separate(ngram,c("history","word"),sep=" ")
two_grams$ngram<-two_grams_2$ngram
two_grams$ngram_length<-2

rm(two_grams_1,frequency,two_grams_2)

three_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=3)
frequency<-table(three_grams_1$ngram)
three_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
three_grams_3<-three_grams_2 %>% separate(ngram,c("history1","history2","word"),sep=" ")
three_grams_3$history<-paste(three_grams_3$history1,three_grams_3$history2,sep=" ")
three_grams<-subset(three_grams_3,select=c("history","word"))
three_grams$ngram<-three_grams_2$ngram
three_grams$frequency<-three_grams_2$frequency
three_grams$ngram_length<-3

rm(three_grams_1,three_grams_2,three_grams_3,frequency)

ORDER BY FREQUENCY DESCENDING

Now, we order the one_gram,two_gram and three_gram by frequencies.

one_grams<-one_grams[order(-one_grams$frequency),]
two_grams<-two_grams[order(-two_grams$frequency),]
three_grams<-three_grams[order(-three_grams$frequency),]

Plots for unigrams

unigramsplot <-ggplot(data=unigrams, aes(x=unigrams$ngram, y=unigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
unigramsplot <- unigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Unigrams - Most Frequently Used Words")
unigramsplot

A word cloud is also a good visualization tool for n-gram frequency for unigrams:

unigrams_50<-head(one_grams, 50)
wordcloud(unigrams_50$ngram,unigrams_50$frequency,scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))

Plots for bigrams

bigramsplot <-ggplot(data=bigrams, aes(x=bigrams$ngram, y=bigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
bigramsplot <- bigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Bigrams - Most Frequently Used Words")
bigramsplot

A word cloud is also a good visualization tool for n-gram frequency for bigrams:

bigrams_50<-head(two_grams, 50)
wordcloud(bigrams_50$ngram,bigrams_50$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))

Plots for trigrams

trigramsplot <-ggplot(data=trigrams, aes(x=trigrams$ngram, y=trigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
trigramsplot <- trigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Trigrams - Most Frequently Used Words")
trigramsplot

A word cloud is also a good visualization tool for n-gram frequency for trigrams:

trigrams_20<-head(three_grams, 20)
wordcloud(trigrams_20$ngram,trigrams_20$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))
## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, :
## some of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, : it
## was a could not be fit on page. It will not be plotted.

Word Coverage

We are going to find out the number of words covered in the data-set increases as we add words from the most frequent to the least frequent for unigrams,

Percentil

one_grams$percentil<-cumsum(one_grams$frequency)/sum(one_grams$frequency)
plot(1:nrow(one_grams),one_grams$percentil,xlab="number of Words",ylab="Percentage of Coverage",main="Word Coverage in unigrams vs Top Occurring N-Grams added")

Now we need to know what’s the minimum number of top words added to achieve 50%, 90% and 100% coverage:

one_grams_50_percent_coverage<-nrow(one_grams[one_grams$percentil<0.5,])
one_grams_50_percent_coverage
## [1] 167
one_grams_90_percent_coverage<-nrow(one_grams[one_grams$percentil<0.9,])
one_grams_90_percent_coverage
## [1] 6122
one_grams_100_percent_coverage<-nrow(one_grams)
one_grams_100_percent_coverage
## [1] 41716

According with these computations, we would need 167 words to achieve 50% coverage, 6122 words to achieve 90% and 41716 for 100% coverage.

Language Detection

we are going to find out the Language Detection Using TextCat for unigrams and show the top 10 languages in terms of unigram frequency.

one_grams$language<-textcat(one_grams$ngram)
languages<-table(one_grams$language)
languages_df<-data.frame(language=names(languages),frequency=as.integer(languages))
languages_df<-languages_df[order(-languages_df$frequency),]
languages_df$percentage<-100*languages_df$frequency/sum(languages_df$frequency)
head(languages_df,10)
##     language frequency percentage
## 11   english      5542  13.316354
## 15    french      2435   5.850834
## 36     scots      2197   5.278966
## 9     danish      2139   5.139603
## 23     latin      2006   4.820030
## 34 rumantsch      1689   4.058340
## 27      manx      1672   4.017492
## 33  romanian      1600   3.844490
## 46   tagalog      1547   3.717142
## 6    catalan      1243   2.986688

TASK 3

Question 1) How can you efficiently store an n-gram model (think Markov Chains)?

A Markov chain is a stochastic process that fulfils the so-called Markov property (sometimes also referred to as memorylessness). The Markov property is fulfilled if the future states of process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that happened before it. Speech and written language seem like suitable application cases for Markov chains. In addition to the reduction and increase in efficiency a model application will bring.

I will store n-grams in a dataframe defining the following columns

  • history : word belongings to wi-1, wi-2, …
  • word : word to predict : wi
  • length_n_gram : 1:unigram, 2:bigram, 3:trigram
  • n_gram : the whole string : history + word.
  • frequency : times that this n_gram appears in the training set.

If the dataframe is a huge table, we can reduce the table by giving only the 50% or 90% of the words coverage.

Question 2) How can you use the knowledge about word frequencies to make your model smaller and more efficient?

I will use the 50% or 90% of the word coverage by using word frequencies in order to make my model smaller and more efficient

Question 3) How many parameters do you need (i.e. how big is n in your n-gram model)?

I will take 3 types of n-grams, unigrams, bigrams and trigrams with the structure explained in the question 1.

Question 4) Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data?

This questions seems to point to a so called Hidden Markov Chain model (HMM). A HMM is one where the rules for producing the chain are not know i.e. “hidden”. The rules include two probabilities: - the probability that there will be a certain observation - the probability that there will be a certain state transition, given the state of the model at a certain time. Application cases of HMM are typically in the area of reinforcement learning and temporal pattern recognition (recognition of speech recognition, part-of speech tagging etc.).

Question 5) How do you evaluate whether your model is any good?

I will split the dataset into a training and test set. I will find out my model from training data and I will try my model out with the test set in order to find out the accuracy of my model.

Question 6) How can you use backoff models to estimate the probability of unobserved n-grams?

Yes, in fact, I think the Katz backoff modelling is a suitable model to estimate the probability of unobserved n-grams.