The document at hand is the milestone report for Week 2 of the Coursera Data Science Capstone project. The report deals with the results of the expoloratory data analysis of natural language data from three sources, i.e. news, blogs, and tweets.

DATA PROCESSING

The data to be used is downloaded from the provided URL and stored locally.

blogs: US.blogs.txt
news: en_US.news.txt
twitter: en_US.twitter.txt

Exploratory analysis

Read data

suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(RColorBrewer)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(textcat)))
suppressWarnings(suppressMessages(library(tidytext)))
suppressWarnings(suppressMessages(library(janeaustenr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(quanteda)))

file1<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.blogs.txt"
file2<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.news.txt"
file3<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.twitter.txt"

blogs<-readLines(file1,warn=FALSE,encoding="UTF-8")
news<-readLines(file2,warn=FALSE,encoding="UTF-8")
twitter<-readLines(file3,warn=FALSE,encoding="UTF-8")

Summary statistics

Next we generate an overview table for metrics that describe the loaded files.

summaries_data<-data.frame(file_name=c("blogs","news","twitter"))
summaries_data$size<-sapply(list(blogs,news,twitter),function(x) format(object.size(x),"MB"))
summaries_data$lines<-sapply(list(blogs,news,twitter),function(x) length(x))
summaries_data$max_line<-sapply(list(blogs,news,twitter),function(x) max(unlist(nchar(x))))
summaries_data

##   file_name     size   lines max_line
## 1     blogs 255.4 Mb  899288    40833
## 2      news  19.8 Mb   77259     5760
## 3   twitter   319 Mb 2360148      140

Data Sampling

For the sampling we use 1% of all data in files; reproducibility is assured by setting a seed.

set.seed(123)
n=length(blogs)
lines_read<-rbinom(n,1,prob=0.01)
text_11<-blogs[(lines_read==TRUE)]
n=length(news)
lines_read<-rbinom(n,1,prob=0.01)
text_12<-news[(lines_read==TRUE)]
n=length(twitter)
lines_read<-rbinom(n,1,prob=0.01)
text_13<-twitter[(lines_read==TRUE)]

Combining data and remove unusuful variables

text_1<-rbind(text_11,text_12,text_13)
rm(twitter,news,blogs,text_11,text_12,text_13)

Text-Mining

Create corpus & prepara data

After creating the so-called corpus with the tm package, we apply some functions from the mentioned package to streamline and clean the data.

dataset_1<-iconv(text_1,"latin1","ASCII","")
corpus <- VCorpus(VectorSource(dataset_1)) # create corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # transform to lower case
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove white spaces
curse_words<-readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
corpus <- tm_map(corpus, removeWords, c(curse_words)) #remove curse_words

dataset_final<-data.frame(text_column=sapply(corpus, as.character), stringsAsFactors = FALSE)

rm(corpus,curse_words,dataset_1,text_1)  # remove unuseful variables

Tokeninzation

Next we apply the tidytext package to generate the n-grams, i.e. one_grams, two_grams and three_grams.

one_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=1)
one_grams_2<-subset(one_grams_1,nchar(one_grams_1$ngram)>1)
frequency<-table(one_grams_2$ngram)
one_grams<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
one_grams$history<-""
one_grams$word<-one_grams$ngram
one_grams$ngram_length<-1

rm(one_grams_1,one_grams_2)

two_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=2)
frequency<-table(two_grams_1$ngram)
two_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
two_grams<-two_grams_2 %>% separate(ngram,c("history","word"),sep=" ")
two_grams$ngram<-two_grams_2$ngram
two_grams$ngram_length<-2

rm(two_grams_1,frequency,two_grams_2)

three_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=3)
frequency<-table(three_grams_1$ngram)
three_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
three_grams_3<-three_grams_2 %>% separate(ngram,c("history1","history2","word"),sep=" ")
three_grams_3$history<-paste(three_grams_3$history1,three_grams_3$history2,sep=" ")
three_grams<-subset(three_grams_3,select=c("history","word"))
three_grams$ngram<-three_grams_2$ngram
three_grams$frequency<-three_grams_2$frequency
three_grams$ngram_length<-3

rm(three_grams_1,three_grams_2,three_grams_3,frequency)

ORDER BY FREQUENCY DESCENDING

Now, we order the one_gram,two_gram and three_gram by frequencies.

one_grams<-one_grams[order(-one_grams$frequency),]
two_grams<-two_grams[order(-two_grams$frequency),]
three_grams<-three_grams[order(-three_grams$frequency),]

Print results (top 20) for unigrams

unigrams<-head(one_grams, 20)
unigrams

##       ngram frequency history word ngram_length
## 36724   the    104337          the            1
## 37271    to     56889           to            1
## 1295    and     53982          and            1
## 25603    of     44386           of            1
## 18019    in     34570           in            1
## 36700  that     23222         that            1
## 13894   for     22392          for            1
## 18849    is     21429           is            1
## 18924    it     18443           it            1
## 25770    on     16225           on            1
## 40691  with     15288         with            1
## 41463   you     14849          you            1
## 39923   was     13342          was            1
## 36908  this     11653         this            1
## 2146     at     11268           at            1
## 1967     as     11159           as            1
## 24367    my     11090           my            1
## 3080     be     10921           be            1
## 5086    but     10537          but            1
## 1788    are     10488          are            1

Plots for unigrams

unigramsplot <-ggplot(data=unigrams, aes(x=unigrams$ngram, y=unigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
unigramsplot <- unigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Unigrams - Most Frequently Used Words")
unigramsplot

A word cloud is also a good visualization tool for n-gram frequency for unigrams:

unigrams_50<-head(one_grams, 50)
wordcloud(unigrams_50$ngram,unigrams_50$frequency,scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))

Print results (top 20) for bigrams

bigrams<-head(two_grams, 20)
bigrams

##        history word frequency    ngram ngram_length
## 215235      of  the      9955   of the            2
## 152026      in  the      8999   in the            2
## 320904      to  the      4598   to the            2
## 219044      on  the      4175   on the            2
## 113147     for  the      4020  for the            2
## 318134      to   be      3214    to be            2
## 30101       at  the      3012   at the            2
## 20627      and  the      2758  and the            2
## 149970      in    a      2569     in a            2
## 161095      it  was      2245   it was            2
## 156938      is    a      2164     is a            2
## 117285    from  the      2118 from the            2
## 352241    with  the      2062 with the            2
## 111426     for    a      1985    for a            2
## 160386      it   is      1800    it is            2
## 349245    will   be      1777  will be            2
## 147557       i  was      1726    i was            2
## 211800      of    a      1713     of a            2
## 146402       i   am      1706     i am            2
## 53050       by  the      1686   by the            2

Plots for bigrams

bigramsplot <-ggplot(data=bigrams, aes(x=bigrams$ngram, y=bigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
bigramsplot <- bigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Bigrams - Most Frequently Used Words")
bigramsplot

A word cloud is also a good visualization tool for n-gram frequency for bigrams:

bigrams_50<-head(two_grams, 50)
wordcloud(bigrams_50$ngram,bigrams_50$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))

Print results (top 20) for trigrams

trigrams<-head(three_grams, 20)
trigrams

##           history  word          ngram frequency ngram_length
## 6513        a lot    of       a lot of       745            3
## 398098     one of   the     one of the       726            3
## 569781      to be     a        to be a       442            3
## 291750     it was     a       it was a       430            3
## 487494    some of   the    some of the       355            3
## 211379   going to    be    going to be       328            3
## 60086     as well    as     as well as       321            3
## 528844    the end    of     the end of       297            3
## 530143  the first  time the first time       272            3
## 281118   is going    to    is going to       271            3
## 408190     out of   the     out of the       270            3
## 414147    part of   the    part of the       260            3
## 259098     i want    to      i want to       258            3
## 266807   in front    of    in front of       245            3
## 350446    most of   the    most of the       245            3
## 72665     be able    to     be able to       241            3
## 515273 thanks for   the thanks for the       233            3
## 271356     in the world   in the world       228            3
## 659189   you want    to    you want to       223            3
## 97003      but it   was     but it was       220            3

Plots for trigrams

trigramsplot <-ggplot(data=trigrams, aes(x=trigrams$ngram, y=trigrams$frequency)) +
  geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
trigramsplot <- trigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
  labs(title = "Trigrams - Most Frequently Used Words")
trigramsplot

A word cloud is also a good visualization tool for n-gram frequency for trigrams:

trigrams_20<-head(three_grams, 20)
wordcloud(trigrams_20$ngram,trigrams_20$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))

## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, :
## some of the could not be fit on page. It will not be plotted.

## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, : it
## was a could not be fit on page. It will not be plotted.

Word Coverage

We are going to find out the number of words covered in the data-set increases as we add words from the most frequent to the least frequent for unigrams,

Percentil

one_grams$percentil<-cumsum(one_grams$frequency)/sum(one_grams$frequency)
plot(1:nrow(one_grams),one_grams$percentil,xlab="number of Words",ylab="Percentage of Coverage",main="Word Coverage in unigrams vs Top Occurring N-Grams added")

Now we need to know what’s the minimum number of top words added to achieve 50%, 90% and 100% coverage:

one_grams_50_percent_coverage<-nrow(one_grams[one_grams$percentil<0.5,])
one_grams_50_percent_coverage

## [1] 167

one_grams_90_percent_coverage<-nrow(one_grams[one_grams$percentil<0.9,])
one_grams_90_percent_coverage

## [1] 6122

one_grams_100_percent_coverage<-nrow(one_grams)
one_grams_100_percent_coverage

## [1] 41716

According with these computations, we would need 167 words to achieve 50% coverage, 6122 words to achieve 90% and 41716 for 100% coverage.

Language Detection

we are going to find out the Language Detection Using TextCat for unigrams and show the top 10 languages in terms of unigram frequency.

one_grams$language<-textcat(one_grams$ngram)
languages<-table(one_grams$language)
languages_df<-data.frame(language=names(languages),frequency=as.integer(languages))
languages_df<-languages_df[order(-languages_df$frequency),]
languages_df$percentage<-100*languages_df$frequency/sum(languages_df$frequency)
head(languages_df,10)

##     language frequency percentage
## 11   english      5542  13.316354
## 15    french      2435   5.850834
## 36     scots      2197   5.278966
## 9     danish      2139   5.139603
## 23     latin      2006   4.820030
## 34 rumantsch      1689   4.058340
## 27      manx      1672   4.017492
## 33  romanian      1600   3.844490
## 46   tagalog      1547   3.717142
## 6    catalan      1243   2.986688

TASK 3

Question 1) How can you efficiently store an n-gram model (think Markov Chains)?

A Markov chain is a stochastic process that fulfils the so-called Markov property (sometimes also referred to as memorylessness). The Markov property is fulfilled if the future states of process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that happened before it. Speech and written language seem like suitable application cases for Markov chains. In addition to the reduction and increase in efficiency a model application will bring.

I will store n-grams in a dataframe defining the following columns

history : word belongings to wi-1, wi-2, …
word : word to predict : wi
length_n_gram : 1:unigram, 2:bigram, 3:trigram
n_gram : the whole string : history + word.
frequency : times that this n_gram appears in the training set.

If the dataframe is a huge table, we can reduce the table by giving only the 50% or 90% of the words coverage.

Question 2) How can you use the knowledge about word frequencies to make your model smaller and more efficient?

I will use the 50% or 90% of the word coverage by using word frequencies in order to make my model smaller and more efficient

Question 3) How many parameters do you need (i.e. how big is n in your n-gram model)?

I will take 3 types of n-grams, unigrams, bigrams and trigrams with the structure explained in the question 1.

Question 4) Can you think of simple ways to “smooth” the probabilities (think about giving all n-grams a non-zero probability even if they aren’t observed in the data?

This questions seems to point to a so called Hidden Markov Chain model (HMM). A HMM is one where the rules for producing the chain are not know i.e. “hidden”. The rules include two probabilities: - the probability that there will be a certain observation - the probability that there will be a certain state transition, given the state of the model at a certain time. Application cases of HMM are typically in the area of reinforcement learning and temporal pattern recognition (recognition of speech recognition, part-of speech tagging etc.).

Question 5) How do you evaluate whether your model is any good?

I will split the dataset into a training and test set. I will find out my model from training data and I will try my model out with the test set in order to find out the accuracy of my model.

Question 6) How can you use backoff models to estimate the probability of unobserved n-grams?

Yes, in fact, I think the Katz backoff modelling is a suitable model to estimate the probability of unobserved n-grams.

Coursera Milestone Report

Oscar Real

January 20th, 2020