The document at hand is the milestone report for Week 2 of the Coursera Data Science Capstone project. The report deals with the results of the expoloratory data analysis of natural language data from three sources, i.e. news, blogs, and tweets.
The data to be used is downloaded from the provided URL and stored locally.
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(RColorBrewer)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(textcat)))
suppressWarnings(suppressMessages(library(tidytext)))
suppressWarnings(suppressMessages(library(janeaustenr)))
suppressWarnings(suppressMessages(library(dplyr)))
suppressWarnings(suppressMessages(library(tidyr)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(stringr)))
suppressWarnings(suppressMessages(library(quanteda)))
file1<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.blogs.txt"
file2<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.news.txt"
file3<-"C:/Users/eoscrea/Desktop/deeplearning/Data_Science/module_10/final/en_us/en_US.twitter.txt"
blogs<-readLines(file1,warn=FALSE,encoding="UTF-8")
news<-readLines(file2,warn=FALSE,encoding="UTF-8")
twitter<-readLines(file3,warn=FALSE,encoding="UTF-8")
Next we generate an overview table for metrics that describe the loaded files.
summaries_data<-data.frame(file_name=c("blogs","news","twitter"))
summaries_data$size<-sapply(list(blogs,news,twitter),function(x) format(object.size(x),"MB"))
summaries_data$lines<-sapply(list(blogs,news,twitter),function(x) length(x))
summaries_data$max_line<-sapply(list(blogs,news,twitter),function(x) max(unlist(nchar(x))))
summaries_data
## file_name size lines max_line
## 1 blogs 255.4 Mb 899288 40833
## 2 news 19.8 Mb 77259 5760
## 3 twitter 319 Mb 2360148 140
For the sampling we use 1% of all data in files; reproducibility is assured by setting a seed.
set.seed(123)
n=length(blogs)
lines_read<-rbinom(n,1,prob=0.01)
text_11<-blogs[(lines_read==TRUE)]
n=length(news)
lines_read<-rbinom(n,1,prob=0.01)
text_12<-news[(lines_read==TRUE)]
n=length(twitter)
lines_read<-rbinom(n,1,prob=0.01)
text_13<-twitter[(lines_read==TRUE)]
text_1<-rbind(text_11,text_12,text_13)
rm(twitter,news,blogs,text_11,text_12,text_13)
After creating the so-called corpus with the tm package, we apply some functions from the mentioned package to streamline and clean the data.
dataset_1<-iconv(text_1,"latin1","ASCII","")
corpus <- VCorpus(VectorSource(dataset_1)) # create corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # transform to lower case
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, stripWhitespace) # remove white spaces
curse_words<-readLines("https://raw.githubusercontent.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en")
corpus <- tm_map(corpus, removeWords, c(curse_words)) #remove curse_words
dataset_final<-data.frame(text_column=sapply(corpus, as.character), stringsAsFactors = FALSE)
rm(corpus,curse_words,dataset_1,text_1) # remove unuseful variables
Next we apply the tidytext package to generate the n-grams, i.e. one_grams, two_grams and three_grams.
one_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=1)
one_grams_2<-subset(one_grams_1,nchar(one_grams_1$ngram)>1)
frequency<-table(one_grams_2$ngram)
one_grams<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
one_grams$history<-""
one_grams$word<-one_grams$ngram
one_grams$ngram_length<-1
rm(one_grams_1,one_grams_2)
two_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=2)
frequency<-table(two_grams_1$ngram)
two_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
two_grams<-two_grams_2 %>% separate(ngram,c("history","word"),sep=" ")
two_grams$ngram<-two_grams_2$ngram
two_grams$ngram_length<-2
rm(two_grams_1,frequency,two_grams_2)
three_grams_1<-unnest_tokens(dataset_final,ngram,text_column,token="ngrams",n=3)
frequency<-table(three_grams_1$ngram)
three_grams_2<-data.frame(ngram=names(frequency),frequency=as.integer(frequency))
three_grams_3<-three_grams_2 %>% separate(ngram,c("history1","history2","word"),sep=" ")
three_grams_3$history<-paste(three_grams_3$history1,three_grams_3$history2,sep=" ")
three_grams<-subset(three_grams_3,select=c("history","word"))
three_grams$ngram<-three_grams_2$ngram
three_grams$frequency<-three_grams_2$frequency
three_grams$ngram_length<-3
rm(three_grams_1,three_grams_2,three_grams_3,frequency)
Now, we order the one_gram,two_gram and three_gram by frequencies.
one_grams<-one_grams[order(-one_grams$frequency),]
two_grams<-two_grams[order(-two_grams$frequency),]
three_grams<-three_grams[order(-three_grams$frequency),]
unigrams<-head(one_grams, 20)
unigrams
## ngram frequency history word ngram_length
## 36724 the 104337 the 1
## 37271 to 56889 to 1
## 1295 and 53982 and 1
## 25603 of 44386 of 1
## 18019 in 34570 in 1
## 36700 that 23222 that 1
## 13894 for 22392 for 1
## 18849 is 21429 is 1
## 18924 it 18443 it 1
## 25770 on 16225 on 1
## 40691 with 15288 with 1
## 41463 you 14849 you 1
## 39923 was 13342 was 1
## 36908 this 11653 this 1
## 2146 at 11268 at 1
## 1967 as 11159 as 1
## 24367 my 11090 my 1
## 3080 be 10921 be 1
## 5086 but 10537 but 1
## 1788 are 10488 are 1
unigramsplot <-ggplot(data=unigrams, aes(x=unigrams$ngram, y=unigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
unigramsplot <- unigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Unigrams - Most Frequently Used Words")
unigramsplot
unigrams_50<-head(one_grams, 50)
wordcloud(unigrams_50$ngram,unigrams_50$frequency,scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))
bigrams<-head(two_grams, 20)
bigrams
## history word frequency ngram ngram_length
## 215235 of the 9955 of the 2
## 152026 in the 8999 in the 2
## 320904 to the 4598 to the 2
## 219044 on the 4175 on the 2
## 113147 for the 4020 for the 2
## 318134 to be 3214 to be 2
## 30101 at the 3012 at the 2
## 20627 and the 2758 and the 2
## 149970 in a 2569 in a 2
## 161095 it was 2245 it was 2
## 156938 is a 2164 is a 2
## 117285 from the 2118 from the 2
## 352241 with the 2062 with the 2
## 111426 for a 1985 for a 2
## 160386 it is 1800 it is 2
## 349245 will be 1777 will be 2
## 147557 i was 1726 i was 2
## 211800 of a 1713 of a 2
## 146402 i am 1706 i am 2
## 53050 by the 1686 by the 2
bigramsplot <-ggplot(data=bigrams, aes(x=bigrams$ngram, y=bigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
bigramsplot <- bigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Bigrams - Most Frequently Used Words")
bigramsplot
bigrams_50<-head(two_grams, 50)
wordcloud(bigrams_50$ngram,bigrams_50$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))
trigrams<-head(three_grams, 20)
trigrams
## history word ngram frequency ngram_length
## 6513 a lot of a lot of 745 3
## 398098 one of the one of the 726 3
## 569781 to be a to be a 442 3
## 291750 it was a it was a 430 3
## 487494 some of the some of the 355 3
## 211379 going to be going to be 328 3
## 60086 as well as as well as 321 3
## 528844 the end of the end of 297 3
## 530143 the first time the first time 272 3
## 281118 is going to is going to 271 3
## 408190 out of the out of the 270 3
## 414147 part of the part of the 260 3
## 259098 i want to i want to 258 3
## 266807 in front of in front of 245 3
## 350446 most of the most of the 245 3
## 72665 be able to be able to 241 3
## 515273 thanks for the thanks for the 233 3
## 271356 in the world in the world 228 3
## 659189 you want to you want to 223 3
## 97003 but it was but it was 220 3
trigramsplot <-ggplot(data=trigrams, aes(x=trigrams$ngram, y=trigrams$frequency)) +
geom_bar(stat="identity", color="blue", fill="white") + theme_minimal()
trigramsplot <- trigramsplot + coord_flip() + xlab("Words or Terms") + ylab("Frequency") +
labs(title = "Trigrams - Most Frequently Used Words")
trigramsplot
trigrams_20<-head(three_grams, 20)
wordcloud(trigrams_20$ngram,trigrams_20$frequency, scale=c(5,0.5), colors = brewer.pal(8, "Dark2"))
## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, :
## some of the could not be fit on page. It will not be plotted.
## Warning in wordcloud(trigrams_20$ngram, trigrams_20$frequency, scale = c(5, : it
## was a could not be fit on page. It will not be plotted.
We are going to find out the number of words covered in the data-set increases as we add words from the most frequent to the least frequent for unigrams,
one_grams$percentil<-cumsum(one_grams$frequency)/sum(one_grams$frequency)
plot(1:nrow(one_grams),one_grams$percentil,xlab="number of Words",ylab="Percentage of Coverage",main="Word Coverage in unigrams vs Top Occurring N-Grams added")
Now we need to know what’s the minimum number of top words added to achieve 50%, 90% and 100% coverage:
one_grams_50_percent_coverage<-nrow(one_grams[one_grams$percentil<0.5,])
one_grams_50_percent_coverage
## [1] 167
one_grams_90_percent_coverage<-nrow(one_grams[one_grams$percentil<0.9,])
one_grams_90_percent_coverage
## [1] 6122
one_grams_100_percent_coverage<-nrow(one_grams)
one_grams_100_percent_coverage
## [1] 41716
According with these computations, we would need 167 words to achieve 50% coverage, 6122 words to achieve 90% and 41716 for 100% coverage.
we are going to find out the Language Detection Using TextCat for unigrams and show the top 10 languages in terms of unigram frequency.
one_grams$language<-textcat(one_grams$ngram)
languages<-table(one_grams$language)
languages_df<-data.frame(language=names(languages),frequency=as.integer(languages))
languages_df<-languages_df[order(-languages_df$frequency),]
languages_df$percentage<-100*languages_df$frequency/sum(languages_df$frequency)
head(languages_df,10)
## language frequency percentage
## 11 english 5542 13.316354
## 15 french 2435 5.850834
## 36 scots 2197 5.278966
## 9 danish 2139 5.139603
## 23 latin 2006 4.820030
## 34 rumantsch 1689 4.058340
## 27 manx 1672 4.017492
## 33 romanian 1600 3.844490
## 46 tagalog 1547 3.717142
## 6 catalan 1243 2.986688
A Markov chain is a stochastic process that fulfils the so-called Markov property (sometimes also referred to as memorylessness). The Markov property is fulfilled if the future states of process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that happened before it. Speech and written language seem like suitable application cases for Markov chains. In addition to the reduction and increase in efficiency a model application will bring.
I will store n-grams in a dataframe defining the following columns
If the dataframe is a huge table, we can reduce the table by giving only the 50% or 90% of the words coverage.
I will use the 50% or 90% of the word coverage by using word frequencies in order to make my model smaller and more efficient
I will take 3 types of n-grams, unigrams, bigrams and trigrams with the structure explained in the question 1.
This questions seems to point to a so called Hidden Markov Chain model (HMM). A HMM is one where the rules for producing the chain are not know i.e. “hidden”. The rules include two probabilities: - the probability that there will be a certain observation - the probability that there will be a certain state transition, given the state of the model at a certain time. Application cases of HMM are typically in the area of reinforcement learning and temporal pattern recognition (recognition of speech recognition, part-of speech tagging etc.).
I will split the dataset into a training and test set. I will find out my model from training data and I will try my model out with the test set in order to find out the accuracy of my model.
Yes, in fact, I think the Katz backoff modelling is a suitable model to estimate the probability of unobserved n-grams.