This project is an assignement in the Coursera Data Science Specialization Capstone project course. The aim of this document is to explore the data and answer 3 predefined questions:
Additionally this project will explain the proposed methodology for solving the final task of the specialization witch is predicting what will be the next word based on the words inputed.
First I have analyzed basic statistics of the 3 datasets.
sum(nchar(blogs))
## [1] 206824506
length(blogs)
## [1] 899288
sum(nchar(news))
## [1] 15639408
length(news)
## [1] 77259
sum(nchar(tweets))
## [1] 162096031
length(tweets)
## [1] 2360148
Next I have read 15k lines from each of the dataset. I tried performing the analysys with full dataset but I was not able to calculate document term matices for 2 and 3 grams. In the future I will have to solve this issue. I will try using a bigger machine from Azure environmet.
blogs <- readLines("final/en_US/en_US.blogs.txt", 15000, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", 15000, encoding = "UTF-8")
tweets <- readLines("final/en_US/en_US.twitter.txt", 15000, encoding = "UTF-8")
sample_text <- c(blogs, news, tweets)
sample_text <- iconv(sample_text, 'UTF-8', 'ASCII')
sample_text<-sample_text[!is.na(sample_text)]
In this step I created a corpora containing all the sampled lines. Each line from the document was a different document. Then I performed some basic transformations of the corpora like: - removing punctuation - removing numbers - changing all the letters to lowercase - removing stopwords like “a”, “the” - stripping whitespaces
I was also thinking about stemming the words but this methodology fits more to documents clustering and not next word prediction as it leads to situation where for example police is stemmed to polic.
library(tm)
## Loading required package: NLP
sample_corpus <- VCorpus(VectorSource(sample_text))
sample_corpus[[1]]$content
## [1] "We love you Mr. Brown."
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus[[1]]$content
## [1] "We love you Mr Brown"
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus[[1]]$content
## [1] "We love you Mr Brown"
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus[[1]]$content
## [1] "we love you mr brown"
sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("english"))
sample_corpus[[1]]$content
## [1] " love mr brown"
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus[[1]]$content
## [1] " love mr brown"
#sample_corpus <- tm_map(sample_corpus, stemDocument)
#sample_corpus[[1]]$content
library(RWeka)
In this step I created a document term matrix for 1, 2 and 3 grams and made the matices sparse to save RAM.
sample_corpus.dtm <- DocumentTermMatrix(sample_corpus)
#as.matrix(sample_corpus.dtm)
TwoGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
sample_corpus2.dtm <- DocumentTermMatrix(sample_corpus, control = list(tokenize = TwoGramTokenizer))
#as.matrix(sample_corpus2.dtm)
ThreeGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
sample_corpus3.dtm <- DocumentTermMatrix(sample_corpus, control = list(tokenize = ThreeGramTokenizer))
#as.matrix(sample_corpus3.dtm)
#rm(list=ls()[-which(ls() %in% c("sample_corpus.dtm"))])
#gc()
cat("Making the matrices sparse")
## Making the matrices sparse
sample_corpus.dtms <- removeSparseTerms(sample_corpus.dtm, 0.999)
sample_corpus2.dtms <- removeSparseTerms(sample_corpus2.dtm, 0.9999)
sample_corpus3.dtms <- removeSparseTerms(sample_corpus3.dtm, 0.9999)
#########
freq_gram <- colSums(as.matrix(sample_corpus.dtms))
ord_gram <- order(freq_gram, decreasing = T)
freq_2gram <- colSums(as.matrix(sample_corpus2.dtms))
ord_2gram <- order(freq_2gram, decreasing = T)
freq_3gram <- colSums(as.matrix(sample_corpus3.dtms))
ord_3gram <- order(freq_3gram, decreasing = T)
Next I perform some intial data exploration. Below are examples of task like analysis of frequency of frequencies, Association of words, etc.
freq_gram[head(ord_gram)]
## said will one just can like
## 3430 2996 2658 2548 2292 2204
head(table(freq_gram), 20)
## freq_gram
## 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
## 13 19 28 31 39 42 39 40 35 46 28 38 45 35 22 42 30 34 34 27
findAssocs(sample_corpus.dtm, "said", 0.07)
## $said
## police spokesman officials president spokeswoman statement
## 0.11 0.10 0.09 0.09 0.08 0.08
## director
## 0.07
#spokesman
#findAssocs(sample_corpus.dtm, "spokesman", 0.07)
#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer
wordcloud(names(freq_gram), freq_gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
data<-data.frame("names"=names(freq_gram), "freq"=freq_gram)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data<-data %>%
arrange(desc(freq)) %>%
top_n(20)
## Selecting by freq
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
names(data)
## [1] "names" "freq"
ggplot(data = data)+
geom_bar(aes(x = names, y=freq), stat = "identity")
wordcloud(names(freq_2gram), freq_2gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
wordcloud(names(freq_3gram), freq_3gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
##################
sort_one_freq<-data.frame("Word"=names(freq_gram[ord_gram]), "Freq"=freq_gram[ord_gram], "Cum_Freq"=freq_gram[ord_gram]/sum(freq_gram[ord_gram]))
cum_freq = 0
words_counted = 0
while(cum_freq<0.50){
cum_freq <- cum_freq + sort_one_freq[(words_counted+1),3]
words_counted <- words_counted + 1
}
print(paste("In order to represent 50% of the words we need: ", words_counted, " unique words.", sep = "" ))
## [1] "In order to represent 50% of the words we need: 337 unique words."
cum_freq = 0
words_counted = 0
while(cum_freq<0.90){
cum_freq <- cum_freq + sort_one_freq[(words_counted+1),3]
words_counted <- words_counted + 1
}
print(paste("In order to represent 90% of the words we need: ", words_counted, " unique words.", sep = "" ))
## [1] "In order to represent 90% of the words we need: 1566 unique words."
The initial idea is to use the n-gram tables to propose the user a next word. For example if the user enters a word, using the 2-gram table I will find matching rows and propose next words starting from the word with highest frequency.