Project Capstone

This project is an assignement in the Coursera Data Science Specialization Capstone project course. The aim of this document is to explore the data and answer 3 predefined questions:

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

Additionally this project will explain the proposed methodology for solving the final task of the specialization witch is predicting what will be the next word based on the words inputed.

First I have analyzed basic statistics of the 3 datasets.

Basic stats of the blogs dataset

sum(nchar(blogs))
## [1] 206824506
length(blogs)
## [1] 899288

Basic stats of the news dataset

sum(nchar(news))
## [1] 15639408
length(news)
## [1] 77259

Basic stats of the tweets dataset

sum(nchar(tweets))
## [1] 162096031
length(tweets)
## [1] 2360148

Reading a sample of data from each file for exploratory data analysis

Next I have read 15k lines from each of the dataset. I tried performing the analysys with full dataset but I was not able to calculate document term matices for 2 and 3 grams. In the future I will have to solve this issue. I will try using a bigger machine from Azure environmet.

blogs <- readLines("final/en_US/en_US.blogs.txt", 15000, encoding = "UTF-8")
news <- readLines("final/en_US/en_US.news.txt", 15000, encoding = "UTF-8")
tweets <- readLines("final/en_US/en_US.twitter.txt", 15000, encoding = "UTF-8")

sample_text <- c(blogs, news, tweets)
sample_text <- iconv(sample_text, 'UTF-8', 'ASCII')

sample_text<-sample_text[!is.na(sample_text)]

Creating the corpora and intial data clean up

In this step I created a corpora containing all the sampled lines. Each line from the document was a different document. Then I performed some basic transformations of the corpora like: - removing punctuation - removing numbers - changing all the letters to lowercase - removing stopwords like “a”, “the” - stripping whitespaces

I was also thinking about stemming the words but this methodology fits more to documents clustering and not next word prediction as it leads to situation where for example police is stemmed to polic.

library(tm)
## Loading required package: NLP
sample_corpus <- VCorpus(VectorSource(sample_text))
sample_corpus[[1]]$content
## [1] "We love you Mr. Brown."
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus[[1]]$content
## [1] "We love you Mr Brown"
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus[[1]]$content
## [1] "We love you Mr Brown"
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus[[1]]$content
## [1] "we love you mr brown"
sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("english"))
sample_corpus[[1]]$content
## [1] " love  mr brown"
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
sample_corpus[[1]]$content
## [1] " love mr brown"
#sample_corpus <- tm_map(sample_corpus, stemDocument)
#sample_corpus[[1]]$content

library(RWeka)

Creating a document term matrix

In this step I created a document term matrix for 1, 2 and 3 grams and made the matices sparse to save RAM.

sample_corpus.dtm <- DocumentTermMatrix(sample_corpus)

#as.matrix(sample_corpus.dtm)
TwoGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
sample_corpus2.dtm <- DocumentTermMatrix(sample_corpus, control = list(tokenize = TwoGramTokenizer))

#as.matrix(sample_corpus2.dtm)

ThreeGramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
sample_corpus3.dtm <- DocumentTermMatrix(sample_corpus, control = list(tokenize = ThreeGramTokenizer))

#as.matrix(sample_corpus3.dtm)

#rm(list=ls()[-which(ls() %in% c("sample_corpus.dtm"))])
#gc()
cat("Making the matrices sparse")
## Making the matrices sparse
sample_corpus.dtms <- removeSparseTerms(sample_corpus.dtm, 0.999)
sample_corpus2.dtms <- removeSparseTerms(sample_corpus2.dtm, 0.9999)
sample_corpus3.dtms <- removeSparseTerms(sample_corpus3.dtm, 0.9999)
#########
freq_gram <- colSums(as.matrix(sample_corpus.dtms))   
ord_gram <- order(freq_gram, decreasing = T) 

freq_2gram <- colSums(as.matrix(sample_corpus2.dtms))   
ord_2gram <- order(freq_2gram, decreasing = T) 

freq_3gram <- colSums(as.matrix(sample_corpus3.dtms))   
ord_3gram <- order(freq_3gram, decreasing = T) 

Next I perform some intial data exploration. Below are examples of task like analysis of frequency of frequencies, Association of words, etc.

Most frequently occuring words

freq_gram[head(ord_gram)]
## said will  one just  can like 
## 3430 2996 2658 2548 2292 2204

Frequency of frequencies

head(table(freq_gram), 20) 
## freq_gram
## 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 
## 13 19 28 31 39 42 39 40 35 46 28 38 45 35 22 42 30 34 34 27

Associations to the most frequest term

findAssocs(sample_corpus.dtm, "said", 0.07)
## $said
##      police   spokesman   officials   president spokeswoman   statement 
##        0.11        0.10        0.09        0.09        0.08        0.08 
##    director 
##        0.07
#spokesman
#findAssocs(sample_corpus.dtm, "spokesman", 0.07)

#install.packages("wordcloud")
library(wordcloud)
## Loading required package: RColorBrewer

Wordcloud - Most frequest word

wordcloud(names(freq_gram), freq_gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

data<-data.frame("names"=names(freq_gram), "freq"=freq_gram)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data<-data %>% 
    arrange(desc(freq)) %>% 
    top_n(20)
## Selecting by freq
library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
names(data)
## [1] "names" "freq"
ggplot(data = data)+
    geom_bar(aes(x = names, y=freq), stat = "identity")

Wordcloud - Most frequest word pair

wordcloud(names(freq_2gram), freq_2gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

Wordcloud - Most frequest group of 3 words

wordcloud(names(freq_3gram), freq_3gram, max.words = 50, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

##################

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

sort_one_freq<-data.frame("Word"=names(freq_gram[ord_gram]), "Freq"=freq_gram[ord_gram], "Cum_Freq"=freq_gram[ord_gram]/sum(freq_gram[ord_gram]))

cum_freq = 0
words_counted = 0
while(cum_freq<0.50){
    cum_freq <- cum_freq + sort_one_freq[(words_counted+1),3]
    words_counted <- words_counted + 1
}
print(paste("In order to represent 50% of the words we need: ", words_counted, " unique words.", sep = "" ))
## [1] "In order to represent 50% of the words we need: 337 unique words."
cum_freq = 0
words_counted = 0
while(cum_freq<0.90){
    cum_freq <- cum_freq + sort_one_freq[(words_counted+1),3]
    words_counted <- words_counted + 1
}
print(paste("In order to represent 90% of the words we need: ", words_counted, " unique words.", sep = "" ))
## [1] "In order to represent 90% of the words we need: 1566 unique words."

Design of the app

The initial idea is to use the n-gram tables to propose the user a next word. For example if the user enters a word, using the 2-gram table I will find matching rows and propose next words starting from the word with highest frequency.