Synnopsis

The goal of the Capstone Project is to build a prediction model that can help us to input next word on mobile devices. This goal will be splitted in some different tasks.This is Capstone project of John Hopkins University’ Data Science Course.

Goal of this document

The current task in Capsone Project is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.

Tasks to accomplish

  1. exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
  2. understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

  1. Some words are more frequent than others - what are the distributions of word frequencies?
  2. What are the frequencies of 2-grams and 3-grams in the dataset?
  3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
  4. How do you evaluate how many of the words come from foreign languages?
  5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Data source

The data could be received from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. This document uses the three files under the ru_RU folder, which contain initial text from blogs, news and twitter source.

Data Getting and Cleaning

Let start from the initial data:

Define some main parameters of the loaded files:

news.length <- length(news) # number of lines in news
news.words <- sum(stri_count(news, regex="\\S+")) # number of words in news

news.length <- length(news) # number of lines in news
news.words <- sum(stri_count(news, regex="\\S+")) # number of words in news
blogs.length <- length(blogs) # number of lines in news
blogs.words <- sum(stri_count(blogs, regex="\\S+")) # number of words in news
twitter.length <- length(twitter) # number of lines in news
twitter.words <- sum(stri_count(twitter, regex="\\S+")) # number of words in news

total.length <- length(income.text) # number of lines in news
total.words <- sum(stri_count(income.text, regex="\\S+")) # number of words in news

initial.data <- matrix(c(news.length, news.words), nrow = 2, ncol = 1, byrow = F)
initial.data <- as.data.frame(initial.data) 
row.names(initial.data) <- c("Lines #", "Words #") # row names
initial.data <- cbind(initial.data, 
                      c(blogs.length, blogs.words), 
                      c(twitter.length, twitter.words), 
                      c(total.length, total.words)) # collect into one place
names(initial.data) <- c("News", "Blogs", "Twitter", "Total") # column names

pandoc.table(format(initial.data, decimal.mark=",", big.mark=".", small.mark="."), 
             keep.line.breaks = TRUE, style = 'simple') # show them in nice table
## 
## 
##    &nbsp;       News      Blogs    Twitter    Total  
## ------------- --------- --------- --------- ---------
##  **Lines #**   196.360   337.100   881.414   150.000 
##  **Words #**  9.115.829 9.405.378 9.223.835 6.966.428

So at the total set we have about 7.0 mln words in 150k line in Russian.

 rm(news, blogs, twitter) # clean the memory

Sentences definition in the text and data set prepararion

All initial data loaded and we are ready to split text into sentences. Duration of this process is about 30 minutes [on my computer].

clasters <- makeCluster(detectCores())
registerDoParallel(clasters)
k <- 6
income.text.length <- length(income.text)
j <- trunc(income.text.length/k)
# i<-1
if (k*j <length(income.text)) {j<-j+1}
income.text.sent <- foreach( i=1:j, .combine=c) %dopar% {
  i.start <-(i-1)*k+1
  i.end <- (i*k)
  if (i.end>income.text.length) {i.end<-income.text.length}
  qdap::sent_detect(income.text[i.start:i.end], language = "ru", model = NULL)
}
stopCluster(clasters)
rm(income.text)

Now we have splitted text into sentenses with

## [1] "490313 lines and 6869681 words."

Let’s go to remove punctuation and transforming to lower case. Estimated duration of mentioned process is about 12 minutes [on my computer].

badwords <- readLines("bad.words.rus.txt", warn = F) # http://www.russki-mat.net/e/mat_slovar.htm
                                                     # was not possible to find a text file
income.text_corpus <- VCorpus(VectorSource(income.text.sent)) # main corpus with all sample files
income.text_corpus <- tm_map(income.text_corpus, removeNumbers) # remove numbers
income.text_corpus <- tm_map(income.text_corpus, stripWhitespace) # remove white strip
income.text_corpus <- tm_map(income.text_corpus, tolower) # transf to lower case
income.text_corpus <- tm_map(income.text_corpus, removeWords, badwords) # remove bad words
income.text_corpus <- tm_map(income.text_corpus, removePunctuation) # remove punctuation
rm(income.text.sent, badwords)

Now we are ready to perform tokenization and get one-word frequencies. As I expect this procedure will be performed about 20 minutes [on my computer].

# one word part ---------------------------------------------
k <- 500
income.text_corpus.length <- length(income.text_corpus)
j <- trunc(income.text_corpus.length/k)
if (k*j <income.text_corpus.length) {j<-j+1}
# m<-1
Flag <- 0
for (m in 1:j) {
    m.start <-(m-1)*k+1
    m.end <- (m*k)
    if (m.end>income.text_corpus.length) {m.end<-income.text_corpus.length}
    One.word.m <- NGramTokenizer(income.text_corpus[m.start:m.end], 
                                 Weka_control(min = 1, max = 1, 
                                 delimiters = " \\r\\n\\t.,;?:\"\\?\\?\\?\\?+?()<>?![A-Za-z0-9]?\\?"))
    if (Flag==0) {One.word <- One.word.m} else {One.word <- c(One.word, One.word.m)}
}
One.word <- data.frame(table(One.word))
One.word[ ,1] <- as.character(One.word[ ,1])
One.word <- One.word[order(One.word$One.word),]

… two-word frequencies (20 minutes more)

# two word part ---------------------------------------------
k <- 500
income.text_corpus.length <- length(income.text_corpus)
j <- trunc(income.text_corpus.length/k)
if (k*j <income.text_corpus.length) {j<-j+1}
# m<-1
Flag <- 0
for (m in 1:j) {
    m.start <-(m-1)*k+1
    m.end <- (m*k)
    if (m.end>income.text_corpus.length) {m.end<-income.text_corpus.length}
    Two.word.m <- NGramTokenizer(income.text_corpus[m.start:m.end], 
                                 Weka_control(min = 2, max = 2, 
                                 delimiters = " \\r\\n\\t.,;?:\"\\?\\?\\?\\?+?()<>?![A-Za-z0-9]?\\?"))
    if (Flag==0) {Two.word <- Two.word.m} else {Two.word <- c(Two.word, Two.word.m)}
}
Two.word <- data.frame(table(Two.word))
Two.word[ ,1] <- as.character(Two.word[ ,1])
Two.word <- Two.word[order(Two.word$Two.word),]

and three-word frequencies (and the next 20 minutes)

# three word part ---------------------------------------------
k <- 500
income.text_corpus.length <- length(income.text_corpus)
j <- trunc(income.text_corpus.length/k)
if (k*j <income.text_corpus.length) {j<-j+1}
# m<-1
Flag <- 0
for (m in 1:j) {
    m.start <-(m-1)*k+1
    m.end <- (m*k)
    if (m.end>income.text_corpus.length) {m.end<-income.text_corpus.length}
    Three.word.m <- NGramTokenizer(income.text_corpus[m.start:m.end], 
                                 Weka_control(min = 3, max = 3, 
                                 delimiters = " \\r\\n\\t.,;?:\"\\?\\?\\?\\?+?()<>?![A-Za-z0-9]?\\?"))
    if (Flag==0) {Three.word <- Three.word.m} else {Three.word <- c(Three.word, Three.word.m)}
}
Three.word <- data.frame(table(Three.word))
Three.word[ ,1] <- as.character(Three.word[ ,1])
Three.word <- Three.word[order(Three.word$Three.word),]

Data saving

In goal to use the final version of dataset let’s save it on the disk:

saveRDS(income.text_corpus, file="Task2.Full.edited.text.rds")
saveRDS(One.word, file="Task2.One.word.Data.Set.Rev.1.rds")
saveRDS(Two.word, file="Task2.Two.word.Data.Set.Rev.1.rds")
saveRDS(Three.word, file="Task2.Three.word.Data.Set.Rev.1.rds")

Exploratory Analysis

So our data set is ready and we are going to explore it. I will analyze initial text in three forms: one word, two words and three words. I will extract 45 records from the final one-two-three words sets with the highest frecuencies.

Question 1: One word

One.word <- One.word[order(-One.word$Freq), ]
names(One.word) <- c('One.Gram', 'Frequency')
ggplot(One.word[1:45, ], aes(x=One.Gram,y=Frequency)) + 
    geom_bar(stat="Identity", fill="Blue") +
    geom_text(aes(label=Frequency), vjust=-0.20) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 2: Two word

Two.word <- Two.word[order(-Two.word$Freq), ]
names(Two.word) <- c('Two.Gram', 'Frequency')

ggplot(Two.word[1:45, ], aes(x=Two.Gram,y=Frequency)) + 
    geom_bar(stat="Identity", fill="Blue") +
    geom_text(aes(label=Frequency), vjust=-0.20) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 2: Three word

Three.word <- Three.word[order(-Three.word$Freq), ]
names(Three.word) <- c('Three.Gram', 'Frequency')

ggplot(Three.word[1:45, ], aes(x=Three.Gram,y=Frequency)) + 
    geom_bar(stat="Identity", fill="Blue") +
    geom_text(aes(label=Frequency), vjust=-0.20) + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 3

  1. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
One.word <- One.word[order(-One.word$Freq), ] # sort by frequency
words.sum <- sum(One.word$Frequency)
A50 <- 0; FLAG <- FALSE; i <- 1 # circle  preparing
while (!FLAG) {
  A50 <- A50 + One.word$Frequency[i]/words.sum
  i <- i+1; FLAG <- A50>0.5
}
paste0("We need ", i, " words to cover 50% of all word instance in the existing text.")
## [1] "We need 418 words to cover 50% of all word instance in the existing text."
A50 <- 0; FLAG <- FALSE; i <- 1 # circle  preparing
while (!FLAG) {
  A50 <- A50 + One.word$Frequency[i]/words.sum
  i <- i+1; FLAG <- A50>0.9
}
paste0("We need ", i, " words to cover 90% of all word instance in the existing text.")
## [1] "We need 2082 words to cover 90% of all word instance in the existing text."

Question 4

  1. How do you evaluate how many of the words come from foreign languages? I am not sure we can evaluate it by this set.

Question 5

  1. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases? I think could be better to replace the derived words to word formation (cases, format, quantities).

Conclusions

We are going to construct prediction model for the next typed word using shiny features based on 1-2-3 word massive data set. Basically 1-2-3 word set ready, but I am not sure that it will be an effective solution.

Next Step (based on the performed calculations)

For the next steps of analysis and modelling, it would be good to do the followings:

  • Optimize 1-2-3 data for low memory utilization and speed improoving
  • Perform adjustments to optimize memory size and runtime