Synopsis

This is a report of Coursera Data Science Capstone project assignment. Around the world people spend lot’s of time on their mobile devices typing words, phrases and sentences. Making typing easier is a pretty good task for mobile developers. The cornerstone of this task is predictive text models. In the Capstone Project we work on understanding and developing such models.

The report shows the exploratory analysis of the data from a corpus called HC Corpora www.corpora.heliohost.org.

All the code available at my Github repository

Dataset

Dataset is available at the following URL. It consist of texts downloaded from different Internet websites, divided by 4-languages (English, Russian, German, Finnish) and parted into three files according to the text’s source:

Also the readme of the dataset is available here

In spite of the texts are language-filtered, but they may contain foreign text. Also the texts may contain some offensive words or phrases that shouldn’t be used in the predictive text modeling.

Basic statistics

Before cleaning the data check basic statistics about full dataset. The table below show the summary:

Filename Size Number of lines Number of words
en_US.blogs.txt 201M 899288 38222304
en_US.news.txt 197M 1010242 35710849
en_US.twitter.txt 160M 2360148 30433509
ru_RU.blogs.txt 112M 337100 9434050
ru_RU.news.txt 114M 196360 9125006
ru_RU.twitter.txt 101M 881414 9084961

Data preprocessing

The data should be cleaned before modeling. As far as we are going to predict words we should remove numbers, punctuation, URLs, profanity words and stopwords (that came from grammar) from texts. We got English profanity words list froms this link and prepare Russian profanity words by hands.

In addition we should think about quality of sampling, because the dataset is large enough. So we use uniform distribution for randomizing input texts for analysis.

library(tm)
getCorpus <- function(filename, sampleSize, profanityWords) {
  conn <- file(filename,open="r")
  
  lines <- iconv(readLines(conn), to = "utf-8")
  if (sampleSize != 0) {
    # Sample data with uniform distribution
    rowNums <- round(runif(sampleSize, min=1, max=length(lines)),0)
    raw <- c(lines[1])
    for(i in rowNums) {
      raw <- c(raw, lines[i])
    }
    rm(lines)
  } else
    raw <- lines
  close(conn)

  # remove punctuation, numbers and tolower the content
  raw <- gsub("[^[:alnum:][:space:]']", ' ', raw)
  raw <- gsub('[[:digit:]]+', ' ', raw)
  raw <- gsub('[[:punct:]]+', '', raw)
  raw <- tolower(raw)
  
  # make a corpus
  txt <- VectorSource(raw)
  rm(raw)
  txt.corpus <- Corpus(txt)
  rm(txt)
  
  # Clean the corpus
  txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
  txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("russian"))
  
  txt.corpus <- tm_map(txt.corpus, removeWords, profanityWords)
  txt.corpus <- tm_map(txt.corpus, stripWhitespace)
  
  return(txt.corpus)
}

setwd("~/Coursera/DS_Capstone/")
blog.en="data/final/en_US/en_US.blogs.txt"
news.en="data/final/en_US/en_US.news.txt"
twitter.en="data/final/en_US/en_US.twitter.txt"
blog.ru="data/final/ru_RU/ru_RU.blogs.txt"
news.ru="data/final/ru_RU/ru_RU.news.txt"
twitter.ru="data/final/ru_RU/ru_RU.twitter.txt"

profanityWords.en <- names(read.csv(url("http://www.bannedwordlist.com/lists/swearWords.csv")))
profanityWords.ru <- names(read.csv("data/profanity_russian.txt"))

txt.en.blog <- getCorpus(blog.en, 10000, profanityWords.en)
txt.en.news <- getCorpus(news.en, 10000, profanityWords.en)
txt.en.twit <- getCorpus(twitter.en, 10000, profanityWords.en)
txt.ru.blog <- getCorpus(blog.ru, 10000, profanityWords.ru)
txt.ru.news <- getCorpus(news.ru, 10000, profanityWords.ru)
txt.ru.twit <- getCorpus(twitter.ru, 10000, profanityWords.ru)

txt.en <- c(txt.en.blog, txt.en.news, txt.en.twit)
txt.ru <- c(txt.ru.blog, txt.ru.news, txt.ru.twit)

# free memory
rm(txt.en.blog, txt.en.news, txt.en.twit, txt.ru.blog, txt.ru.news, txt.ru.twit)

Q1. Some words are more frequent than others - what are the distribution of word frequencies?

The distribution of word frequencies in English and Russian are quite similar. The interesting finding of Russian words frequency analysis is that the word ‘это’ should be included in Russian stopwords list in R. This word means *this in English and it isn’t included into stopwards(“russian”) function in R.

Here we plot the word frequency distribution in two languages.

library(slam)
getFrequency <- function(x) {
  tdm <- TermDocumentMatrix(x)
  tdm.999 <- removeSparseTerms(tdm, sparse = 0.999)
  rm(tdm)
 
  freq <- sort(row_sums(tdm.999), decreasing = TRUE)
  return(freq)
}

freq.en <- getFrequency(txt.en)
freq.ru <- getFrequency(txt.ru)

As we can see the Russian word ‘это’ has really higher frequency than any other Russian word in the corpus. This word means ‘this’ in English and should be included in Russian stopwords list in R.As we found the word ‘это’ is excessive word, we remove this word from Russian text and plot most common Russian words again.

Q2. What are the frequencies of 2-grams and 3-grams in the dataset?

Summary

As expected frequencies of 2-grams and 3-grams are much lower than the frequency of 1-grams, but the total number of 2-grams or 3-grams are much higer. The interested finding is that the number of unique 2-grams doesn’t differ much from the number of unique 3-grams.

Plot most common 2-grams and 3-grams.

library(RWeka)
library(ggplot2)

getNGramFrequency <- function(corpus, ngram) {
  df <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
  delim <- " \\r\\n\\t.,;:\"()?!&+“”‘’'/"
  df_tmp <- NGramTokenizer(df, Weka_control(min=ngram, max=ngram, delimiters = delim))
  rm(df)
  
  df_ngram <- data.frame(table(df_tmp))
  rm(df_tmp)
  
  names(df_ngram) <- c("Ngram", "freq")
  df_ngram <- df_ngram[order(df_ngram$freq, decreasing = TRUE),]
  
  return(df_ngram)
}

freq.bi.en <- getNGramFrequency(txt.en, 2); freq.tri.en <- getNGramFrequency(txt.en, 3)
freq.bi.ru <- getNGramFrequency(txt.ru, 2); freq.tri.ru <- getNGramFrequency(txt.ru, 3)

ggplot(head(freq.bi.en, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
  ggtitle("Most common EN 2-grams") + xlab("") + ylab("Frequency")

ggplot(head(freq.tri.en, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
  ggtitle("Most common EN 3-grams") + xlab("") + ylab("Frequency")

ggplot(head(freq.bi.ru, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
  ggtitle("Most common RU 2-grams") + xlab("") + ylab("Frequency")

ggplot(head(freq.tri.ru, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
  ggtitle("Most common RU 2-grams") + xlab("") + ylab("Frequency")

As expected the frequency of 2-grams and 3-grams is much lower than the frequency of 1-grams (or words), but the numbers of 2-grams or 3-grams are extremely higer. Let’s plot these values.

Q3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%

Language 50% coverage 90% coverage
English 366 (12%) 1743 (56%)
Russian 536 (23%) 2014 (87%)

As we can see we need twice fewer unique words in English comparing with Russian to cover 50% of all word instances in the corpus.

Let’s plot the words that cover 50% of corpus in two languages.

##         coverage 50% percentage 50% coverage 90% percentage 90%
## english          366           0.12         1743           0.56
## russian          536           0.23         2014           0.87

Plot the words that cover 50% of all word instances in the blogs, news and twitter (HC Coprora dataset).

Q4. How do you evaluate how many of the words come from foreign languages?

I think that the approach of identifying words from other languages may look like these:

  1. Transliterate words into a single alphabet, i.e. Russian language which uses cyrillic alphabet should be transliterated into Latin/English alphabet.
  2. Remove preffixes and suffixes of words, finding a stem of each word in a text/dictionary. These word transformation called as stemming and there are a variety of algorithm for this. The most popular algorithm is Porter stemming. The fast stemming algorithm for Russian language, which used in Yandex search engine described by Ilya Segalovich in the article A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine .
  3. Find a correlation between stems of words in different languages based on the letters, lengths, ngrams, etc. There should be a threshold which helps to make a decision about the words that come from foreign languages.

Q5. Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

I think that the successful strategy of increasing the coverage with smaller number of words in the dictionary is using stemming for words prediction. We can use the stem of a word to predict all words with the same stem, even if some of them are not presented in the corpus.