This is a report of Coursera Data Science Capstone project assignment. Around the world people spend lot’s of time on their mobile devices typing words, phrases and sentences. Making typing easier is a pretty good task for mobile developers. The cornerstone of this task is predictive text models. In the Capstone Project we work on understanding and developing such models.
The report shows the exploratory analysis of the data from a corpus called HC Corpora www.corpora.heliohost.org.
All the code available at my Github repository
Dataset is available at the following URL. It consist of texts downloaded from different Internet websites, divided by 4-languages (English, Russian, German, Finnish) and parted into three files according to the text’s source:
Also the readme of the dataset is available here
In spite of the texts are language-filtered, but they may contain foreign text. Also the texts may contain some offensive words or phrases that shouldn’t be used in the predictive text modeling.
Before cleaning the data check basic statistics about full dataset. The table below show the summary:
| Filename | Size | Number of lines | Number of words |
|---|---|---|---|
| en_US.blogs.txt | 201M | 899288 | 38222304 |
| en_US.news.txt | 197M | 1010242 | 35710849 |
| en_US.twitter.txt | 160M | 2360148 | 30433509 |
| ru_RU.blogs.txt | 112M | 337100 | 9434050 |
| ru_RU.news.txt | 114M | 196360 | 9125006 |
| ru_RU.twitter.txt | 101M | 881414 | 9084961 |
The data should be cleaned before modeling. As far as we are going to predict words we should remove numbers, punctuation, URLs, profanity words and stopwords (that came from grammar) from texts. We got English profanity words list froms this link and prepare Russian profanity words by hands.
In addition we should think about quality of sampling, because the dataset is large enough. So we use uniform distribution for randomizing input texts for analysis.
library(tm)
getCorpus <- function(filename, sampleSize, profanityWords) {
conn <- file(filename,open="r")
lines <- iconv(readLines(conn), to = "utf-8")
if (sampleSize != 0) {
# Sample data with uniform distribution
rowNums <- round(runif(sampleSize, min=1, max=length(lines)),0)
raw <- c(lines[1])
for(i in rowNums) {
raw <- c(raw, lines[i])
}
rm(lines)
} else
raw <- lines
close(conn)
# remove punctuation, numbers and tolower the content
raw <- gsub("[^[:alnum:][:space:]']", ' ', raw)
raw <- gsub('[[:digit:]]+', ' ', raw)
raw <- gsub('[[:punct:]]+', '', raw)
raw <- tolower(raw)
# make a corpus
txt <- VectorSource(raw)
rm(raw)
txt.corpus <- Corpus(txt)
rm(txt)
# Clean the corpus
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("english"))
txt.corpus <- tm_map(txt.corpus, removeWords, stopwords("russian"))
txt.corpus <- tm_map(txt.corpus, removeWords, profanityWords)
txt.corpus <- tm_map(txt.corpus, stripWhitespace)
return(txt.corpus)
}
setwd("~/Coursera/DS_Capstone/")
blog.en="data/final/en_US/en_US.blogs.txt"
news.en="data/final/en_US/en_US.news.txt"
twitter.en="data/final/en_US/en_US.twitter.txt"
blog.ru="data/final/ru_RU/ru_RU.blogs.txt"
news.ru="data/final/ru_RU/ru_RU.news.txt"
twitter.ru="data/final/ru_RU/ru_RU.twitter.txt"
profanityWords.en <- names(read.csv(url("http://www.bannedwordlist.com/lists/swearWords.csv")))
profanityWords.ru <- names(read.csv("data/profanity_russian.txt"))
txt.en.blog <- getCorpus(blog.en, 10000, profanityWords.en)
txt.en.news <- getCorpus(news.en, 10000, profanityWords.en)
txt.en.twit <- getCorpus(twitter.en, 10000, profanityWords.en)
txt.ru.blog <- getCorpus(blog.ru, 10000, profanityWords.ru)
txt.ru.news <- getCorpus(news.ru, 10000, profanityWords.ru)
txt.ru.twit <- getCorpus(twitter.ru, 10000, profanityWords.ru)
txt.en <- c(txt.en.blog, txt.en.news, txt.en.twit)
txt.ru <- c(txt.ru.blog, txt.ru.news, txt.ru.twit)
# free memory
rm(txt.en.blog, txt.en.news, txt.en.twit, txt.ru.blog, txt.ru.news, txt.ru.twit)
The distribution of word frequencies in English and Russian are quite similar. The interesting finding of Russian words frequency analysis is that the word ‘это’ should be included in Russian stopwords list in R. This word means *this in English and it isn’t included into stopwards(“russian”) function in R.
Here we plot the word frequency distribution in two languages.
library(slam)
getFrequency <- function(x) {
tdm <- TermDocumentMatrix(x)
tdm.999 <- removeSparseTerms(tdm, sparse = 0.999)
rm(tdm)
freq <- sort(row_sums(tdm.999), decreasing = TRUE)
return(freq)
}
freq.en <- getFrequency(txt.en)
freq.ru <- getFrequency(txt.ru)
As we can see the Russian word ‘это’ has really higher frequency than any other Russian word in the corpus. This word means ‘this’ in English and should be included in Russian stopwords list in R.As we found the word ‘это’ is excessive word, we remove this word from Russian text and plot most common Russian words again.
As expected frequencies of 2-grams and 3-grams are much lower than the frequency of 1-grams, but the total number of 2-grams or 3-grams are much higer. The interested finding is that the number of unique 2-grams doesn’t differ much from the number of unique 3-grams.
Plot most common 2-grams and 3-grams.
library(RWeka)
library(ggplot2)
getNGramFrequency <- function(corpus, ngram) {
df <- data.frame(text = sapply(corpus, as.character), stringsAsFactors = FALSE)
delim <- " \\r\\n\\t.,;:\"()?!&+“”‘’'/"
df_tmp <- NGramTokenizer(df, Weka_control(min=ngram, max=ngram, delimiters = delim))
rm(df)
df_ngram <- data.frame(table(df_tmp))
rm(df_tmp)
names(df_ngram) <- c("Ngram", "freq")
df_ngram <- df_ngram[order(df_ngram$freq, decreasing = TRUE),]
return(df_ngram)
}
freq.bi.en <- getNGramFrequency(txt.en, 2); freq.tri.en <- getNGramFrequency(txt.en, 3)
freq.bi.ru <- getNGramFrequency(txt.ru, 2); freq.tri.ru <- getNGramFrequency(txt.ru, 3)
ggplot(head(freq.bi.en, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
ggtitle("Most common EN 2-grams") + xlab("") + ylab("Frequency")
ggplot(head(freq.tri.en, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
ggtitle("Most common EN 3-grams") + xlab("") + ylab("Frequency")
ggplot(head(freq.bi.ru, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
ggtitle("Most common RU 2-grams") + xlab("") + ylab("Frequency")
ggplot(head(freq.tri.ru, 30), aes(reorder(Ngram, -freq), freq, fill=freq)) +
geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size=12)) +
ggtitle("Most common RU 2-grams") + xlab("") + ylab("Frequency")
As expected the frequency of 2-grams and 3-grams is much lower than the frequency of 1-grams (or words), but the numbers of 2-grams or 3-grams are extremely higer. Let’s plot these values.
| Language | 50% coverage | 90% coverage |
|---|---|---|
| English | 366 (12%) | 1743 (56%) |
| Russian | 536 (23%) | 2014 (87%) |
As we can see we need twice fewer unique words in English comparing with Russian to cover 50% of all word instances in the corpus.
Let’s plot the words that cover 50% of corpus in two languages.
## coverage 50% percentage 50% coverage 90% percentage 90%
## english 366 0.12 1743 0.56
## russian 536 0.23 2014 0.87
Plot the words that cover 50% of all word instances in the blogs, news and twitter (HC Coprora dataset).
I think that the approach of identifying words from other languages may look like these:
I think that the successful strategy of increasing the coverage with smaller number of words in the dictionary is using stemming for words prediction. We can use the stem of a word to predict all words with the same stem, even if some of them are not presented in the corpus.