Summary

The goal of this report is to display the properties of the train data set and basic explorations about it, building the baseline for modeling of the prediction algorithm.

The data

The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.

The data was collected from blogs, news and Twitter, corresponding files are: en_US.twitter.txt, en_US.blogs.txt, en_US.news.txt. The files were downloaded from the Coursera site and unzipped in data folder of the project.

Getting and sampling the data

For purpose of futher testing and assessing of the predicion accuracy the raw data set was divided in 2 parts: train (2/3) and test (1/3) datasets.

# Create train and test separate text files from original data

FILE.NAMES <- c("data/en_US/en_US.twitter.txt", "data/en_US/en_US.blogs.txt", "data/en_US/en_US.news.txt")

for (file in FILE.NAMES){
    
    con.read <- file(file, open="rb")
    
    text <- readLines(con.read, encoding="UTF-8")
    lines.idx <- sample(1:length(text), round(length(text)*2/3,0))
    
    con.write.train <- file(paste(file,"_train.txt",sep=""), "w")
    writeLines(text[lines.idx], con = con.write.train, sep = "\n", useBytes = FALSE)
    close(con.write.train)
    
    con.write.test <- file(paste(file, "_test.txt", sep=""), "w")
    writeLines(text[-lines.idx], con = con.write.test, sep = "\n", useBytes = FALSE)
    close(con.write.test)
    
    close(con.read)
    
}

rm(text, file, lines.idx, FILE.NAMES, con.read, con.write.train, con.write.test)

tweets <- readLines("C:/cproject/data/en_US/en_US.twitter.txt_train.txt")
news <- readLines("C:/cproject/data/en_US/en_US.news.txt_train.txt")
blogs <- readLines("C:/cproject/data/en_US/en_US.blogs.txt_train.txt")

save(tweets,
     news,
     blogs, file = "data/train_data.RData")

The result is 6 additional text files: en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, that were readed and saved in separate RData file. All further processings were done with train files only.

Basic summary and exploration

Raw data:

Word counts, line counts of raw data:

Object Lines count Words count
tweets 1573432 20212567
news 75137 2590305
blogs 599525 25052049

Tokens

For purpose of exploration tokenization was made in following steps:
* split raw text into sentences using stri_split_boundaries
* split each sentence into words using stri_extract_all_words
* remove all numbers, special characters, remove or replace unicode symbols using regex pattern and stri_replace_all_regex funtion, etc
* add special token <s> in the begining of each sentence

The result is 4 data tables of tokens of each corpus sorted by frequency:

# Tweets tokens
head(t.toks.data)
##    words    freq
## 1:   <s> 2518894
## 2:   the  625432
## 3:    to  526036
## 4:     i  483043
## 5:     a  407613
## 6:   you  365846
# Blogs tokens
head(b.toks.data)
##    words    freq
## 1:   <s> 1585893
## 2:   the 1239715
## 3:   and  729698
## 4:    to  713743
## 5:     a  599859
## 6:    of  584357
# News tokens
head(n.toks.data)
##    words   freq
## 1:   <s> 150758
## 2:   the 147623
## 3:    to  67447
## 4:   and  65928
## 5:     a  64867
## 6:    of  57663
# All tokens
head(all.toks.data)
##    words    freq
## 1:   <s> 4255545
## 2:   the 2012770
## 3:    to 1307226
## 4:   and 1088018
## 5:     a 1072339
## 6:     i 1011763

Summary of total/unique tokens for each corpus:

toks.stats <- data.frame(type = c("tweets", "blogs", "news", "all"),
                         total = c( sum(t.toks.data$freq), sum(b.toks.data$freq), sum(n.toks.data$freq), sum(all.toks.data$freq) ),
                         unique = c( nrow(t.toks.data), nrow(b.toks.data), nrow(n.toks.data), nrow(all.toks.data) )
                         )
d.m <- melt(toks.stats, id.vars="type")
names(d.m) <- c("corpus", "count", "tokens")
g1 <- ggplot(d.m, aes(corpus, tokens)) + geom_bar(aes(fill = count), position = "dodge", stat="identity")
g1

plot of chunk toks.summary

Note, that special token <s> is included.

15 most frequent tokens and their fraction in total number of tokens:

most.freq.toks <- data.frame( corpus = c( rep("tweets",15), rep("blogs",15), rep("news",15), rep("all",15) ),
                              token = c( t.toks.data$words[1:15], b.toks.data$words[1:15], n.toks.data$words[1:15], all.toks.data$words[1:15]),
                              frequency = c( t.toks.data$freq[1:15]/sum(t.toks.data$freq), b.toks.data$freq[1:15]/sum(b.toks.data$freq), n.toks.data$freq[1:15]/sum(n.toks.data$freq), all.toks.data$freq[1:15]/sum(all.toks.data$freq) )
                              
                              )
g2 <- ggplot(most.freq.toks, aes(token, frequency)) + geom_bar(aes(fill = corpus), position = "dodge", stat="identity")
g2

plot of chunk most.tokens

Frequency distribution of tokens (with count > 1000), except of special token <s>:

d.m <- all.toks.data[freq > 1000][-1,]
ggplot(data=d.m, aes(x=1:nrow(d.m), y=freq)) + geom_line() + labs(title = "Frequency distribution of tokens with count > 1000", x = "Tokens", y = "Frequency")

plot of chunk freq.distr

Total number of all unique tokens (except of special token <s>) in each corpus is 370605. Total number of tokens with count > 1000 (except of special token <s>) is 3609. So it is clearly that there is only small fraction of unique tokens which are most frequent in all corpus.

Next steps

An obvious and common approach for building the next word prediction model is N-gram model. So the next step I’m going to do (in process already) is building 2-grams, 3-grams, 4-grams and count their frequency. As the exploration of tokens had shown, the data sparcity is expected with N-grams also. Here I mean that there will be some common high-frequency N-grams, but a lot of rare N-grams. So I’ll have to implement a good smoothing method, like Kneser-Ney, for example, because in such case simple probability approach (based on simple counting of N-grams) won’t be accurate enough.

Notes after expoloration

Unfortunately, R is not effective enough when working with huge amount of data. Some packages like tm or RWeka that were supposed to be usefull for this task require a lot of memory for computations on slow PC, that’s why the only not-base packages I used for making tokens and N-grams were stringi and data.table.