The goal of this report is to display the properties of the train data set and basic explorations about it, building the baseline for modeling of the prediction algorithm.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). See the readme file at http://www.corpora.heliohost.org/aboutcorpus.html for details on the corpora available.
The data was collected from blogs, news and Twitter, corresponding files are: en_US.twitter.txt, en_US.blogs.txt, en_US.news.txt. The files were downloaded from the Coursera site and unzipped in data folder of the project.
For purpose of futher testing and assessing of the predicion accuracy the raw data set was divided in 2 parts: train (2/3) and test (1/3) datasets.
# Create train and test separate text files from original data
FILE.NAMES <- c("data/en_US/en_US.twitter.txt", "data/en_US/en_US.blogs.txt", "data/en_US/en_US.news.txt")
for (file in FILE.NAMES){
con.read <- file(file, open="rb")
text <- readLines(con.read, encoding="UTF-8")
lines.idx <- sample(1:length(text), round(length(text)*2/3,0))
con.write.train <- file(paste(file,"_train.txt",sep=""), "w")
writeLines(text[lines.idx], con = con.write.train, sep = "\n", useBytes = FALSE)
close(con.write.train)
con.write.test <- file(paste(file, "_test.txt", sep=""), "w")
writeLines(text[-lines.idx], con = con.write.test, sep = "\n", useBytes = FALSE)
close(con.write.test)
close(con.read)
}
rm(text, file, lines.idx, FILE.NAMES, con.read, con.write.train, con.write.test)
tweets <- readLines("C:/cproject/data/en_US/en_US.twitter.txt_train.txt")
news <- readLines("C:/cproject/data/en_US/en_US.news.txt_train.txt")
blogs <- readLines("C:/cproject/data/en_US/en_US.blogs.txt_train.txt")
save(tweets,
news,
blogs, file = "data/train_data.RData")
The result is 6 additional text files: en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, en_US.twitter.txt_train.txt, en_US.twitter.txt_test.txt, that were readed and saved in separate RData file. All further processings were done with train files only.
Word counts, line counts of raw data:
| Object | Lines count | Words count |
|---|---|---|
tweets |
1573432 | 20212567 |
news |
75137 | 2590305 |
blogs |
599525 | 25052049 |
For purpose of exploration tokenization was made in following steps:
* split raw text into sentences using stri_split_boundaries
* split each sentence into words using stri_extract_all_words
* remove all numbers, special characters, remove or replace unicode symbols using regex pattern and stri_replace_all_regex funtion, etc
* add special token <s> in the begining of each sentence
The result is 4 data tables of tokens of each corpus sorted by frequency:
# Tweets tokens
head(t.toks.data)
## words freq
## 1: <s> 2518894
## 2: the 625432
## 3: to 526036
## 4: i 483043
## 5: a 407613
## 6: you 365846
# Blogs tokens
head(b.toks.data)
## words freq
## 1: <s> 1585893
## 2: the 1239715
## 3: and 729698
## 4: to 713743
## 5: a 599859
## 6: of 584357
# News tokens
head(n.toks.data)
## words freq
## 1: <s> 150758
## 2: the 147623
## 3: to 67447
## 4: and 65928
## 5: a 64867
## 6: of 57663
# All tokens
head(all.toks.data)
## words freq
## 1: <s> 4255545
## 2: the 2012770
## 3: to 1307226
## 4: and 1088018
## 5: a 1072339
## 6: i 1011763
Summary of total/unique tokens for each corpus:
toks.stats <- data.frame(type = c("tweets", "blogs", "news", "all"),
total = c( sum(t.toks.data$freq), sum(b.toks.data$freq), sum(n.toks.data$freq), sum(all.toks.data$freq) ),
unique = c( nrow(t.toks.data), nrow(b.toks.data), nrow(n.toks.data), nrow(all.toks.data) )
)
d.m <- melt(toks.stats, id.vars="type")
names(d.m) <- c("corpus", "count", "tokens")
g1 <- ggplot(d.m, aes(corpus, tokens)) + geom_bar(aes(fill = count), position = "dodge", stat="identity")
g1
Note, that special token <s> is included.
15 most frequent tokens and their fraction in total number of tokens:
most.freq.toks <- data.frame( corpus = c( rep("tweets",15), rep("blogs",15), rep("news",15), rep("all",15) ),
token = c( t.toks.data$words[1:15], b.toks.data$words[1:15], n.toks.data$words[1:15], all.toks.data$words[1:15]),
frequency = c( t.toks.data$freq[1:15]/sum(t.toks.data$freq), b.toks.data$freq[1:15]/sum(b.toks.data$freq), n.toks.data$freq[1:15]/sum(n.toks.data$freq), all.toks.data$freq[1:15]/sum(all.toks.data$freq) )
)
g2 <- ggplot(most.freq.toks, aes(token, frequency)) + geom_bar(aes(fill = corpus), position = "dodge", stat="identity")
g2
Frequency distribution of tokens (with count > 1000), except of special token <s>:
d.m <- all.toks.data[freq > 1000][-1,]
ggplot(data=d.m, aes(x=1:nrow(d.m), y=freq)) + geom_line() + labs(title = "Frequency distribution of tokens with count > 1000", x = "Tokens", y = "Frequency")
Total number of all unique tokens (except of special token <s>) in each corpus is 370605. Total number of tokens with count > 1000 (except of special token <s>) is 3609. So it is clearly that there is only small fraction of unique tokens which are most frequent in all corpus.
An obvious and common approach for building the next word prediction model is N-gram model. So the next step I’m going to do (in process already) is building 2-grams, 3-grams, 4-grams and count their frequency. As the exploration of tokens had shown, the data sparcity is expected with N-grams also. Here I mean that there will be some common high-frequency N-grams, but a lot of rare N-grams. So I’ll have to implement a good smoothing method, like Kneser-Ney, for example, because in such case simple probability approach (based on simple counting of N-grams) won’t be accurate enough.
Unfortunately, R is not effective enough when working with huge amount of data. Some packages like tm or RWeka that were supposed to be usefull for this task require a lot of memory for computations on slow PC, that’s why the only not-base packages I used for making tokens and N-grams were stringi and data.table.