This is a breif explorative report of Capstone Corpus. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The files containing three pieces from blog, tiwtter and news, have been filtered but may still contain some unrecoginzed words or foreign words not in English. Here is a brief explorative analysis of these data.
Here are the procedures we used to read in all three pieces of data.
blog <- readLines("./en_US.blogs.txt")
myfile<-file("en_US.twitter.txt", open="rb")
twitter<-readLines(myfile, encoding="UTF-8")
close(myfile)
myfile<-file("en_US.news.txt", open="rb")
news<-readLines(myfile, encoding="UTF-8")
close(myfile)
Firstly, we define several key functions to be used later in the data analysis. To count how many foreign or unrecognized text are present in the data, we substitute any recognized letters and puncturation such comma, column, quote with empty symbol “”. So the words left are usually not recognized words or maybe foreign words. Since it is hard to define foreign words, we simply count any letters not recognized as foreign words in the method decribed above.
# Substitute recognized letters and symboles
keepforeigntxt <- function(x) gsub("[A-Za-z:digit: \t\\.,`'-_=\\+;:<>~\\(\\)\\!\"\\.,\\?'!@#\"\\$%\\^&\\*\\/\\|\\{\\}]|[:space:]","",x)
# Filter the data and only keep unrecognized letters or symbols
foreigntxt <- function(passage){
sapply(passage, function(x){
txt <- tolower(keepforeigntxt(unlist(strsplit(x,split = " "))))
txt <- txt[!txt==""]
})
}
In the contrast, we will keep any recogized letters as words excluding those numbers.
# Substitute recognized letters, symboles and numbers
keepcleartxt <- function(x) gsub("[\\.0-9 \\.,\\?'!@#\"\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\|\\}\\{\\[\\]`~]|[^A-Za-z]","",x)
# Filter the data and keep only words
tidytxt <- function(passage){
sapply(passage,function(x){
txt <- tolower(keepcleartxt(unlist(strsplit(x,split = " "))))
txt <- txt[!txt==""]
})
}
Here we defined a function called wordcount() to count total number of words in a list of lines for a piece of data. We will simply define a word by the space.
wordcount <- function(passage){
num = 0
for(i in 1:length(passage)){
num <- num + length(unlist(strsplit(passage[[i]],split = " ")))
}
return(num)
}
Here we defined a function called word2pair() to take in a vector of words and generate all the continous two words pairs.
word2pair <- function(words){
sapply(1:(length(words)-1),function(i){paste(words[i],words[i+1])})
}
Here we defined a function called word3pair() to take in a vector of words and generate all the continous three words pairs.
word3pair <- function(words){
sapply(1:(length(words)-2),function(i){paste(words[i],words[i+1],words[i+2])})
}
Here we define a function called gram_coverage() to calculate how many and waht n-grams we need to include in repertoire to cover a certain percentage of n-grams.
gram_coverage <- function(freq_table, coverage=.5){
cc = 0
count = 0
while(cc < coverage){
count = count + 1
cc = cc + freq_table[count]
}
return(count)
}
Here we define a function called FilterFreqTable() to filter the frequency table used for calculate coverage.
FilterFreqTable <- function(table,filter=5){
return(table[table>filter]/sum(table[table>filter]))
}
For a summarized explorative analysis, we only need a part of total data to get an idea of the properties and features of data, which will also improve performance. Therefore, we random sample 9,000 data points from each of three pieces as the data for explorative analysis.
set.seed(1000)
blog_sample <- sample(blog,9000)
twitter_sample <- sample(twitter,9000)
news_sample <- sample(news,9000)
By the two functions defined above, foreigntxt() and tidytxt(), we tokenized and filtered the data. All the words recognized are converted into lower cases in this case. This model is not capable of distinguishing U.S. and us.
By using the functions defined above we can filter the raw data and only keep those foreign words for each of the three pieces of data.
# blog_foreign <- foreigntxt(blog_sample)
# twitter_foreign <- foreigntxt(twitter_sample)
news_foreign <- foreigntxt(news_sample)
And we can also get the clean and tidy data for each one.
blog_clear <- tidytxt(blog_sample)
twitter_clear <- tidytxt(twitter_sample)
news_clear <- tidytxt(news_sample)
Now we run several basic summary analysis on the data we just imported and filtered.
Firstly, we summarized the total number of words and lines in each data.
countsum <- data.frame(line = sapply(list(blog,twitter,news),length),
word = sapply(list(blog,twitter,news),wordcount),
row.names= c("blog","twitter","news"))
countsum
## line word
## blog 899288 37334131
## twitter 2360148 30373543
## news 1010242 34372530
Next, we analyzed the tokenized data and summarize the word frequency by single word, two word and three word pair.
In this section, we sumamrized the frequency of single word. The most frequent single words in the blog data are:
blog_word_table <- sort(table(unlist(blog_clear)), decreasing = T)
as.data.frame(t(head(blog_word_table/sum(blog_word_table))), row.names = "Freq")
## the and to a of i
## Freq 0.05077613 0.02941161 0.02875382 0.0248772 0.02422751 0.02080918
And most of the words are very less frequently present in blog data, i.e. only present once.
hist(log2(blog_word_table), breaks = 10,
xlab="Log2 Times of Appearance",
main= "Histogram of Frequency of Words Blog")
The most frequent single words in the twitter data are:
twitter_word_table <- sort(table(unlist(twitter_clear)), decreasing = T)
as.data.frame(t(head(twitter_word_table/sum(twitter_word_table))), row.names = "Freq")
## the to i a you and
## Freq 0.03151476 0.02669401 0.02404979 0.02059612 0.01850053 0.01475006
And the frequency distribution is similar to blog data.
hist(log2(twitter_word_table), breaks = 10,
xlab="Log2 Times of Appearance",
main= "Histogram of Frequency of Words Twitter")
The most frequent single words in the news data are:
news_word_table <- sort(table(unlist(news_clear)), decreasing = T)
as.data.frame(t(head(news_word_table/sum(news_word_table))), row.names = "Freq")
## the to a and of in
## Freq 0.05821402 0.02696044 0.02648117 0.02609303 0.02355157 0.01990644
And the frequency distribution is similar to blog data, as well.
hist(log2(news_word_table), breaks = 10,
xlab="Log2 Times of Appearance",
main= "Histogram of Frequency of Words News")
In this section, we sumamrized the frequency of two continuous word pair. The most frequent two words pairs in the blog data are:
blog_2word <- sapply(blog_clear,function(x)word2pair(x))
blog_2word_table <- sort(table(unlist(blog_2word)), decreasing = T)
as.data.frame(t(head(blog_2word_table/sum(blog_2word_table))), row.names = "Freq")
## of the in the to the on the to be
## Freq 0.005308641 0.004260157 0.002381163 0.002171466 0.001785182
## for the
## Freq 0.001614114
The most frequent two words pairs in the twitter data are:
twitter_2word <- sapply(twitter_clear,function(x)word2pair(x))
twitter_2word_table <- sort(table(unlist(twitter_2word)), decreasing = T)
as.data.frame(t(head(twitter_2word_table/sum(twitter_2word_table))), row.names = "Freq")
## in the for the of the on the to the
## Freq 0.002778811 0.002720103 0.002054754 0.001917771 0.001692726
## to be
## Freq 0.001624234
The most frequent two words pairs in the news data are:
news_2word <- sapply(news_clear,function(x)word2pair(x))
news_2word_table <- sort(table(unlist(news_2word)), decreasing = T)
as.data.frame(t(head(news_2word_table/sum(news_2word_table))), row.names = "Freq")
## of the in the to the on the for the at the
## Freq 0.005935007 0.005347073 0.00272398 0.002431753 0.001976016 0.00177424
In this section, we sumamrized the frequency of three continuous word pair. The most frequent three words pairs in the blog data are:
blog_3word <- sapply(blog_clear,function(x)word3pair(x))
blog_3word_table <- sort(table(unlist(blog_3word)), decreasing = T)
as.data.frame(t(head(blog_3word_table/sum(blog_3word_table))), row.names = "Freq")
## one of the a lot of it was a some of the as well as
## Freq 0.0004344809 0.0003554843 0.0002341683 0.0001862061 0.0001833848
## to be a
## Freq 0.0001833848
The most frequent three words pairs in the twitter data are:
twitter_3word <- sapply(twitter_clear,function(x)word3pair(x))
twitter_3word_table <- sort(table(unlist(twitter_3word)), decreasing = T)
as.data.frame(t(head(twitter_3word_table/sum(twitter_3word_table))), row.names = "Freq")
## thanks for the thank you for i love you a lot of cant wait to
## Freq 0.0008210531 0.0004265211 0.0004158581 0.0003305538 0.0003305538
## going to be
## Freq 0.0002879017
The most frequent three words pairs in the news data are:
news_3word <- sapply(news_clear,function(x)word3pair(x))
news_3word_table <- sort(table(unlist(news_3word)), decreasing = T)
as.data.frame(t(head(news_3word_table/sum(news_3word_table))), row.names = "Freq")
## one of the a lot of some of the to be a according to the
## Freq 0.0005126973 0.0003549443 0.0002187031 0.000193606 0.0001900207
## as well as
## Freq 0.0001864354
As we can see, different types of original sources of data have different patterns of sentences. For example, “i” is more frequently used in twitter but not in the news, this is because news mostly report objective happennings while twitter is more subjective in personal life.
In order to build better prediction models, it is important to know how many frequent words we have covered in our model or database. We do not want to include all the words otherwise the model will be slow in searching and the data size will be too large.
We can use these total dataset without excluding less frequent words to calculate the coverage.
cov_sum <- data.frame(
Single50 = c(gram_coverage(blog_word_table/sum(blog_word_table),coverage = .5),
gram_coverage(twitter_word_table/sum(twitter_word_table),coverage = .5),
gram_coverage(news_word_table/sum(news_word_table),coverage = .5)),
Single90 = c(gram_coverage(blog_word_table/sum(blog_word_table),coverage = .9),
gram_coverage(twitter_word_table/sum(twitter_word_table),coverage = .9),
gram_coverage(news_word_table/sum(news_word_table),coverage = .9)),
TwoWord50 = c(gram_coverage(blog_2word_table/sum(blog_2word_table),coverage = .5),
gram_coverage(twitter_2word_table/sum(twitter_2word_table),coverage = .5),
gram_coverage(news_2word_table/sum(news_2word_table),coverage = .5)),
TwoWord90 = c(gram_coverage(blog_2word_table/sum(blog_2word_table),coverage = .9),
gram_coverage(twitter_2word_table/sum(twitter_2word_table),coverage = .9),
gram_coverage(news_2word_table/sum(news_2word_table),coverage = .9)),
ThreeWord50 = c(gram_coverage(blog_3word_table/sum(blog_3word_table),coverage = .5),
gram_coverage(twitter_3word_table/sum(twitter_3word_table),coverage = .5),
gram_coverage(news_3word_table/sum(news_3word_table),coverage = .5)),
ThreeWord90 = c(gram_coverage(blog_3word_table/sum(blog_3word_table),coverage = .9),
gram_coverage(twitter_3word_table/sum(twitter_3word_table),coverage = .9),
gram_coverage(news_3word_table/sum(news_3word_table),coverage = .9)),
row.names=c("Blog","Twitter","News"))
cov_sum
## Single50 Single90 TwoWord50 TwoWord90 ThreeWord50 ThreeWord90
## Blog 105 6037 20744 150888 132909 274687
## Twitter 121 4546 12986 53866 39484 76997
## News 193 7513 27373 140570 114257 225824
In order to cover more words we have to sacrifice a lot in memory to keep in those less frequent words. For example, in order to cover 50% of all the two word pairs we need to include top frequent 20744 words in blog data, but for 90% coverage we have to increase to 150888 words, among which are usually only present once in whole data.
One way to increase the coverage would be exlcude those less frequent words first, then calculate the coverage on the filtered set.
We can use these total dataset with excluding less frequent words to calculate the coverage.
cov_sum_filter <- data.frame(
Single50 = c(gram_coverage(FilterFreqTable(blog_word_table),coverage = .5),
gram_coverage(FilterFreqTable(twitter_word_table),coverage = .5),
gram_coverage(FilterFreqTable(news_word_table),coverage = .5)),
Single90 = c(gram_coverage(FilterFreqTable(blog_word_table),coverage = .9),
gram_coverage(FilterFreqTable(twitter_word_table),coverage = .9),
gram_coverage(FilterFreqTable(news_word_table),coverage = .9)),
TwoWord50 = c(gram_coverage(FilterFreqTable(blog_2word_table),coverage = .5),
gram_coverage(FilterFreqTable(twitter_2word_table),coverage = .5),
gram_coverage(FilterFreqTable(news_2word_table),coverage = .5)),
TwoWord90 = c(gram_coverage(FilterFreqTable(blog_2word_table),coverage = .9),
gram_coverage(FilterFreqTable(twitter_2word_table),coverage = .9),
gram_coverage(FilterFreqTable(news_2word_table),coverage = .9)),
ThreeWord50 = c(gram_coverage(FilterFreqTable(blog_3word_table),coverage = .5),
gram_coverage(FilterFreqTable(twitter_3word_table),coverage = .5),
gram_coverage(FilterFreqTable(news_3word_table),coverage = .5)),
ThreeWord90 = c(gram_coverage(FilterFreqTable(blog_3word_table),coverage = .9),
gram_coverage(FilterFreqTable(twitter_3word_table),coverage = .9),
gram_coverage(FilterFreqTable(news_3word_table),coverage = .9)),
row.names=c("Blog","Twitter","News"))
cov_sum_filter
## Single50 Single90 TwoWord50 TwoWord90 ThreeWord50 ThreeWord90
## Blog 80 3020 1705 12496 1892 6184
## Twitter 81 1473 688 3575 357 1063
## News 128 3610 1662 10268 1149 3527
By default, the function FilterFreqTable() will only keep the words more frequent than 5 times of appearance. By doing this, we significantly reduce the number of words to be stored to reach good coverage.
An important information is to see what quality our raw data is. Due to the computer locale issue, here I am going to use sampled news data as an example. We calculated the percentage of foreign words in the sampled data.
wordcount(news_foreign)/wordcount(news_sample)
## [1] 0.007894088
This is suggesting about 0.789% in the news data are foreign languages or errors. We expect this ratio to be higher in twitter data since there are more possible misspelling or foreign languages.
We imported three pieces of data, blog, twitter and news, respectively. Tokenized and analyzed the frequence of one word, two word pair and three word pair. These informations are valuable in predicting models used for natural language prediction.