Introduction

This is a breif explorative report of Capstone Corpus. The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The files containing three pieces from blog, tiwtter and news, have been filtered but may still contain some unrecoginzed words or foreign words not in English. Here is a brief explorative analysis of these data.

Data Pre-Processing

Read in

Here are the procedures we used to read in all three pieces of data.

blog <- readLines("./en_US.blogs.txt")

myfile<-file("en_US.twitter.txt", open="rb")
twitter<-readLines(myfile, encoding="UTF-8")
close(myfile)

myfile<-file("en_US.news.txt", open="rb")
news<-readLines(myfile, encoding="UTF-8")

close(myfile)

Define Key Function

Firstly, we define several key functions to be used later in the data analysis. To count how many foreign or unrecognized text are present in the data, we substitute any recognized letters and puncturation such comma, column, quote with empty symbol “”. So the words left are usually not recognized words or maybe foreign words. Since it is hard to define foreign words, we simply count any letters not recognized as foreign words in the method decribed above.

# Substitute recognized letters and symboles
keepforeigntxt <- function(x) gsub("[A-Za-z:digit: \t\\.,`'-_=\\+;:<>~\\(\\)\\!\"\\.,\\?'!@#\"\\$%\\^&\\*\\/\\|\\{\\}]|[:space:]","",x)

# Filter the data and only keep unrecognized letters or symbols
foreigntxt <- function(passage){
  sapply(passage, function(x){
    txt <- tolower(keepforeigntxt(unlist(strsplit(x,split = " "))))
    txt <- txt[!txt==""]
  })  
}

In the contrast, we will keep any recogized letters as words excluding those numbers.

# Substitute recognized letters, symboles and numbers
keepcleartxt <- function(x) gsub("[\\.0-9 \\.,\\?'!@#\"\\$%\\^&\\*\\(\\)-_=\\+;:<>\\/\\\\|\\}\\{\\[\\]`~]|[^A-Za-z]","",x)

# Filter the data and keep only words
tidytxt <- function(passage){
  sapply(passage,function(x){
    txt <- tolower(keepcleartxt(unlist(strsplit(x,split = " "))))
    txt <- txt[!txt==""]
    })
  }

Here we defined a function called wordcount() to count total number of words in a list of lines for a piece of data. We will simply define a word by the space.

wordcount <- function(passage){
  num = 0 
  for(i in 1:length(passage)){
    num <- num + length(unlist(strsplit(passage[[i]],split = " ")))
    }
  return(num)
}

Here we defined a function called word2pair() to take in a vector of words and generate all the continous two words pairs.

word2pair <- function(words){
  sapply(1:(length(words)-1),function(i){paste(words[i],words[i+1])})
}

Here we defined a function called word3pair() to take in a vector of words and generate all the continous three words pairs.

word3pair <- function(words){
  sapply(1:(length(words)-2),function(i){paste(words[i],words[i+1],words[i+2])})
}

Here we define a function called gram_coverage() to calculate how many and waht n-grams we need to include in repertoire to cover a certain percentage of n-grams.

gram_coverage <- function(freq_table, coverage=.5){
  cc = 0
  count = 0
  while(cc < coverage){
    count = count + 1
    cc = cc + freq_table[count]
  }
  return(count)
}

Here we define a function called FilterFreqTable() to filter the frequency table used for calculate coverage.

FilterFreqTable <- function(table,filter=5){
  return(table[table>filter]/sum(table[table>filter]))
}

Sampling

For a summarized explorative analysis, we only need a part of total data to get an idea of the properties and features of data, which will also improve performance. Therefore, we random sample 9,000 data points from each of three pieces as the data for explorative analysis.

set.seed(1000)
blog_sample <- sample(blog,9000)
twitter_sample <- sample(twitter,9000)
news_sample <- sample(news,9000)

Words Filtering

By the two functions defined above, foreigntxt() and tidytxt(), we tokenized and filtered the data. All the words recognized are converted into lower cases in this case. This model is not capable of distinguishing U.S. and us.

Foreign Words

By using the functions defined above we can filter the raw data and only keep those foreign words for each of the three pieces of data.

# blog_foreign <- foreigntxt(blog_sample)
# twitter_foreign <- foreigntxt(twitter_sample)
news_foreign <- foreigntxt(news_sample)

Tidy Words

And we can also get the clean and tidy data for each one.

blog_clear <- tidytxt(blog_sample)
twitter_clear <- tidytxt(twitter_sample)
news_clear <- tidytxt(news_sample)

Data Analysis

Now we run several basic summary analysis on the data we just imported and filtered.

Words and Lines

Firstly, we summarized the total number of words and lines in each data.

countsum <- data.frame(line = sapply(list(blog,twitter,news),length),
                       word = sapply(list(blog,twitter,news),wordcount),
                       row.names= c("blog","twitter","news"))
countsum
##            line     word
## blog     899288 37334131
## twitter 2360148 30373543
## news    1010242 34372530

Summarize Word Frequency

Next, we analyzed the tokenized data and summarize the word frequency by single word, two word and three word pair.

Single Word

In this section, we sumamrized the frequency of single word. The most frequent single words in the blog data are:

blog_word_table <- sort(table(unlist(blog_clear)), decreasing = T)
as.data.frame(t(head(blog_word_table/sum(blog_word_table))), row.names = "Freq")
##             the        and         to         a         of          i
## Freq 0.05077613 0.02941161 0.02875382 0.0248772 0.02422751 0.02080918

And most of the words are very less frequently present in blog data, i.e. only present once.

hist(log2(blog_word_table), breaks = 10, 
     xlab="Log2 Times of Appearance",
     main= "Histogram of Frequency of Words Blog")

The most frequent single words in the twitter data are:

twitter_word_table <- sort(table(unlist(twitter_clear)), decreasing = T)
as.data.frame(t(head(twitter_word_table/sum(twitter_word_table))), row.names = "Freq")
##             the         to          i          a        you        and
## Freq 0.03151476 0.02669401 0.02404979 0.02059612 0.01850053 0.01475006

And the frequency distribution is similar to blog data.

hist(log2(twitter_word_table), breaks = 10, 
     xlab="Log2 Times of Appearance",
     main= "Histogram of Frequency of Words Twitter")

The most frequent single words in the news data are:

news_word_table <- sort(table(unlist(news_clear)), decreasing = T)
as.data.frame(t(head(news_word_table/sum(news_word_table))), row.names = "Freq")
##             the         to          a        and         of         in
## Freq 0.05821402 0.02696044 0.02648117 0.02609303 0.02355157 0.01990644

And the frequency distribution is similar to blog data, as well.

hist(log2(news_word_table), breaks = 10, 
     xlab="Log2 Times of Appearance",
     main= "Histogram of Frequency of Words News")

Two Word Pair

In this section, we sumamrized the frequency of two continuous word pair. The most frequent two words pairs in the blog data are:

blog_2word <- sapply(blog_clear,function(x)word2pair(x))
blog_2word_table <- sort(table(unlist(blog_2word)), decreasing = T)
as.data.frame(t(head(blog_2word_table/sum(blog_2word_table))), row.names = "Freq")
##           of the      in the      to the      on the       to be
## Freq 0.005308641 0.004260157 0.002381163 0.002171466 0.001785182
##          for the
## Freq 0.001614114

The most frequent two words pairs in the twitter data are:

twitter_2word <- sapply(twitter_clear,function(x)word2pair(x))
twitter_2word_table <- sort(table(unlist(twitter_2word)), decreasing = T)
as.data.frame(t(head(twitter_2word_table/sum(twitter_2word_table))), row.names = "Freq")
##           in the     for the      of the      on the      to the
## Freq 0.002778811 0.002720103 0.002054754 0.001917771 0.001692726
##            to be
## Freq 0.001624234

The most frequent two words pairs in the news data are:

news_2word <- sapply(news_clear,function(x)word2pair(x))
news_2word_table <- sort(table(unlist(news_2word)), decreasing = T)
as.data.frame(t(head(news_2word_table/sum(news_2word_table))), row.names = "Freq")
##           of the      in the     to the      on the     for the     at the
## Freq 0.005935007 0.005347073 0.00272398 0.002431753 0.001976016 0.00177424

Three Words Pair

In this section, we sumamrized the frequency of three continuous word pair. The most frequent three words pairs in the blog data are:

blog_3word <- sapply(blog_clear,function(x)word3pair(x))
blog_3word_table <- sort(table(unlist(blog_3word)), decreasing = T)
as.data.frame(t(head(blog_3word_table/sum(blog_3word_table))), row.names = "Freq")
##        one of the     a lot of     it was a  some of the   as well as
## Freq 0.0004344809 0.0003554843 0.0002341683 0.0001862061 0.0001833848
##           to be a
## Freq 0.0001833848

The most frequent three words pairs in the twitter data are:

twitter_3word <- sapply(twitter_clear,function(x)word3pair(x))
twitter_3word_table <- sort(table(unlist(twitter_3word)), decreasing = T)
as.data.frame(t(head(twitter_3word_table/sum(twitter_3word_table))), row.names = "Freq")
##      thanks for the thank you for   i love you     a lot of cant wait to
## Freq   0.0008210531  0.0004265211 0.0004158581 0.0003305538 0.0003305538
##       going to be
## Freq 0.0002879017

The most frequent three words pairs in the news data are:

news_3word <- sapply(news_clear,function(x)word3pair(x))
news_3word_table <- sort(table(unlist(news_3word)), decreasing = T)
as.data.frame(t(head(news_3word_table/sum(news_3word_table))), row.names = "Freq")
##        one of the     a lot of  some of the     to be a according to the
## Freq 0.0005126973 0.0003549443 0.0002187031 0.000193606     0.0001900207
##        as well as
## Freq 0.0001864354

Summary of n-grams

As we can see, different types of original sources of data have different patterns of sentences. For example, “i” is more frequently used in twitter but not in the news, this is because news mostly report objective happennings while twitter is more subjective in personal life.

Coverage of n-grams

In order to build better prediction models, it is important to know how many frequent words we have covered in our model or database. We do not want to include all the words otherwise the model will be slow in searching and the data size will be too large.

Using Total Dataset

We can use these total dataset without excluding less frequent words to calculate the coverage.

cov_sum <- data.frame(
  Single50 = c(gram_coverage(blog_word_table/sum(blog_word_table),coverage = .5),
                gram_coverage(twitter_word_table/sum(twitter_word_table),coverage = .5),
                gram_coverage(news_word_table/sum(news_word_table),coverage = .5)),
  Single90 = c(gram_coverage(blog_word_table/sum(blog_word_table),coverage = .9),
                gram_coverage(twitter_word_table/sum(twitter_word_table),coverage = .9),
                gram_coverage(news_word_table/sum(news_word_table),coverage = .9)),
  TwoWord50 = c(gram_coverage(blog_2word_table/sum(blog_2word_table),coverage = .5),
                gram_coverage(twitter_2word_table/sum(twitter_2word_table),coverage = .5),
                gram_coverage(news_2word_table/sum(news_2word_table),coverage = .5)),
  TwoWord90 = c(gram_coverage(blog_2word_table/sum(blog_2word_table),coverage = .9),
                gram_coverage(twitter_2word_table/sum(twitter_2word_table),coverage = .9),
                gram_coverage(news_2word_table/sum(news_2word_table),coverage = .9)),
  ThreeWord50 = c(gram_coverage(blog_3word_table/sum(blog_3word_table),coverage = .5),
                gram_coverage(twitter_3word_table/sum(twitter_3word_table),coverage = .5),
                gram_coverage(news_3word_table/sum(news_3word_table),coverage = .5)),
  ThreeWord90 = c(gram_coverage(blog_3word_table/sum(blog_3word_table),coverage = .9),
                gram_coverage(twitter_3word_table/sum(twitter_3word_table),coverage = .9),
                gram_coverage(news_3word_table/sum(news_3word_table),coverage = .9)),
  row.names=c("Blog","Twitter","News"))
cov_sum
##         Single50 Single90 TwoWord50 TwoWord90 ThreeWord50 ThreeWord90
## Blog         105     6037     20744    150888      132909      274687
## Twitter      121     4546     12986     53866       39484       76997
## News         193     7513     27373    140570      114257      225824

In order to cover more words we have to sacrifice a lot in memory to keep in those less frequent words. For example, in order to cover 50% of all the two word pairs we need to include top frequent 20744 words in blog data, but for 90% coverage we have to increase to 150888 words, among which are usually only present once in whole data.

One way to increase the coverage would be exlcude those less frequent words first, then calculate the coverage on the filtered set.

Using Filtered Frequent Dataset

We can use these total dataset with excluding less frequent words to calculate the coverage.

cov_sum_filter <- data.frame(
  Single50 = c(gram_coverage(FilterFreqTable(blog_word_table),coverage = .5),
                gram_coverage(FilterFreqTable(twitter_word_table),coverage = .5),
                gram_coverage(FilterFreqTable(news_word_table),coverage = .5)),
  Single90 = c(gram_coverage(FilterFreqTable(blog_word_table),coverage = .9),
                gram_coverage(FilterFreqTable(twitter_word_table),coverage = .9),
                gram_coverage(FilterFreqTable(news_word_table),coverage = .9)),
  TwoWord50 = c(gram_coverage(FilterFreqTable(blog_2word_table),coverage = .5),
                gram_coverage(FilterFreqTable(twitter_2word_table),coverage = .5),
                gram_coverage(FilterFreqTable(news_2word_table),coverage = .5)),
  TwoWord90 = c(gram_coverage(FilterFreqTable(blog_2word_table),coverage = .9),
                gram_coverage(FilterFreqTable(twitter_2word_table),coverage = .9),
                gram_coverage(FilterFreqTable(news_2word_table),coverage = .9)),
  ThreeWord50 = c(gram_coverage(FilterFreqTable(blog_3word_table),coverage = .5),
                gram_coverage(FilterFreqTable(twitter_3word_table),coverage = .5),
                gram_coverage(FilterFreqTable(news_3word_table),coverage = .5)),
  ThreeWord90 = c(gram_coverage(FilterFreqTable(blog_3word_table),coverage = .9),
                gram_coverage(FilterFreqTable(twitter_3word_table),coverage = .9),
                gram_coverage(FilterFreqTable(news_3word_table),coverage = .9)),
  row.names=c("Blog","Twitter","News"))
cov_sum_filter
##         Single50 Single90 TwoWord50 TwoWord90 ThreeWord50 ThreeWord90
## Blog          80     3020      1705     12496        1892        6184
## Twitter       81     1473       688      3575         357        1063
## News         128     3610      1662     10268        1149        3527

By default, the function FilterFreqTable() will only keep the words more frequent than 5 times of appearance. By doing this, we significantly reduce the number of words to be stored to reach good coverage.

Frequency of Contamination

An important information is to see what quality our raw data is. Due to the computer locale issue, here I am going to use sampled news data as an example. We calculated the percentage of foreign words in the sampled data.

wordcount(news_foreign)/wordcount(news_sample)
## [1] 0.007894088

This is suggesting about 0.789% in the news data are foreign languages or errors. We expect this ratio to be higher in twitter data since there are more possible misspelling or foreign languages.

Summary

We imported three pieces of data, blog, twitter and news, respectively. Tokenized and analyzed the frequence of one word, two word pair and three word pair. These informations are valuable in predicting models used for natural language prediction.