Overview/Introduction

This first part of the developing process is to

Getting Data

The data basis of this Capstone project provided by cloudfront Capstone Dataset.

It contains the 3 major files for this analysis:

I put this files on my local developer direcory and start to get a feeling for the data.

Content of Raw Data

My first investigation is to determine the

For this I use standard linux commands wc and ls.

# file size 
>ls -l en_US.blogs.txt
# count lines
>wc -l en_US.blogs.txt
# count words
>wc -w en_US.blogs.txt
##                   size_in_MB   lines    words
## en_US.blogs.txt          210  899288 37334114
## en_US.news.txt           205 1010242 34365936
## en_US.twitter.txt        167 2360148 30359804

Split data into training dataset and test dataset

The available data need to split into training data and test data. The basic files are quite large to I decide to realize the split of chunck based and use file streeming. So it is also possible to run it on lower sofisticated computers.

split_data <- function(path_filename_extention,p){
  library("tools")
  # set seed
  set.seed(100)
  # set chunk_size for read/write 
  chunk_size = 10000
  # takes first p percent of lines    
  # stores training data in path_filename_training
  # stores test data in path_filename_test
  
  # set filename for traning data set (file) and test data set (file)
  filename = file_path_sans_ext(basename(path_filename_extention))
  path = dirname(path_filename_extention)
  extention = file_ext(path_filename_extention) 
  
  training_file_name = paste(path,"/",filename,"_","train",".",extention,sep="")
  test_file_name = paste(path,"/",filename,"_","test",".",extention,sep="")
  
  # con stream for reading and writing data
  con_read <- file(description = path_filename_extention, open= "r")
  con_write_training = file(description = training_file_name, open="w")
  con_write_test = file(description = test_file_name, open= "w")

  # read first source data chunk
  source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE) 
  # loop over source data
  while(length(source_data)!=0)
  {
      # select training data / source data 
      is_training = rbinom(n = length(source_data), size = 1, prob = p) 
      training_data = source_data[is_training == 1]  
      test_data = source_data[is_training == 0] 
    
      # write data
      writeLines(training_data, con_write_training)
      writeLines(test_data, con_write_test)
      
      # read first source data chunk
      source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE) 
  }
  
  # close file streams
  close(con_read)
  close(con_write_training)
  close(con_write_test)
  
}
# split source data into training data (10%) and testdata (90%) 
split_data("../DataSet/en_US.blogs.txt",0.1)
split_data("../DataSet/en_US.news.txt",0.1)
split_data("../DataSet/en_US.twitter.txt",0.1)

Only nearly 10% of the raw data will be use for training. Thats enough for exploratory the data.

Cleaning data

In the following steps I only operate on training data. The remaining testdata will be set in focus on later phases of this project.

Now the data will preprocessed to provide a nearly clean data source for further steps. Finaly we are interested in a sequence of words. So I remove all non alpha-characters and linebrake.

In addition following preprocessing steps will performed

# writes a clean tokenized version to disk
clean_file <- function(source_file,dest_file){
  chunk_size = 10000
  # con stream for reading and writing data
  con_read <- file(description = source_file, open= "r")
  con_write = file(description = dest_file, open= "w")
  
  # read first source data chunk
  source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE) 
  # loop over source data
  while(length(source_data)!=0)
  {
    
    # keep only ALPHA and linebrak 
    clean_data = gsub("[^A-Za-z\n\r]"," ",source_data)
    # convert to upper case
    clean_data = tolower(clean_data)
    # remove multipe spaces
    clean_data = gsub("\\s+"," ",clean_data)
    # write data
    writeLines(clean_data, con_write)
    # read first source data chunk
    source_data = readLines(con_read, chunk_size,skipNul = TRUE, warn = FALSE) 
  }
  
  # close file streams
  close(con_read)
  close(con_write)

}

  # traing_data
  clean_file(source_file = "../DataSet/en_US.blogs_train.txt", dest_file = "../DataSet/clean_data/en_US.blogs_train.txt")
  clean_file(source_file = "../DataSet/en_US.news_train.txt", dest_file = "../DataSet/clean_data/en_US.news_train.txt")
  clean_file(source_file = "../DataSet/en_US.twitter_train.txt", dest_file = "../DataSet/clean_data/en_US.twitter_train.txt") 

At this step I did not remove stopwords because they should be predicte as well so I keep it in the training data.

Exploratory analysis

Now we are interested of the frequency words and frequency of word sequences. All 3 source type will be examine separately. So the differnces between the 3 sources become visible and comparable.

Frequency of word

Calulate the fequency of words. At first compute the absolut value and than the probability of occurrences of a word.

# get data.frame with frequency
freqWord1 <- function(){
  # load all 3 cleaned up training files
  docs <- VCorpus(DirSource("../DataSet/clean_data/"))
  # setup tokenizer
  BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
  # generate DocumentTermMatrix
  tdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer)) #531MB
  # convert into frequency Matrix
  freq_tdm_M <- data.frame(as.matrix(tdm))
}

freq_tdm_M = freqWord1()

Now take a loot at some charateristics:

  • total number of words
  • propability of occurences per document

Here are the top 20 words per document.

##    en_US.blogs_train.txt en_US.news_train.txt en_US.twitter_train.txt
## 1                    the                  the                     the
## 2                    and                  and                     you
## 3                   that                 that                     and
## 4                    for                  for                     for
## 5                    you                 with                    that
## 6                   with                 said                    with
## 7                    was                  was                    your
## 8                   this                  his                    have
## 9                   have                 from                    this
## 10                   but                  but                     are
## 11                   are                 have                    just
## 12                   not                  are                     can
## 13                  they                 they                     but
## 14                  from                 this                    what
## 15                   all                  has                     all
## 16                   one                  you                     not
## 17                   can                  not                    like
## 18                 about                  who                     was
## 19                  what                 will                     out
## 20                  will                about                     get
##    sumfreq
## 1      the
## 2      and
## 3     that
## 4      for
## 5      you
## 6     with
## 7      was
## 8     this
## 9     have
## 10     are
## 11     but
## 12     not
## 13    from
## 14    they
## 15     all
## 16     can
## 17    will
## 18    just
## 19    said
## 20    your

You can see the sequence is a bit different between the documents.

A plot of the the probability of occurrence of word over the different documents shows:

  • that there is a high variance in the probability of occurrence of word
  • the documents dominate by only a few hundreds different words
  • the variance of word in en_US.twitter.txt is smaler than in en_US.news.txt

Not that the plot has a logaritmic scale on y-axis.

Frequency of word pairs

Calulate the fequency of words pairs. Like above first compute the absolut value and than the probability of occurrences of a pairs word.

And take a look at some charateristics:

  • total number of words pairs
  • propability of occurences per document

Here are the top 20 words per document.

##    en_US.blogs_train.txt en_US.news_train.txt en_US.twitter_train.txt
## 1                 of the               of the                     i m
## 2                 in the               in the                    it s
## 3                 to the               to the                   don t
## 4                   it s               on the                  in the
## 5                 on the              for the                 for the
## 6                  to be                 it s                  of the
## 7                    i m               at the                  on the
## 8                and the              and the                   to be
## 9                  and i                 in a                   can t
## 10               for the                to be                  to the
## 11                 don t             with the              thanks for
## 12                 i was             from the                  you re
## 13                i have              he said                  if you
## 14                at the               with a                  at the
## 15                it was                 of a                  i love
## 16                 it is                 as a                  that s
## 17              with the                don t                   i can
## 18                that i                for a               thank you
## 19                  is a                 is a                going to
## 20                  in a               it was                  have a
##     sumfreq
## 1    of the
## 2    in the
## 3      it s
## 4       i m
## 5    to the
## 6   for the
## 7    on the
## 8     don t
## 9     to be
## 10   at the
## 11  and the
## 12     in a
## 13 with the
## 14    and i
## 15     is a
## 16   it was
## 17    for a
## 18 from the
## 19   i have
## 20   if you

You can see the sequence is a bit different between the documents.

A plot of the the probability of occurrence of word over the different documents shows: Here we observe a different situation like for one word frequency.

  • that there is a high variance in the probability of occurrence of word pairs, it is higher than for singel words
  • the documents dominate by only a few hundreds different word pair but more than for singel words
  • the variance of word in en_US.news.txt is smaler than in en_US.blogs.txt, and the distanze is larger

Not that the plot has a logaritmic scale on y-axis.

Awareness

Size of the modle is to large. More restrictions are needed. Build an other model without stopword * Identify words of other languages. Eg evaluate other data files other languages provided by Capstone Dataset. to identify words of foreign languages. * take of linebreaks by extracting word pairs * there is still a lot of trash words like ‘aaaa’ of ’uuuhh’this should remove * there is also a need to compress the model

Conclusion

We find some interestion results but the model need to optimize and compress.