Introduction

For this milestone report, I will discuss what I have accomplished so far for the capstone project, including procedures for cleaning and exploring data, Markov models, and planned features to be implemented. Though many time during the project, I’ve been stomped by various issues, I’ve learned an enormous amount about Natural Language Processing and have enjoyed the process tremendously.

Cleaning Data

While most of the news data is fairly regular, blog has more variability in language and symbols used, and twitter data by far has the most variation and complexity that have to be addressed before the data can be used. A series of Regular Expressions are used here to clean the data.

Note: I took advantage of the chaining operator %>% in the dplyr package to chain together the regular expression operations, which are all executed through various functions from the stringi package.

I took the following procedures to clean the data:

  1. Removed all characters that is not a letter, number, or common symbols
  2. Capture date data in the Gregorian Calendar format (variations of MM/DD/YYYY) and replaced with <DATE> tag
  3. Captured time data (variations of HH:MM:SSAM/PM) and replaced with <TIME> tag
  4. Captured number data (variations of $XX,XXX,XXX.XXX(%)) and replaced with <NUM> tag
  5. Captured phone number (variation of 1(XXX)XXX-XXXX) and replaced with <PHONE> tag
  6. Captured emoticons composed of various symbol, letter, and number combinations as defined here
  7. Break up sentences in each observation, as defined by any text ending in !? or .
  8. Remove all extraneous symbols after the above cleaning steps
  9. break all words in each phrase

The following code is what I used to parse the data:

# start function
parse <- function (line, n=1){
   line <- line %>%
      # removing all irregular symbols
      stri_replace_all(regex = "[^ a-zA-Z0-9!\"#$%&'\\()*+,-./:;<=>?@^_`{}|~\\[\\]]|\"", replacement = "") %>% 
      # captured Dates
      stri_replace_all(regex = " ([0-1][1-2]|[1-9])[-/]([0-3][0-9]|[1-9])([-/]([0-9]{4}|[0-9]{2}))? ", replacement = " <DATE> ") %>% 
      # captured Time
      stri_replace_all(regex =" [0-2]?[0-9][:-][0-6]?[0-9]([AaPpMm.]*)? ", replacement =" <TIME> ") %>% 
      # captured numbers
      stri_replace_all(regex =" [$]?([0-9,]+)?([0-9]+|[0-9]+[.][0-9]+|[.][0-9]+)(%|th|st|nd)? ",replacement= " <NUM> ") %>% 
      # captured phone numbers
      stri_replace_all(regex ="1?[-(]?[0123456789]{3}[-.)]?[0-9]{3}[-.]?[0-9]{4}|[0-9]{10}", replacement= " <PHONE> ") %>% 
      # captured emoticon
      stri_replace_all(regex =" [<>0O%]?[:;=8]([-o*']+)?([()dDpP/}{#@|oOcC]|\\[|\\])+|([()dDpP/}{#@|cC]|\\[|\\])+([-o*']+)?[:;=8][<>]?|<3+|</+3+|[-oO0><^][_.]+[-oO0><^]", replacement = " <EMOJI><BREAK>") %>% 
      # break up sentences
      stri_replace_all(regex ="[!?]+ |([ a-zA-Z0-9]{3})[.] ", replacement = "$1<BREAK>") %>%
      # remove extraneous symbols left over
      stri_replace_all(regex ="[^ a-zA-Z0-9#@<>]+", replacement = " ") %>% stri_trim_both() %>%
      # split sentences into different string vectors
      stri_split(fixed = "<BREAK>", omit_empty = T) %>%
      # split up by word
      lapply(function (i) stri_split_boundaries(i, type="word", skip_word_none=TRUE)) %>% 
      # remove unnecessary list format
      unlist(recursive = F)
  return(line)
}

Summaries of and Observations From Data

Using the processed data from above, we ahve the summary for the three sets of data below:

Line Count Word Count
Twitter 2360148 579524
Blog 899288 545103
News 1010242 451206

Comparisons for the top 10 words from all three sets of data:

Comparisons for average sentence and word lengths from all three sets of data:

Observations:

Planned Explorations and Modeling

I am currently experimenting with different data structures to construct and build the ngram models. Because of the substantial amount of data, I am finding it a bit difficult to pinpoint an efficient way of building the prediction models.

For the ngram or otherwise known as Markov chains, we are effectively taking every consecutive group of word (i.e. groups of 2, 3, 4 etc.) and counting how many times that specific combination of words appear, and then using the probability of occurrence to predict the next word.

I used data.table package to store the ngrams constructed from all three data sources. Below is one of the functions I drafted for creating a trigram (three words, i.e. “Let me know”) count table.

# create empty trigram data table
trigram <-data.table(n1 = character(0), n2 = character(0), n3 = character(0), value = numeric(0))
# function to store 
tri <- function(list){
  lapply(list, function(phrase){
      # loop through each combination of 3 words
      for(i in 1:(length(phrase)-n+1)){
        # convert to lower cases
        lower <- tolower(phrase[i:(i+2)])
        # evaluate whether the combination exist
        if(dim(trigram[n1==lower[1] & n2 == lower[2] & n3 == lower[3]])[1]==0){
            # if not, add it to the dictionary
            trigram <<- rbindlist(list(trigram, list(lower[1], lower[2], lower[3], 1)))
        } else {
            # if it already exists, increment the count by 1 
            trigram[n1==lower[1]&n2==lower[2]&n3==lower[3]]$value <<- trigram[n1==lower[1]&n2==lower[2]&n3==lower[3]]$value + 1
        }
      }
    })
}

Planned Explorations: