For this milestone report, I will discuss what I have accomplished so far for the capstone project, including procedures for cleaning and exploring data, Markov models, and planned features to be implemented. Though many time during the project, I’ve been stomped by various issues, I’ve learned an enormous amount about Natural Language Processing and have enjoyed the process tremendously.
While most of the news data is fairly regular, blog has more variability in language and symbols used, and twitter data by far has the most variation and complexity that have to be addressed before the data can be used. A series of Regular Expressions are used here to clean the data.
Note: I took advantage of the chaining operator %>% in the dplyr package to chain together the regular expression operations, which are all executed through various functions from the stringi package.
I took the following procedures to clean the data:
<DATE> tag<TIME> tag<NUM> tag<PHONE> tagThe following code is what I used to parse the data:
# start function
parse <- function (line, n=1){
line <- line %>%
# removing all irregular symbols
stri_replace_all(regex = "[^ a-zA-Z0-9!\"#$%&'\\()*+,-./:;<=>?@^_`{}|~\\[\\]]|\"", replacement = "") %>%
# captured Dates
stri_replace_all(regex = " ([0-1][1-2]|[1-9])[-/]([0-3][0-9]|[1-9])([-/]([0-9]{4}|[0-9]{2}))? ", replacement = " <DATE> ") %>%
# captured Time
stri_replace_all(regex =" [0-2]?[0-9][:-][0-6]?[0-9]([AaPpMm.]*)? ", replacement =" <TIME> ") %>%
# captured numbers
stri_replace_all(regex =" [$]?([0-9,]+)?([0-9]+|[0-9]+[.][0-9]+|[.][0-9]+)(%|th|st|nd)? ",replacement= " <NUM> ") %>%
# captured phone numbers
stri_replace_all(regex ="1?[-(]?[0123456789]{3}[-.)]?[0-9]{3}[-.]?[0-9]{4}|[0-9]{10}", replacement= " <PHONE> ") %>%
# captured emoticon
stri_replace_all(regex =" [<>0O%]?[:;=8]([-o*']+)?([()dDpP/}{#@|oOcC]|\\[|\\])+|([()dDpP/}{#@|cC]|\\[|\\])+([-o*']+)?[:;=8][<>]?|<3+|</+3+|[-oO0><^][_.]+[-oO0><^]", replacement = " <EMOJI><BREAK>") %>%
# break up sentences
stri_replace_all(regex ="[!?]+ |([ a-zA-Z0-9]{3})[.] ", replacement = "$1<BREAK>") %>%
# remove extraneous symbols left over
stri_replace_all(regex ="[^ a-zA-Z0-9#@<>]+", replacement = " ") %>% stri_trim_both() %>%
# split sentences into different string vectors
stri_split(fixed = "<BREAK>", omit_empty = T) %>%
# split up by word
lapply(function (i) stri_split_boundaries(i, type="word", skip_word_none=TRUE)) %>%
# remove unnecessary list format
unlist(recursive = F)
return(line)
}
Using the processed data from above, we ahve the summary for the three sets of data below:
| Line Count | Word Count | |
|---|---|---|
| 2360148 | 579524 | |
| Blog | 899288 | 545103 |
| News | 1010242 | 451206 |
Comparisons for the top 10 words from all three sets of data:
Comparisons for average sentence and word lengths from all three sets of data:
Observations:
I am currently experimenting with different data structures to construct and build the ngram models. Because of the substantial amount of data, I am finding it a bit difficult to pinpoint an efficient way of building the prediction models.
For the ngram or otherwise known as Markov chains, we are effectively taking every consecutive group of word (i.e. groups of 2, 3, 4 etc.) and counting how many times that specific combination of words appear, and then using the probability of occurrence to predict the next word.
I used data.table package to store the ngrams constructed from all three data sources. Below is one of the functions I drafted for creating a trigram (three words, i.e. “Let me know”) count table.
# create empty trigram data table
trigram <-data.table(n1 = character(0), n2 = character(0), n3 = character(0), value = numeric(0))
# function to store
tri <- function(list){
lapply(list, function(phrase){
# loop through each combination of 3 words
for(i in 1:(length(phrase)-n+1)){
# convert to lower cases
lower <- tolower(phrase[i:(i+2)])
# evaluate whether the combination exist
if(dim(trigram[n1==lower[1] & n2 == lower[2] & n3 == lower[3]])[1]==0){
# if not, add it to the dictionary
trigram <<- rbindlist(list(trigram, list(lower[1], lower[2], lower[3], 1)))
} else {
# if it already exists, increment the count by 1
trigram[n1==lower[1]&n2==lower[2]&n3==lower[3]]$value <<- trigram[n1==lower[1]&n2==lower[2]&n3==lower[3]]$value + 1
}
}
})
}
Planned Explorations: