Milestone Report - Coursera Capstone

This report goal is to explore messages from Twitter, news and blogs through text mining using R. One can use the packages tm for text mining, SnowballC for steam documents and Rgraphviz for plotting the connection between the terms.The final objective is to generate n-gram probability models for the purpose of text prediction.

## Warning: package 'tm' was built under R version 3.2.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 3.2.3

## Warning: package 'SnowballC' was built under R version 3.2.3

Reading Files

#Read Files
blogs <- readLines("./en_US.blogs.txt", skipNul = TRUE,20000)
news <- readLines("./en_US.news.txt", skipNul = TRUE,20000)
twitter <- readLines("./en_US.twitter.txt", skipNul = TRUE,20000)

Corpus

Corpus is a collection of writings, conversations, speeches, etc., that people use to study and describe a language.

Our Corpus is all the txt in the news, blogs and twitter files.

Is good practice to clean up the Corpus, therefore we remove all ponctuation, number, profanity words and white spaces, which are all the words/characteres we don’t want to predict. The stop words we left them because theses are words we want to predict when someone is writing.

After we need to stem the Corpus. In many cases, words need to be stemmed to retrieve their radicals. For instance, “example” and “examples” are both stemmed to “exampl”. However, after that, one may want to complete the stems to their original forms, so that the words would look “normal”.

docs<-VCorpus(VectorSource(c(blogs,news,twitter)))

#Removing Ponctuation
docs<- tm_map(docs, removePunctuation)

#Removing Numbers
docs<- tm_map(docs, removeNumbers)

#Transform to Lower
docs<- tm_map(docs, content_transformer(tolower))

#Removing Profanity Words (Most popular profanity words in Facebook - http://www.slate.com/blogs/lexicon_valley/2013/09/11/top_swear_words_most_popular_curse_words_on_facebook.html)
docs<- tm_map(docs, removeWords, c('asshole','bastard','bitch','cock','crap','damn','darn','dick','douche','fag','fuck','piss','pussy','shit','slut'))   

#Remove white spaces
docs<- tm_map(docs, stripWhitespace)

#Steam Document
docs<- tm_map(docs, stemDocument)

#R treats the document as a txt
docs <- tm_map(docs, PlainTextDocument)

N-grams

N-grams are a contiguous sequences of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”.

#N-grams
#Unigram
UnigramTokenizer <-  function(x)
                     unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)

tdm_unigram <- TermDocumentMatrix(docs, control = list(tokenize = UnigramTokenizer))

#Bigram
BigramTokenizer <-  function(x)
                    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)

tdm_bigram <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer))

#Trigram
TrigramTokenizer <-  function(x)
                     unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)

tdm_trigram <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer))


#Summary (cuts by the top 20 words)
findFreqTerms(tdm_unigram,lowfreq=5000)

##  [1] "about" "all"   "and"   "are"   "but"   "for"   "from"  "have" 
##  [9] "his"   "not"   "one"   "said"  "that"  "the"   "they"  "this" 
## [17] "was"   "will"  "with"  "you"

#plot(tdm_unigram, terms = findFreqTerms(tdm_unigram, lowfreq = 5000),corThreshold = 0.2)

findFreqTerms(tdm_bigram,lowfreq=1500)

##  [1] "and i"    "and the"  "at the"   "for a"    "for the"  "from the"
##  [7] "go to"    "in a"     "in the"   "is a"     "it is"    "it was"  
## [13] "of a"     "of the"   "on the"   "to be"    "to the"   "want to" 
## [19] "with a"   "with the"

#plot(tdm_bigram, terms = findFreqTerms(tdm_bigram, lowfreq = 1500),corThreshold = 0.07)

findFreqTerms(tdm_trigram,lowfreq=180)

##  [1] "a coupl of"      "a lot of"        "as well as"     
##  [4] "be abl to"       "end of the"      "go to be"       
##  [7] "i have to"       "i want to"       "it was a"       
## [10] "look forward to" "one of the"      "out of the"     
## [13] "part of the"     "some of the"     "thank for the"  
## [16] "the end of"      "the first time"  "the rest of"    
## [19] "there is a"      "to be a"

#plot(tdm_trigram, terms = findFreqTerms(tdm_trigram, lowfreq = 180),corThreshold = 0.01)

Milestone Report - Coursera Capstone

Stéphanie Bassani

Monday, March 21, 2016

Reading Files

Corpus

N-grams