Synopsis

Environment config

The first step is to setup the environment:

Load libraries that will support the analysis

library(tm)
library(ggplot2)
library(dplyr)
library(RWeka)
library(stringi)
library(formattable)
library(SnowballC)
library(parallel)
library(wordcloud)

#Adding parallel processing to minimizing runtime
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(RWeka)))

Datasets

In this step, we will load all three datasets (blog, twitter and news) and, to get a better performance, we will sample the content usign 1% of the lines for “blog and news” and 0,1% for twitter.

Loading

set.seed(20170517)

# Loading, Sampling and Summarizing Blog Dataset
blogs.ds <- readLines("en_US.blogs.txt")
blogs.ds.summary <- c(stri_stats_general(blogs.ds), stri_stats_latex(blogs.ds)[4])
blogs.sample <- blogs.ds[rbinom(length(blogs.ds)*0.01, length(blogs.ds), 0.50)]
blogs.sample.summary <- c(stri_stats_general(blogs.sample), stri_stats_latex(blogs.sample)[4])
# release memory
rm(blogs.ds)

# Loading, Sampling and Summarizing News Dataset
news.ds <- readLines("en_US.news.txt")
news.ds.summary <- c(stri_stats_general(news.ds), stri_stats_latex(news.ds)[4])
news.sample <- news.ds[rbinom(length(news.ds)*0.01, length(news.ds), 0.50)]
news.sample.summary <- c(stri_stats_general(news.sample), stri_stats_latex(news.sample)[4])
# release memory
rm(news.ds)

# Loading, Sampling and Summarizing Twitter Dataset
twitter.ds <- readLines("en_US.twitter.txt")
twitter.ds.summary <- c(stri_stats_general(twitter.ds), stri_stats_latex(twitter.ds)[4])
twitter.sample <- twitter.ds[rbinom(length(twitter.ds)*0.001, length(twitter.ds), 0.50)]
twitter.sample.summary <- c(stri_stats_general(twitter.sample), stri_stats_latex(twitter.sample)[4])
# release memory
rm(twitter.ds)

Summary

Dataset Type Lines LinesNEmpty Chars CharsNWhite Words
Blog Full 899288 899288 208361438 171926076 37865888
Blog Sample 8992 8992 2088794 1724016 377286
News Full 77259 77259 15683765 13117038 2665742
News Sample 772 772 150140 125470 25269
Twitter Full 2360148 2360148 162384825 134370864 30578891
Twitter Sample 2360 2360 162182 134143 30485

Dataset pre-processing (cleaning and preparing data)

In this section, we are going to perform some transformations in data in order to remove complexity or non-relevant details.

Removing

We will start removing some pieces of data, as mentioned bellow:

  • Special characters
  • Numbers
  • Ponctuation
  • Profanity words (from Luis von Ahn’s Research Group / CMU)
  • Stopwords
  • Additional white spaces
# Consolidate samples into one object
corpus.samples <- Corpus(VectorSource(c(blogs.sample, news.sample, twitter.sample)))

# Release memory
rm(blogs.sample, news.sample, twitter.sample)

# Function to replace patterns with whitespace
funPatternToSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

# Remove special characters
corpus.samples <- tm_map(corpus.samples, funPatternToSpace,"\"|/|@|\\|")

# Remove Numbers
corpus.samples <- tm_map(corpus.samples, removeNumbers)

# Remove Ponctuation
corpus.samples <- tm_map(corpus.samples, removePunctuation)

# Remove Profanity words (from Luis von Ahn's Research Group)
profanity.ds <- read.csv(url("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"), header = FALSE, col.names = c("word"))
corpus.samples <- tm_map(corpus.samples, removeWords, profanity.ds$word)

# Release memory
rm(profanity.ds, funPatternToSpace)

# Remove Stopwords
corpus.samples <- tm_map(corpus.samples, removeWords, stopwords("english"))

# Remove Additional white spaces
corpus.samples <- tm_map(corpus.samples, stripWhitespace)

Transforming

To finish “pre-processing”, we will transform the resulting data before analysing specific patterns. Now, we are changing the corpora in the following way:

  • Lower case
  • Stemming (get word radicals)
  • Plain text
# Transform to Lower case
corpus.samples <- tm_map(corpus.samples, tolower)

# Stemming (get word radicals)
corpus.samples <- tm_map(corpus.samples, stemDocument, language="english")

# Transform again to plain text
corpus.samples <- tm_map(corpus.samples, PlainTextDocument)

Exploratory Analysis

n-gram

corpus.df <- data.frame(text=unlist(sapply(corpus.samples, identity)),stringsAsFactors=FALSE)

# Release memory
rm(corpus.samples)

#wordcloud(uniGram$Words, uniGram$Count, min.freq=100, colors=brewer.pal(6, "Dark2"))

1-gram

uniGram <- findNGrams(corpus.df, 1, 20)
p <- ggplot(uniGram, aes(Words, Count)) + geom_col(fill="lightblue", color="darkblue") + labs(title="1-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(uniGram)
Words Count
5265 i 8234
10689 the 2103
7629 one 1449
11834 will 1309
4463 get 1285
6282 like 1213
1770 can 1151
10837 time 1140
5838 just 1055
4536 go 944
6541 make 862
2807 day 861
6430 love 856
7339 new 855
5632 it 840
12040 year 788
11914 work 780
11386 use 764
7493 now 757
5983 know 731
# Release memory
rm(uniGram, p)

2-gram

biGrams <- findNGrams(corpus.df, 2, 20)
p <- ggplot(biGrams, aes(Words, Count)) + geom_col(fill="red", color="darkred") + labs(title="2-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(biGrams)
Words Count
28282 i love 250
28550 i think 227
28242 i just 173
28600 i will 170
28588 i want 165
28251 i know 158
28082 i don’t 157
28029 i can 144
28084 i dont 142
28553 i thought 132
28317 i need 127
28577 i use 119
61381 time i 117
28158 i get 114
32513 know i 114
28129 i find 111
28268 i like 100
28121 i feel 98
33057 last year 95
28413 i realli 84
# Release memory
rm(biGrams, p)

3-gram

triGrams  <- findNGrams(corpus.df, 3, 20)
p <- ggplot(triGrams, aes(Words, Count)) + geom_col(fill="purple", color="darkblue") + labs(title="3-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(triGrams)
Words Count
35210 i think i 48
34347 i know i 47
9049 boy big sword 36
43054 littl boy big 36
33794 i dont know 35
34643 i must say 35
33780 i don’t think 34
33802 i dont think 31
27119 gaston south carolina 30
68089 south carolina attract 30
40914 last night i 29
83576 work incred pleas 28
33620 i can get 27
33773 i don’t know 27
58462 pu bef th 27
35252 i thought i 26
34524 i love toast 24
44386 love toast mom 24
34290 i just love 23
41899 let just say 23
# Release memory
rm(triGrams, p)

4-gram

quadriGrams <- findNGrams(corpus.df, 4, 20)
p <- ggplot(quadriGrams, aes(Words, Count)) + geom_col(fill="green", color="black") + labs(title="4-gram") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
p

formattable(quadriGrams)
Words Count
46968 littl boy big sword 36
29414 gaston south carolina attract 30
37607 i love toast mom 24
48413 love toast mom i 19
10981 buy time fell th 18
11862 can’t buy time fell 18
79098 th king john castl 18
66701 respond email data entri 16
1343 across page can find 15
1345 across photo entitl typhoon 15
6054 awesom pictur i ever 15
8922 blog regular near often 15
11343 came across photo entitl 15
11588 can find support tip 15
15728 complet unrel search pictur 15
17305 creativ kut scrap bug 15
20905 dont blog regular near 15
21959 easier life laughter hope 15
23032 enough i hope stumbl 15
23172 entitl typhoon parti okinawa 15
# Release memory
rm(quadriGrams, p)

Conclusion

Next Steps

  • Creating predition algorithm ** Segmenting analysis by type (blog, news or social) ** Enhance data cleaning (without foreign language, for example) ** Find patterns in tokens ** Take advantage of advanced model, such as Markov Hidden Models

  • Developing a Shiny App ** Create a simple user interface (based on messaging apps) ** As user input some text on keyboard, the ShinyApp suggest something to complete the sentence.