Hint: Java JDK must be installed on your personal computer to use the package RWeka.
library(knitr)
library(stringi) # for statistical exploration
library(tm) # for text mining## Loading required package: NLP
library(SnowballC) # for stemming
library(ggplot2) # show plots##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(gridExtra) # show multiple plots with ggplot2
#run this command in macOS "sudo ln -s $(/usr/libexec/java_home)/jre/lib/server/libjvm.dylib /usr/local/lib"
library(xlsx) #workaround to load rJava which is neede for RWeka## Loading required package: rJava
## Loading required package: xlsxjars
library(RWeka) # generate ngrams
library(dplyr) #text wrangling##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr) #text wrangling
library(wordcloud)## Loading required package: RColorBrewer
library(tidytext) #text wrangling
library(ggraph) # plot network graph
library(igraph) # plot network graph##
## Attaching package: 'igraph'
## The following objects are masked from 'package:tidyr':
##
## %>%, crossing
## The following objects are masked from 'package:dplyr':
##
## %>%, as_data_frame, groups, union
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
The objectives are
1. Demonstrate that you’ve downloaded the data and have successfully loaded in
2. Create a basic report of summary statistics about the data sets
3. Report any interesting findings that you amassed so far
4. Get feedback on your plans for creating a prediction algorithm and Shiny app
The solution design is focused on reaching the objectives above.
Change working directory to my target directory.
opts_chunk$set(root.dir = '/Users/Ramon/Documents/R/DS Capstone Project/final/en_US/')
knitr::opts_knit$set(root.dir = '/Users/Ramon/Documents/R/DS Capstone Project/final/en_US/')As requirement the data must be downloaded and unzipped. You can find my approach in this file on github.
Afterwards we can load the data in. We load the data as UTF-8 format. This format is developing itself to a standard format, is user-friendly, is well supported and a HTML purifier. For details click here.
us.blogs <- readLines("en_US.blogs.txt", skipNul = TRUE, encoding = "UTF-8")
us.news <- readLines("en_US.news.txt", skipNul = TRUE, encoding = "UTF-8")
us.twitter <- readLines("en_US.twitter.txt", skipNul = TRUE, encoding = "UTF-8")We perform simple commands like head() and summary(). Additionally we generate three histograms that show the frequency of text entries by number of words.
# Set number display to decimal format
options(scipen = 999)
format(1e6, big.mark=",", scientific=FALSE)## [1] "1,000,000"
# use head and summary regarding the us blogs example
head(us.blogs, 3)## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
# more character analysis analysis
(statistics.us.blogs <- stri_stats_general(us.blogs))## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
(statistics.us.news <- stri_stats_general(us.news))## Lines LinesNEmpty Chars CharsNWhite
## 1010242 1010242 203223154 169860866
(statistics.us.twitter <- stri_stats_general(us.twitter))## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
# textual analysis
words.us.blogs <- stri_count_words(us.blogs)
words.us.news <- stri_count_words(us.news)
words.us.twitter <- stri_count_words(us.twitter)
# summaries
summary(words.us.blogs)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
summary(words.us.news)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.41 46.00 1796.00
summary(words.us.twitter)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
a <- sd(words.us.blogs)
b <- sd(words.us.news)
c <- sd(words.us.twitter)
# standard deviations
c(a,b,c)## [1] 46.58893 22.82617 6.91223
q1 <- qplot(words.us.blogs,
geom="histogram",
binwidth = 20,
main = "Frequency Blog messages by number of words",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,300))
q2 <- qplot(words.us.news,
geom="histogram",
binwidth = 10,
main = "Frequency News messages by number of words",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,150))
q3 <- qplot(words.us.twitter,
geom="histogram",
binwidth = 5,
main = "Frequency Twitter messages by number of words",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,40))
grid.arrange(q1, q2, q3, nrow = 3)# Check length of objects
(us.blogs.length <- length(us.blogs))## [1] 899288
(us.news.length <- length(us.news))## [1] 1010242
(us.twitter.length <- length(us.twitter))## [1] 2360148
We see that all three objects are over 800’000 rows long. For performance reasons like running time and decreased memory usage we sample our objects. We estimate that a sampling size of 5’000 lines per object is optimal.
set.seed(22)
us.blogs.sample <- us.blogs[sample(1:length(us.blogs), 5000)]
us.news.sample <- us.news[sample(1:length(us.news), 5000)]
us.twitter.sample <- us.twitter[sample(1:length(us.twitter), 5000)]
sample <- c(us.blogs.sample, us.news.sample, us.twitter.sample)
writeLines(sample, "./sample/sampleAll.text")In a next step we remove unnecessary objects to improve RAM storage.
rm(us.blogs, us.news, us.twitter, us.blogs.sample, us.news.sample, us.twitter.sample)We create our corpus from the generated sample data above.
temporary <- file.path(".", "sample")
# important note: use VCorpus instead of Corpus to avoid problems generating ngrams
us.corpus <- VCorpus(DirSource(temporary))
rm(temporary)Our goal here is to achieve tidy data. Tidy data is the foundation for analysis. We perform different cleaning actions: - convert to lower case
- remove URLs
- remove punctuation and numbers - remove stopwords
- remove extra whitespace
Special thanks to Yanchang Zhao who provided the cleaning aspects in his presentation “Text Mining with R - Twitter Data Analysis”.
us.corpus.copy <- us.corpus
us.corpus <- tm_map(us.corpus, content_transformer(tolower))
remove.URL <- function(x) gsub("http[^[:space:]]*", "", x)
us.corpus <- tm_map(us.corpus, content_transformer(remove.URL))
#remove punctuation and numbers
remove.num.punctuation <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
us.corpus <- tm_map(us.corpus, content_transformer(remove.num.punctuation))
us.corpus <- tm_map(us.corpus, removeWords, stopwords("english"))Now we perform stemming to reduce complexity without any severe loss of information.
us.corpus <- tm_map(us.corpus, stemDocument)
#writeLines(as.character(us.corpus), con="cleanCorpus.txt")We perform tokenization. This is the process to split our tokens in a one-row-per-token-structure. Afterwards we create ngrams with one to three words (unigram, bigram and trigram). We transform the corpus into the useful data term matrix on which we perform our further analysis.
Tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unigram.dtm <- DocumentTermMatrix(us.corpus,
control = list(tokenize = Tokenizer))
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram.dtm <- DocumentTermMatrix(us.corpus,
control = list(tokenize = BigramTokenizer))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
trigram.dtm <- DocumentTermMatrix(us.corpus,
control = list(tokenize = TrigramTokenizer))First we are interested in hits greater than the 99.95 or 99.99 perzentile concerning the frequencies. These are the ones with the most occurences.
unigram.frequency <- sort(colSums(as.matrix(unigram.dtm)), decreasing=TRUE)
unigram.word.frequency <- data.frame(word=names(unigram.frequency), freq=unigram.frequency)
unigram.word.frequency %>% filter(freq > quantile(freq,0.9995))## word freq
## 1 said 1525
## 2 will 1352
## 3 one 1274
## 4 like 1173
## 5 just 1162
## 6 get 1145
## 7 can 1052
## 8 year 1014
## 9 time 1012
## 10 day 877
## 11 new 818
## 12 make 814
## 13 know 725
bigram.frequency <- sort(colSums(as.matrix(bigram.dtm)), decreasing=TRUE)
bigram.word.frequency <- data.frame(word=names(bigram.frequency), freq=bigram.frequency)
bigram.word.frequency %>% filter(freq > quantile(freq,0.9999))## word freq
## 1 new york 111
## 2 last year 101
## 3 dont know 94
## 4 right now 86
## 5 look like 78
## 6 year ago 75
## 7 high school 73
## 8 feel like 69
## 9 last week 69
## 10 cant wait 59
## 11 im go 59
## 12 last night 59
## 13 make sure 58
## 14 look forward 51
## 15 even though 48
## 16 come back 47
## 17 new jersey 47
## 18 can get 46
## 19 next year 46
trigram.frequency <- sort(colSums(as.matrix(trigram.dtm)), decreasing=TRUE)
trigram.word.frequency <- data.frame(word=names(trigram.frequency), freq=trigram.frequency)
trigram.word.frequency %>% filter(freq > quantile(freq,0.9999))## word freq
## 1 ass ass ass 28
## 2 new york citi 20
## 3 happi happi happi 17
## 4 cant wait see 10
## 5 im pretti sure 10
## 6 jeep sioux citi 10
## 7 sioux citi iowa 10
## 8 happi mother day 8
## 9 happi new year 8
## 10 st loui counti 8
## 11 presid barack obama 7
## 12 two year ago 7
## 13 want make sure 7
## 14 cant wait get 6
## 15 dont even know 6
## 16 dont get wrong 6
## 17 feel like im 6
## 18 martin luther king 6
## 19 new york time 6
## 20 superior court judg 6
## 21 world war ii 6
Second we create a few network graphs for the bigrams to get a better word relationsship overview. This is a visualization of a Markov chain, a common model in text processing. In a Markov chain, each choice of word depends only on the previous word. In our case, a random generator following this model would predict “now” after the word “right”. We show just the top word pairs but you can see in the third graph that the visual limit is rapidly reached. You could imagine an enormous graph representing all connections that are present in the text.
bigram.counts <- bigram.word.frequency %>%
separate(word, c("word1", "word2"), sep = " ")
bigram.graph <- bigram.counts %>%
filter(freq > quantile(freq,0.9999)) %>%
graph_from_data_frame()
set.seed(22)
a <- grid::arrow(type = "open", length = unit(.15, "inches"))
ggraph(bigram.graph, layout = "fr") +
geom_edge_link(arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()bigram.counts <- bigram.word.frequency %>%
separate(word, c("word1", "word2"), sep = " ")
bigram.graph <- bigram.counts %>%
filter(freq > quantile(freq,0.999)) %>%
graph_from_data_frame()
set.seed(22)
a <- grid::arrow(type = "open", length = unit(.15, "inches"))
ggraph(bigram.graph, layout = "fr") +
geom_edge_link(arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()bigram.counts <- bigram.word.frequency %>%
separate(word, c("word1", "word2"), sep = " ")
bigram.graph <- bigram.counts %>%
filter(freq > quantile(freq,0.99)) %>%
graph_from_data_frame()
set.seed(22)
a <- grid::arrow(type = "open", length = unit(.15, "inches"))
ggraph(bigram.graph, layout = "fr") +
geom_edge_link(arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green", size = 2) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()Finally we generate tag clouds to reveal the essential. Furthermore they are fastly generated and these visualizations are engaging.
Interpretation hint: Blue means more and yellow less
set.seed(22)
wordcloud(names(unigram.frequency), unigram.word.frequency$freq, max.words=25, scale=c(4, 0.5), colors=brewer.pal(6, "RdYlBu"))qplot(unigram.word.frequency$freq,
geom="histogram",
binwidth = 1,
main = "Word frequencies unigrams",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,50),
ylim=c(0,3500))set.seed(22)
wordcloud(names(bigram.frequency), bigram.word.frequency$freq, max.words=25, scale=c(2, 1), colors=brewer.pal(6, "RdYlBu"))qplot(bigram.word.frequency$freq,
geom="histogram",
binwidth = 1,
main = "Word frequencies bigrams",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,50),
ylim=c(0,230))set.seed(22)
wordcloud(names(trigram.frequency), trigram.word.frequency$freq, max.words=25, scale=c(3, 0.3), colors=brewer.pal(6, "RdYlBu"))qplot(trigram.word.frequency$freq,
geom="histogram",
binwidth = 1,
main = "Word frequencies bigrams",
fill=I("blue"),
alpha=I(.5),
xlim=c(0,30),
ylim=c(0,125))Regarding to the nummber of words we see that the blogs and news files have similar medians (28 vs 32) as well as similar means (41.75 vs 34.41). The range is a lot different trough (6726 vs. 1796). The twitter objects have a much smaller median (12) as well as the mean (12.75). The range (47) is pretty small, too. The standard deviations variate a lot (news = 46.59, news = 22.83, twitter = 6.91).
The frequencies of the most number of words regarding the blog object are below 25 words. They are decreasing fast up to 100 words per object. After there are a few with a lot of words. The frequencies of the most number of words regarding the news object are between 25 and 35 words. They are decreasing fast up to 75 words per object. After there are a few with a lot of words. The frequencies of the most number of words regarding the twitter object are below 7.5 words. They are decreasing fast up to 27.5 words per object. After there are a few between 27.5 and 32.5 words.
We see at the word frequencies comparing the unigram, bigram and trigram that there are almost no overlaps. Just the word like is found in the uni- and bigram.
The most single words are verbs. The most counted bigrams contain a lot of time related words like “last year”, “right now” or “year ago”. The most counted trigrams contain either places, congratulations or famous personalities. The top entry contains three words from a song of Big Sean.
Very useful are the token frequency histograms. They all show that by far the most frequencies appear between 0 and 10 token (unigram, bigram, trigram) counts. We need to pay attentation to this insight regarding to model developement (see next chapter).
Our markov chains network graphs confirm the insights about the content of our tokens (bigram and trigram). The bigram contains mainly time and location information. The 99.99 percentile graph expand the insights and tell us that there are a lot of verbs used.
The model we like to use is a n-gram model which is based on the principles of Markov chain. The idea is to predict a word after one to several input words. The prediction of the probabilty is based on the n-1 previous words. When we choose a trigram (n = 3) and the user likes to predict the next word for his sentence “I am going to New” our model should predict “York” according to the two previous words “to New”.
We assume small n-gram histories under certains condititions. For better accuracy we will therefore use Katz’ back-off model.
We use Good-Turing smooting for n-gram frequencies smaller or equal to 5. As we saw in the histograms above there are a lot of such n-gram frequencies.