Milestone

Corpus Basic Summary Pre-Cleaning

summary(docs)

##                   Length Class             Mode
## en_US.blogs.txt   2      PlainTextDocument list
## en_US.news.txt    2      PlainTextDocument list
## en_US.twitter.txt 2      PlainTextDocument list

Line Count

length(blogs_pre); length(news_pre); length(twitter_pre)

## [1] 899288

## [1] 77259

## [1] 2360148

Longest Frase

max(nchar(blogs_pre)); max(nchar(news_pre)); max(nchar(twitter_pre))

## [1] 40833

## [1] 5760

## [1] 140

Corpus Preparation

At this point I will clean up te Corpus to remove:

Foreign Characters due to other languages
Aditional white spaces besides the simple 1 spacebar separation
Remove special characters such incluiding puntuation marks
Remover numbers
Transform all the corpus to lowercases for easiear analysis

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))

docs <- tm_map(docs, content_transformer(function(x) iconv(x, "latin1", "ASCII", sub = ""))) # iconv(enc2utf8(x), sub = "byte"))) esta era la instrucción vieja
docs <- tm_map(docs, toSpace, "/|@|\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)

blogs <- docs[[1]]$content  # Post-Cleaning
news <- docs[[2]]$content   # Post-Cleaning
twitter <- docs[[3]]$content  # Post-Cleaning

Sampling

The Corpus is huge. That’s both a good think and a bad think. Because we (at least I do) have limited resources (time and procesing power). I will take a sample of the Corpus, at this point I chose 4% as my training sample.

train_porcentage <- 4/100

blogs_index <- sample.split(blogs, train_porcentage)
news_index <- sample.split(news, train_porcentage)
twitter_index <- sample.split(twitter, train_porcentage)

blogs_train <- blogs[blogs_index]
news_train <- news[news_index]
twitter_train <- twitter[twitter_index]

train <- paste(blogs_train, news_train, twitter_train)
write(train, file = "E:/CourseraR/Coursera-Swiftkey/final/subset/train.txt")

I reloaded the sample corpus, this is just a way for me to save time, it’s faster to have a sample file saved as txt, and reloaded than to have to load the whole Corpus, and sample everytime.

train_corpus <- Corpus(DirSource(cname2, encoding = "UTF-8"))
train <- train_corpus[[1]]
train <- txt.to.words(train)

N-Grams

I used the “make.ngrams” function from stylo package. This is useful because it doubt that RWeka will work fine on Shiny because of it’s Java language base. Also make.ngrams was fast enough.

n2 <- function(x) make.ngrams(train, ngram.size = 2)
n3 <- function(x) make.ngrams(train, ngram.size = 3)
n4 <- function(x) make.ngrams(train, ngram.size = 4)

dtm <-  DocumentTermMatrix(train_corpus)
dtm2 <- DocumentTermMatrix(train_corpus, control = list(tokenize = n2))
dtm3 <- DocumentTermMatrix(train_corpus, control = list(tokenize = n3))
dtm4 <- DocumentTermMatrix(train_corpus, control = list(tokenize = n4))

Word Frequency

After building the n-grams, the Document Term Matrices it’s very useful to sort them out according to their frequency. I stored the results a Data Tables for performances issues.

freq1 <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
freq2 <- sort(colSums(as.matrix(dtm2)), decreasing = TRUE)
freq3 <- sort(colSums(as.matrix(dtm3)), decreasing = TRUE)
freq4 <- sort(colSums(as.matrix(dtm4)), decreasing = TRUE)

wf1 <- tbl_df(data.frame(word = names(freq1), freq = freq1))
wf2 <- tbl_df(data.frame(word = names(freq2), freq = freq2))
wf3 <- tbl_df(data.frame(word = names(freq3), freq = freq3))
wf4 <- tbl_df(data.frame(word = names(freq4), freq = freq4))

Bar Plots

Here are some n-grams histograms with the 15 most frequent frases in each n-gram

Word Frequency

freq1_top <-  as.integer(wf1[15,2])
freq2_top <-  as.integer(wf2[15,2])
freq3_top <-  as.integer(wf3[15,2])
freq4_top <-  as.integer(wf4[15,2])

g1 <- subset(wf1, freq1 > freq1_top) %>% ggplot(aes(x = reorder(word, -freq), y = freq)) +
        geom_bar(stat = "identity") + labs(x = "Word", y= "Frequency", title = "Single Word Histogram") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
g1

g2 <- subset(wf2, freq2 > freq2_top) %>% ggplot(aes(x = reorder(word, -freq), y = freq)) +
        geom_bar(stat = "identity") + labs(x = "Bi-Gram", y= "Frequency", title = "Bi-Gram Histogram") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))  
g2

g3 <- subset(wf3, freq3 > freq3_top) %>% ggplot(aes(x = reorder(word, -freq), y = freq)) +
        geom_bar(stat = "identity") + labs(x = "Tri-Gram", y= "Frequency", title = "Tri-Gram Histogram") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))  
g3

g4 <- subset(wf4, freq4 > freq4_top) %>% ggplot(aes(x = reorder(word, -freq), y = freq)) +
        geom_bar(stat = "identity") + labs(x = "Four-Gram", y= "Frequency", title = "Four-Gram Histogram") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))  
g4

Also, it’s possible to make wordclouds. It’s fun and nicer looking, but basically is the same info from the histograms

Wordcloud

wordcloud(names(freq1), freq1, min.freq = 1000, max.words = 140, colors = brewer.pal(6, "Dark2"))

wordcloud(names(freq2), freq2, min.freq = 500, max.words = 80, colors = brewer.pal(6, "Dark2"))

wordcloud(names(freq3), freq3, min.freq = 100, max.words = 60, colors = brewer.pal(6, "Dark2"))

wordcloud(names(freq4), freq4, min.freq = 10, max.words= 30, colors = brewer.pal(6, "Dark2"))

Next Steps

Clean the N-Grams tables to remove the lower frequency N-Grams
Automate the text input, and search thru the N-Grams
Speed up the models
Handle apostrohes and contractions
Try to generate prediction models. So far my poor model only got 2 Quiz # 2 questions right out of 10 !

Interesting findings

So far I haven’t have “important findings”. I’ve been fighting with the coding and the handling of text mining and n-grams. I’ve learned a lot about NLP, and the models, but not so much about the information processed for this report.

Milestone

iair kleiman

July 26th, 2015

Library Loading

Corpus Loading

Corpus Basic Summary Pre-Cleaning

Line Count

Longest Frase

Corpus Preparation

Sampling

N-Grams

Word Frequency

Bar Plots

Word Frequency

Wordcloud

Next Steps

Interesting findings