Executive Summary

The goal of the Data Science Capstone Project is to build a predictive model (Natural Language Processing) to predict the next word. Given a word or phrase as input, the product/application shall try to predict the next word.

Milestone report shows an exploratory analysis has done on the training data to understand the distribution and relationship between the words, tokens, and phrases in the text. This goal of this report exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora.

Understand frequencies of words and pairs - build figures and tables to understand variation in the frequencies of the words and word pairs in the data.

Good to know NLP/text data Minning Terminology

Text Corpus : In linguistics, a large and structured set of texts ( Online collection of text and speech)
Sentence : Unit of written language
Utterance: Unit of spoken language
Word Form : The inflected form as it actually appears in the corpus
Lemma : An abstract form, shared by word form having the same stem, part of speech, and word sense - stands for the class of words with stem
Types : Number of distinct words in a corpus(vocabulary size)
Token : total number of words
N-Gram : Use the previous N-1 words in a sequence to predict the next word
N-Gram LM : Language Model ( unigrams, bigrams, trigrams)

Summary of Data - Data Acquisition and Cleaning

Capstone Dataset is available at : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

In data set

Four different languages (English, German, Russian and Finnish)
Three corpus/corpora (Twitter feed, news feed, and blog)
In this analysis choose as English language
Files Name are

en_US.twitter.txt
en_US.news.txt
en_US.blogs.txt

The dataset contains data obtain from Blogs post, News Feeds and Tweets from twitter. They are saved as txt format with \n newline formation.

Loading the data

Setting directory and clean all old object

setwd("/home/alok/capstone_swk")
rm(list = ls())

Load the library ( Note : Not using all the libraries here but shall use in final application)

suppressWarnings(suppressMessages(library(igraph)))
suppressWarnings(suppressMessages(library(biclust)))
suppressWarnings(suppressMessages(library(RColorBrewer)))
suppressWarnings(suppressMessages(library(tm))) 
suppressWarnings(suppressMessages(library(SnowballC))) 
suppressWarnings(suppressMessages(library(ggplot2))) 
suppressWarnings(suppressMessages(library(wordcloud))) 
suppressWarnings(suppressMessages(library(cluster))) 
suppressWarnings(suppressMessages(library(RWeka)))
suppressWarnings(suppressMessages(library(caTools)))
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(rpart.plot)))
suppressWarnings(suppressMessages(library(randomForest)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(qdap)))
suppressWarnings(suppressMessages(library (biclust))) 
suppressWarnings(suppressMessages(library (cluster)))
suppressWarnings(suppressMessages(library (igraph)))

Download the data

download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile="Coursera-SwiftKey.zip",method="curl")
unzip("Coursera-SwiftKey.zip")

blogsfile <- "final/en_US/en_US.blogs.txt"
newsfile  <- "final/en_US/en_US.news.txt"
twitterfile <- "final/en_US/en_US.twitter.txt"
combine_files <-  "final/en_US/en_US.all.txt"
combine_clean_files <-  "final/en_US/en_US.all_3.txt"

Raw data Summary (all_file_cat* data from unix script)

file_name \|	Uniq_words_Types \|	FileSize(MB) \|	line_counts \|	word_counts_Tokens \|	Char_counts
blogs \|	1214516 \|	200.4297 \|	899288 \|	37334114 \|	210160014
news \|	945730 \|	196.2812 \|	1010242 \|	34365936 \|	205811889
twitter \|	1443911 \|	159.3672 \|	2360148 \|	30359804 \|	167105338
Sum(bl+Ne\|Tw) \|	2825934 \|	556.0781 \|	4269678 \|	102059854 \|	583077241
all_file_cat \|	2825934 \|	556.0703 \|	4269678 \|	103041866 \|	583077241
all_file_cat_c \|	541029 \|	525.8711 \|	4269678 \|	103041866 \|	551410913

Sum(b1+Ne+Tw): Sum of blogs+news+Twitter
all_file_cat : cat all the three file in UNIX
all_file_cat_c : Unix combine file clean up special character, number, upper to lower and then identified the uniq word that will help while train the model. ```

**Unix Code** 
combine_files <- (system ("cat  final/en_US/en_US.blogs.txt  final/en_US/en_US.news.txt final/en_US/en_US.twitter.txt > final/en_US/en_US.all.txt", intern=TRUE))
system ("bash /home/alok/capstone_swk/final/en_US/data_clean.bash")
cat data_clean.bash 
#!/bin/bash
cd /home/alok/capstone_swk/final/en_US
#Upper to lower covert
sed 's/\([A-Z]\)/\L\1/g' en_US.all.txt > en_US.all_1.txt 
#Remove all number and special char pass / to special char  
sed 's/[^a-z]/ /g;' en_US.all_1.txt > en_US.all_2.txt
#Remove Space 
awk '{$1=$1};1' en_US.all_2.txt  > en_US.all_3.txt
rm en_US.all_1.txt
rm en_US.all_2.txt 
## identified Uniq work and frequency and sort by column 1 with n 
## count to 50 words "tail -n 50  en_US.all_3_uniq.txt" 
cat en_US.all_3.txt|tr " " "\n"|sort |uniq -c|sort -k 1n -r > en_US.all_3_uniq.txt

Read and Sampling the Data

blogs <- readLines(blogsfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)
news  <- readLines(newsfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)
twitter <-  readLines(twitterfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)

class(blogs)

## [1] "character"

class(news)

## [1] "character"

class(twitter)

## [1] "character"

**Random sample or documents or features Approx 1% of 899288 is 8992 Uniq word is high so gt 1% lt 1.5%

Blogs_sample <- sample(blogs, 10)

Approx 1% of 1010242 is 10102

News_sample <- sample(news, 11)

Approx 1% of 2360148 is 23601

Twitter_sample <- sample(twitter, 23)

class(Blogs_sample )

## [1] "character"

class(News_sample)

## [1] "character"

class(Twitter_sample)

## [1] "character"

Remove logs,news and twitter objects

rm(blogs,news,twitter)

Clean Sample data

sdata <- c(Blogs_sample, News_sample, Twitter_sample)

class(sdata)

## [1] "character"

summary(sdata[[1]])

##    Length     Class      Mode 
##         1 character character

sample code

tdata <- 'Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:'
class(tdata)

## [1] "character"

str(tdata)

##  chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"

Remove Hash tags

tdata <- gsub(" #\\S*","", tdata)  
tdata

## [1] "Hello 7$%^%&%^*^,: tammy @ruby http://www.global.com 1234355 7$%^%&%^*^,:"

Remove URLs

tdata <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", tdata)  
tdata

## [1] "Hello 7$%^%&%^*^,: tammy @ruby  1234355 7$%^%&%^*^,:"

Remove twitter accounts

tdata <- gsub(" @[^\\s]+","",tdata)
tdata

## [1] "Hello 7$%^%&%^*^,: tammy"

Remove special characters

tdata <- gsub("[^0-9A-Za-z///' ]", "", tdata) 
tdata

## [1] "Hello 7 tammy"

unlist example

abc <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
abc

## $a
## $a[[1]]
## [1] 1 2 3 4 5
## 
## $a[[2]]
## [1] "A" "B" "C" "D" "E"
## 
## 
## $b
## [1] "Z"
## 
## $c
## [1] NA

abc <- unlist(abc, recursive = FALSE)
abc

## $a1
## [1] 1 2 3 4 5
## 
## $a2
## [1] "A" "B" "C" "D" "E"
## 
## $b
## [1] "Z"
## 
## $c
## [1] NA

abc <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
abc <- unlist(abc, recursive = TRUE)
abc

##  a1  a2  a3  a4  a5  a6  a7  a8  a9 a10   b   c 
## "1" "2" "3" "4" "5" "A" "B" "C" "D" "E" "Z"  NA

Corpus Example

docs1 <- c("This is a text.", "This is another one.", "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:", "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'")
class(docs1)

## [1] "character"

typeof(docs1)

## [1] "character"

str(docs1)

##  chr [1:4] "This is a text." "This is another one." ...

docs1

## [1] "This is a text."                                                                 
## [2] "This is another one."                                                            
## [3] "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
## [4] "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"

docs1 <- data.frame(docs1)
class(docs1)

## [1] "data.frame"

typeof(docs1)

## [1] "list"

str(docs1)

## 'data.frame':    4 obs. of  1 variable:
##  $ docs1: Factor w/ 4 levels "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'",..: 3 4 2 1

docs1

##                                                                              docs1
## 1                                                                  This is a text.
## 2                                                             This is another one.
## 3 Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:
## 4    Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'

ds <- DataframeSource(docs1)
class(ds)

## [1] "DataframeSource" "SimpleSource"    "Source"

typeof(ds)

## [1] "list"

str(ds)

## List of 4
##  $ encoding: chr ""
##  $ length  : int 4
##  $ position: num 0
##  $ reader  :function (elem, language, id)  
##  - attr(*, "class")= chr [1:3] "DataframeSource" "SimpleSource" "Source"

ds

## $encoding
## [1] ""
## 
## $length
## [1] 4
## 
## $position
## [1] 0
## 
## $reader
## function (elem, language, id) 
## {
##     if (!is.null(elem$uri)) 
##         id <- basename(elem$uri)
##     PlainTextDocument(elem$content, id = id, language = language)
## }
## <environment: namespace:tm>
## 
## $content
##                                                                              docs1
## 1                                                                  This is a text.
## 2                                                             This is another one.
## 3 Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:
## 4    Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'
## 
## attr(,"class")
## [1] "DataframeSource" "SimpleSource"    "Source"

sds <- Corpus(ds)
class(sds)

## [1] "VCorpus" "Corpus"

typeof(sds)

## [1] "list"

str(sds)

## List of 4
##  $ 1:List of 2
##   ..$ content: chr "This is a text."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2:List of 2
##   ..$ content: chr "This is another one."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 3:List of 2
##   ..$ content: chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "3"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 4:List of 2
##   ..$ content: chr "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "4"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

sds

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4

inspect(sds)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 80
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 77

vds <- VCorpus(ds)
class(vds)

## [1] "VCorpus" "Corpus"

typeof(vds)

## [1] "list"

str(vds)

## List of 4
##  $ 1:List of 2
##   ..$ content: chr "This is a text."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2:List of 2
##   ..$ content: chr "This is another one."
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 3:List of 2
##   ..$ content: chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "3"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 4:List of 2
##   ..$ content: chr "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "4"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

vds

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4

inspect(vds)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 4
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 20
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 80
## 
## [[4]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 77

Remove Hash tags

sdata <- gsub(" #\\S*","", sdata)

Remove URLs

sdata <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", sdata)

Remove twitter accounts

sdata <- gsub(" @[^\\s]+","",sdata)

Remove special characters

sdata <- gsub("[^0-9A-Za-z///' ]", "", sdata)

Build the corpus from sample data

t <- VectorSource(sdata)
class(t)

## [1] "VectorSource" "SimpleSource" "Source"

dim(t)

## NULL

summary(t[[1]])

##    Length     Class      Mode 
##         1 character character

#t$VectorSource[[1]
#$SimpleSource[[1]]
#t$Source[[1]]
str(t[[1]])

##  chr ""

str(t[[3]])

##  num 0

str(t[[2]])

##  int 44

##head(t) ## Working lot of data 
##tail(t) ## Working lot of data
corpus <- Corpus(VectorSource(sdata))

class(corpus)

## [1] "VCorpus" "Corpus"

dim(corpus)

## NULL

str(corpus)

## List of 44
##  $ 1 :List of 2
##   ..$ content: chr "homesomebody never had grandchildren"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "1"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 2 :List of 2
##   ..$ content: chr "January Jones wouldnt have been my first choice for Emma Frost but I think she did a pretty good job I wasnt hugely impressed w"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "2"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 3 :List of 2
##   ..$ content: chr "eatcha up inside"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "3"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 4 :List of 2
##   ..$ content: chr "5 This freedom of planting will help to prevent robbing stealing and murdering"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "4"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 5 :List of 2
##   ..$ content: chr "When I first started out job hunting as a teenager there were no recruitment agencies in my town Every job going was advertised"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "5"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 6 :List of 2
##   ..$ content: chr "BF Yes its very well documented"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "6"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 7 :List of 2
##   ..$ content: chr "Related Bible verses Genesis 22912 Ephesians 320 Exodus 3314"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "7"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 8 :List of 2
##   ..$ content: chr "Dear TigerTime Supporter"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "8"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 9 :List of 2
##   ..$ content: chr "Assign a Social Media Monitor is the 49th in a series of excerpts from our book Be a Person the Social Operating Manual for Ent"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "9"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 10:List of 2
##   ..$ content: chr "Penwizard have kindly offered FREE postage and packaging to any of my readers who wish to purchase a book from them just go to "| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "10"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 11:List of 2
##   ..$ content: chr "It's been pretty quiet today she said I haven't heard of any serious crashes"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "11"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 12:List of 2
##   ..$ content: chr "Other church members who attended the hearing said they don't believe their church is going to be able to raise 50 million in 1"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "12"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 13:List of 2
##   ..$ content: chr "Another option is small claims court but given the size of your refund  just 70  it is probably impractical I think a final str"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "13"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 14:List of 2
##   ..$ content: chr "Class of 2012 He was a PostDispatch firstteam AllMetro selection last fall as a junior The 6foot1 195pounder accounted for 1914"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "14"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 15:List of 2
##   ..$ content: chr "BOSTON  Cupcakes brownies and other baked goodies will be spared the chopping block at Massachusetts schools after Gov Deval Pa"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "15"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 16:List of 2
##   ..$ content: chr "Game 37"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "16"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 17:List of 2
##   ..$ content: chr "6820 St Olaf Dr 36667"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "17"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 18:List of 2
##   ..$ content: chr "The image showed a significant amount of white stuff"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "18"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 19:List of 2
##   ..$ content: chr "Apocalyptic rhetoric aside Rubio's proposal is simply a compromise a stopgap  one of those virtuous but messy and elusive const"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "19"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 20:List of 2
##   ..$ content: chr "Pacific Palisades CA 90272"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "20"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 21:List of 2
##   ..$ content: chr "Producer Joel Silver whose blockbuster resume includes The Matrix and Die Hard teams with After Dark Films to launch a new slat"| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "21"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 22:List of 2
##   ..$ content: chr "Going live with this week"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "22"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 23:List of 2
##   ..$ content: chr "Public comment is now closed QA time for Budget Committee members"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "23"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 24:List of 2
##   ..$ content: chr "Limbering up for my Park debut Find me at Green Week when I bring energy efficiency to the masses wwwsmartenergypayscom"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "24"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 25:List of 2
##   ..$ content: chr "What makes the engine go Desire desire desire"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "25"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 26:List of 2
##   ..$ content: chr "Today could be Busy 911 Remember at Shaw's in NE MPLSconcert fundraiser charity Gopher tailgate at Stub and Herb's"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "26"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 27:List of 2
##   ..$ content: chr "lol forgot to mention ass too   what  you a bigger fag"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "27"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 28:List of 2
##   ..$ content: chr "what's the domain we'll check it out"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "28"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 29:List of 2
##   ..$ content: chr "For the life of me I don't understand button down collars Are you worried about the collar getting away from you"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "29"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 30:List of 2
##   ..$ content: chr "I laughed at it my dumb ass was looking back to see when I said that"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "30"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 31:List of 2
##   ..$ content: chr "P2 needs surgery If you care about him don't vote for him"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "31"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 32:List of 2
##   ..$ content: chr "And for the record I don't agree with most of what Frank said He's venting Using war to justify antOWS is dumb"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "32"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 33:List of 2
##   ..$ content: chr "Thanks for the follow Nick"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "33"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 34:List of 2
##   ..$ content: chr "Derby time"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "34"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 35:List of 2
##   ..$ content: chr "Congrats to the cubs fans celebrate and kick off your weekend with a healthy startfruit smoothies sodas and sandwiches"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "35"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 36:List of 2
##   ..$ content: chr "1st off get it right name's CynicalripaMic sicka than Vlad the Impaler stickin ya to the tip of a spike"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "36"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 37:List of 2
##   ..$ content: chr "Now that is a candle AUDJPY"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "37"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 38:List of 2
##   ..$ content: chr "oh and you popped out to the second baseman"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "38"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 39:List of 2
##   ..$ content: chr "FF thanks for help to"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "39"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 40:List of 2
##   ..$ content: chr "Thanks so much for the follow If you ever need a singer let me know"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "40"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 41:List of 2
##   ..$ content: chr "I though Cena might be sad bc of what's goin on with him but he definitely didn't seem sad "
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "41"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 42:List of 2
##   ..$ content: chr "The day is going by too quickly trying to get everything finished up before another amazing weekend"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "42"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 43:List of 2
##   ..$ content: chr "One of the surest bets in Super Bowl history is siding with the field goal/safety when it comes to what the first score of the "| __truncated__
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "43"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  $ 44:List of 2
##   ..$ content: chr "nope just the one"
##   ..$ meta   :List of 7
##   .. ..$ author       : chr(0) 
##   .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
##   .. ..$ description  : chr(0) 
##   .. ..$ heading      : chr(0) 
##   .. ..$ id           : chr "44"
##   .. ..$ language     : chr "en"
##   .. ..$ origin       : chr(0) 
##   .. ..- attr(*, "class")= chr "TextDocumentMeta"
##   ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
##  - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"

#summary(corpus)
corpus[[1]]$content

## [1] "homesomebody never had grandchildren"

summary(corpus[[1]])

##         Length Class            Mode     
## content 1      -none-           character
## meta    7      TextDocumentMeta list

Clean the corpus

corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
#corpus <- tm_map(corpus, removeWords,stopwords("english")) 
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)

Construct the document metrix

tdm <- TermDocumentMatrix(corpus)
dtm <- DocumentTermMatrix(corpus)

class document metrix

class(tdm)

## [1] "TermDocumentMatrix"    "simple_triplet_matrix"

class(dtm)

## [1] "DocumentTermMatrix"    "simple_triplet_matrix"

str(tdm)

## List of 6
##  $ i       : int [1:670] 180 185 200 270 1 7 13 17 22 28 ...
##  $ j       : int [1:670] 1 1 1 1 2 2 2 2 2 2 ...
##  $ v       : num [1:670] 1 1 1 1 1 1 1 4 1 1 ...
##  $ nrow    : int 482
##  $ ncol    : int 44
##  $ dimnames:List of 2
##   ..$ Terms: chr [1:482] "abl" "about" "account" "action" ...
##   ..$ Docs : chr [1:44] "character(0)" "character(0)" "character(0)" "character(0)" ...
##  - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

str(dtm)

## List of 6
##  $ i       : int [1:670] 1 1 1 1 2 2 2 2 2 2 ...
##  $ j       : int [1:670] 180 185 200 270 1 7 13 17 22 28 ...
##  $ v       : num [1:670] 1 1 1 1 1 1 1 4 1 1 ...
##  $ nrow    : int 44
##  $ ncol    : int 482
##  $ dimnames:List of 2
##   ..$ Docs : chr [1:44] "character(0)" "character(0)" "character(0)" "character(0)" ...
##   ..$ Terms: chr [1:482] "abl" "about" "account" "action" ...
##  - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
##  - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

inspect document metrix

inspect(dtm[1:5, 1:20])

## <<DocumentTermMatrix (documents: 5, terms: 20)>>
## Non-/sparse entries: 12/88
## Sparsity           : 88%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           abl about account action advertis advis after agenc agre
##   character(0)   0     0       0      0        0     0     0     0    0
##   character(0)   1     0       0      0        0     0     1     0    0
##   character(0)   0     0       0      0        0     0     0     0    0
##   character(0)   0     0       0      0        0     0     0     0    0
##   character(0)   1     0       0      0        2     1     0     1    0
##               Terms
## Docs           all allmetro also altern amaz amount anchor and ani anoth
##   character(0)   0        0    0      0    0      0      0   0   0     0
##   character(0)   0        0    0      1    0      0      0   4   0     0
##   character(0)   0        0    0      0    0      0      0   0   0     0
##   character(0)   0        0    0      0    0      0      0   1   0     0
##   character(0)   0        0    2      0    0      0      0   1   1     0
##               Terms
## Docs           antow
##   character(0)     0
##   character(0)     0
##   character(0)     0
##   character(0)     0
##   character(0)     0

inspect(tdm[1:5, 1:20])

## <<TermDocumentMatrix (terms: 5, documents: 20)>>
## Non-/sparse entries: 5/95
## Sparsity           : 95%
## Maximal term length: 8
## Weighting          : term frequency (tf)
## 
##           Docs
## Terms      character(0) character(0) character(0) character(0)
##   abl                 0            1            0            0
##   about               0            0            0            0
##   account             0            0            0            0
##   action              0            0            0            0
##   advertis            0            0            0            0
##           Docs
## Terms      character(0) character(0) character(0) character(0)
##   abl                 1            0            0            0
##   about               0            0            0            0
##   account             0            0            0            0
##   action              0            0            0            0
##   advertis            2            0            0            0
##           Docs
## Terms      character(0) character(0) character(0) character(0)
##   abl                 0            0            0            1
##   about               0            0            0            0
##   account             0            0            0            0
##   action              0            0            0            0
##   advertis            0            0            0            0
##           Docs
## Terms      character(0) character(0) character(0) character(0)
##   abl                 0            0            0            0
##   about               0            0            0            0
##   account             0            1            0            0
##   action              0            0            0            0
##   advertis            0            0            0            0
##           Docs
## Terms      character(0) character(0) character(0) character(0)
##   abl                 0            0            0            0
##   about               0            0            0            0
##   account             0            0            0            0
##   action              0            0            0            0
##   advertis            0            0            0            0

Exploratory Analysis and Visualization

Generate the Data Frame from corpus and remove corpus object

corpus[[1]]$content
unlist : input as list and create a column for the each elements 
sapply(objcet, function,...)
objcet could be : list, dataframe, Vector 
function,... : function specific argument
output : Vector, Matrix or List

colnames(corpus)

## NULL

corpus[1]

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1

corpus[[1]]$meta

##   author       : character(0)
##   datetimestamp: 2016-01-04 16:16:37
##   description  : character(0)
##   heading      : character(0)
##   id           : character(0)
##   language     : character(0)
##   origin       : character(0)

corpus[[1]]$meta$datetimestamp

## [1] "2016-01-04 16:16:37 GMT"

corpus[[1]]$content

## [1] "homesomebodi never had grandchildren"

corpus[[2]]$content

## [1] "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with the effect of her in her diamond state but as mani a twilight fan can attestget those damn diamond effect can be challeng i rememb riptid releas littl projectil stake while spin realli fastnot so much for creat tornado and crap im not sure whi they felt the need to chang that other than to bring down the stealth jet dure the climaxwhich i guess is an ok reason that devillook guy that look an aw lot like nightcrawl and use the same teleport effect that would be azazel and he look like nightcrawl becaus hes nightcrawl daddi are you wonder whi nightcrawl is blue that would be becaus mystiqu is his mamaand threw him down a well after he was born azazel is biblic and should technic be trap in an altern dimens thank to his teleport skillz hes abl to come here everi onc in a while for just long enough to knock up a random woman which he doe often"

corpus[[3]]$content

## [1] "eatcha up insid"

testsapply <- sapply(corpus, '[',"content")
class(testsapply)

## [1] "list"

typeof(testsapply)

## [1] "list"

#testsapply[[1]]$content
head(testsapply)

## $`character(0).content`
## [1] "homesomebodi never had grandchildren"
## 
## $`character(0).content`
## [1] "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with the effect of her in her diamond state but as mani a twilight fan can attestget those damn diamond effect can be challeng i rememb riptid releas littl projectil stake while spin realli fastnot so much for creat tornado and crap im not sure whi they felt the need to chang that other than to bring down the stealth jet dure the climaxwhich i guess is an ok reason that devillook guy that look an aw lot like nightcrawl and use the same teleport effect that would be azazel and he look like nightcrawl becaus hes nightcrawl daddi are you wonder whi nightcrawl is blue that would be becaus mystiqu is his mamaand threw him down a well after he was born azazel is biblic and should technic be trap in an altern dimens thank to his teleport skillz hes abl to come here everi onc in a while for just long enough to knock up a random woman which he doe often"
## 
## $`character(0).content`
## [1] "eatcha up insid"
## 
## $`character(0).content`
## [1] " this freedom of plant will help to prevent rob steal and murder"
## 
## $`character(0).content`
## [1] "when i first start out job hunt as a teenag there were no recruit agenc in my town everi job go was advertis in the jobcentr a one stop shop as it were it advertis everi job avail local and also a few nation vacanc advis were also abl to look for a particular job titl in ani other area"
## 
## $`character(0).content`
## [1] "bf yes it veri well document"

tail(testsapply)

## $`character(0).content`
## [1] "ff thank for help to"
## 
## $`character(0).content`
## [1] "thank so much for the follow if you ever need a singer let me know"
## 
## $`character(0).content`
## [1] "i though cena might be sad bc of what goin on with him but he definit didnt seem sad"
## 
## $`character(0).content`
## [1] "the day is go by too quick tri to get everyth finish up befor anoth amaz weekend"
## 
## $`character(0).content`
## [1] "one of the surest bet in super bowl histori is side with the field goalsafeti when it come to what the first score of the game will be"
## 
## $`character(0).content`
## [1] "nope just the one"

corpus[[1]]$content

## [1] "homesomebodi never had grandchildren"

sample_df = data.frame(text=unlist(sapply(corpus, '[',"content")),stringsAsFactors=F)
token_delim =" \\t\\r\\n.!?,;\"()"
length(corpus)

## [1] 44

sapply(ls(), class)

##                 abc        Blogs_sample           blogsfile 
##         "character"         "character"         "character" 
## combine_clean_files       combine_files              corpus 
##         "character"         "character"         "character" 
##               docs1                  ds                 dtm 
##         "character"         "character"         "character" 
##         News_sample            newsfile           sample_df 
##         "character"         "character"         "character" 
##               sdata                 sds                   t 
##         "character"         "character"         "character" 
##               tdata                 tdm          testsapply 
##         "character"         "character"         "character" 
##         token_delim      Twitter_sample         twitterfile 
##         "character"         "character"         "character" 
##                 vds 
##         "character"

rm(corpus )

class(sample_df)

## [1] "data.frame"

dim(sample_df)

## [1] 44  1

str(sample_df)

## 'data.frame':    44 obs. of  1 variable:
##  $ text: chr  "homesomebodi never had grandchildren" "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with th"| __truncated__ "eatcha up insid" " this freedom of plant will help to prevent rob steal and murder" ...

length (sample_df)

## [1] 1

#inspect(sample_df[1:5,])

saving samples for reuse

saveRDS(sample_df, file="Sample_data.txt")

Tokenize accordingly

UnigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=1,max=1))
BigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=2,max=2, delimiters = token_delim))
TrigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=3,max=3, delimiters = token_delim))
QuadgramTokenizer = NGramTokenizer(sample_df, Weka_control(min=4,max=4, delimiters = token_delim))

class(UnigramTokenizer)

## [1] "character"

class(BigramTokenizer)

## [1] "character"

class(TrigramTokenizer)

## [1] "character"

class(QuadgramTokenizer)

## [1] "character"

head(UnigramTokenizer)

## [1] "c"             "homesomebodi"  "never"         "had"          
## [5] "grandchildren" "januari"

head(BigramTokenizer)

## [1] "c homesomebodi"        "homesomebodi never"    "never had"            
## [4] "had grandchildren"     "grandchildren januari" "januari jone"

head(TrigramTokenizer)

## [1] "c homesomebodi never"       "homesomebodi never had"    
## [3] "never had grandchildren"    "had grandchildren januari" 
## [5] "grandchildren januari jone" "januari jone wouldnt"

head(QuadgramTokenizer)

## [1] "c homesomebodi never had"            
## [2] "homesomebodi never had grandchildren"
## [3] "never had grandchildren januari"     
## [4] "had grandchildren januari jone"      
## [5] "grandchildren januari jone wouldnt"  
## [6] "januari jone wouldnt have"

dim(UnigramTokenizer)

## NULL

dim(BigramTokenizer)

## NULL

dim(TrigramTokenizer)

## NULL

dim(QuadgramTokenizer)

## NULL

convert to data frame

unigramTable=data.frame(table(UnigramTokenizer))
bigramTable=data.frame(table(BigramTokenizer))
trigramTable=data.frame(table(TrigramTokenizer))
QrigramTable=data.frame(table(QuadgramTokenizer))

class(unigramTable)

## [1] "data.frame"

class(bigramTable)

## [1] "data.frame"

class(trigramTable)

## [1] "data.frame"

class(QrigramTable)

## [1] "data.frame"

head(unigramTable)

##   UnigramTokenizer Freq
## 1                a   29
## 2              abl    3
## 3            about    2
## 4          account    1
## 5           action    2
## 6         advertis    2

head(bigramTable)

##   BigramTokenizer Freq
## 1        a bigger    1
## 2          a book    1
## 3         a candl    1
## 4     a compromis    1
## 5           a few    1
## 6         a final    1

head(trigramTable)

##   TrigramTokenizer Freq
## 1     a bigger fag    1
## 2      a book from    1
## 3   a candl audjpi    1
## 4    a compromis a    1
## 5     a few nation    1
## 6   a final strong    1

head(QrigramTable)

##       QuadgramTokenizer Freq
## 1     a bigger fag what    1
## 2      a book from them    1
## 3     a candl audjpi oh    1
## 4 a compromis a stopgap    1
## 5   a few nation vacanc    1
## 6   a final strong word    1

dim(unigramTable)

## [1] 523   2

dim(bigramTable)

## [1] 916   2

dim(trigramTable)

## [1] 956   2

dim(QrigramTable)

## [1] 959   2

Sort nGrams

unigramTable=unigramTable[order(unigramTable$Freq,decreasing = TRUE),]
bigramTable=bigramTable[order(bigramTable$Freq,decreasing = TRUE),]
trigramTable=trigramTable[order(trigramTable$Freq,decreasing = TRUE),]
QrigramTable=QrigramTable[order(QrigramTable$Freq,decreasing = TRUE),]

class(unigramTable)

## [1] "data.frame"

class(bigramTable)

## [1] "data.frame"

class(trigramTable)

## [1] "data.frame"

class(QrigramTable)

## [1] "data.frame"

head(unigramTable)

##     UnigramTokenizer Freq
## 449              the   41
## 1                  a   29
## 466               to   25
## 19               and   23
## 306               of   19
## 173              for   18

head(bigramTable)

##     BigramTokenizer Freq
## 817          to the    5
## 312         for the    4
## 29           abl to    3
## 552          of the    3
## 872          when i    3
## 19           a seri    2

head(trigramTable)

##     TrigramTokenizer Freq
## 78       and use the    2
## 322   for the follow    2
## 715      so much for    2
## 762    that would be    2
## 1       a bigger fag    1
## 2        a book from    1

head(QrigramTable)

##       QuadgramTokenizer Freq
## 1     a bigger fag what    1
## 2      a book from them    1
## 3     a candl audjpi oh    1
## 4 a compromis a stopgap    1
## 5   a few nation vacanc    1
## 6   a final strong word    1

length(unigramTable)

## [1] 2

length(bigramTable)

## [1] 2

length(trigramTable)

## [1] 2

length(QrigramTable)

## [1] 2

Unigram bar plot and word cloud

ggplot(unigramTable[1:25,], aes(x=reorder(UnigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("1-Gram: Top 25 Occurrence Words"))

wordcloud(unigramTable$UnigramTokenizer, unigramTable$Freq, scale=c(3,0.1),colors=brewer.pal(6, "Dark2"),rot.per=0.35, max.words=40)

Bigram bar plot and word cloud

ggplot(bigramTable[1:25,], aes(x=reorder(BigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("2-Gram: Top 25 Occurrence Words"))

wordcloud(bigramTable$BigramTokenizer, bigramTable$Freq, scale=c(3,0.1), colors=brewer.pal(6, "Dark2"),rot.per=0.35, max.words=40)

Trigram bar plot

ggplot(trigramTable[1:25,], aes(x=reorder(TrigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("3-Gram: Top 25 Occurrence Words"))

Qrigram bar plot

ggplot(QrigramTable[1:25,], aes(x=reorder(QuadgramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("4-Gram: Top 25 Occurrence Words"))

```

head(findFreqTerms(x = tdm, lowfreq = 8, highfreq = Inf ))

## [1] "and"  "for"  "that" "the"  "you"

head(findFreqTerms(x = dtm, lowfreq = 8, highfreq = Inf ))

## [1] "and"  "for"  "that" "the"  "you"

```

head(findAssocs(x=tdm, term = "bubble", corlimit =0.2))

## $bubble
## numeric(0)

head(findAssocs(x=dtm, term = "bubble", corlimit =0.2))

## $bubble
## numeric(0)

#inspect(tdm[1:2,])
#inspect(dtm[1:2,])
system.time(predict <- trigramTable[trigramTable$TrigramTokenizer == 'a lot of', ])

##    user  system elapsed 
##   0.002   0.000   0.001

system.time(predict <- match('a lot of', trigramTable$TrigramTokenizer))

##    user  system elapsed 
##       0       0       0

predict <- trigramTable[trigramTable$TrigramTokenizer == 'a lot of', ]
predict <- match('a lot of', trigramTable$TrigramTokenizer)
str(predict)

##  int NA

class(predict)

## [1] "integer"

summary(predict)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##      NA      NA      NA     NaN      NA      NA       1

table(predict)

## < table of extent 0 >

predict

## [1] NA

```

ls()

##  [1] "abc"                 "bigramTable"         "BigramTokenizer"    
##  [4] "Blogs_sample"        "blogsfile"           "combine_clean_files"
##  [7] "combine_files"       "docs1"               "ds"                 
## [10] "dtm"                 "News_sample"         "newsfile"           
## [13] "predict"             "QrigramTable"        "QuadgramTokenizer"  
## [16] "sample_df"           "sdata"               "sds"                
## [19] "t"                   "tdata"               "tdm"                
## [22] "testsapply"          "token_delim"         "trigramTable"       
## [25] "TrigramTokenizer"    "Twitter_sample"      "twitterfile"        
## [28] "unigramTable"        "UnigramTokenizer"    "vds"

```

Conclusion and Next Action

Base on data graphs build the much larger training dataset.
Test Data How Does model perform on the new Data
Test the accuracy of the data that we have used to build the model
Develop the text Application prediction based on Markov chain algorithm
Product code and prediction model deploy at Shiny app server
Shiny App that shall take as input phrase(multiple words) in a text box input and output a prediction of the next word.

file_name \|	Uniq_words_Types \|	FileSize(MB) \|	line_counts \|	word_counts_Tokens \|	Char_counts
blogs \|	1214516 \|	200.4297 \|	899288 \|	37334114 \|	210160014
news \|	945730 \|	196.2812 \|	1010242 \|	34365936 \|	205811889
twitter \|	1443911 \|	159.3672 \|	2360148 \|	30359804 \|	167105338
Sum(bl+Ne\|Tw) \|	2825934 \|	556.0781 \|	4269678 \|	102059854 \|	583077241
all_file_cat \|	2825934 \|	556.0703 \|	4269678 \|	103041866 \|	583077241
all_file_cat_c \|	541029 \|	525.8711 \|	4269678 \|	103041866 \|	551410913

Text Mining Data Analysis

Alok

12/27/2015