The goal of the Data Science Capstone Project is to build a predictive model (Natural Language Processing) to predict the next word. Given a word or phrase as input, the product/application shall try to predict the next word.
Milestone report shows an exploratory analysis has done on the training data to understand the distribution and relationship between the words, tokens, and phrases in the text. This goal of this report exploratory analysis of the data, understand the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and pairs - build figures and tables to understand variation in the frequencies of the words and word pairs in the data.
Capstone Dataset is available at : https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The dataset contains data obtain from Blogs post, News Feeds and Tweets from twitter. They are saved as txt format with \n newline formation.
Setting directory and clean all old object
setwd("/home/alok/capstone_swk")
rm(list = ls())
Load the library ( Note : Not using all the libraries here but shall use in final application)
suppressWarnings(suppressMessages(library(igraph)))
suppressWarnings(suppressMessages(library(biclust)))
suppressWarnings(suppressMessages(library(RColorBrewer)))
suppressWarnings(suppressMessages(library(tm)))
suppressWarnings(suppressMessages(library(SnowballC)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(cluster)))
suppressWarnings(suppressMessages(library(RWeka)))
suppressWarnings(suppressMessages(library(caTools)))
suppressWarnings(suppressMessages(library(rpart)))
suppressWarnings(suppressMessages(library(rpart.plot)))
suppressWarnings(suppressMessages(library(randomForest)))
suppressWarnings(suppressMessages(library(wordcloud)))
suppressWarnings(suppressMessages(library(qdap)))
suppressWarnings(suppressMessages(library (biclust)))
suppressWarnings(suppressMessages(library (cluster)))
suppressWarnings(suppressMessages(library (igraph)))
Download the data
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",destfile="Coursera-SwiftKey.zip",method="curl")
unzip("Coursera-SwiftKey.zip")
blogsfile <- "final/en_US/en_US.blogs.txt"
newsfile <- "final/en_US/en_US.news.txt"
twitterfile <- "final/en_US/en_US.twitter.txt"
combine_files <- "final/en_US/en_US.all.txt"
combine_clean_files <- "final/en_US/en_US.all_3.txt"
Raw data Summary (all_file_cat* data from unix script)
| file_name | | Uniq_words_Types | | FileSize(MB) | | line_counts | | word_counts_Tokens | | Char_counts |
|---|---|---|---|---|---|
| blogs | | 1214516 | | 200.4297 | | 899288 | | 37334114 | | 210160014 |
| news | | 945730 | | 196.2812 | | 1010242 | | 34365936 | | 205811889 |
| twitter | | 1443911 | | 159.3672 | | 2360148 | | 30359804 | | 167105338 |
| Sum(bl+Ne|Tw) | | 2825934 | | 556.0781 | | 4269678 | | 102059854 | | 583077241 |
| all_file_cat | | 2825934 | | 556.0703 | | 4269678 | | 103041866 | | 583077241 |
| all_file_cat_c | | 541029 | | 525.8711 | | 4269678 | | 103041866 | | 551410913 |
**Unix Code**
combine_files <- (system ("cat final/en_US/en_US.blogs.txt final/en_US/en_US.news.txt final/en_US/en_US.twitter.txt > final/en_US/en_US.all.txt", intern=TRUE))
system ("bash /home/alok/capstone_swk/final/en_US/data_clean.bash")
cat data_clean.bash
#!/bin/bash
cd /home/alok/capstone_swk/final/en_US
#Upper to lower covert
sed 's/\([A-Z]\)/\L\1/g' en_US.all.txt > en_US.all_1.txt
#Remove all number and special char pass / to special char
sed 's/[^a-z]/ /g;' en_US.all_1.txt > en_US.all_2.txt
#Remove Space
awk '{$1=$1};1' en_US.all_2.txt > en_US.all_3.txt
rm en_US.all_1.txt
rm en_US.all_2.txt
## identified Uniq work and frequency and sort by column 1 with n
## count to 50 words "tail -n 50 en_US.all_3_uniq.txt"
cat en_US.all_3.txt|tr " " "\n"|sort |uniq -c|sort -k 1n -r > en_US.all_3_uniq.txt
Read and Sampling the Data
blogs <- readLines(blogsfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)
news <- readLines(newsfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)
twitter <- readLines(twitterfile, encoding="UTF-8", warn=FALSE, skipNul=TRUE)
class(blogs)
## [1] "character"
class(news)
## [1] "character"
class(twitter)
## [1] "character"
**Random sample or documents or features Approx 1% of 899288 is 8992 Uniq word is high so gt 1% lt 1.5%
Blogs_sample <- sample(blogs, 10)
Approx 1% of 1010242 is 10102
News_sample <- sample(news, 11)
Approx 1% of 2360148 is 23601
Twitter_sample <- sample(twitter, 23)
class(Blogs_sample )
## [1] "character"
class(News_sample)
## [1] "character"
class(Twitter_sample)
## [1] "character"
Remove logs,news and twitter objects
rm(blogs,news,twitter)
Clean Sample data
sdata <- c(Blogs_sample, News_sample, Twitter_sample)
class(sdata)
## [1] "character"
summary(sdata[[1]])
## Length Class Mode
## 1 character character
sample code
tdata <- 'Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:'
class(tdata)
## [1] "character"
str(tdata)
## chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
Remove Hash tags
tdata <- gsub(" #\\S*","", tdata)
tdata
## [1] "Hello 7$%^%&%^*^,: tammy @ruby http://www.global.com 1234355 7$%^%&%^*^,:"
Remove URLs
tdata <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", tdata)
tdata
## [1] "Hello 7$%^%&%^*^,: tammy @ruby 1234355 7$%^%&%^*^,:"
Remove twitter accounts
tdata <- gsub(" @[^\\s]+","",tdata)
tdata
## [1] "Hello 7$%^%&%^*^,: tammy"
Remove special characters
tdata <- gsub("[^0-9A-Za-z///' ]", "", tdata)
tdata
## [1] "Hello 7 tammy"
unlist example
abc <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
abc
## $a
## $a[[1]]
## [1] 1 2 3 4 5
##
## $a[[2]]
## [1] "A" "B" "C" "D" "E"
##
##
## $b
## [1] "Z"
##
## $c
## [1] NA
abc <- unlist(abc, recursive = FALSE)
abc
## $a1
## [1] 1 2 3 4 5
##
## $a2
## [1] "A" "B" "C" "D" "E"
##
## $b
## [1] "Z"
##
## $c
## [1] NA
abc <- list(a = list(1:5, LETTERS[1:5]), b = "Z", c = NA)
abc <- unlist(abc, recursive = TRUE)
abc
## a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 b c
## "1" "2" "3" "4" "5" "A" "B" "C" "D" "E" "Z" NA
Corpus Example
docs1 <- c("This is a text.", "This is another one.", "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:", "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'")
class(docs1)
## [1] "character"
typeof(docs1)
## [1] "character"
str(docs1)
## chr [1:4] "This is a text." "This is another one." ...
docs1
## [1] "This is a text."
## [2] "This is another one."
## [3] "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
## [4] "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"
docs1 <- data.frame(docs1)
class(docs1)
## [1] "data.frame"
typeof(docs1)
## [1] "list"
str(docs1)
## 'data.frame': 4 obs. of 1 variable:
## $ docs1: Factor w/ 4 levels "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'",..: 3 4 2 1
docs1
## docs1
## 1 This is a text.
## 2 This is another one.
## 3 Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:
## 4 Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'
ds <- DataframeSource(docs1)
class(ds)
## [1] "DataframeSource" "SimpleSource" "Source"
typeof(ds)
## [1] "list"
str(ds)
## List of 4
## $ encoding: chr ""
## $ length : int 4
## $ position: num 0
## $ reader :function (elem, language, id)
## - attr(*, "class")= chr [1:3] "DataframeSource" "SimpleSource" "Source"
ds
## $encoding
## [1] ""
##
## $length
## [1] 4
##
## $position
## [1] 0
##
## $reader
## function (elem, language, id)
## {
## if (!is.null(elem$uri))
## id <- basename(elem$uri)
## PlainTextDocument(elem$content, id = id, language = language)
## }
## <environment: namespace:tm>
##
## $content
## docs1
## 1 This is a text.
## 2 This is another one.
## 3 Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:
## 4 Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'
##
## attr(,"class")
## [1] "DataframeSource" "SimpleSource" "Source"
sds <- Corpus(ds)
class(sds)
## [1] "VCorpus" "Corpus"
typeof(sds)
## [1] "list"
str(sds)
## List of 4
## $ 1:List of 2
## ..$ content: chr "This is a text."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2:List of 2
## ..$ content: chr "This is another one."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 3:List of 2
## ..$ content: chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "3"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 4:List of 2
## ..$ content: chr "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "4"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
sds
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
inspect(sds)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 20
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 80
##
## [[4]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 77
vds <- VCorpus(ds)
class(vds)
## [1] "VCorpus" "Corpus"
typeof(vds)
## [1] "list"
str(vds)
## List of 4
## $ 1:List of 2
## ..$ content: chr "This is a text."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2:List of 2
## ..$ content: chr "This is another one."
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 3:List of 2
## ..$ content: chr "Hello 7$%^%&%^*^,: tammy @ruby #world http://www.global.com 1234355 7$%^%&%^*^,:"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "3"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 4:List of 2
## ..$ content: chr "Hello 7$%^%&%^*^,: ammy @ruby #world http://www.ttt.com 1234355 7$%^%&%^*^,:'"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "4"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
vds
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
inspect(vds)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 4
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 20
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 80
##
## [[4]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 77
Remove Hash tags
sdata <- gsub(" #\\S*","", sdata)
Remove URLs
sdata <- gsub("(f|ht)(tp)(s?)(://)(\\S*)", "", sdata)
Remove twitter accounts
sdata <- gsub(" @[^\\s]+","",sdata)
Remove special characters
sdata <- gsub("[^0-9A-Za-z///' ]", "", sdata)
Build the corpus from sample data
t <- VectorSource(sdata)
class(t)
## [1] "VectorSource" "SimpleSource" "Source"
dim(t)
## NULL
summary(t[[1]])
## Length Class Mode
## 1 character character
#t$VectorSource[[1]
#$SimpleSource[[1]]
#t$Source[[1]]
str(t[[1]])
## chr ""
str(t[[3]])
## num 0
str(t[[2]])
## int 44
##head(t) ## Working lot of data
##tail(t) ## Working lot of data
corpus <- Corpus(VectorSource(sdata))
class(corpus)
## [1] "VCorpus" "Corpus"
dim(corpus)
## NULL
str(corpus)
## List of 44
## $ 1 :List of 2
## ..$ content: chr "homesomebody never had grandchildren"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "1"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 2 :List of 2
## ..$ content: chr "January Jones wouldnt have been my first choice for Emma Frost but I think she did a pretty good job I wasnt hugely impressed w"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "2"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 3 :List of 2
## ..$ content: chr "eatcha up inside"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "3"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 4 :List of 2
## ..$ content: chr "5 This freedom of planting will help to prevent robbing stealing and murdering"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "4"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 5 :List of 2
## ..$ content: chr "When I first started out job hunting as a teenager there were no recruitment agencies in my town Every job going was advertised"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "5"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 6 :List of 2
## ..$ content: chr "BF Yes its very well documented"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "6"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 7 :List of 2
## ..$ content: chr "Related Bible verses Genesis 22912 Ephesians 320 Exodus 3314"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "7"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 8 :List of 2
## ..$ content: chr "Dear TigerTime Supporter"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "8"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 9 :List of 2
## ..$ content: chr "Assign a Social Media Monitor is the 49th in a series of excerpts from our book Be a Person the Social Operating Manual for Ent"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "9"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 10:List of 2
## ..$ content: chr "Penwizard have kindly offered FREE postage and packaging to any of my readers who wish to purchase a book from them just go to "| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "10"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 11:List of 2
## ..$ content: chr "It's been pretty quiet today she said I haven't heard of any serious crashes"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "11"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 12:List of 2
## ..$ content: chr "Other church members who attended the hearing said they don't believe their church is going to be able to raise 50 million in 1"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "12"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 13:List of 2
## ..$ content: chr "Another option is small claims court but given the size of your refund just 70 it is probably impractical I think a final str"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "13"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 14:List of 2
## ..$ content: chr "Class of 2012 He was a PostDispatch firstteam AllMetro selection last fall as a junior The 6foot1 195pounder accounted for 1914"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "14"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 15:List of 2
## ..$ content: chr "BOSTON Cupcakes brownies and other baked goodies will be spared the chopping block at Massachusetts schools after Gov Deval Pa"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "15"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 16:List of 2
## ..$ content: chr "Game 37"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "16"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 17:List of 2
## ..$ content: chr "6820 St Olaf Dr 36667"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "17"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 18:List of 2
## ..$ content: chr "The image showed a significant amount of white stuff"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "18"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 19:List of 2
## ..$ content: chr "Apocalyptic rhetoric aside Rubio's proposal is simply a compromise a stopgap one of those virtuous but messy and elusive const"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "19"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 20:List of 2
## ..$ content: chr "Pacific Palisades CA 90272"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "20"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 21:List of 2
## ..$ content: chr "Producer Joel Silver whose blockbuster resume includes The Matrix and Die Hard teams with After Dark Films to launch a new slat"| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "21"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 22:List of 2
## ..$ content: chr "Going live with this week"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "22"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 23:List of 2
## ..$ content: chr "Public comment is now closed QA time for Budget Committee members"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "23"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 24:List of 2
## ..$ content: chr "Limbering up for my Park debut Find me at Green Week when I bring energy efficiency to the masses wwwsmartenergypayscom"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "24"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 25:List of 2
## ..$ content: chr "What makes the engine go Desire desire desire"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "25"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 26:List of 2
## ..$ content: chr "Today could be Busy 911 Remember at Shaw's in NE MPLSconcert fundraiser charity Gopher tailgate at Stub and Herb's"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "26"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 27:List of 2
## ..$ content: chr "lol forgot to mention ass too what you a bigger fag"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "27"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 28:List of 2
## ..$ content: chr "what's the domain we'll check it out"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "28"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 29:List of 2
## ..$ content: chr "For the life of me I don't understand button down collars Are you worried about the collar getting away from you"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "29"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 30:List of 2
## ..$ content: chr "I laughed at it my dumb ass was looking back to see when I said that"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "30"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 31:List of 2
## ..$ content: chr "P2 needs surgery If you care about him don't vote for him"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "31"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 32:List of 2
## ..$ content: chr "And for the record I don't agree with most of what Frank said He's venting Using war to justify antOWS is dumb"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "32"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 33:List of 2
## ..$ content: chr "Thanks for the follow Nick"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "33"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 34:List of 2
## ..$ content: chr "Derby time"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "34"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 35:List of 2
## ..$ content: chr "Congrats to the cubs fans celebrate and kick off your weekend with a healthy startfruit smoothies sodas and sandwiches"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "35"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 36:List of 2
## ..$ content: chr "1st off get it right name's CynicalripaMic sicka than Vlad the Impaler stickin ya to the tip of a spike"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "36"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 37:List of 2
## ..$ content: chr "Now that is a candle AUDJPY"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "37"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 38:List of 2
## ..$ content: chr "oh and you popped out to the second baseman"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "38"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 39:List of 2
## ..$ content: chr "FF thanks for help to"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "39"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 40:List of 2
## ..$ content: chr "Thanks so much for the follow If you ever need a singer let me know"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "40"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 41:List of 2
## ..$ content: chr "I though Cena might be sad bc of what's goin on with him but he definitely didn't seem sad "
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "41"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 42:List of 2
## ..$ content: chr "The day is going by too quickly trying to get everything finished up before another amazing weekend"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "42"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 43:List of 2
## ..$ content: chr "One of the surest bets in Super Bowl history is siding with the field goal/safety when it comes to what the first score of the "| __truncated__
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "43"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## $ 44:List of 2
## ..$ content: chr "nope just the one"
## ..$ meta :List of 7
## .. ..$ author : chr(0)
## .. ..$ datetimestamp: POSIXlt[1:1], format: "2016-01-04 16:16:36"
## .. ..$ description : chr(0)
## .. ..$ heading : chr(0)
## .. ..$ id : chr "44"
## .. ..$ language : chr "en"
## .. ..$ origin : chr(0)
## .. ..- attr(*, "class")= chr "TextDocumentMeta"
## ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
## - attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
#summary(corpus)
corpus[[1]]$content
## [1] "homesomebody never had grandchildren"
summary(corpus[[1]])
## Length Class Mode
## content 1 -none- character
## meta 7 TextDocumentMeta list
Clean the corpus
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
#corpus <- tm_map(corpus, removeWords,stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)
Construct the document metrix
tdm <- TermDocumentMatrix(corpus)
dtm <- DocumentTermMatrix(corpus)
class document metrix
class(tdm)
## [1] "TermDocumentMatrix" "simple_triplet_matrix"
class(dtm)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
str(tdm)
## List of 6
## $ i : int [1:670] 180 185 200 270 1 7 13 17 22 28 ...
## $ j : int [1:670] 1 1 1 1 2 2 2 2 2 2 ...
## $ v : num [1:670] 1 1 1 1 1 1 1 4 1 1 ...
## $ nrow : int 482
## $ ncol : int 44
## $ dimnames:List of 2
## ..$ Terms: chr [1:482] "abl" "about" "account" "action" ...
## ..$ Docs : chr [1:44] "character(0)" "character(0)" "character(0)" "character(0)" ...
## - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
str(dtm)
## List of 6
## $ i : int [1:670] 1 1 1 1 2 2 2 2 2 2 ...
## $ j : int [1:670] 180 185 200 270 1 7 13 17 22 28 ...
## $ v : num [1:670] 1 1 1 1 1 1 1 4 1 1 ...
## $ nrow : int 44
## $ ncol : int 482
## $ dimnames:List of 2
## ..$ Docs : chr [1:44] "character(0)" "character(0)" "character(0)" "character(0)" ...
## ..$ Terms: chr [1:482] "abl" "about" "account" "action" ...
## - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
## - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
inspect document metrix
inspect(dtm[1:5, 1:20])
## <<DocumentTermMatrix (documents: 5, terms: 20)>>
## Non-/sparse entries: 12/88
## Sparsity : 88%
## Maximal term length: 8
## Weighting : term frequency (tf)
##
## Terms
## Docs abl about account action advertis advis after agenc agre
## character(0) 0 0 0 0 0 0 0 0 0
## character(0) 1 0 0 0 0 0 1 0 0
## character(0) 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 0 0
## character(0) 1 0 0 0 2 1 0 1 0
## Terms
## Docs all allmetro also altern amaz amount anchor and ani anoth
## character(0) 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 1 0 0 0 4 0 0
## character(0) 0 0 0 0 0 0 0 0 0 0
## character(0) 0 0 0 0 0 0 0 1 0 0
## character(0) 0 0 2 0 0 0 0 1 1 0
## Terms
## Docs antow
## character(0) 0
## character(0) 0
## character(0) 0
## character(0) 0
## character(0) 0
inspect(tdm[1:5, 1:20])
## <<TermDocumentMatrix (terms: 5, documents: 20)>>
## Non-/sparse entries: 5/95
## Sparsity : 95%
## Maximal term length: 8
## Weighting : term frequency (tf)
##
## Docs
## Terms character(0) character(0) character(0) character(0)
## abl 0 1 0 0
## about 0 0 0 0
## account 0 0 0 0
## action 0 0 0 0
## advertis 0 0 0 0
## Docs
## Terms character(0) character(0) character(0) character(0)
## abl 1 0 0 0
## about 0 0 0 0
## account 0 0 0 0
## action 0 0 0 0
## advertis 2 0 0 0
## Docs
## Terms character(0) character(0) character(0) character(0)
## abl 0 0 0 1
## about 0 0 0 0
## account 0 0 0 0
## action 0 0 0 0
## advertis 0 0 0 0
## Docs
## Terms character(0) character(0) character(0) character(0)
## abl 0 0 0 0
## about 0 0 0 0
## account 0 1 0 0
## action 0 0 0 0
## advertis 0 0 0 0
## Docs
## Terms character(0) character(0) character(0) character(0)
## abl 0 0 0 0
## about 0 0 0 0
## account 0 0 0 0
## action 0 0 0 0
## advertis 0 0 0 0
Generate the Data Frame from corpus and remove corpus object
corpus[[1]]$content
unlist : input as list and create a column for the each elements
sapply(objcet, function,...)
objcet could be : list, dataframe, Vector
function,... : function specific argument
output : Vector, Matrix or List
colnames(corpus)
## NULL
corpus[1]
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
corpus[[1]]$meta
## author : character(0)
## datetimestamp: 2016-01-04 16:16:37
## description : character(0)
## heading : character(0)
## id : character(0)
## language : character(0)
## origin : character(0)
corpus[[1]]$meta$datetimestamp
## [1] "2016-01-04 16:16:37 GMT"
corpus[[1]]$content
## [1] "homesomebodi never had grandchildren"
corpus[[2]]$content
## [1] "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with the effect of her in her diamond state but as mani a twilight fan can attestget those damn diamond effect can be challeng i rememb riptid releas littl projectil stake while spin realli fastnot so much for creat tornado and crap im not sure whi they felt the need to chang that other than to bring down the stealth jet dure the climaxwhich i guess is an ok reason that devillook guy that look an aw lot like nightcrawl and use the same teleport effect that would be azazel and he look like nightcrawl becaus hes nightcrawl daddi are you wonder whi nightcrawl is blue that would be becaus mystiqu is his mamaand threw him down a well after he was born azazel is biblic and should technic be trap in an altern dimens thank to his teleport skillz hes abl to come here everi onc in a while for just long enough to knock up a random woman which he doe often"
corpus[[3]]$content
## [1] "eatcha up insid"
testsapply <- sapply(corpus, '[',"content")
class(testsapply)
## [1] "list"
typeof(testsapply)
## [1] "list"
#testsapply[[1]]$content
head(testsapply)
## $`character(0).content`
## [1] "homesomebodi never had grandchildren"
##
## $`character(0).content`
## [1] "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with the effect of her in her diamond state but as mani a twilight fan can attestget those damn diamond effect can be challeng i rememb riptid releas littl projectil stake while spin realli fastnot so much for creat tornado and crap im not sure whi they felt the need to chang that other than to bring down the stealth jet dure the climaxwhich i guess is an ok reason that devillook guy that look an aw lot like nightcrawl and use the same teleport effect that would be azazel and he look like nightcrawl becaus hes nightcrawl daddi are you wonder whi nightcrawl is blue that would be becaus mystiqu is his mamaand threw him down a well after he was born azazel is biblic and should technic be trap in an altern dimens thank to his teleport skillz hes abl to come here everi onc in a while for just long enough to knock up a random woman which he doe often"
##
## $`character(0).content`
## [1] "eatcha up insid"
##
## $`character(0).content`
## [1] " this freedom of plant will help to prevent rob steal and murder"
##
## $`character(0).content`
## [1] "when i first start out job hunt as a teenag there were no recruit agenc in my town everi job go was advertis in the jobcentr a one stop shop as it were it advertis everi job avail local and also a few nation vacanc advis were also abl to look for a particular job titl in ani other area"
##
## $`character(0).content`
## [1] "bf yes it veri well document"
tail(testsapply)
## $`character(0).content`
## [1] "ff thank for help to"
##
## $`character(0).content`
## [1] "thank so much for the follow if you ever need a singer let me know"
##
## $`character(0).content`
## [1] "i though cena might be sad bc of what goin on with him but he definit didnt seem sad"
##
## $`character(0).content`
## [1] "the day is go by too quick tri to get everyth finish up befor anoth amaz weekend"
##
## $`character(0).content`
## [1] "one of the surest bet in super bowl histori is side with the field goalsafeti when it come to what the first score of the game will be"
##
## $`character(0).content`
## [1] "nope just the one"
corpus[[1]]$content
## [1] "homesomebodi never had grandchildren"
sample_df = data.frame(text=unlist(sapply(corpus, '[',"content")),stringsAsFactors=F)
token_delim =" \\t\\r\\n.!?,;\"()"
length(corpus)
## [1] 44
sapply(ls(), class)
## abc Blogs_sample blogsfile
## "character" "character" "character"
## combine_clean_files combine_files corpus
## "character" "character" "character"
## docs1 ds dtm
## "character" "character" "character"
## News_sample newsfile sample_df
## "character" "character" "character"
## sdata sds t
## "character" "character" "character"
## tdata tdm testsapply
## "character" "character" "character"
## token_delim Twitter_sample twitterfile
## "character" "character" "character"
## vds
## "character"
rm(corpus )
class(sample_df)
## [1] "data.frame"
dim(sample_df)
## [1] 44 1
str(sample_df)
## 'data.frame': 44 obs. of 1 variable:
## $ text: chr "homesomebodi never had grandchildren" "januari jone wouldnt have been my first choic for emma frost but i think she did a pretti good job i wasnt huge impress with th"| __truncated__ "eatcha up insid" " this freedom of plant will help to prevent rob steal and murder" ...
length (sample_df)
## [1] 1
#inspect(sample_df[1:5,])
saveRDS(sample_df, file="Sample_data.txt")
Tokenize accordingly
UnigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=1,max=1))
BigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=2,max=2, delimiters = token_delim))
TrigramTokenizer = NGramTokenizer(sample_df, Weka_control(min=3,max=3, delimiters = token_delim))
QuadgramTokenizer = NGramTokenizer(sample_df, Weka_control(min=4,max=4, delimiters = token_delim))
class(UnigramTokenizer)
## [1] "character"
class(BigramTokenizer)
## [1] "character"
class(TrigramTokenizer)
## [1] "character"
class(QuadgramTokenizer)
## [1] "character"
head(UnigramTokenizer)
## [1] "c" "homesomebodi" "never" "had"
## [5] "grandchildren" "januari"
head(BigramTokenizer)
## [1] "c homesomebodi" "homesomebodi never" "never had"
## [4] "had grandchildren" "grandchildren januari" "januari jone"
head(TrigramTokenizer)
## [1] "c homesomebodi never" "homesomebodi never had"
## [3] "never had grandchildren" "had grandchildren januari"
## [5] "grandchildren januari jone" "januari jone wouldnt"
head(QuadgramTokenizer)
## [1] "c homesomebodi never had"
## [2] "homesomebodi never had grandchildren"
## [3] "never had grandchildren januari"
## [4] "had grandchildren januari jone"
## [5] "grandchildren januari jone wouldnt"
## [6] "januari jone wouldnt have"
dim(UnigramTokenizer)
## NULL
dim(BigramTokenizer)
## NULL
dim(TrigramTokenizer)
## NULL
dim(QuadgramTokenizer)
## NULL
convert to data frame
unigramTable=data.frame(table(UnigramTokenizer))
bigramTable=data.frame(table(BigramTokenizer))
trigramTable=data.frame(table(TrigramTokenizer))
QrigramTable=data.frame(table(QuadgramTokenizer))
class(unigramTable)
## [1] "data.frame"
class(bigramTable)
## [1] "data.frame"
class(trigramTable)
## [1] "data.frame"
class(QrigramTable)
## [1] "data.frame"
head(unigramTable)
## UnigramTokenizer Freq
## 1 a 29
## 2 abl 3
## 3 about 2
## 4 account 1
## 5 action 2
## 6 advertis 2
head(bigramTable)
## BigramTokenizer Freq
## 1 a bigger 1
## 2 a book 1
## 3 a candl 1
## 4 a compromis 1
## 5 a few 1
## 6 a final 1
head(trigramTable)
## TrigramTokenizer Freq
## 1 a bigger fag 1
## 2 a book from 1
## 3 a candl audjpi 1
## 4 a compromis a 1
## 5 a few nation 1
## 6 a final strong 1
head(QrigramTable)
## QuadgramTokenizer Freq
## 1 a bigger fag what 1
## 2 a book from them 1
## 3 a candl audjpi oh 1
## 4 a compromis a stopgap 1
## 5 a few nation vacanc 1
## 6 a final strong word 1
dim(unigramTable)
## [1] 523 2
dim(bigramTable)
## [1] 916 2
dim(trigramTable)
## [1] 956 2
dim(QrigramTable)
## [1] 959 2
Sort nGrams
unigramTable=unigramTable[order(unigramTable$Freq,decreasing = TRUE),]
bigramTable=bigramTable[order(bigramTable$Freq,decreasing = TRUE),]
trigramTable=trigramTable[order(trigramTable$Freq,decreasing = TRUE),]
QrigramTable=QrigramTable[order(QrigramTable$Freq,decreasing = TRUE),]
class(unigramTable)
## [1] "data.frame"
class(bigramTable)
## [1] "data.frame"
class(trigramTable)
## [1] "data.frame"
class(QrigramTable)
## [1] "data.frame"
head(unigramTable)
## UnigramTokenizer Freq
## 449 the 41
## 1 a 29
## 466 to 25
## 19 and 23
## 306 of 19
## 173 for 18
head(bigramTable)
## BigramTokenizer Freq
## 817 to the 5
## 312 for the 4
## 29 abl to 3
## 552 of the 3
## 872 when i 3
## 19 a seri 2
head(trigramTable)
## TrigramTokenizer Freq
## 78 and use the 2
## 322 for the follow 2
## 715 so much for 2
## 762 that would be 2
## 1 a bigger fag 1
## 2 a book from 1
head(QrigramTable)
## QuadgramTokenizer Freq
## 1 a bigger fag what 1
## 2 a book from them 1
## 3 a candl audjpi oh 1
## 4 a compromis a stopgap 1
## 5 a few nation vacanc 1
## 6 a final strong word 1
length(unigramTable)
## [1] 2
length(bigramTable)
## [1] 2
length(trigramTable)
## [1] 2
length(QrigramTable)
## [1] 2
Unigram bar plot and word cloud
ggplot(unigramTable[1:25,], aes(x=reorder(UnigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("1-Gram: Top 25 Occurrence Words"))
wordcloud(unigramTable$UnigramTokenizer, unigramTable$Freq, scale=c(3,0.1),colors=brewer.pal(6, "Dark2"),rot.per=0.35, max.words=40)
Bigram bar plot and word cloud
ggplot(bigramTable[1:25,], aes(x=reorder(BigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("2-Gram: Top 25 Occurrence Words"))
wordcloud(bigramTable$BigramTokenizer, bigramTable$Freq, scale=c(3,0.1), colors=brewer.pal(6, "Dark2"),rot.per=0.35, max.words=40)
Trigram bar plot
ggplot(trigramTable[1:25,], aes(x=reorder(TrigramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="red") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("3-Gram: Top 25 Occurrence Words"))
Qrigram bar plot
ggplot(QrigramTable[1:25,], aes(x=reorder(QuadgramTokenizer,-Freq,sum),y=Freq,), ) + geom_bar(stat="Identity", fill="green") +geom_text(aes(label=Freq), vjust=-0.4) + labs(x="Words") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle(paste("4-Gram: Top 25 Occurrence Words"))
```
head(findFreqTerms(x = tdm, lowfreq = 8, highfreq = Inf ))
## [1] "and" "for" "that" "the" "you"
head(findFreqTerms(x = dtm, lowfreq = 8, highfreq = Inf ))
## [1] "and" "for" "that" "the" "you"
```
head(findAssocs(x=tdm, term = "bubble", corlimit =0.2))
## $bubble
## numeric(0)
head(findAssocs(x=dtm, term = "bubble", corlimit =0.2))
## $bubble
## numeric(0)
#inspect(tdm[1:2,])
#inspect(dtm[1:2,])
system.time(predict <- trigramTable[trigramTable$TrigramTokenizer == 'a lot of', ])
## user system elapsed
## 0.002 0.000 0.001
system.time(predict <- match('a lot of', trigramTable$TrigramTokenizer))
## user system elapsed
## 0 0 0
predict <- trigramTable[trigramTable$TrigramTokenizer == 'a lot of', ]
predict <- match('a lot of', trigramTable$TrigramTokenizer)
str(predict)
## int NA
class(predict)
## [1] "integer"
summary(predict)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NA NA NA NaN NA NA 1
table(predict)
## < table of extent 0 >
predict
## [1] NA
```
ls()
## [1] "abc" "bigramTable" "BigramTokenizer"
## [4] "Blogs_sample" "blogsfile" "combine_clean_files"
## [7] "combine_files" "docs1" "ds"
## [10] "dtm" "News_sample" "newsfile"
## [13] "predict" "QrigramTable" "QuadgramTokenizer"
## [16] "sample_df" "sdata" "sds"
## [19] "t" "tdata" "tdm"
## [22] "testsapply" "token_delim" "trigramTable"
## [25] "TrigramTokenizer" "Twitter_sample" "twitterfile"
## [28] "unigramTable" "UnigramTokenizer" "vds"
```