========================================================
In this report, you will see a detailed visualization and analysis of text from three unique sources:
Blogs
News
library(SnowballC)
library("clue")
library("tm")
## Loading required package: NLP
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
The 3 files are located in the en_US folder in my working directory
con <- file("en_US/en_US.twitter.txt", "r")
twitter <- readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
con1 <- file("en_US/en_US.blogs.txt", "r")
blogs <- readLines(con1)
con2 <- file("en_US/en_US.news.txt", "r")
news <- readLines(con2)
close(con)
close(con1)
close(con2)
The following is the actual length of each text in the document. we’ll can see that data is quite large cname <- file.path(“~”, “Desktop”, “final/en_US”) docs<- Corpus(DirSource(cname)) inspect(docs)
Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 3
[[1]] <
[[2]] <
[[3]] <
As the Data in each text document is too large, we’ll work with just a sample which will represent the whole text population
set.seed(1500)
sample.data <- c(sample(blogs, length(blogs) * 0.005),
sample(news, length(news) * 0.005),
sample(twitter, length(twitter) * 0.005))
head(sample.data)
## [1] "More pledges"
## [2] "So, we will wait & see what the baby is in August when he or she is born."
## [3] "Right after they went inside I knew that we needed to go in or I would have the baby outside. So we made our way inside and Kaye checked me. 7 cm. She started the tub and I got in. I just laid there and relaxed. I think I even went to sleep for a brief moment in between contractions. Jeremy rubbed my hand. I wanted to respond but I couldn\342\200\231t, but having his support and knowing he was there was such a comfort."
## [4] "It's hard to know what I get more excited about these days. The actual event of sitting down and eating at a restaurant or the fact that I am out at all."
## [5] "Lear, of course, does a far better and more thorough job of exploring this theme. But Cymbeline does a pretty creditable job. When the exiled Posthumous sends a letter ordering his servant Pisanio to murder his wife Imogen (Posthumous has been tricked into thinking she has cuckolded him), Pisanio, upon reading the letter, soliloquizes:"
## [6] "Card Maker"
length(sample.data)
## [1] 21347
Here we’ll clean the data in order to make it ready for analysis We’ll change the whole text to lower case, we’ll remove punctuations, numbers, stop words and unnecessary white spaces
doc.vec <- VectorSource(sample.data)
doc.corpus <- Corpus(doc.vec)
doc.corpus <- tm_map(doc.corpus, tolower)
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument)
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
inspect(doc.corpus[1])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 6
Here we’ll simply change the data into a form that we can work with
TDM <- TermDocumentMatrix(doc.corpus)
DTM <- DocumentTermMatrix(doc.corpus)
Let’s visualize the data
freq <- colSums(as.matrix(DTM))
length(freq)
## [1] 41034
ord <- order(freq)
wf <- data.frame(word=names(freq), freq=freq)
head(wf)
## word freq
## <c2><92> <c2><92> 1
## <c2><92>heck <c2><92>heck 1
## <c2><93> <c2><93> 15
## <c2><93>certainly <c2><93>certainly 1
## <c2><93>it<c2><92>s <c2><93>it<c2><92>s 2
## <c2><93>one <c2><93>one 1
tail(wf)
## word freq
## zurich zurich 3
## zuzu zuzu 1
## zweifel zweifel 1
## zwick zwick 1
## zyl zyl 1
## zynga zynga 1
Plot words that appear at least 500 times
p <- ggplot(subset(wf, freq>500), aes(word, freq))
p <- p + geom_bar(stat="identity", colour="#CC79A7", fill="#CC79A7")
p <- p + theme(axis.text.x=element_text(angle=45, hjust=1) )
p
Lets pick two words which is meaningful to this analysis and identify the words that most highly correlate with these terms
If words always appear together, then correlation=1.0. specifying a correlation limit of 0.1
findAssocs(TDM, "great", corlimit=0.1)
## $great
## befriends demagogue psyched solicit squaddies standby
## 0.11 0.11 0.11 0.11 0.11 0.11
## trx understudy pours
## 0.11 0.11 0.10
We will need to load the package that makes word clouds in R.
library(wordcloud)
## Loading required package: RColorBrewer
Plot words that occur at least 300 times.
set.seed(1500)
wordcloud(names(freq), freq, min.freq=300)
Plot words that occur at least 500 times.
set.seed(1500)
wordcloud(names(freq), freq, min.freq=500, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
This is all for this scope as this report only covers exploratory analysis and visualization of text data. Our next report will be a SHINY Application that comprehensively showcases the various processing involved in creating a prediction analysis for forecasting user text input.
Thank you.