Capstone Course Assignment: Exploratory Analysis on a new form of Data: Text (Text Mining)

Author: Ruth Okoilu

Date: Sunday, June 12, 2016

========================================================

Text Mining in R

In this report, you will see a detailed visualization and analysis of text from three unique sources:

Twitter
Blogs
News

To perform this task, we’ll require the following packages

library(SnowballC)
library("clue")
library("tm")

## Loading required package: NLP

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

Getting the data

The following code will be used to load the text document (.txt) from the location on the computer into R

The 3 files are located in the en_US folder in my working directory

con <- file("en_US/en_US.twitter.txt", "r") 
twitter <- readLines(con)

## Warning in readLines(con): line 167155 appears to contain an embedded nul

## Warning in readLines(con): line 268547 appears to contain an embedded nul

## Warning in readLines(con): line 1274086 appears to contain an embedded nul

## Warning in readLines(con): line 1759032 appears to contain an embedded nul

con1 <- file("en_US/en_US.blogs.txt", "r") 
blogs <- readLines(con1)
con2 <- file("en_US/en_US.news.txt", "r") 
news <- readLines(con2)
close(con)
close(con1)
close(con2)

Content of the Data

The following is the actual length of each text in the document. we’ll can see that data is quite large cname <- file.path(“~”, “Desktop”, “final/en_US”) docs<- Corpus(DirSource(cname)) inspect(docs)

Metadata: corpus specific: 0, document level (indexed): 0 Content: documents: 3

[[1]] <> Metadata: 7 Content: chars: 133801343

[[2]] <> Metadata: 7 Content: chars: 140234320

[[3]] <> Metadata: 7 Content: chars: 107985724

Getting a random sample for our analysis

As the Data in each text document is too large, we’ll work with just a sample which will represent the whole text population

set.seed(1500)
sample.data <- c(sample(blogs, length(blogs) * 0.005),
                 sample(news, length(news) * 0.005),
                 sample(twitter, length(twitter) * 0.005))

The following is the length of the random sample:

head(sample.data)

## [1] "More pledges"                                                                                                                                                                                                                                                                                                                                                                                                                          
## [2] "So, we will wait & see what the baby is in August when he or she is born."                                                                                                                                                                                                                                                                                                                                                             
## [3] "Right after they went inside I knew that we needed to go in or I would have the baby outside. So we made our way inside and Kaye checked me. 7 cm. She started the tub and I got in. I just laid there and relaxed. I think I even went to sleep for a brief moment in between contractions. Jeremy rubbed my hand. I wanted to respond but I couldn\342\200\231t, but having his support and knowing he was there was such a comfort."
## [4] "It's hard to know what I get more excited about these days. The actual event of sitting down and eating at a restaurant or the fact that I am out at all."                                                                                                                                                                                                                                                                             
## [5] "Lear, of course, does a far better and more thorough job of exploring this theme. But Cymbeline does a pretty creditable job. When the exiled Posthumous sends a letter ordering his servant Pisanio to murder his wife Imogen (Posthumous has been tricked into thinking she has cuckolded him), Pisanio, upon reading the letter, soliloquizes:"                                                                                     
## [6] "Card Maker"

length(sample.data)

## [1] 21347

Preprocessing and Cleaning the Data

Here we’ll clean the data in order to make it ready for analysis We’ll change the whole text to lower case, we’ll remove punctuations, numbers, stop words and unnecessary white spaces

doc.vec <- VectorSource(sample.data)
doc.corpus <- Corpus(doc.vec)

 
doc.corpus <- tm_map(doc.corpus, tolower) 
doc.corpus <- tm_map(doc.corpus, removePunctuation)
doc.corpus <- tm_map(doc.corpus, removeNumbers)
doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
doc.corpus <- tm_map(doc.corpus, stemDocument)
doc.corpus <- tm_map(doc.corpus, stripWhitespace)
doc.corpus <- tm_map(doc.corpus, PlainTextDocument)
inspect(doc.corpus[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 6

Staging the data

Here we’ll simply change the data into a form that we can work with

TDM <- TermDocumentMatrix(doc.corpus)
DTM <- DocumentTermMatrix(doc.corpus)

Explore the Data

Let’s visualize the data

  freq <- colSums(as.matrix(DTM))
  length(freq)

## [1] 41034

  ord <- order(freq)

INSPECT FREQUENCY OF WORDS

 wf <- data.frame(word=names(freq), freq=freq)   
  head(wf)

##                                    word freq
## <c2><92>                       <c2><92>    1
## <c2><92>heck               <c2><92>heck    1
## <c2><93>                       <c2><93>   15
## <c2><93>certainly     <c2><93>certainly    1
## <c2><93>it<c2><92>s <c2><93>it<c2><92>s    2
## <c2><93>one                 <c2><93>one    1

  tail(wf)

##            word freq
## zurich   zurich    3
## zuzu       zuzu    1
## zweifel zweifel    1
## zwick     zwick    1
## zyl         zyl    1
## zynga     zynga    1

Plot Word Frequencies

Plot words that appear at least 500 times

  p <- ggplot(subset(wf, freq>500), aes(word, freq))    
  p <- p + geom_bar(stat="identity", colour="#CC79A7", fill="#CC79A7")   
  p <- p + theme(axis.text.x=element_text(angle=45, hjust=1) )   
  p

Relationships Between Terms

Lets pick two words which is meaningful to this analysis and identify the words that most highly correlate with these terms

If words always appear together, then correlation=1.0. specifying a correlation limit of 0.1

  findAssocs(TDM, "great", corlimit=0.1)

## $great
##  befriends  demagogue    psyched    solicit  squaddies    standby 
##       0.11       0.11       0.11       0.11       0.11       0.11 
##        trx understudy      pours 
##       0.11       0.11       0.10

Word Clouds

Visual analytics appeal more to the lay man, thats why word clouds are popular.

We will need to load the package that makes word clouds in R.

  library(wordcloud)

## Loading required package: RColorBrewer

Plot words that occur at least 300 times.

  set.seed(1500)   
  wordcloud(names(freq), freq, min.freq=300)

Plot words that occur at least 500 times.

  set.seed(1500)   
  wordcloud(names(freq), freq, min.freq=500, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))

Conclusion

This is all for this scope as this report only covers exploratory analysis and visualization of text data. Our next report will be a SHINY Application that comprehensively showcases the various processing involved in creating a prediction analysis for forecasting user text input.

Thank you.