Review criteria for exploratory ‘text-data’ analysis:

We are given three large text files as part of the ‘Coursera Capstone Project’. We have to create a ‘Milestone Report’ on the analytical findings of these text files. I have decided to do exclusive text-file analysis, where each file will be reduced to smaller in size for processing time efficiency. A general guideline towards the analysis is as follows.

Processes: Read File > Basic Summary Analysis > Corpus creation > Display most frequent words > N-Grams output > Graphical presentation.

Independent analysis: Each file is distinct in the sense that the messages or writings are created to cater certain user group. So I think the prospective analysis should be separate to understand audience-user perspective. The goal of this report is to analyze major features of the ‘text-data’ and devise a plan for creating a prediction algorithm. The project review criteria directed as follows.

Review Criteria: Provide a link of HTML page with exploratory ‘text-data’ analysis of three files > Offer basic Summary Analysis of those files > Display some ‘word-frequency’ or N-grams plots > Report analysis in a concise style for non-data scientists.

1. Twitter File Analysis: Word frequency, NGrams and bar graph

Twitter is small text-file designed with no more than 120-280 character on each text-file. Typically each ‘twitt(text-file)’ has very targetted audience, who are familiar about the context of ‘twitt’. Twitts in general does not follow high level of grammitical correctness, it is more about contextual expression projected on a specific subject matter and directed to specific audience. We can call ‘twitt’ is a concise, personalized and targeted expression in present terms.

# all needed library
suppressMessages(library(doParallel))
## Warning: package 'doParallel' was built under R version 3.3.3
## Warning: package 'foreach' was built under R version 3.3.3
## Warning: package 'iterators' was built under R version 3.3.3
suppressPackageStartupMessages(library(wordcloud))
## Warning: package 'wordcloud' was built under R version 3.3.3
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(RWeka))
## Warning: package 'RWeka' was built under R version 3.3.3
suppressPackageStartupMessages(library(ggplot2))
## Warning: package 'ggplot2' was built under R version 3.3.3
suppressMessages(library(tm))
## Warning: package 'tm' was built under R version 3.3.3
## Warning: package 'NLP' was built under R version 3.3.3
suppressPackageStartupMessages(library(stringi))
## Warning: package 'stringi' was built under R version 3.3.3
suppressPackageStartupMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.3.3
suppressPackageStartupMessages(library(plotly))
## Warning: package 'plotly' was built under R version 3.3.3
#setup parallel backend processors
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

#--------------------------------------------------------------------------------------------
# setting file path from my Desktop( All three files )
setwd("C:/Users/paralax11/Desktop/Data_Science_Capstone_Project/Week_02/Peer_Graded_Assignment")

# reading the 'twitter.txt' file directly from desktop
twitter_text <- readLines("en_US.twitter.txt",encoding = 'utf-8', skipNul = TRUE, warn = FALSE)

# calculating the twitter file size/ total line number and word counts
twitter_size <- file.info("en_US.twitter.txt")$size / 1024^2
twitter_lines <- length(twitter_text)
twitter_words <- sum(stri_count_words(twitter_text))

# displaying the calculated summary detail about the file
twitter_Summary <- data.frame(twitter_size, twitter_lines, twitter_words)
colnames(twitter_Summary) <- c("File size >", "Total Lines >", "Words Total")
print(twitter_Summary)
##   File size > Total Lines > Words Total
## 1    159.3641       2360148    30218166
# reduced twitter file with 25 percent of the total line and display it 
sampled_twitter <- sample(twitter_text, twitter_lines * 0.05)
print(length(sampled_twitter))
## [1] 118007
# each elements of the 'sampled-twitter' file is placed in a 'vector-source' function
corpus1 <- Corpus(VectorSource(sampled_twitter))

# creating a corpus1 with data trimming
corpus1 <- corpus1 %>% tm_map(content_transformer(tolower)) %>% tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english"))  %>% tm_map(stripWhitespace)%>% tm_map(PlainTextDocument)

# corpus1 realignemnt
corpus1 <- Corpus(VectorSource(corpus1))

# processing the 'corpus1' as a term-document-matrix and display the word distribution
Term.doc <- TermDocumentMatrix(corpus1)
Term.doc <- as.matrix(Term.doc)
Word.frequency <- sort(rowSums(Term.doc), decreasing = TRUE)
head(Word.frequency, 10)
##   just   like    get   love   good   will    day    can thanks   dont 
##   7508   6187   5590   5215   5012   4760   4615   4513   4490   4378
#-------------UNI-GRAM-----------------------------------------
# creating Unigram with 'twitter' file and display first 15 of them
UniGramTokenizer <- NGramTokenizer(corpus1, Weka_control(min=1, max=1))
UniGramMatrix <- TermDocumentMatrix(corpus1, control = list(tokenize = UniGramTokenizer))

FrequenTerm <- findFreqTerms(UniGramMatrix, lowfreq = 1000)
TermFrequency <- rowSums(as.matrix(UniGramMatrix[FrequenTerm,]))

# sorting 'Unigrams' in a decreasing order on a dataframe
TermFrequency <- data.frame(Wordfrequency = TermFrequency)
head(TermFrequency, 10)
##         Wordfrequency
## always           1485
## awesome          1290
## back             2934
## bad              1016
## best             1809
## better           1516
## big              1177
## can              4513
## cant             2653
## come             2073
# Converting matrix to a data frame for plotly presentation
WordFrequency <- data.frame(words = names(Word.frequency), frequency = Word.frequency)

# designing the barplot with plotly
g <- ggplot(WordFrequency[1:10,], aes(x=reorder(words, frequency), y=frequency)) +geom_bar(stat="Identity", fill="darkolivegreen") + labs(y="Frequency",x="Words", title="Top 10 frequently used words on twitter") 
ggplotly(g, width = 700, height = 350)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Findings: By analyzing the most frequent ‘twitts(words)’, it is obvious that frequenting words are all in ‘present-tense-verb’ forms can only be used for sharing general contextual expression. Also ‘unigram’ output reflects similar frequenting word pattern.

2. Blog File: Word frequency, Bi-Grams and ‘Word-Cloud’ graph

We know blogs are online pages, describes and analyzes perticular topics designed for specific group of audience, written mostly by one or more writers. Blogs may have categorical perspective on topics cater to specific readers.

# reading the 'blogs_text' files/computing file size/counting lines and total number of words
blogs_txt    <- readLines("en_US.blogs.txt", skipNul = TRUE, warn = FALSE)
blogs_size   <- file.info("en_US.blogs.txt")$size / 1024^2
blogs_lines  <- length(blogs_txt)
blogs_words  <- sum(stri_count_words(blogs_txt))

# displaying size of the blogs file/ total number of lines / total words
blogs_Summary <- data.frame(blogs_size, blogs_lines, blogs_words)
colnames(blogs_Summary) <- c("  File size >",  " Total Lines >", "  Words Total")
print(blogs_Summary)
##     File size >  Total Lines >   Words Total
## 1      200.4242         899288      38154238
# reducing the 'blogs.txt' file to 10 percent of the total size
sampled_blogs <- sample(blogs_txt, blogs_lines * 0.10)

# converting file character vector to 'utf-8' encoding
sampled_blogs <- iconv(sampled_blogs, 'utf-8')

# corpus creation and file trimming
corpus2 <- Corpus(VectorSource(as.data.frame(sampled_blogs, stringsAsFactors = FALSE)))

corpus2 <- corpus2 %>%
  tm_map(tolower) %>%  tm_map(PlainTextDocument) %>% tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(stripWhitespace)

# coering the trimmed corpus into a matrix-document 
# as.matrix turns 'termDocumentMatrix' argument into a matrix. 
term.doc <- TermDocumentMatrix(corpus2)
term.doc <- as.matrix(term.doc)

# sorting the corpus collection and display first 10 most frquency wise
word_frequency <- sort(rowSums(term.doc), decreasing = TRUE)
head(word_frequency, 10)
##    one   will    can   like   just   time    get   know people    now 
##  12341  11197  10740   9955   9945   8754   7077   6030   5983   5980
# creating a data frame to display a 'word-cloud' with first 100 most frequent word
blog.dfram <- data.frame(words=names(word_frequency), frequency=word_frequency)
wordcloud(blog.dfram$words, blog.dfram$frequency, scale=c(4,0.5),  min.freq = 4, max.words = 75, random.order=TRUE, rot.per=.15, use.r.layout=FALSE, colors=brewer.pal(6, "Dark2"), ordered.colors = FALSE)

#------------------------BI-Gram----------------------------------------------
# Creating a 'BiGram' with a frequency visual to compare with most widely used words
bigram <- NGramTokenizer(corpus2, Weka_control(min=2, max=2))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
head(bigram, 10)
##              bigram Freq
## 1428570   years ago  497
## 850771     new york  475
## 1062645   right now  470
## 417485  even though  450
## 456377    feel like  430
## 197000      can see  424
## 473615   first time  418
## 763925    make sure  407
## 695297    last year  398
## 363399   don<U+0092>t know  353

Findings: We can see from most ‘frequenting-words’ output are in present tense ‘verb-noun’ format portrays discussion with informal conversational style. In addition, we see a pattern of contiguous interconnected words from ‘bi-gram’ output, where each words(token) are related to the preceding one in simple logical way. Each ‘bi-gram’ portrays condtional interdependency between tokens.

3. News files: Word frequency, Trigram and Bar graph

News files are essentially written for broad range of audience with indepth analysis infused into it. It is assumed news readers prefer detail connotation of any news topics or subject matter.

News file summary:

Here I’ve decided not to display all the similar codes for computing most frequent ‘words’ and ‘word-trigrams’.

##     File size >  Total Lines >   Words Total
## 1      196.2775          77259       2693898

Most fruequently used 10 words:

## said will  one  new also  can  two year just last 
## 4748 2048 1588 1319 1157 1110 1082 1077 1023 1016

Tri-gram tokenized words with frequency:

##                        trigram Freq
## 341867           two years ago   34
## 213736           new york city   31
## 248310  president barack obama   21
## 117060        first time since   18
## 307120         st louis county   17
## 122472          four years ago   16
## 54974  chief financial officer   15
## 134734      gov chris christie   14
## 329915         three years ago   14
## 173416          last two years   13

Barplot with Tri-gram:

# separating 10-trigram combination for 'bar' display
trigram.Small <- head(trigram, 10)

# plotting trigram with frequency on top bar
newsTrigram <- ggplot(trigram.Small, aes(x=reorder(trigram, Freq), y=Freq)) +  geom_bar(stat="identity", fill="#FF6666") + geom_text(aes(label=Freq), hjust = -0.2)  + theme_bw() + coord_flip() + theme(axis.title.y = element_blank())+labs(y="Trigram Frequency", title="Top 10 trigram words from news file")

print(newsTrigram)

Stopping parallel processing

Findings: On news-file I have decided to analyze ‘tri-gram’, which is contiguous word sequence for probabilistic language model predicts ‘next-word-item’. This tri-gram(head) output from news file will help us to predict next possible continuation of user word choice, thereby help us to write the shiny app.

Next steps towards a prediction model:

  1. Building a simple model based on the most frequent word selections and N-grams output.
  2. On that progression of model building, we might need to explore few more detail
    1. Possible model-efficient data cleaning
    2. Data sampling for reduced Runtime
    3. Exploring prebuilt R algorithm like one based on ‘Hidden Markov Chains’ model
  3. Creating a shiny App, where users would be allowed to input ‘choice-text’ in a text-box.
  4. In a contiguous text-box possible suggested matching text would be displayed to the user.