We are given three large text files as part of the ‘Coursera Capstone Project’. We have to create a ‘Milestone Report’ on the analytical findings of these text files. I have decided to do exclusive text-file analysis, where each file will be reduced to smaller in size for processing time efficiency. A general guideline towards the analysis is as follows.
Processes: Read File > Basic Summary Analysis > Corpus creation > Display most frequent words > N-Grams output > Graphical presentation.
Independent analysis: Each file is distinct in the sense that the messages or writings are created to cater certain user group. So I think the prospective analysis should be separate to understand audience-user perspective. The goal of this report is to analyze major features of the ‘text-data’ and devise a plan for creating a prediction algorithm. The project review criteria directed as follows.
Review Criteria: Provide a link of HTML page with exploratory ‘text-data’ analysis of three files > Offer basic Summary Analysis of those files > Display some ‘word-frequency’ or N-grams plots > Report analysis in a concise style for non-data scientists.
Twitter is small text-file designed with no more than 120-280 character on each text-file. Typically each ‘twitt(text-file)’ has very targetted audience, who are familiar about the context of ‘twitt’. Twitts in general does not follow high level of grammitical correctness, it is more about contextual expression projected on a specific subject matter and directed to specific audience. We can call ‘twitt’ is a concise, personalized and targeted expression in present terms.
# all needed library
suppressMessages(library(doParallel))
## Warning: package 'doParallel' was built under R version 3.3.3
## Warning: package 'foreach' was built under R version 3.3.3
## Warning: package 'iterators' was built under R version 3.3.3
suppressPackageStartupMessages(library(wordcloud))
## Warning: package 'wordcloud' was built under R version 3.3.3
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(RWeka))
## Warning: package 'RWeka' was built under R version 3.3.3
suppressPackageStartupMessages(library(ggplot2))
## Warning: package 'ggplot2' was built under R version 3.3.3
suppressMessages(library(tm))
## Warning: package 'tm' was built under R version 3.3.3
## Warning: package 'NLP' was built under R version 3.3.3
suppressPackageStartupMessages(library(stringi))
## Warning: package 'stringi' was built under R version 3.3.3
suppressPackageStartupMessages(library(dplyr))
## Warning: package 'dplyr' was built under R version 3.3.3
suppressPackageStartupMessages(library(plotly))
## Warning: package 'plotly' was built under R version 3.3.3
#setup parallel backend processors
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
#--------------------------------------------------------------------------------------------
# setting file path from my Desktop( All three files )
setwd("C:/Users/paralax11/Desktop/Data_Science_Capstone_Project/Week_02/Peer_Graded_Assignment")
# reading the 'twitter.txt' file directly from desktop
twitter_text <- readLines("en_US.twitter.txt",encoding = 'utf-8', skipNul = TRUE, warn = FALSE)
# calculating the twitter file size/ total line number and word counts
twitter_size <- file.info("en_US.twitter.txt")$size / 1024^2
twitter_lines <- length(twitter_text)
twitter_words <- sum(stri_count_words(twitter_text))
# displaying the calculated summary detail about the file
twitter_Summary <- data.frame(twitter_size, twitter_lines, twitter_words)
colnames(twitter_Summary) <- c("File size >", "Total Lines >", "Words Total")
print(twitter_Summary)
## File size > Total Lines > Words Total
## 1 159.3641 2360148 30218166
# reduced twitter file with 25 percent of the total line and display it
sampled_twitter <- sample(twitter_text, twitter_lines * 0.05)
print(length(sampled_twitter))
## [1] 118007
# each elements of the 'sampled-twitter' file is placed in a 'vector-source' function
corpus1 <- Corpus(VectorSource(sampled_twitter))
# creating a corpus1 with data trimming
corpus1 <- corpus1 %>% tm_map(content_transformer(tolower)) %>% tm_map(removePunctuation) %>% tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(stripWhitespace)%>% tm_map(PlainTextDocument)
# corpus1 realignemnt
corpus1 <- Corpus(VectorSource(corpus1))
# processing the 'corpus1' as a term-document-matrix and display the word distribution
Term.doc <- TermDocumentMatrix(corpus1)
Term.doc <- as.matrix(Term.doc)
Word.frequency <- sort(rowSums(Term.doc), decreasing = TRUE)
head(Word.frequency, 10)
## just like get love good will day can thanks dont
## 7508 6187 5590 5215 5012 4760 4615 4513 4490 4378
#-------------UNI-GRAM-----------------------------------------
# creating Unigram with 'twitter' file and display first 15 of them
UniGramTokenizer <- NGramTokenizer(corpus1, Weka_control(min=1, max=1))
UniGramMatrix <- TermDocumentMatrix(corpus1, control = list(tokenize = UniGramTokenizer))
FrequenTerm <- findFreqTerms(UniGramMatrix, lowfreq = 1000)
TermFrequency <- rowSums(as.matrix(UniGramMatrix[FrequenTerm,]))
# sorting 'Unigrams' in a decreasing order on a dataframe
TermFrequency <- data.frame(Wordfrequency = TermFrequency)
head(TermFrequency, 10)
## Wordfrequency
## always 1485
## awesome 1290
## back 2934
## bad 1016
## best 1809
## better 1516
## big 1177
## can 4513
## cant 2653
## come 2073
# Converting matrix to a data frame for plotly presentation
WordFrequency <- data.frame(words = names(Word.frequency), frequency = Word.frequency)
# designing the barplot with plotly
g <- ggplot(WordFrequency[1:10,], aes(x=reorder(words, frequency), y=frequency)) +geom_bar(stat="Identity", fill="darkolivegreen") + labs(y="Frequency",x="Words", title="Top 10 frequently used words on twitter")
ggplotly(g, width = 700, height = 350)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Findings: By analyzing the most frequent ‘twitts(words)’, it is obvious that frequenting words are all in ‘present-tense-verb’ forms can only be used for sharing general contextual expression. Also ‘unigram’ output reflects similar frequenting word pattern.
We know blogs are online pages, describes and analyzes perticular topics designed for specific group of audience, written mostly by one or more writers. Blogs may have categorical perspective on topics cater to specific readers.
# reading the 'blogs_text' files/computing file size/counting lines and total number of words
blogs_txt <- readLines("en_US.blogs.txt", skipNul = TRUE, warn = FALSE)
blogs_size <- file.info("en_US.blogs.txt")$size / 1024^2
blogs_lines <- length(blogs_txt)
blogs_words <- sum(stri_count_words(blogs_txt))
# displaying size of the blogs file/ total number of lines / total words
blogs_Summary <- data.frame(blogs_size, blogs_lines, blogs_words)
colnames(blogs_Summary) <- c(" File size >", " Total Lines >", " Words Total")
print(blogs_Summary)
## File size > Total Lines > Words Total
## 1 200.4242 899288 38154238
# reducing the 'blogs.txt' file to 10 percent of the total size
sampled_blogs <- sample(blogs_txt, blogs_lines * 0.10)
# converting file character vector to 'utf-8' encoding
sampled_blogs <- iconv(sampled_blogs, 'utf-8')
# corpus creation and file trimming
corpus2 <- Corpus(VectorSource(as.data.frame(sampled_blogs, stringsAsFactors = FALSE)))
corpus2 <- corpus2 %>%
tm_map(tolower) %>% tm_map(PlainTextDocument) %>% tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>% tm_map(removeWords, stopwords("english")) %>% tm_map(stripWhitespace)
# coering the trimmed corpus into a matrix-document
# as.matrix turns 'termDocumentMatrix' argument into a matrix.
term.doc <- TermDocumentMatrix(corpus2)
term.doc <- as.matrix(term.doc)
# sorting the corpus collection and display first 10 most frquency wise
word_frequency <- sort(rowSums(term.doc), decreasing = TRUE)
head(word_frequency, 10)
## one will can like just time get know people now
## 12341 11197 10740 9955 9945 8754 7077 6030 5983 5980
# creating a data frame to display a 'word-cloud' with first 100 most frequent word
blog.dfram <- data.frame(words=names(word_frequency), frequency=word_frequency)
wordcloud(blog.dfram$words, blog.dfram$frequency, scale=c(4,0.5), min.freq = 4, max.words = 75, random.order=TRUE, rot.per=.15, use.r.layout=FALSE, colors=brewer.pal(6, "Dark2"), ordered.colors = FALSE)
#------------------------BI-Gram----------------------------------------------
# Creating a 'BiGram' with a frequency visual to compare with most widely used words
bigram <- NGramTokenizer(corpus2, Weka_control(min=2, max=2))
bigram <- data.frame(table(bigram))
bigram <- bigram[order(bigram$Freq, decreasing = TRUE),]
head(bigram, 10)
## bigram Freq
## 1428570 years ago 497
## 850771 new york 475
## 1062645 right now 470
## 417485 even though 450
## 456377 feel like 430
## 197000 can see 424
## 473615 first time 418
## 763925 make sure 407
## 695297 last year 398
## 363399 don<U+0092>t know 353
Findings: We can see from most ‘frequenting-words’ output are in present tense ‘verb-noun’ format portrays discussion with informal conversational style. In addition, we see a pattern of contiguous interconnected words from ‘bi-gram’ output, where each words(token) are related to the preceding one in simple logical way. Each ‘bi-gram’ portrays condtional interdependency between tokens.
News files are essentially written for broad range of audience with indepth analysis infused into it. It is assumed news readers prefer detail connotation of any news topics or subject matter.
News file summary:
Here I’ve decided not to display all the similar codes for computing most frequent ‘words’ and ‘word-trigrams’.
## File size > Total Lines > Words Total
## 1 196.2775 77259 2693898
Most fruequently used 10 words:
## said will one new also can two year just last
## 4748 2048 1588 1319 1157 1110 1082 1077 1023 1016
Tri-gram tokenized words with frequency:
## trigram Freq
## 341867 two years ago 34
## 213736 new york city 31
## 248310 president barack obama 21
## 117060 first time since 18
## 307120 st louis county 17
## 122472 four years ago 16
## 54974 chief financial officer 15
## 134734 gov chris christie 14
## 329915 three years ago 14
## 173416 last two years 13
Barplot with Tri-gram:
# separating 10-trigram combination for 'bar' display
trigram.Small <- head(trigram, 10)
# plotting trigram with frequency on top bar
newsTrigram <- ggplot(trigram.Small, aes(x=reorder(trigram, Freq), y=Freq)) + geom_bar(stat="identity", fill="#FF6666") + geom_text(aes(label=Freq), hjust = -0.2) + theme_bw() + coord_flip() + theme(axis.title.y = element_blank())+labs(y="Trigram Frequency", title="Top 10 trigram words from news file")
print(newsTrigram)
Stopping parallel processing
Findings: On news-file I have decided to analyze ‘tri-gram’, which is contiguous word sequence for probabilistic language model predicts ‘next-word-item’. This tri-gram(head) output from news file will help us to predict next possible continuation of user word choice, thereby help us to write the shiny app.