knitr::opts_chunk$set(echo = TRUE)
Along the Captsone Project we are supposed to get familiar with a new subject which is Text mining, a field of Natural Process Language. The text mining allow us to identify and highlight the most frequent keywords within a text, paragraph, a list of texts. The visual presentation of the text data is done over a word cloud. The text mining package (tm) and the wordcloud package (wordcloud) both support in analyzing the texts and visualize the keywords as a word cloud. Few advantages of using word clouds : * simplicity and clarity are gained with the visualization of the most outstanding / frequent words in the word cloud * easy to understand and easily shared as a potent communication tool, hence for the interpretation of the insights it is more powerfull than any table data .
Step1: create a text file
Step2: install and load reuired packages
Step3: text minig (includes text transformation)
Step4: building a term-document matrix
Step5: generate the Word cloud
* explore the data such as word length, distribution, frequency
* create prediction algorithm, create a __predictive text model__ on Next Word Prediction
* integrate the prediction algorithm in Shiny App for the interactive use
[LINK to Files saved on my git hub repository] https://github.com/natalijak/Capstone_Project.
* the data is made available by SwiftKey, and downloaded from the Coursera Website as a zip file.
* Zipfile contains 4 subfolders which store data in 4 foreign languages (English, German, Russian, Finnish)
* I have used the English-Data set __en_US__.
* The data stored in the en_US subfolder comes from different sources, social networks: Twitter and Blogs and from the News
Get the required packages
library(RWekajars)
library(qdapDictionaries)
library(qdapRegex)
library(qdapTools)
library(RColorBrewer) #color palettes
library(qdap)
library(NLP)
library(tm) #text mining
library(SnowballC) #text stemming
library(devtools)
library(RWeka)
library(rJava) #this one was a trouble maker !
library(wordcloud) #word cloud generator
library(tmap)
Load the Data, which was previously downloaded from the official Coursera Source zip file
blogs <- readLines("~/Capstone_Project/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul=TRUE)
news <- readLines("~/Capstone_Project/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul=TRUE)
twitter <- readLines("~/Capstone_Project/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
Get the Data Sample in order to improve the performance and create one sample File in order to collect the data from different sources and perform than the analysis over one sample file
set.seed(12345)
sampleTwitter <- twitter[sample(1:length(twitter),10000)]
sampleNews <- news[sample(1:length(news),10000)]
sampleBlogs <- blogs[sample(1:length(blogs),10000)]
textSample <- c(sampleTwitter,sampleNews,sampleBlogs)
writeLines(textSample, "~/Capstone_Project/Milestone_Report/textSample.txt")
Get the size of the files in MB
SizeMBblogs <- file.info("~/Capstone_Project/final/en_US/en_US.blogs.txt")$size / 1024.0 / 1024.0
SizeMBnews <- file.info("~/Capstone_Project/final/en_US/en_US.news.txt")$size / 1024.0 / 1024.0
SizeMBtwitter <- file.info("~/Capstone_Project/final/en_US/en_US.twitter.txt")$size / 1024.0 / 1024.0
Sizesample <- file.info("C:~/Capstone_Project/Milestone_Report/textSample.txt")$size / 1024.0 / 1024.0
Get the length
Lengthblogs <- length(blogs)
Lengthnews <- length(news)
Lengthtwitter <- length(twitter)
Lengthsample <- length(textSample)
Wordingblogs <- sum(sapply(gregexpr("\\S+", blogs), length))
Wordingnews <- sum(sapply(gregexpr("\\S+", news), length))
Woridngtwitter <- sum(sapply(gregexpr("\\S+", twitter), length))
Woridngsample <- sum(sapply(gregexpr("\\S+", textSample), length))
The table summary returnes first holistic overview on the the basic data summaries
TableSummary <- data.frame(
AssignefileName = c("Blogs File-US","News File-US","Twitter File-US", "Sample File-US"),
AttachfileSize = c(round(SizeMBblogs, digits = 2),
round(SizeMBnews,digits = 2),
round(SizeMBtwitter, digits = 2),
round(Sizesample, digits = 2)),
GetCntofLines = c(Lengthblogs, Lengthnews, Lengthtwitter, Lengthsample),
GetCntofWords = c(Wordingblogs, Wordingnews, Woridngtwitter, Woridngsample)
)
colnames(TableSummary) <- c("File Name", "File SizeMB", "Line Count", "Word Count")
saveRDS(TableSummary, file = "~/Capstone_Project/Milestone_Report/fileSummary.Rda")
#kniter the file- it returnes the tabular overview of previously calculated dimensions of the file
FileSummaryDF <- readRDS("~/Capstone_Project/Milestone_Report/fileSummary.Rda")
knitr::kable(head(FileSummaryDF, 10))
| File Name | File SizeMB | Line Count | Word Count |
|---|---|---|---|
| Blogs File-US | 200.42 | 899288 | 37334131 |
| News File-US | 196.28 | 77259 | 2643969 |
| Twitter File-US | 159.36 | 2360148 | 30373583 |
| Sample File-US | 4.80 | 30000 | 878839 |
#finalCorpus <- readRDS("~/Capstone_Project/Milestone_Report/finalCorpus.RDS")
SampleConnection <- file("~/Capstone_Project/Milestone_Report/textSample.txt")
Sample <- readLines(SampleConnection)
close(SampleConnection)
Load the Data as a Corpus. This Corpus includes three different files and the sample out of it.
## Loading required package: NLP
Text transformation and cleaning. For this action one has the function tm_map available. The function can replace special characters from the text. For example replace “?”, “@” , “}” with space. The tm_map() function is also used to remove not reasonable space inbetween words, converts text to lower case, removes stopwords. The value of stopwords is near zero. They are very common in all languages but have no real value. Numbers and Punctuation are removed as well. The stemming process reduces the words “talked”, “talk”, “talking” into the orgin verb “talk”. This could be extened by multiple other arguments in order to clean the text and to have reasonable data incorporated into a cleaned corpus for the further prediction steps.
library(tm)
cleanSample <- tm_map(cleanSample, content_transformer(tolower), lazy = TRUE)
cleanSample <- tm_map(cleanSample, stripWhitespace)
cleanSample <- tm_map(cleanSample, content_transformer(removePunctuation))
cleanSample <- tm_map(cleanSample, content_transformer(removeNumbers))
cleanSample <- tm_map(cleanSample, removeWords, stopwords("english"))
cleanSample <- tm_map(cleanSample, stemDocument)
cleanSample <- tm_map(cleanSample, stripWhitespace)
saveRDS(cleanSample, file = "~/Capstone_Project/Milestone_Report/finalCorpus.RDS")
Save final Text Corpus.
#read final corpus
finalCorpus <- readRDS("~/Capstone_Project/Milestone_Report/finalCorpus.RDS")
finalCorpusDF <-data.frame(text=unlist(sapply(finalCorpus,`[`, "content")),
stringsAsFactors = FALSE)
##The tokenization function for the n-grams##
library(rJava)
library(RWeka)
nGrams <- function(finalCorpus, grams) {
ngram <- NGramTokenizer(finalCorpus, Weka_control(min = grams, max = grams,
delimiters = " \\r\\n\\t.,;:\"()?!"))
ngram <- data.frame(table(ngram))
ngram <- ngram[order(ngram$Freq,decreasing = TRUE),]
colnames(ngram) <- c("String","Count")
return(ngram)
}
unigram <- nGrams(finalCorpusDF, 1)
nrow(unigram)
## [1] 40318
#save Unigram RDS File
saveRDS(unigram, file = "~/Capstone_Project/Milestone_Report/unigram.RDS")
Plot data for the Unigram, top20
#Plot word frequency Unigram
library(ggplot2)
plot1 <- ggplot(unigram[1:20,], aes(String, Count))
plot1 <- plot1 + geom_bar(stat="identity")
plot1 <- plot1 + theme(axis.text.x=element_text(angle=45, hjust=1))
plot1 <- plot1 + ggtitle("Top 20 Word Frequency of a Unigram")
plot1
Plot the Unigram histogram
#save Unigram histogram
png(filename="~/Capstone_Project/Milestone_Report/Histogram_Unigram.png")
plot(plot1)
dev.off()
## png
## 2
Create a Unigram WordCloud in order to illustrated the importance of the words.
library(wordcloud)
## Loading required package: RColorBrewer
set.seed(12345)
cloud_unigram <-wordcloud(unigram$String, unigram$Count, random.order=FALSE,
min.freq= 700,
colors=brewer.pal(8, "Dark2"))
bigram <- nGrams(finalCorpusDF, 2)
nrow(bigram)
## [1] 392296
#save Bigram RDS File
saveRDS(bigram, file = "~/Capstone_Project/Milestone_Report/bigram.RDS")
Plot data for the Bigram, top20
#Plot word frequency Bigram
library(ggplot2)
plot2 <- ggplot(bigram[1:20,], aes(String, Count))
plot2 <- plot2 + geom_bar(stat="identity")
plot2 <- plot2 + theme(axis.text.x=element_text(angle=45, hjust=1))
plot2 <- plot2 + ggtitle("Top 20 Word Frequency of a Bigram")
plot2
Plot the Bigram histogram
#save Bigram histogram
png(filename="~/Capstone_Project/Milestone_Report/Histogram_Bigram.png")
plot(plot2)
dev.off()
## png
## 2
Some functions are being re-used but also modified from Sites like : [Link to Stackoverflow] http://stackoverflow.com and another extremly valuable source supported by [Link to r-bloggers] http://r-bloggers.com.