We have the text corpus of English News ,logs and Twitter Datasets.We gain some insights of these Dataset .We have to clean our txt and then we can get the summmary. We make some plots for communicate insights for making easy to understand insights to other peoples.
we Load some helpful libraries for text mining task
library(tm)
library(quanteda)
library(dplyr)
library(ggplot2)
library(stringr)
library(pander)
library(stringi)
library(RWeka)
library(wordcloud)
WE can load our dataset as our text files .You gen get the text data from this link.We have text from Following -
But we are using Only english text here.
blog <- readLines(con = "en_US.blogs.txt", encoding= "UTF-8", skipNul = T)
news <- readLines(con = "en_US.news.txt", encoding= "UTF-8", skipNul = T)
twit <- readLines(con = "en_US.twitter.txt", encoding= "UTF-8", skipNul = T)
blog.size <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("en_US.news.txt")$size / 1024 ^ 2
twit.size <- file.info("en_US.twitter.txt")$size / 1024 ^ 2
blog.words <- stri_count_words(blog)
news.words <- stri_count_words(news)
twit.words <- stri_count_words(twit)
data.frame(source = c("blogs", "news", "twitter"),
file.size.MB = c(blog.size, news.size, twit.size),
num.lines = c(length(blog), length(news), length(twit)),
num.words = c(sum(blog.words), sum(news.words), sum(twit.words)),
mean.num.words = c(mean(blog.words), mean(news.words), mean(twit.words)))
## source file.size.MB num.lines num.words mean.num.words
## 1 blogs 200.4242 899288 37546246 41.75108
## 2 news 196.2775 1010242 34762395 34.40997
## 3 twitter 159.3641 2360148 30093410 12.75065
We have to Preprocess our dataset This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower.
We have a very large text corpus to process so we take a small sample to analysis the insights because large text corpus consume so much ram and not feasable when it comes to n-grams.
set.seed(6)
dsample <- c(sample(blog, length(blog) * 0.01),
sample(news, length(news) * 0.01),
sample(twit, length(twit) * 0.01))
corpus <- VCorpus(VectorSource(dsample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords(kind="english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
We should also remove the profanity words we can download common profanity words and remove them .You can download them from here Profanity words
profanity<-readLines("swearWords.txt", encoding = "UTF-8", warn=TRUE, skipNul=TRUE)
corpus<-tm_map(corpus, removeWords, profanity)
Here we make useful function to extract frequencies and contruct n-grams to visualize popular n-grams.
options(mc.cores=1)
getFreq <- function(tdm) {
freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
ggplot(data[1:35,], aes(reorder(word, -freq), freq,fill=freq)) +
labs(x = label, y = "Frequency") +
theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
geom_bar(stat = "identity")+ scale_fill_continuous(low="orange", high="red")
}
makeWC<-function(d){
wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per=0.3 )
}
In this section we chnage our corpus to Term-document matrix first.We get very sparse marix so we reduce the spARSE terms.we also change our
freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.99))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))
Here web plot our 35 most frequent words. we also plot frequent Bigrams and Trigrams too. —
Here we create some clouds of words unigrams ,bigrams and trigrams.