INTRODUCTION

We have the text corpus of English News ,logs and Twitter Datasets.We gain some insights of these Dataset .We have to clean our txt and then we can get the summmary. We make some plots for communicate insights for making easy to understand insights to other peoples.


Load Libraries

we Load some helpful libraries for text mining task

library(tm)
library(quanteda)
library(dplyr)
library(ggplot2)
library(stringr)
library(pander)
library(stringi)
library(RWeka)
library(wordcloud)

Load Datasets

WE can load our dataset as our text files .You gen get the text data from this link.We have text from Following -

  • Three Sources
    • Blogs
    • News
    • Twitter
  • Four Language
    • English
    • German
    • Finnish
    • Russian

But we are using Only english text here.

blog <- readLines(con = "en_US.blogs.txt", encoding= "UTF-8", skipNul = T)
news <- readLines(con = "en_US.news.txt", encoding= "UTF-8", skipNul = T)
twit <- readLines(con = "en_US.twitter.txt", encoding= "UTF-8", skipNul = T)

Some corpus stats

blog.size <- file.info("en_US.blogs.txt")$size / 1024 ^ 2
news.size <- file.info("en_US.news.txt")$size / 1024 ^ 2
twit.size <- file.info("en_US.twitter.txt")$size / 1024 ^ 2
blog.words <- stri_count_words(blog)
news.words <- stri_count_words(news)
twit.words <- stri_count_words(twit)

data.frame(source = c("blogs", "news", "twitter"),
           file.size.MB = c(blog.size, news.size, twit.size),
           num.lines = c(length(blog), length(news), length(twit)),
           num.words = c(sum(blog.words), sum(news.words), sum(twit.words)),
           mean.num.words = c(mean(blog.words), mean(news.words), mean(twit.words)))
##    source file.size.MB num.lines num.words mean.num.words
## 1   blogs     200.4242    899288  37546246       41.75108
## 2    news     196.2775   1010242  34762395       34.40997
## 3 twitter     159.3641   2360148  30093410       12.75065

Data Cleaning

We have to Preprocess our dataset This involves removing URLs, special characters, punctuations, numbers, excess whitespace, stopwords, and changing the text to lower.


Sample the Data

We have a very large text corpus to process so we take a small sample to analysis the insights because large text corpus consume so much ram and not feasable when it comes to n-grams.


set.seed(6)
dsample <- c(sample(blog, length(blog) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twit, length(twit) * 0.01))

corpus <- VCorpus(VectorSource(dsample))
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeWords, stopwords(kind="english"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

Removing profanity words

We should also remove the profanity words we can download common profanity words and remove them .You can download them from here Profanity words

profanity<-readLines("swearWords.txt", encoding = "UTF-8", warn=TRUE, skipNul=TRUE)
corpus<-tm_map(corpus, removeWords, profanity)

Function for Eda Plots

Here we make useful function to extract frequencies and contruct n-grams to visualize popular n-grams.

options(mc.cores=1)

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
makePlot <- function(data, label) {
  ggplot(data[1:35,], aes(reorder(word, -freq), freq,fill=freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity")+ scale_fill_continuous(low="orange", high="red")
}
makeWC<-function(d){
  wordcloud(d$word, d$freq, col=terrain.colors(length(d$word), alpha=0.9), random.order=FALSE, rot.per=0.3 )
}

Make a word frequecny Data Frame

In this section we chnage our corpus to Term-document matrix first.We get very sparse marix so we reduce the spARSE terms.we also change our

freq1 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus), 0.99))
freq2 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = bigram)), 0.999))
freq3 <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, control = list(tokenize = trigram)), 0.9999))

Frequecy Plots

Here web plot our 35 most frequent words. we also plot frequent Bigrams and Trigrams too. —

Plot some wordClouds

Here we create some clouds of words unigrams ,bigrams and trigrams.