The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships you observe in the data and prepare to build your first linguistic models.
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora. Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
At first we need to take clear the fact that we need three fundamentals objetcs to start: + The corpora data in our directory + The java package and program installed + The libraries required
So, loading the packages required:
library(tm)
## Loading required package: NLP
library(SnowballC)
suppressMessages(library(gdata))
library(kableExtra)
suppressMessages(library(ggplot2))
library(gridExtra)
library(rJava)
library(XML)
library(wordcloud)
library(RColorBrewer)
library(caret)
library(NLP)
library(openNLP)
library(RWeka)
library(qdap)
Now we need to load the data, and then take a sample of the length of other file = “Bad Words”, that are going to be used in the next task.
badEn<-readLines("badwords.txt")
## Warning in readLines("badwords.txt"): incomplete final line found on
## 'badwords.txt'
blog<-sample(suppressWarnings(readLines("en_US.blogs.txt")),length(badEn))
twitter<-sample(suppressWarnings(readLines("en_US.twitter.txt")),length(badEn))
news<-sample(suppressWarnings(readLines("en_US.news.txt")),length(badEn))
class(blog)
## [1] "character"
class(news)
## [1] "character"
class(twitter)
## [1] "character"
and the samples as data, are:
sample_d<- c(blog,news,twitter)
Now we need to clean the data for those signs of punctuation and numbers, also we need to reorder the data
txt <- sent_detect(sample_d)
txt<- removeNumbers(txt)
txt <- removePunctuation(txt)
txt<- stripWhitespace(txt)
txt <- tolower(txt)
txt <- txt[which(txt!="")]
txt <- data.frame(txt, stringsAsFactors = FALSE)
Next step is to create the grams(1, 2 y 3) for this purpose we need to order the data more to be more clear.
words <- WordTokenizer(txt)
grams <- NGramTokenizer(txt)
for(i in 1:length(grams))
{if(length(WordTokenizer(grams[i]))==2) break}
for(j in 1:length(grams))
{if(length(WordTokenizer(grams[j]))==1)break}
# Creating the grams(N)
onegram<- data.frame(table(words))
onegram <- onegram[order(onegram$Freq, decreasing = TRUE),]
bigrams<- data.frame(table(grams[i:(j-1)]))
bigrams <- bigrams[order(bigrams$Freq, decreasing = TRUE),]
trigrams <- data.frame(table(grams[1:(i-1)]))
trigrams<- trigrams[order(trigrams$Freq,decreasing = TRUE),]
We can develop two ways but here I focousing in the package Wordcloud to be more clear in the representation and viualization of the data
wordcloud(words, scale = c(3,0.1),max.words = 100, random.order = FALSE,
rot.per = 0.5, use.r.layout = FALSE, colors = brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, function(x) tm::removeWords(x,
## tm::stopwords())): transformation drops documents
wordcloud(words, scale=c(5,0.1), max.words=100, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus, tm::removePunctuation): transformation
## drops documents
wordcloud(onegram$words, onegram$Freq, scale=c(5,0.5), max.words=300, random.order=FALSE,
rot.per=0.5, use.r.layout=FALSE, colors=brewer.pal(8,"Accent"))
On the first graph we can see that the word “said” is the most common word in this sample, other words are “one”, “just” and “will”. The frecuency for onegram is:
onegram[1:10,]
## words Freq
## 14166 the 5613
## 14375 to 2862
## 753 and 2807
## 1 a 2617
## 9783 of 2413
## 7089 in 1792
## 6966 i 1443
## 14159 that 1179
## 7473 is 1079
## 5594 for 1012
For 2 grams the visualization and the frequencies are:
barplot(bigrams[1:20,2],col="blue",
names.arg = bigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
bigrams[1:10,]
## Var1 Freq
## 41558 of the 581
## 29916 in the 454
## 62961 to the 234
## 42312 on the 209
## 22595 for the 183
## 62108 to be 163
## 29406 in a 158
## 7577 at the 140
## 5620 and the 132
## 30919 is a 113
For 2 grams the visualization and the frequencies are:
barplot(trigrams[1:20,2],col="blue",
names.arg = trigrams$Var1[1:20],srt = 45,
space=0.1, xlim=c(0,20),las=2)
trigrams[1:10,]
## Var1 Freq
## 58576 one of the 42
## 1157 a lot of 34
## 81373 the u s 26
## 32007 going to be 22
## 71531 some of the 21
## 84969 to be a 19
## 10640 as well as 17
## 12426 be able to 17
## 97207 you have to 16
## 60792 part of the 15
sumCover <- 0
for(i in 1:length(onegram$Freq)) {
sumCover <- sumCover + onegram$Freq[i]
if(sumCover >= 0.5*sum(onegram$Freq)){break}
}
print(i)
## [1] 142
sumCover <- 0
for(i in 1:length(onegram$Freq)) {
sumCover <- sumCover + onegram$Freq[i]
if(sumCover >= 0.9*sum(onegram$Freq)){break}
}
print(i)
## [1] 6136
So as we see we need 145 words to cover 50% of all word instances in the language and 6066 words to cover 90% of all word instances in the language.
For me we need to try to compare the text with some well-known dictionary,exist a lot in the web. Also, this is the way to remove “rude” or bad words that we previosuly loaded. Nevertheless, there are too few such words and their impact is too small and we do not need to take it into account, is necessary to continue cleaning the data.
Yes, for that we need to take in count: + Prediction, based on the location (costumes,traditions, holidays, names, places, etc) + Lerning the writing style of the author and the non coloquials words + Using additional dictionary with n-grams: first, remove from the dictionary low-frequency words, than use the others for better prediction of n-grams