suppressPackageStartupMessages(library("NLP"))
suppressPackageStartupMessages(library("tm"))
suppressPackageStartupMessages(library("wordcloud"))
suppressPackageStartupMessages(library("RColorBrewer"))
suppressPackageStartupMessages(source("functions.R"))
suppressPackageStartupMessages(library("RWeka"))
suppressPackageStartupMessages(library("qdap"))
suppressPackageStartupMessages(library("ggplot2"))
suppressPackageStartupMessages(library("ngram"))
Exploratory analysis - perform a thorough exploratory analysis of the data, understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs - build figures and tables to understand variation in the frequencies of words and word pairs in the data.
1. Some words are more frequent than others - what are the distributions of word frequencies?
2. What are the frequencies of 2-grams and 3-grams in the dataset?
3. How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
4. How do you evaluate how many of the words come from foreign languages?
5. Can you think of a way to increase the coverage -- identifying words that may not be in the corpora or using a smaller number of words in the dictionary to
In this work, I will start showing one way to clean and sample data text of blogs, news, and twitter. After, I will construct a “Corpus” object that is a collection of documents that will support text analysis. For last, I create two functions that help the construction of “NGRAM” scenarios: my_ngram and coverage.
The conclusion of this paper is: It is necessary to sample data on personal computers, also stopwords are frequent in all scenarios, and few words are in a lot of phrases.
blogs_en <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news_en <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter_en <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
summary(blogs_en)
## Length Class Mode
## 899288 character character
summary(news_en)
## Length Class Mode
## 1010242 character character
summary(twitter_en)
## Length Class Mode
## 2360148 character character
b <- wordcount(blogs_en)
n <- wordcount(news_en)
t <- wordcount(twitter_en)
barplot(c(b,n,t), names.arg = c("blog", "news", "twitter"), horiz = TRUE, main="Word Count")
#set.seed
set.seed(3010)
#sampling
qtd_sample <- 500
t_blogs_en <- sample(blogs_en, qtd_sample)
t_news_en <- sample(news_en, qtd_sample)
t_twitter_en <- sample(twitter_en, qtd_sample)
t_all <- c(t_blogs_en, t_news_en, t_twitter_en)
#cleaning
t_all <- removePunctuation(t_all)
t_all <- stripWhitespace(t_all)
t_all <- removeNumbers(t_all)
t_all <- tolower(t_all)
t_all <- t_all[which(t_all!="")]
summary(t_all)
## Length Class Mode
## 1500 character character
modi_txt <- t_all
modi <- Corpus(VectorSource(modi_txt))
modi_no_stopwords <- tm_map(modi, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(modi, removeWords, stopwords("english")):
## transformation drops documents
tdm_modi <- TermDocumentMatrix (modi)
TDM1 <- as.matrix(tdm_modi)
v <- sort(rowSums(TDM1), decreasing = TRUE)
print("With stopwords")
## [1] "With stopwords"
wordcloud (modi, max.words=30, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
print("Without stopwords")
## [1] "Without stopwords"
wordcloud (modi_no_stopwords, max.words=30, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
my_ngram <-function(x, n){
gramN <- NGramTokenizer(x)
gramN <- NGramTokenizer(gramN, Weka_control(min=n, max=n))
gramN <- data.frame(table(gramN))
gramN <- gramN[order(gramN$Freq, decreasing = TRUE),]
return(gramN)
}
gramOne <- my_ngram(t_all, 1)
gramOne[1:10,]
## gramN Freq
## 8127 the 12177
## 8256 to 6711
## 417 and 6694
## 108 a 5645
## 5606 of 5302
## 4093 in 4054
## 4012 i 3378
## 8123 that 2711
## 4272 is 2612
## 3244 for 2402
par(mfrow=c(1,1))
barplot(gramOne$Freq[1:10], names.arg = as.character(gramOne$gramN[1:10]))
gramTwo <- my_ngram(t_all, 2)
gramThree <- my_ngram(t_all, 3)
gramTwo[1:10,]
## gramN Freq
## 17944 of the 584
## 12862 in the 530
## 27164 to the 286
## 18304 on the 242
## 9546 for the 232
## 3365 at the 198
## 29749 with the 183
## 2495 and the 182
## 26732 to be 174
## 13690 it was 146
barplot(gramTwo$Freq[1:10], names.arg = as.character(gramTwo$gramN[1:10]))
gramThree[1:10,]
## gramN Freq
## 22538 one of the 20
## 607 a lot of 18
## 4040 as well as 10
## 12131 going to be 9
## 15578 in the middle 9
## 29731 the end of 9
## 16646 it is a 8
## 16791 it was a 8
## 30981 the united states 8
## 9590 end of the 7
barplot(gramThree$Freq[1:10], names.arg = as.character(gramThree$gramN[1:10]))
coverage <- function(x, freq){
tmp = 0
max = trunc(sum(x$Freq) * freq)
for(i in 1:nrow(x)){
tmp = x$Freq[i] + tmp
if(tmp > max){
return(i)
}
}
}
coverage(gramOne, 0.5)
## [1] 137
coverage(gramOne, 0.9)
## [1] 4795
If a word has accents (á, ã, ü, ç) or ideograms (男 or 男人) the word is not in english language.
It is possible to use a digital dictionary to verify the language of the word.
Reduce unpopular words.
Uses a thesaurus to transform similar words.
Cluster similar words.