The goal of this project is to display an exploratory analysis, mainly number of lines, the most frequent words with some graphics.
I downloaded all the files, and selected the English files to be evaluated. frome the 3 files I made first an analysis of the number of total lines an the number of total words.
twitter <- readLines("en_US.twitter.txt")
news <- readLines("en_US.news.txt")
blogs <- readLines("en_US.blogs.txt")
all <- c(twitter,news,blogs)
The first statistics are. How many lines have each one of the archives.
Twitter:
length(twitter)
## [1] 2360148
News:
length(news)
## [1] 1010242
Blogs
length(blogs)
## [1] 899288
All the archives joined:
length(all)
## [1] 4269678
I selected a 1% sample to evaluate and make descriptive statistics.
ttwit <- twitter[sample(1:length(twitter),length(twitter)*.01, replace=F)]
tnews <- news[sample(1:length(news),length(news)*.01, replace=F)]
tblogs <- blogs[sample(1:length(blogs),length(blogs)*.01, replace=F)]
sampleall <- c(ttwit,tnews,tblogs)
library(tm)
## Loading required package: NLP
docs <- Corpus(VectorSource(sampleall))
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
m <- as.matrix(dtm)
v <- sort(colSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)
table1 <- head(d,10)
The most frequent 10 words are as is next table and the next figure.
table1 <- head(d,10)
barplot(table1[,2],names.arg=table1[,1], col= rainbow(10))
A word cloud with most frequent 100 words.
library(wordcloud)
## Loading required package: RColorBrewer
table2 <- head(d,100)
wordcloud(table2[,1], table2[,2],scale=c(5, .1), colors=brewer.pal(5, "Dark2"))
I found different functions to find Bigram and Trigrams, that I will use to find the bigrams and trigrams to look for the next report with the Weka library. These are the next functions.
TrigramToken <- function(x) NGramTokenizer(x,Weka_control(min = 3, max = 3)) BigramToken <- function(x) NGramTokenizer(x,Weka_control(min = 2, max = 2))
I`m learning many tools of the TM package, and the weka package