Milestone Report, Capstone Data Science

The goal of this project is to display an exploratory analysis, mainly number of lines, the most frequent words with some graphics.

I downloaded all the files, and selected the English files to be evaluated. frome the 3 files I made first an analysis of the number of total lines an the number of total words.

twitter <- readLines("en_US.twitter.txt")
news <- readLines("en_US.news.txt")
blogs <- readLines("en_US.blogs.txt")
all <- c(twitter,news,blogs)

Basic summary

The first statistics are. How many lines have each one of the archives.

Twitter:

length(twitter)

## [1] 2360148

News:

length(news)

## [1] 1010242

Blogs

length(blogs)

## [1] 899288

All the archives joined:

length(all)

## [1] 4269678

I selected a 1% sample to evaluate and make descriptive statistics.

ttwit <- twitter[sample(1:length(twitter),length(twitter)*.01, replace=F)]
tnews <-  news[sample(1:length(news),length(news)*.01, replace=F)]
tblogs <- blogs[sample(1:length(blogs),length(blogs)*.01, replace=F)]
sampleall <- c(ttwit,tnews,tblogs)

Word counts and basic data tables with figures (graphics).

library(tm)

## Loading required package: NLP

docs <- Corpus(VectorSource(sampleall))
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, PlainTextDocument)
dtm <- DocumentTermMatrix(docs)
m <- as.matrix(dtm)
v <- sort(colSums(m), decreasing=TRUE)
myNames <- names(v)
d <- data.frame(word=myNames, freq=v)
table1 <- head(d,10)

The most frequent 10 words are as is next table and the next figure.

table1 <- head(d,10)
barplot(table1[,2],names.arg=table1[,1], col= rainbow(10))

A word cloud with most frequent 100 words.

library(wordcloud)

## Loading required package: RColorBrewer

table2 <- head(d,100)
wordcloud(table2[,1], table2[,2],scale=c(5, .1), colors=brewer.pal(5, "Dark2"))

The next project…

I found different functions to find Bigram and Trigrams, that I will use to find the bigrams and trigrams to look for the next report with the Weka library. These are the next functions.

TrigramToken <- function(x) NGramTokenizer(x,Weka_control(min = 3, max = 3)) BigramToken <- function(x) NGramTokenizer(x,Weka_control(min = 2, max = 2))

I`m learning many tools of the TM package, and the weka package