Data Science Capstone: Milestone Report

Synopsis

In this report we seek to show some of the findings in the exploratory data analysis, based on the idea of implementing a mixed model of word association, for which we will draw some tables and some graphics to show us the way forward.

Loading the data training

This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).

setwd("~/Documents/Cursos/Coursera/Data Science Specialization/Capstone project/final/en_US")
Blogs <- readLines("en_US.blogs.txt")
News <- readLines("en_US.news.txt")
Twitter <- readLines("en_US.twitter.txt")

Summary of Data training

library(stringi)

## Warning: package 'stringi' was built under R version 3.2.5

lis <- list(Blogs,News,Twitter)
SummData <- data.frame(t(sapply(X = lis, FUN = stri_stats_general)), row.names = c("Blogs", "News","Twitter"))
SummData$Words<-sapply(X = lis, FUN = function(x) sum(stri_count_words(x)))

SummData

##           Lines LinesNEmpty     Chars CharsNWhite    Words
## Blogs    899288      899288 206824382   170389539 37546246
## News    1010242     1010242 203223154   169860866 34762395
## Twitter 2360148     2360148 162096031   134082634 30093369

It is noteworthy that the structure of the blog and the news is very similar, in contrast given the nature of twitter as social network proves to be a little less homogena in relation to the two previous

Patterns from text data

based on applying a statistical language model to quantify the uncertainty in natural language, we will find the distribution of words in the document data for possible patterns in confuguración texts.

library(tm)
library(ggplot2)
set.seed(999)

Sam <- length(Blogs) * 0.005
Corpus <- sample(Blogs, Sam)
MyCorpus <- Corpus(VectorSource(Corpus))
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)

MyCorpus <- tm_map(MyCorpus, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
MyCorpus <- tm_map(MyCorpus, removePunctuation)
MyCorpus <- tm_map(MyCorpus, stripWhitespace)
MyCorpus <- tm_map(MyCorpus, removeNumbers)
MyCorpus <- tm_map(MyCorpus, content_transformer(removeURL))
MyCorpus <- tm_map(MyCorpus, content_transformer(removeNumPunct))
MyDtmB <- TermDocumentMatrix(MyCorpus, control = list(removePunctuation = TRUE,
                                                     removeNumbers = TRUE))


freq.terms <- findFreqTerms(MyDtmB, lowfreq = 1000)
term.freq <- rowSums(as.matrix(MyDtmB))
term.freq <- subset(term.freq, term.freq >= 1000)
df <- data.frame(term = names(term.freq), freq = term.freq)

ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") +
    xlab("Terms") + ylab("Count") + coord_flip() + labs(title = "Most frequent words in sample Blogs")

After reviewing the frequency distribution of words in the texts, it shows that the samples are composed most of all common words, suggesting we take care of the context of use in the predictive model. Finally we show how they are related to some of these words in one of the sets of data of interest.

library(tm)

## Loading required package: NLP

findAssocs(MyDtmB, "the", 0.2)

## $the
##   and  that   for   was  with   but  from   had   not   one  they  this 
##  0.62  0.45  0.41  0.41  0.39  0.36  0.36  0.31  0.31  0.31  0.31  0.31 
##   are  were   all there   has which   who about   its  more  have   out 
##  0.30  0.30  0.28  0.27  0.26  0.26  0.26  0.25  0.25  0.25  0.24  0.24 
##   two  when would first  only  some their  even   his  than  time  also 
##  0.24  0.24  0.24  0.23  0.23  0.23  0.23  0.22  0.22  0.22  0.22  0.21 
##  been   get  into  just  like  most other  them could 
##  0.21  0.21  0.21  0.21  0.21  0.21  0.21  0.21  0.20

feedback for the predictive model

Due to the short description shown above, it is possible to determine a predictive model associated with the recognition of language and its use for predicting association of words, this model can be applied in context and according to a mixed model as common words carry most frequency and is relevant to understand the topic you want to talk when you write a text message.