In this report we seek to show some of the findings in the exploratory data analysis, based on the idea of implementing a mixed model of word association, for which we will draw some tables and some graphics to show us the way forward.
This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).
setwd("~/Documents/Cursos/Coursera/Data Science Specialization/Capstone project/final/en_US")
Blogs <- readLines("en_US.blogs.txt")
News <- readLines("en_US.news.txt")
Twitter <- readLines("en_US.twitter.txt")
library(stringi)
## Warning: package 'stringi' was built under R version 3.2.5
lis <- list(Blogs,News,Twitter)
SummData <- data.frame(t(sapply(X = lis, FUN = stri_stats_general)), row.names = c("Blogs", "News","Twitter"))
SummData$Words<-sapply(X = lis, FUN = function(x) sum(stri_count_words(x)))
SummData
## Lines LinesNEmpty Chars CharsNWhite Words
## Blogs 899288 899288 206824382 170389539 37546246
## News 1010242 1010242 203223154 169860866 34762395
## Twitter 2360148 2360148 162096031 134082634 30093369
It is noteworthy that the structure of the blog and the news is very similar, in contrast given the nature of twitter as social network proves to be a little less homogena in relation to the two previous
based on applying a statistical language model to quantify the uncertainty in natural language, we will find the distribution of words in the document data for possible patterns in confuguración texts.
library(tm)
library(ggplot2)
set.seed(999)
Sam <- length(Blogs) * 0.005
Corpus <- sample(Blogs, Sam)
MyCorpus <- Corpus(VectorSource(Corpus))
removeURL <- function(x) gsub("http[^[:space:]]*", "", x)
removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
MyCorpus <- tm_map(MyCorpus, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
MyCorpus <- tm_map(MyCorpus, removePunctuation)
MyCorpus <- tm_map(MyCorpus, stripWhitespace)
MyCorpus <- tm_map(MyCorpus, removeNumbers)
MyCorpus <- tm_map(MyCorpus, content_transformer(removeURL))
MyCorpus <- tm_map(MyCorpus, content_transformer(removeNumPunct))
MyDtmB <- TermDocumentMatrix(MyCorpus, control = list(removePunctuation = TRUE,
removeNumbers = TRUE))
freq.terms <- findFreqTerms(MyDtmB, lowfreq = 1000)
term.freq <- rowSums(as.matrix(MyDtmB))
term.freq <- subset(term.freq, term.freq >= 1000)
df <- data.frame(term = names(term.freq), freq = term.freq)
ggplot(df, aes(x = term, y = freq)) + geom_bar(stat = "identity") +
xlab("Terms") + ylab("Count") + coord_flip() + labs(title = "Most frequent words in sample Blogs")
After reviewing the frequency distribution of words in the texts, it shows that the samples are composed most of all common words, suggesting we take care of the context of use in the predictive model. Finally we show how they are related to some of these words in one of the sets of data of interest.
library(tm)
## Loading required package: NLP
findAssocs(MyDtmB, "the", 0.2)
## $the
## and that for was with but from had not one they this
## 0.62 0.45 0.41 0.41 0.39 0.36 0.36 0.31 0.31 0.31 0.31 0.31
## are were all there has which who about its more have out
## 0.30 0.30 0.28 0.27 0.26 0.26 0.26 0.25 0.25 0.25 0.24 0.24
## two when would first only some their even his than time also
## 0.24 0.24 0.24 0.23 0.23 0.23 0.23 0.22 0.22 0.22 0.22 0.21
## been get into just like most other them could
## 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.20
Due to the short description shown above, it is possible to determine a predictive model associated with the recognition of language and its use for predicting association of words, this model can be applied in context and according to a mixed model as common words carry most frequency and is relevant to understand the topic you want to talk when you write a text message.