In this report we seek to show some of the findings in the exploratory data analysis, based on the idea of implementing a mixed model of word association, for which we will draw some tables and some graphics to show us the way forward.
This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).
if (!file.exists("/home/kevin/Documentos/final/Coursera-SwiftKey.zip")) {
Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(Url, destfile = "/home/kevin/Documentos/final/Coursera-SwiftKey.zip",method = "curl")}
After downloading the files we proceed to save them to give them treatment
setwd("/home/kevin/Documentos/final/final")
Blogs <- readLines("./en_US/en_US.blogs.txt")
News <- readLines("./en_US/en_US.news.txt")
Twitter <- readLines("./en_US/en_US.twitter.txt")
library(stringi)
lis <- list(Blogs,News,Twitter)
SummData <- data.frame(t(sapply(X = lis, FUN = stri_stats_general)), row.names = c("Blogs", "News","Twitter"))
SummData$Words<-sapply(X = lis, FUN = function(x) sum(stri_count_words(x)))
SummData
## Lines LinesNEmpty Chars CharsNWhite Words
## Blogs 899288 899288 206824382 170389539 37541795
## News 1010242 1010242 203223154 169860866 34762303
## Twitter 2360148 2360148 162096031 134082634 30092866
It is noteworthy that the structure of the blog and the news is very similar, in contrast given the nature of twitter as social network proves to be a little less homogena in relation to the two previous
based on applying a statistical language model to quantify the uncertainty in natural language, we will find the distribution of words in the document data for possible patterns in confuguración texts.
library(tm)
set.seed(999)
Sam <- length(Blogs) * 0.005
Corpus <- sample(Blogs, Sam)
MyCorpus <- Corpus(VectorSource(Corpus))
MyCorpus <- tm_map(MyCorpus, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
MyCorpus <- tm_map(MyCorpus, removePunctuation)
MyCorpus <- tm_map(MyCorpus, stripWhitespace)
MyCorpus <- tm_map(MyCorpus, removeNumbers)
MyDtm <- TermDocumentMatrix(MyCorpus, control = list(removePunctuation = TRUE,
removeNumbers = TRUE))
m <- as.matrix(MyDtm)
v <- sort(rowSums(m), decreasing=TRUE)
barplot(head(v, 10), main="Most frequent words in sample Blogs")
After reviewing the frequency distribution of words in the texts, it shows that the samples are composed most of all common words, suggesting we take care of the context of use in the predictive model.
Due to the short description shown above, it is possible to determine a predictive model associated with the recognition of language and its use for predicting association of words, this model can be applied in context and according to a mixed model as common words carry most frequency and is relevant to understand the topic you want to talk when you escribres a text message.