Data Science Capstone: Milestone Report

Synopsis

In this report we seek to show some of the findings in the exploratory data analysis, based on the idea of implementing a mixed model of word association, for which we will draw some tables and some graphics to show us the way forward.

Loading the data training

This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).

if (!file.exists("/home/kevin/Documentos/final/Coursera-SwiftKey.zip")) {
        Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        download.file(Url, destfile = "/home/kevin/Documentos/final/Coursera-SwiftKey.zip",method = "curl")}

After downloading the files we proceed to save them to give them treatment

setwd("/home/kevin/Documentos/final/final")
Blogs <- readLines("./en_US/en_US.blogs.txt")
News <- readLines("./en_US/en_US.news.txt")
Twitter <- readLines("./en_US/en_US.twitter.txt")

Summary of Data training

library(stringi)
lis <- list(Blogs,News,Twitter)
SummData <- data.frame(t(sapply(X = lis, FUN = stri_stats_general)), row.names = c("Blogs", "News","Twitter"))
SummData$Words<-sapply(X = lis, FUN = function(x) sum(stri_count_words(x)))

SummData

##           Lines LinesNEmpty     Chars CharsNWhite    Words
## Blogs    899288      899288 206824382   170389539 37541795
## News    1010242     1010242 203223154   169860866 34762303
## Twitter 2360148     2360148 162096031   134082634 30092866

It is noteworthy that the structure of the blog and the news is very similar, in contrast given the nature of twitter as social network proves to be a little less homogena in relation to the two previous

Patterns from text data

based on applying a statistical language model to quantify the uncertainty in natural language, we will find the distribution of words in the document data for possible patterns in confuguración texts.

library(tm)

set.seed(999)

Sam <- length(Blogs) * 0.005
Corpus <- sample(Blogs, Sam)
MyCorpus <- Corpus(VectorSource(Corpus))

MyCorpus <- tm_map(MyCorpus, content_transformer(iconv), from = "latin1", to = "ASCII", sub = "")
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
MyCorpus <- tm_map(MyCorpus, removePunctuation)
MyCorpus <- tm_map(MyCorpus, stripWhitespace)
MyCorpus <- tm_map(MyCorpus, removeNumbers)


MyDtm <- TermDocumentMatrix(MyCorpus, control = list(removePunctuation = TRUE,
                                                               removeNumbers = TRUE))
m <- as.matrix(MyDtm)
v <- sort(rowSums(m), decreasing=TRUE)
barplot(head(v, 10), main="Most frequent words in sample Blogs")

After reviewing the frequency distribution of words in the texts, it shows that the samples are composed most of all common words, suggesting we take care of the context of use in the predictive model.

feedback for the predictive model

Due to the short description shown above, it is possible to determine a predictive model associated with the recognition of language and its use for predicting association of words, this model can be applied in context and according to a mixed model as common words carry most frequency and is relevant to understand the topic you want to talk when you escribres a text message.