Introduction

The present work performs a scan and text analysis of the contents of the following files:
• en_US.blogs.txt
• en_US.news.txt
• en_US.twitter.txt

To do this work we will use the following workflow:
DATA COLLECTED -> CLEAN DATASET -> EXPLORATORY -> MODELO & ALGORITHMS -> DATA PRODUCT

  1. We configure our work environment and download the data as well as the libraries to use and We perform the data loading, for this we will use the “DirSource” method of the package “tm”
#Create work folder
setwd("/home/pc/tmp")

if ( !file.exists("Coursera-SwiftKey.zip")) {
  url.file<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
  file.zip<-"Coursera-SwiftKey.zip"
  download.file(url.file, file.zip)
  unzip(file.zip)
}

library("tm")
## Loading required package: NLP
library("SnowballC")
library("wordcloud")
## Loading required package: RColorBrewer
library("RColorBrewer")

load.data.file <- function(arg) {
con <- file(arg,open="r")
line <- readLines(con, skipNul = TRUE)
close(con)
line
}

porcentaje.data.file <- function(arg, valor) {
tmp <- sample(arg, size = length(arg) * valor)
tmp
}

file1 <- load.data.file("final/en_US/en_US.blogs.txt")
file2 <- load.data.file("final/en_US/en_US.news.txt")
file3 <- load.data.file("final/en_US/en_US.twitter.txt")
file1.percent <- porcentaje.data.file(file1, 0.01)
file2.percent <- porcentaje.data.file(file2, 0.01)
file3.percent<- porcentaje.data.file(file3, 0.01)
  1. We perform a cleaning of the data, eliminating punctuation, numbers as well as words that are rare.
files.percent <- VectorSource(c(file1.percent, file2.percent, file3.percent))
corpus.tmp <- Corpus(files.percent)
doc.tmp <- DocumentTermMatrix(corpus.tmp)
corpus <- tm_map(corpus.tmp, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)

Note.- The corpus is an abstract concept representing a collection of data - text.

  1. We perform a data exploration
doc.tmp1 <- DocumentTermMatrix(corpus)

We calculate the most frequent words.

freq <- findFreqTerms(doc.tmp1, lowfreq=20)
order.tmp <- order(freq)

freq[head(order.tmp)]
## [1] "aaron"   "abandon" "abc"     "abil"    "abl"     "absenc"
freq[tail(order.tmp)]
## [1] "yummi"     "yup"       "zimmerman" "zombi"     "zone"      "zoo"
head(table(freq), 15)
## freq
##    aaron  abandon      abc     abil      abl   absenc  absolut     abus 
##        1        1        1        1        1        1        1        1 
##   academ  academi   accent   accept   access    accid accident 
##        1        1        1        1        1        1        1
tail(table(freq), 15)
## freq
##     youll     young   younger      your     youth    youtub      youv 
##         1         1         1         1         1         1         1 
##       yrs       yum     yummi       yup zimmerman     zombi      zone 
##         1         1         1         1         1         1         1 
##       zoo 
##         1
  1. Develop a model that allows us to analyze which words are related to others, we will use the most frequent words to obtain a sequence of words (n-gram) that allows us to “predict” the next word. There are three packages that can help us solve the problem:
    • RWeka
    • Tidytext
    • TM

  2. Development of a Web App that uses the predictive model, when typing a word the system must process and analyze to obtain a result that is probable that the next word is and based on its antecedent x.