The present work performs a scan and text analysis of the contents of the following files:
• en_US.blogs.txt
• en_US.news.txt
• en_US.twitter.txt
To do this work we will use the following workflow:
DATA COLLECTED -> CLEAN DATASET -> EXPLORATORY -> MODELO & ALGORITHMS -> DATA PRODUCT
#Create work folder
setwd("/home/pc/tmp")
if ( !file.exists("Coursera-SwiftKey.zip")) {
url.file<-"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
file.zip<-"Coursera-SwiftKey.zip"
download.file(url.file, file.zip)
unzip(file.zip)
}
library("tm")
## Loading required package: NLP
library("SnowballC")
library("wordcloud")
## Loading required package: RColorBrewer
library("RColorBrewer")
load.data.file <- function(arg) {
con <- file(arg,open="r")
line <- readLines(con, skipNul = TRUE)
close(con)
line
}
porcentaje.data.file <- function(arg, valor) {
tmp <- sample(arg, size = length(arg) * valor)
tmp
}
file1 <- load.data.file("final/en_US/en_US.blogs.txt")
file2 <- load.data.file("final/en_US/en_US.news.txt")
file3 <- load.data.file("final/en_US/en_US.twitter.txt")
file1.percent <- porcentaje.data.file(file1, 0.01)
file2.percent <- porcentaje.data.file(file2, 0.01)
file3.percent<- porcentaje.data.file(file3, 0.01)
files.percent <- VectorSource(c(file1.percent, file2.percent, file3.percent))
corpus.tmp <- Corpus(files.percent)
doc.tmp <- DocumentTermMatrix(corpus.tmp)
corpus <- tm_map(corpus.tmp, tolower)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)
Note.- The corpus is an abstract concept representing a collection of data - text.
doc.tmp1 <- DocumentTermMatrix(corpus)
We calculate the most frequent words.
freq <- findFreqTerms(doc.tmp1, lowfreq=20)
order.tmp <- order(freq)
freq[head(order.tmp)]
## [1] "aaron" "abandon" "abc" "abil" "abl" "absenc"
freq[tail(order.tmp)]
## [1] "yummi" "yup" "zimmerman" "zombi" "zone" "zoo"
head(table(freq), 15)
## freq
## aaron abandon abc abil abl absenc absolut abus
## 1 1 1 1 1 1 1 1
## academ academi accent accept access accid accident
## 1 1 1 1 1 1 1
tail(table(freq), 15)
## freq
## youll young younger your youth youtub youv
## 1 1 1 1 1 1 1
## yrs yum yummi yup zimmerman zombi zone
## 1 1 1 1 1 1 1
## zoo
## 1
Develop a model that allows us to analyze which words are related to others, we will use the most frequent words to obtain a sequence of words (n-gram) that allows us to “predict” the next word. There are three packages that can help us solve the problem:
• RWeka
• Tidytext
• TM
Development of a Web App that uses the predictive model, when typing a word the system must process and analyze to obtain a result that is probable that the next word is and based on its antecedent x.