The objective of this project is to show the work done with the data, on the way to creating its prediction algorithm.
The first step in any task of this type is to analyze what is included in the raw data (or corpus of documents), the size of information to be analyzed in such a way as to contemplate the computational capacity, which will allow us to determine a value of sampling. Later carry out a cleaning of the data, removing the useless information. Finally, perform an Exploratory Data Analysis, looking for patterns between the words , which will be the basis for the prediction algorithms
suppressWarnings(library(stringi))
suppressWarnings(library(knitr))
suppressWarnings(library(data.table))
suppressWarnings(library(dplyr))
suppressWarnings(library(tm))
suppressWarnings(library(NLP))
suppressWarnings(library(RWeka))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringr))
suppressWarnings(library(wordcloud))
suppressWarnings(library(kableExtra))
Please set your own working directory first. The data sources to analyze are the files:
Downloaded from the site: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”
Additionally, a list of bad words is obtained, to filter our site data: “https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en”.
Finally, dictionaries were downloaded for: spanish, german , danish ,dutch ,french ,italian, norwegian and swiss, in order to remove foreign languages. The dictionaries were downloaded from: “http://www.gwicks.net/dictionaries.htm”
# data will be downloaded at "data" subdirectory
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destifile <- "./data/Coursera-SwiftKey.zip"
subdir <- "data"
# check if "data" exists if not it creates
if (file.exists(subdir)){
p <-1
} else {
p <- 0
dir.create(file.path(getwd(), "data"))
}
# check if data file is already downloaded, if not it will download
if(!file.exists(destifile)){
download.file(fileURL,destfile=destifile, method="auto")
}
# check if data file is already unzipped , if not unzip
unzipped.txt <- c("./data/final/en_US/en_US.blogs.txt",
"./data/final/en_US/en_US.news.txt",
"./data/final/en_US/en_US.twitter.txt")
zipF<- "./data/Coursera-SwiftKey.zip"
for (i in 1:3) {
if(!file.exists(unzipped.txt[i])){
unzip(zipF,exdir="./data")
}
}
if(!file.exists("profanity.txt")) {
download.file("https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",destfile="profanity.txt",method="curl")
}
profanity <-readLines("profanity.txt")
urlspanish <- "http://www.gwicks.net/textlists/espanol.zip"
urlgerman <- "http://www.gwicks.net/textlists/deutsch.zip"
urldanish <- "http://www.gwicks.net/textlists/dansk.zip"
urldutch <- "http://www.gwicks.net/textlists/nederlands2.zip"
urlfrech <- "http://www.gwicks.net/textlists/francais.zip"
urlitalian <- "http://www.gwicks.net/textlists/italiano.zip"
urlnorwegian <- "http://www.gwicks.net/textlists/norsk.zip"
urlswiss <-"http://www.gwicks.net/textlists/swiss.zip"
urllanguages <- c(urlspanish, urlgerman, urldanish,urldutch, urlfrech , urlitalian,
urlnorwegian,urlswiss )
destifile1 <- c("./language/spanish.zip", "./language/german.zip", "./language/danish.zip",
"./language/dutch.zip" , "./language/french.zip" , "./language/italian.zip",
"./language/norwegian.zip", "./language/swiss.zip")
subdir1 <- "./language"
# check if subdir "languages" exists if not it creates
if (file.exists(subdir1)){
p <-1
} else {
p <- 0
dir.create(file.path(getwd(), "language"))
}
# check if data file is already downloaded, if not it will download
for (i in 1:8) {
if(!file.exists(destifile1[i])){
download.file(urllanguages[i],destfile=destifile1[i], method="auto")
}
}
# check if data file is already unzipped
lang.dict <- c("./language/dansk.txt", "./language/nederlands2.txt", "./language/francais.txt" ,
"./language/deutsch.txt", "./language/italiano.txt", "./language/norsk.txt",
"./language/espanol.txt", "./language/swiss.txt")
for (i in 1:8) {
if(!file.exists(lang.dict[i])){
unzip(destifile1[i],exdir="./language")
}
}
twitter<-readLines("./data/final/en_US/en_US.twitter.txt",warn=FALSE,encoding="UTF-8")
blogs<-readLines("./data/final/en_US/en_US.blogs.txt",warn=FALSE,encoding="UTF-8")
news<-readLines("./data/final/en_US/en_US.news.txt",warn=FALSE,encoding="UTF-8")
words.in.twitter <-stri_stats_latex(twitter)[4]
words.in.blogs <-stri_stats_latex(blogs)[4]
words.in.news <-stri_stats_latex(news)[4]
chars.in.twitter<-sum(nchar(twitter))
chars.in.blogs<-sum(nchar(blogs))
chars.in.news<-sum(nchar(news))
counts <- data.table("File" = c("twitter", "blogs", "news"),
"Lines" = c(length(twitter),length(blogs), length(news)),
"Words" = c(words.in.twitter, words.in.blogs, words.in.news),
"Characters"=c(chars.in.twitter,chars.in.blogs,chars.in.news))
kable(counts, format = "html", caption = "Counts in data set") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| File | Lines | Words | Characters |
|---|---|---|---|
| 2360148 | 30451128 | 162096031 | |
| blogs | 899288 | 37570839 | 206824505 |
| news | 1010242 | 34494539 | 203223159 |
# Plot of lines
barplot(counts$Lines, main="Number of Lines",
xlab="data",
col=c("chocolate","chartreuse4", "cyan4"),
names.arg = c("twitter", "blogs", "news"),
log="y", beside=TRUE)
# Plot of words
barplot(counts$Words, main="Number of Words",
xlab="data",
col=c("chocolate","chartreuse4", "cyan4"),
names.arg = c("twitter", "blogs", "news"),
log="y", beside=TRUE)
# Plot of characters
barplot(counts$Character, main="Number of Characters",
xlab="data",
col=c("chocolate","chartreuse4", "cyan4"),
names.arg = c("twitter", "blogs", "news"),
log="y", beside=TRUE)
Data sample is taking and unified data in volatil corpus: First, we’ll take a sample of 2 % of twitter, blogs and news, to reduce computational time (from 18.5 GB to 414.5 MB in corpora), in the same operation will unify data. Corpora are collections of documents containing (natural language) text. A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object, not the original corpora (data vector) A vector source interprets each element of the vector x as a document.
# taking a 2 % of data files and unifying:
data.sample <-c(sample(twitter,length(twitter)*0.02),
sample(blogs,length(blogs)*0.02),
sample(news,length(news)*0.02))
# creating volatil corpora
corpora <- VCorpus(VectorSource(data.sample))
We define a function “to.space”, to replace a useless pattern in a space. The following function is considered: content_transformer: functions which modify the content of an R object. gsub: search for a argument pattern in a character element and replace by argument replacement in x character vector tm_map : apply a transformation function (to.space) on copora
to.space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpora <- tm_map(corpora, to.space, "(f|ht)tp(s?)://(.*)[.][a-z]+") # Removing urls
corpora <- tm_map(corpora, to.space, "[@][a - zA - Z0 - 9_]{1,15}") # Removing Twitter Usernames
corpora <- tm_map(corpora, to.space, "RT |via") # RemovingTwitter tags
corpora <- tm_map(corpora, to.space,"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.+[a-zA-Z0-9_.+-]+") # Removing email address
corpora <- tm_map(corpora, to.space,"[^[:alpha:]]") # remove all non Alpha characters
metachar <- c ("\\$", "\\*", "\\+", "\\.", "\\?", "\\[", "\\]", "\\^", "\\{", "\\}", "\\|", "\\(", "\\)", "\\\\")
for(j in 1:length(metachar)){
corpora <- tm_map(corpora, to.space, metachar[j])
}
# removing meta -characters
#tm_map : apply a transformation function (2nd argument) on corpora
corpora <- tm_map(corpora, tolower) # convert uppercase to lowercase
corpora <- tm_map(corpora, removeNumbers) # to remove numbers
corpora <- tm_map(corpora, removeWords, stopwords("en")) # removing ending words in English
corpora <- tm_map(corpora, removePunctuation) # remove puntuation marks
corpora <- tm_map(corpora, stripWhitespace) # Multiple whitespace characters are collapsed to a single blank.
corpora <- tm_map(corpora, PlainTextDocument) # create a plain text document (not formated text)
corpora <- tm_map(corpora, removeWords, profanity)
We’ll take a sample of 0.1% the foreign words to reduce computational time at testing
for (i in 1:8) {
foreing <-readLines(lang.dict[i])
foreign.sample <- sample(foreing,length(foreing)*0.001)
Encoding(foreign.sample) <- "latin1"
foreign.sample <- as.character(strsplit(foreign.sample,'\\\\', fixed=TRUE)) # some words of dict are with ...|... means or
chunk <- 500
n <- length(foreign.sample)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(foreign.sample,r)
for (i in 1:length(d)) {
corpora <- tm_map(corpora, removeWords, c(paste(d[[i]])))
}
}
final.corpora<-data.frame(text=unlist(sapply(corpora,'[',"content")),stringsAsFactors = FALSE)
kable(head(final.corpora), format = "html", caption = "Head of clean corpora") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
| text |
|---|
| fruitcake hot cocoa live band holiday karaoke tons holiday fun s merriment kick party come wisc m st pm |
| can can |
| every disciplined effort multiple reward jim rohn |
| blast movies daughter |
| old one used talk |
| chillin like |
set.seed(123)
unigram<-function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
unigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=unigram))
unigramcorpus<-findFreqTerms(unigramtdm,lowfreq=1000)
unigramcorpusnum<-rowSums(as.matrix(unigramtdm[unigramcorpus,]))
unigramcorpustab<-data.frame(Word=names(unigramcorpusnum),frequency=unigramcorpusnum)
# words and frequency
unigramcorpussort<-unigramcorpustab[order(-unigramcorpustab$frequency),]
ggplot(unigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
geom_bar(stat="identity",fill = I("blue"))+
labs(title="Unigrams",x="Top ten words",y="Frequency")+
theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )
wordcloud(words = unigramcorpussort$Word,
freq = unigramcorpussort$frequency,
max.words = 100,
colors = brewer.pal(6, 'Dark2'),
random.order = FALSE)
bigram<-function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
bigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=bigram))
bigramcorpus<-findFreqTerms(bigramtdm,lowfreq=80)
bigramcorpusnum<-rowSums(as.matrix(bigramtdm[bigramcorpus,]))
bigramcorpustab<-data.frame(Word=names(bigramcorpusnum),frequency=bigramcorpusnum)
# words and frequency
bigramcorpussort<-bigramcorpustab[order(-bigramcorpustab$frequency),]
ggplot(bigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
geom_bar(stat="identity",fill = I("blue"))+
labs(title="2-grams",x="Top ten 2-grams",y="Frequency")+
theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )
wordcloud(words = bigramcorpussort$Word,
freq = bigramcorpussort$frequency,
max.words = 100,
colors = brewer.pal(6, 'Dark2'),
random.order = FALSE)
trigram<-function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
trigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=trigram))
trigramcorpus<-findFreqTerms(trigramtdm,lowfreq=10)
trigramcorpusnum<-rowSums(as.matrix(trigramtdm[trigramcorpus,]))
trigramcorpustab<-data.frame(Word=names(trigramcorpusnum),frequency=trigramcorpusnum)
# words and frequency
trigramcorpussort<-trigramcorpustab[order(-trigramcorpustab$frequency),]
ggplot(trigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
geom_bar(stat="identity",fill = I("blue"))+
labs(title="3-grams",x="Top ten 3-grams",y="Frequency")+
theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )
wordcloud(words = trigramcorpussort$Word,
freq = trigramcorpussort$frequency,
max.words = 100,
colors = brewer.pal(6, 'Dark2'),
random.order = FALSE)
To address this question, we’ll create a function to run a loop until the desired word coverage is reached.
word.coverage<-function(x,coverage) #x is the unigram output sorted by frequency, y is the percent word coverage
{nwords<-0 # initial counter
coverage<-coverage*sum(x$freq) # number of words to hit coverage
for (i in 1:nrow(x))
{if (nwords >= coverage) {return (i)}
nwords<-nwords+x$freq[i]
}}
cover50 <- word.coverage(unigramcorpussort,0.5)
cover90 <- word.coverage(unigramcorpussort,0.9)
Words needed to cover 50% : 30 Words needed to cover 90% : 84
Already done with a sample of 1% of spanish, german , danish ,dutch ,french ,italian, norwegian and swiss dictionaries, by “tm_map” function to “removeWords” based on a language dictionary.
Reducing the number of unique low frequency words by substituting synonyms, including jargon.
Due to the need for computational capacity, we have taken a 2% sample in each file: Twitter, Blogs and News, it is probably not very significant for the population.
The fact of adding the functionality to eliminate words from other languages has significantly increased the calculation time, but it is necessary, at first I found that within the top 10 of 3 grams, three words in Spanish, that led me to include removal of foreign words.
We have done a reasonable cleaning, but as we increase the amount of sampling, some other type of word removal may arise.
Carry out a sampling test, adding an increasing percentage and comparing the variations in the answers. In this task we use 2% of the data, as the percentage increases, it may happen that at some point, there is not much variation in the response in terms of the top 10 of the correlated words, for example.
When implementing a larger quantity for sampling, it may be necessary to add additional data cleaning.
Create a Shiny app with a simple user interface that reliably and accurately predicts the next word based on a word or phrase entered by the user.
Thank you very much