Data science Capstone project - Milestone Report: Exploratory Analysis

1. Goal of this project

The objective of this project is to show the work done with the data, on the way to creating its prediction algorithm.

The first step in any task of this type is to analyze what is included in the raw data (or corpus of documents), the size of information to be analyzed in such a way as to contemplate the computational capacity, which will allow us to determine a value of sampling. Later carry out a cleaning of the data, removing the useless information. Finally, perform an Exploratory Data Analysis, looking for patterns between the words , which will be the basis for the prediction algorithms

2. Necessary libraries

suppressWarnings(library(stringi))
suppressWarnings(library(knitr))
suppressWarnings(library(data.table))
suppressWarnings(library(dplyr))
suppressWarnings(library(tm))
suppressWarnings(library(NLP))
suppressWarnings(library(RWeka))
suppressWarnings(library(ggplot2))
suppressWarnings(library(stringr))
suppressWarnings(library(wordcloud))
suppressWarnings(library(kableExtra))

3. Getting Data

Please set your own working directory first. The data sources to analyze are the files:

en_US.blogs.txt,
en_US.news.txt and
en_US.tweets.txt.

Downloaded from the site: “http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”

Additionally, a list of bad words is obtained, to filter our site data: “https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en”.

Finally, dictionaries were downloaded for: spanish, german , danish ,dutch ,french ,italian, norwegian and swiss, in order to remove foreign languages. The dictionaries were downloaded from: “http://www.gwicks.net/dictionaries.htm”

Download data to analize:

# data will be downloaded at "data" subdirectory
fileURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
destifile <- "./data/Coursera-SwiftKey.zip"
subdir <- "data"

# check if "data" exists if not it creates
if (file.exists(subdir)){
        p <-1 
} else {
        p <- 0
        dir.create(file.path(getwd(), "data"))
}

# check if data file is already downloaded, if not it will download
if(!file.exists(destifile)){
        download.file(fileURL,destfile=destifile, method="auto")
}

# check if data file is already unzipped , if not unzip

unzipped.txt <- c("./data/final/en_US/en_US.blogs.txt",
                  "./data/final/en_US/en_US.news.txt",
                  "./data/final/en_US/en_US.twitter.txt")

zipF<- "./data/Coursera-SwiftKey.zip"
for (i in 1:3) {
        if(!file.exists(unzipped.txt[i])){
                unzip(zipF,exdir="./data")
        }
}

Download bad words dictionary and load into a object:

if(!file.exists("profanity.txt")) {
        download.file("https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en",destfile="profanity.txt",method="curl")
}
profanity <-readLines("profanity.txt")

Download foreign languages dictionaries:

urlspanish <- "http://www.gwicks.net/textlists/espanol.zip"
urlgerman <- "http://www.gwicks.net/textlists/deutsch.zip"
urldanish <- "http://www.gwicks.net/textlists/dansk.zip"
urldutch <- "http://www.gwicks.net/textlists/nederlands2.zip"
urlfrech <- "http://www.gwicks.net/textlists/francais.zip"
urlitalian   <- "http://www.gwicks.net/textlists/italiano.zip"
urlnorwegian <- "http://www.gwicks.net/textlists/norsk.zip"
urlswiss      <-"http://www.gwicks.net/textlists/swiss.zip"

urllanguages <- c(urlspanish, urlgerman, urldanish,urldutch, urlfrech , urlitalian,
                  urlnorwegian,urlswiss  )
destifile1 <- c("./language/spanish.zip", "./language/german.zip", "./language/danish.zip",
                "./language/dutch.zip" , "./language/french.zip" , "./language/italian.zip",
                "./language/norwegian.zip", "./language/swiss.zip")

subdir1 <- "./language"

# check if subdir "languages" exists if not it creates
if (file.exists(subdir1)){
        p <-1 
} else {
        p <- 0
        dir.create(file.path(getwd(), "language"))
}

# check if data file is already downloaded, if not it will download
for (i in 1:8) {
        if(!file.exists(destifile1[i])){
                download.file(urllanguages[i],destfile=destifile1[i], method="auto")
        }
}

# check if data file is already unzipped 

lang.dict <- c("./language/dansk.txt", "./language/nederlands2.txt", "./language/francais.txt" ,
              "./language/deutsch.txt", "./language/italiano.txt", "./language/norsk.txt",
              "./language/espanol.txt", "./language/swiss.txt")

for (i in 1:8) {
        if(!file.exists(lang.dict[i])){
                unzip(destifile1[i],exdir="./language")
        }
}

4. Prepearing Data

Read lines for English files about twitter, blogs and news

twitter<-readLines("./data/final/en_US/en_US.twitter.txt",warn=FALSE,encoding="UTF-8")
blogs<-readLines("./data/final/en_US/en_US.blogs.txt",warn=FALSE,encoding="UTF-8")
news<-readLines("./data/final/en_US/en_US.news.txt",warn=FALSE,encoding="UTF-8")

Count for lines, words and characters in twitter, blogs and news

words.in.twitter <-stri_stats_latex(twitter)[4]
words.in.blogs <-stri_stats_latex(blogs)[4]
words.in.news <-stri_stats_latex(news)[4]

chars.in.twitter<-sum(nchar(twitter))
chars.in.blogs<-sum(nchar(blogs))
chars.in.news<-sum(nchar(news))


counts <- data.table("File" = c("twitter", "blogs", "news"),
           "Lines" = c(length(twitter),length(blogs), length(news)),
           "Words" = c(words.in.twitter, words.in.blogs, words.in.news),
           "Characters"=c(chars.in.twitter,chars.in.blogs,chars.in.news))

kable(counts, format = "html", caption = "Counts in data set") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Counts in data set
File	Lines	Words	Characters
twitter	2360148	30451128	162096031
blogs	899288	37570839	206824505
news	1010242	34494539	203223159

# Plot of lines
barplot(counts$Lines, main="Number of Lines",
        xlab="data",
        col=c("chocolate","chartreuse4", "cyan4"),
        names.arg = c("twitter", "blogs", "news"),
        log="y", beside=TRUE)

# Plot of words
barplot(counts$Words, main="Number of Words",
        xlab="data",
        col=c("chocolate","chartreuse4", "cyan4"),
        names.arg = c("twitter", "blogs", "news"),
        log="y", beside=TRUE)

# Plot of characters
barplot(counts$Character, main="Number of Characters",
        xlab="data",
        col=c("chocolate","chartreuse4", "cyan4"),
        names.arg = c("twitter", "blogs", "news"),
        log="y", beside=TRUE)

5.DATA SAMPLING

Data sample is taking and unified data in volatil corpus: First, we’ll take a sample of 2 % of twitter, blogs and news, to reduce computational time (from 18.5 GB to 414.5 MB in corpora), in the same operation will unify data. Corpora are collections of documents containing (natural language) text. A volatile corpus is fully kept in memory and thus all changes only affect the corresponding R object, not the original corpora (data vector) A vector source interprets each element of the vector x as a document.

# taking a 2 % of data files and unifying:
data.sample <-c(sample(twitter,length(twitter)*0.02),
              sample(blogs,length(blogs)*0.02),
              sample(news,length(news)*0.02))

# creating volatil corpora
corpora <- VCorpus(VectorSource(data.sample))

6. DATA CLEANSING

We define a function “to.space”, to replace a useless pattern in a space. The following function is considered: content_transformer: functions which modify the content of an R object. gsub: search for a argument pattern in a character element and replace by argument replacement in x character vector tm_map : apply a transformation function (to.space) on copora

Removing useless words and create a plain text document:

to.space <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpora <- tm_map(corpora, to.space, "(f|ht)tp(s?)://(.*)[.][a-z]+") # Removing urls
corpora <- tm_map(corpora, to.space, "[@][a - zA - Z0 - 9_]{1,15}") # Removing Twitter Usernames
corpora <- tm_map(corpora, to.space, "RT |via") # RemovingTwitter tags
corpora <- tm_map(corpora, to.space,"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.+[a-zA-Z0-9_.+-]+") # Removing email address
corpora <- tm_map(corpora, to.space,"[^[:alpha:]]") # remove all non Alpha characters
metachar <- c ("\\$", "\\*", "\\+", "\\.", "\\?", "\\[", "\\]", "\\^", "\\{", "\\}", "\\|", "\\(", "\\)",  "\\\\")
for(j in 1:length(metachar)){
  corpora <- tm_map(corpora, to.space, metachar[j]) 
  }

# removing meta -characters

#tm_map : apply a transformation function (2nd argument) on corpora
corpora <- tm_map(corpora, tolower) # convert uppercase to lowercase
corpora <- tm_map(corpora, removeNumbers) # to remove numbers
corpora <- tm_map(corpora, removeWords, stopwords("en")) # removing ending words in English
corpora <- tm_map(corpora, removePunctuation) # remove puntuation marks
corpora <- tm_map(corpora, stripWhitespace) # Multiple whitespace characters are collapsed to a single blank.
corpora <- tm_map(corpora, PlainTextDocument) # create a plain text document (not formated text)

Removing bad words

corpora <- tm_map(corpora, removeWords, profanity)

Removing foreign words

We’ll take a sample of 0.1% the foreign words to reduce computational time at testing

for (i in 1:8) {
        foreing <-readLines(lang.dict[i])
        foreign.sample <- sample(foreing,length(foreing)*0.001)
        Encoding(foreign.sample) <- "latin1"
        foreign.sample <- as.character(strsplit(foreign.sample,'\\\\', fixed=TRUE)) # some words of dict are with ...|... means or
        chunk <- 500
        n <- length(foreign.sample)
        r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
        d <- split(foreign.sample,r)

        for (i in 1:length(d)) {
                corpora <- tm_map(corpora, removeWords, c(paste(d[[i]])))
        }
}

Showing a head of clean corpus

final.corpora<-data.frame(text=unlist(sapply(corpora,'[',"content")),stringsAsFactors = FALSE)

kable(head(final.corpora), format = "html", caption = "Head of clean corpora") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Head of clean corpora
text
fruitcake hot cocoa live band holiday karaoke tons holiday fun s merriment kick party come wisc m st pm
can can
every disciplined effort multiple reward jim rohn
blast movies daughter
old one used talk
chillin like

7. Exploratory Data Analysis

Questions to consider

Some words are more frequent than others - what are the distributions of word frequencies?

set.seed(123)

unigram<-function(x) NGramTokenizer(x,Weka_control(min=1,max=1))
unigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=unigram))
unigramcorpus<-findFreqTerms(unigramtdm,lowfreq=1000)
unigramcorpusnum<-rowSums(as.matrix(unigramtdm[unigramcorpus,]))
unigramcorpustab<-data.frame(Word=names(unigramcorpusnum),frequency=unigramcorpusnum)
# words and frequency
unigramcorpussort<-unigramcorpustab[order(-unigramcorpustab$frequency),]

ggplot(unigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
        geom_bar(stat="identity",fill = I("blue"))+
        labs(title="Unigrams",x="Top ten words",y="Frequency")+
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

wordcloud(words = unigramcorpussort$Word,
          freq = unigramcorpussort$frequency,
          max.words = 100,
          colors = brewer.pal(6, 'Dark2'),
          random.order = FALSE)

What are the frequencies of 2-grams in the dataset?

bigram<-function(x) NGramTokenizer(x,Weka_control(min=2,max=2))
bigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=bigram))
bigramcorpus<-findFreqTerms(bigramtdm,lowfreq=80)
bigramcorpusnum<-rowSums(as.matrix(bigramtdm[bigramcorpus,]))
bigramcorpustab<-data.frame(Word=names(bigramcorpusnum),frequency=bigramcorpusnum)
# words and frequency
bigramcorpussort<-bigramcorpustab[order(-bigramcorpustab$frequency),]

ggplot(bigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
        geom_bar(stat="identity",fill = I("blue"))+
        labs(title="2-grams",x="Top ten 2-grams",y="Frequency")+
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

wordcloud(words = bigramcorpussort$Word,
          freq = bigramcorpussort$frequency,
          max.words = 100,
          colors = brewer.pal(6, 'Dark2'),
          random.order = FALSE)

What are the frequencies of 3-grams in the dataset?

trigram<-function(x) NGramTokenizer(x,Weka_control(min=3,max=3))
trigramtdm<-TermDocumentMatrix(corpora,control=list(tokenize=trigram))
trigramcorpus<-findFreqTerms(trigramtdm,lowfreq=10)
trigramcorpusnum<-rowSums(as.matrix(trigramtdm[trigramcorpus,]))
trigramcorpustab<-data.frame(Word=names(trigramcorpusnum),frequency=trigramcorpusnum)
# words and frequency
trigramcorpussort<-trigramcorpustab[order(-trigramcorpustab$frequency),]


ggplot(trigramcorpussort[1:10,],aes(x=reorder(Word,-frequency),y=frequency))+
        geom_bar(stat="identity",fill = I("blue"))+
        labs(title="3-grams",x="Top ten 3-grams",y="Frequency")+
        theme ( axis.text.x = element_text ( angle = 45 , hjust = 1 ) )

wordcloud(words = trigramcorpussort$Word,
          freq = trigramcorpussort$frequency,
          max.words = 100,
          colors = brewer.pal(6, 'Dark2'),
          random.order = FALSE)

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%? 50% of all word instances in the language? 90%?

To address this question, we’ll create a function to run a loop until the desired word coverage is reached.

word.coverage<-function(x,coverage) #x is the unigram output sorted by frequency, y is the percent word coverage
        {nwords<-0 # initial counter
        coverage<-coverage*sum(x$freq) # number of words to hit coverage
        for (i in 1:nrow(x))
                {if (nwords >= coverage) {return (i)}
                        nwords<-nwords+x$freq[i]
        }}

cover50 <- word.coverage(unigramcorpussort,0.5)
cover90 <- word.coverage(unigramcorpussort,0.9)

Words needed to cover 50% : 30 Words needed to cover 90% : 84

How do you evaluate how many of the words come from foreign languages?:

Already done with a sample of 1% of spanish, german , danish ,dutch ,french ,italian, norwegian and swiss dictionaries, by “tm_map” function to “removeWords” based on a language dictionary.

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?:

Reducing the number of unique low frequency words by substituting synonyms, including jargon.

8. Conclussions:

Due to the need for computational capacity, we have taken a 2% sample in each file: Twitter, Blogs and News, it is probably not very significant for the population.

The fact of adding the functionality to eliminate words from other languages has significantly increased the calculation time, but it is necessary, at first I found that within the top 10 of 3 grams, three words in Spanish, that led me to include removal of foreign words.

We have done a reasonable cleaning, but as we increase the amount of sampling, some other type of word removal may arise.

9. Future plans:

Carry out a sampling test, adding an increasing percentage and comparing the variations in the answers. In this task we use 2% of the data, as the percentage increases, it may happen that at some point, there is not much variation in the response in terms of the top 10 of the correlated words, for example.

When implementing a larger quantity for sampling, it may be necessary to add additional data cleaning.

Create a Shiny app with a simple user interface that reliably and accurately predicts the next word based on a word or phrase entered by the user.

Thank you very much