Capstone - Exploratory Analysis

Introduction

The Capstone Project is about predicting the next word based on previous in that sentence.

The dataset used for exploration and initial modeling is obtained in Corpora, in following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

This dataset is collected from publicly available sources by a web crawler. The data is classified by three main sources:

Blogs
Twitter
News

Exploratory analysis in corpora dataset

This report presents the first four tasks of Capstone Project of Data Science Specialization.

The following topics will be evaluated:

The distributions of word frequencies.
The frequencies of 2-grams and 3-grams in the dataset.
Unique words that is needed to cover 90% of all word instances in the language.
Basic strategy for predict next word in sentence.

Task 0 - Understanding the problem

In this section the data will be downloaded and very basic examination of the data will carried out.

## Setting the path where data is stored
setwd("C:/Users/andgo/Coursera - Data Science/Course 10 - Capstone Project/final/en_US")

## Loading texts form blog data
fileblog <- file("./en_US.blogs.txt", "r") 
blogdata <-readLines(fileblog, encoding="latin1")
close(fileblog)

## Number of samples of text obtained form blogs
length(blogdata)

## [1] 899288

## View of content of a random text
blogdata[sample(c(1:length(blogdata)), 1, replace = FALSE, prob = NULL)]

## [1] "In between races we high-five and giggle. This is sweet. We do it every morning. And by the second race (thanks to my coffee), I am totally into it - we have a blast!"

## Loading texts form twitter data
filetwitter <- file("./en_US.twitter.txt", "r") 
twitterdata <-readLines(filetwitter, encoding="latin1")
close(filetwitter)
## Number of samples of text obtained form twitter
length(twitterdata)

## [1] 2360148

## View of content of a random text
twitterdata[sample(c(1:length(twitterdata)), 1, replace = FALSE, prob = NULL)]

## [1] "Hhahaha I am so disappointed I'll never be able to try any of it. Cheesey for days"

## Loading texts form twitter data
filenews <- file("./en_US.news.txt", "r") 
newsdata <-readLines(filenews, encoding="latin1")
close(filenews)

## Number of samples of text obtained form twitter
length(newsdata)

## [1] 77259

## View of content of a random text
newsdata[sample(c(1:length(newsdata)), 1, replace = FALSE, prob = NULL)]

## [1] "The dominant force at Melbourne Park this century, Williams had lost only two matches at the Australian Open since winning the first of her five titles here in 2003. She was on a 17-match winning streak after capturing titles in 2009 and 2010 and missing last year due to injury."

The total entries for blogs, twitter and news are 899288, 2360148 and 77259, respectively.

Task 1 - Getting and Cleaning the Data

In this section, it will be apllied the most common steps for cleaning raw texts for further evaluation.

Remove special symbols.
Convert all letter to lowercase, for example “Cat” and “cat” would be two different tokens if this step was not performed.
Remove numbers as the predictor will not predict numbers.
Remove ponctuation.

All exploratory analysis will be carried out in a sample of data. The size of sample is set to be 1500 of each source: blogs, twitter and news. One sample of each source (blogs, twitter and news, respectively) is presented below.

## [1] "For three consecutive days beginning Saturday July 2nd, and every Sunday thereafter, The Altered Page will be presenting the results of this year's artist survey as a series of mini projects."

## [1] "He should be... he was in the damn movie!"

## [1] "\"Oh that we could outlaw all behavior that offends us!"

Following, all cleaning steps showed above are performed.

Important: the function to remove special character was obtained in: https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/

The same samples is presented below to illustrate the cleaning process.

## [1] "for three consecutive days beginning saturday july nd and every sunday thereafter the altered page will be presenting the results of this years artist survey as a series of mini projects"

## [1] "he should be he was in the damn movie"

## [1] "oh that we could outlaw all behavior that offends us"

The next step will remove inadequate words that may offend the user. The list of profanity words was obtained in:

http://www.cs.cmu.edu/~biglou/resources/bad-words.txt

In my opinion, not all words presented in this list is inadequate for the users. A list of words to be removed from initial filter was obtained in: https://rpubs.com/Nikotino/58395

profanity <- scan("./profanity.txt", character(0), sep = "\n", encoding="UTF-8")
profanity <- profanity[-(which(profanity%in%c("refugee","reject","remains","screw","welfare", "sweetness","shoot","sick","shooting","servant","sex","radical","racial","racist","republican","public","molestation","mexican","looser","lesbian","liberal","kill","killing","killer","heroin","fraud","fire","fight","fairy","^die","death","desire","deposit","crash","^crim","crack","^color","cigarette","church","^christ","canadian", "cancer","^catholic","cemetery","buried","burn","breast","^bomb","^beast","attack","australian","balls","baptist","^addict","abuse","abortion","amateur","asian","aroused","angry","arab","bible")==TRUE))]

profanity_vector <- VectorSource(profanity)
corpusblog <- tm_map(corpusblog, removeWords, profanity_vector)
corpustwitter <- tm_map(corpustwitter, removeWords, profanity_vector)
corpusnews <- tm_map(corpusnews, removeWords, profanity_vector)

One common task in NLP is remove stopwords, that are very common words, like “the”, “a”, etc. This is important for computational otimization as a fair amount of words is not processed.

This can not be the case in this situation, as we want to predict the next word in sentence. In this exploratory analysis, it will be evaluated two sets of data, one removing stopwords and the second without removing.

Task 2 - Exploratory Data Analysis

The n-grams will be created. In this report the 1-gram(single word) tokenization, 2-grams sets and 3-grams sets will be evaluated.

Now, the frequencies of each word will be evaluated. The top 15 frequency words is shown below.

##    Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1         the   3117           the    583        the   2922
## 2         and   1850           you    365        and   1322
## 3        that    778           and    287        for    504
## 4         for    661           for    224       that    478
## 5         you    549          that    142       with    409
## 6         was    476          with    127       said    382
## 7        with    464          this    116        was    335
## 8        this    452          your    114        are    234
## 9         but    356          have    111        his    232
## 10       have    353          just    109       have    222
## 11        are    335           are    106       from    218
## 12        not    290           but     87        but    215
## 13       from    263           all     86       this    204
## 14       they    237          love     72        has    180
## 15      about    216           not     69        not    177

Now, we will repeat the previous task with the second dataset, without the stopwords.

##    Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1        will    205          just    109       said    382
## 2        just    195          love     72       will    167
## 3        like    173           get     67        one    121
## 4         one    170          good     67       year    120
## 5         can    168          like     63      first    103
## 6        time    164        thanks     62        new    103
## 7         get    137        follow     54        two    102
## 8        know    113           day     53       time    101
## 9      people    113          know     53       just     88
## 10        now    107           one     52        can     86
## 11       make     98          time     52      years     83
## 12        new     97          will     51       also     82
## 13      first     91           can     50       like     82
## 14     little     91         great     48       last     77
## 15       also     90          dont     47     people     76

We can see that most common words are in the stopword category. The most common word after stopword removal is almost out of top 10 in complete data set.

Further analysis will be carried out mantaining the stopwords. This decison can be changed in future if runtime of prediction tool is too long.

Let’s check the most common 2-gram and 3-gram:

2-Grams

##    Blog.Terms B.Freq Twitter.Terms T.Freq News.Terms N.Freq
## 1      of the    289       for the     48     in the    283
## 2      in the    265        on the     45     of the    265
## 3      on the    138        in the     44     to the    131
## 4       to be    120        of the     35     on the    111
## 5      to the    114    thanks for     30    for the     96
## 6     for the    108         to be     29     at the     89
## 7     and the    105        to get     27    and the     78
## 8       i was     90        at the     26      to be     70
## 9      it was     84        to the     25       in a     67
## 10       in a     80      going to     24       of a     59
## 11      and i     78        do you     23   with the     56
## 12     at the     78        have a     23     with a     53
## 13     i have     78        if you     22    he said     52
## 14       i am     76      the best     22      and a     50
## 15       is a     74        i have     21   from the     50

3-Grams

##       Blog.Terms B.Freq      Twitter.Terms T.Freq        News.Terms N.Freq
## 1     one of the     22     thanks for the     13          a lot of     18
## 2       a lot of     18     for the follow      8      in the first     13
## 3    you want to     16        do you know      7       going to be     12
## 4     as well as     13          i have to      6         said in a     12
## 5       i have a     12         i love you      6        one of the     11
## 6        it is a     12           a lot of      5       some of the     10
## 7        i had a     11       have a great      5 the united states     10
## 8      i want to     11         have to be      5  according to the      9
## 9    this is the     11          i want to      5          it was a      9
## 10    be able to     10 looking forward to      5        the end of      9
## 11    there is a     10       cant wait to      4           to be a      9
## 12   you have to     10        going to be      4        out of the      8
## 13   i wanted to      9        i dont know      4        be able to      7
## 14   some of the      9           i have a      4        end of the      7
## 15 the fact that      9          i need to      4        from to pm      7

Usage of words

In this section we will evaluate how many words is necessary to cover 90% of all words presented in dataset.

The figure above shows that 90% of all words occurences come from 52% unique words of bolg data. For twiiter data, 90% of all words occurences come from 67% unique words. Finally, for news data, 90% of all words occurences come from 59% unique words.

This means that a smaller amount of words correpsonds to almost all words of data set. This information will be very useful for otimization purposes by reducing the amount of words that the predictor have to deal and not having such reduction in accuracy.

Task 3 - Basic insights of prediction model

The initial approach is to find the most common 2-grams with analyzed word in first place. Then, it will be finded the most common 3-grams with analyzed word in second place.

As example, let’s take the word “you” and perform the tasks described above.

##   term.1 term.2 Freq
## 1    you    can   43
## 2    you    are   36
## 3    you   have   30
## 4    you   will   23
## 5    you   want   21
## 6    you   know   20

##   term.1 term.2 term.3 Freq
## 1     if    you    are    8
## 2     if    you   want    5
## 3     so    you    can    5
## 4     do    you  think    4
## 5     do    you   want    4
## 6  where    you    can    4

Let’s evaluate the chance of a good prediction.

For a prediction based only in 2-gram model, the total number of events is 529. The most common 2-gram with word “you” have 43 events. So, the chance of a good prediction with only one suggestion is 8.1%. If the number of suggestions could be increased to 3 suggestions, the the chance of a good prediction would increase to 20.6%.

Let’s evaluate the chance of a good prediction with a 3-gram model. As example, we will examine the sentence “if you”.

##   term.1 term.2 term.3 Freq
## 1     if    you    are    8
## 2     if    you   want    5
## 3     if    you   dont    3
## 4     if    you   have    3
## 5     if    you  enjoy    2
## 6     if    you  watch    2

In this example, the total number of events is 41. The most common 3-gram with word “if” followed by “you” have 8 events. So, the chance of a good prediction with only one suggestion is 19.5%. If the number of suggestions could be increased to 3 suggestions, the the chance of a good prediction would increase to 39%.

Next steps

Create a corpus with all sources. It was not identified any bias in each source.
Create the prediction model based on the strategy showed above.
Evaluate the size of n-grams adopted in final model. (Try and error?)
Add a strategy to manipulate words that is out of the range of dictionary adopted.