Milestone Report - DS CAPSTONE JHU

Synopsis

The goal of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create my prediction algorithm.

The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text.

The goal of this task is to understand the basic relationships I observe in the data and prepare to build your first linguistic models.

For this reason I try to do an Exploratory analysis of dataset supplied:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

Tasks to accomplish

In this Exploratory analysis I try to perform a thorough exploratory analysis of the data:

Understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs, build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Questions to consider

In this analysis try to respond the follows questions:

are Some words more frequent than others?
Which are the distributions of word frequencies?
Which are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage?
Identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

In this first milestone document, I’ll report about the following topics:

My first reflections on what problems to deal with and which to ignore.
Plans for a machine learning algorithm to predict the next word given a phrase.

Readying the environment

## Loading the necessary packets to NLP processing.
# install.packages("NLP");
# install.packages("tm");
# install.packages("RWeka");
# install.packages("rJava");
# install.packages("SnowballC");


## Loading the libraries imported
library(NLP);
library(tm);
library(rJava);
library(RWeka);
library(SnowballC);
library(ggplot2);

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Configuring path variables of raw data:
work.Directory <- "I:/proyectos/DataScience Capstone - SwiftKey";
setwd(work.Directory);
fileName.dir        <- paste0(getwd(),"/dataRaw/en_US/");    
fileName.us.blog    <- paste0(getwd(),"/dataRaw/en_US/en_US.blogs.txt");
fileName.us.news    <- paste0(getwd(),"/dataRaw/en_US/en_US.news.txt");
fileName.us.twitter <- paste0(getwd(),"/dataRaw/en_US/en_US.twitter.txt");

## Configuring path variables of processed data:
fileName.dirP        <- paste0(getwd(),"/dataPrc/en_US/");

Load of data and construct the first corpus

## Creating principal Corpus to analyze:
#corpus.text <- Corpus(DirSource(directory=fileName.dir));

corpus.text <- Corpus(DirSource(directory=fileName.dir),
                      readerControl = list(reader=readPlain,
                                           language="en",
                                           load=TRUE)
);
                      
## Inspecting the corpus loaded:
inspect(corpus.text)

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 208361438
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 15683765
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 162384825

I had loaded the follows documents in corpus:

en_US.blogs.txt with 899288 lines.
en_US.news.txt with 77259 lines.
en_US.twitter.txt with 2360148 lines.

The size in memory of total corpus is 573 Mbytes and it is too big to do an agile preliminary analysis.

Create a lite corpus

For this reason i will do a small corpus example.

## I create a new corpus with the first 5000 lines of every document
corpus.train<-corpus.text;
corpus.train[[1]]$content <- corpus.text[[1]]$content[1:5000];
corpus.train[[2]]$content <- corpus.text[[2]]$content[1:5000];
corpus.train[[3]]$content <- corpus.text[[3]]$content[1:5000];
inspect(corpus.train);

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1148203
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1016069
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 340779

The size in memory of total corpus is 3 Mbytes and it is very lite than first, i will work with it.

## I delete the main corpus and I free memory occupied
remove(corpus.text);
## And I run the garbage collector to restore memory
gc()

##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  527183 28.2    3534975 188.8  3909685 208.8
## Vcells 3127580 23.9   61308064 467.8 72634276 554.2

Clean and transform the lite Corpus

# Removes other special characters:
specialCharacters<-"[“”'?¿!$\"\"-_@#·$~%€&¬“”‘’]";
corpus.train[[1]]$content <- 
        gsub(specialCharacters, "", corpus.train[[1]]$content);
corpus.train[[2]]$content <- 
    gsub(specialCharacters, "", corpus.train[[2]]$content);
corpus.train[[3]]$content <- 
    gsub(specialCharacters, "", corpus.train[[3]]$content);


# Transform all characters to lowercase
corpus.train <- tm_map(corpus.train, content_transformer(tolower));

# Remove all numbers
corpus.train <- tm_map(corpus.train, removeNumbers);

# Transform all Punctuation character       
corpus.train <- tm_map(corpus.train, removePunctuation);

# Strip Whitespace  
corpus.train <- tm_map(corpus.train, stripWhitespace);

# Remove English Stop Words
corpus.train <- tm_map(corpus.train, removeWords, stopwords("english"));

# Removes common word endings for English words
corpus.train <- tm_map(corpus.train, stemDocument, language = "english")

Our lite corpus is jet readying to explore.

Exploring the lite Corpus

Get a Term Document Matrix from train Corpora to explore the corpus and get the frequency of appeared of all terms:

tdm.tf.1<-TermDocumentMatrix(corpus.train);

df.terms.freq<-as.data.frame(as.matrix(tdm.tf.1));
df.terms.freq["total"]<-rowSums(df.terms.freq);
summary(df.terms.freq);

##  en_US.blogs.txt   en_US.news.txt     en_US.twitter.txt     total         
##  Min.   :  0.000   Min.   :   0.000   Min.   :  0.000   Min.   :   1.000  
##  1st Qu.:  0.000   1st Qu.:   0.000   1st Qu.:  0.000   1st Qu.:   1.000  
##  Median :  1.000   Median :   1.000   Median :  0.000   Median :   1.000  
##  Mean   :  3.773   Mean   :   3.435   Mean   :  1.185   Mean   :   8.392  
##  3rd Qu.:  2.000   3rd Qu.:   2.000   3rd Qu.:  1.000   3rd Qu.:   4.000  
##  Max.   :674.000   Max.   :1239.000   Max.   :275.000   Max.   :1440.000

head(df.terms.freq);

##           en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## â                     2              0                 0     2
## âarbara               0              1                 0     1
## âust                   0              1                 0     1
## âm                     0              1                 0     1
## âs                     0              1                 0     1
## âm                     3              0                 0     3

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As I hope the most repeat frecuency is 0, but the histogram is not very clear.

If will calculate the 90% of terms with more repetitions, I will Hope exclude the lowest repetitions.

# Obtain the quantile of 10% and filter over it to obtain the 90%:
q<- quantile(df.terms.freq$total, 0.1);
df.terms.freq<-df.terms.freq[df.terms.freq$total>q,];
summary(df.terms.freq)

##  en_US.blogs.txt   en_US.news.txt     en_US.twitter.txt     total        
##  Min.   :  0.000   Min.   :   0.000   Min.   :  0.00    Min.   :   2.00  
##  1st Qu.:  1.000   1st Qu.:   1.000   1st Qu.:  0.00    1st Qu.:   2.00  
##  Median :  2.000   Median :   2.000   Median :  0.00    Median :   4.00  
##  Mean   :  7.816   Mean   :   7.138   Mean   :  2.42    Mean   :  17.37  
##  3rd Qu.:  5.000   3rd Qu.:   5.000   3rd Qu.:  2.00    3rd Qu.:  12.00  
##  Max.   :674.000   Max.   :1239.000   Max.   :275.00    Max.   :1440.00

And show the histogram, in the graph only shows the 3 first quantiles :

The most repeated terms in the corpora are:

##      en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## said             168           1239                33  1440
## will             600            528               182  1310
## one              674            401               185  1260
## like             635            305               251  1191
## get              490            298               275  1063
## time             592            295               145  1032
## just             530            265               234  1029
## can              538            278               167   983
## year             323            554                78   955
## make             432            225               130   787

What are the frequencies of 2-grams and 3-grams in the dataset?

Bigram <- 
    function(text, x=2) NGramTokenizer(corpus.train, Weka_control(min=2, max=2));
Trigram <- 
    function(text, x=3) NGramTokenizer(corpus.train, Weka_control(min=3, max=3));

tdm.tf.2 <- 
    TermDocumentMatrix(corpus.train, control=list(tokenize=Bigram));

tdm.tf.3 <- 
    TermDocumentMatrix(corpus.train, control=list(tokenize=Trigram));

Analyzing the BI-GRAMs data:

df.terms.freq.2<-as.data.frame(as.matrix(tdm.tf.2));
df.terms.freq.2["total"]<-rowSums(df.terms.freq.2);
summary(df.terms.freq.2)

##  en_US.blogs.txt  en_US.news.txt   en_US.twitter.txt     total        
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000    Min.   :  3.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000    1st Qu.:  3.000  
##  Median : 1.000   Median : 1.000   Median : 1.000    Median :  3.000  
##  Mean   : 1.154   Mean   : 1.154   Mean   : 1.154    Mean   :  3.462  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000    3rd Qu.:  3.000  
##  Max.   :96.000   Max.   :96.000   Max.   :96.000    Max.   :288.000

# Obtain the quantile of 10% and filter over it to obtain the 90%:
q2<- quantile(df.terms.freq.2$total, 0.1);
df.terms.freq.2<-df.terms.freq.2[df.terms.freq.2$total>q,];
summary(df.terms.freq.2)

##  en_US.blogs.txt  en_US.news.txt   en_US.twitter.txt     total        
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000    Min.   :  3.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000    1st Qu.:  3.000  
##  Median : 1.000   Median : 1.000   Median : 1.000    Median :  3.000  
##  Mean   : 1.154   Mean   : 1.154   Mean   : 1.154    Mean   :  3.462  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000    3rd Qu.:  3.000  
##  Max.   :96.000   Max.   :96.000   Max.   :96.000    Max.   :288.000

The distribution of frecuencies of BI-GRAMs is concentrated around of 3-4 repetitions.

The most BI-GRAMS repeat are:

##            en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## ew ork                  96             96                96   288
## last year               87             87                87   261
## year ago                71             71                71   213
## right now               68             68                68   204
## feel like               64             64                64   192
## ou can                  62             62                62   186
## look like               61             61                61   183
## last night              58             58                58   174
## dont know               57             57                57   171
## first time              51             51                51   153

Analyzing the TRI-GRAMs data:

df.terms.freq.3<-as.data.frame(as.matrix(tdm.tf.3));
df.terms.freq.3["total"]<-rowSums(df.terms.freq.3);
summary(df.terms.freq.3)

##  en_US.blogs.txt  en_US.news.txt   en_US.twitter.txt     total       
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000    Min.   : 3.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000    1st Qu.: 3.000  
##  Median : 1.000   Median : 1.000   Median : 1.000    Median : 3.000  
##  Mean   : 1.006   Mean   : 1.006   Mean   : 1.006    Mean   : 3.019  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000    3rd Qu.: 3.000  
##  Max.   :13.000   Max.   :13.000   Max.   :13.000    Max.   :39.000

# Obtain the quantile of 10% and filter over it to obtain the 90%:
q3<- quantile(df.terms.freq.3$total, 0.1);
df.terms.freq.3<-df.terms.freq.3[df.terms.freq.3$total>q,];
summary(df.terms.freq.3)

##  en_US.blogs.txt  en_US.news.txt   en_US.twitter.txt     total       
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000    Min.   : 3.000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.000    1st Qu.: 3.000  
##  Median : 1.000   Median : 1.000   Median : 1.000    Median : 3.000  
##  Mean   : 1.006   Mean   : 1.006   Mean   : 1.006    Mean   : 3.019  
##  3rd Qu.: 1.000   3rd Qu.: 1.000   3rd Qu.: 1.000    3rd Qu.: 3.000  
##  Max.   :13.000   Max.   :13.000   Max.   :13.000    Max.   :39.000

The distribution of frecuencies of TRI-GRAMs is too much concentrated around of 3-4 repetitions.

The most TRI-GRAMS repeat are:

##                    en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## first time sinc                 13             13                13    39
## = character 0                   12             12                12    36
## ew ork ime                      12             12                12    36
## ate ountain park                11             11                11    33
## ew ork iti                      10             10                10    30
## resid arack bama                 9              9                 9    27
## lassic ate ountain               8              8                 8    24
## appi ew ear                      6              6                 6    18
## cant wait see                    6              6                 6    18
## li kick â                      6              6                 6    18

Questions to resolve:

Some words are more frequent than others?

Yes, the most used words in the sample of all documents are:

df.terms.freq[1:10,]

##      en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## said             168           1239                33  1440
## will             600            528               182  1310
## one              674            401               185  1260
## like             635            305               251  1191
## get              490            298               275  1063
## time             592            295               145  1032
## just             530            265               234  1029
## can              538            278               167   983
## year             323            554                78   955
## make             432            225               130   787

wich are the distributions of word frequencies?

ggplot(data=df.terms.freq, aes(df.terms.freq$total)) + 
    geom_histogram(breaks=seq(2, 12, by=1), 
                   col="black", 
                   fill="blue", 
                   alpha = .2) + 
    labs(title="Distributions of word frequencies")

Which are the frequencies of 2-grams and 3-grams in the dataset?

The distribution of frequencies of TRI-GRAMs are more concentrate than BI-GRAMs, although the histograms seems equal, the TRI-GRAMs is more fit to left and its max. value is lowest than Bi-GRAM.

MAX BI-GRAM Value: Max. :288.000 .
MAX TRI-GRAM Value: Max. :39.000 .

Although they can do not compare themself with the distribution of single words frequency, logicaly.

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language?

df.tf<-as.data.frame(as.matrix(tdm.tf.1));
m<-dim(df.tf)[1];m;

## [1] 27626

df.tf["total"]<-rowSums(df.tf);
x<- quantile(df.tf$total, 0.5);
df.tf<-df.tf[df.tf$total>x,];
n<-dim(df.tf)[1];n

## [1] 12472

The number of words to cover the 50% of dictionary is: 12472 of 27626.

df.tf<-as.data.frame(as.matrix(tdm.tf.1));
m<-dim(df.tf)[1];m;

## [1] 27626

df.tf["total"]<-rowSums(df.tf);
x<- quantile(df.tf$total, 0.9);
df.tf<-df.tf[df.tf$total>x,];
n<-dim(df.tf)[1];n

## [1] 2645

The number of words to cover the 90% of dictionary is: 2645 of 27626.

How do you evaluate how many of the words come from foreign languages?

I don’t know, but I think maybe I can use different sources like:

https://www.wordgamedictionary.com/english-word-list/download/english.txt

Or resources of http://wordnet.princeton.edu/

To get a complete dictionary and compare with my corpus words.

Can you think of a way to increase the coverage?

Improving the corpus processing:

Better processing special characters.
Better detecting the foreign words.
Detecting the erroneous words and deleting them.

Thus you get more frequencies of words and you get a better quantiles.

Identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

Yes it’s very usefull as reduce a lot of the number of words and thus the memory used.

topics:

My first reflections on what problems to deal with and which to ignore.

I need improve my processing functions, because it became my corpus and quantiles data.
I need fit better the number of samples to construc the model, beacuse my hardware resources are limited.
I need fit better the queantiles in BIGRAM and TRIGRAM to get best results with minus data.

Plans for a machine learning algorithm to predict the next word given a phrase.

I thing that the degree of predictability depends on the actual usage how well we have to prepare our data, I need to study better procesing functions to get best corpus.

I think to use the n-grams stats is very usefull to show the next word to offer, using bigrams when the user only put a single word and trigrams when the user put two words.

But I think i will need restrict the options to reduce the memory cost, thus when the option is not located in my machine i will offer he most common word (more frequency).

Milestone Report -DS CAPSTONE JHU

Francisco González Alonso

26 de noviembre de 2016