The goal of this project is just to display that I’ve gotten used to working with the data and that I’m on track to create my prediction algorithm.
The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text.
The goal of this task is to understand the basic relationships I observe in the data and prepare to build your first linguistic models.
For this reason I try to do an Exploratory analysis of dataset supplied:
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
In this Exploratory analysis I try to perform a thorough exploratory analysis of the data:
Understanding the distribution of words and relationship between the words in the corpora.
Understand frequencies of words and word pairs, build figures and tables to understand variation in the frequencies of words and word pairs in the data.
In this analysis try to respond the follows questions:
are Some words more frequent than others?
Which are the distributions of word frequencies?
Which are the frequencies of 2-grams and 3-grams in the dataset?
How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
How do you evaluate how many of the words come from foreign languages?
Can you think of a way to increase the coverage?
Identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?
In this first milestone document, I’ll report about the following topics:
My first reflections on what problems to deal with and which to ignore.
Plans for a machine learning algorithm to predict the next word given a phrase.
## Loading the necessary packets to NLP processing.
# install.packages("NLP");
# install.packages("tm");
# install.packages("RWeka");
# install.packages("rJava");
# install.packages("SnowballC");
## Loading the libraries imported
library(NLP);
library(tm);
library(rJava);
library(RWeka);
library(SnowballC);
library(ggplot2);
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Configuring path variables of raw data:
work.Directory <- "I:/proyectos/DataScience Capstone - SwiftKey";
setwd(work.Directory);
fileName.dir <- paste0(getwd(),"/dataRaw/en_US/");
fileName.us.blog <- paste0(getwd(),"/dataRaw/en_US/en_US.blogs.txt");
fileName.us.news <- paste0(getwd(),"/dataRaw/en_US/en_US.news.txt");
fileName.us.twitter <- paste0(getwd(),"/dataRaw/en_US/en_US.twitter.txt");
## Configuring path variables of processed data:
fileName.dirP <- paste0(getwd(),"/dataPrc/en_US/");
## Creating principal Corpus to analyze:
#corpus.text <- Corpus(DirSource(directory=fileName.dir));
corpus.text <- Corpus(DirSource(directory=fileName.dir),
readerControl = list(reader=readPlain,
language="en",
load=TRUE)
);
## Inspecting the corpus loaded:
inspect(corpus.text)
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 208361438
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 15683765
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 162384825
I had loaded the follows documents in corpus:
The size in memory of total corpus is 573 Mbytes and it is too big to do an agile preliminary analysis.
For this reason i will do a small corpus example.
## I create a new corpus with the first 5000 lines of every document
corpus.train<-corpus.text;
corpus.train[[1]]$content <- corpus.text[[1]]$content[1:5000];
corpus.train[[2]]$content <- corpus.text[[2]]$content[1:5000];
corpus.train[[3]]$content <- corpus.text[[3]]$content[1:5000];
inspect(corpus.train);
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1148203
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1016069
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 340779
The size in memory of total corpus is 3 Mbytes and it is very lite than first, i will work with it.
## I delete the main corpus and I free memory occupied
remove(corpus.text);
## And I run the garbage collector to restore memory
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 527183 28.2 3534975 188.8 3909685 208.8
## Vcells 3127580 23.9 61308064 467.8 72634276 554.2
# Removes other special characters:
specialCharacters<-"[“”'?¿!$\"\"-_@#·$~%€&¬“”‘’]";
corpus.train[[1]]$content <-
gsub(specialCharacters, "", corpus.train[[1]]$content);
corpus.train[[2]]$content <-
gsub(specialCharacters, "", corpus.train[[2]]$content);
corpus.train[[3]]$content <-
gsub(specialCharacters, "", corpus.train[[3]]$content);
# Transform all characters to lowercase
corpus.train <- tm_map(corpus.train, content_transformer(tolower));
# Remove all numbers
corpus.train <- tm_map(corpus.train, removeNumbers);
# Transform all Punctuation character
corpus.train <- tm_map(corpus.train, removePunctuation);
# Strip Whitespace
corpus.train <- tm_map(corpus.train, stripWhitespace);
# Remove English Stop Words
corpus.train <- tm_map(corpus.train, removeWords, stopwords("english"));
# Removes common word endings for English words
corpus.train <- tm_map(corpus.train, stemDocument, language = "english")
Our lite corpus is jet readying to explore.
Get a Term Document Matrix from train Corpora to explore the corpus and get the frequency of appeared of all terms:
tdm.tf.1<-TermDocumentMatrix(corpus.train);
df.terms.freq<-as.data.frame(as.matrix(tdm.tf.1));
df.terms.freq["total"]<-rowSums(df.terms.freq);
summary(df.terms.freq);
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 1.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 1.000
## Median : 1.000 Median : 1.000 Median : 0.000 Median : 1.000
## Mean : 3.773 Mean : 3.435 Mean : 1.185 Mean : 8.392
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 4.000
## Max. :674.000 Max. :1239.000 Max. :275.000 Max. :1440.000
head(df.terms.freq);
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## â 2 0 0 2
## âarbara 0 1 0 1
## âust 0 1 0 1
## âm 0 1 0 1
## âs 0 1 0 1
## âm 3 0 0 3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As I hope the most repeat frecuency is 0, but the histogram is not very clear.
If will calculate the 90% of terms with more repetitions, I will Hope exclude the lowest repetitions.
# Obtain the quantile of 10% and filter over it to obtain the 90%:
q<- quantile(df.terms.freq$total, 0.1);
df.terms.freq<-df.terms.freq[df.terms.freq$total>q,];
summary(df.terms.freq)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 2.00
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 0.00 1st Qu.: 2.00
## Median : 2.000 Median : 2.000 Median : 0.00 Median : 4.00
## Mean : 7.816 Mean : 7.138 Mean : 2.42 Mean : 17.37
## 3rd Qu.: 5.000 3rd Qu.: 5.000 3rd Qu.: 2.00 3rd Qu.: 12.00
## Max. :674.000 Max. :1239.000 Max. :275.00 Max. :1440.00
And show the histogram, in the graph only shows the 3 first quantiles :
The most repeated terms in the corpora are:
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## said 168 1239 33 1440
## will 600 528 182 1310
## one 674 401 185 1260
## like 635 305 251 1191
## get 490 298 275 1063
## time 592 295 145 1032
## just 530 265 234 1029
## can 538 278 167 983
## year 323 554 78 955
## make 432 225 130 787
Bigram <-
function(text, x=2) NGramTokenizer(corpus.train, Weka_control(min=2, max=2));
Trigram <-
function(text, x=3) NGramTokenizer(corpus.train, Weka_control(min=3, max=3));
tdm.tf.2 <-
TermDocumentMatrix(corpus.train, control=list(tokenize=Bigram));
tdm.tf.3 <-
TermDocumentMatrix(corpus.train, control=list(tokenize=Trigram));
df.terms.freq.2<-as.data.frame(as.matrix(tdm.tf.2));
df.terms.freq.2["total"]<-rowSums(df.terms.freq.2);
summary(df.terms.freq.2)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 3.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 1.154 Mean : 1.154 Mean : 1.154 Mean : 3.462
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :96.000 Max. :96.000 Max. :96.000 Max. :288.000
# Obtain the quantile of 10% and filter over it to obtain the 90%:
q2<- quantile(df.terms.freq.2$total, 0.1);
df.terms.freq.2<-df.terms.freq.2[df.terms.freq.2$total>q,];
summary(df.terms.freq.2)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 3.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 1.154 Mean : 1.154 Mean : 1.154 Mean : 3.462
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :96.000 Max. :96.000 Max. :96.000 Max. :288.000
The distribution of frecuencies of BI-GRAMs is concentrated around of 3-4 repetitions.
The most BI-GRAMS repeat are:
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## ew ork 96 96 96 288
## last year 87 87 87 261
## year ago 71 71 71 213
## right now 68 68 68 204
## feel like 64 64 64 192
## ou can 62 62 62 186
## look like 61 61 61 183
## last night 58 58 58 174
## dont know 57 57 57 171
## first time 51 51 51 153
df.terms.freq.3<-as.data.frame(as.matrix(tdm.tf.3));
df.terms.freq.3["total"]<-rowSums(df.terms.freq.3);
summary(df.terms.freq.3)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 3.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 1.006 Mean : 1.006 Mean : 1.006 Mean : 3.019
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :13.000 Max. :13.000 Max. :13.000 Max. :39.000
# Obtain the quantile of 10% and filter over it to obtain the 90%:
q3<- quantile(df.terms.freq.3$total, 0.1);
df.terms.freq.3<-df.terms.freq.3[df.terms.freq.3$total>q,];
summary(df.terms.freq.3)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 3.000
## 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 1.000 1st Qu.: 3.000
## Median : 1.000 Median : 1.000 Median : 1.000 Median : 3.000
## Mean : 1.006 Mean : 1.006 Mean : 1.006 Mean : 3.019
## 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :13.000 Max. :13.000 Max. :13.000 Max. :39.000
The distribution of frecuencies of TRI-GRAMs is too much concentrated around of 3-4 repetitions.
The most TRI-GRAMS repeat are:
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## first time sinc 13 13 13 39
## = character 0 12 12 12 36
## ew ork ime 12 12 12 36
## ate ountain park 11 11 11 33
## ew ork iti 10 10 10 30
## resid arack bama 9 9 9 27
## lassic ate ountain 8 8 8 24
## appi ew ear 6 6 6 18
## cant wait see 6 6 6 18
## li kick â 6 6 6 18
Yes, the most used words in the sample of all documents are:
df.terms.freq[1:10,]
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt total
## said 168 1239 33 1440
## will 600 528 182 1310
## one 674 401 185 1260
## like 635 305 251 1191
## get 490 298 275 1063
## time 592 295 145 1032
## just 530 265 234 1029
## can 538 278 167 983
## year 323 554 78 955
## make 432 225 130 787
ggplot(data=df.terms.freq, aes(df.terms.freq$total)) +
geom_histogram(breaks=seq(2, 12, by=1),
col="black",
fill="blue",
alpha = .2) +
labs(title="Distributions of word frequencies")
The distribution of frequencies of TRI-GRAMs are more concentrate than BI-GRAMs, although the histograms seems equal, the TRI-GRAMs is more fit to left and its max. value is lowest than Bi-GRAM.
MAX BI-GRAM Value: Max. :288.000 .
MAX TRI-GRAM Value: Max. :39.000 .
Although they can do not compare themself with the distribution of single words frequency, logicaly.
df.tf<-as.data.frame(as.matrix(tdm.tf.1));
m<-dim(df.tf)[1];m;
## [1] 27626
df.tf["total"]<-rowSums(df.tf);
x<- quantile(df.tf$total, 0.5);
df.tf<-df.tf[df.tf$total>x,];
n<-dim(df.tf)[1];n
## [1] 12472
The number of words to cover the 50% of dictionary is: 12472 of 27626.
df.tf<-as.data.frame(as.matrix(tdm.tf.1));
m<-dim(df.tf)[1];m;
## [1] 27626
df.tf["total"]<-rowSums(df.tf);
x<- quantile(df.tf$total, 0.9);
df.tf<-df.tf[df.tf$total>x,];
n<-dim(df.tf)[1];n
## [1] 2645
The number of words to cover the 90% of dictionary is: 2645 of 27626.
I don’t know, but I think maybe I can use different sources like:
https://www.wordgamedictionary.com/english-word-list/download/english.txt
To get a complete dictionary and compare with my corpus words.
Improving the corpus processing:
Better processing special characters.
Better detecting the foreign words.
Detecting the erroneous words and deleting them.
Thus you get more frequencies of words and you get a better quantiles.
Yes it’s very usefull as reduce a lot of the number of words and thus the memory used.
I need improve my processing functions, because it became my corpus and quantiles data.
I need fit better the number of samples to construc the model, beacuse my hardware resources are limited.
I need fit better the queantiles in BIGRAM and TRIGRAM to get best results with minus data.
I thing that the degree of predictability depends on the actual usage how well we have to prepare our data, I need to study better procesing functions to get best corpus.
I think to use the n-grams stats is very usefull to show the next word to offer, using bigrams when the user only put a single word and trigrams when the user put two words.
But I think i will need restrict the options to reduce the memory cost, thus when the option is not located in my machine i will offer he most common word (more frequency).