Introduction

This is the first submission for the JH Data Science Specialization Capstone project on Coursera, and it consists in an exploratory Data Analysis on textual data. The final goal is to build a smart keyboard application in R/Shiny, able to propose guesses for the “best” next word the user will type. The prediction shall be based on the data set that I am oing to analyse now.

Here we are faced with text data from different sources (twitter, news and blog posts) and in different languages. Data is in English (en_US), German (de_DE), Finnish (fi_FI) and Russian (ru_RU).

Premise

My Analysis is concentrated on the English portion of the data set. I have a good knowledge of languages, coming from Italy (Italian), living in Belgium (French, Dutch), having a basic knowledge of Spanish and German. I think that the characteristics of these languages, let’s us take for instance German, are such that a good prediction algorithm should recognise and take into account the gender of the nouns and the cases of the part of sentence (nominativ, akkusativ and dativ mainly) in order to decline articles and adjectives and sometimes names (This is sometimes not possible to predict). Therefore in my opinion some knowledge of the language should be included in a prediction algorithm and text (vocabulary) reduction measures like stemming would most likely not work. Probably this is also true in English and other languages.

Data Acquisition and Cleaning

The single zipped data file was downloaded from HC Corpora) and unzipped to extract the text files.

# Loading libraries in no particular order (many packages available it is hard sometimes to make choices)
library(tm)
library(dplyr)
library(stringi) 
library(stringr) 
library(qdap)
library(quanteda)
library(qdapRegex)
library(openNLP)
library(RWeka)
# library(htmlTable)
library(ggplot2)
library(gridExtra)
library(grid)
library(gtable)
library(knitr)

# Files (Commented out, I only did this once, then I load a saved file when I resume work) 
# localFile <- "Coursera-SwiftKey.zip"
# dataFile <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

# Download
# download.file(dataFile, localFile)

# Unzip localFile
# unzip(localFile)

Data exploration: File characteristics

The size of the downloaded data file is 548 Mb, which uncompresses to about 1.4G. The first thing to say is that there is no way to perform any sort of brute force algorithm preserving all of this data online, at least on my PC. Some selection has to be made because the amount of data is too big to find a strategy before coming up with a solution. This can already be seen by the loading time of the single files.

** Notes

In principle I do not want to exclude the other languages from this exploration, because they present interesting features: In German normal sentences the verbs are always in the second position, identification of this specific feature could improve a lot the prediction accuracy (to be seen). But I am afraid for the moment I will I have to. Also Finnish is interesting, especially in Software localization. In this language, the amount of space you have to leave for the Finnish translation in dialog boxes is almost unpredictable as this language is very unrelated to the other European languages, where you allow about 30% of the English string. I have been a software localization coordinator for Microsoft for a couple of years, so I know by experience.

# Reading the statistics file by file to compile a table

# Twitter en_US
twEnUSSize<-file.size("data/final/en_US/en_US.twitter.txt")
twEnUs<-readLines("data/final/en_US/en_US.twitter.txt")
statsTwEnUS<-stri_stats_general(twEnUs)
statsLTwEnUS<-stri_stats_latex(twEnUs)

# News en_US
newsEnUSSize<-file.size("data/final/en_US/en_US.news.txt")
newsEnUs<-readLines("data/final/en_US/en_US.news.txt")
statsNewsEnUS<-stri_stats_general(newsEnUs)
statsLNewsEnUS<-stri_stats_latex(newsEnUs)

# Blogs en_US
blogsEnUSSize<-file.size("data/final/en_US/en_US.blogs.txt")
blogsEnUs<-readLines("data/final/en_US/en_US.blogs.txt")
statsBlogsEnUS<-stri_stats_general(blogsEnUs)
statsLBlogsEnUS<-stri_stats_latex(blogsEnUs)

# Compose a table with the values just found out... 
twStatsNames<-c("Source", "Size", "Lines", "Words", "Chars", "MWPL")

twStats<-matrix(nrow=3, ncol=6) 

twStats[1,] <- c("Twitter", twEnUSSize, statsTwEnUS["Lines"], statsLTwEnUS["Words"],  statsTwEnUS["Chars"], statsLTwEnUS["Words"] / statsTwEnUS["Lines"])
twStats[2,] <- c("News", newsEnUSSize, statsNewsEnUS["Lines"], statsLNewsEnUS["Words"],  statsNewsEnUS["Chars"], statsLNewsEnUS["Words"] / statsNewsEnUS["Lines"] ) 
twStats[3,] <- c("Blogs", blogsEnUSSize, statsBlogsEnUS["Lines"], statsLBlogsEnUS["Words"],  statsBlogsEnUS["Chars"], statsLBlogsEnUS["Words"] / statsBlogsEnUS["Lines"])
twStats<-as.data.frame(twStats)
colnames(twStats)<-twStatsNames

# I am unsatisfied with the looks of the table here... 
#htmlTable(twStats, caption="Basic statistics about the en_US files", align="l|r|r|r|r|r|")
# grid.table(twStats)

table <- tableGrob(twStats)

title <- textGrob("Text files summary statistics",gp=gpar(fontsize=20))
footnote <- textGrob("Note: Only for the English Language files", x=0, hjust=0, gp=gpar( fontface="italic"))

padding <- unit(0.5,"line")

table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))

grid.newpage()
grid.draw(table)

As it can be seen from the table generated, the document with the biggest number of lines but smallest mean number of words per line is the twitter document. Tweets are much smaller than blog posts and news abstracts, being limited to 140 characters and in general to short phrases. Probably they also contain a lot of web references and in my opinion they do not constitute a valid predictor. In the same way, email addresses should be removed. It is not clear at this point whether replacing them with a space or removing the entire sentence is the best option. Blogs articles are the longest in average, with about 41.78 words per line.

Another aspect to evaluate is the actual number of non-unique words in the text, which makes the three files somehow similar, all being between 30.45 and 37.57 million words.

To proceed to further analysis to be able to handle this type of data I will study a part of the files.

Data Sampling

The first decision to take is the size of the sample we are going to analyse. This is not a trivial task, I do not want to undersample but my computing resources are limited. So, loosely based on resources I mention in the references, which discusses sampling size in a NLP task I take arbitrarily the sample size of 10,000 lines per each of the documents. These should be enough to draw some initial conclusions. Evaluations may be slightly different across samples, so this may change, and I should repeat it. However, we can use the samples drawn to estimate Heap’s law coefficients and then use them to estimate the necessary Vocabulary size to cover the 50 % and 90 % of the total Text size. We could argue something, from one of the resources I have researched, Heaps’ law can be roughly approximated with sqrt(Text size). This would give us approximately 10125 vocabulary words for the total text.

# Load all English documents as a corpus 
if(!exists("enCorpus", mode = "any"))
  enCorpus <- Corpus(DirSource(file.path(".", "data","final","en_US")), 
                             readerControl=list(reader=readPlain, language="en_US"))

# Save it - as is - for future use (load it when needed or when resuming)
if (!file.exists("encorpus")) save(enCorpus,file="encorpus")

# Execute this when restarting before sampling... 
# load("encorpus")

# Sampling: See discussion above 
sampleSize = 10000 

# Seed: make it reproducible 
set.seed(20161120)

# Documents are read in alphabetic order, verified (not relevant for the analysis but good for consistency and overall correctness) 
blSampleInd <- sample(1:length(enCorpus[[1]][[1]]), sampleSize, replace=FALSE)
neSampleInd <- sample(1:length(enCorpus[[2]][[1]]), sampleSize, replace=FALSE)
twSampleInd <- sample(1:length(enCorpus[[3]][[1]]), sampleSize, replace=FALSE)

# Corpus Reduction 
enCorpus[[1]][[1]] <- enCorpus[[1]][[1]][blSampleInd]
enCorpus[[2]][[1]] <- enCorpus[[2]][[1]][neSampleInd]
enCorpus[[3]][[1]] <- enCorpus[[3]][[1]][twSampleInd]

# Saving reduced Corpus
# Save it - as is - for future use (load it when needed or when resuming after sampling)
if (!file.exists("reducedencorpus")) save(enCorpus,file="reducedencorpus")

# load("reducedencorpus")

# Reduced corpus size... 
memSize <- object.size(enCorpus) 
memSize

## 6246944 bytes

Data Cleaning

There is no use for mis-spelled words, for numbers, for web or email addresses and for twitter characters like @ and #. However the order of removal is important and email addresses should be removed before the twitter markers otherwise the emails shall be missed.

# Package qdapRegex contins all regular expressions needed to clean the text from web, email addresses and twitter char codes 
enCorpus[[1]][[1]]<-rm_url(enCorpus[[1]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))
enCorpus[[2]][[1]]<-rm_url(enCorpus[[2]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))
enCorpus[[3]][[1]]<-rm_url(enCorpus[[3]][[1]], pattern=pastex("@rm_twitter_url", "@rm_url", "@rm_email"))

# Removing Dates 
enCorpus[[1]][[1]]<-rm_date(enCorpus[[1]][[1]], pattern=pastex("@rm_date4"))
enCorpus[[2]][[1]]<-rm_date(enCorpus[[2]][[1]], pattern=pastex("@rm_date4"))
enCorpus[[3]][[1]]<-rm_date(enCorpus[[3]][[1]], pattern=pastex("@rm_date4"))

# Removing numbers and non-word characters (I have to repeat this separately by document, as the call brings down my R session)
enCorpus[[1]][[1]]<-rm_non_words(enCorpus[[1]][[1]])
enCorpus[[2]][[1]]<-rm_non_words(enCorpus[[2]][[1]])
enCorpus[[3]][[1]]<-rm_non_words(enCorpus[[3]][[1]])

# everything to lower case using tm 
enCorpus <- tm_map(enCorpus, content_transformer(tolower))

Considerations

Now I should have in my hands a clean corpus. Considering that in presence of an empty input I should still be able to present a prediction. This is a fine-tuning feature for the final model and I will leave it for later. Notice I did not eliminate words with wrong spelling. This should be done in my opinion, although these should not be very frequent (same spelling errors should not repeat very often). But I do not want to lose information for the moment due to specific dictionaries, so I do not attempt it.

Anyway, by removing words with low frequency it is possible that many of the occasional mis-spellings and foreign words would also be eliminated, empowering in fact the prediction algorithm. I am going to look in that direction.

I consider also that stemming and removing the stopwords is not a good way to go (this is my opinion) because they are part of the language(s) and a good predictor would suggest them when needed. So, a good prediction algorithm should also be capable of proposing stopwords and regular words, stemmed words are basically misspellings for natural languages.

Also, I am not removing bad language at this moment. This will be done later in the model and in the application.

Exploratory Analysis

Creation of TDM

I am creating a TDM (Term Document Matrix) from this corpus and then looking at the frequencies of terms under various aspects. Note: sometimes I settle for a different strategy to display my data. I leave the attempts that did not not provide a satisfactory output in the code, this is a placeholder for future improvements. I would

tdm<-TermDocumentMatrix(enCorpus)
tdm

## <<TermDocumentMatrix (terms: 47309, documents: 3)>>
## Non-/sparse entries: 72496/69431
## Sparsity           : 49%
## Maximal term length: 37
## Weighting          : term frequency (tf)

# Having a look at Zipf's and Heap's plots... 
Zipf_plot(x = tdm)

## (Intercept)           x 
##   13.207676   -1.256846

Heaps_plot(tdm)

## (Intercept)           x 
##   2.1524817   0.6402622

# Calculating the frequencies: 
wordFreq<-rowSums(as.matrix(tdm))
freqord<-order(wordFreq, decreasing=TRUE)


# Looking at the most frequent words... 
wordFreq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
wf <- data.frame(word=names(wordFreq), freq=wordFreq)


kable(head(wf,30), caption="Most frequent words", row.names = FALSE)

Most frequent words
word	freq
the	44462
and	22457
that	9667
for	9280
you	6694
with	6310
was	5724
this	4674
have	4577
but	4349
are	4237
not	3578
from	3477
they	2997
all	2937
said	2916
his	2914
will	2832
one	2652
about	2568
has	2473
out	2438
just	2317
when	2313
what	2305
who	2268
more	2200
your	2174
can	2155
like	2133

kable(tail(wf,30), caption="Least frequent words", row.names = FALSE)

Least frequent words
word	freq
zuckerberg’s	1
zuckerburg	1
zuhlsdorf	1
zukor	1
zulia	1
zulu	1
zuni	1
zunilda	1
zunt	1
zürich	1
zuzana	1
zvs	1
zwane	1
zwerling	1
zydeco	1
zydrunas	1
zyl	1
zylstra	1
zynga	1
zz’s	1
αφα	1
помощи	1
рук	1
самих	1
утопающим	1
утопающих	1
案ずるより産むが易し	1
潘兆初法官	1
犀利人妻	1
红腐乳	1

# Table generation with GridExtra not satisfactory. Table gets truncated, could not find way around.
# table <- tableGrob(head(wf, 12),  gp = gpar(fontsize = 8), rows = NULL )
# title <- textGrob("Most frequent words",gp=gpar(fontsize=14))
# footnote <- textGrob("Note: Stemming and stopwords not removed - on purpose", x=0, hjust=0, gp=gpar( fontface="italic"))
# padding <- unit(2,"line")
# table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
# table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
# table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))
# grid.newpage()
# grid.draw(table)

# Looking at the least frequent words 
# Table generation (I should make a function for this...)
# table <- tableGrob(tail(wf,12), theme =ttheme_default(gpar.coretext =gpar(fontsize=7), gpar.coltext=gpar(fontsize=8, fontface='bold'), gpar.rowtext=gpar(fontsize=7, fontface='bold') ), rows = NULL )
# 
# title <- textGrob("Least frequent words",gp=gpar(fontsize=14))
# footnote <- textGrob("Note: Many spelling mistakes, foreign text, proper nouns etc...", x=0, hjust=0, gp=gpar( fontface="italic"))
# padding <- unit(2,"line")
# table <- gtable_add_rows(table, heights = grobHeight(title) + padding, pos = 0)
# table <- gtable_add_rows(table, heights = grobHeight(footnote)+ padding)
# table <- gtable_add_grob(table, list(title, footnote), t=c(1, nrow(table)), l=c(1,2), r=ncol(table))
# grid.newpage()
# grid.draw(table)

# These look ugly
# head(wordFreq, 30)
# wordFreq[head(freqord,30)] 
# wordFreq[tail(freqord,30)] 

# Plot of the words with frequency > 1500 in my sample 
subset(wf, freq>1500)   %>% 
  ggplot(aes(x=reorder(word, -freq),y=freq, fill=freq))  +       
  geom_bar(stat="identity", position=position_dodge(), colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

# How much do the words with freq == 1 count in this structure? How many are they in total? 
count(wf[which(wf$freq==1),])

## # A tibble: 1 × 1
##       n
##   <int>
## 1 23297

# And what percentage of our data is that? 
count(wf[which(wf$freq==1),]) / nrow(wf)

##           n
## 1 0.4924433

# And how much does "the rest" count? 
sum(wf[which(wf$freq>1),]$freq)

## [1] 669763

# What is the dictionary coverage of the words with freq==1? 
#sum(wf[which(wf$freq==1),]$freq)/(sum(wf[which(wf$freq==1),]$freq) + sum(wf[which(wf$freq>1),]$freq))
sum(wf[which(wf$freq==1),]$freq)/sum(wf$freq)

## [1] 0.03361469

# if I want to stretch this reasoning, I could argue that there are other words with little predictive values and I could define a threshold (ex. 3)
count(wf[which(wf$freq<=3),])

## # A tibble: 1 × 1
##       n
##   <int>
## 1 33378

# And how much does "the rest" count? 
sum(wf[which(wf$freq>3),]$freq)

## [1] 646226

# How much predictive capability would I lose? (just on unigram terms) 
#sum(wf[which(wf$freq<=3),]$freq)/(sum(wf[which(wf$freq<=1),]$freq) + sum(wf[which(wf$freq>3),]$freq))
sum(wf[which(wf$freq<=3),]$freq)/sum(wf$freq)

## [1] 0.06757568

# Can I generalize?

Exploring Coverage

Zips’ law fit is a bit is reasonable, while the parameters of the Heap’s law seem to be a very good fit, so this could really help in establishing a good sample size for the final application.

The above analysis, and the references tell me that terms which occur only one time in the text amount to about half of the data, and give us a coverage of the text of less than 4%. They contribute to the sparsity of the matrix and provide little predicting value. We can probably live without them and still cover 96% of the text or more, but we would reduce our TDM by 70% of the original size.

Even if I would not consider terms that occur 3 times or less I would still cover about 93% of the text. I have also tried with other sample sizes and the results seem to hold better by increasing the sample size. See also literature in the references, about number of words in a dictionary and coverage of the language.

Let us look at the coverages that we can obtain with what percentages of the initial TDM, supposing that we remove the terms with freq=1, then freq=2.. until freq=50 (arbitrary choice). I am almost certain that I will not remove all of these from the final model.

for (frequency in 1:50) {
  # How many words with this frequency or less
  qwRem <- count(wf[which(wf$freq<=frequency),])
  
  # And how much coverage do we lose? 
  lostCov<-sum(wf[which(wf$freq<=frequency),]$freq)/sum(wf$freq)   
  
  # And what proportion of our data is that? 
  pwRem<-count(wf[which(wf$freq<=frequency),]) / nrow(wf)
  
  # And how much does "the rest" count? 
  remainCov<-sum(wf[which(wf$freq>frequency),]$freq)/sum(wf$freq)  
  
  if (!exists("remUnfreqTerms")) {
    remUnfreqTerms<-data.frame(frequency, qwRem, lostCov, pwRem, remainCov, nrow(wf)-qwRem)
  }
  else {
    remUnfreqTerms<-rbind(remUnfreqTerms, data.frame(frequency, qwRem, lostCov, pwRem, remainCov, nrow(wf)-qwRem))
  }  

}
colnames(remUnfreqTerms)<-c("FrequencyRemoved", "Removed", "LostCoverage", "ProportionWordsRemoved", "RemainingCoverage", "TermsRemaining")


# Plotting the removed freq vs coverage and the size of the remaining dictionary 

 ggplot(remUnfreqTerms,  aes(x=FrequencyRemoved))  +       
   geom_line( colour="blue", aes(y=Removed) ) +
     geom_line( colour="red", aes(y=TermsRemaining)) +
   labs(title="Words removed (blue) and remaining (red) losing infrequent terms")

 ggplot(remUnfreqTerms,  aes(x=FrequencyRemoved))  +       
   geom_line( colour="blue", aes(y=RemainingCoverage) ) +
   geom_line( colour="red", aes(y=LostCoverage))  +
   labs(title="Coverage loss (red) and remaining (blue) removing infrequent terms")

In fact, these coverage plots are other visualizations of Heaps’ law (not log plots).

Bi-grams and Tri-grams

The number of n-grams in a text is known, this is a linear function of the total number of words. In a sentence of length L words there are L-(n-1) n-grams. Therefore, if we tokenize, and keep the selected vocabulary to a decent size, the structures to hold the n-grams should also be reasonable in size. However I expect the sparsity of a density matrix to be lower.

# Sets the default number of threads to use. Thanks stackoverflow! I spent days for this single problem! 
options(mc.cores=1)

bigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

# Finding bigrams 
tdm2<- TermDocumentMatrix(enCorpus, control = list(tokenize = bigramTokenizer))

# Finding trigrams  
tdm3<- TermDocumentMatrix(enCorpus, control = list(tokenize = trigramTokenizer))


# Calculating the frequencies of bi-grams 
bigramFreq<-rowSums(as.matrix(tdm2))

# Looking at the bigrams frequencies... 
bigramFreq <- sort(rowSums(as.matrix(tdm2)), decreasing=TRUE)
bif <- data.frame(bigram=names(bigramFreq), freq=bigramFreq)

# Looking at the most frequent bigrams... 
kable(head(bif,30), caption="Most frequent bigrams", row.names = FALSE)

Most frequent bigrams
bigram	freq
of the	4197
in the	3759
to the	1965
it s	1911
on the	1681
for the	1678
i m	1447
to be	1433
at the	1322
don t	1279
and the	1230
in a	1074
with the	996
is a	918
and i	877
for a	840
it is	831
it was	829
from the	769
of a	766
with a	761
i have	751
will be	731
i was	727
if you	686
one of	680
that s	675
is the	665
as a	664
that i	622

# Looking at the least frequent bigrams... 
kable(tail(bif,30), caption="Least frequent bigrams", row.names = FALSE)

Least frequent bigrams
bigram	freq
zuma zapiro	1
zumba classes	1
zumba fitness	1
zumwalt east	1
zunilda junco	1
zuni tamaroa	1
zunt for	1
zürich today	1
zutic demands	1
zutic says	1
zuzana hejnova	1
zvs induction	1
zwane said	1
zwerling were	1
zydeco something	1
zydrunas ilgauskas	1
zyl rsa	1
zylstra ppg	1
zynga poker	1
zz s	1
zz top	1
αφα ice	1
дело помощи	1
дело рук	1
помощи утопающим	1
рук самих	1
самих утопающих	1
утопающим дело	1
潘兆初法官 decided	1
红腐乳 this	1

# Object size 
bigramSize<- object.size(bigramFreq) 
bigramSize

## 20835024 bytes

# Calculating the frequencies of tri-grams 
trigramFreq<-rowSums(as.matrix(tdm3))

# Looking at the tri-grams freqs... 
trigramFreq <- sort(rowSums(as.matrix(tdm3)), decreasing=TRUE)
trif <- data.frame(trigram=names(trigramFreq), freq=trigramFreq)

# Looking at the most frequent trigrams... 
kable(head(trif,30), caption="Most frequent trigrams", row.names = FALSE)

Most frequent trigrams
trigram	freq
i don t	455
one of the	337
a lot of	275
it s a	247
i m not	165
i didn t	162
out of the	154
i can t	152
some of the	146
it s not	144
the end of	144
to be a	143
going to be	134
you don t	134
i ve been	129
the u s	129
part of the	128
it was a	124
don t know	119
i want to	118
as well as	113
be able to	113
thanks for the	111
i have to	106
don t have	105
t want to	104
this is a	102
the first time	100
the rest of	100
it s the	98

# Looking at the least frequent trigrams... 
kable(tail(trif,30), caption="Least frequent trigrams", row.names = FALSE)

Least frequent trigrams
trigram	freq
zuma they create	1
zuma wants an	1
zuma zapiro s	1
zumba classes i	1
zumba fitness and	1
zumwalt east but	1
zumwalt north had	1
zumwalt west s	1
zunilda junco who	1
zuni tamaroa maritime	1
zunt for another	1
zürich today and	1
zutic demands move	1
zuzana hejnova cze	1
zva blog hop	1
zva blog starting	1
zvs induction heater	1
zwane said the	1
zwerling were elected	1
zydeco something on	1
zydrunas ilgauskas juwan	1
zyl rsa and	1
zz top was	1
дело помощи утопающим	1
дело рук самих	1
помощи утопающим дело	1
рук самих утопающих	1
утопающим дело рук	1
潘兆初法官 decided and	1
红腐乳 this recipe	1

trigramSize<- object.size(trigramFreq) 
trigramSize

## 41353584 bytes

Considerations

Some of the bi-grams and tri-grams seem to be trivial, and yet they are wrong. The removal of the punctuation removed also the apostroph, and in cases like the tri-gram “i don t” although it is one of the most frequent found in my structure, it would not represent a well-spelled suggestion. This is true in many cases. I am not sure how to proceed in these cases, possibly a removal is a better option than offering a wrong prediction.

The size of the structures is also to be kept into account, with over 21 Mb for the bigrams and 46 Mb for the trigrams, so may be a 4-grams option would be too much. Some reduction measure should be considered, as the bi-grams and tri-grams with frequency 1 are almost all useless for predictions (presence foreign words, of proper nouns, mis-spelling etc).

TO DO - BAR GRAPH AS FOR WORDS

# Plot of the bigrams with frequency > 1500 in my sample 
subset(bif, freq>1000)   %>% 
  ggplot(aes(x=reorder(bigram, -freq),y=freq, fill=freq))  +       
  geom_bar(stat="identity", position=position_dodge(), colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

# Plot of the trigrams with frequency > 100 in my sample 
subset(trif, freq>100)   %>% 
  ggplot(aes(x=reorder(trigram, -freq),y=freq, fill=freq))  +       
  geom_bar(stat="identity", position=position_dodge(), colour="black") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Possible Algorithm / Model for prediction

At the moment, my orientation is as follows:

Strategy for prediction of initial text. I will propose the three most common words for starting a phrase.
A word prediction is performed when the user types or select a full words and a space (The space indicates word completion). A period indicates a new sentence and therefore the predictor restarts with no knowledge.
When only 1 word is available, look for the most common words (bi-gram starting with the word). This is true after the first word or after a period + word. If no bi-gram is available, propose the most frequent words.
When 2 or more words are available for prediction, start with tri-grams, than back-off to bi-grams and finally to words.
Optional, I do not know if I will have the time to do this: Depending on speed and time of execution, a level of prediction could be made when the user taps keys, looking at the words starting with the first characters she/he types. So, the first level of prediction (character level) is live and works exactly like the word predictor but we are looking at tri-grams / bi-grams or words in which the last word start with the characters the users is just typing. The libraries I have used are probably not sufficient for this, so some extra reading is needed.

Data Science Capstone Project: Exploratory Analysis

Antonio Ferraro

2016/11/26

Introduction

Premise

Data Acquisition and Cleaning

Data exploration: File characteristics

Data Sampling

Data Cleaning

Considerations

Exploratory Analysis

Creation of TDM

Exploring Coverage

Bi-grams and Tri-grams

Considerations

Possible Algorithm / Model for prediction

Resources used