Overview

The goal of the project “Data Science Capstone” is to build an application that can predict the next word based on an existing word sequence (e.g. two, three or four words).

For this purpose, a data source “Coursera-SwiftKey.zip” will be made available within the framework of the project, which provides three resources of blogs, news and Twitter in different languages (US English, German, Russian, Finnish).

URL "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The considerations in the further project are limited to the data sources in English.

The aim of this milestone report is to provide the data, the data cleaning, an exploratory data analysis and an outlook to the planned prediction algorithm of the next word.

rm( list =ls( all=T))
# dataframe
l_df_all <- bind_rows( l_df_blogs_sample, l_df_news_sample, l_df_twitters_sample )

# count words 
l_freq <- l_df_all %>% 
          mutate(word = str_extract(txt, "[a-z']+")) %>%
          dplyr::count(src, word) %>%
          group_by(src)
ngrams <- function( df ) {
# input ; df
# output: list of (1gram, 4grams) 
  
  l_df <- df
  
  # one grams
  system.time(
    l_df_1gram <- l_df %>%
      unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
      dplyr::count(word1, sort = TRUE)
  )
  
  ## fourgrams
  system.time(
    l_df_4grams <- l_df %>%
      unnest_tokens( fourgram, txt, token = "ngrams", n = 4 ) %>%
      separate( fourgram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
      dplyr::count(word1, word2, word3, word4, sort = TRUE)
  )
  
  return( list(head(l_df_1gram,50), head(l_df_4grams,50)) )
  
}  
tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator()
  a2 <- Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, word_token_annotator, a2)
  a3 <- NLP::annotate(s, Maxent_POS_Tag_Annotator(), a2)
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
  POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  POSwords <- unlist(s[a3w])
  list(POStagged = POStagged, POSwords=POSwords, POStags = POStags)
}
get_pos <- function( x ) {
  
#  
  l_sentences <- qdap::sent_detect_nlp(x)
  
  l_df <- tibble(txt=l_sentences)
  l_gram_sen <- l_df %>%
    dplyr::count(txt, sort = TRUE)  

  l_tagged_str <-  tagPOS(l_gram_sen$txt)
   
  ## list(POStagged = POStagged, POStags = POStags)
  
  ## onegram for POStags
  l_df_tags <- tibble(txt=l_tagged_str$POStags)
  system.time(
    l_df_tag_1grams <- l_df_tags %>%
      unnest_tokens( "tag1", txt, token = "ngrams", to_lower = FALSE, n = 1 ) %>%
      dplyr::count(tag1, sort = TRUE)
  )
  
  ## bigrams
  l_df_tags <- tibble(txt=l_tagged_str$POStags)
  system.time(
    l_df_tag_2grams <- l_df_tags %>%
      unnest_tokens( bigram, txt, token = "ngrams", to_lower = FALSE, n = 2 ) %>%
      separate( bigram, c("tag1", "tag2"), sep = " ") %>%
      dplyr::count(tag1, tag2, sort = TRUE)
  )
    
  ## onegrams for POStagged
  l_df_words <- tibble(txt=l_tagged_str$POSwords)
  system.time(
    l_df_words_1grams <- l_df_words %>%
      unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
      dplyr::count(word1, sort = TRUE)
   )   
  
  ## onegrams for POStagged
  l_df_POSwordtags <- tibble(word1=l_tagged_str$POSwords, tag1=l_tagged_str$POStags)
  system.time(
    l_df_POStag_2grams <- l_df_POSwordtags %>%
      dplyr::count(word1, tag1, sort = TRUE)
   )     
  
## return  
  return(list(head(l_df_tag_1grams,25), 
              head(l_df_tag_2grams,25),
              head(l_df_words_1grams,25),
              head(l_df_POStag_2grams,25)
             )
        )
}
explore_pos <- function( filename ) {
  
##  read sample RDS file     
  l_filename <- paste0(getwd(), "/data/sample/", filename)
  l_filename <- stringi::stri_replace_last_fixed(str=l_filename, ".txt", "_tidy.rds")
  system.time(
    l_df <- readRDS(file=l_filename)  
  ) 
  l_inf <- file.info(l_filename)
  l_ds <- l_df$txt    
  
## return  
  return( get_pos( x=l_ds ) )
  
}

Summary

The summary section summarizes the work steps and results that result from data provision, data cleaning, and Explorative Data Analysis. For more detailed information, see the sections in the Appendix. Tables and graphs are included.

Text Data Preparation

Since the data comes from very different sources and probably on different devices (e.g. Twitter on mobile phones), the data sets are treated separately. The files en_US.blogs.txt, en_US.news.txt and en_US.twitters.txt are read in, separately cleaned, and the explorative data analysis carried out separately.

  • Dataset: en_US.blogs.txt with 899288 text lines and 208361438 characters.
  • Dataset: en_US.news.txt with 77259 text lines and 15683765 characters.
  • Dataset: en_US.twitter.txt with 2360148 lines and 162385035 characters.

Since the amount of data is very large, the data is collected by drawing a 10 % sample before data cleansing and explorative data analysis.

The restriction means that the sample does not contain all the words of the whole. The words or combinations with the highest frequency of occurrence are very likely to be found in the sample. For the first approximation one could use Zipf’s law.

The folling table provides the first differences between the datasets:

ggplot(l_df_all, aes(x=factor(src), y=nchar(txt), fill=src)) + geom_boxplot() + theme(legend.position="none")

  • The number of text lines is different.
  • The number of the maximum length and the median of the contributions (text line) differ.
  • The median and the mean value are different. The distributions are skewed. Shorter texts predominate.

Text Data Cleaning

The following steps are taken to obtain cleaned data:

  • remove all text lines recognized as foreign languages
  • remove all text lines with swear words
  • replace contractions
  • replace abbreviations
  • remove brackets
  • replace ordinals
  • convert in lower letters
  • replace special expressions
  • remove all text lines with numbers
  • remove punctuations
  • remove all characters, which not alphabetic
  • trim all whitespaces
  • remove all text lines without content
  • remove all text lines which have less than 3 characters

Percentage of text lines in English language:

  • File en_US.blogs.txt : A percentage of 84.4 percent is recognized in English.
  • File en_US.news.txt : A percentage of 90.8 percent is recognized in English.
  • File en_US.twitter.txt : A percentage of 80.6 percent is recognized in English.

Text Data Explorative Data Analytics

Histograms: count of words

In order to show the differences in the frequencies of the words, the following histograms are offered. The log- transformation of the frequencies serves to make the differences somewhat clearer; otherwise the histograms would be very similar.

## Histograms
  l_total <- dim(l_freq)[1]
  ggplot(l_freq, aes(log(n)/l_total, fill=src)) +
    geom_histogram(show.legend = FALSE, bins = 30) +
    facet_wrap(~src, ncol = 3, scales = "free_y")

Ngrams

For the cleaned data sets a One-gram and an r-gram are created and output as a table.

This makes it clear how quickly the frequency of occurrence of 4-grams over one-grams decreases, and the word combinations become clearer.

Since the articles, prepositions, binding words are very often used in sentences, they are also found in the tables to the programs in the hip charts in the front entries.

l_out <- ngrams( df=l_df_blogs_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with 2354249 words:

  • 86 words cover 50 % of the corpus “blogs_tidy.txt”,
  • 3340 words cover 90 % of the corpus “blogs_tidy.txt”, and
  • 11591 words cover 100 % of the corpus “blogs_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (blogs.txt)", format="html")
Onegram (blogs.txt)
word1 n
the 66859
and 42179
i 42149
to 40606
a 33510
of 30539
in 22676
is 19547
that 18545
it 17394
for 13943
you 12472
not 12369
was 12225
my 12208
have 11493
this 10608
with 9797
are 8679
on 8679
as 8370
but 8009
be 7903
we 7793
so 7153
knitr::kable(head(l_df_4ngram,25), caption="4-gram (blogs.txt)", format="html")
4-gram (blogs.txt)
word1 word2 word3 word4 n
i am going to 329
i would love to 229
if you would like 210
you would like to 210
is one of the 206
love to tell you 186
to tell you that 186
would love to tell 186
one of the most 163
the rest of the 159
have to admit that 158
on the other side 158
to admit that i 155
i have to admit 153
i am not a 152
and i do not 151
i do not think 146
i do not know 137
the other khador player 130
can not wait to 128
for those of you 125
tell you that i 124
the one thing that 124
i can not get 123
a big fan of 122
l_out <- ngrams( df=l_df_news_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with total 141219 words:

  • 130 words cover 50 % of the corpus “news_tidy.txt”,
  • 2130 words cover 90 % of the corpus “news_tidy.txt”, and
  • 4662 words cover 100 % of the corpus “news_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (news.txt)", format="html")
Onegram (news.txt)
word1 n
the 75563
to 33980
and 32303
a 30126
of 24216
in 18669
is 16932
that 16321
for 12442
said 12199
it 11863
on 10563
not 10047
he 9946
with 9245
was 8747
i 7466
they 7252
are 7108
have 6742
at 6086
but 6084
we 5731
as 5392
be 4832
knitr::kable(head(l_df_4ngram,25), caption="4-gram (news.txt)", format="html")
4-gram (news.txt)
word1 word2 word3 word4 n
said with a laugh 326
in a row in 256
chief executive officer of 228
that he was not 213
in the last year 205
i do not think 200
when it comes to 198
additional stolen property was 195
connected to our investigation 195
found connected to our 195
in their room additional 195
our investigation poulin said 195
property was found connected 195
room additional stolen property 195
stolen property was found 195
their room additional stolen 195
to our investigation poulin 195
was found connected to 195
about their gratitude for 193
also spoke about their 193
for the outpouring of 193
from the community and 193
gratitude for the outpouring 193
of support from the 193
outpouring of support from 193
l_out <- ngrams( df=l_df_twitters_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with total 2120216 words:

  • 85 words cover 50 % of the corpus “twitters_tidy.txt”.
  • 2435 words cover 90 % of the corpus “twitters_tidy.txt”, and
  • 7826 words cover 100 % of the corpus “twitters_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (twitter.txt)", format="html")
Onegram (twitter.txt)
word1 n
the 66451
i 58853
to 57808
a 41646
you 40772
is 39061
and 35807
it 28984
in 25006
for 24774
of 24179
not 23332
my 21228
that 18804
on 17993
me 15399
do 15022
have 14968
with 14602
are 14417
be 12292
your 12001
so 11807
this 11529
we 11133
knitr::kable(head(l_df_4ngram,25), caption="4-gram (twitter.txt)", format="html")
4-gram (twitter.txt)
word1 word2 word3 word4 n
thanks for the follow 894
what is your favorite 495
i am going to 463
can not wait to 398
i am not the 342
i do not think 337
i would like to 291
thanks for following me 285
i am pretty sure 277
i do not want 255
let us know if 242
i do not know 237
if you do not 230
i do not like 228
if i do not 223
do not have to 215
thanks for the mention 212
to do when he 204
what to do when 204
have a great day 203
i would love to 200
do not see it 199
the end of the 199
hope to see you 198
i cannot wait to 198
Part of Speech

A POS tagger is used to generate tagged text from raw text using a tag set.

## Example: 
## Rawtext       time     flies      like      an     arrow.
## Tagged Text   time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].

## NN    Noun, singular or mass
## IN    Preposition or subordinating conjunction
## DT    Determiner
## VBZ   Verb, 3rd person singular present
## List of tags
## ----------------------------------------------
## cc    Coordinating conjunction
## CD    Cardinal number
## DT    Determiner
## EX    Existential there
## FW    Foreign word
## IN    Preposition or subordinating conjunction
## JJ    Adjective 
## JJR   Adjective, comparative
## JJS   Adjective, superlative
## LS    List item markerMDModal
## NN    Noun, singular or mass
## NNS   Noun, plural
## NNP   Proper noun, singular
## NNPS  Proper noun, plural
## PDT   Predeterminer
## POS   Possessive ending
## PRP   Personal pronoun
## PRP$  Possessive pronoun
## RB    Adverb
## RBR   Adverb, comparative
## RBS   Adverb, superlative
## RP    Particle
## SYM   Symbol
## TO    to
## UH    Interjection
## VB    Verb, base form
## VBD   Verb, past tense
## VBG   Verb, gerund or present participle
## VBN   Verb, past participle
## VBP   Verb, non-3rd person singular present
## VBZ   Verb, 3rd person singular present
## WDT   Wh-determiner
## WP    Wh-pronoun
## WP$   Possessive wh-pronoun
## WRB   Wh-adverb

In a Hidden Markov model, two types of information are useful

  • relations between words and tags
  • relations between tags and tags
system.time(
  l_grams_list <- explore_pos( filename="en_US.blogs.txt" ) 
)  
## Registered S3 methods overwritten by 'qdap':
##   method               from
##   t.DocumentTermMatrix tm  
##   t.TermDocumentMatrix tm
##    user  system elapsed 
##  204.25    3.93  177.49
l_df_tag_1grams    <- l_grams_list[[1]]
l_df_tag_2grams    <- l_grams_list[[2]]
l_df_POStag_1grams <- l_grams_list[[3]]
l_df_POStag_2grams <- l_grams_list[[4]]

The following table shows the relations between tags and tags and their frequencies.

knitr::kable(head(l_df_tag_2grams,25), caption="2-grams for tags (en_US.blogs.txt)", format="html")
2-grams for tags (en_US.blogs.txt)
tag1 tag2 n
DT NN 2596
IN DT 2030
NN IN 1914
JJ NN 1392
IN PRP 1382
DT JJ 1143
TO VB 1057
PRP VBP 1036
PRP VBD 963
NN NN 875
NNP NNP 788
NN CC 759
NN PRP 755
IN NN 728
PRP NN 700
NNS IN 693
MD VB 611
JJ NNS 591
VB DT 535
PRP MD 510
VB PRP 489
IN NNP 464
RB IN 440
DT NNS 430
NN RB 425

The following graph illustrates the network of relationships based on the pairwise relationships of the 25 most common tags.

## graph  
  bigram_graph <- l_df_tag_2grams %>%
##  head(n=50) %>%
    graph_from_data_frame() 

  set.seed(2017)
  l_df_2grapf <-  ggraph(bigram_graph, layout = "fr") +
                  geom_edge_link() +
                  geom_node_point() +
                  geom_node_text(aes(label = name), vjust = 1, hjust = 1)
  
  l_df_2grapf 

The following table shows the relations between words and tags and their frequencies.

knitr::kable(head(l_df_POStag_2grams,25), caption="2-grams for word and tag (en_US.blogs.txt)", format="html")
2-grams for word and tag (en_US.blogs.txt)
word1 tag1 n
the DT 2317
to TO 1547
and CC 1508
I PRP 1490
a DT 1242
of IN 1169
in IN 785
is VBZ 775
it PRP 563
for IN 506
not RB 445
. . 444
was VBD 434
you PRP 420
that IN 406
my PRP$ 395
with IN 379
are VBP 328
on IN 321
this DT 313
be VB 300
have VBP 290
The DT 268
will MD 244
as IN 239

Outlock on Algorithmus

A possible algorithm/model to calculate the word frequencies probability could be the Hidden Markov Model.

A hidden Markov model will serve as the basis for estimating the next word under using of a tag sequence.

A POS tagger is used to generate tagged text from raw text using a tag set.

# Example: 
# Rawtext           time flies like an arrow.
# Tagged Text   time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].

In a Hidden Markov model, two types of information are useful

Approach

Finding the best sequence of tags (t1, … ,tn) that corresponds to the sequence of observations of words (w1, … ,wn). Choosing the tag sequence which is most probable given the observation sequence.

… t1(n) = argmax p( t1(n) | w1(n) ) over all tags t1(n)

Using the bayes rule:

… t1(n) = argmax p( w1(n) | t1(n) ) * p( t1(n) )

Using markov assuption

… emission propability: p( w1(n) | t1(n) ) := product of p( w(i) | t(i) )
… transion propability: p( t1(n) ) := product of p( t(i) | t(i-1) )
… t(n) = argmax product of ( p( w(i) | t(i) ) * p( t(i) | t(i-1) ) )

First the tag of the next word estimated is taken and then the most likely word under the given tag.

… w(n) = argmax p( w(i) | t(estimated) )

The Hidden Markov model delivers a predicted word at best.

But usually several words are delivered with a corresponding probability. These should be displayed in the app ‘my_next_word’.