Capstone Project : Data Preparation and Exploratory Analysis

Overview

The goal of the project “Data Science Capstone” is to build an application that can predict the next word based on an existing word sequence (e.g. two, three or four words).

For this purpose, a data source “Coursera-SwiftKey.zip” will be made available within the framework of the project, which provides three resources of blogs, news and Twitter in different languages (US English, German, Russian, Finnish).

URL "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The considerations in the further project are limited to the data sources in English.

The aim of this milestone report is to provide the data, the data cleaning, an exploratory data analysis and an outlook to the planned prediction algorithm of the next word.

rm( list =ls( all=T))

# dataframe
l_df_all <- bind_rows( l_df_blogs_sample, l_df_news_sample, l_df_twitters_sample )

# count words 
l_freq <- l_df_all %>% 
          mutate(word = str_extract(txt, "[a-z']+")) %>%
          dplyr::count(src, word) %>%
          group_by(src)

ngrams <- function( df ) {
# input ; df
# output: list of (1gram, 4grams) 
  
  l_df <- df
  
  # one grams
  system.time(
    l_df_1gram <- l_df %>%
      unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
      dplyr::count(word1, sort = TRUE)
  )
  
  ## fourgrams
  system.time(
    l_df_4grams <- l_df %>%
      unnest_tokens( fourgram, txt, token = "ngrams", n = 4 ) %>%
      separate( fourgram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
      dplyr::count(word1, word2, word3, word4, sort = TRUE)
  )
  
  return( list(head(l_df_1gram,50), head(l_df_4grams,50)) )
  
}

tagPOS <-  function(x, ...) {
  s <- as.String(x)
  word_token_annotator <- Maxent_Word_Token_Annotator()
  a2 <- Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, word_token_annotator, a2)
  a3 <- NLP::annotate(s, Maxent_POS_Tag_Annotator(), a2)
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
  POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  POSwords <- unlist(s[a3w])
  list(POStagged = POStagged, POSwords=POSwords, POStags = POStags)
}

get_pos <- function( x ) {
  
#  
  l_sentences <- qdap::sent_detect_nlp(x)
  
  l_df <- tibble(txt=l_sentences)
  l_gram_sen <- l_df %>%
    dplyr::count(txt, sort = TRUE)  

  l_tagged_str <-  tagPOS(l_gram_sen$txt)
   
  ## list(POStagged = POStagged, POStags = POStags)
  
  ## onegram for POStags
  l_df_tags <- tibble(txt=l_tagged_str$POStags)
  system.time(
    l_df_tag_1grams <- l_df_tags %>%
      unnest_tokens( "tag1", txt, token = "ngrams", to_lower = FALSE, n = 1 ) %>%
      dplyr::count(tag1, sort = TRUE)
  )
  
  ## bigrams
  l_df_tags <- tibble(txt=l_tagged_str$POStags)
  system.time(
    l_df_tag_2grams <- l_df_tags %>%
      unnest_tokens( bigram, txt, token = "ngrams", to_lower = FALSE, n = 2 ) %>%
      separate( bigram, c("tag1", "tag2"), sep = " ") %>%
      dplyr::count(tag1, tag2, sort = TRUE)
  )
    
  ## onegrams for POStagged
  l_df_words <- tibble(txt=l_tagged_str$POSwords)
  system.time(
    l_df_words_1grams <- l_df_words %>%
      unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
      dplyr::count(word1, sort = TRUE)
   )   
  
  ## onegrams for POStagged
  l_df_POSwordtags <- tibble(word1=l_tagged_str$POSwords, tag1=l_tagged_str$POStags)
  system.time(
    l_df_POStag_2grams <- l_df_POSwordtags %>%
      dplyr::count(word1, tag1, sort = TRUE)
   )     
  
## return  
  return(list(head(l_df_tag_1grams,25), 
              head(l_df_tag_2grams,25),
              head(l_df_words_1grams,25),
              head(l_df_POStag_2grams,25)
             )
        )
}

explore_pos <- function( filename ) {
  
##  read sample RDS file     
  l_filename <- paste0(getwd(), "/data/sample/", filename)
  l_filename <- stringi::stri_replace_last_fixed(str=l_filename, ".txt", "_tidy.rds")
  system.time(
    l_df <- readRDS(file=l_filename)  
  ) 
  l_inf <- file.info(l_filename)
  l_ds <- l_df$txt    
  
## return  
  return( get_pos( x=l_ds ) )
  
}

Summary

The summary section summarizes the work steps and results that result from data provision, data cleaning, and Explorative Data Analysis. For more detailed information, see the sections in the Appendix. Tables and graphs are included.

Text Data Preparation

Since the data comes from very different sources and probably on different devices (e.g. Twitter on mobile phones), the data sets are treated separately. The files en_US.blogs.txt, en_US.news.txt and en_US.twitters.txt are read in, separately cleaned, and the explorative data analysis carried out separately.

Dataset: en_US.blogs.txt with 899288 text lines and 208361438 characters.
Dataset: en_US.news.txt with 77259 text lines and 15683765 characters.
Dataset: en_US.twitter.txt with 2360148 lines and 162385035 characters.

Since the amount of data is very large, the data is collected by drawing a 10 % sample before data cleansing and explorative data analysis.

The restriction means that the sample does not contain all the words of the whole. The words or combinations with the highest frequency of occurrence are very likely to be found in the sample. For the first approximation one could use Zipf’s law.

The folling table provides the first differences between the datasets:

ggplot(l_df_all, aes(x=factor(src), y=nchar(txt), fill=src)) + geom_boxplot() + theme(legend.position="none")

The number of text lines is different.
The number of the maximum length and the median of the contributions (text line) differ.
The median and the mean value are different. The distributions are skewed. Shorter texts predominate.

Text Data Cleaning

The following steps are taken to obtain cleaned data:

remove all text lines recognized as foreign languages
remove all text lines with swear words
replace contractions
replace abbreviations
remove brackets
replace ordinals
convert in lower letters
replace special expressions
remove all text lines with numbers
remove punctuations
remove all characters, which not alphabetic
trim all whitespaces
remove all text lines without content
remove all text lines which have less than 3 characters

Percentage of text lines in English language:

File en_US.blogs.txt : A percentage of 84.4 percent is recognized in English.
File en_US.news.txt : A percentage of 90.8 percent is recognized in English.
File en_US.twitter.txt : A percentage of 80.6 percent is recognized in English.

Text Data Explorative Data Analytics

Histograms: count of words

In order to show the differences in the frequencies of the words, the following histograms are offered. The log- transformation of the frequencies serves to make the differences somewhat clearer; otherwise the histograms would be very similar.

## Histograms
  l_total <- dim(l_freq)[1]
  ggplot(l_freq, aes(log(n)/l_total, fill=src)) +
    geom_histogram(show.legend = FALSE, bins = 30) +
    facet_wrap(~src, ncol = 3, scales = "free_y")

Ngrams

For the cleaned data sets a One-gram and an r-gram are created and output as a table.

This makes it clear how quickly the frequency of occurrence of 4-grams over one-grams decreases, and the word combinations become clearer.

Since the articles, prepositions, binding words are very often used in sentences, they are also found in the tables to the programs in the hip charts in the front entries.

l_out <- ngrams( df=l_df_blogs_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with 2354249 words:

86 words cover 50 % of the corpus “blogs_tidy.txt”,
3340 words cover 90 % of the corpus “blogs_tidy.txt”, and
11591 words cover 100 % of the corpus “blogs_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (blogs.txt)", format="html")

Onegram (blogs.txt)
word1	n
the	66859
and	42179
i	42149
to	40606
a	33510
of	30539
in	22676
is	19547
that	18545
it	17394
for	13943
you	12472
not	12369
was	12225
my	12208
have	11493
this	10608
with	9797
are	8679
on	8679
as	8370
but	8009
be	7903
we	7793
so	7153

knitr::kable(head(l_df_4ngram,25), caption="4-gram (blogs.txt)", format="html")

4-gram (blogs.txt)
word1	word2	word3	word4	n
i	am	going	to	329
i	would	love	to	229
if	you	would	like	210
you	would	like	to	210
is	one	of	the	206
love	to	tell	you	186
to	tell	you	that	186
would	love	to	tell	186
one	of	the	most	163
the	rest	of	the	159
have	to	admit	that	158
on	the	other	side	158
to	admit	that	i	155
i	have	to	admit	153
i	am	not	a	152
and	i	do	not	151
i	do	not	think	146
i	do	not	know	137
the	other	khador	player	130
can	not	wait	to	128
for	those	of	you	125
tell	you	that	i	124
the	one	thing	that	124
i	can	not	get	123
a	big	fan	of	122

l_out <- ngrams( df=l_df_news_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with total 141219 words:

130 words cover 50 % of the corpus “news_tidy.txt”,
2130 words cover 90 % of the corpus “news_tidy.txt”, and
4662 words cover 100 % of the corpus “news_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (news.txt)", format="html")

Onegram (news.txt)
word1	n
the	75563
to	33980
and	32303
a	30126
of	24216
in	18669
is	16932
that	16321
for	12442
said	12199
it	11863
on	10563
not	10047
he	9946
with	9245
was	8747
i	7466
they	7252
are	7108
have	6742
at	6086
but	6084
we	5731
as	5392
be	4832

knitr::kable(head(l_df_4ngram,25), caption="4-gram (news.txt)", format="html")

4-gram (news.txt)
word1	word2	word3	word4	n
said	with	a	laugh	326
in	a	row	in	256
chief	executive	officer	of	228
that	he	was	not	213
in	the	last	year	205
i	do	not	think	200
when	it	comes	to	198
additional	stolen	property	was	195
connected	to	our	investigation	195
found	connected	to	our	195
in	their	room	additional	195
our	investigation	poulin	said	195
property	was	found	connected	195
room	additional	stolen	property	195
stolen	property	was	found	195
their	room	additional	stolen	195
to	our	investigation	poulin	195
was	found	connected	to	195
about	their	gratitude	for	193
also	spoke	about	their	193
for	the	outpouring	of	193
from	the	community	and	193
gratitude	for	the	outpouring	193
of	support	from	the	193
outpouring	of	support	from	193

l_out <- ngrams( df=l_df_twitters_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]

The words are skewed in the samples with total 2120216 words:

85 words cover 50 % of the corpus “twitters_tidy.txt”.
2435 words cover 90 % of the corpus “twitters_tidy.txt”, and
7826 words cover 100 % of the corpus “twitters_tidy.txt”.

On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.

knitr::kable(head(l_df_1ngram,25), caption="Onegram (twitter.txt)", format="html")

Onegram (twitter.txt)
word1	n
the	66451
i	58853
to	57808
a	41646
you	40772
is	39061
and	35807
it	28984
in	25006
for	24774
of	24179
not	23332
my	21228
that	18804
on	17993
me	15399
do	15022
have	14968
with	14602
are	14417
be	12292
your	12001
so	11807
this	11529
we	11133

knitr::kable(head(l_df_4ngram,25), caption="4-gram (twitter.txt)", format="html")

4-gram (twitter.txt)
word1	word2	word3	word4	n
thanks	for	the	follow	894
what	is	your	favorite	495
i	am	going	to	463
can	not	wait	to	398
i	am	not	the	342
i	do	not	think	337
i	would	like	to	291
thanks	for	following	me	285
i	am	pretty	sure	277
i	do	not	want	255
let	us	know	if	242
i	do	not	know	237
if	you	do	not	230
i	do	not	like	228
if	i	do	not	223
do	not	have	to	215
thanks	for	the	mention	212
to	do	when	he	204
what	to	do	when	204
have	a	great	day	203
i	would	love	to	200
do	not	see	it	199
the	end	of	the	199
hope	to	see	you	198
i	cannot	wait	to	198

Part of Speech

A POS tagger is used to generate tagged text from raw text using a tag set.

## Example: 
## Rawtext       time     flies      like      an     arrow.
## Tagged Text   time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].

## NN    Noun, singular or mass
## IN    Preposition or subordinating conjunction
## DT    Determiner
## VBZ   Verb, 3rd person singular present

## List of tags
## ----------------------------------------------
## cc    Coordinating conjunction
## CD    Cardinal number
## DT    Determiner
## EX    Existential there
## FW    Foreign word
## IN    Preposition or subordinating conjunction
## JJ    Adjective 
## JJR   Adjective, comparative
## JJS   Adjective, superlative
## LS    List item markerMDModal
## NN    Noun, singular or mass
## NNS   Noun, plural
## NNP   Proper noun, singular
## NNPS  Proper noun, plural
## PDT   Predeterminer
## POS   Possessive ending
## PRP   Personal pronoun
## PRP$  Possessive pronoun
## RB    Adverb
## RBR   Adverb, comparative
## RBS   Adverb, superlative
## RP    Particle
## SYM   Symbol
## TO    to
## UH    Interjection
## VB    Verb, base form
## VBD   Verb, past tense
## VBG   Verb, gerund or present participle
## VBN   Verb, past participle
## VBP   Verb, non-3rd person singular present
## VBZ   Verb, 3rd person singular present
## WDT   Wh-determiner
## WP    Wh-pronoun
## WP$   Possessive wh-pronoun
## WRB   Wh-adverb

In a Hidden Markov model, two types of information are useful

relations between words and tags
relations between tags and tags

system.time(
  l_grams_list <- explore_pos( filename="en_US.blogs.txt" ) 
)

## Registered S3 methods overwritten by 'qdap':
##   method               from
##   t.DocumentTermMatrix tm  
##   t.TermDocumentMatrix tm

##    user  system elapsed 
##  204.25    3.93  177.49

l_df_tag_1grams    <- l_grams_list[[1]]
l_df_tag_2grams    <- l_grams_list[[2]]
l_df_POStag_1grams <- l_grams_list[[3]]
l_df_POStag_2grams <- l_grams_list[[4]]

The following table shows the relations between tags and tags and their frequencies.

knitr::kable(head(l_df_tag_2grams,25), caption="2-grams for tags (en_US.blogs.txt)", format="html")

2-grams for tags (en_US.blogs.txt)
tag1	tag2	n
DT	NN	2596
IN	DT	2030
NN	IN	1914
JJ	NN	1392
IN	PRP	1382
DT	JJ	1143
TO	VB	1057
PRP	VBP	1036
PRP	VBD	963
NN	NN	875
NNP	NNP	788
NN	CC	759
NN	PRP	755
IN	NN	728
PRP	NN	700
NNS	IN	693
MD	VB	611
JJ	NNS	591
VB	DT	535
PRP	MD	510
VB	PRP	489
IN	NNP	464
RB	IN	440
DT	NNS	430
NN	RB	425

The following graph illustrates the network of relationships based on the pairwise relationships of the 25 most common tags.

## graph  
  bigram_graph <- l_df_tag_2grams %>%
##  head(n=50) %>%
    graph_from_data_frame() 

  set.seed(2017)
  l_df_2grapf <-  ggraph(bigram_graph, layout = "fr") +
                  geom_edge_link() +
                  geom_node_point() +
                  geom_node_text(aes(label = name), vjust = 1, hjust = 1)
  
  l_df_2grapf

The following table shows the relations between words and tags and their frequencies.

knitr::kable(head(l_df_POStag_2grams,25), caption="2-grams for word and tag (en_US.blogs.txt)", format="html")

2-grams for word and tag (en_US.blogs.txt)
word1	tag1	n
the	DT	2317
to	TO	1547
and	CC	1508
I	PRP	1490
a	DT	1242
of	IN	1169
in	IN	785
is	VBZ	775
it	PRP	563
for	IN	506
not	RB	445
.	.	444
was	VBD	434
you	PRP	420
that	IN	406
my	PRP$	395
with	IN	379
are	VBP	328
on	IN	321
this	DT	313
be	VB	300
have	VBP	290
The	DT	268
will	MD	244
as	IN	239

Outlock on Algorithmus

A possible algorithm/model to calculate the word frequencies probability could be the Hidden Markov Model.

A hidden Markov model will serve as the basis for estimating the next word under using of a tag sequence.

A POS tagger is used to generate tagged text from raw text using a tag set.

# Example: 
# Rawtext           time flies like an arrow.
# Tagged Text   time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].

In a Hidden Markov model, two types of information are useful

relations between words and tags
relations between tags and tags

Approach

Finding the best sequence of tags (t1, … ,tn) that corresponds to the sequence of observations of words (w1, … ,wn). Choosing the tag sequence which is most probable given the observation sequence.

… t1(n) = argmax p( t1(n) | w1(n) ) over all tags t1(n)

Using the bayes rule:

… t1(n) = argmax p( w1(n) | t1(n) ) * p( t1(n) )

Using markov assuption

… emission propability: p( w1(n) | t1(n) ) := product of p( w(i) | t(i) )

… transion propability: p( t1(n) ) := product of p( t(i) | t(i-1) )

… t(n) = argmax product of ( p( w(i) | t(i) ) * p( t(i) | t(i-1) ) )

First the tag of the next word estimated is taken and then the most likely word under the given tag.

… w(n) = argmax p( w(i) | t(estimated) )

The Hidden Markov model delivers a predicted word at best.

But usually several words are delivered with a corresponding probability. These should be displayed in the app ‘my_next_word’.

Capstone Project : Data Preparation and Exploratory Analysis

A. Paul

May 2019

Overview

Summary

Text Data Preparation

Text Data Cleaning

Text Data Explorative Data Analytics

Histograms: count of words

Ngrams

Part of Speech

Outlock on Algorithmus

Approach

… t1(n) = argmax p( t1(n) | w1(n) ) over all tags t1(n)

… t1(n) = argmax p( w1(n) | t1(n) ) * p( t1(n) )

… emission propability: p( w1(n) | t1(n) ) := product of p( w(i) | t(i) )

… transion propability: p( t1(n) ) := product of p( t(i) | t(i-1) )

… t(n) = argmax product of ( p( w(i) | t(i) ) * p( t(i) | t(i-1) ) )

… w(n) = argmax p( w(i) | t(estimated) )