The goal of the project “Data Science Capstone” is to build an application that can predict the next word based on an existing word sequence (e.g. two, three or four words).
For this purpose, a data source “Coursera-SwiftKey.zip” will be made available within the framework of the project, which provides three resources of blogs, news and Twitter in different languages (US English, German, Russian, Finnish).
URL "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The considerations in the further project are limited to the data sources in English.
The aim of this milestone report is to provide the data, the data cleaning, an exploratory data analysis and an outlook to the planned prediction algorithm of the next word.
rm( list =ls( all=T))
# dataframe
l_df_all <- bind_rows( l_df_blogs_sample, l_df_news_sample, l_df_twitters_sample )
# count words
l_freq <- l_df_all %>%
mutate(word = str_extract(txt, "[a-z']+")) %>%
dplyr::count(src, word) %>%
group_by(src)
ngrams <- function( df ) {
# input ; df
# output: list of (1gram, 4grams)
l_df <- df
# one grams
system.time(
l_df_1gram <- l_df %>%
unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
dplyr::count(word1, sort = TRUE)
)
## fourgrams
system.time(
l_df_4grams <- l_df %>%
unnest_tokens( fourgram, txt, token = "ngrams", n = 4 ) %>%
separate( fourgram, c("word1", "word2", "word3", "word4"), sep = " ") %>%
dplyr::count(word1, word2, word3, word4, sort = TRUE)
)
return( list(head(l_df_1gram,50), head(l_df_4grams,50)) )
}
tagPOS <- function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- NLP::annotate(s, word_token_annotator, a2)
a3 <- NLP::annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
POSwords <- unlist(s[a3w])
list(POStagged = POStagged, POSwords=POSwords, POStags = POStags)
}
get_pos <- function( x ) {
#
l_sentences <- qdap::sent_detect_nlp(x)
l_df <- tibble(txt=l_sentences)
l_gram_sen <- l_df %>%
dplyr::count(txt, sort = TRUE)
l_tagged_str <- tagPOS(l_gram_sen$txt)
## list(POStagged = POStagged, POStags = POStags)
## onegram for POStags
l_df_tags <- tibble(txt=l_tagged_str$POStags)
system.time(
l_df_tag_1grams <- l_df_tags %>%
unnest_tokens( "tag1", txt, token = "ngrams", to_lower = FALSE, n = 1 ) %>%
dplyr::count(tag1, sort = TRUE)
)
## bigrams
l_df_tags <- tibble(txt=l_tagged_str$POStags)
system.time(
l_df_tag_2grams <- l_df_tags %>%
unnest_tokens( bigram, txt, token = "ngrams", to_lower = FALSE, n = 2 ) %>%
separate( bigram, c("tag1", "tag2"), sep = " ") %>%
dplyr::count(tag1, tag2, sort = TRUE)
)
## onegrams for POStagged
l_df_words <- tibble(txt=l_tagged_str$POSwords)
system.time(
l_df_words_1grams <- l_df_words %>%
unnest_tokens( "word1", txt, token = "ngrams", n = 1 ) %>%
dplyr::count(word1, sort = TRUE)
)
## onegrams for POStagged
l_df_POSwordtags <- tibble(word1=l_tagged_str$POSwords, tag1=l_tagged_str$POStags)
system.time(
l_df_POStag_2grams <- l_df_POSwordtags %>%
dplyr::count(word1, tag1, sort = TRUE)
)
## return
return(list(head(l_df_tag_1grams,25),
head(l_df_tag_2grams,25),
head(l_df_words_1grams,25),
head(l_df_POStag_2grams,25)
)
)
}
explore_pos <- function( filename ) {
## read sample RDS file
l_filename <- paste0(getwd(), "/data/sample/", filename)
l_filename <- stringi::stri_replace_last_fixed(str=l_filename, ".txt", "_tidy.rds")
system.time(
l_df <- readRDS(file=l_filename)
)
l_inf <- file.info(l_filename)
l_ds <- l_df$txt
## return
return( get_pos( x=l_ds ) )
}
The summary section summarizes the work steps and results that result from data provision, data cleaning, and Explorative Data Analysis. For more detailed information, see the sections in the Appendix. Tables and graphs are included.
Since the data comes from very different sources and probably on different devices (e.g. Twitter on mobile phones), the data sets are treated separately. The files en_US.blogs.txt, en_US.news.txt and en_US.twitters.txt are read in, separately cleaned, and the explorative data analysis carried out separately.
Since the amount of data is very large, the data is collected by drawing a 10 % sample before data cleansing and explorative data analysis.
The restriction means that the sample does not contain all the words of the whole. The words or combinations with the highest frequency of occurrence are very likely to be found in the sample. For the first approximation one could use Zipf’s law.
The folling table provides the first differences between the datasets:
ggplot(l_df_all, aes(x=factor(src), y=nchar(txt), fill=src)) + geom_boxplot() + theme(legend.position="none")
The following steps are taken to obtain cleaned data:
Percentage of text lines in English language:
In order to show the differences in the frequencies of the words, the following histograms are offered. The log- transformation of the frequencies serves to make the differences somewhat clearer; otherwise the histograms would be very similar.
## Histograms
l_total <- dim(l_freq)[1]
ggplot(l_freq, aes(log(n)/l_total, fill=src)) +
geom_histogram(show.legend = FALSE, bins = 30) +
facet_wrap(~src, ncol = 3, scales = "free_y")
For the cleaned data sets a One-gram and an r-gram are created and output as a table.
This makes it clear how quickly the frequency of occurrence of 4-grams over one-grams decreases, and the word combinations become clearer.
Since the articles, prepositions, binding words are very often used in sentences, they are also found in the tables to the programs in the hip charts in the front entries.
l_out <- ngrams( df=l_df_blogs_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]
The words are skewed in the samples with 2354249 words:
On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.
knitr::kable(head(l_df_1ngram,25), caption="Onegram (blogs.txt)", format="html")
word1 | n |
---|---|
the | 66859 |
and | 42179 |
i | 42149 |
to | 40606 |
a | 33510 |
of | 30539 |
in | 22676 |
is | 19547 |
that | 18545 |
it | 17394 |
for | 13943 |
you | 12472 |
not | 12369 |
was | 12225 |
my | 12208 |
have | 11493 |
this | 10608 |
with | 9797 |
are | 8679 |
on | 8679 |
as | 8370 |
but | 8009 |
be | 7903 |
we | 7793 |
so | 7153 |
knitr::kable(head(l_df_4ngram,25), caption="4-gram (blogs.txt)", format="html")
word1 | word2 | word3 | word4 | n |
---|---|---|---|---|
i | am | going | to | 329 |
i | would | love | to | 229 |
if | you | would | like | 210 |
you | would | like | to | 210 |
is | one | of | the | 206 |
love | to | tell | you | 186 |
to | tell | you | that | 186 |
would | love | to | tell | 186 |
one | of | the | most | 163 |
the | rest | of | the | 159 |
have | to | admit | that | 158 |
on | the | other | side | 158 |
to | admit | that | i | 155 |
i | have | to | admit | 153 |
i | am | not | a | 152 |
and | i | do | not | 151 |
i | do | not | think | 146 |
i | do | not | know | 137 |
the | other | khador | player | 130 |
can | not | wait | to | 128 |
for | those | of | you | 125 |
tell | you | that | i | 124 |
the | one | thing | that | 124 |
i | can | not | get | 123 |
a | big | fan | of | 122 |
l_out <- ngrams( df=l_df_news_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]
The words are skewed in the samples with total 141219 words:
On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.
knitr::kable(head(l_df_1ngram,25), caption="Onegram (news.txt)", format="html")
word1 | n |
---|---|
the | 75563 |
to | 33980 |
and | 32303 |
a | 30126 |
of | 24216 |
in | 18669 |
is | 16932 |
that | 16321 |
for | 12442 |
said | 12199 |
it | 11863 |
on | 10563 |
not | 10047 |
he | 9946 |
with | 9245 |
was | 8747 |
i | 7466 |
they | 7252 |
are | 7108 |
have | 6742 |
at | 6086 |
but | 6084 |
we | 5731 |
as | 5392 |
be | 4832 |
knitr::kable(head(l_df_4ngram,25), caption="4-gram (news.txt)", format="html")
word1 | word2 | word3 | word4 | n |
---|---|---|---|---|
said | with | a | laugh | 326 |
in | a | row | in | 256 |
chief | executive | officer | of | 228 |
that | he | was | not | 213 |
in | the | last | year | 205 |
i | do | not | think | 200 |
when | it | comes | to | 198 |
additional | stolen | property | was | 195 |
connected | to | our | investigation | 195 |
found | connected | to | our | 195 |
in | their | room | additional | 195 |
our | investigation | poulin | said | 195 |
property | was | found | connected | 195 |
room | additional | stolen | property | 195 |
stolen | property | was | found | 195 |
their | room | additional | stolen | 195 |
to | our | investigation | poulin | 195 |
was | found | connected | to | 195 |
about | their | gratitude | for | 193 |
also | spoke | about | their | 193 |
for | the | outpouring | of | 193 |
from | the | community | and | 193 |
gratitude | for | the | outpouring | 193 |
of | support | from | the | 193 |
outpouring | of | support | from | 193 |
l_out <- ngrams( df=l_df_twitters_sample )
l_df_1ngram <- l_out[[1]]
l_df_4ngram <- l_out[[2]]
The words are skewed in the samples with total 2120216 words:
On the other hand, the sample does not include all the words of the total population. Many words that rarely appear in the population are not taken into account when the sample is drawn.
knitr::kable(head(l_df_1ngram,25), caption="Onegram (twitter.txt)", format="html")
word1 | n |
---|---|
the | 66451 |
i | 58853 |
to | 57808 |
a | 41646 |
you | 40772 |
is | 39061 |
and | 35807 |
it | 28984 |
in | 25006 |
for | 24774 |
of | 24179 |
not | 23332 |
my | 21228 |
that | 18804 |
on | 17993 |
me | 15399 |
do | 15022 |
have | 14968 |
with | 14602 |
are | 14417 |
be | 12292 |
your | 12001 |
so | 11807 |
this | 11529 |
we | 11133 |
knitr::kable(head(l_df_4ngram,25), caption="4-gram (twitter.txt)", format="html")
word1 | word2 | word3 | word4 | n |
---|---|---|---|---|
thanks | for | the | follow | 894 |
what | is | your | favorite | 495 |
i | am | going | to | 463 |
can | not | wait | to | 398 |
i | am | not | the | 342 |
i | do | not | think | 337 |
i | would | like | to | 291 |
thanks | for | following | me | 285 |
i | am | pretty | sure | 277 |
i | do | not | want | 255 |
let | us | know | if | 242 |
i | do | not | know | 237 |
if | you | do | not | 230 |
i | do | not | like | 228 |
if | i | do | not | 223 |
do | not | have | to | 215 |
thanks | for | the | mention | 212 |
to | do | when | he | 204 |
what | to | do | when | 204 |
have | a | great | day | 203 |
i | would | love | to | 200 |
do | not | see | it | 199 |
the | end | of | the | 199 |
hope | to | see | you | 198 |
i | cannot | wait | to | 198 |
A POS tagger is used to generate tagged text from raw text using a tag set.
## Example:
## Rawtext time flies like an arrow.
## Tagged Text time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].
## NN Noun, singular or mass
## IN Preposition or subordinating conjunction
## DT Determiner
## VBZ Verb, 3rd person singular present
## List of tags
## ----------------------------------------------
## cc Coordinating conjunction
## CD Cardinal number
## DT Determiner
## EX Existential there
## FW Foreign word
## IN Preposition or subordinating conjunction
## JJ Adjective
## JJR Adjective, comparative
## JJS Adjective, superlative
## LS List item markerMDModal
## NN Noun, singular or mass
## NNS Noun, plural
## NNP Proper noun, singular
## NNPS Proper noun, plural
## PDT Predeterminer
## POS Possessive ending
## PRP Personal pronoun
## PRP$ Possessive pronoun
## RB Adverb
## RBR Adverb, comparative
## RBS Adverb, superlative
## RP Particle
## SYM Symbol
## TO to
## UH Interjection
## VB Verb, base form
## VBD Verb, past tense
## VBG Verb, gerund or present participle
## VBN Verb, past participle
## VBP Verb, non-3rd person singular present
## VBZ Verb, 3rd person singular present
## WDT Wh-determiner
## WP Wh-pronoun
## WP$ Possessive wh-pronoun
## WRB Wh-adverb
In a Hidden Markov model, two types of information are useful
system.time(
l_grams_list <- explore_pos( filename="en_US.blogs.txt" )
)
## Registered S3 methods overwritten by 'qdap':
## method from
## t.DocumentTermMatrix tm
## t.TermDocumentMatrix tm
## user system elapsed
## 204.25 3.93 177.49
l_df_tag_1grams <- l_grams_list[[1]]
l_df_tag_2grams <- l_grams_list[[2]]
l_df_POStag_1grams <- l_grams_list[[3]]
l_df_POStag_2grams <- l_grams_list[[4]]
The following table shows the relations between tags and tags and their frequencies.
knitr::kable(head(l_df_tag_2grams,25), caption="2-grams for tags (en_US.blogs.txt)", format="html")
tag1 | tag2 | n |
---|---|---|
DT | NN | 2596 |
IN | DT | 2030 |
NN | IN | 1914 |
JJ | NN | 1392 |
IN | PRP | 1382 |
DT | JJ | 1143 |
TO | VB | 1057 |
PRP | VBP | 1036 |
PRP | VBD | 963 |
NN | NN | 875 |
NNP | NNP | 788 |
NN | CC | 759 |
NN | PRP | 755 |
IN | NN | 728 |
PRP | NN | 700 |
NNS | IN | 693 |
MD | VB | 611 |
JJ | NNS | 591 |
VB | DT | 535 |
PRP | MD | 510 |
VB | PRP | 489 |
IN | NNP | 464 |
RB | IN | 440 |
DT | NNS | 430 |
NN | RB | 425 |
The following graph illustrates the network of relationships based on the pairwise relationships of the 25 most common tags.
## graph
bigram_graph <- l_df_tag_2grams %>%
## head(n=50) %>%
graph_from_data_frame()
set.seed(2017)
l_df_2grapf <- ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
l_df_2grapf
The following table shows the relations between words and tags and their frequencies.
knitr::kable(head(l_df_POStag_2grams,25), caption="2-grams for word and tag (en_US.blogs.txt)", format="html")
word1 | tag1 | n |
---|---|---|
the | DT | 2317 |
to | TO | 1547 |
and | CC | 1508 |
I | PRP | 1490 |
a | DT | 1242 |
of | IN | 1169 |
in | IN | 785 |
is | VBZ | 775 |
it | PRP | 563 |
for | IN | 506 |
not | RB | 445 |
. | . | 444 |
was | VBD | 434 |
you | PRP | 420 |
that | IN | 406 |
my | PRP$ | 395 |
with | IN | 379 |
are | VBP | 328 |
on | IN | 321 |
this | DT | 313 |
be | VB | 300 |
have | VBP | 290 |
The | DT | 268 |
will | MD | 244 |
as | IN | 239 |
A possible algorithm/model to calculate the word frequencies probability could be the Hidden Markov Model.
A hidden Markov model will serve as the basis for estimating the next word under using of a tag sequence.
A POS tagger is used to generate tagged text from raw text using a tag set.
# Example:
# Rawtext time flies like an arrow.
# Tagged Text time[NN] flies[VBZ] like [IN] an[DT] arrow[NN].
In a Hidden Markov model, two types of information are useful
Finding the best sequence of tags (t1, … ,tn) that corresponds to the sequence of observations of words (w1, … ,wn). Choosing the tag sequence which is most probable given the observation sequence.
Using markov assuption
First the tag of the next word estimated is taken and then the most likely word under the given tag.
The Hidden Markov model delivers a predicted word at best.
But usually several words are delivered with a corresponding probability. These should be displayed in the app ‘my_next_word’.