knitr::opts_chunk$set(echo = TRUE)
options(tidyverse.quiet = TRUE)

Abstract

This is the exploratory analysis of the capstone project of the course “Data Science” from coursera. In this project I will use data collected from twitter, news sites and blogs to create a prediction algorithm to identify the next word you want to type in in an incomplete sentence as a part of a sentence completion helper.

The data set to train this sentence completion helper consists of newspaper articles, blog entries and tweets from twitter. In this analysis we prepare the data for further use and see if we find anything interesting or unusual.

Setup

Please make sure to have downloaded and extracted the Dataset (Link) in the root directory of this RMarkdown document and installed the dependencies in the section “Load Libraries”. You can add your custom bad-word list and put it in the root directory as well, for this analysis this blacklist is used.

Load Libraries

In this part we load the necessary libraries for R to use in this analysis.

suppressPackageStartupMessages({
  library(tidyverse)
  library(tidytext)
  library(stringi)
  library(scales)
  library(hrbrthemes)
  library(knitr)
})
theme_set(theme_ipsum_rc())

Import Data

Now we need to import the data set into R. We load every source and then combine it into one big variable.

if(!exists("en_US.full"))
{
  en_US.news <- readLines("../final/en_US/en_US.news.txt", 
                          encoding="UTF-8")
  en_US.news <- tibble(txt=en_US.news, Source="N")

  en_US.twitter <- readLines("../final/en_US/en_US.twitter.txt", 
                             encoding="UTF-8",
                             skipNul=T)
  en_US.twitter <- tibble(txt=en_US.twitter, Source="T")
  
  en_US.blogs <- readLines("../final/en_US/en_US.blogs.txt", 
                           encoding="UTF-8")
  en_US.blogs <- tibble(txt=en_US.blogs, Source="B")

  en_US.full <- en_US.news %>% 
                  bind_rows(en_US.twitter,
                            en_US.blogs) %>% 
                  mutate(Source = factor(Source, 
                                         c("N", "T", "B"), 
                                         c("News", "Twitter", "Blogs")) )
  # Clean Up
  en_US.twitter <- NULL
  en_US.blogs <- NULL
  en_US.news <- NULL
}

badwords <- readLines("bad-words.txt")
badwords <- badwords[str_length(badwords)>4]
badwords <- gsub("[^\\w\\s]", "", 
                  badwords, 
                  perl=T)

Split

The data set is quite big. To make computations a lot faster we work only on 20% of the complete data set. This should be well enough to get a grasp of the data set and create a model that is reliable enough for this purpose.

set.seed(123)
inTrain <- sample(1:nrow(en_US.full),
                  nrow(en_US.full) * .2)
train <- en_US.full[inTrain,] 

# Clean up
en_US.full <- NULL
inTrain <- NULL

Cleanup Data

The data set consists of several words and characters that should not be included in the final prediction algorithm, like non-ASCII characters - for example ä, ö, ü - that are uncommon in the English language, swear words, numbers and links. These will be removed in the steps below.

# Encode to ASCII and remove any other characters
train$txt <- iconv(train$txt, to="ASCII", sub="")

# Remove Numbers
train$txt <- gsub("[0-9]*",
                  "", 
                  train$txt, 
                  perl=T)

# Remove URL 
train$txt <- gsub(r"((?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+)",
                  "",
                  train$txt, 
                  perl=T)

# Remove puctuation
train$txt <- gsub("[^\\w\\s]", "", 
                  train$txt, 
                  perl=T)

# Remove multiple whitespaces
train$txt <- gsub("\\s+", " ", 
                  train$txt, 
                  perl=T)

# Lowercase
train$txt <- str_to_lower(train$txt)

# Remove bad words
train$txt <- stri_replace_all_fixed(train$txt, 
                                    badwords, 
                                    "", 
                                    vectorize_all=F)

train %>% 
  group_by(Source) %>% 
  slice(1:5) %>% 
  kable()
txt Source
several lawmakers testified that they think child predators are likely to reoffend before the or year waiting News
you see our predecessors understood that government could not and should not solve every problem they understood that there are instances when the gains in security from government action are not worth the added constraints on our freedom but they also understood that the danger of too much government is matched by the perils of too little that without the leavening hand of wise policy markets can monopolies can stifle competition and the vulnerable can be exploited and they knew that when any government measure no matter how carefully crafted or beneficial is subject to scorn when any efforts to help people in need are ed as un when facts and reason are thrown overboard and only timidity p for wisdom and we can no longer even engage in a civil conversation with each other over the things that truly matter that at that point we dont merely lose our capacity to solve big challenges we lose something essential about ourselves News
in days leading up to wednesdays election griffin asked democrats to back garson News
this years event falls on jan although lighter more familyfriendly events begin on jan News
in dolan hall at the college of st elizabeth the concert had a long motley program at two and a half hours but its worthy artists were well received at best the orchestra gave ful energetic performances and the dancers were stunning News
i love the playoffs no matter what sport youre watching its always awesome Twitter
mudcats scooter is a hairy like skipper bivins on hillbillyhandfishin Twitter
have presses releases or promos send em to simmonsradio at gmail dot com Twitter
in the studio downloading plugin presets for the console Twitter
glee is ing omg Twitter
as just one example it parcels out control to the secretary of agriculture with respect to food resources food resource facilities livestock resources veterinary resources plant health resources and the domestic distribution of farm equipment and commercial fertilizer and thereafter to Blogs
managed care imposes health oriented requirements which do not emphasize person centered or person directed supports Blogs
besides noticing a bit of colon tightness remedied with extra fiber and chamomile tea not the cheeseburgers and root beer ive been employing as of late she also found traces of leg weakness back pain thyroid and bladder irritation which is interesting because i broke my leg as kid currently spend my working days bent over a was d several times for thyroiditis as a teen and cant even think about bubble baths or without contracting a bladder infection each of these afflictions imprinted veritable scar tissue on my irises and believe it or not julie could see it youve gotta admitthats pretty cool Blogs
the vacuum seemed to work after just a few minutes of intense pulling and twisting and yanking my darling baby girl was finally freed Blogs
she was up at the top of the eiffel tower when troops came in and evacuated it a while back the train station was evacuated while she was sitting waiting for a train to dijon looks like there was something in her row everyone scurried outside and now she is stuck at the charles degaulle airport Blogs

Data description

Overview

Let’s print out some basic statistics:

train %>% 
  mutate(
    `Sentence Length` = str_length(txt),
    `Number of Words` = stri_count_words(txt),
  ) %>% 
  group_by(Source) %>% 
  summarize(
    Entries = n(),
    `Mean Sentence Length` = mean(`Sentence Length` ),
    `Mean Number of Words` = mean(`Number of Words`),
    .groups="drop"
  ) %>% 
  mutate_if(is.numeric, round, 2)
## # A tibble: 3 x 4
##   Source  Entries `Mean Sentence Length` `Mean Number of Words`
##   <fct>     <dbl>                  <dbl>                  <dbl>
## 1 News     202325                  188.                    32.8
## 2 Twitter  471859                   62.9                   12.2
## 3 Blogs    179751                  218.                    40.6

It is obvious, that the mean count of words and the mean sentence length is a lot lower on twitter than anywhere else, based on the limited amount of letters one can use in a tweet. Though surprisingly blogs have a similar or higher average word count and sentence length.

In my opinion it is beneficial to use twitter as a data source because of this low sentence length, it resembles more of a short conversation, typically happening in WhatsApp and other messengers.

Word and N-Gram Frequency

Let’s count the most frequent words and the most frequent N-Grams in this dataset and visualize how many words are necessary to represent >90% of the words in the dataset. N-Grams are N words that appear in a row. E.g. “this is” is a 2-gram (bigram) and “this is it” is a 3-gram (trigram).

Preparation

We prepare data for further analysis. To save computational time and use the data further down the line when creating the model each result is saved in a separate file. Tokens consists of word frequencies, NGram2 and NGram3 consists of frequencies associated with the appearance of two and three words in a row.

if(!file.exists("Tokens.RDS")) {
  Tokens <- train %>% 
              unnest_tokens(word, txt) 
  saveRDS(Tokens, "Tokens.RDS")
} else {
  Tokens <- readRDS("Tokens.RDS")
}

if(!file.exists("NGram2.RDS")) {
  NGram2 <-  train %>%
              unnest_tokens(bigram, txt, token = "ngrams", n = 2) %>% 
              separate(bigram, c("word1", "word2"), sep = " ") %>% 
              count(word1, word2, sort = TRUE)
  saveRDS(NGram2, "NGram2.RDS")
} else {
  NGram2 <- readRDS("NGram2.RDS")
}

if(!file.exists("NGram3.RDS")) {
  NGram3 <-  train %>%
              unnest_tokens(trigram, txt, token = "ngrams", n = 3) %>% 
              separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% 
              count(word1, word2, word3, sort = TRUE)
  saveRDS(NGram3, "NGram3.RDS")
} else {
  NGram3 <- readRDS("NGram3.RDS")
}

Words & Lexical Length

Visualize the most frequent words and the lexical length required for a certain coverage threshold.

Tokens <- Tokens %>% 
                  count(word, sort=T) %>% 
                  mutate(index = 1:n(),
                         n.percentage.overall = n/sum(n),
                         n.percentage = cumsum(n) / sum(n)) 
# Top-20 Words
Tokens %>%
    slice(1:20) %>% 
    mutate(word = reorder(word, n.percentage.overall)) %>%
    ggplot(aes(x=n.percentage.overall, 
               y=word)) +
      geom_col(fill="#039be5") +
      scale_x_continuous(label=label_percent()) + 
      labs(title = "Word Frequency",
           subtitle = "The most frequent words in this data set") +
      xlab("% of all words") +
      ylab("Word") -> p.frequency

# Lexicon Length necessary
Tokens %>% 
    ggplot(aes(x=index, 
               y=n.percentage)) +
  
      geom_hline(yintercept = .5, color="red", lty=2) +
      geom_hline(yintercept = .9, color="#039be5", lty=2) +
      geom_label(label="50%", color="red", data=tibble(index=1,n.percentage=.5)) +
      geom_label(label="90%", color="#039be5", data=tibble(index=1,n.percentage=.9)) +
  
      geom_step() +
      scale_y_log10(label = label_percent(1), breaks=breaks_log(), minor_breaks=NULL) +
      scale_x_log10(label = label_number(1), breaks=breaks_log(), minor_breaks=NULL) +
      annotation_logticks() +
      labs(title="Lexical Length",
           subtitle = "How many words cover how much of the text?") +
      xlab("Words [log10]") +
      ylab("Coverage [log10]") -> p.lexiconLength

Tokens <- NULL

p.frequency

p.lexiconLength

The most common words are - not surprising - “filler words” (the, i, …) in the english language, not pointing at any specific topic. This is to be expected.

The coverage of 50% is reached after only about 1.000 words, to get a coverage of about 90% of the words in each sentence a total amount of about 10.000 words is needed. This means we can drastically decrease the amount of words in our algorithm to cover most of the cases. This will reduce computational power and memory requirements!

Especially in the beginning of this coverage graph we see huge gains of coverage, especially from the filler words shown in this graph. The coverage increase gets lower and lower, it is much harder to increase the coverage further the higher the coverage is.

N-Grams

We visualize the most frequent n-Grams:

NGram2.Count <- NGram2  %>% 
                  slice(1:20) %>% 
                  arrange(desc(n)) %>% 
                  mutate(index = 1:n(),
                         n.percentage.overall = n/sum(n),
                         n.percentage = cumsum(n) / sum(n))

NGram3.Count <- NGram3 %>% 
                  slice(1:20) %>% 
                  arrange(desc(n)) %>% 
                  mutate(index = 1:n(),
                         n.percentage.overall = n/sum(n),
                         n.percentage = cumsum(n) / sum(n))
# Top-20 Bigrams
NGram2.Count %>%
  mutate(word = paste(word1, word2),
         word = reorder(word, n.percentage.overall)) %>%  
    ggplot(aes(x=n.percentage.overall, 
               y=word)) +
      geom_col(fill="#039be5") +
      scale_x_continuous(label=label_percent()) + 
      labs(title = "Bigram Frequency",
           subtitle = "The most frequent bigrams in this data set") +
      xlab("% of all bigrams") +
      ylab("Bigram") -> p.frequency.bigram

# Top-20 Trigrams
NGram3.Count %>%
  slice(1:20) %>% 
  mutate(word = paste(word1, word2, word3),
         word = reorder(word, n.percentage.overall)) %>%  
    ggplot(aes(x=n.percentage.overall, 
               y=word)) +
      geom_col(fill="#039be5") +
      scale_x_continuous(label=label_percent()) + 
      labs(title = "Trigram Frequency",
           subtitle = "The most frequent trigrams in this data set") +
      xlab("% of all trigrams") +
      ylab("Trigram") -> p.frequency.trigram

p.frequency.bigram

p.frequency.trigram

Even the 2-Grams and 3-Grams do not show any specific topic, but are mostly associated by “filler words”. Though the probabilities that a word appears is more uniform - nothing really stands out anymore, like the word “the” on the analysis of the words on their own.

Foreign Language

Foreign language could be detected by identifying characters uncommon in the English language as well as using a whitelist of words used in the English language: Filtering each word that does not appear in this lexicon. Another approach could be foreign language lexicons that could be used to identify those words.

No filtering of this kind has been done on this data set. As is the appearance of foreign words is not such of a problem in a more or less clean English text corpus. Words from other languages used in this data is (i) used in the English language on a regular basis (e.g. kindergarten) and these should not be removed anyway or (ii) have a low frequency with a high probability of being removed by making the dictionary smaller.

Model Creation

Tokens, Bigrams and Trigrams could be a good option for an algorithm to predict the next word in a sentence. Though it is more and more uncommon to do so. With the development of neural networks in natural language processing it is, in my opinion, more en vogue to use a technically more sophisticated approach to use neural networks as a means to complete sentences and other related and/or more complex tasks in natural language processing (NLP).

It is easier to fit a large context - previous words before the word to guess - into this architecture. Unseen words will be handled by replacing them with a fixed replacement for all words unknown by our lexicon. The LSTM (long short term memory) neural net algorithm is able to handle such word fillers by itself without the requirement to change anything about the underlying algorithm.

A shiny model will be build to test the sentence completion algorithm.