1 Introduction
2 Example sentences and overview visualisations
- 2.1 A first look at normalised sentences
- 2.2 Summary visualisations
3 The impact of the normalization
4 Token classes and their text normalisation
5 Context matters: a next-neighbour analysis
6 Transformation statistics

1 Introduction

As many of us can attest, learning another language is tough. Picking up on nuances like slang, dates and times, and local expressions, can often be a distinguishing factor between proficiency and fluency. This challenge is even more difficult for a computers.

Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken.

Many speech and language applications, including text-to-speech synthesis (TTS) and automatic speech recognition (ASR), require text to be converted from written expressions into appropriate “spoken” forms. This is a process known as text normalization, and helps convert 12:47 to “twelve forty-seven” and $3.16 into “three dollars, sixteen cents.”

In this project, we will use Machine learning, linguistic sophistication and native speaker intuition to develop a text-to-speech synthesis (TTS) and automatic speech recognition (ASR) system for english language and test the grammar for all the known rules.

The objective of this research paper is create tools for the detection, normalization and denormalization of non-standard words such as abbreviations, numbers or currency expressions; and semiotic classes – text tokens and token sequences that represent particular entities that are semantically constrained, such as measurement phrases, addresses or dates

The dataset(a large corpus of text, about 350 megabytes in total) comes in the shape of two files: en_test.csv - the test set, which does not contain the normalized text, and en_train.csv - the training set, which contains the normalized text.

Applications of this work include text-to-speech synthesis, automatic speech recognition, and information extraction/retrieval.

1.1 Load libraries and helper functions

# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('ggforce') # visualisation
library('treemapify') # visualisation

# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation

# Text / NLP
library('tidytext') # text analysis
library('tm') # text analysis
library('SnowballC') # text analysis
library('topicmodels') # text analysis
library('wordcloud') # test visualisation

# Extra Vis
library('plotly')

# Extras
library('babynames')

#We use the *multiplot* function, courtesy of [R Cookbooks](http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/) to create multi-panel plots.

# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

1.2 Load data

We use data.table’s fread function to speed up reading in the data. The training data file is rather extensive, with close to 10 million rows and 300 MB uncompressed size:

train <- as.tibble(fread('en_train.csv'))
test <- as.tibble(fread('en_test.csv'))

1.3 File structure and content

Let’s have an overview of the data sets using the summary and glimpse tools. First the training data:

summary(train)

##   sentence_id        token_id         class              before         
##  Min.   :     0   Min.   :  0.00   Length:9918441     Length:9918441    
##  1st Qu.:192526   1st Qu.:  3.00   Class :character   Class :character  
##  Median :379259   Median :  6.00   Mode  :character   Mode  :character  
##  Mean   :377857   Mean   :  7.52                                        
##  3rd Qu.:564189   3rd Qu.: 11.00                                        
##  Max.   :748065   Max.   :255.00                                        
##     after          
##  Length:9918441    
##  Class :character  
##  Mode  :character  
##                    
##                    
##

We have a total of 748,065 sentences in our training corpora, each tagged using a uniquie identification number. The average number of words or token in each sentence is 8, and the maximum being 255- which is abnormal in a good sentence. Finally we have a total of 9,918,441 words or tokens used in the training set, each of them classified using it own part of speech.

glimpse(train)

## Observations: 9,918,441
## Variables: 5
## $ sentence_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ token_id    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6,...
## $ class       <chr> "PLAIN", "PLAIN", "PLAIN", "PLAIN", "PLAIN", "PLAI...
## $ before      <chr> "Brillantaisia", "is", "a", "genus", "of", "plant"...
## $ after       <chr> "Brillantaisia", "is", "a", "genus", "of", "plant"...

We’ve got a sense of our variables, their class type, and the first few observations of each. We can see we’re working with 9,918,441 observations of 5 variables. To make things a bit more explicit since a couple of the variable names aren’t 100% illuminating, here’s what we’ve got to deal with:

Variable Name	Description
sentence_id	Each sentence has a unique sentence id number, from 0 to 748,065
token_id	Each token or word within a sentence has a unique token id number from 0 to 255
class	Class is provided to show the token type or part of speech where each token belong
before	The before column contains the raw text
after	The after column contains the normalized text

Let’s move on to our test dataset

summary(test)

##   sentence_id       token_id          before         
##  Min.   :    0   Min.   :  0.000   Length:1088564    
##  1st Qu.:17488   1st Qu.:  3.000   Class :character  
##  Median :35028   Median :  7.000   Mode  :character  
##  Mean   :35007   Mean   :  8.344                     
##  3rd Qu.:52522   3rd Qu.: 12.000                     
##  Max.   :69999   Max.   :248.000

We have a total of 69,999 sentences in our testing corpora, each tagged using a uniquie identification number. The average number of words or token in each sentence is 8, and the maximum is 248- which is still abnormal for a good sentence. Finally we have a total of 1088564 words or tokens used in the test set, each of them classified using it own part of speech.

glimpse(test)

## Observations: 1,088,564
## Variables: 3
## $ sentence_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ token_id    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ before      <chr> "Another", "religious", "family", "is", "of", "Haz...

We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1,088,564 observations of 3 variables. Let’s take a deeper look at our testing dat:

Variable Name	Description
sentence_id	Each sentence has a unique sentence id number, from 0 to 748,065
token_id	Each token or word within a sentence has a unique token id number from 0 to 255
before	The before column contains the raw text

print(c(max(train$sentence_id), max(test$sentence_id)))

## [1] 748065  69999

In summary, we have 748,065 sentences in the training data, and 69,999 sentences in the test data which we want the computer to learn from.

We combine the train and test data sets for comparison treatment:

combine <- bind_rows(train %>% mutate(set = "train"),test %>% mutate(set = "test")) %>%
  mutate(set = as.factor(set))

1.4 Missing values

There are no missing values in our data set:

sum(is.na(train))

## [1] 0

sum(is.na(test))

## [1] 0

1.5 Reformating features

We decide to turn Class into a factor for exploration purposes:

train <- train %>%
  mutate(class = factor(class))

2 Example sentences and overview visualisations

2.1 A first look at normalised sentences

Before diving into data summaries, let’s begin by printing a few before and after sentences to get a feeling for the data. To make this task easier, we first define a short helper function to compare these sentences:

before_vs_after <- function(sent_id){

  bf <- train %>%
    filter(sentence_id == sent_id) %>%
    .$before %>%
    str_c(collapse = " ")
  
  af <- train %>%
    filter(sentence_id == sent_id) %>%
    .$after %>%
    str_c(collapse = " ")
  
  print(str_c("Before:", bf, sep = " ", collapse = " "))
  print(str_c("After :", af, sep = " ", collapse = " "))
}

Those are a few example sentences:

before_vs_after(11)

## [1] "Before: Retrieved April 10, 2013 ."
## [1] "After : Retrieved april tenth twenty thirteen ."

before_vs_after(99)

## [1] "Before: Retrieved 12 April 2015 ."
## [1] "After : Retrieved the twelfth of april twenty fifteen ."

before_vs_after(1234)

## [1] "Before: The PMO provides secretarial assistance to the Prime Minister ."
## [1] "After : The p m o provides secretarial assistance to the Prime Minister ."

We noticed a few things:

The first two examples are very similar, but the normalization helped to make the difference between “april tenth” and “tenth of april” depending on how the date is written.
“2015” becomes “twenty fifteen” instead of “two thousand fifteen”.
“April” becomes “april”. Lower vs upper case should not be a problem, but it is noteworthy.
Acronyms like “PMO” turn into their spoken version of “p m o”, without giving us the meaning of p m o

2.2 Summary visualisations

Now let’s look at overview visualisations. First, we examine the different token classes and their frequency. Here is a visual summary:

train %>%
  group_by(class) %>%
  count() %>%
  ungroup() %>%
  mutate(class = reorder(class, n)) %>%
  ggplot(aes(class,n, fill = class)) +
  geom_col() +
  scale_y_log10() +
  labs(y = "Frequency") +
  coord_flip() +
  theme(legend.position = "none")

Fig. 1

We used logarithmic scale on the x-axis, primarily to avoid skewness the data.

Insights:

The “PLAIN” class is by far the most frequent, followed by punctuation mark “PUNCT” then“DATE”.
In total there are 16 classes, with “TIME”, “FRACTION”, and “ADDRESS” having the leasst number of occurences (around/below 100 tokens each).

Next up, this is the histogram distribution of the length of each sentence for the training data:

train %>%
  group_by(sentence_id) %>%
  summarise(sentence_len = max(token_id)) %>%
  ggplot(aes(sentence_len)) +
  geom_histogram(bins = 50, fill = "red") +
  scale_y_sqrt() +
  scale_x_log10() +
  labs(x = "Sentence length")

Fig. 2

We have a gaussian distribution in our training data, and We can see that sentences are typically up to 15-20 tokens long, after which the frequency drops quickly. Very long sentences (> 100 tokens) exist but are relatively rare. Note again the logarithmic x-axis and square-root y-axis.

Below we compare the sentence length distributions for the training vs test data sets, this time with an overlapping density plot:

combine %>%
  group_by(sentence_id, set) %>%
  summarise(sentence_len = max(token_id)) %>%
  ggplot(aes(sentence_len, fill = set)) +
  geom_density(bw = 0.1, alpha = 0.5) +
  scale_x_log10() +
  labs(x = "Sentence length")

Fig. 3

We find that the training data contains more shorter sentences (< 10 tokens) and that there is a larger proportion of longer sentences in the test data set. Note the logarithmic x-axis.

Next, we will look at the token_ids of each class in their sentences:

train %>%
  ggplot(aes(reorder(class, token_id, FUN = median), token_id, col = class)) +
  geom_boxplot() +
  scale_y_log10() +
  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(x = "Class", y = "Token ID")

Fig. 4

insights:

The “TELEPHONE” class appears predominantly at token_ids of less than 10.
Above token_ids of 100 we find no occurences of the “ELECTRONICS”, “ADDRESS”, “FRACTION”, “DIGIT”, “ORDINAL”, “TIME”, “MONEY”, and “MEASURE” classes. Of those, “FRACTION”, “MONEY”, and “MEASURE” barely appear above token_id == 20.
The classes “DECIMAL”, “MONEY”, “PUNCT”, and “MEASURE” are rarely found in the first token of a sentence.

We can take this analysis a step further by relating the token_id to the length of the sentence. Thereby, we will see at which relative position in a sentence a certain class is more likely to occur:

sen_len <- train %>%
  group_by(sentence_id) %>%
  summarise(sentence_len = max(token_id))

train %>%
  left_join(sen_len, by = "sentence_id") %>%
  mutate(token_rel = token_id/sentence_len) %>%
  ggplot(aes(reorder(class, token_rel, FUN = median), token_rel, col = class)) +
  geom_boxplot() +
  #scale_y_log10() +
  theme(legend.position = "none", axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(x = "Class", y = "Relative token ID")

Fig. 5

findings:

As suggested above, “TELEPHONE” tokens are more likely to occur early in a sentence. A similar observation holds for “LETTERS” and “PLAIN”.
Unsurprisingly, the punctuation “PUNCT” class can be found more frequently towards the end of a sentence. Similarly, “MONEY” tokens occur relatively late.
In general, there is a certain trend among the classes, with medians ranging from about 0.4 to 0.8. However, interquartile ranges are wide and there is a large amount of overlap between the different classes.

3 The impact of the normalization

Now we will include the effects of the text normalization in our study by analysing the changes it introduced in the training data.

To begin, we define the new feature transformed to indicate those tokens that changed from before to after:

train <- train %>%
  mutate(transformed = (before != after))

train %>%
  group_by(transformed) %>%
  count() %>%
  mutate(freq = n/nrow(train))

## # A tibble: 2 x 3
## # Groups:   transformed [2]
##   transformed       n       freq
##         <lgl>   <int>      <dbl>
## 1       FALSE 9258648 0.93347815
## 2        TRUE  659793 0.06652185

In total, only about 7% of tokens in the training data, or 659,793 objects in total, were changed during the process of text normalization:

This explains the high baseline accuracies we can achieve even without any adjustment of the test data input.

By comparing the fraction of tokens that changed from before to after the text normalisation we can visualise which classes are most affected by this process:

train %>%
  ggplot(aes(class, fill = transformed)) +
  geom_bar(position = "fill") +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  labs(x = "Class", y = "Transformed fraction [%]")

Fig. 6

Insights:

Quite a few “ELECTRONIC” and “LETTERS” tokens did not change in after vs before.
A noteable fraction of “PLAIN” class text elements changed.
The majority of “VERBATIM” class tokens remained identical during the normalisation.

train %>%
  group_by(class, transformed) %>%
  count() %>%
  spread(key = transformed, value = n) %>%
  mutate(`TRUE` = ifelse(is.na(`TRUE`),0,`TRUE`),
         `FALSE` = ifelse(is.na(`FALSE`),0,`FALSE`)) %>%
  mutate(frac = `FALSE`/(`TRUE`+`FALSE`)*100) %>%
  filter(frac>1e-5) %>%
  arrange(desc(frac)) %>%
  rename(unchanged = `FALSE`, changed = `TRUE`, unchanged_percentage = frac)

## # A tibble: 7 x 4
## # Groups:   class [7]
##        class unchanged changed unchanged_percentage
##       <fctr>     <dbl>   <dbl>                <dbl>
## 1      PUNCT   1880507       0         100.00000000
## 2      PLAIN   7317221   36472          99.50403151
## 3   VERBATIM     52271   25837          66.92144211
## 4    LETTERS      8426  144369           5.51457836
## 5 ELECTRONIC       198    4964           3.83572259
## 6    MEASURE        22   14761           0.14881959
## 7      MONEY         3    6125           0.04895561

This is a breakdown of the number and percentage of how many tokens per class remained unchanged in before vs after. As we can see, puntuation markes remained unchanged while 36,472 or 0.5 % of plain “PLAIN” texts changed.

4 Token classes and their text normalisation

In order to explore the meaning of these classes for our normalization task we modify our helper function to include the class name. Here we also remove punctuation:

before_vs_after_class <- function(sent_id){

  bf <- train %>%
    filter(sentence_id == sent_id & class != "PUNCT") %>%
    .$before %>%
    str_pad(30) %>%
    str_c(collapse = " ")
  
  af <- train %>%
    filter(sentence_id == sent_id & class != "PUNCT") %>%
    .$after %>%
    str_pad(30) %>%
    str_c(collapse = " ")
  
  
  cl <- train %>%
    filter(sentence_id == sent_id & class != "PUNCT") %>%
    .$class %>%
    str_pad(30) %>%
    str_c(collapse = " ")
  
  print(str_c("[Class]:", cl, sep = " ", collapse = " "))
  print(str_c("Before :", bf, sep = " ", collapse = " "))
  print(str_c("After  :", af, sep = " ", collapse = " "))
}

Using our example sentence from earlier, we get the following output structure, indicating the combination of a “PLAIN” and a “DATE” token:

before_vs_after_class(11)

## [1] "[Class]:                          PLAIN                           DATE"
## [1] "Before :                      Retrieved                 April 10, 2013"
## [1] "After  :                      Retrieved    april tenth twenty thirteen"

Using our example sentence from earlier, we get the following output structure, indicating the combination of a “PLAIN” and a “DATE” token.

Next we will explore the different token classes within categories of similarity and provide a few examples for each.

4.1 “PLAIN” class: modifications

As we saw above, most “PLAIN” tokens remained unchanged. However, about 0.5% were transformed, which still amounts to about 36k tokens:

train %>%
  filter(class == "PLAIN") %>%
  group_by(transformed) %>%
  count() %>%
  spread(key = transformed, value = n) %>%
  mutate(frac = `FALSE`/(`TRUE`+`FALSE`)*100) %>%
  filter(!is.na(frac)) %>%
  arrange(desc(frac)) %>%
  rename(unchanged = `FALSE`, changed = `TRUE`, unchanged_percentage = frac)

## # A tibble: 1 x 3
##   unchanged changed unchanged_percentage
##       <int>   <int>                <dbl>
## 1   7317221   36472             99.50403

As we saw above, most “PLAIN” tokens remained unchanged. However, about 0.5% were transformed, which still amounts to about 36472 tokens:

These are a few transformed examples:

set.seed(1234)
train %>%
  filter(transformed == TRUE & class == "PLAIN") %>%
  sample_n(10)

## # A tibble: 10 x 6
##    sentence_id token_id  class    before    after transformed
##          <int>    <int> <fctr>     <chr>    <chr>       <lgl>
##  1       88698        0  PLAIN        dr   doctor        TRUE
##  2      469912        8  PLAIN   colours   colors        TRUE
##  3      460114        2  PLAIN       vol   volume        TRUE
##  4      470842       11  PLAIN         -       to        TRUE
##  5      646512        8  PLAIN         -       to        TRUE
##  6      483660        7  PLAIN   theatre  theater        TRUE
##  7        7782        1  PLAIN         -       to        TRUE
##  8      179773        9  PLAIN        lb    pound        TRUE
##  9      503183       12  PLAIN neighbour neighbor        TRUE
## 10      391000       13  PLAIN        st    saint        TRUE

We see a few typical changes, such as “-” to “to” and the adjustment from British to American spelling.

4.2 Punctuation class: “PUNCT”

It is important to briefly cross-check the “PUNCT” class. Intuitively, punctuation should not be affected by text normalization unless it is associated to other structures such as numbers or dates. Here are a few punctuation examples:

set.seed(1234)
train %>%
  filter(class == "PUNCT") %>%
  sample_n(10)

## # A tibble: 10 x 6
##    sentence_id token_id  class before  after transformed
##          <int>    <int> <fctr>  <chr>  <chr>       <lgl>
##  1       88108        5  PUNCT      '      '       FALSE
##  2      468923       11  PUNCT      .      .       FALSE
##  3      459100        4  PUNCT "\"\"" "\"\""       FALSE
##  4      469748        8  PUNCT      ,      ,       FALSE
##  5      645746       19  PUNCT      .      .       FALSE
##  6      482242        6  PUNCT      .      .       FALSE
##  7        7445       25  PUNCT      .      .       FALSE
##  8      177712       16  PUNCT      .      .       FALSE
##  9      501373       13  PUNCT      .      .       FALSE
## 10      388155       21  PUNCT      ,      ,       FALSE

As expected, these tokens are identical before and after:

train %>%
  filter(class == "PUNCT") %>%
  mutate(test = (before == after)) %>%
  group_by(test) %>%
  count()

## # A tibble: 1 x 2
## # Groups:   test [1]
##    test       n
##   <lgl>   <int>
## 1  TRUE 1880507

They are 1,880,507 punctuation marks in our test set.

4.3 Numerical classes

Numbers is by far the most diverse class (and an important are for normalization treatment). These classes are as follows: “DATE”, CARDINAL“,”MEASURE“,”ORDINAL“,”DECIMAL“,”MONEY“, DIGIT”, “TELEPHONE”, “TIME”, “FRACTION”, and “ADDRESS”. Here we will look at a few examples at each of them.

“DATE”

We have already seen how dates can be transformed differently depending on their formatting. These are some more examples:

before_vs_after_class(8)

## [1] "[Class]:                          PLAIN                           DATE"
## [1] "Before :                      Retrieved                   4 March 2014"
## [1] "After  :                      Retrieved the fourth of march twenty fourteen"

before_vs_after_class(12)

## [1] "[Class]:                          PLAIN                          PLAIN                           DATE"
## [1] "Before :                     Downloaded                             on                  7 August 2007"
## [1] "After  :                     Downloaded                             on the seventh of august two thousand seven"

“CARDINAL”

Not a big surprise here: these are cardinal numbers. Interestingly, it includes Roman numerals:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "CARDINAL") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id    class before                    after
##         <int>   <fctr>  <chr>                    <chr>
## 1       89600 CARDINAL     53              fifty three
## 2      471164 CARDINAL     36               thirty six
## 3      462639 CARDINAL    589 five hundred eighty nine
## 4      471926 CARDINAL     II                      two
## 5      642177 CARDINAL    305       three hundred five

before_vs_after(471926)

## [1] "Before: Listed as Grade II by English Heritage ."
## [1] "After : Listed as Grade two by English Heritage ."

“MEASURE”

These are mainly percentages and physical measurements like megawatts:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "MEASURE") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id   class   before                                      after
##         <int>  <fctr>    <chr>                                      <chr>
## 1       83465 MEASURE      45%                         forty five percent
## 2      457847 MEASURE    270MW              two hundred seventy megawatts
## 3      447041 MEASURE 93.3 km2 ninety three point three square kilometers
## 4      458852 MEASURE  1,500 m           one thousand five hundred meters
## 5      639771 MEASURE  13.0 mi                  thirteen point zero miles

before_vs_after(200476)

## [1] "Before: As of 2008 , the population was 50.3% male and 49.7% female ."
## [1] "After : As of two thousand eight , the population was fifty point three percent male and forty nine point seven percent female ."

before_vs_after_class(225531)

## [1] "[Class]:                          PLAIN                          PLAIN                       CARDINAL                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                        DECIMAL                          PLAIN                          PLAIN                          PLAIN                        MEASURE"
## [1] "Before :                          There                           were                            233                        housing                          units                             at                             an                        average                        density                             of                            6.8                            per                         square                           mile                       2.6/kmÂ²"
## [1] "After  :                          There                           were       two hundred thirty three                        housing                          units                             at                             an                        average                        density                             of                six point eight                            per                         square                           mile two point six per square kilometers"

“ORDINAL”

Also ordinal numbers can include Roman numerals, as in the example of Queen Elizabeth I:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "ORDINAL") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id   class before       after
##         <int>  <fctr>  <chr>       <chr>
## 1       92031 ORDINAL   45th forty fifth
## 2      487085 ORDINAL   19th  nineteenth
## 3      478452 ORDINAL      I   the first
## 4      487719 ORDINAL   16th   sixteenth
## 5      657344 ORDINAL   13th  thirteenth

before_vs_after(478452)

## [1] "Before: Maid of honour to Elizabeth I ( 1576 until 1583 ) ."
## [1] "After : Maid of honor to Elizabeth the first ( fifteen seventy six until fifteen eighty three ) ."

“DECIMAL”

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "DECIMAL") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id   class     before                              after
##         <int>  <fctr>      <chr>                              <chr>
## 1       86596 DECIMAL        .66                      point six six
## 2      454408 DECIMAL    10.1002              ten point one o o two
## 3      443910 DECIMAL      199.4 one hundred ninety nine point four
## 4      454830 DECIMAL        .37                  point three seven
## 5      636963 DECIMAL 60 billion                      sixty billion

before_vs_after(443910)

## [1] "Before: There were 331 housing units at an average density of 199.4 per square mile ( 77.0/km2 ) ."
## [1] "After : There were three hundred thirty one housing units at an average density of one hundred ninety nine point four per square mile ( seventy seven point zero per square kilometers ) ."

Also, in this particular example the token in parentheses is a “MEASURE”:

before_vs_after_class(443910)

## [1] "[Class]:                          PLAIN                          PLAIN                       CARDINAL                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                          PLAIN                        DECIMAL                          PLAIN                          PLAIN                          PLAIN                        MEASURE"
## [1] "Before :                          There                           were                            331                        housing                          units                             at                             an                        average                        density                             of                          199.4                            per                         square                           mile                       77.0/km2"
## [1] "After  :                          There                           were       three hundred thirty one                        housing                          units                             at                             an                        average                        density                             of one hundred ninety nine point four                            per                         square                           mile seventy seven point zero per square kilometers"

“MONEY”

Money tokens come with currency symbols like “$” or plain text names like “yuan”:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "MONEY") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id  class           before
##         <int> <fctr>            <chr>
## 1       90430  MONEY    $10.8 million
## 2      461097  MONEY            $985M
## 3      449366  MONEY       $5 billion
## 4      461569  MONEY              $20
## 5      636060  MONEY 1.8 million yuan
## # ... with 1 more variables: after <chr>

before_vs_after(90430)

## [1] "Before: It issued 369 citations at that time , assessing $10.8 million in penalties ."
## [1] "After : It issued three hundred sixty nine citations at that time , assessing ten point eight million dollars in penalties ."

“DIGIT”

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "DIGIT") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id  class before                     after
##         <int> <fctr>  <chr>                     <chr>
## 1       91368  DIGIT 202767 two o two seven six seven
## 2      480736  DIGIT      2                       two
## 3      474035  DIGIT      1                       one
## 4      481444  DIGIT   1996         one nine nine six
## 5      645718  DIGIT   2003             two o o three

before_vs_after(481444)

## [1] "Before: Mintz , Sidney W. 1996 a Tasting Food , Tasting Freedom : Excursions into Eating , Culture , and the Past ."
## [1] "After : Mintz , Sidney w one nine nine six a Tasting Food , Tasting Freedom : Excursions into Eating , Culture , and the Past ."

Interesting error! the computer missed it. This example given here looks like a citation of an article or book, which means that “nineteen ninety six” should be correct instead of “one nine nine six”, since the number most likely refers to the year of publication.

“TELEPHONE”

set.seed(1234)
train %>%
  filter(class == "TELEPHONE") %>%
  select(-token_id, -class, -transformed) %>%
  sample_n(5)

## # A tibble: 5 x 3
##   sentence_id        before
##         <int>         <chr>
## 1       88957       3 18-49
## 2      476323    0192627929
## 3      467425 0-900652-85-3
## 4      477330   1 1693-1775
## 5      644524 985-433-695-6
## # ... with 1 more variables: after <chr>

before_vs_after(88957)

## [1] "Before: The show was viewed by an estimated 3.17 million Americans with a 1.1 / 3 18-49 rating / share ."
## [1] "After : The show was viewed by an estimated three point one seven million Americans with a one point one / three sil one eight sil four nine rating / share ."

That doesn’t look like a telephone number to me. Neither does this one:

before_vs_after(476323)

## [1] "Before: ISBN 0192627929 Chambers TJ , Revell PA , Fuller K , Athanasou  ."
## [1] "After : i s b n o one nine two six two seven nine two nine Chambers t j , Revell p a , Fuller K , Athanasou  ."

Quite a few of these entries seem to be in fact ISBN numbers. Everything that is formatted like integer digits plus dashes appears to be identified as a “TELEPHONE” token. This could be a tricky category.

“TIME”

The “TIME” formatting and normalisation are comparatively simple:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "TIME") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id  class     before              after
##         <int> <fctr>      <chr>              <chr>
## 1       98527   TIME    3:00 pm          three p m
## 2      471678   TIME    10:30pm     ten thirty p m
## 3      465185   TIME 11:00 a.m.         eleven a m
## 4      746945   TIME      10:30         ten thirty
## 5      635327   TIME    3:14 AM three fourteen a m

before_vs_after(471678)

## [1] "Before: The show shifted its timeslot from 10:30pm into 6:00pm in order to compete TV Patrol ."
## [1] "After : The show shifted its timeslot from ten thirty p m into six p m in order to compete t v Patrol ."

“FRACTION”

Here it become tricky again:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "FRACTION") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id    class   before                          after
##         <int>   <fctr>    <chr>                          <chr>
## 1       91464 FRACTION 2000/400   two thousand four hundredths
## 2      479012 FRACTION    172/6 one hundred seventy two sixths
## 3      470330 FRACTION      6/6                     six sixths
## 4      746922 FRACTION     16/8                sixteen eighths
## 5      650125 FRACTION   2/2013   two two thousand thirteenths

Many of these numbers are unlikely to be fractions and are therefore normalized in a rather clumsy way:

before_vs_after(470330)

## [1] "Before: \"\" WWE Superstars Results ( 6/6 ) : Ascension vs"
## [1] "After : \"\" w w e Superstars Results ( six sixths ) : Ascension versus"

But since we only need to reproduce this normalisation approach it should actually make it easier for us; because the required approach appears to be rather homogeneous for any two integers separated by a forward slash.

“ADDRESS”

The “ADDRESS” class appears to assigned to alpha-numeric combinations:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "ADDRESS") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id   class before          after
##         <int>  <fctr>  <chr>          <chr>
## 1       88342 ADDRESS   A380 a three eighty
## 2      458052 ADDRESS     B1          b one
## 3      440683 ADDRESS     M2          m two
## 4      457174 ADDRESS    C1           c one
## 5      612855 ADDRESS     C0            c o

before_vs_after(88342)

## [1] "Before: \"\" Seat Map Singapore Airlines Airbus A380 \"\" ."
## [1] "After : \"\" Seat Map Singapore Airlines Airbus a three eighty \"\" ."

before_vs_after(440683)

## [1] "Before: Each of the four identical protein subunits is composed of two membrane spanning alpha helices ( M1 and M2 ) ."
## [1] "After : Each of the four identical protein subunits is composed of two membrane spanning alpha helices ( m one and m two ) ."

4.4 Acronyms and Initials: “LETTERS” class

The tokens in the class “LETTERS” appear to be normalised to a form which spells them out one by one:

set.seed(4321)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "LETTERS") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id   class before   after
##         <int>  <fctr>  <chr>   <chr>
## 1      254590 LETTERS     NS     n s
## 2      680878 LETTERS     D.       d
## 3      313518 LETTERS     UK     u k
## 4       35001 LETTERS   ISBN i s b n
## 5      571356 LETTERS  M. A.     m a

before_vs_after(571356)

## [1] "Before: Kose , M. A. ; Prasad , E. S. ; Terrones , M. E. ( 2006 ) ."
## [1] "After : Kose , m a ; Prasad , e s ; Terrones , m e ( two thousand six ) ."

Since these text elements are typically completely in upper case it should be relatively simple to define a normalization. Given the relatively high frequency of the “LETTERS” class I suggest we use a simple transformation at first, before proceeding to a little more advanced LB baseline.

4.5 Special symbols: “VERBATIM” class

Verbatim involves using exactly the same words; word for word. Here we have special symbols such as non-english characters. Interestingly, not all of these are copied verbatim from before to after:

set.seed(1234)
train %>%
  select(-token_id, -transformed) %>%
  filter(class == "VERBATIM") %>%
  sample_n(5)

## # A tibble: 5 x 4
##   sentence_id    class before after
##         <int>   <fctr>  <chr> <chr>
## 1       85530 VERBATIM     Îº kappa
## 2      458616 VERBATIM      &   and
## 3      449455 VERBATIM     Ð·    Ð·
## 4      459603 VERBATIM    å®¶   å®¶
## 5      646953 VERBATIM     Ð¸    Ð¸

The exception are Greek letters or the ampersand “&”. It might be useful to search for other exceptions and make use of simple transformations such as “&” to “and”.

4.6 Websites: “ELECTRONIC” class

This class includes websites that are normalized using single characters and a “dot”:

set.seed(4321)
train %>%
  filter(class == "ELECTRONIC") %>%
  select(-token_id, -class, -sentence_id, -transformed) %>%
  sample_n(5)

## # A tibble: 5 x 2
##                 before                                       after
##                  <chr>                                       <chr>
## 1              CNN.com                             c n n dot c o m
## 2             uefa.com                           u e f a dot c o m
## 3 MyLifeIsTwilight.com   m y l i f e i s t w i l i g h t dot c o m
## 4      2008Achieve.org t w o o o e i g h t a c h i e v e dot o r g
## 5             Goal.com                           g o a l dot c o m

before_vs_after(99083)

## [1] "Before: Climate Summary for Madison , Florida \"\" Weatherbase.com \"\" ."
## [1] "After : Climate Summary for Madison , Florida \"\" w e a t h e r b a s e dot c o m \"\" ."

This is another format that should be relatively easy to learn and implement.

In summary: The different classes appear to follow different translation rules for the text normalization. Even though the class feature is not present in the test data set it might be useful to train a model to identify the specific class for a token and then apply its normalisation rules.

4.7 Most frequent transformations per class - wordclouds

Here we are using the wordcloud package to visualise tokens for each class that occur most frequently among the terms affected by the text normalization. We start with the overall frequency in the first tab and then sequentially plot the individual classes.

All classes

train %>%
  filter(transformed == TRUE) %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 100, rot.per=0, fixed.asp=FALSE))

Fig. 7

“PLAIN”

train %>%
  filter(transformed == TRUE, class == "PLAIN") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))

Fig. 8

“DATE”

train %>%
  filter(transformed == TRUE, class == "DATE") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 9

“CARDINAL”

train %>%
  filter(transformed == TRUE, class == "CARDINAL") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 40, scale=c(4,1), rot.per=0, fixed.asp=FALSE))

Fig. 10

“MEASURE”

train %>%
  filter(transformed == TRUE, class == "MEASURE") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 11

“ORDINAL”

train %>%
  filter(transformed == TRUE, class == "ORDINAL") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 12

“DECIMAL”

train %>%
  filter(transformed == TRUE, class == "DECIMAL") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 13

“MONEY”

train %>%
  filter(transformed == TRUE, class == "MONEY") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 14

“DIGIT”

train %>%
  filter(transformed == TRUE, class == "DIGIT") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))

Fig. 15

“TELEPHONE”

train %>%
  filter(transformed == TRUE, class == "TELEPHONE") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 16

“TIME”

train %>%
  filter(transformed == TRUE, class == "TIME") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 17

“FRACTION”

train %>%
  filter(transformed == TRUE, class == "FRACTION") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 18

“ADDRESS”

train %>%
  filter(transformed == TRUE, class == "ADDRESS") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 19

“LETTERS”

train %>%
  filter(transformed == TRUE, class == "LETTERS") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))

Fig. 20

“VERBATIM”

train %>%
  filter(transformed == TRUE, class == "VERBATIM") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))

Fig. 21

“ELECTRONIC”

train %>%
  filter(transformed == TRUE, class == "ELECTRONIC") %>%
  count(before) %>%
  with(wordcloud(before, n, max.words = 30, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))

Fig. 22

5 Context matters: a next-neighbour analysis

In this section we turn to analysing whether the context of our normalized token can provide any indications towards its class. We’re doing this by preparing a data frame in which every token is listed together with the token (and its class) that came immediately previous or next in the text data. (Note, that for the first/last tokens of a sentence the previous/next tokens will belong to the the previous/next sentence.)

We build this new data frame using the dplyr tool lead which shifts the contents of a vector by a certain number of indices.

t3 <- train %>%
  select(class, before) %>%
  mutate(before2 = lead(train$before, 1),
         before3 = lead(train$before,2),
         class2 = lead(train$class, 1),
         class3 = lead(train$class, 2),
         transformed = c(train$transformed[-1], NA),
         after = c(train$after[-1], 0)) %>%
  filter(!is.na(before3)) %>%
  rename(class_prev = class, class_next = class3, class = class2,
         before_prev = before, before_next = before3, before = before2) %>%
  select(before_prev, before, before_next, class_prev,
         class, class_next, transformed, after)

5.1 Token class overview - previous vs next

We begin this analysis with a tabset of overview plots comparing the frequencies of classes at the positions previous and next in line to all tokens of the tabset class. Note the logarithmic x-axes. These plots show the absolute numbers of tokens per class. The classes are colour-coded for easier comparison. Here is the corresponding helper function:

plot_t3_comp <- function(cname){
  p1 <- t3 %>%
    filter(transformed == TRUE & class == cname) %>%
    ggplot(aes(class_prev, fill = class_prev)) +
    labs(x = "Previous class") +
    geom_bar() +
    coord_flip() +
    scale_y_log10() +
    theme(legend.position = "none")

  p2 <- t3 %>%
    filter(transformed == TRUE & class == cname) %>%
    ggplot(aes(class_next, fill = class_next)) +
    labs(x = "Next class") +
    geom_bar() +
    coord_flip() +
    scale_y_log10() +
    theme(legend.position = "none")
  
  layout <- matrix(c(1,2),2,1,byrow=TRUE)
  multiplot(p1, p2, layout=layout)
}

“PLAIN”

plot_t3_comp("PLAIN")

Fig. 23

“DATE”

plot_t3_comp("DATE")

Fig. 24

“CARDINAL”

plot_t3_comp("CARDINAL")

Fig. 25

“MEASURE”

plot_t3_comp("MEASURE")

Fig. 26

“ORDINAL”

plot_t3_comp("ORDINAL")

Fig. 27

“DECIMAL”

plot_t3_comp("DECIMAL")

Fig. 28

“MONEY”

plot_t3_comp("MONEY")

Fig. 29

“DIGIT”

plot_t3_comp("DIGIT")

Fig. 30

“TELEPHONE”

plot_t3_comp("TELEPHONE")

Fig. 31

“TIME”

plot_t3_comp("TIME")

Fig. 32

“FRACTION”

plot_t3_comp("FRACTION")

Fig. 33

“ADDRESS”

plot_t3_comp("ADDRESS")

Fig. 34

“LETTERS”

plot_t3_comp("LETTERS")

Fig. 35

“VERBATIM”

plot_t3_comp("VERBATIM")

Fig. 36

“ELECTRONIC”

plot_t3_comp("ELECTRONIC")

Fig. 37

5.2 Treemap overviews

For an alternative comprehensive overview of the neighbour class statistics here are two treemaps build using the treemapify package.

The treemaps summarise at a glance which neighbour combinations exist and are most frequent:

t3 %>%
  group_by(class, class_prev) %>%
  count() %>%
  ungroup() %>%
  mutate(n = log10(n+1)) %>%
  ggplot(aes(area = n, fill = class_prev, label = class_prev, subgroup = class)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
                             "black", fontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  theme(legend.position = "null") +
  ggtitle("Previous classes grouped by token class; log scale frequencies")

Fig. 38

t3 %>%
  group_by(class, class_next) %>%
  count() %>%
  ungroup() %>%
  mutate(n = log10(n+1)) %>%
  ggplot(aes(area = n, fill = class_next, label = class_next, subgroup = class)) +
  geom_treemap() +
  geom_treemap_subgroup_border() +
  geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
                             "black", fontface = "italic", min.size = 0) +
  geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
  theme(legend.position = "null") +
  ggtitle("Next classes grouped by token class; log scale frequencies")

Fig. 39

The first plot shows the frequency of previous token classes and the second treemap the frequency of next token classes (all labeled in white) for each target token class (labeled in a large black font and separated by group with grey borders). We use a again a logarithmic frequency scaling to improve the visibility of the rare combinations. Group sizes decrease from the bottom left to the top right of the plot and each subgroup box. The colours of the white-labeled neighbour boxes are identical troughout the plot (e.g. “PUNCT” is always purple). Note, that the subgroup boxes in the two plots have different sizes for identical classes because of the log(n+1) transformation.

5.3 Relative percentages

5.3.1 All tokens

Based on these raw numbers we can study the relative contributions of a certain class to the previous or next tokens of another class. To visualise these dependencies, we again determine the log10(n+1) frequency distributions for each class among the previous/next tokens depending on the class of the reference token. These are the numbers in the bar plots above. We then normalise the range of these numbers (i.e. the height of the bars) to the interval [0,1] for each class. This data wrangling is done in the following code block:

prev_stat <- t3 %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

next_stat <- t3 %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

t3_norm_prev <- t3 %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  left_join(prev_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

t3_norm_next <- t3 %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  left_join(next_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

The result is displayed in two tile plots for the class vs previous class and class vs next class, respectively. Here, each tile shows the relative frequency (by class) of a specific neighbour pairing. The colour coding assigns bluer colours to lower frequencies and redder colours to higher frequencies:

t3_norm_prev %>%
  ggplot(aes(class, class_prev, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
  ggtitle("Class vs previous class; relative log scale frequencies")

Fig. 40

t3_norm_next %>%
  ggplot(aes(class, class_next, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
  ggtitle("Class vs next class; relative log scale frequencies")

Fig. 41

We find:

Not all possible neighbouring combinations exist: only about 77% of the tile plots are filled. Certain potential pairs such as “ORDINAL” and “FRACTION” or “TIME” and “MONEY” are never found next to each other.
“PUNCT” and “PLAIN”, the overall most frequent classes, also dominate the relative frequencies of previous/next classes for every token class. “PUNCT” is relatively weakly related to “MONEY” (previous) and “ORDINAL” (next).
“DIGIT” tokens are often preceded or followed by “LETTERS” tokens.
“TELEPHONE” is very likely to be preceded by “LETTERS”, as well.
“VERBATIM” tokens are very likely to be followed and preceded by other “VERBATIM” tokens. No other token class has such a strong correlation with itself. Some, such as “DIGIT”, “MONEY”, or “TELEPHONE” are rather unlikely to be preceded by the same class
“VERBATIM” is also very unlikely to be preceded by “ADDRESS” and “TIME” or followed by “ADDRESS”.

5.3.2 Transformed tokens only

Now we restrict this neighbour analysis to the transformed tokens only. Naturally, these will still have transformed or untransformed tokens in the previous and next positions. Having set up our “context” data frame with this option in mind, we only need to modify our code slightly to prepare the corresponding tile plots:

prev_stat <- t3 %>%
  filter(transformed == TRUE) %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

next_stat <- t3 %>%
  filter(transformed == TRUE) %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

t3_norm_prev <- t3 %>%
  filter(transformed == TRUE) %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  left_join(prev_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

t3_norm_next <- t3 %>%
  filter(transformed == TRUE) %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  left_join(next_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

t3_norm_prev %>%
  ggplot(aes(class, class_prev, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
  ggtitle("Class vs previous class; relative log scale; transformed")

Fig. 42

t3_norm_next %>%
  ggplot(aes(class, class_next, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
  ggtitle("Class vs next class; relative log scale; transformed")

Fig. 43

We find:

The class “PUNCT” is now gone from the “Token class” axis, since those symbols are never transformed.
Two combinations disappear: “VERBATIM” is not transformed if preceeded or followed by “TIME” (relatively infrequent neighbours to start with).
The pair of “PLAIN” preceded by “CARDINAL” significantly increases in frequency (within the rarely transformed “PLAIN” class).
In general, the classes “PLAIN” and “VERBATM” are experiences the most visible changes with respect to the total set of neighbouring tokens since these are the classes with the highest percentage of untransformed tokens (after “PUNCT”, of course).

5.3.3 No cross-sentence pairs

The previous plots did not distinguish between neighbouring tokens that were placed at the end of one sentence and the beginning of another. Since the sentences in our data set are unrelated and in a random order, the end of one sentence should not influence the beginning of the next one. Here we take this into account by removing those pair of tokens that bridge two sentences.

foo <- train %>%
  group_by(sentence_id) %>%
  summarise(sentence_len = max(token_id)) %>%
  ungroup()

bar <- train %>%
  left_join(foo, by = "sentence_id") %>%
  mutate(first_token = token_id == 0,
         last_token = token_id == sentence_len) %>%
  slice(c(-1,-nrow(train))) %>%
  select(first_token, last_token)

prev_stat <- t3 %>%
  bind_cols(bar) %>%
  filter(first_token == FALSE) %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

next_stat <- t3 %>%
  bind_cols(bar) %>%
  filter(last_token == FALSE) %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  group_by(class) %>%
  summarise(mean_n = mean(n),
            max_n = max(n),
            min_n = min(n))

t3_norm_prev <- t3 %>%
  bind_cols(bar) %>%
  filter(first_token == FALSE) %>%
  count(class, class_prev) %>%
  mutate(n = log10(n+1)) %>%
  left_join(prev_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

t3_norm_next <- t3 %>%
  bind_cols(bar) %>%
  filter(last_token == FALSE) %>%
  count(class, class_next) %>%
  mutate(n = log10(n+1)) %>%
  left_join(next_stat, by = "class") %>%
  mutate(frac_norm = (n-min_n)/(max_n - min_n),
         test1 = max_n - min_n,
         test2 = n - min_n)

t3_norm_prev %>%
  ggplot(aes(class, class_prev, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
  ggtitle("Class vs previous class; relative log scale; no sentence bridging")

Fig. 40

t3_norm_next %>%
  ggplot(aes(class, class_next, fill = frac_norm)) +
  geom_tile() +
  theme(axis.text.x  = element_text(angle=45, hjust=1, vjust=0.9)) +
  scale_fill_distiller(palette = "Spectral") +
  labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
  ggtitle("Class vs next class; relative log scale; no sentence bridging")

Fig. 41

We find:

The differences to the original plot are only marginal; possibly because of the relatively small contribution of cross-sentence neighbours to the overall sample. With about 750k sentences and close to 10 million tokens there are about 7.5% of pairs bridging two sentences.
In the class vs next class plot the differences are exclusively seen in the “PUNCT” class. This is to be expected since a full stop (“.”) belongs to the “PUNCT” category. Thus, we can confirm the expectation that (practically) all sentences end like this one here.

Actually, let’s see if this is true. Here we select only the final tokens of each sentence and plot their class distribution. We also plot the frequency of the the different tokens within the “PUNCT” class. Note the logarithmic axes:

bar <- train %>%
  left_join(foo, by = "sentence_id") %>%
  filter(token_id == sentence_len)

p1 <- bar %>%
  ggplot(aes(class, fill = transformed)) +
  geom_bar(position = "dodge") +
  scale_y_log10() +
  ggtitle("Final tokens of sentence")

p2 <- bar %>%
  filter(class == "PUNCT") %>%
  ggplot(aes(before, fill = before)) +
  geom_bar() +
  scale_y_log10() +
  theme(legend.position = "none", axis.text.x = element_text(face = "bold", size = 16)) +
  ggtitle("Final PUNCT tokens")

layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)

Fig. 42

We find:

The “PUNCT” class is indeed the most frequent by far, but four other classes are found at the end of a sentence, too: “PLAIN”, “LETTERS”, “DATE”, and “MONEY”. This could already be kind-of seen in Fig. 5 above.
As we already know, no “PUNCT” tokens were transformed. All “DATE”, “LETTERS”, and “MONEY” tokens were transformed, which is not self-evident (see Fig. 5 and the numbers below it).
For the “PLAIN” class, about half of the tokens were transformed, which is a way larger fraction than for the overall “PLAIN” sample. This is an interesting result.
There are only two cases in which the last token belongs to the “MONEY” class:

bar %>% filter(class == "MONEY") %>% select(-transformed, -sentence_len)

## # A tibble: 2 x 5
##   sentence_id token_id  class before        after
##         <int>    <int> <fctr>  <chr>        <chr>
## 1       37375       12  MONEY   3rs. three rupees
## 2      590317        9  MONEY   3rs. three rupees

You would be absolutely right in thinking that these look suspicious. Let’s see what they really are:

before_vs_after(37375)

## [1] "Before: \"\" Parasite \"\" was used in the US TV Crime show Numb 3rs."
## [1] "After : \"\" Parasite \"\" was used in the u s t v Crime show Numb three rupees"

before_vs_after(590317)

## [1] "Before: John McCarthy ( 1 episode , 2004 ) Numb 3rs."
## [1] "After : John McCarthy ( one episode , two thousand four ) Numb three rupees"

And indeed we have two references to the moderately successful TV show “Numb3rs” (IMDB), whose characters were apparently engaging in the kind of mathematical detective work you’ll find much more successfully done in many Kaggle kernels ;-) . No rupees here, I’m afraid. Although, why exactly in our data there is a space in between the two parts of that name is not entirely clear to me.

5.4 Explore all neighbour relations

To round off this section, we will create a set of interactive 3D plots using the plotly package. Here, you are able to explore the parameter space of neighbouring classes. The grid is defined as previous class (x), next class (y), and token class (z) and the corresponding frequencies are indicated by the colour and size of the data points.

All tokens

t3 %>%
  group_by(class, class_prev, class_next) %>%
  count() %>%
  ungroup() %>%
  mutate(class = as.character(class),
         class_prev = as.character(class_prev),
         class_next = as.character(class_next)) %>%
  arrange(desc(n)) %>%
  plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
          text = ~paste('Class:', class,
                        '<br>Previous class:', class_prev,
                        '<br>Next class:', class_next,
                        '<br>Counts:', n)) %>%
  add_markers(size = ~log10(n)) %>%
  layout(title = "Class Neighbour frequencies",
         scene = list(xaxis = list(title = 'Previous Class'),
                     yaxis = list(title = 'Next Class'),
                     zaxis = list(title = 'Class')))

Fig. 43

Transformed tokens

t3 %>%
  filter(transformed == TRUE) %>%
  group_by(class, class_prev, class_next) %>%
  count() %>%
  ungroup() %>%
  mutate(class = as.character(class),
         class_prev = as.character(class_prev),
         class_next = as.character(class_next)) %>%
  arrange(desc(n)) %>%
  plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
          text = ~paste('Class:', class,
                        '<br>Previous class:', class_prev,
                        '<br>Next class:', class_next,
                        '<br>Counts:', n)) %>%
  add_markers(size = ~log10(n)) %>%
  layout(title = "Class Neighbour frequencies",
         scene = list(xaxis = list(title = 'Previous Class'),
                     yaxis = list(title = 'Next Class'),
                     zaxis = list(title = 'Class')))

Fig. 44

No cross-sentence pairs

foo <- train %>%
  group_by(sentence_id) %>%
  summarise(sentence_len = max(token_id)) %>%
  ungroup()

bar <- train %>%
  left_join(foo, by = "sentence_id") %>%
  mutate(first_token = token_id == 0,
         last_token = token_id == sentence_len) %>%
  slice(c(-1,-nrow(train))) %>%
  select(first_token, last_token)

t3 %>%
  bind_cols(bar) %>%
  filter(first_token == FALSE & last_token == FALSE) %>%
  group_by(class, class_prev, class_next) %>%
  count() %>%
  ungroup() %>%
  mutate(class = as.character(class),
         class_prev = as.character(class_prev),
         class_next = as.character(class_next)) %>%
  arrange(desc(n)) %>%
  plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
          text = ~paste('Class:', class,
                        '<br>Previous class:', class_prev,
                        '<br>Next class:', class_next,
                        '<br>Counts:', n)) %>%
  add_markers(size = ~log10(n)) %>%
  layout(title = "Class Neighbour frequencies",
         scene = list(xaxis = list(title = 'Previous Class'),
                     yaxis = list(title = 'Next Class'),
                     zaxis = list(title = 'Class')))

Fig. 45

5.5 Summary

Certain combinations of classes are more likely to be found next to one another than other combinations. Ultimately, this reflects the grammar structure of the language.

By making use of these next-neighbour statistics we can estimate the probability that a token was classified correctly by iteratively cross-checking the other tokens in the same sentence. This adds a certain degree of context to a classification/normalisation attempt that is only considering the token itself.

6 Transformation statistics

This section will look at specific sentence parameters and how they affect the transformation statistics or behaviour.

6.1 Sentence length and classes

We begin by studying how the length (in tokens) of the sentences affects the statistics of classes and transformed tokens. For this we estimate the mean transformed fraction for each group of sentences with the same length together with the corresponding uncertainties. In addition, we examine the class frequencies for the shortest sentences and the sentence lenth distributions for each class:

foo <- train %>%
  group_by(sentence_id, transformed) %>%
  count() %>%
  spread(transformed, n, fill = 0) %>%
  mutate(frac = `TRUE`/(`TRUE` + `FALSE`),
         sen_len = `TRUE` + `FALSE`)
  
bar <- foo %>%
  group_by(sen_len) %>%
  summarise(mean_frac = mean(frac),
            sd_frac = sd(frac),
            ct = n())

foobar <- foo %>%
  left_join(train, by = "sentence_id")

p1 <- bar %>%
  ggplot(aes(sen_len, mean_frac, size = ct)) +
  geom_errorbar(aes(ymin = mean_frac-sd_frac, ymax = mean_frac+sd_frac),
                width = 0., size = 0.7, color = "gray30") +
  geom_point(col = "red") +
  scale_x_log10() +
  labs(x = "Sentence length", y = "Average transformation fraction")

p2 <- foobar %>%
  filter(sen_len < 6) %>%
  group_by(class, sen_len) %>%
  count() %>%
  ungroup() %>%
  filter(n > 100) %>%
  ggplot(aes(sen_len, n, fill = class)) +
  geom_col(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Sentence length", y = "Proportion per class")

p3 <- foobar %>%
  ggplot(aes(sen_len)) +
  geom_density(bw = .1, size = 1.5, fill = "red") +
  scale_x_log10() +
  labs(x = "Sentence length") +
  facet_wrap(~ class, nrow = 2)
  
layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)

Fig. 46

p1 <- 1; p2 <- 1; p3 <- 1

We find:

For sentences with more than about 10 tokens the proportion of transformed tokens first decreases somewhat up to about 20 and then increases slightly afterwards. However, none of the changes appear to be significant within their standard deviations. Here, the size of the red data points is proportional to the number of cases per group.
Interestingly, for very short sentences of only 2 or 3 tokens the mean fraction of transformed tokens is significantly higher: about 33% for 3 tokens and ** practically always** 50% for 2 tokens.
Upon closer inspection in the bar plot we find that almost all 2-token sentences consist of a “DATE” class and a “PUNCT” class token. Note that our plot omits rare classes with less than 100 cases for better visibility. Below is the complete table for the 2-token sentences.
The 3-token sentences basically just add a “PLAIN” token to the mix, reducing the proportions of “DATE” and “PUNCT” tokens to 1/3 each. For longer sentences, the mix starts to become more heterogenenous. Interestingly, for longer sentences the “PUNCT” fraction does not continue to decline geomtrically - indicating the presence of tokens other than the final full stop.
Among the individual classes we can see considerable differences in the shape of their sentence length distributions. For instance, “VERBATIM” has a far broader distribution that most; reaching notable number above 100 tokens. “PLAIN”, “MEASURE”, and “DECIMAL” have the sharpest peaks, while “DATE” has an interesting step structure in addition to the dominance in short sentences that we discussed above (here is the promised table for 2-token sentences:)

foobar %>%
  filter(sen_len <= 2) %>%
  group_by(sen_len, class, transformed) %>%
  count()

## # A tibble: 5 x 4
## # Groups:   sen_len, class, transformed [5]
##   sen_len     class transformed     n
##     <dbl>    <fctr>       <lgl> <int>
## 1       2      DATE        TRUE 10127
## 2       2     PLAIN       FALSE     3
## 3       2     PLAIN        TRUE     1
## 4       2     PUNCT       FALSE 10133
## 5       2 TELEPHONE        TRUE    10

6.2 Challenges

Converting letters are abbreviations is still tricky to the computer. It ought to know pmo means Project management office since its in the context of “Industry safety”

The machine still needs a large amount of data if it’s going to improve its reading even if it’s by 1%. In this case study we used about 350 megabytes of text data to achieve a pretty good result even though it slowed down my machine.

6.3 Summary

We learnt that the average number of words or token in each sentence is 8, and the maximum being 255.

We also learnt that the computers are good translating numbers in text but it’s still tricky for it to concantenate it with a sentence.

For more use case of Machine Learning, visit us at Cartwheel Technologies

NATURAL LANGUAGE PROCESSING - TEXT NORMALIZATION

Cartwheel Technologies

December 17, 2017

1 Introduction

1.1 Load libraries and helper functions

1.2 Load data

1.3 File structure and content

1.4 Missing values

1.5 Reformating features

2 Example sentences and overview visualisations

2.1 A first look at normalised sentences

2.2 Summary visualisations

3 The impact of the normalization

4 Token classes and their text normalisation

4.1 “PLAIN” class: modifications

4.2 Punctuation class: “PUNCT”

4.3 Numerical classes

“DATE”

“CARDINAL”

“MEASURE”

“ORDINAL”

“DECIMAL”

“MONEY”

“DIGIT”

“TELEPHONE”

“TIME”

“FRACTION”

“ADDRESS”

4.4 Acronyms and Initials: “LETTERS” class

4.5 Special symbols: “VERBATIM” class

4.6 Websites: “ELECTRONIC” class

4.7 Most frequent transformations per class - wordclouds

All classes

“PLAIN”

“DATE”

“CARDINAL”

“MEASURE”

“ORDINAL”

“DECIMAL”

“MONEY”

“DIGIT”

“TELEPHONE”

“TIME”

“FRACTION”

“ADDRESS”

“LETTERS”

“VERBATIM”

“ELECTRONIC”

5 Context matters: a next-neighbour analysis

5.1 Token class overview - previous vs next

“PLAIN”

“DATE”

“CARDINAL”

“MEASURE”

“ORDINAL”

“DECIMAL”

“MONEY”

“DIGIT”

“TELEPHONE”

“TIME”

“FRACTION”

“ADDRESS”

“LETTERS”

“VERBATIM”

“ELECTRONIC”

5.2 Treemap overviews

5.3 Relative percentages

5.3.1 All tokens

5.3.2 Transformed tokens only

5.3.3 No cross-sentence pairs

5.4 Explore all neighbour relations

All tokens

Transformed tokens

No cross-sentence pairs

5.5 Summary

6 Transformation statistics

6.1 Sentence length and classes

6.2 Challenges

6.3 Summary