As many of us can attest, learning another language is tough. Picking up on nuances like slang, dates and times, and local expressions, can often be a distinguishing factor between proficiency and fluency. This challenge is even more difficult for a computers.
Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken.
Many speech and language applications, including text-to-speech synthesis (TTS) and automatic speech recognition (ASR), require text to be converted from written expressions into appropriate “spoken” forms. This is a process known as text normalization, and helps convert 12:47 to “twelve forty-seven” and $3.16 into “three dollars, sixteen cents.”
In this project, we will use Machine learning, linguistic sophistication and native speaker intuition to develop a text-to-speech synthesis (TTS) and automatic speech recognition (ASR) system for english language and test the grammar for all the known rules.
The objective of this research paper is create tools for the detection, normalization and denormalization of non-standard words such as abbreviations, numbers or currency expressions; and semiotic classes – text tokens and token sequences that represent particular entities that are semantically constrained, such as measurement phrases, addresses or dates
The dataset(a large corpus of text, about 350 megabytes in total) comes in the shape of two files: en_test.csv - the test set, which does not contain the normalized text, and en_train.csv - the training set, which contains the normalized text.
Applications of this work include text-to-speech synthesis, automatic speech recognition, and information extraction/retrieval.
# general visualisation
library('ggplot2') # visualisation
library('scales') # visualisation
library('grid') # visualisation
library('gridExtra') # visualisation
library('RColorBrewer') # visualisation
library('corrplot') # visualisation
library('ggforce') # visualisation
library('treemapify') # visualisation
# general data manipulation
library('dplyr') # data manipulation
library('readr') # input/output
library('data.table') # data manipulation
library('tibble') # data wrangling
library('tidyr') # data wrangling
library('stringr') # string manipulation
library('forcats') # factor manipulation
# Text / NLP
library('tidytext') # text analysis
library('tm') # text analysis
library('SnowballC') # text analysis
library('topicmodels') # text analysis
library('wordcloud') # test visualisation
# Extra Vis
library('plotly')
# Extras
library('babynames')
#We use the *multiplot* function, courtesy of [R Cookbooks](http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/) to create multi-panel plots.
# Define multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
We use data.table’s fread function to speed up reading in the data. The training data file is rather extensive, with close to 10 million rows and 300 MB uncompressed size:
train <- as.tibble(fread('en_train.csv'))
test <- as.tibble(fread('en_test.csv'))
Let’s have an overview of the data sets using the summary and glimpse tools. First the training data:
summary(train)
## sentence_id token_id class before
## Min. : 0 Min. : 0.00 Length:9918441 Length:9918441
## 1st Qu.:192526 1st Qu.: 3.00 Class :character Class :character
## Median :379259 Median : 6.00 Mode :character Mode :character
## Mean :377857 Mean : 7.52
## 3rd Qu.:564189 3rd Qu.: 11.00
## Max. :748065 Max. :255.00
## after
## Length:9918441
## Class :character
## Mode :character
##
##
##
We have a total of 748,065 sentences in our training corpora, each tagged using a uniquie identification number. The average number of words or token in each sentence is 8, and the maximum being 255- which is abnormal in a good sentence. Finally we have a total of 9,918,441 words or tokens used in the training set, each of them classified using it own part of speech.
glimpse(train)
## Observations: 9,918,441
## Variables: 5
## $ sentence_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ token_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6,...
## $ class <chr> "PLAIN", "PLAIN", "PLAIN", "PLAIN", "PLAIN", "PLAI...
## $ before <chr> "Brillantaisia", "is", "a", "genus", "of", "plant"...
## $ after <chr> "Brillantaisia", "is", "a", "genus", "of", "plant"...
We’ve got a sense of our variables, their class type, and the first few observations of each. We can see we’re working with 9,918,441 observations of 5 variables. To make things a bit more explicit since a couple of the variable names aren’t 100% illuminating, here’s what we’ve got to deal with:
Variable Name | Description |
---|---|
sentence_id | Each sentence has a unique sentence id number, from 0 to 748,065 |
token_id | Each token or word within a sentence has a unique token id number from 0 to 255 |
class | Class is provided to show the token type or part of speech where each token belong |
before | The before column contains the raw text |
after | The after column contains the normalized text |
Let’s move on to our test dataset
summary(test)
## sentence_id token_id before
## Min. : 0 Min. : 0.000 Length:1088564
## 1st Qu.:17488 1st Qu.: 3.000 Class :character
## Median :35028 Median : 7.000 Mode :character
## Mean :35007 Mean : 8.344
## 3rd Qu.:52522 3rd Qu.: 12.000
## Max. :69999 Max. :248.000
We have a total of 69,999 sentences in our testing corpora, each tagged using a uniquie identification number. The average number of words or token in each sentence is 8, and the maximum is 248- which is still abnormal for a good sentence. Finally we have a total of 1088564 words or tokens used in the test set, each of them classified using it own part of speech.
glimpse(test)
## Observations: 1,088,564
## Variables: 3
## $ sentence_id <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ token_id <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ before <chr> "Another", "religious", "family", "is", "of", "Haz...
We’ve got a sense of our variables, their class type, and the first few observations of each. We know we’re working with 1,088,564 observations of 3 variables. Let’s take a deeper look at our testing dat:
Variable Name | Description |
---|---|
sentence_id | Each sentence has a unique sentence id number, from 0 to 748,065 |
token_id | Each token or word within a sentence has a unique token id number from 0 to 255 |
before | The before column contains the raw text |
print(c(max(train$sentence_id), max(test$sentence_id)))
## [1] 748065 69999
In summary, we have 748,065 sentences in the training data, and 69,999 sentences in the test data which we want the computer to learn from.
We combine the train and test data sets for comparison treatment:
combine <- bind_rows(train %>% mutate(set = "train"),test %>% mutate(set = "test")) %>%
mutate(set = as.factor(set))
There are no missing values in our data set:
sum(is.na(train))
## [1] 0
sum(is.na(test))
## [1] 0
We decide to turn Class into a factor for exploration purposes:
train <- train %>%
mutate(class = factor(class))
Before diving into data summaries, let’s begin by printing a few before and after sentences to get a feeling for the data. To make this task easier, we first define a short helper function to compare these sentences:
before_vs_after <- function(sent_id){
bf <- train %>%
filter(sentence_id == sent_id) %>%
.$before %>%
str_c(collapse = " ")
af <- train %>%
filter(sentence_id == sent_id) %>%
.$after %>%
str_c(collapse = " ")
print(str_c("Before:", bf, sep = " ", collapse = " "))
print(str_c("After :", af, sep = " ", collapse = " "))
}
Those are a few example sentences:
before_vs_after(11)
## [1] "Before: Retrieved April 10, 2013 ."
## [1] "After : Retrieved april tenth twenty thirteen ."
before_vs_after(99)
## [1] "Before: Retrieved 12 April 2015 ."
## [1] "After : Retrieved the twelfth of april twenty fifteen ."
before_vs_after(1234)
## [1] "Before: The PMO provides secretarial assistance to the Prime Minister ."
## [1] "After : The p m o provides secretarial assistance to the Prime Minister ."
We noticed a few things:
The first two examples are very similar, but the normalization helped to make the difference between “april tenth” and “tenth of april” depending on how the date is written.
“2015” becomes “twenty fifteen” instead of “two thousand fifteen”.
“April” becomes “april”. Lower vs upper case should not be a problem, but it is noteworthy.
Acronyms like “PMO” turn into their spoken version of “p m o”, without giving us the meaning of p m o
Now let’s look at overview visualisations. First, we examine the different token classes and their frequency. Here is a visual summary:
train %>%
group_by(class) %>%
count() %>%
ungroup() %>%
mutate(class = reorder(class, n)) %>%
ggplot(aes(class,n, fill = class)) +
geom_col() +
scale_y_log10() +
labs(y = "Frequency") +
coord_flip() +
theme(legend.position = "none")
Fig. 1
We used logarithmic scale on the x-axis, primarily to avoid skewness the data.
Insights:
The “PLAIN” class is by far the most frequent, followed by punctuation mark “PUNCT” then“DATE”.
In total there are 16 classes, with “TIME”, “FRACTION”, and “ADDRESS” having the leasst number of occurences (around/below 100 tokens each).
Next up, this is the histogram distribution of the length of each sentence for the training data:
train %>%
group_by(sentence_id) %>%
summarise(sentence_len = max(token_id)) %>%
ggplot(aes(sentence_len)) +
geom_histogram(bins = 50, fill = "red") +
scale_y_sqrt() +
scale_x_log10() +
labs(x = "Sentence length")
Fig. 2
We have a gaussian distribution in our training data, and We can see that sentences are typically up to 15-20 tokens long, after which the frequency drops quickly. Very long sentences (> 100 tokens) exist but are relatively rare. Note again the logarithmic x-axis and square-root y-axis.
Below we compare the sentence length distributions for the training vs test data sets, this time with an overlapping density plot:
combine %>%
group_by(sentence_id, set) %>%
summarise(sentence_len = max(token_id)) %>%
ggplot(aes(sentence_len, fill = set)) +
geom_density(bw = 0.1, alpha = 0.5) +
scale_x_log10() +
labs(x = "Sentence length")
Fig. 3
We find that the training data contains more shorter sentences (< 10 tokens) and that there is a larger proportion of longer sentences in the test data set. Note the logarithmic x-axis.
Next, we will look at the token_ids of each class in their sentences:
train %>%
ggplot(aes(reorder(class, token_id, FUN = median), token_id, col = class)) +
geom_boxplot() +
scale_y_log10() +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(x = "Class", y = "Token ID")
Fig. 4
insights:
The “TELEPHONE” class appears predominantly at token_ids of less than 10.
Above token_ids of 100 we find no occurences of the “ELECTRONICS”, “ADDRESS”, “FRACTION”, “DIGIT”, “ORDINAL”, “TIME”, “MONEY”, and “MEASURE” classes. Of those, “FRACTION”, “MONEY”, and “MEASURE” barely appear above token_id == 20
.
The classes “DECIMAL”, “MONEY”, “PUNCT”, and “MEASURE” are rarely found in the first token of a sentence.
We can take this analysis a step further by relating the token_id to the length of the sentence. Thereby, we will see at which relative position in a sentence a certain class is more likely to occur:
sen_len <- train %>%
group_by(sentence_id) %>%
summarise(sentence_len = max(token_id))
train %>%
left_join(sen_len, by = "sentence_id") %>%
mutate(token_rel = token_id/sentence_len) %>%
ggplot(aes(reorder(class, token_rel, FUN = median), token_rel, col = class)) +
geom_boxplot() +
#scale_y_log10() +
theme(legend.position = "none", axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(x = "Class", y = "Relative token ID")
Fig. 5
findings:
As suggested above, “TELEPHONE” tokens are more likely to occur early in a sentence. A similar observation holds for “LETTERS” and “PLAIN”.
Unsurprisingly, the punctuation “PUNCT” class can be found more frequently towards the end of a sentence. Similarly, “MONEY” tokens occur relatively late.
In general, there is a certain trend among the classes, with medians ranging from about 0.4 to 0.8. However, interquartile ranges are wide and there is a large amount of overlap between the different classes.
Now we will include the effects of the text normalization in our study by analysing the changes it introduced in the training data.
To begin, we define the new feature transformed to indicate those tokens that changed from before to after:
train <- train %>%
mutate(transformed = (before != after))
train %>%
group_by(transformed) %>%
count() %>%
mutate(freq = n/nrow(train))
## # A tibble: 2 x 3
## # Groups: transformed [2]
## transformed n freq
## <lgl> <int> <dbl>
## 1 FALSE 9258648 0.93347815
## 2 TRUE 659793 0.06652185
In total, only about 7% of tokens in the training data, or 659,793 objects in total, were changed during the process of text normalization:
This explains the high baseline accuracies we can achieve even without any adjustment of the test data input.
By comparing the fraction of tokens that changed from before to after the text normalisation we can visualise which classes are most affected by this process:
train %>%
ggplot(aes(class, fill = transformed)) +
geom_bar(position = "fill") +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
labs(x = "Class", y = "Transformed fraction [%]")
Fig. 6
Insights:
Quite a few “ELECTRONIC” and “LETTERS” tokens did not change in after vs before.
A noteable fraction of “PLAIN” class text elements changed.
The majority of “VERBATIM” class tokens remained identical during the normalisation.
train %>%
group_by(class, transformed) %>%
count() %>%
spread(key = transformed, value = n) %>%
mutate(`TRUE` = ifelse(is.na(`TRUE`),0,`TRUE`),
`FALSE` = ifelse(is.na(`FALSE`),0,`FALSE`)) %>%
mutate(frac = `FALSE`/(`TRUE`+`FALSE`)*100) %>%
filter(frac>1e-5) %>%
arrange(desc(frac)) %>%
rename(unchanged = `FALSE`, changed = `TRUE`, unchanged_percentage = frac)
## # A tibble: 7 x 4
## # Groups: class [7]
## class unchanged changed unchanged_percentage
## <fctr> <dbl> <dbl> <dbl>
## 1 PUNCT 1880507 0 100.00000000
## 2 PLAIN 7317221 36472 99.50403151
## 3 VERBATIM 52271 25837 66.92144211
## 4 LETTERS 8426 144369 5.51457836
## 5 ELECTRONIC 198 4964 3.83572259
## 6 MEASURE 22 14761 0.14881959
## 7 MONEY 3 6125 0.04895561
This is a breakdown of the number and percentage of how many tokens per class remained unchanged in before vs after. As we can see, puntuation markes remained unchanged while 36,472 or 0.5 % of plain “PLAIN” texts changed.
In order to explore the meaning of these classes for our normalization task we modify our helper function to include the class name. Here we also remove punctuation:
before_vs_after_class <- function(sent_id){
bf <- train %>%
filter(sentence_id == sent_id & class != "PUNCT") %>%
.$before %>%
str_pad(30) %>%
str_c(collapse = " ")
af <- train %>%
filter(sentence_id == sent_id & class != "PUNCT") %>%
.$after %>%
str_pad(30) %>%
str_c(collapse = " ")
cl <- train %>%
filter(sentence_id == sent_id & class != "PUNCT") %>%
.$class %>%
str_pad(30) %>%
str_c(collapse = " ")
print(str_c("[Class]:", cl, sep = " ", collapse = " "))
print(str_c("Before :", bf, sep = " ", collapse = " "))
print(str_c("After :", af, sep = " ", collapse = " "))
}
Using our example sentence from earlier, we get the following output structure, indicating the combination of a “PLAIN” and a “DATE” token:
before_vs_after_class(11)
## [1] "[Class]: PLAIN DATE"
## [1] "Before : Retrieved April 10, 2013"
## [1] "After : Retrieved april tenth twenty thirteen"
Using our example sentence from earlier, we get the following output structure, indicating the combination of a “PLAIN” and a “DATE” token.
Next we will explore the different token classes within categories of similarity and provide a few examples for each.
As we saw above, most “PLAIN” tokens remained unchanged. However, about 0.5% were transformed, which still amounts to about 36k tokens:
train %>%
filter(class == "PLAIN") %>%
group_by(transformed) %>%
count() %>%
spread(key = transformed, value = n) %>%
mutate(frac = `FALSE`/(`TRUE`+`FALSE`)*100) %>%
filter(!is.na(frac)) %>%
arrange(desc(frac)) %>%
rename(unchanged = `FALSE`, changed = `TRUE`, unchanged_percentage = frac)
## # A tibble: 1 x 3
## unchanged changed unchanged_percentage
## <int> <int> <dbl>
## 1 7317221 36472 99.50403
As we saw above, most “PLAIN” tokens remained unchanged. However, about 0.5% were transformed, which still amounts to about 36472 tokens:
These are a few transformed examples:
set.seed(1234)
train %>%
filter(transformed == TRUE & class == "PLAIN") %>%
sample_n(10)
## # A tibble: 10 x 6
## sentence_id token_id class before after transformed
## <int> <int> <fctr> <chr> <chr> <lgl>
## 1 88698 0 PLAIN dr doctor TRUE
## 2 469912 8 PLAIN colours colors TRUE
## 3 460114 2 PLAIN vol volume TRUE
## 4 470842 11 PLAIN - to TRUE
## 5 646512 8 PLAIN - to TRUE
## 6 483660 7 PLAIN theatre theater TRUE
## 7 7782 1 PLAIN - to TRUE
## 8 179773 9 PLAIN lb pound TRUE
## 9 503183 12 PLAIN neighbour neighbor TRUE
## 10 391000 13 PLAIN st saint TRUE
We see a few typical changes, such as “-” to “to” and the adjustment from British to American spelling.
It is important to briefly cross-check the “PUNCT” class. Intuitively, punctuation should not be affected by text normalization unless it is associated to other structures such as numbers or dates. Here are a few punctuation examples:
set.seed(1234)
train %>%
filter(class == "PUNCT") %>%
sample_n(10)
## # A tibble: 10 x 6
## sentence_id token_id class before after transformed
## <int> <int> <fctr> <chr> <chr> <lgl>
## 1 88108 5 PUNCT ' ' FALSE
## 2 468923 11 PUNCT . . FALSE
## 3 459100 4 PUNCT "\"\"" "\"\"" FALSE
## 4 469748 8 PUNCT , , FALSE
## 5 645746 19 PUNCT . . FALSE
## 6 482242 6 PUNCT . . FALSE
## 7 7445 25 PUNCT . . FALSE
## 8 177712 16 PUNCT . . FALSE
## 9 501373 13 PUNCT . . FALSE
## 10 388155 21 PUNCT , , FALSE
As expected, these tokens are identical before and after:
train %>%
filter(class == "PUNCT") %>%
mutate(test = (before == after)) %>%
group_by(test) %>%
count()
## # A tibble: 1 x 2
## # Groups: test [1]
## test n
## <lgl> <int>
## 1 TRUE 1880507
They are 1,880,507 punctuation marks in our test set.
Numbers is by far the most diverse class (and an important are for normalization treatment). These classes are as follows: “DATE”, CARDINAL“,”MEASURE“,”ORDINAL“,”DECIMAL“,”MONEY“, DIGIT”, “TELEPHONE”, “TIME”, “FRACTION”, and “ADDRESS”. Here we will look at a few examples at each of them.
We have already seen how dates can be transformed differently depending on their formatting. These are some more examples:
before_vs_after_class(8)
## [1] "[Class]: PLAIN DATE"
## [1] "Before : Retrieved 4 March 2014"
## [1] "After : Retrieved the fourth of march twenty fourteen"
before_vs_after_class(12)
## [1] "[Class]: PLAIN PLAIN DATE"
## [1] "Before : Downloaded on 7 August 2007"
## [1] "After : Downloaded on the seventh of august two thousand seven"
Not a big surprise here: these are cardinal numbers. Interestingly, it includes Roman numerals:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "CARDINAL") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 89600 CARDINAL 53 fifty three
## 2 471164 CARDINAL 36 thirty six
## 3 462639 CARDINAL 589 five hundred eighty nine
## 4 471926 CARDINAL II two
## 5 642177 CARDINAL 305 three hundred five
before_vs_after(471926)
## [1] "Before: Listed as Grade II by English Heritage ."
## [1] "After : Listed as Grade two by English Heritage ."
These are mainly percentages and physical measurements like megawatts:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "MEASURE") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 83465 MEASURE 45% forty five percent
## 2 457847 MEASURE 270MW two hundred seventy megawatts
## 3 447041 MEASURE 93.3 km2 ninety three point three square kilometers
## 4 458852 MEASURE 1,500 m one thousand five hundred meters
## 5 639771 MEASURE 13.0 mi thirteen point zero miles
before_vs_after(200476)
## [1] "Before: As of 2008 , the population was 50.3% male and 49.7% female ."
## [1] "After : As of two thousand eight , the population was fifty point three percent male and forty nine point seven percent female ."
before_vs_after_class(225531)
## [1] "[Class]: PLAIN PLAIN CARDINAL PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN DECIMAL PLAIN PLAIN PLAIN MEASURE"
## [1] "Before : There were 233 housing units at an average density of 6.8 per square mile 2.6/km²"
## [1] "After : There were two hundred thirty three housing units at an average density of six point eight per square mile two point six per square kilometers"
Also ordinal numbers can include Roman numerals, as in the example of Queen Elizabeth I:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "ORDINAL") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 92031 ORDINAL 45th forty fifth
## 2 487085 ORDINAL 19th nineteenth
## 3 478452 ORDINAL I the first
## 4 487719 ORDINAL 16th sixteenth
## 5 657344 ORDINAL 13th thirteenth
before_vs_after(478452)
## [1] "Before: Maid of honour to Elizabeth I ( 1576 until 1583 ) ."
## [1] "After : Maid of honor to Elizabeth the first ( fifteen seventy six until fifteen eighty three ) ."
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "DECIMAL") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 86596 DECIMAL .66 point six six
## 2 454408 DECIMAL 10.1002 ten point one o o two
## 3 443910 DECIMAL 199.4 one hundred ninety nine point four
## 4 454830 DECIMAL .37 point three seven
## 5 636963 DECIMAL 60 billion sixty billion
before_vs_after(443910)
## [1] "Before: There were 331 housing units at an average density of 199.4 per square mile ( 77.0/km2 ) ."
## [1] "After : There were three hundred thirty one housing units at an average density of one hundred ninety nine point four per square mile ( seventy seven point zero per square kilometers ) ."
Also, in this particular example the token in parentheses is a “MEASURE”:
before_vs_after_class(443910)
## [1] "[Class]: PLAIN PLAIN CARDINAL PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN DECIMAL PLAIN PLAIN PLAIN MEASURE"
## [1] "Before : There were 331 housing units at an average density of 199.4 per square mile 77.0/km2"
## [1] "After : There were three hundred thirty one housing units at an average density of one hundred ninety nine point four per square mile seventy seven point zero per square kilometers"
Money tokens come with currency symbols like “$” or plain text names like “yuan”:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "MONEY") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before
## <int> <fctr> <chr>
## 1 90430 MONEY $10.8 million
## 2 461097 MONEY $985M
## 3 449366 MONEY $5 billion
## 4 461569 MONEY $20
## 5 636060 MONEY 1.8 million yuan
## # ... with 1 more variables: after <chr>
before_vs_after(90430)
## [1] "Before: It issued 369 citations at that time , assessing $10.8 million in penalties ."
## [1] "After : It issued three hundred sixty nine citations at that time , assessing ten point eight million dollars in penalties ."
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "DIGIT") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 91368 DIGIT 202767 two o two seven six seven
## 2 480736 DIGIT 2 two
## 3 474035 DIGIT 1 one
## 4 481444 DIGIT 1996 one nine nine six
## 5 645718 DIGIT 2003 two o o three
before_vs_after(481444)
## [1] "Before: Mintz , Sidney W. 1996 a Tasting Food , Tasting Freedom : Excursions into Eating , Culture , and the Past ."
## [1] "After : Mintz , Sidney w one nine nine six a Tasting Food , Tasting Freedom : Excursions into Eating , Culture , and the Past ."
Interesting error! the computer missed it. This example given here looks like a citation of an article or book, which means that “nineteen ninety six” should be correct instead of “one nine nine six”, since the number most likely refers to the year of publication.
set.seed(1234)
train %>%
filter(class == "TELEPHONE") %>%
select(-token_id, -class, -transformed) %>%
sample_n(5)
## # A tibble: 5 x 3
## sentence_id before
## <int> <chr>
## 1 88957 3 18-49
## 2 476323 0192627929
## 3 467425 0-900652-85-3
## 4 477330 1 1693-1775
## 5 644524 985-433-695-6
## # ... with 1 more variables: after <chr>
before_vs_after(88957)
## [1] "Before: The show was viewed by an estimated 3.17 million Americans with a 1.1 / 3 18-49 rating / share ."
## [1] "After : The show was viewed by an estimated three point one seven million Americans with a one point one / three sil one eight sil four nine rating / share ."
That doesn’t look like a telephone number to me. Neither does this one:
before_vs_after(476323)
## [1] "Before: ISBN 0192627929 Chambers TJ , Revell PA , Fuller K , Athanasou ."
## [1] "After : i s b n o one nine two six two seven nine two nine Chambers t j , Revell p a , Fuller K , Athanasou ."
Quite a few of these entries seem to be in fact ISBN numbers. Everything that is formatted like integer digits plus dashes appears to be identified as a “TELEPHONE” token. This could be a tricky category.
The “TIME” formatting and normalisation are comparatively simple:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "TIME") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 98527 TIME 3:00 pm three p m
## 2 471678 TIME 10:30pm ten thirty p m
## 3 465185 TIME 11:00 a.m. eleven a m
## 4 746945 TIME 10:30 ten thirty
## 5 635327 TIME 3:14 AM three fourteen a m
before_vs_after(471678)
## [1] "Before: The show shifted its timeslot from 10:30pm into 6:00pm in order to compete TV Patrol ."
## [1] "After : The show shifted its timeslot from ten thirty p m into six p m in order to compete t v Patrol ."
Here it become tricky again:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "FRACTION") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 91464 FRACTION 2000/400 two thousand four hundredths
## 2 479012 FRACTION 172/6 one hundred seventy two sixths
## 3 470330 FRACTION 6/6 six sixths
## 4 746922 FRACTION 16/8 sixteen eighths
## 5 650125 FRACTION 2/2013 two two thousand thirteenths
Many of these numbers are unlikely to be fractions and are therefore normalized in a rather clumsy way:
before_vs_after(470330)
## [1] "Before: \"\" WWE Superstars Results ( 6/6 ) : Ascension vs"
## [1] "After : \"\" w w e Superstars Results ( six sixths ) : Ascension versus"
But since we only need to reproduce this normalisation approach it should actually make it easier for us; because the required approach appears to be rather homogeneous for any two integers separated by a forward slash.
The “ADDRESS” class appears to assigned to alpha-numeric combinations:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "ADDRESS") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 88342 ADDRESS A380 a three eighty
## 2 458052 ADDRESS B1 b one
## 3 440683 ADDRESS M2 m two
## 4 457174 ADDRESS C1 c one
## 5 612855 ADDRESS C0 c o
before_vs_after(88342)
## [1] "Before: \"\" Seat Map Singapore Airlines Airbus A380 \"\" ."
## [1] "After : \"\" Seat Map Singapore Airlines Airbus a three eighty \"\" ."
before_vs_after(440683)
## [1] "Before: Each of the four identical protein subunits is composed of two membrane spanning alpha helices ( M1 and M2 ) ."
## [1] "After : Each of the four identical protein subunits is composed of two membrane spanning alpha helices ( m one and m two ) ."
The tokens in the class “LETTERS” appear to be normalised to a form which spells them out one by one:
set.seed(4321)
train %>%
select(-token_id, -transformed) %>%
filter(class == "LETTERS") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 254590 LETTERS NS n s
## 2 680878 LETTERS D. d
## 3 313518 LETTERS UK u k
## 4 35001 LETTERS ISBN i s b n
## 5 571356 LETTERS M. A. m a
before_vs_after(571356)
## [1] "Before: Kose , M. A. ; Prasad , E. S. ; Terrones , M. E. ( 2006 ) ."
## [1] "After : Kose , m a ; Prasad , e s ; Terrones , m e ( two thousand six ) ."
Since these text elements are typically completely in upper case it should be relatively simple to define a normalization. Given the relatively high frequency of the “LETTERS” class I suggest we use a simple transformation at first, before proceeding to a little more advanced LB baseline.
Verbatim involves using exactly the same words; word for word. Here we have special symbols such as non-english characters. Interestingly, not all of these are copied verbatim from before to after:
set.seed(1234)
train %>%
select(-token_id, -transformed) %>%
filter(class == "VERBATIM") %>%
sample_n(5)
## # A tibble: 5 x 4
## sentence_id class before after
## <int> <fctr> <chr> <chr>
## 1 85530 VERBATIM κ kappa
## 2 458616 VERBATIM & and
## 3 449455 VERBATIM з з
## 4 459603 VERBATIM å®¶ å®¶
## 5 646953 VERBATIM и и
The exception are Greek letters or the ampersand “&”. It might be useful to search for other exceptions and make use of simple transformations such as “&” to “and”.
This class includes websites that are normalized using single characters and a “dot”:
set.seed(4321)
train %>%
filter(class == "ELECTRONIC") %>%
select(-token_id, -class, -sentence_id, -transformed) %>%
sample_n(5)
## # A tibble: 5 x 2
## before after
## <chr> <chr>
## 1 CNN.com c n n dot c o m
## 2 uefa.com u e f a dot c o m
## 3 MyLifeIsTwilight.com m y l i f e i s t w i l i g h t dot c o m
## 4 2008Achieve.org t w o o o e i g h t a c h i e v e dot o r g
## 5 Goal.com g o a l dot c o m
before_vs_after(99083)
## [1] "Before: Climate Summary for Madison , Florida \"\" Weatherbase.com \"\" ."
## [1] "After : Climate Summary for Madison , Florida \"\" w e a t h e r b a s e dot c o m \"\" ."
This is another format that should be relatively easy to learn and implement.
In summary: The different classes appear to follow different translation rules for the text normalization. Even though the class feature is not present in the test data set it might be useful to train a model to identify the specific class for a token and then apply its normalisation rules.
Here we are using the wordcloud package to visualise tokens for each class that occur most frequently among the terms affected by the text normalization. We start with the overall frequency in the first tab and then sequentially plot the individual classes.
train %>%
filter(transformed == TRUE) %>%
count(before) %>%
with(wordcloud(before, n, max.words = 100, rot.per=0, fixed.asp=FALSE))
Fig. 7
train %>%
filter(transformed == TRUE, class == "PLAIN") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))
Fig. 8
train %>%
filter(transformed == TRUE, class == "DATE") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 9
train %>%
filter(transformed == TRUE, class == "CARDINAL") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 40, scale=c(4,1), rot.per=0, fixed.asp=FALSE))
Fig. 10
train %>%
filter(transformed == TRUE, class == "MEASURE") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 11
train %>%
filter(transformed == TRUE, class == "ORDINAL") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 12
train %>%
filter(transformed == TRUE, class == "DECIMAL") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 13
train %>%
filter(transformed == TRUE, class == "MONEY") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 14
train %>%
filter(transformed == TRUE, class == "DIGIT") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))
Fig. 15
train %>%
filter(transformed == TRUE, class == "TELEPHONE") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 16
train %>%
filter(transformed == TRUE, class == "TIME") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 17
train %>%
filter(transformed == TRUE, class == "FRACTION") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 18
train %>%
filter(transformed == TRUE, class == "ADDRESS") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 50, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 19
train %>%
filter(transformed == TRUE, class == "LETTERS") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))
Fig. 20
train %>%
filter(transformed == TRUE, class == "VERBATIM") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 30, scale=c(4,1), rot.per=0, fixed.asp=FALSE))
Fig. 21
train %>%
filter(transformed == TRUE, class == "ELECTRONIC") %>%
count(before) %>%
with(wordcloud(before, n, max.words = 30, scale=c(4,0.5), rot.per=0, fixed.asp=FALSE))
Fig. 22
In this section we turn to analysing whether the context of our normalized token can provide any indications towards its class. We’re doing this by preparing a data frame in which every token is listed together with the token (and its class) that came immediately previous or next in the text data. (Note, that for the first/last tokens of a sentence the previous/next tokens will belong to the the previous/next sentence.)
We build this new data frame using the dplyr
tool lead
which shifts the contents of a vector by a certain number of indices.
t3 <- train %>%
select(class, before) %>%
mutate(before2 = lead(train$before, 1),
before3 = lead(train$before,2),
class2 = lead(train$class, 1),
class3 = lead(train$class, 2),
transformed = c(train$transformed[-1], NA),
after = c(train$after[-1], 0)) %>%
filter(!is.na(before3)) %>%
rename(class_prev = class, class_next = class3, class = class2,
before_prev = before, before_next = before3, before = before2) %>%
select(before_prev, before, before_next, class_prev,
class, class_next, transformed, after)
We begin this analysis with a tabset of overview plots comparing the frequencies of classes at the positions previous and next in line to all tokens of the tabset class. Note the logarithmic x-axes. These plots show the absolute numbers of tokens per class. The classes are colour-coded for easier comparison. Here is the corresponding helper function:
plot_t3_comp <- function(cname){
p1 <- t3 %>%
filter(transformed == TRUE & class == cname) %>%
ggplot(aes(class_prev, fill = class_prev)) +
labs(x = "Previous class") +
geom_bar() +
coord_flip() +
scale_y_log10() +
theme(legend.position = "none")
p2 <- t3 %>%
filter(transformed == TRUE & class == cname) %>%
ggplot(aes(class_next, fill = class_next)) +
labs(x = "Next class") +
geom_bar() +
coord_flip() +
scale_y_log10() +
theme(legend.position = "none")
layout <- matrix(c(1,2),2,1,byrow=TRUE)
multiplot(p1, p2, layout=layout)
}
plot_t3_comp("PLAIN")
Fig. 23
plot_t3_comp("DATE")
Fig. 24
plot_t3_comp("CARDINAL")
Fig. 25
plot_t3_comp("MEASURE")
Fig. 26
plot_t3_comp("ORDINAL")
Fig. 27
plot_t3_comp("DECIMAL")
Fig. 28
plot_t3_comp("MONEY")
Fig. 29
plot_t3_comp("DIGIT")
Fig. 30
plot_t3_comp("TELEPHONE")
Fig. 31
plot_t3_comp("TIME")
Fig. 32
plot_t3_comp("FRACTION")
Fig. 33
plot_t3_comp("ADDRESS")
Fig. 34
plot_t3_comp("LETTERS")
Fig. 35
plot_t3_comp("VERBATIM")
Fig. 36
plot_t3_comp("ELECTRONIC")
Fig. 37
For an alternative comprehensive overview of the neighbour class statistics here are two treemaps build using the treemapify
package.
The treemaps summarise at a glance which neighbour combinations exist and are most frequent:
t3 %>%
group_by(class, class_prev) %>%
count() %>%
ungroup() %>%
mutate(n = log10(n+1)) %>%
ggplot(aes(area = n, fill = class_prev, label = class_prev, subgroup = class)) +
geom_treemap() +
geom_treemap_subgroup_border() +
geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
"black", fontface = "italic", min.size = 0) +
geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
theme(legend.position = "null") +
ggtitle("Previous classes grouped by token class; log scale frequencies")
Fig. 38
t3 %>%
group_by(class, class_next) %>%
count() %>%
ungroup() %>%
mutate(n = log10(n+1)) %>%
ggplot(aes(area = n, fill = class_next, label = class_next, subgroup = class)) +
geom_treemap() +
geom_treemap_subgroup_border() +
geom_treemap_subgroup_text(place = "centre", grow = T, alpha = 0.5, colour =
"black", fontface = "italic", min.size = 0) +
geom_treemap_text(colour = "white", place = "topleft", reflow = T) +
theme(legend.position = "null") +
ggtitle("Next classes grouped by token class; log scale frequencies")
Fig. 39
The first plot shows the frequency of previous token classes and the second treemap the frequency of next token classes (all labeled in white) for each target token class (labeled in a large black font and separated by group with grey borders). We use a again a logarithmic frequency scaling to improve the visibility of the rare combinations. Group sizes decrease from the bottom left to the top right of the plot and each subgroup box. The colours of the white-labeled neighbour boxes are identical troughout the plot (e.g. “PUNCT” is always purple). Note, that the subgroup boxes in the two plots have different sizes for identical classes because of the log(n+1)
transformation.
Based on these raw numbers we can study the relative contributions of a certain class to the previous or next tokens of another class. To visualise these dependencies, we again determine the log10(n+1)
frequency distributions for each class among the previous/next tokens depending on the class of the reference token. These are the numbers in the bar plots above. We then normalise the range of these numbers (i.e. the height of the bars) to the interval [0,1]
for each class. This data wrangling is done in the following code block:
prev_stat <- t3 %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
next_stat <- t3 %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
t3_norm_prev <- t3 %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
left_join(prev_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
t3_norm_next <- t3 %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
left_join(next_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
The result is displayed in two tile plots for the class vs previous class and class vs next class, respectively. Here, each tile shows the relative frequency (by class) of a specific neighbour pairing. The colour coding assigns bluer colours to lower frequencies and redder colours to higher frequencies:
t3_norm_prev %>%
ggplot(aes(class, class_prev, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
ggtitle("Class vs previous class; relative log scale frequencies")
Fig. 40
t3_norm_next %>%
ggplot(aes(class, class_next, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
ggtitle("Class vs next class; relative log scale frequencies")
Fig. 41
We find:
Not all possible neighbouring combinations exist: only about 77% of the tile plots are filled. Certain potential pairs such as “ORDINAL” and “FRACTION” or “TIME” and “MONEY” are never found next to each other.
“PUNCT” and “PLAIN”, the overall most frequent classes, also dominate the relative frequencies of previous/next classes for every token class. “PUNCT” is relatively weakly related to “MONEY” (previous) and “ORDINAL” (next).
“DIGIT” tokens are often preceded or followed by “LETTERS” tokens.
“TELEPHONE” is very likely to be preceded by “LETTERS”, as well.
“VERBATIM” tokens are very likely to be followed and preceded by other “VERBATIM” tokens. No other token class has such a strong correlation with itself. Some, such as “DIGIT”, “MONEY”, or “TELEPHONE” are rather unlikely to be preceded by the same class
“VERBATIM” is also very unlikely to be preceded by “ADDRESS” and “TIME” or followed by “ADDRESS”.
Now we restrict this neighbour analysis to the transformed tokens only. Naturally, these will still have transformed or untransformed tokens in the previous and next positions. Having set up our “context” data frame with this option in mind, we only need to modify our code slightly to prepare the corresponding tile plots:
prev_stat <- t3 %>%
filter(transformed == TRUE) %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
next_stat <- t3 %>%
filter(transformed == TRUE) %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
t3_norm_prev <- t3 %>%
filter(transformed == TRUE) %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
left_join(prev_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
t3_norm_next <- t3 %>%
filter(transformed == TRUE) %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
left_join(next_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
t3_norm_prev %>%
ggplot(aes(class, class_prev, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
ggtitle("Class vs previous class; relative log scale; transformed")
Fig. 42
t3_norm_next %>%
ggplot(aes(class, class_next, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
ggtitle("Class vs next class; relative log scale; transformed")
Fig. 43
We find:
The class “PUNCT” is now gone from the “Token class” axis, since those symbols are never transformed.
Two combinations disappear: “VERBATIM” is not transformed if preceeded or followed by “TIME” (relatively infrequent neighbours to start with).
The pair of “PLAIN” preceded by “CARDINAL” significantly increases in frequency (within the rarely transformed “PLAIN” class).
In general, the classes “PLAIN” and “VERBATM” are experiences the most visible changes with respect to the total set of neighbouring tokens since these are the classes with the highest percentage of untransformed tokens (after “PUNCT”, of course).
The previous plots did not distinguish between neighbouring tokens that were placed at the end of one sentence and the beginning of another. Since the sentences in our data set are unrelated and in a random order, the end of one sentence should not influence the beginning of the next one. Here we take this into account by removing those pair of tokens that bridge two sentences.
foo <- train %>%
group_by(sentence_id) %>%
summarise(sentence_len = max(token_id)) %>%
ungroup()
bar <- train %>%
left_join(foo, by = "sentence_id") %>%
mutate(first_token = token_id == 0,
last_token = token_id == sentence_len) %>%
slice(c(-1,-nrow(train))) %>%
select(first_token, last_token)
prev_stat <- t3 %>%
bind_cols(bar) %>%
filter(first_token == FALSE) %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
next_stat <- t3 %>%
bind_cols(bar) %>%
filter(last_token == FALSE) %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
group_by(class) %>%
summarise(mean_n = mean(n),
max_n = max(n),
min_n = min(n))
t3_norm_prev <- t3 %>%
bind_cols(bar) %>%
filter(first_token == FALSE) %>%
count(class, class_prev) %>%
mutate(n = log10(n+1)) %>%
left_join(prev_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
t3_norm_next <- t3 %>%
bind_cols(bar) %>%
filter(last_token == FALSE) %>%
count(class, class_next) %>%
mutate(n = log10(n+1)) %>%
left_join(next_stat, by = "class") %>%
mutate(frac_norm = (n-min_n)/(max_n - min_n),
test1 = max_n - min_n,
test2 = n - min_n)
t3_norm_prev %>%
ggplot(aes(class, class_prev, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Previous token class", fill = "Rel freq") +
ggtitle("Class vs previous class; relative log scale; no sentence bridging")
Fig. 40
t3_norm_next %>%
ggplot(aes(class, class_next, fill = frac_norm)) +
geom_tile() +
theme(axis.text.x = element_text(angle=45, hjust=1, vjust=0.9)) +
scale_fill_distiller(palette = "Spectral") +
labs(x = "Token class", y = "Next token class", fill = "Rel freq") +
ggtitle("Class vs next class; relative log scale; no sentence bridging")
Fig. 41
We find:
The differences to the original plot are only marginal; possibly because of the relatively small contribution of cross-sentence neighbours to the overall sample. With about 750k sentences and close to 10 million tokens there are about 7.5% of pairs bridging two sentences.
In the class vs next class plot the differences are exclusively seen in the “PUNCT” class. This is to be expected since a full stop (“.”) belongs to the “PUNCT” category. Thus, we can confirm the expectation that (practically) all sentences end like this one here.
Actually, let’s see if this is true. Here we select only the final tokens of each sentence and plot their class distribution. We also plot the frequency of the the different tokens within the “PUNCT” class. Note the logarithmic axes:
bar <- train %>%
left_join(foo, by = "sentence_id") %>%
filter(token_id == sentence_len)
p1 <- bar %>%
ggplot(aes(class, fill = transformed)) +
geom_bar(position = "dodge") +
scale_y_log10() +
ggtitle("Final tokens of sentence")
p2 <- bar %>%
filter(class == "PUNCT") %>%
ggplot(aes(before, fill = before)) +
geom_bar() +
scale_y_log10() +
theme(legend.position = "none", axis.text.x = element_text(face = "bold", size = 16)) +
ggtitle("Final PUNCT tokens")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(p1, p2, layout=layout)
Fig. 42
We find:
The “PUNCT” class is indeed the most frequent by far, but four other classes are found at the end of a sentence, too: “PLAIN”, “LETTERS”, “DATE”, and “MONEY”. This could already be kind-of seen in Fig. 5 above.
As we already know, no “PUNCT” tokens were transformed. All “DATE”, “LETTERS”, and “MONEY” tokens were transformed, which is not self-evident (see Fig. 5 and the numbers below it).
For the “PLAIN” class, about half of the tokens were transformed, which is a way larger fraction than for the overall “PLAIN” sample. This is an interesting result.
There are only two cases in which the last token belongs to the “MONEY” class:
bar %>% filter(class == "MONEY") %>% select(-transformed, -sentence_len)
## # A tibble: 2 x 5
## sentence_id token_id class before after
## <int> <int> <fctr> <chr> <chr>
## 1 37375 12 MONEY 3rs. three rupees
## 2 590317 9 MONEY 3rs. three rupees
You would be absolutely right in thinking that these look suspicious. Let’s see what they really are:
before_vs_after(37375)
## [1] "Before: \"\" Parasite \"\" was used in the US TV Crime show Numb 3rs."
## [1] "After : \"\" Parasite \"\" was used in the u s t v Crime show Numb three rupees"
before_vs_after(590317)
## [1] "Before: John McCarthy ( 1 episode , 2004 ) Numb 3rs."
## [1] "After : John McCarthy ( one episode , two thousand four ) Numb three rupees"
And indeed we have two references to the moderately successful TV show “Numb3rs” (IMDB), whose characters were apparently engaging in the kind of mathematical detective work you’ll find much more successfully done in many Kaggle kernels ;-) . No rupees here, I’m afraid. Although, why exactly in our data there is a space in between the two parts of that name is not entirely clear to me.
To round off this section, we will create a set of interactive 3D plots using the plotly
package. Here, you are able to explore the parameter space of neighbouring classes. The grid is defined as previous class (x), next class (y), and token class (z) and the corresponding frequencies are indicated by the colour and size of the data points.
t3 %>%
group_by(class, class_prev, class_next) %>%
count() %>%
ungroup() %>%
mutate(class = as.character(class),
class_prev = as.character(class_prev),
class_next = as.character(class_next)) %>%
arrange(desc(n)) %>%
plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
text = ~paste('Class:', class,
'<br>Previous class:', class_prev,
'<br>Next class:', class_next,
'<br>Counts:', n)) %>%
add_markers(size = ~log10(n)) %>%
layout(title = "Class Neighbour frequencies",
scene = list(xaxis = list(title = 'Previous Class'),
yaxis = list(title = 'Next Class'),
zaxis = list(title = 'Class')))
Fig. 43
t3 %>%
filter(transformed == TRUE) %>%
group_by(class, class_prev, class_next) %>%
count() %>%
ungroup() %>%
mutate(class = as.character(class),
class_prev = as.character(class_prev),
class_next = as.character(class_next)) %>%
arrange(desc(n)) %>%
plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
text = ~paste('Class:', class,
'<br>Previous class:', class_prev,
'<br>Next class:', class_next,
'<br>Counts:', n)) %>%
add_markers(size = ~log10(n)) %>%
layout(title = "Class Neighbour frequencies",
scene = list(xaxis = list(title = 'Previous Class'),
yaxis = list(title = 'Next Class'),
zaxis = list(title = 'Class')))
Fig. 44
foo <- train %>%
group_by(sentence_id) %>%
summarise(sentence_len = max(token_id)) %>%
ungroup()
bar <- train %>%
left_join(foo, by = "sentence_id") %>%
mutate(first_token = token_id == 0,
last_token = token_id == sentence_len) %>%
slice(c(-1,-nrow(train))) %>%
select(first_token, last_token)
t3 %>%
bind_cols(bar) %>%
filter(first_token == FALSE & last_token == FALSE) %>%
group_by(class, class_prev, class_next) %>%
count() %>%
ungroup() %>%
mutate(class = as.character(class),
class_prev = as.character(class_prev),
class_next = as.character(class_next)) %>%
arrange(desc(n)) %>%
plot_ly(x = ~class_prev, y = ~class_next, z = ~class, color = ~log10(n),
text = ~paste('Class:', class,
'<br>Previous class:', class_prev,
'<br>Next class:', class_next,
'<br>Counts:', n)) %>%
add_markers(size = ~log10(n)) %>%
layout(title = "Class Neighbour frequencies",
scene = list(xaxis = list(title = 'Previous Class'),
yaxis = list(title = 'Next Class'),
zaxis = list(title = 'Class')))
Fig. 45
Certain combinations of classes are more likely to be found next to one another than other combinations. Ultimately, this reflects the grammar structure of the language.
By making use of these next-neighbour statistics we can estimate the probability that a token was classified correctly by iteratively cross-checking the other tokens in the same sentence. This adds a certain degree of context to a classification/normalisation attempt that is only considering the token itself.
This section will look at specific sentence parameters and how they affect the transformation statistics or behaviour.
We begin by studying how the length (in tokens) of the sentences affects the statistics of classes and transformed tokens. For this we estimate the mean transformed fraction for each group of sentences with the same length together with the corresponding uncertainties. In addition, we examine the class frequencies for the shortest sentences and the sentence lenth distributions for each class:
foo <- train %>%
group_by(sentence_id, transformed) %>%
count() %>%
spread(transformed, n, fill = 0) %>%
mutate(frac = `TRUE`/(`TRUE` + `FALSE`),
sen_len = `TRUE` + `FALSE`)
bar <- foo %>%
group_by(sen_len) %>%
summarise(mean_frac = mean(frac),
sd_frac = sd(frac),
ct = n())
foobar <- foo %>%
left_join(train, by = "sentence_id")
p1 <- bar %>%
ggplot(aes(sen_len, mean_frac, size = ct)) +
geom_errorbar(aes(ymin = mean_frac-sd_frac, ymax = mean_frac+sd_frac),
width = 0., size = 0.7, color = "gray30") +
geom_point(col = "red") +
scale_x_log10() +
labs(x = "Sentence length", y = "Average transformation fraction")
p2 <- foobar %>%
filter(sen_len < 6) %>%
group_by(class, sen_len) %>%
count() %>%
ungroup() %>%
filter(n > 100) %>%
ggplot(aes(sen_len, n, fill = class)) +
geom_col(position = "fill") +
scale_fill_brewer(palette = "Set1") +
labs(x = "Sentence length", y = "Proportion per class")
p3 <- foobar %>%
ggplot(aes(sen_len)) +
geom_density(bw = .1, size = 1.5, fill = "red") +
scale_x_log10() +
labs(x = "Sentence length") +
facet_wrap(~ class, nrow = 2)
layout <- matrix(c(1,2,3,3),2,2,byrow=TRUE)
multiplot(p1, p2, p3, layout=layout)
Fig. 46
p1 <- 1; p2 <- 1; p3 <- 1
We find:
For sentences with more than about 10 tokens the proportion of transformed tokens first decreases somewhat up to about 20 and then increases slightly afterwards. However, none of the changes appear to be significant within their standard deviations. Here, the size of the red data points is proportional to the number of cases per group.
Interestingly, for very short sentences of only 2 or 3 tokens the mean fraction of transformed tokens is significantly higher: about 33% for 3 tokens and ** practically always** 50% for 2 tokens.
Upon closer inspection in the bar plot we find that almost all 2-token sentences consist of a “DATE” class and a “PUNCT” class token. Note that our plot omits rare classes with less than 100 cases for better visibility. Below is the complete table for the 2-token sentences.
The 3-token sentences basically just add a “PLAIN” token to the mix, reducing the proportions of “DATE” and “PUNCT” tokens to 1/3 each. For longer sentences, the mix starts to become more heterogenenous. Interestingly, for longer sentences the “PUNCT” fraction does not continue to decline geomtrically - indicating the presence of tokens other than the final full stop.
Among the individual classes we can see considerable differences in the shape of their sentence length distributions. For instance, “VERBATIM” has a far broader distribution that most; reaching notable number above 100 tokens. “PLAIN”, “MEASURE”, and “DECIMAL” have the sharpest peaks, while “DATE” has an interesting step structure in addition to the dominance in short sentences that we discussed above (here is the promised table for 2-token sentences:)
foobar %>%
filter(sen_len <= 2) %>%
group_by(sen_len, class, transformed) %>%
count()
## # A tibble: 5 x 4
## # Groups: sen_len, class, transformed [5]
## sen_len class transformed n
## <dbl> <fctr> <lgl> <int>
## 1 2 DATE TRUE 10127
## 2 2 PLAIN FALSE 3
## 3 2 PLAIN TRUE 1
## 4 2 PUNCT FALSE 10133
## 5 2 TELEPHONE TRUE 10
Converting letters are abbreviations is still tricky to the computer. It ought to know pmo means Project management office since its in the context of “Industry safety”
The machine still needs a large amount of data if it’s going to improve its reading even if it’s by 1%. In this case study we used about 350 megabytes of text data to achieve a pretty good result even though it slowed down my machine.
We learnt that the average number of words or token in each sentence is 8, and the maximum being 255.
We also learnt that the computers are good translating numbers in text but it’s still tricky for it to concantenate it with a sentence.
For more use case of Machine Learning, visit us at Cartwheel Technologies