ReadingBooks

Author

Jorge D. Lopez Saavedra

Text Analytics

Imagine you work in a Collections based Call Center. You have your prime agents, those that collect most frequently. And you have your agents in Growth Phase, those who collect at a much lower frequency than their prime counter-parts. For simplicity we will just label them as “sub-prime agents”.

Management would like to know what really distinguishes a prime agent from a sub-prime agent.

Currently, there are some quantitative metrics in place, such as:

Proportion of Customer Contact over calls

\[ \frac{Customer~Contact}{Number~of~Calls} \]
Proportion of Promises to Pay over Contact:

\[ \frac{Promises~to~Pay}{Customer~Contact} \]

And many others. However, that still does not explain why prime agents tend to score better on these metrics. At the end, it boils down to the art of using the right words at the right time. But, what words are these?

So, you don’t know. But, you set forth to understand thousands of successful conversations from prime agents, and a thousand more from sub-prime agents.

What words are used most frequently?
Which words are used less frequently?

The result of your study would lead to understand the main subtle differences among agents, and look for ways to provide focused coaching to nonperforming agents.

To replicate this exercise we will use 2 books that are very different in their plots. One is Moby Dick by Herman Melville, the other is Romeo and Juliet.

Reading Books

We will be using python for a few exercises first, and then move to R.

import os 
os.chdir(r'D:\MyDrive\10. MS in Data Science UofWisconsin\09. Data Visualization\Fourth Project')

with open('Moby Dick.txt','r', encoding='utf-8') as f:
  book_MobyDick = f.read()

with open('RomeAndJuliet.txt','r', encoding='utf-8') as f:
  book_RomeoAndJuliet = f.read()

Printing a few Lines of Moby Dick

Just exploring string variable was captured using Python.

list_example_MD=book_MobyDick[10000:10210].split("\n")

for x in list_example_MD:
  if x is None or len(x) == 0:
    print("")
  else:
    print(x,'\n')

 the 

  former, one was of a most monstrous size.... This came towards us, 

  open-mouthed, raising the waves on all sides, and beating the sea 

  before him into a foam.” —_Tooke’s Lucian_. “_The True History_.”

Vectorizing Words in Books

Now, we are ready to use CountVectorizer function from sklearn. We are basically going to get the number of distinct words present in both books.

import pandas as pd 
from sklearn.feature_extraction.text import CountVectorizer

How many different words each book has?

cv_Moby = CountVectorizer()
sparseMat_Moby=cv_Moby.fit_transform([book_MobyDick])
words_MobyDick=cv_Moby.get_feature_names_out()
print("Number of Words in Moby Dick: ",len(book_MobyDick.split()))

Number of Words in Moby Dick:  215831

print("Number of Different Words in Moby Dick: ",len(words_MobyDick))

Number of Different Words in Moby Dick:  17597

cv_RJ = CountVectorizer()
sparseMat_RJ=cv_RJ.fit_transform([book_RomeoAndJuliet])
words_RJ=cv_RJ.get_feature_names_out()
print("Number of Words in Romeo and Juliet: ",len(book_RomeoAndJuliet.split()))

Number of Words in Romeo and Juliet:  28987

print("Number of Different Words in Romeo and Juliet: ",len(words_RJ))

Number of Different Words in Romeo and Juliet:  4028

Note

This tells us that Moby Dick is a book with a lot more words than Romeo and Juliet.
At this point, we are ready to clean these texts further by removing all Stop Words.
Stop Words are basically a list of English Words that are commonly used in most English sentences.
The reason for doing this is because there is no inherit benefit of including every single word in the English Language in our analysis, we are only interested in words that would make each writing unique.

Removing Commonly Used Words in the English Language

from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print("After taking out STOP WORDS, the number of unique words present in each book was reduced .")

After taking out STOP WORDS, the number of unique words present in each book was reduced .

cv_Moby = CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS))
sparseMat_Moby=cv_Moby.fit_transform([book_MobyDick])
words_MobyDick=cv_Moby.get_feature_names_out()

print("Number of Different Words in Moby Dick: ",len(words_MobyDick))

Number of Different Words in Moby Dick:  17302

cv_RJ = CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS))
sparseMat_RJ=cv_RJ.fit_transform([book_RomeoAndJuliet])
words_RJ=cv_RJ.get_feature_names_out()
print("Number of Different Words in Romeo and Juliet: ",len(words_RJ))

Number of Different Words in Romeo and Juliet:  3795

Preparing Books for Analysis

What we would like to do next is create a matrix for each book, where column headers are each of the words used in the different paragraph extractions. And start noticing some differences amongst books.

This is similar to looking at the universe of words used by best vs worst performing agents in a callcenter.

vect_both_books = CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS))
vect_both_books.fit([book_MobyDick, book_RomeoAndJuliet])

CountVectorizer(stop_words=['this', 'something', 'ourselves', 'move', 'only',
                            'as', 'yourself', 'and', 'please', 'them',
                            'whereupon', 'whole', 'out', 'that', 'nine', 'too',
                            'thereafter', 'system', 'what', 'yours', 'these',
                            'found', 'beside', 'upon', 'afterwards', 'through',
                            'eleven', 'more', 'thru', 'could', ...])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

bothBooks_transformed = vect_both_books.transform([book_MobyDick,book_RomeoAndJuliet])
bothBooks_array = bothBooks_transformed.toarray()
df_bothBooks=pd.DataFrame(bothBooks_array, columns= vect_both_books.get_feature_names_out())

Observing resulting object:

df_bothBooks.iloc[:,0:10]

   000  10  100  101  102  103  104  105  106  107
0   21   5    4    2    2    2    2    2    2    2
1    1   0    0    0    0    0    0    0    0    0

We notice that some of the words included are still not what we are looking for. We shouldn’t be considering words that start with a number (these are mostly chapter numbers included in the text file), or words that start with a symbol. So, we are going to clean this dataframe a bit more.

import re 
pattern= "^\d|^\_" #pattern says: does it start with a number \d, or an underscore \_

true_false_listOfWordsToInclude=[re.match(pattern,x) is None for x in df_bothBooks.columns]

df_bothBooks1=df_bothBooks.loc[:,true_false_listOfWordsToInclude].copy()

Using R for Further Analysis

Loading R Packages

library(reticulate)
suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(kableExtra))

Mod R Dataframe to Pivot Columns to Rows

df_books_r = py$df_bothBooks1

#adding column to identify word count
df_books_r$book = c("Moby Dick","Romeo and Juliet")

#creating new dataframe
df_books_r %>% relocate(book) %>% 
  tidyr::pivot_longer(
    cols=2:ncol(df_books_r),
    names_to = "word",
    values_to = "word_count"
  ) -> df_books_r_pivotLonger

#Pivot Longer
df_books_r_pivotLonger %>% 
  arrange(book,desc(word_count))%>% 
  group_by(book)%>% 
  mutate(
    rn= row_number()
  ) %>% ungroup() ->df_books_r_pivotLonger

#Creating Top 5 Words Used in Each Book
mobyTop5 = df_books_r_pivotLonger %>% 
  filter(book=="Moby Dick",rn<6) %>% 
   select(word,word_count)
RJTop5 = df_books_r_pivotLonger %>% filter(book=="Romeo and Juliet",rn<6)%>% 
   select(word,word_count)

Viewing Top 5 - table

knitr::kables(
  list(
        knitr::kable(
            mobyTop5,
            caption = "Moby Dick",
            col.names = c("Word","Word Count") 
        ),
        knitr::kable(
            RJTop5,
            caption = "Romeo and Juliet",
            col.names = c("Word","Word Count")
        )
      )
)

Moby Dick
Word	Word Count
whale	1229
like	647
man	527
ship	519
ahab	512

Romeo and Juliet
Word	Word Count
romeo	320
thou	278
juliet	193
thy	170
capulet	163

Viewing Bag of Words Graph for Both Books

suppressPackageStartupMessages(library(wordcloud))

Warning: package 'wordcloud' was built under R version 4.3.2

freq_MD=df_books_r_pivotLonger %>% 
  filter(book=="Moby Dick") 

freq_RJ=df_books_r_pivotLonger %>% 
  filter(book=="Romeo and Juliet")

set.seed(1234)
wordcloud(words = freq_MD$word, freq = freq_MD$word_count, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

set.seed(1234)
wordcloud(words = freq_RJ$word, freq = freq_RJ$word_count, min.freq = 1,
          max.words=100, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Important Notes:

Results above tell us there are some differences amongst both books. whale is used with an irregular frequency in the book Moby Dick. No surprise there ☺.
Romeo and Juliet are also used with an irregular frequency in the book of Romeo and Juliet.

Relating to our Call Center example, these results would be similar to looking at both groups of call center agents.

Potentially you would get to see that trigger active verbs are used most often by high performing agents (top words used), whereas non-performing agents would not use the same words amongst their top 10.

Example:

High Performing agent: I require payment in full today.
Low Performing agent: I would like to see if you could make a payment.

In this simple example, we would spot that verbs used by non performing agents would not be the right choice of words to trigger a payment from a debtor.

There could be some words that due to the nature of the business could be used with greater frequency, in such a case, we would just need to make sure to include these words in our Stop Words list like so:

new_stop_words = ENGLISH_STOP_WORDS.union(['new word 1','new word 2',
'etc...'])

Example words that could be used with irregular frequency in our fictional call center:

debt
amount
pay
payment
etc…

We have established that word frequency of top words are extremely different. However, I’m positive that some words would be shared by both books, naturally. We’ll explore this next.

Before we jump back into word frequency, notice that “whale” and “whales” are both included among top 100 words used in Moby Dick. Naturally, both words stem from “whale”. We really don’t want to be counting twice words that provide, in essence, the same kind of information.

We are going to use the Python nltk.stem library and call PorterStemmer function on our R dataframe to create our version of stemmed words.

from nltk.stem import PorterStemmer
porter = PorterStemmer()

py_df_books_longer=r.df_books_r_pivotLonger.copy()

py_df_books_longer['word stem']=py_df_books_longer.word.apply(porter.stem)

py_df_books_longer.loc[(py_df_books_longer.word=="whale") | (py_df_books_longer.word=="whales"),:]

                   book    word  word_count     rn word stem
0             Moby Dick   whale      1229.0      1     whale
19            Moby Dick  whales       271.0     20     whale
35419  Romeo and Juliet   whale         0.0  17558     whale
35430  Romeo and Juliet  whales         0.0  17569     whale

Now that we have done that, we are going to group back again and count using the stem of the word

df_books_r_pivotLonger=py$py_df_books_longer 

df_books_r_pivotLonger %>% 
  group_by(
      book,`word stem`
) %>% 
  summarise(
      totalWords = sum(word_count)
  ) %>% 
  rename('word_stem' = `word stem`) -> df_books_r_pivotLonger1

`summarise()` has grouped output by 'book'. You can override using the
`.groups` argument.

df_books_r_pivotLonger1 %>% arrange(book, desc(totalWords)) %>% head(5) %>% kableExtra::kable() %>% kable_styling("hover")

book	word_stem	totalWords
Moby Dick	whale	1633
Moby Dick	like	661
Moby Dick	ship	625
Moby Dick	ye	547
Moby Dick	sea	542

Notice that whale is now counted about 400 times more since it includes all derivations from the word “whale”.

Word Frequency Comparison

#Creating word freq per book
df_books_r_pivotLonger1 %>% 
  group_by(book) %>% 
  summarise(totalWords_book = sum(totalWords)) %>% 
  inner_join(df_books_r_pivotLonger1) %>% 
  mutate(word_freq= totalWords/totalWords_book)->df_books_r_pivotLonger1

Joining with `by = join_by(book)`

Filtering for top 300 words used

top_words = 300

# filtering for top 300 words in each book
df_books_r_pivotLonger1 %>% arrange(book,desc(totalWords)) %>% 
  group_by(book) %>% mutate(rn=row_number()) %>% ungroup() %>% 
  filter(rn<=top_words) -> df_top300Words

# pivot wider
df_top300Words %>% select(book,word_stem,word_freq) %>% 
    tidyr::pivot_wider(
        names_from = book, values_from = word_freq,values_fill = 0
    ) -> df_top300Words_wide

This is the graph as it should render using true frequencies. This graph emphasizes words that are radically different amongst books. Words such as : whale, romeo, juliet, etc…

library(ggiraph)
graph1=ggplot(df_top300Words_wide %>% mutate(
  theToolTip = paste(word_stem,"\n", "MD", round(`Moby Dick`,3)*100,"%","\n","RJ",round(`Romeo and Juliet`,3)*100,"%")
)) +
  geom_point_interactive(
    aes(x = `Moby Dick`, y = `Romeo and Juliet`,tooltip=theToolTip),alpha = 0.3, size = 2.5, color='steelblue'
  )+
  geom_text(aes(x = `Moby Dick`, y = `Romeo and Juliet`,label = word_stem), check_overlap = TRUE, vjust = "outward",
            hjust=0.5)+
  geom_abline(color = "red", slope=1, intercept=0)


girafe(ggobj=graph1)

This graph uses a log scale and jitter, the resulting effect provides a better view into words that are regularly used in both books but with a lower frequency than prime words.

# plotting 
library(scales)

ggplot(df_top300Words_wide) +
  geom_jitter(
    aes(x = `Moby Dick`, y = `Romeo and Juliet`),alpha = 0.3, size = 2.5, width = 0.25, height = 0.25, color='steelblue'
  )+
  geom_text(aes(x = `Moby Dick`, y = `Romeo and Juliet`,label = word_stem), check_overlap = TRUE, vjust = "outward",
            hjust=0.5)+
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  geom_abline(color = "red", slope=1, intercept=0)

This tells us that words like: dead, night, and heart are used in both books. However, these words are used at a much larger frequency in Romeo and Juliet than in Moby Dick.
On the other hand, words like: world, head, thought and said are words also used in both books but used at a much larger frequency in Moby Dick than they are in Romeo and Juliet.

Going back to the example in the call center, I would expect a similar pattern to be appreciated amongst top and low performing agents. I would expect a certain pattern in frequency of words used during conversations. And this could be an indication that choice of words matter in the art of collecting.

Setting up Model

Why the model?

What would the model attempt to predict?

It will predict whether a paragraph is more similar to Moby Dick or to Romeo and Juliet.

How can this be applied in real life?

It will be used to predict if a conversation is more similar to a high performing agent or a low performing agent.

But, what is the importance of the model if you have other KPI’s to measure performance (amount of money collected, for example)

Because collections is a cyclical business. And not all time periods promise the same amount of revenue generation. For example, collection companies do not tend to collect too well during spring break.

So, the real importance of the model would be:

Are low performing agents showing signs of improvement?
1. E.i. Do they conversations resemble more those of a high performing agent?
Are some high performing agents peforming at a less optimal rate even before month’s end numbers are released?
Can coaching happen on a more timely manner by pin pointing exact conversations for improvement?

Setting up Testing Env

To set up a testing environment we are going to subdivide both books in chunks of 10 sentences. Each chunk of 10 sentences would become an observation to train our model.

import nltk 
# download the punkt tokenizer for sentence splitting
nltk.download('punkt')

True

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jlope\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# splitting the book string into sentences using nltk.sent_tokenize
sentences_MD = nltk.sent_tokenize(book_MobyDick)

Moby Dick book does not truly start until sentence #460.So, we are starting at that point.

I’m also noticing that some sentences would be: “Chapter 1”, “Chapter 2”, etc…

These will be removed.


pattern = r"^CHAPTER \d+\.$"

# initialize an empty list to store the filtered strings
sentences_MD_filtered = []

# loop through the list of strings
for string in sentences_MD[460:]: #starting from sentence 460
    # check if the string does not match the pattern using re.match
    if not re.match(pattern, string):
        # append the string to the filtered list
        sentences_MD_filtered.append(string)


# removing sentences after and Including the Epilogue
sentences_MD_filtered=sentences_MD_filtered[:8483]

Now that we have filtered, we are going to group these sentences in groupings of 10 sentences each.


chunks_MD = []

for i in range(0,len(sentences_MD_filtered),10):
  chunk = " ".join(sentences_MD_filtered[i:i+10])
  chunks_MD.append(chunk)

We are going to repeat a similar process for Romeo and Juliet. To save some space in this document, we are just going to do that process in a separate python file and import it.

import RJ_cleaning_andGrouping as RJ_CG #running python file in background

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jlope\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


chunks_RJ=RJ_CG.chunks_RJ #importing object from py file

We are going to clean these chunks by removing stop words, and stemming. First, we are going to place them all in a dataframe.

df_chunks_books=pd.DataFrame({
  'book_name': ["MD"]*849 + ["RJ"]*297,
  'chunk_number': list(range(849)) + list(range(297)),
  'chunk': chunks_MD+chunks_RJ
  
})
#printing first 5 books 
df_chunks_books.head(5)

  book_name  chunk_number                                              chunk
0        MD             0  Loomings. Call me Ishmael. Some years ago—neve...
1        MD             1  Right and left, the streets take you waterward...
2        MD             2  What do they here? But look! here come more cr...
3        MD             3  Tell\nme, does the magnetic virtue of the need...
4        MD             4  What is the chief element he employs? There st...

Training Model using Logistic Regression

We are going to use 80% of the data to train our model.


#creating sample var
final_sample = df_chunks_books.copy()

At this point we are going to vectorize and stem.

Stemming


from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

def apply_porter_stemming(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Initialize the PorterStemmer
    porter = PorterStemmer()

    # Apply stemming to each word in each sentence
    stemmed_sentences = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        stemmed_words = [porter.stem(word) for word in words]
        stemmed_sentence = ' '.join(stemmed_words)
        stemmed_sentences.append(stemmed_sentence)

    # Join the stemmed sentences back together
    stemmed_text = ' '.join(stemmed_sentences)

    return stemmed_text


final_sample['stemmed_chunk']=final_sample.chunk.apply(apply_porter_stemming)

Vectorizing Sample

vect_sample = CountVectorizer(stop_words=list(ENGLISH_STOP_WORDS))
vect_sample.fit(final_sample.stemmed_chunk)

CountVectorizer(stop_words=['this', 'something', 'ourselves', 'move', 'only',
                            'as', 'yourself', 'and', 'please', 'them',
                            'whereupon', 'whole', 'out', 'that', 'nine', 'too',
                            'thereafter', 'system', 'what', 'yours', 'these',
                            'found', 'beside', 'upon', 'afterwards', 'through',
                            'eleven', 'more', 'thru', 'could', ...])

sample_transformed = vect_sample.transform(final_sample.stemmed_chunk)
sample_array = sample_transformed.toarray()
df_vect_sample=pd.DataFrame(sample_array, columns= vect_sample.get_feature_names_out())


#taking out words that start with a number or _
pattern= "^\d|^\_" #pattern says: does it start with a number \d, or an underscore \_

true_false_listOfWordsToInclude=[re.match(pattern,x) is None for x in df_vect_sample.columns]
df_vect_sample1=df_vect_sample.loc[:,true_false_listOfWordsToInclude].copy()


# cbinding dataframes
final_sample_withVectorizer=pd.concat([final_sample.reset_index(drop=True), df_vect_sample1.reset_index(drop=True)],axis=1)


final_sample_withVectorizer=final_sample_withVectorizer.drop('chunk',axis=1)

Training Model

We are going to train a simple logistic model to predict

1 = Romeo and Juliet
0 = Moby Dick

import numpy as np
final_sample_withVectorizer['book_binom'] = np.where(final_sample_withVectorizer.book_name == "MD",0,1 )

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Initializing the logistic regression model
logreg_model = LogisticRegression()

X = final_sample_withVectorizer.iloc[:,3:10482] # predictors = word counter
y = final_sample_withVectorizer["book_binom"] # bernoulli var. 1 = RJ, 0=MD

# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #setting seed = 42

# Training the model on the training data
logreg_model.fit(X_train, y_train)

LogisticRegression()

Testing Model


# Making predictions on the testing data
y_pred = logreg_model.predict(X_test)

# Evaluating model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

Printing Results of Model

print(accuracy)

1.0

This tells us that our model is able to predict with 100% accuracy if a chunk of 10 sentences comes from either Moby Dick or Romeo and Juliet.

What is even better, is that we can use our model to test if the probability of being one or the other.

probability_prediction=logreg_model.predict_proba(X_test).round(4)

print(probability_prediction[0:5,:])

[[9.952e-01 4.800e-03]
 [9.985e-01 1.500e-03]
 [1.000e+00 0.000e+00]
 [9.996e-01 4.000e-04]
 [1.000e+00 0.000e+00]]

Let’s observe a paragraph that had high probability of coming from Moby Dick first (probability > 99%):

final_sample.loc[final_sample.chunk_number==218,"chunk"].tolist()[0]

'If\nany of the following whales, shall hereafter be caught and marked, then\nhe can readily be incorporated into this System, according to his\nFolio, Octavo, or Duodecimo magnitude:—The Bottle-Nose Whale; the Junk\nWhale; the Pudding-Headed Whale; the Cape Whale; the Leading Whale; the\nCannon Whale; the Scragg Whale; the Coppered Whale; the Elephant Whale;\nthe Iceberg Whale; the Quog Whale; the Blue Whale; etc. From Icelandic,\nDutch, and old English authorities, there might be quoted other lists\nof uncertain whales, blessed with all manner of uncouth names. But I\nomit them as altogether obsolete; and can hardly help suspecting them\nfor mere sounds, full of Leviathanism, but signifying nothing. Finally: It was stated at the outset, that this system would not be\nhere, and at once, perfected. You cannot but plainly see that I have\nkept my word. But I now leave my cetological System standing thus\nunfinished, even as the great Cathedral of Cologne was left, with the\ncrane still standing upon the top of the uncompleted tower. For small\nerections may be finished by their first architects; grand ones, true\nones, ever leave the copestone to posterity. God keep me from ever\ncompleting anything. This whole book is but a draught—nay, but the\ndraught of a draught. Oh, Time, Strength, Cash, and Patience!'

And let’s compare it to a paragraph that was also labeled correctly, bu their probability dropped to 78.24%

final_sample.loc[final_sample.chunk_number==277,"chunk"].tolist()[0]

'I\nstill rest me on thy mat, but the soft soil has slid! I saw thee woven\nin the wood, my mat! green the first day I brought ye thence; now worn\nand wilted quite. Ah me!—not thou nor I can bear the change! How then,\nif so be transplanted to yon sky? Hear I the roaring streams from\nPirohitee’s peak of spears, when they leap down the crags and drown the\nvillages?—The blast! the blast! Up, spine, and meet it! (_Leaps to his\nfeet_.) PORTUGUESE SAILOR.'

Let’s think about the differences. For one, the word “whale” is not mentioned in the second paragraph, and it kind of sounds bit emotional in comparison with the usual thematic of the book.

Summary

So, we have established that text analytics can be modelled to provide prompt feedback to call center agents.

I can imagine that coaching can be made more readily available to agents before other KPI’s are available to managers in order to provide feedback.

There is power in the language we use.