This paper sets out to address three challenges posed by Closer, based on the corpus of complaints provided:
The first two challenges are set out here in this document, but due to time constraints the third was not attempted.
For the first challenge, Naive Bayes and SVM algorithms were used to classify complaints, though the desired accuracy levels weren’t achieved. This is most likely, due to a reduced data set size, hence a reduction in accuracy. Principle Component Analysis could have been used to achieve better results, but a deeper explanation is provided in the Conclusion section as to why it wasn’t attempted.
This paper will endeavour to address the challenges put forth in the Closer Challenge. These challenges are:
As part of the challenge, a corpus of customer complaints is provided, and this will be the basis for the analysis.
Before starting the analysis, a few initialisation steps must be done.
The required packages necessary for this analysis will be loaded (silently to avoid loading logs):
#TM package require for text manipulation and generating Document Term Matrix
require(tm)
#dplyr for manipulating data frames
require(dplyr)
#caret for partitioning data into test and train sets
require(caret)
#e1071 for using SVM and Naivebaye's algorithms used for supervised learning
require(e1071)
#topicmodels contains LDA algorithm for unsupervised learning
require(topicmodels)
#Used to extract data from LDA model
require(tidytext)
#Used to plot topics from LDA
require(ggrepel)
To ensure analysis is repeatable and attain the same result, a seed will be set manually:
set.seed(475)
The complaints corpus provided in CSV is loaded:
data<-read.csv("complains_data.csv")
Following data loading, the required columns are extracted, which are:
Consumer.complaint.narrative which contains customer complaints provided by ACME; and will form the basis for the corpus to be analysed.Issue which contains the issue behind the complaints, as classified by ACME analysts; and will be treated as the response/target variable when conducting supervised learning.#Extract the aforementioned columns and create a new data frame
corpus<-data.frame(complaint=data$Consumer.complaint.narrative, issue_tv=data$Issue)
#Coerce columns into characers
corpus$complaint<-as.character(corpus$complaint)
corpus$issue_tv<-as.character(corpus$issue_tv)
Now we have a data frame containing customer complaints and their corresponding issue.
Goal: Make sure the train and test data sets contain equal (on average) representation of each target variable.
Why: In order to avoid overfitting for one type of target variable and classify poorly on other target variables, we balance our to have equal representation of each target variable.
Process: There are 161 issues. If we plot number of complaints for each issue, we can see that the data is heavily skewed (imbalanced):
#Count complainst by issue
countbyissue<-aggregate(corpus$issue_tv, by=list(corpus$issue_tv), length)
#Give data frame meaniful names
colnames(countbyissue)<-c("issue","count")
#Order data in descending order
countbyissue<-countbyissue[order(-countbyissue$count),]
#Create a bar plot
barplot(countbyissue$count, main = "Count of Each Issue", col="blue")
Note: X-axis labels were left out due to lack of legibility if added, also the point from the plot is to show the distribution of counts in the figure above.
Strategy for balancing: In order to balance the data the, following steps were taken (rationale behind strategy is explained after):
1. Scan the Count by Issue (arranged in descending order) table till we find the point where cumulative sum of issues accounts for 99% of complaints:
sum(countbyissue$count[1:100])/sum(countbyissue$count)
## [1] 0.9888615
2. Read out the count of issues for that particular issue
countbyissue$count[100]
## [1] 151
3. Any issue count with a count below 151 will now be grouped into “other” column
#Create a lookup table for issues are below the 100th issue
lookup<-as.character(countbyissue$issue[101:161])
#Loop through those issues and replace with "other" as issue
for(i in 1: nrow(corpus)){
if(corpus$issue_tv[i] %in% lookup){
corpus$issue_tv[i]<-"other"
}
}
4. By doing step 3, we now insured that the minimum number of complaints per issue is 151, now we can sample 151 complaints for each issue
#Sample 151 complaints from each issue
corpus <- corpus %>% group_by(issue_tv) %>% sample_n(151)
#Convertfrom dplyr to data frame
corpus <- ungroup(corpus) %>% as.data.frame()
5. Plot the distribution of the new balanced complaints, to confirm we have a balanced data set:
#Count complainst by issue
countbyissue<-aggregate(corpus$issue_tv, by=list(corpus$issue_tv), length)
#Give data frame meaniful names
colnames(countbyissue)<-c("issue","count")
#Order data in descending order
countbyissue<-countbyissue[order(-countbyissue$count),]
#Create a bar plot
barplot(countbyissue$count, main = "Count of Each Issue(after balancing)", col="blue")
Now we have balanced data set that we can use for training & testing purposes.
Rationale:The reasons behind this strategy is as follows:
In order not to loose to many categories/issues, we have opted to account for 99% of complaints to ensure most scenarios were covered.
The second rationale was mainly due to processing times, where if the cutoff point was selected at 95%, this would resulted at sampling 659 complaints per target variable, which would resulted in approximately 46,000+ samples (this has taken 1.5 hours to run, proving ineffective for experimentation). While selecting a cutoff at 99% of complaints resulted in 15,000+ samples, which reduced processing significantly (down to 10 min +/-). Ideally a bigger training set would’ve been used to improve accuracy of model, and indeed the results have shown this shortcomings, and it affected results. More on this in later sections.
Goal: To prepare and clean the data for training purposes.
Why: The data set, or complaints narrative in this case, contains a lot of data/words that will not add value to the analysis and may hamper it, such as punctuation. Therefore, the data will be cleaned to improve its quality in the hopes to produce better results and also reducing its size, as a desired side effect.
Process: We will perform the following on the complaints narrative:
\n and \tNote: Word stemming was considered, but due to the loss of legibility and minimal improvement on results, it was decided against.
#Convert letters to lower case
corpus$complaint<-tolower(corpus$complaint)
#Remove Numbers
corpus$complaint<-removeNumbers(corpus$complaint)
#Remove Punctuation
corpus$complaint<-removePunctuation(corpus$complaint)
#Create a function that will take a pattern and remove it
findremove<- function (pattern, object) { gsub(pattern, "", object)}
#Remove xxxx
corpus$complaint<-findremove("xxxx", corpus$complaint)
#Remove xx
corpus$complaint<-findremove("xx", corpus$complaint)
#Remove \n
corpus$complaint<-findremove("\n", corpus$complaint)
#Remove \t
corpus$complaint<-findremove("\t", corpus$complaint)
#Strip extra white spaces
corpus$complaint<-stripWhitespace(corpus$complaint)
#Convert complaints into a Corpus using tm package
complaintCorpus<-Corpus(VectorSource(corpus$complaint))
#Remove Stopwords
complaintCorpus<-tm_map(complaintCorpus, removeWords, stopwords())
#Stemming words
#corpus$complaint<-stemDocument(corpus$complaint)
Goal: Create features to be consumed by the machine learning algorithms, to be used in training.
Why: In order for data to be used by the algorithm, we need to create a set features that can be fed into the algorithm. In this case it’s Document Term Matrix (DTM)
Process: The “tm” package will be used to create a DTM, each complaint will be a “document” occupying a row, each “term” will be a column name, and the counts of each term per document will be the values of the matrix:
#Create DTM using TM
dtm<-DocumentTermMatrix(complaintCorpus)
#View the DTM
inspect(dtm)
## <<DocumentTermMatrix (documents: 15251, terms: 28264)>>
## Non-/sparse entries: 953776/430100488
## Sparsity : 100%
## Maximal term length: 235
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs account bank called card credit loan payment received time told
## 12252 4 5 4 0 14 4 2 0 7 0
## 12258 0 0 2 0 4 12 2 1 4 6
## 12645 2 11 1 0 0 9 7 6 10 1
## 12685 10 0 0 1 4 15 24 5 6 0
## 15079 60 0 2 0 0 2 27 0 0 1
## 2386 28 20 9 2 1 0 3 1 5 18
## 2532 1 10 0 0 0 12 4 0 4 0
## 6023 0 5 7 2 6 5 9 0 4 25
## 625 3 1 4 0 4 7 0 2 5 14
## 9381 22 6 1 3 2 0 1 5 7 1
After creating our DTM, we can visualise the terms a bit better which will aid us in further pre-processing. If we consider the top 100 most popular terms:
#Sum each term across all documents (i.e column sum)
cumulativeAllTerms<-colSums(as.matrix(dtm))
#Sort in descending order
cumulativeAllTerms<-cumulativeAllTerms[order(-cumulativeAllTerms)]
#Show top 100 terms
head(cumulativeAllTerms, 100)
## account credit card bank payment loan
## 22183 22089 14027 12007 10466 10312
## told called received time information money
## 9978 8081 7915 7176 6922 6912
## pay company back get will never
## 6842 6545 6519 6323 6281 6174
## can said call one balance report
## 6145 6032 5926 5721 5633 5629
## payments made sent paid now amount
## 5469 5241 5185 5167 5060 5036
## also interest phone due asked number
## 4982 4937 4936 4806 4665 4626
## debt days since letter still even
## 4506 4466 4403 4244 4067 4026
## check make service customer month email
## 3970 3810 3743 3587 3580 3457
## late just fee day mortgage help
## 3444 3375 3335 3315 3304 3244
## fees new years months contacted chase
## 3209 3190 3168 3166 3147 3072
## first date know another times charge
## 3065 3011 2904 2854 2848 2843
## charges home stated funds like issue
## 2835 2807 2794 2748 2732 2727
## business accounts closed without name went
## 2722 2703 2676 2625 2624 2565
## complaint want request charged contact need
## 2558 2551 2549 2545 2542 2538
## take please online fraud statement see
## 2524 2488 2441 2409 2375 2345
## able send address got loans work
## 2330 2315 2301 2295 2292 2275
## several transfer last however
## 2266 2264 2233 2227
It can be noted, that there are more terms that don’t add value that are not included in the “tm” package stopword list. For example: “can”, “just”, “however” etc…
Based on the 100 most popular term, other deemed unnecessary terms were selected & removed, and a new DTM is created:
#Create my list of stopwords
otherstopwords<-c("told","called","back","can","will","get","said","never","also",
"even","just","know","another","like","want","went","please","take",
"however","going","see","got","several","able")
#Remove custom stopwords
complaintCorpus<-tm_map(complaintCorpus, removeWords, otherstopwords)
#Create a new DTM
dtm<-DocumentTermMatrix(complaintCorpus)
To enhance the quality of the data even further and reduce our data set size, sparse terms that appear less than times 10 were removed:
#Find terms that appear 10 times or more
freqterms<-findFreqTerms(dtm, lowfreq = 10)
#Limit DTM to contain terms that appear >= 10
dtm<-DocumentTermMatrix(complaintCorpus, list(dictionary=freqterms))
inspect(dtm)
## <<DocumentTermMatrix (documents: 15251, terms: 6095)>>
## Non-/sparse entries: 846897/92107948
## Sparsity : 99%
## Maximal term length: 21
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs account bank card credit information loan money payment received
## 12252 4 5 0 14 2 4 3 2 0
## 12258 0 0 0 4 0 12 1 2 1
## 12262 1 0 0 0 4 0 0 0 5
## 12645 2 11 0 0 2 9 8 7 6
## 12685 10 0 1 4 0 15 0 24 5
## 15079 60 0 0 0 8 2 0 27 0
## 2386 28 20 2 1 2 0 5 3 1
## 2532 1 10 0 0 0 12 0 4 0
## 6023 0 5 2 6 3 5 8 9 0
## 9381 22 6 3 2 24 0 12 1 5
## Terms
## Docs time
## 12252 7
## 12258 4
## 12262 2
## 12645 10
## 12685 6
## 15079 0
## 2386 5
## 2532 4
## 6023 4
## 9381 7
It can be noted that sparsity has dropped as a result, albeit marginally. Below is a plot of the 30 common terms:
#Sum count of each term across all documents
cumulativeAllTerms<-colSums(as.matrix(dtm))
#Sort in descending order and take top 30 terms
Top30<-head(cumulativeAllTerms[order(-cumulativeAllTerms)], 30)
#Convert to data frame
Top30<-data.frame(term=names(Top30), count=Top30)
Top30<-Top30[order(-Top30$count),]
#Plot
barplot(rev(Top30$count), horiz = T, names.arg = Top30$term, las=2, col="blue", main="Most Frequent 30 Terms")
Goal: Split data into training (75%) and test (25%) sets.
Why: To avoid overfitting with the data at hand, the data is split into training (used to train the model), and test that model didn’t see, which then can be validated against.
Process: In brief:
complaintCorpus using the same index#Convert issue/target variable to factor, in order to conserve levels in case some categories don't appear in one of the set (highly unlikely since data is balanced)
corpus$issue_tv<-as.factor(corpus$issue_tv)
#Create an index with 75% split based on issue value in raw data
inTrain<-createDataPartition(corpus$issue_tv,p=0.75,list=FALSE)
#Subset raw data with index
train<-corpus[inTrain,]
#Subset raw data with NOT index
test<-corpus[-inTrain,]
#Subset cleaned corpus for training & test sets
corpustrain<-complaintCorpus[inTrain]
corpustest<-complaintCorpus[-inTrain]
#Create DTM based on subsetted cleaned corpus
dtmtrain<-DocumentTermMatrix(corpustrain, list(dictionary=freqterms))
dtmtest<-DocumentTermMatrix(corpustest, list(dictionary=freqterms))
#Function to convert non-zero values to 1
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
}
#Convert non-zero values to 1 in train and test DRM
dtmtrain<- dtmtrain %>% apply(MARGIN=2,convert_counts)
dtmtest<- dtmtest %>% apply(MARGIN=2,convert_counts)
#Convert DTM to data frames
dtmtrain<-as.data.frame(dtmtrain)
dtmtest<-as.data.frame(dtmtest)
#Bind target variable to test and train DTMs
dtmtrain<-cbind(issue_tv=train$issue_tv,dtmtrain)
dtmtest<-cbind(issue_tv=test$issue_tv,dtmtest)
Goal: To train a model to classify complaints as per target variable,“issue” in this case.
Process:
#Train a model based on Naive Bayes using e1017 package
fit_NB<-naiveBayes(dtmtrain, dtmtrain$issue_tv)
#Predict using Naive Bayes model using the test set
pred_NB<-predict(fit_NB, newdata= dtmtest)
#Create a confusion matrix for that model/prediction
conf_NB<-confusionMatrix(pred_NB,dtmtest$issue_tv)
This model has an accuracy of:
conf_NB$overall["Accuracy"]
## Accuracy
## 0.01204174
That is a very low accuracy. This is most likely due to reduction in data set size. More about this in the conclusion section.
#Train a model based on SVM in the e1017 package
fit_SVM<-svm(issue_tv ~ ., data = dtmtrain, scale=FALSE)
#Predict using Naive Bayes model using the test set
pred_SVM<-predict(fit_SVM, newdata= dtmtest)
#Create a confusion matrix for that model/prediction
conf_SVM<-confusionMatrix(pred_SVM,dtmtest$issue_tv)
This model has an accuracy of:
conf_SVM$overall["Accuracy"]
## Accuracy
## 0.153064
At 15% accuracy it’s an improvement but still very low accuracy. More on this in the conclusion section.
Goal: To construct new issue categories (“second opinion on issues”) based on the complaints corpus.
Process: For this task, the LDA (Latent Dirichlet Allocation) algorithm will be used, since it’s an unsupervised learning algorithm and the aim is to propose a new categorisation for the complaints, ignoring the existing one. LDA will automatically detect themes (in this case, a new categorisation) based on terms that exist in the corpus.
After topics are created by LDA, each having a set of terms, each topic will be inspected manually, and based on the terms, a new categorisation will be proposed.
Before running the training algorithm, the method requires the number of expected topics to be defined. 10 topics is proposed for the following reasons:
#10 topics/categories
k<-10
#Run LDA algorithm
lda<-LDA(dtm, k=10, method = "GIBBS")
Following the training, below is a plot of the 10 topics with the 10 most frequent terms:
Note: the code for the above plot has been borrowed from this site
As per illustration above, it can be noted that there are some clear topics being identified. Below is an attempt at proposing new categories (in order of appearance as per plot above):
| Topic | Category |
|---|---|
| Topic 1 | Credit card issues |
| Topic 10 | Information Not Sent |
| Topic 2 | Not clear |
| Topic 3 | Not clear |
| Topic 4 | Mortgage issues |
| Topic 5 | Late payments |
| Topic 6 | Credit score |
| Topic 7 | Account balance |
| Topic 8 | Not clear (though references to phone calls) |
| Topic 9 | Not clear (though references to time and people) |
With better industry understanding, the above suggested categorisation can be improved further.
This paper set out to address two challenges:
With regards to the first task, it proved challenging resulting in a low accuracy. The reason, most likely, is due to reducing the data set size, which was done mainly due to processing times. The data was reduced to 4% of its original size, which may have contributed to the loss of accuracy. One way to address this issue is to use Principal Component Analysis, which could have reduced the data size, and kept the same level of information. Though due to the lack of knowledge of PCA and time limitations, this wasn’t implemented.
As for the second challenge, attempts were more successful. Using LDA some new categories/topics started to emerge. Given more time, the corpus used for algorithm would’ve increased to include more example complaints; experimentation with the number of topics and analysing the results would have also improved the quality of the results.
Below are some challenges that were faced during this challenge (to keep it brief, I’m just mentioning headlines):
Overall though it was a very educating and fruitful challenge to take.