Preamble

Not so long ago I got done with an online data science course and have been putting my new found skills to test ever since - a little over a month now. So, I thought to myself,“Hmmmm, why not explore the kernel section of Kaggle, given my failure to ‘dominate’ competitions”. So here I am, and this time I told myself that, I am going to learn to crawl before I walk. And hence the Loan Data set.

This dataset is a rather classic example of a classification problem. What’s a classification problem you may ask, these are problems whose datapoints consists of Discrete data points rather than Continuous data points. What are Discrete and continuous data points you may ask ‘Hmmm this is getting annoying’, well I implore you to check this link out. Now that we have the basics out of the way let’s take a swing at the Loan dataset.

Data Teasing

Just like ‘Foreplay, ’Data Teasing’ plays a crucial role in determining the result of an of any mining escapade. This is primarily realised in three steps: Reading, Cleaning and Learning.

Reading

Without a doubt, the hardest part of this analysis is Reading. Despite the arduous nature of this step, I going to try and read my training data into two variables:‘loan_data’ and ‘Training’.

loan_data<- read.csv("Loan payments data.csv", 
                      sep = ",", 
                      header = TRUE)

Training<- read.csv("Loan payments data.csv", 
                     sep = ",", 
                     header = TRUE)

Some of you must be going, “why is this ‘future kaggle grandmaster’” reading the same data into two variables. Well, since I will be teasing my dataset further and if at any point during the tease things were to go south, I’ll have what one would call an ‘insurance’ B-) (It was with this level of data intellect I sought to “conquer” the competitions self-facepalm.

Learning and Cleaning

Now that the hard part is out of the way, let’s take a quick peek at the dataset.


A scan of a sample of the dataset reveals that there are a total of 11 features per client, namely…. please refer to the table above. Though the table is quite informative, the ‘str’ function would reveal a little more about the nature of the variables. (It’d be great if one of y’all data sleuths could show me a fancier way to project the output of the string function.)

## 'data.frame':    500 obs. of  11 variables:
##  $ Loan_ID       : Factor w/ 500 levels "xqd12160159",..: 472 474 5 6 7 470 8 9 471 10 ...
##  $ loan_status   : Factor w/ 3 levels "COLLECTION","COLLECTION_PAIDOFF",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Principal     : int  1000 1000 1000 1000 1000 300 1000 1000 1000 800 ...
##  $ terms         : int  30 30 30 15 30 7 30 30 30 15 ...
##  $ effective_date: Factor w/ 7 levels "9/10/2016","9/11/2016",..: 6 6 6 6 7 7 7 7 7 1 ...
##  $ due_date      : Factor w/ 25 levels "10/10/2016","10/11/2016",..: 7 7 7 19 8 14 8 8 8 21 ...
##  $ paid_off_time : Factor w/ 321 levels "","10/1/2016 10:18",..: 171 101 254 211 223 321 99 87 108 234 ...
##  $ past_due_days : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ age           : int  45 50 33 27 28 35 29 36 28 26 ...
##  $ education     : Factor w/ 4 levels "Bechalor","college",..: 3 1 1 2 2 4 2 2 2 2 ...
##  $ Gender        : Factor w/ 2 levels "female","male": 2 1 1 2 1 2 2 2 2 2 ...

Perfecto!!, now let’s analyse the variables individually and get to understand the true meaning of life…. (this dude is definitely on something, Illeism much)

  • With over a month’s experience teasing data, my ‘data scientist’ senses are telling me that Loan_id is not going to amount up to much. So let’s rid that off of our Training dataset.
Training$Loan_ID <- NULL


  • Next is the dependent variable Loan_status; the value of which may or may not be influenced by the remaining ten features. Further along this analysis, I will try and project how the indpendent(predictor) variables interact with each other and with the dependent variable (so as to understand the true meaning of life…). To get an idea of the distribution of the the same, let’s call summary on it.
Loan Status Summary
Count
COLLECTION 100
COLLECTION_PAIDOFF 100
PAIDOFF 300
    • Interesting, the data shows 300 of the 500 customers have paid off their loans, while 100 each remains in ‘collection’ and ‘paid off after collection’ category. Let’s leave the dependent variable as it is and move on with the sleuthing.

  • Talking about the Principal …. to be honest, there is nothing much to this. It is simply the actual sum borrowed, but do you know what could spice things up , calling summary on it ;)
Principal Summary
Count
300 6
500 3
700 1
800 111
900 2
1000 377
    • The table shows that a majority of the clients have a principal value of 1000 and 800. It would be interesting to observe the relationship between Loan_status and Principal - coming soon (like in a few scrolls..).

  • Terms has its values distributed amongst three classes, weekly, bi-weekly and monthly. So, I am going to do the noble thing and convert this column to that of a factor class and project a summary while I’m at it.
Terms Summary
Count
7 21
15 207
30 272
    • Data shows that a majority of the clients have taken loans with a bi-weekly or monthly payoff schedule. What!! this is unacceptable, 15 days ain’t two weeks, it’s two week and one day. I can’t take this kind of inaccuracy any more.That’s it, I am moving on to - hopefully- more accuracte variables.

  • Dealing with dates and time be like dealing with tangled earphone wires, disentangling it could be quite the daunting task. The lubridate package is like tangle free wiring for earphones; it makes life a whole less complicated.
Training$effective_date <- mdy(Training$effective_date)
Training$due_date <- mdy(Training$due_date)
Training$paid_off_time <-  mdy_hm(Training$paid_off_time)
    • Et Voila!! dealing with dates couldn’t have gotten any easier. Y’all probably going, what in god’s name is he talking about. Believe me people, with the lubridate package all the work happens in the background B|.

  • Before we move on, I would like to confirm my doubts regarding the relationship between terms and the duration for loan payment (difference between the effective date and the due date).
Duration
Count
6 18
13 3
14 190
29 259
44 8
59 22
    • It seems like the duration of the effective date, and the due date is multiples of the variable terms. Not sure how helpful that’s going to be.. but yea, Kaggle grandmaster in the making.

  • Last but not the least, we tackle the matrimonial essentials - age, education and gender. I’m simply going to call summary on these variables.
Summary of Age, Education and Gender
age education Gender
Min. :18.00 Bechalor : 67 female: 77
1st Qu.:27.00 college :220 male :423
Median :30.00 High School or Below:209 NA
Mean :31.12 Master or Above : 4 NA
3rd Qu.:35.00 NA NA
Max. :51.00 NA NA
    • Data seems to suggest that we are dealing with a sample population between the age of 18 and 51, primarily with college or high school education levels. Also, more than 80% of the clients are men.

There you have it folks. So, now that I am finished with data teasing, let’s move on to the fun part, a coffee break B| - much deserved.

Data Exploration

It’s quite apparent from the data set that the two variables past_due_days and paid_off_time can be used to make spot-on predictions about the loan status. To be honest, it took me some time before figured this out. I had to spend a good amount of time exploring this dataset double facepalm.

Since creating a prediction model and then predicting using the dataset would be rather simple - like stealing candy from a baby -, let’s talk to the data, see what secrets she’s hiding, understand her, empathise with her…. maybe I should stop, sounds like I have issues….maybe I don’t, only data can tell ;)

Long story short, let’s mine.

Starting with the variables that I had placed too much faith on :/

So this is how I am going to approach mining; I’m going to ask myself a bunch of questions about the loan dataset, and I’m going to try and get it answered using modern plotting practices a.k.a ggplots.

Question 1 - Does the value of ‘Principal’ help determine whether or not the loan has been paid ?

Seems like as long as the loan was below 800, there was a 100% payback rate. Well, data seems to suggest that there’s is going to be a 0% payback rate for all those bullet loans I had taken from my dad…. sorry dad :|

Question 1.1 - Does including ‘terms’ help separate out the values to help better determine whether or not the loan has been paid ?

The variable terms does create a visible separation of the loan status for one of its classes - 7 days/weekly term - that goes to say that, clients who have borrowed at weekly terms seem to have paid off entirely their loans. Too bad, terms don’t extend beyond a month, else I would be sure to find in a payback day for those humble donations from my dad… thanks dad :|

Question 1.1.1 - How about when age is taken into consideration?

Before I can move forward and introduce age, I would need to convert it into something more tangible - for ggplot of course. So I went ahead and created another feature age_r. So a person aged between 16-25, 26-35, 36-45 or 46-55, was assigned to the age group 20, 30, 40 or 50 respectively within the age_r column.

Too much info on one graph?? Let me break it down for y’all. As usual, the x-axis has been assigned the principal values and the y-axis the frequency of occurrence (counts). The label situated on the top of each subgraph has two values, the upper one signifying the terms and the lower one age_r. I call these kinds of graphs, the whooper extreme. They might not be the most aesthetic or compact figures, but they carry a stomach full.

As for the graph, adding age did not bring about the desired separation I was hoping to realise. As regardless of the age, if the terms were on a weekly basis and the principal was less than 800, there was a 100% payback.

Question 1.1.2 - How about education ?

Moving on, to the same graph, I added education while removing age_r. Let’s see if this answers the age-old question of what came first, the chicken or the egg - of course the egg came first

Nada! this is definitely not how I had expected things to turn out. Again regardless of the education, if the terms were on a weekly basis and the principal was less than 800, there was a 100% payback.

Question 1.1.3 - Gender…. ?

Maybe, just maybe our answer lies within the gender status. Let’s give that a go:

Well, it took Thomas Edison 1000 tries before he was able to realise the light blub. So, if my math is correct, I have only 997 tries to go….also, I’m simply not convinced that the variables age, Gender and education have given me everything it had to offer.

Question 2 - Does age on its own help with loan_status prediction?

So, let me launch another bout with them. This time, I will assign the x-axis the age of the clients and keep adding variables as I narrate to y’all a story of courage, valour and stupid perseverance.

Well, it looks like the “stupid perseverance” part of this story is beginning to take shape. The graph simply shows that people of all ages are likely to default.

Question 2.1 - Does taking education into condiseration help?

Let’s hope this works…

I am not a very lucky grandmaster; education to has failed me.

Question 2.2 - How about gender?

Feels like I have asked this question before..

Hmmm, looks like at my 994th try I am calling it quits. Gender too has let me down.

Well, the whooper extremes weren’t able to project any discernable separation by the inclusion of age, gender or education. So, with a reliable level of certainty, I can conclude that the most significant separation was observed with the addition of the variables term and principal. In other words, I have stupidly persevered.

Moving on to the variables which had me ‘self-double-facepalm’

Reading the names of the features of the loan dataset one should have gotten the idea that paid_off_time could go a long way in helping predict the loan status. But not me, I decided otherwise, and now I am back to where I started…

Question 3 - Does the variable ‘past_due_days’ help me pinpoint the defaulters?

Incroyble!! look at all that separation Einstein. Well, despite the profound separation brought about only by the inclusion of the variable past_due_days there still seems to be some work to be done with separating the collection and collection paid-off variable.

Question 3.1 - Does the tweaking the variale “paid_off_time” help me pinpoint the defaulters?

By tweaking, I mean [feature engineering] (https://en.wikipedia.org/wiki/Feature_engineering), which would be essential for the last part of our puzzle.(for the inquisitive souls - ). So, I am going to introduce another variable called paid_off, which would simply assign clients a ‘1’ if the loan has been paid and ‘0’ otherwise.

Voila! There you have it peeps. If a given client has 1 as his paid-off value and with ‘past_due_days’ as 0, he has paid-off his loan. If the ‘past_due_days’ was any value greater than 0, he still has paid off his loan, but after it went into collection. Alternatively, if the client’s paid-off value is 0, he then has defaulted.

Just for kicks exploration

Enough of all that serious talk, let’s have some fun with the dataset.

Question 4 - Does age play a role in determining how much will be the borrowed principle amount?

Nope, no particular trend here, people of all ages borrowed at different principals

Question 5 - How about gender, does it play a role in determining how much will be the borrowed principle amount?

Seems like women primarily borrowed loans at principal values greater than 800. But this could be simply due to the ratio of men to women clients in the data set. Possibly with a larger dataset, the pattern would disappear.

Question 6 - How about education, can the level of education help determine the principal amount?

Not really, also since we have an inadequate number of clients with an education level of ‘Masters or Above’ we really can’t say much about that group.

Question 7 - What is the relationship between age and education level?

We possibly aren’t dealing with a highly educated population or sample size.

Question 8 - Are women more educated ?

On average the data suggests women are the more educated gender.

Question 9 - Does age group, gender or education have anything to do with time of payment?

## Warning: Removed 100 rows containing non-finite values (stat_bin).

It seems like quite a few of the clients aged between 25 and 44 have made payments at the early morning hours. Interesting, I would have imagined this from the younger bunch - I was awake at ungodly hours during my bachelors.

Modelling

And now for the moment that y’all have been waiting for …. * chirping of crickets * …..“MODELLING!!!”" ……* chirping of crickets *..This is embarrassing :| Looks like I am in this boat all by myself, et vous brutes (forgive my poor French). Anywho, as mentioned earlier, the loan status of a client can be predicted rather easily with the variables ‘past_due_days’ and ‘paid_off_time’. Before I go on with the obvious let me try and realise my best possible prediction on loan status excluding the obvious.

Modelling excluding the obvious

For this modelling task, I thought to myself why not model with one of the most basic classification algorithm, decision tree - also this dataset is a rather classic example of a classification problem. (For more info: )

##                     Reference
## Prediction           COLLECTION COLLECTION_PAIDOFF PAIDOFF
##   COLLECTION                  0                  0       0
##   COLLECTION_PAIDOFF          0                  0       0
##   PAIDOFF                    40                 40     120
## Accuracy 
##      0.6

As you can see, the variables, Principal, terms, education, Gender and age_r didn’t really help with predicting the loan status - the algorithm simply doesn’t have any useful features to build branches and leaves with. Thereby, it by default predicts clients to have paid-off their loans without having had to move into collection - as clients who have paid-off are in the majority. Also, the absence of a rpart tree diagram could be attributed to the default prediction pattern.

Let’s give it another go with only Education:

##                     Reference
## Prediction           COLLECTION COLLECTION_PAIDOFF PAIDOFF
##   COLLECTION                  0                  0       0
##   COLLECTION_PAIDOFF          0                  0       0
##   PAIDOFF                    40                 40     120
## Accuracy 
##      0.6

Nothing changed. So, basically with no useful variables to extract information from, the prediction by default is going to be made in favour of the loan status in majority within in the training set - ‘Paid-off’.

Modelling with the obvious

To show y’all the predictive power of simply one relevant variable I’ll first run the classification with only ‘past_due_date’. Also, I’ll throw in a fancy rtree plot (pun intended) to make things more visual.

##                     Reference
## Prediction           COLLECTION COLLECTION_PAIDOFF PAIDOFF
##   COLLECTION                 39                  3       0
##   COLLECTION_PAIDOFF          1                 37       0
##   PAIDOFF                     0                  0     120

Voila!!, 98%, accuracy! Of course, the due_date_days variable alone cannot get you to a 100% accuracy. For this, I’ll be needing the feature engineered paid_off variable.

##                     Reference
## Prediction           COLLECTION COLLECTION_PAIDOFF PAIDOFF
##   COLLECTION                 40                  0       0
##   COLLECTION_PAIDOFF          0                 40       0
##   PAIDOFF                     0                  0     120
## Accuracy 
##        1

And there you go, 100% accuracy.

There you have it peeps, I hope y’all enjoyed my infromal way of narration through this kernel. As you can see, I am a novice looking to learn and grow within the world of data science. ‘Learing how to crawl before I walk’ is my attempt at recording my progress through this journey that I have embarked on. With enough support and positive feedback, I seek to take y’all with me on what I hope to be an immense leanrning experience. Until next time….