Not so long ago I got done with an online data science course and have been putting my new found skills to test ever since - a little over a month now. So, I thought to myself,“Hmmmm, why not explore the kernel section of Kaggle, given my failure to ‘dominate’ competitions”. So here I am, and this time I told myself that, I am going to learn to crawl before I walk. And hence the Loan Data set.
This dataset is a rather classic example of a classification problem. What’s a classification problem you may ask, these are problems whose datapoints consists of Discrete data points rather than Continuous data points. What are Discrete and continuous data points you may ask ‘Hmmm this is getting annoying’, well I implore you to check this link out. Now that we have the basics out of the way let’s take a swing at the Loan dataset.
Just like ‘Foreplay, ’Data Teasing’ plays a crucial role in determining the result of an of any mining escapade. This is primarily realised in three steps: Reading, Cleaning and Learning.
Without a doubt, the hardest part of this analysis is Reading. Despite the arduous nature of this step, I going to try and read my training data into two variables:‘loan_data’ and ‘Training’.
loan_data<- read.csv("Loan payments data.csv",
sep = ",",
header = TRUE)
Training<- read.csv("Loan payments data.csv",
sep = ",",
header = TRUE)
Some of you must be going, “why is this ‘future kaggle grandmaster’” reading the same data into two variables. Well, since I will be teasing my dataset further and if at any point during the tease things were to go south, I’ll have what one would call an ‘insurance’ B-) (It was with this level of data intellect I sought to “conquer” the competitions self-facepalm.
Now that the hard part is out of the way, let’s take a quick peek at the dataset.
A scan of a sample of the dataset reveals that there are a total of 11 features per client, namely…. please refer to the table above. Though the table is quite informative, the ‘str’ function would reveal a little more about the nature of the variables. (It’d be great if one of y’all data sleuths could show me a fancier way to project the output of the string function.)
## 'data.frame': 500 obs. of 11 variables:
## $ Loan_ID : Factor w/ 500 levels "xqd12160159",..: 472 474 5 6 7 470 8 9 471 10 ...
## $ loan_status : Factor w/ 3 levels "COLLECTION","COLLECTION_PAIDOFF",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Principal : int 1000 1000 1000 1000 1000 300 1000 1000 1000 800 ...
## $ terms : int 30 30 30 15 30 7 30 30 30 15 ...
## $ effective_date: Factor w/ 7 levels "9/10/2016","9/11/2016",..: 6 6 6 6 7 7 7 7 7 1 ...
## $ due_date : Factor w/ 25 levels "10/10/2016","10/11/2016",..: 7 7 7 19 8 14 8 8 8 21 ...
## $ paid_off_time : Factor w/ 321 levels "","10/1/2016 10:18",..: 171 101 254 211 223 321 99 87 108 234 ...
## $ past_due_days : int NA NA NA NA NA NA NA NA NA NA ...
## $ age : int 45 50 33 27 28 35 29 36 28 26 ...
## $ education : Factor w/ 4 levels "Bechalor","college",..: 3 1 1 2 2 4 2 2 2 2 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 1 1 2 1 2 2 2 2 2 ...
Perfecto!!, now let’s analyse the variables individually and get to understand the true meaning of life…. (this dude is definitely on something, Illeism much)
Training$Loan_ID <- NULL
| Count | |
|---|---|
| COLLECTION | 100 |
| COLLECTION_PAIDOFF | 100 |
| PAIDOFF | 300 |
| Count | |
|---|---|
| 300 | 6 |
| 500 | 3 |
| 700 | 1 |
| 800 | 111 |
| 900 | 2 |
| 1000 | 377 |
| Count | |
|---|---|
| 7 | 21 |
| 15 | 207 |
| 30 | 272 |
Training$effective_date <- mdy(Training$effective_date)
Training$due_date <- mdy(Training$due_date)
Training$paid_off_time <- mdy_hm(Training$paid_off_time)
| Count | |
|---|---|
| 6 | 18 |
| 13 | 3 |
| 14 | 190 |
| 29 | 259 |
| 44 | 8 |
| 59 | 22 |
| age | education | Gender | |
|---|---|---|---|
| Min. :18.00 | Bechalor : 67 | female: 77 | |
| 1st Qu.:27.00 | college :220 | male :423 | |
| Median :30.00 | High School or Below:209 | NA | |
| Mean :31.12 | Master or Above : 4 | NA | |
| 3rd Qu.:35.00 | NA | NA | |
| Max. :51.00 | NA | NA |
There you have it folks. So, now that I am finished with data teasing, let’s move on to the fun part, a coffee break B| - much deserved.
It’s quite apparent from the data set that the two variables past_due_days and paid_off_time can be used to make spot-on predictions about the loan status. To be honest, it took me some time before figured this out. I had to spend a good amount of time exploring this dataset double facepalm.
Since creating a prediction model and then predicting using the dataset would be rather simple - like stealing candy from a baby -, let’s talk to the data, see what secrets she’s hiding, understand her, empathise with her…. maybe I should stop, sounds like I have issues….maybe I don’t, only data can tell ;)
Long story short, let’s mine.
So this is how I am going to approach mining; I’m going to ask myself a bunch of questions about the loan dataset, and I’m going to try and get it answered using modern plotting practices a.k.a ggplots.
Question 1 - Does the value of ‘Principal’ help determine whether or not the loan has been paid ?
Seems like as long as the loan was below 800, there was a 100% payback rate. Well, data seems to suggest that there’s is going to be a 0% payback rate for all those bullet loans I had taken from my dad…. sorry dad :|
Question 1.1 - Does including ‘terms’ help separate out the values to help better determine whether or not the loan has been paid ?
The variable terms does create a visible separation of the loan status for one of its classes - 7 days/weekly term - that goes to say that, clients who have borrowed at weekly terms seem to have paid off entirely their loans. Too bad, terms don’t extend beyond a month, else I would be sure to find in a payback day for those humble donations from my dad… thanks dad :|
Question 1.1.1 - How about when age is taken into consideration?
Before I can move forward and introduce age, I would need to convert it into something more tangible - for ggplot of course. So I went ahead and created another feature age_r. So a person aged between 16-25, 26-35, 36-45 or 46-55, was assigned to the age group 20, 30, 40 or 50 respectively within the age_r column.
Too much info on one graph?? Let me break it down for y’all. As usual, the x-axis has been assigned the principal values and the y-axis the frequency of occurrence (counts). The label situated on the top of each subgraph has two values, the upper one signifying the terms and the lower one age_r. I call these kinds of graphs, the whooper extreme. They might not be the most aesthetic or compact figures, but they carry a stomach full.
As for the graph, adding age did not bring about the desired separation I was hoping to realise. As regardless of the age, if the terms were on a weekly basis and the principal was less than 800, there was a 100% payback.
Question 1.1.2 - How about education ?
Moving on, to the same graph, I added education while removing age_r. Let’s see if this answers the age-old question of what came first, the chicken or the egg - of course the egg came first
Nada! this is definitely not how I had expected things to turn out. Again regardless of the education, if the terms were on a weekly basis and the principal was less than 800, there was a 100% payback.
Question 1.1.3 - Gender…. ?
Maybe, just maybe our answer lies within the gender status. Let’s give that a go:
Well, it took Thomas Edison 1000 tries before he was able to realise the light blub. So, if my math is correct, I have only 997 tries to go….also, I’m simply not convinced that the variables age, Gender and education have given me everything it had to offer.
Question 2 - Does age on its own help with loan_status prediction?
So, let me launch another bout with them. This time, I will assign the x-axis the age of the clients and keep adding variables as I narrate to y’all a story of courage, valour and stupid perseverance.
Well, it looks like the “stupid perseverance” part of this story is beginning to take shape. The graph simply shows that people of all ages are likely to default.
Question 2.1 - Does taking education into condiseration help?
Let’s hope this works…
I am not a very lucky grandmaster; education to has failed me.
Question 2.2 - How about gender?
Feels like I have asked this question before..
Hmmm, looks like at my 994th try I am calling it quits. Gender too has let me down.
Well, the whooper extremes weren’t able to project any discernable separation by the inclusion of age, gender or education. So, with a reliable level of certainty, I can conclude that the most significant separation was observed with the addition of the variables term and principal. In other words, I have stupidly persevered.
Reading the names of the features of the loan dataset one should have gotten the idea that paid_off_time could go a long way in helping predict the loan status. But not me, I decided otherwise, and now I am back to where I started…
Question 3 - Does the variable ‘past_due_days’ help me pinpoint the defaulters?
Incroyble!! look at all that separation Einstein. Well, despite the profound separation brought about only by the inclusion of the variable past_due_days there still seems to be some work to be done with separating the collection and collection paid-off variable.
Question 3.1 - Does the tweaking the variale “paid_off_time” help me pinpoint the defaulters?
By tweaking, I mean [feature engineering] (https://en.wikipedia.org/wiki/Feature_engineering), which would be essential for the last part of our puzzle.(for the inquisitive souls - ). So, I am going to introduce another variable called paid_off, which would simply assign clients a ‘1’ if the loan has been paid and ‘0’ otherwise.
Voila! There you have it peeps. If a given client has 1 as his paid-off value and with ‘past_due_days’ as 0, he has paid-off his loan. If the ‘past_due_days’ was any value greater than 0, he still has paid off his loan, but after it went into collection. Alternatively, if the client’s paid-off value is 0, he then has defaulted.
Enough of all that serious talk, let’s have some fun with the dataset.
Question 4 - Does age play a role in determining how much will be the borrowed principle amount?
Nope, no particular trend here, people of all ages borrowed at different principals
Question 5 - How about gender, does it play a role in determining how much will be the borrowed principle amount?
Seems like women primarily borrowed loans at principal values greater than 800. But this could be simply due to the ratio of men to women clients in the data set. Possibly with a larger dataset, the pattern would disappear.
Question 6 - How about education, can the level of education help determine the principal amount?
Not really, also since we have an inadequate number of clients with an education level of ‘Masters or Above’ we really can’t say much about that group.
Question 7 - What is the relationship between age and education level?
We possibly aren’t dealing with a highly educated population or sample size.
Question 8 - Are women more educated ?
On average the data suggests women are the more educated gender.
Question 9 - Does age group, gender or education have anything to do with time of payment?
## Warning: Removed 100 rows containing non-finite values (stat_bin).
It seems like quite a few of the clients aged between 25 and 44 have made payments at the early morning hours. Interesting, I would have imagined this from the younger bunch - I was awake at ungodly hours during my bachelors.
And now for the moment that y’all have been waiting for …. * chirping of crickets * …..“MODELLING!!!”" ……* chirping of crickets *..This is embarrassing :| Looks like I am in this boat all by myself, et vous brutes (forgive my poor French). Anywho, as mentioned earlier, the loan status of a client can be predicted rather easily with the variables ‘past_due_days’ and ‘paid_off_time’. Before I go on with the obvious let me try and realise my best possible prediction on loan status excluding the obvious.
For this modelling task, I thought to myself why not model with one of the most basic classification algorithm, decision tree - also this dataset is a rather classic example of a classification problem. (For more info: )
## Reference
## Prediction COLLECTION COLLECTION_PAIDOFF PAIDOFF
## COLLECTION 0 0 0
## COLLECTION_PAIDOFF 0 0 0
## PAIDOFF 40 40 120
## Accuracy
## 0.6
As you can see, the variables, Principal, terms, education, Gender and age_r didn’t really help with predicting the loan status - the algorithm simply doesn’t have any useful features to build branches and leaves with. Thereby, it by default predicts clients to have paid-off their loans without having had to move into collection - as clients who have paid-off are in the majority. Also, the absence of a rpart tree diagram could be attributed to the default prediction pattern.
Let’s give it another go with only Education:
## Reference
## Prediction COLLECTION COLLECTION_PAIDOFF PAIDOFF
## COLLECTION 0 0 0
## COLLECTION_PAIDOFF 0 0 0
## PAIDOFF 40 40 120
## Accuracy
## 0.6
Nothing changed. So, basically with no useful variables to extract information from, the prediction by default is going to be made in favour of the loan status in majority within in the training set - ‘Paid-off’.
To show y’all the predictive power of simply one relevant variable I’ll first run the classification with only ‘past_due_date’. Also, I’ll throw in a fancy rtree plot (pun intended) to make things more visual.
## Reference
## Prediction COLLECTION COLLECTION_PAIDOFF PAIDOFF
## COLLECTION 39 3 0
## COLLECTION_PAIDOFF 1 37 0
## PAIDOFF 0 0 120
Voila!!, 98%, accuracy! Of course, the due_date_days variable alone cannot get you to a 100% accuracy. For this, I’ll be needing the feature engineered paid_off variable.
## Reference
## Prediction COLLECTION COLLECTION_PAIDOFF PAIDOFF
## COLLECTION 40 0 0
## COLLECTION_PAIDOFF 0 40 0
## PAIDOFF 0 0 120
## Accuracy
## 1
And there you go, 100% accuracy.
There you have it peeps, I hope y’all enjoyed my infromal way of narration through this kernel. As you can see, I am a novice looking to learn and grow within the world of data science. ‘Learing how to crawl before I walk’ is my attempt at recording my progress through this journey that I have embarked on. With enough support and positive feedback, I seek to take y’all with me on what I hope to be an immense leanrning experience. Until next time….