Introduction
This data set contains loan data from https://www.kaggle.com/itssuru/loan-data. It contains various data like FICO score, interest rate, installment, purpose and some more.
This is an analysis based on loan data which includes various methods to perform exploratory data analysis to understand how FICO score is co-related to various other factors listed on the data set.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
df <-read.csv("loan_data.csv")
Mean of the FICO score from the dataset can be calculated to see the average score that our dataset includes
mean(df$fico)
## [1] 710.8463
The standared deviation will be calculated to see how deviated the data is.
sd(df$fico)
## [1] 37.97054
Five number summary will be calculated below to quickly explore our dataset.
summary(df$fico)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 612.0 682.0 707.0 710.8 737.0 827.0
Now, the data will be provided with a graphical representation that organizes a group of data points into user-specified ranges. It helps condense data to visually interpret it with convinience.
hist(df$fico, breaks=10, main="Histogram of FICO score", xlab="FICO score", xlim=c(600,860), col="purple", probability = TRUE)
lines(density(df$fico), col = "Black")
Boxplot here will provide a good indication of values in the data that are spread out without using as much space as the histogram even though it may seem primitive.
boxplot(df$fico, main ="Box Plot Representing FICO scores", xlab = "Quartiles", ylab = "FICO scores")
QQ plot will be providing the insights on the distribution only buy only looking at the plot to see the FICO scores of more than 9000 individuals. It gives a clear idea as illustrated below
qqnorm(df$fico)
qqline(df$fico, col = "purple")
Yes, there are outliers that doesn’t fall under the average distribution which might affect the accuracy slightly. The distribution is symmetric ranging at 650 to 8000 but also shows slight positive skewness.
plot(df$fico, df$int.rate)
cor(df$fico, df$int.rate)
## [1] -0.7148208
Negative Coefficient represents negative relationship between these two variables which means they both move in the opposite direction. Higher the credit score, lower the interest rate.
plot(df$int.rate, df$installment)
cor(df$int.rate, df$installment)
## [1] 0.2761402
Here, positive coefficient represents positive relationship between two variables which co-responds to higher interest rate for bigger installment amounts.
table(df$purpose)
##
## all_other credit_card debt_consolidation educational
## 2331 1262 3957 343
## home_improvement major_purchase small_business
## 629 437 619
The categories above show frequency table which includes number of loans provided for various purposes like credit card, debt consodilation, educational, home improvement, major purchase, small business and all others.
table(df$purpose)/length(df$purpose)
##
## all_other credit_card debt_consolidation educational
## 0.24337022 0.13176028 0.41313427 0.03581123
## home_improvement major_purchase small_business
## 0.06567133 0.04562539 0.06462727
The table above shows relative frequency.
two_way_table <-table(df$purpose, df$not.fully.paid)
two_way_table
##
## 0 1
## all_other 1944 387
## credit_card 1116 146
## debt_consolidation 3354 603
## educational 274 69
## home_improvement 522 107
## major_purchase 388 49
## small_business 447 172
Here, 1 relates to loan amount not fully paid whereas 0 means loan amount fully paid. Most loans were given out for debt consolidation and it had most number of unpaid loans. It also had most amount to paid transactions fully paid, making it the best revenue generating source for the credit lendors. Very few educational loans were given out.
boxplot(df$fico ~ df$purpose, col = "orange", main = "Distribution of FICO score with loan purposes", xlab = "Purpose of loans", ylab = "FICO Score")
The side to side box plot above has data illustrating FICO scores on the right and purpose of loan on the right side. This plot shows clear data about the loans granted to most FICO scores below 750 and above 650 on average.
barplot(df$int.rate, main = "Bar Chart", xlab ="Interest Rate")
Barplot above clearly shows that most interest rates were above .15%. I was fascinated to see that FICO score has huge importance and it can significantly lower the interest rates. I realized that having a good FICO score will allow people to have a financially sound living.
Decision Trees
Decision tree analysis is the process of drawing a decision tree, which is a graphic representation of various alternative solutions that are available to solve a given problem, in order to determine the most effective courses of action. Decision trees are comprised of nodes and branches - nodes represent a test on an attribute and branches represent potential alternative outcomes.
library(rpart)
library(rpart.plot)
tree <- rpart(int.rate~.,data = df)
tree1 <- rpart(int.rate ~.,data = df,method = 'class')
library(dplyr)
df <- df%>%
mutate(int.rate = factor(int.rate == TRUE, levels = c(TRUE,FALSE),
labels = c('high,low')))
tree1 <- rpart(int.rate ~.,data = df,method = 'class')
tree1
## n= 9578
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 9578 0 high,low2 (0.0000000 1.0000000) *
rpart.plot(tree)
The decision tree above shows that FICO score greatly affected the installment amounts. It also helps us divide the fico score>==710 at one glimpse and help us understand our datset. 27% of people who had FICO score greater or equal to 735 had installment amount lower than $324. 31% with FICO score greater than 680 but less than 710 had installment amount of $511 a month.
pred <- predict(tree1,df,type="class")
head(pred)
## [1] high,low2 high,low2 high,low2 high,low2 high,low2 high,low2
## Levels: high,low1 high,low2
We can see that there are majority number 2 rates in this data set which correlates to the model predicting FICO scores only equal or below 710.
predict(tree,df) %>%
head()
## 1 2 3 4 5 6
## 0.1198941 0.1278030 0.1278030 0.1071067 0.1451888 0.1071067
We can see that the tree is divided.
Confusion Table
confusion_table <- with(df, table(int.rate, pred))
confusion_table
## pred
## int.rate high,low1 high,low2
## high,low1 0 0
## high,low2 0 9578
The confusion table clearly explains that the rate2 interest rate datatypes are 9578 different datapoints in our dataset.
Now we will slice and train our data.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
inTrain <- createDataPartition(y = df$int.rate, p = .66, list = FALSE)
## Warning in createDataPartition(y = df$int.rate, p = 0.66, list = FALSE): Some
## classes have no records ( high,low1 ) and these will be ignored
df_train <- df %>% slice(inTrain)
df_test <- df %>% slice(-inTrain)
dim(df_train)
## [1] 6322 14
dim(df_test)
## [1] 3256 14
Here we observe how important our data categories are for us to make a decision from the dataset.
imp <- varImp(tree)
head(imp)
## Overall
## credit.policy 0.08648839
## days.with.cr.line 0.03684838
## dti 0.11105553
## fico 1.33329254
## inq.last.6mths 0.20246152
## installment 0.94392436
imp %>% ggplot(aes(x = row.names(imp), weight = Overall)) +
geom_bar()
Here, FICO score is the most important category in our dataset to make a prediction which is illustrated on the barplot above.
tree1 <- rpart(purpose ~ int.rate + fico ,data = df, method = "class")
tree1
## n= 9578
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 9578 5621 debt_consolidation (0.24 0.13 0.41 0.036 0.066 0.046 0.065)
## 2) fico< 749.5 7908 4391 debt_consolidation (0.23 0.13 0.44 0.036 0.054 0.039 0.06) *
## 3) fico>=749.5 1670 1165 all_other (0.3 0.12 0.26 0.034 0.12 0.077 0.085) *
The tree above shows people with FICO score less than 750 mostly had debt consolidation loans whereas people above 750 had mostly other type of loans.
rpart.plot(tree1, extra = 2)
Above, the debt consolidation accounted for two categories. All interest rates above 0.1 was for debt consolidation whereas anything less than that was other types of loan.
tail(df)
## credit.policy purpose int.rate installment log.annual.inc
## 9573 0 debt_consolidation high,low2 69.98 10.11047
## 9574 0 all_other high,low2 344.76 12.18075
## 9575 0 all_other high,low2 257.70 11.14186
## 9576 0 debt_consolidation high,low2 97.81 10.59663
## 9577 0 home_improvement high,low2 351.58 10.81978
## 9578 0 debt_consolidation high,low2 853.43 11.26446
## dti fico days.with.cr.line revol.bal revol.util inq.last.6mths
## 9573 7.02 662 8190.042 2999 39.5 6
## 9574 10.39 672 10474.000 215372 82.1 2
## 9575 0.21 722 4380.000 184 1.1 5
## 9576 13.09 687 3450.042 10036 82.9 8
## 9577 19.18 692 1800.000 0 3.2 5
## 9578 16.28 732 4740.000 37879 57.0 6
## delinq.2yrs pub.rec not.fully.paid
## 9573 0 0 1
## 9574 0 0 1
## 9575 0 0 1
## 9576 0 0 1
## 9577 0 0 1
## 9578 0 0 1
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
We installed the libraries neede to pursue our project for analysis.
transactions(df)
## Warning: Column(s) 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 not logical or
## factor. Applying default discretization (see '? discretizeDF').
## Warning in discretize(x = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : The calculated breaks are: 0, 1, 1, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 2L, : The calculated breaks are: 0, 0, 2, 33
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 13
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, : The calculated breaks are: 0, 0, 0, 5
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## transactions in sparse format with
## 9578 transactions (rows) and
## 36 items (columns)
Total transactions were recorded to be 9578 rows and 37 columns.
colnames(df)[c(2,4,5,6,7)]
## [1] "purpose" "installment" "log.annual.inc" "dti"
## [5] "fico"
Some column names on our dataset (df) were illustrated above.
df <- df %>% mutate(
installment=(installment>0),
log.annual.inc=(log.annual.inc>0)
)
as(df,"transactions")
## Warning: Column(s) 1, 2, 6, 7, 8, 9, 10, 11, 12, 13, 14 not logical or factor.
## Applying default discretization (see '? discretizeDF').
## Warning in discretize(x = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : The calculated breaks are: 0, 1, 1, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 2L, : The calculated breaks are: 0, 0, 2, 33
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 13
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, : The calculated breaks are: 0, 0, 0, 5
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## transactions in sparse format with
## 9578 transactions (rows) and
## 32 items (columns)
We now know that there were 9578 rows and 33 colums in our dataset.
trans <- transactions(df)
## Warning: Column(s) 1, 2, 6, 7, 8, 9, 10, 11, 12, 13, 14 not logical or factor.
## Applying default discretization (see '? discretizeDF').
## Warning in discretize(x = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : The calculated breaks are: 0, 1, 1, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 2L, : The calculated breaks are: 0, 0, 2, 33
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 13
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, : The calculated breaks are: 0, 0, 0, 5
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
## Warning in discretize(x = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, : The calculated breaks are: 0, 0, 0, 1
## Only unique breaks are used reducing the number of intervals. Look at ? discretize for details.
summary(trans)
## transactions as itemMatrix in sparse format with
## 9578 rows (elements/itemsets/transactions) and
## 32 columns (items) and a density of 0.4375
##
## most frequent items:
## credit.policy=[0,1] int.rate=high,low2 installment log.annual.inc
## 9578 9578 9578 9578
## delinq.2yrs=[0,13] (Other)
## 9578 86202
##
## element (itemset/transaction) length distribution:
## sizes
## 14
## 9578
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14 14 14 14 14 14
##
## includes extended item information - examples:
## labels variables levels
## 1 credit.policy=[0,1] credit.policy [0,1]
## 2 purpose=all_other purpose all_other
## 3 purpose=credit_card purpose credit_card
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
We now summarized our transaction to understand essential information regarding frequency, Mean and Median.
colnames(trans)
## [1] "credit.policy=[0,1]"
## [2] "purpose=all_other"
## [3] "purpose=credit_card"
## [4] "purpose=debt_consolidation"
## [5] "purpose=educational"
## [6] "purpose=home_improvement"
## [7] "purpose=major_purchase"
## [8] "purpose=small_business"
## [9] "int.rate=high,low1"
## [10] "int.rate=high,low2"
## [11] "installment"
## [12] "log.annual.inc"
## [13] "dti=[0,9.13)"
## [14] "dti=[9.13,16.1)"
## [15] "dti=[16.1,30]"
## [16] "fico=[612,687)"
## [17] "fico=[687,727)"
## [18] "fico=[727,827]"
## [19] "days.with.cr.line=[179,3.3e+03)"
## [20] "days.with.cr.line=[3.3e+03,5.16e+03)"
## [21] "days.with.cr.line=[5.16e+03,1.76e+04]"
## [22] "revol.bal=[0,4.74e+03)"
## [23] "revol.bal=[4.74e+03,1.42e+04)"
## [24] "revol.bal=[1.42e+04,1.21e+06]"
## [25] "revol.util=[0,30.8)"
## [26] "revol.util=[30.8,62.2)"
## [27] "revol.util=[62.2,119]"
## [28] "inq.last.6mths=[0,2)"
## [29] "inq.last.6mths=[2,33]"
## [30] "delinq.2yrs=[0,13]"
## [31] "pub.rec=[0,5]"
## [32] "not.fully.paid=[0,1]"
All of the column names were identified above to have a better understanding of the categories we are dealing with without having to scroll through the original dataset file.
inspect(trans[1:3])
## items transactionID
## [1] {credit.policy=[0,1],
## purpose=debt_consolidation,
## int.rate=high,low2,
## installment,
## log.annual.inc,
## dti=[16.1,30],
## fico=[727,827],
## days.with.cr.line=[5.16e+03,1.76e+04],
## revol.bal=[1.42e+04,1.21e+06],
## revol.util=[30.8,62.2),
## inq.last.6mths=[0,2),
## delinq.2yrs=[0,13],
## pub.rec=[0,5],
## not.fully.paid=[0,1]} 1
## [2] {credit.policy=[0,1],
## purpose=credit_card,
## int.rate=high,low2,
## installment,
## log.annual.inc,
## dti=[9.13,16.1),
## fico=[687,727),
## days.with.cr.line=[179,3.3e+03),
## revol.bal=[1.42e+04,1.21e+06],
## revol.util=[62.2,119],
## inq.last.6mths=[0,2),
## delinq.2yrs=[0,13],
## pub.rec=[0,5],
## not.fully.paid=[0,1]} 2
## [3] {credit.policy=[0,1],
## purpose=debt_consolidation,
## int.rate=high,low2,
## installment,
## log.annual.inc,
## dti=[9.13,16.1),
## fico=[612,687),
## days.with.cr.line=[3.3e+03,5.16e+03),
## revol.bal=[0,4.74e+03),
## revol.util=[0,30.8),
## inq.last.6mths=[0,2),
## delinq.2yrs=[0,13],
## pub.rec=[0,5],
## not.fully.paid=[0,1]} 3
We now inspected our data to find the results above.
image(trans)
The image consists of Items (Columns) which corresponds to Transactions (Row).
itemFrequencyPlot(trans,topN = 20)
The plot above is a frequency plot showing us the frequency of columns in the dataset.
vertical <- as(trans, "tidLists")
as(vertical, "matrix")[1:10, 1:5]
## 1 2 3 4 5
## credit.policy=[0,1] TRUE TRUE TRUE TRUE TRUE
## purpose=all_other FALSE FALSE FALSE FALSE FALSE
## purpose=credit_card FALSE TRUE FALSE FALSE TRUE
## purpose=debt_consolidation TRUE FALSE TRUE TRUE FALSE
## purpose=educational FALSE FALSE FALSE FALSE FALSE
## purpose=home_improvement FALSE FALSE FALSE FALSE FALSE
## purpose=major_purchase FALSE FALSE FALSE FALSE FALSE
## purpose=small_business FALSE FALSE FALSE FALSE FALSE
## int.rate=high,low1 FALSE FALSE FALSE FALSE FALSE
## int.rate=high,low2 TRUE TRUE TRUE TRUE TRUE
The table above shows us correlation within our dataset.
trans
## transactions in sparse format with
## 9578 transactions (rows) and
## 32 items (columns)
its <- apriori(trans, parameter=list(target = "frequent"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 957
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[31 item(s), 9578 transaction(s)] done [0.01s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans, parameter = list(target = "frequent")): Mining stopped
## (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.07s].
## sorting transactions ... done [0.00s].
## writing ... [18943 set(s)] done [0.00s].
## creating S4 object ... done [0.00s].
its
## set of 18943 itemsets
inspect(head(its, n = 10))
## items support transIdenticalToItemsets count
## [1] {purpose=credit_card} 0.1317603 0 1262
## [2] {purpose=all_other} 0.2433702 0 2331
## [3] {fico=[612,687)} 0.2817916 0 2699
## [4] {days.with.cr.line=[179,3.3e+03)} 0.3319065 0 3179
## [5] {revol.util=[0,30.8)} 0.3330549 0 3190
## [6] {dti=[0,9.13)} 0.3331593 0 3191
## [7] {dti=[9.13,16.1)} 0.3332637 0 3192
## [8] {revol.bal=[0,4.74e+03)} 0.3332637 0 3192
## [9] {revol.bal=[4.74e+03,1.42e+04)} 0.3333681 0 3193
## [10] {revol.bal=[1.42e+04,1.21e+06]} 0.3333681 0 3193
Above we the support for all our columns.
ggplot(tibble(`Itemset Size` = factor(size(its))), aes(`Itemset Size`)) + geom_bar()
A bell curve is established making us easier to see item size.
rules <- apriori(trans, parameter = list(support = 0.05, confidence = 0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 478
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[31 item(s), 9578 transaction(s)] done [0.01s].
## sorting and recoding items ... [29 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(trans, parameter = list(support = 0.05, confidence = 0.9)):
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.15s].
## writing ... [190010 rule(s)] done [0.02s].
## creating S4 object ... done [0.04s].
inspect(head(rules))
## lhs rhs support confidence coverage lift count
## [1] {} => {credit.policy=[0,1]} 1 1 1 1 9578
## [2] {} => {not.fully.paid=[0,1]} 1 1 1 1 9578
## [3] {} => {pub.rec=[0,5]} 1 1 1 1 9578
## [4] {} => {delinq.2yrs=[0,13]} 1 1 1 1 9578
## [5] {} => {log.annual.inc} 1 1 1 1 9578
## [6] {} => {installment} 1 1 1 1 9578
plot(rules,jitter = 1)
The plot above illustrates the correlation of support and confidence.
plot(rules, shading = "order")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Here we saw several rules in the scatterplot.
plot(head(rules, n = 100), method = "graph")
We conclude the project with the rule plot of our dataset.