For this assessment, i will be using R language to go over the solution.
Reason for this is i don’t have a windows laptop at the moment (my old workstation crashed early last month), and considering this has to be done in two days - i decided to solve this assessment in R with my current workstation(mac).
Even though i know it was mentioned in the assessment that i can use any tool of my choice, i’d have loved to solve this with PowerBI (since that’s the preferred tool for the Ren team).
#load in app-previous datat
app_prev <- readxl::read_excel("/Users/kazeemolalekan/Downloads/app_previous_application(3).xlsx",1)
#checking for missing values distribution before starting task
aggr(app_prev, prop = F, numbers = T, cex.axis = 0.5)
#checking for missing values distribution
matrixplot(app_prev, sortby = 4, interactive = F)
#red shades are missing values
+ What is the mean number of received applications per contract type and channel type ?
#load in app-previous datat
app_prev <- readxl::read_excel("/Users/kazeemolalekan/Downloads/app_previous_application(3).xlsx",1)
#change character variable to factors
app_prev[sapply(app_prev, is.character)] <- lapply(app_prev[sapply(app_prev,is.character)], as.factor)
#change numeric variables to integers variable
app_prev[sapply(app_prev, is.double)] <- lapply(app_prev[sapply(app_prev,is.double)], as.integer)
Let’s start with application for contract type. Below is a breakdown of applications received based on contract type.
#calculate count and percentage share
app_per_contract <- app_prev %>%
group_by(NAME_CONTRACT_TYPE) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations))
app_per_share <- app_per_contract %>%
mutate(Percent_share = (Observations/dim(app_prev)[1])*100)
#adding the % symbol
app_per_share$Percent_share <- paste0(round(app_per_share$Percent_share,2),"%",sep = " ")
#table
kable(app_per_share, align = "ccc",format.args = list(big.mark = ","),
caption = "Contract-type table:") %>%
kable_material_dark()
NAME_CONTRACT_TYPE | Observations | Percent_share |
---|---|---|
Consumer loans | 27,257 | 46.54% |
Cash loans | 24,621 | 42.04% |
Revolving loans | 6,681 | 11.41% |
XNA | 9 | 0.02% |
Here we can see that consumer loans are the most frequent contract type. Also from total applications received:
We can view this co-horts breakdown using a bar chart.
Below we can see a breakdown of application received based on channel-type.
#calculate count and percentage share
app_per_channel <- app_prev %>%
group_by(CHANNEL_TYPE) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations)) %>%
mutate(Percent_share = (Observations/dim(app_prev)[1])*100)
#adding the % symbol
app_per_channel$Percent_share <- paste0(round(app_per_channel$Percent_share,2),"%",sep = " ")
#table
kable(app_per_channel, align = 'ccc',format.args = list(big.mark = ","),
caption = "Channel-type table:") %>%
kable_material_dark()
CHANNEL_TYPE | Observations | Percent_share |
---|---|---|
Credit and cash offices | 25,045 | 42.76% |
Country-wide | 18,459 | 31.52% |
Stone | 7,849 | 13.4% |
Regional / Local | 4,156 | 7.1% |
AP+ (Cash loan) | 1,930 | 3.3% |
Contact center | 800 | 1.37% |
Channel of corporate sales | 309 | 0.53% |
Car dealer | 20 | 0.03% |
Here, we can see that Credit and cash offices channel have the biggest share of total application received (with 25,045 applications falling under that band). Country-wide channel also make up a big chunk of applications received (18,459 applications). Car dealer channel are the least used channel for loan applications.
We can also view this breakdown using a barplot.
app_prev_channel <- app_prev %>%
group_by(CHANNEL_TYPE) %>%
summarise(freq_c = n()) %>%
arrange(desc(freq_c))
#rearrange factors
app_prev_channel$CHANNEL_TYPE <- factor(app_prev_channel$CHANNEL_TYPE,
levels = names(sort(table(app_prev$CHANNEL_TYPE), decreasing = F)))
ggplot(app_prev_channel, aes(CHANNEL_TYPE, freq_c)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(freq_c,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(), expand = c(0,0),
limits = c(0, 27000)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "Credit and cash offices are the most used channel type",
subtitle = "Applications received across all Channel-type ")
+ Generate the most common reason for loan application from highest to lowest ?
The most commonly filled reason for loan application is XAP and XNA and both looks like NA entries or “Not Available” entries. Asides these two entries, Repairs , Other and Urgent needs are the most common reason for loan application.
We can see the breakdown (from highest to lowest) in the table below.
#calculate count and percentage share
app_per_channel <- app_prev %>%
group_by(NAME_CASH_LOAN_PURPOSE) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations)) %>%
mutate(Percent_share = (Observations/dim(app_prev)[1])*100)
#adding the % symbol
app_per_channel$Percent_share <- paste0(round(app_per_channel$Percent_share,3),"%",sep = " ")
#table
kable(app_per_channel, align = 'ccc',format.args = list(big.mark = ","), caption = "Reasons for loan applications:") %>%
kable_material_dark()
NAME_CASH_LOAN_PURPOSE | Observations | Percent_share |
---|---|---|
XAP | 33,947 | 57.962% |
XNA | 22,245 | 37.981% |
Repairs | 783 | 1.337% |
Other | 507 | 0.866% |
Urgent needs | 301 | 0.514% |
Building a house or an annex | 105 | 0.179% |
Buying a used car | 96 | 0.164% |
Medicine | 86 | 0.147% |
Payments on other loans | 70 | 0.12% |
Everyday expenses | 64 | 0.109% |
Education | 58 | 0.099% |
Journey | 47 | 0.08% |
Buying a new car | 37 | 0.063% |
Furniture | 36 | 0.061% |
Buying a home | 34 | 0.058% |
Car repairs | 33 | 0.056% |
Purchase of electronic equipment | 31 | 0.053% |
Wedding / gift / holiday | 29 | 0.05% |
Buying a holiday home / land | 26 | 0.044% |
Business development | 14 | 0.024% |
Gasification / water supply | 11 | 0.019% |
Buying a garage | 4 | 0.007% |
Hobby | 3 | 0.005% |
Refusal to name the goal | 1 | 0.002% |
Once again, we can view the breakdown with a barplot.
app_prev_channel <- app_prev %>%
group_by(NAME_CASH_LOAN_PURPOSE) %>%
summarise(count_n= n())
#rearrange factors
app_prev_channel$NAME_CASH_LOAN_PURPOSE <- factor(app_prev_channel$NAME_CASH_LOAN_PURPOSE,
levels = names(sort(table(app_prev$NAME_CASH_LOAN_PURPOSE),
decreasing = F)))
ggplot(app_prev_channel, aes(NAME_CASH_LOAN_PURPOSE, count_n)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(count_n,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 39000)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "XAP and XNA are the most filled entries",
subtitle = "XAP and XNA are most likely a generated filled reasons")
One caveat here - Our barplot is hard to read because XAP and XNA make up a significant chunk of Cash loan purposes. I have tried to make it more readable by adding the number of observations for each reason in front of each bar but this is still not enough. What we can do is to exclude both of these entries (since they don’t look like an actual filled reasons) and build another bar plot.
This is what we did below
app_prev_channel %>%
filter(NAME_CASH_LOAN_PURPOSE != "XAP" & NAME_CASH_LOAN_PURPOSE != "XNA") %>%
ggplot(aes(NAME_CASH_LOAN_PURPOSE, count_n)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(count_n,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 820)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "XAP and XNA excluded from plot",
subtitle = "Repairs is the most common reason for loan application")
The bar-plot above passes the eye test, and here we can see that most reasons fall below the 200 mark (grid-lines). Which brings us to the final plot, we might be interested in eye-balling reasons that have less than 200 observations. This can be best viewed with a dot-chart.
Below is a dot-chart plot of reasons below the 200 grid-line mark
app_prev_channel %>%
filter(count_n < 200) %>%
ggplot(aes(NAME_CASH_LOAN_PURPOSE, count_n)) +
geom_point(color = "firebrick", size = 3) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 130)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "Loan reasons with less than 200 occurrences",
subtitle = "Some of our clients take loan application as a hobby ")
How many customers got a higher Credit amount than their Application amount, categorize by their contract Types (i:e type of loans) ?
First thing first, let’s see customers that got a higher credit amount than their application amount.
#isolate customers that got higher credit than they requested for
app_prev %>%
mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
group_by(GOT_HIGHER_AMOUNT) %>%
summarise(Observations = n()) -> GOT_HIGHER_AMOUNT
#table
kable(GOT_HIGHER_AMOUNT, align = 'ccc', format.args = list(big.mark = ","), caption = "How many customers got a higher credit amount ?") %>%
kable_material_dark()
GOT_HIGHER_AMOUNT | Observations |
---|---|
No | 35,614 |
Yes | 22,954 |
You should read this as 22,954 of our customers got a higher credit amount than their application amount. Now just because they get it, doesn’t mean the loan was approved. The reason is it wasn’t stated in the question if it should be for approved loans only so for that reason i’ll stick to the basics. However before we move on to the next question, we’ll find out how many of our loans with higher credit amount were approved.
We can view this distribution with a bar plot
ggplot(GOT_HIGHER_AMOUNT, aes(GOT_HIGHER_AMOUNT, Observations)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), vjust = -1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 38000)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 13,family = "mono"),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(color = "grey"),
panel.grid.major.x = element_blank(),
plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) +
labs(title = "Customers have a high chance of getting more amount",
subtitle = "A lot of our customers got more than what they requested for")
Now we can categorize customers that got a higher credit amount than they requested for by contract types. We visualize this by a table below…..
app_prev %>%
mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
filter(GOT_HIGHER_AMOUNT == "Yes") %>%
group_by(NAME_CONTRACT_TYPE) %>%
summarise(Observations = n()) %>%
kable(align = 'ccc', format.args = list(big.mark = ",")) %>%
kable_material_dark()
NAME_CONTRACT_TYPE | Observations |
---|---|
Cash loans | 12,705 |
Consumer loans | 8,445 |
Revolving loans | 1,804 |
Let’s break this down (using the table above). from the 22,954 customers that got a higher credit amount: + 12,705 are cash loans; + 8,445 are consumer loans; + 1,804 are revolving loans
Going forward in this section, i will be using mosaic plot instead of barplots - this is because Categorical variables with less than three or two levels don’t do well with barplots (the bars are too big).
One way to visualize this breakdown is to use a mosaic plot. Mosaic plots are sometimes hard to read at first but once you figure out the logic behind mosaic plot, you will get to understand why they are a powerful plot to visualize categorical variables. They are very similar to tree-maps.
We start with a basic mosaic plot.
app_prev %>%
mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
filter(GOT_HIGHER_AMOUNT == "Yes") -> higher_credit
#change from factors to character
higher_credit$NAME_CONTRACT_TYPE <- as.character(higher_credit$NAME_CONTRACT_TYPE)
higher_credit$NAME_CONTRACT_STATUS <- as.character(higher_credit$NAME_CONTRACT_STATUS)
#mosaic plot
mosaic(~GOT_HIGHER_AMOUNT + NAME_CONTRACT_TYPE,
data = higher_credit)
Here we can see three rectangles - with different sizes. It passes the eye test and it’s not hard to figure out that Cash loans have a bigger share compared to consumer loans and revolving loans . This is line with our previous table.
Now a typical question might be, from our customers that got a higher credit amount, how many got approved ? . This is where it might get complicated with a mosaic plot but it’s also the best tool to visualize this kind of question
mosaic(~GOT_HIGHER_AMOUNT + NAME_CONTRACT_TYPE +
NAME_CONTRACT_STATUS,
data = higher_credit, shade = T)
How to read this plot: Most of our customers that got a higher credit amount actually got their loan approved (the rectangles on top). However customers with consumer loans that got a higher credit amount are more likely to get their loan approved than customers with cash loans or revolving loans.
Also the boxes in blue shading are classifications that occur more than expected and red shadings are classifications that occur less often than expected. For instance, customers that have cash loans and also got their loan approved occurs less often than expected under an independence model and customers with cash loans that got higher credit amount occur often than expected (These are statistics parlance).
We can end this section by checking the breakdown of customers that got a higher credit amount, their contract type and how many got approved, cancelled and refused.
higher_c_tab_2 <- xtabs(~ NAME_CONTRACT_TYPE + GOT_HIGHER_AMOUNT + NAME_CONTRACT_STATUS,
data = higher_credit)
addmargins(higher_c_tab_2,1)
## , , NAME_CONTRACT_STATUS = Approved
##
## GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE Yes
## Cash loans 9069
## Consumer loans 7741
## Revolving loans 1383
## Sum 18193
##
## , , NAME_CONTRACT_STATUS = Canceled
##
## GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE Yes
## Cash loans 55
## Consumer loans 1
## Revolving loans 0
## Sum 56
##
## , , NAME_CONTRACT_STATUS = Refused
##
## GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE Yes
## Cash loans 3581
## Consumer loans 703
## Revolving loans 421
## Sum 4705
A way to read this table is - From 22,954 customers that got a higher credit than they applied for:
What is the percentage of customers with bad default history in the bureau records
We load the bureau dataset
#load dataset bureau updated
bureau_up <- readxl::read_excel("/Users/kazeemolalekan/Downloads/bureau updated(3).xlsx",1)
#check number of rows and column
dim(bureau_up)
## [1] 582751 14
We have 582,751 rows and 14 columns
#change character cols to factors
bureau_up[sapply(bureau_up, is.character)] <- lapply(bureau_up[sapply(bureau_up,is.character)], as.factor)
#change numeric cols to integers
bureau_up[sapply(bureau_up, is.double)] <- lapply(bureau_up[sapply(bureau_up,is.double)], as.integer)
Now, let’s see a breakdown of our customers with bad credit history. Once again, we view this using a table
bureau_up_ord <- within(
bureau_up,CREDIT_ACTIVE <- factor(
CREDIT_ACTIVE,levels = names(sort(table(CREDIT_ACTIVE),decreasing = F))))
#derive the proportion
percentage_share <- bureau_up_ord %>%
group_by(CREDIT_ACTIVE) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations)) %>%
mutate(Percent_share = (Observations/dim(bureau_up_ord)[1])*100)
p.share <- percentage_share
p.share$Percent_share <- paste0(round(p.share$Percent_share,3),"%",sep = " ")
kable(p.share, align = 'ccc', format.args = list(big.mark = ",")) %>%
kable_material_dark()
CREDIT_ACTIVE | Observations | Percent_share |
---|---|---|
Closed | 366,513 | 62.894% |
Active | 214,040 | 36.729% |
Sold | 2,191 | 0.376% |
Bad debt | 7 | 0.001% |
I can infer from the plot above that customers with bad credit make up 0.0012% of all the customers in the bureau data record. That is way less than 1%.
As always we can visualize this with a barplots.
#round to 3 decimal places
percentage_share$Percent_share <- round(percentage_share$Percent_share,3)
#bar-plot
ggplot(percentage_share, aes(CREDIT_ACTIVE, Percent_share)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = paste0(Percent_share,"%"), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 70)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "Customers with bad default history make up less than 1% \n of the Bureau records")
Estimate average volume of loan applications per Name of their product Type
app_prev %>%
group_by(NAME_PRODUCT_TYPE) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations)) %>%
mutate(Percent_share = (Observations/dim(app_prev)[1])*100) -> Average_vol_product_type
#add the percentage sign
Average_vol_product_type$Percent_share <- paste0(round(Average_vol_product_type$Percent_share,0),"%",sep = " ")
#kable_table
kable(Average_vol_product_type, align = 'ccc', format.args = list(big.mark = ",")) %>%
kable_material_dark()
NAME_PRODUCT_TYPE | Observations | Percent_share |
---|---|---|
XNA | 38,308 | 65% |
x-sell | 15,003 | 26% |
walk-in | 5,257 | 9% |
From our table we can deduce that XNA are the most filled entry for product type (again XNA is most likely “Not available” but i won’t assume). We can also deduce that x-sell and walk-in make up less than 40% of the total share of filled entries for product type.
We can visualize this with a mosaic plot
app_prev$NAME_PRODUCT_TYPE <- factor(app_prev$NAME_PRODUCT_TYPE,
levels = names(sort(table(app_prev$NAME_PRODUCT_TYPE),decreasing = T)))
mosaic(~NAME_PRODUCT_TYPE, direction = "v", data = app_prev)
We can go further by doing a cross-classification of our product_type with contract type with the mosaic plot.
app_prev$NAME_CONTRACT_TYPE <- factor(app_prev$NAME_CONTRACT_TYPE, ordered = T,
levels = c("Revolving loans","Consumer loans","Cash loans","XNA"))
mosaic(~NAME_PRODUCT_TYPE + NAME_CONTRACT_TYPE, shade = T, direction = "v", data = app_prev)
Here, we can infer that most of x-sell product_type are cash-loans and revolving loans (and by most here, i mean almost all) - Same for walk-in product type, seems as if x-sell and walk-in are hardly consumer loans. Consumers loans are mostly XNA product_type.
We can confirm this with a two-way table:
p_type <- xtabs(~ NAME_CONTRACT_TYPE + NAME_PRODUCT_TYPE,
data = app_prev)
addmargins(p_type,1)
## NAME_PRODUCT_TYPE
## NAME_CONTRACT_TYPE XNA x-sell walk-in
## Revolving loans 1696 3165 1820
## Consumer loans 27257 0 0
## Cash loans 9346 11838 3437
## XNA 9 0 0
## Sum 38308 15003 5257
We can see that, we don’t have any consumer loans that are x-sell or
What is the ratio of performing loans to NPL and what are the evident reasons for higher NPLS
Spent hours trying to figure this out. I wasn’t able to.
Average number of payments per loan by installment amount band category (i.e: 0-50k, 51-100k, 101-200k, 201-300k, 301-400k, 401-500k, 500k above)
Let’s load in the installment dataset and then we check the number of rows and columns.
installment <- readxl::read_excel(
"/Users/kazeemolalekan/Downloads/installments_payments_updated(3).xlsx",1)
dim(installment)
## [1] 575137 8
Here, we can see that we have 575,137 rows and 9 columns.
#create a new column with band category
installment <- within(installment,{
Band <- NA
Band[AMT_INSTALMENT <= 50000] <- "0-50k"
Band[AMT_INSTALMENT > 50000 & AMT_INSTALMENT <= 100000] <- "51-100k"
Band[AMT_INSTALMENT > 100000 & AMT_INSTALMENT <= 200000] <- "101-200k"
Band[AMT_INSTALMENT > 200000 & AMT_INSTALMENT <= 300000] <- "201-300k"
Band[AMT_INSTALMENT > 300000 & AMT_INSTALMENT <= 400000] <- "301-400k"
Band[AMT_INSTALMENT > 400000 & AMT_INSTALMENT <= 500000] <- "401-500k"
Band[AMT_INSTALMENT > 500000] <- "500k above"}
)
#reorder factors for bands
installment$Band <- factor(installment$Band, levels = names(sort(table(installment$Band), decreasing = F)))
To get the average number of payments per loan by band, we can either use the basic table function or use the group_by function. I will do both (as a way to be sure our figures are accurate).
Starting with the table function, let’s view each band distribution and proportion.
band_category <- sort(xtabs(~Band, data = installment), decreasing = T)
addmargins(band_category)
## Band
## 0-50k 51-100k 101-200k 201-300k 500k above 301-400k 401-500k
## 549939 16767 4105 1552 1395 744 635
## Sum
## 575137
round(addmargins(prop.table(band_category)*100),2)
## Band
## 0-50k 51-100k 101-200k 201-300k 500k above 301-400k 401-500k
## 95.62 2.92 0.71 0.27 0.24 0.13 0.11
## Sum
## 100.00
In the table above we can see the number of observations for each band. We can also see 95.62% of our payment-per-loan fall into the 0-50k band (in short 549,939 observations are between the 0-50k band). 401-500k band is the least frequent band (635 of total observations fall into that co-hort - which is less than 1% of our dataset).
We can confirm these figures with the group_by function and the show our breakdown as usual using a table
installment %>%
group_by(Band) %>%
summarise(Observations = n()) %>%
arrange(desc(Observations)) %>%
mutate(Percent_share = (Observations / dim(installment)[1])*100) -> group_by_band
#add the percentage sign
group_by_band$Percent_share <- paste0(round(group_by_band$Percent_share,2),"%",sep = " ")
kable(group_by_band, align = 'ccc', format.args = list(big.mark = ","),
caption = "Payment Per loan by installment Cohort:") %>%
kable_material_dark()
Band | Observations | Percent_share |
---|---|---|
0-50k | 549,939 | 95.62% |
51-100k | 16,767 | 2.92% |
101-200k | 4,105 | 0.71% |
201-300k | 1,552 | 0.27% |
500k above | 1,395 | 0.24% |
301-400k | 744 | 0.13% |
401-500k | 635 | 0.11% |
Let’s end this with a barplot of our bands.
ggplot(group_by_band, aes(Band, Observations)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 590000)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "95% of payment per loan by installment are between 0-50k band")
Our plot is hard to read because the “0-50k” make up a huge proportion of our observations. What we might want to do (as we have done before is to exclude “0-50k” band from our plot).
#exclude 0-50k band from plot
group_by_band %>%
filter(Band != "0-50k") %>%
ggplot(aes(Band, Observations)) +
geom_bar(stat = "identity", fill = "pink", color = "black") +
geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold",
family = "mono"), hjust = -0.1) +
scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 18000)) +
theme_minimal() +
theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
axis.text.x = element_text(size = 11, color = "black", family = "mono"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major = element_line(color = "grey"),
plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
labs(title = "Excluding the 0-50k band")
I will make this short, since i have a deadline to complete this task.
One thing i was interested was the comparison between the groups in Contract status cohort. To be specific, i was interested in how Approved loans, Canceled loans, Refused loans and Unused loans differ in-terms of significance level. For this, i will be using the npar ( r package ) and oneway-test (this is already going into advanced statistical analysis but i will make it short)
Let’s run the function and go over the summary
library(npar)
sig.check <- oneway(AMT_APPLICATION ~ NAME_CONTRACT_STATUS, data = app_prev)
summary(sig.check)
## data: AMT_APPLICATION on NAME_CONTRACT_STATUS
##
## Omnibus Test
## Kruskal-Wallis chi-squared = 22406.4122, df = 3, p-value = < 2.2e-16
##
## Descriptive Statistics
## Canceled Unused offer Approved Refused
## n 10166 987.00 37317.00 10098.0
## median 0 57780.00 90000.00 155812.5
## mad 0 39029.44 79894.35 187712.0
##
## Multiple Comparisons (Wilcoxon Rank Sum Tests)
## Probability Adjustment = holm
## Group.1 Group.2 W p
## 1 Canceled Unused offer 184635.5 0.000000e+00 ***
## 2 Canceled Approved 10513123.0 0.000000e+00 ***
## 3 Canceled Refused 7773898.5 0.000000e+00 ***
## 4 Unused offer Approved 12675351.0 6.333843e-63 ***
## 5 Unused offer Refused 2882133.5 3.049048e-106 ***
## 6 Approved Refused 155149994.0 3.314641e-163 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
One way to know if there’s a significant difference between our contract_status is by eye-balling our p-value. Here we can see that the p-value is less than 0.05(or 5%) which is another way of saying there’s a significance difference between all the factors in contract status (of-course this is not a surprise, we expected this).
From the summary table, we also get a summary statistics. Here we can see that we have 10,166 observations with cancelled loans, and 987 loans are unused offer, 37,317 loans were approved.
Also, the median value of Approved loans is 90,000. Refused loans on the other hand have a median value of 155,812. Inshort refused loans on average are higher than approved loans (by more than 60,000).
Lastly, the smallest variability (mad) are in Unused offer cohort with mad value of 39,029 (Cancelled orders don’t count) - Refused offers have the biggest variability.
We can actually plot this stats with a box plot.
plot(sig.check, col = "pink", main =
"Contract Status Comparison", ylab = "Amount Applied for",
xlab = "Contract Status", family = "mono")
Once again, we can confirm that on on average, Refused loans are of higher amount than approved loans.
Moving on, let’s look at the days clients apply for loans.
#reorder days
app_prev$WEEKDAY_APPR_PROCESS_START <- factor(app_prev$WEEKDAY_APPR_PROCESS_START, order = T,
levels = c("MONDAY","TUESDAY","WEDNESDAY","THURSDAY","FRIDAY","SATURDAY","SUNDAY"))
#plot count
barplot(table(app_prev$WEEKDAY_APPR_PROCESS_START))
From the plot, we can deduce most clients applied for loans on wednesdays (although the difference with other days is not too much). Also very few of our clients applied for loan on sundays.
We can dive in deeper by checking which days are loans mostly approved and canceled or refused with a mosaic plot.
mosaic(~NAME_CONTRACT_STATUS + WEEKDAY_APPR_PROCESS_START,shade = T, data = app_prev)
Nothing much here but i did notice that, more application are cancelled on wednesdays than other days - For approved loans, it’s very tight but i think saturday takes the crown as the day most loans are approved.
We can also check on the time of the day our clients apply for loans the most.
barplot(table(app_prev$HOUR_APPR_PROCESS_START))
Most loan applications come in around 11AM and then starts to drop-off an hour later.
I will add two-box plots just to see the distribution of Contract type and amount application. For the box-plot below, i have logged the amount (y-axis). This is to make the plot more readable.
ggplot(app_prev, aes(NAME_CONTRACT_TYPE, log(AMT_APPLICATION))) +
geom_boxplot() + facet_wrap(~NAME_CONTRACT_STATUS)
Let’s see the distribution of AMT_APPLICATION(in short the amount our clients requested for) across contract types with a density plot.
app_prev %>%
filter(AMT_APPLICATION > 0) %>%
ggplot(aes(AMT_APPLICATION, fill = NAME_CONTRACT_TYPE)) +
geom_density() + facet_grid(NAME_CONTRACT_TYPE~.) +
theme(legend.position = "None",
strip.text.y = element_text(size = 14, angle = 0))
Here we can see that, most of the figures are skewed to the right. Also, Clients with Cash loans request for a higher amount of money than cients with consumer loans and revolving loans.
Let’s see the distribution AMT_APPLICATION(in short the amount our clients requested for) across channel types.
app_prev %>%
filter(AMT_APPLICATION > 0) %>%
ggplot(aes(AMT_APPLICATION, fill = CHANNEL_TYPE)) +
geom_density() +
#scale_x_continuous(breaks = c(0, 1000000, 2000000, 3000000, 4000000))
facet_grid(CHANNEL_TYPE~.) +
theme(legend.position = "None",
strip.text.y = element_text(size = 14, angle = 0))
Compared to other channels, Car dealer and credit and cash offices to some degree are skewed to the right (distribution goes into the millions territory).