BI Team Lead Assessment - Olalekan Kazeem

Introduction

For this assessment, i will be using R language to go over the solution.

Reason for this is i don’t have a windows laptop at the moment (my old workstation crashed early last month), and considering this has to be done in two days - i decided to solve this assessment in R with my current workstation(mac).

Even though i know it was mentioned in the assessment that i can use any tool of my choice, i’d have loved to solve this with PowerBI (since that’s the preferred tool for the Ren team).

Checking for Missing Values before commencing task.

#load in app-previous datat
app_prev <- readxl::read_excel("/Users/kazeemolalekan/Downloads/app_previous_application(3).xlsx",1)

#checking for missing values distribution before starting task
aggr(app_prev, prop = F, numbers = T, cex.axis = 0.5)

#checking for missing values distribution
matrixplot(app_prev, sortby = 4, interactive = F)

#red shades are missing values

Question: 1

+ What is the mean number of received applications per contract type and channel type ?

Solution

#load in app-previous datat
app_prev <- readxl::read_excel("/Users/kazeemolalekan/Downloads/app_previous_application(3).xlsx",1)

#change character variable to factors
app_prev[sapply(app_prev, is.character)] <- lapply(app_prev[sapply(app_prev,is.character)], as.factor)

#change numeric variables to integers variable 
app_prev[sapply(app_prev, is.double)] <- lapply(app_prev[sapply(app_prev,is.double)], as.integer)

Let’s start with application for contract type. Below is a breakdown of applications received based on contract type.

#calculate count and percentage share
app_per_contract <- app_prev %>%
  group_by(NAME_CONTRACT_TYPE) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations))


app_per_share <- app_per_contract %>% 
  mutate(Percent_share = (Observations/dim(app_prev)[1])*100)

#adding the % symbol  
app_per_share$Percent_share <-  paste0(round(app_per_share$Percent_share,2),"%",sep = " ")

#table
kable(app_per_share, align = "ccc",format.args = list(big.mark = ","),
      caption = "Contract-type table:") %>%
 kable_material_dark()

Contract-type table:
NAME_CONTRACT_TYPE	Observations	Percent_share
Consumer loans	27,257	46.54%
Cash loans	24,621	42.04%
Revolving loans	6,681	11.41%
XNA	9	0.02%

Here we can see that consumer loans are the most frequent contract type. Also from total applications received:

27,257 are consumer loans - another way of saying this is consumer loans make up 47% of total applications received;
24,621 are cash loans and they make up 42% of total applications received;
6, 681 applications received are revolving loans; and
9 applications are categorized under “XNA” cohort.

We can view this co-horts breakdown using a bar chart.

Below we can see a breakdown of application received based on channel-type.

#calculate count and percentage share
app_per_channel <- app_prev %>%
  group_by(CHANNEL_TYPE) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations)) %>%
  mutate(Percent_share = (Observations/dim(app_prev)[1])*100)

#adding the % symbol  
app_per_channel$Percent_share <-  paste0(round(app_per_channel$Percent_share,2),"%",sep = " ")

#table
kable(app_per_channel, align = 'ccc',format.args = list(big.mark = ","), 
      caption = "Channel-type table:") %>%
  kable_material_dark()

Channel-type table:
CHANNEL_TYPE	Observations	Percent_share
Credit and cash offices	25,045	42.76%
Country-wide	18,459	31.52%
Stone	7,849	13.4%
Regional / Local	4,156	7.1%
AP+ (Cash loan)	1,930	3.3%
Contact center	800	1.37%
Channel of corporate sales	309	0.53%
Car dealer	20	0.03%

Here, we can see that Credit and cash offices channel have the biggest share of total application received (with 25,045 applications falling under that band). Country-wide channel also make up a big chunk of applications received (18,459 applications). Car dealer channel are the least used channel for loan applications.

We can also view this breakdown using a barplot.

app_prev_channel <- app_prev %>%
  group_by(CHANNEL_TYPE) %>%
  summarise(freq_c = n()) %>%
  arrange(desc(freq_c))



#rearrange factors
app_prev_channel$CHANNEL_TYPE <-  factor(app_prev_channel$CHANNEL_TYPE,
                                         levels = names(sort(table(app_prev$CHANNEL_TYPE), decreasing = F)))


ggplot(app_prev_channel, aes(CHANNEL_TYPE, freq_c)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") + 
  geom_text(aes(label = format(freq_c,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(), expand = c(0,0),
                     limits = c(0, 27000)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "Credit and cash offices are the most used channel type",
       subtitle = "Applications received across all Channel-type ")

Question: 2

+ Generate the most common reason for loan application from highest to lowest ?

Solution

The most commonly filled reason for loan application is XAP and XNA and both looks like NA entries or “Not Available” entries. Asides these two entries, Repairs , Other and Urgent needs are the most common reason for loan application.

We can see the breakdown (from highest to lowest) in the table below.

#calculate count and percentage share
app_per_channel <- app_prev %>%
  group_by(NAME_CASH_LOAN_PURPOSE) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations)) %>%
  mutate(Percent_share = (Observations/dim(app_prev)[1])*100)

#adding the % symbol  
app_per_channel$Percent_share <-  paste0(round(app_per_channel$Percent_share,3),"%",sep = " ")

#table
kable(app_per_channel, align = 'ccc',format.args = list(big.mark = ","), caption = "Reasons for loan applications:") %>%
  kable_material_dark()

Reasons for loan applications:
NAME_CASH_LOAN_PURPOSE	Observations	Percent_share
XAP	33,947	57.962%
XNA	22,245	37.981%
Repairs	783	1.337%
Other	507	0.866%
Urgent needs	301	0.514%
Building a house or an annex	105	0.179%
Buying a used car	96	0.164%
Medicine	86	0.147%
Payments on other loans	70	0.12%
Everyday expenses	64	0.109%
Education	58	0.099%
Journey	47	0.08%
Buying a new car	37	0.063%
Furniture	36	0.061%
Buying a home	34	0.058%
Car repairs	33	0.056%
Purchase of electronic equipment	31	0.053%
Wedding / gift / holiday	29	0.05%
Buying a holiday home / land	26	0.044%
Business development	14	0.024%
Gasification / water supply	11	0.019%
Buying a garage	4	0.007%
Hobby	3	0.005%
Refusal to name the goal	1	0.002%

Once again, we can view the breakdown with a barplot.

app_prev_channel <- app_prev %>%
  group_by(NAME_CASH_LOAN_PURPOSE) %>%
  summarise(count_n= n())


#rearrange factors
app_prev_channel$NAME_CASH_LOAN_PURPOSE <-  factor(app_prev_channel$NAME_CASH_LOAN_PURPOSE,
                                         levels = names(sort(table(app_prev$NAME_CASH_LOAN_PURPOSE), 
                                                             decreasing = F)))


ggplot(app_prev_channel, aes(NAME_CASH_LOAN_PURPOSE, count_n)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") + 
  geom_text(aes(label = format(count_n,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 39000)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "XAP and XNA are the most filled entries",
       subtitle = "XAP and XNA are most likely a generated filled reasons")

One caveat here - Our barplot is hard to read because XAP and XNA make up a significant chunk of Cash loan purposes. I have tried to make it more readable by adding the number of observations for each reason in front of each bar but this is still not enough. What we can do is to exclude both of these entries (since they don’t look like an actual filled reasons) and build another bar plot.

This is what we did below

app_prev_channel %>%
  filter(NAME_CASH_LOAN_PURPOSE != "XAP" & NAME_CASH_LOAN_PURPOSE != "XNA") %>%
ggplot(aes(NAME_CASH_LOAN_PURPOSE, count_n)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") + 
  geom_text(aes(label = format(count_n,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 820)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "XAP and XNA excluded from plot",
       subtitle = "Repairs is the most common reason for loan application")

The bar-plot above passes the eye test, and here we can see that most reasons fall below the 200 mark (grid-lines). Which brings us to the final plot, we might be interested in eye-balling reasons that have less than 200 observations. This can be best viewed with a dot-chart.

Below is a dot-chart plot of reasons below the 200 grid-line mark

app_prev_channel %>%
  filter(count_n < 200) %>%
  ggplot(aes(NAME_CASH_LOAN_PURPOSE, count_n)) + 
  geom_point(color = "firebrick", size = 3) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 130)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "Loan reasons with less than 200 occurrences",
       subtitle = "Some of our clients take loan application as a hobby ")

Question: 3

How many customers got a higher Credit amount than their Application amount, categorize by their contract Types (i:e type of loans) ?

Solution

First thing first, let’s see customers that got a higher credit amount than their application amount.

#isolate customers that got higher credit than they requested for

app_prev %>%
  mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
  select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
  group_by(GOT_HIGHER_AMOUNT) %>%
  summarise(Observations = n()) -> GOT_HIGHER_AMOUNT

#table
kable(GOT_HIGHER_AMOUNT, align = 'ccc', format.args = list(big.mark = ","), caption = "How many customers got a higher credit amount ?") %>%
  kable_material_dark()

How many customers got a higher credit amount ?
GOT_HIGHER_AMOUNT	Observations
No	35,614
Yes	22,954

You should read this as 22,954 of our customers got a higher credit amount than their application amount. Now just because they get it, doesn’t mean the loan was approved. The reason is it wasn’t stated in the question if it should be for approved loans only so for that reason i’ll stick to the basics. However before we move on to the next question, we’ll find out how many of our loans with higher credit amount were approved.

We can view this distribution with a bar plot

ggplot(GOT_HIGHER_AMOUNT, aes(GOT_HIGHER_AMOUNT, Observations)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") +
  geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), vjust = -1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 38000)) +
  theme_minimal() +
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.x = element_blank(),
        axis.title.y = element_text(size = 13,family = "mono"),
        panel.grid.minor = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        panel.grid.major.x = element_blank(),
        plot.title = element_text(face = "bold", size = 20, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) +
  
  labs(title = "Customers have a high chance of getting more amount",
       subtitle = "A lot of our customers got more than what they requested for")

Now we can categorize customers that got a higher credit amount than they requested for by contract types. We visualize this by a table below…..

app_prev %>%
  mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
  select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
  filter(GOT_HIGHER_AMOUNT == "Yes") %>%
  group_by(NAME_CONTRACT_TYPE) %>%
  summarise(Observations = n()) %>%
kable(align = 'ccc', format.args = list(big.mark = ",")) %>%
  kable_material_dark()

NAME_CONTRACT_TYPE	Observations
Cash loans	12,705
Consumer loans	8,445
Revolving loans	1,804

Let’s break this down (using the table above). from the 22,954 customers that got a higher credit amount: + 12,705 are cash loans; + 8,445 are consumer loans; + 1,804 are revolving loans

Going forward in this section, i will be using mosaic plot instead of barplots - this is because Categorical variables with less than three or two levels don’t do well with barplots (the bars are too big).

One way to visualize this breakdown is to use a mosaic plot. Mosaic plots are sometimes hard to read at first but once you figure out the logic behind mosaic plot, you will get to understand why they are a powerful plot to visualize categorical variables. They are very similar to tree-maps.

We start with a basic mosaic plot.

app_prev %>%
  mutate(GOT_HIGHER_AMOUNT = ifelse(AMT_CREDIT - AMT_APPLICATION > 0,"Yes","No")) %>%
  select(NAME_CONTRACT_TYPE, AMT_APPLICATION, AMT_CREDIT, GOT_HIGHER_AMOUNT, NAME_CONTRACT_STATUS) %>%
  filter(GOT_HIGHER_AMOUNT == "Yes") -> higher_credit

#change from factors to character
higher_credit$NAME_CONTRACT_TYPE <- as.character(higher_credit$NAME_CONTRACT_TYPE)
higher_credit$NAME_CONTRACT_STATUS <- as.character(higher_credit$NAME_CONTRACT_STATUS)

#mosaic plot
mosaic(~GOT_HIGHER_AMOUNT + NAME_CONTRACT_TYPE, 
       data = higher_credit)

Here we can see three rectangles - with different sizes. It passes the eye test and it’s not hard to figure out that Cash loans have a bigger share compared to consumer loans and revolving loans . This is line with our previous table.

Now a typical question might be, from our customers that got a higher credit amount, how many got approved ? . This is where it might get complicated with a mosaic plot but it’s also the best tool to visualize this kind of question

mosaic(~GOT_HIGHER_AMOUNT + NAME_CONTRACT_TYPE + 
         NAME_CONTRACT_STATUS, 
       data = higher_credit, shade = T)

How to read this plot: Most of our customers that got a higher credit amount actually got their loan approved (the rectangles on top). However customers with consumer loans that got a higher credit amount are more likely to get their loan approved than customers with cash loans or revolving loans.

Also the boxes in blue shading are classifications that occur more than expected and red shadings are classifications that occur less often than expected. For instance, customers that have cash loans and also got their loan approved occurs less often than expected under an independence model and customers with cash loans that got higher credit amount occur often than expected (These are statistics parlance).

We can end this section by checking the breakdown of customers that got a higher credit amount, their contract type and how many got approved, cancelled and refused.

higher_c_tab_2 <- xtabs(~ NAME_CONTRACT_TYPE + GOT_HIGHER_AMOUNT + NAME_CONTRACT_STATUS, 
                        data = higher_credit)

addmargins(higher_c_tab_2,1)

## , , NAME_CONTRACT_STATUS = Approved
## 
##                   GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE   Yes
##    Cash loans       9069
##    Consumer loans   7741
##    Revolving loans  1383
##    Sum             18193
## 
## , , NAME_CONTRACT_STATUS = Canceled
## 
##                   GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE   Yes
##    Cash loans         55
##    Consumer loans      1
##    Revolving loans     0
##    Sum                56
## 
## , , NAME_CONTRACT_STATUS = Refused
## 
##                   GOT_HIGHER_AMOUNT
## NAME_CONTRACT_TYPE   Yes
##    Cash loans       3581
##    Consumer loans    703
##    Revolving loans   421
##    Sum              4705

A way to read this table is - From 22,954 customers that got a higher credit than they applied for:

18,193 were approved - 9,9069 are cash loans, 7,741 are consumer loans;
4,705 were refused - 3,581 are cash loans, 703 are consumer loans.
56 were cancelled.

Question: 4

What is the percentage of customers with bad default history in the bureau records

Solution

We load the bureau dataset

#load dataset bureau updated
bureau_up <- readxl::read_excel("/Users/kazeemolalekan/Downloads/bureau updated(3).xlsx",1)

#check number of rows and column
dim(bureau_up)

## [1] 582751     14

We have 582,751 rows and 14 columns

#change character cols to factors
bureau_up[sapply(bureau_up, is.character)] <- lapply(bureau_up[sapply(bureau_up,is.character)], as.factor)

#change numeric cols to integers
bureau_up[sapply(bureau_up, is.double)] <- lapply(bureau_up[sapply(bureau_up,is.double)], as.integer)

Now, let’s see a breakdown of our customers with bad credit history. Once again, we view this using a table

bureau_up_ord <- within(
  bureau_up,CREDIT_ACTIVE <- factor(
    CREDIT_ACTIVE,levels = names(sort(table(CREDIT_ACTIVE),decreasing = F))))

#derive the proportion
percentage_share <- bureau_up_ord %>% 
  group_by(CREDIT_ACTIVE) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations)) %>%
  mutate(Percent_share = (Observations/dim(bureau_up_ord)[1])*100)


p.share <- percentage_share

p.share$Percent_share <-  paste0(round(p.share$Percent_share,3),"%",sep = " ")


kable(p.share, align = 'ccc', format.args = list(big.mark = ",")) %>%
  kable_material_dark()

CREDIT_ACTIVE	Observations	Percent_share
Closed	366,513	62.894%
Active	214,040	36.729%
Sold	2,191	0.376%
Bad debt	7	0.001%

I can infer from the plot above that customers with bad credit make up 0.0012% of all the customers in the bureau data record. That is way less than 1%.

As always we can visualize this with a barplots.

#round to 3 decimal places
percentage_share$Percent_share <- round(percentage_share$Percent_share,3)


#bar-plot
ggplot(percentage_share, aes(CREDIT_ACTIVE, Percent_share)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") +
  geom_text(aes(label = paste0(Percent_share,"%"), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0),limits = c(0, 70)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "Customers with bad default history make up less than 1% \n of the Bureau records")

Question: 5

Estimate average volume of loan applications per Name of their product Type

Solution

app_prev %>% 
  group_by(NAME_PRODUCT_TYPE) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations)) %>%
  mutate(Percent_share = (Observations/dim(app_prev)[1])*100) -> Average_vol_product_type


#add the percentage sign
Average_vol_product_type$Percent_share <-  paste0(round(Average_vol_product_type$Percent_share,0),"%",sep = " ")

#kable_table
kable(Average_vol_product_type, align = 'ccc', format.args = list(big.mark = ",")) %>%
  kable_material_dark()

NAME_PRODUCT_TYPE	Observations	Percent_share
XNA	38,308	65%
x-sell	15,003	26%
walk-in	5,257	9%

From our table we can deduce that XNA are the most filled entry for product type (again XNA is most likely “Not available” but i won’t assume). We can also deduce that x-sell and walk-in make up less than 40% of the total share of filled entries for product type.

We can visualize this with a mosaic plot

app_prev$NAME_PRODUCT_TYPE <- factor(app_prev$NAME_PRODUCT_TYPE, 
                                     levels = names(sort(table(app_prev$NAME_PRODUCT_TYPE),decreasing = T)))

mosaic(~NAME_PRODUCT_TYPE, direction = "v", data = app_prev)

We can go further by doing a cross-classification of our product_type with contract type with the mosaic plot.

app_prev$NAME_CONTRACT_TYPE <- factor(app_prev$NAME_CONTRACT_TYPE, ordered = T,
                                              levels = c("Revolving loans","Consumer loans","Cash loans","XNA"))


mosaic(~NAME_PRODUCT_TYPE + NAME_CONTRACT_TYPE, shade = T, direction = "v", data = app_prev)

Here, we can infer that most of x-sell product_type are cash-loans and revolving loans (and by most here, i mean almost all) - Same for walk-in product type, seems as if x-sell and walk-in are hardly consumer loans. Consumers loans are mostly XNA product_type.

We can confirm this with a two-way table:

p_type <- xtabs(~ NAME_CONTRACT_TYPE + NAME_PRODUCT_TYPE, 
                        data = app_prev)

addmargins(p_type,1)

##                   NAME_PRODUCT_TYPE
## NAME_CONTRACT_TYPE   XNA x-sell walk-in
##    Revolving loans  1696   3165    1820
##    Consumer loans  27257      0       0
##    Cash loans       9346  11838    3437
##    XNA                 9      0       0
##    Sum             38308  15003    5257

We can see that, we don’t have any consumer loans that are x-sell or product type. Infact cash loans make up a significant portion of both x-sell and walk-in product type, which is not surprising.

Question: 6

What is the ratio of performing loans to NPL and what are the evident reasons for higher NPLS

Spent hours trying to figure this out. I wasn’t able to.

Question: 7

Average number of payments per loan by installment amount band category (i.e: 0-50k, 51-100k, 101-200k, 201-300k, 301-400k, 401-500k, 500k above)

Solution

Let’s load in the installment dataset and then we check the number of rows and columns.

installment <- readxl::read_excel(
"/Users/kazeemolalekan/Downloads/installments_payments_updated(3).xlsx",1)

dim(installment)

## [1] 575137      8

Here, we can see that we have 575,137 rows and 9 columns.

#create a new column with band category
installment <- within(installment,{
                      Band <- NA
                      Band[AMT_INSTALMENT <= 50000] <- "0-50k"
                      Band[AMT_INSTALMENT > 50000 & AMT_INSTALMENT <= 100000] <- "51-100k"
                      Band[AMT_INSTALMENT > 100000 & AMT_INSTALMENT <= 200000] <- "101-200k"
                      Band[AMT_INSTALMENT > 200000 & AMT_INSTALMENT <= 300000] <- "201-300k"
                      Band[AMT_INSTALMENT > 300000 & AMT_INSTALMENT <= 400000] <- "301-400k"
                      Band[AMT_INSTALMENT > 400000 & AMT_INSTALMENT <= 500000] <- "401-500k"
                      Band[AMT_INSTALMENT > 500000] <- "500k above"}
                      )

#reorder factors for bands
installment$Band <- factor(installment$Band, levels = names(sort(table(installment$Band), decreasing = F)))

To get the average number of payments per loan by band, we can either use the basic table function or use the group_by function. I will do both (as a way to be sure our figures are accurate).

Starting with the table function, let’s view each band distribution and proportion.

band_category <- sort(xtabs(~Band, data = installment), decreasing = T)

addmargins(band_category)

## Band
##      0-50k    51-100k   101-200k   201-300k 500k above   301-400k   401-500k 
##     549939      16767       4105       1552       1395        744        635 
##        Sum 
##     575137

round(addmargins(prop.table(band_category)*100),2)

## Band
##      0-50k    51-100k   101-200k   201-300k 500k above   301-400k   401-500k 
##      95.62       2.92       0.71       0.27       0.24       0.13       0.11 
##        Sum 
##     100.00

In the table above we can see the number of observations for each band. We can also see 95.62% of our payment-per-loan fall into the 0-50k band (in short 549,939 observations are between the 0-50k band). 401-500k band is the least frequent band (635 of total observations fall into that co-hort - which is less than 1% of our dataset).

We can confirm these figures with the group_by function and the show our breakdown as usual using a table

installment %>%
  group_by(Band) %>%
  summarise(Observations = n()) %>%
  arrange(desc(Observations)) %>%
  mutate(Percent_share = (Observations / dim(installment)[1])*100) -> group_by_band


#add the percentage sign
group_by_band$Percent_share <-  paste0(round(group_by_band$Percent_share,2),"%",sep = " ")

kable(group_by_band, align = 'ccc', format.args = list(big.mark = ","),
      caption = "Payment Per loan by installment Cohort:") %>%
  kable_material_dark()

Payment Per loan by installment Cohort:
Band	Observations	Percent_share
0-50k	549,939	95.62%
51-100k	16,767	2.92%
101-200k	4,105	0.71%
201-300k	1,552	0.27%
500k above	1,395	0.24%
301-400k	744	0.13%
401-500k	635	0.11%

Let’s end this with a barplot of our bands.

ggplot(group_by_band, aes(Band, Observations)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") +
  geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 590000)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "95% of payment per loan by installment are between 0-50k band")

Our plot is hard to read because the “0-50k” make up a huge proportion of our observations. What we might want to do (as we have done before is to exclude “0-50k” band from our plot).

#exclude 0-50k band from plot
group_by_band %>%
  filter(Band != "0-50k") %>%
  ggplot(aes(Band, Observations)) + 
  geom_bar(stat = "identity", fill = "pink", color = "black") +
  geom_text(aes(label = format(Observations,big.mark = ",", scientific = F), fontface = "bold", 
                family = "mono"), hjust = -0.1) +
  scale_y_continuous(labels = scales::comma_format(),expand = c(0,0), limits = c(0, 18000)) +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = 11, color = "black", family = "mono"),
        axis.text.x = element_text(size = 11, color = "black", family = "mono"),
        axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.major = element_line(color = "grey"),
        plot.title = element_text(face = "bold", size = 15, hjust = 0.5, family = "mono"),
                       plot.subtitle = element_text(size = 15, hjust = .5, family = "mono")) + coord_flip() +
  labs(title = "Excluding the 0-50k band")

Additional Exploratory Analysis and Finding.

I will make this short, since i have a deadline to complete this task.

One thing i was interested was the comparison between the groups in Contract status cohort. To be specific, i was interested in how Approved loans, Canceled loans, Refused loans and Unused loans differ in-terms of significance level. For this, i will be using the npar ( r package ) and oneway-test (this is already going into advanced statistical analysis but i will make it short)

Let’s run the function and go over the summary

library(npar)

sig.check <- oneway(AMT_APPLICATION ~ NAME_CONTRACT_STATUS, data = app_prev)

summary(sig.check)

## data: AMT_APPLICATION on NAME_CONTRACT_STATUS 
## 
## Omnibus Test
## Kruskal-Wallis chi-squared = 22406.4122, df = 3, p-value = < 2.2e-16
## 
## Descriptive Statistics
##        Canceled Unused offer Approved  Refused
## n         10166       987.00 37317.00  10098.0
## median        0     57780.00 90000.00 155812.5
## mad           0     39029.44 79894.35 187712.0
## 
## Multiple Comparisons (Wilcoxon Rank Sum Tests)
## Probability Adjustment = holm
##        Group.1      Group.2           W             p    
## 1     Canceled Unused offer    184635.5  0.000000e+00 ***
## 2     Canceled     Approved  10513123.0  0.000000e+00 ***
## 3     Canceled      Refused   7773898.5  0.000000e+00 ***
## 4 Unused offer     Approved  12675351.0  6.333843e-63 ***
## 5 Unused offer      Refused   2882133.5 3.049048e-106 ***
## 6     Approved      Refused 155149994.0 3.314641e-163 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

One way to know if there’s a significant difference between our contract_status is by eye-balling our p-value. Here we can see that the p-value is less than 0.05(or 5%) which is another way of saying there’s a significance difference between all the factors in contract status (of-course this is not a surprise, we expected this).

From the summary table, we also get a summary statistics. Here we can see that we have 10,166 observations with cancelled loans, and 987 loans are unused offer, 37,317 loans were approved.

Also, the median value of Approved loans is 90,000. Refused loans on the other hand have a median value of 155,812. Inshort refused loans on average are higher than approved loans (by more than 60,000).

Lastly, the smallest variability (mad) are in Unused offer cohort with mad value of 39,029 (Cancelled orders don’t count) - Refused offers have the biggest variability.

We can actually plot this stats with a box plot.

plot(sig.check, col = "pink", main = 
       "Contract Status Comparison", ylab = "Amount Applied for", 
     xlab = "Contract Status", family = "mono")

Once again, we can confirm that on on average, Refused loans are of higher amount than approved loans.

Moving on, let’s look at the days clients apply for loans.

#reorder days
app_prev$WEEKDAY_APPR_PROCESS_START <- factor(app_prev$WEEKDAY_APPR_PROCESS_START, order = T,
                                              levels = c("MONDAY","TUESDAY","WEDNESDAY","THURSDAY","FRIDAY","SATURDAY","SUNDAY"))

#plot count
barplot(table(app_prev$WEEKDAY_APPR_PROCESS_START))

From the plot, we can deduce most clients applied for loans on wednesdays (although the difference with other days is not too much). Also very few of our clients applied for loan on sundays.

We can dive in deeper by checking which days are loans mostly approved and canceled or refused with a mosaic plot.

mosaic(~NAME_CONTRACT_STATUS + WEEKDAY_APPR_PROCESS_START,shade = T, data = app_prev)

Nothing much here but i did notice that, more application are cancelled on wednesdays than other days - For approved loans, it’s very tight but i think saturday takes the crown as the day most loans are approved.

We can also check on the time of the day our clients apply for loans the most.

barplot(table(app_prev$HOUR_APPR_PROCESS_START))

Most loan applications come in around 11AM and then starts to drop-off an hour later.

I will add two-box plots just to see the distribution of Contract type and amount application. For the box-plot below, i have logged the amount (y-axis). This is to make the plot more readable.

ggplot(app_prev, aes(NAME_CONTRACT_TYPE, log(AMT_APPLICATION))) + 
  geom_boxplot() + facet_wrap(~NAME_CONTRACT_STATUS)

Let’s see the distribution of AMT_APPLICATION(in short the amount our clients requested for) across contract types with a density plot.

app_prev %>%
  filter(AMT_APPLICATION > 0) %>%
ggplot(aes(AMT_APPLICATION, fill = NAME_CONTRACT_TYPE)) + 
  geom_density() + facet_grid(NAME_CONTRACT_TYPE~.) +
  theme(legend.position = "None",
        strip.text.y = element_text(size = 14, angle = 0))

Here we can see that, most of the figures are skewed to the right. Also, Clients with Cash loans request for a higher amount of money than cients with consumer loans and revolving loans.

Let’s see the distribution AMT_APPLICATION(in short the amount our clients requested for) across channel types.

app_prev %>%
  filter(AMT_APPLICATION > 0) %>%
ggplot(aes(AMT_APPLICATION, fill = CHANNEL_TYPE)) + 
  geom_density() + 
  #scale_x_continuous(breaks = c(0, 1000000, 2000000, 3000000, 4000000))
  facet_grid(CHANNEL_TYPE~.) +
  theme(legend.position = "None",
        strip.text.y = element_text(size = 14, angle = 0))

Compared to other channels, Car dealer and credit and cash offices to some degree are skewed to the right (distribution goes into the millions territory).

BI Team Lead Assessment - Olalekan Kazeem

Lekan Ali

5/24/2021

Introduction

Checking for Missing Values before commencing task.

Question: 1

Solution

Question: 2

Solution

Question: 3

Solution

Question: 4

Solution

Question: 5

Solution

Question: 6

Question: 7

Solution

Additional Exploratory Analysis and Finding.