require(tidyverse)
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(ggthemes)
## Loading required package: ggthemes
require(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(DT)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(dplyr) # for group_by
library(gridExtra) # for grid.arrange
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

Introduction

The Loan data set contains 53 features over 1010 rows. Initial data exploration was carried out using Excel. The following are the classification of the features into data types:

  1. Categorical/Qualitative Data:
    • Loan_ID
    • Loan_Status
    • Property_Area
    • Grade_of_Loan
    • Home_Ownership
    • Verification_Status
    • Gender
    • Married
    • Dependents
    • Payment_Plan
    • Description
    • Purpose
    • Title
    • Address_State
    • Initial_List_Status
    • application_type
    • disbursement_method
    • debt_settlement_flag
    • hardship_flag
  2. Numerical/Quantitative Data (Continuous):
    • Loan_Amount
    • Funded_Amount
    • Funded_Amnt_Inv
    • Annual_Income
    • Int_rate_percentage
    • Installment
    • Dti
    • Revol_Bal
    • Revol_Util_percent
    • Total_Acc
    • Out_Prncp
    • Out_Prncp_Inv
    • Total_Payment
    • total_pymnt_inv
    • total_rec_prncp
    • total_rec_int
    • total_rec_late_fee
    • recoveries
    • collection_recovery_fee
    • last_pymnt_amnt
    • collections_12_mths_ex_med
    • policy_code
    • Months_Since_Last_Delinq
    • Months_Since_Last_Record
    • Open_Acc
    • Pub_Rec
    • Delinq_2yrs
    • Inq_Last_6mths
  3. Date/Time Data:
    • Issued_Date
    • Earliest_cr_line
    • last_pymnt_d
    • next_pymnt_d
    • last_credit_pull_d

To analyze the data and reveal inherent patterns and relationships for the different types of data (qualitative and quantitative), we will explore various visualization tools and techniques. The following are some appropriate visualization tools for each of the identified data type. We will make honest efforts to present an informative visualization as possible.

For Qualitative (Categorical) Data:

  1. Bar Charts: Use bar charts to visualize the distribution of categorical variables. This can help you understand the frequency or count of each category.

  2. Pie Charts: Pie charts can be used to show the composition of a whole dataset in terms of different categories. They are effective when you have a small number of categories.

  3. Stacked Bar Charts: Stacked bar charts are useful when you want to show the composition of a category over multiple subcategories or groups.

  4. Histograms: Although histograms are typically used for numerical data, you can create histograms for categorical data by grouping categories into bins and showing the frequency of data points in each bin.

  5. Heatmaps: Heatmaps are great for visualizing relationships between two categorical variables. They use color-coding to represent the strength of the relationship.

For Quantitative (Numerical) Data:

  1. Histograms: Histograms are commonly used to visualize the distribution of numerical data. They show the frequency or density of data points within specified bins.

  2. Box Plots (Box-and-Whisker Plots): Box plots provide a summary of the distribution of numerical data, including measures like median, quartiles, and potential outliers.

  3. Scatter Plots: Scatter plots are useful for visualizing relationships between two numerical variables. They can help identify correlations or patterns.

  4. Line Charts: Line charts are used to visualize trends over time or across ordered data points. They are useful for time-series analysis.

  5. Density Plots (Kernel Density Estimation): Density plots provide a smooth representation of the data distribution and can be especially useful when dealing with large datasets.

  6. Violin Plots: Violin plots combine elements of box plots and kernel density plots to show the distribution of numerical data along with summary statistics.

For Date/Time Data:

  1. Time Series Plots: Use time series plots to visualize changes in data over time. Line charts are commonly used for this purpose.

  2. Calendar Heatmaps: Calendar heatmaps are useful for visualizing patterns in date-related data over a calendar year. Each day is color-coded based on the data.

  3. Time Series Decomposition: Decompose time series data into its components (trend, seasonality, noise) using techniques like seasonal decomposition of time series (STL) and visualize these components separately.

  4. Lag Plots: For time series analysis, lag plots can help visualize the relationship between a data point and its lagged values.

  5. Event Plots: Event plots can be used to visualize specific events or occurrences over time.

Some exploratory questions to tackle

Based on the insight we have as regards the data types the following are some pending questions the aid our data exploration process:

We will attempt to answer some of these questions by performing an Exploratory Data Analysis in three phases viz:

  1. Univariate Analysis
  2. Bivariate Analysis
  3. Multivariate Analysis

Feature Selection & Engineering

In this stage, we will subset features based on their types and perform procession and EDA

mydata = read.csv("loan.csv", na = "", h = T)
attach(mydata)
glimpse(mydata)
## Rows: 1,010
## Columns: 51
## $ Loan_Status                <chr> "Fully Paid", "Fully Paid", "Fully Paid", "…
## $ Issued_Date                <chr> "12/01/2011", "12/01/2011", "12/01/2011", "…
## $ Loan_Amount                <int> 10000, 12000, 14000, 7500, 19200, 9000, 160…
## $ Funded_Amount              <int> 10000, 8175, 14000, 7500, 19200, 9000, 1600…
## $ Funded_Amnt_Inv            <dbl> 10000.000, 8175.000, 14000.000, 7500.000, 1…
## $ Annual_Income              <dbl> 40000, 75000, 48880, 48000, 40000, 110000, …
## $ Term                       <chr> " 36 months", " 60 months", " 36 months", "…
## $ Int_rate_percentage        <dbl> 0.14, 0.13, 0.11, 0.06, 0.10, 0.12, 0.17, 0…
## $ Installment                <dbl> 343.09, 188.07, 456.03, 228.27, 618.72, 297…
## $ Property_Area              <chr> "Rural", "Rural", "Semiurban", "Semiurban",…
## $ Grade_of_Loan              <chr> "C", "C", "B", "A", "B", "B", "D", "A", "C"…
## $ Employment_length          <chr> "4 years", "1 year", "3 years", "5 years", …
## $ Home_Ownership             <chr> "RENT", "RENT", "RENT", "RENT", "MORTGAGE",…
## $ Verification_Status        <chr> "Source Verified", "Not Verified", "Not Ver…
## $ Gender                     <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ Married                    <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "N…
## $ Dependents                 <int> 0, 2, 3, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ Payment_Plan               <chr> "n", "n", "n", "n", "n", "n", "n", "n", "n"…
## $ Description                <chr> NA, NA, NA, NA, "  Borrower added on 12/08/…
## $ Purpose                    <chr> "car", "other", "credit_card", "debt_consol…
## $ Title                      <chr> "Car Loan", "Unexpected Expenses", "Helping…
## $ Address_State              <chr> "NY", "CA", "IL", "MA", "WA", "FL", "CA", "…
## $ Dti                        <dbl> 18.69, 23.58, 21.95, 5.08, 11.37, 13.06, 11…
## $ Delinq_2yrs                <int> 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0…
## $ Earliest_cr_line           <chr> "7/10/23", "1/9/95", "2/8/23", "1/5/00", "1…
## $ Inq_Last_6mths             <int> 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 0, 0, 1, 1, 0…
## $ Open_Acc                   <int> 6, 7, 7, 6, 8, 10, 8, 8, 3, 10, 13, 14, 8, …
## $ Pub_Rec                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Revol_Bal                  <int> 5689, 11209, 13790, 6626, 14061, 14810, 196…
## $ Revol_Util_percent         <dbl> 0.88, 0.58, 0.74, 0.31, 0.68, 0.65, 0.81, 0…
## $ Total_Acc                  <int> 7, 24, 15, 15, 15, 12, 13, 17, 15, 28, 45, …
## $ Initial_List_Status        <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f"…
## $ Out_Prncp                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Out_Prncp_Inv              <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Total_Payment              <dbl> 11156.108, 11283.637, 16416.880, 8214.165, …
## $ total_pymnt_inv            <dbl> 11156.11, 11283.64, 16416.88, 8214.17, 2147…
## $ total_rec_prncp            <dbl> 10000.00, 8175.00, 14000.00, 7500.00, 19200…
## $ total_rec_int              <dbl> 1156.11, 3108.64, 2416.88, 714.17, 2332.52,…
## $ total_rec_late_fee         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ recoveries                 <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ collection_recovery_fee    <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ last_pymnt_d               <chr> "12-Nov", "16-Dec", "14-Dec", "14-Oct", "13…
## $ last_pymnt_amnt            <dbl> 7731.76, 187.51, 470.86, 683.80, 10413.87, …
## $ next_pymnt_d               <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ last_credit_pull_d         <chr> "17-Nov", "18-Apr", "14-Dec", "17-Jan", "13…
## $ collections_12_mths_ex_med <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ policy_code                <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ application_type           <chr> "Individual", "Individual", "Individual", "…
## $ disbursement_method        <chr> "Cash", "Cash", "Cash", "Cash", "Cash", "Ca…
## $ debt_settlement_flag       <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"…
## $ hardship_flag              <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"…
dim(mydata)
## [1] 1010   51
colnames(mydata)
##  [1] "Loan_Status"                "Issued_Date"               
##  [3] "Loan_Amount"                "Funded_Amount"             
##  [5] "Funded_Amnt_Inv"            "Annual_Income"             
##  [7] "Term"                       "Int_rate_percentage"       
##  [9] "Installment"                "Property_Area"             
## [11] "Grade_of_Loan"              "Employment_length"         
## [13] "Home_Ownership"             "Verification_Status"       
## [15] "Gender"                     "Married"                   
## [17] "Dependents"                 "Payment_Plan"              
## [19] "Description"                "Purpose"                   
## [21] "Title"                      "Address_State"             
## [23] "Dti"                        "Delinq_2yrs"               
## [25] "Earliest_cr_line"           "Inq_Last_6mths"            
## [27] "Open_Acc"                   "Pub_Rec"                   
## [29] "Revol_Bal"                  "Revol_Util_percent"        
## [31] "Total_Acc"                  "Initial_List_Status"       
## [33] "Out_Prncp"                  "Out_Prncp_Inv"             
## [35] "Total_Payment"              "total_pymnt_inv"           
## [37] "total_rec_prncp"            "total_rec_int"             
## [39] "total_rec_late_fee"         "recoveries"                
## [41] "collection_recovery_fee"    "last_pymnt_d"              
## [43] "last_pymnt_amnt"            "next_pymnt_d"              
## [45] "last_credit_pull_d"         "collections_12_mths_ex_med"
## [47] "policy_code"                "application_type"          
## [49] "disbursement_method"        "debt_settlement_flag"      
## [51] "hardship_flag"

We will then check for missing values

sapply(mydata , function(x) sum(is.na(x)))
##                Loan_Status                Issued_Date 
##                          0                          0 
##                Loan_Amount              Funded_Amount 
##                          0                          0 
##            Funded_Amnt_Inv              Annual_Income 
##                          0                          0 
##                       Term        Int_rate_percentage 
##                          0                          0 
##                Installment              Property_Area 
##                          0                          0 
##              Grade_of_Loan          Employment_length 
##                          0                          0 
##             Home_Ownership        Verification_Status 
##                          0                          0 
##                     Gender                    Married 
##                          0                          0 
##                 Dependents               Payment_Plan 
##                         25                          0 
##                Description                    Purpose 
##                        440                          0 
##                      Title              Address_State 
##                          0                          0 
##                        Dti                Delinq_2yrs 
##                          0                          0 
##           Earliest_cr_line             Inq_Last_6mths 
##                          0                          0 
##                   Open_Acc                    Pub_Rec 
##                          0                          0 
##                  Revol_Bal         Revol_Util_percent 
##                          0                          0 
##                  Total_Acc        Initial_List_Status 
##                          0                          0 
##                  Out_Prncp              Out_Prncp_Inv 
##                          0                          0 
##              Total_Payment            total_pymnt_inv 
##                          0                          0 
##            total_rec_prncp              total_rec_int 
##                          0                          0 
##         total_rec_late_fee                 recoveries 
##                          0                          0 
##    collection_recovery_fee               last_pymnt_d 
##                          0                          0 
##            last_pymnt_amnt               next_pymnt_d 
##                          0                       1010 
##         last_credit_pull_d collections_12_mths_ex_med 
##                          0                          0 
##                policy_code           application_type 
##                          0                          0 
##        disbursement_method       debt_settlement_flag 
##                          0                          0 
##              hardship_flag 
##                          0

We will remove the rows with missing values

mydata = mydata %>%
        filter(!is.na(Dependents))
sapply(mydata , function(x) sum(is.na(x)))
##                Loan_Status                Issued_Date 
##                          0                          0 
##                Loan_Amount              Funded_Amount 
##                          0                          0 
##            Funded_Amnt_Inv              Annual_Income 
##                          0                          0 
##                       Term        Int_rate_percentage 
##                          0                          0 
##                Installment              Property_Area 
##                          0                          0 
##              Grade_of_Loan          Employment_length 
##                          0                          0 
##             Home_Ownership        Verification_Status 
##                          0                          0 
##                     Gender                    Married 
##                          0                          0 
##                 Dependents               Payment_Plan 
##                          0                          0 
##                Description                    Purpose 
##                        427                          0 
##                      Title              Address_State 
##                          0                          0 
##                        Dti                Delinq_2yrs 
##                          0                          0 
##           Earliest_cr_line             Inq_Last_6mths 
##                          0                          0 
##                   Open_Acc                    Pub_Rec 
##                          0                          0 
##                  Revol_Bal         Revol_Util_percent 
##                          0                          0 
##                  Total_Acc        Initial_List_Status 
##                          0                          0 
##                  Out_Prncp              Out_Prncp_Inv 
##                          0                          0 
##              Total_Payment            total_pymnt_inv 
##                          0                          0 
##            total_rec_prncp              total_rec_int 
##                          0                          0 
##         total_rec_late_fee                 recoveries 
##                          0                          0 
##    collection_recovery_fee               last_pymnt_d 
##                          0                          0 
##            last_pymnt_amnt               next_pymnt_d 
##                          0                        985 
##         last_credit_pull_d collections_12_mths_ex_med 
##                          0                          0 
##                policy_code           application_type 
##                          0                          0 
##        disbursement_method       debt_settlement_flag 
##                          0                          0 
##              hardship_flag 
##                          0
dim(mydata)
## [1] 985  51

So after performing cleaning we are left with 51 features over 985 rows. We will go further to subset quantitative and qualitative features to enhance visualization and engineering.

mydatacat = mydata %>%
        select(Loan_Status, Property_Area, Grade_of_Loan, Home_Ownership, Verification_Status, Gender, Married, Dependents, Payment_Plan, Description, Purpose,Title, Address_State, Initial_List_Status, application_type, disbursement_method, debt_settlement_flag, hardship_flag)

mydatacat

We will go ahead and perform same for the quantitative data

mydatanum = mydata %>%
        select(   Loan_Amount, Funded_Amount, Funded_Amnt_Inv, Annual_Income, Int_rate_percentage, Installment, Dti, Revol_Bal, Revol_Util_percent, Total_Acc, Out_Prncp, Out_Prncp_Inv, Total_Payment, total_pymnt_inv, total_rec_prncp, total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_amnt, collections_12_mths_ex_med, policy_code, Open_Acc, Pub_Rec, Delinq_2yrs, Inq_Last_6mths)

mydatanum

Exploratory Data Analysis

Univariate Plots

  • Loan_Status:
require(ggplot2)
mydata %>%
        count(Loan_Status) %>%
        ggplot(aes(x = reorder(Loan_Status , desc(n)) , y = n , fill = n)) + 
        geom_col() + 
        coord_flip() + 
        labs(x = 'Loan_Status' , y = 'Count')

  • Property Area
ggplot(mydata, aes(x = Property_Area)) +
  geom_bar() +
  labs(title = "Property Area Distribution")

  • Loan Amount
ggplot(mydata, aes(x = Loan_Amount)) +
  geom_histogram(binwidth = 1000, fill = "light Green", color = "black") +
  labs(title = "Loan Amount Distribution")

  • Annual Income
ggplot(mydata, aes(y = Annual_Income)) +
  geom_boxplot() +
  labs(title = "Box Plot of AnnualIncome")

  • Total Payment
ggplot(mydata, aes(x = Total_Payment)) +
  geom_density(fill = "green") +
  labs(title = "Density Plot of Total_Payment")

Bivariate Plots

  • Loan Amount vs Annual Income
ggplot(mydata, aes(x = Loan_Amount, y = Annual_Income)) +
  geom_point() +
  labs(title = "Loan Amount vs. Annual Income")

ggplot(mydata, aes(x = Annual_Income, y = Loan_Amount, color = Loan_Status)) +
  geom_point() +
  labs(title = "Annual Income vs. Loan Amount by Loan Status")

  • Box plot for Loan Amount by Property Area
ggplot(mydata, aes(x = Property_Area, y = Loan_Amount, fill = Property_Area)) +
  geom_boxplot() +
  labs(title = "Loan Amount by Property Area")

  • Stacked bar chart for Loan Status by Property Area
ggplot(mydata, aes(x = Property_Area, fill = Loan_Status)) +
  geom_bar(position = "fill") +
  labs(title = "Loan Status Distribution by Property Area")

  • Visualise credit risk features
ggplot(mydata , aes(x = Grade_of_Loan , y = Int_rate_percentage , fill = Grade_of_Loan)) + 
        geom_boxplot() + 
        theme_igray() + 
        labs(y = 'Int_rate_percentage' , x = 'Grade_of_Loan')

  • impact the annual income of the borrower has on the other variables
ggplot(mydata , aes(x = Annual_Income , y = Loan_Amount , color = Int_rate_percentage)) +
        geom_point(alpha = 0.5 , size = 1.5) + 
        geom_smooth(se = F , color = 'darkred' , method = 'loess') +
        xlim(c(0 , 300000)) + 
        labs(x = 'Annual Income' , y = 'Loan Amount' , color = 'Interest Rate percentage')

Conclusion

The above presented visualization is just as aspect of the data. This shows that the data is rich with information, hence I will encourage others to explore the area of modelling and other bivariate relationships that exist among the features.