require(tidyverse)
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(ggthemes)
## Loading required package: ggthemes
require(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
require(GGally)
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(DT)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr) # for group_by
library(gridExtra) # for grid.arrange
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
The Loan data set contains 53 features over 1010 rows. Initial data exploration was carried out using Excel. The following are the classification of the features into data types:
To analyze the data and reveal inherent patterns and relationships for the different types of data (qualitative and quantitative), we will explore various visualization tools and techniques. The following are some appropriate visualization tools for each of the identified data type. We will make honest efforts to present an informative visualization as possible.
For Qualitative (Categorical) Data:
Bar Charts: Use bar charts to visualize the distribution of categorical variables. This can help you understand the frequency or count of each category.
Pie Charts: Pie charts can be used to show the composition of a whole dataset in terms of different categories. They are effective when you have a small number of categories.
Stacked Bar Charts: Stacked bar charts are useful when you want to show the composition of a category over multiple subcategories or groups.
Histograms: Although histograms are typically used for numerical data, you can create histograms for categorical data by grouping categories into bins and showing the frequency of data points in each bin.
Heatmaps: Heatmaps are great for visualizing relationships between two categorical variables. They use color-coding to represent the strength of the relationship.
For Quantitative (Numerical) Data:
Histograms: Histograms are commonly used to visualize the distribution of numerical data. They show the frequency or density of data points within specified bins.
Box Plots (Box-and-Whisker Plots): Box plots provide a summary of the distribution of numerical data, including measures like median, quartiles, and potential outliers.
Scatter Plots: Scatter plots are useful for visualizing relationships between two numerical variables. They can help identify correlations or patterns.
Line Charts: Line charts are used to visualize trends over time or across ordered data points. They are useful for time-series analysis.
Density Plots (Kernel Density Estimation): Density plots provide a smooth representation of the data distribution and can be especially useful when dealing with large datasets.
Violin Plots: Violin plots combine elements of box plots and kernel density plots to show the distribution of numerical data along with summary statistics.
For Date/Time Data:
Time Series Plots: Use time series plots to visualize changes in data over time. Line charts are commonly used for this purpose.
Calendar Heatmaps: Calendar heatmaps are useful for visualizing patterns in date-related data over a calendar year. Each day is color-coded based on the data.
Time Series Decomposition: Decompose time series data into its components (trend, seasonality, noise) using techniques like seasonal decomposition of time series (STL) and visualize these components separately.
Lag Plots: For time series analysis, lag plots can help visualize the relationship between a data point and its lagged values.
Event Plots: Event plots can be used to visualize specific events or occurrences over time.
Some exploratory questions to tackle
Based on the insight we have as regards the data types the following are some pending questions the aid our data exploration process:
We will attempt to answer some of these questions by performing an Exploratory Data Analysis in three phases viz:
In this stage, we will subset features based on their types and perform procession and EDA
mydata = read.csv("loan.csv", na = "", h = T)
attach(mydata)
glimpse(mydata)
## Rows: 1,010
## Columns: 51
## $ Loan_Status <chr> "Fully Paid", "Fully Paid", "Fully Paid", "…
## $ Issued_Date <chr> "12/01/2011", "12/01/2011", "12/01/2011", "…
## $ Loan_Amount <int> 10000, 12000, 14000, 7500, 19200, 9000, 160…
## $ Funded_Amount <int> 10000, 8175, 14000, 7500, 19200, 9000, 1600…
## $ Funded_Amnt_Inv <dbl> 10000.000, 8175.000, 14000.000, 7500.000, 1…
## $ Annual_Income <dbl> 40000, 75000, 48880, 48000, 40000, 110000, …
## $ Term <chr> " 36 months", " 60 months", " 36 months", "…
## $ Int_rate_percentage <dbl> 0.14, 0.13, 0.11, 0.06, 0.10, 0.12, 0.17, 0…
## $ Installment <dbl> 343.09, 188.07, 456.03, 228.27, 618.72, 297…
## $ Property_Area <chr> "Rural", "Rural", "Semiurban", "Semiurban",…
## $ Grade_of_Loan <chr> "C", "C", "B", "A", "B", "B", "D", "A", "C"…
## $ Employment_length <chr> "4 years", "1 year", "3 years", "5 years", …
## $ Home_Ownership <chr> "RENT", "RENT", "RENT", "RENT", "MORTGAGE",…
## $ Verification_Status <chr> "Source Verified", "Not Verified", "Not Ver…
## $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Ma…
## $ Married <chr> "Yes", "Yes", "Yes", "Yes", "No", "Yes", "N…
## $ Dependents <int> 0, 2, 3, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ Payment_Plan <chr> "n", "n", "n", "n", "n", "n", "n", "n", "n"…
## $ Description <chr> NA, NA, NA, NA, " Borrower added on 12/08/…
## $ Purpose <chr> "car", "other", "credit_card", "debt_consol…
## $ Title <chr> "Car Loan", "Unexpected Expenses", "Helping…
## $ Address_State <chr> "NY", "CA", "IL", "MA", "WA", "FL", "CA", "…
## $ Dti <dbl> 18.69, 23.58, 21.95, 5.08, 11.37, 13.06, 11…
## $ Delinq_2yrs <int> 0, 1, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0…
## $ Earliest_cr_line <chr> "7/10/23", "1/9/95", "2/8/23", "1/5/00", "1…
## $ Inq_Last_6mths <int> 0, 0, 0, 0, 2, 0, 1, 2, 0, 0, 0, 0, 1, 1, 0…
## $ Open_Acc <int> 6, 7, 7, 6, 8, 10, 8, 8, 3, 10, 13, 14, 8, …
## $ Pub_Rec <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Revol_Bal <int> 5689, 11209, 13790, 6626, 14061, 14810, 196…
## $ Revol_Util_percent <dbl> 0.88, 0.58, 0.74, 0.31, 0.68, 0.65, 0.81, 0…
## $ Total_Acc <int> 7, 24, 15, 15, 15, 12, 13, 17, 15, 28, 45, …
## $ Initial_List_Status <chr> "f", "f", "f", "f", "f", "f", "f", "f", "f"…
## $ Out_Prncp <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Out_Prncp_Inv <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Total_Payment <dbl> 11156.108, 11283.637, 16416.880, 8214.165, …
## $ total_pymnt_inv <dbl> 11156.11, 11283.64, 16416.88, 8214.17, 2147…
## $ total_rec_prncp <dbl> 10000.00, 8175.00, 14000.00, 7500.00, 19200…
## $ total_rec_int <dbl> 1156.11, 3108.64, 2416.88, 714.17, 2332.52,…
## $ total_rec_late_fee <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ recoveries <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ collection_recovery_fee <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0…
## $ last_pymnt_d <chr> "12-Nov", "16-Dec", "14-Dec", "14-Oct", "13…
## $ last_pymnt_amnt <dbl> 7731.76, 187.51, 470.86, 683.80, 10413.87, …
## $ next_pymnt_d <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ last_credit_pull_d <chr> "17-Nov", "18-Apr", "14-Dec", "17-Jan", "13…
## $ collections_12_mths_ex_med <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ policy_code <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ application_type <chr> "Individual", "Individual", "Individual", "…
## $ disbursement_method <chr> "Cash", "Cash", "Cash", "Cash", "Cash", "Ca…
## $ debt_settlement_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"…
## $ hardship_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N"…
dim(mydata)
## [1] 1010 51
colnames(mydata)
## [1] "Loan_Status" "Issued_Date"
## [3] "Loan_Amount" "Funded_Amount"
## [5] "Funded_Amnt_Inv" "Annual_Income"
## [7] "Term" "Int_rate_percentage"
## [9] "Installment" "Property_Area"
## [11] "Grade_of_Loan" "Employment_length"
## [13] "Home_Ownership" "Verification_Status"
## [15] "Gender" "Married"
## [17] "Dependents" "Payment_Plan"
## [19] "Description" "Purpose"
## [21] "Title" "Address_State"
## [23] "Dti" "Delinq_2yrs"
## [25] "Earliest_cr_line" "Inq_Last_6mths"
## [27] "Open_Acc" "Pub_Rec"
## [29] "Revol_Bal" "Revol_Util_percent"
## [31] "Total_Acc" "Initial_List_Status"
## [33] "Out_Prncp" "Out_Prncp_Inv"
## [35] "Total_Payment" "total_pymnt_inv"
## [37] "total_rec_prncp" "total_rec_int"
## [39] "total_rec_late_fee" "recoveries"
## [41] "collection_recovery_fee" "last_pymnt_d"
## [43] "last_pymnt_amnt" "next_pymnt_d"
## [45] "last_credit_pull_d" "collections_12_mths_ex_med"
## [47] "policy_code" "application_type"
## [49] "disbursement_method" "debt_settlement_flag"
## [51] "hardship_flag"
We will then check for missing values
sapply(mydata , function(x) sum(is.na(x)))
## Loan_Status Issued_Date
## 0 0
## Loan_Amount Funded_Amount
## 0 0
## Funded_Amnt_Inv Annual_Income
## 0 0
## Term Int_rate_percentage
## 0 0
## Installment Property_Area
## 0 0
## Grade_of_Loan Employment_length
## 0 0
## Home_Ownership Verification_Status
## 0 0
## Gender Married
## 0 0
## Dependents Payment_Plan
## 25 0
## Description Purpose
## 440 0
## Title Address_State
## 0 0
## Dti Delinq_2yrs
## 0 0
## Earliest_cr_line Inq_Last_6mths
## 0 0
## Open_Acc Pub_Rec
## 0 0
## Revol_Bal Revol_Util_percent
## 0 0
## Total_Acc Initial_List_Status
## 0 0
## Out_Prncp Out_Prncp_Inv
## 0 0
## Total_Payment total_pymnt_inv
## 0 0
## total_rec_prncp total_rec_int
## 0 0
## total_rec_late_fee recoveries
## 0 0
## collection_recovery_fee last_pymnt_d
## 0 0
## last_pymnt_amnt next_pymnt_d
## 0 1010
## last_credit_pull_d collections_12_mths_ex_med
## 0 0
## policy_code application_type
## 0 0
## disbursement_method debt_settlement_flag
## 0 0
## hardship_flag
## 0
We will remove the rows with missing values
mydata = mydata %>%
filter(!is.na(Dependents))
sapply(mydata , function(x) sum(is.na(x)))
## Loan_Status Issued_Date
## 0 0
## Loan_Amount Funded_Amount
## 0 0
## Funded_Amnt_Inv Annual_Income
## 0 0
## Term Int_rate_percentage
## 0 0
## Installment Property_Area
## 0 0
## Grade_of_Loan Employment_length
## 0 0
## Home_Ownership Verification_Status
## 0 0
## Gender Married
## 0 0
## Dependents Payment_Plan
## 0 0
## Description Purpose
## 427 0
## Title Address_State
## 0 0
## Dti Delinq_2yrs
## 0 0
## Earliest_cr_line Inq_Last_6mths
## 0 0
## Open_Acc Pub_Rec
## 0 0
## Revol_Bal Revol_Util_percent
## 0 0
## Total_Acc Initial_List_Status
## 0 0
## Out_Prncp Out_Prncp_Inv
## 0 0
## Total_Payment total_pymnt_inv
## 0 0
## total_rec_prncp total_rec_int
## 0 0
## total_rec_late_fee recoveries
## 0 0
## collection_recovery_fee last_pymnt_d
## 0 0
## last_pymnt_amnt next_pymnt_d
## 0 985
## last_credit_pull_d collections_12_mths_ex_med
## 0 0
## policy_code application_type
## 0 0
## disbursement_method debt_settlement_flag
## 0 0
## hardship_flag
## 0
dim(mydata)
## [1] 985 51
So after performing cleaning we are left with 51 features over 985 rows. We will go further to subset quantitative and qualitative features to enhance visualization and engineering.
mydatacat = mydata %>%
select(Loan_Status, Property_Area, Grade_of_Loan, Home_Ownership, Verification_Status, Gender, Married, Dependents, Payment_Plan, Description, Purpose,Title, Address_State, Initial_List_Status, application_type, disbursement_method, debt_settlement_flag, hardship_flag)
mydatacat
We will go ahead and perform same for the quantitative data
mydatanum = mydata %>%
select( Loan_Amount, Funded_Amount, Funded_Amnt_Inv, Annual_Income, Int_rate_percentage, Installment, Dti, Revol_Bal, Revol_Util_percent, Total_Acc, Out_Prncp, Out_Prncp_Inv, Total_Payment, total_pymnt_inv, total_rec_prncp, total_rec_int, total_rec_late_fee, recoveries, collection_recovery_fee, last_pymnt_amnt, collections_12_mths_ex_med, policy_code, Open_Acc, Pub_Rec, Delinq_2yrs, Inq_Last_6mths)
mydatanum
require(ggplot2)
mydata %>%
count(Loan_Status) %>%
ggplot(aes(x = reorder(Loan_Status , desc(n)) , y = n , fill = n)) +
geom_col() +
coord_flip() +
labs(x = 'Loan_Status' , y = 'Count')
ggplot(mydata, aes(x = Property_Area)) +
geom_bar() +
labs(title = "Property Area Distribution")
ggplot(mydata, aes(x = Loan_Amount)) +
geom_histogram(binwidth = 1000, fill = "light Green", color = "black") +
labs(title = "Loan Amount Distribution")
ggplot(mydata, aes(y = Annual_Income)) +
geom_boxplot() +
labs(title = "Box Plot of AnnualIncome")
ggplot(mydata, aes(x = Total_Payment)) +
geom_density(fill = "green") +
labs(title = "Density Plot of Total_Payment")
ggplot(mydata, aes(x = Loan_Amount, y = Annual_Income)) +
geom_point() +
labs(title = "Loan Amount vs. Annual Income")
ggplot(mydata, aes(x = Annual_Income, y = Loan_Amount, color = Loan_Status)) +
geom_point() +
labs(title = "Annual Income vs. Loan Amount by Loan Status")
ggplot(mydata, aes(x = Property_Area, y = Loan_Amount, fill = Property_Area)) +
geom_boxplot() +
labs(title = "Loan Amount by Property Area")
ggplot(mydata, aes(x = Property_Area, fill = Loan_Status)) +
geom_bar(position = "fill") +
labs(title = "Loan Status Distribution by Property Area")
ggplot(mydata , aes(x = Grade_of_Loan , y = Int_rate_percentage , fill = Grade_of_Loan)) +
geom_boxplot() +
theme_igray() +
labs(y = 'Int_rate_percentage' , x = 'Grade_of_Loan')
ggplot(mydata , aes(x = Annual_Income , y = Loan_Amount , color = Int_rate_percentage)) +
geom_point(alpha = 0.5 , size = 1.5) +
geom_smooth(se = F , color = 'darkred' , method = 'loess') +
xlim(c(0 , 300000)) +
labs(x = 'Annual Income' , y = 'Loan Amount' , color = 'Interest Rate percentage')
The above presented visualization is just as aspect of the data. This shows that the data is rich with information, hence I will encourage others to explore the area of modelling and other bivariate relationships that exist among the features.