# Load required libraries
library(factoextra)
library(dplyr)
library(ggplot2)
library(FactoMineR)
library(pdp)
library(reshape2)
When consumers don’t pay back their debts on time, it results in losses for the bank. As a result of this, every year, institutions sustain losses that are measured in crores, which not only has an effect on the overall rate of economic development in the country but also has a significant bearing on it. During this hackathon, we are trying to forecast whether or not a person will default on their loan by looking at a variety of factors, such as the quantity of money that was funded, the location of the loan, the balance, etc. In order to find a solution to this issue, MachineHack has developed a training dataset that contains 67,463 rows and 35 columns. In order to execute dimensionality reduction, we will make use of this dataset. In this particular instance, the dimensionality reduction technique will be applied to both the numerical and the qualitative variables. The dataset could be obtained from Kaggle and could be found at https://www.kaggle.com/datasets/hemanthsai7/loandefault?select=train.csv.
First of all the data is read and the top 6 rows are printed to see how exactly the data looks like.
data <- read.csv("train.csv")
head(data)
Next, we have checked the data for missing observations and we found out that we do not have any missing observations in the dataset.
colSums(is.na(data))
## ID Loan.Amount
## 0 0
## Funded.Amount Funded.Amount.Investor
## 0 0
## Term Batch.Enrolled
## 0 0
## Interest.Rate Grade
## 0 0
## Sub.Grade Employment.Duration
## 0 0
## Home.Ownership Verification.Status
## 0 0
## Payment.Plan Loan.Title
## 0 0
## Debit.to.Income Delinquency...two.years
## 0 0
## Inquires...six.months Open.Account
## 0 0
## Public.Record Revolving.Balance
## 0 0
## Revolving.Utilities Total.Accounts
## 0 0
## Initial.List.Status Total.Received.Interest
## 0 0
## Total.Received.Late.Fee Recoveries
## 0 0
## Collection.Recovery.Fee Collection.12.months.Medical
## 0 0
## Application.Type Last.week.Pay
## 0 0
## Accounts.Delinquent Total.Collection.Amount
## 0 0
## Total.Current.Balance Total.Revolving.Credit.Limit
## 0 0
## Loan.Status
## 0
The ID column is removed from the dataset and the summary of the model is printed.
data <- data[,-1]
summary(data)
## Loan.Amount Funded.Amount Funded.Amount.Investor Term
## Min. : 1014 Min. : 1014 Min. : 1115 Min. :36.00
## 1st Qu.:10012 1st Qu.: 9266 1st Qu.: 9832 1st Qu.:58.00
## Median :16073 Median :13042 Median :12794 Median :59.00
## Mean :16849 Mean :15771 Mean :14622 Mean :58.17
## 3rd Qu.:22106 3rd Qu.:21793 3rd Qu.:17808 3rd Qu.:59.00
## Max. :35000 Max. :34999 Max. :35000 Max. :59.00
## Batch.Enrolled Interest.Rate Grade Sub.Grade
## Length:67463 Min. : 5.320 Length:67463 Length:67463
## Class :character 1st Qu.: 9.297 Class :character Class :character
## Mode :character Median :11.378 Mode :character Mode :character
## Mean :11.846
## 3rd Qu.:14.194
## Max. :27.182
## Employment.Duration Home.Ownership Verification.Status Payment.Plan
## Length:67463 Min. : 14574 Length:67463 Length:67463
## Class :character 1st Qu.: 51690 Class :character Class :character
## Mode :character Median : 69336 Mode :character Mode :character
## Mean : 80542
## 3rd Qu.: 94623
## Max. :406562
## Loan.Title Debit.to.Income Delinquency...two.years
## Length:67463 Min. : 0.6753 Min. :0.0000
## Class :character 1st Qu.:16.7564 1st Qu.:0.0000
## Mode :character Median :22.6567 Median :0.0000
## Mean :23.2992 Mean :0.3271
## 3rd Qu.:30.0484 3rd Qu.:0.0000
## Max. :39.6299 Max. :8.0000
## Inquires...six.months Open.Account Public.Record Revolving.Balance
## Min. :0.0000 Min. : 2.00 Min. :0.00000 Min. : 0
## 1st Qu.:0.0000 1st Qu.:10.00 1st Qu.:0.00000 1st Qu.: 2557
## Median :0.0000 Median :13.00 Median :0.00000 Median : 5516
## Mean :0.1458 Mean :14.27 Mean :0.08144 Mean : 7699
## 3rd Qu.:0.0000 3rd Qu.:16.00 3rd Qu.:0.00000 3rd Qu.: 10184
## Max. :5.0000 Max. :37.00 Max. :4.00000 Max. :116933
## Revolving.Utilities Total.Accounts Initial.List.Status
## Min. : 0.00517 Min. : 4.00 Length:67463
## 1st Qu.: 38.65882 1st Qu.:13.00 Class :character
## Median : 54.08233 Median :18.00 Mode :character
## Mean : 52.88944 Mean :18.63
## 3rd Qu.: 69.17712 3rd Qu.:23.00
## Max. :100.88005 Max. :72.00
## Total.Received.Interest Total.Received.Late.Fee Recoveries
## Min. : 4.737 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 570.904 1st Qu.: 0.02111 1st Qu.: 1.630
## Median : 1330.843 Median : 0.04340 Median : 3.345
## Mean : 2068.993 Mean : 1.14397 Mean : 59.692
## 3rd Qu.: 2656.957 3rd Qu.: 0.07188 3rd Qu.: 5.454
## Max. :14301.368 Max. :42.61888 Max. :4354.467
## Collection.Recovery.Fee Collection.12.months.Medical Application.Type
## Min. : 0.00004 Min. :0.0000 Length:67463
## 1st Qu.: 0.47626 1st Qu.:0.0000 Class :character
## Median : 0.78014 Median :0.0000 Mode :character
## Mean : 1.12514 Mean :0.0213
## 3rd Qu.: 1.07057 3rd Qu.:0.0000
## Max. :166.83300 Max. :1.0000
## Last.week.Pay Accounts.Delinquent Total.Collection.Amount
## Min. : 0.00 Min. :0 Min. : 1.0
## 1st Qu.: 35.00 1st Qu.:0 1st Qu.: 24.0
## Median : 68.00 Median :0 Median : 36.0
## Mean : 71.16 Mean :0 Mean : 146.5
## 3rd Qu.:105.00 3rd Qu.:0 3rd Qu.: 46.0
## Max. :161.00 Max. :0 Max. :16421.0
## Total.Current.Balance Total.Revolving.Credit.Limit Loan.Status
## Min. : 617 Min. : 1000 Min. :0.00000
## 1st Qu.: 50379 1st Qu.: 8156 1st Qu.:0.00000
## Median : 118369 Median : 16733 Median :0.00000
## Mean : 159574 Mean : 23123 Mean :0.09251
## 3rd Qu.: 228375 3rd Qu.: 32147 3rd Qu.:0.00000
## Max. :1177412 Max. :201169 Max. :1.00000
A numerical and a qualitative representation of each variable is included in the dataset. I will separate them and then perform individual analyses for numerical variables (using principal components analysis; PCA) and qualitative variables so that the analysis is more accurate (using MCA). Moreover, the categorical and numerical data is separated into two different dataframes.
data_numerical <- select_if(data, is.numeric)
data_categorical <- select_if(data, is.character)
Finally, among the numerical columns, the target column called loan status is omitted.
data_numerical_features <- data_numerical[,c(1:24)]
The diagrams that are connected show that the majority of the participants are from grade C, while grade G participants make up the smallest percentage. The vast majority of individuals are mortgage holders, and only a minority own their homes outright. The vast majority of the individuals have their original list classification set to W and have been source verified. Because both the payment plan variable and the application type variable only have one possible value, they should both be eliminated from the dataset because they do not contribute any information that is helpful.
long_data <- melt(data_categorical, id.vars = NULL)
ggplot(data = long_data) +
geom_bar(aes(x = value)) +
theme(plot.title = element_text(hjust = 0.5,
size = 14)) +
facet_wrap(~ variable,
scales = "free", ncol = 3)
From the graphs attached above, we can see that the payment plan and the application type has a single value in its columns and they do not contribute in providing any useful information. Hence, these two columns are removed from the categorical dataset.
data_categorical <- data_categorical[,-which(names(data_categorical) %in%
c("Application.Type",
"Payment.Plan"))]
First of all the distinct value count is obtained for each of the numerical column and we can see that the Accounts.Delinquent column has only 1 unique value and it need to be removed from the dataset.
sapply(data_numerical_features, function(x) n_distinct(x))
## Loan.Amount Funded.Amount
## 27525 24548
## Funded.Amount.Investor Term
## 67441 3
## Interest.Rate Home.Ownership
## 67448 67454
## Debit.to.Income Delinquency...two.years
## 67454 9
## Inquires...six.months Open.Account
## 6 36
## Public.Record Revolving.Balance
## 5 20582
## Revolving.Utilities Total.Accounts
## 67458 69
## Total.Received.Interest Total.Received.Late.Fee
## 67451 67380
## Recoveries Collection.Recovery.Fee
## 67387 67313
## Collection.12.months.Medical Last.week.Pay
## 2 162
## Accounts.Delinquent Total.Collection.Amount
## 1 2193
## Total.Current.Balance Total.Revolving.Credit.Limit
## 60901 37708
Hence, its removed.
data_numerical_features <- data_numerical_features[,-which(names(data_numerical_features) %in% c("Accounts.Delinquent"))]
Next, we have applied PCA on numerical features dataset and the result summary is obtained. It can be seen from the proportion of variance explained that the PC1 explains 4.7% approximately and same goes for all other dimensions as well (more of less same as they all lie in 4-5%).
pca_numerical <- prcomp(data_numerical_features,
center = TRUE,
scale = TRUE)
summary(pca_numerical)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.04586 1.02547 1.02170 1.01993 1.0185 1.0106 1.00977
## Proportion of Variance 0.04756 0.04572 0.04539 0.04523 0.0451 0.0444 0.04433
## Cumulative Proportion 0.04756 0.09328 0.13866 0.18389 0.2290 0.2734 0.31773
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00799 1.00220 1.00096 0.99842 0.99761 0.99499 0.99441
## Proportion of Variance 0.04418 0.04367 0.04356 0.04334 0.04327 0.04304 0.04299
## Cumulative Proportion 0.36190 0.40557 0.44914 0.49248 0.53575 0.57879 0.62178
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.99378 0.99182 0.98849 0.98602 0.98481 0.98144 0.97911
## Proportion of Variance 0.04294 0.04277 0.04248 0.04227 0.04217 0.04188 0.04168
## Cumulative Proportion 0.66472 0.70749 0.74998 0.79225 0.83441 0.87629 0.91797
## PC22 PC23
## Standard deviation 0.97570 0.96675
## Proportion of Variance 0.04139 0.04063
## Cumulative Proportion 0.95937 1.00000
A scree plot is a graphical representation of the proportion of variation that can be attributed to each component. According to the findings, PC1 is responsible for explaining just over 4% of the total diversity. There must be twenty components for there to be an explanation for eighty percent of the variation.
fviz_eig(pca_numerical)
Let’s check variable contribution in every first 3 dimensions, which explain 66% of variance. We can plot all dimensions one by one. The highest contribution in PC1 is of Credit limit and reviving balance.
fviz_contrib(pca_numerical,
choice = "var",
axes = 1)
For PC2 the highest contribution is done by open account and loan amount.
fviz_contrib(pca_numerical,
choice = "var",
axes = 2)
For PC3, the highest contribution is done by last week pay and total accounts.
fviz_contrib(pca_numerical,
choice = "var",
axes = 3)
For categorical variable, we can see that the first 3 PCs could be used which explains 80% of the variation. The PC1 explains 70% of the variation while PC2 and PC3 explains 6% each which sums up to 81%.
mca_categorical <- MCA(data_categorical,
graph = FALSE)
fviz_screeplot(mca_categorical,
addlabels = TRUE)
We can check the contribution of each variable in each PC using the plot below. The highest contribution in PC1 is not verified and mortgage.
fviz_contrib(mca_categorical,
choice = "var",
axes = 1)
The findings do not appear to be acceptable in either analysis, despite the fact that the percentage of explained variation in two-dimensional space is not high enough to presume that all points have been allocated accurately. When compared to the variance described by numerical columns and principal component analysis (PCA), the variance explained by categorical variables was in a league of its own.