Dimensionality Reduction

Introduction
Data Loading & Processing
PCA for numerical variables
Dimensionality reduction for categorical variables
Conclusion

# Load required libraries
library(factoextra)
library(dplyr)
library(ggplot2)
library(FactoMineR)
library(pdp)
library(reshape2)

Introduction

When consumers don’t pay back their debts on time, it results in losses for the bank. As a result of this, every year, institutions sustain losses that are measured in crores, which not only has an effect on the overall rate of economic development in the country but also has a significant bearing on it. During this hackathon, we are trying to forecast whether or not a person will default on their loan by looking at a variety of factors, such as the quantity of money that was funded, the location of the loan, the balance, etc. In order to find a solution to this issue, MachineHack has developed a training dataset that contains 67,463 rows and 35 columns. In order to execute dimensionality reduction, we will make use of this dataset. In this particular instance, the dimensionality reduction technique will be applied to both the numerical and the qualitative variables. The dataset could be obtained from Kaggle and could be found at https://www.kaggle.com/datasets/hemanthsai7/loandefault?select=train.csv.

Data Loading & Processing

First of all the data is read and the top 6 rows are printed to see how exactly the data looks like.

data <- read.csv("train.csv")
head(data)

Next, we have checked the data for missing observations and we found out that we do not have any missing observations in the dataset.

colSums(is.na(data))

##                           ID                  Loan.Amount 
##                            0                            0 
##                Funded.Amount       Funded.Amount.Investor 
##                            0                            0 
##                         Term               Batch.Enrolled 
##                            0                            0 
##                Interest.Rate                        Grade 
##                            0                            0 
##                    Sub.Grade          Employment.Duration 
##                            0                            0 
##               Home.Ownership          Verification.Status 
##                            0                            0 
##                 Payment.Plan                   Loan.Title 
##                            0                            0 
##              Debit.to.Income      Delinquency...two.years 
##                            0                            0 
##        Inquires...six.months                 Open.Account 
##                            0                            0 
##                Public.Record            Revolving.Balance 
##                            0                            0 
##          Revolving.Utilities               Total.Accounts 
##                            0                            0 
##          Initial.List.Status      Total.Received.Interest 
##                            0                            0 
##      Total.Received.Late.Fee                   Recoveries 
##                            0                            0 
##      Collection.Recovery.Fee Collection.12.months.Medical 
##                            0                            0 
##             Application.Type                Last.week.Pay 
##                            0                            0 
##          Accounts.Delinquent      Total.Collection.Amount 
##                            0                            0 
##        Total.Current.Balance Total.Revolving.Credit.Limit 
##                            0                            0 
##                  Loan.Status 
##                            0

The ID column is removed from the dataset and the summary of the model is printed.

data <- data[,-1]
summary(data)

##   Loan.Amount    Funded.Amount   Funded.Amount.Investor      Term      
##  Min.   : 1014   Min.   : 1014   Min.   : 1115          Min.   :36.00  
##  1st Qu.:10012   1st Qu.: 9266   1st Qu.: 9832          1st Qu.:58.00  
##  Median :16073   Median :13042   Median :12794          Median :59.00  
##  Mean   :16849   Mean   :15771   Mean   :14622          Mean   :58.17  
##  3rd Qu.:22106   3rd Qu.:21793   3rd Qu.:17808          3rd Qu.:59.00  
##  Max.   :35000   Max.   :34999   Max.   :35000          Max.   :59.00  
##  Batch.Enrolled     Interest.Rate       Grade            Sub.Grade        
##  Length:67463       Min.   : 5.320   Length:67463       Length:67463      
##  Class :character   1st Qu.: 9.297   Class :character   Class :character  
##  Mode  :character   Median :11.378   Mode  :character   Mode  :character  
##                     Mean   :11.846                                        
##                     3rd Qu.:14.194                                        
##                     Max.   :27.182                                        
##  Employment.Duration Home.Ownership   Verification.Status Payment.Plan      
##  Length:67463        Min.   : 14574   Length:67463        Length:67463      
##  Class :character    1st Qu.: 51690   Class :character    Class :character  
##  Mode  :character    Median : 69336   Mode  :character    Mode  :character  
##                      Mean   : 80542                                         
##                      3rd Qu.: 94623                                         
##                      Max.   :406562                                         
##   Loan.Title        Debit.to.Income   Delinquency...two.years
##  Length:67463       Min.   : 0.6753   Min.   :0.0000         
##  Class :character   1st Qu.:16.7564   1st Qu.:0.0000         
##  Mode  :character   Median :22.6567   Median :0.0000         
##                     Mean   :23.2992   Mean   :0.3271         
##                     3rd Qu.:30.0484   3rd Qu.:0.0000         
##                     Max.   :39.6299   Max.   :8.0000         
##  Inquires...six.months  Open.Account   Public.Record     Revolving.Balance
##  Min.   :0.0000        Min.   : 2.00   Min.   :0.00000   Min.   :     0   
##  1st Qu.:0.0000        1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:  2557   
##  Median :0.0000        Median :13.00   Median :0.00000   Median :  5516   
##  Mean   :0.1458        Mean   :14.27   Mean   :0.08144   Mean   :  7699   
##  3rd Qu.:0.0000        3rd Qu.:16.00   3rd Qu.:0.00000   3rd Qu.: 10184   
##  Max.   :5.0000        Max.   :37.00   Max.   :4.00000   Max.   :116933   
##  Revolving.Utilities Total.Accounts  Initial.List.Status
##  Min.   :  0.00517   Min.   : 4.00   Length:67463       
##  1st Qu.: 38.65882   1st Qu.:13.00   Class :character   
##  Median : 54.08233   Median :18.00   Mode  :character   
##  Mean   : 52.88944   Mean   :18.63                      
##  3rd Qu.: 69.17712   3rd Qu.:23.00                      
##  Max.   :100.88005   Max.   :72.00                      
##  Total.Received.Interest Total.Received.Late.Fee   Recoveries      
##  Min.   :    4.737       Min.   : 0.00000        Min.   :   0.000  
##  1st Qu.:  570.904       1st Qu.: 0.02111        1st Qu.:   1.630  
##  Median : 1330.843       Median : 0.04340        Median :   3.345  
##  Mean   : 2068.993       Mean   : 1.14397        Mean   :  59.692  
##  3rd Qu.: 2656.957       3rd Qu.: 0.07188        3rd Qu.:   5.454  
##  Max.   :14301.368       Max.   :42.61888        Max.   :4354.467  
##  Collection.Recovery.Fee Collection.12.months.Medical Application.Type  
##  Min.   :  0.00004       Min.   :0.0000               Length:67463      
##  1st Qu.:  0.47626       1st Qu.:0.0000               Class :character  
##  Median :  0.78014       Median :0.0000               Mode  :character  
##  Mean   :  1.12514       Mean   :0.0213                                 
##  3rd Qu.:  1.07057       3rd Qu.:0.0000                                 
##  Max.   :166.83300       Max.   :1.0000                                 
##  Last.week.Pay    Accounts.Delinquent Total.Collection.Amount
##  Min.   :  0.00   Min.   :0           Min.   :    1.0        
##  1st Qu.: 35.00   1st Qu.:0           1st Qu.:   24.0        
##  Median : 68.00   Median :0           Median :   36.0        
##  Mean   : 71.16   Mean   :0           Mean   :  146.5        
##  3rd Qu.:105.00   3rd Qu.:0           3rd Qu.:   46.0        
##  Max.   :161.00   Max.   :0           Max.   :16421.0        
##  Total.Current.Balance Total.Revolving.Credit.Limit  Loan.Status     
##  Min.   :    617       Min.   :  1000               Min.   :0.00000  
##  1st Qu.:  50379       1st Qu.:  8156               1st Qu.:0.00000  
##  Median : 118369       Median : 16733               Median :0.00000  
##  Mean   : 159574       Mean   : 23123               Mean   :0.09251  
##  3rd Qu.: 228375       3rd Qu.: 32147               3rd Qu.:0.00000  
##  Max.   :1177412       Max.   :201169               Max.   :1.00000

A numerical and a qualitative representation of each variable is included in the dataset. I will separate them and then perform individual analyses for numerical variables (using principal components analysis; PCA) and qualitative variables so that the analysis is more accurate (using MCA). Moreover, the categorical and numerical data is separated into two different dataframes.

data_numerical <- select_if(data, is.numeric)
data_categorical <- select_if(data, is.character)

Finally, among the numerical columns, the target column called loan status is omitted.

data_numerical_features <- data_numerical[,c(1:24)]

The diagrams that are connected show that the majority of the participants are from grade C, while grade G participants make up the smallest percentage. The vast majority of individuals are mortgage holders, and only a minority own their homes outright. The vast majority of the individuals have their original list classification set to W and have been source verified. Because both the payment plan variable and the application type variable only have one possible value, they should both be eliminated from the dataset because they do not contribute any information that is helpful.

long_data <- melt(data_categorical, id.vars = NULL)
ggplot(data = long_data) +
  geom_bar(aes(x = value)) + 
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 14)) +
  facet_wrap(~ variable, 
             scales = "free", ncol = 3)

From the graphs attached above, we can see that the payment plan and the application type has a single value in its columns and they do not contribute in providing any useful information. Hence, these two columns are removed from the categorical dataset.

data_categorical <- data_categorical[,-which(names(data_categorical) %in% 
                                               c("Application.Type", 
                                                 "Payment.Plan"))]

PCA for numerical variables

First of all the distinct value count is obtained for each of the numerical column and we can see that the Accounts.Delinquent column has only 1 unique value and it need to be removed from the dataset.

sapply(data_numerical_features, function(x) n_distinct(x))

##                  Loan.Amount                Funded.Amount 
##                        27525                        24548 
##       Funded.Amount.Investor                         Term 
##                        67441                            3 
##                Interest.Rate               Home.Ownership 
##                        67448                        67454 
##              Debit.to.Income      Delinquency...two.years 
##                        67454                            9 
##        Inquires...six.months                 Open.Account 
##                            6                           36 
##                Public.Record            Revolving.Balance 
##                            5                        20582 
##          Revolving.Utilities               Total.Accounts 
##                        67458                           69 
##      Total.Received.Interest      Total.Received.Late.Fee 
##                        67451                        67380 
##                   Recoveries      Collection.Recovery.Fee 
##                        67387                        67313 
## Collection.12.months.Medical                Last.week.Pay 
##                            2                          162 
##          Accounts.Delinquent      Total.Collection.Amount 
##                            1                         2193 
##        Total.Current.Balance Total.Revolving.Credit.Limit 
##                        60901                        37708

Hence, its removed.

data_numerical_features <- data_numerical_features[,-which(names(data_numerical_features) %in% c("Accounts.Delinquent"))]

Next, we have applied PCA on numerical features dataset and the result summary is obtained. It can be seen from the proportion of variance explained that the PC1 explains 4.7% approximately and same goes for all other dimensions as well (more of less same as they all lie in 4-5%).

pca_numerical <- prcomp(data_numerical_features, 
                        center = TRUE, 
                        scale = TRUE)
summary(pca_numerical)

## Importance of components:
##                            PC1     PC2     PC3     PC4    PC5    PC6     PC7
## Standard deviation     1.04586 1.02547 1.02170 1.01993 1.0185 1.0106 1.00977
## Proportion of Variance 0.04756 0.04572 0.04539 0.04523 0.0451 0.0444 0.04433
## Cumulative Proportion  0.04756 0.09328 0.13866 0.18389 0.2290 0.2734 0.31773
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.00799 1.00220 1.00096 0.99842 0.99761 0.99499 0.99441
## Proportion of Variance 0.04418 0.04367 0.04356 0.04334 0.04327 0.04304 0.04299
## Cumulative Proportion  0.36190 0.40557 0.44914 0.49248 0.53575 0.57879 0.62178
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.99378 0.99182 0.98849 0.98602 0.98481 0.98144 0.97911
## Proportion of Variance 0.04294 0.04277 0.04248 0.04227 0.04217 0.04188 0.04168
## Cumulative Proportion  0.66472 0.70749 0.74998 0.79225 0.83441 0.87629 0.91797
##                           PC22    PC23
## Standard deviation     0.97570 0.96675
## Proportion of Variance 0.04139 0.04063
## Cumulative Proportion  0.95937 1.00000

A scree plot is a graphical representation of the proportion of variation that can be attributed to each component. According to the findings, PC1 is responsible for explaining just over 4% of the total diversity. There must be twenty components for there to be an explanation for eighty percent of the variation.

fviz_eig(pca_numerical)

Let’s check variable contribution in every first 3 dimensions, which explain 66% of variance. We can plot all dimensions one by one. The highest contribution in PC1 is of Credit limit and reviving balance.

fviz_contrib(pca_numerical,
             choice = "var", 
             axes = 1)

For PC2 the highest contribution is done by open account and loan amount.

fviz_contrib(pca_numerical, 
             choice = "var", 
             axes = 2)

For PC3, the highest contribution is done by last week pay and total accounts.

fviz_contrib(pca_numerical,
             choice = "var", 
             axes = 3)

Dimensionality reduction for categorical variables

For categorical variable, we can see that the first 3 PCs could be used which explains 80% of the variation. The PC1 explains 70% of the variation while PC2 and PC3 explains 6% each which sums up to 81%.

mca_categorical <- MCA(data_categorical, 
                       graph = FALSE)
fviz_screeplot(mca_categorical, 
               addlabels = TRUE)

We can check the contribution of each variable in each PC using the plot below. The highest contribution in PC1 is not verified and mortgage.

fviz_contrib(mca_categorical, 
             choice = "var", 
             axes = 1)

Conclusion

The findings do not appear to be acceptable in either analysis, despite the fact that the percentage of explained variation in two-dimensional space is not high enough to presume that all points have been allocated accurately. When compared to the variance described by numerical columns and principal component analysis (PCA), the variance explained by categorical variables was in a league of its own.