Credit Scoring

The following are examples of variables commonly used for credit scoring:

Datasets for credit ratings generally have a lot of variables. The purpose of recording data with many variables is to create a model that can make an accurate credit rating classification. However, the large number of variables also raises its own problems, namely the number of redundant variables, difficulties in visualizing, difficulties in explaining the model and the amount of storage required. This is where the Principal Component Analysis statistical technique can play a role, namely as an unsupervised algorithm to reduce the number of variables (dimension reduction) to be used as input for other algorithms without much reducing the quality of the rating prediction.

Most machine learning algorithms are divided into 2 parts, namely supervised learning and unsupervised learning. Supervised learning aims to find a function that maps the attribute value of an instance to the value of its target attribute. Some examples of supervised learning are the C5.0 decision tree algorithm, linear regression analysis. Unsupervised learning aims to find patterns in the data without the help of target attributes. One example of unsupervised learning is the k-means algorithm used for customer segmentation.

Principal Component Analysis

Principal Component Analysis (PCA) is one of the dimensional reduction methods in machine learning. PCA will select the “variables” that are able to explain most of the variability of the data. If you are familiar with Regression statistical analysis, the concept of variability is similar to the coefficient of determination R2. PCA reduces the dimensions by forming new variables called Principal Components which are linear combinations of the old variables. For example, a data set has 3 variables, x1, x2 and x3. A principal component is a weighted average as follows:

What is meant by linear in this context is linear in variables. In the example PC1 = 0.674x1 + 0.664x2 + 0.330x3, the variables x1, x2, x3 are each to the power of 1. In the example PC3 = 0.5x12 + 0.664√x2 + log20.330x3, the variables x1, x2 and x3 are not to the power of 1. This Principal Component is selected with the following conditions:

The first picture below shows that the data variability is higher in the direction of the first Principal Component (blue line) compared to the direction of the second Principal Component (red line). In the second picture, the data varies more in the direction of the X axis than in the direction of the Y axis. The variability in the data can be approximated only by using Principal Component 1 which is located parallel to the X axis.

The principal component analogy is similar to a skewer. In the first image we will place the first skewer (blue color) to pierce as many scattered pieces of meat as possible. Some of the meat that has not been pierced by the first skewer, will be pierced by the second skewer (red color). In the second picture most of the cuts of meat are pointing parallel to the X-axis, so only one skewer is needed (blue).

The basic concept of PCA is to make approximations by finding the variables that provide the greatest variability in the data and then placing the principal component “axes” so that the relationship between variables (correlation between variables) is minimal. Variance and Principal Component calculations can be done using the concepts of eigenvalues and eigenvectors from Linear Algebra. The eigenvalues indicate the contribution of a principal component to the data variance. The eigenvectors give the coefficients of the Principal Components. The number of Principal Components to be selected is adjusted to the desired constraints or by using a Screeplot.

Assumptions that must be met so that the Principal Component Analysis technique can run well:

The following are the steps required to perform Principal Component Analysis:

  1. Prepare data (data standardization)

  2. Calculate the covariance matrix or correlation matrix

  3. Calculate the eigen values and eigen vectors from the correlation matrix

  4. Selecting the principal components

  5. Output visualization

  6. Calculate the new score

The principal component analysis technique works is to choose the principal component that provides the largest variance. The workings of the PCA method cause this method to be very sensitive to different variable values or to different variable scales. So that PCA does not “choose” the wrong variables, the input variables need to be standardized. Variable standardization is done by subtracting each observation on the variable by the variable mean and dividing by the standard deviation of the variable.

The coefficient correlation shows the strong linear relationship between 2 variables. The relationship between 2 variables for all predictor variables in the data can be seen in the matrix correlation. A high coefficient correlation (close to 1 for a linear relationship or close to -1 for a linear relationship in the opposite direction) is an indication of redundancy. This is where the PCA technique can be used to reduce dimensions by selecting new uncorrelated “variables” that can explain most of the data variability.

From the correlation matrix, information is obtained about the strength of the linear relationship between variables. Our goal in performing PCA is to select the variables that provide the greatest variance by minimizing the distortion between variables. This is done by calculating the eigenvalues and eigenvectors from the correlation matrix. The “new variable” principal component is obtained from the eigenvectors and is a linear combination of the old variables. These coefficients indicate the magnitude of the weight of the old variables in each principal component.

How many principal components are selected? There are several criteria that can be used to determine k:

We will choose the first p Principal Component whose eigenvalue is more than 1 up to a certain variability limit. Why the selected eigenvalue is an eigenvalue greater than 1? A PC whose eigenvalue is more than 1 is able to explain greater data variability than the original variables. The amount of the contribution of a principal component to the variability of the data can be calculated by the formula:

where λj is the j-th eigenvalue and p is the total number of variables.

Visualization of PCA output can be done using a biplot. This plot displays scores and loadings in the same graph using the 2 most important Principal Components. The horizontal axis shows the first PC, the vertical axis shows the second PC. The variables that have a large contribution to PC1 will be depicted horizontally and the variables that have a high contribution to PC2 will be depicted in the vertical direction. The old variable score will be transformed into a new variable score using the formula

The following is the package and R function that will be used:

R provides prcomp() and princomp() functions to perform Principal Component Analysis.

prcomp(x, center = TRUE, scale. = FALSE)

Description

A complete list of parameters can be found in Help by typing ?prcomp in the R console.

Practice

Standardization

The output of descriptive statistics shows the scale of the x3 variable which is different from the other two variables. In order for PCA to work optimally, this data will be standardized using scale() function in R.

library(openxlsx)
df <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet="3varb")
head(df)
##     x1   x2    x3
## 1  8.1 10.6 217.7
## 2 10.6 13.0 173.3
## 3  7.5  9.5 648.5
## 4 14.8 15.8 578.3
## 5 11.0 13.3 -44.4
## 6  7.5  9.5 -53.7
#standarize variables (centering and scaling)
df <- scale(df, center = TRUE, scale = TRUE)
head(df, 3)
##             x1         x2         x3
## 1 -0.906102243 -0.7416508 -0.3747010
## 2  0.007307276  0.1550555 -0.4843659
## 3 -1.125320527 -1.1526412  0.6893456

Calculate Data Correlation Matrix

In this dataset the variables x1 and x2 have a close linear correlation (r12 = 0.987) while the pairs of variables x2 and x3 and pairs of x1 and x3 do not have as strong a correlation as the pairs of x1 and x2 (r13 = 0.299, r23 = 0.258). The scatter plot also gives a similar conclusion.

cormat <- cor(df)
cormat
##           x1        x2        x3
## x1 1.0000000 0.9869242 0.2994195
## x2 0.9869242 1.0000000 0.2584419
## x3 0.2994195 0.2584419 1.0000000

Calculate Eigenvalues and Eigenvectors

eig <- eigen(cormat)
eig
## eigen() decomposition
## $values
## [1] 2.12525679 0.86259104 0.01215218
## 
## $vectors
##           [,1]       [,2]        [,3]
## [1,] 0.6704698 -0.2112281  0.71123334
## [2,] 0.6640498 -0.2567210 -0.70223374
## [3,] 0.3309200  0.9431209 -0.03185768

Each eigenvalue has a pair of eigenvectors. In the output above, the eigenvectors for the first eigenvalues are the first column in the eigenvectors and so on. In the example dataset with 3 variables, there are 3 eigenvalues and 3 eigenvectors.

The “new variables” principal components PC1, PC2, and PC3 are obtained from the eigenvectors and are linear combinations of the old variables x1, x2 and x3. These coefficients show the contribution of the old variables in each principal component.

Select the Number of Principal Components

round(eig$values/ncol(df),3) #contribution of PC1, PC2 and PC3 to data variability
## [1] 0.708 0.288 0.004
round(cumsum(eig$values/ncol(df)),3) #cumulative contribution of PC1, PC2 and PC3
## [1] 0.708 0.996 1.000
"By using the 80% limit, only 2 principal components are needed to explain 99% of the data variability, namely PC1 and PC2."
## [1] "By using the 80% limit, only 2 principal components are needed to explain 99% of the data variability, namely PC1 and PC2."
pr.out <- prcomp(df, scale. = TRUE, center = TRUE)
pr.out
## Standard deviations (1, .., p=3):
## [1] 1.4578260 0.9287578 0.1102369
## 
## Rotation (n x k) = (3 x 3):
##          PC1        PC2         PC3
## x1 0.6704698 -0.2112281 -0.71123334
## x2 0.6640498 -0.2567210  0.70223374
## x3 0.3309200  0.9431209  0.03185768
summary(pr.out)
## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.4578 0.9288 0.11024
## Proportion of Variance 0.7084 0.2875 0.00405
## Cumulative Proportion  0.7084 0.9960 1.00000
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pr.out, addlabels = TRUE)

screeplot(pr.out, type = "line")
abline(h = 1, lty = 3, col = "red")

Screeplot can be used as a tool to select the number of PCs. The number of principal components to be selected is the number on the Elbow Screeplot. The number of PC(p) selected is when there is a sharp decrease in variance before the decrease in variance slopes. In this example, the number of PCs selected is 2. If we use Kaiser criteria for the scree plot, then the number of PCs selected is 1.

Visualization with Biplot

pr.out$rotation
##          PC1        PC2         PC3
## x1 0.6704698 -0.2112281 -0.71123334
## x2 0.6640498 -0.2567210  0.70223374
## x3 0.3309200  0.9431209  0.03185768
biplot(pr.out, scale = 0)

The output of rotation shows the weight of the contribution of each variable to each Principal Component.

In the biplot above, the variables x1 and x2 have a large contribution to PC1 (horizontal direction) and the variable x3 has a large contribution to PC2 (vertical direction).

Calculate New Score

We have succeeded in reducing the dimensions of the predictor variables from 3 variables to 2 variables by using Principal Component Analysis. The old variable score will be transformed into a new variable score using the formula:

head(df) #observation with 3 variables
##             x1         x2         x3
## 1 -0.906102243 -0.7416508 -0.3747010
## 2  0.007307276  0.1550555 -0.4843659
## 3 -1.125320527 -1.1526412  0.6893456
## 4  1.541835268  1.2012128  0.5159564
## 5  0.153452799  0.2671437 -1.0220701
## 6 -1.125320527 -1.1526412 -1.0450404
#the value of this observation in the new “coordinate system” with 2 PC
df_new <- df %*% pr.out$rotation
df_new[1:6,1:2]
##           PC1         PC2
## 1 -1.22400335  0.02840327
## 2 -0.05242254 -0.49816513
## 3 -1.29178635  1.18374272
## 4  2.00215943 -0.14744625
## 5 -0.05794123 -1.06493057
## 6 -1.86572942 -0.45199295

Case Study

Case Study 1

We will use credit rating data to practice using the PCA method. We will reduce the data set with 4 numerical predictor variables into 2 Principal Components.

## 'data.frame'  : 900 obs. of 6 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income      : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor       : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents  : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue  : num 75.5 75.5 0 53 38 15 38 38 38 38 ...
## $ riskrating  : num 4 4 1 3 2 1 2 2 2 2 ...

The descriptions of the variables in this credit rating dataset are as follows:

  • contractcode is contract number

  • income is income per year in millions of rupiah

  • tenor is loan duration

  • dependents is the number of dependents

  • midoverdue is average delay in loan payments in days

  • risk rating is risk rating

The response variable in this dataset is risk rating. The other variables are predictor variables. The variables that will be used for PCA analysis are predictor variables with numeric data types, namely:

## [1] "income" "tenor" "dependents" "midoverdue"
library(openxlsx)

csdat_raw <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet = "csdata")
head(csdat_raw)
##   contractcode income tenor dependents midoverdue riskrating
## 1   AGR-000001    295    48          5       75.5          4
## 2   AGR-000011    271    36          5       75.5          4
## 3   AGR-000030    159    12          0        0.0          1
## 4   AGR-000043    210    12          3       53.0          3
## 5   AGR-000049    165    36          0       38.0          2
## 6   AGR-000063    220    24          5       15.0          1
str(csdat_raw) #data structure
## 'data.frame':    900 obs. of  6 variables:
##  $ contractcode: chr  "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
##  $ income      : num  295 271 159 210 165 220 70 88 163 100 ...
##  $ tenor       : num  48 36 12 12 36 24 36 48 48 36 ...
##  $ dependents  : num  5 5 0 3 0 5 3 3 5 6 ...
##  $ midoverdue  : num  75.5 75.5 0 53 38 15 38 38 38 38 ...
##  $ riskrating  : num  4 4 1 3 2 1 2 2 2 2 ...
summary(csdat_raw) #descriptive statistics
##  contractcode           income          tenor         dependents   
##  Length:900         Min.   : 70.0   Min.   :12.00   Min.   :0.000  
##  Class :character   1st Qu.:121.0   1st Qu.:12.00   1st Qu.:1.000  
##  Mode  :character   Median :162.0   Median :24.00   Median :3.000  
##                     Mean   :163.3   Mean   :29.93   Mean   :2.932  
##                     3rd Qu.:199.0   3rd Qu.:48.00   3rd Qu.:5.000  
##                     Max.   :300.0   Max.   :48.00   Max.   :6.000  
##    midoverdue     riskrating   
##  Min.   : 0.0   Min.   :1.000  
##  1st Qu.:15.0   1st Qu.:1.000  
##  Median :53.0   Median :3.000  
##  Mean   :48.1   Mean   :2.681  
##  3rd Qu.:53.0   3rd Qu.:3.000  
##  Max.   :91.0   Max.   :5.000
"Create plot of Income on Dependents distribution"
## [1] "Create plot of Income on Dependents distribution"
library(ggplot2)
ggplot(csdat_raw, aes(as.factor(dependents), income)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Income Berdasarkan Dependents")

"Split dataset to training and testing dataset"
## [1] "Split dataset to training and testing dataset"
#Catat indeks/ nomor baris untuk tiap-tiap risk rating
index1 <- which(csdat_raw$riskrating == 1)
index2 <- which(csdat_raw$riskrating == 2)

#Lakukan pencatatan indeks untuk risk rating berikutnya
index3 <- which(csdat_raw$riskrating == 3)
index4 <- which(csdat_raw$riskrating == 4)
index5 <- which(csdat_raw$riskrating == 5)

#80% data akan digunakan sebagai traning set
ntrain1 <- round(0.8 * length(index1))
ntrain2 <- round(0.8 * length(index2))
ntrain3 <- round(0.8 * length(index3))
ntrain4 <- round(0.8 * length(index4))
ntrain5 <- round(0.8 * length(index5))

#set seed agar sampling ini bisa direproduksi
set.seed(100)

#sampling data masing-masing rating untuk training set
train1_index <- sample(index1, ntrain1)
train2_index <- sample(index2, ntrain2)
train3_index <- sample(index3, ntrain3)
train4_index <- sample(index4, ntrain4)
train5_index <- sample(index5, ntrain5)

#menyimpan data ke dalam testing set
test1_index <- setdiff(index1, train1_index)
test2_index <- setdiff(index2, train2_index)
test3_index <- setdiff(index3, train3_index)
test4_index <- setdiff(index4, train4_index)
test5_index <- setdiff(index5, train5_index)

#Menggabungkan hasil sampling masing-masing risk rating ke dalam training set
csdattrain <- do.call("rbind", list(csdat_raw[train1_index,],
csdat_raw[train2_index,], csdat_raw[train3_index,],
csdat_raw[train4_index,], csdat_raw[train5_index,]))
cstrain <- subset(csdattrain, select =
-c(contractcode,riskrating))

#Menggabungkan hasil sampling masing-masing risk rating ke dalam testing set
csdattest <- do.call("rbind", list(csdat_raw[test1_index,],
csdat_raw[test2_index,], csdat_raw[test3_index,],
csdat_raw[test4_index,], csdat_raw[test5_index,])) #[10]
cstest <- subset(csdattest,
select = -c(contractcode,riskrating)) #[11]

#Menghitung korelasi antar variabel
cor(cstrain)
##                 income       tenor dependents midoverdue
## income      1.00000000 -0.07256604  0.2427909  0.1250535
## tenor      -0.07256604  1.00000000  0.0334339  0.2333681
## dependents  0.24279088  0.03343390  1.0000000  0.7632659
## midoverdue  0.12505348  0.23336810  0.7632659  1.0000000
#Lakukan analisa PCA dengan fungsi prcomp() dan
#simpan output ke dalam obyek dengan nama pr.out
pr.out <- prcomp(cstrain, scale = TRUE, center = TRUE)

#Tampilkan output PCA dengan memanggil obyek pr.out
pr.out
## Standard deviations (1, .., p=4):
## [1] 1.3685609 1.0492944 0.9059637 0.4530475
## 
## Rotation (n x k) = (4 x 4):
##                   PC1        PC2        PC3         PC4
## income     -0.2665678  0.6235436 -0.7299486  0.08549929
## tenor      -0.1827165 -0.7592266 -0.6015443 -0.16832731
## dependents -0.6675878  0.1100498  0.2569754 -0.69005738
## midoverdue -0.6707330 -0.1505238  0.1981998  0.69869635
#Tampilkan summary dari output PCA
summary(pr.out)
## Importance of components:
##                           PC1    PC2    PC3     PC4
## Standard deviation     1.3686 1.0493 0.9060 0.45305
## Proportion of Variance 0.4682 0.2752 0.2052 0.05131
## Cumulative Proportion  0.4682 0.7435 0.9487 1.00000
#Gambarkan scree plot dengan menggunakan fungsi screeplot()
screeplot(pr.out, type = "line", ylim = c(0,2))

#Tambahkan garis horizontal sebagai panduan untuk menggunakan kriteria Kaiser
abline(h = 1, lty = 3, col = "red")

#Gambarkan biplot dengan menggunakan fungsi biplot()
biplot(pr.out, scale = 0) #draw first 2 principal components

#Other functions that can be used are autoplot from ggfortify or fviz_pca_biplot from factoextra.
library(ggfortify)
autoplot(pr.out, data = csdattrain, loadings = TRUE, loadings.label = TRUE, scale = 0)

library(factoextra)
fviz_pca_biplot(pr.out, label = "var", habillage=csdattrain$riskrating)

Descriptive Statistics

Before performing dimension reduction with PCA, we first explore the data using descriptive statistical techniques.

##    income          tenor       dependents    midoverdue
## Min.   : 70.0  Min.   :12.0  Min.   :0.000 Min.   :15.00
## 1st Qu.:120.0  1st Qu.:12.0  1st Qu.:1.000 1st Qu.:15.00
## Median :162.0  Median :24.0  Median :3.000 Median :53.00
## Mean   :162.4  Mean   :29.8  Mean   :2.929 Mean   :48.08
## 3rd Qu.:197.0  3rd Qu.:48.0  3rd Qu.:5.000 3rd Qu.:53.00
## Max.   :299.0  Max.   :48.0  Max.   :6.000 Max.   :91.00

From the boxplot of income distribution based on dependents, it can be concluded that there is no big difference in the median and income distribution of individuals with no dependents up to 4 dependents. The median and distribution of income for groups with 5 and 6 dependents is greater than that of the previous 5 groups.

The correlation matrix provides information on the strength of the linear relationship between variables. The strongest relationship seen is the relationship between dependents and midoverdue, followed by the relationship between income and dependents.

##              income        tenor   dependents  midoverdue
## income      1.00000000 -0.04649806 0.25180228  0.1599918
## tenor      -0.04649806  1.00000000 0.00526126  0.2100942
## dependents  0.25180228  0.00526126 1.00000000  0.7615983
## midoverdue  0.15999175  0.21009415 0.76159830  1.0000000

Split Dataset

The datasets that will be used for training and testing need to be separated so that the information used for testing does not “leak” into the datasets used for training. As much as 80% of the data will be used for training, the rest is allocated for testing.

The sampling technique that will be used is stratified sampling using the risk rating as the stratum. The reason for using stratified sampling is so that different risk ratings can be represented in the training dataset and to prevent the possibility of obtaining PCA resulting from training datasets that only contain a certain risk rating.

The proportion for each risk rating in the training set will be applied to the testing set. PCA will be applied to the training dataset.

Apply Principal Component Analysis

## Standard deviations (1, .., p=4):
## [1] 1.3709060 1.0347230 0.9176641 0.4559141
## 
## Rotation (n x k) = (4 x 4):
##                PC1         PC2       PC3         PC4
## income     -0.3038356 -0.5337512 -0.7861956  0.06848411
## tenor      -0.1470943  0.8240099 -0.5179407 -0.17637573
## dependents -0.6649190 -0.1002035  0.2647878 -0.69117970
## midoverdue -0.6662807  0.1614826  0.2086176  0.69747555

From the output rotation, it is found that loading Principal Component

PC1 = (-0.304)income + (-0.147)tenor + (-0.665)dependents + (-0.666)midoverdue

PC2 = (-0.534)income + (0.824)tenor + (-0.1)dependents + (0.161)midoverdue

PC3 = (-0.786)income + (-0.518)tenor + (0.265)dependents + (0.209)midoverdue

PC4 = (0.068)income + (-0.176)tenor + (-0.691)dependents + (0.697)midoverdue

where a PC is a weighted average of the income to midoverdue variables. In PC1 the dependents and midoverdue variables are dominant, while in PC2 the dominant variables are income and tenor. PC3 and PC4 have the same pattern as PC1 and PC2.

Choose Number of Principal Components

## Importance of components:
##                          PC1   PC2     PC3   PC4
## Standard deviation     1.3709 1.0347 0.9177 0.45591
## Proportion of Variance 0.4698 0.2677 0.2105 0.05196
## Cumulative Proportion  0.4698 0.7375 0.9480 1.00000

Eigenvalues

## [1] 1.8793832 1.0706518 0.8421074 0.2078577

The application of the Kaiser Criterion on the Screeplot resulted in 2 PCs, namely PC1 and PC2 which explained as much as 74% of the variability in the data. It is also consistent with the rotation inspection in the previous section, where PC3 and PC4 are repeating patterns on PC1 and PC2.

At this stage the dataset with 4 numerical variables of income, tenor, dependents and midoverdue has been successfully reduced to 2 “variables” PC1 and PC2.

Output Visualization

In the biplot, it can be seen that the dependents and midoverdue variables contributed the most to the direction of PC1. The income and tenor variables that contribute to the direction of PC1.PC2 can be named “Expenses”, and PC2 can be named “Ability to Pay”.

The angle between the two vector variables shows the strength of the relationship between the two variables. The acute angle between midoverdue and dependents shows a strong correlation while the obtuse angle between income and tenor shows a weak correlation

##              income      tenor     dependents  midoverdue
## income     1.00000000 -0.04649806  0.25180228  0.1599918
## tenor     -0.04649806  1.00000000  0.00526126  0.2100942
## dependents 0.25180228  0.00526126  1.00000000  0.7615983
## midoverdue 0.15999175  0.21009415  0.76159830  1.0000000

Case Study 2

We will use credit rating data to practice using the PCA method. We will reduce a data set with 8 numerical predictor variables to 2 Principal Components.

## 'data.frame'  : 900 obs. of 10 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income      : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor       : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents  : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue  : num 76 76 0 53 38 15 38 38 38 38 ...
## $ riskrating  : num 4 4 1 3 2 1 2 2 2 2 ...
## $ age         : num 55 53 35 45 36 45 21 24 35 26 ...
## $ empyear     : num 12 10 5 7 5 8 1 1 6 2 ...
## $ asset       : num 893 906 552 791 593 ...
## $ debt        : num 4.6984 4.0639 0.05 0.7214 0.0667 ...

The description of the variables in this credit rating dataset is as follows

  • contractcode is contract number

  • income is annual income in millions of rupiah

  • tenor is loan duration

  • dependents is amenability of applicants

  • midoverdue is average delay in loan payments in days

  • riskrating is the risks involved in the daily activities of a business and its classification them

  • age is age of applicants

  • empyear is long time worked on last job

  • asset is amount of valuable things of applicants

  • debt is the amount of the loan is in millions of rupiah

The response variable in this dataset is the risk rating. The other variables are predictor variables. The variables that will be used for PCA analysis are predictor variables with numeric data types, namely

## [1] "income" "tenor" "dependents" "midoverdue" "age"
## [6] "empyear"  "asset" "debt"
library(openxlsx)

cslarge_raw <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet = "cslarge")
head(cslarge_raw)
##   contractcode income tenor dependents midoverdue riskrating age empyear
## 1   AGR-000001    295    48          5         76          4  55      12
## 2   AGR-000011    271    36          5         76          4  53      10
## 3   AGR-000030    159    12          0          0          1  35       5
## 4   AGR-000043    210    12          3         53          3  45       7
## 5   AGR-000049    165    36          0         38          2  36       5
## 6   AGR-000063    220    24          5         15          1  45       8
##      asset       debt
## 1 892.9266 4.69837074
## 2 905.8225 4.06385168
## 3 551.7261 0.05000000
## 4 791.1124 0.72138396
## 5 592.6501 0.06666667
## 6 778.0493 2.59791099
str(cslarge_raw) #data structure
## 'data.frame':    900 obs. of  10 variables:
##  $ contractcode: chr  "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
##  $ income      : num  295 271 159 210 165 220 70 88 163 100 ...
##  $ tenor       : num  48 36 12 12 36 24 36 48 48 36 ...
##  $ dependents  : num  5 5 0 3 0 5 3 3 5 6 ...
##  $ midoverdue  : num  76 76 0 53 38 15 38 38 38 38 ...
##  $ riskrating  : num  4 4 1 3 2 1 2 2 2 2 ...
##  $ age         : num  55 53 35 45 36 45 21 24 35 26 ...
##  $ empyear     : num  12 10 5 7 5 8 1 1 6 2 ...
##  $ asset       : num  893 906 552 791 593 ...
##  $ debt        : num  4.6984 4.0639 0.05 0.7214 0.0667 ...
summary(cslarge_raw) #descriptive statistics
##  contractcode           income          tenor         dependents   
##  Length:900         Min.   : 70.0   Min.   :12.00   Min.   :0.000  
##  Class :character   1st Qu.:121.0   1st Qu.:12.00   1st Qu.:1.000  
##  Mode  :character   Median :162.0   Median :24.00   Median :3.000  
##                     Mean   :163.3   Mean   :29.93   Mean   :2.932  
##                     3rd Qu.:199.0   3rd Qu.:48.00   3rd Qu.:5.000  
##                     Max.   :300.0   Max.   :48.00   Max.   :6.000  
##    midoverdue      riskrating         age           empyear      
##  Min.   : 0.00   Min.   :1.000   Min.   :20.00   Min.   : 0.000  
##  1st Qu.:15.00   1st Qu.:1.000   1st Qu.:29.00   1st Qu.: 3.000  
##  Median :53.00   Median :3.000   Median :36.00   Median : 5.000  
##  Mean   :48.16   Mean   :2.681   Mean   :35.88   Mean   : 5.153  
##  3rd Qu.:53.00   3rd Qu.:3.000   3rd Qu.:42.00   3rd Qu.: 7.000  
##  Max.   :91.00   Max.   :5.000   Max.   :61.00   Max.   :13.000  
##      asset             debt        
##  Min.   : 232.2   Min.   : 0.0500  
##  1st Qu.: 440.0   1st Qu.: 0.6469  
##  Median : 555.0   Median : 2.0253  
##  Mean   : 571.0   Mean   : 3.8091  
##  3rd Qu.: 687.8   3rd Qu.: 5.1902  
##  Max.   :1192.3   Max.   :23.5382
"Visualize distribution of income based on dependents"
## [1] "Visualize distribution of income based on dependents"
library(ggplot2)
ggplot(cslarge_raw, aes(as.factor(dependents), income)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Income Berdasarkan Dependents")

"Visualize distribution of debt based on dependents"
## [1] "Visualize distribution of debt based on dependents"
ggplot(cslarge_raw, aes(as.factor(dependents), debt)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Debt Berdasarkan Dependents")

"Split data into training set and testing set for each risk rating"
## [1] "Split data into training set and testing set for each risk rating"
#Catat indeks/ nomor baris untuk tiap-tiap risk rating
index1 <- which(cslarge_raw$riskrating == 1)
index2 <- which(cslarge_raw$riskrating == 2)

#Lakukan pencatatan indeks untuk risk rating berikutnya
index3 <- which(cslarge_raw$riskrating == 3)
index4 <- which(cslarge_raw$riskrating == 4)
index5 <- which(cslarge_raw$riskrating == 5)

#80% data akan digunakan sebagai traning set.
ntrain1 <- round(0.8 * length(index1))
ntrain2 <- round(0.8 * length(index2))
ntrain3 <- round(0.8 * length(index3))
ntrain4 <- round(0.8 * length(index4))
ntrain5 <- round(0.8 * length(index5))

#set seed agar sampling ini bisa direproduksi
set.seed(100)

#sampling data masing-masing rating untuk training set
train1_index <- sample(index1, ntrain1)
train2_index <- sample(index2, ntrain2)
train3_index <- sample(index3, ntrain3)
train4_index <- sample(index4, ntrain4)
train5_index <- sample(index5, ntrain5)

#menyimpan data ke dalam testing set
test1_index <- setdiff(index1, train1_index)
test2_index <- setdiff(index2, train2_index)
test3_index <- setdiff(index3, train3_index)
test4_index <- setdiff(index4, train4_index)
test5_index <- setdiff(index5, train5_index)

#Menggabungkan hasil sampling masing-masing risk rating ke dalam training set
cslarge_train <- do.call("rbind", list(cslarge_raw[train1_index,],
cslarge_raw[train2_index,], cslarge_raw[train3_index,],
cslarge_raw[train4_index,], cslarge_raw[train5_index,]))
cstrain <- subset(cslarge_train, select = -c(contractcode,riskrating))

#Menggabungkan hasil sampling masing-masing risk rating ke dalam testing set
cslarge_test <- do.call("rbind", list(cslarge_raw[test1_index,],
cslarge_raw[test2_index,], cslarge_raw[test3_index,],
cslarge_raw[test4_index,], cslarge_raw[test5_index,]))
cstest <- subset(cslarge_test, select = -c(contractcode,riskrating))

#Menghitung korelasi antar variabel
cor(cstrain)
##                 income       tenor dependents midoverdue         age
## income      1.00000000 -0.07256604  0.2427909 0.12756439  0.98345799
## tenor      -0.07256604  1.00000000  0.0334339 0.23170789 -0.07433539
## dependents  0.24279088  0.03343390  1.0000000 0.76422579  0.23110488
## midoverdue  0.12756439  0.23170789  0.7642258 1.00000000  0.12134786
## age         0.98345799 -0.07433539  0.2311049 0.12134786  1.00000000
## empyear     0.97526554 -0.07152033  0.2499218 0.13267309  0.92632487
## asset       0.86275940 -0.07265391  0.1901048 0.09832691  0.93780421
## debt        0.16255937  0.08842241  0.6952961 0.70521333  0.15994524
##                empyear       asset       debt
## income      0.97526554  0.86275940 0.16255937
## tenor      -0.07152033 -0.07265391 0.08842241
## dependents  0.24992178  0.19010482 0.69529606
## midoverdue  0.13267309  0.09832691 0.70521333
## age         0.92632487  0.93780421 0.15994524
## empyear     1.00000000  0.74882989 0.16008510
## asset       0.74882989  1.00000000 0.13959529
## debt        0.16008510  0.13959529 1.00000000
#Menggambarkan matrik korelasi dengan ggcorrplot
library(ggcorrplot)
ggcorrplot(cor(cstrain))

#Lakukan analisa PCA dengan fungsi prcomp() dan
#simpan output ke dalam obyek dengan nama pr.out
pr.out <- prcomp(cstrain, scale = TRUE, center = TRUE)

#Tampilkan output PCA dengan memanggil obyek pr.out
pr.out
## Standard deviations (1, .., p=8):
## [1] 1.9886275 1.5061223 0.9869479 0.5697568 0.5168331 0.4513118 0.0807862
## [8] 0.0306914
## 
## Rotation (n x k) = (8 x 8):
##                    PC1        PC2         PC3         PC4         PC5
## income     -0.48037618  0.1732370 -0.04956696  0.05692772 -0.21996290
## tenor       0.02300648 -0.1921475 -0.96534791 -0.02622698 -0.04016120
## dependents -0.23802064 -0.5058409  0.19514188  0.46401430  0.04509053
## midoverdue -0.18278120 -0.5638239 -0.03401568  0.29673063  0.22083181
## age        -0.48194139  0.1807062 -0.05221971 -0.04070810  0.10019847
## empyear    -0.46111279  0.1571628 -0.04049775  0.17088762 -0.57056947
## asset      -0.44138937  0.1785155 -0.05058642 -0.22043915  0.71311191
## debt       -0.19730574 -0.5196857  0.13958206 -0.78310912 -0.23470343
##                     PC6          PC7           PC8
## income      0.053169773  0.560979083  6.063182e-01
## tenor      -0.168350384 -0.003290186 -9.006944e-05
## dependents -0.657287184  0.004360366 -1.701761e-03
## midoverdue  0.714643123  0.002246697  1.006382e-03
## age         0.002608534  0.386800275 -7.556671e-01
## empyear     0.111545448 -0.623731435 -6.927760e-02
## asset      -0.102418913 -0.382877667  2.377850e-01
## debt       -0.056134329 -0.004113309  6.814070e-04
#Tampilkan summary dari output PCA
summary(pr.out)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9886 1.5061 0.9869 0.56976 0.51683 0.45131 0.08079
## Proportion of Variance 0.4943 0.2836 0.1218 0.04058 0.03339 0.02546 0.00082
## Cumulative Proportion  0.4943 0.7779 0.8996 0.94022 0.97361 0.99907 0.99988
##                            PC8
## Standard deviation     0.03069
## Proportion of Variance 0.00012
## Cumulative Proportion  1.00000
#Gambarkan scree plot dengan menggunakan fungsi screeplot()
screeplot(pr.out, type = "line", ylim = c(0,2))

#Tambahkan garis horizontal sebagai panduan untuk menggunakan kriteria Kaiser
abline(h = 1, lty = 3, col = "red")

#Gambarkan biplot dengan menggunakan fungsi biplot()
biplot(pr.out, scale = 0) #draw first 2 principal components

#Gambarkan Principal Component dan risk rating dengan menggunakan
#fungsi autoplot() dari package ggfortify.
library(ggfortify)
autoplot(pr.out, data = cslarge_train, colour = 'riskrating',
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 3, scale = 0)

#Gambarkan Principal Component dan risk rating dengan menggunakan
#fungsi fviz_pca_ind() package factoextra.
library(factoextra)
fviz_pca_ind(pr.out, label="none", habillage=cslarge_train$riskrating)

Descriptive Statistics

Before performing dimension reduction with PCA, we first explore the data using descriptive statistical techniques.

##    income          tenor       dependents    midoverdue
## Min.   : 70.0  Min.   :12.0  Min.   :0.000  Min.   :15.00
## 1st Qu.:120.0  1st Qu.:12.0  1st Qu.:1.000  1st Qu.:15.00
## Median :162.0  Median :24.0  Median :3.000  Median :53.00
## Mean   :162.4  Mean   :29.8  Mean   :2.929  Mean   :48.14
## 3rd Qu.:197.0  3rd Qu.:48.0  3rd Qu.:5.000  3rd Qu.:53.00
## Max.   :299.0  Max.   :48.0  Max.   :6.000  Max.   :91.00
##          age     empyear       asset               debt
## Min.   :20.00  Min.   : 0.000  Min.   : 232.2  Min.   : 0.0500
## 1st Qu.:29.00  1st Qu.: 3.000  1st Qu.: 444.3  1st Qu.: 0.6469
## Median :36.00  Median : 5.000  Median : 553.2  Median : 2.0416
## Mean   :35.74  Mean   : 5.108  Mean   : 568.7  Mean   : 3.7966
## 3rd Qu.:41.00  3rd Qu.: 7.000  3rd Qu.: 682.7  3rd Qu.: 5.1902
## Max.   :59.00  Max.   :13.000  Max.   :1100.3  Max.   :23.5382

From the boxplot of income based on dependents, it can be concluded that there is no big difference in the median and income distribution of individuals with no dependents up to 4 dependents. The median and distribution of income for groups with 5 and 6 dependents is greater than that of the previous 5 groups. The same thing can be observed in the boxplot of debt based on dependents. The median and distribution of debt increases with the number of dependents. The correlation matrix provides information on the strength of the linear relationship between variables. The strongest relationship seen is the relationship between income versus age, income and empyear followed by the relationship between dependents versus midoverdue and debt.

##              income      tenor    dependents midoverdue    age
## income     1.00000000 -0.04649806 0.25180228 0.1624426  0.9834652
## tenor     -0.04649806  1.00000000 0.00526126 0.2083705 -0.0466877
## dependents 0.25180228  0.00526126 1.00000000 0.7625827  0.2430167
## midoverdue 0.16244264  0.20837054 0.76258273 1.0000000  0.1551441
## age        0.98346517 -0.04668770 0.24301671 0.1551441  1.0000000
## empyear    0.97444733 -0.04796359 0.25394067 0.1645664  0.9251052
## asset      0.86337873 -0.04804028 0.20436953 0.1224551  0.9382740
## debt       0.15557952  0.05837108 0.69737720 0.7000607  0.1556199
##              empyear     asset       debt
## income     0.97444733  0.86337873 0.15557952
## tenor     -0.04796359 -0.04804028 0.05837108
## dependents 0.25394067  0.20436953 0.69737720
## midoverdue 0.16456638  0.12245511 0.70006065
## age        0.92510523  0.93827403 0.15561994
## empyear    1.00000000  0.74756395 0.15089249
## asset      0.74756395  1.00000000 0.13877713
## debt       0.15089249  0.13877713 1.00000000

Split Dataset

The datasets that will be used for training and testing need to be separated so that the information used for testing does not “leak” into the datasets used for training. As much as 80% of the data will be used for training, the rest is allocated for testing.

The sampling technique that will be used is stratified sampling using the risk rating as the stratum. The reason for using stratified sampling is so that different risk ratings can be represented in the training dataset and to prevent the possibility of obtaining PCA resulting from training datasets that only contain a certain risk rating.

The proportion for each risk rating in the training set will be applied to the testing set. PCA will be applied to the training dataset.

PCA with prcomp Function

## Standard deviations (1, .., p=8):
## [1] 1.99490861 1.48867836 0.99921838 0.57081733 0.51435251 0.45579721
## [7] 0.08168404 0.03040859
## 
## Rotation (n x k) = (8 x 8):
##                PC1         PC2         PC3         PC4         PC5
## income     -0.47706407 -0.1852419 -0.03775491  0.06523892 -0.22008106
## tenor       0.01081998  0.1511663 -0.97111786 -0.04289554 -0.02097040
## dependents -0.24696246  0.5063808  0.17722837  0.42331675  0.14590700
## midoverdue -0.20336992  0.5550113 -0.06515053  0.33185665  0.14679694
## age        -0.47873792 -0.1917264 -0.03983856 -0.04920586  0.09776723
## empyear    -0.45716825 -0.1703323 -0.03007422  0.19373502 -0.57710422
## asset      -0.43841511 -0.1890776 -0.03521731 -0.26268660  0.70389438
## debt       -0.19885270  0.5261678  0.12695805 -0.77176371 -0.26497429
##                PC6           PC7           PC8
## income      0.021707104 -0.5700871556 -0.5987549883
## tenor      -0.177964730  0.0031881769 -0.0006072532
## dependents -0.671312799 -0.0052066743  0.0020444245
## midoverdue  0.717392848  0.0048806574 -0.0020012051
## age         0.016433973 -0.3761733102  0.7607284732
## empyear     0.023860374  0.6214289709  0.0612683219
## asset      -0.007010088  0.3837436091 -0.2429438834
## debt       -0.040687232  0.0002399125 -0.0007503969

From the output rotation, it is found that loading Principal Component

PC1 = (-0.477)income + (0.011)tenor + (-0.247)dependents + (-0.203)midoverdue

PC2 = (-0.185)income + (0.151)tenor + (0.506)dependents + (0.555)midoverdue

PC3 = (-0.038)income + (-0.971)tenor + (0.177)dependents + (-0.065)midoverdue

PC4 = (0.065)income + (-0.043)tenor + (0.423)dependents + (0.332)midoverdue

and so on up to PC8. In PC1 it can be seen that the income, age, empyear and asset variables make a large contribution. In PC2, the midoverdue, dependents and debt variables are dominant. In PC3 the most dominant variable is tenor.

Quantity of Principal Component

## Importance of components:
##                          PC1   PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.9949 1.4887 0.9992 0.57082 0.51435 0.45580
## Proportion of Variance 0.4975 0.2770 0.1248 0.04073 0.03307 0.02597
## Cumulative Proportion  0.4975 0.7745 0.8993 0.94001 0.97308 0.99905
##                          PC7     PC8
## Standard deviation     0.08168 0.03041
## Proportion of Variance 0.00083 0.00012
## Cumulative Proportion  0.99988 1.00000

Eigenvalues

## [1] 3.9796603540 2.2161632727 0.9984373775 0.3258324245 0.2645585093
## [6] 0.2077510974 0.0066722822 0.0009246825

The application of the Kaiser criterion on the screeplot resulted in 2 PCs, namely PC1 and PC2 which explained as much as 74% of the variability in the data. This is also consistent with the rotational inspection in the previous section, where PC3 and PC4 are repeating patterns on PC1 and PC2.

At this stage the dataset with 8 numerical variables has been successfully reduced to 2 “variables” PC1 and PC2.

## 'data.frame': 720 obs. of 8 variables:
## $ income    : num 225 157 118 104 198 189 133 149 175 230 ...
## $ tenor     : num 24 24 12 24 12 36 12 24 12 12 ...
## $ dependents: num 0 0 1 6 1 2 2 0 1 0 ...
## $ midoverdue: num 15 15 15 15 15 15 15 15 15 15 ...
## $ age       : num 47 34 29 28 41 38 32 36 37 45 ...
## $ empyear   : num 8 5 3 2 7 7 3 4 6 9 ...
## $ asset     : num 841 510 433 455 693 ...
## $ debt      : num 0.05 0.05 0.974 3.15 0.352 ...

Output Visualization

In the biplot, it can be seen that income, age, empyear and asset variables contributed the most to the direction of PC1. The dependent, midoverdue, and debt variables contribute the most to the direction of PC2. PC1 can be named “Payability” and PC2 can be named “Loads and Risks”.

The angle between the two vector variables shows the strength of the relationship between the two variables. The acute angle between midoverdue and dependents shows a strong correlation while the obtuse angle between income and tenor shows a weak correlation.

##              income      tenor    dependents midoverdue    age
## income     1.00000000 -0.04649806 0.25180228 0.1624426  0.9834652
## tenor     -0.04649806  1.00000000 0.00526126 0.2083705 -0.0466877
## dependents 0.25180228  0.00526126 1.00000000 0.7625827  0.2430167
## midoverdue 0.16244264  0.20837054 0.76258273 1.0000000  0.1551441
## age        0.98346517 -0.04668770 0.24301671 0.1551441  1.0000000
## empyear    0.97444733 -0.04796359 0.25394067 0.1645664  0.9251052
## asset      0.86337873 -0.04804028 0.20436953 0.1224551  0.9382740
## debt       0.15557952  0.05837108 0.69737720 0.7000607  0.1556199
##              empyear     asset       debt
## income     0.97444733  0.86337873 0.15557952
## tenor     -0.04796359 -0.04804028 0.05837108
## dependents 0.25394067  0.20436953 0.69737720
## midoverdue 0.16456638  0.12245511 0.70006065
## age        0.92510523  0.93827403 0.15561994
## empyear    1.00000000  0.74756395 0.15089249
## asset      0.74756395  1.00000000 0.13877713
## debt       0.15089249  0.13877713 1.00000000

Predict Test Set

The purpose of performing PCA for this data set is to reduce the dimensions so that this data can be used as input for other algorithms. R provides predict() function to do this.

predboject <- predict(pcaobject, newdata = newdataset)

What should be noted is that the pr.out object in the predict function is generated by PCA using the training data set as input and the data set used in the function is a testing data set. The data in predobject$x can be saved for use as input to other algorithms.

Algebra and Linear

Matrix Addition and Multiplication

Two matrices can be added if they have the same dimensions, i.e. m × n.

Two matrices can be multiplied if the number of rows in the first matrix is equal to the number of columns in the second matrix, i.e. A(m×n)B(n×p) = C(m×p)

Linear Coordinates and Combinations

The motivation to discuss this linear combination concept comes from the need to express the data as a linear combination of principal components in which each principal component is a linear combination of predictor variables. The coordinates of a vector can be written as a linear combination of the vectors on which it is based.

The coordinates of the vectors in this coordinate system are c1 = 1 and c2 = 2. The coordinates of this vector will change when the base changes.

The coordinates of the vector in the new coordinate system are c1 = 1 and c2 = 1.

Eigenvalues and Eigenvectors

The eigenvalues and eigenvectors are pairs of λ and that satisfy the below equation:

The eigenvalues are calculated by finding the root of the characteristic polynomial det(A-λI) = 0. The determinant of a matrix (A−λI) for a square matrix A(3×3) can be calculated by the Laplace cofactor expansion.

The eigenvectors are calculated by substituting the eigenvalues into the equation A= λand finding a solution to the system of linear equations.

Each unique eigenvalue will produce an eigenvector.

Try

# Ketik perintah berikut ini untuk membaca help tentang matriks
?matrix
## starting httpd help server ... done
# Buatlah matriks 3 x 3 dan simpan dengan nama matriks A.
A <- matrix(c(1, 1, 0, 0, -2, 1, 0, 0, 3), nrow = 3, ncol = 3, byrow = TRUE)

# Tuliskan perintah untuk menampilkan matriks A
A
##      [,1] [,2] [,3]
## [1,]    1    1    0
## [2,]    0   -2    1
## [3,]    0    0    3
# Tuliskan perintah R untuk menghitung nilai eigen dan vektor eigen
# dan simpanlah hasilnya dalam variable ev
ev <- eigen(A)

# Tuliskan perintah untuk melihat struktur obyek eigen
str(ev)
## List of 2
##  $ values : num [1:3] 3 -2 1
##  $ vectors: num [1:3, 1:3] 0.0976 0.1952 0.9759 -0.3162 0.9487 ...
##  - attr(*, "class")= chr "eigen"
# Tuliskan perintah untuk melihat hasil output
ev
## eigen() decomposition
## $values
## [1]  3 -2  1
## 
## $vectors
##            [,1]       [,2] [,3]
## [1,] 0.09759001 -0.3162278    1
## [2,] 0.19518001  0.9486833    0
## [3,] 0.97590007  0.0000000    0
# Tuliskan perintah untuk mengakses nilai eigen
ev$values
## [1]  3 -2  1
# Tuliskan perintah untuk mengakses vektor eigen
ev$vectors
##            [,1]       [,2] [,3]
## [1,] 0.09759001 -0.3162278    1
## [2,] 0.19518001  0.9486833    0
## [3,] 0.97590007  0.0000000    0

Deficiency of Principal Component Analysis

In the practice, Principal Components are given a separate name based on a summary of the dominant variables. This requires domain knowledge and there are times when the variables that appear are not in accordance with the theory in the domain.

The way PCA works in reducing dimensions is to form Principal Components that provide the greatest data variability. The limitation of PCA is that the PC selection process is only carried out with predictor variables. Therefore PCA should not be used as a model but used as a data preprocessing technique to then be used as input for other methods. Another alternative is to use the Partial Least Squares method which involves the response variable in dimension reduction.