The following are examples of variables commonly used for credit scoring:
Demographics of applicants; gender, age, education, profession, and work experience
Disposable income; income, loan amount, loan tenor, and dependents
Credit history; loan amount, overdue, nominal payment per period
Asset; residence ownership status, other assets owned
Datasets for credit ratings generally have a lot of variables. The purpose of recording data with many variables is to create a model that can make an accurate credit rating classification. However, the large number of variables also raises its own problems, namely the number of redundant variables, difficulties in visualizing, difficulties in explaining the model and the amount of storage required. This is where the Principal Component Analysis statistical technique can play a role, namely as an unsupervised algorithm to reduce the number of variables (dimension reduction) to be used as input for other algorithms without much reducing the quality of the rating prediction.
Most machine learning algorithms are divided into 2 parts, namely supervised learning and unsupervised learning. Supervised learning aims to find a function that maps the attribute value of an instance to the value of its target attribute. Some examples of supervised learning are the C5.0 decision tree algorithm, linear regression analysis. Unsupervised learning aims to find patterns in the data without the help of target attributes. One example of unsupervised learning is the k-means algorithm used for customer segmentation.
Principal Component Analysis (PCA) is one of the dimensional reduction methods in machine learning. PCA will select the “variables” that are able to explain most of the variability of the data. If you are familiar with Regression statistical analysis, the concept of variability is similar to the coefficient of determination R2. PCA reduces the dimensions by forming new variables called Principal Components which are linear combinations of the old variables. For example, a data set has 3 variables, x1, x2 and x3. A principal component is a weighted average as follows:
What is meant by linear in this context is linear in variables. In the example PC1 = 0.674x1 + 0.664x2 + 0.330x3, the variables x1, x2, x3 are each to the power of 1. In the example PC3 = 0.5x12 + 0.664√x2 + log20.330x3, the variables x1, x2 and x3 are not to the power of 1. This Principal Component is selected with the following conditions:
each Principal Component gives the largest variance,
each Principal Component has no correlation with other PCs,
and the length of the vector containing the coefficients of this PC is 1.
The first picture below shows that the data variability is higher in the direction of the first Principal Component (blue line) compared to the direction of the second Principal Component (red line). In the second picture, the data varies more in the direction of the X axis than in the direction of the Y axis. The variability in the data can be approximated only by using Principal Component 1 which is located parallel to the X axis.
The principal component analogy is similar to a skewer. In the first image we will place the first skewer (blue color) to pierce as many scattered pieces of meat as possible. Some of the meat that has not been pierced by the first skewer, will be pierced by the second skewer (red color). In the second picture most of the cuts of meat are pointing parallel to the X-axis, so only one skewer is needed (blue).
The basic concept of PCA is to make approximations by finding the variables that provide the greatest variability in the data and then placing the principal component “axes” so that the relationship between variables (correlation between variables) is minimal. Variance and Principal Component calculations can be done using the concepts of eigenvalues and eigenvectors from Linear Algebra. The eigenvalues indicate the contribution of a principal component to the data variance. The eigenvectors give the coefficients of the Principal Components. The number of Principal Components to be selected is adjusted to the desired constraints or by using a Screeplot.
Assumptions that must be met so that the Principal Component Analysis technique can run well:
variable type is numeric,
the variables that become the input for PCA are predictor variables,
the predictor variables have a linear relationship,
and all variables must be standardized (centered and scaled).
The following are the steps required to perform Principal Component Analysis:
Prepare data (data standardization)
Calculate the covariance matrix or correlation matrix
Calculate the eigen values and eigen vectors from the correlation matrix
Selecting the principal components
Output visualization
Calculate the new score
The principal component analysis technique works is to choose the principal component that provides the largest variance. The workings of the PCA method cause this method to be very sensitive to different variable values or to different variable scales. So that PCA does not “choose” the wrong variables, the input variables need to be standardized. Variable standardization is done by subtracting each observation on the variable by the variable mean and dividing by the standard deviation of the variable.
The coefficient correlation shows the strong linear relationship between 2 variables. The relationship between 2 variables for all predictor variables in the data can be seen in the matrix correlation. A high coefficient correlation (close to 1 for a linear relationship or close to -1 for a linear relationship in the opposite direction) is an indication of redundancy. This is where the PCA technique can be used to reduce dimensions by selecting new uncorrelated “variables” that can explain most of the data variability.
From the correlation matrix, information is obtained about the strength of the linear relationship between variables. Our goal in performing PCA is to select the variables that provide the greatest variance by minimizing the distortion between variables. This is done by calculating the eigenvalues and eigenvectors from the correlation matrix. The “new variable” principal component is obtained from the eigenvectors and is a linear combination of the old variables. These coefficients indicate the magnitude of the weight of the old variables in each principal component.
How many principal components are selected? There are several criteria that can be used to determine k:
Variability limits, for example 80%.
Kaiser Criterion: selects all PCs whose eigenvalues are greater than 1.
Choose k Principal Components.
We will choose the first p Principal Component whose eigenvalue is more than 1 up to a certain variability limit. Why the selected eigenvalue is an eigenvalue greater than 1? A PC whose eigenvalue is more than 1 is able to explain greater data variability than the original variables. The amount of the contribution of a principal component to the variability of the data can be calculated by the formula:
where λj is the j-th eigenvalue and p is the total number of variables.
Visualization of PCA output can be done using a biplot. This plot displays scores and loadings in the same graph using the 2 most important Principal Components. The horizontal axis shows the first PC, the vertical axis shows the second PC. The variables that have a large contribution to PC1 will be depicted horizontally and the variables that have a high contribution to PC2 will be depicted in the vertical direction. The old variable score will be transformed into a new variable score using the formula
The following is the package and R function that will be used:
cor() from stats library to calculate the correlation matrix.
eigen() from base library to calculate eigenvalues and eigenvectors
prcomp() from stats library for Principal Component analysis
princomp() from stats library for Principal Component analysis
screeplot() from stats library to create a screeplot
biplot() from stats library to create a biplot.
autoplot() from ggfortify package to create a biplot.
fviz_pca_ind() from factoextra package to create a biplot.
R provides prcomp() and princomp() functions to perform Principal Component Analysis.
prcomp(x, center = TRUE, scale. = FALSE)
Description
x is a dataframe containing a numeric or matrix variable.
R will automatically standardize the data when center = TRUE and scale = TRUE arguments are enabled.
stdev output is the root of the eigenvalues.
The output center is mean of the values of a variable.
The output scale produces the standard deviation for the values of a variable (column mean in the data structure).
The principal component sdev output produces a standard deviation value which is used to calculate the contribution of a principal component to the variability of the data.
The output rotation displays the loading of the Principal Component that will be used to calculate the data score on the new coordinate system. The coefficients on the Principal Component vectors indicate the magnitude of the correlation between the variables and the Principal Component.
Output x displays the data scores on the “new coordinate system”.
A complete list of parameters can be found in Help by typing ?prcomp in the R console.
The output of descriptive statistics shows the scale of the x3 variable which is different from the other two variables. In order for PCA to work optimally, this data will be standardized using scale() function in R.
library(openxlsx)
df <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet="3varb")
head(df)
## x1 x2 x3
## 1 8.1 10.6 217.7
## 2 10.6 13.0 173.3
## 3 7.5 9.5 648.5
## 4 14.8 15.8 578.3
## 5 11.0 13.3 -44.4
## 6 7.5 9.5 -53.7
#standarize variables (centering and scaling)
df <- scale(df, center = TRUE, scale = TRUE)
head(df, 3)
## x1 x2 x3
## 1 -0.906102243 -0.7416508 -0.3747010
## 2 0.007307276 0.1550555 -0.4843659
## 3 -1.125320527 -1.1526412 0.6893456
In this dataset the variables x1 and x2 have a close linear correlation (r12 = 0.987) while the pairs of variables x2 and x3 and pairs of x1 and x3 do not have as strong a correlation as the pairs of x1 and x2 (r13 = 0.299, r23 = 0.258). The scatter plot also gives a similar conclusion.
cormat <- cor(df)
cormat
## x1 x2 x3
## x1 1.0000000 0.9869242 0.2994195
## x2 0.9869242 1.0000000 0.2584419
## x3 0.2994195 0.2584419 1.0000000
eig <- eigen(cormat)
eig
## eigen() decomposition
## $values
## [1] 2.12525679 0.86259104 0.01215218
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.6704698 -0.2112281 0.71123334
## [2,] 0.6640498 -0.2567210 -0.70223374
## [3,] 0.3309200 0.9431209 -0.03185768
Each eigenvalue has a pair of eigenvectors. In the output above, the eigenvectors for the first eigenvalues are the first column in the eigenvectors and so on. In the example dataset with 3 variables, there are 3 eigenvalues and 3 eigenvectors.
The “new variables” principal components PC1, PC2, and PC3 are obtained from the eigenvectors and are linear combinations of the old variables x1, x2 and x3. These coefficients show the contribution of the old variables in each principal component.
round(eig$values/ncol(df),3) #contribution of PC1, PC2 and PC3 to data variability
## [1] 0.708 0.288 0.004
round(cumsum(eig$values/ncol(df)),3) #cumulative contribution of PC1, PC2 and PC3
## [1] 0.708 0.996 1.000
"By using the 80% limit, only 2 principal components are needed to explain 99% of the data variability, namely PC1 and PC2."
## [1] "By using the 80% limit, only 2 principal components are needed to explain 99% of the data variability, namely PC1 and PC2."
pr.out <- prcomp(df, scale. = TRUE, center = TRUE)
pr.out
## Standard deviations (1, .., p=3):
## [1] 1.4578260 0.9287578 0.1102369
##
## Rotation (n x k) = (3 x 3):
## PC1 PC2 PC3
## x1 0.6704698 -0.2112281 -0.71123334
## x2 0.6640498 -0.2567210 0.70223374
## x3 0.3309200 0.9431209 0.03185768
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.4578 0.9288 0.11024
## Proportion of Variance 0.7084 0.2875 0.00405
## Cumulative Proportion 0.7084 0.9960 1.00000
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(pr.out, addlabels = TRUE)
screeplot(pr.out, type = "line")
abline(h = 1, lty = 3, col = "red")
Screeplot can be used as a tool to select the number of PCs. The number of principal components to be selected is the number on the Elbow Screeplot. The number of PC(p) selected is when there is a sharp decrease in variance before the decrease in variance slopes. In this example, the number of PCs selected is 2. If we use Kaiser criteria for the scree plot, then the number of PCs selected is 1.
pr.out$rotation
## PC1 PC2 PC3
## x1 0.6704698 -0.2112281 -0.71123334
## x2 0.6640498 -0.2567210 0.70223374
## x3 0.3309200 0.9431209 0.03185768
biplot(pr.out, scale = 0)
The output of rotation shows the weight of the contribution of each variable to each Principal Component.
In the biplot above, the variables x1 and x2 have a large contribution to PC1 (horizontal direction) and the variable x3 has a large contribution to PC2 (vertical direction).
We have succeeded in reducing the dimensions of the predictor variables from 3 variables to 2 variables by using Principal Component Analysis. The old variable score will be transformed into a new variable score using the formula:
head(df) #observation with 3 variables
## x1 x2 x3
## 1 -0.906102243 -0.7416508 -0.3747010
## 2 0.007307276 0.1550555 -0.4843659
## 3 -1.125320527 -1.1526412 0.6893456
## 4 1.541835268 1.2012128 0.5159564
## 5 0.153452799 0.2671437 -1.0220701
## 6 -1.125320527 -1.1526412 -1.0450404
#the value of this observation in the new “coordinate system” with 2 PC
df_new <- df %*% pr.out$rotation
df_new[1:6,1:2]
## PC1 PC2
## 1 -1.22400335 0.02840327
## 2 -0.05242254 -0.49816513
## 3 -1.29178635 1.18374272
## 4 2.00215943 -0.14744625
## 5 -0.05794123 -1.06493057
## 6 -1.86572942 -0.45199295
We will use credit rating data to practice using the PCA method. We will reduce the data set with 4 numerical predictor variables into 2 Principal Components.
## 'data.frame' : 900 obs. of 6 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue : num 75.5 75.5 0 53 38 15 38 38 38 38 ...
## $ riskrating : num 4 4 1 3 2 1 2 2 2 2 ...
The descriptions of the variables in this credit rating dataset are as follows:
contractcode is contract number
income is income per year in millions of rupiah
tenor is loan duration
dependents is the number of dependents
midoverdue is average delay in loan payments in days
risk rating is risk rating
The response variable in this dataset is risk rating. The other variables are predictor variables. The variables that will be used for PCA analysis are predictor variables with numeric data types, namely:
## [1] "income" "tenor" "dependents" "midoverdue"
library(openxlsx)
csdat_raw <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet = "csdata")
head(csdat_raw)
## contractcode income tenor dependents midoverdue riskrating
## 1 AGR-000001 295 48 5 75.5 4
## 2 AGR-000011 271 36 5 75.5 4
## 3 AGR-000030 159 12 0 0.0 1
## 4 AGR-000043 210 12 3 53.0 3
## 5 AGR-000049 165 36 0 38.0 2
## 6 AGR-000063 220 24 5 15.0 1
str(csdat_raw) #data structure
## 'data.frame': 900 obs. of 6 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue : num 75.5 75.5 0 53 38 15 38 38 38 38 ...
## $ riskrating : num 4 4 1 3 2 1 2 2 2 2 ...
summary(csdat_raw) #descriptive statistics
## contractcode income tenor dependents
## Length:900 Min. : 70.0 Min. :12.00 Min. :0.000
## Class :character 1st Qu.:121.0 1st Qu.:12.00 1st Qu.:1.000
## Mode :character Median :162.0 Median :24.00 Median :3.000
## Mean :163.3 Mean :29.93 Mean :2.932
## 3rd Qu.:199.0 3rd Qu.:48.00 3rd Qu.:5.000
## Max. :300.0 Max. :48.00 Max. :6.000
## midoverdue riskrating
## Min. : 0.0 Min. :1.000
## 1st Qu.:15.0 1st Qu.:1.000
## Median :53.0 Median :3.000
## Mean :48.1 Mean :2.681
## 3rd Qu.:53.0 3rd Qu.:3.000
## Max. :91.0 Max. :5.000
"Create plot of Income on Dependents distribution"
## [1] "Create plot of Income on Dependents distribution"
library(ggplot2)
ggplot(csdat_raw, aes(as.factor(dependents), income)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Income Berdasarkan Dependents")
"Split dataset to training and testing dataset"
## [1] "Split dataset to training and testing dataset"
#Catat indeks/ nomor baris untuk tiap-tiap risk rating
index1 <- which(csdat_raw$riskrating == 1)
index2 <- which(csdat_raw$riskrating == 2)
#Lakukan pencatatan indeks untuk risk rating berikutnya
index3 <- which(csdat_raw$riskrating == 3)
index4 <- which(csdat_raw$riskrating == 4)
index5 <- which(csdat_raw$riskrating == 5)
#80% data akan digunakan sebagai traning set
ntrain1 <- round(0.8 * length(index1))
ntrain2 <- round(0.8 * length(index2))
ntrain3 <- round(0.8 * length(index3))
ntrain4 <- round(0.8 * length(index4))
ntrain5 <- round(0.8 * length(index5))
#set seed agar sampling ini bisa direproduksi
set.seed(100)
#sampling data masing-masing rating untuk training set
train1_index <- sample(index1, ntrain1)
train2_index <- sample(index2, ntrain2)
train3_index <- sample(index3, ntrain3)
train4_index <- sample(index4, ntrain4)
train5_index <- sample(index5, ntrain5)
#menyimpan data ke dalam testing set
test1_index <- setdiff(index1, train1_index)
test2_index <- setdiff(index2, train2_index)
test3_index <- setdiff(index3, train3_index)
test4_index <- setdiff(index4, train4_index)
test5_index <- setdiff(index5, train5_index)
#Menggabungkan hasil sampling masing-masing risk rating ke dalam training set
csdattrain <- do.call("rbind", list(csdat_raw[train1_index,],
csdat_raw[train2_index,], csdat_raw[train3_index,],
csdat_raw[train4_index,], csdat_raw[train5_index,]))
cstrain <- subset(csdattrain, select =
-c(contractcode,riskrating))
#Menggabungkan hasil sampling masing-masing risk rating ke dalam testing set
csdattest <- do.call("rbind", list(csdat_raw[test1_index,],
csdat_raw[test2_index,], csdat_raw[test3_index,],
csdat_raw[test4_index,], csdat_raw[test5_index,])) #[10]
cstest <- subset(csdattest,
select = -c(contractcode,riskrating)) #[11]
#Menghitung korelasi antar variabel
cor(cstrain)
## income tenor dependents midoverdue
## income 1.00000000 -0.07256604 0.2427909 0.1250535
## tenor -0.07256604 1.00000000 0.0334339 0.2333681
## dependents 0.24279088 0.03343390 1.0000000 0.7632659
## midoverdue 0.12505348 0.23336810 0.7632659 1.0000000
#Lakukan analisa PCA dengan fungsi prcomp() dan
#simpan output ke dalam obyek dengan nama pr.out
pr.out <- prcomp(cstrain, scale = TRUE, center = TRUE)
#Tampilkan output PCA dengan memanggil obyek pr.out
pr.out
## Standard deviations (1, .., p=4):
## [1] 1.3685609 1.0492944 0.9059637 0.4530475
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## income -0.2665678 0.6235436 -0.7299486 0.08549929
## tenor -0.1827165 -0.7592266 -0.6015443 -0.16832731
## dependents -0.6675878 0.1100498 0.2569754 -0.69005738
## midoverdue -0.6707330 -0.1505238 0.1981998 0.69869635
#Tampilkan summary dari output PCA
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.3686 1.0493 0.9060 0.45305
## Proportion of Variance 0.4682 0.2752 0.2052 0.05131
## Cumulative Proportion 0.4682 0.7435 0.9487 1.00000
#Gambarkan scree plot dengan menggunakan fungsi screeplot()
screeplot(pr.out, type = "line", ylim = c(0,2))
#Tambahkan garis horizontal sebagai panduan untuk menggunakan kriteria Kaiser
abline(h = 1, lty = 3, col = "red")
#Gambarkan biplot dengan menggunakan fungsi biplot()
biplot(pr.out, scale = 0) #draw first 2 principal components
#Other functions that can be used are autoplot from ggfortify or fviz_pca_biplot from factoextra.
library(ggfortify)
autoplot(pr.out, data = csdattrain, loadings = TRUE, loadings.label = TRUE, scale = 0)
library(factoextra)
fviz_pca_biplot(pr.out, label = "var", habillage=csdattrain$riskrating)
Before performing dimension reduction with PCA, we first explore the data using descriptive statistical techniques.
## income tenor dependents midoverdue
## Min. : 70.0 Min. :12.0 Min. :0.000 Min. :15.00
## 1st Qu.:120.0 1st Qu.:12.0 1st Qu.:1.000 1st Qu.:15.00
## Median :162.0 Median :24.0 Median :3.000 Median :53.00
## Mean :162.4 Mean :29.8 Mean :2.929 Mean :48.08
## 3rd Qu.:197.0 3rd Qu.:48.0 3rd Qu.:5.000 3rd Qu.:53.00
## Max. :299.0 Max. :48.0 Max. :6.000 Max. :91.00
From the boxplot of income distribution based on dependents, it can be concluded that there is no big difference in the median and income distribution of individuals with no dependents up to 4 dependents. The median and distribution of income for groups with 5 and 6 dependents is greater than that of the previous 5 groups.
The correlation matrix provides information on the strength of the linear relationship between variables. The strongest relationship seen is the relationship between dependents and midoverdue, followed by the relationship between income and dependents.
## income tenor dependents midoverdue
## income 1.00000000 -0.04649806 0.25180228 0.1599918
## tenor -0.04649806 1.00000000 0.00526126 0.2100942
## dependents 0.25180228 0.00526126 1.00000000 0.7615983
## midoverdue 0.15999175 0.21009415 0.76159830 1.0000000
The datasets that will be used for training and testing need to be separated so that the information used for testing does not “leak” into the datasets used for training. As much as 80% of the data will be used for training, the rest is allocated for testing.
The sampling technique that will be used is stratified sampling using the risk rating as the stratum. The reason for using stratified sampling is so that different risk ratings can be represented in the training dataset and to prevent the possibility of obtaining PCA resulting from training datasets that only contain a certain risk rating.
The proportion for each risk rating in the training set will be applied to the testing set. PCA will be applied to the training dataset.
Apply Principal Component Analysis
## Standard deviations (1, .., p=4):
## [1] 1.3709060 1.0347230 0.9176641 0.4559141
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## income -0.3038356 -0.5337512 -0.7861956 0.06848411
## tenor -0.1470943 0.8240099 -0.5179407 -0.17637573
## dependents -0.6649190 -0.1002035 0.2647878 -0.69117970
## midoverdue -0.6662807 0.1614826 0.2086176 0.69747555
From the output rotation, it is found that loading Principal Component
PC1 = (-0.304)income + (-0.147)tenor + (-0.665)dependents + (-0.666)midoverdue
PC2 = (-0.534)income + (0.824)tenor + (-0.1)dependents + (0.161)midoverdue
PC3 = (-0.786)income + (-0.518)tenor + (0.265)dependents + (0.209)midoverdue
PC4 = (0.068)income + (-0.176)tenor + (-0.691)dependents + (0.697)midoverdue
where a PC is a weighted average of the income to midoverdue variables. In PC1 the dependents and midoverdue variables are dominant, while in PC2 the dominant variables are income and tenor. PC3 and PC4 have the same pattern as PC1 and PC2.
Choose Number of Principal Components
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.3709 1.0347 0.9177 0.45591
## Proportion of Variance 0.4698 0.2677 0.2105 0.05196
## Cumulative Proportion 0.4698 0.7375 0.9480 1.00000
Eigenvalues
## [1] 1.8793832 1.0706518 0.8421074 0.2078577
The application of the Kaiser Criterion on the Screeplot resulted in 2 PCs, namely PC1 and PC2 which explained as much as 74% of the variability in the data. It is also consistent with the rotation inspection in the previous section, where PC3 and PC4 are repeating patterns on PC1 and PC2.
At this stage the dataset with 4 numerical variables of income, tenor, dependents and midoverdue has been successfully reduced to 2 “variables” PC1 and PC2.
In the biplot, it can be seen that the dependents and midoverdue variables contributed the most to the direction of PC1. The income and tenor variables that contribute to the direction of PC1.PC2 can be named “Expenses”, and PC2 can be named “Ability to Pay”.
The angle between the two vector variables shows the strength of the relationship between the two variables. The acute angle between midoverdue and dependents shows a strong correlation while the obtuse angle between income and tenor shows a weak correlation
## income tenor dependents midoverdue
## income 1.00000000 -0.04649806 0.25180228 0.1599918
## tenor -0.04649806 1.00000000 0.00526126 0.2100942
## dependents 0.25180228 0.00526126 1.00000000 0.7615983
## midoverdue 0.15999175 0.21009415 0.76159830 1.0000000
We will use credit rating data to practice using the PCA method. We will reduce a data set with 8 numerical predictor variables to 2 Principal Components.
## 'data.frame' : 900 obs. of 10 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue : num 76 76 0 53 38 15 38 38 38 38 ...
## $ riskrating : num 4 4 1 3 2 1 2 2 2 2 ...
## $ age : num 55 53 35 45 36 45 21 24 35 26 ...
## $ empyear : num 12 10 5 7 5 8 1 1 6 2 ...
## $ asset : num 893 906 552 791 593 ...
## $ debt : num 4.6984 4.0639 0.05 0.7214 0.0667 ...
The description of the variables in this credit rating dataset is as follows
contractcode is contract number
income is annual income in millions of rupiah
tenor is loan duration
dependents is amenability of applicants
midoverdue is average delay in loan payments in days
riskrating is the risks involved in the daily activities of a business and its classification them
age is age of applicants
empyear is long time worked on last job
asset is amount of valuable things of applicants
debt is the amount of the loan is in millions of rupiah
The response variable in this dataset is the risk rating. The other variables are predictor variables. The variables that will be used for PCA analysis are predictor variables with numeric data types, namely
## [1] "income" "tenor" "dependents" "midoverdue" "age"
## [6] "empyear" "asset" "debt"
library(openxlsx)
cslarge_raw <- read.xlsx("C:/Users/Jacque de l'est/Documents/Datasets for Data Science/dqlab_pcadata.xlsx", sheet = "cslarge")
head(cslarge_raw)
## contractcode income tenor dependents midoverdue riskrating age empyear
## 1 AGR-000001 295 48 5 76 4 55 12
## 2 AGR-000011 271 36 5 76 4 53 10
## 3 AGR-000030 159 12 0 0 1 35 5
## 4 AGR-000043 210 12 3 53 3 45 7
## 5 AGR-000049 165 36 0 38 2 36 5
## 6 AGR-000063 220 24 5 15 1 45 8
## asset debt
## 1 892.9266 4.69837074
## 2 905.8225 4.06385168
## 3 551.7261 0.05000000
## 4 791.1124 0.72138396
## 5 592.6501 0.06666667
## 6 778.0493 2.59791099
str(cslarge_raw) #data structure
## 'data.frame': 900 obs. of 10 variables:
## $ contractcode: chr "AGR-000001" "AGR-000011" "AGR-000030" "AGR-000043" ...
## $ income : num 295 271 159 210 165 220 70 88 163 100 ...
## $ tenor : num 48 36 12 12 36 24 36 48 48 36 ...
## $ dependents : num 5 5 0 3 0 5 3 3 5 6 ...
## $ midoverdue : num 76 76 0 53 38 15 38 38 38 38 ...
## $ riskrating : num 4 4 1 3 2 1 2 2 2 2 ...
## $ age : num 55 53 35 45 36 45 21 24 35 26 ...
## $ empyear : num 12 10 5 7 5 8 1 1 6 2 ...
## $ asset : num 893 906 552 791 593 ...
## $ debt : num 4.6984 4.0639 0.05 0.7214 0.0667 ...
summary(cslarge_raw) #descriptive statistics
## contractcode income tenor dependents
## Length:900 Min. : 70.0 Min. :12.00 Min. :0.000
## Class :character 1st Qu.:121.0 1st Qu.:12.00 1st Qu.:1.000
## Mode :character Median :162.0 Median :24.00 Median :3.000
## Mean :163.3 Mean :29.93 Mean :2.932
## 3rd Qu.:199.0 3rd Qu.:48.00 3rd Qu.:5.000
## Max. :300.0 Max. :48.00 Max. :6.000
## midoverdue riskrating age empyear
## Min. : 0.00 Min. :1.000 Min. :20.00 Min. : 0.000
## 1st Qu.:15.00 1st Qu.:1.000 1st Qu.:29.00 1st Qu.: 3.000
## Median :53.00 Median :3.000 Median :36.00 Median : 5.000
## Mean :48.16 Mean :2.681 Mean :35.88 Mean : 5.153
## 3rd Qu.:53.00 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.: 7.000
## Max. :91.00 Max. :5.000 Max. :61.00 Max. :13.000
## asset debt
## Min. : 232.2 Min. : 0.0500
## 1st Qu.: 440.0 1st Qu.: 0.6469
## Median : 555.0 Median : 2.0253
## Mean : 571.0 Mean : 3.8091
## 3rd Qu.: 687.8 3rd Qu.: 5.1902
## Max. :1192.3 Max. :23.5382
"Visualize distribution of income based on dependents"
## [1] "Visualize distribution of income based on dependents"
library(ggplot2)
ggplot(cslarge_raw, aes(as.factor(dependents), income)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Income Berdasarkan Dependents")
"Visualize distribution of debt based on dependents"
## [1] "Visualize distribution of debt based on dependents"
ggplot(cslarge_raw, aes(as.factor(dependents), debt)) +
geom_boxplot() + xlab("Dependents") + ggtitle("Boxplot Debt Berdasarkan Dependents")
"Split data into training set and testing set for each risk rating"
## [1] "Split data into training set and testing set for each risk rating"
#Catat indeks/ nomor baris untuk tiap-tiap risk rating
index1 <- which(cslarge_raw$riskrating == 1)
index2 <- which(cslarge_raw$riskrating == 2)
#Lakukan pencatatan indeks untuk risk rating berikutnya
index3 <- which(cslarge_raw$riskrating == 3)
index4 <- which(cslarge_raw$riskrating == 4)
index5 <- which(cslarge_raw$riskrating == 5)
#80% data akan digunakan sebagai traning set.
ntrain1 <- round(0.8 * length(index1))
ntrain2 <- round(0.8 * length(index2))
ntrain3 <- round(0.8 * length(index3))
ntrain4 <- round(0.8 * length(index4))
ntrain5 <- round(0.8 * length(index5))
#set seed agar sampling ini bisa direproduksi
set.seed(100)
#sampling data masing-masing rating untuk training set
train1_index <- sample(index1, ntrain1)
train2_index <- sample(index2, ntrain2)
train3_index <- sample(index3, ntrain3)
train4_index <- sample(index4, ntrain4)
train5_index <- sample(index5, ntrain5)
#menyimpan data ke dalam testing set
test1_index <- setdiff(index1, train1_index)
test2_index <- setdiff(index2, train2_index)
test3_index <- setdiff(index3, train3_index)
test4_index <- setdiff(index4, train4_index)
test5_index <- setdiff(index5, train5_index)
#Menggabungkan hasil sampling masing-masing risk rating ke dalam training set
cslarge_train <- do.call("rbind", list(cslarge_raw[train1_index,],
cslarge_raw[train2_index,], cslarge_raw[train3_index,],
cslarge_raw[train4_index,], cslarge_raw[train5_index,]))
cstrain <- subset(cslarge_train, select = -c(contractcode,riskrating))
#Menggabungkan hasil sampling masing-masing risk rating ke dalam testing set
cslarge_test <- do.call("rbind", list(cslarge_raw[test1_index,],
cslarge_raw[test2_index,], cslarge_raw[test3_index,],
cslarge_raw[test4_index,], cslarge_raw[test5_index,]))
cstest <- subset(cslarge_test, select = -c(contractcode,riskrating))
#Menghitung korelasi antar variabel
cor(cstrain)
## income tenor dependents midoverdue age
## income 1.00000000 -0.07256604 0.2427909 0.12756439 0.98345799
## tenor -0.07256604 1.00000000 0.0334339 0.23170789 -0.07433539
## dependents 0.24279088 0.03343390 1.0000000 0.76422579 0.23110488
## midoverdue 0.12756439 0.23170789 0.7642258 1.00000000 0.12134786
## age 0.98345799 -0.07433539 0.2311049 0.12134786 1.00000000
## empyear 0.97526554 -0.07152033 0.2499218 0.13267309 0.92632487
## asset 0.86275940 -0.07265391 0.1901048 0.09832691 0.93780421
## debt 0.16255937 0.08842241 0.6952961 0.70521333 0.15994524
## empyear asset debt
## income 0.97526554 0.86275940 0.16255937
## tenor -0.07152033 -0.07265391 0.08842241
## dependents 0.24992178 0.19010482 0.69529606
## midoverdue 0.13267309 0.09832691 0.70521333
## age 0.92632487 0.93780421 0.15994524
## empyear 1.00000000 0.74882989 0.16008510
## asset 0.74882989 1.00000000 0.13959529
## debt 0.16008510 0.13959529 1.00000000
#Menggambarkan matrik korelasi dengan ggcorrplot
library(ggcorrplot)
ggcorrplot(cor(cstrain))
#Lakukan analisa PCA dengan fungsi prcomp() dan
#simpan output ke dalam obyek dengan nama pr.out
pr.out <- prcomp(cstrain, scale = TRUE, center = TRUE)
#Tampilkan output PCA dengan memanggil obyek pr.out
pr.out
## Standard deviations (1, .., p=8):
## [1] 1.9886275 1.5061223 0.9869479 0.5697568 0.5168331 0.4513118 0.0807862
## [8] 0.0306914
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## income -0.48037618 0.1732370 -0.04956696 0.05692772 -0.21996290
## tenor 0.02300648 -0.1921475 -0.96534791 -0.02622698 -0.04016120
## dependents -0.23802064 -0.5058409 0.19514188 0.46401430 0.04509053
## midoverdue -0.18278120 -0.5638239 -0.03401568 0.29673063 0.22083181
## age -0.48194139 0.1807062 -0.05221971 -0.04070810 0.10019847
## empyear -0.46111279 0.1571628 -0.04049775 0.17088762 -0.57056947
## asset -0.44138937 0.1785155 -0.05058642 -0.22043915 0.71311191
## debt -0.19730574 -0.5196857 0.13958206 -0.78310912 -0.23470343
## PC6 PC7 PC8
## income 0.053169773 0.560979083 6.063182e-01
## tenor -0.168350384 -0.003290186 -9.006944e-05
## dependents -0.657287184 0.004360366 -1.701761e-03
## midoverdue 0.714643123 0.002246697 1.006382e-03
## age 0.002608534 0.386800275 -7.556671e-01
## empyear 0.111545448 -0.623731435 -6.927760e-02
## asset -0.102418913 -0.382877667 2.377850e-01
## debt -0.056134329 -0.004113309 6.814070e-04
#Tampilkan summary dari output PCA
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9886 1.5061 0.9869 0.56976 0.51683 0.45131 0.08079
## Proportion of Variance 0.4943 0.2836 0.1218 0.04058 0.03339 0.02546 0.00082
## Cumulative Proportion 0.4943 0.7779 0.8996 0.94022 0.97361 0.99907 0.99988
## PC8
## Standard deviation 0.03069
## Proportion of Variance 0.00012
## Cumulative Proportion 1.00000
#Gambarkan scree plot dengan menggunakan fungsi screeplot()
screeplot(pr.out, type = "line", ylim = c(0,2))
#Tambahkan garis horizontal sebagai panduan untuk menggunakan kriteria Kaiser
abline(h = 1, lty = 3, col = "red")
#Gambarkan biplot dengan menggunakan fungsi biplot()
biplot(pr.out, scale = 0) #draw first 2 principal components
#Gambarkan Principal Component dan risk rating dengan menggunakan
#fungsi autoplot() dari package ggfortify.
library(ggfortify)
autoplot(pr.out, data = cslarge_train, colour = 'riskrating',
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 3, scale = 0)
#Gambarkan Principal Component dan risk rating dengan menggunakan
#fungsi fviz_pca_ind() package factoextra.
library(factoextra)
fviz_pca_ind(pr.out, label="none", habillage=cslarge_train$riskrating)
Before performing dimension reduction with PCA, we first explore the data using descriptive statistical techniques.
## income tenor dependents midoverdue
## Min. : 70.0 Min. :12.0 Min. :0.000 Min. :15.00
## 1st Qu.:120.0 1st Qu.:12.0 1st Qu.:1.000 1st Qu.:15.00
## Median :162.0 Median :24.0 Median :3.000 Median :53.00
## Mean :162.4 Mean :29.8 Mean :2.929 Mean :48.14
## 3rd Qu.:197.0 3rd Qu.:48.0 3rd Qu.:5.000 3rd Qu.:53.00
## Max. :299.0 Max. :48.0 Max. :6.000 Max. :91.00
## age empyear asset debt
## Min. :20.00 Min. : 0.000 Min. : 232.2 Min. : 0.0500
## 1st Qu.:29.00 1st Qu.: 3.000 1st Qu.: 444.3 1st Qu.: 0.6469
## Median :36.00 Median : 5.000 Median : 553.2 Median : 2.0416
## Mean :35.74 Mean : 5.108 Mean : 568.7 Mean : 3.7966
## 3rd Qu.:41.00 3rd Qu.: 7.000 3rd Qu.: 682.7 3rd Qu.: 5.1902
## Max. :59.00 Max. :13.000 Max. :1100.3 Max. :23.5382
From the boxplot of income based on dependents, it can be concluded that there is no big difference in the median and income distribution of individuals with no dependents up to 4 dependents. The median and distribution of income for groups with 5 and 6 dependents is greater than that of the previous 5 groups. The same thing can be observed in the boxplot of debt based on dependents. The median and distribution of debt increases with the number of dependents. The correlation matrix provides information on the strength of the linear relationship between variables. The strongest relationship seen is the relationship between income versus age, income and empyear followed by the relationship between dependents versus midoverdue and debt.
## income tenor dependents midoverdue age
## income 1.00000000 -0.04649806 0.25180228 0.1624426 0.9834652
## tenor -0.04649806 1.00000000 0.00526126 0.2083705 -0.0466877
## dependents 0.25180228 0.00526126 1.00000000 0.7625827 0.2430167
## midoverdue 0.16244264 0.20837054 0.76258273 1.0000000 0.1551441
## age 0.98346517 -0.04668770 0.24301671 0.1551441 1.0000000
## empyear 0.97444733 -0.04796359 0.25394067 0.1645664 0.9251052
## asset 0.86337873 -0.04804028 0.20436953 0.1224551 0.9382740
## debt 0.15557952 0.05837108 0.69737720 0.7000607 0.1556199
## empyear asset debt
## income 0.97444733 0.86337873 0.15557952
## tenor -0.04796359 -0.04804028 0.05837108
## dependents 0.25394067 0.20436953 0.69737720
## midoverdue 0.16456638 0.12245511 0.70006065
## age 0.92510523 0.93827403 0.15561994
## empyear 1.00000000 0.74756395 0.15089249
## asset 0.74756395 1.00000000 0.13877713
## debt 0.15089249 0.13877713 1.00000000
The datasets that will be used for training and testing need to be separated so that the information used for testing does not “leak” into the datasets used for training. As much as 80% of the data will be used for training, the rest is allocated for testing.
The sampling technique that will be used is stratified sampling using the risk rating as the stratum. The reason for using stratified sampling is so that different risk ratings can be represented in the training dataset and to prevent the possibility of obtaining PCA resulting from training datasets that only contain a certain risk rating.
The proportion for each risk rating in the training set will be applied to the testing set. PCA will be applied to the training dataset.
## Standard deviations (1, .., p=8):
## [1] 1.99490861 1.48867836 0.99921838 0.57081733 0.51435251 0.45579721
## [7] 0.08168404 0.03040859
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## income -0.47706407 -0.1852419 -0.03775491 0.06523892 -0.22008106
## tenor 0.01081998 0.1511663 -0.97111786 -0.04289554 -0.02097040
## dependents -0.24696246 0.5063808 0.17722837 0.42331675 0.14590700
## midoverdue -0.20336992 0.5550113 -0.06515053 0.33185665 0.14679694
## age -0.47873792 -0.1917264 -0.03983856 -0.04920586 0.09776723
## empyear -0.45716825 -0.1703323 -0.03007422 0.19373502 -0.57710422
## asset -0.43841511 -0.1890776 -0.03521731 -0.26268660 0.70389438
## debt -0.19885270 0.5261678 0.12695805 -0.77176371 -0.26497429
## PC6 PC7 PC8
## income 0.021707104 -0.5700871556 -0.5987549883
## tenor -0.177964730 0.0031881769 -0.0006072532
## dependents -0.671312799 -0.0052066743 0.0020444245
## midoverdue 0.717392848 0.0048806574 -0.0020012051
## age 0.016433973 -0.3761733102 0.7607284732
## empyear 0.023860374 0.6214289709 0.0612683219
## asset -0.007010088 0.3837436091 -0.2429438834
## debt -0.040687232 0.0002399125 -0.0007503969
From the output rotation, it is found that loading Principal Component
PC1 = (-0.477)income + (0.011)tenor + (-0.247)dependents + (-0.203)midoverdue
PC2 = (-0.185)income + (0.151)tenor + (0.506)dependents + (0.555)midoverdue
PC3 = (-0.038)income + (-0.971)tenor + (0.177)dependents + (-0.065)midoverdue
PC4 = (0.065)income + (-0.043)tenor + (0.423)dependents + (0.332)midoverdue
and so on up to PC8. In PC1 it can be seen that the income, age, empyear and asset variables make a large contribution. In PC2, the midoverdue, dependents and debt variables are dominant. In PC3 the most dominant variable is tenor.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.9949 1.4887 0.9992 0.57082 0.51435 0.45580
## Proportion of Variance 0.4975 0.2770 0.1248 0.04073 0.03307 0.02597
## Cumulative Proportion 0.4975 0.7745 0.8993 0.94001 0.97308 0.99905
## PC7 PC8
## Standard deviation 0.08168 0.03041
## Proportion of Variance 0.00083 0.00012
## Cumulative Proportion 0.99988 1.00000
Eigenvalues
## [1] 3.9796603540 2.2161632727 0.9984373775 0.3258324245 0.2645585093
## [6] 0.2077510974 0.0066722822 0.0009246825
The application of the Kaiser criterion on the screeplot resulted in 2 PCs, namely PC1 and PC2 which explained as much as 74% of the variability in the data. This is also consistent with the rotational inspection in the previous section, where PC3 and PC4 are repeating patterns on PC1 and PC2.
At this stage the dataset with 8 numerical variables has been successfully reduced to 2 “variables” PC1 and PC2.
## 'data.frame': 720 obs. of 8 variables:
## $ income : num 225 157 118 104 198 189 133 149 175 230 ...
## $ tenor : num 24 24 12 24 12 36 12 24 12 12 ...
## $ dependents: num 0 0 1 6 1 2 2 0 1 0 ...
## $ midoverdue: num 15 15 15 15 15 15 15 15 15 15 ...
## $ age : num 47 34 29 28 41 38 32 36 37 45 ...
## $ empyear : num 8 5 3 2 7 7 3 4 6 9 ...
## $ asset : num 841 510 433 455 693 ...
## $ debt : num 0.05 0.05 0.974 3.15 0.352 ...
Output Visualization
In the biplot, it can be seen that income, age, empyear and asset variables contributed the most to the direction of PC1. The dependent, midoverdue, and debt variables contribute the most to the direction of PC2. PC1 can be named “Payability” and PC2 can be named “Loads and Risks”.
The angle between the two vector variables shows the strength of the relationship between the two variables. The acute angle between midoverdue and dependents shows a strong correlation while the obtuse angle between income and tenor shows a weak correlation.
## income tenor dependents midoverdue age
## income 1.00000000 -0.04649806 0.25180228 0.1624426 0.9834652
## tenor -0.04649806 1.00000000 0.00526126 0.2083705 -0.0466877
## dependents 0.25180228 0.00526126 1.00000000 0.7625827 0.2430167
## midoverdue 0.16244264 0.20837054 0.76258273 1.0000000 0.1551441
## age 0.98346517 -0.04668770 0.24301671 0.1551441 1.0000000
## empyear 0.97444733 -0.04796359 0.25394067 0.1645664 0.9251052
## asset 0.86337873 -0.04804028 0.20436953 0.1224551 0.9382740
## debt 0.15557952 0.05837108 0.69737720 0.7000607 0.1556199
## empyear asset debt
## income 0.97444733 0.86337873 0.15557952
## tenor -0.04796359 -0.04804028 0.05837108
## dependents 0.25394067 0.20436953 0.69737720
## midoverdue 0.16456638 0.12245511 0.70006065
## age 0.92510523 0.93827403 0.15561994
## empyear 1.00000000 0.74756395 0.15089249
## asset 0.74756395 1.00000000 0.13877713
## debt 0.15089249 0.13877713 1.00000000
The purpose of performing PCA for this data set is to reduce the dimensions so that this data can be used as input for other algorithms. R provides predict() function to do this.
predboject <- predict(pcaobject, newdata = newdataset)
What should be noted is that the pr.out object in the predict function is generated by PCA using the training data set as input and the data set used in the function is a testing data set. The data in predobject$x can be saved for use as input to other algorithms.
Two matrices can be added if they have the same dimensions, i.e. m × n.
Two matrices can be multiplied if the number of rows in the first matrix is equal to the number of columns in the second matrix, i.e. A(m×n)B(n×p) = C(m×p)
The motivation to discuss this linear combination concept comes from the need to express the data as a linear combination of principal components in which each principal component is a linear combination of predictor variables. The coordinates of a vector can be written as a linear combination of the vectors on which it is based.
The coordinates of the vectors in this coordinate system are c1 = 1 and c2 = 2. The coordinates of this vector will change when the base changes.
The coordinates of the vector in the new coordinate system are c1 = 1 and c2 = 1.
The eigenvalues and eigenvectors are pairs of λ and that satisfy the below equation:
The eigenvalues are calculated by finding the root of the characteristic polynomial det(A-λI) = 0. The determinant of a matrix (A−λI) for a square matrix A(3×3) can be calculated by the Laplace cofactor expansion.
The eigenvectors are calculated by substituting the eigenvalues into the equation A= λ
and finding a solution to the system of linear equations.
Each unique eigenvalue will produce an eigenvector.
# Ketik perintah berikut ini untuk membaca help tentang matriks
?matrix
## starting httpd help server ... done
# Buatlah matriks 3 x 3 dan simpan dengan nama matriks A.
A <- matrix(c(1, 1, 0, 0, -2, 1, 0, 0, 3), nrow = 3, ncol = 3, byrow = TRUE)
# Tuliskan perintah untuk menampilkan matriks A
A
## [,1] [,2] [,3]
## [1,] 1 1 0
## [2,] 0 -2 1
## [3,] 0 0 3
# Tuliskan perintah R untuk menghitung nilai eigen dan vektor eigen
# dan simpanlah hasilnya dalam variable ev
ev <- eigen(A)
# Tuliskan perintah untuk melihat struktur obyek eigen
str(ev)
## List of 2
## $ values : num [1:3] 3 -2 1
## $ vectors: num [1:3, 1:3] 0.0976 0.1952 0.9759 -0.3162 0.9487 ...
## - attr(*, "class")= chr "eigen"
# Tuliskan perintah untuk melihat hasil output
ev
## eigen() decomposition
## $values
## [1] 3 -2 1
##
## $vectors
## [,1] [,2] [,3]
## [1,] 0.09759001 -0.3162278 1
## [2,] 0.19518001 0.9486833 0
## [3,] 0.97590007 0.0000000 0
# Tuliskan perintah untuk mengakses nilai eigen
ev$values
## [1] 3 -2 1
# Tuliskan perintah untuk mengakses vektor eigen
ev$vectors
## [,1] [,2] [,3]
## [1,] 0.09759001 -0.3162278 1
## [2,] 0.19518001 0.9486833 0
## [3,] 0.97590007 0.0000000 0
In the practice, Principal Components are given a separate name based on a summary of the dominant variables. This requires domain knowledge and there are times when the variables that appear are not in accordance with the theory in the domain.
The way PCA works in reducing dimensions is to form Principal Components that provide the greatest data variability. The limitation of PCA is that the PC selection process is only carried out with predictor variables. Therefore PCA should not be used as a model but used as a data preprocessing technique to then be used as input for other methods. Another alternative is to use the Partial Least Squares method which involves the response variable in dimension reduction.