Load libraries:
suppressWarnings(library(poLCA))
## Loading required package: scatterplot3d
## Loading required package: MASS
suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))
Load the data from CSV file and keep only categorical variables of choice:
gcd <- read.csv('/Users/codethedral/Google Drive/MScA/MSCA_31008 Data Mining/HW 3 Due Jul 22nd/German.Credit.csv', header=TRUE, sep=',')
gcd<-data.frame(gcd)
gcd.cat<- gcd[c(2,13,18)]
I’ll pick the following three:
(str(gcd.cat))
## 'data.frame': 1000 obs. of 3 variables:
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Most.valuable.available.asset: int 2 1 1 1 2 1 1 1 3 4 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## NULL
Splitting data into training and test.
set.seed(221)
split <- sample(c(TRUE, FALSE),size = 1000, replace = TRUE, prob = c(.632, .368) )
train.gc <- gcd.cat[split,]
test.gc <- gcd.cat[!split,]
str(train.gc)
## 'data.frame': 637 obs. of 3 variables:
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Most.valuable.available.asset: int 2 1 1 1 2 1 1 1 3 4 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
Function to select best number of Classes based on AIC: the result was 3 classes.
f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_aic <- 100000
for(i in 2:6){
lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
if(lc$aic < min_aic){
min_aic <- lc$aic
LCA_best_model.aic<-lc
}
}
#LCA_best_model.aic
Function to select best number of Classes based on BIC: the results was 2 classes.
f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_bic <- 100000
for(i in 2:6){
lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
if(lc$bic < min_bic){
min_bic <- lc$bic
LCA_best_model.bic<-lc
}
}
#LCA_best_model.bic
For loop to run 2-6 LCA solutions, saving AIC/BIC.
f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
#accumulator for VAF and Cluster Size
gc.AIC <- data.frame()
gc.BIC <- data.frame()
#run LCA for all clusters 2 to 6
for(i in 2:6){
#Run LCA for each level of i, allowing up to 6 iterations
lc <- poLCA(f, train.gc,nclass=i, tol=.001, nrep=10, verbose=FALSE)
#Combine LCA results AIC and BIC, write to df
gc.AIC<- rbind(gc.AIC, cbind(i, (lc$aic)))
gc.BIC<-rbind(gc.BIC, cbind(i, (lc$bic)))
}
class.table<-cbind(gc.AIC,gc.BIC$V2)
names(class.table)<- c("Classes","AIC","BIC");class.table<-data.frame(class.table);(class.table)
## Classes AIC BIC
## 1 2 4588.700 4673.379
## 2 3 4572.307 4701.553
## 3 4 4573.295 4747.109
## 4 5 4583.699 4802.081
## 5 6 4596.341 4859.291
AIC/BIC plot, against number of Classes
ggplot(class.table, aes(Classes)) +
geom_line(aes(y = AIC, colour = "AIC")) +
geom_line(aes(y = BIC, colour = "BIC")) +
labs(y = "Metrics")
I will go with three classes for two reasons: (1) As Anil said in class, we should rely on the lowest AIC score, (2) after looking at the graph the elbow aligns for AIC/BIC at three classes.
f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3=poLCA(f,train.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)
results.3$probs
## $Account.Balance
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 0.003770783 0.2285224 0.07036863 0.69733816
## class 2: 0.587939004 0.3673691 0.03144087 0.01325106
## class 3: 0.346968125 0.2693796 0.05934595 0.32430632
##
## $Most.valuable.available.asset
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 1.943108e-01 0.1607807 0.4956826 0.1492259
## class 2: 4.138624e-15 0.2917745 0.2525816 0.4556439
## class 3: 4.296625e-01 0.2783873 0.2203751 0.0715751
##
## $Occupation
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 6.553095e-36 2.165601e-08 0.7447588 2.552412e-01
## class 2: 7.559911e-02 6.168413e-06 0.4428241 4.815706e-01
## class 3: 2.673608e-02 3.929432e-01 0.5803207 1.804066e-13
LCA.train.object<-results.3$probs
f<-with(test.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3.test=poLCA(f,test.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)
results.3.test$probs
## $Account.Balance
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 0.549286002 0.1730777 0.01704340 0.2605929
## class 2: 0.003990787 0.3960265 0.07837387 0.5216088
## class 3: 0.131668451 0.2514142 0.13138998 0.4855274
##
## $Most.valuable.available.asset
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 0.1748832570 0.29286557 0.3611443 0.1711068915
## class 2: 0.0006249038 0.08054358 0.6402172 0.2786143511
## class 3: 0.6271531826 0.22104849 0.1513653 0.0004330039
##
## $Occupation
## Pr(1) Pr(2) Pr(3) Pr(4)
## class 1: 5.329059e-31 0.1115828887 0.7957901 0.09262706
## class 2: 1.065780e-02 0.0001227996 0.6888663 0.30035308
## class 3: 3.304956e-02 0.4328422523 0.5005392 0.03356901
My 1st class with population share of 30.4% is dominated by people without a checking account that most valuable asset is a car or something of less value and they are all skilled with a quarter of them highly skilled.
My 2nd class with population share of 16.5% is dominated by skilled employees and unskilled residents, 43% of them own real state and only 7% of them don’t have an asset. Some of them don’t have a checking account and 50% of them who do have a negative balance or less than 200 DM’s in it.
My 3rd class with population share of 53.1%, the largest class is dominated by skilled and highly qualified employees, none of them own real state and 58% of them have their account on negative balance, else they don’t have more than 200 DM’s.
I can think of class 1 being young professionals who are starting their careers, perhaps in grad school. The 2nd class is the one with the most real state, I can think of young families, with some debt. The 3rd class could be recent grad students, with a lot of debt, no house and very little cash on hand.
In terms of stability, my proportions changed to a more even share per class. The margined probabilities remained very similar to the ones in the training set.
5. Comment on the similarity/differences between the clustering solutions you generated in Assignament 1 with the solutions you generated using LCA.
The one similarity that I will point out is that age groups were very defined in my clustering analysis, which here may refflect as skilled and highly skilled under occupation. Also Duration in current address may pair with having real state as your most valuable asset.
In terms of differences, I like how LCA sort of tells a story based on the different classes which allows us more interpretation. I think clustering is more empirical, which is great to measure risk in more detail.
Each method will have very specific uses, I’ve choosen these variables because I was curious to know how not having cash on had would relate to a very specific occupation. The good news is that almost all of our customers are employed, which helps reduce risk.