Load libraries:

suppressWarnings(library(poLCA))
## Loading required package: scatterplot3d
## Loading required package: MASS
suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))

Load the data from CSV file and keep only categorical variables of choice:

gcd <- read.csv('/Users/codethedral/Google Drive/MScA/MSCA_31008 Data Mining/HW 3 Due Jul 22nd/German.Credit.csv', header=TRUE, sep=',')
gcd<-data.frame(gcd)
gcd.cat<- gcd[c(2,13,18)]

1. Perform latent class analysis of only the categorical variables for market segmentation using (function poLCA ).

I’ll pick the following three:

(str(gcd.cat))
## 'data.frame':    1000 obs. of  3 variables:
##  $ Account.Balance              : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Most.valuable.available.asset: int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Occupation                   : int  3 3 2 2 2 2 2 2 1 1 ...
## NULL

Splitting data into training and test.

set.seed(221)
split <- sample(c(TRUE, FALSE),size = 1000, replace = TRUE, prob = c(.632, .368) )

train.gc <- gcd.cat[split,]
test.gc <- gcd.cat[!split,]
str(train.gc)
## 'data.frame':    637 obs. of  3 variables:
##  $ Account.Balance              : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Most.valuable.available.asset: int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Occupation                   : int  3 3 2 2 2 2 2 2 1 1 ...

Running LCA

Function to select best number of Classes based on AIC: the result was 3 classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_aic <- 100000
for(i in 2:6){
              lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
              if(lc$aic < min_aic){
                                    min_aic <- lc$aic
                                    LCA_best_model.aic<-lc
                                  }
              }       
#LCA_best_model.aic

Function to select best number of Classes based on BIC: the results was 2 classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_bic <- 100000
for(i in 2:6){
              lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
              if(lc$bic < min_bic){
                                    min_bic <- lc$bic
                                    LCA_best_model.bic<-lc
                                    }
              }       
#LCA_best_model.bic

For loop to run 2-6 LCA solutions, saving AIC/BIC.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)

#accumulator for VAF and Cluster Size
gc.AIC <- data.frame()
gc.BIC <- data.frame()
#run LCA for all clusters 2 to 6
for(i in 2:6){
              #Run LCA for each level of i, allowing up to 6 iterations
              lc <- poLCA(f, train.gc,nclass=i, tol=.001, nrep=10, verbose=FALSE)
              #Combine LCA results AIC and BIC, write to df
              gc.AIC<- rbind(gc.AIC, cbind(i, (lc$aic)))
              gc.BIC<-rbind(gc.BIC, cbind(i, (lc$bic)))
              }

class.table<-cbind(gc.AIC,gc.BIC$V2)
names(class.table)<- c("Classes","AIC","BIC");class.table<-data.frame(class.table);(class.table)
##   Classes      AIC      BIC
## 1       2 4588.700 4673.379
## 2       3 4572.307 4701.553
## 3       4 4573.295 4747.109
## 4       5 4583.699 4802.081
## 5       6 4596.341 4859.291

AIC/BIC plot, against number of Classes

ggplot(class.table, aes(Classes)) + 
  geom_line(aes(y = AIC, colour = "AIC")) + 
  geom_line(aes(y = BIC, colour = "BIC")) + 
  labs(y = "Metrics")

2. Determine 2,3,..,K clas/cluster solutions. Remember to run from multiple random starts. Use AIC criterin and interpretation based on graphs to interpret LCA solutions.

I will go with three classes for two reasons: (1) As Anil said in class, we should rely on the lowest AIC score, (2) after looking at the graph the elbow aligns for AIC/BIC at three classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3=poLCA(f,train.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)

results.3$probs
## $Account.Balance
##                 Pr(1)     Pr(2)      Pr(3)      Pr(4)
## class 1:  0.003770783 0.2285224 0.07036863 0.69733816
## class 2:  0.587939004 0.3673691 0.03144087 0.01325106
## class 3:  0.346968125 0.2693796 0.05934595 0.32430632
## 
## $Most.valuable.available.asset
##                  Pr(1)     Pr(2)     Pr(3)     Pr(4)
## class 1:  1.943108e-01 0.1607807 0.4956826 0.1492259
## class 2:  4.138624e-15 0.2917745 0.2525816 0.4556439
## class 3:  4.296625e-01 0.2783873 0.2203751 0.0715751
## 
## $Occupation
##                  Pr(1)        Pr(2)     Pr(3)        Pr(4)
## class 1:  6.553095e-36 2.165601e-08 0.7447588 2.552412e-01
## class 2:  7.559911e-02 6.168413e-06 0.4428241 4.815706e-01
## class 3:  2.673608e-02 3.929432e-01 0.5803207 1.804066e-13

3. Perform Holdout validation of LCA.

LCA.train.object<-results.3$probs
f<-with(test.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3.test=poLCA(f,test.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)

results.3.test$probs
## $Account.Balance
##                 Pr(1)     Pr(2)      Pr(3)     Pr(4)
## class 1:  0.549286002 0.1730777 0.01704340 0.2605929
## class 2:  0.003990787 0.3960265 0.07837387 0.5216088
## class 3:  0.131668451 0.2514142 0.13138998 0.4855274
## 
## $Most.valuable.available.asset
##                  Pr(1)      Pr(2)     Pr(3)        Pr(4)
## class 1:  0.1748832570 0.29286557 0.3611443 0.1711068915
## class 2:  0.0006249038 0.08054358 0.6402172 0.2786143511
## class 3:  0.6271531826 0.22104849 0.1513653 0.0004330039
## 
## $Occupation
##                  Pr(1)        Pr(2)     Pr(3)      Pr(4)
## class 1:  5.329059e-31 0.1115828887 0.7957901 0.09262706
## class 2:  1.065780e-02 0.0001227996 0.6888663 0.30035308
## class 3:  3.304956e-02 0.4328422523 0.5005392 0.03356901

4. Provide implications / commentary on the goodness, interpretability, stability, and adequacy of solutions.

I can think of class 1 being young professionals who are starting their careers, perhaps in grad school. The 2nd class is the one with the most real state, I can think of young families, with some debt. The 3rd class could be recent grad students, with a lot of debt, no house and very little cash on hand.

In terms of stability, my proportions changed to a more even share per class. The margined probabilities remained very similar to the ones in the training set.

5. Comment on the similarity/differences between the clustering solutions you generated in Assignament 1 with the solutions you generated using LCA.

The one similarity that I will point out is that age groups were very defined in my clustering analysis, which here may refflect as skilled and highly skilled under occupation. Also Duration in current address may pair with having real state as your most valuable asset.

In terms of differences, I like how LCA sort of tells a story based on the different classes which allows us more interpretation. I think clustering is more empirical, which is great to measure risk in more detail.

Each method will have very specific uses, I’ve choosen these variables because I was curious to know how not having cash on had would relate to a very specific occupation. The good news is that almost all of our customers are employed, which helps reduce risk.