HW 3 DUE JUL 22ND

Load libraries:

suppressWarnings(library(poLCA))

## Loading required package: scatterplot3d

## Loading required package: MASS

suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))

Load the data from CSV file and keep only categorical variables of choice:

gcd <- read.csv('/Users/codethedral/Google Drive/MScA/MSCA_31008 Data Mining/HW 3 Due Jul 22nd/German.Credit.csv', header=TRUE, sep=',')
gcd<-data.frame(gcd)
gcd.cat<- gcd[c(2,13,18)]

1. Perform latent class analysis of only the categorical variables for market segmentation using (function poLCA ).

I’ll pick the following three:

(str(gcd.cat))

## 'data.frame':    1000 obs. of  3 variables:
##  $ Account.Balance              : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Most.valuable.available.asset: int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Occupation                   : int  3 3 2 2 2 2 2 2 1 1 ...

## NULL

Splitting data into training and test.

set.seed(221)
split <- sample(c(TRUE, FALSE),size = 1000, replace = TRUE, prob = c(.632, .368) )

train.gc <- gcd.cat[split,]
test.gc <- gcd.cat[!split,]
str(train.gc)

## 'data.frame':    637 obs. of  3 variables:
##  $ Account.Balance              : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Most.valuable.available.asset: int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Occupation                   : int  3 3 2 2 2 2 2 2 1 1 ...

Running LCA

Function to select best number of Classes based on AIC: the result was 3 classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_aic <- 100000
for(i in 2:6){
              lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
              if(lc$aic < min_aic){
                                    min_aic <- lc$aic
                                    LCA_best_model.aic<-lc
                                  }
              }       
#LCA_best_model.aic

Function to select best number of Classes based on BIC: the results was 2 classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
max_II <- -100000
min_bic <- 100000
for(i in 2:6){
              lc <- poLCA(f, train.gc, nclass=i, tol=.001, nrep=10, verbose=FALSE)
              if(lc$bic < min_bic){
                                    min_bic <- lc$bic
                                    LCA_best_model.bic<-lc
                                    }
              }       
#LCA_best_model.bic

For loop to run 2-6 LCA solutions, saving AIC/BIC.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)

#accumulator for VAF and Cluster Size
gc.AIC <- data.frame()
gc.BIC <- data.frame()
#run LCA for all clusters 2 to 6
for(i in 2:6){
              #Run LCA for each level of i, allowing up to 6 iterations
              lc <- poLCA(f, train.gc,nclass=i, tol=.001, nrep=10, verbose=FALSE)
              #Combine LCA results AIC and BIC, write to df
              gc.AIC<- rbind(gc.AIC, cbind(i, (lc$aic)))
              gc.BIC<-rbind(gc.BIC, cbind(i, (lc$bic)))
              }

class.table<-cbind(gc.AIC,gc.BIC$V2)
names(class.table)<- c("Classes","AIC","BIC");class.table<-data.frame(class.table);(class.table)

##   Classes      AIC      BIC
## 1       2 4588.700 4673.379
## 2       3 4572.307 4701.553
## 3       4 4573.295 4747.109
## 4       5 4583.699 4802.081
## 5       6 4596.341 4859.291

AIC/BIC plot, against number of Classes

ggplot(class.table, aes(Classes)) + 
  geom_line(aes(y = AIC, colour = "AIC")) + 
  geom_line(aes(y = BIC, colour = "BIC")) + 
  labs(y = "Metrics")

2. Determine 2,3,..,K clas/cluster solutions. Remember to run from multiple random starts. Use AIC criterin and interpretation based on graphs to interpret LCA solutions.

I will go with three classes for two reasons: (1) As Anil said in class, we should rely on the lowest AIC score, (2) after looking at the graph the elbow aligns for AIC/BIC at three classes.

f<-with(train.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3=poLCA(f,train.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)

results.3$probs

## $Account.Balance
##                 Pr(1)     Pr(2)      Pr(3)      Pr(4)
## class 1:  0.003770783 0.2285224 0.07036863 0.69733816
## class 2:  0.587939004 0.3673691 0.03144087 0.01325106
## class 3:  0.346968125 0.2693796 0.05934595 0.32430632
## 
## $Most.valuable.available.asset
##                  Pr(1)     Pr(2)     Pr(3)     Pr(4)
## class 1:  1.943108e-01 0.1607807 0.4956826 0.1492259
## class 2:  4.138624e-15 0.2917745 0.2525816 0.4556439
## class 3:  4.296625e-01 0.2783873 0.2203751 0.0715751
## 
## $Occupation
##                  Pr(1)        Pr(2)     Pr(3)        Pr(4)
## class 1:  6.553095e-36 2.165601e-08 0.7447588 2.552412e-01
## class 2:  7.559911e-02 6.168413e-06 0.4428241 4.815706e-01
## class 3:  2.673608e-02 3.929432e-01 0.5803207 1.804066e-13

3. Perform Holdout validation of LCA.

LCA.train.object<-results.3$probs
f<-with(test.gc, cbind(Account.Balance,Most.valuable.available.asset,Occupation)~1)
results.3.test=poLCA(f,test.gc,nclass=3,nrep=10,tol=.001,verbose=FALSE, graphs=TRUE)

results.3.test$probs

## $Account.Balance
##                 Pr(1)     Pr(2)      Pr(3)     Pr(4)
## class 1:  0.549286002 0.1730777 0.01704340 0.2605929
## class 2:  0.003990787 0.3960265 0.07837387 0.5216088
## class 3:  0.131668451 0.2514142 0.13138998 0.4855274
## 
## $Most.valuable.available.asset
##                  Pr(1)      Pr(2)     Pr(3)        Pr(4)
## class 1:  0.1748832570 0.29286557 0.3611443 0.1711068915
## class 2:  0.0006249038 0.08054358 0.6402172 0.2786143511
## class 3:  0.6271531826 0.22104849 0.1513653 0.0004330039
## 
## $Occupation
##                  Pr(1)        Pr(2)     Pr(3)      Pr(4)
## class 1:  5.329059e-31 0.1115828887 0.7957901 0.09262706
## class 2:  1.065780e-02 0.0001227996 0.6888663 0.30035308
## class 3:  3.304956e-02 0.4328422523 0.5005392 0.03356901

4. Provide implications / commentary on the goodness, interpretability, stability, and adequacy of solutions.

My 1st class with population share of 30.4% is dominated by people without a checking account that most valuable asset is a car or something of less value and they are all skilled with a quarter of them highly skilled.
My 2nd class with population share of 16.5% is dominated by skilled employees and unskilled residents, 43% of them own real state and only 7% of them don’t have an asset. Some of them don’t have a checking account and 50% of them who do have a negative balance or less than 200 DM’s in it.
My 3rd class with population share of 53.1%, the largest class is dominated by skilled and highly qualified employees, none of them own real state and 58% of them have their account on negative balance, else they don’t have more than 200 DM’s.

I can think of class 1 being young professionals who are starting their careers, perhaps in grad school. The 2nd class is the one with the most real state, I can think of young families, with some debt. The 3rd class could be recent grad students, with a lot of debt, no house and very little cash on hand.

In terms of stability, my proportions changed to a more even share per class. The margined probabilities remained very similar to the ones in the training set.

5. Comment on the similarity/differences between the clustering solutions you generated in Assignament 1 with the solutions you generated using LCA.

The one similarity that I will point out is that age groups were very defined in my clustering analysis, which here may refflect as skilled and highly skilled under occupation. Also Duration in current address may pair with having real state as your most valuable asset.

In terms of differences, I like how LCA sort of tells a story based on the different classes which allows us more interpretation. I think clustering is more empirical, which is great to measure risk in more detail.

Each method will have very specific uses, I’ve choosen these variables because I was curious to know how not having cash on had would relate to a very specific occupation. The good news is that almost all of our customers are employed, which helps reduce risk.