Assignment 2 Final

Question 1

The vast majority of the dataset is binary data. This limits th capabilities to determine the mean as they are often responses to “Yes or No” questions. The variables that are not categorical in nature and lend themselves to this type of analysis are Age, Duration, and Amount

dataPath <- getwd()
df<-read.csv("German.Credit.csv")
GermanCredit<-df

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(1234)
intrain<-createDataPartition(y=GermanCredit$Duration,p=.7,list=FALSE)
training<-GermanCredit[intrain,]
testing<-GermanCredit[-intrain,]

nrow(training)

## [1] 701

#701
nrow(testing)

## [1] 299

#299

scaledtrain<-as.data.frame(apply(X=training,MARGIN = 2,FUN=function(x){(x-min(x))/(max(x)-min(x))}))
scaledtest<-as.data.frame(apply(X=testing,MARGIN  =  2,FUN= function(x){(x-min(x))/(max(x)-min(x))}))
head(cbind(training,  scaledtrain))

##    Duration Amount Age   Duration     Amount        Age
## 1        18   1049  21 0.20588235 0.04396390 0.03571429
## 5        12   2171  38 0.11764706 0.10570045 0.33928571
## 6        10   2241  48 0.08823529 0.10955211 0.51785714
## 8         6   1361  40 0.02941176 0.06113129 0.37500000
## 10       24   3758  23 0.29411765 0.19302300 0.07142857
## 12       30   6187  24 0.38235294 0.32667547 0.08928571

Question 2 Used in later steps to derive solutions as well but this is the initial run.

k3 <- kmeans(df, centers = 3, nstart = 50)
k3

## K-means clustering with 3 clusters of sizes 216, 56, 728
## 
## Cluster means:
##   Duration    Amount      Age
## 1 30.14352  5742.269 36.16204
## 2 39.66071 11695.589 36.05357
## 3 16.71841  1890.062 35.31868
## 
## Clustering vector:
##    [1] 3 3 3 3 3 3 3 3 3 3 1 1 3 1 3 3 1 3 3 1 3 3 3 3 1 1 3 3 3 1 3 3 1 3
##   [35] 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 1 3 3 1 3 3 3 3 3 3 3 3 2 3 3 3 3
##   [69] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 1 3 3 3 3 1 3 1 1 3 1 3 1 3 3
##  [103] 1 1 3 3 3 3 3 3 1 3 3 3 1 1 3 3 3 3 1 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3
##  [137] 3 3 3 3 1 1 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 1 1 3 3 3 1 3 3 3 1 1 3 3
##  [171] 1 3 3 3 3 3 1 3 3 2 3 3 1 3 3 1 3 3 3 3 3 3 1 3 3 1 1 3 3 3 3 1 3 1
##  [205] 3 1 3 3 2 3 3 3 3 3 3 1 3 3 2 3 3 3 1 1 3 3 1 3 3 3 3 3 1 1 3 3 3 3
##  [239] 3 3 3 2 3 3 3 3 3 1 3 1 3 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1
##  [273] 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 3 3
##  [307] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3
##  [341] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1 3 1 3
##  [375] 3 3 3 1 3 1 3 3 2 3 1 3 3 1 3 3 3 3 3 3 1 3 1 3 3 1 3 3 3 3 1 1 1 3
##  [409] 3 1 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 2 3 3 3 1 3 3 3 3 3 3 3 3 3 3 1
##  [443] 2 3 3 3 3 3 1 3 3 3 1 1 1 1 3 3 3 3 1 1 3 1 1 3 3 1 3 3 3 3 1 3 1 3
##  [477] 3 3 3 1 1 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 1 3 3 1 1 3
##  [511] 3 3 1 1 3 3 1 3 1 1 3 2 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3
##  [545] 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 3 3 1 1 3 1 3 2 3 3 2 3 2 1 3 1 2 3 3
##  [579] 1 1 3 3 3 3 1 3 1 3 3 3 3 3 1 3 1 1 3 3 3 1 1 3 3 3 3 3 3 3 1 3 3 3
##  [613] 3 3 1 3 2 3 3 3 2 1 3 3 1 3 3 3 3 3 3 1 1 1 3 3 3 3 1 3 3 1 1 1 3 2
##  [647] 3 1 3 3 1 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [681] 3 3 1 3 3 3 3 1 3 3 2 3 3 3 3 2 3 3 1 3 1 1 1 3 3 3 3 3 1 1 3 3 1 3
##  [715] 3 3 1 1 3 1 1 3 3 2 1 3 3 3 2 2 2 1 3 3 1 3 3 3 3 1 3 3 3 1 1 3 3 1
##  [749] 3 3 3 3 3 1 2 3 3 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 2 3 3 1 3 3 1 1 3
##  [783] 2 3 3 3 3 3 1 1 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 2 1 3 3 1 3
##  [817] 3 3 3 3 2 1 3 3 3 3 1 1 3 1 3 1 1 1 1 2 1 1 1 3 3 3 3 2 2 3 3 2 3 2
##  [851] 3 3 3 3 3 1 3 3 1 3 1 1 3 3 1 3 3 2 3 3 3 1 3 2 3 1 1 1 3 3 3 3 1 3
##  [885] 3 1 3 1 3 2 2 2 3 3 3 2 3 3 1 3 1 3 3 1 3 3 3 3 3 1 1 3 1 3 1 3 3 3
##  [919] 1 1 3 3 3 1 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 1 2 3 3 2 3 1 3 1 3 3 1 3
##  [953] 3 2 3 3 3 3 1 3 1 3 2 3 3 2 1 3 3 3 1 1 3 2 3 2 2 2 3 3 3 3 1 3 1 3
##  [987] 2 3 3 3 3 2 1 1 1 3 3 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 427277030 284495713 566445196
##  (between_SS / total_SS =  83.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

k4 <- kmeans(df, centers = 4, nstart = 50)
k4

## K-means clustering with 4 clusters of sizes 42, 543, 130, 285
## 
## Cluster means:
##   Duration    Amount      Age
## 1 40.26190 12511.714 36.66667
## 2 15.06077  1469.337 35.61694
## 3 33.34615  7127.523 36.80000
## 4 23.50526  3583.607 34.65965
## 
## Clustering vector:
##    [1] 2 4 2 2 2 2 4 2 2 4 4 3 2 3 2 4 4 4 2 3 4 4 2 2 4 4 2 2 4 4 4 4 3 2
##   [35] 2 2 2 2 4 4 2 2 2 4 2 2 2 2 2 2 2 4 2 2 3 2 2 2 4 4 2 4 4 1 2 2 4 2
##   [69] 2 2 4 2 2 2 4 2 4 2 2 2 2 4 3 2 2 4 2 4 2 2 2 4 4 2 3 4 4 4 2 3 4 2
##  [103] 4 3 2 2 2 2 2 2 4 2 2 2 4 3 2 4 2 4 4 3 2 2 2 4 2 4 2 4 2 2 2 2 2 2
##  [137] 2 2 2 4 4 4 4 2 2 2 4 2 2 2 2 2 3 4 2 4 2 3 3 4 2 2 4 4 2 2 3 4 4 2
##  [171] 3 2 2 2 2 4 4 2 2 1 2 2 3 4 2 3 2 4 2 4 2 4 3 2 2 3 3 2 2 4 2 3 2 3
##  [205] 4 4 2 2 1 4 2 4 2 2 2 4 4 2 1 2 2 4 4 3 4 2 4 2 2 4 2 2 3 4 2 2 2 2
##  [239] 2 4 2 1 2 2 2 4 2 4 4 4 4 3 2 4 3 4 4 2 2 4 4 2 2 2 2 4 2 2 2 2 2 3
##  [273] 2 4 4 4 4 4 4 2 2 2 4 2 4 2 2 4 2 2 4 2 2 2 2 3 2 2 2 2 2 4 4 4 2 2
##  [307] 2 2 4 3 2 4 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2 4 3 2 2 2 2 2 2 2 2 4 2
##  [341] 4 4 4 4 2 2 2 2 2 2 2 2 3 4 2 1 4 2 2 4 2 2 2 2 2 4 2 2 2 4 3 4 3 2
##  [375] 2 4 2 3 2 3 4 4 1 2 4 2 2 3 2 4 4 2 2 2 3 2 3 2 2 3 4 2 2 4 4 3 4 2
##  [409] 2 3 2 4 2 2 2 4 2 3 2 2 2 2 2 2 2 2 1 2 2 4 4 2 2 2 2 2 2 2 2 4 2 4
##  [443] 1 2 2 2 2 2 4 2 2 2 3 3 4 3 4 2 2 2 3 3 4 4 3 2 2 4 2 2 2 2 3 2 4 2
##  [477] 4 4 4 4 4 4 4 3 2 2 4 4 2 4 2 4 4 4 2 2 2 4 2 2 2 2 2 2 3 2 2 3 4 2
##  [511] 2 2 4 4 4 4 3 2 3 3 4 1 2 2 2 3 2 4 2 2 2 2 2 2 2 2 4 2 2 4 2 4 2 2
##  [545] 2 2 2 2 2 2 2 2 2 2 2 4 3 2 3 2 4 4 3 2 4 2 1 2 2 1 2 3 3 4 3 1 4 2
##  [579] 4 3 2 2 4 2 3 4 4 4 4 4 2 2 3 2 4 3 2 2 2 4 3 2 2 2 4 2 2 2 4 4 2 4
##  [613] 2 2 3 2 1 2 4 2 3 3 2 4 3 2 4 2 2 4 2 3 4 4 4 4 2 2 4 4 2 3 3 3 2 1
##  [647] 2 3 2 4 4 2 2 2 3 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2
##  [681] 2 2 3 2 2 2 2 4 4 2 1 2 2 4 2 1 2 2 4 2 3 3 3 4 2 2 4 2 3 3 4 4 3 2
##  [715] 4 4 3 4 2 3 3 2 4 3 4 4 4 2 1 3 3 3 4 2 3 2 2 2 2 3 2 4 4 3 4 2 4 3
##  [749] 2 4 4 2 2 4 1 2 2 2 2 4 4 4 4 3 4 2 4 2 2 2 2 2 2 3 2 2 4 4 2 4 3 2
##  [783] 1 2 2 2 2 2 4 4 2 2 3 2 2 2 4 2 2 2 2 2 2 2 2 2 2 3 4 2 1 3 2 4 4 2
##  [817] 2 2 2 2 1 3 2 2 2 2 4 4 2 3 4 4 3 4 4 1 4 4 3 2 2 2 2 1 1 2 2 1 4 1
##  [851] 4 2 2 2 2 4 4 2 4 2 4 4 2 4 3 2 2 3 2 2 4 3 2 3 2 4 4 3 2 2 2 2 3 2
##  [885] 4 3 4 3 2 3 1 3 2 2 2 1 4 4 4 2 4 4 2 4 4 2 2 2 2 3 4 2 4 4 4 2 2 4
##  [919] 4 3 4 2 2 4 4 2 4 2 1 1 2 2 2 2 4 2 2 2 2 3 1 2 2 1 2 3 2 3 2 2 4 2
##  [953] 2 1 2 2 4 4 4 4 3 2 1 4 2 1 3 2 2 2 4 3 2 1 2 1 1 1 4 4 2 2 4 2 3 2
##  [987] 3 2 2 4 2 1 3 4 3 2 2 1 3 3
## 
## Within cluster sum of squares by cluster:
## [1] 171776080 162732739 159054981 156315889
##  (between_SS / total_SS =  91.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

k5 <- kmeans(df, centers = 5, nstart = 50)
k5

## K-means clustering with 5 clusters of sizes 116, 73, 488, 32, 291
## 
## Cluster means:
##   Duration    Amount      Age
## 1 29.75000  5637.569 36.56034
## 2 36.49315  8383.397 36.41096
## 3 14.32992  1366.691 35.67828
## 4 38.43750 13209.281 37.03125
## 5 22.56014  3146.598 34.52577
## 
## Clustering vector:
##    [1] 3 5 3 3 3 3 5 3 3 5 5 1 3 2 3 5 5 5 5 2 5 5 5 3 1 1 3 3 5 1 5 5 1 3
##   [35] 3 3 3 3 5 5 3 3 3 5 3 3 3 3 3 3 3 5 3 5 1 3 3 3 5 5 3 5 5 4 3 3 5 5
##   [69] 3 3 5 3 5 3 5 3 5 3 3 3 3 5 1 5 3 5 5 1 3 3 3 5 5 3 1 1 5 1 3 1 5 3
##  [103] 1 2 3 3 3 3 3 3 5 3 5 3 5 1 3 5 3 5 1 1 3 3 3 1 3 5 3 5 3 5 3 3 3 3
##  [137] 3 5 3 5 5 5 1 5 5 5 5 3 3 3 3 3 2 5 3 5 3 1 1 5 3 3 5 5 3 3 1 1 5 3
##  [171] 1 3 3 3 3 5 5 3 3 4 3 3 2 5 3 1 5 5 5 5 3 5 2 3 3 2 1 3 3 5 5 2 3 2
##  [205] 5 1 3 3 4 5 3 5 5 3 3 5 5 3 2 3 3 5 1 2 5 3 5 3 3 5 3 3 2 5 3 3 3 5
##  [239] 3 5 3 2 3 3 3 5 3 5 5 1 5 1 3 5 1 5 5 3 5 5 5 3 3 3 3 5 3 3 3 3 3 2
##  [273] 3 5 5 5 5 5 5 3 3 3 1 5 5 3 3 5 5 3 5 3 3 3 3 2 5 3 3 5 3 1 5 5 3 3
##  [307] 3 3 5 2 3 5 3 3 3 3 3 5 3 3 3 3 3 3 3 3 3 3 5 1 3 5 3 3 3 3 3 3 5 3
##  [341] 5 5 5 5 3 3 3 3 3 3 3 3 2 5 3 2 5 3 3 1 3 3 3 3 5 5 3 3 3 5 1 5 2 3
##  [375] 3 5 3 1 3 1 5 5 4 3 5 3 3 2 3 5 5 3 3 3 1 3 1 3 3 1 5 3 3 5 1 1 1 3
##  [409] 3 2 3 5 3 5 3 5 3 1 3 3 3 3 3 3 3 3 4 3 3 5 1 3 5 3 3 3 3 3 5 5 5 1
##  [443] 4 3 3 3 3 3 1 3 3 3 2 1 5 2 5 3 5 5 1 2 5 5 2 3 3 1 5 3 3 3 2 3 1 3
##  [477] 5 5 5 5 1 5 5 1 3 3 5 5 3 5 3 5 5 5 3 3 3 5 3 3 3 5 3 3 2 3 3 1 1 3
##  [511] 3 3 5 5 5 5 1 3 1 2 5 4 3 3 3 2 3 5 3 3 3 3 3 3 5 3 5 3 3 5 3 5 3 3
##  [545] 3 3 3 3 3 3 3 3 3 3 3 5 2 3 1 3 5 5 2 3 5 3 2 3 3 2 3 2 2 5 1 4 5 3
##  [579] 5 2 3 5 5 3 2 5 1 5 5 5 3 3 1 3 1 1 3 3 5 1 2 3 3 3 5 3 3 3 1 5 3 5
##  [613] 3 3 1 3 2 3 5 3 2 1 3 5 1 3 5 3 3 5 3 2 5 5 5 5 3 3 5 5 5 1 1 1 3 4
##  [647] 3 2 5 5 5 3 3 5 1 5 3 3 5 3 3 3 3 3 3 3 5 3 3 5 3 3 3 3 3 3 3 3 5 3
##  [681] 3 3 2 3 3 3 3 5 5 3 2 3 3 5 3 4 3 3 1 3 2 1 2 5 3 3 5 3 2 1 5 5 1 3
##  [715] 5 5 2 1 3 1 1 3 5 2 5 5 5 3 4 2 2 1 5 5 1 3 3 3 3 1 3 5 5 2 1 3 5 1
##  [749] 3 5 5 3 3 5 4 3 3 3 3 5 5 5 1 1 5 3 1 3 3 3 3 3 3 2 3 3 5 5 3 1 2 3
##  [783] 4 3 3 5 3 3 5 1 3 5 1 3 3 3 5 5 3 3 3 3 3 3 3 3 5 1 5 3 4 2 3 5 1 3
##  [817] 3 3 3 3 4 1 3 3 3 3 5 1 3 1 5 1 2 5 1 4 5 5 1 3 3 3 3 4 4 5 3 4 5 4
##  [851] 5 3 3 3 3 5 5 5 5 3 5 5 3 5 1 3 3 2 3 3 5 2 3 2 3 5 1 1 3 3 5 3 2 3
##  [885] 5 1 5 2 5 2 4 2 3 3 3 4 5 5 5 3 1 5 3 5 5 3 3 5 3 2 1 3 5 5 5 5 3 5
##  [919] 1 1 5 3 3 1 5 3 5 3 2 4 3 3 3 3 5 3 3 3 3 2 2 3 3 4 3 1 3 2 3 3 1 3
##  [953] 3 4 3 3 5 5 5 5 1 3 4 5 3 2 2 3 3 3 1 2 3 4 3 4 4 4 5 5 3 3 1 3 1 5
##  [987] 2 3 3 5 3 4 2 5 1 3 5 4 1 1
## 
## Within cluster sum of squares by cluster:
## [1]  71605221  77393875 111669199 105689153 101363806
##  (between_SS / total_SS =  94.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Question 3

clusters<-2:10
between_SS<-c()
tot_SS<-c()
set.seed(1234)

for  (num  in  clusters){
  
  cluster.object<-(kmeans(training,  centers  =  num,  nstart  =  100))
  between_SS<-append(between_SS,  cluster.object$betweenss)
  
  tot_SS<-append(tot_SS,  cluster.object$totss)
  
}

VAF<-as.data.frame(cbind(clusters,  between_SS,  tot_SS))
VAF['VAF']<-(between_SS/tot_SS)
VAF

##   clusters between_SS     tot_SS       VAF
## 1        2 3906308518 5679962401 0.6877349
## 2        3 4808799874 5679962401 0.8466253
## 3        4 5216475461 5679962401 0.9183996
## 4        5 5345306506 5679962401 0.9410813
## 5        6 5463785102 5679962401 0.9619404
## 6        7 5518813619 5679962401 0.9716285
## 7        8 5547318229 5679962401 0.9766470
## 8        9 5575547262 5679962401 0.9816169
## 9       10 5594043580 5679962401 0.9848733

The largest improvement is from clusters 2-3 with diminishing returns following cluster 3 for each additional cluster.

Question 4

plot(2:10,  VAF$VAF,  type  =  "l",  xlab  =  "Number  of  Clusters",  ylab  = "VAF",  main  =  "Scree  Plot")

Question 5

Question 6

6.1 When looking at the VAF for the three cluster model at .84 to the four cluster model which is at .91, the diminishing returns brings to question the tradeoff of adding an additonal cluster. 6.2 As the VAF of the four cluster model as well of the center of the two clusters is very similar, to go with the fewer number of clusters is appropriate on the test data as well as the training data. 6.3 As the four cluster analysis is more interpretable (i.e. patterns such as long durations with high amounts is more clear as in the four cluster model as compared to 5) and the VAF trade off is small, the four cluster model is appropriate.The small trade off in VAF is not worth the difficulty that arises in interpretabiliy.

## [1] 0.9183592

## [1] 0.9412428

Question 7

komeans4 <-  komeans(scaledtrain,  nclust  =  4,  lnorm  =  2,  nloops  =   100,  tolerance  =  .001,  seed  =  3)
komeans4
komeans4$VAF

komeans3 <-  komeans(scaledtrain,  nclust  =  3,  lnorm  =  2,  nloops  =   100,  tolerance  =  .001,  seed  =  3)
komeans3
komeans3$VAF

komeans5 <-  komeans(scaledtrain,  nclust  =  5,  lnorm  =  2,  nloops  =   100,  tolerance  =  .001,  seed  =  3)
komeans5
komeans5$VAF

Question 8 I would choose the k-means solution. The KO-means clusters have too much similarity across their solutions making them significantly more uninterpretable. This is indicative of clusters needing to be combined. Moreover, there is little to no gain in the VAF values by switching to KO means.

Question 9 When comparing the k means and ko means models on the variables Age, Duration and Amount, it is clear that when comparing the two models using the same cluster size, that there is little to no gain in VAF to support the loss of interpretibility of the KO Means model. The decision was made to use the k means 4 cluster model as it is the most interpretable without major losses in VAF.

Question 10 Step 1: Recruitment The first step would be to determine the requirements of the telephone recruitement. What questions are we allowed to ask? How long can we speak? Can we leave a voicemail? After that, I would review the data using models outside of KO means and K means to determine if there is a relationship with any other existing data sets. One example might be to see if age has any relationship to duration of a phone call to determine how we may approach recruitment. Step 2 This seems like bad idea. This could create an issue where people who are less well off financially may be more prone to recruitment over people who have more money. Also it could skew how the answers are given if people believe their compensation is based off of their responses. As we have our basic clusers defined, we could use the new recruits age (as an example) and using what we know about the loan amount for a person of their age to put them in the cluster. Use what we know to find a cluster for the new recruit.