Get the Data
Doing cluster analysis and creating 2 best clusters from the data.
Cross tabulation of the clustering vector.

We use the fertility data as used in previous analysis.
Objective: To determine a profile of a lady who will become pregnant.

Get the Data

data("infert")
#Subset the data using the square brackets.
infert[c(1:3), ] # view the first 3 rows and all coloumns.

##   education age parity induced case spontaneous stratum pooled.stratum
## 1    0-5yrs  26      6       1    1           2       1              3
## 2    0-5yrs  42      1       1    1           0       2              1
## 3    0-5yrs  39      6       2    1           0       3              4

infert1 <- infert[ , -c(7,8)] # Omit the last 2 coloumns
infert1[c(1:3), ] # view the first 3 rows and all coloumns of infert1 dataset

##   education age parity induced case spontaneous
## 1    0-5yrs  26      6       1    1           2
## 2    0-5yrs  42      1       1    1           0
## 3    0-5yrs  39      6       2    1           0

sapply(infert1, class) #View the class of the variables in the dataset

##   education         age      parity     induced        case spontaneous 
##    "factor"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric"

summary(infert1) #Summary statistics for dataset

##    education        age            parity         induced      
##  0-5yrs : 12   Min.   :21.00   Min.   :1.000   Min.   :0.0000  
##  6-11yrs:120   1st Qu.:28.00   1st Qu.:1.000   1st Qu.:0.0000  
##  12+ yrs:116   Median :31.00   Median :2.000   Median :0.0000  
##                Mean   :31.50   Mean   :2.093   Mean   :0.5726  
##                3rd Qu.:35.25   3rd Qu.:3.000   3rd Qu.:1.0000  
##                Max.   :44.00   Max.   :6.000   Max.   :2.0000  
##       case         spontaneous    
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000  
##  Mean   :0.3347   Mean   :0.5766  
##  3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :2.0000

table(infert1$case) # To get count of a numeric variable table command is used.

## 
##   0   1 
## 165  83

# using the ifelse command to recode the education variable
infert1$education1 <- ifelse(infert1$education == "0-5yrs", 0, ifelse(infert1$education == "6-11yrs" , 1, 2))
infert1$education1

##   [1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [211] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [246] 2 2 2

Doing cluster analysis and creating 2 best clusters from the data.

Here the choice of 2 is dictated by the need for profiling if the woman became pregnant.

infert2 <- kmeans(infert1[ , c(2,3,4,6,7)], centers =2,nstart = 25 )
infert2

## K-means clustering with 2 clusters of sizes 150, 98
## 
## Cluster means:
##        age   parity induced spontaneous education1
## 1 27.90000 1.960000    0.62   0.6066667   1.560000
## 2 37.02041 2.295918    0.50   0.5306122   1.204082
## 
## Clustering vector:
##   [1] 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2
##  [36] 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1
##  [71] 1 1 1 2 1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2
## [106] 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1
## [141] 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1
## [176] 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1
## [211] 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2
## [246] 2 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 1581.353 1023.204
##  (between_SS / total_SS =  65.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Cross tabulation of the clustering vector.

This is also called a confusion matrix

confusionmatrix <- table(infert2$cluster, infert1$case)
confusionmatrix

##    
##       0   1
##   1 100  50
##   2  65  33

As we can see from the above table that while more ladies in cluster 1 became pregnant there was also a predominence of women in cluster 1 in the group who failed to become pregnant. This indicates that using this data we cannot establish a particular profile using cluster analysis in this population.

Using Cluster analysis to Develop a Profile

Santam Chakraborty

27 December 2015

Get the Data

Doing cluster analysis and creating 2 best clusters from the data.

Cross tabulation of the clustering vector.