We use the fertility data as used in previous analysis.
Objective: To determine a profile of a lady who will become pregnant.
data("infert")
#Subset the data using the square brackets.
infert[c(1:3), ] # view the first 3 rows and all coloumns.
## education age parity induced case spontaneous stratum pooled.stratum
## 1 0-5yrs 26 6 1 1 2 1 3
## 2 0-5yrs 42 1 1 1 0 2 1
## 3 0-5yrs 39 6 2 1 0 3 4
infert1 <- infert[ , -c(7,8)] # Omit the last 2 coloumns
infert1[c(1:3), ] # view the first 3 rows and all coloumns of infert1 dataset
## education age parity induced case spontaneous
## 1 0-5yrs 26 6 1 1 2
## 2 0-5yrs 42 1 1 1 0
## 3 0-5yrs 39 6 2 1 0
sapply(infert1, class) #View the class of the variables in the dataset
## education age parity induced case spontaneous
## "factor" "numeric" "numeric" "numeric" "numeric" "numeric"
summary(infert1) #Summary statistics for dataset
## education age parity induced
## 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000
## 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000
## 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000
## Mean :31.50 Mean :2.093 Mean :0.5726
## 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :44.00 Max. :6.000 Max. :2.0000
## case spontaneous
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000
## Mean :0.3347 Mean :0.5766
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :2.0000
table(infert1$case) # To get count of a numeric variable table command is used.
##
## 0 1
## 165 83
# using the ifelse command to recode the education variable
infert1$education1 <- ifelse(infert1$education == "0-5yrs", 0, ifelse(infert1$education == "6-11yrs" , 1, 2))
infert1$education1
## [1] 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [71] 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [211] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [246] 2 2 2
Here the choice of 2 is dictated by the need for profiling if the woman became pregnant.
infert2 <- kmeans(infert1[ , c(2,3,4,6,7)], centers =2,nstart = 25 )
infert2
## K-means clustering with 2 clusters of sizes 150, 98
##
## Cluster means:
## age parity induced spontaneous education1
## 1 27.90000 1.960000 0.62 0.6066667 1.560000
## 2 37.02041 2.295918 0.50 0.5306122 1.204082
##
## Clustering vector:
## [1] 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2
## [36] 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1
## [71] 1 1 1 2 1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 2
## [106] 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1
## [141] 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1
## [176] 1 2 1 1 1 1 1 1 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 1
## [211] 2 1 1 1 2 1 2 1 1 1 2 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2
## [246] 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 1581.353 1023.204
## (between_SS / total_SS = 65.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
This is also called a confusion matrix
confusionmatrix <- table(infert2$cluster, infert1$case)
confusionmatrix
##
## 0 1
## 1 100 50
## 2 65 33
As we can see from the above table that while more ladies in cluster 1 became pregnant there was also a predominence of women in cluster 1 in the group who failed to become pregnant. This indicates that using this data we cannot establish a particular profile using cluster analysis in this population.