Indonesia, with an estimated population of 260 million people, is the world’s fourth largest country. In the late 1960s, President Suharto implemented a ‘population policy’ that would help the nation reach its development goals. The so-called “Family Planning Program” (FP program) was successful: between 1976 and 2002, contraceptive use in the country doubled, and the nation’s average fertility rate dropped from 5.6 to 2.6 children per woman.
In order to reach the target population goals, the FP program focused on community education and the distribution of free contraceptives. The successes of the FP program ensured that the steady growth of Indonesia’s economy bettered the quality of life for everyone in the country, rather than funding and fueling overpopulation.
In recent years, however, the success of the FP program—heralded before the turn of the century as having achieved “one of the most impressive records in fertility reduction over the past two decades”—has begun to slow dramatically.
As Hari Fitri Putjuk notes in her piece on Indonesian family planning for the Bill & Melinda Gates Foundation: “This is largely due to a complex devolution process that shifted power over family planning programs from the national to the district level, leading to confusion around roles and responsibilities — and, at times, inaction.”
Putjuk also notes, however, that things are looking up for the FP program. Since the 2012 London Summit on Family Planning, the Indonesian government has launched a renewed effort to strengthen the FP program.
As the country looks to continue making strides in managing its population, it is vital that they first understand the behavior of the nation’s parents and potential parents. What types of people use each type of contraceptive method? How does education level affect contraceptive choices? Does religion in the majority-Muslim country play an important role in the family-planning process?
Finding answers to questions like these will shape family planning policy in Indonesia, determine how to best allocate funding of the program, and allow the country’s government to best improve the quality of life for everyone in their nation.
Following the creation and implementation of Indonesia’s Family Planning Program, the National Indonesia Contraceptive Prevalence Survey was taken in the country with the goal of gauging both the progress of the program and the status of contraceptive use among Indonesian women. The data I will be analyzing is a subset of the data collected in the 1987 survey.
The data comes from surveys taken by married Indonesian women who were not pregnant (or did not know they were pregnant) at the time of the survey. The report itself offers more insight into how the data was collected:
“DHS model questionnaires and manuals were modified to suit the needs of Indonesia and were translated into Bahasa Indonesian. Over 90 female interviewers were trained for 15 days in five training centers during September, 1987 and data collection took place from mid-September to the third week of December. Data from the questionnaires were entered on microcomputers at the CBS headquarters in Jakarta, using the ISSA program, which was specially designed for the DHS project.”
The specific subset of the survey data that I will be analyzing is available through the UCI Machine Learning Repository here.
I scraped the data from the UCI page using the read_delim function from the readr package:
library(readr)
URL <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/cmc/cmc.data'
cmc <- read_delim(url(URL), delim = ",", col_names = FALSE)
The main data frame of the project is called “cmc”, short for “Contraceptive Method Choice.”
Obviously, this data is outdated — it is almost three decades old, after all. However, the methods of analysis and machine learning models applied to this data can be applied to survey data collected today. One of the steps planned by the Indonesian government is to equip volunteers with electronic tablets; those tablets could facilitate the collection of similar data that could be analyzed in a similar way.
The raw data from the UCI repository doesn’t include column names or any information about the variables. The repository does, however, include a separate file with more information about the data.
The variables in the data are, according to the UCI description page:
| Variable | Type | Description |
|---|---|---|
| Wife’s Age | Numerical | Age of woman being surveyed |
| Wife’s Education | Categorical | Level of education reached by wife: 1 = low, 2, 3, 4 = high |
| Husband’s Education | Categorical | Level of education reached by husband: 1 = low, 2, 3, 4 = high |
| Number of Children Ever Born | Numerical | Total number of children wife has had in her lifetime |
| Wife’s Religion | Binary | Religion practiced by wife: 0 = Non-Islam, 1 = Islam |
| Wife Working | Binary | Is the woman currently working? 0 = Yes, 1 = No |
| Husband’s Occupation | Categorical | Code of husband’s current occupation: 1, 2, 3, 4 |
| Standard-of-Living Index | Categorical | Standard-of-Living Index of the woman: 1 = low, 2, 3, 4 = high |
| Media Exposure | Binary | Level of woman’s media exposure: 0 = Good, 1 = Not Good |
| Contraceptive Method Used | Class Attribute | 1 = No contraceptive use, 2 = Long-term contaceptive use, 3 = Short-term contraceptive use |
Taking a quick look at the scraped data reveals that we’ll need to clean up the data a bit before it becomes useful to us.
head(cmc)
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## 1 24 2 3 3 1 1 2 3 0 1
## 2 45 1 3 10 1 1 3 4 0 1
## 3 43 2 3 7 1 1 3 4 0 1
## 4 42 3 2 9 1 1 3 3 0 1
## 5 36 3 3 8 1 1 3 2 0 1
## 6 19 4 4 0 1 1 3 3 0 1
dim(cmc)
## [1] 1473 10
The first thing to take care of is labeling the columns:
#fix colnames to make df more usable
colnames(cmc) <- c("wife_age",
"wife_edu",
"husb_edu",
"num_child",
"islam",
"wife_working",
"husb_job",
"SOL_index",
"media_exposure",
"cmc")
As this data was pulled from the UCI Machine Learning Repository, it’s already cleaned up and mostly ready to use. There aren’t any missing values, so the only cleaning I did was checking and changing the classes of the variables, making the ‘binary’ variables booleans, and labeling some of the encoded variables with factor labels that better reflect what the original integer coding represents.
sapply(cmc[,1:10], FUN = function(x) {class(x)})
## wife_age wife_edu husb_edu num_child islam
## "integer" "integer" "integer" "integer" "integer"
## wife_working husb_job SOL_index media_exposure cmc
## "integer" "integer" "integer" "integer" "integer"
The education variables better serve us as labeled factors, not integers:
cmc$wife_edu <- factor(cmc$wife_edu, labels = c("low", "mid_low", "mid_high", "high"))
cmc$husb_edu <- factor(cmc$husb_edu, labels = c("low", "mid_low", "mid_high", "high"))
Next, let’s convert the binary variables to booleans:
cmc$islam <- ifelse(cmc$islam == 1, TRUE, FALSE)
cmc$wife_working <- ifelse(cmc$wife_working == 1, FALSE, TRUE)
cmc$media_exposure <- ifelse(cmc$media_exposure == 1, FALSE, TRUE)
husb_job is also categorical, but since I don’t have any more info on what each level means, I’ll keep the 1-4 labels, but store them as factor variable rather than an integer:
cmc$husb_job <- as.factor(cmc$husb_job)
I changed standard-of-living index to reflect bad/good quality of the categorical variable:
cmc$SOL_index <- factor(cmc$SOL_index, labels = c("low", "mid_low", "mid_high", "high"))
Finally, the most important variable in the data set — Contraceptive Method Choice. This is the variable my models will be classifying on, so I of course need it to be a categorical variable:
cmc$cmc <- factor(cmc$cmc, labels = c("no_use", "long_term", "short_term"))
Let’s check the classes of each variable again to double check the cleaning we did:
sapply(cmc[,1:10], FUN = function(x) {class(x)})
## wife_age wife_edu husb_edu num_child islam
## "integer" "factor" "factor" "integer" "logical"
## wife_working husb_job SOL_index media_exposure cmc
## "logical" "factor" "factor" "logical" "factor"
Looks good — we’re all set to start exploring this data.
Before I start building any models on this data, I want to first visualize it to understand what I’m working with and what engineered features might be worth looking into.
One of my first thoughts on the data was that the more educated the husband and wife, the higher the likelihood that they used either long- or short-term contraception.
To quickly check this suspicion, I plotted the counts of the wife education level/contraceptive choice pairs.
library(ggplot2)
ggplot(cmc, aes(x = wife_edu, y = cmc)) +
geom_count() +
scale_x_discrete(limits = c("low", "mid_low", "mid_high", "high"))
This certainly looks promising, especially the relationship between the ‘long-term’ choice and education level — but there’s a problem: this plot is only showing how often each of these pairings occurred in the data. We don’t know if our data has an equal number of people in each education level. To check, I plotted how frequently each education level occurred in the data set for both men and women:
suppressMessages(library(dplyr))
countsFemale <- cmc %>%group_by(wife_edu) %>% summarise(count = n())
countsMale <- cmc %>%group_by(husb_edu) %>% summarise(count = n())
countsMale$gender <- "male"
countsFemale$gender <- "female"
colnames(countsFemale) <- c("edu", "count", "gender")
colnames(countsMale) <- c("edu", "count", "gender")
countsComb <- rbind(countsFemale, countsMale)
ggplot(countsComb, aes(x = edu, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "dodge") +
scale_x_discrete(limits = c("low", "mid_low", "mid_high", "high")) +
theme_minimal() +
labs(x = "Education Level", y = "Frequency", title = "Distribution of Education Level in Collected Data")
It’s quite clear that there isn’t a uniform distribution across the levels of education of the Indonesians surveyed. Going forward, I’ll assume that none of the variables are uniformly distributed; I’ll work with percentages of each category when visualizing and exploring the data.
Here, I break down the data into subsets of wife_edu. For each education level, I want to see the makeup of the contraceptive choices — which contraceptive choice is most popular for each level of education?
perc <- cmc %>% group_by(wife_edu, cmc) %>% summarise(count = n())
highTotal <- perc %>% filter(wife_edu == "high") %>% summarise(total = sum(count))
lowTotal <- perc %>% filter(wife_edu == "low") %>% summarise(total = sum(count))
mid_lowTotal <- perc %>% filter(wife_edu == "mid_low") %>% summarise(total = sum(count))
mid_highTotal <- perc %>% filter(wife_edu == "mid_high") %>% summarise(total = sum(count))
perc$total <- NA
perc[which(perc$wife_edu == "high"), 4] <- highTotal[1,2]
perc[which(perc$wife_edu == "low"), 4] <- lowTotal[1,2]
perc[which(perc$wife_edu == "mid_high"), 4] <- mid_highTotal[1,2]
perc[which(perc$wife_edu == "mid_low"), 4] <- mid_lowTotal[1,2]
perc <- perc %>% mutate(percent = count/total)
ggplot(perc, aes(x = wife_edu, y = percent, fill = cmc)) +
geom_bar(stat = "identity", position = "stack") +
scale_x_discrete(limits = c("low", "mid_low", "mid_high", "high")) +
theme_minimal() +
labs(x = "Wife Education Level", y = "Percentage Breakdown", title = "Contraceptive Use by Wife's Education Level")
And the same plot, but for the husband’s education level:
Looking at these graphs, a few different relationships appear:
Of course, these are simply correlations, and we cannot be sure at this point that choice of contraception method is determined by education level. I suspect, however, that these relationships will appear in the models.
Another suspicion of mine was that religion would play a role in determining whether or not a couple uses contraception at all. Conservative Islamic scholars frown upon the use of contraception, though the Quran does not explicitly mention it. Indonesia has a large population of conservative Muslims, so intuitively, I expected the Muslim participants in the survey to be less likely to use any contraception than their non-Muslim counterparts.
pc_islam_use <- dim(cmc %>% filter(islam & cmc != 'no_use'))[1]/dim(cmc %>% filter(islam))[1]
pc_islam_not_use <- 1 - pc_islam_use
pc_not_islam_use <- (dim(cmc %>% filter(!islam & cmc != 'no_use'))[1]/dim(cmc %>% filter(!islam))[1])
pc_not_islam_not_use <- 1 - pc_not_islam_use
religion <- data.frame("Religion" = c("Islam", "Non-Islam"), "Use" = c(pc_islam_use, pc_not_islam_use, pc_islam_not_use, pc_not_islam_not_use) )
religion$Contraception_Use <- TRUE
religion[which(religion$Use < 0.5), 3] <- FALSE
ggplot(religion, aes(x = Religion, y = Use, fill = religion$Contraception_Use )) +
geom_bar(stat = "identity", position = "stack") +
guides(title = "Uses Contraception") +
labs(title = "Religious Influence on Contraceptive Use", y = "Usage Breakdown")
It appears that there is a difference in contraceptive use between Muslims and non-Muslims in Indonesia. Muslim women are slightly less likely to use any sort of contraception. This difference is smaller than I expected; it will be interesting to see how it appears in the models.
As I am working with a multi-class classification problem, I’m going to first use a decision tree model. I’ll need to break my data up into a test set and a training set:
index <- sample(1:dim(cmc)[1], floor(0.75 * dim(cmc)[1]))
train <- cmc[index,]
test <- cmc[-index,]
Now that I have partitioned my data, I can train the decision tree model using the training data set, and predict/test using the test data set:
library(rpart)
cmc.rpart <- rpart(cmc ~. , data = cmc, method = "class")
cmc.predict <- predict(cmc.rpart, test, type = "class")
results <- cmc.predict == test$cmc
(accuracy <- sum(results) / length(results))
## [1] 0.5582656
This decision tree usually gets between 50% and 60% accuracy. Not bad, but not great. Let’s take a look at what’s going on under the hood:
suppressMessages(library(rattle))
fancyRpartPlot(cmc.rpart)
It turns out that number of children born is actually the key split in our decision tree. The model essentially breaks the initial cohort of women into two sets: those who have had children in the past, and those who have not. Women that have not yet had a child are almost certain to be using no contraception at all.
After that initial split, the age of the wife plays a key role in the split between choices. Next, the wife’s education level, and another split on number of children.
The splits in the last decision tree make sense, but it resulted in a mediocre accuracy. Let’s see if we can use cross-validation to create a better model.
suppressMessages(library(caret))
train(cmc ~ ., data = cmc, method = "rpart", metric = "Accuracy")
## CART
##
## 1473 samples
## 9 predictor
## 3 classes: 'no_use', 'long_term', 'short_term'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1473, 1473, 1473, 1473, 1473, 1473, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.02310427 0.5372215 0.27881775
## 0.03791469 0.5055572 0.22135818
## 0.06931280 0.4559926 0.08706326
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02310427.
Looks like cross-validation didn’t help at all; in fact, the optimal accuracy using cross-validation was lower than the standard decision tree model.
Rpart isn’t getting great results, so I’m going to try something completely different — a neural network:
library(randomForest)
cmc.nnet <- train(cmc ~., data = cmc, method = "nnet")
cmc.nnet
## Neural Network
##
## 1473 samples
## 9 predictor
## 3 classes: 'no_use', 'long_term', 'short_term'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 1473, 1473, 1473, 1473, 1473, 1473, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 1 0e+00 0.4409899 0.06012862
## 1 1e-04 0.4378088 0.04551876
## 1 1e-01 0.4614377 0.11229170
## 3 0e+00 0.4618637 0.13223509
## 3 1e-04 0.4642792 0.14541486
## 3 1e-01 0.5354087 0.28354505
## 5 0e+00 0.4676611 0.16117609
## 5 1e-04 0.4807547 0.20052068
## 5 1e-01 0.5351501 0.28336522
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 3 and decay = 0.1.
The neural network model doesn’t perform any better; its best accuracy is still in the low 50% range.
As my data set presented a multi-class classification problem, the metric of success was accuracy. I was surprised to find that a simple decision tree model performed the best, averaging about 55% accuracy.
Going forward, I would like to revisit that model, and engineer new features in the data set to assist the machine learning process.
Machine learning models can be used for two purposes: understanding and predicting. In this case, understanding the underlying behavior of Indonesians making decisions about contraceptive methods is far more important than predicting which choice a given Indonesian woman would make.
The ultimate goal of the Family Planning Program is to educate women about contraceptive use so that they are empowered to make their own decisions about contraception and family planning. However, there are quite a few paths to get to that goal, and with limited time and resources, the Indonesian government needs to take the most effective route.
Using a decision tree model allows us to understand what the biggest factors are in the behavior patterns of these women. this knowledge should be utilized to shape the policy and funding allocation of the FP program. For example, looking at the right side of the decision tree visualization, it is clear that young women who are not well educated are extremely likely to be using no contraception or only short term contraceptive methods. The median age that an Indonesian woman has her first child is 22.8 years old. I would advise the Indonesian government to target this cohort of women with educational campaigns about long-term contraceptive methods in secondary schools and universities.
If the decisions about the FP program are data-driven, Indonesia will stand a much better chance of becoming once again a shining example of population management and effective family planning.