Classification Model for Pregnancies

Overview

This document is me helping a very dear friend who is looking to classify survey respondents into one of 7 categories based on their survey responses.

Simulating a Sample Data Set

Going to randomly generate responses for the 9 survey questions:
* Age
* Sex
* Are you financially prepared to have a child?
* How satisfied are you with your current career status?
* How might having a child affect your career status?
* How emotionally ready are you for a baby?
* How important is a partner in raising a child?
* Are you currently with a partner who is ready for a child?
* How soon would you like to have a baby?

I’ll generate 10,000 random results for each question, store them in a dataframe as if they are responses from survey respondents, and then split that dataset into a training set to build a random forests model upon and a testing set to see how accurate our model was.

# Set sample size
sample_size <- 10000

# set seed for reproducible RNG outcomes
set.seed(324)

# Random results generated
age <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.05, 0.15, 0.64, 0.15, 0.01))
sex <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.04, 0.95, 0.01))
financially_ready <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.2, 0.2, 0.1, 0.3, 0.2))
employed <- sample(0:1, size = sample_size, replace = TRUE, prob = c(0.95, 0.05))
job_loss_worry <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.05, 0.6, 0.35))
comfortable_baby_inside <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.3, 0.4, 0.17, 0.1, 0.03))
want_partner <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.23, 0.71, 0.01, 0.04, 0.01))
with_good_partner <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.59, 0.23, 0.18))
when_ready <- sample(0:5, size = sample_size, replace = TRUE, prob = c(0.11, 0.05, 0.28, 0.27, 0.17, 0.12))

# Create data frame of random results
random_pregnancy_sample <- data.frame(age, sex, financially_ready, employed, job_loss_worry, comfortable_baby_inside, want_partner, with_good_partner, when_ready)

Assigning Outcomes to Results

The following outcome is to be assigned to a respondent:

A = Artificial Womb
B = Discrete Surrogacy
C = Genetic Partner Matching
D = Budget Home Birth Kit
E = Egg Freezing
F = IVG + Surrogacy or AW
G = Default – Assisted Physician Birth

Unsupervised Model to Generate Outcomes

I’ll build an unsupervised k-means algorithm to cluster the responses into 7 categories. Then I’ll attach the category to each respondent in a variable called Outcome.

# Set seed
set.seed(324)

# generate 7 clusters
clusters <- kmeans(random_pregnancy_sample, centers = 7)

# assign cluster to respondent
random_pregnancy_sample$Outcome <- as.factor(clusters$cluster)

Plot of age, financially ready, and outcome

Here is a random pair of variables – age (x) and financially ready (y). The colors of the points are the outcomes they were classified as.

library(ggplot2)
ggplot(aes(x = age, y = financially_ready, colour = Outcome), data = random_pregnancy_sample) + geom_jitter()

The k-means algorithm clearly weighted financially ready as an important variable to classify on as many values of 0 or 1 were classified as Outcome 7.

Prediction Model

This model will train a random forests model on 70% of the original simulated dataset. Then we’ll see how accurately that model can predict the values in the test set.

# Load libraries
library(caret)
library(caTools)
library(e1071)

# Split into training and testing set
split <- sample(x = 1:sample_size, size = 0.7*sample_size, replace = FALSE)
trainingset <- random_pregnancy_sample[split,]
testingset <- random_pregnancy_sample[-split,]

# Train partition random forests model
mod <- train(Outcome ~ ., data = trainingset, method = "rf")

# Test the model predictions
test_predictions <- predict(mod, testingset)

# Confusion Matrix
confusionMatrix(test_predictions, testingset$Outcome)

With such an accurate confusion matrix, I have a lot of confidence that when we feed data into our training model, it will output very accurate results.