This document is me helping a very dear friend who is looking to classify survey respondents into one of 7 categories based on their survey responses.
Going to randomly generate responses for the 9 survey questions:
* Age
* Sex
* Are you financially prepared to have a child?
* How satisfied are you with your current career status?
* How might having a child affect your career status?
* How emotionally ready are you for a baby?
* How important is a partner in raising a child?
* Are you currently with a partner who is ready for a child?
* How soon would you like to have a baby?
I’ll generate 10,000 random results for each question, store them in a dataframe as if they are responses from survey respondents, and then split that dataset into a training set to build a random forests model upon and a testing set to see how accurate our model was.
# Set sample size
sample_size <- 10000
# set seed for reproducible RNG outcomes
set.seed(324)
# Random results generated
age <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.05, 0.15, 0.64, 0.15, 0.01))
sex <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.04, 0.95, 0.01))
financially_ready <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.2, 0.2, 0.1, 0.3, 0.2))
employed <- sample(0:1, size = sample_size, replace = TRUE, prob = c(0.95, 0.05))
job_loss_worry <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.05, 0.6, 0.35))
comfortable_baby_inside <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.3, 0.4, 0.17, 0.1, 0.03))
want_partner <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.23, 0.71, 0.01, 0.04, 0.01))
with_good_partner <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.59, 0.23, 0.18))
when_ready <- sample(0:5, size = sample_size, replace = TRUE, prob = c(0.11, 0.05, 0.28, 0.27, 0.17, 0.12))
# Create data frame of random results
random_pregnancy_sample <- data.frame(age, sex, financially_ready, employed, job_loss_worry, comfortable_baby_inside, want_partner, with_good_partner, when_ready)
The following outcome is to be assigned to a respondent:
#
I’ll build an unsupervised k-means algorithm to cluster the responses into 7 categories. Then I’ll attach the category to each respondent in a variable called Outcome.
# Set seed
set.seed(324)
# generate 7 clusters
clusters <- kmeans(random_pregnancy_sample, centers = 7)
# assign cluster to respondent
random_pregnancy_sample$Outcome <- as.factor(clusters$cluster)
Here is a random pair of variables – age (x) and financially ready (y). The colors of the points are the outcomes they were classified as.
library(ggplot2)
ggplot(aes(x = age, y = financially_ready, colour = Outcome), data = random_pregnancy_sample) + geom_jitter()
The k-means algorithm clearly weighted financially ready as an important variable to classify on as many values of 0 or 1 were classified as Outcome 7.
This model will train a random forests model on 70% of the original simulated dataset. Then we’ll see how accurately that model can predict the values in the test set.
# Load libraries
library(caret)
library(caTools)
library(e1071)
# Split into training and testing set
split <- sample(x = 1:sample_size, size = 0.7*sample_size, replace = FALSE)
trainingset <- random_pregnancy_sample[split,]
testingset <- random_pregnancy_sample[-split,]
# Train partition random forests model
mod <- train(Outcome ~ ., data = trainingset, method = "rf")
# Test the model predictions
test_predictions <- predict(mod, testingset)
# Confusion Matrix
confusionMatrix(test_predictions, testingset$Outcome)
With such an accurate confusion matrix, I have a lot of confidence that when we feed data into our training model, it will output very accurate results.