Overview

This document is me helping a very dear friend who is looking to classify survey respondents into one of 7 categories based on their survey responses.

Simulating a Sample Data Set

Going to randomly generate responses for the 9 survey questions:
* Age
* Sex
* Are you financially prepared to have a child?
* How satisfied are you with your current career status?
* How might having a child affect your career status?
* How emotionally ready are you for a baby?
* How important is a partner in raising a child?
* Are you currently with a partner who is ready for a child?
* How soon would you like to have a baby?

I’ll generate 10,000 random results for each question, store them in a dataframe as if they are responses from survey respondents. Later I’ll split that dataset into a training set to build a random forests model upon and a testing set to see how accurate our model was.

# Load library
library(tidyr)
library(dplyr)

# Set sample size
sample_size <- 10000

# set seed for reproducible RNG outcomes
set.seed(324)

# Random results generated
age <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.05, 0.15, 0.64, 0.15, 0.01))
sex <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.95, 0.05, 0.01))
financially_ready <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.2, 0.2, 0.1, 0.3, 0.2))
employed <- sample(0:1, size = sample_size, replace = TRUE, prob = c(0.95, 0.05))
job_loss_worry <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.05, 0.6, 0.35))
comfortable_baby_inside <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.3, 0.4, 0.17, 0.1, 0.03))
want_partner <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.23, 0.71, 0.01, 0.04, 0.01))
with_good_partner <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.59, 0.23, 0.18))
when_ready <- sample(0:5, size = sample_size, replace = TRUE, prob = c(0.11, 0.05, 0.28, 0.27, 0.17, 0.12))

# Create data frame of random results
sampleDF <- data.frame(age, sex, financially_ready, employed, job_loss_worry, comfortable_baby_inside, want_partner, with_good_partner, when_ready)
sampleDF <- tbl_df(sampleDF)

Assigning Outcomes to Results

The following outcomes are to be assigned to each respondent:

I’ll begin by assigning the default to all respondants, then overwrite assigned outcomes based on given criteria.

# Assign defaults to all
sampleDF$Outcome <- "G"

sampleDF[sampleDF$sex >= 1 & sampleDF$with_good_partner >= 1, "Outcome"] <- "A"
sampleDF[sampleDF$sex == 0 & sampleDF$age >= 3, "Outcome"] <- "A"

sampleDF[sampleDF$sex >= 1 & sampleDF$with_good_partner == 0, "Outcome"] <- "B"
sampleDF[sampleDF$sex == 0 & sampleDF$job_loss_worry == 0, "Outcome"] <- "B"

sampleDF[sampleDF$with_good_partner >=1 & sampleDF$want_partner <= 1 & sampleDF$financially_ready <= 1, "Outcome"] <- "C"

sampleDF[sampleDF$sex == 0 & sampleDF$age <= 2 & sampleDF$financially_ready >= 2, "Outcome"] <- "D"

sampleDF[sampleDF$sex == 0 & sampleDF$age <= 2 & sampleDF$when_ready >= 3, "Outcome"] <- "E"

sampleDF[sampleDF$want_partner >= 2, "Outcome"] <- "F"

# Results
sampleDF$Outcome <- as.factor(sampleDF$Outcome)
table(sampleDF$Outcome)
## 
##    A    B    C    D    E    F    G 
## 1223  456  865 1956 4121  608  771

Plot of age, financially ready, and outcome

Here is a random pair of variables – age (x) and financially ready (y). The colors of the points are the outcomes they were classified as.

library(ggplot2)
ggplot(aes(x = age, y = financially_ready, colour = Outcome), data = sampleDF) + geom_jitter()

I can generate more of these. You’ll see non-financially ready people often were outcome G and older people were outcome A.

Prediction Model

This model will train a random forests model on 70% of the original simulated dataset. Then we’ll see how accurately that model can predict the values in the test set.

# Load libraries
library(caret)
library(caTools)
library(e1071)
library(lattice)
# Split into training and testing set
split <- sample(x = 1:sample_size, size = 0.7*sample_size, replace = FALSE)
trainingset <- sampleDF[split,]
testingset <- sampleDF[-split,]

# Train partition random forests model
mod <- train(Outcome ~ ., data = trainingset, method = "rf") 
# Test the model predictions
test_predictions <- predict(mod, testingset)

# Confusion Matrix
confusionMatrix(test_predictions, testingset$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E    F    G
##          A  367    0    0    0    0    0    0
##          B    0  128    0    0    0    0    0
##          C    0    0  275    0    0    0    0
##          D    0    0    0  569    0    0    0
##          E    0    0    0    0 1247    0    0
##          F    0    0    0    0    0  168    0
##          G    0    0    0    0    0    0  246
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9988, 1)
##     No Information Rate : 0.4157     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity            1.0000  1.00000  1.00000   1.0000   1.0000    1.000
## Specificity            1.0000  1.00000  1.00000   1.0000   1.0000    1.000
## Pos Pred Value         1.0000  1.00000  1.00000   1.0000   1.0000    1.000
## Neg Pred Value         1.0000  1.00000  1.00000   1.0000   1.0000    1.000
## Prevalence             0.1223  0.04267  0.09167   0.1897   0.4157    0.056
## Detection Rate         0.1223  0.04267  0.09167   0.1897   0.4157    0.056
## Detection Prevalence   0.1223  0.04267  0.09167   0.1897   0.4157    0.056
## Balanced Accuracy      1.0000  1.00000  1.00000   1.0000   1.0000    1.000
##                      Class: G
## Sensitivity             1.000
## Specificity             1.000
## Pos Pred Value          1.000
## Neg Pred Value          1.000
## Prevalence              0.082
## Detection Rate          0.082
## Detection Prevalence    0.082
## Balanced Accuracy       1.000

Model has 100% accuracy, so will classify results accurately based on criteria given.