This document is me helping a very dear friend who is looking to classify survey respondents into one of 7 categories based on their survey responses.
Going to randomly generate responses for the 9 survey questions:
* Age
* Sex
* Are you financially prepared to have a child?
* How satisfied are you with your current career status?
* How might having a child affect your career status?
* How emotionally ready are you for a baby?
* How important is a partner in raising a child?
* Are you currently with a partner who is ready for a child?
* How soon would you like to have a baby?
I’ll generate 10,000 random results for each question, store them in a dataframe as if they are responses from survey respondents. Later I’ll split that dataset into a training set to build a random forests model upon and a testing set to see how accurate our model was.
# Load library
library(tidyr)
library(dplyr)
# Set sample size
sample_size <- 10000
# set seed for reproducible RNG outcomes
set.seed(324)
# Random results generated
age <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.05, 0.15, 0.64, 0.15, 0.01))
sex <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.95, 0.05, 0.01))
financially_ready <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.2, 0.2, 0.1, 0.3, 0.2))
employed <- sample(0:1, size = sample_size, replace = TRUE, prob = c(0.95, 0.05))
job_loss_worry <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.05, 0.6, 0.35))
comfortable_baby_inside <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.3, 0.4, 0.17, 0.1, 0.03))
want_partner <- sample(0:4, size = sample_size, replace = TRUE, prob = c(0.23, 0.71, 0.01, 0.04, 0.01))
with_good_partner <- sample(0:2, size = sample_size, replace = TRUE, prob = c(0.59, 0.23, 0.18))
when_ready <- sample(0:5, size = sample_size, replace = TRUE, prob = c(0.11, 0.05, 0.28, 0.27, 0.17, 0.12))
# Create data frame of random results
sampleDF <- data.frame(age, sex, financially_ready, employed, job_loss_worry, comfortable_baby_inside, want_partner, with_good_partner, when_ready)
sampleDF <- tbl_df(sampleDF)
The following outcomes are to be assigned to each respondent:
I’ll begin by assigning the default to all respondants, then overwrite assigned outcomes based on given criteria.
# Assign defaults to all
sampleDF$Outcome <- "G"
sampleDF[sampleDF$sex >= 1 & sampleDF$with_good_partner >= 1, "Outcome"] <- "A"
sampleDF[sampleDF$sex == 0 & sampleDF$age >= 3, "Outcome"] <- "A"
sampleDF[sampleDF$sex >= 1 & sampleDF$with_good_partner == 0, "Outcome"] <- "B"
sampleDF[sampleDF$sex == 0 & sampleDF$job_loss_worry == 0, "Outcome"] <- "B"
sampleDF[sampleDF$with_good_partner >=1 & sampleDF$want_partner <= 1 & sampleDF$financially_ready <= 1, "Outcome"] <- "C"
sampleDF[sampleDF$sex == 0 & sampleDF$age <= 2 & sampleDF$financially_ready >= 2, "Outcome"] <- "D"
sampleDF[sampleDF$sex == 0 & sampleDF$age <= 2 & sampleDF$when_ready >= 3, "Outcome"] <- "E"
sampleDF[sampleDF$want_partner >= 2, "Outcome"] <- "F"
# Results
sampleDF$Outcome <- as.factor(sampleDF$Outcome)
table(sampleDF$Outcome)
##
## A B C D E F G
## 1223 456 865 1956 4121 608 771
Here is a random pair of variables – age (x) and financially ready (y). The colors of the points are the outcomes they were classified as.
library(ggplot2)
ggplot(aes(x = age, y = financially_ready, colour = Outcome), data = sampleDF) + geom_jitter()
I can generate more of these. You’ll see non-financially ready people often were outcome G and older people were outcome A.
This model will train a random forests model on 70% of the original simulated dataset. Then we’ll see how accurately that model can predict the values in the test set.
# Load libraries
library(caret)
library(caTools)
library(e1071)
library(lattice)
# Split into training and testing set
split <- sample(x = 1:sample_size, size = 0.7*sample_size, replace = FALSE)
trainingset <- sampleDF[split,]
testingset <- sampleDF[-split,]
# Train partition random forests model
mod <- train(Outcome ~ ., data = trainingset, method = "rf")
# Test the model predictions
test_predictions <- predict(mod, testingset)
# Confusion Matrix
confusionMatrix(test_predictions, testingset$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E F G
## A 367 0 0 0 0 0 0
## B 0 128 0 0 0 0 0
## C 0 0 275 0 0 0 0
## D 0 0 0 569 0 0 0
## E 0 0 0 0 1247 0 0
## F 0 0 0 0 0 168 0
## G 0 0 0 0 0 0 246
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9988, 1)
## No Information Rate : 0.4157
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E Class: F
## Sensitivity 1.0000 1.00000 1.00000 1.0000 1.0000 1.000
## Specificity 1.0000 1.00000 1.00000 1.0000 1.0000 1.000
## Pos Pred Value 1.0000 1.00000 1.00000 1.0000 1.0000 1.000
## Neg Pred Value 1.0000 1.00000 1.00000 1.0000 1.0000 1.000
## Prevalence 0.1223 0.04267 0.09167 0.1897 0.4157 0.056
## Detection Rate 0.1223 0.04267 0.09167 0.1897 0.4157 0.056
## Detection Prevalence 0.1223 0.04267 0.09167 0.1897 0.4157 0.056
## Balanced Accuracy 1.0000 1.00000 1.00000 1.0000 1.0000 1.000
## Class: G
## Sensitivity 1.000
## Specificity 1.000
## Pos Pred Value 1.000
## Neg Pred Value 1.000
## Prevalence 0.082
## Detection Rate 0.082
## Detection Prevalence 0.082
## Balanced Accuracy 1.000
Model has 100% accuracy, so will classify results accurately based on criteria given.