The task of this competition is to analyze eight years of employment-related data spanning from 2005 - 2012.
The data is broken down into two files. One is the training data, and the other is the test set. We will also be using 20% of the training data as a validation set.
# Load the data and create a data.table
train <- setDT(read_xlsx("Training Dataset.xlsx"))
test <- setDT(read_xlsx("Test Dataset.xlsx"))
# Remove the unwanted characters
names(train) <- gsub(" ", "", names(train))
names(test) <- gsub(" ", "", names(test))
names(train) <- gsub("&", "and", names(train))
names(test) <- gsub("&", "and", names(test))
# Lower-casing
setnames(train, names(train), tolower(names(train)))
setnames(test, names(test), tolower(names(test)))
# Factor columns
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")
train <- train[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
test <- test[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
# Create a training and validation sets
trainObs <- sample(nrow(train), .8 * nrow(train), replace = FALSE)
valObs <- sample(nrow(train), .2 * nrow(train), replace = FALSE)
traindt <- train[trainObs,]
valdt <- train[valObs,]
Our final task is to make predictions on employment status for the unlabeled test set.
In order to do this, we will need to rely on a machine learning algorithm/s.
To decide which algorithm will perform best on the test set, we will implement several and choose the one that produces the most accurate predictions on the validation set.
# Function to compute classification error
classification_error <- function(conf_mat) {
conf_mat = as.matrix(conf_mat)
error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
return (error)
}
# Random Forest
# Model
rf_model <- randomForest(employmentstatus ~ ., data = traindt, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = employmentstatus ~ ., data = traindt, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.38%
## Confusion matrix:
## Employed Not in labor force Unemployed class.error
## Employed 32370 460 11 0.014341829
## Not in labor force 5 15581 149 0.009787099
## Unemployed 2 2129 497 0.810882801
# Predictions
rf_pred <- predict(rf_model, valdt)
rf_test_pred <- data.table(predict(rf_model, test))
# Results
rf_conf_mat <- table(true = valdt$employmentstatus, pred = rf_pred)
cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01124912
# Neural Network
# Run the neural network
nn <- nnet(employmentstatus ~ ., traindt, size = 15)
## # weights: 738
## initial value 117266.669794
## iter 10 value 35459.552036
## iter 20 value 19702.534478
## iter 30 value 18935.938957
## iter 40 value 17555.050482
## iter 50 value 14861.375264
## iter 60 value 12082.681110
## iter 70 value 11504.532740
## iter 80 value 11203.988741
## iter 90 value 10896.824191
## iter 100 value 10739.708299
## final value 10739.708299
## stopped after 100 iterations
nn_pred <- data.table(predict(nn, valdt, type = "class"))
nn_conf_mat <- table(true = valdt$employmentstatus, pred = nn_pred$V1)
cat("Neural Network Classification Error, Validation:", classification_error(nn_conf_mat), "\n")
## Neural Network Classification Error, Validation: 0.06694789
# Combine rf test predictions with original df
test$employmentstatus <- rf_test_pred[,1]
# Write the results of winning algorithm to csv
write.csv(test, "rf_predictions.csv")
Well, it would appear as if in this situation- the random forest won over the neural network when applied to the validation set. All thats left is to see how the algorithm stacks up in the challenge