IAS Analytics Challenge 3.0

Introduction:

The task of this competition is to analyze eight years of employment-related data spanning from 2005 - 2012.

The data is broken down into two files. One is the training data, and the other is the test set. We will also be using 20% of the training data as a validation set.

Load the Data-

# Load the data and create a data.table
train <- setDT(read_xlsx("Training Dataset.xlsx"))
test <- setDT(read_xlsx("Test Dataset.xlsx"))

# Remove the unwanted characters
names(train) <- gsub(" ", "", names(train))
names(test) <- gsub(" ", "", names(test))

names(train) <- gsub("&", "and", names(train))
names(test) <- gsub("&", "and", names(test))

# Lower-casing
setnames(train, names(train), tolower(names(train)))
setnames(test, names(test), tolower(names(test)))

# Factor columns 
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")

train <- train[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
test <- test[, (factors) := lapply(.SD, as.factor), .SDcols = factors]

# Create a training and validation sets
trainObs <- sample(nrow(train), .8 * nrow(train), replace = FALSE)
valObs <- sample(nrow(train), .2 * nrow(train), replace = FALSE)

traindt <- train[trainObs,]
valdt <- train[valObs,]

Predictions:

Our final task is to make predictions on employment status for the unlabeled test set.

In order to do this, we will need to rely on a machine learning algorithm/s.

To decide which algorithm will perform best on the test set, we will implement several and choose the one that produces the most accurate predictions on the validation set.

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}

# Random Forest

# Model
rf_model <- randomForest(employmentstatus ~ ., data = traindt, importance = TRUE)

print(rf_model)
## 
## Call:
##  randomForest(formula = employmentstatus ~ ., data = traindt,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.38%
## Confusion matrix:
##                    Employed Not in labor force Unemployed class.error
## Employed              32370                460         11 0.014341829
## Not in labor force        5              15581        149 0.009787099
## Unemployed                2               2129        497 0.810882801
# Predictions
rf_pred <- predict(rf_model, valdt)

rf_test_pred <- data.table(predict(rf_model, test))

# Results
rf_conf_mat <- table(true = valdt$employmentstatus, pred = rf_pred)

cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")
## Random Forest Classification Error, Validation: 0.01124912
# Neural Network

# Run the neural network
nn <- nnet(employmentstatus ~ ., traindt, size = 15)
## # weights:  738
## initial  value 117266.669794 
## iter  10 value 35459.552036
## iter  20 value 19702.534478
## iter  30 value 18935.938957
## iter  40 value 17555.050482
## iter  50 value 14861.375264
## iter  60 value 12082.681110
## iter  70 value 11504.532740
## iter  80 value 11203.988741
## iter  90 value 10896.824191
## iter 100 value 10739.708299
## final  value 10739.708299 
## stopped after 100 iterations
nn_pred <- data.table(predict(nn, valdt, type = "class"))

nn_conf_mat <- table(true = valdt$employmentstatus, pred = nn_pred$V1)

cat("Neural Network Classification Error, Validation:", classification_error(nn_conf_mat), "\n")
## Neural Network Classification Error, Validation: 0.06694789
# Combine rf test predictions with original df
test$employmentstatus <- rf_test_pred[,1]

# Write the results of winning algorithm to csv
write.csv(test, "rf_predictions.csv")

Results:

Well, it would appear as if in this situation- the random forest won over the neural network when applied to the validation set. All thats left is to see how the algorithm stacks up in the challenge