IAS Analytics Challenge 3.0

Introduction:

The task of this competition is to analyze eight years of employment-related data spanning from 2005 - 2012.

The data is broken down into two files. One is the training data, and the other is the test set. We will also be using 20% of the training data as a validation set.

Load the Data-

# Load the data and create a data.table
train <- setDT(read_xlsx("Training Dataset.xlsx"))
test <- setDT(read_xlsx("Test Dataset.xlsx"))

# Remove the unwanted characters
names(train) <- gsub(" ", "", names(train))
names(test) <- gsub(" ", "", names(test))

names(train) <- gsub("&", "and", names(train))
names(test) <- gsub("&", "and", names(test))

# Lower-casing
setnames(train, names(train), tolower(names(train)))
setnames(test, names(test), tolower(names(test)))

# Factor columns 
factors <- c("educationlevel", "agerange", "employmentstatus", "gender", "year")

train <- train[, (factors) := lapply(.SD, as.factor), .SDcols = factors]
test <- test[, (factors) := lapply(.SD, as.factor), .SDcols = factors]

# Create a training and validation sets
trainObs <- sample(nrow(train), .8 * nrow(train), replace = FALSE)
valObs <- sample(nrow(train), .2 * nrow(train), replace = FALSE)

traindt <- train[trainObs,]
valdt <- train[valObs,]

Predictions:

Our final task is to make predictions on employment status for the unlabeled test set.

In order to do this, we will need to rely on a machine learning algorithm/s.

To decide which algorithm will perform best on the test set, we will implement several and choose the one that produces the most accurate predictions on the validation set.

# Function to compute classification error
classification_error <- function(conf_mat) {
  conf_mat = as.matrix(conf_mat)
  
  error = 1 - sum(diag(conf_mat)) / sum(conf_mat)
  
  return (error)
}

# Random Forest

# Model
rf_model <- randomForest(employmentstatus ~ ., data = traindt, importance = TRUE)

print(rf_model)

## 
## Call:
##  randomForest(formula = employmentstatus ~ ., data = traindt,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.38%
## Confusion matrix:
##                    Employed Not in labor force Unemployed class.error
## Employed              32370                460         11 0.014341829
## Not in labor force        5              15581        149 0.009787099
## Unemployed                2               2129        497 0.810882801

# Predictions
rf_pred <- predict(rf_model, valdt)

rf_test_pred <- data.table(predict(rf_model, test))

# Results
rf_conf_mat <- table(true = valdt$employmentstatus, pred = rf_pred)

cat("Random Forest Classification Error, Validation:", classification_error(rf_conf_mat), "\n")

## Random Forest Classification Error, Validation: 0.01124912

# Neural Network

# Run the neural network
nn <- nnet(employmentstatus ~ ., traindt, size = 15)

## # weights:  738
## initial  value 117266.669794 
## iter  10 value 35459.552036
## iter  20 value 19702.534478
## iter  30 value 18935.938957
## iter  40 value 17555.050482
## iter  50 value 14861.375264
## iter  60 value 12082.681110
## iter  70 value 11504.532740
## iter  80 value 11203.988741
## iter  90 value 10896.824191
## iter 100 value 10739.708299
## final  value 10739.708299 
## stopped after 100 iterations

nn_pred <- data.table(predict(nn, valdt, type = "class"))

nn_conf_mat <- table(true = valdt$employmentstatus, pred = nn_pred$V1)

cat("Neural Network Classification Error, Validation:", classification_error(nn_conf_mat), "\n")

## Neural Network Classification Error, Validation: 0.06694789

# Combine rf test predictions with original df
test$employmentstatus <- rf_test_pred[,1]

# Write the results of winning algorithm to csv
write.csv(test, "rf_predictions.csv")

Results:

Well, it would appear as if in this situation- the random forest won over the neural network when applied to the validation set. All thats left is to see how the algorithm stacks up in the challenge

Predicting Employment

ZXS107020

3/5/2018

IAS Analytics Challenge 3.0

Introduction:

Load the Data-

Predictions:

Results: