Packages

library("tidyverse")
library("rpart")
library("rpart.plot")
library("randomForest")
library("readr")
library("dplyr")

Import the Data

fin_data <- readr::read_rds("Fin Capacity Profiling.rds")
fin_data

This is a simulated data about financial capacity and the “consumption profile” of a family. Assume that this data describes Filipino families in 2015. The response variable is fin_capacity, which is a binary variable (good or bad). A good financial capacity is equivalent to having an above average annual income. All other variables can be considered as predictors.

Test and Training Data Set

~25% of the observations were allocated for testing and ~75% for training. In order for the variable fin_capacity to be well distributed in both dataset, and auxiliary variable, annual_income was created that indicates if its whether above or below average.

fin_data <- fin_data %>%
              mutate(annual_income = factor(1*(fin_capacity == "GOOD"), levels = c(1, 0), labels = c("above average", "below average")))
fin_data

taking 25% of the data set for test data set

set.seed(201803075)
fin_test <- fin_data %>%
                group_by(annual_income) %>%
                sample_frac(size = 0.25, replace = FALSE)
fin_test

taking the remaining 75% as training data

fin_train <- fin_data %>%
                setdiff(fin_test)
fin_train 
fin_training <- fin_train  %>%
                  ungroup() %>%
                  select(-annual_income)

Building a Decision Tree

set.seed(201803075)
fin_tree <- rpart(formula = fin_capacity ~ ., 
                   data = fin_training, 
                   method = "class",
                   control=rpart.control(minsplit = 10, minbucket = 6,maxdepth = 30))
rpart.plot(fin_tree, cex = 0.70)

fin_tree$cptable %>%
  as_tibble()
plotcp(fin_tree)

Pruning the tree using the cp value based on 1-sd rule.

pruned_tree <- prune(fin_tree, cp = 0.01461737)
rpart.plot(pruned_tree, cex = 0.8)

Based on the decision tree, with great power, comes with great electricity bill because this is one of the most defining factor in predicting the financial capacity of a family. When it comes budgeting, important things to be considered are the necessities which are electric bill and food allowance that is immediately allotted. The other thing is, the surplus, which is the capacity of a family to supply or buy things beyond their needs and it is evident on the tree that the expensive appliances are included which are refrigerator and computers. In overall, even though the tree is not that comprehensive in considering other external factors that may affect the financial capacity, but it show enough variables to be a basis of whether families have the capacity and it can be seen on the possessions that they have.

Building a Random Forest

mtry: 11 or 16 (third or half of the predictors) ntree: 100 or 500 (low number of trees vs. high number trees)

set.seed(201803075)
rf_fin<-list()
for (i in c(10,15)) {
  for (j in c(100,500)) { 
    rf_fin[[paste("mtry =",i,",","ntree = ", j)]] <- 
      randomForest(formula = fin_capacity ~ ., data = fin_training, mtry = i, ntree = j)
  }
}
rf_fin
## $`mtry = 10 , ntree =  100`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 9.03%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4597 199  0.04149291
## BAD   339 824  0.29148753
## 
## $`mtry = 10 , ntree =  500`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 9.16%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4596 200  0.04170142
## BAD   346 817  0.29750645
## 
## $`mtry = 15 , ntree =  100`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 9.55%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4584 212   0.0442035
## BAD   357 806   0.3069647
## 
## $`mtry = 15 , ntree =  500`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 9.01%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4597 199  0.04149291
## BAD   338 825  0.29062769

4th model seems to be the best random forest

Getting the Importance of the best random forest

importance(rf_fin[[4]]) %>%
  as.matrix() %>%
  as_tibble(rownames = "predictors") %>%
  arrange(desc(MeanDecreaseGini))
The table shows the largin between the most imporant variable monthly electricity bill, next to the food_allowance. Because this indicator can be enough to justify if the family have a good financial capacity because it just shows that they do not have much of a problem in purchasing appliances since they have the money. Also, food allowance is thing to be highly considered because it shows that the family can meet their needs while when it comes to appliances or luxury, these have another thing that we will notice first compared to the necessity. Also, the family size plays an important role too in having a good financial capacity because this can be a deciding factor if the budget is more or enough for all. Despite the employed variable having no importance in the table, mainly because it cannot be measured since it is a categorical variable, but being employed is really important because this is were most of the budget will come from and need to be also considered with high importance.

Calculating the F-score for decision tree and random forest

Decision Tree

observed_tree <- fin_test$fin_capacity
pred_tree<- predict(pruned_tree, newdata = fin_test, type = "class")
cm_tree <- table(observed_tree, pred_tree)
TP_tree <- cm_tree[1,1] #subsetting of a matrix (taking row 2, column 2)
FP_tree <- cm_tree[2,1] # False Positive, predicted as Positive when the true value is negative
TN_tree <- cm_tree[2,2]
FN_tree <- cm_tree[1,2]
tree_overall_accuracy <- (TP_tree + TN_tree) / (TP_tree + FP_tree + TN_tree + FN_tree)
tree_recall           <- TP_tree / (TP_tree + FN_tree) # percent of positives that are correctly classified
tree_precision        <- TP_tree / (TP_tree + FP_tree) # percent of predicted positives that are correctly classified
tree_f_measure        <- 2*tree_recall*tree_precision / (tree_precision + tree_recall)

list("rf overall accuracy" = tree_overall_accuracy,
     "rf recall"           = tree_recall,
     "rf precision"        = tree_precision,
     "rf F measure"        = tree_f_measure)   
## $`rf overall accuracy`
## [1] 0.8965015
## 
## $`rf recall`
## [1] 0.9508982
## 
## $`rf precision`
## [1] 0.9237929
## 
## $`rf F measure`
## [1] 0.9371496

Random Forest

fin_observed <- fin_test$fin_capacity
fin_pred<- predict(rf_fin[[4]], newdata = fin_test)
cm_rf <- table(fin_observed, fin_pred)
TP_rf <- cm_rf[1,1]
FP_rf <- cm_rf[2,1] 
TN_rf <- cm_rf[2,2]
FN_rf <- cm_rf[1,2]
rf_overall_accuracy <- (TP_rf + TN_rf) / (TP_rf + FP_rf + TN_rf + FN_rf)
rf_recall           <- TP_rf / (TP_rf + FN_rf) # percent of positives that are correctly classified
rf_precision        <- TP_rf / (TP_rf + FP_rf) # percent of predicted positives that are correctly classified
rf_f_measure        <- 2*rf_recall*rf_precision / (rf_precision + rf_recall)

list("rf overall accuracy" = rf_overall_accuracy,
     "rf recall"           = rf_recall,
     "rf precision"        = rf_precision,
     "rf F measure"        = rf_f_measure)  
## $`rf overall accuracy`
## [1] 0.9110787
## 
## $`rf recall`
## [1] 0.9616766
## 
## $`rf precision`
## [1] 0.9310145
## 
## $`rf F measure`
## [1] 0.9460972
Based on the scores and accuracy of decision tree and random forest, the model in the random forest is much better and won unanimously. This can imply that the algorithm of random tree is much better and can predict good values more correctly.