Packages

library("tidyverse")
library("rpart")
library("rpart.plot")
library("randomForest")
library("readr")
library("dplyr")

Import the Data

fin_data <- readr::read_rds("Fin Capacity Profiling.rds")
fin_data

This is a simulated data about financial capacity and the “consumption profile” of a family. Assume that this data describes Filipino families in 2015. The response variable is fin_capacity, which is a binary variable (good or bad). A good financial capacity is equivalent to having an above average annual income. All other variables can be considered as predictors.

family_size is the number of members living in the same household.
food_allowance is the approximate food allowance (in PHP) of the family per day.
electric_bill is the approximate electric cost (in PHP) of the family per month.
Variables ending with qty are count variables (i.e. number of certain asset). All remaining variables are dummy variables.
- radio_qty, tv_qty, cd_qty, stereo_qty are the number of radio, tv, cd, and stereo owned by the family respectively.
- ref_qty, wash_qty, aircon_qty are the number of refrigerator, washing machine, and aircon units owned by the family respectively.
- car_qty, landline_qty, cellphone_qty, pc_qty are the number of cars, landline phones, cellphones and PCs owned by the family respectively.
- oven_qty, motor_banca_qty, motorcycle are the number of stove/gas range, motorized boat/banca, motorcycle owned by the family respectively.
- bed_qty is a proxy to the size of the house. It counts the number of bedrooms, which does not include rooms/spaces that are converted to sleeping area at night. It has to be a room (enclosed partition in a house) dedicated for sleeping.
- lives_in_ncr is an indicator for living in NCR or not.
- college_grad is an indicator for the educational attainment of the household head (i.e. at least college graduate or not)
- employed is an indicator for the employement status of the household head (i.e. employed or not)
- cook_fuel_wood is an indicator for using “wood” as primary fuel for cooking
- cook_fuel_kerosene is an indicator for using “kerosene” as primary fuel for cooking
- cook_fuel_charcoal is an indicator for using “charcoal” as primary fuel for cooking
- cook_fuel_lpg is an indicator for using “lpg” as primary fuel for cooking
- private_school is 1 whenever there is at least one member of the family currently studying in a private school and is 0 otherwise
- flight_domestic is 1 when at least one family member paid for a domestic flight within the past 6 months and 0 otherwise
- flight_intl is 1 when at least one family member paid for an international flight within the past 6 months and 0 otherwise
- house_helpers is 1 when the family hires and pays for a household helper (driver, or other domestic helper) and 0 otherwise
- concrete_walls is 1 if the walls of the house is made of concrete and 0 otherwise
- private_toilet is 1 if the family’s house has a private toilet or not

Test and Training Data Set

~25% of the observations were allocated for testing and ~75% for training. In order for the variable fin_capacity to be well distributed in both dataset, and auxiliary variable, annual_income was created that indicates if its whether above or below average.

fin_data <- fin_data %>%
              mutate(annual_income = factor(1*(fin_capacity == "GOOD"), levels = c(1, 0), labels = c("above average", "below average")))
fin_data

taking 25% of the data set for test data set

set.seed(201803075)
fin_test <- fin_data %>%
                group_by(annual_income) %>%
                sample_frac(size = 0.25, replace = FALSE)
fin_test

taking the remaining 75% as training data

fin_train <- fin_data %>%
                setdiff(fin_test)
fin_train

fin_training <- fin_train  %>%
                  ungroup() %>%
                  select(-annual_income)

Building a Decision Tree

set.seed(201803075)
fin_tree <- rpart(formula = fin_capacity ~ ., 
                   data = fin_training, 
                   method = "class",
                   control=rpart.control(minsplit = 10, minbucket = 6,maxdepth = 30))
rpart.plot(fin_tree, cex = 0.70)

fin_tree$cptable %>%
  as_tibble()

plotcp(fin_tree)

Pruning the tree using the cp value based on 1-sd rule.

pruned_tree <- prune(fin_tree, cp = 0.01461737)
rpart.plot(pruned_tree, cex = 0.8)

Based on the decision tree, with great power, comes with great electricity bill because this is one of the most defining factor in predicting the financial capacity of a family. When it comes budgeting, important things to be considered are the necessities which are electric bill and food allowance that is immediately allotted. The other thing is, the surplus, which is the capacity of a family to supply or buy things beyond their needs and it is evident on the tree that the expensive appliances are included which are refrigerator and computers. In overall, even though the tree is not that comprehensive in considering other external factors that may affect the financial capacity, but it show enough variables to be a basis of whether families have the capacity and it can be seen on the possessions that they have.

Building a Random Forest

mtry: 11 or 16 (third or half of the predictors) ntree: 100 or 500 (low number of trees vs. high number trees)

set.seed(201803075)
rf_fin<-list()
for (i in c(10,15)) {
  for (j in c(100,500)) { 
    rf_fin[[paste("mtry =",i,",","ntree = ", j)]] <- 
      randomForest(formula = fin_capacity ~ ., data = fin_training, mtry = i, ntree = j)
  }
}
rf_fin

## $`mtry = 10 , ntree =  100`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 9.03%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4597 199  0.04149291
## BAD   339 824  0.29148753
## 
## $`mtry = 10 , ntree =  500`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 9.16%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4596 200  0.04170142
## BAD   346 817  0.29750645
## 
## $`mtry = 15 , ntree =  100`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 9.55%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4584 212   0.0442035
## BAD   357 806   0.3069647
## 
## $`mtry = 15 , ntree =  500`
## 
## Call:
##  randomForest(formula = fin_capacity ~ ., data = fin_training,      mtry = i, ntree = j) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 9.01%
## Confusion matrix:
##      GOOD BAD class.error
## GOOD 4597 199  0.04149291
## BAD   338 825  0.29062769

4th model seems to be the best random forest

Getting the Importance of the best random forest

importance(rf_fin[[4]]) %>%
  as.matrix() %>%
  as_tibble(rownames = "predictors") %>%
  arrange(desc(MeanDecreaseGini))

The table shows the largin between the most imporant variable monthly electricity bill, next to the food_allowance. Because this indicator can be enough to justify if the family have a good financial capacity because it just shows that they do not have much of a problem in purchasing appliances since they have the money. Also, food allowance is thing to be highly considered because it shows that the family can meet their needs while when it comes to appliances or luxury, these have another thing that we will notice first compared to the necessity. Also, the family size plays an important role too in having a good financial capacity because this can be a deciding factor if the budget is more or enough for all. Despite the employed variable having no importance in the table, mainly because it cannot be measured since it is a categorical variable, but being employed is really important because this is were most of the budget will come from and need to be also considered with high importance.

Calculating the F-score for decision tree and random forest

Decision Tree

observed_tree <- fin_test$fin_capacity
pred_tree<- predict(pruned_tree, newdata = fin_test, type = "class")
cm_tree <- table(observed_tree, pred_tree)

TP_tree <- cm_tree[1,1] #subsetting of a matrix (taking row 2, column 2)
FP_tree <- cm_tree[2,1] # False Positive, predicted as Positive when the true value is negative
TN_tree <- cm_tree[2,2]
FN_tree <- cm_tree[1,2]

tree_overall_accuracy <- (TP_tree + TN_tree) / (TP_tree + FP_tree + TN_tree + FN_tree)
tree_recall           <- TP_tree / (TP_tree + FN_tree) # percent of positives that are correctly classified
tree_precision        <- TP_tree / (TP_tree + FP_tree) # percent of predicted positives that are correctly classified
tree_f_measure        <- 2*tree_recall*tree_precision / (tree_precision + tree_recall)

list("rf overall accuracy" = tree_overall_accuracy,
     "rf recall"           = tree_recall,
     "rf precision"        = tree_precision,
     "rf F measure"        = tree_f_measure)

## $`rf overall accuracy`
## [1] 0.8965015
## 
## $`rf recall`
## [1] 0.9508982
## 
## $`rf precision`
## [1] 0.9237929
## 
## $`rf F measure`
## [1] 0.9371496

Random Forest

fin_observed <- fin_test$fin_capacity
fin_pred<- predict(rf_fin[[4]], newdata = fin_test)
cm_rf <- table(fin_observed, fin_pred)

TP_rf <- cm_rf[1,1]
FP_rf <- cm_rf[2,1] 
TN_rf <- cm_rf[2,2]
FN_rf <- cm_rf[1,2]

rf_overall_accuracy <- (TP_rf + TN_rf) / (TP_rf + FP_rf + TN_rf + FN_rf)
rf_recall           <- TP_rf / (TP_rf + FN_rf) # percent of positives that are correctly classified
rf_precision        <- TP_rf / (TP_rf + FP_rf) # percent of predicted positives that are correctly classified
rf_f_measure        <- 2*rf_recall*rf_precision / (rf_precision + rf_recall)

list("rf overall accuracy" = rf_overall_accuracy,
     "rf recall"           = rf_recall,
     "rf precision"        = rf_precision,
     "rf F measure"        = rf_f_measure)

## $`rf overall accuracy`
## [1] 0.9110787
## 
## $`rf recall`
## [1] 0.9616766
## 
## $`rf precision`
## [1] 0.9310145
## 
## $`rf F measure`
## [1] 0.9460972

Based on the scores and accuracy of decision tree and random forest, the model in the random forest is much better and won unanimously. This can imply that the algorithm of random tree is much better and can predict good values more correctly.

Simple Analytics Project on Decision Trees and Random Forest

Mania, Dexter James U.