Project: Modeling with Classification Trees

Load and Transform Data

Here is a data dictionary:

ANSWERED: Indicator for whether the call was answered.
INCOME: Annual income of customer
FEMALE: Indicator for whether the customer is female
AGE: Age of customer
JOB: Indicator for job type.
NUM_DEPENDENTS: Number of dependents.
RENT: Indicator for whether customer rents
OWN_RES: Indicator for whether customer owns residence
NEW_CAR: Indicator of whether the customer owns a new car
CHK_ACCT: Number of checking accounts
SAV_ACCT: Number of savings accounts
NUM_ACCTS: Total number of other accounts
MOBILE: Indicator for whether the call back number is mobile
PRODUCT: Type of product purchased. (0 represents no purchase)

Here is code to clean and prepare the dataset for modeling.

## Parsed with column specification:
## cols(
##   answered = col_double(),
##   income = col_double(),
##   female = col_double(),
##   age = col_double(),
##   job = col_double(),
##   num_dependents = col_double(),
##   rent = col_double(),
##   own_res = col_double(),
##   new_car = col_double(),
##   chk_acct = col_double(),
##   sav_acct = col_double(),
##   num_accts = col_double(),
##   mobile = col_double(),
##   product = col_double()
## )

Questions

Q1.

The target variable for this modeling exercise is ANSWERED. In this dataset, what proportion of calls were answered?

In this case answered calls are the majority class. A simple model would be to always predict the majority class. Our tree models should have better accuracy than majority class prediction, which we will therefore use as a benchmark for evaluating subsequent models.

## [1] 0.543

## [1] 0.457

Q2.

Fit a tree model to the outcome using just one variable, income. We’ll call this the “income model.” What is the accuracy of the income model?

## n= 5000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 5000 2285 yes (0.4570000 0.5430000)  
##    2) income>=39135 1385  495 no (0.6425993 0.3574007) *
##    3) income< 39135 3615 1395 yes (0.3858921 0.6141079)  
##      6) income< 36355 3490 1390 yes (0.3982808 0.6017192)  
##       12) income>=4295 3450 1390 yes (0.4028986 0.5971014)  
##         24) income< 9595 480  223 no (0.5354167 0.4645833)  
##           48) income>=7890 183   39 no (0.7868852 0.2131148) *
##           49) income< 7890 297  113 yes (0.3804714 0.6195286)  
##             98) income< 4455 25    0 no (1.0000000 0.0000000) *
##             99) income>=4455 272   88 yes (0.3235294 0.6764706) *
##         25) income>=9595 2970 1133 yes (0.3814815 0.6185185) *
##       13) income< 4295 40    0 yes (0.0000000 1.0000000) *
##      7) income>=36355 125    5 yes (0.0400000 0.9600000) *

## n= 5000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 5000 2285 yes (0.4570000 0.5430000)  
##    2) income>=39135 1385  495 no (0.6425993 0.3574007) *
##    3) income< 39135 3615 1395 yes (0.3858921 0.6141079)  
##      6) income< 36355 3490 1390 yes (0.3982808 0.6017192)  
##       12) income>=4295 3450 1390 yes (0.4028986 0.5971014)  
##         24) income< 9595 480  223 no (0.5354167 0.4645833)  
##           48) income>=7890 183   39 no (0.7868852 0.2131148) *
##           49) income< 7890 297  113 yes (0.3804714 0.6195286)  
##             98) income< 4455 25    0 no (1.0000000 0.0000000) *
##             99) income>=4455 272   88 yes (0.3235294 0.6764706) *
##         25) income>=9595 2970 1133 yes (0.3814815 0.6185185) *
##       13) income< 4295 40    0 yes (0.0000000 1.0000000) *
##      7) income>=36355 125    5 yes (0.0400000 0.9600000) *

##   1   2   3   4   5   6 
##  no yes yes yes  no  no 
## Levels: no yes

## [1] 0.648

Q3.

The first split in this tree, on income >= 39135, is guaranteed by the tree algorithm to yield the greatest information gain (IG) of any possible split. What is the IG associated with that split?

The formula for IG combines the weighted entropy in the children (weighted by the proportion of the data in each node) and subtracts it from the entropy in the parent. Here is what you need to calculate:

entropy(parent): entropy in the parent, prior to the split.
entropy(c1): entropy in the first child.
p(c1): the proportion of observations from the parent that wind up in the first child after the split.
entropy(c2): entropy in the second child.
p(c2): the proportion of observations from the parent that wind up in the second child after the split.

The formula is:

IG = entropy(parent) - [p(c1) * entropy(c1) + p(c2) * entropy(c2)]

Recall that entropy for any group is defined as:

p1 * log2(p1) - p2 * log2(p2),

where p1 is the proportion of the first label and p2 is the proportion of the second label.

The easiest way to do this calculation is to define as objects each of the elements in the formula.

Q4.

Fit a tree model of the outcome using all the predictors (again, not including product, and having changed the indicators and the outcome into factors). We’ll call this the “tree model.”

Create and upload a visualization of the tree model using the rpart.plot.version1() function. (This function creates a tree plot with less complexity than rpart.plot(), that is easier to read.)
Identify the top 3 most important predictors in this model.

Q5.

Is the tree model better than the the income model? Hint: calculate the accuracy of the tree model and compare it to the income model.

##   1   2   3   4   5   6 
## yes yes yes  no  no  no 
## Levels: no yes

## [1] 0.8108

Project: Modeling with Classification Trees

Mark Sandbothe

Load and Transform Data

Questions

Q1.

Q2.

Q3.

Q4.

Q5.