Decision Trees - Part 2 - Assignment 5B
 
1.  Name three predictor variables associated with the Gini index.

This question shoud be associated with dataset. (Is this question correct?)

2.  Fill-in-the-blank: 
Any implementation of a decision tree algorithm provides a collection of parameters __for tuning__ how the tree is built.

3.  Explain tuning parameters.

Adjus the paratermeters estimated in the model and find the best fitted to increase prediction accuracy rate. 

4.  What is the rpart( ) function?
The rpsrt() frunction from the Recursive Partitioning and Regression Trees (rpart) package fits the tree model.

5.  List four splitting functions.

(i) split= “information” directs rpart to use the information gain measure 
(ii). split= “gini”,  splitting functions are Gini Index, Information Gain, Entropy and Gain.
(iii). minsplit= argument specifies the minimum number of observations that must exist at a node in the tree before it is considered for splitting.
(iv). minbucket= argument is the minimum number of observations in any terminal or leaf node.  The default value is about one-third of the default value of minsplit= .


6.  Fill-in-the-blank:  
Maxdepth, minbucket, minsplit, and maxcomplete are called _some tuning arguments. 

7.  Explain the following arguments:

control <- rpart.control(cp=0, minbucket=0)

model <- rpart(formula=form, data=data[train, vars], method=“class”, 
        parms=list(split=“information”), control=control) 

data=data[train, vars] --> Training data 
method=“class”         --> Classification problem
split=“information”    --> rpart to use the information gain measure
control=control        --> Complexicity controling arguments

8.  When working with clustering, K means is sensitive to the number of clusters; the choice requires a delicate balance. Setting K to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data.  Ideally, you will have a prior knowledge about the true groupings and you can apply this information to choosing the number of clusters.      TRUE or FALSE

TRUE

9.  Name four tree-building implementations.

Entropy, Gain, Information Gain, Gini Index

10. The default value of the minbucket=  argument is about one-third of the default value of minsplit=    TRUE or FALSE

TRUE

11. Fill-in-the-blank: 
In general, you will get a larger decision tree by ___allowing leaf nodes with a smaller number of observations to be considere in the terminal nodes___.



12. Fill-in-the-blank: 
The __minbucket= argument___ is the minimum number of observations in any leaf mode.


13. A node will be considered for splitting if it __has at least minsplit= observations and if on splitting each of its children  has at least minbucket=  observations.__

14. List four tuning parameters for the decision tree algorithm.

minbucket, minsplit, maxdepth or the complexity parameter (cp)

15. Fill-in-the-blank: 
The  ___complexity parameter___ is used to control the size of the decision tree, and to select an optimal tree size.


16. The larger the decision tree, the more likely it can overfit the training data. TRUE of FALSE

TRUE 

17. In order to avoid overfitting, we should do what?

We should increase minbucket, minsplit, maxdepth or the complexity parameter (cp). 

18. In order to make a node split of a decision tree worthwhile, what could we do?


The cp= argument governs the minimum “benefit” that must be gained at each split of the decision tree in order to make a split worthwhile.


19. In the code statement,
Model <-- rpart(formula = form, …)
            What does the argument, formula = form, tell the model to do?

It has tree model with identified response variable (Y) and covariates (X).

20. Fill-in-the-blank: 
When a model is complex, it is likely to __overfit__ .