## Decision Trees - Part 2 - Assignment 5B
## 1. Name three predictor variables associated with the Gini index.
## • Numeric, nominal, least squares criterion
## 2. Fill-in-the-blank: Any implementation of a decision tree algorithm provides a collection of parameters ______________ how the tree is built.
## • tuning
## 3. Explain tuning parameters.
## • The process of adjusting the models’ operations to identify the best fit and increase the accuracy of the model . The first parameter to tune is max_depth. This indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data.
## 4. What is the rpart( ) function?
## • It is a package used for obtaining tree-based models. Used to grow the tree
## 5. List four splitting functions.
## • Gini Index, Information Gain, Entropy and Gain
## 6. Fill-in-the-blank: Maxdepth, minbucket, minsplit, and maxcomplete are called __________
## • Used to stop training based on several thresholds
## 7. Explain the following arguments:
## data = data[train, vars]
## argument identifies the training dataset
## method = “class”
## method indicates the type of model to be built and is dependent on the target variable and class indicates that this is a classification model
## parms =
## is an optional parameter for the splitting function
## control = control
## controls the selection of a best surrogate
## split = “information”
## directs rpart to use the information gain measure
## 8. When working with clustering, K means is sensitive to the number of clusters; the choice requires a delicate balance. Setting K to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data. Ideally, you will have a prior knowledge about the true groupings and you can apply this information to choosing the number of clusters. TRUE or FALSE
## • FALSE based on setting K to be very large to improve homogeneity of the clusters
## 9. Name four tree-building implementations.
## • Entropy, Gain, Information Gain, Gini Index
## 10. The default value of the minbucket= argument is one-third of the default value of minsplit= T/F
## • True
## 11. Fill-in-the-blank: In general, you will get a larger decision tree by ________________________.
## • By reducing the bucket size, you generally get a larger decision tree
## 12. Fill-in-the-blank: The ________________ is the minimum number of observations in any leaf mode.
## • is about one-third of the default value of minsplit
## 13. A node will be considered for splitting if it has
## • to divide the data into smaller, more homogeneous groups.
## 14. List four tuning parameters for the decision tree algorithm.
## • Max-depth, min-samples-split, min-samples-leaf and max features
## 15. Fill-in-the-blank: The ____ is used to control the size of the decision tree, and to select an optimal tree size.
## • Complexity parameter
## 16. The larger the decision tree, the more likely it can overfit the training data. TRUE of FALSE
## • True
## 17. In order to avoid overfitting, we should do ## what?
## • Can use of a cross-validated relative error measure, as in the implementation in the rpart() function to guard against overfitting
## 18. In order to make a node split of a decision tree worthwhile, what could we do?
## • We can use the cp= argument in order to make a split worthwhile
## 19. In the code statement,
## Model <-- rpart(formula = form, …)
What does the argument, formula = form, tell the model to do?
## argument identifies the model that is to be built
## 20. Fill-in-the-blank: When a model is complex, it is likely to _______________ .
## • Improve the model for the training data but then start to overfit the validation