Assignment 5B

## Decision Trees - Part 2 - Assignment 5B
 
## 1.   Name three predictor variables associated with the Gini index.
##      •   Numeric, nominal, least squares criterion
## 2.   Fill-in-the-blank: Any implementation of a decision tree algorithm provides a collection of parameters ______________ how the tree is built.
##      •   tuning
## 3.   Explain tuning parameters.
##      •   The process of adjusting the models’ operations to identify the best fit and increase the accuracy of the model . The first parameter to tune is max_depth. This indicates how deep the tree can be. The deeper the tree, the more splits it has and it captures more information about the data.
## 4.   What is the rpart( ) function?
##      •   It is a package used for obtaining tree-based models. Used to grow the tree
## 5.   List four splitting functions.
##      •   Gini Index, Information Gain, Entropy and Gain
## 6.   Fill-in-the-blank:  Maxdepth, minbucket, minsplit, and maxcomplete are called __________ 
##      •   Used to stop training based on several thresholds

## 7.   Explain the following arguments:
##    data = data[train, vars]
##         argument identifies the training dataset
##    method = “class”
##         method indicates the type of model to be built and is dependent on the target variable and class indicates that this is a classification model
##    parms =
##         is an optional parameter for the splitting function

##    control = control
##         controls the selection of a best surrogate
##    split = “information”
##         directs rpart to use the information gain measure

## 8.   When working with clustering, K means is sensitive to the number of clusters; the choice requires a delicate balance. Setting K to be very large will improve the homogeneity of the clusters, and at the same time, it risks overfitting the data.  Ideally, you will have a prior knowledge about the true groupings and you can apply this information to choosing the number of clusters.      TRUE or FALSE

##      •   FALSE based on setting K to be very large to improve homogeneity of the clusters
## 9.   Name four tree-building implementations.
##      •   Entropy, Gain, Information Gain, Gini Index
## 10.  The default value of the minbucket= argument is one-third of the default value of minsplit= T/F
##      •   True

## 11.  Fill-in-the-blank: In general, you will get a larger decision tree by ________________________.
##      •   By reducing the bucket size, you generally get a larger decision tree
## 12.  Fill-in-the-blank: The ________________ is the minimum number of observations in any leaf mode.
##      •   is about one-third of the default value of minsplit
## 13.  A node will be considered for splitting if it has 
##      •   to divide the data into smaller, more homogeneous groups.
## 14.  List four tuning parameters for the decision tree algorithm.
##      •   Max-depth, min-samples-split, min-samples-leaf and max features
## 15.  Fill-in-the-blank: The  ____ is used to control the size of the decision tree, and to select an optimal tree size.
##      •   Complexity parameter
## 16.  The larger the decision tree, the more likely it can overfit the training data. TRUE of FALSE
##      •   True
## 17.  In order to avoid overfitting, we should do ##      what?
##      •   Can use of a cross-validated relative error measure, as in the implementation in the rpart() function to guard against overfitting
## 18.  In order to make a node split of a decision tree worthwhile, what could we do?
##     •    We can use the cp= argument in order to make a split worthwhile
## 19.  In the code statement,
##      Model <-- rpart(formula = form, …)
            What does the argument, formula = form, tell the model to do?

##         argument identifies the model that is to be built
## 20.  Fill-in-the-blank: When a model is complex, it is likely to _______________ .
##       •  Improve the model for the training data but then start to overfit the validation
Assignment 5B

Paul Brown

2022-12-14