Classification and Regression Trees (CART)
Part 1
1. When working with CART we understand that a node has two components. What are those two components?
• Root node and leaf node
2. What is the root node?
• It is the highest node in the tree structure, and has no parent
3. What is the leaf node?
• It is the terminal node of a tree
4. What does a Gini score of 0 indicate?
• Perfect separation results
5. Before you are able to calculate the gini_index, you must first do what?
• you must first create the two groups of data based on a splitting point.
6. In Listing 11.1, the following line of code is used: “n_instances = float(sum([len(group) for group in groups]))” Explain this line of code.
7. Fill-in-the-blank: A split is comprised of an ___________ in the dataset and a ________ .
• attribute and value
8. Creating a split involves three parts. What are they?
• Calculating the Gini score,
• Splitting a dataset,
• Evaluating all splits
9. When splitting a dataset, the test_split ( ) function implements a procedure. Explain this procedure.
• It splits a dataset
10. Which two functions do we need to evaluate splits in the CART algorithm?
• the index of an attribute
• a split value for that attribute
11. Which score do we use to evaluate the cost of the split?
• We use Gini index score to evaluate the cost of the split.
12. You received a gini score of 0.35. What does this tell you about your split?
• a score of 0.35 suggests that there is a 35% chance of misclassifying an element in the resulting subsets based on the selected split.
13. How do we use the Gini index score with the CART classification algorithm?
• In the CART algorithm, the Gini index is used as a criterion to select the best split at each node of the decision tree.
14. In this chapter why are we calculating proportions?
• Chapter uses proportions to calculate impurity measures, such as Gini Index.
15. What is the CART algorithm used for?
• It is used to create decision trees for both classification and regression tasks, by recursively partitioning the data into subsets based on the most informative features or attributes, and using impurity measures such as the Gini index to select the best split at each node.
16. Do we normally use the Gini index on regression problems?
• No, we do not normally use the Gini index on regression problems
17. List two cost functions in machine learning.
• Mean Squared Error (MSE):
18. Explain the greedy algorithm and how it is used with CART.
• The CART (Classification and Regression Trees) algorithm uses the greedy algorithm to construct the tree by selecting the best split at each node based on a given cost function
19. Explain the cost function in machine learning and how it is used.
• It provides a way to evaluate and improve the performance of a model. By minimizing the cost function during training, the model is able to make better predictions on new, unseen data.
20. What is node purity?
• It refers to how mixed the training data assigned to each node is.
21. What is the Gini index?
• the name of the cost function used to evaluated splits in the dataset.
22. What is, in fact, a Gini score?
• gives an idea of how good a split is by how mixed the classes are in the two groups created by the split
23. How do you split a dataset?
• iterating over each row, then checking if the attribute value is below or above the split value and then assigning it to the left or right group respectively.
24. How are we using the word “dictionary” in machine learning?
• The word "dictionary" is often used to refer to a data structure that maps keys to values. Dictionaries are commonly used to represent feature vectors or feature mappings.
25. When selecting the best split and using it as a new node for the tree what do we store?
• We store the splitting feature and its corresponding threshold value in the new node.