Always ask: how is this model going to be used?
When building models, conversations shouldn’t be solely technical.
Data equity & ethics: Just because we can build predictive models, doesn’t mean we always should.
Target values known
Training data labeled with target values
Goal: find a way to map attributes to target value
Classification & regression
Target values unknown
Training data unlabeled
Goal: Discover information hidden in the data
May precede supervised learning
Clustering
Two attribute types: predictors and class
Finding a model to map a predictor set to a class
Train/test model: split data into training and testing
Decision tree learning: Splitting attributes* and drilling down the data to discover the attributes that may predict the problem. In this case, whether a passenger survived or not.
The order in which the drilling down happens makes an implicit assumption about which features are related to the problem.
What if the data used is not representative** of the population?
Create a model that will most accurately predict with the data available.
Find a split that optimizes some criteria.
Determine which feature and what thresholds will be used for splitting.
Determine the best split.
When to *stop splitting?**
A different order for drilling down the same dataset in the previous image.
The variety of ways to drill down illustrate the importance of evaluating the model. How do we ensure that our order in which the drilling down happens is most accurate and reliable?
We want to find features that are as separate as possible.
We want to find nodes that are as pure as possible.
Gini
1-[Probability + 1]Squared
Maximum possible Gini: 0.5
Whichever split gives a lower Gini value is a better feature
Entropy Measure of uncertainty that reflects a node’s purity - how much disorder is there in a system?
A harsher punishment on skew.
Entropy of the data - entropy of the split = information gain or Gini reduction
iris[1, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
# Gives first row of iris and all columns
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# Gives numbers 1 through 10
sample(1:10, 5)
## [1] 10 7 8 3 6
# Pulls five random numbers 1-10
1:nrow(iris)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## [52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
## [69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
## [86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102
## [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
## [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
# Vector of numbers in iris dataset
train.index <- sample(1:nrow(iris), 0.7*nrow(iris))
# Gives 70% of random values from iris dataset - essentially creating a training dataset
set.seed(777) # Set seed to ensure you get the same training vs test split every time you run the code
iris.train <- iris[train.index,] # Create the training dataset
iris.test <- iris[-train.index,] # Create the test dataset
iris.tree <- rpart(Species ~ ., data = iris.train) # Fit a decision tree model on the iris dataset
# Species = predictor
# ~. = Consider all columns. Could also do 'Species ~ Sepal Length + Petal Length' to only consider these two columns.
rpart.plot(iris.tree)
summary(iris.tree) # Summary of the decision tree
## Call:
## rpart(formula = Species ~ ., data = iris.train)
## n= 105
##
## CP nsplit rel error xerror xstd
## 1 0.47761194 0 1.00000000 1.1194030 0.06909105
## 2 0.44776119 1 0.52238806 0.6268657 0.07492478
## 3 0.01492537 2 0.07462687 0.1343284 0.04281416
## 4 0.01000000 3 0.05970149 0.1343284 0.04281416
##
## Variable importance
## Petal.Width Petal.Length Sepal.Length Sepal.Width
## 34 31 21 14
##
## Node number 1: 105 observations, complexity param=0.4776119
## predicted class=virginica expected loss=0.6380952 P(node) =1
## class counts: 32 35 38
## probabilities: 0.305 0.333 0.362
## left son=2 (32 obs) right son=3 (73 obs)
## Primary splits:
## Petal.Length < 2.45 to the left, improve=33.39022, (0 missing)
## Petal.Width < 0.8 to the left, improve=33.39022, (0 missing)
## Sepal.Length < 5.45 to the left, improve=23.00753, (0 missing)
## Sepal.Width < 3.25 to the right, improve=13.24142, (0 missing)
## Surrogate splits:
## Petal.Width < 0.8 to the left, agree=1.000, adj=1.00, (0 split)
## Sepal.Length < 5.45 to the left, agree=0.924, adj=0.75, (0 split)
## Sepal.Width < 3.25 to the right, agree=0.848, adj=0.50, (0 split)
##
## Node number 2: 32 observations
## predicted class=setosa expected loss=0 P(node) =0.3047619
## class counts: 32 0 0
## probabilities: 1.000 0.000 0.000
##
## Node number 3: 73 observations, complexity param=0.4477612
## predicted class=virginica expected loss=0.4794521 P(node) =0.6952381
## class counts: 0 35 38
## probabilities: 0.000 0.479 0.521
## left son=6 (40 obs) right son=7 (33 obs)
## Primary splits:
## Petal.Width < 1.75 to the left, improve=27.688360, (0 missing)
## Petal.Length < 4.75 to the left, improve=27.281340, (0 missing)
## Sepal.Length < 6.25 to the left, improve= 9.399317, (0 missing)
## Sepal.Width < 2.95 to the left, improve= 3.748962, (0 missing)
## Surrogate splits:
## Petal.Length < 4.75 to the left, agree=0.890, adj=0.758, (0 split)
## Sepal.Length < 6.15 to the left, agree=0.767, adj=0.485, (0 split)
## Sepal.Width < 2.95 to the left, agree=0.699, adj=0.333, (0 split)
##
## Node number 6: 40 observations, complexity param=0.01492537
## predicted class=versicolor expected loss=0.125 P(node) =0.3809524
## class counts: 0 35 5
## probabilities: 0.000 0.875 0.125
## left son=12 (33 obs) right son=13 (7 obs)
## Primary splits:
## Petal.Length < 4.85 to the left, improve=3.3820350, (0 missing)
## Petal.Width < 1.45 to the left, improve=1.4880950, (0 missing)
## Sepal.Length < 5.95 to the left, improve=0.4500000, (0 missing)
## Sepal.Width < 2.65 to the right, improve=0.3434066, (0 missing)
## Surrogate splits:
## Petal.Width < 1.55 to the left, agree=0.85, adj=0.143, (0 split)
##
## Node number 7: 33 observations
## predicted class=virginica expected loss=0 P(node) =0.3142857
## class counts: 0 0 33
## probabilities: 0.000 0.000 1.000
##
## Node number 12: 33 observations
## predicted class=versicolor expected loss=0.03030303 P(node) =0.3142857
## class counts: 0 32 1
## probabilities: 0.000 0.970 0.030
##
## Node number 13: 7 observations
## predicted class=virginica expected loss=0.4285714 P(node) =0.06666667
## class counts: 0 3 4
## probabilities: 0.000 0.429 0.571
# Understanding the tree (without plotting):
# Node 1
# First number = node number
# Second number = total n in node that are setosa
# Third number = total number in node that are note setosa
# Name = feature being drilled down
# Percentages = probabilities of setosa, versicolor, virginica
# To know how many exactly, multiply total with probabilities
# * = terminal node
# Node 2
# Petal.Length< 2.05 = Rule is that the petal length is less than 2.35
# First number = total n in node
# (35, 0, 0) = 35 setosa, 0 versicolor, 0 virginica
# Operationalizing the model for other languages
# Common terms: 'ifelse', 'and', 'instead', 'greater than/less than'
iris.predict <- predict(iris.tree, iris.test, type="class") # Predicting how well test data fits into decision tree. Getting class labels.
# Predicting the species of flower based on features.
iris.predict
## 4 7 9 11 13 15
## setosa setosa setosa setosa setosa setosa
## 18 20 22 28 29 34
## setosa setosa setosa setosa setosa setosa
## 36 40 43 45 46 49
## setosa setosa setosa setosa setosa setosa
## 53 55 57 58 61 67
## virginica versicolor versicolor versicolor versicolor versicolor
## 71 75 79 82 87 88
## virginica versicolor versicolor versicolor versicolor versicolor
## 93 96 98 104 110 111
## versicolor versicolor versicolor virginica virginica virginica
## 119 122 123 131 132 143
## virginica virginica virginica virginica virginica virginica
## 144 145 149
## virginica virginica virginica
## Levels: setosa versicolor virginica
iris.predictprob <- predict(iris.tree, iris.test, type="prob") # Getting probabilities.
iris.compare <- iris.test # Assign data frame.
iris.compare$Predictions <- iris.predict # Create a new column and copy over predictions from previous dataset
iris.compare
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 4 4.6 3.1 1.5 0.2 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 53 6.9 3.1 4.9 1.5 versicolor
## 55 6.5 2.8 4.6 1.5 versicolor
## 57 6.3 3.3 4.7 1.6 versicolor
## 58 4.9 2.4 3.3 1.0 versicolor
## 61 5.0 2.0 3.5 1.0 versicolor
## 67 5.6 3.0 4.5 1.5 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## 75 6.4 2.9 4.3 1.3 versicolor
## 79 6.0 2.9 4.5 1.5 versicolor
## 82 5.5 2.4 3.7 1.0 versicolor
## 87 6.7 3.1 4.7 1.5 versicolor
## 88 6.3 2.3 4.4 1.3 versicolor
## 93 5.8 2.6 4.0 1.2 versicolor
## 96 5.7 3.0 4.2 1.2 versicolor
## 98 6.2 2.9 4.3 1.3 versicolor
## 104 6.3 2.9 5.6 1.8 virginica
## 110 7.2 3.6 6.1 2.5 virginica
## 111 6.5 3.2 5.1 2.0 virginica
## 119 7.7 2.6 6.9 2.3 virginica
## 122 5.6 2.8 4.9 2.0 virginica
## 123 7.7 2.8 6.7 2.0 virginica
## 131 7.4 2.8 6.1 1.9 virginica
## 132 7.9 3.8 6.4 2.0 virginica
## 143 5.8 2.7 5.1 1.9 virginica
## 144 6.8 3.2 5.9 2.3 virginica
## 145 6.7 3.3 5.7 2.5 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## Predictions
## 4 setosa
## 7 setosa
## 9 setosa
## 11 setosa
## 13 setosa
## 15 setosa
## 18 setosa
## 20 setosa
## 22 setosa
## 28 setosa
## 29 setosa
## 34 setosa
## 36 setosa
## 40 setosa
## 43 setosa
## 45 setosa
## 46 setosa
## 49 setosa
## 53 virginica
## 55 versicolor
## 57 versicolor
## 58 versicolor
## 61 versicolor
## 67 versicolor
## 71 virginica
## 75 versicolor
## 79 versicolor
## 82 versicolor
## 87 versicolor
## 88 versicolor
## 93 versicolor
## 96 versicolor
## 98 versicolor
## 104 virginica
## 110 virginica
## 111 virginica
## 119 virginica
## 122 virginica
## 123 virginica
## 131 virginica
## 132 virginica
## 143 virginica
## 144 virginica
## 145 virginica
## 149 virginica
iris.compare[iris.compare$Species!=iris.compare$Predictions, ] # Passing entire vector to comparison dataframe
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 53 6.9 3.1 4.9 1.5 versicolor
## 71 5.9 3.2 4.8 1.8 versicolor
## Predictions
## 53 virginica
## 71 virginica
# View missclassified rows
iris.confusion <- table(iris.predict, iris.test$Species)
# Use the predicted species and actual species to build a confusion matrix
iris.confusion
##
## iris.predict setosa versicolor virginica
## setosa 18 0 0
## versicolor 0 13 0
## virginica 0 2 12
tree.params <- rpart.control(minsplit = 20, minbucket = 7, maxdepth = 30, cp = 0.01)
# minsplit = min number of observations that must exist in a nde in order for a split to be attempted
# minbucket = min number of observations in any terminal node
# If only one of minbucket/minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3
iris.tree2 <- rpart(Species ~ ., data = iris.train,
control=tree.params, parms=list(split="gini"))
# Fit the decision model to the training set
# Use parameters from above and Gini index for splitting
iris.tree2
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 67 virginica (0.30476190 0.33333333 0.36190476)
## 2) Petal.Length< 2.45 32 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.45 73 35 virginica (0.00000000 0.47945205 0.52054795)
## 6) Petal.Width< 1.75 40 5 versicolor (0.00000000 0.87500000 0.12500000)
## 12) Petal.Length< 4.85 33 1 versicolor (0.00000000 0.96969697 0.03030303) *
## 13) Petal.Length>=4.85 7 3 virginica (0.00000000 0.42857143 0.57142857) *
## 7) Petal.Width>=1.75 33 0 virginica (0.00000000 0.00000000 1.00000000) *
iris.treereg <- rpart(Petal.Length ~ ., data = iris.train, method="anova")
# Use method "anova" as a parameter to
# Set a regression decision tree model
iris.treereg
## n= 105
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 105 303.037100 3.842857
## 2) Petal.Width< 0.8 32 1.080000 1.475000 *
## 3) Petal.Width>=0.8 73 43.893150 4.880822
## 6) Species=versicolor 35 7.586857 4.245714
## 12) Sepal.Length< 5.95 19 2.746316 3.957895 *
## 13) Sepal.Length>=5.95 16 1.397500 4.587500 *
## 7) Species=virginica 38 9.185526 5.465789
## 14) Sepal.Length< 7 31 3.840000 5.300000 *
## 15) Sepal.Length>=7 7 0.720000 6.200000 *