Load required R packages

Building predictive models

Always ask: how is this model going to be used?
When building models, conversations shouldn’t be solely technical.
Data equity & ethics: Just because we can build predictive models, doesn’t mean we always should.

Steps for building predictive models

  1. What is the unit of prediction?
  2. What is the level of granularity of what you’re trying to predict?
  3. How is the data appropriate for the unit of prediction?
  4. What problems is the model being built to address?
  5. What are the potential features that could be used to build the model?
  6. What are some potential unintended consequences of builidng the model?

Supervised learning in data science

Target values known
Training data labeled with target values
Goal: find a way to map attributes to target value
Classification & regression

Unsupervised learning in data science

Target values unknown
Training data unlabeled
Goal: Discover information hidden in the data
May precede supervised learning
Clustering

Steps of machine learning

  1. Identify the problem. What kind of problem is it? What are the values or categories? What are the groupings/rankings?
  2. Define success metrics.
  3. Data & feature engineering. Data we have/don’t have.
  4. Choose the learning algorithm. Decision trees, logistic regression, SVM.
  5. Cross-validaton & parameter tuning.
  6. Evaluation
  7. Experimentation

Classification: the prediction of categories

Two attribute types: predictors and class
Finding a model to map a predictor set to a class

Methods of classification

Train/test model: split data into training and testing
Decision tree learning: Splitting attributes* and drilling down the data to discover the attributes that may predict the problem. In this case, whether a passenger survived or not.
The order in which the drilling down happens makes an implicit assumption about which features are related to the problem.
What if the data used is not
representative** of the population?

The solution: creating greedy algorithms

Create a model that will most accurately predict with the data available.
Find a split that optimizes some criteria.
Determine which feature and what thresholds will be used for splitting.
Determine the best split.
When to *stop splitting?**

A different order for drilling down the same dataset in the previous image.
The variety of ways to drill down illustrate the importance of evaluating the model. How do we ensure that our order in which the drilling down happens is most accurate and reliable?

Types of splitting

Measuring node impurity

We want to find features that are as separate as possible.
We want to find nodes that are as pure as possible.
Gini
1-[Probability + 1]Squared
Maximum possible Gini: 0.5
Whichever split gives a lower Gini value is a better feature

Entropy Measure of uncertainty that reflects a node’s purity - how much disorder is there in a system?
A harsher punishment on skew.
Entropy of the data - entropy of the split = information gain or Gini reduction

Building a decision tree in R

iris[1, ]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
# Gives first row of iris and all columns

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
# Gives numbers 1 through 10

sample(1:10, 5)
## [1] 10  7  8  3  6
# Pulls five random numbers 1-10

1:nrow(iris)
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102
## [103] 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
## [120] 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
## [137] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
# Vector of numbers in iris dataset

train.index <- sample(1:nrow(iris), 0.7*nrow(iris))
# Gives 70% of random values from iris dataset - essentially creating a training dataset
set.seed(777) # Set seed to ensure you get the same training vs test split every time you run the code

iris.train <- iris[train.index,] # Create the training dataset

iris.test <- iris[-train.index,] # Create the test dataset

iris.tree <- rpart(Species ~ ., data = iris.train) # Fit a decision tree model on the iris dataset
# Species = predictor
# ~. = Consider all columns. Could also do 'Species ~ Sepal Length + Petal Length' to only consider these two columns.

Plotting the decision tree

rpart.plot(iris.tree)

summary(iris.tree) # Summary of the decision tree
## Call:
## rpart(formula = Species ~ ., data = iris.train)
##   n= 105 
## 
##           CP nsplit  rel error    xerror       xstd
## 1 0.47761194      0 1.00000000 1.1194030 0.06909105
## 2 0.44776119      1 0.52238806 0.6268657 0.07492478
## 3 0.01492537      2 0.07462687 0.1343284 0.04281416
## 4 0.01000000      3 0.05970149 0.1343284 0.04281416
## 
## Variable importance
##  Petal.Width Petal.Length Sepal.Length  Sepal.Width 
##           34           31           21           14 
## 
## Node number 1: 105 observations,    complexity param=0.4776119
##   predicted class=virginica   expected loss=0.6380952  P(node) =1
##     class counts:    32    35    38
##    probabilities: 0.305 0.333 0.362 
##   left son=2 (32 obs) right son=3 (73 obs)
##   Primary splits:
##       Petal.Length < 2.45 to the left,  improve=33.39022, (0 missing)
##       Petal.Width  < 0.8  to the left,  improve=33.39022, (0 missing)
##       Sepal.Length < 5.45 to the left,  improve=23.00753, (0 missing)
##       Sepal.Width  < 3.25 to the right, improve=13.24142, (0 missing)
##   Surrogate splits:
##       Petal.Width  < 0.8  to the left,  agree=1.000, adj=1.00, (0 split)
##       Sepal.Length < 5.45 to the left,  agree=0.924, adj=0.75, (0 split)
##       Sepal.Width  < 3.25 to the right, agree=0.848, adj=0.50, (0 split)
## 
## Node number 2: 32 observations
##   predicted class=setosa      expected loss=0  P(node) =0.3047619
##     class counts:    32     0     0
##    probabilities: 1.000 0.000 0.000 
## 
## Node number 3: 73 observations,    complexity param=0.4477612
##   predicted class=virginica   expected loss=0.4794521  P(node) =0.6952381
##     class counts:     0    35    38
##    probabilities: 0.000 0.479 0.521 
##   left son=6 (40 obs) right son=7 (33 obs)
##   Primary splits:
##       Petal.Width  < 1.75 to the left,  improve=27.688360, (0 missing)
##       Petal.Length < 4.75 to the left,  improve=27.281340, (0 missing)
##       Sepal.Length < 6.25 to the left,  improve= 9.399317, (0 missing)
##       Sepal.Width  < 2.95 to the left,  improve= 3.748962, (0 missing)
##   Surrogate splits:
##       Petal.Length < 4.75 to the left,  agree=0.890, adj=0.758, (0 split)
##       Sepal.Length < 6.15 to the left,  agree=0.767, adj=0.485, (0 split)
##       Sepal.Width  < 2.95 to the left,  agree=0.699, adj=0.333, (0 split)
## 
## Node number 6: 40 observations,    complexity param=0.01492537
##   predicted class=versicolor  expected loss=0.125  P(node) =0.3809524
##     class counts:     0    35     5
##    probabilities: 0.000 0.875 0.125 
##   left son=12 (33 obs) right son=13 (7 obs)
##   Primary splits:
##       Petal.Length < 4.85 to the left,  improve=3.3820350, (0 missing)
##       Petal.Width  < 1.45 to the left,  improve=1.4880950, (0 missing)
##       Sepal.Length < 5.95 to the left,  improve=0.4500000, (0 missing)
##       Sepal.Width  < 2.65 to the right, improve=0.3434066, (0 missing)
##   Surrogate splits:
##       Petal.Width < 1.55 to the left,  agree=0.85, adj=0.143, (0 split)
## 
## Node number 7: 33 observations
##   predicted class=virginica   expected loss=0  P(node) =0.3142857
##     class counts:     0     0    33
##    probabilities: 0.000 0.000 1.000 
## 
## Node number 12: 33 observations
##   predicted class=versicolor  expected loss=0.03030303  P(node) =0.3142857
##     class counts:     0    32     1
##    probabilities: 0.000 0.970 0.030 
## 
## Node number 13: 7 observations
##   predicted class=virginica   expected loss=0.4285714  P(node) =0.06666667
##     class counts:     0     3     4
##    probabilities: 0.000 0.429 0.571
# Understanding the tree (without plotting):
# Node 1
  # First number = node number
  # Second number = total n in node that are setosa
  # Third number = total number in node that are note setosa
  # Name = feature being drilled down
  # Percentages = probabilities of setosa, versicolor, virginica
    # To know how many exactly, multiply total with probabilities
  # * = terminal node
# Node 2
  # Petal.Length< 2.05 = Rule is that the petal length is less than 2.35
  # First number = total n in node
  # (35, 0, 0) = 35 setosa, 0 versicolor, 0 virginica


# Operationalizing the model for other languages
  # Common terms: 'ifelse', 'and', 'instead', 'greater than/less than'

Predictive analysis in R

iris.predict <- predict(iris.tree, iris.test, type="class") # Predicting how well test data fits into decision tree. Getting class labels.
# Predicting the species of flower based on features.

iris.predict
##          4          7          9         11         13         15 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         18         20         22         28         29         34 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         36         40         43         45         46         49 
##     setosa     setosa     setosa     setosa     setosa     setosa 
##         53         55         57         58         61         67 
##  virginica versicolor versicolor versicolor versicolor versicolor 
##         71         75         79         82         87         88 
##  virginica versicolor versicolor versicolor versicolor versicolor 
##         93         96         98        104        110        111 
## versicolor versicolor versicolor  virginica  virginica  virginica 
##        119        122        123        131        132        143 
##  virginica  virginica  virginica  virginica  virginica  virginica 
##        144        145        149 
##  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
iris.predictprob <- predict(iris.tree, iris.test, type="prob") # Getting probabilities.

Comparing predictions against the original dataset

iris.compare <- iris.test # Assign data frame.

iris.compare$Predictions <- iris.predict # Create a new column and copy over predictions from previous dataset

iris.compare
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 4            4.6         3.1          1.5         0.2     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 53           6.9         3.1          4.9         1.5 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 104          6.3         2.9          5.6         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 149          6.2         3.4          5.4         2.3  virginica
##     Predictions
## 4        setosa
## 7        setosa
## 9        setosa
## 11       setosa
## 13       setosa
## 15       setosa
## 18       setosa
## 20       setosa
## 22       setosa
## 28       setosa
## 29       setosa
## 34       setosa
## 36       setosa
## 40       setosa
## 43       setosa
## 45       setosa
## 46       setosa
## 49       setosa
## 53    virginica
## 55   versicolor
## 57   versicolor
## 58   versicolor
## 61   versicolor
## 67   versicolor
## 71    virginica
## 75   versicolor
## 79   versicolor
## 82   versicolor
## 87   versicolor
## 88   versicolor
## 93   versicolor
## 96   versicolor
## 98   versicolor
## 104   virginica
## 110   virginica
## 111   virginica
## 119   virginica
## 122   virginica
## 123   virginica
## 131   virginica
## 132   virginica
## 143   virginica
## 144   virginica
## 145   virginica
## 149   virginica
iris.compare[iris.compare$Species!=iris.compare$Predictions, ] # Passing entire vector to comparison dataframe 
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 53          6.9         3.1          4.9         1.5 versicolor
## 71          5.9         3.2          4.8         1.8 versicolor
##    Predictions
## 53   virginica
## 71   virginica
# View missclassified rows

iris.confusion <- table(iris.predict, iris.test$Species) 
# Use the predicted species and actual species to build a confusion matrix

iris.confusion
##             
## iris.predict setosa versicolor virginica
##   setosa         18          0         0
##   versicolor      0         13         0
##   virginica       0          2        12

Setting control parameters of decision tree

tree.params <- rpart.control(minsplit = 20, minbucket = 7, maxdepth = 30, cp = 0.01)
# minsplit = min number of observations that must exist in a nde in order for a split to be attempted
# minbucket = min number of observations in any terminal node
  # If only one of minbucket/minsplit is specified, the code either sets minsplit to minbucket*3 or minbucket to minsplit/3


iris.tree2 <- rpart(Species ~ ., data = iris.train, 
                       control=tree.params, parms=list(split="gini"))
# Fit the decision model to the training set
# Use parameters from above and Gini index for splitting

iris.tree2
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 105 67 virginica (0.30476190 0.33333333 0.36190476)  
##    2) Petal.Length< 2.45 32  0 setosa (1.00000000 0.00000000 0.00000000) *
##    3) Petal.Length>=2.45 73 35 virginica (0.00000000 0.47945205 0.52054795)  
##      6) Petal.Width< 1.75 40  5 versicolor (0.00000000 0.87500000 0.12500000)  
##       12) Petal.Length< 4.85 33  1 versicolor (0.00000000 0.96969697 0.03030303) *
##       13) Petal.Length>=4.85 7  3 virginica (0.00000000 0.42857143 0.57142857) *
##      7) Petal.Width>=1.75 33  0 virginica (0.00000000 0.00000000 1.00000000) *

Build a regression decision tree

iris.treereg <- rpart(Petal.Length ~ ., data = iris.train, method="anova")
# Use method "anova" as a parameter to
# Set a regression decision tree model

iris.treereg
## n= 105 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 105 303.037100 3.842857  
##    2) Petal.Width< 0.8 32   1.080000 1.475000 *
##    3) Petal.Width>=0.8 73  43.893150 4.880822  
##      6) Species=versicolor 35   7.586857 4.245714  
##       12) Sepal.Length< 5.95 19   2.746316 3.957895 *
##       13) Sepal.Length>=5.95 16   1.397500 4.587500 *
##      7) Species=virginica 38   9.185526 5.465789  
##       14) Sepal.Length< 7 31   3.840000 5.300000 *
##       15) Sepal.Length>=7 7   0.720000 6.200000 *