This case study is a typical classification model problem where we will solve with the help of randomForest.

library(MASS)

There is a data set named ‘birthwt’ which is contained in library MASS. We will make use of this in built data set. Let’s store this data set in an R object

b=birthwt

Let’s have a look at our data frame from global environment. It has 189 rows or observations & 10 variables or columns. In order to understand what this data set is all about we can use help function.

help(birthwt)

## starting httpd help server ... done

This data set is about Risk Factors Associated with Low Infant Birth Weight. It contains following columns

low indicator of birth weight less than 2.5 kg. This is our target variable.

age mother’s age in years.

lwt mother’s weight in pounds at last menstrual period.

race mother’s race (1 = white, 2 = black, 3 = other).

smoke smoking status during pregnancy.

ptl number of previous premature labours.

ht history of hypertension.

ui presence of uterine irritability.

ftv number of physician visits during the first trimester.

bwt birth weight in grams.

Also we can see Few columns are numeric & few are categorical like race, smoke,ptl,ht,ui,ftv. Low is the target variable which we want to predict.

Let us check first 3 rows

head(b,3)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557

we can check summary also like shown below which will give us statistics like min value,max value,1st quartile,3rd quartile,mean,median.

summary(b)

##       low              age             lwt             race      
##  Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000  
##  Median :0.0000   Median :23.00   Median :121.0   Median :1.000  
##  Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000  
##      smoke             ptl               ht                ui        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.3915   Mean   :0.1958   Mean   :0.06349   Mean   :0.1481  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :3.0000   Max.   :1.00000   Max.   :1.0000  
##       ftv              bwt      
##  Min.   :0.0000   Min.   : 709  
##  1st Qu.:0.0000   1st Qu.:2414  
##  Median :0.0000   Median :2977  
##  Mean   :0.7937   Mean   :2945  
##  3rd Qu.:1.0000   3rd Qu.:3487  
##  Max.   :6.0000   Max.   :4990

Let us find the number of unique values in our data sets. We will use apply function.

apply(b,2,function(x) length(unique(x)))

##   low   age   lwt  race smoke   ptl    ht    ui   ftv   bwt 
##     2    24    75     3     2     4     2     2     6   131

It looks like variables low,race,smoke,ptl,ht,ui & ftv are categorical in nature as they contains less unique values. Whereas age is continuous variable so it has more unique values.

If we want to check how my age variable look like then we can plot a histogram like this.Upon looking at histogram We understand that age is distributed between 10 & 45. Also most of women who are pregnant are of age 20 & 25.

hist(b$age)

We need to convert our categorical variable into factor using function as.factor. This is because if we don’t convert them then it will be treated as numeric data & randomforest will treat them differently. In order to do that we need to run a for loop which will convert all of them together.

cols=c("low","race","smoke","ptl","ht","ui","ftv")
for (i in cols){
  b[,i]=as.factor(b[,i])}

Let us check if they have been converted to factor or not. We can see that using str function

str(b)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
##  $ ptl  : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
##  $ ftv  : Factor w/ 6 levels "0","1","2","3",..: 1 4 2 3 1 1 2 2 2 1 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

From above its clear that race has 3 unique values (1,2,3),smoke has 2 (0 & 1), ptl has 4 (0,1,2,3) & so on.

Let’s now divide the data set into training & test data sets.Library caTools has built in function sample.split. WE will split in a way that our train data has 70% of rows & test data has 30% rows.

library(caTools)

ind=sample.split(Y=b$low,SplitRatio = 0.7)
train=b[ind,]
test=b[!ind,]

In decision tree classifier we make a split on each of the variable in order to target the target variable. & again it also treats numerical & categorical data separately. For numerical it sorts the values & for categorical it finds the best combination of levels. So if we have a good dtree then it has low error rate. This means it is predicting right.

Let’s go on randomforest parameters. mtry is number of variables selected at each split. In random forest Lets say if we choose 3 variables at a time & then at a time we choose 50 rows. so our mtry will be 3 & sample size is 50. For each tree when we fit a model I am taking mtry=3 it means at random we are choosing 3 variables out of 9 then take 50 rows & fit a decision tree.

Lower mtry means if we choose less variables at a time & say we are choosing 1000 trees to be built. So each time we built a dtree we take 3 variables & 50 rows & fit the model. Second time we take any of 3 variables & any 50 rows & fit the model. We do it 1000 times . We then predict for test data. Each dtree will give its own predictions (0 & 1). WE will take average of all opinions .

So lower mtry means lesser correlation between individual dtrees (each of the 1000 dtrees). If we take mtry as number of predictor variables i.e. 9 in this case then we are fitting 1000 dtrees then each dtrees will take same variables. There is no difference in dtrees. So we should always choose mtry wisely & it should be less than our predictor variables else our dtrees will not learn anything new.

ntree is the number of trees to grow. nodesize is minimum size of terminal nodes.If we have larger value then smaller trees will be built that mneans lesser no. of splits will be done on each of the individual decison trees.

Now let us fit the model using randomforest where we will be fitting 20 dtrees.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

model_rf=randomForest(low~.,data=train,mtry=3,ntree=20)

Let us see what does this model has

model_rf

## 
## Call:
##  randomForest(formula = low ~ ., data = train, mtry = 3, ntree = 20) 
##                Type of random forest: classification
##                      Number of trees: 20
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 0.76%
## Confusion matrix:
##    0  1 class.error
## 0 90  1  0.01098901
## 1  0 41  0.00000000

Our OOB i.e. out of bag error (or misclassification rate) is quite less 2.27% which is good.So each dtrees is tested on one-third of the number of observations which is not used in building the dtrees. Each time we are fitting a dtrees we are taking a few sample size 2/3rd & then we are estimating misclassification on left 1/3rd part.

Let’s check out of these 9 variables which are the important one in rf. When we are splitting a dtrees we use ginni index or entropy to make a split.

importance(model_rf)

##       MeanDecreaseGini
## age          2.0422413
## lwt          3.3288957
## race         0.9620755
## smoke        0.5865722
## ptl          1.4794198
## ht           0.6230500
## ui           0.6216476
## ftv          2.2033156
## bwt         44.6141458

varImpPlot(model_rf)

If we remove bwt from out model then decrease in ginni index is around 40.78.So higher the mean decrease ginni for any variable better is the variable for prediction. So bwt or birth weight in grams is the most important variable.

Now let’s do the predictions

predictions_with_class=predict(model_rf,newdata = test,type="class")

We will get binary predictions either 1 or 0.

predictions_with_class

##  89  91  94  95  97  98  99 100 101 109 113 114 116 121 126 129 138 144 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## 151 155 162 166 167 169 182 183 188 190 192 193 200 201 205 207 211 213 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## 214 217 218   4  10  15  25  26  27  29  30  32  44  45  57  62  75  78 
##   0   0   0   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  81  82  83 
##   1   1   1 
## Levels: 0 1

Now let’s check the table containing actual & predicted

t=table(Predictions=predictions_with_class,Actual=test$low)

To have a look at our table

##            Actual
## Predictions  0  1
##           0 39  0
##           1  0 18

We can see from above that there are 39 actual values for 0 & for predicted its also 39. Its accurately predicted. Similarily for 1 its 18 actual values & for predicted its also 18.

We can check the accuracy metric as shown below

sum(diag(t))/sum(t)

## [1] 1

Our accuracy is 1 or 100% on test data sets.

Plotting ROC Curve & calculating AUC metric.

Next let us check the AUC metric. For that we need to first load the library named pROC.

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

prediction_with_prob=predict(model_rf,newdata = test,type="prob")

Our prediction with probability will look like this

prediction_with_prob

##        0    1
## 89  0.90 0.10
## 91  0.90 0.10
## 94  1.00 0.00
## 95  0.95 0.05
## 97  1.00 0.00
## 98  0.90 0.10
## 99  0.80 0.20
## 100 0.95 0.05
## 101 0.95 0.05
## 109 0.95 0.05
## 113 1.00 0.00
## 114 1.00 0.00
## 116 1.00 0.00
## 121 1.00 0.00
## 126 0.95 0.05
## 129 0.90 0.10
## 138 0.90 0.10
## 144 1.00 0.00
## 151 1.00 0.00
## 155 0.85 0.15
## 162 0.90 0.10
## 166 1.00 0.00
## 167 1.00 0.00
## 169 1.00 0.00
## 182 1.00 0.00
## 183 1.00 0.00
## 188 0.90 0.10
## 190 1.00 0.00
## 192 0.95 0.05
## 193 0.95 0.05
## 200 0.95 0.05
## 201 0.95 0.05
## 205 0.95 0.05
## 207 0.95 0.05
## 211 1.00 0.00
## 213 1.00 0.00
## 214 1.00 0.00
## 217 1.00 0.00
## 218 1.00 0.00
## 4   0.05 0.95
## 10  0.20 0.80
## 15  0.05 0.95
## 25  0.10 0.90
## 26  0.10 0.90
## 27  0.20 0.80
## 29  0.10 0.90
## 30  0.10 0.90
## 32  0.15 0.85
## 44  0.10 0.90
## 45  0.10 0.90
## 57  0.10 0.90
## 62  0.15 0.85
## 75  0.05 0.95
## 78  0.05 0.95
## 81  0.35 0.65
## 82  0.30 0.70
## 83  0.25 0.75
## attr(,"class")
## [1] "matrix" "votes"

For AUC

auc=auc(test$low,prediction_with_prob[,2])

Our AUC has scome out to be 1 which is very good.It tells us that our model is quite good in differentiating the class.

auc

## Area under the curve: 1

We can also plot the ROC curve like this

plot(roc(test$low,prediction_with_prob[,2]))

In order to find the best mtry we will use function tuneRF from package randomForest.Here we will take number of trees to be fit in will be 20, mtry will decrease by 1.2 on each iterations.If there is an improve by 0.01 then it will go ahead else it will stop.

bestmtry=tuneRF(train,train$low,ntreeTry = 200,stepFactor = 1.2,improve = 0.01,trace=T,plot=T)

## mtry = 3  OOB error = 0% 
## Searching left ...
## Searching right ...

So our best mtry is 3 where our OOB error is zero.

bestmtry

##       mtry OOBError
## 3.OOB    3        0