Abstract
Information on water points is available on Kenya’s open data portal. The functional status of water points can be functional, not functional or something in between (dodgy). A classifier was built that labels the functional status of water points. The accuracy of this classifier is 92 percent, the overall precision 91 percent.Kenya provides data on water points on its open data platform. The dataset considered here contains information on water points in the counties Kisumu, Busia, Kajiado and Kiambu. There are 4402 waterpoints and 59 features (such as number of households served, consumption per day, etc).
The goal is to predict the functional status of the water point. The working order (“functional status”) is summarized in the table below. The status has been modified and simplified, as the defined categories for working order are ambiguous. For example, whether a system is in use (“Functional ( in use)”) or not (“Functional ( not in use)”) as long as it works it is functional.
| functional_status_original | functional_status_new |
|---|---|
| Non- Functional | nonfunc |
| Functional ( in use) | func |
| Functional ( not in use) | func |
| Functional (but in bad state) | dodgy |
The dataset required extensive preparation.
Most of the text fields do not seem to be value restricted. In order words, the fields contains multiple categories that are the same but have different names, for example due to spelling errors.
Consider for example the following table for the variable “nature of water payment” against “functional status”:
## functional.status
## nature.of.water.payment dodgy func.use non.func
## ..missing.. 0 11 0
## 10 0 2 0
## 10 per month 0 26 0
## 10 per month per hh 0 5 1
## 10 pr month 0 2 0
## 10/- monthly/hh 0 2 0
## 1000 flat rate 0 2 0
## 10@month 0 42 1
## 10month 0 2 0
## 20 0 7 0
## 20 per month 6 98 8
## 20 per month per hh 0 5 1
## 20@month 0 25 0
## 30 per month 0 12 2
## 30 per month per hh 0 6 0
## 300 per month 0 0 4
## 50 per month per hh 0 1 5
## after system breakdown 2 148 24
## farming activities 0 4 0
## metered (ksh/m3) 0 209 15
## none 327 2006 277
## per animal (ksh) 0 99 2
## per jerrican 20l (ksh) 15 880 42
## sh 20 per month 0 6 0
## other 2 58 10
As can be seen in the table, there are about 15 different ways to express that water payment is monthly.
It is possible to calculate the correlation between numerical variables. Likewise it is possible to calculate the strength of association betwwen categorical variables.
These association strengths were calculated and plotted below.
To illustrate, here is the table between one of the variable pairs:
## livestock.is.availability.sufficient.for
## households.is.availability.sufficient.for no yes ..missing..
## no 1636 40 0
## yes 1349 1310 0
## ..missing.. 0 0 67
As you can see there is an association between these two variables. If there is enough water for households, there may or may not be enough water for livestock. But if there is insufficient water for livestock, then there will be enough for households only.
There are 58 features (e.g. consumption per household, etc), but not all of them are important. The results of the feature selection are shown in the plot below [^Feature selection was done with the Boruta package. This package uses a randomForest classifier].
This plot can be interpreted as follows:
The functional status of the water points can now be classified.
The dataset was split into a training set (80%) and a testset. A randomForest classifier [^Recursive partitioning, Breiman, 2001] was run on the training set. The randomForest approach is to sample cases (rows) and variables (columns) simultaneously and repeatedly. This has the following advantages:
The importance plot is shown below. Note that the most important variables found by the trained classifier is not the same as those found by the feature selection engine.
The classifier was used on the test dataset (preds) and the results compared with the actual functional status (trues). This is shown in the table below.
## $confusion.matrix
## preds
## trues dodgy func.use non.func
## dodgy 69 37 0
## func.use 6 1039 9
## non.func 1 50 69
##
## $per.class
## dodgy func.use non.func
## precision 0.91 0.92 0.88
## recall 0.65 0.99 0.57
## f1 0.76 0.95 0.70
## prevalence 0.08 0.82 0.09
##
## $macro.metrics
## value
## macroPrecision 0.91
## macroRecall 0.74
## macroF1 0.80
## accuracy 0.92
## expAccuracy 0.73
## kappa 0.70
These results can be interpreted as follows.
Some of the accuracy of the randomForest approach can be sacrificed in order to get results that are easier to interpret.
One of the classifiers that can be used is JRip [^Repeated Incremental Pruning to Produce Error Reduction, Cohen, 1995].
## JRIP rules:
## ===========
##
## (quantitative.in.field.assessment = unsafe) and (specify.none.payment = yes) and (funded.by = community) => functional.status=dodgy (78.0/3.0)
## (quantitative.in.field.assessment = unsafe) and (county = busia) and (livestock.is.availability.sufficient.for = no) and (specify.none.payment = yes) => functional.status=dodgy (24.0/6.0)
## (color = coloured) and (wp.maintainance = unknown) and (year.constructed <= 1992) => functional.status=dodgy (86.0/33.0)
## (no.of.hh.served.day <= 0) and (wp.maintainance = unknown) => functional.status=non.func (55.0/2.0)
## (quantitative.in.field.assessment = not tested) and (households.is.availability.sufficient.for = no) and (county = busia) and (altitude >= 1195) => functional.status=non.func (58.0/10.0)
## (quantitative.in.field.assessment = not tested) and (households.is.availability.sufficient.for = no) and (county = busia) and (no.of.hh.served.day >= 88) => functional.status=non.func (15.0/3.0)
## (quantitative.in.field.assessment = not tested) and (water.consumption.per.day.in.dry.season >= 100) and (water.consumption.per.day.in.dry.season <= 120) and (year.constructed <= 1993) and (extraction = hand.manual) and (livestock.is.availability.sufficient.for = no) => functional.status=non.func (23.0/3.0)
## (no.of.hh.served.day <= 0) and (reliability = ..missing..) => functional.status=non.func (9.0/0.0)
## (no.of.hh.served.day <= 1) and (cost.recovery.mechanism = no) and (energy = electrical) and (year.constructed <= 2009) => functional.status=non.func (8.0/1.0)
## => functional.status=func.use (2766.0/209.0)
##
## Number of Rules : 10
As you can see the classifier was able to generate 10 easy to understand rules for classifying the functional status of water points. The numbers in brackets after each rule stand for “number of cases covered” and “number of incorrectly classified cases” respectively. For example, ==> functional.status=dodgy (78/3) means that the rule predicts “dodgy” for functional status; it covers 78 water points, and incorrectly classifies 3 cases.
The classifier does not reach the accuracy of the randomForest classifier as shown in the table below.
## $confusion.matrix
## preds
## trues dodgy func.use non.func
## dodgy 139 103 4
## func.use 51 2510 43
## non.func 7 111 154
##
## $per.class
## dodgy func.use non.func
## precision 0.71 0.92 0.77
## recall 0.57 0.96 0.57
## f1 0.63 0.94 0.65
## prevalence 0.08 0.83 0.09
##
## $macro.metrics
## value
## macroPrecision 0.80
## macroRecall 0.70
## macroF1 0.74
## accuracy 0.90
## expAccuracy 0.74
## kappa 0.61