1 Predict functional status of water points

1.1 data source

Kenya provides data on water points on its open data platform. The dataset considered here contains information on water points in the counties Kisumu, Busia, Kajiado and Kiambu. There are 4402 waterpoints and 59 features (such as number of households served, consumption per day, etc).

1.2 purpose of the analysis

The goal is to predict the functional status of the water point. The working order (“functional status”) is summarized in the table below. The status has been modified and simplified, as the defined categories for working order are ambiguous. For example, whether a system is in use (“Functional ( in use)”) or not (“Functional ( not in use)”) as long as it works it is functional.

functional_status_original functional_status_new
Non- Functional nonfunc
Functional ( in use) func
Functional ( not in use) func
Functional (but in bad state) dodgy

2 Methodology

2.1 Data cleaning

The dataset required extensive preparation.

Most of the text fields do not seem to be value restricted. In order words, the fields contains multiple categories that are the same but have different names, for example due to spelling errors.

Consider for example the following table for the variable “nature of water payment” against “functional status”:

##                         functional.status
## nature.of.water.payment  dodgy func.use non.func
##   ..missing..                0       11        0
##   10                         0        2        0
##   10 per month               0       26        0
##   10 per month per hh        0        5        1
##   10 pr month                0        2        0
##   10/- monthly/hh            0        2        0
##   1000 flat rate             0        2        0
##   10@month                   0       42        1
##   10month                    0        2        0
##   20                         0        7        0
##   20 per month               6       98        8
##   20 per month per hh        0        5        1
##   20@month                   0       25        0
##   30 per month               0       12        2
##   30 per month per hh        0        6        0
##   300 per month              0        0        4
##   50 per month per hh        0        1        5
##   after system breakdown     2      148       24
##   farming activities         0        4        0
##   metered (ksh/m3)           0      209       15
##   none                     327     2006      277
##   per animal (ksh)           0       99        2
##   per jerrican 20l (ksh)    15      880       42
##   sh 20 per month            0        6        0
##   other                      2       58       10

As can be seen in the table, there are about 15 different ways to express that water payment is monthly.

2.2 Relation between variables

It is possible to calculate the correlation between numerical variables. Likewise it is possible to calculate the strength of association betwwen categorical variables.

These association strengths were calculated and plotted below.

To illustrate, here is the table between one of the variable pairs:

##                                          livestock.is.availability.sufficient.for
## households.is.availability.sufficient.for   no  yes ..missing..
##                               no          1636   40           0
##                               yes         1349 1310           0
##                               ..missing..    0    0          67

As you can see there is an association between these two variables. If there is enough water for households, there may or may not be enough water for livestock. But if there is insufficient water for livestock, then there will be enough for households only.

2.3 Feature selection

There are 58 features (e.g. consumption per household, etc), but not all of them are important. The results of the feature selection are shown in the plot below [^Feature selection was done with the Boruta package. This package uses a randomForest classifier].

This plot can be interpreted as follows:

  • The most important variables for classifying the functional status of water points are status, energy and extraction.
    • status = {improved, unimproved, missing}
    • energy = {manual, diesel, electrical, solar, water hydram wind, missing}
    • extraction = {handmanual, submersiblepump, gravity, ropeandbucket, surfacepump, hydrampump, nira, gutters, moneymaker, none, tank, tap, missing}
  • Values that fall within the “shadow variables” at the bottom of the plot have no predictive value: fluor content and SPA.

3 Classification

3.1 randomForest approach

The functional status of the water points can now be classified.

The dataset was split into a training set (80%) and a testset. A randomForest classifier [^Recursive partitioning, Breiman, 2001] was run on the training set. The randomForest approach is to sample cases (rows) and variables (columns) simultaneously and repeatedly. This has the following advantages:

  • The algorithm can show which variables were more important than others in the training process.
  • The method is known to produce very good results. It is however not really possible to understand how exactly the procedure came to its conclusion.

The importance plot is shown below. Note that the most important variables found by the trained classifier is not the same as those found by the feature selection engine.

3.2 Classification results

The classifier was used on the test dataset (preds) and the results compared with the actual functional status (trues). This is shown in the table below.

## $confusion.matrix
##           preds
## trues      dodgy func.use non.func
##   dodgy       69       37        0
##   func.use     6     1039        9
##   non.func     1       50       69
## 
## $per.class
##            dodgy func.use non.func
## precision   0.91     0.92     0.88
## recall      0.65     0.99     0.57
## f1          0.76     0.95     0.70
## prevalence  0.08     0.82     0.09
## 
## $macro.metrics
##                value
## macroPrecision  0.91
## macroRecall     0.74
## macroF1         0.80
## accuracy        0.92
## expAccuracy     0.73
## kappa           0.70

These results can be interpreted as follows.

  • The first table tabulates the real values for functional status against the predicted ones.
    • Numbers on the diagonal are correctly classified. E.g. the classifier correctly predicted 1039 functional water points out of 1126 actual functional ones.
  • The second table contains information per class (functional, not functional, dodgy).
    • Prevalence is the fraction of cases: for example, 82 percent of the water points are “functional”.
    • Precision is the percent of instances where the classifier is correct. For example, the classifier predicted that there are 1126 functional water points (see table 1, column “func.use”). It was right in 1039 of the cases. The precision is therefore 92 percent.
    • Recall is the percent of cases where the classifier found back the number of true cases. For example, there are 1126 functional water points (see table 1, second row). The classifier found back 1054 cases, so recall is 99 percent.
  • The third table gives some overall statistics.
    • The classifier predicted the right class (functional, not functional, dodgy) in 91 percent of the cases (accuracy).
    • 74 percent of the cases were detected by the classifier (recall).
    • The classifier could correctly classify by chance alone. For example, 82 percent of all cases is of class functional. You could therefore get an accuracy of 82 percent by simplify predicting that all unseen instances fall in the class “functional”. The kappa statistic corrects for this effect. A kappa of 70 percent is reasonably good, meaning that the performance of the classifier cannot be explained by chance alone.

3.3 Rule-based approach

Some of the accuracy of the randomForest approach can be sacrificed in order to get results that are easier to interpret.

One of the classifiers that can be used is JRip [^Repeated Incremental Pruning to Produce Error Reduction, Cohen, 1995].

## JRIP rules:
## ===========
## 
## (quantitative.in.field.assessment = unsafe) and (specify.none.payment = yes) and (funded.by = community) => functional.status=dodgy (78.0/3.0)
## (quantitative.in.field.assessment = unsafe) and (county = busia) and (livestock.is.availability.sufficient.for = no) and (specify.none.payment = yes) => functional.status=dodgy (24.0/6.0)
## (color = coloured) and (wp.maintainance = unknown) and (year.constructed <= 1992) => functional.status=dodgy (86.0/33.0)
## (no.of.hh.served.day <= 0) and (wp.maintainance = unknown) => functional.status=non.func (55.0/2.0)
## (quantitative.in.field.assessment = not tested) and (households.is.availability.sufficient.for = no) and (county = busia) and (altitude >= 1195) => functional.status=non.func (58.0/10.0)
## (quantitative.in.field.assessment = not tested) and (households.is.availability.sufficient.for = no) and (county = busia) and (no.of.hh.served.day >= 88) => functional.status=non.func (15.0/3.0)
## (quantitative.in.field.assessment = not tested) and (water.consumption.per.day.in.dry.season >= 100) and (water.consumption.per.day.in.dry.season <= 120) and (year.constructed <= 1993) and (extraction = hand.manual) and (livestock.is.availability.sufficient.for = no) => functional.status=non.func (23.0/3.0)
## (no.of.hh.served.day <= 0) and (reliability = ..missing..) => functional.status=non.func (9.0/0.0)
## (no.of.hh.served.day <= 1) and (cost.recovery.mechanism = no) and (energy = electrical) and (year.constructed <= 2009) => functional.status=non.func (8.0/1.0)
##  => functional.status=func.use (2766.0/209.0)
## 
## Number of Rules : 10

As you can see the classifier was able to generate 10 easy to understand rules for classifying the functional status of water points. The numbers in brackets after each rule stand for “number of cases covered” and “number of incorrectly classified cases” respectively. For example, ==> functional.status=dodgy (78/3) means that the rule predicts “dodgy” for functional status; it covers 78 water points, and incorrectly classifies 3 cases.

The classifier does not reach the accuracy of the randomForest classifier as shown in the table below.

## $confusion.matrix
##           preds
## trues      dodgy func.use non.func
##   dodgy      139      103        4
##   func.use    51     2510       43
##   non.func     7      111      154
## 
## $per.class
##            dodgy func.use non.func
## precision   0.71     0.92     0.77
## recall      0.57     0.96     0.57
## f1          0.63     0.94     0.65
## prevalence  0.08     0.83     0.09
## 
## $macro.metrics
##                value
## macroPrecision  0.80
## macroRecall     0.70
## macroF1         0.74
## accuracy        0.90
## expAccuracy     0.74
## kappa           0.61