Pump it Up: Data Mining the Water Table

Introduction

The aim of this “data for social good” project is to predict which water pumps in Tanzania are functional based on a set of characteristics. The data is provided by the Tanzanian Ministry of Water. If a good model is built by using these data maintenance can be improved, which would lead to better access to water (more information is available on the drivendata competition page).

Caption for the picture.

Caption for the picture.


Children using a water pump.

Let’s start by looking at the structure of the training set, and the first few rows.

str(df)
## 'data.frame':    59400 obs. of  41 variables:
##  $ id                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ status_group         : Factor w/ 3 levels "functional","functional needs repair",..: 3 1 1 1 3 1 3 1 3 3 ...
##  $ amount_tsh           : num  0 0 0 10 0 50 0 0 0 0 ...
##  $ date_recorded        : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 230 39 61 341 56 32 206 57 72 104 ...
##  $ funder               : Factor w/ 1898 levels "","0","A/co Germany",..: 1635 1502 840 438 213 1272 458 1831 1831 281 ...
##  $ gps_height           : int  0 1978 0 1639 0 28 0 0 0 0 ...
##  $ installer            : Factor w/ 2146 levels "","-","0","A.D.B",..: 1808 1692 1001 230 278 1462 572 565 289 376 ...
##  $ longitude            : num  33.1 34.8 36.1 37.1 36.2 ...
##  $ latitude             : num  -5.12 -9.4 -6.28 -3.19 -6.1 ...
##  $ wpt_name             : Factor w/ 37400 levels "24","A Kulwa",..: 29563 32437 1067 446 2598 20290 28857 4091 9066 35286 ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : Factor w/ 9 levels "Internal","Lake Nyasa",..: 4 7 9 6 9 9 1 7 9 5 ...
##  $ subvillage           : Factor w/ 19288 levels "","'A' Kati",..: 9076 8912 17876 18716 8625 13587 14808 13632 10151 10823 ...
##  $ region               : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 20 4 3 7 3 15 18 3 3 5 ...
##  $ region_code          : int  14 11 1 3 1 60 17 1 1 18 ...
##  $ district_code        : int  3 4 4 5 4 43 3 1 5 8 ...
##  $ lga                  : Factor w/ 125 levels "Arusha Rural",..: 125 92 12 17 12 70 105 77 15 13 ...
##  $ ward                 : Factor w/ 2092 levels "Aghondi","Akheri",..: 316 2049 1357 1091 1003 2061 2027 2079 392 1711 ...
##  $ population           : int  0 20 0 25 0 6922 0 0 0 0 ...
##  $ public_meeting       : Factor w/ 3 levels "","False","True": 1 3 3 3 3 3 3 3 3 3 ...
##  $ recorded_by          : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scheme_management    : Factor w/ 13 levels "","Company","None",..: 9 1 9 11 9 6 9 9 9 9 ...
##  $ scheme_name          : Factor w/ 2697 levels "","14 Kambarage",..: 1 1 1461 1091 1 1 1 1141 526 1326 ...
##  $ permit               : Factor w/ 3 levels "","False","True": 3 2 3 3 3 2 3 3 2 3 ...
##  $ construction_year    : int  0 2008 0 1999 0 0 0 0 0 0 ...
##  $ extraction_type      : Factor w/ 18 levels "afridev","cemo",..: 1 13 8 4 9 15 10 10 8 8 ...
##  $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 1 10 5 2 6 11 7 7 5 5 ...
##  $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 2 5 3 1 2 6 4 4 3 3 ...
##  $ management           : Factor w/ 12 levels "company","other",..: 8 8 8 10 8 5 8 8 5 8 ...
##  $ management_group     : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 5 1 5 5 1 5 ...
##  $ payment              : Factor w/ 7 levels "never pay","other",..: 7 1 5 5 7 5 1 1 5 1 ...
##  $ payment_type         : Factor w/ 7 levels "annually","monthly",..: 7 3 6 6 7 6 3 3 6 3 ...
##  $ water_quality        : Factor w/ 8 levels "coloured","fluoride",..: 4 7 7 7 7 7 7 4 7 7 ...
##  $ quality_group        : Factor w/ 6 levels "colored","fluoride",..: 4 3 3 3 3 3 3 4 3 3 ...
##  $ quantity             : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ quantity_group       : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ source               : Factor w/ 10 levels "dam","hand dtw",..: 8 8 4 9 8 4 8 9 4 4 ...
##  $ source_type          : Factor w/ 7 levels "borehole","dam",..: 6 6 1 7 6 1 6 7 1 1 ...
##  $ source_class         : Factor w/ 3 levels "groundwater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ waterpoint_type      : Factor w/ 7 levels "cattle trough",..: 5 5 3 2 5 3 7 6 3 3 ...
##  $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 4 4 2 2 4 2 6 5 2 2 ...

There is a relatively large number of variables (41) and most of them are factors. Now let’s look at the first few rows of the dataset.

head(df)
id status_group amount_tsh date_recorded funder gps_height installer longitude latitude wpt_name num_private basin subvillage region region_code district_code lga ward population public_meeting recorded_by scheme_management scheme_name permit construction_year extraction_type extraction_type_group extraction_type_class management management_group payment payment_type water_quality quality_group quantity quantity_group source source_type source_class waterpoint_type waterpoint_type_group
0 non functional 0 2012-11-13 Tasaf 0 TASAF 33.12583 -5.118154 Mratibu 0 Lake Tanganyika Majengo Tabora 14 3 Uyui Igalula 0 GeoData Consultants Ltd VWC True 0 afridev afridev handpump vwc user-group unknown unknown milky milky enough enough shallow well shallow well groundwater hand pump hand pump
1 functional 0 2011-03-05 Shipo 1978 SHIPO 34.77072 -9.395641 none 0 Rufiji Magoda C Iringa 11 4 Njombe Uwemba 20 True GeoData Consultants Ltd False 2008 other - rope pump rope pump rope pump vwc user-group never pay never pay soft good enough enough shallow well shallow well groundwater hand pump hand pump
2 functional 0 2011-03-27 Lvia 0 LVIA 36.11506 -6.279268 Bombani 0 Wami / Ruvu Songambele Dodoma 1 4 Chamwino Msamalo 0 True GeoData Consultants Ltd VWC Mgun True 0 mono mono motorpump vwc user-group pay per bucket per bucket soft good insufficient insufficient machine dbh borehole groundwater communal standpipe multiple communal standpipe
3 functional 10 2013-06-03 Germany Republi 1639 CES 37.14743 -3.187555 Area 7 Namba 5 0 Pangani Urereni Kilimanjaro 3 5 Hai Masama Magharibi 25 True GeoData Consultants Ltd Water Board Losaa-Kia water supply True 1999 gravity gravity gravity water board user-group pay per bucket per bucket soft good enough enough spring spring groundwater communal standpipe communal standpipe
4 non functional 0 2011-03-22 Cmsr 0 CMSR 36.16489 -6.099290 Ezeleda 0 Wami / Ruvu Maata A Dodoma 1 4 Chamwino Majeleko 0 True GeoData Consultants Ltd VWC True 0 nira/tanira nira/tanira handpump vwc user-group unknown unknown soft good dry dry shallow well shallow well groundwater hand pump hand pump
5 functional 50 2011-02-26 Private 28 Private 39.28612 -6.972403 Kwa Namaj 0 Wami / Ruvu Mwandege Pwani 60 43 Mkuranga Vikindu 6922 True GeoData Consultants Ltd Private operator False 0 submersible submersible submersible private operator commercial pay per bucket per bucket soft good enough enough machine dbh borehole groundwater communal standpipe multiple communal standpipe

If you have a look at the construction_year variable you can already see that even in the first few rows there are a lot of zeroes. Here we should do some thinking about the best way to proceed1 We leave this as an ecercise for the motivated reader..

Data Cleaning

Let’s explore the missing values first.

Caption for the picture.

Caption for the picture.

We can see that there are actually very few rows with missing values, with the scheme_name having the largest amount. The following heatmap represents the presence/abscence relationships between the variables. Most values are not influenced by this, with the exception of the permit and funder pair.

Caption for the picture.

Caption for the picture.

This dendrogram goes one step further in showing possible relationships from the heatmap.

Caption for the picture.

Caption for the picture.

Map

An interesting way to visualise the distribution of pumps geographically is to create a map. This would also show us that there is something wrong with a few of the coordinates.

map.df <- df %>% select(latitude, longitude, status_group) %>% sample_frac(size = 0.01)

leaflet() %>% addTiles() %>% addMarkers(lat = map.df$latitude,
                                        lng = map.df$longitude,
                                        clusterOptions = markerClusterOptions())

Pay attention to the cluster west of Cameroun, in the eastern Atlantic. We can safely asume that there are no water pumps there, so we should clean the data a bit2 One more exercise..

EDA

First lets have a look at the distrubution of water pump statuses in our training dataset.

# Look at the number of pumps in each functional status group
table(df$status_group)
functional functional needs repair non functional
32259 4317 22824
# As proportions
prop.table(table(df$status_group))
functional functional needs repair non functional
0.5430808 0.0726768 0.3842424

There are three different statuses in the status_group column (our labels): functional, functional needs repair and non functional. A good next step would be to see this distrubition across the quantity_group variable. This might provide some insights on the prevalence of faulty pumps.

# Table of the quantity variable vs the status of the pumps
table(df$quantity, df$status_group)
/ functional functional needs repair non functional
dry 157 37 6052
enough 21648 2400 9138
insufficient 7916 1450 5763
seasonal 2325 416 1309
unknown 213 14 562
# As row-wise proportions, quantity vs status_group
prop.table(table(df$quantity, df$status_group), margin = 1)
/ functional functional needs repair non functional
dry 0.0251361 0.0059238 0.9689401
enough 0.6523233 0.0723197 0.2753571
insufficient 0.5232335 0.0958424 0.3809241
seasonal 0.5740741 0.1027160 0.3232099
unknown 0.2699620 0.0177440 0.7122940

We can already see some patterns here, lets visualise them next.

## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

From this plot we can deduce that most of the dry pumps are not functional, while the classes are more evenly distributed across the other variables.

## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

Most pumps with an unknown quality_group are non functional, so are the colored ones.

The most non functional pumps are of the other waterpoint_type.

This is a visualisation of why we want to exclude year = 0 from our visualisations and analysis.

One plausible hypothesis that we can test is that there is a prevalence of dysfunctional water pumps that are old. Let’s test this with a few histograms3 As an exercise we can try a statistical test on this..

And indeed it seems as if there are more non functional pumps that are older, mostly through the 80s.

Modeling

A random forest model is used for the classification. First we use the sample.split function from the handy caTools package to split our data into training and test datasets (25% for testing).

# train test split
sample = sample.split(df$status_group, SplitRatio = .75)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)

Next we must convert our dataframes into a format suitable for the h2o package to analyse.

# convert dataframes into h2o objects
data.train.hex <- as.h2o(train)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
data.test.hex <- as.h2o(test)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
# select predictor and response
Y = "status_group"
X = c("longitude", "latitude", "extraction_type_group", "quality_group", "quantity", "waterpoint_type", "construction_year")

Here we also selected our predictor and response variables. Next we can finally create our model.

# Define the data for the model and display the results
data.rf <- h2o.randomForest(training_frame=data.train.hex, x=X, y=Y)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
# Model performance
h2o.confusionMatrix(data.rf, data.test.hex)
functional functional needs repair non functional Error Rate
functional 7361 95 609 0.0872908 704 / 8,065
functional needs repair 685 218 176 0.7979611 861 / 1,079
non functional 1541 64 4101 0.2812829 1,605 / 5,706
Totals 9587 377 4886 0.2134680 3,170 / 14,850

Conclusion

As you can see from the confusion matrix our random forest model has managed to classify a good portion of the values with an overall accuracy of around 0.80. There are ways we can improve the model (we have kept the feature engineering to a minimum for this example), but we can assume that this is a satisfying first result.

Note: some of this code is inspired by a course on Datacamp.