Introduction

The aim of this “data for social good” project is to predict which water pumps in Tanzania are functional based on a set of characteristics. The data is provided by the Tanzanian Ministry of Water. If a good model is built by using these data maintenance can be improved, which would lead to better access to water (more information is available on the drivendata competition page).

⊕

Caption for the picture.

Children using a water pump.

Let’s start by looking at the structure of the training set, and the first few rows.

str(df)

## 'data.frame':    59400 obs. of  41 variables:
##  $ id                   : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ status_group         : Factor w/ 3 levels "functional","functional needs repair",..: 3 1 1 1 3 1 3 1 3 3 ...
##  $ amount_tsh           : num  0 0 0 10 0 50 0 0 0 0 ...
##  $ date_recorded        : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 230 39 61 341 56 32 206 57 72 104 ...
##  $ funder               : Factor w/ 1898 levels "","0","A/co Germany",..: 1635 1502 840 438 213 1272 458 1831 1831 281 ...
##  $ gps_height           : int  0 1978 0 1639 0 28 0 0 0 0 ...
##  $ installer            : Factor w/ 2146 levels "","-","0","A.D.B",..: 1808 1692 1001 230 278 1462 572 565 289 376 ...
##  $ longitude            : num  33.1 34.8 36.1 37.1 36.2 ...
##  $ latitude             : num  -5.12 -9.4 -6.28 -3.19 -6.1 ...
##  $ wpt_name             : Factor w/ 37400 levels "24","A Kulwa",..: 29563 32437 1067 446 2598 20290 28857 4091 9066 35286 ...
##  $ num_private          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ basin                : Factor w/ 9 levels "Internal","Lake Nyasa",..: 4 7 9 6 9 9 1 7 9 5 ...
##  $ subvillage           : Factor w/ 19288 levels "","'A' Kati",..: 9076 8912 17876 18716 8625 13587 14808 13632 10151 10823 ...
##  $ region               : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 20 4 3 7 3 15 18 3 3 5 ...
##  $ region_code          : int  14 11 1 3 1 60 17 1 1 18 ...
##  $ district_code        : int  3 4 4 5 4 43 3 1 5 8 ...
##  $ lga                  : Factor w/ 125 levels "Arusha Rural",..: 125 92 12 17 12 70 105 77 15 13 ...
##  $ ward                 : Factor w/ 2092 levels "Aghondi","Akheri",..: 316 2049 1357 1091 1003 2061 2027 2079 392 1711 ...
##  $ population           : int  0 20 0 25 0 6922 0 0 0 0 ...
##  $ public_meeting       : Factor w/ 3 levels "","False","True": 1 3 3 3 3 3 3 3 3 3 ...
##  $ recorded_by          : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scheme_management    : Factor w/ 13 levels "","Company","None",..: 9 1 9 11 9 6 9 9 9 9 ...
##  $ scheme_name          : Factor w/ 2697 levels "","14 Kambarage",..: 1 1 1461 1091 1 1 1 1141 526 1326 ...
##  $ permit               : Factor w/ 3 levels "","False","True": 3 2 3 3 3 2 3 3 2 3 ...
##  $ construction_year    : int  0 2008 0 1999 0 0 0 0 0 0 ...
##  $ extraction_type      : Factor w/ 18 levels "afridev","cemo",..: 1 13 8 4 9 15 10 10 8 8 ...
##  $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 1 10 5 2 6 11 7 7 5 5 ...
##  $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 2 5 3 1 2 6 4 4 3 3 ...
##  $ management           : Factor w/ 12 levels "company","other",..: 8 8 8 10 8 5 8 8 5 8 ...
##  $ management_group     : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 5 1 5 5 1 5 ...
##  $ payment              : Factor w/ 7 levels "never pay","other",..: 7 1 5 5 7 5 1 1 5 1 ...
##  $ payment_type         : Factor w/ 7 levels "annually","monthly",..: 7 3 6 6 7 6 3 3 6 3 ...
##  $ water_quality        : Factor w/ 8 levels "coloured","fluoride",..: 4 7 7 7 7 7 7 4 7 7 ...
##  $ quality_group        : Factor w/ 6 levels "colored","fluoride",..: 4 3 3 3 3 3 3 4 3 3 ...
##  $ quantity             : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ quantity_group       : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
##  $ source               : Factor w/ 10 levels "dam","hand dtw",..: 8 8 4 9 8 4 8 9 4 4 ...
##  $ source_type          : Factor w/ 7 levels "borehole","dam",..: 6 6 1 7 6 1 6 7 1 1 ...
##  $ source_class         : Factor w/ 3 levels "groundwater",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ waterpoint_type      : Factor w/ 7 levels "cattle trough",..: 5 5 3 2 5 3 7 6 3 3 ...
##  $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 4 4 2 2 4 2 6 5 2 2 ...

There is a relatively large number of variables (41) and most of them are factors. Now let’s look at the first few rows of the dataset.

head(df)

id	status_group	amount_tsh	date_recorded	funder	gps_height	installer	longitude	latitude	wpt_name	basin	subvillage	region	region_code	district_code	lga	ward	population	public_meeting	recorded_by	scheme_management	scheme_name	permit	construction_year	extraction_type	extraction_type_group	extraction_type_class	management	management_group	payment	payment_type	water_quality	quality_group	quantity	quantity_group	source	source_type	source_class	waterpoint_type	waterpoint_type_group
0	non functional	0	2012-11-13	Tasaf	0	TASAF	33.12583	-5.118154	Mratibu	Lake Tanganyika	Majengo	Tabora	14	3	Uyui	Igalula	0		GeoData Consultants Ltd	VWC		True	0	afridev	afridev	handpump	vwc	user-group	unknown	unknown	milky	milky	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
1	functional	0	2011-03-05	Shipo	1978	SHIPO	34.77072	-9.395641	none	Rufiji	Magoda C	Iringa	11	4	Njombe	Uwemba	20	True	GeoData Consultants Ltd			False	2008	other - rope pump	rope pump	rope pump	vwc	user-group	never pay	never pay	soft	good	enough	enough	shallow well	shallow well	groundwater	hand pump	hand pump
2	functional	0	2011-03-27	Lvia	0	LVIA	36.11506	-6.279268	Bombani	Wami / Ruvu	Songambele	Dodoma	1	4	Chamwino	Msamalo	0	True	GeoData Consultants Ltd	VWC	Mgun	True	0	mono	mono	motorpump	vwc	user-group	pay per bucket	per bucket	soft	good	insufficient	insufficient	machine dbh	borehole	groundwater	communal standpipe multiple	communal standpipe
3	functional	10	2013-06-03	Germany Republi	1639	CES	37.14743	-3.187555	Area 7 Namba 5	Pangani	Urereni	Kilimanjaro	3	5	Hai	Masama Magharibi	25	True	GeoData Consultants Ltd	Water Board	Losaa-Kia water supply	True	1999	gravity	gravity	gravity	water board	user-group	pay per bucket	per bucket	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe
4	non functional	0	2011-03-22	Cmsr	0	CMSR	36.16489	-6.099290	Ezeleda	Wami / Ruvu	Maata A	Dodoma	1	4	Chamwino	Majeleko	0	True	GeoData Consultants Ltd	VWC		True	0	nira/tanira	nira/tanira	handpump	vwc	user-group	unknown	unknown	soft	good	dry	dry	shallow well	shallow well	groundwater	hand pump	hand pump
5	functional	50	2011-02-26	Private	28	Private	39.28612	-6.972403	Kwa Namaj	Wami / Ruvu	Mwandege	Pwani	60	43	Mkuranga	Vikindu	6922	True	GeoData Consultants Ltd	Private operator		False	0	submersible	submersible	submersible	private operator	commercial	pay per bucket	per bucket	soft	good	enough	enough	machine dbh	borehole	groundwater	communal standpipe multiple	communal standpipe

If you have a look at the construction_year variable you can already see that even in the first few rows there are a lot of zeroes. Here we should do some thinking about the best way to proceed11 We leave this as an ecercise for the motivated reader..

Data Cleaning

Let’s explore the missing values first.

Caption for the picture.

We can see that there are actually very few rows with missing values, with the scheme_name having the largest amount. The following heatmap represents the presence/abscence relationships between the variables. Most values are not influenced by this, with the exception of the permit and funder pair.

Caption for the picture.

This dendrogram goes one step further in showing possible relationships from the heatmap.

Caption for the picture.

Map

An interesting way to visualise the distribution of pumps geographically is to create a map. This would also show us that there is something wrong with a few of the coordinates.

map.df <- df %>% select(latitude, longitude, status_group) %>% sample_frac(size = 0.01)

leaflet() %>% addTiles() %>% addMarkers(lat = map.df$latitude,
                                        lng = map.df$longitude,
                                        clusterOptions = markerClusterOptions())

Pay attention to the cluster west of Cameroun, in the eastern Atlantic. We can safely asume that there are no water pumps there, so we should clean the data a bit22 One more exercise..

EDA

First lets have a look at the distrubution of water pump statuses in our training dataset.

# Look at the number of pumps in each functional status group
table(df$status_group)

functional	functional needs repair	non functional
32259	4317	22824

# As proportions
prop.table(table(df$status_group))

functional	functional needs repair	non functional
0.5430808	0.0726768	0.3842424

There are three different statuses in the status_group column (our labels): functional, functional needs repair and non functional. A good next step would be to see this distrubition across the quantity_group variable. This might provide some insights on the prevalence of faulty pumps.

# Table of the quantity variable vs the status of the pumps
table(df$quantity, df$status_group)

/	functional	functional needs repair	non functional
dry	157	37	6052
enough	21648	2400	9138
insufficient	7916	1450	5763
seasonal	2325	416	1309
unknown	213	14	562

# As row-wise proportions, quantity vs status_group
prop.table(table(df$quantity, df$status_group), margin = 1)

/	functional	functional needs repair	non functional
dry	0.0251361	0.0059238	0.9689401
enough	0.6523233	0.0723197	0.2753571
insufficient	0.5232335	0.0958424	0.3809241
seasonal	0.5740741	0.1027160	0.3232099
unknown	0.2699620	0.0177440	0.7122940

We can already see some patterns here, lets visualise them next.

## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

From this plot we can deduce that most of the dry pumps are not functional, while the classes are more evenly distributed across the other variables.

## Warning: plotly.js does not (yet) support horizontal legend items 
## You can track progress here: 
## https://github.com/plotly/plotly.js/issues/53

Most pumps with an unknown quality_group are non functional, so are the colored ones.

The most non functional pumps are of the other waterpoint_type.

⊕This is a visualisation of why we want to exclude year = 0 from our visualisations and analysis.

One plausible hypothesis that we can test is that there is a prevalence of dysfunctional water pumps that are old. Let’s test this with a few histograms33 As an exercise we can try a statistical test on this..

And indeed it seems as if there are more non functional pumps that are older, mostly through the 80s.

Modeling

A random forest model is used for the classification. First we use the sample.split function from the handy caTools package to split our data into training and test datasets (25% for testing).

# train test split
sample = sample.split(df$status_group, SplitRatio = .75)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)

Next we must convert our dataframes into a format suitable for the h2o package to analyse.

# convert dataframes into h2o objects
data.train.hex <- as.h2o(train)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

data.test.hex <- as.h2o(test)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

# select predictor and response
Y = "status_group"
X = c("longitude", "latitude", "extraction_type_group", "quality_group", "quantity", "waterpoint_type", "construction_year")

Here we also selected our predictor and response variables. Next we can finally create our model.

# Define the data for the model and display the results
data.rf <- h2o.randomForest(training_frame=data.train.hex, x=X, y=Y)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=================                                                |  26%
  |                                                                       
  |======================                                           |  34%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |========================================                         |  62%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=====================================================            |  82%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%

# Model performance
h2o.confusionMatrix(data.rf, data.test.hex)

	functional	functional needs repair	non functional	Error	Rate
functional	7361	95	609	0.0872908	704 / 8,065
functional needs repair	685	218	176	0.7979611	861 / 1,079
non functional	1541	64	4101	0.2812829	1,605 / 5,706
Totals	9587	377	4886	0.2134680	3,170 / 14,850

Conclusion

As you can see from the confusion matrix our random forest model has managed to classify a good portion of the values with an overall accuracy of around 0.80. There are ways we can improve the model (we have kept the feature engineering to a minimum for this example), but we can assume that this is a satisfying first result.

Note: some of this code is inspired by a course on Datacamp.

Pump it Up: Data Mining the Water Table

Introduction

Data Cleaning

Map

EDA

Modeling

Conclusion