The aim of this “data for social good” project is to predict which water pumps in Tanzania are functional based on a set of characteristics. The data is provided by the Tanzanian Ministry of Water. If a good model is built by using these data maintenance can be improved, which would lead to better access to water (more information is available on the drivendata competition page).
Caption for the picture.
Children using a water pump.
Let’s start by looking at the structure of the training set, and the first few rows.
str(df)
## 'data.frame': 59400 obs. of 41 variables:
## $ id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ status_group : Factor w/ 3 levels "functional","functional needs repair",..: 3 1 1 1 3 1 3 1 3 3 ...
## $ amount_tsh : num 0 0 0 10 0 50 0 0 0 0 ...
## $ date_recorded : Factor w/ 356 levels "2002-10-14","2004-01-07",..: 230 39 61 341 56 32 206 57 72 104 ...
## $ funder : Factor w/ 1898 levels "","0","A/co Germany",..: 1635 1502 840 438 213 1272 458 1831 1831 281 ...
## $ gps_height : int 0 1978 0 1639 0 28 0 0 0 0 ...
## $ installer : Factor w/ 2146 levels "","-","0","A.D.B",..: 1808 1692 1001 230 278 1462 572 565 289 376 ...
## $ longitude : num 33.1 34.8 36.1 37.1 36.2 ...
## $ latitude : num -5.12 -9.4 -6.28 -3.19 -6.1 ...
## $ wpt_name : Factor w/ 37400 levels "24","A Kulwa",..: 29563 32437 1067 446 2598 20290 28857 4091 9066 35286 ...
## $ num_private : int 0 0 0 0 0 0 0 0 0 0 ...
## $ basin : Factor w/ 9 levels "Internal","Lake Nyasa",..: 4 7 9 6 9 9 1 7 9 5 ...
## $ subvillage : Factor w/ 19288 levels "","'A' Kati",..: 9076 8912 17876 18716 8625 13587 14808 13632 10151 10823 ...
## $ region : Factor w/ 21 levels "Arusha","Dar es Salaam",..: 20 4 3 7 3 15 18 3 3 5 ...
## $ region_code : int 14 11 1 3 1 60 17 1 1 18 ...
## $ district_code : int 3 4 4 5 4 43 3 1 5 8 ...
## $ lga : Factor w/ 125 levels "Arusha Rural",..: 125 92 12 17 12 70 105 77 15 13 ...
## $ ward : Factor w/ 2092 levels "Aghondi","Akheri",..: 316 2049 1357 1091 1003 2061 2027 2079 392 1711 ...
## $ population : int 0 20 0 25 0 6922 0 0 0 0 ...
## $ public_meeting : Factor w/ 3 levels "","False","True": 1 3 3 3 3 3 3 3 3 3 ...
## $ recorded_by : Factor w/ 1 level "GeoData Consultants Ltd": 1 1 1 1 1 1 1 1 1 1 ...
## $ scheme_management : Factor w/ 13 levels "","Company","None",..: 9 1 9 11 9 6 9 9 9 9 ...
## $ scheme_name : Factor w/ 2697 levels "","14 Kambarage",..: 1 1 1461 1091 1 1 1 1141 526 1326 ...
## $ permit : Factor w/ 3 levels "","False","True": 3 2 3 3 3 2 3 3 2 3 ...
## $ construction_year : int 0 2008 0 1999 0 0 0 0 0 0 ...
## $ extraction_type : Factor w/ 18 levels "afridev","cemo",..: 1 13 8 4 9 15 10 10 8 8 ...
## $ extraction_type_group: Factor w/ 13 levels "afridev","gravity",..: 1 10 5 2 6 11 7 7 5 5 ...
## $ extraction_type_class: Factor w/ 7 levels "gravity","handpump",..: 2 5 3 1 2 6 4 4 3 3 ...
## $ management : Factor w/ 12 levels "company","other",..: 8 8 8 10 8 5 8 8 5 8 ...
## $ management_group : Factor w/ 5 levels "commercial","other",..: 5 5 5 5 5 1 5 5 1 5 ...
## $ payment : Factor w/ 7 levels "never pay","other",..: 7 1 5 5 7 5 1 1 5 1 ...
## $ payment_type : Factor w/ 7 levels "annually","monthly",..: 7 3 6 6 7 6 3 3 6 3 ...
## $ water_quality : Factor w/ 8 levels "coloured","fluoride",..: 4 7 7 7 7 7 7 4 7 7 ...
## $ quality_group : Factor w/ 6 levels "colored","fluoride",..: 4 3 3 3 3 3 3 4 3 3 ...
## $ quantity : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
## $ quantity_group : Factor w/ 5 levels "dry","enough",..: 2 2 3 2 1 2 4 3 1 3 ...
## $ source : Factor w/ 10 levels "dam","hand dtw",..: 8 8 4 9 8 4 8 9 4 4 ...
## $ source_type : Factor w/ 7 levels "borehole","dam",..: 6 6 1 7 6 1 6 7 1 1 ...
## $ source_class : Factor w/ 3 levels "groundwater",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ waterpoint_type : Factor w/ 7 levels "cattle trough",..: 5 5 3 2 5 3 7 6 3 3 ...
## $ waterpoint_type_group: Factor w/ 6 levels "cattle trough",..: 4 4 2 2 4 2 6 5 2 2 ...
There is a relatively large number of variables (41) and most of them are factors. Now let’s look at the first few rows of the dataset.
head(df)
| id | status_group | amount_tsh | date_recorded | funder | gps_height | installer | longitude | latitude | wpt_name | num_private | basin | subvillage | region | region_code | district_code | lga | ward | population | public_meeting | recorded_by | scheme_management | scheme_name | permit | construction_year | extraction_type | extraction_type_group | extraction_type_class | management | management_group | payment | payment_type | water_quality | quality_group | quantity | quantity_group | source | source_type | source_class | waterpoint_type | waterpoint_type_group |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | non functional | 0 | 2012-11-13 | Tasaf | 0 | TASAF | 33.12583 | -5.118154 | Mratibu | 0 | Lake Tanganyika | Majengo | Tabora | 14 | 3 | Uyui | Igalula | 0 | GeoData Consultants Ltd | VWC | True | 0 | afridev | afridev | handpump | vwc | user-group | unknown | unknown | milky | milky | enough | enough | shallow well | shallow well | groundwater | hand pump | hand pump | ||
| 1 | functional | 0 | 2011-03-05 | Shipo | 1978 | SHIPO | 34.77072 | -9.395641 | none | 0 | Rufiji | Magoda C | Iringa | 11 | 4 | Njombe | Uwemba | 20 | True | GeoData Consultants Ltd | False | 2008 | other - rope pump | rope pump | rope pump | vwc | user-group | never pay | never pay | soft | good | enough | enough | shallow well | shallow well | groundwater | hand pump | hand pump | ||
| 2 | functional | 0 | 2011-03-27 | Lvia | 0 | LVIA | 36.11506 | -6.279268 | Bombani | 0 | Wami / Ruvu | Songambele | Dodoma | 1 | 4 | Chamwino | Msamalo | 0 | True | GeoData Consultants Ltd | VWC | Mgun | True | 0 | mono | mono | motorpump | vwc | user-group | pay per bucket | per bucket | soft | good | insufficient | insufficient | machine dbh | borehole | groundwater | communal standpipe multiple | communal standpipe |
| 3 | functional | 10 | 2013-06-03 | Germany Republi | 1639 | CES | 37.14743 | -3.187555 | Area 7 Namba 5 | 0 | Pangani | Urereni | Kilimanjaro | 3 | 5 | Hai | Masama Magharibi | 25 | True | GeoData Consultants Ltd | Water Board | Losaa-Kia water supply | True | 1999 | gravity | gravity | gravity | water board | user-group | pay per bucket | per bucket | soft | good | enough | enough | spring | spring | groundwater | communal standpipe | communal standpipe |
| 4 | non functional | 0 | 2011-03-22 | Cmsr | 0 | CMSR | 36.16489 | -6.099290 | Ezeleda | 0 | Wami / Ruvu | Maata A | Dodoma | 1 | 4 | Chamwino | Majeleko | 0 | True | GeoData Consultants Ltd | VWC | True | 0 | nira/tanira | nira/tanira | handpump | vwc | user-group | unknown | unknown | soft | good | dry | dry | shallow well | shallow well | groundwater | hand pump | hand pump | |
| 5 | functional | 50 | 2011-02-26 | Private | 28 | Private | 39.28612 | -6.972403 | Kwa Namaj | 0 | Wami / Ruvu | Mwandege | Pwani | 60 | 43 | Mkuranga | Vikindu | 6922 | True | GeoData Consultants Ltd | Private operator | False | 0 | submersible | submersible | submersible | private operator | commercial | pay per bucket | per bucket | soft | good | enough | enough | machine dbh | borehole | groundwater | communal standpipe multiple | communal standpipe |
If you have a look at the construction_year variable you can already see that even in the first few rows there are a lot of zeroes. Here we should do some thinking about the best way to proceed1 We leave this as an ecercise for the motivated reader..
Let’s explore the missing values first.
Caption for the picture.
We can see that there are actually very few rows with missing values, with the scheme_name having the largest amount. The following heatmap represents the presence/abscence relationships between the variables. Most values are not influenced by this, with the exception of the permit and funder pair.
Caption for the picture.
This dendrogram goes one step further in showing possible relationships from the heatmap.
Caption for the picture.
An interesting way to visualise the distribution of pumps geographically is to create a map. This would also show us that there is something wrong with a few of the coordinates.
map.df <- df %>% select(latitude, longitude, status_group) %>% sample_frac(size = 0.01)
leaflet() %>% addTiles() %>% addMarkers(lat = map.df$latitude,
lng = map.df$longitude,
clusterOptions = markerClusterOptions())
Pay attention to the cluster west of Cameroun, in the eastern Atlantic. We can safely asume that there are no water pumps there, so we should clean the data a bit2 One more exercise..
First lets have a look at the distrubution of water pump statuses in our training dataset.
# Look at the number of pumps in each functional status group
table(df$status_group)
| functional | functional needs repair | non functional |
|---|---|---|
| 32259 | 4317 | 22824 |
# As proportions
prop.table(table(df$status_group))
| functional | functional needs repair | non functional |
|---|---|---|
| 0.5430808 | 0.0726768 | 0.3842424 |
There are three different statuses in the status_group column (our labels): functional, functional needs repair and non functional. A good next step would be to see this distrubition across the quantity_group variable. This might provide some insights on the prevalence of faulty pumps.
# Table of the quantity variable vs the status of the pumps
table(df$quantity, df$status_group)
| / | functional | functional needs repair | non functional |
|---|---|---|---|
| dry | 157 | 37 | 6052 |
| enough | 21648 | 2400 | 9138 |
| insufficient | 7916 | 1450 | 5763 |
| seasonal | 2325 | 416 | 1309 |
| unknown | 213 | 14 | 562 |
# As row-wise proportions, quantity vs status_group
prop.table(table(df$quantity, df$status_group), margin = 1)
| / | functional | functional needs repair | non functional |
|---|---|---|---|
| dry | 0.0251361 | 0.0059238 | 0.9689401 |
| enough | 0.6523233 | 0.0723197 | 0.2753571 |
| insufficient | 0.5232335 | 0.0958424 | 0.3809241 |
| seasonal | 0.5740741 | 0.1027160 | 0.3232099 |
| unknown | 0.2699620 | 0.0177440 | 0.7122940 |
We can already see some patterns here, lets visualise them next.
## Warning: plotly.js does not (yet) support horizontal legend items
## You can track progress here:
## https://github.com/plotly/plotly.js/issues/53
From this plot we can deduce that most of the dry pumps are not functional, while the classes are more evenly distributed across the other variables.
## Warning: plotly.js does not (yet) support horizontal legend items
## You can track progress here:
## https://github.com/plotly/plotly.js/issues/53
Most pumps with an unknown quality_group are non functional, so are the colored ones.
The most non functional pumps are of the other waterpoint_type.
This is a visualisation of why we want to exclude year = 0 from our visualisations and analysis.
One plausible hypothesis that we can test is that there is a prevalence of dysfunctional water pumps that are old. Let’s test this with a few histograms3 As an exercise we can try a statistical test on this..
And indeed it seems as if there are more non functional pumps that are older, mostly through the 80s.
A random forest model is used for the classification. First we use the sample.split function from the handy caTools package to split our data into training and test datasets (25% for testing).
# train test split
sample = sample.split(df$status_group, SplitRatio = .75)
train = subset(df, sample == TRUE)
test = subset(df, sample == FALSE)
Next we must convert our dataframes into a format suitable for the h2o package to analyse.
# convert dataframes into h2o objects
data.train.hex <- as.h2o(train)
##
|
| | 0%
|
|=================================================================| 100%
data.test.hex <- as.h2o(test)
##
|
| | 0%
|
|=================================================================| 100%
# select predictor and response
Y = "status_group"
X = c("longitude", "latitude", "extraction_type_group", "quality_group", "quantity", "waterpoint_type", "construction_year")
Here we also selected our predictor and response variables. Next we can finally create our model.
# Define the data for the model and display the results
data.rf <- h2o.randomForest(training_frame=data.train.hex, x=X, y=Y)
##
|
| | 0%
|
|=== | 4%
|
|====== | 10%
|
|========== | 16%
|
|================= | 26%
|
|====================== | 34%
|
|============================= | 44%
|
|================================== | 52%
|
|======================================== | 62%
|
|=============================================== | 72%
|
|===================================================== | 82%
|
|========================================================== | 90%
|
|=================================================================| 100%
# Model performance
h2o.confusionMatrix(data.rf, data.test.hex)
| functional | functional needs repair | non functional | Error | Rate | |
|---|---|---|---|---|---|
| functional | 7361 | 95 | 609 | 0.0872908 | 704 / 8,065 |
| functional needs repair | 685 | 218 | 176 | 0.7979611 | 861 / 1,079 |
| non functional | 1541 | 64 | 4101 | 0.2812829 | 1,605 / 5,706 |
| Totals | 9587 | 377 | 4886 | 0.2134680 | 3,170 / 14,850 |
As you can see from the confusion matrix our random forest model has managed to classify a good portion of the values with an overall accuracy of around 0.80. There are ways we can improve the model (we have kept the feature engineering to a minimum for this example), but we can assume that this is a satisfying first result.
Note: some of this code is inspired by a course on Datacamp.