Abstract

Hand-driven and gravity fed water pumps are a key source of potable water throughout much of Africa and Asia, and the ongoing management and maintenance of such pumps is an ever-present challenge for many communities. This project explored the use of a variety of predictive/classification modeling techniques for purposes of predicting the functional status of water pumps located within the country of Tanzania on the basis of data provided via a www.DrivenData.org data challenge competition. That data set was found to be riddled with significant quantities of invalid and/or missing data values. As such, a wide variety of statistically valid imputation methods were applied to improve the quality of the data set. The imputation methods were implemented via a Python-based, automated, user-configurable data transformation pipeline algorithm which is designed to be both extensible and easily modified for use with other types of data. The transformed data is then used to train various predictive models in both R and Python. A random forest model was found to have the highest overall predictive accuracy, while a bootstrap aggregation model proved to be the most effective at identifying water pumps that are functional but in need of some repairs. Finally, an interactive visualization dashboard was constructed using R and Shiny to help the Ministry of Water more effectively manage and maintain water pumps throughout Tanzania, thereby potentially increasing the availability of potable water for the country’s population.

Key Words: Predictive Modeling, Classification Models, Water Pumps, Tanzania, Pump It Up

Introduction

The availability of clean, potable water is an important public health issue throughout less-developed parts of the world, including many parts of Africa and Asia. In fact, the water delivery infrastructure in some countries is still largely comprised of hand-driven and gravity-fed pumps. While such pumps are vital to the sustenance of many towns and villages, like any type of mechanical device they require ongoing management and periodic maintenance to ensure their proper functioning and to minimize the risk of non-potable water polluting a well.

One country that relies heavily on such pumps is the east African country of Tanzania. According to Tanzania’s Ministry of Water, more than 74,000 such pumps can be found throughout the country. While the installation of these pumps is largely funded via contributions from charitable and other non-governmental organizations (NGO’s), their ongoing maintenance is typically the responsibility of the local community within which they reside. Unfortunately, the cost of maintaining the pumps is often beyond the means of the local community, resulting in pumps becoming non-functional. Furthermore, local communities are often unaware of the need to perform the required maintenance due to the apparent lack of any significant problems with a pump up until the point it ultimately fails.

The purpose of this project was to develop, assess, and compare a variety of predictive / classification models, with the best-performing model being implemented as part of a complete end-to-end intelligent system that allows the user (e.g., the Tanzanian Ministry of Water) to pro-actively identify pumps that are either in need of repair or are in inoperable condition. The research question we are addressing can be stated as follows:

Is it possible to predict whether a water pump is either functional, functional but in need of maintenance, or non-functional based on data that include a description of the type of pump, the date when the pump was installed, the physical location of the pump, the quality of the water provided by the pump, and a description of how the pump is managed?

If so, Tanzania’s Ministry of Water can make use of such predictions to proactively address any identified water delivery system maintenance issues, thereby enhancing the ongoing availability of potable water to Tanzania’s citizens.

In 2016, the Ministry of Water teamed with Taarifa (www.taarifa.org) to sponsor a data challenge competition to assess whether it is possible to predict the functional status of a water pump based on a variety of quanitative and qualitative indicators. The data for this project was sourced from the DrivenData.org website that hosts the data challenge. Specifically, we made use of the datasets accessible via the following web link:

https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/

The data set is comprised of attributes that describe a total of 59,400 Tanzanian water pumps. Each water pump is represented by a total of 39 qualititative and quantitative attributes that describe such things as the type of pump, location, altitude, installation funding source, management method, year constructed, water source, and the quality of the water delivered by the pump. This data trove serves as the basis for the research discussed herein.

Literature Review

Our literature review was focused on identifying recently published commentary related to the Tanzanian Ministry of Water’s data challenge. While no refereed academic journal articles related to the data challenge could be located, a fair number of relevant informal online web pages, github postings, and/or blog postings that discuss various aspects of the data challenge were identified. These web postings generally discuss the nature of the data challenge, describe rudimentary exploratory data analysis results and data prep / imputation methods, and assess the effectiveness of a single type of classification / predictive model relative to a “cleansed” version of the data that the author(s) derived via their exploratory data analysis (EDA) and data imputation efforts. A subset of the postings provide non-interactive map-based visualizations of the physical location and status of each pump.

Exploratory Data Analysis: Most of the authors whose commentary was reviewed performed rudimentary EDA using tools such as histograms, barplots, and data aggregation via categorical variable values. However, in several instances (e.g., [2, 3, 6] ) the histograms or barplots used are rather ineffective for a variety of reasons, including the presentation of data of different magnitudes within a single graphic; the graphic lacking sufficient granularity to allow the viewer to draw a valid conclusion from it; or a failure to exclude clearly invalid data values from a plot. In general, the authors focus the majority of their EDA efforts on discussing missing data values for various variables and describing the variety of values found within the categorical variables. Two authors [5, 7] comment on the apparent collinearity of several variables within the data set, while [2, 6] comment on the inconsistency of funder and installer names throughout the data set (e.g., “Danida” vs. “Danid”) as well as the coincidence of missing values for those two variables. Very little attention is given to the derivation of potential predictive inferences for each variable, with authors such as [5, 6, 7] providing minimal or even no such insights, [8] discussing potential inferences for only 3 variables, and [4] exploring a total of 6 variables for possible predictive inferences.

Data Preparation: Many authors relied on the use of frequency based binning to reduce the dimensionality of the categorical variables, including [2, 5, 6, 7], and most opted to remove sparse variables from the data set prior to model building (e.g., [2, 3, 7]) . Imputation for missing data values was either entirely lacking or relied on simplistic use of a variable’s mean or median [2, 4, 6] or filling of missing values with zeroes [8]. [8] opted to consolidate all ‘funder’ values into 5 distinct new categories, while [2] attempted to bin ‘lga’ values according to whether or not they contained ‘urban’ and ‘rural’ as substrings. [2] suggests creating an ‘age’ variable based on ‘construction_year’ and ‘date_recorded’ while [4] creates a variable indicative of the time elapsed since the data were recorded. [5] employed external population data to augment the sparse ‘population’ variable and also made use of Principal Components Analysis in an attempt to reduce the dimensionality of the ‘funder’ variable.

Predictive / Classification Models: The mix of independent variables used for modeling varied significantly by author, with [3], for example, relying on only seven variables while [7] made use of twenty. Every author made use of some form of decision tree model, with random forests being the most widely applied [2, 3, 4, 5, 6, 7], followed by extra trees [4, 7], gradient boosted decision trees [7, 8] and basic decision trees [4, 7]. One author described the use of multinomial logistic regression [3]. Performance metrics varied widely, likely due to some authors reporting results relative to a model validation data set and others reporting their results relative to the data challenge’s dedicated testing data set. So while [5] claims 94% accuracy for their model, that result is likely applicable only to their training data: Others generally reported accuracies ranging between 80% - 83% for random forest and gradient boosted models [2, 3, 4, 7], while the sole multinomial regression model reported 70.38% accuracy.

Map Based Visualizations: Geomapping of pump statuses was used by [3, 5, 6, 8, 9]. These maps are all static in nature and do not provide the user with any means of exploring the predicted statuses of pumps contained within the dedicated data challenge testing data set.

Methodology

Our approach to addressing the problem at hand was comprised of six distinct components: Data Exploration, Data Preparation, Predictive / Classification Modeling, Predictive Modeling Pipelines, Model Selection, and Visual Presentation. Each is described below.

Data Exploration: Investigation of the characteristics of each individual variable contained within the data set, including their data types, range of valid values, their distributions, and missing data values. Results of Data Exploration were used for the following: Development of any strategies required for imputing missing data values; Deciding whether any variables should be dropped from model building due to collinearity; Deciding whether any of the predictor variables may need to be mathematically transformed to better serve the model building process.
Data Preparation: Development of strategies for handling missing or invalid data values, binning of categorical variable values by frequency and their relationship to the response variable, and development of an approach to the separation of the master data set into dedicated modeling “Training” and “Evaluation” subsets. These strategies are designed and implemented using independent functional components to enable the testing and integration of a complete data transformation pipeline. This approach promotes the reusability of its components while enabling quick adaptation to changing requirements, easy inclusion/evaluation of customized hyper-parameters (i.e., parameters which are set before the commencement of the learning process. For example: The method of scaling numerical data can be implemented as a hyper parameter), and a streamlined process for the training/testing of models.
Predictive / Classification Modeling: Identification, development, and testing of task-appropriate predictive / classification models. Since we were tasked with classifying a pump as being in one of three possible conditions, various classification models were to be evaluated in an attempt to identify that which was most effective at making such classifications / predictions. The specific types of models developed were Random Forests, SVM (Support Vector Machines), C5.0 Boosted Trees, Bootstrap Aggregation, K Nearest Neighbors (KNN), and Multinomial Logistic Regression. Each model was iteratively refined in an attempt to maximize its performance and reduce the likelihood of overfitting of the training data.
Automation / Integration of Data Prep Pipelines & Predictive Models: The pipelines developed during Data Preparation were extended to incorporate the model development process. Specifically, the best performing model is integrated with the data transformation pipeline to allow for the application of both in a seamless manner to a new/incoming/streaming data. This combined framework can potentially be extended to automate the data preparation, predictive modeling, and autonomous retraining steps necessary for the retraining of a preferred model as its performance degrades over time due to the incorporation of new data from a real-time data stream.
Model Selection: Predictive/classification models were evaluated on the basis of performance metrics including Kappa scores, accuracy, classification error rates, specificity, sensitivity, and balanced accuracy. Results of each type of model were then compared against each other to judge their effectiveness at accurately classifying / predicting the three respective response variable values.
Visual Presentation: The visual presentation is provided via an interactive graphical “dashboard” application that incorporates geomapping of the locations of the pumps represented within the data challenge’s test data set, the output of the six types of predictive/classification models, as well as providing the user with the ability to view the predicted status of each pump relative to each of the relevant independent variables used for model training.

Experimentation and Results

Data Exploration

For each water pump we were provided with 39 attributes (excluding a unique record identifier) that could potentially be used as predictor variables. The response variable can assume one of three possible values:

functional: The pump is operational and not in need of maintenance
functional needs repair: The pump is operational but is in need of repair
non functional: The pump is inoperable

54.3% of all pumps in the data set were found to be functional, while 38.4% were non functional and 7.3% had a status of functional needs repair.

Each independent variable can be characterized as belonging to one of three classes: non- categorical numeric variables; categorical variables; and administrative variables. Variables classified as “Administrative” in nature appear to serve solely as data management attributes within the data set:

Administrative Variables	Comments
id	Unique ID for each data record
date_recorded	Date of data collection by survey company
recorded_by	Name of the data collection / survey company

While detailed analyses for each of the remaining independent variables can be found in the Data Exploration section of the Appendix, key findings from that work are described below.

Analysis of Missing Data Values

Our analysis found that every non-categorical numeric variable contains what appear to be significant quantities of invalid data values represented by zeroes:

Numeric Variables	Invalid Data Comments
amount_tsh	41,639 of 59,400 records = “0”
gps_height	20,438 of 59,400 records = “0”
population	21,381 of 59.400 records = “0”
longitude	1,812 zero values: likely invalid
latitude	1,819 values < -1: likely invalid
num_private	58,643 of 59.400 records = “0”

Eight categorical variables were also found to have missing data values (as indicated by either ‘NA’ values, zeroes, or blank character strings):

Categorical Var.	Distinct Values	Missing	Comments
funder	1898	3635	3582 NA’s coinc. w installer
installer	2146	3655	3582 NA’s coinc. w funder
subvillage	19288	371
public_meeting	3	3334	valid values: TRUE/FALSE
scheme_management	13	3877
scheme_name	2697	28166
permit	3	3056	valid values: TRUE/FALSE
construction_year	55	20709	valid values: 1960 - 2013

Further analysis of the missing data yielded the following key data insights:

Missing data values are found within 31,587 of the 59,400 records within the data set (53.17%).
69.2% of subvillages, 52.1% of wards, 32% of lga's, and 19% of regions have no valid amount_tsh values.
Four regions (Dodoma, Kagera, Mbeya, Tabora) were found to have no non-zero values for the following variables: amount_tsh, gps_height, construction_year, num_private, and population. These 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages. The 12,115 records covered by these regions represent approximately 60% of the zero values found within the gps_height (12,115 / 20,438), population (12,115 / 21,381), and construction_year (12,115 / 20,709) variables.
3,582 of the missing funder and installer values coincide with each other
1,812 of the missing gps_height values coincide with missing latitude and longitude values
43 wards have no valid public_meeting values; 75 wards have no valid scheme_management values; 73 wards and 3 lga's have no valid permit values.

The widespread incidence of missing data throughout the data set was a strong indicator of the need for the development of statistically valid data imputation algorithms for many of the affected variables. A discussion of the various imputation strategies used can be found within the Data Preparation section of this document.

Insights Gained via Data Aggregation

Aggregation of data by various variables / values provided additional valuable insights, including:

The funder and installer variables contain many instances of what appear to be minor variations of a single valid data value, e.g., “Government of Tanzania” vs. “Ministry of Water” vs. “Water”.
There is no one-to-one relationship between the region and region_code variables (i.e., 21 distinct regions; 27 distinct region_codes).
There is no one-to-one relationship between district_code and region: district_code values are only unique within a given region, not across regions.
Aggregating by extraction_type reveals an apparent chronological progression in the deployment of different pump technologies. For example, we find the median construction_year value for swn 80 extraction types to be 1997 while that of the cemo extraction type is 2009.
Aggregating by basin reveals an apparent chronological geographic progression in the installation of pumps throughout Tanzania. For example, we find the Lake Rukwa basin has a median construction_year value of 1989 while the basin of Wami / Ruvu has a median value of 2003.
There is no single scheme_management value that is universally applicable to any given extraction_type or waterpoint_type or for any of the composites thereof (e.g., extraction_type_group, etc.), or to any ward, lga, region or basin.
Several variables contained within the data set are either duplicative or composites of other variables:
1. extraction_type_group is a binned / composite version of extraction_type
2. extraction_type_class is a binned / composite version of extraction_type_group
3. payment_type is 100% duplicative of payment
4. quality_group is a binned / composite version of water_quality
5. quantity_group is 100% duplicative of quantity
6. source_type is a binned / composite version of source
7. waterpoint_type_class is a binned / composite version of waterpoint_type

Insights such as these serve as the basis of both the data imputation and model building efforts described herein.

Initial Predictive Inferences

Derivation of barplots for five numeric variables (amount_tsh, gps_height, population, longitude, latitude) and each categorical variable value relative to each of the three possible response variable values allowed us to gain insight into some of the the predictive aspects of the data. The plots (which can be found in the Data Exploration section of the Appendix) show that most of the categorical variables do, in fact, contain significant predictive characteristics. For example, we find geographic variability with respect to the functional status of pumps via variables such as basin and region. Additional variability in the functional status of pumps can be found via variables such as scheme_management, construction_year, public_meeting, extraction_type, management, payment, source, quantity, water_quality, and waterpoint_type:

Plots for the numeric variables show that gps_height, longitude, and latitude also contain predictive characteristics, while population and amount_tsh appear likely to have less predictive value. Such inferences are helpful for purposes of model building since they suggest that many variables found within the data set are strong candidates for inclusion within a prospective predictive / classification model. Furthermore, it suggests that use of varying combinations of independent variables might be employed in an attempt to determine whether any particular combination thereof yields a superior model.

Data Preparation

As discussed above, more than 53% of the records in the data set contain either missing or invalid data values. Our data exploration efforts allowed us to uncover relationships between various independent variables that could serve as the basis of statistically valid imputation algorithms for the missing values of the amount_tsh, gps_height, construction_year, latitude and longitude variables. A summary of the various imputation techniques used for this project are provided in the tables shown below. The first table provides a summary of numeric variable imputation methods as well as the python classes defined to perform those imputations. The second table provides a summary of the imputations applied to character-based categorical variables. Defining potential imputation options as hyper parameters within our data preparation framework allows us compare the effects of varying imputation methods via cross-validation [16] or via a holdout method [10]. Please see Appendix B for a detailed explanation of these imputation techniques and Appendix C for the python source code used to implement the data transformation classes.

Numeric Variable Imputation Methods

Categorical Variable Imputation Methods

We find three types of categorical variables within the data set:

Variables with large numbers of possible categories such as installer (2146) and subvillage (19288);
Variables with relatively small numbers of possible categories such as basin, region, and payment;
Variables whose values appear to be the result of free-form text input such as scheme_name and funder.

Each of these variables are converted to numeric categoricals prior to model building. For variables with large numbers of categories, we make use of 2 imputation techniques: frequency based binning and response variable based binning. During frequency based binning, the levels of a categorical variable are binned based on the percentage of data records represented by a particular level. For response based binning the levels of an independent categorical variable are binned based on the percentage of data records represented by a particular level relative to the categorical level of a data record’s corresponding target variable. The choice of frequency based and/or responsed based binning is controlled by a hyper parameter.

To handle categorical columns with free form text, we group the text based on the initial “n” characters of a data value, followed by the creation of dummy variables for each of those groups. Both the number of initial characters to consider and the number of groupings to be created are controlled by hyper parameters. These techniques allow us to reduce the number of levels found within categorical variables. To handle categorical variables with lesser numbers of levels, we create dummy variables to represent the presence/absence of each level within a data record. Please see Appendix B for a detailed explanation of these 3 techniques.

The following additional classes will support the transformation process. See Appendix B and Appendix C for the details and Python implementations of these classes.

Creating Dedicated Training & Evaluation Subsets

Creation of separate training and evaluation data sets was deemed necessary to allow us to test our classification models on a relatively unbiased set of data records. In preparation for model building we divided the training data into model training and evaluation subsets by extracting a random subset (stratified sampling based on the target variable) of 20% of the items of the training data set to serve as an evaluation subset. Those same items were then removed from the training data set to create a “model training subset”. The “model training subset” for all the model building activities described herein while the evaluation subset was used to gauge the effectiveness of each model.

Creating Data Transformation Pipelines

Data transformation pipelines play a significant part in cleansing and preparing the for model building. The data transformation pipelines can also be applied to streaming data if so desired. As such, it is possible for our framework to handle new data as robustly as possible. The data transformation modules will run independently (wherever possible) to create the transformed data, which can then be readily supplied as input to predictive models. For this project we have used Python to develop the data transformation modules and both R and Python for predictive models. The source code for the data transformation pipelines and predictive models is provided in Appendix C and Appendix E. The following data flow diagram shows the complete data transformation process:

In the data flow diagram shown above, the green nodes represent the classes whose instances can be applied to any data set to perform the transformation implemented in the respective class. The orange nodes represent the classes which cannot be used on any data set, other than the data set of this project. Please see Appendix C for the actual python implementation of these classes, and Appendix B for a tabular summary.

In the diagram, processes (1a - 1h) are instances of the DataFrameSelector class, which facilitates the selection of the required columns from the input data. Process 1a extracts the funder and installer columns from the data set, both of which contain free form text data, and creates a data frame (C1). Process 1b extracts the columns basin, region, district_code, public_meeting, scheme_management, permit, extraction_type_group, management, payment_type, water_quality, quantity_group, source,source_class, waterpoint_type, and creates another data frame (C2). The variables in this second data frame have a relatively low number of levels (less than 20). The 1c process will extract the subvillage, lga, and ward columns to create a third data frame (C3). The variables in C3 have a very large numbers of levels. The C1, C2, and C3 data frames are then supplied as inputs to 3 instances of the class HandleCategoricalNulls. This class will substitute a desired value ( unknown by default) in place of any null values and will optionally convert text to lower case. Each of the instances (2a, 2b, 2c) of the HandleCategoricalNulls class produces the C4, C5, C6 data frames. The C4 data frame is supplied as input to the instance of class FunderInstTransformer. This instance (represented as 3) will extract the initial 3 characters of the funder and installer variables and creates another data frame (C7). Data frames C7 and C5 are passed as inputs to the instances (4a and 4b respectively) of the class CatMultiLabelTransformer. This class will help us to create dummy variables for the input categorical variables. The instances 4a and 4b (of class CatMultiLabelTransformer) produce the C8 and C9 data frames. The C6 data frame will be supplied as input to the instance of the class ChooseCatPipelineType (represented by 13). This class will help us to choose frequency pipeline or response based pipeline or both (controlled by a hyper parameter). Based on the choice, the input data set will be supplied to the instances of the class FreqBasedCategoricalBinning (represented by 5), and RespBasedCategoricalBinning (represented by 6). The dotted arrows between 13 and 6 items indicate that we have a choice to choose either frequency based binning or response based binning or both. The 5 (instance of FreqBasedCategoricalBinning) and 6 (instance of RespBasedCategoricalBinning) processes will produce the C10 and C11 data frames.

The processes 1d, 1e, 1f, 1g, and 1h will extract amount_tsh, construction_year, latitude longitude, gps_height, and population respectively, producing the N1, N2, N3, N4 and N5 data frames. N1, N2 and N3 will be processed by AmountTSHTransformer (node 12), YearTransformer (node 10), and LatitudeLongitudeProcess (node 11) instances respectively, producing N6, N7 and N8 data sets. The N4 and N5 data frames are processed by GPSHeightTransformer (node 8) and PopulationTransformer (node 9), by combining with N8 (imputed latitude and longitude) data frame, producing N9 and N10 data frames. All the numeric data frame N6-N10 are supplied as input to StandardScaler instance (Node 7) producing N10 data frame. The C8, C9, C10, C11, and N10 data frames are combined by FeatureUnion class of python to form the output data set.

Predictive / Classification Modeling & Model Selection

Six distinct types of classification models were constructed for purposes of predicting the current status of a given water pump:

Multinomial Logistic Regression (MLR)
K-Nearest Neighbors (KNN)
Support Vector Machines (SVM)
Bootstrap Aggregation (Bagging)
Random Forest
C5.0 Boosted Classification Trees

Detailed analysis of the performance of each type of model can be found in Appendix D.

The table below provides a summary of the performance of each type of model relative to the model validation data subset discussed in the Data Preparation section above. In the table, each type of model is measured via five distinct metrics:

The overall accuracy of a model’s predictions for the validation subset (“Overall Accu.”);
Balanced accuracy relative to the functional pump status (“Accu.(F)”);
Balanced accuracy relative to the functional needs repair pump status (“Accu.(FNR)”);
Balanced accuracy relative to the non-functional pump status (“Accu.(NF)”);
The model’s Kappa score.

The table lists the models in descending rank order based on their overall accuracy scores.

Model Type	Overall Accu.	Accu.(F)	Accu.(FNR)	Accu.(NF)	Kappa
Random Forest (R)	0.8113	0.8173	0.64064	0.8417	0.6418
C5.0 Boosted Trees	0.8040	0.8140	0.65627	0.8367	0.6318
Bootstrap Aggregation	0.7991	0.8138	0.67610	0.8378	0.6281
SVM	0.7650	0.7636	0.55706	0.7887	0.5391
Multinomial Log. Regr.	0.7508	0.7554	0.55316	0.7775	0.5156
KNN	0.7493	0.7664	0.64735	0.7842	0.5334

As shown in the table, the Random Forest model achieved the highest overall accuracy of the six types of classification models relative to the model validation data subset. However, three other types of models ( KNN, Bootstrap Aggregation, and C5.0 Boosted Trees) were more proficient at accurately identifying water pumps having a status of functional needs repair. These results highlight a potential challenge in identifying a “best performing” classification model: The model with the highest overall accuracy may not be the most effective across all of a data set’s possible response variable values. For example, if Tanzania’s Ministry of Water is most interested in accurately identifying pumps that are working but in need of repair, the Bootstrap Aggregation model may be preferable to the Random Forest model despite the fact that the Bootstrap Aggregation model has a lower overall accuracy. As such, selection of a preferred model should be performed relative to the priorities and needs of the prospective model user.

From a computational efficiency perspective, the random forest, C5.0, MLR, and bootstrap aggregation algorithms could each be trained within approximately 30 minutes on a modest 6GB Windows-based PC and were capable of generating any required predictions within approximately 5 minutes. The KNN and SVM models each required far longer periods of training time and required relatively long periods of time to make any required predictions. Therefore, long-term reliance on either the KNN or SVM models may prove to be impractical. The relative inefficiencies of those two types of models combined with their relatively lower overall predictive performance led us to remove them from consideration for the “best performing” model type.

Of the remaining four model types (MLR, bootstrap aggregation, C5.0 boosted trees, random forest), the MLR model was dropped from consideration as “best performing” due to its overall accuracy score falling nearly five full percentage points below that of the other three.

An attempt was also made to improve the overall performance of our modeling efforts by combining the output of the Random Forest, Bootstrap Aggregation and C5.0 Boosted Tree methods as an ensemble model. However, the results of that approach did not improve upon those achieved by the random forest model.

Therefore, we recommend the use of the Random Forest model if overall accuracy across each of the three possible water pump statuses is of most importance to the Ministry of Water. However, if correctly identifying the largest number of pumps that are functional but in need of repair is the top priority, the Bootstrap Aggregration model should be prefered.

Visual Presentation

The interactive visual presentation was created using the Shiny toolset within R and is comprised of five separate panels:

About provides a brief synopsis of the project and lists the individual members of the project team.
Full Writeup presents the user with a zoomable, downloadable PDF version of this document in its entirety, including the appendices.
Data Table allows the user to explore the data used for model testing, including any imputed data values, via an interactive, searchable data table. Results of each of the six predictive/classification models for each pump represented in the model testing data set are also included within the data table and may be searched / sorted interactively.
Plot makes use of interactive bar plots to visualize the output of each of the six predictive models relative to a user-selected independent/predictive variable. These visualizations provide the user with an intuitive method of identifying specific pump characteristics that are most indicative of a pump being either functional, functional needs repair or non functional.
Map Visualization allows the user to generate an interactive geoplot showing the location, population, and wellpoint type for each water pump represented within the data challenge’s test data set. The user can pan/zoom within the map. Hovering over a specific pump will display the wellpoint type while clicking on a pump will display what the predicted status of the pump was for each of the six types of predictive / classification models.

The visualization can be accessed via the following link:

https://spsdata698.shinyapps.io/TANZWATERPROJ/

Discussion and Conclusions

The purpose of this project was to attempt to predict the functional status of a water pump from one of three possible status values (functional, functional needs repair, non functional) based on a variety of qualitative and quantitative water pump attributes and geographical variables. Data related to 59,400 water pumps was used for training and validating six distinct types classification models (MLR, KNN, Bootstrap Aggregation, SVM, Random Forest, and C5.0 Boosted Classification Trees), and the output of those models serves as the basis of an interactive geomap whereby the user can explore the predicted statuses of more than 14,000 additional pumps whose functional status was not available as part of the original data set.

Our modeling efforts indicated that the quantity, extraction_type, waterpoint_type, and the derived pump_age variables appeared to have the highest levels of predictive ulitity for the three types of models from which such assessments could be made (MLR, C5.0, random forest). The latitude, longitude, and gps_height variables appeared to be of little value to the MLR model but had very high predictive value for the random forest and moderately high predictive value for the C5.0 and MLR models. Other variables found to have relatively high levels of predictive value for MLR, C5.0 and random forest included payment, source, region_code and district_code. By contrast, the subvillage variable and many of the binned ward and lga variables were found to have relatively low predictive strength across those three types of models, and the region variable was found to add no value if the region_code variable was already included within a model.

The overall accuracy of the six models when applied to the validation subset ranged from a low of approximately 75% (e.g., MLR, KNN) to a high of just over 81% (random forest), though the KNN, Bootstrap Aggregation, and C5.0 models all outperformed the random forest model for accurately identifying water pumps having a status of functional needs repair. Computationally, the Random Forest, Bootstrap Aggregation and C5.0 models each proved to be far more efficient and/or accurate than the MLR, KNN and SVM models. An ensemble model that combined the output of the Bootstrap Aggregation, Random Forest and C5.0 models did not outperform the results obtained via the Random Forest model.

The Random Forest model scored highest within the context of the DrivenData.org data challenge, achieving an overall accuracy of .8195 (just .009 less than the top score of .8285) and a ranking within the top seven percent of all submissions. Therefore, the Random Forest model is recommended for use by Tanzania’s Ministry of Water if overall accuracy across each of the three possible pump statuses is of most importance, while the Bootstratp Aggregation model should be preferred if identifying the largest number of pumps that are functional but in need of repair is the top priority.

Finally, the interactive visual dashboard developed during this project represents a distinct contribution relative to the available literature for this specific DrivenData.org data challenge. The dashboard allows a user to not only explore the predicted status of pumps for all six types of predictive/classification models via an interactive geomap but also provides the user with a way to explore those predictions relative to each of the independent variables that were used for model training. Such flexibility allows the user to understand the relationship between pump statuses and attributes such as extraction_type, water quality, how users pay for use of a pump, and various approaches to management of pumps, amongst others. Insights gained from such exploration can help the Ministry of Water more effectively manage and maintain pumps throughout Tanzania, thereby potentially increasing the availability of potable water for the country’s population.

Future Work: While the conclusions described herein apply only to the water pumps described in the aforementioned DriventData.org data set, the extensible/reusable design of the data transformation pipeline allows much of what we’ve implemented to potentially be applied to other types of predictive/classification modeling efforts. Future work could include extending our visualization tool to incorporate graph theory techniques for purposes of identifying clusters of pumps that are in need of repair. Such functionality could provide the Ministry of Water with a way to improve the efficiency of its repair efforts, e.g., repair teams could be deployed to clusters of problematic pumps, etc. Future work could also include the development and training of a neural network classification algorithm to determine whether such an approach would improve upon the results we’ve documented herein: Use of a neural network approach proved to be computationally prohibitive relative to the computing resources available to us during this project.

References

Tanzania Water Competition: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/
Jdills: https://github.com/jdills26/Tanzania-water-table
Domptail: https://github.com/domptail/Tanzania-Water-Wells-ongoing
Mattiaf: https://github.com/drivendataorg/pump-it-up/blob/master/mattiaf/pumps.ipynb
Agrawal et. al: http://www.contrib.andrew.cmu.edu/~kdagrawa/documents/waterpump.pdf
Joomik: https://joomik.github.io/waterpumps/
Kalluri: https://medium.com/towards-data-science/predicting-the-functional-status-of-pumps-in-tanzania-355c9269d0c2
Cheng: http://blog.nycdatascience.com/student-works/linlin_cheng_proj_5/
DataCamp: https://github.com/datacamp/courses-drivendata-water-pumps/blob/master/chapter1.md
Wikipedia - Training/Test/Validation: https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets
Tanzania map: https://en.wikipedia.org/wiki/Template:Location_map_Tanzania
Tanzania sub-divisions: https://en.wikipedia.org/wiki/Subdivisions_of_Tanzania
Decimal degrees: https://en.wikipedia.org/wiki/Decimal_degrees
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - An Introduction to Statistical Learning with Application in R, Sprienger 2013
FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
Wikipedia - Cross Validation - https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Appendix A: Data Exploration

NOTE: An online version of this Appendix can be found at the following Rpubs.com page:

https://rpubs.com/jt_rpubs/314767

Load Training Data

# load tanzania training data file
tanz <- read.csv("https://raw.githubusercontent.com/jtopor/DDO-TNZ/master/Tanz-Training-Data.csv", header = TRUE, stringsAsFactors = FALSE)

# count distinct items
length(unique(tanz$id))

## [1] 59400

# summary(tanz)

Load Training Labels

# load tanzania training data labels
tanz_trl <- read.csv("https://raw.githubusercontent.com/jtopor/DDO-TNZ/master/Tanz-Training-Labels.csv", header = TRUE, stringsAsFactors = FALSE)

tanz$label <- tanz_trl$status_group

Determine how many rows have complete cases and how many have NA’s

# count number of rows with no NA's = 27813
(c.cases <- sum(complete.cases(tanz)) )

## [1] 53281

# now calculate number of rows with NA's = 6119
(na.cases <- nrow(tanz) - c.cases)

## [1] 6119

na.cases / nrow(tanz)

## [1] 0.1030135

53281 complete cases
6119 rows with NA’s
10.3% of rows have NA’s

Calculate Number of NA’s for Each Variable (Including Blank Strings)

# fill in all blank entries with NA
tanz[,][tanz[,] == ""] <- NA

# count NA's in each column
# kable(colSums(is.na(tanz)), col.names = c("NA Count") )

# recount number of rows with no NA's = 27813
(c.cases <- sum(complete.cases(tanz)) )

## [1] 27813

# now calculate number of rows with NA's = 31587
(na.cases <- nrow(tanz) - c.cases)

## [1] 31587

na.cases / nrow(tanz)

## [1] 0.5317677

Only 27813 complete cases after filling in blank string values with NA
31587 rows with NA’s
53.17% of rows have NA’s

Count number of pumps for each pump’s status value

## # A tibble: 3 x 2
##                     label TotalWells
##                     <chr>      <int>
## 1              functional      32259
## 2          non functional      22824
## 3 functional needs repair       4317

The plot above provides a benchmark of sorts for purposes of evaluating each of the variables to be discussed below. The plot tells us that 54.3% of all pumps are functional, 38.4% are non-functional, and 7.3% are functional but in need of repair. We can use these metrics to partially assess each of the individual categorical variable values found within the data set; as we analyze each variable value, we can determine whether or not the percentage of pumps pertaining to that variable value either exceeds or falls short of the overall performance metrics plotted above. For example, those exceeding the 54.3% “functional” metric may share characteristics that poorer performing pumps may benefit from emulating / replicating.

Summary of Variables in Data Set

The data set is comprised of a total of 40 variables, six of which are numeric, 31 of which are categorical, and three of which are strictly for administrative/reference purposes. The following is a brief summary of each of the independent variables found within the data set.

Variable	Definition
`id`	Identification Variable (not a predictor)
`date_recorded`	Date of data collection by survey company
`recorded_by`	Name of the data collection / survey company
`amount_tsh`	Metric indicating total static head for pump; should be > 0
`gps_height`	Altitude of pump
`population`	Human population in immediate vicinty of pump
`longitude`	Longitudinal coordinate of pump
`latitude`	Latitudinal coordinate of pump
`num_private`	No definition is available for this variable
`funder`	Name of organization that funded installation of pump
`installer`	Name of organization that installed the pump
`wpt_name`	Name assigned to given waterpoint
`basin`	Name of geographic water basin where pump is located
`subvillage`	Name of geographic subvillage where pump is located
`region`	Name of geographic region in Tanzania where pump is located
`region_code`	Numeric ID for ‘region’ variable
`district_code`	Numeric ID of district within a region where pump is located
`lga`	Tanzania-specific geographic indicator of where pump is located
`ward`	Name of Tanzanian geographic ward where pump is located
`public_meeting`	True/False indicator
`scheme_management`	Type of the organization responsible for management of pump
`scheme_name`	Name of organization responsible for management of pump
`permit`	True/False indicating whether the pump has valid permit
`construction_year`	The year the pump was installed
`extraction_type`	Method of extraction used at a given pump site
`extraction_type_group`	Aggregation of extraction_type categories
`extraction_type_class`	Aggregation of extraction_type_group categories
`management`	Name of method employed for management of a given pump
`management_group`	Possibly an aggregation of management categories
`payment`	Categorical indicator of payment method required of pump users
`payment_type`	Appears to be a duplicate of payment categories
`water_quality`	Categorical indicator of water quality produced by pump
`quality_group`	Aggregation of water_quality categories
`quantity`	Categorical indicator of water quantity produced by pump
`quantity_group`	Aggregation of quantity categories
`source`	Type of source of the water for a given pump
`source_type`	Aggregation of source categories
`source_type_class`	Aggregation of source_type categories
`waterpoint_type`	The type of pump installed at a well site
`waterpoint_type_group`	Aggregation of waterpoint_type categories

Each of these variables can be characterized as belonging to one of three classes of independent variables: non-categorical numeric variables; categorical variables; and administrative variables. The tables below summarize our analyses of missing and possibly invalid values for each variable.

Numeric Variables

Numeric Variables	Comments
amount_tsh	41,639 of 59,400 records = “0”
gps_height	20,438 of 59,400 records = “0”
population	21,381 of 59.400 records = “0”
longitude	1,812 zero values: likely invalid
latitude	1,819 values < -1: likely invalid
num_private	58,643 of 59.400 records = “0”

Categorical variables

Categorical Variable	Distinct Values	NA’s	Comments
funder	1898	3635	3582 NA’s coinc. w installer
installer	2146	3655	3582 NA’s coinc. w funder
wpt_name	37400	0	3563 vals = “none”
basin	9	0
subvillage	19288	371
region	21	0
region_code	27	0
district_code	20	0	Not unique across regions
lga	125	0
ward	2092	0
public_meeting	3	3334	binary + NA’s
scheme_management	13	3877
scheme_name	2697	28166
permit	3	3056	binary + NA’s
construction_year	55	20709
extraction_type	18	0
extraction_type_group	13	0	composite of extr_type
extraction_type_class	7	0	composite of extr_type_class
management	12	0
management_group	5	0
payment	7	0
payment_type	7	0	dupe of payment
water_quality	8	0
quality_group	6	0	composite of water_quality
quantity	5	0
quantity_group	5	0	dupe of quantity
source	10	0
source_type	7	0	composite of source
source_class	3	0	binary + “unknown”
waterpoint_type	7	0
waterpoint_type_group	6	0	composite of waterpoint_type

Four region names are found to have no non-zero values for the following variables:

amount_tsh
gps_height
construction_year
num_private
population

The four regions are:

Dodoma
Kagera
Mbeya
Tabora

These 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages.

The 12,115 records covered by these regions represent approximately 60% of the zero values found within the gps_height (12,115 / 20,438), population (12,115 / 21,381), and construction_year (12,115 / 20,709) variables.

The lack of non-zero values throughout the four indicated regions for the five variables listed above makes it highly unlikely that we will be able to effectively derive imputed values for the zero values of those five variables using the geographical indicators provided within the data set.

Administrative / Non-Predictive Variables

Administrative Variables	Data Type
id	int
date_recorded	date
recorded_by	char str

The administrative variables shown above have no predictive value and will be ignored for purposes of model building. The id variable serves as a unique identifier for each data record within the data set. The date_recorded variable indicates the date on which the each record presented within the data was collected, while the recorded_by variable contains the name of the survey firm that collected the data.

The numeric and categorical variables listed above are explored in detail below.

Exploration of Numeric Variables

amount_tsh

The amount_tsh variable can be characterized as representing the “Total static head (amount water available to waterpoint)”. According to the Encyclopedia of Marine Technology, the definition of “total static head” is “the vertical height of a stationary column of liquid produced by a pump” [ref needed]. In other words, it is the height at which a pump can raise water above its source. Therefore, a pump with a total static head value of zero would not be particularly useful since it would not be capable of producing water from a source located below its output point.

Summary statistics and a boxplot for the amount_tsh variable indicate a heavily right-skewed distribution:

summary(tanz$amount_tsh)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##      0.0      0.0      0.0    317.7     20.0 350000.0

boxplot(tanz$amount_tsh, col = "yellow")

In fact, analysis of the amount_tsh variable indicates that 41,639 of the 59,400 data records (70.3%) have a value of “0”:

length(tanz$amount_tsh[tanz$amount_tsh == 0])

## [1] 41639

Analysis of the amount_tsh = 0 values relative to various geographic variables provided within the data set reveals a significant amount of clustering of such values by subvillage, ward, lga, district_code, and region. The table below provides a summary of this clustering.

Geographic Var.	Num of categories w No amount_tsh Value
subvillage	13357 / 19288 (69.2%)
ward	1090 / 2092 (52.1%)
lga	40 / 125 (32%)
district_code	2 / 20 (10%)
region	4 / 21 (19%)

As shown in the table, more than two-thirds of subvillages and one-half wards represented within the data set have no non-zero amount_tsh values, as do 32% of lga’s, 10% of district_codes and 19% of regions. Such clustering might suggest a systemic data collection issue within those geographic areas, e.g., perhaps the pumps located within those areas were not actually visited by the surveying company that compiled the data. Alternatively, the total static head for many pumps may simply not be known if it was not recorded when the pump was originally installed.

In any event, the lack of non-zero data values across so many geographic indicators strongly suggests that imputing the zero values found within the amount_tsh variable via geographic indicators will not be possible. Furthermore, given the widespread lack non-zero amount_tsh values found for these geographic areas it seems highly likely that the zero values are simply invalid.

Analysis shows that 14,771 of the non-zero values fall within the range of (1 : 1000):

atsh <- tanz$amount_tsh[tanz$amount_tsh > 0 & tanz$amount_tsh <= 1000]
length(atsh)

## [1] 14771

tot_atsh <- hist(atsh, col = 'yellow', main = "Histogram of amount_tsh values (1:1000)")

This leaves slightly less than 3,000 records having amount_tsh values that exceed 1000. Such values should be investigated to determine whether they are in fact valid values for the variable. Of those 3000 records, 2,810 are found to lie within the range of (1000:10,000):

atsh_10k <- tanz$amount_tsh[tanz$amount_tsh > 1000 & tanz$amount_tsh <= 10000]
length(atsh_10k)

## [1] 2810

Barplots for amount_tsh values within the range of (1:10000) suggest that pumps with relatively larger values may be more likely to be functional than are pumps having smaller values. However, since the larger values might possibly be outliers and we lack valid amount_tsh values for more than 70% of the records within the data set, it seems highly unlikely that we can derive any valid predictive inferences from the plots.

Given the large percentage of missing values for the variable, other variables within the data set were examined to determine whether any of their attributes might serve as indicators or proxies for missing amount_tsh values. The extraction_type_group and waterpoint_type_group categorical variables are each at least somewhat indicative of the type of pump deployed at each waterpoint. Assuming similar types of pumps generally share somewhat similar physical characteristics and limitations, there may be a relationship we can infer between these two variables and the amount_tsh value each pump may reasonably have. The tables below summarize the median non-zero amount_tsh values for each category of the extraction_type_group and waterpoint_type_group variables.

# isolate non-zero amount_tsh records
t_atsh <- subset(tanz, amount_tsh !=0)

# get the median amount_tsh value for each extraction_type_class
etg_mtsh <- arrange(summarise(group_by(t_atsh, extraction_type_group), 
                     MedTsh = median(amount_tsh) ), desc(MedTsh) )

etg_mtsh

## # A tibble: 13 x 2
##    extraction_type_group MedTsh
##                    <chr>  <dbl>
##  1       other motorpump   7500
##  2        other handpump   1000
##  3               afridev    500
##  4               gravity    500
##  5         india mark ii    500
##  6           nira/tanira    500
##  7                 other    500
##  8             rope pump    500
##  9                swn 80    500
## 10                  mono     50
## 11           submersible     50
## 12          wind-powered     50
## 13        india mark iii     25

# check avg tsh for extraction_type_class
# mean(tanz$amount_tsh[which(tanz$extraction_type_class == 'gravity')])

# get the median amount_tsh value for each waterpoint_type_group
wptg_mtsh <- arrange(summarise(group_by(t_atsh, waterpoint_type_group), 
                     MedTsh = median(amount_tsh) ), desc(MedTsh) )

wptg_mtsh

## # A tibble: 5 x 2
##   waterpoint_type_group MedTsh
##                   <chr>  <dbl>
## 1       improved spring   2000
## 2             hand pump    500
## 3    communal standpipe    200
## 4                 other    200
## 5         cattle trough    100

# memory cleanup
rm(t_atsh)

Since each pump represented within the data set has known values for both of these variables, one or both of the median values shown in the tables above may allow us to impute a reasonable value for each instance of amount_tsh = 0 within the data set.

gps_height

The gps_height variable represents the physical altitude of a pump. Summary statistics and plots for the gps_height variable indicate a heavily right-skewed distribution:

summary(tanz$gps_height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -90.0     0.0   369.0   668.3  1319.2  2770.0

boxplot(tanz$gps_height, col = "yellow")

hist(tanz$gps_height, col = "yellow")

Both the summary statistics and the boxplot show that a portion of the data records contain a negative gps_height value. An analysis of the variable finds 1,496 such records within the data set:

length(tanz$gps_height[tanz$gps_height < 0])

## [1] 1496

While it may be somewhat surprising to have a pump located below sea level, Africa is in fact home to many areas where the actual land altitude falls below sea level. As such, further investigation is required to determine whether or not these negative values are valid within the context of the data set.

The histogram shown above indicates that the majority of non-zero and non-negative gps_height values lie in the range of (1000:2000). The histogram also indicates the presence of a significant number of zero values for the variable. Further analysis finds a total of 20,438 data records containing a zero value for the gps_height variable.

length(tanz$gps_height[tanz$gps_height == 0])

## [1] 20438

1812 of these zero values occur when values for both latitude and longitude are apparently unknown:

nrow(subset(tanz, latitude > -1  & longitude == 0 & gps_height == 0) )

## [1] 1812

The fact that those zero values coincide with missing latitude and longitude coordinates stongly suggests that they are also representative of missing data.

Analysis of the presence of gps_height = 0 values relative to various geographic variables provided within the data set reveals a significant amount of clustering of such values by subvillage, ward, lga, district_code, and region. The table below provides a summary of this clustering.

Geographic Var.	Num of categories w No gps_h Value
subvillage	7265 / 19288 (37.6%)
ward	776 / 2092 (37.1%)
lga	42 / 125 (33.6%)
district_code	2 / 20 (10%)
region	4 / 21 (19%)

As shown in the table, more than one third of subvillages, wards and lga’s represented within the data set have no non-zero gps_height values, as do 10% of district_codes and 19% of regions. Such clustering suggests a systemic data collection issue within those geographic areas, e.g., perhaps the pumps located within those areas were not actually visited by the surveying company that compiled the data. If in fact the physical pumps had been visited during data collection, use of a basic, inexpensive altimeter could have provided the required altitude measurement.

In any event, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.

Further analysis is required to determine whether the remaining zero values are in fact valid data values or if they merely represent the lack of an accurate altitude measurement for the associated pumps. However, given the widespread lack of accurate measurements found for these geographic areas it seems highly likely that the zero values are simply invalid.

Barplots for the non-zero gps_height values show that pumps located at relatively higher altitudes are more likely to be functional than are pumps found at lower altitudes. However, pumps located at altitudes higher than agps_height value of approximately 2500 are also the most likely to be functional needs repair, followed by those lying within the approximate range of (1000:1600). We also find that pumps located at relatively low altitudes (i.e., between 0 and 800) are the most likely to be non-functional.

population

The population variable represents the human population in the area surrounding a pump. Summary statistics and plots for the population variable indicate a heavily right-skewed distribution:

summary(tanz$population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0    25.0   179.9   215.0 30500.0

boxplot(tanz$population, col = "yellow")

hist(tanz$population, col = "yellow")

Our analysis indicates the presence of 21,381 data records containing a zero value for the population variable.

length(tanz$population[tanz$population == 0])

## [1] 21381

While it may in fact be realistic to find a pump located in an unpopulated area, it seems highly unlikely that such a large number of pumps would reside in locations that are devoid of human inhabitants. As such, many of these zero values are likely to be invalid.

As with the amount_tsh and gps_height variables, we find a significant amount of clustering relative to the subvillage, ward, lga, district_code, and region geographic variables, as shown in the table below.

Geographic Var.	Num of Categories w No popu Value
subvillage	7291 / 19288 (37.8%)
ward	779 / 2092 (37.2%)
lga	42 / 125 (33.6%)
district_code	2 / 20 (20%)
region	4 / 21 (19%)

The statistics are very similar, though not identical, to those derived for the amount_tsh and gps_height variables, and once again suggest the possibility of a systemic data collection issue on the part of the surveying firm that collected the data. Furthermore, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.

Analysis of the non-zero population figures reveals a highly right-skewed distribution, with a median value of 150 and a maximum value of 30,500.

summary(tanz$population[tanz$population > 0])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    40.0   150.0   281.1   324.0 30500.0

boxplot(tanz$population[tanz$population > 0], col = "yellow")

hist(tanz$population[tanz$population > 0], col = "yellow")

The summary statistics and plots suggest that the vast majority of the pumps are located in the vicinity of small villages comprised of less than 500 inhabitants. This suggests that it may be feasible to impute the missing population values for predictive modeling purposes via use of the median. However, such an approach may have an adverse effect on the distribution of the variable.

Barplots for the non-zero population values of less than 1000 show no obvious relationship between population and the likelihood of a pump being functional. However, pumps located in areas with relatively larger populations do appear to be somewhat more likely to have a status of functional needs repair than are pumps located in less densely populated areas. Finally, pumps located in relatively higher population areas appear to be somewhat less likely to be non functional than are pumps located in areas with relatively smaller populations.

Longitude & Latitude

Longitude and latitude coordinates are provided for each pump within the data set. Summary statistics and plots indicate that at least a portion of the coordinates provided may be invalid, as indicated by the zero and near-zero values found in the both variables.

summary(tanz$longitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   33.09   34.91   34.08   37.18   40.35

boxplot(tanz$longitude, col = "yellow")

hist(tanz$longitude, col = 'yellow')

summary(tanz$latitude)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.649  -8.541  -5.022  -5.706  -3.326   0.000

boxplot(tanz$latitude, col = "yellow")

hist(tanz$latitude, col = 'yellow')

Further analysis finds 1812 zero values within the longitude variable and 1819 values less than -1 for the latitude variable.

# longitude zero values
length(tanz$longitude[tanz$longitude == 0])

## [1] 1812

# latitude zero values
length(tanz$latitude[tanz$latitude > -1])

## [1] 1819

A cross referencing of the two variables shows that every instance of a zero longitude value does in fact correspond with an instance of a (< -1) latitude value:

nrow(subset(tanz, latitude > -1  & longitude == 0))

## [1] 1812

The fact that these values appear to be coincident strongly suggests that they are simply invalid data values. However, for each record within the data set we are provided with a variety of other geographical indicators, including ward, district_code, region, and basin. Given the relatively small number of invalid longitude and latitude values present within the data set, a reasonable approach to their imputation could be based on calculating the mean or median longitude and latitude values for a data record’s ward or region and using those values as the imputed values for the missing data.

Barplots for the non-zero longitude values indicate that pumps located between 34 and 38 degrees longitude are more likley to be functional than are pumps at other longitudes, while those located between 30 and 31 degrees longitude are much more likely to have a status of functional needs repair than are pumps located at other longitudes. Finally, we see that longitude can be somewhat indicative of how likely a pump is to be non functional. Therefore, the longitude variable appears to offer a fair amount of predictive value.

Barplots for latitude values of less than (-1) indicate that pumps at some latitudes are much more likely to be either functional or functional needs repair than are pumps at other latitudes. Therefore, the latitude variable appears to offer a fair amount of predictive value.

num_private

No clear explanation is provided regarding what the num_private variable is meant to represent. Summary statistics and plots show a heavily right-skewed distribution with zero values dominating the variable throughout the data set.

summary(tanz$num_private)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.4741    0.0000 1776.0000

boxplot(as.numeric(tanz$num_private), col = "yellow")

hist(as.numeric(tanz$num_private), col = 'yellow')

In fact, analysis reveals that only 757 of the data set’s 59,400 num_private values are non-zero:

length(tanz$num_private[which(tanz$num_private > 0)])

## [1] 757

Unfortunately, the lack of an explanation of the meaning of the variable means we have no way of knowing whether the large number of zero values represent legitimate data values or simply serve as an indicator of missing data.

The variable is comprised of a total of 65 distinct values. The number of pumps pertaining to the ten most frequently occurring of those values is summarized below.

# count distinct num_private = 65
length(unique(tanz$num_private))

## [1] 65

# calc number of wells per wpt_name
c_npriv <- arrange(summarise(group_by(tanz, num_private), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_npriv, n = 10)

## # A tibble: 10 x 2
##    num_private TotalWells
##          <int>      <int>
##  1           0      58643
##  2           6         81
##  3           1         73
##  4           5         46
##  5           8         46
##  6          32         40
##  7          45         36
##  8          15         35
##  9          39         30
## 10          93         28

Exploration of Categorical Variables

funder

The funder variable represents the name of the organization that funded the installation of a given pump. A total of 1,898 distinct values for the variable are found within the data set, along with 3,635 missing values. The top 20 funders by pump count are shown in the table below.

# count distinct funder = 1898, 3635 NA
length(unique(tanz$funder))

## [1] 1898

# calc number of wells per funder
c_funders <- arrange(summarise(group_by(tanz, funder), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_funders, n = 20)

## # A tibble: 20 x 2
##                    funder TotalWells
##                     <chr>      <int>
##  1 Government Of Tanzania       9084
##  2                   <NA>       3635
##  3                 Danida       3114
##  4                 Hesawa       2202
##  5                  Rwssp       1374
##  6             World Bank       1349
##  7                   Kkkt       1287
##  8           World Vision       1246
##  9                 Unicef       1057
## 10                  Tasaf        877
## 11       District Council        843
## 12                    Dhv        829
## 13     Private Individual        826
## 14                   Dwsp        811
## 15                      0        777
## 16                  Norad        765
## 17        Germany Republi        610
## 18                   Tcrs        602
## 19      Ministry Of Water        590
## 20                  Water        583

As we can see in the small sample of funder values shown above, there appears to be a lack of consistency in the application of funder names throughout the data set. For example, in the table above we note the presence of “Government of Tanzania”, “Ministry of Water”, and “Water”. It is quite possible that each of these values refer to the Tanzanian government. Such inconsistency may be unnecessarily increasing the number of distinct funder values while simultaneously dilluting the accuracy of the data records. We will explore whether it is possibile to address such issues as part of our Data Preparation efforts.

installer

The installer variable represents the name of the organization that installed a given pump. A total of 2,146 distinct values for the variable are found within the data set, along with 3,655 missing values. The top 20 installers by pump count are shown in the table below.

# count distinct installer = 2146, 3655 NA
length(unique(tanz$installer))

## [1] 2146

# calc number of wells per installer
c_installers <- arrange(summarise(group_by(tanz, installer), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_installers, n = 20)

## # A tibble: 20 x 2
##             installer TotalWells
##                 <chr>      <int>
##  1                DWE      17402
##  2               <NA>       3655
##  3         Government       1825
##  4                RWE       1206
##  5              Commu       1060
##  6             DANIDA       1050
##  7               KKKT        898
##  8             Hesawa        840
##  9                  0        777
## 10               TCRS        707
## 11 Central government        622
## 12                CES        610
## 13          Community        553
## 14              DANID        552
## 15   District Council        551
## 16             HESAWA        539
## 17                LGA        408
## 18       World vision        408
## 19             WEDECO        397
## 20              TASAF        396

As with the funder variable, there appears to be a lack of consistency in the application of installer names throughout the data set. For example, in the table above we note the presence of “”Government of Tanzania“,”Ministry of Water“, and”Water“”Government" and “Central Government”, as well as both “Commu” and “Community”. It is quite possible that each of these pairs values refer to a single installer. As with the funder variable, We will explore whether it is possibile to address such issues as part of our Data Preparation efforts.

Funder & Installer: Overlap of Missing Values

A cursory review of the data set indicates that many instances of missing data within the funder variable are coincident with missing values for the installer variable. In fact, we find that 3,582 of the 3,635 instances of funder = NA occur when the installer variable is also unknown.

f_i <- subset(tanz, is.na(tanz$funder) & is.na(tanz$installer))
nrow(f_i)

## [1] 3582

The fact that nearly all of the missing funder and installer values are coincident may preclude any imputation for missing values for the two variables.

wpt_name

The wpt_name variable represents a name that has been assigned to given waterpoint. A total of 37,400 distinct values for the variable are found within the data set, along with 3,563 instances where no defined name could be identified.

# count distinct wpt_name = 37400, 3563 = none (which is not the same as NA)
length(unique(tanz$wpt_name))

## [1] 37400

# calc number of wells per wpt_name
c_wptn <- arrange(summarise(group_by(tanz, wpt_name), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_wptn, n = 20)

## # A tibble: 20 x 2
##           wpt_name TotalWells
##              <chr>      <int>
##  1            none       3563
##  2         Shuleni       1748
##  3        Zahanati        830
##  4       Msikitini        535
##  5        Kanisani        323
##  6         Bombani        271
##  7          Sokoni        260
##  8         Ofisini        254
##  9          School        208
## 10 Shule Ya Msingi        199
## 11           Shule        152
## 12       Sekondari        146
## 13        Muungano        133
## 14        Mkombozi        111
## 15        Madukani        104
## 16        Hospital         94
## 17         Mbugani         94
## 18          Upendo         93
## 19  Kituo Cha Afya         90
## 20         Mkuyuni         88

Given the large number of possible wpt_name values as a proportion of the total records contained within the data set (37400 / 59400), the wpt_name variable likely offers limited, if any, value for predictive modeling purposes.

basin

The basin variable represents the name of the geographic water basin within which a given pump is located. The summary statistics shown below indicate a total of 9 distinct water basins within the data set, with each record having a valid value. The number of pumps found per basin is detailed in the table below.

# count distinct basin = 9
length(unique(tanz$basin))

## [1] 9

# calc number of wells per basin
c_basin <- arrange(summarise(group_by(tanz, basin), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_basin, n = 10)

## # A tibble: 9 x 2
##                     basin TotalWells
##                     <chr>      <int>
## 1           Lake Victoria      10248
## 2                 Pangani       8940
## 3                  Rufiji       7976
## 4                Internal       7785
## 5         Lake Tanganyika       6432
## 6             Wami / Ruvu       5987
## 7              Lake Nyasa       5085
## 8 Ruvuma / Southern Coast       4493
## 9              Lake Rukwa       2454

Plots for the basin variable show a wide variance in the proportions of functional pumps between basins: Three (Lake Nyasa, Pangani, Rufiji) have more than 60% of their pumps functioning while the Southern Coast basin has only 37.2% of its pumps functioning. In fact, five of the nine basins fail to achieve the overall “functional” metric of 54.3%. The underlying reasons for these disparities should be explored to determine whether or not the underperforming basins can be improved via the application of pump management methods/practices being employed within the outperforming basins.

Subvillage

The subvillage variable represents the name of the geographic subvillage within which a given pump is located. Summary statistics indicate a total of 19,288 distinct subvillage values within the data set, with 371 missing values. The 20 subvillages with the largest number of pumps are shown below.

# count distinct subvillage = 19288, 371 NA's
length(unique(tanz$subvillage))

## [1] 19288

# calc number of wells per subvillage
c_subv <- arrange(summarise(group_by(tanz, subvillage), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_subv, n = 20)

## # A tibble: 20 x 2
##    subvillage TotalWells
##         <chr>      <int>
##  1   Madukani        508
##  2    Shuleni        506
##  3    Majengo        502
##  4       Kati        373
##  5       <NA>        371
##  6    Mtakuja        262
##  7     Sokoni        232
##  8          M        187
##  9   Muungano        172
## 10    Mbuyuni        164
## 11    Mlimani        152
## 12 Songambele        147
## 13   Miembeni        134
## 14  Msikitini        134
## 15          1        132
## 16    Kibaoni        114
## 17   Kanisani        111
## 18          I        109
## 19  Mapinduzi        109
## 20   Mjimwema        108

Given the large number of possible subvillage values as a proportion of the total records contained within the data set (19288 / 59400), the variable may offer limited value for predictive modeling purposes.

Region

The region variable represents the name of the geographic region in Tanzania within which a given pump is located. The summary statistics shown below indicate a total of 21 distinct regions within the data set, with each record having a valid value. The 20 regions having the largest number of pump installations are detailed in the table below.

# count distinct region = 21
length(unique(tanz$region))

## [1] 21

# calc number of wells per region
c_region <- arrange(summarise(group_by(tanz, region), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_region, n = 20)

## # A tibble: 20 x 2
##         region TotalWells
##          <chr>      <int>
##  1      Iringa       5294
##  2   Shinyanga       4982
##  3       Mbeya       4639
##  4 Kilimanjaro       4379
##  5    Morogoro       4006
##  6      Arusha       3350
##  7      Kagera       3316
##  8      Mwanza       3102
##  9      Kigoma       2816
## 10      Ruvuma       2640
## 11       Pwani       2635
## 12       Tanga       2547
## 13      Dodoma       2201
## 14     Singida       2093
## 15        Mara       1969
## 16      Tabora       1959
## 17       Rukwa       1808
## 18      Mtwara       1730
## 19     Manyara       1583
## 20       Lindi       1546

The functional pump disparities found within the various regions exceed those seen within the basin variable: we see one region (Iringa) having more than 78.2% of its pumps functional while the region of Lindi has only 29.8% functional, a nearly 50 point difference. In fact, of the 21 regions listed, only 9 exceed the overall functional metric of 54.3%. Furthermore, we find the region of Kigoma has 21.4% of its pumps as “functional needs repair”, a percentage that greatly exceeds all other regions.

Analysis of individual regions reveals that the four specific regions are found to have no non-zero values for the following variables::

amount_tsh
gps_height
construction_year
num_private
population

The four regions are:

Dodoma
Kagera
Mbeya
Tabora

# Dodoma
unique(tanz$population[tanz$region == 'Dodoma'])
unique(tanz$construction_year[tanz$region == 'Dodoma'])
unique(tanz$gps_height[tanz$region == 'Dodoma'])
unique(tanz$amount_tsh[tanz$region == 'Dodoma'])
unique(tanz$num_private[tanz$region == 'Dodoma'])

# Kagera
unique(tanz$population[tanz$region == 'Kagera'])
unique(tanz$construction_year[tanz$region == 'Kagera'])
unique(tanz$gps_height[tanz$region == 'Kagera'])
unique(tanz$amount_tsh[tanz$region == 'Kagera'])
unique(tanz$num_private[tanz$region == 'Kagera'])

# Mbeya
unique(tanz$population[tanz$region == 'Mbeya'])
unique(tanz$construction_year[tanz$region == 'Mbeya'])
unique(tanz$gps_height[tanz$region == 'Mbeya'])
unique(tanz$amount_tsh[tanz$region == 'Mbeya'])
unique(tanz$num_private[tanz$region == 'Mbeya'])

# Tabora
unique(tanz$population[tanz$region == 'Tabora'])
unique(tanz$construction_year[tanz$region == 'Tabora'])
unique(tanz$gps_height[tanz$region == 'Tabora'])
unique(tanz$amount_tsh[tanz$region == 'Tabora'])
unique(tanz$num_private[tanz$region == 'Tabora'])

As shown below, these 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages.

# count num recs w/ these 4 regions = 12115 => 20.39% of all records
all_z <- nrow(subset(tanz, region == 'Tabora' | region == 'Mbeya' | region == 'Kagera' |
              region == 'Dodoma'))

all_z/nrow(tanz)

## [1] 0.2039562

# count all records in these regions
all_zdf <- subset(tanz, region == 'Tabora' | region == 'Mbeya' | region == 'Kagera' |
              region == 'Dodoma')

length(unique(all_zdf$district_code)) # 10 unique district codes

## [1] 10

length(unique(all_zdf$ward)) # 514 unique wards

## [1] 514

length(unique(all_zdf$subvillage)) # 4644 unique subvillages

## [1] 4644

length(unique(all_zdf$lga)) # 27 unique lga's

## [1] 27

Region Code

The region_code variable provides an integer code for the Tanzanian region within which a given pump is located. The summary statistics shown below indicate a total of 27 distinct region codes within the data set, with each record having a valid value. The 20 region codes having the largest number of pump installations are summarized in the table below.

## [1] 27

## # A tibble: 20 x 2
##    region_code TotalWells
##          <int>      <int>
##  1          11       5300
##  2          17       5011
##  3          12       4639
##  4           3       4379
##  5           5       4040
##  6          18       3324
##  7          19       3047
##  8           2       3024
##  9          16       2816
## 10          10       2640
## 11           4       2513
## 12           1       2201
## 13          13       2093
## 14          14       1979
## 15          20       1969
## 16          15       1808
## 17           6       1609
## 18          21       1583
## 19          80       1238
## 20          60       1025

It is unclear why we find 27 region_code values while there are only 20 possible values for the region variable. It is possibile that some subset of the region codes have either simply been entered incorrectly or were deliberately entered incorrectly due to uncertainty over the correct region_code value to apply to a given record.

Plotting the status of pumps relative to each value of the region_code variable shows even greater disparities between geographic regions for functional pumps, from a low of 8.7% in Region 8 to nearly 97% in Region 24. Region 16 shows an unusually large 21.4% of its pumps having a status of “functional needs repair” while Region 40’s single pump (100% of the pumps in that region) is not functioning.

Construction Year

The construction_year variable indicates the year in which a given pump was installed. The summary statistics shown below indicate a total of 54 distinct non-zero construction year values ranging from 1960 through 2013, with 20,709 records lacking a valid value. The year values having the largest number of pump installations are summarized in the table below.

# get range of non-zero values for variable
summary(tanz$construction_year[which(tanz$construction_year > 0)])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1960    1987    2000    1997    2008    2013

# count distinct construction_year = 55, 20,709 NA's
length(unique(tanz$construction_year))

## [1] 55

# calc number of wells per construction_year
c_const_y <- arrange(summarise(group_by(tanz, construction_year), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_const_y, n = 20)

## # A tibble: 20 x 2
##    construction_year TotalWells
##                <int>      <int>
##  1                 0      20709
##  2              2010       2645
##  3              2008       2613
##  4              2009       2533
##  5              2000       2091
##  6              2007       1587
##  7              2006       1471
##  8              2003       1286
##  9              2011       1256
## 10              2004       1123
## 11              2012       1084
## 12              2002       1075
## 13              1978       1037
## 14              1995       1014
## 15              2005       1011
## 16              1999        979
## 17              1998        966
## 18              1990        954
## 19              1985        945
## 20              1980        811

Plotting the status of pumps by construction year allows us to conclude that relatively newer pumps are generally more likely to be functional than are relatively older pumps. While this is rather unsurprising from an intuitive perspective, confirmation of such intuition can prove to be useful when crafting a predictive model.

As with the amount_tsh, gps_height, and population variables, we find a significant amount of clustering relative to the subvillage, ward, lga, district_code, and region geographic variables, as shown in the table below.

Geographic Var.	Num of Recs w No c_yr Value
subvillage	7208 / 19288 (37.3%)
ward	749 / 2092 (35.8%)
lga	41 / 125 (32.8%)
district_code	4 / 20 (20%)
region	4 / 21 (19%)

The statistics are very similar, though not identical, to those derived for the amount_tsh, gps_height and population variables, and once again suggest the possibility of a systemic data collection issue on the part of the surveying firm that collected the data. Furthermore, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.

District Code

The district_code variable provides an integer coding of the Tanzanian district within which a given pump is located. The summary statistics shown below indicate a total of 20 distinct district codes within the data set, with each record having a valid value. The number of pumps per district code is summarized in the table below.

# count distinct district_code = 20
length(unique(tanz$district_code))

## [1] 20

# calc number of wells per basin
c_distc <- arrange(summarise(group_by(tanz, district_code), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_distc, n = 20)

## # A tibble: 20 x 2
##    district_code TotalWells
##            <int>      <int>
##  1             1      12203
##  2             2      11173
##  3             3       9998
##  4             4       8999
##  5             5       4356
##  6             6       4074
##  7             7       3343
##  8             8       1043
##  9            30        995
## 10            33        874
## 11            53        745
## 12            43        505
## 13            13        391
## 14            23        293
## 15            63        195
## 16            62        109
## 17            60         63
## 18             0         23
## 19            80         12
## 20            67          6

The district_code variable is yet another geographic label, and as with the others we’ve analyzed (e.g., basin, region, etc.) we see a wide disparity in the percentage of functional pumps, ranging from a low of 17.4% for district 0 to a hight of 83.3% in district 67. Only 9 of the 20 districts exceed the overall 54.3% functional pump metric. Furthermore we see that six districts’ “non functional” pump percentages exceed 60%.

Further analysis reveals that values of the district_code variable are not unique across regions, as can be seen in the table below where district_code values are aggregated by region. In fact, we find a total of 132 region / district_code combinations within the data set

# aggregate district_code values by region
c_reg_dist <- arrange(summarise(group_by(tanz, region, district_code), 
                     TotalWells = length(unique(id)) ) )

head(c_reg_dist, n = 20)

## Source: local data frame [20 x 3]
## Groups: region [4]
## 
## # A tibble: 20 x 3
##           region district_code TotalWells
##            <chr>         <int>      <int>
##  1        Arusha             1        189
##  2        Arusha             2       1206
##  3        Arusha             3        109
##  4        Arusha             5        201
##  5        Arusha             6        310
##  6        Arusha             7       1009
##  7        Arusha            30        326
##  8 Dar es Salaam             1         93
##  9 Dar es Salaam             2        497
## 10 Dar es Salaam             3        215
## 11        Dodoma             0         23
## 12        Dodoma             1        888
## 13        Dodoma             3        361
## 14        Dodoma             4        347
## 15        Dodoma             5        358
## 16        Dodoma             6        224
## 17        Iringa             1        728
## 18        Iringa             2        530
## 19        Iringa             3        650
## 20        Iringa             4       2473

# count number of distinct region / district_code pairs
nrow(c_reg_dist)

## [1] 132

This dependence on the region variable indicates that the district_code variable itself is unlikely to serve as a valid predictor of a pump’s status. As such, the variable appears unlikely to be independently useful for predictive modeling purposes.

Public Meeting

The public_meeting variable is essentially a binary variable with 3,334 instances of the variable’s value being unknown. A summary of the variable’s values is shown below.

# count public_meeting = 3
length(unique(tanz$public_meeting))

## [1] 3

# calc number of wells per ward
c_pubmeet <- arrange(summarise(group_by(tanz, public_meeting), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_pubmeet, n = 10)

## # A tibble: 3 x 2
##   public_meeting TotalWells
##            <lgl>      <int>
## 1           TRUE      51011
## 2          FALSE       5055
## 3             NA       3334

The plots below show that, in general, pumps having a public meeting value of TRUE are more likely to be functional than those that have not.

In an attempt to determine whether the lack of a public_meeting value is related to the geographical location of a pump, we aggregate the data by ward and calculate the following:

the total number of pumps per ward
the number of pumps within a ward having a public_meeting value of TRUE
the percentage of pumps within a ward having a public_meeting value of TRUE
the number of pumps within a ward having an unknown public_meeting value
the percentage of pumps within a ward having an uknown public_meeting

# c_ward contains the total number of wells by ward
ward_tmp <- tanz[, c("id", "ward", "public_meeting")]

# fill in NA's so data can be grouped properly
ward_tmp$public_meeting[is.na(ward_tmp$public_meeting)] <- "unknown"

# aggregate the data
ward_meet_true <- arrange(summarise(group_by(ward_tmp, ward), 
                               TotalWells = length(unique(id)),
                               MeetUnk = sum(public_meeting == "unknown"),
                     MeetTrue = sum(public_meeting == "TRUE") ), desc(TotalWells) )

# calculate percentage of TRUE and unknown meeting values
ward_meet_true$perc <- ward_meet_true$MeetTrue /  ward_meet_true$TotalWells
ward_meet_true$unk_perc <- ward_meet_true$MeetUnk /  ward_meet_true$TotalWells

head(ward_meet_true, n = 20)

## # A tibble: 20 x 6
##             ward TotalWells MeetUnk MeetTrue       perc    unk_perc
##            <chr>      <int>   <int>    <int>      <dbl>       <dbl>
##  1         Igosi        307       0      307 1.00000000 0.000000000
##  2      Imalinyi        252       0      252 1.00000000 0.000000000
##  3     Siha Kati        232       0      232 1.00000000 0.000000000
##  4        Mdandu        231       0      231 1.00000000 0.000000000
##  5       Nduruma        217       1      216 0.99539171 0.004608295
##  6       Kitunda        203     193       10 0.04926108 0.950738916
##  7       Mishamo        203       4      198 0.97536946 0.019704433
##  8        Msindo        201       0      200 0.99502488 0.000000000
##  9      Chalinze        196       0       66 0.33673469 0.000000000
## 10  Maji ya Chai        190       1      189 0.99473684 0.005263158
## 11         Usuka        187       0      187 1.00000000 0.000000000
## 12  Ngarenanyuki        172       8      164 0.95348837 0.046511628
## 13       Chanika        171     145       24 0.14035088 0.847953216
## 14       Vikindu        162       0      161 0.99382716 0.000000000
## 15       Mtwango        153       0      153 1.00000000 0.000000000
## 16        Matola        145       0      124 0.85517241 0.000000000
## 17 Zinga/Ikerege        141       1      127 0.90070922 0.007092199
## 18       Maramba        139       0      139 1.00000000 0.000000000
## 19  Wanging'ombe        139       0      139 1.00000000 0.000000000
## 20         Itete        137       0      137 1.00000000 0.000000000

Summary statistics for the percentage of unknown public_meeting values per ward indicate that one or more ward’s within the data set have no valid public_meeting values:

# shows some wards have 100% unknowns
summary(ward_meet_true$unk_perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0612  0.0000  1.0000

In fact, we find a total of 43 such wards within the data set, as shown below.

# find wards with 100% unknowns = 43 wards
nrow(subset(ward_meet_true, unk_perc == 1))

## [1] 43

What this tells us is that if we wish to impute public_meeting values for the data records lacking valid values, we should not attempt to do so solely on the basis of a record’s ward value since a significant number of wards contain no valid public_meeting values from which a reasonable imputation for the public_meeting indicator can be derived.

We explore the lga variable using the same aggregation techniques described above for the ward variable,

# memory cleanup
rm(ward_tmp)

# subset for fast aggregation
lga_tmp <- tanz[, c("id", "lga", "public_meeting")]

# fill in NA's so data can be grouped properly
lga_tmp$public_meeting[is.na(lga_tmp$public_meeting)] <- "unknown"

# aggregate the data
lga_meet_true <- arrange(summarise(group_by(lga_tmp, lga), 
                               TotalWells = length(unique(id)),
                              MeetUnk = sum(public_meeting == "unknown"),
                     MeetTrue = sum(public_meeting == "TRUE") ), desc(TotalWells) )

# calculate percentage of TRUE and unknown meeting values
lga_meet_true$perc <- lga_meet_true$MeetTrue /  lga_meet_true$TotalWells
lga_meet_true$unk_perc <- lga_meet_true$MeetUnk /  lga_meet_true$TotalWells
head(lga_meet_true, n = 20)

## # A tibble: 20 x 6
##              lga TotalWells MeetUnk MeetTrue      perc    unk_perc
##            <chr>      <int>   <int>    <int>     <dbl>       <dbl>
##  1        Njombe       2503       0     2462 0.9836197 0.000000000
##  2  Arusha Rural       1252       4     1204 0.9616613 0.003194888
##  3   Moshi Rural       1251      17      868 0.6938449 0.013589129
##  4       Bariadi       1177     306      644 0.5471538 0.259983008
##  5        Rungwe       1106      17     1084 0.9801085 0.015370705
##  6        Kilosa       1094      24     1058 0.9670932 0.021937843
##  7        Kasulu       1047       2     1045 0.9980898 0.001910220
##  8         Mbozi       1034       5      628 0.6073501 0.004835590
##  9          Meru       1009      19      990 0.9811695 0.018830525
## 10      Bagamoyo        997       4      685 0.6870612 0.004012036
## 11 Singida Rural        995     115      880 0.8844221 0.115577889
## 12     Kilombero        959       3      950 0.9906152 0.003128259
## 13          Same        877       3      847 0.9657925 0.003420753
## 14       Kibondo        874      21      853 0.9759725 0.024027460
## 15         Kyela        859       0      857 0.9976717 0.000000000
## 16        Kahama        836     116      717 0.8576555 0.138755981
## 17  Kigoma Rural        824       6      735 0.8919903 0.007281553
## 18          Magu        824      55      756 0.9174757 0.066747573
## 19         Maswa        809       9      798 0.9864030 0.011124845
## 20       Karagwe        771       0      768 0.9961089 0.000000000

Summary statistics for the percentage of unknown public_meeting values per lga indicate that all lga’s within the data set have at least some valid public_meeting values:

# shows no lga's have 100% unknowns
summary(lga_meet_true$unk_perc)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.003128 0.054246 0.028169 0.925553

As such, if we wish to impute public_meeting values for the data records lacking valid values, use of a record’s ward and lga values together may allow us to infer whether or not the missing value should be TRUE or FALSE based on the percentage of pumps within the ward or lga sharing a TRUE value. For example, if the percentage of TRUE public_meeting values within a ward (or lga if the ward lacks any valid public_meeting values) exceeds 50%, it would seem reasonable to assign a value of TRUE to any missing public_meeting value pertaining to that ward (or lga). This approach is applied in the Data Preparation section of this document.

lga

The lga variable provides another Tanzania-specific geographic identifier within which a given pump is located. The summary statistics shown below indicate a total of 125 lga values within the data set, with each record having a valid value. The 20 lga’s having the largest number of pump installations are summarized in the table below.

# count distinct lga = 125
length(unique(tanz$lga))

## [1] 125

# calc number of wells per lga
c_lga <- arrange(summarise(group_by(tanz, lga), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_lga, n = 20)

## # A tibble: 20 x 2
##              lga TotalWells
##            <chr>      <int>
##  1        Njombe       2503
##  2  Arusha Rural       1252
##  3   Moshi Rural       1251
##  4       Bariadi       1177
##  5        Rungwe       1106
##  6        Kilosa       1094
##  7        Kasulu       1047
##  8         Mbozi       1034
##  9          Meru       1009
## 10      Bagamoyo        997
## 11 Singida Rural        995
## 12     Kilombero        959
## 13          Same        877
## 14       Kibondo        874
## 15         Kyela        859
## 16        Kahama        836
## 17  Kigoma Rural        824
## 18          Magu        824
## 19         Maswa        809
## 20       Karagwe        771

Ward

The ward variable represents the name of the geographic ward within which a given pump is located. The summary statistics shown below indicate a total of 2092 distinct ward values within the data set, with each record having a valid value. The 20 wards having the largest number of pump installations are summarized in the table below.

# count distinct ward = 2092
length(unique(tanz$ward))

## [1] 2092

# calc number of wells per ward
c_ward <- arrange(summarise(group_by(tanz, ward), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_ward, n = 20)

## # A tibble: 20 x 2
##             ward TotalWells
##            <chr>      <int>
##  1         Igosi        307
##  2      Imalinyi        252
##  3     Siha Kati        232
##  4        Mdandu        231
##  5       Nduruma        217
##  6       Kitunda        203
##  7       Mishamo        203
##  8        Msindo        201
##  9      Chalinze        196
## 10  Maji ya Chai        190
## 11         Usuka        187
## 12  Ngarenanyuki        172
## 13       Chanika        171
## 14       Vikindu        162
## 15       Mtwango        153
## 16        Matola        145
## 17 Zinga/Ikerege        141
## 18       Maramba        139
## 19  Wanging'ombe        139
## 20         Itete        137

Scheme Management

The scheme_management variable represents the type of organization responsible for managing a given pump. The summary statistics shown below indicate a total of 13 distinct scheme management values within the data set, with 3,877 records having no valid value. The number of pump installations per type of scheme management is summarized in the table below.

# count distinct scheme_management = 13, 3877 NA's
length(unique(tanz$scheme_management))

## [1] 13

# calc number of wells per scheme_management
c_scheme_m <- arrange(summarise(group_by(tanz, scheme_management), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_scheme_m, n = 20)

## # A tibble: 13 x 2
##    scheme_management TotalWells
##                <chr>      <int>
##  1               VWC      36793
##  2               WUG       5206
##  3              <NA>       3877
##  4   Water authority       3153
##  5               WUA       2883
##  6       Water Board       2748
##  7        Parastatal       1680
##  8  Private operator       1063
##  9           Company       1061
## 10             Other        766
## 11               SWC         97
## 12             Trust         72
## 13              None          1

Plots for the scheme_management variable show that (excluding “None” since it applies to only a single pump) that while the Water Board scheme seems to yield the largest percentage of functional pumps, that scheme is used for a relatively small number (2,748) of the 59,400 pumps. By contrast, the VWC scheme is used for nearly 36,800 pumps. So while a majority of the schemes exceed the 54.3% overall functional metric, the vast majority of the pumps themselves do not benefit from those schemes.

Analysis of the scheme_management variable’s missing values reveals the following:

There is no single management scheme that is universally applicable to any given extraction_type or waterpoint_type or for any of the composites thereof (e.g., extraction_type_group, etc.)
Instances of missing scheme_management values are not correlated in any way with missing public_meeting or permit values, despite the fact that the missing value counts for each of those variables are somewhat similar.
Aggregating by geographic variables also fails to produce even a single valid scheme_management value for many subvillages and wards. At the lga and region levels we can generally find at least several valid values per lga or region, though some lga’s contain only a single valid value.
Aggregating at the ward level shows that 75 wards lack any valid scheme_management value, and those 75 wards cover 1,413 (36.4%) of the 3,877 missing scheme_management values.
At the lga level it appears as though nearly every lga has at least one valid scheme_management value. However, in some lga’s the number of unknowns is an order of magnitude greater than the number of known values. For example, in the lga of Kahama there are 736 pumps having an unknown scheme_management value and only 100 pumps with known values that are spread across several distinct scheme_management categories.

One possible approach to dealing with the missing values would be to assign the most frequently occurring valid scheme_management value found with the corresponding ward or lga as the imputed value. However, that approach seems inappropriate since many wards and lga’s are dominated by unknown values: in such instances the most frequently occurring valid value will likely represent a relatively small percentage of the total number of data records for a ward or lga, and assigning that value to a missing scheme_management record would likely introduce far too much inappropriate statistical bias within the data set. As such, it may be preferable to simply mark the missing values as “unknown” to prevent the introduction of such bias.

Scheme_name

The scheme_name variable represents the name of the party responsible for managing a given pump. The summary statistics shown below indicate a total of 2,697 distinct scheme name values within the data set, with 28,166 records having no valid value. The 20 scheme names having the largest number of pump installations are summarized in the table below.

# count distinct scheme_name = 2697, 28,166 NA's
length(unique(tanz$scheme_name))

## [1] 2697

# calc number of wells per scheme_name
c_scheme_n <- arrange(summarise(group_by(tanz, scheme_name), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_scheme_n, n = 20)

## # A tibble: 20 x 2
##                              scheme_name TotalWells
##                                    <chr>      <int>
##  1                                  <NA>      28166
##  2                                     K        682
##  3                                  None        644
##  4                              Borehole        546
##  5                         Chalinze wate        405
##  6                                     M        400
##  7                                DANIDA        379
##  8                            Government        320
##  9           Ngana water supplied scheme        270
## 10           wanging'ombe water supply s        261
## 11            wanging'ombe supply scheme        234
## 12                         Bagamoyo wate        229
## 13                                     I        229
## 14           Uroki-Bomang'ombe water sup        209
## 15                                     N        204
## 16 Kirua kahe gravity water supply trust        193
## 17             Machumba estate pipe line        185
## 18           Makwale water supplied sche        166
## 19                                Kijiji        161
## 20                                     S        154

The large percentage of records lacking a defined scheme name combined with the lack of a definitive list of valid scheme names makes it unlikely that imputation of the missing values will prove feasible. As such, the scheme_name variable appears unlikely to be useful for predictive modeling purposes.

Permit

The permit variable is a binary True/False indicator. Summary statistics indicate that 28,166 records within the data set have no valid permit value.

# count distinct permit = 3, 28,166 NA's
length(unique(tanz$permit))

## [1] 3

# calc number of wells per scheme_name
c_permit <- arrange(summarise(group_by(tanz, permit), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_permit, n = 10)

## # A tibble: 3 x 2
##   permit TotalWells
##    <lgl>      <int>
## 1   TRUE      38852
## 2  FALSE      17492
## 3     NA       3056

The plots below show that pumps indicated as having a permit are more likely to be functional than are those that did not require a permit.

In an attempt to determine whether the lack of a permit value is related to the geographical location of a pump, we aggregate the data by ward and calculate the following:

the total number of pumps per ward
the number of pumps within a ward having a permit value of TRUE
the percentage of pumps within a ward having a permit value of TRUE
the number of pumps within a ward having an unknown permit value
the percentage of pumps within a ward having an uknown permit

# c_ward contains the total number of wells by ward
ward_tmp <- tanz[, c("id", "ward", "permit")]

# fill in NA's so data can be grouped properly
ward_tmp$permit[is.na(ward_tmp$permit)] <- "unknown"

# aggregate the data
ward_perm_true <- arrange(summarise(group_by(ward_tmp, ward), 
                               TotalWells = length(unique(id)),
                               PermUnk = sum(permit == "unknown"),
                     PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )

# calculate percentage of TRUE and unknown meeting values
ward_perm_true$perc <- ward_perm_true$PermTrue /  ward_perm_true$TotalWells
ward_perm_true$unk_perc <- ward_perm_true$PermUnk /  ward_perm_true$TotalWells

head(ward_perm_true, n = 20)

## # A tibble: 20 x 6
##             ward TotalWells PermUnk PermTrue       perc    unk_perc
##            <chr>      <int>   <int>    <int>      <dbl>       <dbl>
##  1         Igosi        307       0      222 0.72312704 0.000000000
##  2      Imalinyi        252       2      250 0.99206349 0.007936508
##  3     Siha Kati        232       0      232 1.00000000 0.000000000
##  4        Mdandu        231       0      231 1.00000000 0.000000000
##  5       Nduruma        217       0      217 1.00000000 0.000000000
##  6       Kitunda        203       0        8 0.03940887 0.000000000
##  7       Mishamo        203       0      203 1.00000000 0.000000000
##  8        Msindo        201       1      134 0.66666667 0.004975124
##  9      Chalinze        196       0      195 0.99489796 0.000000000
## 10  Maji ya Chai        190       0      127 0.66842105 0.000000000
## 11         Usuka        187       0      187 1.00000000 0.000000000
## 12  Ngarenanyuki        172       2      132 0.76744186 0.011627907
## 13       Chanika        171      28        4 0.02339181 0.163742690
## 14       Vikindu        162       0        0 0.00000000 0.000000000
## 15       Mtwango        153       0       71 0.46405229 0.000000000
## 16        Matola        145       0       41 0.28275862 0.000000000
## 17 Zinga/Ikerege        141       0      121 0.85815603 0.000000000
## 18       Maramba        139       0      139 1.00000000 0.000000000
## 19  Wanging'ombe        139       0      139 1.00000000 0.000000000
## 20         Itete        137      89       48 0.35036496 0.649635036

Summary statistics for the percentage of unknown permit values per ward indicate that one or more ward’s within the data set have no valid permit values:

# shows some wards have 100% unknowns
summary(ward_perm_true$unk_perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04415 0.00000 1.00000

In fact, we find a total of 73 such wards within the data set, as shown below.

# find wards with 100% unknowns = 73 wards
nrow(subset(ward_perm_true, unk_perc == 1))

## [1] 73

What this tells us is that if we wish to impute permit values for the data records lacking valid values, we should not attempt to do so solely via a record’s ward value since a significant number of wards contain no valid permit values from which a reasonable imputation for the permit indicator can be derived.

As an alternative, we explore the lga variable using the same aggregation techniques described above for the ward variable,

# memory cleanup
rm(ward_tmp)

# subset for fast aggregation
lga_tmp <- tanz[, c("id", "lga", "permit")]

# fill in NA's so data can be grouped properly
lga_tmp$permit[is.na(lga_tmp$permit)] <- "unknown"

# aggregate the data
lga_perm_true <- arrange(summarise(group_by(lga_tmp, lga), 
                               TotalWells = length(unique(id)),
                              PermUnk = sum(permit == "unknown"),
                     PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )

# calculate percentage of TRUE and unknown meeting values
lga_perm_true$perc <- lga_perm_true$PermTrue /  lga_perm_true$TotalWells
lga_perm_true$unk_perc <- lga_perm_true$PermUnk /  lga_perm_true$TotalWells
head(lga_perm_true, n = 20)

## # A tibble: 20 x 6
##              lga TotalWells PermUnk PermTrue        perc    unk_perc
##            <chr>      <int>   <int>    <int>       <dbl>       <dbl>
##  1        Njombe       2503       3     1576 0.629644427 0.001198562
##  2  Arusha Rural       1252      53     1199 0.957667732 0.042332268
##  3   Moshi Rural       1251       8     1210 0.967226219 0.006394884
##  4       Bariadi       1177       0        2 0.001699235 0.000000000
##  5        Rungwe       1106    1106        0 0.000000000 1.000000000
##  6        Kilosa       1094       0     1073 0.980804388 0.000000000
##  7        Kasulu       1047       0      751 0.717287488 0.000000000
##  8         Mbozi       1034       0      143 0.138297872 0.000000000
##  9          Meru       1009       3      803 0.795837463 0.002973241
## 10      Bagamoyo        997       0      788 0.790371113 0.000000000
## 11 Singida Rural        995     967       28 0.028140704 0.971859296
## 12     Kilombero        959       0      959 1.000000000 0.000000000
## 13          Same        877     130      739 0.842645382 0.148232611
## 14       Kibondo        874       0       53 0.060640732 0.000000000
## 15         Kyela        859       0      859 1.000000000 0.000000000
## 16        Kahama        836       0      761 0.910287081 0.000000000
## 17  Kigoma Rural        824       0      781 0.947815534 0.000000000
## 18          Magu        824       0      765 0.928398058 0.000000000
## 19         Maswa        809       0        9 0.011124845 0.000000000
## 20       Karagwe        771       0      771 1.000000000 0.000000000

Summary statistics for the percentage of unknown permit values per lga indicate that one or more lga’s within the data set have no valid permit values:

# shows no lga's have 100% unknowns
summary(lga_perm_true$unk_perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.0417  0.0000  1.0000

In fact, we find a total of 3 such lga’s within the data set, as shown below.

# find wards with 100% unknowns = 3 lga's
nrow(subset(lga_perm_true, unk_perc == 1))

## [1] 3

We now explore the region variable using the same aggregation techniques described above for the ward and lga variables,

# memory cleanup
rm(lga_tmp)

# subset for fast aggregation
reg_tmp <- tanz[, c("id", "region", "permit")]

# fill in NA's so data can be grouped properly
reg_tmp$permit[is.na(reg_tmp$permit)] <- "unknown"

# aggregate the data
reg_perm_true <- arrange(summarise(group_by(reg_tmp, region), 
                               TotalWells = length(unique(id)),
                              PermUnk = sum(permit == "unknown"),
                     PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )

# calculate percentage of TRUE and unknown meeting values
reg_perm_true$perc <- reg_perm_true$PermTrue /  reg_perm_true$TotalWells
reg_perm_true$unk_perc <- reg_perm_true$PermUnk /  reg_perm_true$TotalWells
head(reg_perm_true, n = 20)

## # A tibble: 20 x 6
##         region TotalWells PermUnk PermTrue      perc     unk_perc
##          <chr>      <int>   <int>    <int>     <dbl>        <dbl>
##  1      Iringa       5294       3     3174 0.5995467 0.0005666793
##  2   Shinyanga       4982       0     2364 0.4745082 0.0000000000
##  3       Mbeya       4639    1106     2134 0.4600129 0.2384134512
##  4 Kilimanjaro       4379     138     3796 0.8668646 0.0315140443
##  5    Morogoro       4006       0     3967 0.9902646 0.0000000000
##  6      Arusha       3350     445     2650 0.7910448 0.1328358209
##  7      Kagera       3316       0     2521 0.7602533 0.0000000000
##  8      Mwanza       3102       0     2975 0.9590587 0.0000000000
##  9      Kigoma       2816       0     1656 0.5880682 0.0000000000
## 10      Ruvuma       2640       2     1798 0.6810606 0.0007575758
## 11       Pwani       2635       0     1530 0.5806452 0.0000000000
## 12       Tanga       2547       0     1378 0.5410287 0.0000000000
## 13      Dodoma       2201       0      989 0.4493412 0.0000000000
## 14     Singida       2093     967      805 0.3846154 0.4620162446
## 15        Mara       1969     367      998 0.5068563 0.1863890300
## 16      Tabora       1959       0     1325 0.6763655 0.0000000000
## 17       Rukwa       1808       0     1162 0.6426991 0.0000000000
## 18      Mtwara       1730       0     1311 0.7578035 0.0000000000
## 19     Manyara       1583       0     1506 0.9513582 0.0000000000
## 20       Lindi       1546       0      813 0.5258732 0.0000000000

Summary statistics for the percentage of unknown permit values per region indicate that all region’s within the data set have at least some valid permit values:

# shows no lga's have 100% unknowns
summary(reg_perm_true$unk_perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.05178 0.03151 0.46202

As such, if we wish to impute permit values for the data records lacking valid values, use of a record’s lga and region values may allow us to infer whether or not the missing value should be TRUE or FALSE based on the percentage of pumps within the lga or region sharing a TRUE value. For example, if the percentage of TRUE permit values within an lga exceeds 50%, it would seem reasonable to assign a value of TRUE to any missing permit value pertaining to that lga. However, if the specific lga lacks any valid permit values, we can instead make use of the percentage of TRUE permit values within the associated region and apply the same 50% thresholding approach described for use with the lga variable’s values. This approach is applied in the Data Preparation section of this document.

Extraction Type

The extraction_type variable represents the method of extraction used by a given pump. The summary statistics shown below indicate a total of 18 distinct extraction_type values within the data set, with each record having a valid value. The number of pump installations per extraction type is summarized in the table below.

# count distinct extraction_type = 18
length(unique(tanz$extraction_type))

## [1] 18

# calc number of wells per extraction_type
c_extract_t <- arrange(summarise(group_by(tanz, extraction_type), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_t, n = 10)

## # A tibble: 10 x 2
##      extraction_type TotalWells
##                <chr>      <int>
##  1           gravity      26780
##  2       nira/tanira       8154
##  3             other       6430
##  4       submersible       4764
##  5            swn 80       3670
##  6              mono       2865
##  7     india mark ii       2400
##  8           afridev       1770
##  9               ksb       1415
## 10 other - rope pump        451

Plots of the extraction_type variable show that seven of the seventeen extraction types have functional metrics that exceed the overall 54.3% benchmark. Of particular interest here is the gravity type since nearly 27,000 of the 59,400 total pumps rely on that approach, a far higher percentage than any of the other extraction types.

Extraction Type Group

The extraction_type_group variable represents a composite of the methods of extraction indicated by the extraction_type variable. The summary statistics shown below indicate a total of 13 distinct extraction_type_group values within the data set, with each record having a valid value. The number of pump installations per extraction type group is summarized in the table below.

# count distinct extraction_type_group = 13
length(unique(tanz$extraction_type_group))

## [1] 13

# calc number of wells per extraction_type
c_extract_tg <- arrange(summarise(group_by(tanz, extraction_type_group), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_tg, n = 20)

## # A tibble: 13 x 2
##    extraction_type_group TotalWells
##                    <chr>      <int>
##  1               gravity      26780
##  2           nira/tanira       8154
##  3                 other       6430
##  4           submersible       6179
##  5                swn 80       3670
##  6                  mono       2865
##  7         india mark ii       2400
##  8               afridev       1770
##  9             rope pump        451
## 10        other handpump        364
## 11       other motorpump        122
## 12          wind-powered        117
## 13        india mark iii         98

The extraction_type_group indicator appears to represent a narrowing of the breadth of values available via the extraction_type variable into various higher-level overall categories. The gravity type group again accounts for the largest percentage of total pumps.

Extraction Type Class

The extraction_type_class variable represents a composite of the methods of extraction indicated by the extraction_type_group variable. The summary statistics shown below indicate a total of 7 distinct extraction_type_class values within the data set, with each record having a valid value. The number of pump installations per extraction type class is summarized in the table below.

# count distinct extraction_type_class = 7
length(unique(tanz$extraction_type_class))

## [1] 7

# calc number of wells per extraction_type_class
c_extract_t_c <- arrange(summarise(group_by(tanz, extraction_type_class), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_t_c, n = 10)

## # A tibble: 7 x 2
##   extraction_type_class TotalWells
##                   <chr>      <int>
## 1               gravity      26780
## 2              handpump      16456
## 3                 other       6430
## 4           submersible       6179
## 5             motorpump       2987
## 6             rope pump        451
## 7          wind-powered        117

The extraction_type_class indicator appears to represent a narrowing of the breadth of values available via the extraction_type_group variable into various higher-level overall categories. The gravity type class again accounts for the largest percentage of total pumps. However, the handpump class also represents a significant percentage of the total pump count, with nearly 16,400 pumps having been assigned to that class. Of the seven classes, three exceed the 54.3% overall functional benchmark.

Management

The management variable represents the name of the method employed for management of a given pump. The summary statistics shown below indicate a total of 12 distinct management values within the data set, with each record having a valid value. The number of pump installations per management method is summarized in the table below.

# count distinct management = 12
length(unique(tanz$management))

## [1] 12

# calc number of wells per management
c_mgmt <- arrange(summarise(group_by(tanz, management), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_mgmt, n = 20)

## # A tibble: 12 x 2
##          management TotalWells
##               <chr>      <int>
##  1              vwc      40507
##  2              wug       6515
##  3      water board       2933
##  4              wua       2535
##  5 private operator       1971
##  6       parastatal       1768
##  7  water authority        904
##  8            other        844
##  9          company        685
## 10          unknown        561
## 11   other - school         99
## 12            trust         78

VWC dominates the management variable’s values, with more than 40,000 pumps having been assigned that value. As we can see in the plots below, while seven other management approaches have higher percentages of their pumps functional, collectively those seven approaches are applied to less than one third of the pumps.

Management Group

The management_group variable may represent a composite of the management method names indicated by the management variable. The summary statistics shown below indicate a total of 5 distinct management_group values within the data set, including 561 records labeled as “unknown”. The number of pump installations per management group is summarized in the table below.

# count distinct management_group = 5
length(unique(tanz$management_group))

## [1] 5

# calc number of wells per management_group
c_mgmt_grp <- arrange(summarise(group_by(tanz, management_group), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_mgmt_grp, n = 10)

## # A tibble: 5 x 2
##   management_group TotalWells
##              <chr>      <int>
## 1       user-group      52490
## 2       commercial       3638
## 3       parastatal       1768
## 4            other        943
## 5          unknown        561

The management_group variable’s most relevant value is user-group, with 52,490 pumps having been assigned that value. As we can see in the plots, user-group yields a functional pump percentage close to that of the 54.3% overall functional metric.

Payment

The payment variable represents how the water is actually paid for by users of the pump, if at all. The summary statistics shown below indicate a total of 12 distinct payment values within the data set, with each record having a valid value. The number of pump installations per payment method is summarized in the table below.

# count distinct payment = 7
length(unique(tanz$payment))

## [1] 7

# calc number of wells per payment
c_paymt <- arrange(summarise(group_by(tanz, payment), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_paymt, n = 10)

## # A tibble: 7 x 2
##                 payment TotalWells
##                   <chr>      <int>
## 1             never pay      25348
## 2        pay per bucket       8985
## 3           pay monthly       8300
## 4               unknown       8157
## 5 pay when scheme fails       3914
## 6          pay annually       3642
## 7                 other       1054

As shown above, more than 25,000 pumps require no payment for their use, and, unsurprisingly, as shown in the graphics below, those pumps appear to be the least functional overall if unknown payment types are excluded. However, pumps that do not require payment may be located in remote areas where collection of payment is not feasible. Nevertheless, it appears reasonable to conclude that requiring users to pay for use of a pump is more likely to result in a pump remaining functional than will allowing use of a pump free of charge.

Payment Type

The payment_type variable represents a duplicate of the payment methods indicated by the payment variable, with the sole difference being that payment’s “pay when scheme fails” category has been replaced by “on failure”. The summary statistics shown below indicate a total of 7 distinct payment_type values within the data set, with each record having a valid value. The number of pump installations per payment type is summarized in the table below.

# count distinct payment_type = 7; on failure appears to replace "pay when scheme fails" from payment variable
length(unique(tanz$payment_type))

## [1] 7

# calc number of wells per payment_type
c_paymt_t <- arrange(summarise(group_by(tanz, payment_type), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_paymt_t, n = 10)

## # A tibble: 7 x 2
##   payment_type TotalWells
##          <chr>      <int>
## 1    never pay      25348
## 2   per bucket       8985
## 3      monthly       8300
## 4      unknown       8157
## 5   on failure       3914
## 6     annually       3642
## 7        other       1054

The payment_type variable’s values are not distinct from those of the payment variable. In fact, the names of the values are simply minor alterations of those provided via the payment variable. All of the plots shown below exhibit values identical to those of the payment variable once the minor value name alterations are considered. As such, the payment_type variable appears to be redundant and can therefore be ignored for purposes of model building.

Water Quality

The water_quality variable is an indicator of the quality of the water produced by a given pump. The summary statistics shown below indicate a total of 8 distinct water_quality values within the data set, with each record having a valid value. The number of pump installations per water_quality value is summarized in the table below.

# count distinct water_quality = 8
length(unique(tanz$water_quality))

## [1] 8

# calc number of wells per water_quality
c_waterq <- arrange(summarise(group_by(tanz, water_quality), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterq, n = 10)

## # A tibble: 8 x 2
##        water_quality TotalWells
##                <chr>      <int>
## 1               soft      50818
## 2              salty       4856
## 3            unknown       1876
## 4              milky        804
## 5           coloured        490
## 6    salty abandoned        339
## 7           fluoride        200
## 8 fluoride abandoned         17

The water_quality variable is dominated by the soft value, with 50,818 of the 59,400 pumps having that water_quality value. While the plots shown below indicate that the largest percentage of functional pumps belong to those of the fluoride category, there are only 200 such pumps throughout the country.

Quality Group

The quality_group variable represents a composite of the water quality indicators provided by the water_quality variable. The summary statistics shown below indicate a total of 6 distinct quality_group values within the data set, with each record having a valid value. The number of pump installations per extraction type class is summarized in the table below.

# count distinct quality_group = 6
length(unique(tanz$quality_group))

## [1] 6

# calc number of wells per quality_group
c_qualityg <- arrange(summarise(group_by(tanz, quality_group), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_qualityg, n = 10)

## # A tibble: 6 x 2
##   quality_group TotalWells
##           <chr>      <int>
## 1          good      50818
## 2         salty       5195
## 3       unknown       1876
## 4         milky        804
## 5       colored        490
## 6      fluoride        217

The quality_group variable appears to represent a narrowing of the breadth of values available via the water_quality variable. In fact, the number of pumps in the good category is identical to those of the water_quality variable’s soft category, while the fluoride and salty categories have been created by summing the fluoride, fluoride_abandoned, and salty and salty_abandoned categories belonging to the water_quality variable. As such, this variable may be duplicative/redundant and therefore may likely be considered for removal for purposes of model building.

Quantity & Quantity_Group

The quantity and quantity_group variables are exact duplicates of one another, as shown in the summary statistics provided below. Both variables are used to indicate the quantity of water available at a given pump. The summary statistics indicate a total of 5 distinct quantity and quantity_group values within the data set, with each record having a valid value. The number of pump installations per quantity value is summarized in the table below.

# count distinct quantity = 5
length(unique(tanz$quantity))

## [1] 5

# calc number of wells per quantity
c_quantity <- arrange(summarise(group_by(tanz, quantity), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_quantity, n = 10)

## # A tibble: 5 x 2
##       quantity TotalWells
##          <chr>      <int>
## 1       enough      33186
## 2 insufficient      15129
## 3          dry       6246
## 4     seasonal       4050
## 5      unknown        789

# -------------------------------
# count distinct quantity_group = 5; appears to be identical to "quantity" variable
length(unique(tanz$quantity_group))

## [1] 5

# calc number of wells per quantity_group
c_quantityg <- arrange(summarise(group_by(tanz, quantity_group), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_quantityg, n = 10)

## # A tibble: 5 x 2
##   quantity_group TotalWells
##            <chr>      <int>
## 1         enough      33186
## 2   insufficient      15129
## 3            dry       6246
## 4       seasonal       4050
## 5        unknown        789

The quantity and quantity_group variables have identical categories with identical numbers of pumps assigned to each. As such, the quantity_group variable is likely to be redundant/duplicative and can therefore be ignored for purposes of model building.

Of the five quantity categories, the dry category offers the smallest percentage of functional pumps with only 2.5% of such pumps being labeled as such. By contrast, more than 65% of pumps categorized as having a quantity of enough are functional.

Source

The source variable indicates the source of the water for a given pump. The summary statistics shown below indicate a total of 10 distinct source values within the data set, with each record having a valid value. The number of pump installations per source value is summarized in the table below.

# count distinct source = 10
length(unique(tanz$source))

## [1] 10

# calc number of wells per source
c_source <- arrange(summarise(group_by(tanz, source), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source, n = 10)

## # A tibble: 10 x 2
##                  source TotalWells
##                   <chr>      <int>
##  1               spring      17021
##  2         shallow well      16824
##  3          machine dbh      11075
##  4                river       9612
##  5 rainwater harvesting       2295
##  6             hand dtw        874
##  7                 lake        765
##  8                  dam        656
##  9                other        212
## 10              unknown         66

As shown below, the spring category of the source variable offers the highest percentage of functional pumps and also represents the largest source category with 17,021 pumps. By contrast, while the lake category performs poorly, it represents only 765 of the 59,400 pumps represented in the data set. Of the other categories represented, 57.8% of source = dam are also not functioning.

Source Type

The source_type variable appears to be a composite of the source variable. The summary statistics shown below indicate a total of 7 distinct source_type values within the data set, with each record having a valid value. The number of pump installations per source_type is summarized in the table below.

# count distinct source_type = 7
length(unique(tanz$source_type))

## [1] 7

# calc number of wells per source_type
c_source_t <- arrange(summarise(group_by(tanz, source_type), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source_t, n = 10)

## # A tibble: 7 x 2
##            source_type TotalWells
##                  <chr>      <int>
## 1               spring      17021
## 2         shallow well      16824
## 3             borehole      11949
## 4           river/lake      10377
## 5 rainwater harvesting       2295
## 6                  dam        656
## 7                other        278

The source_type variable appears to combine some of the categories from source variable. As such, this variable may be redundant / duplicative and might possibly be ignored for purposes of model building.

As shown below, the spring category of the source_type variable offers the highest percentage of functional pumps and also represents the largest source_type category with 17,021 pumps. By contrast, while the dam category performs poorly, it represents only 656 of the 59,400 pumps represented in the data set. Of the other categories represented, 46.2% of source_type = borehole are also not functioning.

Source_Class

The source_class variable appears to be a composite of the source_type variable. The summary statistics shown below indicate a total of 3 distinct source_class values within the data set, with each record having a valid value. The number of pump installations per source_class is summarized in the table below.

# count distinct source_class = 3
length(unique(tanz$source_class))

## [1] 3

# calc number of wells per source_class
c_source_c <- arrange(summarise(group_by(tanz, source_class), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source_c, n = 10)

## # A tibble: 3 x 2
##   source_class TotalWells
##          <chr>      <int>
## 1  groundwater      45794
## 2      surface      13328
## 3      unknown        278

The source_class variable appears to be a binary indicator, with, as shown below, both known categories appearing to be equally likely to offer a functional pump. However, groundwater class pumps appear to be more likely than surface class pumps to be completely non-functional.

Waterpoint Type

The waterpoint_type variable indicates the type of pump installed at a given location. The summary statistics shown below indicate a total of 7 distinct waterpoint_type values within the data set, with each record having a valid value. The number of pump installations per waterpoint_type value is summarized in the table below.

# count distinct waterpoint_type = 7
length(unique(tanz$waterpoint_type))

## [1] 7

# calc number of wells per waterpoint_type
c_waterp_t <- arrange(summarise(group_by(tanz, waterpoint_type), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterp_t, n = 10)

## # A tibble: 7 x 2
##               waterpoint_type TotalWells
##                         <chr>      <int>
## 1          communal standpipe      28522
## 2                   hand pump      17488
## 3                       other       6380
## 4 communal standpipe multiple       6103
## 5             improved spring        784
## 6               cattle trough        116
## 7                         dam          7

As shown below, the waterpoint_type variable’s values indicate that pumps having a dam as their waterpoint are the most likely to be functional while communal standpipe multiple appear to be the least likely to be functional. However, there are only 7 pumps having a waterpoint_type of dam. In fact, the most common waterpoint_type is communal standpipe followed by hand pump. Both of those categories appear to perform well relative to the overall functional benchmark of 54.3%.

Waterpoint Type Group

The waterpoint_type_group variable appears to be a composite of the waterpoint_type variable. The summary statistics shown below indicate a total of 6 distinct waterpoint_type_group values within the data set, with each record having a valid value. The number of pump installations per waterpoint_type_group is summarized in the table below.

# count distinct waterpoint_type_group = 6
length(unique(tanz$waterpoint_type_group))

## [1] 6

# calc number of wells per waterpoint_type_group
c_waterp_t_g <- arrange(summarise(group_by(tanz, waterpoint_type_group), 
                     TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterp_t_g, n = 10)

## # A tibble: 6 x 2
##   waterpoint_type_group TotalWells
##                   <chr>      <int>
## 1    communal standpipe      34625
## 2             hand pump      17488
## 3                 other       6380
## 4       improved spring        784
## 5         cattle trough        116
## 6                   dam          7

As shown above, the waterpoint_type_group appears to be a composite of the waterpoint_type variable, with the sole difference being the two standpipe groups have been summed together. As such, this variable is likely duplicative / redundant and might possibly be ignored for purposes of model building. Plots showing the status of pumps by waterpoint_type_group value are shown below.

Initial Predictive Inferences

The barplots shown above allow us to gain insight into some of the the predictive aspects of the data. The plots show that most of the categorical variables do, in fact, contain significant predictive characteristics. For example, we find geographic variability with respect to the functional status of pumps via variables such as basin and region:

Additional variability in the functional status of pumps can be found via variables such as scheme_management, construction_year, public_meeting, extraction_type, management, payment, source, quantity, water_quality, and waterpoint_type:

This variability is indicative of the predictive value of the variables. For example, pumps located within the Lake Nyasa basin are far more likely to be functional than are pumps located within the Southern Coast basin (65.4% vs 37.2%), and relatively newer pumps are more likely to be functional than are relatively older pumps (see plot for construction_year). Furthermore, pumps located in the Arusha region are more than twice as likely to be functional than are pumps found within the region of Lindi (68.5% vs 29.8%), and pumps that require some form of payment for use are much more likely to be functional than are pumps that require no payment (see plot for payment). Similar types of inferences can be postulated relative to the functional needs repair and non functional statuses.

Such inferences are helpful for purposes of model building since they suggest that many variables found within the data set are strong candidates for inclusion within a prospective predictive / classification model. Furthermore, it suggests that use of varying combinations of independent variables might be employed in an attempt to determine whether any particular combination thereof yields a superior model.

Appendix B: Data Preparation

Imputations

Numeric variables:

We considered the numeric variables (‘amount_tsh’, ‘latitude’, ‘longitude’, ‘gps_height’, ‘population’ and ‘construction_year’) in our models, and the used imputation techniques to handle missing values of these variables are explained below:

amount_tsh: This variable contains Total Static Head (difference in elevation of fluid’s source and destination surfaces). The unit of measurement will be a distance metric such as feet, meters etc., but the unit of measurement used for this variable in the data set is unknown. We are assuming that a uniform metric was used for this variable’s data in the data set. We used 4 imputation strategies (mean, median, ignore and custom) for this variable..

“mean” - use the mean value of the non-zero/non-null observations to impute 0s and null values of amount_tsh.
“median” - use the median value of the non-zero/non-null observations to impute 0s and null values of amount_tsh.
“ignore” - just substitute the null values of amount_tsh with 0
“custom” - Get the median of amount_tsh by grouping source_class, basin, waterpoint_type_group variables. If the source_class is null, then use the median value based on the grouping of waterpoint_type_group values. source_class represents the type of water source (ground water or surface) and basin represents the water source itself. So these two variables should represent the source surface, and the waterpoint_type_group represents the type of the pump (hand pump, communal standpipe etc). Hence the median value of amount_tsh grouped by source_class, basin, waterpoint_type_group variables could potentially represent a reliable value to impute amount_tsh values.

latitude & longitude: These 2 variables represent the precise location of the pump. These 2 variables have 1812 unknown values (represented as 0). We decided these 1812 values as unknown values, as the latitudes and longitudes of Tanzania range between [-13, 0.25] and [28, 42] respectively [11]. We used 3 different imputation techniques (“mean”, “median”, and “custom”). The “mean” and “median” methods use the mean and median values of non-zero and non-null values respectively, for imputation. The “custom” imputation technique uses the data from “ward”, “lga” and “region” variables. If a latitude data is missing, then the custom imputation technique will use the average of all the non-zero/non-null latitudes of the “ward” (where the pump is located). If all the latitudes are unknown for the ward, then we will use the average of non-zero/non-null latitudes at the “lga” level, and if “lga” has all unknown latitudes, then we will use average of latitudes at the “region” level. If region also has all unknown or zero latitudes, then we will use the average of all non-zero latitudes at the country level. The same imputation technique is applicable to longitudes also. According to [12], in Tanzania, regions are subdivided into districts, districts are subdivided into wards, and wards into subvillages. Hence we started our search at ward level. But we used “lga” (Local Government Authority) as the next level, based on the assumption that “lga” represent one or more wards. We could start the search at subvillage level, instead of ward, but the ward and lga variables have 0 missing values, whereas the subvillage variable has some missing values. We ended up using region as the highest level, since the future data might have missing ward/lga details, and we have to gracefully handle the situation without the failure of predictive model. It is very unlikely that region information would be missing in future data, since Tanzania has only 22 regions or divisions[12]. If the region is also missing in future data, then we will use the average at the country level.

gps_height: This variable represents the altitude of the pump’s location. For this variable we used “mean”, “median”, “ignore” and “custom” strategies. The “mean”, “median” and “ignore” strategies are similar to the strategies explained for amount_tsh variable. The “custom” strategy uses the location details (based on latitude and longitude data). But before applying the “custom” impute strategy, the latitude and longitude variables must be successfully imputed. The “custom” impute method is controlled by two additional hyper parameters: “initial_radius” and “increment_radius”. The “initial_radius” represent the initial area of search. Each degree in latitude (or longitude) represent 110 KM [13]. So if “initial_radius” is initialized to 0.1, then the custom method will consider all the gps_heights within the 11 KM radius of the latitude and longitude, and impute with the median value of non-zero gps_heights within 11 KM radius. If there are no valid gps_heights found within 11 KM radius, then the search area is increased by “increment_radius”. So if “increment_radius” is set at 0.2, then the search area is increased by 22 km radius. This search continues until the search radius reaches to 2 (representing 220 KM radius). If no gps_height is found within 220 KM radius, then the country’s median gps_height will be used to impute the gps_height variable.

population: This variable is imputed using the same logic as described for gps_height.

construction_year: This variable represents the year when the pump was installed. For this variable also we defined 4 types of imputation methods - “mean”, “median”, “ignore”, and “custom”. The “mean”, “median” and “ignore” are similar to the methods described for amount_tsh variable. The “custom” imputation technique uses the “date_recorded” variable, to calculate the approximate age of the pump. We first get the difference between the date_recorded and construction year (using non_zero and non-null data only). In the second step, we get the difference between date_recorded and construction_year, to find the age. If there are any age values which are less than 0 or unknown (Nulls), then we substitute those values with the median of the age computed.

categorical variables:

We have three types of categorical variables in our data set - 1. Variables with many levels such as installer (2146 levels), subvillage (19288 levels). 2. Variables with less number of levels (20 or less) like district. 3. Some variables like installer and funder have free form text data, with so many levels.

Each of these categorical variables have to be quantified before we start building predictive model. To tackle the variables with many levels, we made use of 2 imputation techniques frequency based and response variable based. In frequency based technique, we bin the levels of categorical variable based on the percentage of the data the levels represent. For example, if a categorical variable has 3 levels A, B, C, and each of these levels represent 12%, 15% and 83%, then we will group the variables into several buckets (each bucket representing a specific percentage range). If we use 5% range as the bucket size, then the variables A, B belong to 11-15% bucket, while C belongs to 81-85% range.

The response based binning takes the target variable into account (the target variable must be a categorical variable).For example if a categorical variable has three levels A, B, C and the target variable has only 2 levels TRUE and FALSE, then we first compute the TRUE and FALSE percentage values for each level A, B, C. If A level has 81% TRUE, B has 13% TRUE and C has 84% TRUE, then we bin A and C into one bin (80-85%) and C into bin 10-15%.

The desired binning technique, and the bin size are controlled by hyper parameters, which can be tuned/tested using techniques like Cross Validation. After binning the levels, we will apply the dummy variable technique used to handle the categorical variables with less number of levels (usually 20 or less).

To handle categorical variables with less number of levels, we create dummy variables to represent presence/absence of each level. For example, if we have three levels A, B, C in a categorical variable, then we will have 3 dummy variables dummy_A, dummy_B, and dummy_C to represent the presence/absence of A/B/C levels. To represent A, we can use [1,0,0] (format [dummy_A, dummy_B, and dummy_C]). This technique is also known as one-hot encoding. In some statistical books like [14], it was suggested to create n-1 dummy variables to represent n levels in a categorical variable, so that we avoid correlated variables in the data set. But we suggest to create n dummy variables for n levels, so that we can gracefully handle the situation to process an unknown level in the future incoming data. For instance if our training data has just 3 levels A, B, C, then we create 3 dummy variables. But after deploying our model in production, if the streaming data has a different level like D, then we can represent D by making the dummy variables values as [0, 0, 0].

To handle the third type of categorical variables (free form text), we used 2 techniques to group the data - based on Fuzzy logic [15] and considering only the first “n” characters of the text. The Fuzzy logic technique did not give better results, as the runtime of to process the data was significantly high. So we considered the second method to consider only the initial characters of the text, and group the data based on the initial characters. Once the data is grouped, we will create dummy variables to represent each group.

To test all the options listed above for handling categorical varables, we created the following hyper parapeters:

“substitute” - A desired value to substitute Null values of categorical variable.
“convert_to_small” - Helps to convert the text to small case letters.
“buckets” - Controls the desired number of bins to group the categorical data.
“method” - Controls which type of binning to apply. Can have “freq”, “resp” and “both” to apply frequency based, response based, and both frequency and response based techniques respectively.
“initial_chars” - Controls the initial number of characters to consider while processing the free form text.
“groups” - Once the initial characters are extracted, how many groups do we need? For example, if we set groups=15, then we will create 14 different groups for the top 14 groups represented by the initial characters, and the 15th group will be other, which contains all the remaining rows whose initial characters are not in the top 14 groups.

In our project, we used “unknown” to replace all the null values in all categorical columns. We also converted all the categorical variables data to lower case.

The categorical variables [funder, installer] were considered to have free form text with many levels, the variables [‘subvillage’,‘lga’,‘ward’] were considered to have many levels, and the variables [‘basin’,‘region’,‘district_code’, ‘public_meeting’, ‘scheme_management’, ‘permit’, ‘extraction_type_group’, ‘management’, ‘payment_type’, ‘water_quality’, ‘quantity_group’, ‘source’, ‘waterpoint_type’] were considered as categorical variables with less number of levels (20 or less). The remaining categorical variables [‘scheme_name’, ‘extraction_type’, ‘extraction_type_class’, ‘management_group’, ‘quality_group’, ‘source_type’,‘source_class’,‘waterpoint_type_group’] were excluded from further processing since they represent the duplicate data.

Appendix B - References

B1. https://www.wartsila.com/encyclopedia/term/total-static-head

Appendix C: Python Transformation Pipelines

All data transformations were implemented in python 3.4 and the source code is listed below:

Required packages

#Import all the required packages:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, LabelBinarizer, Imputer, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
#from fuzzywuzzy import fuzz
#from fuzzywuzzy import process
from sklearn.externals import joblib
import math
import pickle

DataFrameSelector

###################################################################
#Every class defined will have a variable called column_names.
#This will help us to get the column_names of the transformation.
#It can be accessed like the way shown below:
#Example-1:
#freq_bin = FreqBasedCategoricalBinning()
#freq_bin.fit(test_df)
#freq_bin.transform(test_df)
#freq_bin.column_names will 
#contain the column names of the final transformation
#
#Example-2 (using pipeline):
#freq_pipeline = \
#Pipeline([('FreqBasedCategoricalBinning',FreqBasedCategoricalBinning())])
#freq_pipeline.fit(test_df)
#freq_pipeline.transform(test_df)
#freq_pipeline.named_steps['FreqBasedCategoricalBinning'].column_names 
#will contain the list of column names
#used in the transformation
#This will help us to select the desired columns of a data frame
#This will help us to select the desired columns of a data frame
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self,attribute_names):
        self.attribute_names = attribute_names
        
    def fit(self,X, y=None):
        self.column_names = self.attribute_names
        return self
    def transform(self,X):
        return X[self.attribute_names]

HandleCategoricalNulls

#This will help us to convert all text columns to small case and
#also help us to substitute a user defined text in the place of NA values        
class HandleCategoricalNulls(BaseEstimator, TransformerMixin):
    def __init__(self,substitute='unknown',convert_all_to_small=True):
        '''
           cols must be a list
        '''
        self.substitute = substitute
        self.convert_all_to_small = convert_all_to_small
        self.column_names = []
    def get_alpha_numeric(self,x):
        temp_l = list()
        for i in list(x):
            temp = re.sub(r'[^a-zA-Z0-9]',' ', i)
            temp = re.sub(' +'," ",temp)
            temp_l.append(temp)
        return temp_l
    def fit(self,X, y=None):
        HandleCategorical_columns = list(X.columns) 
        self.column_names = list(X.columns) 
        return self
    def transform(self,X):
        X=X.fillna(self.substitute)
        if self.convert_all_to_small:
            X = X.apply(lambda x: x.astype(str).str.lower())
            #Include only alpha-numeric chars
            #Remove any contiguous spaces    
            for i in X.columns:
                X[i] = self.get_alpha_numeric(X[i])
        return X

ApplyFuzzyTextProcess This class will not be used as it is computationally intensive. But the definition is included so that we can enhance it in future work.

#helps to contol whether to group text data bsed on common grouping        
#but do not use this, as it runs for a while
class ApplyFuzzyTextProcess( BaseEstimator, TransformerMixin):
    def __init__(self,map_dict,apply=True):
        from fuzzywuzzy import fuzz
        from fuzzywuzzy import process
        self.map_dict = map_dict
        self.apply = apply
        self.column_names = []
    def find_and_replace(self,l):
        temp_l = []
        for j in list(l):
            try:
                temp_l.append(self.map_dict[str(j).lower()])
            except:
                word,score = process.extract(j, self.map_dict.keys(), limit=1)[0]
                if score > 85:
                    temp_l.append(self.map_dict[word])
                else:
                    temp_l.append(j)
                continue    
                
        return temp_l       
    def fit(self,X, y=None):
        self.column_names = list(X.columns)  
        return self
        
    def transform(self,X):
      if self.apply: 
        for i in X.columns:
                X[i] = self.find_and_replace(X[i])
        return X

FreqBasedCategoricalBinning

##This class will help us to bin the categorical variables with 
##many levels based on the frequency distribution
class FreqBasedCategoricalBinning( BaseEstimator, TransformerMixin):
    def __init__(self,buckets=20,apply=True):
        '''
        buckets - Desired number of classes. 
        you can get less than or equal to the desired buckets,
        if the data is heavily skewed or if we have missing ranges
        '''
        #Calculate the number of bins
        try:
            self.bin_size=np.floor(100.0/(buckets-1))
        except:
            self.bin_size=5.0
        self.freq_dict={}
        self.column_names = []
        self.apply = apply
    def get_freq(self,df):
        #Get the frequency of each level in col 
        for col in df.columns:
            total_count=df.groupby([col]).size().reset_index()
            total_count.columns=[col,col+'_freq_bin']
            total_count.index = list(total_count[col])
            total_count = total_count.drop([col],axis=1)        
            #Save the result to a dictionary
            self.freq_dict[col] = \
            np.ceil(total_count.iloc[:,[0]]/np.max(total_count.iloc[:,0])*100)
    
    def join(self,X,key,value):
        #add a column called 'sorted' to save the order.
        X['sorted'] = np.arange(len(X))
        X.index = list(X[key])
        X.drop([key],axis=1,inplace=True)
        X=X.join(value,how='left')
        #Left join may result in NaN, 
        #if we have unseen levels in the variable
        #We will pad such values with median
        temp_val = np.median(value.iloc[:,[0]])
        X[[value.columns[0]]] = X[[value.columns[0]]].fillna(temp_val)
        
        #Bin the data
        X[[value.columns[0]]] = np.ceil(X[[value.columns[0]]]/self.bin_size)
        
        #Convert the bins to int and later to str
        X[[value.columns[0]]] = X[[value.columns[0]]].astype('int') 
        #X[[value.columns[0]]] = 'class_'+X[[value.columns[0]]].astype('str')
        X[[value.columns[0]]] = X[[value.columns[0]]].astype('str')
        #Reset and drop the first (new) column resulted in reset_index()
        X=X.reset_index()
        X.drop(X.columns[0],axis=1,inplace=True)
        #Restore the order:
        X.sort_values(['sorted'],inplace=True)
        X.drop(['sorted'],axis=1,inplace=True)
        return X
        
    def fit(self,X, y=None):
      if self.apply:
        self.get_freq(X)
        #FreqBasedCategoricalBinning = []
        #Do not set the column names here, to make the logic simple.
        self.column_names = []
        return self      
      else:
        self.column_names = []
        return self      
       
    def transform(self,X,y=None):
      if self.apply:
            X = X.copy()
            for key, value in self.freq_dict.items():
                X = self.join(X,key,value)
            global FreqBasedCategoricalBinning_cols
            #Set the column names here
            self.column_names = list(X.columns)
            return X    
      else:
            self.column_names = list(X.columns)
            return X

RespBasedCategoricalBinning

##This class will help us to bin the categorical variables with 
##many levels based on the target classes.
class RespBasedCategoricalBinning( BaseEstimator, TransformerMixin):
    def __init__(self,buckets=20,apply=True):
        '''
        buckets - Desired number of classes. 
        you can get less than or equal to the desired buckets,
        if the data is heavily skewed or if we have missing ranges
        '''
        ##Determine the bin size
        try:
           self.bin_size=np.floor(100.0/(buckets-1))
        except:
           self.bin_size= 5.0
        #Declare a dictionary, which will save the details
        #in the fit() function
        self.freq_dict={}
        self.column_names = []
        self.apply = apply
        
    ##This function will be called by fit() function.    
    def get_probs(self,X,y):
        #Get the columns of the X data frame
        X_columns = list(X.columns)
        #Combine X and y into a single data frame
        df = X
        
        #_target_variable is the target column, which is a categorical column
        df['_target_variable'] = list(y)
        
        #For each column in the X data frame, perform the following:
        for col in X_columns:
        
            #Create a data frame with counts, group by the values of the column col
            total_count=df.groupby([col]).size().reset_index()
            
            #Assign the column names to the data frame with group counts
            total_count.columns=[col,'total']
            
            #Make the col as the index.
            #Pandas join is not behaving well if we are joining on a column.
            #So I am assigning the index using the to be joined col values
            total_count.index = list(total_count[col])
            
            #Drop the col as its values are in the index already.
            total_count = total_count.drop([col],axis=1)        
            
            #Now create another data frame, that contains the counts of 
            #values on col, group by target variable
            total_grp_count = df.groupby([col,'_target_variable']).size().reset_index()
            
            total_grp_count.columns=[col,'_target_variable','status_total']
            total_grp_count.index = total_grp_count[col]
            total_grp_count = total_grp_count.drop([col],axis=1)
            
            #Join the two data frames: total_count and total_grp_count
            #We have to use inner join, as we are certain that there will be NO mis-matches
            joined_df = total_count.join(total_grp_count,how='inner')
            joined_df = joined_df.reset_index()
            
            #Make sure that we have proper column names, after reset_index
            columns = list(joined_df.columns)
            columns[0] = col
            joined_df.columns = columns
            #print(joined_df)
            
            #Calculating the proportions
            joined_df['proportion'] = joined_df['status_total']/joined_df['total']
            
            #Pivot the table
            joined_df = pd.pivot_table(joined_df,values='proportion',\
                                       columns='_target_variable',index=col)
            #DO NOT reset the index on the pivot table.
            #As the pivoted table will have col values as the index, 
            #and this will be useful later in transform()
            
            #Fill the NaN values with 0
            joined_df = joined_df.fillna(0)
            for i in joined_df.columns:
                joined_df[i] = joined_df[i]/np.max(joined_df[i])
            #Rename the column names, by appending with col name. 
            #This will make sure that we do not have any duplicate columns            
            joined_df.columns = [col + '-'+str(i).replace(" ", "-") for i in joined_df.columns]
            #Save the col and pivoted table in a dictionary.
            #This dictionary will be used in transform() logic.
            self.freq_dict[col] = joined_df    
        #You have to drop the _target_variable, so that we revert back the changes to X
        df.drop(["_target_variable"],axis=1,inplace=True)
    #Define a function to help with the custom join            
    def join(self,X,key,value):
        '''
           X will be a data frame, on which we have to apply the transform()
           key will be a key in the self.freq_dict
           value will be the pivoted table corresponding to the key
        '''
        ##Create a column called 'sorted'  X. This will help
        ##us to restore the order of X later (or else there might be 
        ##change that the order of X might be disturbed)
        X['sorted'] = np.arange(len(X))
        #Make sure that X is indexed on the key column
        try:
             X.index = list(X[key])
             X.drop([key],axis=1,inplace=True)
             X=X.join(value,how='left')
        except:
             print("EXCEPTION/EROR: The input data frame does not have "+key+" column")
             print("Terminating the program")
         
        #Make sure that you fill the  NaN values with median
        #This is needed, to gracefully add unseen values in the categorical variables
        
        
        for i in value.columns:
            temp_val = np.median(value[i])
            X[i] = X[i].fillna(temp_val)
            X[i] = np.ceil(X[i]*100/self.bin_size)
            X[i] = X[i].astype('int') 
            X[i] = X[i].astype('str')
            
        X=X.reset_index()
        #Drop the first column, which is obtained by reset_index, as it is not needed
        X.drop(X.columns[0],axis=1,inplace=True)
        X.sort_values(['sorted'],inplace=True)
        X.drop(['sorted'],axis=1,inplace=True)
        return X
    #fit() will just build the self.freq_dict         
    #In the following func def, the y parameter is NOT optional
    def fit(self,X, y):
      if self.apply:
        self.get_probs(X,y)
        #To make the logic simple, we will set the column names in transform()
        self.column_names = []
        return self 
      else:
        self.column_names = []
        return self 
                 
    #transform() will iterate over the dictionary keys,
    #and build the transformation.
    #As we are using the self.freq_dict items,
    #even though the input data frame supplied to transform has 
    #extra values, we will not get any errors, and such columns will remain 
    #undisturbed. 
    def transform(self,X,y=None):
      if self.apply:
        X = X.copy()
        for key, value in self.freq_dict.items():
            X = self.join(X,key,value)
        self.column_names = list(X.columns) 
        return X    
      else:
        self.column_names = list(X.columns) 
        return X

CatMultiLabelTransformer

#        
class CatMultiLabelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,apply=True): 
        self.column_names = [] 
        self.apply = apply
        self.binarizers={}
    
    def check_input_obj(self,X,location):
        ##Check if input object is a pandas df, else raise exception
        try:
            if not isinstance(X,pd.DataFrame):
               raise ValueError
        except:
            print("**EXCEPTION/ERROR**: In "+ location + \
                  " function of "+self.__name__+ ". Input must be a Pandas dataframe")
            exit(10)
        
    
    def fit(self, X, y=None):
      self.column_names = []
      if self.apply:         
        ##Check if input object is a pandas df, else raise exception
        self.check_input_obj(X,'fit()')
        ##Create an empty dict, 
        ##which will be updated with LabelBinarizer for each column        
        self.binarizers={}
        
        for col in X.columns:
            uniq_elements = list(set(X[col]))
            #print(uniq_elements)
            if len(uniq_elements) == 2:
               ##Add a dummy class
               #We have to name this class in a 
               #weird fashion,so that no data has this class
               uniq_elements.append('d#u/m*m-y+class_991-+xya')
            lb = LabelBinarizer()
            self.binarizers[col] = lb.fit(uniq_elements)
            #print(X)
            #self.column_names.append([str(col) + "_" + str(j) \ 
            #for j in list(lb.classes_) if j != 'd#u/m*m-y+class_991-+xya'])
            self.column_names = self.column_names + \
                 [str(col) + "_" + str(j) \
                   for j in list(lb.classes_) \
                     if j != 'd#u/m*m-y+class_991-+xya']
        #print("in transform")
        #print("len of self.binarizers",len(self.binarizers))
        #print("len of self.column_names",len(self.column_names))
        return self
      else:
         return self
    
    def transform(self, X, y=None):
         #print("in transform")
         #print("len of self.binarizers",len(self.binarizers))
         #print(self.apply)
         if self.apply:
            self.check_input_obj(X,'transform()')
            #X_transform = np.empty()
            temp_transformed_data = []
            transformed_column_names = []
            for key, value in self.binarizers.items():
                #print("key=",key)
                #print("value",value)
                #print("len of temp_transformed_data",len(temp_transformed_data))
                try:
                   temp_transformed_data.append(value.transform(X[key]))
                   transformed_column_names = \
                   transformed_column_names + \
                   [str(key) + "_" + str(j) \
                    for j in list(value.classes_)]
                except:
                   continue                
            return pd.DataFrame(np.concatenate(temp_transformed_data, axis=1),\
                                columns=transformed_column_names)[self.column_names]
            #transformed_column_names = self.column_names + [str(col) + "_" + \
            #str(j) for j in list(lb.classes_) if j != 'd#u/m*m-y+class_991-+xya']
            #transformed_X.columns = self.columns        
         else:
            self.column_names = list(X.columns)
            return X

AmountTSHTransformer

#Here will be our strategy to handle amount_tsh:
#    1. Get the median values of 
#       amount_tsh group by 'source_class', 'basin', 
#       'waterpoint_type_group'
#    2. If amount_tsh is 0, fill the value using the 
#       median computed by grouping 'source_class', 
#       'basin', 'waterpoint_type_group' 
#    3. If there is no combination of 'source_class', 
#       'basin', 'waterpoint_type_group' found, 
#        fill using 'waterpoint_type_group'
#    4. Else, let it stay 0
class AmountTSHTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,method='custom'): 
        self.column_names = [] 
        self.method = method
    
    def check_input_obj(self,X,location):
        ##Check if input object is a pandas df, else raise exception
        try:
            if not isinstance(X,pd.DataFrame):
               raise ValueError
        except:
            print("**EXCEPTION/ERROR**: In "+ location + \
                  " function of "+self.__name__+ \
                  ". Input must be a Pandas dataframe")
            exit(10)
        
    def fit(self, X, y=None):
      self.column_names = ['amount_tsh']
      if self.method == 'custom':
            X['amount_tsh'] = X['amount_tsh'].astype(float)
            self.check_input_obj(X,"fit()") 
            #Make sure that you have all the required columns:
            if len(set(X.columns) - set(['amount_tsh','source_class', \
                                         'basin', 'waterpoint_type_group'])) == 0:
                
                #Get the required dictionaries...
                amount_tsh_df = X[X['amount_tsh'] != 0].groupby(['source_class', \
                                                                 'basin', \
                                                                 'waterpoint_type_group'])\
                                                                  ['amount_tsh'].median()
                self.amount_tsh_dict_all_level = dict(amount_tsh_df)
                amount_tsh_df = X[X['amount_tsh'] != 0].\
                                groupby(['waterpoint_type_group'])\
                                ['amount_tsh'].median()
                self.amount_tsh_dict_wp = dict(amount_tsh_df)
            else:
                raise ValueError("Check the supplied columns. Must supply 'source_class', \
                                 'basin', 'waterpoint_type_group', 'amount_tsh' only")
                exit(10)
            self.column_names = ['amount_tsh']
            return self
      if self.method == 'median':
           
          X['amount_tsh'] = X['amount_tsh'].astype(float)
          self.median = np.median(list(X[X['amount_tsh'] != 0]['amount_tsh']))
          if math.isnan(self.median):
             self.median = 0
          self.column_names = ['amount_tsh']
          return self
      if self.method == 'mean':
          X['amount_tsh'] = X['amount_tsh'].astype(float)
          self.mean = np.mean(list(X[X['amount_tsh'] > 0]['amount_tsh']))
          if math.isnan(self.mean):
             self.mean = 0
          self.column_names = ['amount_tsh']
          return self
          
      if self.method == 'ignore':
          self.column_names = ['amount_tsh']
          return self
          
          
          

    def transform(self,X):
      if self.method == 'custom':
            X['amount_tsh'] = X['amount_tsh'].astype(float)
            self.check_input_obj(X,"transform()")
            transformed_amount_tsh = []
            for i, j, k, l in list(zip(X['amount_tsh'].\
                                       fillna(0),X['source_class'], \
                                       X['basin'], X['waterpoint_type_group'])):
                if i == 0:
                    try:
                        transformed_amount_tsh.append(self.amount_tsh_dict_all_level[(j,k,l)])

                    except:
                        try:
                            transformed_amount_tsh.append(self.amount_tsh_dict_wp[l])
                        except:
                                transformed_amount_tsh.append(i)
                                continue
                else:
                        transformed_amount_tsh.append(i)
            X['amount_tsh'] = transformed_amount_tsh
            return X[['amount_tsh']]
      if self.method == 'median':
          X['amount_tsh'] = X['amount_tsh'].astype(float)
          X['amount_tsh'] = X['amount_tsh'].fillna(0)
          amount_tsh = np.array(list(X['amount_tsh']))
          amount_tsh[amount_tsh == 0] = self.median
          X['amount_tsh']  = amount_tsh
          return X[['amount_tsh']]
      if self.method == 'mean':
          X['amount_tsh'] = X['amount_tsh'].astype(float)
          X['amount_tsh'] = X['amount_tsh'].fillna(0)
          amount_tsh = np.array(list(X['amount_tsh']))
          amount_tsh[amount_tsh == 0] = self.mean
          X['amount_tsh']  = amount_tsh
          return X[['amount_tsh']]
          
      if self.method == 'ignore':
          X['amount_tsh'] = X['amount_tsh'].astype(float)
          X['amount_tsh'] = X['amount_tsh'].fillna(0)          
          return X[['amount_tsh']]

GPSHeightTransformer

#Here will be our strategy to handle gps_height:
# In fit() just save the input data as a data frame \
# with gps_height, lat, long, and gps_height > 0
# In transform(), if gps_height == 0, then 
# start at 0.1 radius, and check if there are any non-zero gps_instances.
# If yes, get the average, else, increment search radius 
# by 0.3 (0.1 increase corresponds to 11km approximately)
# If nothing is found within an increment of 2, then just ignore.


class GPSHeightTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,init_radius=0.1,increment_radius=0.3,method = 'custom'): 
        self.column_names = [] 
        self.init_radius = init_radius
        self.increment_radius = increment_radius
        self.method = method

    
    def get_subset_records(self, latitude,longitude,df,radius):
        latitude_from = latitude - radius
        latitude_to = latitude + radius
        longitude_from = longitude - radius
        longitude_to = longitude + radius
        
        df_temp = df[(df['latitude'] >= latitude_from) & (df['latitude'] <= latitude_to) & \
                  (df['longitude'] >= longitude_from) & (df['longitude'] <= longitude_to)]
        return df_temp
       
    
    def check_input_obj(self,X,location):
        ##Check if input object is a pandas df, else raise exception
        try:
            if not isinstance(X,pd.DataFrame):
               raise ValueError
        except:
            print("**EXCEPTION/ERROR**: In "+ \
                  location + " function of "+\
                  self.__name__+ \
                  ". Input must be a Pandas dataframe")
            exit(10)
        
    def fit(self, X, y=None):
      if self.method == 'custom':
        X['gps_height'] = X['gps_height'].astype(float)
        X['latitude'] = X['latitude'].astype(float)
        X['longitude'] = X['longitude'].astype(float)
        self.df = X[X['gps_height'] != 0]
        self.column_names = ['gps_height']
        return self
      if self.method == 'median':
        X['gps_height'] = X['gps_height'].astype(float)
        #X['gps_height'] = X['gps_height'].fillna(0)
        self.median = np.median(list(X[X['gps_height'] != 0]['gps_height']))
        
        if math.isnan(self.median):
            self.median = 0
        self.column_names = ['gps_height']
        return self
            
      if self.method == 'mean':
        X['gps_height'] = X['gps_height'].astype(float)
        #X['gps_height'] = X['gps_height'].fillna(0)
        self.mean = np.mean(list(X[X['gps_height'] != 0]['gps_height']))
        if math.isnan(self.mean):
            self.mean = 0
        self.column_names = ['gps_height']
        return self
        
      if self.method == 'ignore':
        self.column_names = ['gps_height']
        return self      
        
    def transform(self,X):
      if self.method == 'custom':
            X['gps_height'] = X['gps_height'].astype(float)
            X['latitude'] = X['latitude'].astype(float)
            X['longitude'] = X['longitude'].astype(float)
            
            gps_height_transformed = []
            for latitude, longitude, gps_height in \
                zip(X['latitude'],X['longitude'],X['gps_height']):
                radius = self.init_radius
                if gps_height == 0:
                    gps_height_temp = gps_height
                    while gps_height_temp == 0 and radius <= 2:
                        df_temp = self.get_subset_records\
                                  (latitude,longitude,self.df,radius)
                        
                        gps_height_temp = np.mean(df_temp[df_temp['gps_height']!=0]\
                                                  ['gps_height'])
                        if math.isnan(gps_height_temp):
                            gps_height_temp = 0 
                        radius = self.increment_radius + radius
                else:
                    gps_height_temp =gps_height
                gps_height_transformed.append(gps_height_temp)
            X['gps_height'] = gps_height_transformed
            self.column_names = ['gps_height']
            #self.column_names = list(X.columns)
            #return X[['latitude','longitude','gps_height']]
            return X[['gps_height']]
      if self.method == 'median':
        X['gps_height'] = X['gps_height'].astype(float)
        X['gps_height'] = X['gps_height'].fillna(0)
        gps_height = np.array(list(X['gps_height']))
        gps_height[gps_height == 0] = self.median
        self.column_names = ['gps_height']
        #self.column_names = list(X.columns)
        #return X[['latitude','longitude','gps_height']]
        X['gps_height'] = gps_height
        return X[['gps_height']]
        
      if self.method == 'mean':
        X['gps_height'] = X['gps_height'].astype(float)
        X['gps_height'] = X['gps_height'].fillna(0)
        gps_height = np.array(list(X['gps_height']))
        gps_height[gps_height == 0] = self.mean
        self.column_names = ['gps_height']
        #self.column_names = list(X.columns)
        #return X[['latitude','longitude','gps_height']]
        X['gps_height'] = gps_height
        return X[['gps_height']]
        
      if self.method == 'ignore':
        self.column_names = ['gps_height']
        X['gps_height'] = X['gps_height'].astype(float)
        X['gps_height'] = X['gps_height'].fillna(0)
        return X[['gps_height']]

PopulationTransformer

class PopulationTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,init_radius=0.1,increment_radius=0.3,method = 'custom'): 
        self.column_names = [] 
        self.init_radius = init_radius
        self.increment_radius = increment_radius
        self.method = method

    
    def get_subset_records(self, latitude,longitude,df,radius):
        latitude_from = latitude - radius
        latitude_to = latitude + radius
        longitude_from = longitude - radius
        longitude_to = longitude + radius
        
        df_temp = df[(df['latitude'] >= \
                      latitude_from) & (df['latitude'] <= latitude_to) & \
                  (df['longitude'] >= longitude_from) & \
                  (df['longitude'] <= longitude_to)]
        return df_temp
       
    
    def check_input_obj(self,X,location):
        ##Check if input object is a pandas df, else raise exception
        try:
            if not isinstance(X,pd.DataFrame):
               raise ValueError
        except:
            print("**EXCEPTION/ERROR**: In "+ location + \
                  " function of "+self.__name__+ ". Input must be a Pandas dataframe")
            exit(10)
        
    def fit(self, X, y=None):
       if self.method == 'custom':
            X['latitude'] = X['latitude'].astype(float)
            X['longitude'] = X['longitude'].astype(float)
            X['population'] = X['population'].astype(float)
            self.df = X[X['population'] > 1]
            self.column_names = ['population']
            return self
       if self.method == 'median':
           X['population'] = X['population'].astype(float)
           #X['gps_height'] = X['gps_height'].fillna(0)
           self.median = np.median(list(X[X['population'] != 0]['population']))
           if math.isnan(self.median):
              self.median = 0
           self.column_names = ['population']
           return self

       if self.method == 'mean':
           X['population'] = X['population'].astype(float)
           #X['gps_height'] = X['gps_height'].fillna(0)
           self.mean = np.mean(list(X[X['population'] != 0]['population']))
           if math.isnan(self.mean):
              self.mean = 0
           self.column_names = ['population']   
           return self
       if self.method == 'ignore':
           self.column_names = ['population']   
           return self      
       
    def transform(self,X):
        self.column_names = ['population']
        if self.method == 'custom':      
            X['latitude'] = X['latitude'].astype(float)
            X['longitude'] = X['longitude'].astype(float)
            X['population'] = X['population'].astype(float)
        
            population_transformed = []
            for latitude, longitude, population in \
                zip(X['latitude'],X['longitude'],X['population']):
                radius = self.init_radius
                if population <= 1:
                    population_temp = population
                    while population_temp <= 1 and radius <= 2:
                        df_temp = self.get_subset_records\
                                  (latitude,longitude,self.df,radius)
                        
                        population_temp = np.mean(df_temp['population'])
                        if math.isnan(population_temp):
                            population_temp = population 
                        radius = self.increment_radius + radius
                else:
                    population_temp =population
                population_transformed.append(population_temp)
            X['population'] = population_transformed
            #self.column_names = ['population']
            #self.column_names = list(X.columns)
            self.column_names = ['population']
            return X[['population']]

        if self.method == 'median':      
                X['population'] = X['population'].astype(float)
                X['population'] = X['population'].fillna(0)
                population = np.array(list(X['population']))
                population[population == 0] = self.median
                self.column_names = ['population']
                #self.column_names = list(X.columns)
                #return X[['latitude','longitude','gps_height']]
                X['population'] = population
                return X[['population']]

        if self.method == 'mean':
                X['population'] = X['population'].astype(float)
                X['population'] = X['population'].fillna(0)
                population = np.array(list(X['population']))
                population[population == 0] = self.mean
                self.column_names = ['population']
                #self.column_names = list(X.columns)
                #return X[['latitude','longitude','gps_height']]
                X['population'] = population
                return X[['population']]
        
        if self.method == 'ignore':      
                X['population'] = X['population'].astype(float)
                X['population'] = X['population'].fillna(0)
                return X[['population']]

YearTransformer

#We will create another variable: age, 
#based on the year of construction and recorded date
class YearTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,method = 'custom'): 
        self.column_names = [] 
        #self.init_radius = init_radius
        #self.increment_radius = increment_radius
        self.method = method
        pass ##Nothing else to do
        
    def fit(self, X, y=None):
        X['construction_year'] = X['construction_year'].astype(float)
        if self.method == 'custom':
            year_recorded = X[X['construction_year'] > 0]\
                            ['date_recorded'].\
                            apply(lambda x: int(x.split("-")[0]))
            year_constructed = X[X['construction_year'] > 0]['construction_year']
            self.median_age = np.median(year_recorded - year_constructed)
            self.column_names = ['age']
            return self
        if self.method == 'median':
               X['construction_year'] = X['construction_year'].astype(float)
               #X['gps_height'] = X['gps_height'].fillna(0)
               self.median = \
                          np.median(list(X[X['construction_year'] != 0]['construction_year']))
               if math.isnan(self.median):
                  self.median = 0
               self.column_names = ['construction_year']
               return self
        if self.method == 'mean':
               X['construction_year'] = X['construction_year'].astype(float)
               #X['gps_height'] = X['gps_height'].fillna(0)
               self.mean = np.mean(list(X[X['construction_year'] != 0]['construction_year']))
               if math.isnan(self.mean):
                  self.mean = 0
               self.column_names = ['construction_year']
               return self

        if self.method == 'ignore':
               self.column_names = ['construction_year']
               return self
          
    def transform(self,X):
        if self.method == 'custom':
            year_recorded = list(X['date_recorded'].apply(lambda x: int(x.split("-")[0])))
            year_constructed = list(X['construction_year'])
            age = []
            for i,j in enumerate(year_constructed):
                if j == 0:
                   age.append(self.median_age)
                else:
                   temp_age = year_recorded[i] - year_constructed[i]
                   if temp_age < 0:
                      temp_age = self.median_age
                   age.append(temp_age)   
            X['age'] = age
            self.column_names = ['age']
            #self.column_names = X.columns
            return X[['age']]
        if self.method == 'median':      
                X['construction_year'] = X['construction_year'].astype(float)
                X['construction_year'] = X['construction_year'].fillna(0)
                construction_year = np.array(list(X['construction_year']))
                construction_year[construction_year == 0] = self.median
                self.column_names = ['construction_year']
                X['construction_year'] = construction_year
                return X[['construction_year']]

        if self.method == 'mean':
                X['construction_year'] = X['construction_year'].astype(float)
                X['construction_year'] = X['construction_year'].fillna(0)
                construction_year = np.array(list(X['construction_year']))
                construction_year[construction_year == 0] = self.mean
                self.column_names = ['construction_year']
                X['construction_year'] = construction_year
                return X[['construction_year']]
        
        if self.method == 'ignore':      
                X['construction_year'] = X['construction_year'].astype(float)
                X['construction_year'] = X['construction_year'].fillna(0)
                self.column_names = ['construction_year']
                return X[['construction_year']]

LatitudeLongitudeProcess

##Sekhar: This can be written as a generic class...
##but leave it as it is as of now.
##I will look at this later...
class LatitudeLongitudeProcess( BaseEstimator, TransformerMixin):
    def __init__(self,strategy='median'):
        '''
           type = 'median' is the default.
           other values of type can be 'custom' 
        ''' 
        self.strategy = strategy
        self.median_longitude = 0
        self.custom_longitude = 0
        self.median_latitude = 0
        self.custom_latitude = 0
        self.avg_lat_ward_dict = {}
        self.avg_long_ward_dict = {}
        self.avg_lat_lga_dict = {}
        self.avg_long_lga_dict = {}
        self.avg_lat_region_dict = {}
        self.avg_long_region_dict = {}
        self.avg_lat_country_dict = {}
        self.avg_long_country_dict = {}
        self.column_names = [] 
        
    def get_level_means(self,X):
          if 'ward' in X.columns:   
                #Get average of lats and longs at the ward level
                #First delete rows that have unknown ward values:
                df = X[~((X['ward'].isnull()) | (X['ward'] == 'unknown'))]
                avg_lat_long_by_ward_df = df[df['longitude'] != 0]. \
                groupby(['ward'])['latitude','longitude'].mean().reset_index()
                if len(avg_lat_long_by_ward_df) > 0:
                    avg_lat_long_by_ward_df.columns=['ward','avg_latitude','avg_longitude']
                    self.avg_lat_ward_dict = dict(zip(list(avg_lat_long_by_ward_df['ward']),\
                                                      list(avg_lat_long_by_ward_df['avg_latitude'])))
                    self.avg_long_ward_dict = dict(zip(list(avg_lat_long_by_ward_df['ward']),\
                                                       list(avg_lat_long_by_ward_df['avg_longitude'])))
          if 'lga' in X.columns:        
                #Get average of lats and longs at the lga level
                #First delete rows that have unknown region values:
                df = X[~((X['lga'].isnull()) | (X['lga'] == 'unknown'))]
                avg_lat_long_by_lga_df = df[df['longitude'] != 0]. \
                groupby(['lga'])['latitude','longitude'].mean().reset_index()
                if len(avg_lat_long_by_lga_df) > 0:
                    avg_lat_long_by_lga_df.columns=['lga','avg_latitude','avg_longitude']
                    self.avg_lat_lga_dict = dict(zip(list(avg_lat_long_by_lga_df['lga']),
                                                     list(avg_lat_long_by_ward_df['avg_latitude'])))
                    self.avg_long_lga_dict = dict(zip(list(avg_lat_long_by_lga_df['lga']),
                                                      list(avg_lat_long_by_ward_df['avg_longitude'])))
                
          if 'region' in X.columns:                
                #Get average of lats and longs at the region level
                #First delete rows that have unknown region values:
                df = X[~((X['region'].isnull()) | (X['region'] == 'unknown'))]
                avg_lat_long_by_region_df = df[df['longitude'] != 0]. \
                groupby(['region'])['latitude','longitude'].mean().reset_index()
                if len(avg_lat_long_by_region_df) > 0:
                    avg_lat_long_by_region_df.columns=['region','avg_latitude','avg_longitude']
                    self.avg_lat_region_dict = dict(zip(list(avg_lat_long_by_region_df['region']),\
                                   list(avg_lat_long_by_region_df['avg_latitude'])))
                    self.avg_long_region_dict = dict(zip(list(avg_lat_long_by_region_df['region']),\
                                    list(avg_lat_long_by_region_df['avg_longitude'])))            
          
          #Get average of lats and longs at the country level
          avg_long = np.mean(X[X['longitude'] != 0]['longitude'])
          avg_lat = np.mean(X[X['latitude'] != 0]['latitude'])
          self.avg_lat_country_dict['country'] = avg_lat
          self.avg_long_country_dict['country'] = avg_long
        
    def fit(self,X, y=None):
        self.column_names = ['latitude','longitude']
        X['latitude'] = X['latitude'].astype(float)
        X['longitude'] = X['longitude'].astype(float)
        if self.strategy == 'custom':
           self.get_level_means(X)
           
        elif self.strategy == 'mean':
           #Impute using mean
           self.mean_longitude = np.mean(X[X['longitude'] != 0]['longitude'])
           self.mean_latitude = np.mean(X[X['latitude'] != 0]['latitude'])           
           #X.longitude = [i for i in X.longitude if np.abs(i) <= 0 self.mean_longitude else i]
           #X.latitude  = [i for i in X.latitude if np.abs(i) <= 0 self.mean_latitude else i]
        elif self.strategy == 'median':
           #Impute using median
           self.median_longitude = np.median(X[X['longitude'] != 0]['longitude'])
           self.median_latitude = np.median(X[X['latitude'] != 0]['latitude'])           
           #X.longitude = [i for i in X.longitude if np.abs(i) <= 0 self.median_longitude else i]
           #X.latitude  = [i for i in X.latitude if np.abs(i) <= 0 self.median_latitude else i]
        else:
            print("Invalid strategy supplied for LatitudeLongitudeProcess.")
            print("Valid values are 'mean', 'median' or 'custom'. Terminating the program")
            exit(10)
        return self
    def make_up_lat_long(self,X):
          #Handle the situation gracefully, if the incoming data does not have any required columns
          try:     
              latitude_list = list(X['latitude'].fillna(0))
          except:
                latitude_list = list(np.zeros(len_X))
                #continue
          try:          
              longitude_list = list(X['longitude'].fillna(0))
          except:
                longitude_list = list(np.zeros(len_X))
                #continue
          return latitude_list, longitude_list   
        
    def custom_transform(self, X):     
          len_X = len(X)
          
          #Declare lists to hold the transformed lat and long
          latitude_transformed = []
          longitude_transformed = []
          
          #Handle the situation gracefully, if the incoming data does not have any required columns
          latitude_list, longitude_list =  self.make_up_lat_long(X)
          
          try:  
              ward_list = list(X['ward'].fillna('unknown'))
          except:
              ward_list = ['unknown'] * len_X
              #continue
              
          try:    
              lga_list = list(X['lga'].fillna('unknown'))
          except:
              lga_list = ['unknown'] * len_X
              #continue
              
          try:    
              region_list = list(X['region'].fillna('unknown'))
          except:
              region_list = ['unknown'] * len_X
              #continue
              
          for (i, j, k, l, m) in zip(latitude_list,longitude_list, \
                                    ward_list,lga_list,region_list):
                if np.round(i) == 0 or np.round(j) == 0:
                    try:
                        latitude_transformed.append(self.avg_lat_ward_dict[k])
                        longitude_transformed.append(self.avg_long_ward_dict[k])
                    except:
                        try:
                            latitude_transformed.append(self.avg_lat_lga_dict[l])
                            longitude_transformed.append(self.avg_long_lga_dict[l])
                            continue
                        except:
                            try:
                                latitude_transformed.append(avg_lat_region_dict[m])
                                longitude_transformed.append(avg_long_region_dict[m])
                                continue
                            except:   
                                latitude_transformed.append(self.avg_lat_country_dict['country'])
                                longitude_transformed.append(self.avg_long_country_dict['country'])
                                continue
                else:
                    latitude_transformed.append(i)
                    longitude_transformed.append(j)     
          X['latitude'] = latitude_transformed
          X['longitude'] = longitude_transformed          
          return X
    def transform(self,X):
      X['latitude'] = X['latitude'].astype(float)
      X['longitude'] = X['longitude'].astype(float)
      self.column_names = ['latitude','longitude'] 
      if self.strategy == 'custom':
         X = self.custom_transform(X)
         return X[['latitude','longitude']]
      elif self.strategy == 'mean':
         latitude_list, longitude_list =  self.make_up_lat_long(X)
         latitude_list = np.array(latitude_list)
         longitude_list = np.array(longitude_list)
         latitude_list[latitude_list == 0] = self.mean_latitude
         longitude_list[longitude_list == 0] = self.mean_longitude
         X['latitude'] = latitude_list
         X['longitude'] = longitude_list
         return X[['latitude','longitude']]
      elif self.strategy == 'median':
         latitude_list, longitude_list =  self.make_up_lat_long(X)
         latitude_list = np.array(latitude_list)
         longitude_list = np.array(longitude_list)
         latitude_list[latitude_list == 0] = self.median_latitude
         longitude_list[longitude_list == 0] = self.median_longitude
         X['latitude'] = latitude_list
         X['longitude'] = longitude_list
         return X[['latitude','longitude']]
         #self.column_names = list(X.columns)
         #return X
      else:
         print("EXCEPTION: The supplied strategy",self.strategy," is incorrect")
         exit(10)

FunderInstTransformer

class FunderInstTransformer( BaseEstimator, TransformerMixin):
    def __init__(self,initial_chars=3,groups=15,apply=True):
        self.initial_chars = initial_chars
        self.groups = groups
        self.apply = apply
        self.group_dict = dict()
    def fetch_first_n_chars(self,l):
        temp_l = [str(j).lower()[0:self.initial_chars] for j in list(l)]
        return pd.Series(temp_l)
                
    def fit(self,X, y=None):   
      if self.apply:
        self.column_names = ['funder','installer'] 
        self.group_dict = dict()
        for i in X.columns:
                temp_series = self.fetch_first_n_chars(X[i]).value_counts()
                temp_series.sort_values(ascending=False,inplace=True)
                top_groups = list(temp_series[0:self.groups].index)
                self.group_dict[i] = top_groups
                 
        return self
      else:
        return self
        
    def transform(self,X):
      X = X.copy()
      if self.apply:
        for i in X.columns:
                temp_series = self.fetch_first_n_chars(X[i])
                temp_l = []
                for j in temp_series.values:
                    try:
                        #print(self.group_dict[i])
                        if j in self.group_dict[i]:
                            #print(j)
                            temp_l.append(j)  
                        else: 
                            temp_l.append('other')  
                    except:
                        continue
                X[i] = temp_l
        #X['funder_installer_same'] = X['funder']==X['installer']
        #self.column_names = ['funder','installer','funder_installer_same']
        #return X[['funder','installer','funder_installer_same']]         
        self.column_names = ['funder','installer']
        return X[['funder','installer']]         
      else:
        #self.column_names = ['funder','installer','funder_installer_same']
        #return X[['funder','installer','funder_installer_same']]
        self.column_names = ['funder','installer']
        return X[['funder','installer']]         
         
class Numpy2DFTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,columns): 
        self.column_names = columns
        
        
    def fit(self, X, y=None):
        return self

    def transform(self,X):

        return pd.DataFrame(X,columns=self.column_names)
    ##############
    ##IMPORTANT###
    ##############
    #https://stackoverflow.com/questions/41837261/data-not-persistent-in-scikit-learn-transformers
    #get_params() is very important to get the data persistent between
    #the CV evaluation. If NOT defined, then the init params are NOT set and results in 
    #CV (GridSearch) failure  
    def get_params(self, deep=False):
        return {'columns': self.column_names}

#Scale numeric data        
class ScaleData(BaseEstimator, TransformerMixin): 
     def __init__(self,std_scaler=True,apply=True): 
         self.std_scaler = std_scaler 
         self.apply = apply
 
     def fit(self,X, y=None): 
       if self.apply: 
         if self.std_scaler == True: 
            self.scaler = StandardScaler() 
         else: 
            self.scaler = MinMaxScaler() 
         return self.scaler.fit(X) 
       else:
         return X        
 
     def transform(self,X): 
       if self.apply:
         return self.scaler.transform(X) 
       else:
         return X

ChooseCatPipelineType

class ChooseCatPipelineType(BaseEstimator, TransformerMixin): 
     def __init__(self,freq_pipeline,resp_pipeline,method='both'): 
         self.freq_pipeline = freq_pipeline
         self.resp_pipeline = resp_pipeline
         self.method = method
         self.column_names = []
     def fit(self,X, y=None): 
       if self.method == 'resp':
         self.resp_pipeline.fit(X,y) 
       if self.method == 'freq':
         self.freq_pipeline.fit(X,y) 
       if self.method == 'both':
          self.both_pipeline = FeatureUnion(transformer_list = [ \
                                    ("freq_pipeline",self.freq_pipeline), \
                                    ("resp_pipeline",self.resp_pipeline) \
                                                ])  
          self.both_pipeline.fit(X,y)                                                
       return self

     def transform(self,X): 
           if self.method == 'resp':
             self.column_names = self.resp_pipeline.named_steps['CatMultiLabelTransformer'].column_names
             return self.resp_pipeline.transform(X) 
           if self.method == 'freq':
             self.column_names = self.freq_pipeline.named_steps['CatMultiLabelTransformer'].column_names
             return self.freq_pipeline.transform(X) 
           if self.method == 'both':
              self.column_names = self.freq_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
                                  self.resp_pipeline.named_steps['CatMultiLabelTransformer'].column_names
              return self.both_pipeline.transform(X)

All the above code was included in a file called pipelinepack.py.

We have used the following hyper parameters to generate the transformed data for training. These parameters were chosen as they have given a superior cross validation accuracy when we ran Random Forest model with various parameter combinations.

The code to generate the transformed training and validation data, based on the above listed options is given below:

#Import all the required packages:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, LabelBinarizer, Imputer, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from sklearn.externals import joblib
import math
import pickle
from pipelinepack import *
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


#Prepare data types dict
train_data_type = {
'id':int,
'amount_tsh': float,
'date_recorded': str,
'funder': str,
'gps_height': float,
'installer': str,
'longitude': float,
'latitude': float,
'wpt_name': str,
'num_private': float,
'basin': str,
'subvillage': str,
'region': str,
'region_code': str,
'district_code': str,
'lga': str,
'ward': str,
'population': float,
'public_meeting': str,
'recorded_by': str,
'scheme_management': str,
'scheme_name': str,
'permit': str,
'construction_year': int,
'extraction_type': str,
'extraction_type_group': str,
'extraction_type_class': str,
'management': str,
'management_group': str,
'payment': str,
'payment_type': str,
'water_quality': str,
'quality_group': str,
'quantity': str,
'quantity_group': str,
'source': str,
'source_type': str,
'source_class': str,
'waterpoint_type': str,
'waterpoint_type_group': str
}                      

train_label_type = {
'id':int,
'status_group': str
}

##Function to read the files
def read_files(file,dtype):
        df = pd.read_csv(file,dtype=dtype)
        return df


##Function to split the data into test and training
def get_test_train_data(training_values,training_labels):
        training_values.index = list(training_values['id'])
        training_labels.index = list(training_labels['id'])
        training_values.drop(["id"],axis=1,inplace=True)
        training_labels.drop(["id"],axis=1,inplace=True)
        training_data=training_values.join(training_labels,
                             how='inner',lsuffix='_values', 
                             rsuffix='labels')
        X = training_data.drop('status_group',axis=1)
        y = training_data['status_group']
        X_train, X_test, y_train, y_test = train_test_split\
                                           ( X, y, \
                                             test_size=0.2, \
                                             random_state=42, \
                                             stratify=y)
        return X_train, X_test, y_train, y_test


#Columns with many levels.
high_levels_cat_columns = ['subvillage','lga','ward'
#,'scheme_name'
]

#Columns with less number of levels.
low_levels_cat_columns = [
        'basin','region','district_code',
        'public_meeting',
        'scheme_management',
        'permit',
        #'extraction_type',
        'extraction_type_group',
        #'extraction_type_class',
        'management',
        #'management_group',
        'payment_type',
        'water_quality',
        #'quality_group',
        'quantity_group',
        'source',
        #'source_type','source_class',
        'waterpoint_type'
        #,'waterpoint_type_group'
        ]

#Columns which need fuzzy matching. These columns have many levels.
fuzzy_logic_columns = ['funder','installer']


#Pipelines definition

fuzz_pipeline = Pipeline([ \
                         ('selector',DataFrameSelector(fuzzy_logic_columns)), \
                         ('cat_nulls', HandleCategoricalNulls()), \
                         ('FunderInstTransformer', \
                          FunderInstTransformer(initial_chars = 3, \
                                                groups = 15,apply=True)), \
                         ('CatMultiLabelTransformer',\
                          CatMultiLabelTransformer(apply=True)) \
                         ])

cat_pipeline_high_level_freq_based = Pipeline([ \
                         ('selector',DataFrameSelector(high_levels_cat_columns)), \
                         ('cat_nulls', HandleCategoricalNulls()), \
                         ('FreqBasedCategoricalBinning', \
                          FreqBasedCategoricalBinning(buckets=20,apply=True)), \
                         ('CatMultiLabelTransformer',CatMultiLabelTransformer()) \
                         ])
                         
cat_pipeline_high_level_resp_based = Pipeline([ \
                         ('selector',DataFrameSelector(high_levels_cat_columns)), \
                         ('cat_nulls', HandleCategoricalNulls()), \
                         ('RespBasedCategoricalBinning', \
                          RespBasedCategoricalBinning(buckets=20,apply=True)), \
                         ('CatMultiLabelTransformer',CatMultiLabelTransformer()) \
                         ])

choose_high_level_cat_pipeline = Pipeline([ \
                                         ('ChooseCatPipelineType', ChooseCatPipelineType( \
                                         freq_pipeline = cat_pipeline_high_level_freq_based, \
                                         resp_pipeline = cat_pipeline_high_level_resp_based, \
                                         method = 'freq')) \
                                         ])
                                         
cat_pipeline_low_level = Pipeline([ \
                         ('selector',DataFrameSelector(low_levels_cat_columns)), \
                         ('cat_nulls', HandleCategoricalNulls()), \
                         ('CatMultiLabelTransformer',CatMultiLabelTransformer(apply=True)) \
                        ])

#Combine all categorical pipelines first:
full_categorical_pipeline = FeatureUnion(transformer_list = [ \
                                         ("fuzz_pipeline",fuzz_pipeline), \
                                         ("choose_high_level_cat_pipeline",\
                                          choose_high_level_cat_pipeline), \
                                         ("cat_pipeline_low_level", cat_pipeline_low_level) \
                                         ])       



#amount_tsh pipelines
amount_tsh_prep_pipeline = Pipeline([ \
                         ('selector',\
                          DataFrameSelector(['source_class', 'basin', \
                                             'waterpoint_type_group'])), \
                         ('cat_nulls', HandleCategoricalNulls()) \
                         ])

amount_tsh_selector = Pipeline([ \
                         ('selector',DataFrameSelector(['amount_tsh']))])


amount_tsh_transformer = Pipeline([ ('amount_tsh_prep',FeatureUnion(transformer_list = [ \
                                    ("amount_tsh_prep_pipeline",amount_tsh_prep_pipeline), \
                                    ("amount_tsh_selector",amount_tsh_selector)])) \
                                    ,("Numpy2DFTransformer", \
                                      Numpy2DFTransformer(['source_class', \
                                                           'basin', \
                                                           'waterpoint_type_group',\
                                                           'amount_tsh'])) \
                                    ,('AmountTSHTransformer', \
                                      AmountTSHTransformer()) \
                                ])

#age pipeline
age_pipeline = Pipeline([ \
                         ('selector',DataFrameSelector(['date_recorded',\
                                                        'construction_year'])), \
                         ('YearTransformer',YearTransformer())
                        ])




#Lat Long pipelines
lat_long_prep_pipeline = Pipeline([ \
                         ('selector',DataFrameSelector(['lga',\
                                                        'region',\
                                                        'ward'])), \
                         ('cat_nulls', HandleCategoricalNulls()) \
                         ])

lat_long_selector = Pipeline([ \
                         ('selector',DataFrameSelector(['longitude','latitude']))])

lat_long_transformer = Pipeline([ ('lat_long_prep',\
                                   FeatureUnion(transformer_list = [ \
                                    ("lat_long_prep_pipeline",\
                                     lat_long_prep_pipeline), \
                                    ("lat_long_selector",\
                                     lat_long_selector)])) \
                                    ,("Numpy2DFTransformer", \
                                      Numpy2DFTransformer(['lga','region',\
                                                           'ward','longitude',\
                                                           'latitude'])) \
                                    ,('LatitudeLongitudeProcess', \
                                      LatitudeLongitudeProcess(strategy="custom")) \
                                ])
#lat_long_transformer always return the 
#data in latitude, longitude order

#gps_height pipelines.
#This is dependent on the lat_long pipeline
gps_height_transformer = Pipeline([('gps_height_prep',\
                                    FeatureUnion(transformer_list=[('lat_long_transformer', \
                                                                   lat_long_transformer), \
                                   ('gps_selector',DataFrameSelector(['gps_height']))]))
                                   ,("Numpy2DFTransformer", \
                                      Numpy2DFTransformer(['latitude','longitude','gps_height']))
                                   ,('GPSHeightTransformer',GPSHeightTransformer(method='median'))
                                   #,('GPSHeightTransformer',GPSHeightTransformer(method='custom'))
                                  ])



#population pipelines.
#This is dependent on the lat_long pipeline

population_transformer = Pipeline([('population_prep',\
                                    FeatureUnion(transformer_list=[('lat_long_transformer', \
                                                                   lat_long_transformer), \
                                   ('population_selector',DataFrameSelector(['population']))]))
                                   ,("Numpy2DFTransformer", \
                                      Numpy2DFTransformer(['latitude','longitude','population']))
                                   ,('PopulationTransformer',PopulationTransformer(method = 'ignore'))
                                   #,('PopulationTransformer',PopulationTransformer(method = 'custom')) #NOT worth
                                   # ,('PopulationTransformer',PopulationTransformer(method = 'median')) #NOT Worth
                                  ])

full_numeric_transformations = Pipeline([('all_numeric_transformations', \
                                        FeatureUnion(transformer_list = \
                                                     [('amount_tsh_transformer',amount_tsh_transformer), \
                                        ('lat_long_transformer',lat_long_transformer), \
                                        ('gps_height_transformer',gps_height_transformer), \
                                        ('population_transformer',population_transformer), \
                                        ('age_pipeline',age_pipeline) \
                                       ])) \
                                       ,('scaler',ScaleData())
                                        ])

all_transformations = Pipeline([ \
                                ('all_transformations', \
                                    FeatureUnion(transformer_list = \
                                                 [('full_categorical_pipeline',\
                                                   full_categorical_pipeline), \
                                                  ('full_numeric_transformations',\
                                                   full_numeric_transformations)])) \
                                 ])

predict_pipeline = Pipeline([('all_transformations',all_transformations), \
                             ('rf',RandomForestClassifier(n_estimators=5000,n_jobs=-1))])

def get_test_train_data(training_values,training_labels):
        training_values.index = list(training_values['id'])
        training_labels.index = list(training_labels['id'])
        training_values.drop(["id"],axis=1,inplace=True)
        training_labels.drop(["id"],axis=1,inplace=True)

        training_data=training_values.join(training_labels,
                             how='inner',lsuffix='_values', 
                             rsuffix='labels')

        X = training_data.drop('status_group',axis=1)
        y = training_data['status_group']

        X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)
        return X_train, X_test, y_train, y_test
                                                  

def prepare_data(split_test_train=True,save=False):
        training_data = read_files('./data/training_data.csv',dtype = train_data_type)
        training_labels = read_files('./data/training_labels.csv',dtype = train_label_type)
        if split_test_train:
            X_train, X_test, y_train, y_test = get_test_train_data(training_data,training_labels)
            #Save the data
            if save:
                X_test.to_csv('X_test.csv',index=False)
                y_test.to_csv('y_test.csv',index=False)
                X_train.to_csv('X_train.csv',index=False)
                y_train.to_csv('y_train.csv',index=False)
                
            #prepare combined data frames:            
            X_train['status_group'] = y_train
            X_test['status_group'] = y_test
            return X_train, X_test
        else:
            training_data['status_group'] = training_labels['status_group']
            return training_data, None

##Function to fit random forest with default parameters
def default_rf():
    train_data, test_data = prepare_data(split_test_train=True)
    X_train = train_data.drop(['status_group'],axis=1)
    y_train = train_data['status_group']
    X_test = test_data.drop(['status_group'],axis=1)
    y_test = test_data['status_group']
    #temp_df = all_transformations.fit_transform(X_train,y_train)
    predict_pipeline.fit(X_train,y_train)
    y_test_predict = predict_pipeline.predict(X_test)
    print("Test Accuracy score: ",accuracy_score(y_test,y_test_predict))
    all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
                  choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
                  cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
                  amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
                  lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
                  gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
                  population_transformer.named_steps['PopulationTransformer'].column_names + \
                  age_pipeline.named_steps['YearTransformer'].column_names

    feature_imp = list(zip(predict_pipeline.named_steps['rf'].feature_importances_,list(all_columns)))
    #print feature_imp
    feature_imp.sort(key = lambda X: X[0],reverse=True)
    feature_imp= pd.DataFrame(feature_imp)
    feature_imp.columns = ['score','variable']
    feature_imp['percentage'] = feature_imp['score']/np.max(feature_imp['score']) * 100
    print feature_imp
    feature_imp.to_csv('rf_variable_imp.csv',index=False)


##Function to fit gradient boosting with default parameters
def default_gb():
    predict_pipeline = Pipeline([('all_transformations',all_transformations), \
                             ('gb',GradientBoostingClassifier(n_estimators=300))])
    train_data, test_data = prepare_data(split_test_train=True)
    X_train = train_data.drop(['status_group'],axis=1)
    y_train = train_data['status_group']
    X_test = test_data.drop(['status_group'],axis=1)
    y_test = test_data['status_group']

    #temp_df = all_transformations.fit_transform(X_train,y_train)
    predict_pipeline.fit(X_train,y_train)
    y_test_predict = predict_pipeline.predict(X_test)
    print("Test Accuracy score: ",accuracy_score(y_test,y_test_predict))
    all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
                  choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
                  cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
                  amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
                  lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
                  gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
                  population_transformer.named_steps['PopulationTransformer'].column_names + \
                  age_pipeline.named_steps['YearTransformer'].column_names

    feature_imp = list(zip(predict_pipeline.named_steps['gb'].feature_importances_,list(all_columns)))
    #print feature_imp
    feature_imp.sort(key = lambda X: X[0],reverse=True)
    feature_imp= pd.DataFrame(feature_imp)
    feature_imp.columns = ['score','variable']
    feature_imp['percentage'] = feature_imp['score']/np.max(feature_imp['score'])
    print feature_imp


def transform_to_share():
        training_data = read_files('./data/training_data.csv',dtype = train_data_type)
        training_labels = read_files('./data/training_labels.csv',dtype = train_label_type)
        train_data_id = list(training_data['id'])
        train_labels_id = list(training_labels['id'])
        training_data.index = train_data_id
        training_labels.index = train_labels_id
        training_data.drop(["id"],axis=1,inplace=True)
        training_labels.drop(["id"],axis=1,inplace=True)
        training_transformed = all_transformations.fit_transform(training_data,training_labels)
        all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
                  choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
                  cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
                  amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
                  lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
                  gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
                  population_transformer.named_steps['PopulationTransformer'].column_names + \
                  age_pipeline.named_steps['YearTransformer'].column_names

        training_transformed = pd.DataFrame(training_transformed,columns = all_columns)

        training_transformed.index = train_data_id
        training_transformed = training_transformed.reset_index()
        training_transformed_cols = list(training_transformed.columns)
        training_transformed_cols[0] = "id"
        training_transformed.columns = training_transformed_cols
        training_transformed.to_csv('training_transformed.csv',index=False)
        submission_data = pd.read_csv('./data/test_data.csv')
        submission_data_id = list(submission_data['id'])
        submission_data = submission_data.drop(['id'],axis=1)
        submission_transformed = all_transformations.transform(submission_data)
        submission_transformed = pd.DataFrame(submission_transformed,columns = all_columns)
        submission_transformed.index = submission_data_id
        submission_transformed = submission_transformed.reset_index()
        submission_transformed.columns = training_transformed_cols
        submission_transformed.to_csv('submission_transformed.csv',index=False)



def main():
    #Train a rand forest model (5000 trees)
    default_rf()

    #Generate data for submission to driven data
    submission_rf()
    
    #Generate data to share with peers so that they can train in R language
    transform_to_share()
    
    #train using default parms (gradient boosting alg)
    default_gb()

The source code for ensemble model based on Random Forest, Gradient Boosting and C5.0 Boosted trees is given below. Whenever Gradient Boosting and C5.0 Boosted trees predict “functional needs repair”, then we will predict “functional needs repair” (irrespective of the random forest model’s output). In all other cases we will use the random forest output.

import pandas as pd

#We created a data frame called all_models_out.csv, which has the following columns:
#'id','rf','c5','boosted_tr'

df = pd.DataFrame('all_models_out.csv')

status_group=[]
id = []
for i in df.index:
    id_current,rf,c5,boosted_tr = df.iloc[i]
    if c5 == 'functional needs repair' and boosted_tr == 'functional needs repair':
        status_group.append('functional needs repair')
    else:
        status_group.append(rf)
    id.append(id_current)    
submission_df = pd.DataFrame(list(zip(id,status_group)),columns=['id','status_group'])   
submission_df.to_csv('ensemble_submission.csv',index=False)

Appendix D: Predictive / Classification Modeling

Predictive / Classification Modeling

Six distinct types of classification models were constructed for purposes of predicting the current status of a given water pump:

Multinomial Logistic Regression (MLR)
K-Nearest Neighbors (KNN)
Support Vector Machines (SVM)
Bootstrap Aggregation (Bagging)
Random Forest
C5.0 Boosted Classification Trees

A description of our attempt at quantifying the predictive value of each of the independent variables is provided below followed by a brief discussion of each model’s structure and performance.

Ranking Variables According to Predictive Strength

As discussed in Data Exploration, many of the data set’s 30+ independent variables were found to have potential predictive characteristics. The strength of those characteristics was measured by constructing a separate multinomial logistic regression model for each independent variable relative to the data set’s response variable. The accuracy and Alkaline Information Criterion (AIC) results of those models allowed us to rank our independent variables on the basis of their likely predictive “strength”, as summarized in the table below.

Tanzania MLR Results - Single Ind. Variable Models

Var Name	Accuracy	AIC
quantity	64.9%	65186
waterpoint_type	63.5%	68496
extraction_type	62.6%	68684
installer (binned)	58.1%	71687
pump_age	57.9%	71466
reg_code + dst_code	57.1%	70004
region_code	57.1%	70455
region	57.1%	70769
funder(binned)	56.9%	72151
water_quality	56.9%	72385
payment	56.7%	71071
basin	55.9%	72664
district_code	55.8%	72855
source	55.2%	72155
management	54.7%	72546
lga (binned)	54.7%	72341
scheme_management	54.4%	72673
ward (binned)	54.3%	72220
public_meeting_new	54.47%	73798
permit_new	54.3%	73938
amount_tsh_new	54.3%	73700
lat + long	54.3%	73712
population_new	54.3%	73975
gps_height_new	54.3%	73587
subvillage (binned)	54.3%	73949

As indicated above, quantity, waterpoint_type and extraction_type proved to have the highest levels of independent predictive value for MLR modeling, while subvillage, gps_height, population, latititude and longitude were all deemed to be of little value for MLR modeling purposes. The derived pump_age variable was also found to have a relatively high level of predictive value.

Multinomial Logistic Regression

The ranking of variables relative to their MLR predicitve accuracy served as the basis of a forward selection MLR model build process whereby individual variables were added to the model in descending order of their predictive value until no further gains in either model accuracy or AIC score were possible. That process led us to conclude that three variables (gps_height, subvillage, region) should be excluded from an MLR model due to their inability to improve either the model’s accuracy or AIC score. Furthermore, it was found that the district_code variable added most value when added to the MLR model in conjunction with the inclusion of the region_code variable: This is likely due to the fact that the district_code variable is not truly independent since its values are not unique across each of the possible region and region_code values. Results of the forward selection MLR model building process are shown below.

MLR Forward Selection Results

Variable(s) Added	Model Accuracy	AIC
quantity	64.90 %	65186
waterpoint_type	69.24 %	61555
extraction_type	70.03 %	59279
installer (binned)	70.72 %	57765
pump_age	71.38 %	57031
reg_code + dst_code	72.42 %	54713
funder (binned)	72.70 %	54251
water_quality	72.88 %	54120
payment	73.86 %	52968
basin	74.10 %	52765
source	74.40 %	52298
management	74.66 %	51766
lga (binned)	74.77 %	51500
scheme_management	74.96 %	51431
ward (binned)	74.94 %	51272
public_meeting	74.93 %	51189
permit	75.03 %	51135
amount_tsh	75.02 %	51128
latitude + longitude	75.12 %	51107
population	75.15 %	51064

Ineffective Variables
gps_height_new
subvillage (binned)
region

Cross validation of the resulting 22-variable MLR model indicated an overall model accuracy of 75.08%. A confusion matrix and detailed summary statistics from the cross validation process are shown below. Please note that " Class 0 " represents functional pump status; " Class 1 " represents functional needs repair and " Class 2 " represents non functional.

MLR Confusion Matrix and Statistics

As shown in the summary above, the MLR model is most challenged when attempting to accurately identify pumps having a status of functional needs repair as indicated by the Balanced Accuracy metric. This is unsurprising given the fact that only 7.3% of the training data set’s records have that status.

K Nearest Neighbors (KNN)

A KNN model was constructed using the same mix of variables identified via the MLR forward selection process. Cross validation of the model indicated an overall model accuracy of 74.93% and an optimal K value of k = 5. A confusion matrix and detailed summary statistics from the cross validation process are shown below.

KNN Confusion Matrix and Statistics

As shown in the summary above, the KNN model does a much better job of correctly identifying pumps having a status of functional needs repair (64.73% balanced accuracy) than does the MLR model (55.31% balanced accuracy). This suggests that the KNN model may be preferable to the MLR model despites the KNN’s slightly lower overall accuracy score. In fact, the KNN model’s kappa score exceeds that of the MLR model, which is also indicative of the KNN model likely being preferable to the MLR model.

Bootstrap Aggregation with Classification Trees

A bootstrap aggregation classification model based on the same set of 22 independent variables identified via the MLR forward selection process outperformed both the KNN and MLR models, achieving an overall accuracy of 79.91%. A confusion matrix and detailed summary statistics from the cross validation process are shown below.

Bootstrap Aggregation Confusion Matrix and Statistics

The Balanced Accuracy metric indicates the bootstrap aggregation algorithm exceeded the performance of the KNN model across each of the three pump statuses and also achieved a much higher kappa value (0.6281) than either the MLR (0.5156) or KNN models (0.5334). As such, the bootstrap aggregation method appears to be preferable to both the KNN and MLR models.

Support Vector Machines (SVM)

An SVM model constructed from the same variables as the MLR model yielded an overall accuracy of 76.95%, exceeding that of both the MLR and KNN models but below that of the bootstrap aggregation model. A confusion matrix and detailed summary statistics from the cross validation process are shown below.

SVM Confusion Matrix and Statistics

The Balanced Accuracy metric indicates the SVM algorithm (55.57%) significanctly underperforms both the KNN (64.73%) and bootstrap aggregation (67.61%) approaches for correctly identifying pumps having a status of functional needs repair. So while the SVM model achieves a slightly higher kappa value than the KNN model (0.5391 vs. 0.5334), the SVM model’s relatively weak performance relative to the functional needs repair status rates its efficacy below both that of both the bootstrap aggregation and KNN models.

Random Forest:

The random forest model was constructed in R using the same variables used for the MLR model and a total of 500 classification trees. While an initial attempt was made to add both the gps_height and subvillage variables to the model, their inclusion did not improve its performance. The subsequent random forest model yielded an overall accuracy of 81.13%, which exceeds that of all of the other five types of models that were constructed. A confusion matrix and detailed summary statistics from the cross validation process are shown below.

Random Forest Confusion Matrix and Statistics

The Balanced Accuracy metric indicates that despite having the highest overall accuracy, the random forest algorithm (64.06%) underperforms both the KNN (64.73%) and bootstrap aggregation (67.61%) approaches for correctly identifying pumps having a status of functional needs repair. However, the random forest model also achieved a higher kappa value than did each of the other models, including the bootstrap aggregation model (0.6281).

Variable Importance

R provides a mechanism for ranking the “importance” of each variable within a random forest model. The output of that mechanism indicated that both the ward and lga variables had limited use within the random forest model. Experiments indicated that removal of both variables had little effect on the model’s overall accuracy. As such, the dimensionality of the model could potentially be reduced by a total of 30 binning variables with limited impact on the efficacy of the model.

As with the MLR model, quantity, extraction_type, and waterpoint_type were found to have the highest levels of predictive value / “importance”, as were longitude and latitude, both of which, by contrast, had little predictive value for the MLR model. The derived pump_age variable was also deemed to be of high “importance” to the model.

Subsequent experiments (not detailed here) also showed that adding the gps_height variable did not improve the model’s performance, but did result in gps_height being deemed as having a high level of “importance” within the model. Results of the initial “importance” assessment are provided below.

Variable	Importance Score
quantity	2386.462153
longitude	1280.208366
latitude	1243.581667
extraction_type	1178.682671
pump_age	1116.002314
waterpoint_type	1093.158956
payment	816.229526
region_code	681.391666
population_new	661.159021
source	578.178014
district_code	555.321904
basin	427.311991
scheme_management	384.212795
management	372.834456
amount_tsh_new	345.614271
water_quality	299.133328
funder_other	153.621007
ward_freq_bin_3	149.327702
ward_freq_bin_2	142.580870
funder_gov	135.037070
installer_other	132.631561
installer_dwe	123.834254
permit_new	121.603216
ward_freq_bin_4	110.478197
public_meeting_new	108.145648
ward_freq_bin_1	102.555506
ward_freq_bin_5	89.906952
lga_freq_bin_4	80.632662
lga_freq_bin_3	78.191339
lga_freq_bin_6	74.982971
lga_freq_bin_5	71.640857
lga_freq_bin_10	61.760992
lga_freq_bin_2	61.629666
lga_freq_bin_7	60.363336
installer_gov	58.517909
ward_freq_bin_6	57.437098
installer_cen	57.360212
funder_wor	57.045224
installer_rwe	54.030158
lga_freq_bin_9	49.896391
ward_freq_bin_7	49.652331
ward_freq_bin_8	48.296252
funder_dan	39.977137
installer_unk	37.709373
lga_freq_bin_8	37.567692
funder_hes	36.508775
lga_freq_bin_20	35.405233
installer_dis	35.347900
funder_unk	34.678247
installer_com	34.384736
funder_tas	33.928214
funder_dws	31.883431
funder_pri	29.892122
funder_uni	25.143246
ward_freq_bin_9	24.503300
funder_wat	24.246632
installer_wor	23.716591
funder_rws	22.778945
installer_hes	20.582274
funder_dis	18.413152
installer_tcr	18.133932
installer_dan	17.600180
funder_kkk	17.576963
installer_fin	15.507702
lga_freq_bin_11	14.531972
installer_kkk	14.371995
funder_fin	14.235434
lga_freq_bin_1	14.198987
funder_dhv	13.691042
ward_freq_bin_10	10.414861
ward_freq_bin_14	9.585436
ward_freq_bin_13	9.147670
installer_0	7.232043
ward_freq_bin_12	6.411819
ward_freq_bin_16	5.501362
installer_ces	5.378000
ward_freq_bin_11	3.980340
ward_freq_bin_15	3.305827
ward_freq_bin_20	2.600001
ward_freq_bin_17	2.306922

C5.0 Boosted Classification Trees

The C5.0 boosted classification tree model adds the gps_height variable to those used for the MLR model. A total of 20 boosting trials were applied during model training, and the resulting model yielded an overall accuracy of 80.40%, which exceeds that of all of the other models except the random forest. A confusion matrix and detailed summary statistics for the C5.0 model are shown below.

C5.0 Confusion Matrix and Statistics

The Balanced Accuracy metric indicates that despite having slightly lower overall accuracy and kappa score than the random forest algorithm, the C5.0 model (65.627%) outperforms both the KNN (64.73%) and random forest (64.064%) approaches, though not the bootstrap aggregation approach, for correctly identifying pumps having a status of functional needs repair. The C5.0 model’s kappa score also exceeds that of all other models except that of the random forest.

Variable Importance

As with the random forest model, R provides a mechanism for ranking the “importance” of each variable within a C5.0 model. However, our experiments found that removing any of the variables used during model training decreased the performance of the C5.0 model. The quantity, extraction_type and waterpoint_type variables were again deemed to have the highest levels of relevance to the efficacy of the model, as were longitude, one of the binned lga variables and a subset of the binned ward variables. Results of the initial “importance” assessment are provided below.

Variable	Importance Score
quantity	100.00
extraction_type	100.00
lga_freq_bin_10	100.00
ward_freq_bin_16	100.00
ward_freq_bin_20	100.00
longitude	99.99
waterpoint_type	99.97
management	99.84
ward_freq_bin_11	99.28
pump_age	99.24
source	98.99
amount_tsh_new	98.90
region_code	98.62
water_quality	98.62
latitude	97.53
payment	97.24
district_code	96.90
scheme_management	95.25
funder_dws	92.77
gps_height_new	92.30
population_new	92.21
lga_freq_bin_20	92.04
installer_ces	89.43
ward_freq_bin_17	89.43
ward_freq_bin_8	88.22
installer_rwe	86.90
basin	86.71
ward_freq_bin_10	83.44
ward_freq_bin_7	82.98
funder_pri	82.54
installer_wor	81.31
installer_0	81.22
ward_freq_bin_12	81.20
funder_rws	80.87
installer_com	80.66
installer_cen	80.03
installer_gov	79.21
lga_freq_bin_1	78.87
public_meeting_new	76.77
funder_wor	75.98
funder_uni	75.71
installer_hes	75.25
ward_freq_bin_6	74.34
funder_hes	74.19
funder_dhv	71.43
lga_freq_bin_2	71.03
funder_dan	70.76
installer_fin	70.40
ward_freq_bin_1	69.92
ward_freq_bin_4	69.80
lga_freq_bin_11	69.56
funder_gov	68.41
ward_freq_bin_5	67.40
lga_freq_bin_3	67.36
funder_unk	67.09
funder_tas	65.18
ward_freq_bin_13	64.32
funder_wat	64.31
ward_freq_bin_14	63.98
lga_freq_bin_6	62.75
lga_freq_bin_8	62.66
ward_freq_bin_9	59.46
funder_dis	59.26
ward_freq_bin_2	58.27
installer_unk	57.53
ward_freq_bin_3	56.92
installer_tcr	56.65
installer_dwe	55.18
installer_dis	55.03
lga_freq_bin_7	53.77
lga_freq_bin_9	53.71
lga_freq_bin_4	53.34
lga_freq_bin_5	51.60
installer_dan	50.99
permit_new	50.13
funder_kkk	44.81
installer_kkk	39.89
installer_other	37.96
funder_other	36.33
funder_fin	31.87
ward_freq_bin_15	8.82

Appendix E: R Code for Predictive / Classification Models

The R code used for each of the six types of predictive / classification models is available on Github. Additional R code modules were used to apply a set of custom imputations to both the training and test data set provided by the Tanzanian Ministry of Water. Links to the source code for each module are provided below.

Training Data Custom Transformations

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/JT-Data-Prep.Rmd

Test Data Custom Transformations

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/JT-Submit-Data-Prep.Rmd

Multinomial Logistic Regression R Code

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-mlr-submit-all-training-data.Rmd

K Nearest Neighbors

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-knn-submit-all-training-data.Rmd

Support Vector Machines

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-svm-submit-all-training-data.Rmd

Bootstrap Aggregation

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-bag-submit-all-training-data.Rmd

Random Forest

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-rf-submit-all-training-data.Rmd

C5.0 Boosted Trees

https://github.com/jtopor/DDO-TNZ/blob/master/Model-Build%2BPred-RCode/tanz-c50-submit-alltraining-data.Rmd

Appendix F: R Code for Visualization

##################################### DATA 698.02 Capstone Project ########################################################
#
# This is a Shiny web application of Tanzanian WaterPump project on driven data.org
#
# The purpose of this project is to develop, asses, and compare a variety of predictive / classification models 
#
# with the best-performing model being implemented as part of a complete end-to-end intelligent system.
#
# Find out more about building applications with Shiny here:
#
#    http://shiny.rstudio.com/
###################################################### Libraries #########################################################

library(ggthemes)    # for graphs
library(knitr)       # dynamic report generation in R using Literate Programming techniques.
library(tidyr)       # Tidy Data
library(readr)       # reading in data
library(dplyr)       # data manupilation
library(ggplot2)     # plotting of Graphs
library(DT)          # data tables
library(png)         # Read and write PNG images
library(markdown)    # document formatting
library(leaflet)     # Maps
library(RColorBrewer)# Color Palette
library(tidyverse)   # install core packages
library(plotly)      # Interactivity with Charts
library(shinythemes) # Shiny Theme
library(scales)      # for percentage scales

############################################### Load data Source #########################################################
tanz_Train<-tbl_df(read.csv("Tanz_Pump_Data.csv", header = TRUE, stringsAsFactors = FALSE)) 
tanz_Train$pump_age<-cut(tanz_Train$pump_age, seq(from = 1, to = 53, by = 3)) #bin pump_age
tanz_Train=data.frame(tanz_Train)#create data frame
tanz_Train<-na.exclude (tanz_Train) %>% #exclude NA's
filter(!is.na(latitude) & !is.na(longitude))

############################################## Define UI for application #################################################

ui <- navbarPage("Data 698 Tanzanian Water Pump Project", # Navigation bar title
                 tabPanel("About",
                          fluidPage(theme = shinytheme("cerulean"),
                                    titlePanel(title = h4("About Tanzanian Water Pump Project",align="center")),
                                    fluidRow(
                                      column(4,
                                             tags$img(src="pumping.jpg",width = 400,height = 287, style="text-align: center;")),#load Image
                                      column(6, 
                                            tags$style(".help-block{color: black;font-size: 16px;font-style: normal;font-family:Georgia}") ,
                                             helpText("The purpose of this project is to develop, assess, and compare a variety of predictive / classification models ",
                                                      "with the best-performing model being implemented as part of a complete end-to-end intelligent system",
                                                      "that allows the user (i.e., the Tanzanian Ministry of Water) to pro-actively identify wells that are either in need of repair or are in inoperable condition.",
                                                      "The data is provided by Taarifa and the Tanzanian Ministry of Water,more about the competition can be found",  tags$a(href="https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/","here"),
                                                      ".We will take various data points into consideration such as when was the pumps installed, and how it is managed ,Lattitude and Longitude etc. ",
                                                      "The outcome of understanding which water point pumps will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.",
                                                      "The data files ,methodology and code can be accessed from this:",
                                                      tags$a(href="https://github.com/jtopor/DDO-TNZ/tree/master/Model-Build%2BPred-RCode","GitHub Link!"),
                                                      br(),tags$b( "Team Members"),
                                                      tags$ul(tags$li("James Topor"),tags$li("Sekhar Mekala"),tags$li("Sonya Hong"),tags$li("Steve Dunn")))
                                             )))),               
                tabPanel("Introduction",
                         fluidPage(titlePanel(title = h4("Methodology,Analysis and Exploration",align="center")),
                                                  tags$iframe(style="height:500px;width:100%;scrolling=yes",src="Tanzanian_Introduction.pdf")
                                   )),#load rendered PDF file
                
  tabPanel("Data Table",
           fluidPage(theme = shinytheme("cerulean"),#shiny theme
   titlePanel(title = h4("Tanzanian Water Pump Data",align="center")),# Title
            sidebarLayout(#Side Bar Panel
                  sidebarPanel(helpText(tags$b("Data Table"),
                                        br(),
                                        "Data is from  test data set and Predictive Output by Status created from six different predictive models",
                                        br(),
                                        "the table can be searched ,sorted and filtered by columns",
                                        "or uncheck the options to reduce the number of columns displayed." ),
                               
                    checkboxGroupInput("show_vars", "Data Columns:",
                                      names(tanz_Train), selected = names(tanz_Train))),
# Create a new row for the table
mainPanel(
          fluidRow(
                    DT::dataTableOutput("table")
              )
           )
        )
    )
),

#########################Bar Plot output from the data set##################################

tabPanel("Plot",
         fluidPage(
           titlePanel(title = h4("Outcome of Predictive Models",align="center")),
           fluidRow(
             helpText("Select data element from dropdown"),
               selectInput("TGroup",
                         label ="Data Columns:",
                          choices=colnames(tanz_Train[,20:33]),selected = "waterpoint_type"
               ),
             helpText("Select predictive model from the list"),
column(width=2,
       radioButtons("PrediChoice", "Predictive Models:",
                    c("RF","SVM","KNN","MLR","C50","BAG")           
       )),
mainPanel(
  plotlyOutput("dplots",width="auto",height = "400")
      )
    )
  )
), 

################################### Map of tanzania with population circles ############################################

tabPanel("Map Visualization",
         fluidPage(#theme = shinytheme("cerulean",
           titlePanel(title = h4("The bubble size is based on the population  and the bubble color represents the water point type ",align="center")),
           column(3,
                  helpText("The output Map of Tanzania shows the results of the predictive models when the bubbles are clicked.",
                          
                           "The hover text shows the Water point types.",
                           
                           "The circle size is representative of the population around each well location.",
                           br(), br(),
                           tags$i("The circle Markers on the map may not display correctly if you are viewing this application on a device behind a firewall ")
                           )),
         
            column(4,
           leafletOutput("Tanzmap", width="850",height = "600"))
     )
   )
 
)

 ################################################ Define server logic ####################################################

  server <- function(input, output, session) {
  
  ############################################### PDF output ##############################################################
    
  output$Intro<-renderUI({})
  
  ############################################## Table Output #############################################################
  
  tanz_Train2 = tanz_Train[sample(nrow(tanz_Train)), ]
      output$table <- DT::renderDataTable({
         DT::datatable(tanz_Train2[, input$show_vars, drop = FALSE])
  })
  
  #############################################Bar Plot Output###################################################### 
  
  Treact<-reactive({#Reactive function to update the plots based on user input
    data   <- tanz_Train[[input$TGroup]]
    }) 
  
  Treact2<-reactive({ 
    data2   <- tanz_Train[[input$PrediChoice]]
  })
 
    output$dplots <-renderPlotly({#Predictive Model Options
    dist <- switch(input$PrediChoice,
                   RF = tanz_Train$RF,
                   SVM = tanz_Train$SVM,
                   KNN = tanz_Train$KNN,
                   MLR = tanz_Train$MLR,
                   C50=tanz_Train$C50,
                   BAG=tanz_Train$BAG)
    
# # show Raw count of data grouped by Status
# #     p<-ggplot(tanz_Train, aes(x=Treact2(), y = ..count..)) +
# #       geom_bar(aes(fill = Treact()), position = "dodge")+
# #       xlab("Status Group")+labs(title =paste("Pump Status by",input$TGroup))+
# #       theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
# #       theme_tufte()+ scale_fill_discrete(name=input$TGroup)
# # p

p<- ggplot( tanz_Train, aes(x= Treact(),label2 = scales::percent(..prop..)
                           ,  group=as.character("x"))) +geom_bar(aes(y = ..prop..,fill=as.factor(..x..)
                          ,text = paste("Count:",..count..)
                          ),stat= "count"
)+geom_text(data=tanz_Train,
            aes(
                label = scales::percent(..prop..)
                ,y= ..prop..
                ,group=as.character("x") 
               )
            ,position=position_dodge(0.9)
            , vjust =-1.5
            ,cex=3.5
            , stat= "count") + facet_grid(~Treact2()) + theme_tufte() + theme(axis.text.x = element_text(angle = 40)) + labs(y = "Percent of Status Group",fill=~levels(input$TGroup)) + labs(title =paste("Pump Status by",input$TGroup)) + scale_y_continuous(labels = scales::percent)+theme(axis.title.x = element_blank()) + theme(legend.position="none")#remove legend
P<-ggplotly(p,tooltip=c("text","label2"))

  })
  
  #################################################### Map output ######################################################### 
#marker text of Predictive Model output
  details2 <- paste(sep = "<br/>",
                   tags$b("SVM:"),as.character(tanz_Train$SVM),
                   #"<br/>",
                   tags$b("RF:"),as.character(tanz_Train$RF),
                   #"<br/>",
                   tags$b("KNN:"),as.character(tanz_Train$KNN),
                   #"<br/>",
                   tags$b("MLR:"),as.character(tanz_Train$MLR),
                   #"<br/>",
                   tags$b("C50:"),as.character(tanz_Train$C50),
                   #"<br/>",
                   tags$b("BAG:"),as.character(tanz_Train$BAG),
                   #"<br/>",
                   tags$b("Population:"),as.character(tanz_Train$population))

  
    
   #color Palette
   color <- colorFactor( palette = "viridis", tanz_Train$waterpoint_type,reverse = TRUE, alpha = TRUE)
   
   #leaflet map output
  output$Tanzmap<-renderLeaflet({
    leaflet(tanz_Train) %>%
            addTiles() %>% #setting the bounds of longitude and lattitude
                  fitBounds(~min(tanz_Train$longitude),
                            ~min(tanz_Train$latitude),
                            ~max(tanz_Train$longitude),
                            ~max(tanz_Train$latitude))%>%
            setView(lng=34.888822000000005,lat=-6.369028,zoom=6)%>%
            addProviderTiles("CartoDB.Positron", 
                             options = providerTileOptions(noWrap = TRUE))%>%
      addCircleMarkers(data=tanz_Train,
                       lng=~tanz_Train$longitude, 
                       lat=~tanz_Train$latitude,
                       fillOpacity=0.7,
                       color=~color(waterpoint_type),
                       stroke=FALSE,
                       popup=details2,
                       label = as.character(tanz_Train$waterpoint_type),
                       radius=~sqrt(population)*.35)%>% #maintain circle size during zoom
      # Legend 
      addLegend("bottomleft", 
        pal=color, 
        values=~waterpoint_type, 
        opacity = 0.7,
        title="Water Point Type"
      )
  })


}
# Run the application 
shinyApp(ui = ui, server = server)

Predicting Tanzanian Water Pump Maintenance Needs

James Topor, Sekhar Mekala, Sonya Hong, Steve Dunn

December 18, 2017

Abstract

Key Words: Predictive Modeling, Classification Models, Water Pumps, Tanzania, Pump It Up

Introduction

Literature Review

Methodology

Experimentation and Results

Data Exploration

Analysis of Missing Data Values

Insights Gained via Data Aggregation

Initial Predictive Inferences

Data Preparation

Creating Dedicated Training & Evaluation Subsets

Creating Data Transformation Pipelines

Predictive / Classification Modeling & Model Selection

Visual Presentation

Discussion and Conclusions

References

Appendix A: Data Exploration

Determine how many rows have complete cases and how many have NA’s

Calculate Number of NA’s for Each Variable (Including Blank Strings)

Count number of pumps for each pump’s status value

Summary of Variables in Data Set

Numeric Variables

Categorical variables

Administrative / Non-Predictive Variables

Exploration of Numeric Variables

amount_tsh

gps_height

population

Longitude & Latitude

num_private

Exploration of Categorical Variables

funder

installer

Funder & Installer: Overlap of Missing Values

wpt_name

basin

Subvillage

Region

Region Code

Construction Year

District Code

Public Meeting

lga

Ward

Scheme Management

Scheme_name

Permit

Extraction Type

Extraction Type Group

Extraction Type Class

Management

Management Group

Payment

Payment Type

Water Quality

Quality Group

Quantity & Quantity_Group

Source

Source Type

Source_Class

Waterpoint Type

Waterpoint Type Group

Initial Predictive Inferences

Appendix B: Data Preparation

Imputations

Numeric variables:

categorical variables:

Appendix B - References

Appendix C: Python Transformation Pipelines

Appendix D: Predictive / Classification Modeling

Predictive / Classification Modeling

Ranking Variables According to Predictive Strength

Multinomial Logistic Regression

K Nearest Neighbors (KNN)

Bootstrap Aggregation with Classification Trees

Support Vector Machines (SVM)