Hand-driven and gravity fed water pumps are a key source of potable water throughout much of Africa and Asia, and the ongoing management and maintenance of such pumps is an ever-present challenge for many communities. This project explored the use of a variety of predictive/classification modeling techniques for purposes of predicting the functional status of water pumps located within the country of Tanzania on the basis of data provided via a www.DrivenData.org data challenge competition. That data set was found to be riddled with significant quantities of invalid and/or missing data values. As such, a wide variety of statistically valid imputation methods were applied to improve the quality of the data set. The imputation methods were implemented via a Python-based, automated, user-configurable data transformation pipeline algorithm which is designed to be both extensible and easily modified for use with other types of data. The transformed data is then used to train various predictive models in both R and Python. A random forest model was found to have the highest overall predictive accuracy, while a bootstrap aggregation model proved to be the most effective at identifying water pumps that are functional but in need of some repairs. Finally, an interactive visualization dashboard was constructed using R and Shiny to help the Ministry of Water more effectively manage and maintain water pumps throughout Tanzania, thereby potentially increasing the availability of potable water for the country’s population.
The availability of clean, potable water is an important public health issue throughout less-developed parts of the world, including many parts of Africa and Asia. In fact, the water delivery infrastructure in some countries is still largely comprised of hand-driven and gravity-fed pumps. While such pumps are vital to the sustenance of many towns and villages, like any type of mechanical device they require ongoing management and periodic maintenance to ensure their proper functioning and to minimize the risk of non-potable water polluting a well.
One country that relies heavily on such pumps is the east African country of Tanzania. According to Tanzania’s Ministry of Water, more than 74,000 such pumps can be found throughout the country. While the installation of these pumps is largely funded via contributions from charitable and other non-governmental organizations (NGO’s), their ongoing maintenance is typically the responsibility of the local community within which they reside. Unfortunately, the cost of maintaining the pumps is often beyond the means of the local community, resulting in pumps becoming non-functional. Furthermore, local communities are often unaware of the need to perform the required maintenance due to the apparent lack of any significant problems with a pump up until the point it ultimately fails.
The purpose of this project was to develop, assess, and compare a variety of predictive / classification models, with the best-performing model being implemented as part of a complete end-to-end intelligent system that allows the user (e.g., the Tanzanian Ministry of Water) to pro-actively identify pumps that are either in need of repair or are in inoperable condition. The research question we are addressing can be stated as follows:
If so, Tanzania’s Ministry of Water can make use of such predictions to proactively address any identified water delivery system maintenance issues, thereby enhancing the ongoing availability of potable water to Tanzania’s citizens.
In 2016, the Ministry of Water teamed with Taarifa (www.taarifa.org) to sponsor a data challenge competition to assess whether it is possible to predict the functional status of a water pump based on a variety of quanitative and qualitative indicators. The data for this project was sourced from the DrivenData.org website that hosts the data challenge. Specifically, we made use of the datasets accessible via the following web link:
The data set is comprised of attributes that describe a total of 59,400 Tanzanian water pumps. Each water pump is represented by a total of 39 qualititative and quantitative attributes that describe such things as the type of pump, location, altitude, installation funding source, management method, year constructed, water source, and the quality of the water delivered by the pump. This data trove serves as the basis for the research discussed herein.
Our literature review was focused on identifying recently published commentary related to the Tanzanian Ministry of Water’s data challenge. While no refereed academic journal articles related to the data challenge could be located, a fair number of relevant informal online web pages, github postings, and/or blog postings that discuss various aspects of the data challenge were identified. These web postings generally discuss the nature of the data challenge, describe rudimentary exploratory data analysis results and data prep / imputation methods, and assess the effectiveness of a single type of classification / predictive model relative to a “cleansed” version of the data that the author(s) derived via their exploratory data analysis (EDA) and data imputation efforts. A subset of the postings provide non-interactive map-based visualizations of the physical location and status of each pump.
Exploratory Data Analysis: Most of the authors whose commentary was reviewed performed rudimentary EDA using tools such as histograms, barplots, and data aggregation via categorical variable values. However, in several instances (e.g., [2, 3, 6] ) the histograms or barplots used are rather ineffective for a variety of reasons, including the presentation of data of different magnitudes within a single graphic; the graphic lacking sufficient granularity to allow the viewer to draw a valid conclusion from it; or a failure to exclude clearly invalid data values from a plot. In general, the authors focus the majority of their EDA efforts on discussing missing data values for various variables and describing the variety of values found within the categorical variables. Two authors [5, 7] comment on the apparent collinearity of several variables within the data set, while [2, 6] comment on the inconsistency of funder
and installer
names throughout the data set (e.g., “Danida” vs. “Danid”) as well as the coincidence of missing values for those two variables. Very little attention is given to the derivation of potential predictive inferences for each variable, with authors such as [5, 6, 7] providing minimal or even no such insights, [8] discussing potential inferences for only 3 variables, and [4] exploring a total of 6 variables for possible predictive inferences.
Data Preparation: Many authors relied on the use of frequency based binning to reduce the dimensionality of the categorical variables, including [2, 5, 6, 7], and most opted to remove sparse variables from the data set prior to model building (e.g., [2, 3, 7]) . Imputation for missing data values was either entirely lacking or relied on simplistic use of a variable’s mean or median [2, 4, 6] or filling of missing values with zeroes [8]. [8] opted to consolidate all ‘funder’ values into 5 distinct new categories, while [2] attempted to bin ‘lga’ values according to whether or not they contained ‘urban’ and ‘rural’ as substrings. [2] suggests creating an ‘age’ variable based on ‘construction_year’ and ‘date_recorded’ while [4] creates a variable indicative of the time elapsed since the data were recorded. [5] employed external population data to augment the sparse ‘population’ variable and also made use of Principal Components Analysis in an attempt to reduce the dimensionality of the ‘funder’ variable.
Predictive / Classification Models: The mix of independent variables used for modeling varied significantly by author, with [3], for example, relying on only seven variables while [7] made use of twenty. Every author made use of some form of decision tree model, with random forests being the most widely applied [2, 3, 4, 5, 6, 7], followed by extra trees [4, 7], gradient boosted decision trees [7, 8] and basic decision trees [4, 7]. One author described the use of multinomial logistic regression [3]. Performance metrics varied widely, likely due to some authors reporting results relative to a model validation data set and others reporting their results relative to the data challenge’s dedicated testing data set. So while [5] claims 94% accuracy for their model, that result is likely applicable only to their training data: Others generally reported accuracies ranging between 80% - 83% for random forest and gradient boosted models [2, 3, 4, 7], while the sole multinomial regression model reported 70.38% accuracy.
Map Based Visualizations: Geomapping of pump statuses was used by [3, 5, 6, 8, 9]. These maps are all static in nature and do not provide the user with any means of exploring the predicted statuses of pumps contained within the dedicated data challenge testing data set.
Our approach to addressing the problem at hand was comprised of six distinct components: Data Exploration, Data Preparation, Predictive / Classification Modeling, Predictive Modeling Pipelines, Model Selection, and Visual Presentation. Each is described below.
Data Exploration: Investigation of the characteristics of each individual variable contained within the data set, including their data types, range of valid values, their distributions, and missing data values. Results of Data Exploration were used for the following: Development of any strategies required for imputing missing data values; Deciding whether any variables should be dropped from model building due to collinearity; Deciding whether any of the predictor variables may need to be mathematically transformed to better serve the model building process.
Data Preparation: Development of strategies for handling missing or invalid data values, binning of categorical variable values by frequency and their relationship to the response variable, and development of an approach to the separation of the master data set into dedicated modeling “Training” and “Evaluation” subsets. These strategies are designed and implemented using independent functional components to enable the testing and integration of a complete data transformation pipeline. This approach promotes the reusability of its components while enabling quick adaptation to changing requirements, easy inclusion/evaluation of customized hyper-parameters (i.e., parameters which are set before the commencement of the learning process. For example: The method of scaling numerical data can be implemented as a hyper parameter), and a streamlined process for the training/testing of models.
Predictive / Classification Modeling: Identification, development, and testing of task-appropriate predictive / classification models. Since we were tasked with classifying a pump as being in one of three possible conditions, various classification models were to be evaluated in an attempt to identify that which was most effective at making such classifications / predictions. The specific types of models developed were Random Forests, SVM (Support Vector Machines), C5.0 Boosted Trees, Bootstrap Aggregation, K Nearest Neighbors (KNN), and Multinomial Logistic Regression. Each model was iteratively refined in an attempt to maximize its performance and reduce the likelihood of overfitting of the training data.
Automation / Integration of Data Prep Pipelines & Predictive Models: The pipelines developed during Data Preparation were extended to incorporate the model development process. Specifically, the best performing model is integrated with the data transformation pipeline to allow for the application of both in a seamless manner to a new/incoming/streaming data. This combined framework can potentially be extended to automate the data preparation, predictive modeling, and autonomous retraining steps necessary for the retraining of a preferred model as its performance degrades over time due to the incorporation of new data from a real-time data stream.
Model Selection: Predictive/classification models were evaluated on the basis of performance metrics including Kappa scores, accuracy, classification error rates, specificity, sensitivity, and balanced accuracy. Results of each type of model were then compared against each other to judge their effectiveness at accurately classifying / predicting the three respective response variable values.
Visual Presentation: The visual presentation is provided via an interactive graphical “dashboard” application that incorporates geomapping of the locations of the pumps represented within the data challenge’s test data set, the output of the six types of predictive/classification models, as well as providing the user with the ability to view the predicted status of each pump relative to each of the relevant independent variables used for model training.
For each water pump we were provided with 39 attributes (excluding a unique record identifier) that could potentially be used as predictor variables. The response variable can assume one of three possible values:
functional
: The pump is operational and not in need of maintenancefunctional needs repair
: The pump is operational but is in need of repairnon functional
: The pump is inoperable54.3% of all pumps in the data set were found to be functional
, while 38.4% were non functional
and 7.3% had a status of functional needs repair
.
Each independent variable can be characterized as belonging to one of three classes: non- categorical numeric variables; categorical variables; and administrative variables. Variables classified as “Administrative” in nature appear to serve solely as data management attributes within the data set:
Administrative Variables | Comments |
---|---|
id | Unique ID for each data record |
date_recorded | Date of data collection by survey company |
recorded_by | Name of the data collection / survey company |
While detailed analyses for each of the remaining independent variables can be found in the Data Exploration section of the Appendix, key findings from that work are described below.
Our analysis found that every non-categorical numeric variable contains what appear to be significant quantities of invalid data values represented by zeroes:
Numeric Variables | Invalid Data Comments |
---|---|
amount_tsh | 41,639 of 59,400 records = “0” |
gps_height | 20,438 of 59,400 records = “0” |
population | 21,381 of 59.400 records = “0” |
longitude | 1,812 zero values: likely invalid |
latitude | 1,819 values < -1: likely invalid |
num_private | 58,643 of 59.400 records = “0” |
Eight categorical variables were also found to have missing data values (as indicated by either ‘NA’ values, zeroes, or blank character strings):
Categorical Var. | Distinct Values | Missing | Comments |
---|---|---|---|
funder | 1898 | 3635 | 3582 NA’s coinc. w installer |
installer | 2146 | 3655 | 3582 NA’s coinc. w funder |
subvillage | 19288 | 371 | |
public_meeting | 3 | 3334 | valid values: TRUE/FALSE |
scheme_management | 13 | 3877 | |
scheme_name | 2697 | 28166 | |
permit | 3 | 3056 | valid values: TRUE/FALSE |
construction_year | 55 | 20709 | valid values: 1960 - 2013 |
Further analysis of the missing data yielded the following key data insights:
Missing data values are found within 31,587 of the 59,400 records within the data set (53.17%).
69.2% of subvillages
, 52.1% of wards
, 32% of lga's
, and 19% of regions
have no valid amount_tsh
values.
Four regions
(Dodoma, Kagera, Mbeya, Tabora) were found to have no non-zero values for the following variables: amount_tsh
, gps_height
, construction_year
, num_private
, and population
. These 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages. The 12,115 records covered by these regions represent approximately 60% of the zero values found within the gps_height (12,115 / 20,438), population (12,115 / 21,381), and construction_year (12,115 / 20,709) variables.
3,582 of the missing funder
and installer
values coincide with each other
1,812 of the missing gps_height
values coincide with missing latitude
and longitude
values
43 wards
have no valid public_meeting
values; 75 wards
have no valid scheme_management
values; 73 wards
and 3 lga's
have no valid permit
values.
The widespread incidence of missing data throughout the data set was a strong indicator of the need for the development of statistically valid data imputation algorithms for many of the affected variables. A discussion of the various imputation strategies used can be found within the Data Preparation section of this document.
Aggregation of data by various variables / values provided additional valuable insights, including:
The funder
and installer
variables contain many instances of what appear to be minor variations of a single valid data value, e.g., “Government of Tanzania” vs. “Ministry of Water” vs. “Water”.
There is no one-to-one relationship between the region
and region_code
variables (i.e., 21 distinct regions
; 27 distinct region_codes
).
There is no one-to-one relationship between district_code
and region
: district_code
values are only unique within a given region, not across regions.
Aggregating by extraction_type
reveals an apparent chronological progression in the deployment of different pump technologies. For example, we find the median construction_year
value for swn 80 extraction types to be 1997 while that of the cemo extraction type is 2009.
Aggregating by basin
reveals an apparent chronological geographic progression in the installation of pumps throughout Tanzania. For example, we find the Lake Rukwa basin has a median construction_year
value of 1989 while the basin of Wami / Ruvu has a median value of 2003.
There is no single scheme_management
value that is universally applicable to any given extraction_type
or waterpoint_type
or for any of the composites thereof (e.g., extraction_type_group
, etc.), or to any ward
, lga
, region
or basin
.
Several variables contained within the data set are either duplicative or composites of other variables:
extraction_type_group
is a binned / composite version of extraction_type
extraction_type_class
is a binned / composite version of extraction_type_group
payment_type
is 100% duplicative of payment
quality_group
is a binned / composite version of water_quality
quantity_group
is 100% duplicative of quantity
source_type
is a binned / composite version of source
waterpoint_type_class
is a binned / composite version of waterpoint_type
Insights such as these serve as the basis of both the data imputation and model building efforts described herein.
Derivation of barplots for five numeric variables (amount_tsh
, gps_height
, population
, longitude
, latitude
) and each categorical variable value relative to each of the three possible response variable values allowed us to gain insight into some of the the predictive aspects of the data. The plots (which can be found in the Data Exploration section of the Appendix) show that most of the categorical variables do, in fact, contain significant predictive characteristics. For example, we find geographic variability with respect to the functional
status of pumps via variables such as basin
and region
. Additional variability in the functional
status of pumps can be found via variables such as scheme_management
, construction_year
, public_meeting
, extraction_type
, management
, payment
, source
, quantity
, water_quality
, and waterpoint_type
:
Plots for the numeric variables show that gps_height
, longitude
, and latitude
also contain predictive characteristics, while population
and amount_tsh
appear likely to have less predictive value. Such inferences are helpful for purposes of model building since they suggest that many variables found within the data set are strong candidates for inclusion within a prospective predictive / classification model. Furthermore, it suggests that use of varying combinations of independent variables might be employed in an attempt to determine whether any particular combination thereof yields a superior model.
As discussed above, more than 53% of the records in the data set contain either missing or invalid data values. Our data exploration efforts allowed us to uncover relationships between various independent variables that could serve as the basis of statistically valid imputation algorithms for the missing values of the amount_tsh
, gps_height
, construction_year
, latitude
and longitude
variables. A summary of the various imputation techniques used for this project are provided in the tables shown below. The first table provides a summary of numeric variable imputation methods as well as the python classes defined to perform those imputations. The second table provides a summary of the imputations applied to character-based categorical variables. Defining potential imputation options as hyper parameters within our data preparation framework allows us compare the effects of varying imputation methods via cross-validation [16] or via a holdout method [10]. Please see Appendix B for a detailed explanation of these imputation techniques and Appendix C for the python source code used to implement the data transformation classes.
Numeric Variable Imputation Methods
Categorical Variable Imputation Methods
We find three types of categorical variables within the data set:
Variables with large numbers of possible categories such as installer
(2146) and subvillage
(19288);
Variables with relatively small numbers of possible categories such as basin
, region
, and payment
;
Variables whose values appear to be the result of free-form text input such as scheme_name
and funder
.
Each of these variables are converted to numeric categoricals prior to model building. For variables with large numbers of categories, we make use of 2 imputation techniques: frequency based binning and response variable based binning. During frequency based binning, the levels of a categorical variable are binned based on the percentage of data records represented by a particular level. For response based binning the levels of an independent categorical variable are binned based on the percentage of data records represented by a particular level relative to the categorical level of a data record’s corresponding target variable. The choice of frequency based and/or responsed based binning is controlled by a hyper parameter.
To handle categorical columns with free form text, we group the text based on the initial “n” characters of a data value, followed by the creation of dummy variables for each of those groups. Both the number of initial characters to consider and the number of groupings to be created are controlled by hyper parameters. These techniques allow us to reduce the number of levels found within categorical variables. To handle categorical variables with lesser numbers of levels, we create dummy variables to represent the presence/absence of each level within a data record. Please see Appendix B for a detailed explanation of these 3 techniques.
The following additional classes will support the transformation process. See Appendix B and Appendix C for the details and Python implementations of these classes.
Creation of separate training and evaluation data sets was deemed necessary to allow us to test our classification models on a relatively unbiased set of data records. In preparation for model building we divided the training data into model training and evaluation subsets by extracting a random subset (stratified sampling based on the target variable) of 20% of the items of the training data set to serve as an evaluation subset. Those same items were then removed from the training data set to create a “model training subset”. The “model training subset” for all the model building activities described herein while the evaluation subset was used to gauge the effectiveness of each model.
Data transformation pipelines play a significant part in cleansing and preparing the for model building. The data transformation pipelines can also be applied to streaming data if so desired. As such, it is possible for our framework to handle new data as robustly as possible. The data transformation modules will run independently (wherever possible) to create the transformed data, which can then be readily supplied as input to predictive models. For this project we have used Python to develop the data transformation modules and both R and Python for predictive models. The source code for the data transformation pipelines and predictive models is provided in Appendix C and Appendix E. The following data flow diagram shows the complete data transformation process:
In the data flow diagram shown above, the green nodes represent the classes whose instances can be applied to any data set to perform the transformation implemented in the respective class. The orange nodes represent the classes which cannot be used on any data set, other than the data set of this project. Please see Appendix C for the actual python implementation of these classes, and Appendix B for a tabular summary.
In the diagram, processes (1a - 1h) are instances of the DataFrameSelector class, which facilitates the selection of the required columns from the input data. Process 1a extracts the funder
and installer
columns from the data set, both of which contain free form text data, and creates a data frame (C1). Process 1b extracts the columns basin
, region
, district_code
, public_meeting
, scheme_management
, permit
, extraction_type_group
, management
, payment_type
, water_quality
, quantity_group
, source
,source_class
, waterpoint_type
, and creates another data frame (C2). The variables in this second data frame have a relatively low number of levels (less than 20). The 1c process will extract the subvillage
, lga
, and ward
columns to create a third data frame (C3). The variables in C3 have a very large numbers of levels. The C1, C2, and C3 data frames are then supplied as inputs to 3 instances of the class HandleCategoricalNulls. This class will substitute a desired value ( unknown by default) in place of any null values and will optionally convert text to lower case. Each of the instances (2a, 2b, 2c) of the HandleCategoricalNulls class produces the C4, C5, C6 data frames. The C4 data frame is supplied as input to the instance of class FunderInstTransformer. This instance (represented as 3) will extract the initial 3 characters of the funder
and installer
variables and creates another data frame (C7). Data frames C7 and C5 are passed as inputs to the instances (4a and 4b respectively) of the class CatMultiLabelTransformer. This class will help us to create dummy variables for the input categorical variables. The instances 4a and 4b (of class CatMultiLabelTransformer) produce the C8 and C9 data frames. The C6 data frame will be supplied as input to the instance of the class ChooseCatPipelineType (represented by 13). This class will help us to choose frequency pipeline or response based pipeline or both (controlled by a hyper parameter). Based on the choice, the input data set will be supplied to the instances of the class FreqBasedCategoricalBinning (represented by 5), and RespBasedCategoricalBinning (represented by 6). The dotted arrows between 13 and 6 items indicate that we have a choice to choose either frequency based binning or response based binning or both. The 5 (instance of FreqBasedCategoricalBinning) and 6 (instance of RespBasedCategoricalBinning) processes will produce the C10 and C11 data frames.
The processes 1d, 1e, 1f, 1g, and 1h will extract amount_tsh
, construction_year
, latitude
longitude
, gps_height
, and population
respectively, producing the N1, N2, N3, N4 and N5 data frames. N1, N2 and N3 will be processed by AmountTSHTransformer (node 12), YearTransformer (node 10), and LatitudeLongitudeProcess (node 11) instances respectively, producing N6, N7 and N8 data sets. The N4 and N5 data frames are processed by GPSHeightTransformer (node 8) and PopulationTransformer (node 9), by combining with N8 (imputed latitude and longitude) data frame, producing N9 and N10 data frames. All the numeric data frame N6-N10 are supplied as input to StandardScaler instance (Node 7) producing N10 data frame. The C8, C9, C10, C11, and N10 data frames are combined by FeatureUnion class of python to form the output data set.
Six distinct types of classification models were constructed for purposes of predicting the current status of a given water pump:
Detailed analysis of the performance of each type of model can be found in Appendix D.
The table below provides a summary of the performance of each type of model relative to the model validation data subset discussed in the Data Preparation section above. In the table, each type of model is measured via five distinct metrics:
functional
pump status (“Accu.(F)”);functional needs repair
pump status (“Accu.(FNR)”);non-functional
pump status (“Accu.(NF)”);The table lists the models in descending rank order based on their overall accuracy scores.
Model Type | Overall Accu. | Accu.(F) | Accu.(FNR) | Accu.(NF) | Kappa |
---|---|---|---|---|---|
Random Forest (R) | 0.8113 | 0.8173 | 0.64064 | 0.8417 | 0.6418 |
C5.0 Boosted Trees | 0.8040 | 0.8140 | 0.65627 | 0.8367 | 0.6318 |
Bootstrap Aggregation | 0.7991 | 0.8138 | 0.67610 | 0.8378 | 0.6281 |
SVM | 0.7650 | 0.7636 | 0.55706 | 0.7887 | 0.5391 |
Multinomial Log. Regr. | 0.7508 | 0.7554 | 0.55316 | 0.7775 | 0.5156 |
KNN | 0.7493 | 0.7664 | 0.64735 | 0.7842 | 0.5334 |
As shown in the table, the Random Forest model achieved the highest overall accuracy of the six types of classification models relative to the model validation data subset. However, three other types of models ( KNN, Bootstrap Aggregation, and C5.0 Boosted Trees) were more proficient at accurately identifying water pumps having a status of functional needs repair
. These results highlight a potential challenge in identifying a “best performing” classification model: The model with the highest overall accuracy may not be the most effective across all of a data set’s possible response variable values. For example, if Tanzania’s Ministry of Water is most interested in accurately identifying pumps that are working but in need of repair, the Bootstrap Aggregation model may be preferable to the Random Forest model despite the fact that the Bootstrap Aggregation model has a lower overall accuracy. As such, selection of a preferred model should be performed relative to the priorities and needs of the prospective model user.
From a computational efficiency perspective, the random forest, C5.0, MLR, and bootstrap aggregation algorithms could each be trained within approximately 30 minutes on a modest 6GB Windows-based PC and were capable of generating any required predictions within approximately 5 minutes. The KNN and SVM models each required far longer periods of training time and required relatively long periods of time to make any required predictions. Therefore, long-term reliance on either the KNN or SVM models may prove to be impractical. The relative inefficiencies of those two types of models combined with their relatively lower overall predictive performance led us to remove them from consideration for the “best performing” model type.
Of the remaining four model types (MLR, bootstrap aggregation, C5.0 boosted trees, random forest), the MLR model was dropped from consideration as “best performing” due to its overall accuracy score falling nearly five full percentage points below that of the other three.
An attempt was also made to improve the overall performance of our modeling efforts by combining the output of the Random Forest, Bootstrap Aggregation and C5.0 Boosted Tree methods as an ensemble model. However, the results of that approach did not improve upon those achieved by the random forest model.
Therefore, we recommend the use of the Random Forest model if overall accuracy across each of the three possible water pump statuses is of most importance to the Ministry of Water. However, if correctly identifying the largest number of pumps that are functional but in need of repair is the top priority, the Bootstrap Aggregration model should be prefered.
The interactive visual presentation was created using the Shiny toolset within R and is comprised of five separate panels:
About provides a brief synopsis of the project and lists the individual members of the project team.
Full Writeup presents the user with a zoomable, downloadable PDF version of this document in its entirety, including the appendices.
Data Table allows the user to explore the data used for model testing, including any imputed data values, via an interactive, searchable data table. Results of each of the six predictive/classification models for each pump represented in the model testing data set are also included within the data table and may be searched / sorted interactively.
Plot makes use of interactive bar plots to visualize the output of each of the six predictive models relative to a user-selected independent/predictive variable. These visualizations provide the user with an intuitive method of identifying specific pump characteristics that are most indicative of a pump being either functional
, functional needs repair
or non functional
.
Map Visualization allows the user to generate an interactive geoplot showing the location, population, and wellpoint type for each water pump represented within the data challenge’s test data set. The user can pan/zoom within the map. Hovering over a specific pump will display the wellpoint type while clicking on a pump will display what the predicted status of the pump was for each of the six types of predictive / classification models.
The visualization can be accessed via the following link:
The purpose of this project was to attempt to predict the functional status of a water pump from one of three possible status values (functional
, functional needs repair
, non functional
) based on a variety of qualitative and quantitative water pump attributes and geographical variables. Data related to 59,400 water pumps was used for training and validating six distinct types classification models (MLR, KNN, Bootstrap Aggregation, SVM, Random Forest, and C5.0 Boosted Classification Trees), and the output of those models serves as the basis of an interactive geomap whereby the user can explore the predicted statuses of more than 14,000 additional pumps whose functional status was not available as part of the original data set.
Our modeling efforts indicated that the quantity
, extraction_type
, waterpoint_type
, and the derived pump_age
variables appeared to have the highest levels of predictive ulitity for the three types of models from which such assessments could be made (MLR, C5.0, random forest). The latitude
, longitude
, and gps_height
variables appeared to be of little value to the MLR model but had very high predictive value for the random forest and moderately high predictive value for the C5.0 and MLR models. Other variables found to have relatively high levels of predictive value for MLR, C5.0 and random forest included payment
, source
, region_code
and district_code
. By contrast, the subvillage
variable and many of the binned ward
and lga
variables were found to have relatively low predictive strength across those three types of models, and the region
variable was found to add no value if the region_code
variable was already included within a model.
The overall accuracy of the six models when applied to the validation subset ranged from a low of approximately 75% (e.g., MLR, KNN) to a high of just over 81% (random forest), though the KNN, Bootstrap Aggregation, and C5.0 models all outperformed the random forest model for accurately identifying water pumps having a status of functional needs repair
. Computationally, the Random Forest, Bootstrap Aggregation and C5.0 models each proved to be far more efficient and/or accurate than the MLR, KNN and SVM models. An ensemble model that combined the output of the Bootstrap Aggregation, Random Forest and C5.0 models did not outperform the results obtained via the Random Forest model.
The Random Forest model scored highest within the context of the DrivenData.org data challenge, achieving an overall accuracy of .8195 (just .009 less than the top score of .8285) and a ranking within the top seven percent of all submissions. Therefore, the Random Forest model is recommended for use by Tanzania’s Ministry of Water if overall accuracy across each of the three possible pump statuses is of most importance, while the Bootstratp Aggregation model should be preferred if identifying the largest number of pumps that are functional but in need of repair is the top priority.
Finally, the interactive visual dashboard developed during this project represents a distinct contribution relative to the available literature for this specific DrivenData.org data challenge. The dashboard allows a user to not only explore the predicted status of pumps for all six types of predictive/classification models via an interactive geomap but also provides the user with a way to explore those predictions relative to each of the independent variables that were used for model training. Such flexibility allows the user to understand the relationship between pump statuses and attributes such as extraction_type
, water quality, how users pay for use of a pump, and various approaches to management of pumps, amongst others. Insights gained from such exploration can help the Ministry of Water more effectively manage and maintain pumps throughout Tanzania, thereby potentially increasing the availability of potable water for the country’s population.
Future Work: While the conclusions described herein apply only to the water pumps described in the aforementioned DriventData.org data set, the extensible/reusable design of the data transformation pipeline allows much of what we’ve implemented to potentially be applied to other types of predictive/classification modeling efforts. Future work could include extending our visualization tool to incorporate graph theory techniques for purposes of identifying clusters of pumps that are in need of repair. Such functionality could provide the Ministry of Water with a way to improve the efficiency of its repair efforts, e.g., repair teams could be deployed to clusters of problematic pumps, etc. Future work could also include the development and training of a neural network classification algorithm to determine whether such an approach would improve upon the results we’ve documented herein: Use of a neural network approach proved to be computationally prohibitive relative to the computing resources available to us during this project.
Tanzania Water Competition: https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/
Domptail: https://github.com/domptail/Tanzania-Water-Wells-ongoing
Mattiaf: https://github.com/drivendataorg/pump-it-up/blob/master/mattiaf/pumps.ipynb
Agrawal et. al: http://www.contrib.andrew.cmu.edu/~kdagrawa/documents/waterpump.pdf
Cheng: http://blog.nycdatascience.com/student-works/linlin_cheng_proj_5/
DataCamp: https://github.com/datacamp/courses-drivendata-water-pumps/blob/master/chapter1.md
Wikipedia - Training/Test/Validation: https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets
Tanzania map: https://en.wikipedia.org/wiki/Template:Location_map_Tanzania
Tanzania sub-divisions: https://en.wikipedia.org/wiki/Subdivisions_of_Tanzania
Decimal degrees: https://en.wikipedia.org/wiki/Decimal_degrees
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - An Introduction to Statistical Learning with Application in R, Sprienger 2013
FuzzyWuzzy: https://github.com/seatgeek/fuzzywuzzy
Wikipedia - Cross Validation - https://en.wikipedia.org/wiki/Cross-validation_(statistics)
NOTE: An online version of this Appendix can be found at the following Rpubs.com page:
Load Training Data
# load tanzania training data file
tanz <- read.csv("https://raw.githubusercontent.com/jtopor/DDO-TNZ/master/Tanz-Training-Data.csv", header = TRUE, stringsAsFactors = FALSE)
# count distinct items
length(unique(tanz$id))
## [1] 59400
# summary(tanz)
Load Training Labels
# load tanzania training data labels
tanz_trl <- read.csv("https://raw.githubusercontent.com/jtopor/DDO-TNZ/master/Tanz-Training-Labels.csv", header = TRUE, stringsAsFactors = FALSE)
tanz$label <- tanz_trl$status_group
# count number of rows with no NA's = 27813
(c.cases <- sum(complete.cases(tanz)) )
## [1] 53281
# now calculate number of rows with NA's = 6119
(na.cases <- nrow(tanz) - c.cases)
## [1] 6119
na.cases / nrow(tanz)
## [1] 0.1030135
53281 complete cases
6119 rows with NA’s
10.3% of rows have NA’s
# fill in all blank entries with NA
tanz[,][tanz[,] == ""] <- NA
# count NA's in each column
# kable(colSums(is.na(tanz)), col.names = c("NA Count") )
# recount number of rows with no NA's = 27813
(c.cases <- sum(complete.cases(tanz)) )
## [1] 27813
# now calculate number of rows with NA's = 31587
(na.cases <- nrow(tanz) - c.cases)
## [1] 31587
na.cases / nrow(tanz)
## [1] 0.5317677
Only 27813 complete cases after filling in blank string values with NA
31587 rows with NA’s
53.17% of rows have NA’s
## # A tibble: 3 x 2
## label TotalWells
## <chr> <int>
## 1 functional 32259
## 2 non functional 22824
## 3 functional needs repair 4317
The plot above provides a benchmark of sorts for purposes of evaluating each of the variables to be discussed below. The plot tells us that 54.3% of all pumps are functional, 38.4% are non-functional, and 7.3% are functional but in need of repair. We can use these metrics to partially assess each of the individual categorical variable values found within the data set; as we analyze each variable value, we can determine whether or not the percentage of pumps pertaining to that variable value either exceeds or falls short of the overall performance metrics plotted above. For example, those exceeding the 54.3% “functional” metric may share characteristics that poorer performing pumps may benefit from emulating / replicating.
The data set is comprised of a total of 40 variables, six of which are numeric, 31 of which are categorical, and three of which are strictly for administrative/reference purposes. The following is a brief summary of each of the independent variables found within the data set.
Variable | Definition |
---|---|
id |
Identification Variable (not a predictor) |
date_recorded |
Date of data collection by survey company |
recorded_by |
Name of the data collection / survey company |
amount_tsh |
Metric indicating total static head for pump; should be > 0 |
gps_height |
Altitude of pump |
population |
Human population in immediate vicinty of pump |
longitude |
Longitudinal coordinate of pump |
latitude |
Latitudinal coordinate of pump |
num_private |
No definition is available for this variable |
funder |
Name of organization that funded installation of pump |
installer |
Name of organization that installed the pump |
wpt_name |
Name assigned to given waterpoint |
basin |
Name of geographic water basin where pump is located |
subvillage |
Name of geographic subvillage where pump is located |
region |
Name of geographic region in Tanzania where pump is located |
region_code |
Numeric ID for ‘region’ variable |
district_code |
Numeric ID of district within a region where pump is located |
lga |
Tanzania-specific geographic indicator of where pump is located |
ward |
Name of Tanzanian geographic ward where pump is located |
public_meeting |
True/False indicator |
scheme_management |
Type of the organization responsible for management of pump |
scheme_name |
Name of organization responsible for management of pump |
permit |
True/False indicating whether the pump has valid permit |
construction_year |
The year the pump was installed |
extraction_type |
Method of extraction used at a given pump site |
extraction_type_group |
Aggregation of extraction_type categories |
extraction_type_class |
Aggregation of extraction_type_group categories |
management |
Name of method employed for management of a given pump |
management_group |
Possibly an aggregation of management categories |
payment |
Categorical indicator of payment method required of pump users |
payment_type |
Appears to be a duplicate of payment categories |
water_quality |
Categorical indicator of water quality produced by pump |
quality_group |
Aggregation of water_quality categories |
quantity |
Categorical indicator of water quantity produced by pump |
quantity_group |
Aggregation of quantity categories |
source |
Type of source of the water for a given pump |
source_type |
Aggregation of source categories |
source_type_class |
Aggregation of source_type categories |
waterpoint_type |
The type of pump installed at a well site |
waterpoint_type_group |
Aggregation of waterpoint_type categories |
Each of these variables can be characterized as belonging to one of three classes of independent variables: non-categorical numeric variables; categorical variables; and administrative variables. The tables below summarize our analyses of missing and possibly invalid values for each variable.
Numeric Variables | Comments |
---|---|
amount_tsh | 41,639 of 59,400 records = “0” |
gps_height | 20,438 of 59,400 records = “0” |
population | 21,381 of 59.400 records = “0” |
longitude | 1,812 zero values: likely invalid |
latitude | 1,819 values < -1: likely invalid |
num_private | 58,643 of 59.400 records = “0” |
Categorical Variable | Distinct Values | NA’s | Comments |
---|---|---|---|
funder | 1898 | 3635 | 3582 NA’s coinc. w installer |
installer | 2146 | 3655 | 3582 NA’s coinc. w funder |
wpt_name | 37400 | 0 | 3563 vals = “none” |
basin | 9 | 0 | |
subvillage | 19288 | 371 | |
region | 21 | 0 | |
region_code | 27 | 0 | |
district_code | 20 | 0 | Not unique across regions |
lga | 125 | 0 | |
ward | 2092 | 0 | |
public_meeting | 3 | 3334 | binary + NA’s |
scheme_management | 13 | 3877 | |
scheme_name | 2697 | 28166 | |
permit | 3 | 3056 | binary + NA’s |
construction_year | 55 | 20709 | |
extraction_type | 18 | 0 | |
extraction_type_group | 13 | 0 | composite of extr_type |
extraction_type_class | 7 | 0 | composite of extr_type_class |
management | 12 | 0 | |
management_group | 5 | 0 | |
payment | 7 | 0 | |
payment_type | 7 | 0 | dupe of payment |
water_quality | 8 | 0 | |
quality_group | 6 | 0 | composite of water_quality |
quantity | 5 | 0 | |
quantity_group | 5 | 0 | dupe of quantity |
source | 10 | 0 | |
source_type | 7 | 0 | composite of source |
source_class | 3 | 0 | binary + “unknown” |
waterpoint_type | 7 | 0 | |
waterpoint_type_group | 6 | 0 | composite of waterpoint_type |
Four region names are found to have no non-zero values for the following variables:
The four regions are:
These 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages.
The 12,115 records covered by these regions represent approximately 60% of the zero values found within the gps_height (12,115 / 20,438), population (12,115 / 21,381), and construction_year (12,115 / 20,709) variables.
The lack of non-zero values throughout the four indicated regions for the five variables listed above makes it highly unlikely that we will be able to effectively derive imputed values for the zero values of those five variables using the geographical indicators provided within the data set.
Administrative Variables | Data Type |
---|---|
id | int |
date_recorded | date |
recorded_by | char str |
The administrative variables shown above have no predictive value and will be ignored for purposes of model building. The id variable serves as a unique identifier for each data record within the data set. The date_recorded variable indicates the date on which the each record presented within the data was collected, while the recorded_by variable contains the name of the survey firm that collected the data.
The numeric and categorical variables listed above are explored in detail below.
The amount_tsh variable can be characterized as representing the “Total static head (amount water available to waterpoint)”. According to the Encyclopedia of Marine Technology, the definition of “total static head” is “the vertical height of a stationary column of liquid produced by a pump” [ref needed]. In other words, it is the height at which a pump can raise water above its source. Therefore, a pump with a total static head value of zero would not be particularly useful since it would not be capable of producing water from a source located below its output point.
Summary statistics and a boxplot for the amount_tsh variable indicate a heavily right-skewed distribution:
summary(tanz$amount_tsh)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 317.7 20.0 350000.0
boxplot(tanz$amount_tsh, col = "yellow")
In fact, analysis of the amount_tsh variable indicates that 41,639 of the 59,400 data records (70.3%) have a value of “0”:
length(tanz$amount_tsh[tanz$amount_tsh == 0])
## [1] 41639
Analysis of the amount_tsh = 0 values relative to various geographic variables provided within the data set reveals a significant amount of clustering of such values by subvillage, ward, lga, district_code, and region. The table below provides a summary of this clustering.
Geographic Var. | Num of categories w No amount_tsh Value |
---|---|
subvillage | 13357 / 19288 (69.2%) |
ward | 1090 / 2092 (52.1%) |
lga | 40 / 125 (32%) |
district_code | 2 / 20 (10%) |
region | 4 / 21 (19%) |
As shown in the table, more than two-thirds of subvillages and one-half wards represented within the data set have no non-zero amount_tsh values, as do 32% of lga’s, 10% of district_codes and 19% of regions. Such clustering might suggest a systemic data collection issue within those geographic areas, e.g., perhaps the pumps located within those areas were not actually visited by the surveying company that compiled the data. Alternatively, the total static head for many pumps may simply not be known if it was not recorded when the pump was originally installed.
In any event, the lack of non-zero data values across so many geographic indicators strongly suggests that imputing the zero values found within the amount_tsh variable via geographic indicators will not be possible. Furthermore, given the widespread lack non-zero amount_tsh values found for these geographic areas it seems highly likely that the zero values are simply invalid.
Analysis shows that 14,771 of the non-zero values fall within the range of (1 : 1000):
atsh <- tanz$amount_tsh[tanz$amount_tsh > 0 & tanz$amount_tsh <= 1000]
length(atsh)
## [1] 14771
tot_atsh <- hist(atsh, col = 'yellow', main = "Histogram of amount_tsh values (1:1000)")
This leaves slightly less than 3,000 records having amount_tsh values that exceed 1000. Such values should be investigated to determine whether they are in fact valid values for the variable. Of those 3000 records, 2,810 are found to lie within the range of (1000:10,000):
atsh_10k <- tanz$amount_tsh[tanz$amount_tsh > 1000 & tanz$amount_tsh <= 10000]
length(atsh_10k)
## [1] 2810
Barplots for amount_tsh
values within the range of (1:10000) suggest that pumps with relatively larger values may be more likely to be functional
than are pumps having smaller values. However, since the larger values might possibly be outliers and we lack valid amount_tsh
values for more than 70% of the records within the data set, it seems highly unlikely that we can derive any valid predictive inferences from the plots.
Given the large percentage of missing values for the variable, other variables within the data set were examined to determine whether any of their attributes might serve as indicators or proxies for missing amount_tsh
values. The extraction_type_group and waterpoint_type_group categorical variables are each at least somewhat indicative of the type of pump deployed at each waterpoint. Assuming similar types of pumps generally share somewhat similar physical characteristics and limitations, there may be a relationship we can infer between these two variables and the amount_tsh value each pump may reasonably have. The tables below summarize the median non-zero amount_tsh values for each category of the extraction_type_group and waterpoint_type_group variables.
# isolate non-zero amount_tsh records
t_atsh <- subset(tanz, amount_tsh !=0)
# get the median amount_tsh value for each extraction_type_class
etg_mtsh <- arrange(summarise(group_by(t_atsh, extraction_type_group),
MedTsh = median(amount_tsh) ), desc(MedTsh) )
etg_mtsh
## # A tibble: 13 x 2
## extraction_type_group MedTsh
## <chr> <dbl>
## 1 other motorpump 7500
## 2 other handpump 1000
## 3 afridev 500
## 4 gravity 500
## 5 india mark ii 500
## 6 nira/tanira 500
## 7 other 500
## 8 rope pump 500
## 9 swn 80 500
## 10 mono 50
## 11 submersible 50
## 12 wind-powered 50
## 13 india mark iii 25
# check avg tsh for extraction_type_class
# mean(tanz$amount_tsh[which(tanz$extraction_type_class == 'gravity')])
# get the median amount_tsh value for each waterpoint_type_group
wptg_mtsh <- arrange(summarise(group_by(t_atsh, waterpoint_type_group),
MedTsh = median(amount_tsh) ), desc(MedTsh) )
wptg_mtsh
## # A tibble: 5 x 2
## waterpoint_type_group MedTsh
## <chr> <dbl>
## 1 improved spring 2000
## 2 hand pump 500
## 3 communal standpipe 200
## 4 other 200
## 5 cattle trough 100
# memory cleanup
rm(t_atsh)
Since each pump represented within the data set has known values for both of these variables, one or both of the median values shown in the tables above may allow us to impute a reasonable value for each instance of amount_tsh = 0 within the data set.
The gps_height variable represents the physical altitude of a pump. Summary statistics and plots for the gps_height variable indicate a heavily right-skewed distribution:
summary(tanz$gps_height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -90.0 0.0 369.0 668.3 1319.2 2770.0
boxplot(tanz$gps_height, col = "yellow")
hist(tanz$gps_height, col = "yellow")
Both the summary statistics and the boxplot show that a portion of the data records contain a negative gps_height value. An analysis of the variable finds 1,496 such records within the data set:
length(tanz$gps_height[tanz$gps_height < 0])
## [1] 1496
While it may be somewhat surprising to have a pump located below sea level, Africa is in fact home to many areas where the actual land altitude falls below sea level. As such, further investigation is required to determine whether or not these negative values are valid within the context of the data set.
The histogram shown above indicates that the majority of non-zero and non-negative gps_height
values lie in the range of (1000:2000). The histogram also indicates the presence of a significant number of zero values for the variable. Further analysis finds a total of 20,438 data records containing a zero value for the gps_height variable.
length(tanz$gps_height[tanz$gps_height == 0])
## [1] 20438
1812 of these zero values occur when values for both latitude and longitude are apparently unknown:
nrow(subset(tanz, latitude > -1 & longitude == 0 & gps_height == 0) )
## [1] 1812
The fact that those zero values coincide with missing latitude and longitude coordinates stongly suggests that they are also representative of missing data.
Analysis of the presence of gps_height = 0 values relative to various geographic variables provided within the data set reveals a significant amount of clustering of such values by subvillage, ward, lga, district_code, and region. The table below provides a summary of this clustering.
Geographic Var. | Num of categories w No gps_h Value |
---|---|
subvillage | 7265 / 19288 (37.6%) |
ward | 776 / 2092 (37.1%) |
lga | 42 / 125 (33.6%) |
district_code | 2 / 20 (10%) |
region | 4 / 21 (19%) |
As shown in the table, more than one third of subvillages, wards and lga’s represented within the data set have no non-zero gps_height values, as do 10% of district_codes and 19% of regions. Such clustering suggests a systemic data collection issue within those geographic areas, e.g., perhaps the pumps located within those areas were not actually visited by the surveying company that compiled the data. If in fact the physical pumps had been visited during data collection, use of a basic, inexpensive altimeter could have provided the required altitude measurement.
In any event, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.
Further analysis is required to determine whether the remaining zero values are in fact valid data values or if they merely represent the lack of an accurate altitude measurement for the associated pumps. However, given the widespread lack of accurate measurements found for these geographic areas it seems highly likely that the zero values are simply invalid.
Barplots for the non-zero gps_height
values show that pumps located at relatively higher altitudes are more likely to be functional
than are pumps found at lower altitudes. However, pumps located at altitudes higher than agps_height
value of approximately 2500 are also the most likely to be functional needs repair
, followed by those lying within the approximate range of (1000:1600). We also find that pumps located at relatively low altitudes (i.e., between 0 and 800) are the most likely to be non-functional.
The population variable represents the human population in the area surrounding a pump. Summary statistics and plots for the population variable indicate a heavily right-skewed distribution:
summary(tanz$population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 25.0 179.9 215.0 30500.0
boxplot(tanz$population, col = "yellow")
hist(tanz$population, col = "yellow")
Our analysis indicates the presence of 21,381 data records containing a zero value for the population variable.
length(tanz$population[tanz$population == 0])
## [1] 21381
While it may in fact be realistic to find a pump located in an unpopulated area, it seems highly unlikely that such a large number of pumps would reside in locations that are devoid of human inhabitants. As such, many of these zero values are likely to be invalid.
As with the amount_tsh and gps_height variables, we find a significant amount of clustering relative to the subvillage, ward, lga, district_code, and region geographic variables, as shown in the table below.
Geographic Var. | Num of Categories w No popu Value |
---|---|
subvillage | 7291 / 19288 (37.8%) |
ward | 779 / 2092 (37.2%) |
lga | 42 / 125 (33.6%) |
district_code | 2 / 20 (20%) |
region | 4 / 21 (19%) |
The statistics are very similar, though not identical, to those derived for the amount_tsh and gps_height variables, and once again suggest the possibility of a systemic data collection issue on the part of the surveying firm that collected the data. Furthermore, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.
Analysis of the non-zero population figures reveals a highly right-skewed distribution, with a median value of 150 and a maximum value of 30,500.
summary(tanz$population[tanz$population > 0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 40.0 150.0 281.1 324.0 30500.0
boxplot(tanz$population[tanz$population > 0], col = "yellow")
hist(tanz$population[tanz$population > 0], col = "yellow")
The summary statistics and plots suggest that the vast majority of the pumps are located in the vicinity of small villages comprised of less than 500 inhabitants. This suggests that it may be feasible to impute the missing population values for predictive modeling purposes via use of the median. However, such an approach may have an adverse effect on the distribution of the variable.
Barplots for the non-zero population
values of less than 1000 show no obvious relationship between population and the likelihood of a pump being functional
. However, pumps located in areas with relatively larger populations do appear to be somewhat more likely to have a status of functional needs repair
than are pumps located in less densely populated areas. Finally, pumps located in relatively higher population areas appear to be somewhat less likely to be non functional
than are pumps located in areas with relatively smaller populations.
Longitude and latitude coordinates are provided for each pump within the data set. Summary statistics and plots indicate that at least a portion of the coordinates provided may be invalid, as indicated by the zero and near-zero values found in the both variables.
summary(tanz$longitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 33.09 34.91 34.08 37.18 40.35
boxplot(tanz$longitude, col = "yellow")
hist(tanz$longitude, col = 'yellow')
summary(tanz$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -11.649 -8.541 -5.022 -5.706 -3.326 0.000
boxplot(tanz$latitude, col = "yellow")
hist(tanz$latitude, col = 'yellow')
Further analysis finds 1812 zero values within the longitude variable and 1819 values less than -1 for the latitude variable.
# longitude zero values
length(tanz$longitude[tanz$longitude == 0])
## [1] 1812
# latitude zero values
length(tanz$latitude[tanz$latitude > -1])
## [1] 1819
A cross referencing of the two variables shows that every instance of a zero longitude value does in fact correspond with an instance of a (< -1) latitude value:
nrow(subset(tanz, latitude > -1 & longitude == 0))
## [1] 1812
The fact that these values appear to be coincident strongly suggests that they are simply invalid data values. However, for each record within the data set we are provided with a variety of other geographical indicators, including ward, district_code, region, and basin. Given the relatively small number of invalid longitude and latitude values present within the data set, a reasonable approach to their imputation could be based on calculating the mean or median longitude and latitude values for a data record’s ward or region and using those values as the imputed values for the missing data.
Barplots for the non-zero longitude
values indicate that pumps located between 34 and 38 degrees longitude are more likley to be functional
than are pumps at other longitudes, while those located between 30 and 31 degrees longitude are much more likely to have a status of functional needs repair
than are pumps located at other longitudes. Finally, we see that longitude
can be somewhat indicative of how likely a pump is to be non functional
. Therefore, the longitude
variable appears to offer a fair amount of predictive value.
Barplots for latitude
values of less than (-1) indicate that pumps at some latitudes are much more likely to be either functional
or functional needs repair
than are pumps at other latitudes. Therefore, the latitude
variable appears to offer a fair amount of predictive value.
No clear explanation is provided regarding what the num_private variable is meant to represent. Summary statistics and plots show a heavily right-skewed distribution with zero values dominating the variable throughout the data set.
summary(tanz$num_private)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4741 0.0000 1776.0000
boxplot(as.numeric(tanz$num_private), col = "yellow")
hist(as.numeric(tanz$num_private), col = 'yellow')
In fact, analysis reveals that only 757 of the data set’s 59,400 num_private values are non-zero:
length(tanz$num_private[which(tanz$num_private > 0)])
## [1] 757
Unfortunately, the lack of an explanation of the meaning of the variable means we have no way of knowing whether the large number of zero values represent legitimate data values or simply serve as an indicator of missing data.
The variable is comprised of a total of 65 distinct values. The number of pumps pertaining to the ten most frequently occurring of those values is summarized below.
# count distinct num_private = 65
length(unique(tanz$num_private))
## [1] 65
# calc number of wells per wpt_name
c_npriv <- arrange(summarise(group_by(tanz, num_private),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_npriv, n = 10)
## # A tibble: 10 x 2
## num_private TotalWells
## <int> <int>
## 1 0 58643
## 2 6 81
## 3 1 73
## 4 5 46
## 5 8 46
## 6 32 40
## 7 45 36
## 8 15 35
## 9 39 30
## 10 93 28
The funder variable represents the name of the organization that funded the installation of a given pump. A total of 1,898 distinct values for the variable are found within the data set, along with 3,635 missing values. The top 20 funders by pump count are shown in the table below.
# count distinct funder = 1898, 3635 NA
length(unique(tanz$funder))
## [1] 1898
# calc number of wells per funder
c_funders <- arrange(summarise(group_by(tanz, funder),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_funders, n = 20)
## # A tibble: 20 x 2
## funder TotalWells
## <chr> <int>
## 1 Government Of Tanzania 9084
## 2 <NA> 3635
## 3 Danida 3114
## 4 Hesawa 2202
## 5 Rwssp 1374
## 6 World Bank 1349
## 7 Kkkt 1287
## 8 World Vision 1246
## 9 Unicef 1057
## 10 Tasaf 877
## 11 District Council 843
## 12 Dhv 829
## 13 Private Individual 826
## 14 Dwsp 811
## 15 0 777
## 16 Norad 765
## 17 Germany Republi 610
## 18 Tcrs 602
## 19 Ministry Of Water 590
## 20 Water 583
As we can see in the small sample of funder values shown above, there appears to be a lack of consistency in the application of funder names throughout the data set. For example, in the table above we note the presence of “Government of Tanzania”, “Ministry of Water”, and “Water”. It is quite possible that each of these values refer to the Tanzanian government. Such inconsistency may be unnecessarily increasing the number of distinct funder values while simultaneously dilluting the accuracy of the data records. We will explore whether it is possibile to address such issues as part of our Data Preparation efforts.
The installer variable represents the name of the organization that installed a given pump. A total of 2,146 distinct values for the variable are found within the data set, along with 3,655 missing values. The top 20 installers by pump count are shown in the table below.
# count distinct installer = 2146, 3655 NA
length(unique(tanz$installer))
## [1] 2146
# calc number of wells per installer
c_installers <- arrange(summarise(group_by(tanz, installer),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_installers, n = 20)
## # A tibble: 20 x 2
## installer TotalWells
## <chr> <int>
## 1 DWE 17402
## 2 <NA> 3655
## 3 Government 1825
## 4 RWE 1206
## 5 Commu 1060
## 6 DANIDA 1050
## 7 KKKT 898
## 8 Hesawa 840
## 9 0 777
## 10 TCRS 707
## 11 Central government 622
## 12 CES 610
## 13 Community 553
## 14 DANID 552
## 15 District Council 551
## 16 HESAWA 539
## 17 LGA 408
## 18 World vision 408
## 19 WEDECO 397
## 20 TASAF 396
As with the funder variable, there appears to be a lack of consistency in the application of installer names throughout the data set. For example, in the table above we note the presence of “”Government of Tanzania“,”Ministry of Water“, and”Water“”Government" and “Central Government”, as well as both “Commu” and “Community”. It is quite possible that each of these pairs values refer to a single installer. As with the funder variable, We will explore whether it is possibile to address such issues as part of our Data Preparation efforts.
A cursory review of the data set indicates that many instances of missing data within the funder variable are coincident with missing values for the installer variable. In fact, we find that 3,582 of the 3,635 instances of funder = NA occur when the installer variable is also unknown.
f_i <- subset(tanz, is.na(tanz$funder) & is.na(tanz$installer))
nrow(f_i)
## [1] 3582
The fact that nearly all of the missing funder and installer values are coincident may preclude any imputation for missing values for the two variables.
The wpt_name variable represents a name that has been assigned to given waterpoint. A total of 37,400 distinct values for the variable are found within the data set, along with 3,563 instances where no defined name could be identified.
# count distinct wpt_name = 37400, 3563 = none (which is not the same as NA)
length(unique(tanz$wpt_name))
## [1] 37400
# calc number of wells per wpt_name
c_wptn <- arrange(summarise(group_by(tanz, wpt_name),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_wptn, n = 20)
## # A tibble: 20 x 2
## wpt_name TotalWells
## <chr> <int>
## 1 none 3563
## 2 Shuleni 1748
## 3 Zahanati 830
## 4 Msikitini 535
## 5 Kanisani 323
## 6 Bombani 271
## 7 Sokoni 260
## 8 Ofisini 254
## 9 School 208
## 10 Shule Ya Msingi 199
## 11 Shule 152
## 12 Sekondari 146
## 13 Muungano 133
## 14 Mkombozi 111
## 15 Madukani 104
## 16 Hospital 94
## 17 Mbugani 94
## 18 Upendo 93
## 19 Kituo Cha Afya 90
## 20 Mkuyuni 88
Given the large number of possible wpt_name values as a proportion of the total records contained within the data set (37400 / 59400), the wpt_name variable likely offers limited, if any, value for predictive modeling purposes.
The basin variable represents the name of the geographic water basin within which a given pump is located. The summary statistics shown below indicate a total of 9 distinct water basins within the data set, with each record having a valid value. The number of pumps found per basin is detailed in the table below.
# count distinct basin = 9
length(unique(tanz$basin))
## [1] 9
# calc number of wells per basin
c_basin <- arrange(summarise(group_by(tanz, basin),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_basin, n = 10)
## # A tibble: 9 x 2
## basin TotalWells
## <chr> <int>
## 1 Lake Victoria 10248
## 2 Pangani 8940
## 3 Rufiji 7976
## 4 Internal 7785
## 5 Lake Tanganyika 6432
## 6 Wami / Ruvu 5987
## 7 Lake Nyasa 5085
## 8 Ruvuma / Southern Coast 4493
## 9 Lake Rukwa 2454
Plots for the basin variable show a wide variance in the proportions of functional pumps between basins: Three (Lake Nyasa, Pangani, Rufiji) have more than 60% of their pumps functioning while the Southern Coast basin has only 37.2% of its pumps functioning. In fact, five of the nine basins fail to achieve the overall “functional” metric of 54.3%. The underlying reasons for these disparities should be explored to determine whether or not the underperforming basins can be improved via the application of pump management methods/practices being employed within the outperforming basins.
The subvillage variable represents the name of the geographic subvillage within which a given pump is located. Summary statistics indicate a total of 19,288 distinct subvillage values within the data set, with 371 missing values. The 20 subvillages with the largest number of pumps are shown below.
# count distinct subvillage = 19288, 371 NA's
length(unique(tanz$subvillage))
## [1] 19288
# calc number of wells per subvillage
c_subv <- arrange(summarise(group_by(tanz, subvillage),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_subv, n = 20)
## # A tibble: 20 x 2
## subvillage TotalWells
## <chr> <int>
## 1 Madukani 508
## 2 Shuleni 506
## 3 Majengo 502
## 4 Kati 373
## 5 <NA> 371
## 6 Mtakuja 262
## 7 Sokoni 232
## 8 M 187
## 9 Muungano 172
## 10 Mbuyuni 164
## 11 Mlimani 152
## 12 Songambele 147
## 13 Miembeni 134
## 14 Msikitini 134
## 15 1 132
## 16 Kibaoni 114
## 17 Kanisani 111
## 18 I 109
## 19 Mapinduzi 109
## 20 Mjimwema 108
Given the large number of possible subvillage values as a proportion of the total records contained within the data set (19288 / 59400), the variable may offer limited value for predictive modeling purposes.
The region variable represents the name of the geographic region in Tanzania within which a given pump is located. The summary statistics shown below indicate a total of 21 distinct regions within the data set, with each record having a valid value. The 20 regions having the largest number of pump installations are detailed in the table below.
# count distinct region = 21
length(unique(tanz$region))
## [1] 21
# calc number of wells per region
c_region <- arrange(summarise(group_by(tanz, region),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_region, n = 20)
## # A tibble: 20 x 2
## region TotalWells
## <chr> <int>
## 1 Iringa 5294
## 2 Shinyanga 4982
## 3 Mbeya 4639
## 4 Kilimanjaro 4379
## 5 Morogoro 4006
## 6 Arusha 3350
## 7 Kagera 3316
## 8 Mwanza 3102
## 9 Kigoma 2816
## 10 Ruvuma 2640
## 11 Pwani 2635
## 12 Tanga 2547
## 13 Dodoma 2201
## 14 Singida 2093
## 15 Mara 1969
## 16 Tabora 1959
## 17 Rukwa 1808
## 18 Mtwara 1730
## 19 Manyara 1583
## 20 Lindi 1546
The functional pump disparities found within the various regions exceed those seen within the basin variable: we see one region (Iringa) having more than 78.2% of its pumps functional while the region of Lindi has only 29.8% functional, a nearly 50 point difference. In fact, of the 21 regions listed, only 9 exceed the overall functional metric of 54.3%. Furthermore, we find the region of Kigoma has 21.4% of its pumps as “functional needs repair”, a percentage that greatly exceeds all other regions.
Analysis of individual regions reveals that the four specific regions are found to have no non-zero values for the following variables::
The four regions are:
# Dodoma
unique(tanz$population[tanz$region == 'Dodoma'])
unique(tanz$construction_year[tanz$region == 'Dodoma'])
unique(tanz$gps_height[tanz$region == 'Dodoma'])
unique(tanz$amount_tsh[tanz$region == 'Dodoma'])
unique(tanz$num_private[tanz$region == 'Dodoma'])
# Kagera
unique(tanz$population[tanz$region == 'Kagera'])
unique(tanz$construction_year[tanz$region == 'Kagera'])
unique(tanz$gps_height[tanz$region == 'Kagera'])
unique(tanz$amount_tsh[tanz$region == 'Kagera'])
unique(tanz$num_private[tanz$region == 'Kagera'])
# Mbeya
unique(tanz$population[tanz$region == 'Mbeya'])
unique(tanz$construction_year[tanz$region == 'Mbeya'])
unique(tanz$gps_height[tanz$region == 'Mbeya'])
unique(tanz$amount_tsh[tanz$region == 'Mbeya'])
unique(tanz$num_private[tanz$region == 'Mbeya'])
# Tabora
unique(tanz$population[tanz$region == 'Tabora'])
unique(tanz$construction_year[tanz$region == 'Tabora'])
unique(tanz$gps_height[tanz$region == 'Tabora'])
unique(tanz$amount_tsh[tanz$region == 'Tabora'])
unique(tanz$num_private[tanz$region == 'Tabora'])
As shown below, these 4 regions comprise 12,115 of the 59,400 records in the data set (20.39%), including 27 of the unique lga’s, 514 of the unique wards and 4644 of the unique subvillages.
The 12,115 records covered by these regions represent approximately 60% of the zero values found within the gps_height (12,115 / 20,438), population (12,115 / 21,381), and construction_year (12,115 / 20,709) variables.
The lack of non-zero values throughout the four indicated regions for the five variables listed above makes it highly unlikely that we will be able to effectively derive imputed values for the zero values of those five variables using the geographical indicators provided within the data set.
# count num recs w/ these 4 regions = 12115 => 20.39% of all records
all_z <- nrow(subset(tanz, region == 'Tabora' | region == 'Mbeya' | region == 'Kagera' |
region == 'Dodoma'))
all_z/nrow(tanz)
## [1] 0.2039562
# count all records in these regions
all_zdf <- subset(tanz, region == 'Tabora' | region == 'Mbeya' | region == 'Kagera' |
region == 'Dodoma')
length(unique(all_zdf$district_code)) # 10 unique district codes
## [1] 10
length(unique(all_zdf$ward)) # 514 unique wards
## [1] 514
length(unique(all_zdf$subvillage)) # 4644 unique subvillages
## [1] 4644
length(unique(all_zdf$lga)) # 27 unique lga's
## [1] 27
The region_code variable provides an integer code for the Tanzanian region within which a given pump is located. The summary statistics shown below indicate a total of 27 distinct region codes within the data set, with each record having a valid value. The 20 region codes having the largest number of pump installations are summarized in the table below.
## [1] 27
## # A tibble: 20 x 2
## region_code TotalWells
## <int> <int>
## 1 11 5300
## 2 17 5011
## 3 12 4639
## 4 3 4379
## 5 5 4040
## 6 18 3324
## 7 19 3047
## 8 2 3024
## 9 16 2816
## 10 10 2640
## 11 4 2513
## 12 1 2201
## 13 13 2093
## 14 14 1979
## 15 20 1969
## 16 15 1808
## 17 6 1609
## 18 21 1583
## 19 80 1238
## 20 60 1025
It is unclear why we find 27 region_code values while there are only 20 possible values for the region variable. It is possibile that some subset of the region codes have either simply been entered incorrectly or were deliberately entered incorrectly due to uncertainty over the correct region_code value to apply to a given record.
Plotting the status of pumps relative to each value of the region_code variable shows even greater disparities between geographic regions for functional pumps, from a low of 8.7% in Region 8 to nearly 97% in Region 24. Region 16 shows an unusually large 21.4% of its pumps having a status of “functional needs repair” while Region 40’s single pump (100% of the pumps in that region) is not functioning.
The construction_year variable indicates the year in which a given pump was installed. The summary statistics shown below indicate a total of 54 distinct non-zero construction year values ranging from 1960 through 2013, with 20,709 records lacking a valid value. The year values having the largest number of pump installations are summarized in the table below.
# get range of non-zero values for variable
summary(tanz$construction_year[which(tanz$construction_year > 0)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1960 1987 2000 1997 2008 2013
# count distinct construction_year = 55, 20,709 NA's
length(unique(tanz$construction_year))
## [1] 55
# calc number of wells per construction_year
c_const_y <- arrange(summarise(group_by(tanz, construction_year),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_const_y, n = 20)
## # A tibble: 20 x 2
## construction_year TotalWells
## <int> <int>
## 1 0 20709
## 2 2010 2645
## 3 2008 2613
## 4 2009 2533
## 5 2000 2091
## 6 2007 1587
## 7 2006 1471
## 8 2003 1286
## 9 2011 1256
## 10 2004 1123
## 11 2012 1084
## 12 2002 1075
## 13 1978 1037
## 14 1995 1014
## 15 2005 1011
## 16 1999 979
## 17 1998 966
## 18 1990 954
## 19 1985 945
## 20 1980 811
Plotting the status of pumps by construction year allows us to conclude that relatively newer pumps are generally more likely to be functional than are relatively older pumps. While this is rather unsurprising from an intuitive perspective, confirmation of such intuition can prove to be useful when crafting a predictive model.
As with the amount_tsh, gps_height, and population variables, we find a significant amount of clustering relative to the subvillage, ward, lga, district_code, and region geographic variables, as shown in the table below.
Geographic Var. | Num of Recs w No c_yr Value |
---|---|
subvillage | 7208 / 19288 (37.3%) |
ward | 749 / 2092 (35.8%) |
lga | 41 / 125 (32.8%) |
district_code | 4 / 20 (20%) |
region | 4 / 21 (19%) |
The statistics are very similar, though not identical, to those derived for the amount_tsh, gps_height and population variables, and once again suggest the possibility of a systemic data collection issue on the part of the surveying firm that collected the data. Furthermore, the lack of accurate data values across so many geographic indicators strongly suggests that imputing the missing data values via geographic indicators will not be possible.
The district_code variable provides an integer coding of the Tanzanian district within which a given pump is located. The summary statistics shown below indicate a total of 20 distinct district codes within the data set, with each record having a valid value. The number of pumps per district code is summarized in the table below.
# count distinct district_code = 20
length(unique(tanz$district_code))
## [1] 20
# calc number of wells per basin
c_distc <- arrange(summarise(group_by(tanz, district_code),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_distc, n = 20)
## # A tibble: 20 x 2
## district_code TotalWells
## <int> <int>
## 1 1 12203
## 2 2 11173
## 3 3 9998
## 4 4 8999
## 5 5 4356
## 6 6 4074
## 7 7 3343
## 8 8 1043
## 9 30 995
## 10 33 874
## 11 53 745
## 12 43 505
## 13 13 391
## 14 23 293
## 15 63 195
## 16 62 109
## 17 60 63
## 18 0 23
## 19 80 12
## 20 67 6
The district_code variable is yet another geographic label, and as with the others we’ve analyzed (e.g., basin, region, etc.) we see a wide disparity in the percentage of functional pumps, ranging from a low of 17.4% for district 0 to a hight of 83.3% in district 67. Only 9 of the 20 districts exceed the overall 54.3% functional pump metric. Furthermore we see that six districts’ “non functional” pump percentages exceed 60%.
Further analysis reveals that values of the district_code variable are not unique across regions, as can be seen in the table below where district_code values are aggregated by region. In fact, we find a total of 132 region / district_code combinations within the data set
# aggregate district_code values by region
c_reg_dist <- arrange(summarise(group_by(tanz, region, district_code),
TotalWells = length(unique(id)) ) )
head(c_reg_dist, n = 20)
## Source: local data frame [20 x 3]
## Groups: region [4]
##
## # A tibble: 20 x 3
## region district_code TotalWells
## <chr> <int> <int>
## 1 Arusha 1 189
## 2 Arusha 2 1206
## 3 Arusha 3 109
## 4 Arusha 5 201
## 5 Arusha 6 310
## 6 Arusha 7 1009
## 7 Arusha 30 326
## 8 Dar es Salaam 1 93
## 9 Dar es Salaam 2 497
## 10 Dar es Salaam 3 215
## 11 Dodoma 0 23
## 12 Dodoma 1 888
## 13 Dodoma 3 361
## 14 Dodoma 4 347
## 15 Dodoma 5 358
## 16 Dodoma 6 224
## 17 Iringa 1 728
## 18 Iringa 2 530
## 19 Iringa 3 650
## 20 Iringa 4 2473
# count number of distinct region / district_code pairs
nrow(c_reg_dist)
## [1] 132
This dependence on the region variable indicates that the district_code variable itself is unlikely to serve as a valid predictor of a pump’s status. As such, the variable appears unlikely to be independently useful for predictive modeling purposes.
The public_meeting variable is essentially a binary variable with 3,334 instances of the variable’s value being unknown. A summary of the variable’s values is shown below.
# count public_meeting = 3
length(unique(tanz$public_meeting))
## [1] 3
# calc number of wells per ward
c_pubmeet <- arrange(summarise(group_by(tanz, public_meeting),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_pubmeet, n = 10)
## # A tibble: 3 x 2
## public_meeting TotalWells
## <lgl> <int>
## 1 TRUE 51011
## 2 FALSE 5055
## 3 NA 3334
The plots below show that, in general, pumps having a public meeting value of TRUE are more likely to be functional than those that have not.
In an attempt to determine whether the lack of a public_meeting value is related to the geographical location of a pump, we aggregate the data by ward and calculate the following:
the total number of pumps per ward
the number of pumps within a ward having a public_meeting value of TRUE
the percentage of pumps within a ward having a public_meeting value of TRUE
the number of pumps within a ward having an unknown public_meeting value
the percentage of pumps within a ward having an uknown public_meeting
# c_ward contains the total number of wells by ward
ward_tmp <- tanz[, c("id", "ward", "public_meeting")]
# fill in NA's so data can be grouped properly
ward_tmp$public_meeting[is.na(ward_tmp$public_meeting)] <- "unknown"
# aggregate the data
ward_meet_true <- arrange(summarise(group_by(ward_tmp, ward),
TotalWells = length(unique(id)),
MeetUnk = sum(public_meeting == "unknown"),
MeetTrue = sum(public_meeting == "TRUE") ), desc(TotalWells) )
# calculate percentage of TRUE and unknown meeting values
ward_meet_true$perc <- ward_meet_true$MeetTrue / ward_meet_true$TotalWells
ward_meet_true$unk_perc <- ward_meet_true$MeetUnk / ward_meet_true$TotalWells
head(ward_meet_true, n = 20)
## # A tibble: 20 x 6
## ward TotalWells MeetUnk MeetTrue perc unk_perc
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Igosi 307 0 307 1.00000000 0.000000000
## 2 Imalinyi 252 0 252 1.00000000 0.000000000
## 3 Siha Kati 232 0 232 1.00000000 0.000000000
## 4 Mdandu 231 0 231 1.00000000 0.000000000
## 5 Nduruma 217 1 216 0.99539171 0.004608295
## 6 Kitunda 203 193 10 0.04926108 0.950738916
## 7 Mishamo 203 4 198 0.97536946 0.019704433
## 8 Msindo 201 0 200 0.99502488 0.000000000
## 9 Chalinze 196 0 66 0.33673469 0.000000000
## 10 Maji ya Chai 190 1 189 0.99473684 0.005263158
## 11 Usuka 187 0 187 1.00000000 0.000000000
## 12 Ngarenanyuki 172 8 164 0.95348837 0.046511628
## 13 Chanika 171 145 24 0.14035088 0.847953216
## 14 Vikindu 162 0 161 0.99382716 0.000000000
## 15 Mtwango 153 0 153 1.00000000 0.000000000
## 16 Matola 145 0 124 0.85517241 0.000000000
## 17 Zinga/Ikerege 141 1 127 0.90070922 0.007092199
## 18 Maramba 139 0 139 1.00000000 0.000000000
## 19 Wanging'ombe 139 0 139 1.00000000 0.000000000
## 20 Itete 137 0 137 1.00000000 0.000000000
Summary statistics for the percentage of unknown public_meeting values per ward indicate that one or more ward’s within the data set have no valid public_meeting values:
# shows some wards have 100% unknowns
summary(ward_meet_true$unk_perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0612 0.0000 1.0000
In fact, we find a total of 43 such wards within the data set, as shown below.
# find wards with 100% unknowns = 43 wards
nrow(subset(ward_meet_true, unk_perc == 1))
## [1] 43
What this tells us is that if we wish to impute public_meeting values for the data records lacking valid values, we should not attempt to do so solely on the basis of a record’s ward value since a significant number of wards contain no valid public_meeting values from which a reasonable imputation for the public_meeting indicator can be derived.
We explore the lga variable using the same aggregation techniques described above for the ward variable,
# memory cleanup
rm(ward_tmp)
# subset for fast aggregation
lga_tmp <- tanz[, c("id", "lga", "public_meeting")]
# fill in NA's so data can be grouped properly
lga_tmp$public_meeting[is.na(lga_tmp$public_meeting)] <- "unknown"
# aggregate the data
lga_meet_true <- arrange(summarise(group_by(lga_tmp, lga),
TotalWells = length(unique(id)),
MeetUnk = sum(public_meeting == "unknown"),
MeetTrue = sum(public_meeting == "TRUE") ), desc(TotalWells) )
# calculate percentage of TRUE and unknown meeting values
lga_meet_true$perc <- lga_meet_true$MeetTrue / lga_meet_true$TotalWells
lga_meet_true$unk_perc <- lga_meet_true$MeetUnk / lga_meet_true$TotalWells
head(lga_meet_true, n = 20)
## # A tibble: 20 x 6
## lga TotalWells MeetUnk MeetTrue perc unk_perc
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Njombe 2503 0 2462 0.9836197 0.000000000
## 2 Arusha Rural 1252 4 1204 0.9616613 0.003194888
## 3 Moshi Rural 1251 17 868 0.6938449 0.013589129
## 4 Bariadi 1177 306 644 0.5471538 0.259983008
## 5 Rungwe 1106 17 1084 0.9801085 0.015370705
## 6 Kilosa 1094 24 1058 0.9670932 0.021937843
## 7 Kasulu 1047 2 1045 0.9980898 0.001910220
## 8 Mbozi 1034 5 628 0.6073501 0.004835590
## 9 Meru 1009 19 990 0.9811695 0.018830525
## 10 Bagamoyo 997 4 685 0.6870612 0.004012036
## 11 Singida Rural 995 115 880 0.8844221 0.115577889
## 12 Kilombero 959 3 950 0.9906152 0.003128259
## 13 Same 877 3 847 0.9657925 0.003420753
## 14 Kibondo 874 21 853 0.9759725 0.024027460
## 15 Kyela 859 0 857 0.9976717 0.000000000
## 16 Kahama 836 116 717 0.8576555 0.138755981
## 17 Kigoma Rural 824 6 735 0.8919903 0.007281553
## 18 Magu 824 55 756 0.9174757 0.066747573
## 19 Maswa 809 9 798 0.9864030 0.011124845
## 20 Karagwe 771 0 768 0.9961089 0.000000000
Summary statistics for the percentage of unknown public_meeting values per lga indicate that all lga’s within the data set have at least some valid public_meeting values:
# shows no lga's have 100% unknowns
summary(lga_meet_true$unk_perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.003128 0.054246 0.028169 0.925553
As such, if we wish to impute public_meeting values for the data records lacking valid values, use of a record’s ward and lga values together may allow us to infer whether or not the missing value should be TRUE or FALSE based on the percentage of pumps within the ward or lga sharing a TRUE value. For example, if the percentage of TRUE public_meeting values within a ward (or lga if the ward lacks any valid public_meeting values) exceeds 50%, it would seem reasonable to assign a value of TRUE to any missing public_meeting value pertaining to that ward (or lga). This approach is applied in the Data Preparation section of this document.
The lga variable provides another Tanzania-specific geographic identifier within which a given pump is located. The summary statistics shown below indicate a total of 125 lga values within the data set, with each record having a valid value. The 20 lga’s having the largest number of pump installations are summarized in the table below.
# count distinct lga = 125
length(unique(tanz$lga))
## [1] 125
# calc number of wells per lga
c_lga <- arrange(summarise(group_by(tanz, lga),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_lga, n = 20)
## # A tibble: 20 x 2
## lga TotalWells
## <chr> <int>
## 1 Njombe 2503
## 2 Arusha Rural 1252
## 3 Moshi Rural 1251
## 4 Bariadi 1177
## 5 Rungwe 1106
## 6 Kilosa 1094
## 7 Kasulu 1047
## 8 Mbozi 1034
## 9 Meru 1009
## 10 Bagamoyo 997
## 11 Singida Rural 995
## 12 Kilombero 959
## 13 Same 877
## 14 Kibondo 874
## 15 Kyela 859
## 16 Kahama 836
## 17 Kigoma Rural 824
## 18 Magu 824
## 19 Maswa 809
## 20 Karagwe 771
The ward variable represents the name of the geographic ward within which a given pump is located. The summary statistics shown below indicate a total of 2092 distinct ward values within the data set, with each record having a valid value. The 20 wards having the largest number of pump installations are summarized in the table below.
# count distinct ward = 2092
length(unique(tanz$ward))
## [1] 2092
# calc number of wells per ward
c_ward <- arrange(summarise(group_by(tanz, ward),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_ward, n = 20)
## # A tibble: 20 x 2
## ward TotalWells
## <chr> <int>
## 1 Igosi 307
## 2 Imalinyi 252
## 3 Siha Kati 232
## 4 Mdandu 231
## 5 Nduruma 217
## 6 Kitunda 203
## 7 Mishamo 203
## 8 Msindo 201
## 9 Chalinze 196
## 10 Maji ya Chai 190
## 11 Usuka 187
## 12 Ngarenanyuki 172
## 13 Chanika 171
## 14 Vikindu 162
## 15 Mtwango 153
## 16 Matola 145
## 17 Zinga/Ikerege 141
## 18 Maramba 139
## 19 Wanging'ombe 139
## 20 Itete 137
The scheme_management variable represents the type of organization responsible for managing a given pump. The summary statistics shown below indicate a total of 13 distinct scheme management values within the data set, with 3,877 records having no valid value. The number of pump installations per type of scheme management is summarized in the table below.
# count distinct scheme_management = 13, 3877 NA's
length(unique(tanz$scheme_management))
## [1] 13
# calc number of wells per scheme_management
c_scheme_m <- arrange(summarise(group_by(tanz, scheme_management),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_scheme_m, n = 20)
## # A tibble: 13 x 2
## scheme_management TotalWells
## <chr> <int>
## 1 VWC 36793
## 2 WUG 5206
## 3 <NA> 3877
## 4 Water authority 3153
## 5 WUA 2883
## 6 Water Board 2748
## 7 Parastatal 1680
## 8 Private operator 1063
## 9 Company 1061
## 10 Other 766
## 11 SWC 97
## 12 Trust 72
## 13 None 1
Plots for the scheme_management variable show that (excluding “None” since it applies to only a single pump) that while the Water Board scheme seems to yield the largest percentage of functional pumps, that scheme is used for a relatively small number (2,748) of the 59,400 pumps. By contrast, the VWC scheme is used for nearly 36,800 pumps. So while a majority of the schemes exceed the 54.3% overall functional metric, the vast majority of the pumps themselves do not benefit from those schemes.
Analysis of the scheme_management variable’s missing values reveals the following:
There is no single management scheme that is universally applicable to any given extraction_type or waterpoint_type or for any of the composites thereof (e.g., extraction_type_group, etc.)
Instances of missing scheme_management values are not correlated in any way with missing public_meeting or permit values, despite the fact that the missing value counts for each of those variables are somewhat similar.
Aggregating by geographic variables also fails to produce even a single valid scheme_management value for many subvillages and wards. At the lga and region levels we can generally find at least several valid values per lga or region, though some lga’s contain only a single valid value.
Aggregating at the ward level shows that 75 wards lack any valid scheme_management value, and those 75 wards cover 1,413 (36.4%) of the 3,877 missing scheme_management values.
At the lga level it appears as though nearly every lga has at least one valid scheme_management value. However, in some lga’s the number of unknowns is an order of magnitude greater than the number of known values. For example, in the lga of Kahama there are 736 pumps having an unknown scheme_management value and only 100 pumps with known values that are spread across several distinct scheme_management categories.
One possible approach to dealing with the missing values would be to assign the most frequently occurring valid scheme_management value found with the corresponding ward or lga as the imputed value. However, that approach seems inappropriate since many wards and lga’s are dominated by unknown values: in such instances the most frequently occurring valid value will likely represent a relatively small percentage of the total number of data records for a ward or lga, and assigning that value to a missing scheme_management record would likely introduce far too much inappropriate statistical bias within the data set. As such, it may be preferable to simply mark the missing values as “unknown” to prevent the introduction of such bias.
The scheme_name variable represents the name of the party responsible for managing a given pump. The summary statistics shown below indicate a total of 2,697 distinct scheme name values within the data set, with 28,166 records having no valid value. The 20 scheme names having the largest number of pump installations are summarized in the table below.
# count distinct scheme_name = 2697, 28,166 NA's
length(unique(tanz$scheme_name))
## [1] 2697
# calc number of wells per scheme_name
c_scheme_n <- arrange(summarise(group_by(tanz, scheme_name),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_scheme_n, n = 20)
## # A tibble: 20 x 2
## scheme_name TotalWells
## <chr> <int>
## 1 <NA> 28166
## 2 K 682
## 3 None 644
## 4 Borehole 546
## 5 Chalinze wate 405
## 6 M 400
## 7 DANIDA 379
## 8 Government 320
## 9 Ngana water supplied scheme 270
## 10 wanging'ombe water supply s 261
## 11 wanging'ombe supply scheme 234
## 12 Bagamoyo wate 229
## 13 I 229
## 14 Uroki-Bomang'ombe water sup 209
## 15 N 204
## 16 Kirua kahe gravity water supply trust 193
## 17 Machumba estate pipe line 185
## 18 Makwale water supplied sche 166
## 19 Kijiji 161
## 20 S 154
The large percentage of records lacking a defined scheme name combined with the lack of a definitive list of valid scheme names makes it unlikely that imputation of the missing values will prove feasible. As such, the scheme_name variable appears unlikely to be useful for predictive modeling purposes.
The permit variable is a binary True/False indicator. Summary statistics indicate that 28,166 records within the data set have no valid permit value.
# count distinct permit = 3, 28,166 NA's
length(unique(tanz$permit))
## [1] 3
# calc number of wells per scheme_name
c_permit <- arrange(summarise(group_by(tanz, permit),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_permit, n = 10)
## # A tibble: 3 x 2
## permit TotalWells
## <lgl> <int>
## 1 TRUE 38852
## 2 FALSE 17492
## 3 NA 3056
The plots below show that pumps indicated as having a permit are more likely to be functional than are those that did not require a permit.
In an attempt to determine whether the lack of a permit value is related to the geographical location of a pump, we aggregate the data by ward and calculate the following:
the total number of pumps per ward
the number of pumps within a ward having a permit value of TRUE
the percentage of pumps within a ward having a permit value of TRUE
the number of pumps within a ward having an unknown permit value
the percentage of pumps within a ward having an uknown permit
# c_ward contains the total number of wells by ward
ward_tmp <- tanz[, c("id", "ward", "permit")]
# fill in NA's so data can be grouped properly
ward_tmp$permit[is.na(ward_tmp$permit)] <- "unknown"
# aggregate the data
ward_perm_true <- arrange(summarise(group_by(ward_tmp, ward),
TotalWells = length(unique(id)),
PermUnk = sum(permit == "unknown"),
PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )
# calculate percentage of TRUE and unknown meeting values
ward_perm_true$perc <- ward_perm_true$PermTrue / ward_perm_true$TotalWells
ward_perm_true$unk_perc <- ward_perm_true$PermUnk / ward_perm_true$TotalWells
head(ward_perm_true, n = 20)
## # A tibble: 20 x 6
## ward TotalWells PermUnk PermTrue perc unk_perc
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Igosi 307 0 222 0.72312704 0.000000000
## 2 Imalinyi 252 2 250 0.99206349 0.007936508
## 3 Siha Kati 232 0 232 1.00000000 0.000000000
## 4 Mdandu 231 0 231 1.00000000 0.000000000
## 5 Nduruma 217 0 217 1.00000000 0.000000000
## 6 Kitunda 203 0 8 0.03940887 0.000000000
## 7 Mishamo 203 0 203 1.00000000 0.000000000
## 8 Msindo 201 1 134 0.66666667 0.004975124
## 9 Chalinze 196 0 195 0.99489796 0.000000000
## 10 Maji ya Chai 190 0 127 0.66842105 0.000000000
## 11 Usuka 187 0 187 1.00000000 0.000000000
## 12 Ngarenanyuki 172 2 132 0.76744186 0.011627907
## 13 Chanika 171 28 4 0.02339181 0.163742690
## 14 Vikindu 162 0 0 0.00000000 0.000000000
## 15 Mtwango 153 0 71 0.46405229 0.000000000
## 16 Matola 145 0 41 0.28275862 0.000000000
## 17 Zinga/Ikerege 141 0 121 0.85815603 0.000000000
## 18 Maramba 139 0 139 1.00000000 0.000000000
## 19 Wanging'ombe 139 0 139 1.00000000 0.000000000
## 20 Itete 137 89 48 0.35036496 0.649635036
Summary statistics for the percentage of unknown permit values per ward indicate that one or more ward’s within the data set have no valid permit values:
# shows some wards have 100% unknowns
summary(ward_perm_true$unk_perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04415 0.00000 1.00000
In fact, we find a total of 73 such wards within the data set, as shown below.
# find wards with 100% unknowns = 73 wards
nrow(subset(ward_perm_true, unk_perc == 1))
## [1] 73
What this tells us is that if we wish to impute permit values for the data records lacking valid values, we should not attempt to do so solely via a record’s ward value since a significant number of wards contain no valid permit values from which a reasonable imputation for the permit indicator can be derived.
As an alternative, we explore the lga variable using the same aggregation techniques described above for the ward variable,
# memory cleanup
rm(ward_tmp)
# subset for fast aggregation
lga_tmp <- tanz[, c("id", "lga", "permit")]
# fill in NA's so data can be grouped properly
lga_tmp$permit[is.na(lga_tmp$permit)] <- "unknown"
# aggregate the data
lga_perm_true <- arrange(summarise(group_by(lga_tmp, lga),
TotalWells = length(unique(id)),
PermUnk = sum(permit == "unknown"),
PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )
# calculate percentage of TRUE and unknown meeting values
lga_perm_true$perc <- lga_perm_true$PermTrue / lga_perm_true$TotalWells
lga_perm_true$unk_perc <- lga_perm_true$PermUnk / lga_perm_true$TotalWells
head(lga_perm_true, n = 20)
## # A tibble: 20 x 6
## lga TotalWells PermUnk PermTrue perc unk_perc
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Njombe 2503 3 1576 0.629644427 0.001198562
## 2 Arusha Rural 1252 53 1199 0.957667732 0.042332268
## 3 Moshi Rural 1251 8 1210 0.967226219 0.006394884
## 4 Bariadi 1177 0 2 0.001699235 0.000000000
## 5 Rungwe 1106 1106 0 0.000000000 1.000000000
## 6 Kilosa 1094 0 1073 0.980804388 0.000000000
## 7 Kasulu 1047 0 751 0.717287488 0.000000000
## 8 Mbozi 1034 0 143 0.138297872 0.000000000
## 9 Meru 1009 3 803 0.795837463 0.002973241
## 10 Bagamoyo 997 0 788 0.790371113 0.000000000
## 11 Singida Rural 995 967 28 0.028140704 0.971859296
## 12 Kilombero 959 0 959 1.000000000 0.000000000
## 13 Same 877 130 739 0.842645382 0.148232611
## 14 Kibondo 874 0 53 0.060640732 0.000000000
## 15 Kyela 859 0 859 1.000000000 0.000000000
## 16 Kahama 836 0 761 0.910287081 0.000000000
## 17 Kigoma Rural 824 0 781 0.947815534 0.000000000
## 18 Magu 824 0 765 0.928398058 0.000000000
## 19 Maswa 809 0 9 0.011124845 0.000000000
## 20 Karagwe 771 0 771 1.000000000 0.000000000
Summary statistics for the percentage of unknown permit values per lga indicate that one or more lga’s within the data set have no valid permit values:
# shows no lga's have 100% unknowns
summary(lga_perm_true$unk_perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0417 0.0000 1.0000
In fact, we find a total of 3 such lga’s within the data set, as shown below.
# find wards with 100% unknowns = 3 lga's
nrow(subset(lga_perm_true, unk_perc == 1))
## [1] 3
We now explore the region variable using the same aggregation techniques described above for the ward and lga variables,
# memory cleanup
rm(lga_tmp)
# subset for fast aggregation
reg_tmp <- tanz[, c("id", "region", "permit")]
# fill in NA's so data can be grouped properly
reg_tmp$permit[is.na(reg_tmp$permit)] <- "unknown"
# aggregate the data
reg_perm_true <- arrange(summarise(group_by(reg_tmp, region),
TotalWells = length(unique(id)),
PermUnk = sum(permit == "unknown"),
PermTrue = sum(permit == "TRUE") ), desc(TotalWells) )
# calculate percentage of TRUE and unknown meeting values
reg_perm_true$perc <- reg_perm_true$PermTrue / reg_perm_true$TotalWells
reg_perm_true$unk_perc <- reg_perm_true$PermUnk / reg_perm_true$TotalWells
head(reg_perm_true, n = 20)
## # A tibble: 20 x 6
## region TotalWells PermUnk PermTrue perc unk_perc
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 Iringa 5294 3 3174 0.5995467 0.0005666793
## 2 Shinyanga 4982 0 2364 0.4745082 0.0000000000
## 3 Mbeya 4639 1106 2134 0.4600129 0.2384134512
## 4 Kilimanjaro 4379 138 3796 0.8668646 0.0315140443
## 5 Morogoro 4006 0 3967 0.9902646 0.0000000000
## 6 Arusha 3350 445 2650 0.7910448 0.1328358209
## 7 Kagera 3316 0 2521 0.7602533 0.0000000000
## 8 Mwanza 3102 0 2975 0.9590587 0.0000000000
## 9 Kigoma 2816 0 1656 0.5880682 0.0000000000
## 10 Ruvuma 2640 2 1798 0.6810606 0.0007575758
## 11 Pwani 2635 0 1530 0.5806452 0.0000000000
## 12 Tanga 2547 0 1378 0.5410287 0.0000000000
## 13 Dodoma 2201 0 989 0.4493412 0.0000000000
## 14 Singida 2093 967 805 0.3846154 0.4620162446
## 15 Mara 1969 367 998 0.5068563 0.1863890300
## 16 Tabora 1959 0 1325 0.6763655 0.0000000000
## 17 Rukwa 1808 0 1162 0.6426991 0.0000000000
## 18 Mtwara 1730 0 1311 0.7578035 0.0000000000
## 19 Manyara 1583 0 1506 0.9513582 0.0000000000
## 20 Lindi 1546 0 813 0.5258732 0.0000000000
Summary statistics for the percentage of unknown permit values per region indicate that all region’s within the data set have at least some valid permit values:
# shows no lga's have 100% unknowns
summary(reg_perm_true$unk_perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05178 0.03151 0.46202
As such, if we wish to impute permit values for the data records lacking valid values, use of a record’s lga and region values may allow us to infer whether or not the missing value should be TRUE or FALSE based on the percentage of pumps within the lga or region sharing a TRUE value. For example, if the percentage of TRUE permit values within an lga exceeds 50%, it would seem reasonable to assign a value of TRUE to any missing permit value pertaining to that lga. However, if the specific lga lacks any valid permit values, we can instead make use of the percentage of TRUE permit values within the associated region and apply the same 50% thresholding approach described for use with the lga variable’s values. This approach is applied in the Data Preparation section of this document.
The extraction_type variable represents the method of extraction used by a given pump. The summary statistics shown below indicate a total of 18 distinct extraction_type values within the data set, with each record having a valid value. The number of pump installations per extraction type is summarized in the table below.
# count distinct extraction_type = 18
length(unique(tanz$extraction_type))
## [1] 18
# calc number of wells per extraction_type
c_extract_t <- arrange(summarise(group_by(tanz, extraction_type),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_t, n = 10)
## # A tibble: 10 x 2
## extraction_type TotalWells
## <chr> <int>
## 1 gravity 26780
## 2 nira/tanira 8154
## 3 other 6430
## 4 submersible 4764
## 5 swn 80 3670
## 6 mono 2865
## 7 india mark ii 2400
## 8 afridev 1770
## 9 ksb 1415
## 10 other - rope pump 451
Plots of the extraction_type variable show that seven of the seventeen extraction types have functional metrics that exceed the overall 54.3% benchmark. Of particular interest here is the gravity type since nearly 27,000 of the 59,400 total pumps rely on that approach, a far higher percentage than any of the other extraction types.
The extraction_type_group variable represents a composite of the methods of extraction indicated by the extraction_type variable. The summary statistics shown below indicate a total of 13 distinct extraction_type_group values within the data set, with each record having a valid value. The number of pump installations per extraction type group is summarized in the table below.
# count distinct extraction_type_group = 13
length(unique(tanz$extraction_type_group))
## [1] 13
# calc number of wells per extraction_type
c_extract_tg <- arrange(summarise(group_by(tanz, extraction_type_group),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_tg, n = 20)
## # A tibble: 13 x 2
## extraction_type_group TotalWells
## <chr> <int>
## 1 gravity 26780
## 2 nira/tanira 8154
## 3 other 6430
## 4 submersible 6179
## 5 swn 80 3670
## 6 mono 2865
## 7 india mark ii 2400
## 8 afridev 1770
## 9 rope pump 451
## 10 other handpump 364
## 11 other motorpump 122
## 12 wind-powered 117
## 13 india mark iii 98
The extraction_type_group indicator appears to represent a narrowing of the breadth of values available via the extraction_type variable into various higher-level overall categories. The gravity type group again accounts for the largest percentage of total pumps.
The extraction_type_class variable represents a composite of the methods of extraction indicated by the extraction_type_group variable. The summary statistics shown below indicate a total of 7 distinct extraction_type_class values within the data set, with each record having a valid value. The number of pump installations per extraction type class is summarized in the table below.
# count distinct extraction_type_class = 7
length(unique(tanz$extraction_type_class))
## [1] 7
# calc number of wells per extraction_type_class
c_extract_t_c <- arrange(summarise(group_by(tanz, extraction_type_class),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_extract_t_c, n = 10)
## # A tibble: 7 x 2
## extraction_type_class TotalWells
## <chr> <int>
## 1 gravity 26780
## 2 handpump 16456
## 3 other 6430
## 4 submersible 6179
## 5 motorpump 2987
## 6 rope pump 451
## 7 wind-powered 117
The extraction_type_class indicator appears to represent a narrowing of the breadth of values available via the extraction_type_group variable into various higher-level overall categories. The gravity type class again accounts for the largest percentage of total pumps. However, the handpump class also represents a significant percentage of the total pump count, with nearly 16,400 pumps having been assigned to that class. Of the seven classes, three exceed the 54.3% overall functional benchmark.
The management variable represents the name of the method employed for management of a given pump. The summary statistics shown below indicate a total of 12 distinct management values within the data set, with each record having a valid value. The number of pump installations per management method is summarized in the table below.
# count distinct management = 12
length(unique(tanz$management))
## [1] 12
# calc number of wells per management
c_mgmt <- arrange(summarise(group_by(tanz, management),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_mgmt, n = 20)
## # A tibble: 12 x 2
## management TotalWells
## <chr> <int>
## 1 vwc 40507
## 2 wug 6515
## 3 water board 2933
## 4 wua 2535
## 5 private operator 1971
## 6 parastatal 1768
## 7 water authority 904
## 8 other 844
## 9 company 685
## 10 unknown 561
## 11 other - school 99
## 12 trust 78
VWC dominates the management variable’s values, with more than 40,000 pumps having been assigned that value. As we can see in the plots below, while seven other management approaches have higher percentages of their pumps functional, collectively those seven approaches are applied to less than one third of the pumps.
The management_group variable may represent a composite of the management method names indicated by the management variable. The summary statistics shown below indicate a total of 5 distinct management_group values within the data set, including 561 records labeled as “unknown”. The number of pump installations per management group is summarized in the table below.
# count distinct management_group = 5
length(unique(tanz$management_group))
## [1] 5
# calc number of wells per management_group
c_mgmt_grp <- arrange(summarise(group_by(tanz, management_group),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_mgmt_grp, n = 10)
## # A tibble: 5 x 2
## management_group TotalWells
## <chr> <int>
## 1 user-group 52490
## 2 commercial 3638
## 3 parastatal 1768
## 4 other 943
## 5 unknown 561
The management_group variable’s most relevant value is user-group, with 52,490 pumps having been assigned that value. As we can see in the plots, user-group yields a functional pump percentage close to that of the 54.3% overall functional metric.
The payment variable represents how the water is actually paid for by users of the pump, if at all. The summary statistics shown below indicate a total of 12 distinct payment values within the data set, with each record having a valid value. The number of pump installations per payment method is summarized in the table below.
# count distinct payment = 7
length(unique(tanz$payment))
## [1] 7
# calc number of wells per payment
c_paymt <- arrange(summarise(group_by(tanz, payment),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_paymt, n = 10)
## # A tibble: 7 x 2
## payment TotalWells
## <chr> <int>
## 1 never pay 25348
## 2 pay per bucket 8985
## 3 pay monthly 8300
## 4 unknown 8157
## 5 pay when scheme fails 3914
## 6 pay annually 3642
## 7 other 1054
As shown above, more than 25,000 pumps require no payment for their use, and, unsurprisingly, as shown in the graphics below, those pumps appear to be the least functional overall if unknown payment types are excluded. However, pumps that do not require payment may be located in remote areas where collection of payment is not feasible. Nevertheless, it appears reasonable to conclude that requiring users to pay for use of a pump is more likely to result in a pump remaining functional than will allowing use of a pump free of charge.
The payment_type variable represents a duplicate of the payment methods indicated by the payment variable, with the sole difference being that payment’s “pay when scheme fails” category has been replaced by “on failure”. The summary statistics shown below indicate a total of 7 distinct payment_type values within the data set, with each record having a valid value. The number of pump installations per payment type is summarized in the table below.
# count distinct payment_type = 7; on failure appears to replace "pay when scheme fails" from payment variable
length(unique(tanz$payment_type))
## [1] 7
# calc number of wells per payment_type
c_paymt_t <- arrange(summarise(group_by(tanz, payment_type),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_paymt_t, n = 10)
## # A tibble: 7 x 2
## payment_type TotalWells
## <chr> <int>
## 1 never pay 25348
## 2 per bucket 8985
## 3 monthly 8300
## 4 unknown 8157
## 5 on failure 3914
## 6 annually 3642
## 7 other 1054
The payment_type variable’s values are not distinct from those of the payment variable. In fact, the names of the values are simply minor alterations of those provided via the payment variable. All of the plots shown below exhibit values identical to those of the payment variable once the minor value name alterations are considered. As such, the payment_type variable appears to be redundant and can therefore be ignored for purposes of model building.
The water_quality variable is an indicator of the quality of the water produced by a given pump. The summary statistics shown below indicate a total of 8 distinct water_quality values within the data set, with each record having a valid value. The number of pump installations per water_quality value is summarized in the table below.
# count distinct water_quality = 8
length(unique(tanz$water_quality))
## [1] 8
# calc number of wells per water_quality
c_waterq <- arrange(summarise(group_by(tanz, water_quality),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterq, n = 10)
## # A tibble: 8 x 2
## water_quality TotalWells
## <chr> <int>
## 1 soft 50818
## 2 salty 4856
## 3 unknown 1876
## 4 milky 804
## 5 coloured 490
## 6 salty abandoned 339
## 7 fluoride 200
## 8 fluoride abandoned 17
The water_quality variable is dominated by the soft value, with 50,818 of the 59,400 pumps having that water_quality value. While the plots shown below indicate that the largest percentage of functional pumps belong to those of the fluoride category, there are only 200 such pumps throughout the country.
The quality_group variable represents a composite of the water quality indicators provided by the water_quality variable. The summary statistics shown below indicate a total of 6 distinct quality_group values within the data set, with each record having a valid value. The number of pump installations per extraction type class is summarized in the table below.
# count distinct quality_group = 6
length(unique(tanz$quality_group))
## [1] 6
# calc number of wells per quality_group
c_qualityg <- arrange(summarise(group_by(tanz, quality_group),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_qualityg, n = 10)
## # A tibble: 6 x 2
## quality_group TotalWells
## <chr> <int>
## 1 good 50818
## 2 salty 5195
## 3 unknown 1876
## 4 milky 804
## 5 colored 490
## 6 fluoride 217
The quality_group variable appears to represent a narrowing of the breadth of values available via the water_quality variable. In fact, the number of pumps in the good category is identical to those of the water_quality variable’s soft category, while the fluoride and salty categories have been created by summing the fluoride, fluoride_abandoned, and salty and salty_abandoned categories belonging to the water_quality variable. As such, this variable may be duplicative/redundant and therefore may likely be considered for removal for purposes of model building.
The quantity and quantity_group variables are exact duplicates of one another, as shown in the summary statistics provided below. Both variables are used to indicate the quantity of water available at a given pump. The summary statistics indicate a total of 5 distinct quantity and quantity_group values within the data set, with each record having a valid value. The number of pump installations per quantity value is summarized in the table below.
# count distinct quantity = 5
length(unique(tanz$quantity))
## [1] 5
# calc number of wells per quantity
c_quantity <- arrange(summarise(group_by(tanz, quantity),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_quantity, n = 10)
## # A tibble: 5 x 2
## quantity TotalWells
## <chr> <int>
## 1 enough 33186
## 2 insufficient 15129
## 3 dry 6246
## 4 seasonal 4050
## 5 unknown 789
# -------------------------------
# count distinct quantity_group = 5; appears to be identical to "quantity" variable
length(unique(tanz$quantity_group))
## [1] 5
# calc number of wells per quantity_group
c_quantityg <- arrange(summarise(group_by(tanz, quantity_group),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_quantityg, n = 10)
## # A tibble: 5 x 2
## quantity_group TotalWells
## <chr> <int>
## 1 enough 33186
## 2 insufficient 15129
## 3 dry 6246
## 4 seasonal 4050
## 5 unknown 789
The quantity and quantity_group variables have identical categories with identical numbers of pumps assigned to each. As such, the quantity_group variable is likely to be redundant/duplicative and can therefore be ignored for purposes of model building.
Of the five quantity categories, the dry category offers the smallest percentage of functional pumps with only 2.5% of such pumps being labeled as such. By contrast, more than 65% of pumps categorized as having a quantity of enough are functional.
The source variable indicates the source of the water for a given pump. The summary statistics shown below indicate a total of 10 distinct source values within the data set, with each record having a valid value. The number of pump installations per source value is summarized in the table below.
# count distinct source = 10
length(unique(tanz$source))
## [1] 10
# calc number of wells per source
c_source <- arrange(summarise(group_by(tanz, source),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source, n = 10)
## # A tibble: 10 x 2
## source TotalWells
## <chr> <int>
## 1 spring 17021
## 2 shallow well 16824
## 3 machine dbh 11075
## 4 river 9612
## 5 rainwater harvesting 2295
## 6 hand dtw 874
## 7 lake 765
## 8 dam 656
## 9 other 212
## 10 unknown 66
As shown below, the spring category of the source variable offers the highest percentage of functional pumps and also represents the largest source category with 17,021 pumps. By contrast, while the lake category performs poorly, it represents only 765 of the 59,400 pumps represented in the data set. Of the other categories represented, 57.8% of source = dam are also not functioning.
The source_type variable appears to be a composite of the source variable. The summary statistics shown below indicate a total of 7 distinct source_type values within the data set, with each record having a valid value. The number of pump installations per source_type is summarized in the table below.
# count distinct source_type = 7
length(unique(tanz$source_type))
## [1] 7
# calc number of wells per source_type
c_source_t <- arrange(summarise(group_by(tanz, source_type),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source_t, n = 10)
## # A tibble: 7 x 2
## source_type TotalWells
## <chr> <int>
## 1 spring 17021
## 2 shallow well 16824
## 3 borehole 11949
## 4 river/lake 10377
## 5 rainwater harvesting 2295
## 6 dam 656
## 7 other 278
The source_type variable appears to combine some of the categories from source variable. As such, this variable may be redundant / duplicative and might possibly be ignored for purposes of model building.
As shown below, the spring category of the source_type variable offers the highest percentage of functional pumps and also represents the largest source_type category with 17,021 pumps. By contrast, while the dam category performs poorly, it represents only 656 of the 59,400 pumps represented in the data set. Of the other categories represented, 46.2% of source_type = borehole are also not functioning.
The source_class variable appears to be a composite of the source_type variable. The summary statistics shown below indicate a total of 3 distinct source_class values within the data set, with each record having a valid value. The number of pump installations per source_class is summarized in the table below.
# count distinct source_class = 3
length(unique(tanz$source_class))
## [1] 3
# calc number of wells per source_class
c_source_c <- arrange(summarise(group_by(tanz, source_class),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_source_c, n = 10)
## # A tibble: 3 x 2
## source_class TotalWells
## <chr> <int>
## 1 groundwater 45794
## 2 surface 13328
## 3 unknown 278
The source_class variable appears to be a binary indicator, with, as shown below, both known categories appearing to be equally likely to offer a functional pump. However, groundwater class pumps appear to be more likely than surface class pumps to be completely non-functional.
The waterpoint_type variable indicates the type of pump installed at a given location. The summary statistics shown below indicate a total of 7 distinct waterpoint_type values within the data set, with each record having a valid value. The number of pump installations per waterpoint_type value is summarized in the table below.
# count distinct waterpoint_type = 7
length(unique(tanz$waterpoint_type))
## [1] 7
# calc number of wells per waterpoint_type
c_waterp_t <- arrange(summarise(group_by(tanz, waterpoint_type),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterp_t, n = 10)
## # A tibble: 7 x 2
## waterpoint_type TotalWells
## <chr> <int>
## 1 communal standpipe 28522
## 2 hand pump 17488
## 3 other 6380
## 4 communal standpipe multiple 6103
## 5 improved spring 784
## 6 cattle trough 116
## 7 dam 7
As shown below, the waterpoint_type variable’s values indicate that pumps having a dam as their waterpoint are the most likely to be functional while communal standpipe multiple appear to be the least likely to be functional. However, there are only 7 pumps having a waterpoint_type of dam. In fact, the most common waterpoint_type is communal standpipe followed by hand pump. Both of those categories appear to perform well relative to the overall functional benchmark of 54.3%.
The waterpoint_type_group variable appears to be a composite of the waterpoint_type variable. The summary statistics shown below indicate a total of 6 distinct waterpoint_type_group values within the data set, with each record having a valid value. The number of pump installations per waterpoint_type_group is summarized in the table below.
# count distinct waterpoint_type_group = 6
length(unique(tanz$waterpoint_type_group))
## [1] 6
# calc number of wells per waterpoint_type_group
c_waterp_t_g <- arrange(summarise(group_by(tanz, waterpoint_type_group),
TotalWells = length(unique(id)) ), desc(TotalWells) )
head(c_waterp_t_g, n = 10)
## # A tibble: 6 x 2
## waterpoint_type_group TotalWells
## <chr> <int>
## 1 communal standpipe 34625
## 2 hand pump 17488
## 3 other 6380
## 4 improved spring 784
## 5 cattle trough 116
## 6 dam 7
As shown above, the waterpoint_type_group appears to be a composite of the waterpoint_type variable, with the sole difference being the two standpipe groups have been summed together. As such, this variable is likely duplicative / redundant and might possibly be ignored for purposes of model building. Plots showing the status of pumps by waterpoint_type_group value are shown below.
The barplots shown above allow us to gain insight into some of the the predictive aspects of the data. The plots show that most of the categorical variables do, in fact, contain significant predictive characteristics. For example, we find geographic variability with respect to the functional
status of pumps via variables such as basin
and region
:
Additional variability in the functional
status of pumps can be found via variables such as scheme_management
, construction_year
, public_meeting
, extraction_type
, management
, payment
, source
, quantity
, water_quality
, and waterpoint_type
:
Plots for the numeric variables show that gps_height
, longitude
, and latitude
also contain predictive characteristics, while population
and amount_tsh
appear likely to have less predictive value.
This variability is indicative of the predictive value of the variables. For example, pumps located within the Lake Nyasa basin are far more likely to be functional
than are pumps located within the Southern Coast basin (65.4% vs 37.2%), and relatively newer pumps are more likely to be functional than are relatively older pumps (see plot for construction_year
). Furthermore, pumps located in the Arusha region are more than twice as likely to be functional than are pumps found within the region of Lindi (68.5% vs 29.8%), and pumps that require some form of payment for use are much more likely to be functional than are pumps that require no payment (see plot for payment
). Similar types of inferences can be postulated relative to the functional needs repair
and non functional
statuses.
Such inferences are helpful for purposes of model building since they suggest that many variables found within the data set are strong candidates for inclusion within a prospective predictive / classification model. Furthermore, it suggests that use of varying combinations of independent variables might be employed in an attempt to determine whether any particular combination thereof yields a superior model.
We considered the numeric variables (‘amount_tsh’, ‘latitude’, ‘longitude’, ‘gps_height’, ‘population’ and ‘construction_year’) in our models, and the used imputation techniques to handle missing values of these variables are explained below:
amount_tsh: This variable contains Total Static Head (difference in elevation of fluid’s source and destination surfaces). The unit of measurement will be a distance metric such as feet, meters etc., but the unit of measurement used for this variable in the data set is unknown. We are assuming that a uniform metric was used for this variable’s data in the data set. We used 4 imputation strategies (mean, median, ignore and custom) for this variable..
“mean” - use the mean value of the non-zero/non-null observations to impute 0s and null values of amount_tsh.
“median” - use the median value of the non-zero/non-null observations to impute 0s and null values of amount_tsh.
“ignore” - just substitute the null values of amount_tsh with 0
“custom” - Get the median of amount_tsh by grouping source_class, basin, waterpoint_type_group variables. If the source_class is null, then use the median value based on the grouping of waterpoint_type_group values. source_class represents the type of water source (ground water or surface) and basin represents the water source itself. So these two variables should represent the source surface, and the waterpoint_type_group represents the type of the pump (hand pump, communal standpipe etc). Hence the median value of amount_tsh grouped by source_class, basin, waterpoint_type_group variables could potentially represent a reliable value to impute amount_tsh values.
latitude & longitude: These 2 variables represent the precise location of the pump. These 2 variables have 1812 unknown values (represented as 0). We decided these 1812 values as unknown values, as the latitudes and longitudes of Tanzania range between [-13, 0.25] and [28, 42] respectively [11]. We used 3 different imputation techniques (“mean”, “median”, and “custom”). The “mean” and “median” methods use the mean and median values of non-zero and non-null values respectively, for imputation. The “custom” imputation technique uses the data from “ward”, “lga” and “region” variables. If a latitude data is missing, then the custom imputation technique will use the average of all the non-zero/non-null latitudes of the “ward” (where the pump is located). If all the latitudes are unknown for the ward, then we will use the average of non-zero/non-null latitudes at the “lga” level, and if “lga” has all unknown latitudes, then we will use average of latitudes at the “region” level. If region also has all unknown or zero latitudes, then we will use the average of all non-zero latitudes at the country level. The same imputation technique is applicable to longitudes also. According to [12], in Tanzania, regions are subdivided into districts, districts are subdivided into wards, and wards into subvillages. Hence we started our search at ward level. But we used “lga” (Local Government Authority) as the next level, based on the assumption that “lga” represent one or more wards. We could start the search at subvillage level, instead of ward, but the ward and lga variables have 0 missing values, whereas the subvillage variable has some missing values. We ended up using region as the highest level, since the future data might have missing ward/lga details, and we have to gracefully handle the situation without the failure of predictive model. It is very unlikely that region information would be missing in future data, since Tanzania has only 22 regions or divisions[12]. If the region is also missing in future data, then we will use the average at the country level.
gps_height: This variable represents the altitude of the pump’s location. For this variable we used “mean”, “median”, “ignore” and “custom” strategies. The “mean”, “median” and “ignore” strategies are similar to the strategies explained for amount_tsh variable. The “custom” strategy uses the location details (based on latitude and longitude data). But before applying the “custom” impute strategy, the latitude and longitude variables must be successfully imputed. The “custom” impute method is controlled by two additional hyper parameters: “initial_radius” and “increment_radius”. The “initial_radius” represent the initial area of search. Each degree in latitude (or longitude) represent 110 KM [13]. So if “initial_radius” is initialized to 0.1, then the custom method will consider all the gps_heights within the 11 KM radius of the latitude and longitude, and impute with the median value of non-zero gps_heights within 11 KM radius. If there are no valid gps_heights found within 11 KM radius, then the search area is increased by “increment_radius”. So if “increment_radius” is set at 0.2, then the search area is increased by 22 km radius. This search continues until the search radius reaches to 2 (representing 220 KM radius). If no gps_height is found within 220 KM radius, then the country’s median gps_height will be used to impute the gps_height variable.
population: This variable is imputed using the same logic as described for gps_height.
construction_year: This variable represents the year when the pump was installed. For this variable also we defined 4 types of imputation methods - “mean”, “median”, “ignore”, and “custom”. The “mean”, “median” and “ignore” are similar to the methods described for amount_tsh variable. The “custom” imputation technique uses the “date_recorded” variable, to calculate the approximate age of the pump. We first get the difference between the date_recorded and construction year (using non_zero and non-null data only). In the second step, we get the difference between date_recorded and construction_year, to find the age. If there are any age values which are less than 0 or unknown (Nulls), then we substitute those values with the median of the age computed.
We have three types of categorical variables in our data set - 1. Variables with many levels such as installer (2146 levels), subvillage (19288 levels). 2. Variables with less number of levels (20 or less) like district. 3. Some variables like installer and funder have free form text data, with so many levels.
Each of these categorical variables have to be quantified before we start building predictive model. To tackle the variables with many levels, we made use of 2 imputation techniques frequency based and response variable based. In frequency based technique, we bin the levels of categorical variable based on the percentage of the data the levels represent. For example, if a categorical variable has 3 levels A, B, C, and each of these levels represent 12%, 15% and 83%, then we will group the variables into several buckets (each bucket representing a specific percentage range). If we use 5% range as the bucket size, then the variables A, B belong to 11-15% bucket, while C belongs to 81-85% range.
The response based binning takes the target variable into account (the target variable must be a categorical variable).For example if a categorical variable has three levels A, B, C and the target variable has only 2 levels TRUE and FALSE, then we first compute the TRUE and FALSE percentage values for each level A, B, C. If A level has 81% TRUE, B has 13% TRUE and C has 84% TRUE, then we bin A and C into one bin (80-85%) and C into bin 10-15%.
The desired binning technique, and the bin size are controlled by hyper parameters, which can be tuned/tested using techniques like Cross Validation. After binning the levels, we will apply the dummy variable technique used to handle the categorical variables with less number of levels (usually 20 or less).
To handle categorical variables with less number of levels, we create dummy variables to represent presence/absence of each level. For example, if we have three levels A, B, C in a categorical variable, then we will have 3 dummy variables dummy_A, dummy_B, and dummy_C to represent the presence/absence of A/B/C levels. To represent A, we can use [1,0,0] (format [dummy_A, dummy_B, and dummy_C]). This technique is also known as one-hot encoding. In some statistical books like [14], it was suggested to create n-1 dummy variables to represent n levels in a categorical variable, so that we avoid correlated variables in the data set. But we suggest to create n dummy variables for n levels, so that we can gracefully handle the situation to process an unknown level in the future incoming data. For instance if our training data has just 3 levels A, B, C, then we create 3 dummy variables. But after deploying our model in production, if the streaming data has a different level like D, then we can represent D by making the dummy variables values as [0, 0, 0].
To handle the third type of categorical variables (free form text), we used 2 techniques to group the data - based on Fuzzy logic [15] and considering only the first “n” characters of the text. The Fuzzy logic technique did not give better results, as the runtime of to process the data was significantly high. So we considered the second method to consider only the initial characters of the text, and group the data based on the initial characters. Once the data is grouped, we will create dummy variables to represent each group.
To test all the options listed above for handling categorical varables, we created the following hyper parapeters:
“substitute” - A desired value to substitute Null values of categorical variable.
“convert_to_small” - Helps to convert the text to small case letters.
“buckets” - Controls the desired number of bins to group the categorical data.
“method” - Controls which type of binning to apply. Can have “freq”, “resp” and “both” to apply frequency based, response based, and both frequency and response based techniques respectively.
“initial_chars” - Controls the initial number of characters to consider while processing the free form text.
“groups” - Once the initial characters are extracted, how many groups do we need? For example, if we set groups=15, then we will create 14 different groups for the top 14 groups represented by the initial characters, and the 15th group will be other, which contains all the remaining rows whose initial characters are not in the top 14 groups.
In our project, we used “unknown” to replace all the null values in all categorical columns. We also converted all the categorical variables data to lower case.
The categorical variables [funder, installer] were considered to have free form text with many levels, the variables [‘subvillage’,‘lga’,‘ward’] were considered to have many levels, and the variables [‘basin’,‘region’,‘district_code’, ‘public_meeting’, ‘scheme_management’, ‘permit’, ‘extraction_type_group’, ‘management’, ‘payment_type’, ‘water_quality’, ‘quantity_group’, ‘source’, ‘waterpoint_type’] were considered as categorical variables with less number of levels (20 or less). The remaining categorical variables [‘scheme_name’, ‘extraction_type’, ‘extraction_type_class’, ‘management_group’, ‘quality_group’, ‘source_type’,‘source_class’,‘waterpoint_type_group’] were excluded from further processing since they represent the duplicate data.
B1. https://www.wartsila.com/encyclopedia/term/total-static-head
All data transformations were implemented in python 3.4 and the source code is listed below:
Required packages
#Import all the required packages:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, LabelBinarizer, Imputer, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
#from fuzzywuzzy import fuzz
#from fuzzywuzzy import process
from sklearn.externals import joblib
import math
import pickle
DataFrameSelector
###################################################################
#Every class defined will have a variable called column_names.
#This will help us to get the column_names of the transformation.
#It can be accessed like the way shown below:
#Example-1:
#freq_bin = FreqBasedCategoricalBinning()
#freq_bin.fit(test_df)
#freq_bin.transform(test_df)
#freq_bin.column_names will
#contain the column names of the final transformation
#
#Example-2 (using pipeline):
#freq_pipeline = \
#Pipeline([('FreqBasedCategoricalBinning',FreqBasedCategoricalBinning())])
#freq_pipeline.fit(test_df)
#freq_pipeline.transform(test_df)
#freq_pipeline.named_steps['FreqBasedCategoricalBinning'].column_names
#will contain the list of column names
#used in the transformation
#This will help us to select the desired columns of a data frame
#This will help us to select the desired columns of a data frame
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self,attribute_names):
self.attribute_names = attribute_names
def fit(self,X, y=None):
self.column_names = self.attribute_names
return self
def transform(self,X):
return X[self.attribute_names]
HandleCategoricalNulls
#This will help us to convert all text columns to small case and
#also help us to substitute a user defined text in the place of NA values
class HandleCategoricalNulls(BaseEstimator, TransformerMixin):
def __init__(self,substitute='unknown',convert_all_to_small=True):
'''
cols must be a list
'''
self.substitute = substitute
self.convert_all_to_small = convert_all_to_small
self.column_names = []
def get_alpha_numeric(self,x):
temp_l = list()
for i in list(x):
temp = re.sub(r'[^a-zA-Z0-9]',' ', i)
temp = re.sub(' +'," ",temp)
temp_l.append(temp)
return temp_l
def fit(self,X, y=None):
HandleCategorical_columns = list(X.columns)
self.column_names = list(X.columns)
return self
def transform(self,X):
X=X.fillna(self.substitute)
if self.convert_all_to_small:
X = X.apply(lambda x: x.astype(str).str.lower())
#Include only alpha-numeric chars
#Remove any contiguous spaces
for i in X.columns:
X[i] = self.get_alpha_numeric(X[i])
return X
ApplyFuzzyTextProcess This class will not be used as it is computationally intensive. But the definition is included so that we can enhance it in future work.
#helps to contol whether to group text data bsed on common grouping
#but do not use this, as it runs for a while
class ApplyFuzzyTextProcess( BaseEstimator, TransformerMixin):
def __init__(self,map_dict,apply=True):
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
self.map_dict = map_dict
self.apply = apply
self.column_names = []
def find_and_replace(self,l):
temp_l = []
for j in list(l):
try:
temp_l.append(self.map_dict[str(j).lower()])
except:
word,score = process.extract(j, self.map_dict.keys(), limit=1)[0]
if score > 85:
temp_l.append(self.map_dict[word])
else:
temp_l.append(j)
continue
return temp_l
def fit(self,X, y=None):
self.column_names = list(X.columns)
return self
def transform(self,X):
if self.apply:
for i in X.columns:
X[i] = self.find_and_replace(X[i])
return X
FreqBasedCategoricalBinning
##This class will help us to bin the categorical variables with
##many levels based on the frequency distribution
class FreqBasedCategoricalBinning( BaseEstimator, TransformerMixin):
def __init__(self,buckets=20,apply=True):
'''
buckets - Desired number of classes.
you can get less than or equal to the desired buckets,
if the data is heavily skewed or if we have missing ranges
'''
#Calculate the number of bins
try:
self.bin_size=np.floor(100.0/(buckets-1))
except:
self.bin_size=5.0
self.freq_dict={}
self.column_names = []
self.apply = apply
def get_freq(self,df):
#Get the frequency of each level in col
for col in df.columns:
total_count=df.groupby([col]).size().reset_index()
total_count.columns=[col,col+'_freq_bin']
total_count.index = list(total_count[col])
total_count = total_count.drop([col],axis=1)
#Save the result to a dictionary
self.freq_dict[col] = \
np.ceil(total_count.iloc[:,[0]]/np.max(total_count.iloc[:,0])*100)
def join(self,X,key,value):
#add a column called 'sorted' to save the order.
X['sorted'] = np.arange(len(X))
X.index = list(X[key])
X.drop([key],axis=1,inplace=True)
X=X.join(value,how='left')
#Left join may result in NaN,
#if we have unseen levels in the variable
#We will pad such values with median
temp_val = np.median(value.iloc[:,[0]])
X[[value.columns[0]]] = X[[value.columns[0]]].fillna(temp_val)
#Bin the data
X[[value.columns[0]]] = np.ceil(X[[value.columns[0]]]/self.bin_size)
#Convert the bins to int and later to str
X[[value.columns[0]]] = X[[value.columns[0]]].astype('int')
#X[[value.columns[0]]] = 'class_'+X[[value.columns[0]]].astype('str')
X[[value.columns[0]]] = X[[value.columns[0]]].astype('str')
#Reset and drop the first (new) column resulted in reset_index()
X=X.reset_index()
X.drop(X.columns[0],axis=1,inplace=True)
#Restore the order:
X.sort_values(['sorted'],inplace=True)
X.drop(['sorted'],axis=1,inplace=True)
return X
def fit(self,X, y=None):
if self.apply:
self.get_freq(X)
#FreqBasedCategoricalBinning = []
#Do not set the column names here, to make the logic simple.
self.column_names = []
return self
else:
self.column_names = []
return self
def transform(self,X,y=None):
if self.apply:
X = X.copy()
for key, value in self.freq_dict.items():
X = self.join(X,key,value)
global FreqBasedCategoricalBinning_cols
#Set the column names here
self.column_names = list(X.columns)
return X
else:
self.column_names = list(X.columns)
return X
RespBasedCategoricalBinning
##This class will help us to bin the categorical variables with
##many levels based on the target classes.
class RespBasedCategoricalBinning( BaseEstimator, TransformerMixin):
def __init__(self,buckets=20,apply=True):
'''
buckets - Desired number of classes.
you can get less than or equal to the desired buckets,
if the data is heavily skewed or if we have missing ranges
'''
##Determine the bin size
try:
self.bin_size=np.floor(100.0/(buckets-1))
except:
self.bin_size= 5.0
#Declare a dictionary, which will save the details
#in the fit() function
self.freq_dict={}
self.column_names = []
self.apply = apply
##This function will be called by fit() function.
def get_probs(self,X,y):
#Get the columns of the X data frame
X_columns = list(X.columns)
#Combine X and y into a single data frame
df = X
#_target_variable is the target column, which is a categorical column
df['_target_variable'] = list(y)
#For each column in the X data frame, perform the following:
for col in X_columns:
#Create a data frame with counts, group by the values of the column col
total_count=df.groupby([col]).size().reset_index()
#Assign the column names to the data frame with group counts
total_count.columns=[col,'total']
#Make the col as the index.
#Pandas join is not behaving well if we are joining on a column.
#So I am assigning the index using the to be joined col values
total_count.index = list(total_count[col])
#Drop the col as its values are in the index already.
total_count = total_count.drop([col],axis=1)
#Now create another data frame, that contains the counts of
#values on col, group by target variable
total_grp_count = df.groupby([col,'_target_variable']).size().reset_index()
total_grp_count.columns=[col,'_target_variable','status_total']
total_grp_count.index = total_grp_count[col]
total_grp_count = total_grp_count.drop([col],axis=1)
#Join the two data frames: total_count and total_grp_count
#We have to use inner join, as we are certain that there will be NO mis-matches
joined_df = total_count.join(total_grp_count,how='inner')
joined_df = joined_df.reset_index()
#Make sure that we have proper column names, after reset_index
columns = list(joined_df.columns)
columns[0] = col
joined_df.columns = columns
#print(joined_df)
#Calculating the proportions
joined_df['proportion'] = joined_df['status_total']/joined_df['total']
#Pivot the table
joined_df = pd.pivot_table(joined_df,values='proportion',\
columns='_target_variable',index=col)
#DO NOT reset the index on the pivot table.
#As the pivoted table will have col values as the index,
#and this will be useful later in transform()
#Fill the NaN values with 0
joined_df = joined_df.fillna(0)
for i in joined_df.columns:
joined_df[i] = joined_df[i]/np.max(joined_df[i])
#Rename the column names, by appending with col name.
#This will make sure that we do not have any duplicate columns
joined_df.columns = [col + '-'+str(i).replace(" ", "-") for i in joined_df.columns]
#Save the col and pivoted table in a dictionary.
#This dictionary will be used in transform() logic.
self.freq_dict[col] = joined_df
#You have to drop the _target_variable, so that we revert back the changes to X
df.drop(["_target_variable"],axis=1,inplace=True)
#Define a function to help with the custom join
def join(self,X,key,value):
'''
X will be a data frame, on which we have to apply the transform()
key will be a key in the self.freq_dict
value will be the pivoted table corresponding to the key
'''
##Create a column called 'sorted' X. This will help
##us to restore the order of X later (or else there might be
##change that the order of X might be disturbed)
X['sorted'] = np.arange(len(X))
#Make sure that X is indexed on the key column
try:
X.index = list(X[key])
X.drop([key],axis=1,inplace=True)
X=X.join(value,how='left')
except:
print("EXCEPTION/EROR: The input data frame does not have "+key+" column")
print("Terminating the program")
#Make sure that you fill the NaN values with median
#This is needed, to gracefully add unseen values in the categorical variables
for i in value.columns:
temp_val = np.median(value[i])
X[i] = X[i].fillna(temp_val)
X[i] = np.ceil(X[i]*100/self.bin_size)
X[i] = X[i].astype('int')
X[i] = X[i].astype('str')
X=X.reset_index()
#Drop the first column, which is obtained by reset_index, as it is not needed
X.drop(X.columns[0],axis=1,inplace=True)
X.sort_values(['sorted'],inplace=True)
X.drop(['sorted'],axis=1,inplace=True)
return X
#fit() will just build the self.freq_dict
#In the following func def, the y parameter is NOT optional
def fit(self,X, y):
if self.apply:
self.get_probs(X,y)
#To make the logic simple, we will set the column names in transform()
self.column_names = []
return self
else:
self.column_names = []
return self
#transform() will iterate over the dictionary keys,
#and build the transformation.
#As we are using the self.freq_dict items,
#even though the input data frame supplied to transform has
#extra values, we will not get any errors, and such columns will remain
#undisturbed.
def transform(self,X,y=None):
if self.apply:
X = X.copy()
for key, value in self.freq_dict.items():
X = self.join(X,key,value)
self.column_names = list(X.columns)
return X
else:
self.column_names = list(X.columns)
return X
CatMultiLabelTransformer
#
class CatMultiLabelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,apply=True):
self.column_names = []
self.apply = apply
self.binarizers={}
def check_input_obj(self,X,location):
##Check if input object is a pandas df, else raise exception
try:
if not isinstance(X,pd.DataFrame):
raise ValueError
except:
print("**EXCEPTION/ERROR**: In "+ location + \
" function of "+self.__name__+ ". Input must be a Pandas dataframe")
exit(10)
def fit(self, X, y=None):
self.column_names = []
if self.apply:
##Check if input object is a pandas df, else raise exception
self.check_input_obj(X,'fit()')
##Create an empty dict,
##which will be updated with LabelBinarizer for each column
self.binarizers={}
for col in X.columns:
uniq_elements = list(set(X[col]))
#print(uniq_elements)
if len(uniq_elements) == 2:
##Add a dummy class
#We have to name this class in a
#weird fashion,so that no data has this class
uniq_elements.append('d#u/m*m-y+class_991-+xya')
lb = LabelBinarizer()
self.binarizers[col] = lb.fit(uniq_elements)
#print(X)
#self.column_names.append([str(col) + "_" + str(j) \
#for j in list(lb.classes_) if j != 'd#u/m*m-y+class_991-+xya'])
self.column_names = self.column_names + \
[str(col) + "_" + str(j) \
for j in list(lb.classes_) \
if j != 'd#u/m*m-y+class_991-+xya']
#print("in transform")
#print("len of self.binarizers",len(self.binarizers))
#print("len of self.column_names",len(self.column_names))
return self
else:
return self
def transform(self, X, y=None):
#print("in transform")
#print("len of self.binarizers",len(self.binarizers))
#print(self.apply)
if self.apply:
self.check_input_obj(X,'transform()')
#X_transform = np.empty()
temp_transformed_data = []
transformed_column_names = []
for key, value in self.binarizers.items():
#print("key=",key)
#print("value",value)
#print("len of temp_transformed_data",len(temp_transformed_data))
try:
temp_transformed_data.append(value.transform(X[key]))
transformed_column_names = \
transformed_column_names + \
[str(key) + "_" + str(j) \
for j in list(value.classes_)]
except:
continue
return pd.DataFrame(np.concatenate(temp_transformed_data, axis=1),\
columns=transformed_column_names)[self.column_names]
#transformed_column_names = self.column_names + [str(col) + "_" + \
#str(j) for j in list(lb.classes_) if j != 'd#u/m*m-y+class_991-+xya']
#transformed_X.columns = self.columns
else:
self.column_names = list(X.columns)
return X
AmountTSHTransformer
#Here will be our strategy to handle amount_tsh:
# 1. Get the median values of
# amount_tsh group by 'source_class', 'basin',
# 'waterpoint_type_group'
# 2. If amount_tsh is 0, fill the value using the
# median computed by grouping 'source_class',
# 'basin', 'waterpoint_type_group'
# 3. If there is no combination of 'source_class',
# 'basin', 'waterpoint_type_group' found,
# fill using 'waterpoint_type_group'
# 4. Else, let it stay 0
class AmountTSHTransformer(BaseEstimator, TransformerMixin):
def __init__(self,method='custom'):
self.column_names = []
self.method = method
def check_input_obj(self,X,location):
##Check if input object is a pandas df, else raise exception
try:
if not isinstance(X,pd.DataFrame):
raise ValueError
except:
print("**EXCEPTION/ERROR**: In "+ location + \
" function of "+self.__name__+ \
". Input must be a Pandas dataframe")
exit(10)
def fit(self, X, y=None):
self.column_names = ['amount_tsh']
if self.method == 'custom':
X['amount_tsh'] = X['amount_tsh'].astype(float)
self.check_input_obj(X,"fit()")
#Make sure that you have all the required columns:
if len(set(X.columns) - set(['amount_tsh','source_class', \
'basin', 'waterpoint_type_group'])) == 0:
#Get the required dictionaries...
amount_tsh_df = X[X['amount_tsh'] != 0].groupby(['source_class', \
'basin', \
'waterpoint_type_group'])\
['amount_tsh'].median()
self.amount_tsh_dict_all_level = dict(amount_tsh_df)
amount_tsh_df = X[X['amount_tsh'] != 0].\
groupby(['waterpoint_type_group'])\
['amount_tsh'].median()
self.amount_tsh_dict_wp = dict(amount_tsh_df)
else:
raise ValueError("Check the supplied columns. Must supply 'source_class', \
'basin', 'waterpoint_type_group', 'amount_tsh' only")
exit(10)
self.column_names = ['amount_tsh']
return self
if self.method == 'median':
X['amount_tsh'] = X['amount_tsh'].astype(float)
self.median = np.median(list(X[X['amount_tsh'] != 0]['amount_tsh']))
if math.isnan(self.median):
self.median = 0
self.column_names = ['amount_tsh']
return self
if self.method == 'mean':
X['amount_tsh'] = X['amount_tsh'].astype(float)
self.mean = np.mean(list(X[X['amount_tsh'] > 0]['amount_tsh']))
if math.isnan(self.mean):
self.mean = 0
self.column_names = ['amount_tsh']
return self
if self.method == 'ignore':
self.column_names = ['amount_tsh']
return self
def transform(self,X):
if self.method == 'custom':
X['amount_tsh'] = X['amount_tsh'].astype(float)
self.check_input_obj(X,"transform()")
transformed_amount_tsh = []
for i, j, k, l in list(zip(X['amount_tsh'].\
fillna(0),X['source_class'], \
X['basin'], X['waterpoint_type_group'])):
if i == 0:
try:
transformed_amount_tsh.append(self.amount_tsh_dict_all_level[(j,k,l)])
except:
try:
transformed_amount_tsh.append(self.amount_tsh_dict_wp[l])
except:
transformed_amount_tsh.append(i)
continue
else:
transformed_amount_tsh.append(i)
X['amount_tsh'] = transformed_amount_tsh
return X[['amount_tsh']]
if self.method == 'median':
X['amount_tsh'] = X['amount_tsh'].astype(float)
X['amount_tsh'] = X['amount_tsh'].fillna(0)
amount_tsh = np.array(list(X['amount_tsh']))
amount_tsh[amount_tsh == 0] = self.median
X['amount_tsh'] = amount_tsh
return X[['amount_tsh']]
if self.method == 'mean':
X['amount_tsh'] = X['amount_tsh'].astype(float)
X['amount_tsh'] = X['amount_tsh'].fillna(0)
amount_tsh = np.array(list(X['amount_tsh']))
amount_tsh[amount_tsh == 0] = self.mean
X['amount_tsh'] = amount_tsh
return X[['amount_tsh']]
if self.method == 'ignore':
X['amount_tsh'] = X['amount_tsh'].astype(float)
X['amount_tsh'] = X['amount_tsh'].fillna(0)
return X[['amount_tsh']]
GPSHeightTransformer
#Here will be our strategy to handle gps_height:
# In fit() just save the input data as a data frame \
# with gps_height, lat, long, and gps_height > 0
# In transform(), if gps_height == 0, then
# start at 0.1 radius, and check if there are any non-zero gps_instances.
# If yes, get the average, else, increment search radius
# by 0.3 (0.1 increase corresponds to 11km approximately)
# If nothing is found within an increment of 2, then just ignore.
class GPSHeightTransformer(BaseEstimator, TransformerMixin):
def __init__(self,init_radius=0.1,increment_radius=0.3,method = 'custom'):
self.column_names = []
self.init_radius = init_radius
self.increment_radius = increment_radius
self.method = method
def get_subset_records(self, latitude,longitude,df,radius):
latitude_from = latitude - radius
latitude_to = latitude + radius
longitude_from = longitude - radius
longitude_to = longitude + radius
df_temp = df[(df['latitude'] >= latitude_from) & (df['latitude'] <= latitude_to) & \
(df['longitude'] >= longitude_from) & (df['longitude'] <= longitude_to)]
return df_temp
def check_input_obj(self,X,location):
##Check if input object is a pandas df, else raise exception
try:
if not isinstance(X,pd.DataFrame):
raise ValueError
except:
print("**EXCEPTION/ERROR**: In "+ \
location + " function of "+\
self.__name__+ \
". Input must be a Pandas dataframe")
exit(10)
def fit(self, X, y=None):
if self.method == 'custom':
X['gps_height'] = X['gps_height'].astype(float)
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
self.df = X[X['gps_height'] != 0]
self.column_names = ['gps_height']
return self
if self.method == 'median':
X['gps_height'] = X['gps_height'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.median = np.median(list(X[X['gps_height'] != 0]['gps_height']))
if math.isnan(self.median):
self.median = 0
self.column_names = ['gps_height']
return self
if self.method == 'mean':
X['gps_height'] = X['gps_height'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.mean = np.mean(list(X[X['gps_height'] != 0]['gps_height']))
if math.isnan(self.mean):
self.mean = 0
self.column_names = ['gps_height']
return self
if self.method == 'ignore':
self.column_names = ['gps_height']
return self
def transform(self,X):
if self.method == 'custom':
X['gps_height'] = X['gps_height'].astype(float)
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
gps_height_transformed = []
for latitude, longitude, gps_height in \
zip(X['latitude'],X['longitude'],X['gps_height']):
radius = self.init_radius
if gps_height == 0:
gps_height_temp = gps_height
while gps_height_temp == 0 and radius <= 2:
df_temp = self.get_subset_records\
(latitude,longitude,self.df,radius)
gps_height_temp = np.mean(df_temp[df_temp['gps_height']!=0]\
['gps_height'])
if math.isnan(gps_height_temp):
gps_height_temp = 0
radius = self.increment_radius + radius
else:
gps_height_temp =gps_height
gps_height_transformed.append(gps_height_temp)
X['gps_height'] = gps_height_transformed
self.column_names = ['gps_height']
#self.column_names = list(X.columns)
#return X[['latitude','longitude','gps_height']]
return X[['gps_height']]
if self.method == 'median':
X['gps_height'] = X['gps_height'].astype(float)
X['gps_height'] = X['gps_height'].fillna(0)
gps_height = np.array(list(X['gps_height']))
gps_height[gps_height == 0] = self.median
self.column_names = ['gps_height']
#self.column_names = list(X.columns)
#return X[['latitude','longitude','gps_height']]
X['gps_height'] = gps_height
return X[['gps_height']]
if self.method == 'mean':
X['gps_height'] = X['gps_height'].astype(float)
X['gps_height'] = X['gps_height'].fillna(0)
gps_height = np.array(list(X['gps_height']))
gps_height[gps_height == 0] = self.mean
self.column_names = ['gps_height']
#self.column_names = list(X.columns)
#return X[['latitude','longitude','gps_height']]
X['gps_height'] = gps_height
return X[['gps_height']]
if self.method == 'ignore':
self.column_names = ['gps_height']
X['gps_height'] = X['gps_height'].astype(float)
X['gps_height'] = X['gps_height'].fillna(0)
return X[['gps_height']]
PopulationTransformer
class PopulationTransformer(BaseEstimator, TransformerMixin):
def __init__(self,init_radius=0.1,increment_radius=0.3,method = 'custom'):
self.column_names = []
self.init_radius = init_radius
self.increment_radius = increment_radius
self.method = method
def get_subset_records(self, latitude,longitude,df,radius):
latitude_from = latitude - radius
latitude_to = latitude + radius
longitude_from = longitude - radius
longitude_to = longitude + radius
df_temp = df[(df['latitude'] >= \
latitude_from) & (df['latitude'] <= latitude_to) & \
(df['longitude'] >= longitude_from) & \
(df['longitude'] <= longitude_to)]
return df_temp
def check_input_obj(self,X,location):
##Check if input object is a pandas df, else raise exception
try:
if not isinstance(X,pd.DataFrame):
raise ValueError
except:
print("**EXCEPTION/ERROR**: In "+ location + \
" function of "+self.__name__+ ". Input must be a Pandas dataframe")
exit(10)
def fit(self, X, y=None):
if self.method == 'custom':
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
X['population'] = X['population'].astype(float)
self.df = X[X['population'] > 1]
self.column_names = ['population']
return self
if self.method == 'median':
X['population'] = X['population'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.median = np.median(list(X[X['population'] != 0]['population']))
if math.isnan(self.median):
self.median = 0
self.column_names = ['population']
return self
if self.method == 'mean':
X['population'] = X['population'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.mean = np.mean(list(X[X['population'] != 0]['population']))
if math.isnan(self.mean):
self.mean = 0
self.column_names = ['population']
return self
if self.method == 'ignore':
self.column_names = ['population']
return self
def transform(self,X):
self.column_names = ['population']
if self.method == 'custom':
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
X['population'] = X['population'].astype(float)
population_transformed = []
for latitude, longitude, population in \
zip(X['latitude'],X['longitude'],X['population']):
radius = self.init_radius
if population <= 1:
population_temp = population
while population_temp <= 1 and radius <= 2:
df_temp = self.get_subset_records\
(latitude,longitude,self.df,radius)
population_temp = np.mean(df_temp['population'])
if math.isnan(population_temp):
population_temp = population
radius = self.increment_radius + radius
else:
population_temp =population
population_transformed.append(population_temp)
X['population'] = population_transformed
#self.column_names = ['population']
#self.column_names = list(X.columns)
self.column_names = ['population']
return X[['population']]
if self.method == 'median':
X['population'] = X['population'].astype(float)
X['population'] = X['population'].fillna(0)
population = np.array(list(X['population']))
population[population == 0] = self.median
self.column_names = ['population']
#self.column_names = list(X.columns)
#return X[['latitude','longitude','gps_height']]
X['population'] = population
return X[['population']]
if self.method == 'mean':
X['population'] = X['population'].astype(float)
X['population'] = X['population'].fillna(0)
population = np.array(list(X['population']))
population[population == 0] = self.mean
self.column_names = ['population']
#self.column_names = list(X.columns)
#return X[['latitude','longitude','gps_height']]
X['population'] = population
return X[['population']]
if self.method == 'ignore':
X['population'] = X['population'].astype(float)
X['population'] = X['population'].fillna(0)
return X[['population']]
YearTransformer
#We will create another variable: age,
#based on the year of construction and recorded date
class YearTransformer(BaseEstimator, TransformerMixin):
def __init__(self,method = 'custom'):
self.column_names = []
#self.init_radius = init_radius
#self.increment_radius = increment_radius
self.method = method
pass ##Nothing else to do
def fit(self, X, y=None):
X['construction_year'] = X['construction_year'].astype(float)
if self.method == 'custom':
year_recorded = X[X['construction_year'] > 0]\
['date_recorded'].\
apply(lambda x: int(x.split("-")[0]))
year_constructed = X[X['construction_year'] > 0]['construction_year']
self.median_age = np.median(year_recorded - year_constructed)
self.column_names = ['age']
return self
if self.method == 'median':
X['construction_year'] = X['construction_year'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.median = \
np.median(list(X[X['construction_year'] != 0]['construction_year']))
if math.isnan(self.median):
self.median = 0
self.column_names = ['construction_year']
return self
if self.method == 'mean':
X['construction_year'] = X['construction_year'].astype(float)
#X['gps_height'] = X['gps_height'].fillna(0)
self.mean = np.mean(list(X[X['construction_year'] != 0]['construction_year']))
if math.isnan(self.mean):
self.mean = 0
self.column_names = ['construction_year']
return self
if self.method == 'ignore':
self.column_names = ['construction_year']
return self
def transform(self,X):
if self.method == 'custom':
year_recorded = list(X['date_recorded'].apply(lambda x: int(x.split("-")[0])))
year_constructed = list(X['construction_year'])
age = []
for i,j in enumerate(year_constructed):
if j == 0:
age.append(self.median_age)
else:
temp_age = year_recorded[i] - year_constructed[i]
if temp_age < 0:
temp_age = self.median_age
age.append(temp_age)
X['age'] = age
self.column_names = ['age']
#self.column_names = X.columns
return X[['age']]
if self.method == 'median':
X['construction_year'] = X['construction_year'].astype(float)
X['construction_year'] = X['construction_year'].fillna(0)
construction_year = np.array(list(X['construction_year']))
construction_year[construction_year == 0] = self.median
self.column_names = ['construction_year']
X['construction_year'] = construction_year
return X[['construction_year']]
if self.method == 'mean':
X['construction_year'] = X['construction_year'].astype(float)
X['construction_year'] = X['construction_year'].fillna(0)
construction_year = np.array(list(X['construction_year']))
construction_year[construction_year == 0] = self.mean
self.column_names = ['construction_year']
X['construction_year'] = construction_year
return X[['construction_year']]
if self.method == 'ignore':
X['construction_year'] = X['construction_year'].astype(float)
X['construction_year'] = X['construction_year'].fillna(0)
self.column_names = ['construction_year']
return X[['construction_year']]
LatitudeLongitudeProcess
##Sekhar: This can be written as a generic class...
##but leave it as it is as of now.
##I will look at this later...
class LatitudeLongitudeProcess( BaseEstimator, TransformerMixin):
def __init__(self,strategy='median'):
'''
type = 'median' is the default.
other values of type can be 'custom'
'''
self.strategy = strategy
self.median_longitude = 0
self.custom_longitude = 0
self.median_latitude = 0
self.custom_latitude = 0
self.avg_lat_ward_dict = {}
self.avg_long_ward_dict = {}
self.avg_lat_lga_dict = {}
self.avg_long_lga_dict = {}
self.avg_lat_region_dict = {}
self.avg_long_region_dict = {}
self.avg_lat_country_dict = {}
self.avg_long_country_dict = {}
self.column_names = []
def get_level_means(self,X):
if 'ward' in X.columns:
#Get average of lats and longs at the ward level
#First delete rows that have unknown ward values:
df = X[~((X['ward'].isnull()) | (X['ward'] == 'unknown'))]
avg_lat_long_by_ward_df = df[df['longitude'] != 0]. \
groupby(['ward'])['latitude','longitude'].mean().reset_index()
if len(avg_lat_long_by_ward_df) > 0:
avg_lat_long_by_ward_df.columns=['ward','avg_latitude','avg_longitude']
self.avg_lat_ward_dict = dict(zip(list(avg_lat_long_by_ward_df['ward']),\
list(avg_lat_long_by_ward_df['avg_latitude'])))
self.avg_long_ward_dict = dict(zip(list(avg_lat_long_by_ward_df['ward']),\
list(avg_lat_long_by_ward_df['avg_longitude'])))
if 'lga' in X.columns:
#Get average of lats and longs at the lga level
#First delete rows that have unknown region values:
df = X[~((X['lga'].isnull()) | (X['lga'] == 'unknown'))]
avg_lat_long_by_lga_df = df[df['longitude'] != 0]. \
groupby(['lga'])['latitude','longitude'].mean().reset_index()
if len(avg_lat_long_by_lga_df) > 0:
avg_lat_long_by_lga_df.columns=['lga','avg_latitude','avg_longitude']
self.avg_lat_lga_dict = dict(zip(list(avg_lat_long_by_lga_df['lga']),
list(avg_lat_long_by_ward_df['avg_latitude'])))
self.avg_long_lga_dict = dict(zip(list(avg_lat_long_by_lga_df['lga']),
list(avg_lat_long_by_ward_df['avg_longitude'])))
if 'region' in X.columns:
#Get average of lats and longs at the region level
#First delete rows that have unknown region values:
df = X[~((X['region'].isnull()) | (X['region'] == 'unknown'))]
avg_lat_long_by_region_df = df[df['longitude'] != 0]. \
groupby(['region'])['latitude','longitude'].mean().reset_index()
if len(avg_lat_long_by_region_df) > 0:
avg_lat_long_by_region_df.columns=['region','avg_latitude','avg_longitude']
self.avg_lat_region_dict = dict(zip(list(avg_lat_long_by_region_df['region']),\
list(avg_lat_long_by_region_df['avg_latitude'])))
self.avg_long_region_dict = dict(zip(list(avg_lat_long_by_region_df['region']),\
list(avg_lat_long_by_region_df['avg_longitude'])))
#Get average of lats and longs at the country level
avg_long = np.mean(X[X['longitude'] != 0]['longitude'])
avg_lat = np.mean(X[X['latitude'] != 0]['latitude'])
self.avg_lat_country_dict['country'] = avg_lat
self.avg_long_country_dict['country'] = avg_long
def fit(self,X, y=None):
self.column_names = ['latitude','longitude']
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
if self.strategy == 'custom':
self.get_level_means(X)
elif self.strategy == 'mean':
#Impute using mean
self.mean_longitude = np.mean(X[X['longitude'] != 0]['longitude'])
self.mean_latitude = np.mean(X[X['latitude'] != 0]['latitude'])
#X.longitude = [i for i in X.longitude if np.abs(i) <= 0 self.mean_longitude else i]
#X.latitude = [i for i in X.latitude if np.abs(i) <= 0 self.mean_latitude else i]
elif self.strategy == 'median':
#Impute using median
self.median_longitude = np.median(X[X['longitude'] != 0]['longitude'])
self.median_latitude = np.median(X[X['latitude'] != 0]['latitude'])
#X.longitude = [i for i in X.longitude if np.abs(i) <= 0 self.median_longitude else i]
#X.latitude = [i for i in X.latitude if np.abs(i) <= 0 self.median_latitude else i]
else:
print("Invalid strategy supplied for LatitudeLongitudeProcess.")
print("Valid values are 'mean', 'median' or 'custom'. Terminating the program")
exit(10)
return self
def make_up_lat_long(self,X):
#Handle the situation gracefully, if the incoming data does not have any required columns
try:
latitude_list = list(X['latitude'].fillna(0))
except:
latitude_list = list(np.zeros(len_X))
#continue
try:
longitude_list = list(X['longitude'].fillna(0))
except:
longitude_list = list(np.zeros(len_X))
#continue
return latitude_list, longitude_list
def custom_transform(self, X):
len_X = len(X)
#Declare lists to hold the transformed lat and long
latitude_transformed = []
longitude_transformed = []
#Handle the situation gracefully, if the incoming data does not have any required columns
latitude_list, longitude_list = self.make_up_lat_long(X)
try:
ward_list = list(X['ward'].fillna('unknown'))
except:
ward_list = ['unknown'] * len_X
#continue
try:
lga_list = list(X['lga'].fillna('unknown'))
except:
lga_list = ['unknown'] * len_X
#continue
try:
region_list = list(X['region'].fillna('unknown'))
except:
region_list = ['unknown'] * len_X
#continue
for (i, j, k, l, m) in zip(latitude_list,longitude_list, \
ward_list,lga_list,region_list):
if np.round(i) == 0 or np.round(j) == 0:
try:
latitude_transformed.append(self.avg_lat_ward_dict[k])
longitude_transformed.append(self.avg_long_ward_dict[k])
except:
try:
latitude_transformed.append(self.avg_lat_lga_dict[l])
longitude_transformed.append(self.avg_long_lga_dict[l])
continue
except:
try:
latitude_transformed.append(avg_lat_region_dict[m])
longitude_transformed.append(avg_long_region_dict[m])
continue
except:
latitude_transformed.append(self.avg_lat_country_dict['country'])
longitude_transformed.append(self.avg_long_country_dict['country'])
continue
else:
latitude_transformed.append(i)
longitude_transformed.append(j)
X['latitude'] = latitude_transformed
X['longitude'] = longitude_transformed
return X
def transform(self,X):
X['latitude'] = X['latitude'].astype(float)
X['longitude'] = X['longitude'].astype(float)
self.column_names = ['latitude','longitude']
if self.strategy == 'custom':
X = self.custom_transform(X)
return X[['latitude','longitude']]
elif self.strategy == 'mean':
latitude_list, longitude_list = self.make_up_lat_long(X)
latitude_list = np.array(latitude_list)
longitude_list = np.array(longitude_list)
latitude_list[latitude_list == 0] = self.mean_latitude
longitude_list[longitude_list == 0] = self.mean_longitude
X['latitude'] = latitude_list
X['longitude'] = longitude_list
return X[['latitude','longitude']]
elif self.strategy == 'median':
latitude_list, longitude_list = self.make_up_lat_long(X)
latitude_list = np.array(latitude_list)
longitude_list = np.array(longitude_list)
latitude_list[latitude_list == 0] = self.median_latitude
longitude_list[longitude_list == 0] = self.median_longitude
X['latitude'] = latitude_list
X['longitude'] = longitude_list
return X[['latitude','longitude']]
#self.column_names = list(X.columns)
#return X
else:
print("EXCEPTION: The supplied strategy",self.strategy," is incorrect")
exit(10)
FunderInstTransformer
class FunderInstTransformer( BaseEstimator, TransformerMixin):
def __init__(self,initial_chars=3,groups=15,apply=True):
self.initial_chars = initial_chars
self.groups = groups
self.apply = apply
self.group_dict = dict()
def fetch_first_n_chars(self,l):
temp_l = [str(j).lower()[0:self.initial_chars] for j in list(l)]
return pd.Series(temp_l)
def fit(self,X, y=None):
if self.apply:
self.column_names = ['funder','installer']
self.group_dict = dict()
for i in X.columns:
temp_series = self.fetch_first_n_chars(X[i]).value_counts()
temp_series.sort_values(ascending=False,inplace=True)
top_groups = list(temp_series[0:self.groups].index)
self.group_dict[i] = top_groups
return self
else:
return self
def transform(self,X):
X = X.copy()
if self.apply:
for i in X.columns:
temp_series = self.fetch_first_n_chars(X[i])
temp_l = []
for j in temp_series.values:
try:
#print(self.group_dict[i])
if j in self.group_dict[i]:
#print(j)
temp_l.append(j)
else:
temp_l.append('other')
except:
continue
X[i] = temp_l
#X['funder_installer_same'] = X['funder']==X['installer']
#self.column_names = ['funder','installer','funder_installer_same']
#return X[['funder','installer','funder_installer_same']]
self.column_names = ['funder','installer']
return X[['funder','installer']]
else:
#self.column_names = ['funder','installer','funder_installer_same']
#return X[['funder','installer','funder_installer_same']]
self.column_names = ['funder','installer']
return X[['funder','installer']]
class Numpy2DFTransformer(BaseEstimator, TransformerMixin):
def __init__(self,columns):
self.column_names = columns
def fit(self, X, y=None):
return self
def transform(self,X):
return pd.DataFrame(X,columns=self.column_names)
##############
##IMPORTANT###
##############
#https://stackoverflow.com/questions/41837261/data-not-persistent-in-scikit-learn-transformers
#get_params() is very important to get the data persistent between
#the CV evaluation. If NOT defined, then the init params are NOT set and results in
#CV (GridSearch) failure
def get_params(self, deep=False):
return {'columns': self.column_names}
#Scale numeric data
class ScaleData(BaseEstimator, TransformerMixin):
def __init__(self,std_scaler=True,apply=True):
self.std_scaler = std_scaler
self.apply = apply
def fit(self,X, y=None):
if self.apply:
if self.std_scaler == True:
self.scaler = StandardScaler()
else:
self.scaler = MinMaxScaler()
return self.scaler.fit(X)
else:
return X
def transform(self,X):
if self.apply:
return self.scaler.transform(X)
else:
return X
ChooseCatPipelineType
class ChooseCatPipelineType(BaseEstimator, TransformerMixin):
def __init__(self,freq_pipeline,resp_pipeline,method='both'):
self.freq_pipeline = freq_pipeline
self.resp_pipeline = resp_pipeline
self.method = method
self.column_names = []
def fit(self,X, y=None):
if self.method == 'resp':
self.resp_pipeline.fit(X,y)
if self.method == 'freq':
self.freq_pipeline.fit(X,y)
if self.method == 'both':
self.both_pipeline = FeatureUnion(transformer_list = [ \
("freq_pipeline",self.freq_pipeline), \
("resp_pipeline",self.resp_pipeline) \
])
self.both_pipeline.fit(X,y)
return self
def transform(self,X):
if self.method == 'resp':
self.column_names = self.resp_pipeline.named_steps['CatMultiLabelTransformer'].column_names
return self.resp_pipeline.transform(X)
if self.method == 'freq':
self.column_names = self.freq_pipeline.named_steps['CatMultiLabelTransformer'].column_names
return self.freq_pipeline.transform(X)
if self.method == 'both':
self.column_names = self.freq_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
self.resp_pipeline.named_steps['CatMultiLabelTransformer'].column_names
return self.both_pipeline.transform(X)
All the above code was included in a file called pipelinepack.py.
We have used the following hyper parameters to generate the transformed data for training. These parameters were chosen as they have given a superior cross validation accuracy when we ran Random Forest model with various parameter combinations.
The code to generate the transformed training and validation data, based on the above listed options is given below:
#Import all the required packages:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, LabelBinarizer, Imputer, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')
from sklearn.externals import joblib
import math
import pickle
from pipelinepack import *
from sklearn.metrics import confusion_matrix, f1_score, precision_score, recall_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
#Prepare data types dict
train_data_type = {
'id':int,
'amount_tsh': float,
'date_recorded': str,
'funder': str,
'gps_height': float,
'installer': str,
'longitude': float,
'latitude': float,
'wpt_name': str,
'num_private': float,
'basin': str,
'subvillage': str,
'region': str,
'region_code': str,
'district_code': str,
'lga': str,
'ward': str,
'population': float,
'public_meeting': str,
'recorded_by': str,
'scheme_management': str,
'scheme_name': str,
'permit': str,
'construction_year': int,
'extraction_type': str,
'extraction_type_group': str,
'extraction_type_class': str,
'management': str,
'management_group': str,
'payment': str,
'payment_type': str,
'water_quality': str,
'quality_group': str,
'quantity': str,
'quantity_group': str,
'source': str,
'source_type': str,
'source_class': str,
'waterpoint_type': str,
'waterpoint_type_group': str
}
train_label_type = {
'id':int,
'status_group': str
}
##Function to read the files
def read_files(file,dtype):
df = pd.read_csv(file,dtype=dtype)
return df
##Function to split the data into test and training
def get_test_train_data(training_values,training_labels):
training_values.index = list(training_values['id'])
training_labels.index = list(training_labels['id'])
training_values.drop(["id"],axis=1,inplace=True)
training_labels.drop(["id"],axis=1,inplace=True)
training_data=training_values.join(training_labels,
how='inner',lsuffix='_values',
rsuffix='labels')
X = training_data.drop('status_group',axis=1)
y = training_data['status_group']
X_train, X_test, y_train, y_test = train_test_split\
( X, y, \
test_size=0.2, \
random_state=42, \
stratify=y)
return X_train, X_test, y_train, y_test
#Columns with many levels.
high_levels_cat_columns = ['subvillage','lga','ward'
#,'scheme_name'
]
#Columns with less number of levels.
low_levels_cat_columns = [
'basin','region','district_code',
'public_meeting',
'scheme_management',
'permit',
#'extraction_type',
'extraction_type_group',
#'extraction_type_class',
'management',
#'management_group',
'payment_type',
'water_quality',
#'quality_group',
'quantity_group',
'source',
#'source_type','source_class',
'waterpoint_type'
#,'waterpoint_type_group'
]
#Columns which need fuzzy matching. These columns have many levels.
fuzzy_logic_columns = ['funder','installer']
#Pipelines definition
fuzz_pipeline = Pipeline([ \
('selector',DataFrameSelector(fuzzy_logic_columns)), \
('cat_nulls', HandleCategoricalNulls()), \
('FunderInstTransformer', \
FunderInstTransformer(initial_chars = 3, \
groups = 15,apply=True)), \
('CatMultiLabelTransformer',\
CatMultiLabelTransformer(apply=True)) \
])
cat_pipeline_high_level_freq_based = Pipeline([ \
('selector',DataFrameSelector(high_levels_cat_columns)), \
('cat_nulls', HandleCategoricalNulls()), \
('FreqBasedCategoricalBinning', \
FreqBasedCategoricalBinning(buckets=20,apply=True)), \
('CatMultiLabelTransformer',CatMultiLabelTransformer()) \
])
cat_pipeline_high_level_resp_based = Pipeline([ \
('selector',DataFrameSelector(high_levels_cat_columns)), \
('cat_nulls', HandleCategoricalNulls()), \
('RespBasedCategoricalBinning', \
RespBasedCategoricalBinning(buckets=20,apply=True)), \
('CatMultiLabelTransformer',CatMultiLabelTransformer()) \
])
choose_high_level_cat_pipeline = Pipeline([ \
('ChooseCatPipelineType', ChooseCatPipelineType( \
freq_pipeline = cat_pipeline_high_level_freq_based, \
resp_pipeline = cat_pipeline_high_level_resp_based, \
method = 'freq')) \
])
cat_pipeline_low_level = Pipeline([ \
('selector',DataFrameSelector(low_levels_cat_columns)), \
('cat_nulls', HandleCategoricalNulls()), \
('CatMultiLabelTransformer',CatMultiLabelTransformer(apply=True)) \
])
#Combine all categorical pipelines first:
full_categorical_pipeline = FeatureUnion(transformer_list = [ \
("fuzz_pipeline",fuzz_pipeline), \
("choose_high_level_cat_pipeline",\
choose_high_level_cat_pipeline), \
("cat_pipeline_low_level", cat_pipeline_low_level) \
])
#amount_tsh pipelines
amount_tsh_prep_pipeline = Pipeline([ \
('selector',\
DataFrameSelector(['source_class', 'basin', \
'waterpoint_type_group'])), \
('cat_nulls', HandleCategoricalNulls()) \
])
amount_tsh_selector = Pipeline([ \
('selector',DataFrameSelector(['amount_tsh']))])
amount_tsh_transformer = Pipeline([ ('amount_tsh_prep',FeatureUnion(transformer_list = [ \
("amount_tsh_prep_pipeline",amount_tsh_prep_pipeline), \
("amount_tsh_selector",amount_tsh_selector)])) \
,("Numpy2DFTransformer", \
Numpy2DFTransformer(['source_class', \
'basin', \
'waterpoint_type_group',\
'amount_tsh'])) \
,('AmountTSHTransformer', \
AmountTSHTransformer()) \
])
#age pipeline
age_pipeline = Pipeline([ \
('selector',DataFrameSelector(['date_recorded',\
'construction_year'])), \
('YearTransformer',YearTransformer())
])
#Lat Long pipelines
lat_long_prep_pipeline = Pipeline([ \
('selector',DataFrameSelector(['lga',\
'region',\
'ward'])), \
('cat_nulls', HandleCategoricalNulls()) \
])
lat_long_selector = Pipeline([ \
('selector',DataFrameSelector(['longitude','latitude']))])
lat_long_transformer = Pipeline([ ('lat_long_prep',\
FeatureUnion(transformer_list = [ \
("lat_long_prep_pipeline",\
lat_long_prep_pipeline), \
("lat_long_selector",\
lat_long_selector)])) \
,("Numpy2DFTransformer", \
Numpy2DFTransformer(['lga','region',\
'ward','longitude',\
'latitude'])) \
,('LatitudeLongitudeProcess', \
LatitudeLongitudeProcess(strategy="custom")) \
])
#lat_long_transformer always return the
#data in latitude, longitude order
#gps_height pipelines.
#This is dependent on the lat_long pipeline
gps_height_transformer = Pipeline([('gps_height_prep',\
FeatureUnion(transformer_list=[('lat_long_transformer', \
lat_long_transformer), \
('gps_selector',DataFrameSelector(['gps_height']))]))
,("Numpy2DFTransformer", \
Numpy2DFTransformer(['latitude','longitude','gps_height']))
,('GPSHeightTransformer',GPSHeightTransformer(method='median'))
#,('GPSHeightTransformer',GPSHeightTransformer(method='custom'))
])
#population pipelines.
#This is dependent on the lat_long pipeline
population_transformer = Pipeline([('population_prep',\
FeatureUnion(transformer_list=[('lat_long_transformer', \
lat_long_transformer), \
('population_selector',DataFrameSelector(['population']))]))
,("Numpy2DFTransformer", \
Numpy2DFTransformer(['latitude','longitude','population']))
,('PopulationTransformer',PopulationTransformer(method = 'ignore'))
#,('PopulationTransformer',PopulationTransformer(method = 'custom')) #NOT worth
# ,('PopulationTransformer',PopulationTransformer(method = 'median')) #NOT Worth
])
full_numeric_transformations = Pipeline([('all_numeric_transformations', \
FeatureUnion(transformer_list = \
[('amount_tsh_transformer',amount_tsh_transformer), \
('lat_long_transformer',lat_long_transformer), \
('gps_height_transformer',gps_height_transformer), \
('population_transformer',population_transformer), \
('age_pipeline',age_pipeline) \
])) \
,('scaler',ScaleData())
])
all_transformations = Pipeline([ \
('all_transformations', \
FeatureUnion(transformer_list = \
[('full_categorical_pipeline',\
full_categorical_pipeline), \
('full_numeric_transformations',\
full_numeric_transformations)])) \
])
predict_pipeline = Pipeline([('all_transformations',all_transformations), \
('rf',RandomForestClassifier(n_estimators=5000,n_jobs=-1))])
def get_test_train_data(training_values,training_labels):
training_values.index = list(training_values['id'])
training_labels.index = list(training_labels['id'])
training_values.drop(["id"],axis=1,inplace=True)
training_labels.drop(["id"],axis=1,inplace=True)
training_data=training_values.join(training_labels,
how='inner',lsuffix='_values',
rsuffix='labels')
X = training_data.drop('status_group',axis=1)
y = training_data['status_group']
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)
return X_train, X_test, y_train, y_test
def prepare_data(split_test_train=True,save=False):
training_data = read_files('./data/training_data.csv',dtype = train_data_type)
training_labels = read_files('./data/training_labels.csv',dtype = train_label_type)
if split_test_train:
X_train, X_test, y_train, y_test = get_test_train_data(training_data,training_labels)
#Save the data
if save:
X_test.to_csv('X_test.csv',index=False)
y_test.to_csv('y_test.csv',index=False)
X_train.to_csv('X_train.csv',index=False)
y_train.to_csv('y_train.csv',index=False)
#prepare combined data frames:
X_train['status_group'] = y_train
X_test['status_group'] = y_test
return X_train, X_test
else:
training_data['status_group'] = training_labels['status_group']
return training_data, None
##Function to fit random forest with default parameters
def default_rf():
train_data, test_data = prepare_data(split_test_train=True)
X_train = train_data.drop(['status_group'],axis=1)
y_train = train_data['status_group']
X_test = test_data.drop(['status_group'],axis=1)
y_test = test_data['status_group']
#temp_df = all_transformations.fit_transform(X_train,y_train)
predict_pipeline.fit(X_train,y_train)
y_test_predict = predict_pipeline.predict(X_test)
print("Test Accuracy score: ",accuracy_score(y_test,y_test_predict))
all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
population_transformer.named_steps['PopulationTransformer'].column_names + \
age_pipeline.named_steps['YearTransformer'].column_names
feature_imp = list(zip(predict_pipeline.named_steps['rf'].feature_importances_,list(all_columns)))
#print feature_imp
feature_imp.sort(key = lambda X: X[0],reverse=True)
feature_imp= pd.DataFrame(feature_imp)
feature_imp.columns = ['score','variable']
feature_imp['percentage'] = feature_imp['score']/np.max(feature_imp['score']) * 100
print feature_imp
feature_imp.to_csv('rf_variable_imp.csv',index=False)
##Function to fit gradient boosting with default parameters
def default_gb():
predict_pipeline = Pipeline([('all_transformations',all_transformations), \
('gb',GradientBoostingClassifier(n_estimators=300))])
train_data, test_data = prepare_data(split_test_train=True)
X_train = train_data.drop(['status_group'],axis=1)
y_train = train_data['status_group']
X_test = test_data.drop(['status_group'],axis=1)
y_test = test_data['status_group']
#temp_df = all_transformations.fit_transform(X_train,y_train)
predict_pipeline.fit(X_train,y_train)
y_test_predict = predict_pipeline.predict(X_test)
print("Test Accuracy score: ",accuracy_score(y_test,y_test_predict))
all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
population_transformer.named_steps['PopulationTransformer'].column_names + \
age_pipeline.named_steps['YearTransformer'].column_names
feature_imp = list(zip(predict_pipeline.named_steps['gb'].feature_importances_,list(all_columns)))
#print feature_imp
feature_imp.sort(key = lambda X: X[0],reverse=True)
feature_imp= pd.DataFrame(feature_imp)
feature_imp.columns = ['score','variable']
feature_imp['percentage'] = feature_imp['score']/np.max(feature_imp['score'])
print feature_imp
def transform_to_share():
training_data = read_files('./data/training_data.csv',dtype = train_data_type)
training_labels = read_files('./data/training_labels.csv',dtype = train_label_type)
train_data_id = list(training_data['id'])
train_labels_id = list(training_labels['id'])
training_data.index = train_data_id
training_labels.index = train_labels_id
training_data.drop(["id"],axis=1,inplace=True)
training_labels.drop(["id"],axis=1,inplace=True)
training_transformed = all_transformations.fit_transform(training_data,training_labels)
all_columns = fuzz_pipeline.named_steps['CatMultiLabelTransformer'].column_names + \
choose_high_level_cat_pipeline.named_steps['ChooseCatPipelineType'].column_names + \
cat_pipeline_low_level.named_steps['CatMultiLabelTransformer'].column_names + \
amount_tsh_transformer.named_steps['AmountTSHTransformer'].column_names + \
lat_long_transformer.named_steps['LatitudeLongitudeProcess'].column_names + \
gps_height_transformer.named_steps['GPSHeightTransformer'].column_names + \
population_transformer.named_steps['PopulationTransformer'].column_names + \
age_pipeline.named_steps['YearTransformer'].column_names
training_transformed = pd.DataFrame(training_transformed,columns = all_columns)
training_transformed.index = train_data_id
training_transformed = training_transformed.reset_index()
training_transformed_cols = list(training_transformed.columns)
training_transformed_cols[0] = "id"
training_transformed.columns = training_transformed_cols
training_transformed.to_csv('training_transformed.csv',index=False)
submission_data = pd.read_csv('./data/test_data.csv')
submission_data_id = list(submission_data['id'])
submission_data = submission_data.drop(['id'],axis=1)
submission_transformed = all_transformations.transform(submission_data)
submission_transformed = pd.DataFrame(submission_transformed,columns = all_columns)
submission_transformed.index = submission_data_id
submission_transformed = submission_transformed.reset_index()
submission_transformed.columns = training_transformed_cols
submission_transformed.to_csv('submission_transformed.csv',index=False)
def main():
#Train a rand forest model (5000 trees)
default_rf()
#Generate data for submission to driven data
submission_rf()
#Generate data to share with peers so that they can train in R language
transform_to_share()
#train using default parms (gradient boosting alg)
default_gb()
The source code for ensemble model based on Random Forest, Gradient Boosting and C5.0 Boosted trees is given below. Whenever Gradient Boosting and C5.0 Boosted trees predict “functional needs repair”, then we will predict “functional needs repair” (irrespective of the random forest model’s output). In all other cases we will use the random forest output.
import pandas as pd
#We created a data frame called all_models_out.csv, which has the following columns:
#'id','rf','c5','boosted_tr'
df = pd.DataFrame('all_models_out.csv')
status_group=[]
id = []
for i in df.index:
id_current,rf,c5,boosted_tr = df.iloc[i]
if c5 == 'functional needs repair' and boosted_tr == 'functional needs repair':
status_group.append('functional needs repair')
else:
status_group.append(rf)
id.append(id_current)
submission_df = pd.DataFrame(list(zip(id,status_group)),columns=['id','status_group'])
submission_df.to_csv('ensemble_submission.csv',index=False)
Six distinct types of classification models were constructed for purposes of predicting the current status of a given water pump:
A description of our attempt at quantifying the predictive value of each of the independent variables is provided below followed by a brief discussion of each model’s structure and performance.
As discussed in Data Exploration, many of the data set’s 30+ independent variables were found to have potential predictive characteristics. The strength of those characteristics was measured by constructing a separate multinomial logistic regression model for each independent variable relative to the data set’s response variable. The accuracy and Alkaline Information Criterion (AIC) results of those models allowed us to rank our independent variables on the basis of their likely predictive “strength”, as summarized in the table below.
Tanzania MLR Results - Single Ind. Variable Models
Var Name | Accuracy | AIC |
---|---|---|
quantity | 64.9% | 65186 |
waterpoint_type | 63.5% | 68496 |
extraction_type | 62.6% | 68684 |
installer (binned) | 58.1% | 71687 |
pump_age | 57.9% | 71466 |
reg_code + dst_code | 57.1% | 70004 |
region_code | 57.1% | 70455 |
region | 57.1% | 70769 |
funder(binned) | 56.9% | 72151 |
water_quality | 56.9% | 72385 |
payment | 56.7% | 71071 |
basin | 55.9% | 72664 |
district_code | 55.8% | 72855 |
source | 55.2% | 72155 |
management | 54.7% | 72546 |
lga (binned) | 54.7% | 72341 |
scheme_management | 54.4% | 72673 |
ward (binned) | 54.3% | 72220 |
public_meeting_new | 54.47% | 73798 |
permit_new | 54.3% | 73938 |
amount_tsh_new | 54.3% | 73700 |
lat + long | 54.3% | 73712 |
population_new | 54.3% | 73975 |
gps_height_new | 54.3% | 73587 |
subvillage (binned) | 54.3% | 73949 |
As indicated above, quantity
, waterpoint_type
and extraction_type
proved to have the highest levels of independent predictive value for MLR modeling, while subvillage
, gps_height
, population
, latititude
and longitude
were all deemed to be of little value for MLR modeling purposes. The derived pump_age
variable was also found to have a relatively high level of predictive value.
The ranking of variables relative to their MLR predicitve accuracy served as the basis of a forward selection MLR model build process whereby individual variables were added to the model in descending order of their predictive value until no further gains in either model accuracy or AIC score were possible. That process led us to conclude that three variables (gps_height
, subvillage
, region
) should be excluded from an MLR model due to their inability to improve either the model’s accuracy or AIC score. Furthermore, it was found that the district_code
variable added most value when added to the MLR model in conjunction with the inclusion of the region_code
variable: This is likely due to the fact that the district_code
variable is not truly independent since its values are not unique across each of the possible region
and region_code
values. Results of the forward selection MLR model building process are shown below.
MLR Forward Selection Results
Variable(s) Added | Model Accuracy | AIC |
---|---|---|
quantity | 64.90 % | 65186 |
waterpoint_type | 69.24 % | 61555 |
extraction_type | 70.03 % | 59279 |
installer (binned) | 70.72 % | 57765 |
pump_age | 71.38 % | 57031 |
reg_code + dst_code | 72.42 % | 54713 |
funder (binned) | 72.70 % | 54251 |
water_quality | 72.88 % | 54120 |
payment | 73.86 % | 52968 |
basin | 74.10 % | 52765 |
source | 74.40 % | 52298 |
management | 74.66 % | 51766 |
lga (binned) | 74.77 % | 51500 |
scheme_management | 74.96 % | 51431 |
ward (binned) | 74.94 % | 51272 |
public_meeting | 74.93 % | 51189 |
permit | 75.03 % | 51135 |
amount_tsh | 75.02 % | 51128 |
latitude + longitude | 75.12 % | 51107 |
population | 75.15 % | 51064 |
Ineffective Variables |
---|
gps_height_new |
subvillage (binned) |
region |
Cross validation of the resulting 22-variable MLR model indicated an overall model accuracy of 75.08%. A confusion matrix and detailed summary statistics from the cross validation process are shown below. Please note that " Class 0 " represents functional
pump status; " Class 1 " represents functional needs repair
and " Class 2 " represents non functional
.
MLR Confusion Matrix and Statistics
As shown in the summary above, the MLR model is most challenged when attempting to accurately identify pumps having a status of functional needs repair
as indicated by the Balanced Accuracy metric. This is unsurprising given the fact that only 7.3% of the training data set’s records have that status.
A KNN model was constructed using the same mix of variables identified via the MLR forward selection process. Cross validation of the model indicated an overall model accuracy of 74.93% and an optimal K value of k = 5. A confusion matrix and detailed summary statistics from the cross validation process are shown below.
KNN Confusion Matrix and Statistics
As shown in the summary above, the KNN model does a much better job of correctly identifying pumps having a status of functional needs repair
(64.73% balanced accuracy) than does the MLR model (55.31% balanced accuracy). This suggests that the KNN model may be preferable to the MLR model despites the KNN’s slightly lower overall accuracy score. In fact, the KNN model’s kappa score exceeds that of the MLR model, which is also indicative of the KNN model likely being preferable to the MLR model.
A bootstrap aggregation classification model based on the same set of 22 independent variables identified via the MLR forward selection process outperformed both the KNN and MLR models, achieving an overall accuracy of 79.91%. A confusion matrix and detailed summary statistics from the cross validation process are shown below.
Bootstrap Aggregation Confusion Matrix and Statistics
The Balanced Accuracy metric indicates the bootstrap aggregation algorithm exceeded the performance of the KNN model across each of the three pump statuses and also achieved a much higher kappa value (0.6281) than either the MLR (0.5156) or KNN models (0.5334). As such, the bootstrap aggregation method appears to be preferable to both the KNN and MLR models.
An SVM model constructed from the same variables as the MLR model yielded an overall accuracy of 76.95%, exceeding that of both the MLR and KNN models but below that of the bootstrap aggregation model. A confusion matrix and detailed summary statistics from the cross validation process are shown below.
SVM Confusion Matrix and Statistics
The Balanced Accuracy metric indicates the SVM algorithm (55.57%) significanctly underperforms both the KNN (64.73%) and bootstrap aggregation (67.61%) approaches for correctly identifying pumps having a status of functional needs repair
. So while the SVM model achieves a slightly higher kappa value than the KNN model (0.5391 vs. 0.5334), the SVM model’s relatively weak performance relative to the functional needs repair
status rates its efficacy below both that of both the bootstrap aggregation and KNN models.
The random forest model was constructed in R using the same variables used for the MLR model and a total of 500 classification trees. While an initial attempt was made to add both the gps_height
and subvillage
variables to the model, their inclusion did not improve its performance. The subsequent random forest model yielded an overall accuracy of 81.13%, which exceeds that of all of the other five types of models that were constructed. A confusion matrix and detailed summary statistics from the cross validation process are shown below.
Random Forest Confusion Matrix and Statistics
The Balanced Accuracy metric indicates that despite having the highest overall accuracy, the random forest algorithm (64.06%) underperforms both the KNN (64.73%) and bootstrap aggregation (67.61%) approaches for correctly identifying pumps having a status of functional needs repair
. However, the random forest model also achieved a higher kappa value than did each of the other models, including the bootstrap aggregation model (0.6281).
Variable Importance
R provides a mechanism for ranking the “importance” of each variable within a random forest model. The output of that mechanism indicated that both the ward
and lga
variables had limited use within the random forest model. Experiments indicated that removal of both variables had little effect on the model’s overall accuracy. As such, the dimensionality of the model could potentially be reduced by a total of 30 binning variables with limited impact on the efficacy of the model.
As with the MLR model, quantity
, extraction_type
, and waterpoint_type
were found to have the highest levels of predictive value / “importance”, as were longitude
and latitude
, both of which, by contrast, had little predictive value for the MLR model. The derived pump_age
variable was also deemed to be of high “importance” to the model.
Subsequent experiments (not detailed here) also showed that adding the gps_height
variable did not improve the model’s performance, but did result in gps_height
being deemed as having a high level of “importance” within the model. Results of the initial “importance” assessment are provided below.
Variable | Importance Score |
---|---|
quantity | 2386.462153 |
longitude | 1280.208366 |
latitude | 1243.581667 |
extraction_type | 1178.682671 |
pump_age | 1116.002314 |
waterpoint_type | 1093.158956 |
payment | 816.229526 |
region_code | 681.391666 |
population_new | 661.159021 |
source | 578.178014 |
district_code | 555.321904 |
basin | 427.311991 |
scheme_management | 384.212795 |
management | 372.834456 |
amount_tsh_new | 345.614271 |
water_quality | 299.133328 |
funder_other | 153.621007 |
ward_freq_bin_3 | 149.327702 |
ward_freq_bin_2 | 142.580870 |
funder_gov | 135.037070 |
installer_other | 132.631561 |
installer_dwe | 123.834254 |
permit_new | 121.603216 |
ward_freq_bin_4 | 110.478197 |
public_meeting_new | 108.145648 |
ward_freq_bin_1 | 102.555506 |
ward_freq_bin_5 | 89.906952 |
lga_freq_bin_4 | 80.632662 |
lga_freq_bin_3 | 78.191339 |
lga_freq_bin_6 | 74.982971 |
lga_freq_bin_5 | 71.640857 |
lga_freq_bin_10 | 61.760992 |
lga_freq_bin_2 | 61.629666 |
lga_freq_bin_7 | 60.363336 |
installer_gov | 58.517909 |
ward_freq_bin_6 | 57.437098 |
installer_cen | 57.360212 |
funder_wor | 57.045224 |
installer_rwe | 54.030158 |
lga_freq_bin_9 | 49.896391 |
ward_freq_bin_7 | 49.652331 |
ward_freq_bin_8 | 48.296252 |
funder_dan | 39.977137 |
installer_unk | 37.709373 |
lga_freq_bin_8 | 37.567692 |
funder_hes | 36.508775 |
lga_freq_bin_20 | 35.405233 |
installer_dis | 35.347900 |
funder_unk | 34.678247 |
installer_com | 34.384736 |
funder_tas | 33.928214 |
funder_dws | 31.883431 |
funder_pri | 29.892122 |
funder_uni | 25.143246 |
ward_freq_bin_9 | 24.503300 |
funder_wat | 24.246632 |
installer_wor | 23.716591 |
funder_rws | 22.778945 |
installer_hes | 20.582274 |
funder_dis | 18.413152 |
installer_tcr | 18.133932 |
installer_dan | 17.600180 |
funder_kkk | 17.576963 |
installer_fin | 15.507702 |
lga_freq_bin_11 | 14.531972 |
installer_kkk | 14.371995 |
funder_fin | 14.235434 |
lga_freq_bin_1 | 14.198987 |
funder_dhv | 13.691042 |
ward_freq_bin_10 | 10.414861 |
ward_freq_bin_14 | 9.585436 |
ward_freq_bin_13 | 9.147670 |
installer_0 | 7.232043 |
ward_freq_bin_12 | 6.411819 |
ward_freq_bin_16 | 5.501362 |
installer_ces | 5.378000 |
ward_freq_bin_11 | 3.980340 |
ward_freq_bin_15 | 3.305827 |
ward_freq_bin_20 | 2.600001 |
ward_freq_bin_17 | 2.306922 |
The C5.0 boosted classification tree model adds the gps_height
variable to those used for the MLR model. A total of 20 boosting trials were applied during model training, and the resulting model yielded an overall accuracy of 80.40%, which exceeds that of all of the other models except the random forest. A confusion matrix and detailed summary statistics for the C5.0 model are shown below.
C5.0 Confusion Matrix and Statistics
The Balanced Accuracy metric indicates that despite having slightly lower overall accuracy and kappa score than the random forest algorithm, the C5.0 model (65.627%) outperforms both the KNN (64.73%) and random forest (64.064%) approaches, though not the bootstrap aggregation approach, for correctly identifying pumps having a status of functional needs repair
. The C5.0 model’s kappa score also exceeds that of all other models except that of the random forest.
Variable Importance
As with the random forest model, R provides a mechanism for ranking the “importance” of each variable within a C5.0 model. However, our experiments found that removing any of the variables used during model training decreased the performance of the C5.0 model. The quantity
, extraction_type
and waterpoint_type
variables were again deemed to have the highest levels of relevance to the efficacy of the model, as were longitude
, one of the binned lga
variables and a subset of the binned ward
variables. Results of the initial “importance” assessment are provided below.
Variable | Importance Score |
---|---|
quantity | 100.00 |
extraction_type | 100.00 |
lga_freq_bin_10 | 100.00 |
ward_freq_bin_16 | 100.00 |
ward_freq_bin_20 | 100.00 |
longitude | 99.99 |
waterpoint_type | 99.97 |
management | 99.84 |
ward_freq_bin_11 | 99.28 |
pump_age | 99.24 |
source | 98.99 |
amount_tsh_new | 98.90 |
region_code | 98.62 |
water_quality | 98.62 |
latitude | 97.53 |
payment | 97.24 |
district_code | 96.90 |
scheme_management | 95.25 |
funder_dws | 92.77 |
gps_height_new | 92.30 |
population_new | 92.21 |
lga_freq_bin_20 | 92.04 |
installer_ces | 89.43 |
ward_freq_bin_17 | 89.43 |
ward_freq_bin_8 | 88.22 |
installer_rwe | 86.90 |
basin | 86.71 |
ward_freq_bin_10 | 83.44 |
ward_freq_bin_7 | 82.98 |
funder_pri | 82.54 |
installer_wor | 81.31 |
installer_0 | 81.22 |
ward_freq_bin_12 | 81.20 |
funder_rws | 80.87 |
installer_com | 80.66 |
installer_cen | 80.03 |
installer_gov | 79.21 |
lga_freq_bin_1 | 78.87 |
public_meeting_new | 76.77 |
funder_wor | 75.98 |
funder_uni | 75.71 |
installer_hes | 75.25 |
ward_freq_bin_6 | 74.34 |
funder_hes | 74.19 |
funder_dhv | 71.43 |
lga_freq_bin_2 | 71.03 |
funder_dan | 70.76 |
installer_fin | 70.40 |
ward_freq_bin_1 | 69.92 |
ward_freq_bin_4 | 69.80 |
lga_freq_bin_11 | 69.56 |
funder_gov | 68.41 |
ward_freq_bin_5 | 67.40 |
lga_freq_bin_3 | 67.36 |
funder_unk | 67.09 |
funder_tas | 65.18 |
ward_freq_bin_13 | 64.32 |
funder_wat | 64.31 |
ward_freq_bin_14 | 63.98 |
lga_freq_bin_6 | 62.75 |
lga_freq_bin_8 | 62.66 |
ward_freq_bin_9 | 59.46 |
funder_dis | 59.26 |
ward_freq_bin_2 | 58.27 |
installer_unk | 57.53 |
ward_freq_bin_3 | 56.92 |
installer_tcr | 56.65 |
installer_dwe | 55.18 |
installer_dis | 55.03 |
lga_freq_bin_7 | 53.77 |
lga_freq_bin_9 | 53.71 |
lga_freq_bin_4 | 53.34 |
lga_freq_bin_5 | 51.60 |
installer_dan | 50.99 |
permit_new | 50.13 |
funder_kkk | 44.81 |
installer_kkk | 39.89 |
installer_other | 37.96 |
funder_other | 36.33 |
funder_fin | 31.87 |
ward_freq_bin_15 | 8.82 |
The R code used for each of the six types of predictive / classification models is available on Github. Additional R code modules were used to apply a set of custom imputations to both the training and test data set provided by the Tanzanian Ministry of Water. Links to the source code for each module are provided below.
Training Data Custom Transformations
Test Data Custom Transformations
Multinomial Logistic Regression R Code
K Nearest Neighbors
Support Vector Machines
Bootstrap Aggregation
Random Forest
C5.0 Boosted Trees
##################################### DATA 698.02 Capstone Project ########################################################
#
# This is a Shiny web application of Tanzanian WaterPump project on driven data.org
#
# The purpose of this project is to develop, asses, and compare a variety of predictive / classification models
#
# with the best-performing model being implemented as part of a complete end-to-end intelligent system.
#
# Find out more about building applications with Shiny here:
#
# http://shiny.rstudio.com/
###################################################### Libraries #########################################################
library(ggthemes) # for graphs
library(knitr) # dynamic report generation in R using Literate Programming techniques.
library(tidyr) # Tidy Data
library(readr) # reading in data
library(dplyr) # data manupilation
library(ggplot2) # plotting of Graphs
library(DT) # data tables
library(png) # Read and write PNG images
library(markdown) # document formatting
library(leaflet) # Maps
library(RColorBrewer)# Color Palette
library(tidyverse) # install core packages
library(plotly) # Interactivity with Charts
library(shinythemes) # Shiny Theme
library(scales) # for percentage scales
############################################### Load data Source #########################################################
tanz_Train<-tbl_df(read.csv("Tanz_Pump_Data.csv", header = TRUE, stringsAsFactors = FALSE))
tanz_Train$pump_age<-cut(tanz_Train$pump_age, seq(from = 1, to = 53, by = 3)) #bin pump_age
tanz_Train=data.frame(tanz_Train)#create data frame
tanz_Train<-na.exclude (tanz_Train) %>% #exclude NA's
filter(!is.na(latitude) & !is.na(longitude))
############################################## Define UI for application #################################################
ui <- navbarPage("Data 698 Tanzanian Water Pump Project", # Navigation bar title
tabPanel("About",
fluidPage(theme = shinytheme("cerulean"),
titlePanel(title = h4("About Tanzanian Water Pump Project",align="center")),
fluidRow(
column(4,
tags$img(src="pumping.jpg",width = 400,height = 287, style="text-align: center;")),#load Image
column(6,
tags$style(".help-block{color: black;font-size: 16px;font-style: normal;font-family:Georgia}") ,
helpText("The purpose of this project is to develop, assess, and compare a variety of predictive / classification models ",
"with the best-performing model being implemented as part of a complete end-to-end intelligent system",
"that allows the user (i.e., the Tanzanian Ministry of Water) to pro-actively identify wells that are either in need of repair or are in inoperable condition.",
"The data is provided by Taarifa and the Tanzanian Ministry of Water,more about the competition can be found", tags$a(href="https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/page/25/","here"),
".We will take various data points into consideration such as when was the pumps installed, and how it is managed ,Lattitude and Longitude etc. ",
"The outcome of understanding which water point pumps will fail can improve maintenance operations and ensure that clean, potable water is available to communities across Tanzania.",
"The data files ,methodology and code can be accessed from this:",
tags$a(href="https://github.com/jtopor/DDO-TNZ/tree/master/Model-Build%2BPred-RCode","GitHub Link!"),
br(),tags$b( "Team Members"),
tags$ul(tags$li("James Topor"),tags$li("Sekhar Mekala"),tags$li("Sonya Hong"),tags$li("Steve Dunn")))
)))),
tabPanel("Introduction",
fluidPage(titlePanel(title = h4("Methodology,Analysis and Exploration",align="center")),
tags$iframe(style="height:500px;width:100%;scrolling=yes",src="Tanzanian_Introduction.pdf")
)),#load rendered PDF file
tabPanel("Data Table",
fluidPage(theme = shinytheme("cerulean"),#shiny theme
titlePanel(title = h4("Tanzanian Water Pump Data",align="center")),# Title
sidebarLayout(#Side Bar Panel
sidebarPanel(helpText(tags$b("Data Table"),
br(),
"Data is from test data set and Predictive Output by Status created from six different predictive models",
br(),
"the table can be searched ,sorted and filtered by columns",
"or uncheck the options to reduce the number of columns displayed." ),
checkboxGroupInput("show_vars", "Data Columns:",
names(tanz_Train), selected = names(tanz_Train))),
# Create a new row for the table
mainPanel(
fluidRow(
DT::dataTableOutput("table")
)
)
)
)
),
#########################Bar Plot output from the data set##################################
tabPanel("Plot",
fluidPage(
titlePanel(title = h4("Outcome of Predictive Models",align="center")),
fluidRow(
helpText("Select data element from dropdown"),
selectInput("TGroup",
label ="Data Columns:",
choices=colnames(tanz_Train[,20:33]),selected = "waterpoint_type"
),
helpText("Select predictive model from the list"),
column(width=2,
radioButtons("PrediChoice", "Predictive Models:",
c("RF","SVM","KNN","MLR","C50","BAG")
)),
mainPanel(
plotlyOutput("dplots",width="auto",height = "400")
)
)
)
),
################################### Map of tanzania with population circles ############################################
tabPanel("Map Visualization",
fluidPage(#theme = shinytheme("cerulean",
titlePanel(title = h4("The bubble size is based on the population and the bubble color represents the water point type ",align="center")),
column(3,
helpText("The output Map of Tanzania shows the results of the predictive models when the bubbles are clicked.",
"The hover text shows the Water point types.",
"The circle size is representative of the population around each well location.",
br(), br(),
tags$i("The circle Markers on the map may not display correctly if you are viewing this application on a device behind a firewall ")
)),
column(4,
leafletOutput("Tanzmap", width="850",height = "600"))
)
)
)
################################################ Define server logic ####################################################
server <- function(input, output, session) {
############################################### PDF output ##############################################################
output$Intro<-renderUI({})
############################################## Table Output #############################################################
tanz_Train2 = tanz_Train[sample(nrow(tanz_Train)), ]
output$table <- DT::renderDataTable({
DT::datatable(tanz_Train2[, input$show_vars, drop = FALSE])
})
#############################################Bar Plot Output######################################################
Treact<-reactive({#Reactive function to update the plots based on user input
data <- tanz_Train[[input$TGroup]]
})
Treact2<-reactive({
data2 <- tanz_Train[[input$PrediChoice]]
})
output$dplots <-renderPlotly({#Predictive Model Options
dist <- switch(input$PrediChoice,
RF = tanz_Train$RF,
SVM = tanz_Train$SVM,
KNN = tanz_Train$KNN,
MLR = tanz_Train$MLR,
C50=tanz_Train$C50,
BAG=tanz_Train$BAG)
# # show Raw count of data grouped by Status
# # p<-ggplot(tanz_Train, aes(x=Treact2(), y = ..count..)) +
# # geom_bar(aes(fill = Treact()), position = "dodge")+
# # xlab("Status Group")+labs(title =paste("Pump Status by",input$TGroup))+
# # theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
# # theme_tufte()+ scale_fill_discrete(name=input$TGroup)
# # p
p<- ggplot( tanz_Train, aes(x= Treact(),label2 = scales::percent(..prop..)
, group=as.character("x"))) +geom_bar(aes(y = ..prop..,fill=as.factor(..x..)
,text = paste("Count:",..count..)
),stat= "count"
)+geom_text(data=tanz_Train,
aes(
label = scales::percent(..prop..)
,y= ..prop..
,group=as.character("x")
)
,position=position_dodge(0.9)
, vjust =-1.5
,cex=3.5
, stat= "count") + facet_grid(~Treact2()) + theme_tufte() + theme(axis.text.x = element_text(angle = 40)) + labs(y = "Percent of Status Group",fill=~levels(input$TGroup)) + labs(title =paste("Pump Status by",input$TGroup)) + scale_y_continuous(labels = scales::percent)+theme(axis.title.x = element_blank()) + theme(legend.position="none")#remove legend
P<-ggplotly(p,tooltip=c("text","label2"))
})
#################################################### Map output #########################################################
#marker text of Predictive Model output
details2 <- paste(sep = "<br/>",
tags$b("SVM:"),as.character(tanz_Train$SVM),
#"<br/>",
tags$b("RF:"),as.character(tanz_Train$RF),
#"<br/>",
tags$b("KNN:"),as.character(tanz_Train$KNN),
#"<br/>",
tags$b("MLR:"),as.character(tanz_Train$MLR),
#"<br/>",
tags$b("C50:"),as.character(tanz_Train$C50),
#"<br/>",
tags$b("BAG:"),as.character(tanz_Train$BAG),
#"<br/>",
tags$b("Population:"),as.character(tanz_Train$population))
#color Palette
color <- colorFactor( palette = "viridis", tanz_Train$waterpoint_type,reverse = TRUE, alpha = TRUE)
#leaflet map output
output$Tanzmap<-renderLeaflet({
leaflet(tanz_Train) %>%
addTiles() %>% #setting the bounds of longitude and lattitude
fitBounds(~min(tanz_Train$longitude),
~min(tanz_Train$latitude),
~max(tanz_Train$longitude),
~max(tanz_Train$latitude))%>%
setView(lng=34.888822000000005,lat=-6.369028,zoom=6)%>%
addProviderTiles("CartoDB.Positron",
options = providerTileOptions(noWrap = TRUE))%>%
addCircleMarkers(data=tanz_Train,
lng=~tanz_Train$longitude,
lat=~tanz_Train$latitude,
fillOpacity=0.7,
color=~color(waterpoint_type),
stroke=FALSE,
popup=details2,
label = as.character(tanz_Train$waterpoint_type),
radius=~sqrt(population)*.35)%>% #maintain circle size during zoom
# Legend
addLegend("bottomleft",
pal=color,
values=~waterpoint_type,
opacity = 0.7,
title="Water Point Type"
)
})
}
# Run the application
shinyApp(ui = ui, server = server)