Course Details

Specialisation	Data Analysis and Interpretation
Course	Data Analysis and Interpretation Capstone
Education Institution	Wesleyan University
Publisher	Coursera
Assignment	Methods

Title

Factors associated with operating condition of waterpoints in Tanzania

Methods

Sample

The data was obtain from the Taarifa and the Tanzanian Ministry of Water via the DriveData’s competition “Pump it Up: Data Mining the Water Table”.

The data is already split in training and testing subsets, with 59,400 observation (80%) in the training subset and 14,850 (20%) cases in the testing subset.

There is a total of forty variables plus the label for each of the training observations.

The observations contain details of waterpoints available to the population of Tanzania. The records have dates from 2002-10-14 to 2013-12-03.

Measures

The operating condition of a waterpoint is a categorical response variable, which can be one of the alternatives among “functional”, “functional, needs repair” or “non functional”.

Predictors included can be grouped as:

Location: GPS coordinates and altitude as well as village, region and district
Quantity: Amount of water available
Quality: Quality of the water provided
Administrative: Funder, installer, management and payment details
Characteristics: Year of construction, extraction type, source and waterpoint type

Code book

The summary of the variables provided with the dataset is as follows:

Name	Type	Description
id	Discrete	waterpoint identification
amount_tsh	Continuos	Total static head (amount water available to waterpoint)
date_recorded	Discrete	The date the row was entered
funder	Categorical	Who funded the well
gps_height	Continuos	Altitude of the well
installer	Categorical	Organization that installed the well
longitude	Continuos	GPS coordinate
latitude	Continuos	GPS coordinate
wpt_name	Categorical	Name of the waterpoint if there is one
num_private
basin	Categorical	Geographic water basin
subvillage	Categorical	Geographic location
region	Categorical	Geographic location
region_code	Categorical	Geographic location (coded)
district_code	Categorical	Geographic location (coded)
lga	Categorical	Geographic location
ward	Categorical	Geographic location
population	Discrete	Population around the well
public_meeting	Categorical	True/False
recorded_by	Categorical	Group entering this row of data
scheme_management	Categorical	Who operates the waterpoint
scheme_name	Categorical	Who operates the waterpoint
permit	Categorical	If the waterpoint is permitted
construction_year	Discrete	Year the waterpoint was constructed
extraction_type	Categorical	The kind of extraction the waterpoint uses
extraction_type_group	Categorical	The kind of extraction the waterpoint uses
extraction_type_class	Categorical	The kind of extraction the waterpoint uses
management	Categorical	How the waterpoint is managed
management_group	Categorical	How the waterpoint is managed
payment	Categorical	What the water costs
payment_type	Categorical	What the water costs
water_quality	Categorical	The quality of the water
quality_group	Categorical	The quality of the water
quantity	Categorical	The quantity of water
quantity_group	Categorical	The quantity of water
source	Categorical	The source of the water
source_type	Categorical	The source of the water
source_class	Categorical	The source of the water
waterpoint_type	Categorical	The kind of waterpoint
waterpoint_type_group	Categorical	The kind of waterpoint

Analysis

The distributions for the predictors and the operating condition of a waterpoint were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.

Scatter plots and box plots were also examined, and Pearson correlation and Analysis of variance (ANOVA) were used to test bivariate associations between individual predictors and the operating condition of a waterpoint (response variable).

Factors associated with operating condition of waterpoints in Tanzania

Assignment - Methods

Angelo Klin

11 April 2016