Course Details

Specialisation Data Analysis and Interpretation
Course Data Analysis and Interpretation Capstone
Education Institution Wesleyan University
Publisher Coursera
Assignment Methods

Title

Factors associated with operating condition of waterpoints in Tanzania

Methods

Sample

The data was obtain from the Taarifa and the Tanzanian Ministry of Water via the DriveData’s competition “Pump it Up: Data Mining the Water Table”.

The data is already split in training and testing subsets, with 59,400 observation (80%) in the training subset and 14,850 (20%) cases in the testing subset.

There is a total of forty variables plus the label for each of the training observations.

The observations contain details of waterpoints available to the population of Tanzania. The records have dates from 2002-10-14 to 2013-12-03.

Measures

The operating condition of a waterpoint is a categorical response variable, which can be one of the alternatives among “functional”, “functional, needs repair” or “non functional”.

Predictors included can be grouped as:

  1. Location: GPS coordinates and altitude as well as village, region and district
  2. Quantity: Amount of water available
  3. Quality: Quality of the water provided
  4. Administrative: Funder, installer, management and payment details
  5. Characteristics: Year of construction, extraction type, source and waterpoint type

Code book

The summary of the variables provided with the dataset is as follows:

Name Type Description
id Discrete waterpoint identification
amount_tsh Continuos Total static head (amount water available to waterpoint)
date_recorded Discrete The date the row was entered
funder Categorical Who funded the well
gps_height Continuos Altitude of the well
installer Categorical Organization that installed the well
longitude Continuos GPS coordinate
latitude Continuos GPS coordinate
wpt_name Categorical Name of the waterpoint if there is one
num_private
basin Categorical Geographic water basin
subvillage Categorical Geographic location
region Categorical Geographic location
region_code Categorical Geographic location (coded)
district_code Categorical Geographic location (coded)
lga Categorical Geographic location
ward Categorical Geographic location
population Discrete Population around the well
public_meeting Categorical True/False
recorded_by Categorical Group entering this row of data
scheme_management Categorical Who operates the waterpoint
scheme_name Categorical Who operates the waterpoint
permit Categorical If the waterpoint is permitted
construction_year Discrete Year the waterpoint was constructed
extraction_type Categorical The kind of extraction the waterpoint uses
extraction_type_group Categorical The kind of extraction the waterpoint uses
extraction_type_class Categorical The kind of extraction the waterpoint uses
management Categorical How the waterpoint is managed
management_group Categorical How the waterpoint is managed
payment Categorical What the water costs
payment_type Categorical What the water costs
water_quality Categorical The quality of the water
quality_group Categorical The quality of the water
quantity Categorical The quantity of water
quantity_group Categorical The quantity of water
source Categorical The source of the water
source_type Categorical The source of the water
source_class Categorical The source of the water
waterpoint_type Categorical The kind of waterpoint
waterpoint_type_group Categorical The kind of waterpoint

Analysis

The distributions for the predictors and the operating condition of a waterpoint were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.

Scatter plots and box plots were also examined, and Pearson correlation and Analysis of variance (ANOVA) were used to test bivariate associations between individual predictors and the operating condition of a waterpoint (response variable).