Specialisation | Data Analysis and Interpretation |
Course | Data Analysis and Interpretation Capstone |
Education Institution | Wesleyan University |
Publisher | Coursera |
Assignment | Methods |
Factors associated with operating condition of waterpoints in Tanzania
The data was obtain from the Taarifa and the Tanzanian Ministry of Water via the DriveData’s competition “Pump it Up: Data Mining the Water Table”.
The data is already split in training and testing subsets, with 59,400 observation (80%) in the training subset and 14,850 (20%) cases in the testing subset.
There is a total of forty variables plus the label for each of the training observations.
The observations contain details of waterpoints available to the population of Tanzania. The records have dates from 2002-10-14 to 2013-12-03.
The operating condition of a waterpoint is a categorical response variable, which can be one of the alternatives among “functional”, “functional, needs repair” or “non functional”.
Predictors included can be grouped as:
The summary of the variables provided with the dataset is as follows:
Name | Type | Description |
---|---|---|
id | Discrete | waterpoint identification |
amount_tsh | Continuos | Total static head (amount water available to waterpoint) |
date_recorded | Discrete | The date the row was entered |
funder | Categorical | Who funded the well |
gps_height | Continuos | Altitude of the well |
installer | Categorical | Organization that installed the well |
longitude | Continuos | GPS coordinate |
latitude | Continuos | GPS coordinate |
wpt_name | Categorical | Name of the waterpoint if there is one |
num_private | ||
basin | Categorical | Geographic water basin |
subvillage | Categorical | Geographic location |
region | Categorical | Geographic location |
region_code | Categorical | Geographic location (coded) |
district_code | Categorical | Geographic location (coded) |
lga | Categorical | Geographic location |
ward | Categorical | Geographic location |
population | Discrete | Population around the well |
public_meeting | Categorical | True/False |
recorded_by | Categorical | Group entering this row of data |
scheme_management | Categorical | Who operates the waterpoint |
scheme_name | Categorical | Who operates the waterpoint |
permit | Categorical | If the waterpoint is permitted |
construction_year | Discrete | Year the waterpoint was constructed |
extraction_type | Categorical | The kind of extraction the waterpoint uses |
extraction_type_group | Categorical | The kind of extraction the waterpoint uses |
extraction_type_class | Categorical | The kind of extraction the waterpoint uses |
management | Categorical | How the waterpoint is managed |
management_group | Categorical | How the waterpoint is managed |
payment | Categorical | What the water costs |
payment_type | Categorical | What the water costs |
water_quality | Categorical | The quality of the water |
quality_group | Categorical | The quality of the water |
quantity | Categorical | The quantity of water |
quantity_group | Categorical | The quantity of water |
source | Categorical | The source of the water |
source_type | Categorical | The source of the water |
source_class | Categorical | The source of the water |
waterpoint_type | Categorical | The kind of waterpoint |
waterpoint_type_group | Categorical | The kind of waterpoint |
The distributions for the predictors and the operating condition of a waterpoint were evaluated by examining frequency tables for categorical variables and calculating the mean, standard deviation and minimum and maximum values for quantitative variables.
Scatter plots and box plots were also examined, and Pearson correlation and Analysis of variance (ANOVA) were used to test bivariate associations between individual predictors and the operating condition of a waterpoint (response variable).