1.Preparing clean data from Cleaning log
There are plenty of functions that can make clean data from the cleaning log. This tutorial will focus on the butteR package. so first you have to install the butteR package.
2. Install butteR package
You can find it from GitHub or you can just copy and paste the following code in your console.
devtools::install_github("zackarno/butteR")3. Introduction to check_cleaning_log() and implement_cleaning_log()
3.1 check_cleaning_log()
Once you have installed butteR package, you will have a function called check_cleaning_log(). The main purpose of this function is to check if your cleaning log has any issues that need to be solved before preparing the clean data. It usually checks if all the UUIDs and questions reported in the cleaning log exist in the raw data set or not. If everything alright it will show a message like- no issues in cleaning log found but if you have at least one issue then it will make a data frame that will contain all the cleaning log issue(s) and it will also show a message like - cleaning log has issues, see output table
The meaning of each argument of check_cleaning_log() function is given below:
df: Raw data (data.frame)
df_uuid: Column in raw data with uuid
cl: Cleaning log (data.frame)
cl_change_type_col: Column in cleaning log which specifies change type to be made
cl_change_col: Column in cleaning log which specifies data set column to change
cl_uuid: UUID in cleaning log
cl_new_val: cleaning log column specifying the new correct value
3.2implement_cleaning_log()
Once you have no issue in your cleaning log, the next step is preparing the clean data from cleaning log. butteR offers implementing_cleaning_log() to do this. The arguments are exactly same as check_cleaning_log() So I won’t be describing them here again. This function will provide clean data for the associate cleaning log and raw data.
4. Things to be considered
To run this function you should follow a specific format that is almost similar to one that HQ shared. You can find a cleaning log template here
Example:: preparing clean data from cleaning log
Here you will find two cleaning logs (1. With issue, 2. Without issues) and a raw dataset for Education sector assessment. We will be preparing the clean data from the cleaning log in two steps.
Check cleaning log: To prepared the cleaning log first we need to check if the cleaning log is perfectly aligned with the data or not. This stage is to check unintentional mistake such as copying error or incomplete cleaning log. We will be using check_cleaning_log() function to check the cleaning log.
Preparing cleaning log: In this stage, we will be preparing the clean data from the cleaning log. We will be using implement_cleaning_log() for preparing the clean data.
The following chunk will show you how the check_cleaning_log() and implement_cleaning_log() works. Though the following example will show everything but it is recommended to give a try by yourself.
Read Raw data and Cleaning log
raw_data <- read.csv("data/data_cleaning/hh.csv") #Raw data
cleaning_log_with_issue <- read.csv("data/data_cleaning/Cleaning_log_with_issue.csv") #Initial cleaning logOnce you run the check_cleaning_log with the problematic cleaning log, you will have a note mentioning that the cleaning has issues. You will also have a data frame with all possible issues in the cleaning log (Issues will be written in the first column of the data frame).
Identifying the cleaning log issues::check_cleaning_log()
butteR::check_cleaning_log(df = raw_data,df_uuid = "X_uuid",cl = cleaning_log_with_issue,cl_change_type_col = "change_type",
cl_change_col = "question",cl_uuid = "uuid",cl_new_val = "new_value" ) Output:check_cleaning_log()
[1] "cleaning log has issues, see output table"
| cl_prob | uuid | change_type | dataset_loop | question | Current_value | new_value | Issue | Enumerator.comments |
|---|---|---|---|---|---|---|---|---|
| uuid_does_not_exist | a52776c5-ddbc-44a5-b596-5e92e1f9956 | remove_survey | HH | NA | NA | Pilot |
Adjusting cleaning log issue and recheck the cleaning log
Once you have fixed all the issues, you will receive the following note when you re-run the check_cleaning_log() again.
cleaning_log_without_issue <- read.csv("data/data_cleaning/Cleaning_log_without_issue.csv") #updated cleaning log
butteR::check_cleaning_log(df = raw_data,df_uuid = "X_uuid",cl = cleaning_log_without_issue,cl_change_type_col = "change_type",
cl_change_col = "question",cl_uuid = "uuid",cl_new_val = "new_value" )[1] "no issues in cleaning log found"
Preparing clean data from cleaning log::implement_cleaning_log()
When you have the cleaning log with no issues, the next step will be preparing the clean data. The following function from butteR package can make the clean data from the cleaning log.
clean_data <- butteR::implement_cleaning_log(df = raw_data,df_uuid = "X_uuid",cl = cleaning_log_without_issue,cl_change_type_col = "change_type",
cl_change_col = "question",cl_uuid = "uuid",cl_new_val = "new_value" )