Data pre-processing and manipulation (Data cleaning)
Reading the data-set generated using RedCap and merging it
sysuse demo, clearcodebook
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
-------------------------------------------------------------------------------
record_id Record ID
-------------------------------------------------------------------------------
Type: Numeric (byte)
Range: [1,31] Units: 1
Unique values: 24 Missing .: 0/24
Mean: 15.2083
Std. dev.: 9.25005
Percentiles: 10% 25% 50% 75% 90%
3 7.5 15 23 27
-------------------------------------------------------------------------------
redcap_event_name Event Name
-------------------------------------------------------------------------------
Type: String (str24)
Unique values: 1 Missing "": 0/24
Tabulation: Freq. Value
24 "demographic_inform_arm_1"
-------------------------------------------------------------------------------
redcap_survey_identifier Survey Identifier
-------------------------------------------------------------------------------
Type: Numeric (byte)
Range: [.,.] Units: .
Unique values: 0 Missing .: 24/24
Tabulation: Freq. Value
24 .
-------------------------------------------------------------------------------
demographic_informat_v_0 Survey Timestamp
-------------------------------------------------------------------------------
Type: String (str19)
Unique values: 23 Missing "": 0/24
Examples: "2025-02-08 21:02:10"
"2025-02-09 20:40:32"
"2025-02-10 16:56:02"
"2025-02-11 14:32:22"
Warning: Variable has embedded blanks.
-------------------------------------------------------------------------------
dob Date of birth
-------------------------------------------------------------------------------
Type: Numeric daily date (float)
Range: [10894,23875] Units: 1
Or equivalently: [29oct1989,14may2025] Units: days
Unique values: 24 Missing .: 0/24
Mean: 15287 = 08nov2001(+ 1 hour)
Std. dev.: 3164.95
Percentiles: 10% 25% 50% 75% 90%
12072 13174.5 15629 15883.5 18295
19jan1993 26jan1996 16oct2002 27jun2003 02feb2010
-------------------------------------------------------------------------------
consent_date Consent date.
-------------------------------------------------------------------------------
Type: Numeric daily date (float)
Range: [20005,23785] Units: 1
Or equivalently: [09oct2014,13feb2025] Units: days
Unique values: 8 Missing .: 0/24
Tabulation: Freq. Value
1 20005 09oct2014
1 21663 24apr2019
1 23779 07feb2025
5 23780 08feb2025
4 23781 09feb2025
2 23782 10feb2025
5 23783 11feb2025
5 23785 13feb2025
-------------------------------------------------------------------------------
age Age
-------------------------------------------------------------------------------
Type: Numeric (byte)
Range: [0,35] Units: 1
Unique values: 14 Missing .: 0/24
Mean: 23
Std. dev.: 7.72348
Percentiles: 10% 25% 50% 75% 90%
15 21 22 28.5 32
-------------------------------------------------------------------------------
edu_level What is your highest level of education?
-------------------------------------------------------------------------------
Type: Numeric (byte)
Label: edu_level_
Range: [1,3] Units: 1
Unique values: 3 Missing .: 0/24
Tabulation: Freq. Numeric Label
1 1 Primary school
5 2 High school
18 3 Tertiary education
-------------------------------------------------------------------------------
employ_status What is your current employment status?
-------------------------------------------------------------------------------
Type: Numeric (byte)
Label: employ_status_
Range: [1,5] Units: 1
Unique values: 5 Missing .: 0/24
Tabulation: Freq. Numeric Label
5 1 Full-time
3 2 Part-time
7 3 Unemployed
3 4 Self-employed
6 5 Student
-------------------------------------------------------------------------------
monthly_income What is your current household income (monthly, in ZAR)?
-------------------------------------------------------------------------------
Type: Numeric (byte)
Label: monthly_income_
Range: [1,5] Units: 1
Unique values: 5 Missing .: 0/24
Tabulation: Freq. Numeric Label
10 1 Less than R3,500
1 2 R3, 500-R7,000
4 3 R7, 001-R15,000
4 4 R15, 001-R30,000
5 5 More than R30,000
-------------------------------------------------------------------------------
preg_complications Did you experience any pregnancy
complications? (e.g., gestational
diabetes, pre
-------------------------------------------------------------------------------
Type: Numeric (byte)
Label: preg_complications_
Range: [0,1] Units: 1
Unique values: 2 Missing .: 2/24
Tabulation: Freq. Numeric Label
21 0 No
1 1 Yes
2 .
-------------------------------------------------------------------------------
yes_specify If yes, please specify.
-------------------------------------------------------------------------------
Type: String (str16)
Unique values: 1 Missing "": 23/24
Tabulation: Freq. Value
23 ""
1 "Placenta Previa "
Warning: Variable has embedded and trailing blanks.
-------------------------------------------------------------------------------
demographic_informat_v_1 Complete?
-------------------------------------------------------------------------------
Type: Numeric (byte)
Label: demographic_informat_v_1_
Range: [2,2] Units: 1
Unique values: 1 Missing .: 0/24
Tabulation: Freq. Numeric Label
24 2 Complete
. use "demo.dta"
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
. merge 1:1 record_id using "baseline.dta"
Result Number of obs
-----------------------------------------
Not matched 3
from master 3 (_merge==1)
from using 0 (_merge==2)
Matched 21 (_merge==3)
-----------------------------------------
. drop if _merge==1
(3 observations deleted)
.
.
.
. save "Merged.dta", replace
file Merged.dta saved
.
.
.
.
Describing the second data-set to be merged with the data-set above
use Follow_updescribe
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1528.csv)
Contains data from Follow_up.dta
Observations: 25 HonsMSc20257_DATA_NOHDRS_2025-0
2-14_1528.csv
Variables: 18 14 Feb 2025 17:19
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
record_id byte %8.0g Record ID
redcap_event_~e str22 %22s Event Name
redcap_survey~r byte %8.0g Survey Identifier
followup_time~p str19 %19s Survey Timestamp
baby_current_~t float %9.0g What is your babys current weight
(kg)?
baby_current_~h float %9.0g What is your babys current length
(cm)?
baby_bmi2 float %9.0g Babys BMI
recent_growth~s byte %8.0g recent_growth_issues_
Has your baby been diagnosed with
growth-related issues?
yes_issues byte %8.0g If yes, please specify.
healthcare_vi~s byte %17.0g healthcare_visits_
How many times has your baby
visited a healthcare facility
for a check-up since
feed_3month byte %39.0g feed_3month_
How is your baby currently fed?
feed_perday_3~h byte %9.0g feed_perday_3month_
If breastfeeding, how many times
per day does your baby feed?
bottles_perda~h byte %11.0g bottles_perday_3month_
If formula feeding, how many
bottles per day does your baby
consume?
complementary~h byte %8.0g complementary_food_3month_
Have you introduced any
complementary foods (e.g.,
porridge, purees)?
latest_illnes~h byte %47.0g latest_illness_3month_
Has your baby experienced any of
the following since the last
survey? (Check all
hospitalizati~h byte %8.0g hospitalization_3month_
Has your baby been hospitalized
since the last survey?
maternal_meals byte %8.0g maternal_meals_
How many meals do you eat per
day?
followup_comp~e byte %10.0g followup_complete_
Complete?
-------------------------------------------------------------------------------
Sorted by:
Removing duplicates for smooth merging
use Follow_up, clearduplicatesdrop record_id, forcesave Follow_upD.dta, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1528.csv)
Duplicates in terms of record_id
(12 observations deleted)
file Follow_upD.dta saved
Now the entire data-set has been cleaned and merged, and it ready subsequent analysis
use Merged.dta, clearmerge 1:1 record_id using"Follow_upD", nogeneratesave"Merged1.dta", replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
Result Number of obs
-----------------------------------------
Not matched 8
from master 8
from using 0
Matched 13
-----------------------------------------
file Merged1.dta saved
Changing the variable, baby_bmi from a string variable to a numeric variable
use merged_data_clean, cleardescribe
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
Contains data from merged_data_clean.dta
Observations: 21 HonsMSc20257_DATA_NOHDRS_2025-0
2-14_1523.csv
Variables: 45 24 Feb 2025 01:37
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
record_id byte %8.0g Record ID
redcap_event_~e str24 %24s Event Name
redcap_survey~r byte %8.0g Survey Identifier
demographic_i~0 str19 %19s Survey Timestamp
dob float %dM_d,_CY Date of birth
consent_date float %dM_d,_CY Consent date.
age byte %8.0g Age
edu_level byte %18.0g edu_level_
What is your highest level of
education?
employ_status byte %13.0g employ_status_
What is your current employment
status?
monthly_income byte %17.0g monthly_income_
What is your current household
income (monthly, in ZAR)?
preg_complica~s byte %8.0g preg_complications_
Did you experience any pregnancy
complications? (e.g.,
gestational diabetes, pre
yes_specify str16 %16s If yes, please specify.
demographic_i~1 byte %10.0g demographic_informat_v_1_
Complete?
baseline_info~p str19 %19s Survey Timestamp
feed_baseline byte %18.0g feed_baseline_
How is your baby currently fed?
feed_per_day byte %9.0g feed_per_day_
How often does your baby feed per
day?
solid_foods byte %8.0g solid_foods_
Have you introduced any solid
foods to your baby?
nutrition_cou~l byte %8.0g nutrition_counsel_
Do you have access to nutritional
counseling?
weigh_facility byte %8.0g weigh_facility_
Has your baby been weighed at a
healthcare facility since
birth?
last_weight float %9.0g What was your babys last recorded
weight?
last_length float %9.0g What was your babys last recorded
length (in cm)?
baby_bmi float %9.0g Babys BMI.
condition_aff~h byte %8.0g condition_affecting_growth_
Does your baby have any diagnosed
medical conditions affecting
growth?
yes_select_op~s byte %21.0g yes_select_options_
If yes, please select the
following options.
other_specify str9 %9s If other, please specify.
maternal_diet byte %121.0g maternal_diet_
What is your daily diet like?
clean_water byte %8.0g clean_water_
Do you have access to clean
drinking water?
mental_health~s byte %8.0g mental_health_concerns_
Have you experienced any mental
health concerns since giving
birth?
baseline_info~e byte %10.0g baseline_information_complete_
Complete?
_merge byte %23.0g _merge Matching result from merge
followup_time~p str19 %19s Survey Timestamp
baby_current_~t float %9.0g What is your babys current weight
(kg)?
baby_current_~h float %9.0g What is your babys current length
(cm)?
baby_bmi2 float %9.0g Babys BMI
recent_growth~s byte %8.0g recent_growth_issues_
Has your baby been diagnosed with
growth-related issues?
yes_issues byte %8.0g If yes, please specify.
healthcare_vi~s byte %17.0g healthcare_visits_
How many times has your baby
visited a healthcare facility
for a check-up since
feed_3month byte %39.0g feed_3month_
How is your baby currently fed?
feed_perday_3~h byte %9.0g feed_perday_3month_
If breastfeeding, how many times
per day does your baby feed?
bottles_perda~h byte %11.0g bottles_perday_3month_
If formula feeding, how many
bottles per day does your baby
consume?
complementary~h byte %8.0g complementary_food_3month_
Have you introduced any
complementary foods (e.g.,
porridge, purees)?
latest_illnes~h byte %47.0g latest_illness_3month_
Has your baby experienced any of
the following since the last
survey? (Check all
hospitalizati~h byte %8.0g hospitalization_3month_
Has your baby been hospitalized
since the last survey?
maternal_meals byte %8.0g maternal_meals_
How many meals do you eat per
day?
followup_comp~e byte %10.0g followup_complete_
Complete?
-------------------------------------------------------------------------------
Sorted by:
use merged_data_clean, cleardestring baby_bmi, replacedestring age, replacedtable age preg_complications
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
baby_bmi already numeric; no replace
age already numeric; no replace
-----------------------------------------------------------------------------------------------
Summary
-----------------------------------------------------------------------------------------------
N 21
Age 22.857 (7.663)
Did you experience any pregnancy complications? (e.g., gestational diabetes, pre 0.053 (0.229)
-----------------------------------------------------------------------------------------------
Creating a logistic regression stats table and saving it in a working directory
asdoc logistic preg_complications age
(File Myfile.doc already exists, option append was assumed)
Logistic regression Number of obs = 22
LR chi2(1) = 0.86
Prob > chi2 = 0.3532
Log likelihood = -3.6370114 Pseudo R2 = 0.1059
------------------------------------------------------------------------------
preg_compl~s | Odds ratio Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 1.180138 .2443618 0.80 0.424 .7864683 1.77086
_cons | .0006154 .0037186 -1.22 0.221 4.42e-09 85.67737
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Click to Open File: Myfile.doc
Data Visualization in QUARTO using STATA commands
use merged_data_clean, clearscatter last_length last_weight, xtitle("Weight(kg)") ytitle("Length (cm)"), ,title("length vs. Weight")graphexport"scatter.png",as(png) replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)
file scatter.png saved as PNG format
use merged_data_clean, clearscatter last_length last_weight, xtitle("Weight(kg)") ytitle("Length (cm)"), ,title("length vs. Weight")graphsave graph1.gph, replace