Group7 STATA Assignment

Group 7 members

  1. Vusumuzi Mabasa

  2. Thandeka Mchunu

  3. Nontobeko Mnisi

  4. Prosperity Hadebe

  5. Shantel Mphogo

  6. Phindile Stowe

Connecting STATA to R in Quarto

options(scipen = 999)
library(Statamarkdown)
Warning: package 'Statamarkdown' was built under R version 4.4.2
Stata found at C:/Program Files/Stata18/StataSE-64.exe
The 'stata' engine is ready to use.
stataexe <- "C:/Program Files/Stata18/StataSE-64.exe"

knitr::opts_chunk$set(engine.path = list(stata=stataexe))

Executing STATA commands in Quarto/Rmarkdown

Setting the working directory

cd "C:\Users\VUSI\Downloads\Group 7 Biostats Assignment"
C:\Users\VUSI\Downloads\Group 7 Biostats Assignment

Checking if we are working in the correct working directory

pwd
C:\Users\VUSI\Downloads\Group 7 Biostats Assignment

Data pre-processing and manipulation (Data cleaning)

Reading the data-set generated using RedCap and merging it


sysuse demo, clear

codebook
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


-------------------------------------------------------------------------------
record_id                                                             Record ID
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [1,31]                        Units: 1
         Unique values: 24                        Missing .: 0/24

                  Mean: 15.2083
             Std. dev.: 9.25005

           Percentiles:     10%       25%       50%       75%       90%
                              3       7.5        15        23        27

-------------------------------------------------------------------------------
redcap_event_name                                                    Event Name
-------------------------------------------------------------------------------

                  Type: String (str24)

         Unique values: 1                         Missing "": 0/24

            Tabulation: Freq.  Value
                           24  "demographic_inform_arm_1"

-------------------------------------------------------------------------------
redcap_survey_identifier                                      Survey Identifier
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [.,.]                         Units: .
         Unique values: 0                         Missing .: 24/24

            Tabulation: Freq.  Value
                           24  .

-------------------------------------------------------------------------------
demographic_informat_v_0                                       Survey Timestamp
-------------------------------------------------------------------------------

                  Type: String (str19)

         Unique values: 23                        Missing "": 0/24

              Examples: "2025-02-08 21:02:10"
                        "2025-02-09 20:40:32"
                        "2025-02-10 16:56:02"
                        "2025-02-11 14:32:22"

               Warning: Variable has embedded blanks.

-------------------------------------------------------------------------------
dob                                                               Date of birth
-------------------------------------------------------------------------------

                  Type: Numeric daily date (float)

                 Range: [10894,23875]                 Units: 1
       Or equivalently: [29oct1989,14may2025]         Units: days
         Unique values: 24                        Missing .: 0/24

                  Mean:   15287 = 08nov2001(+ 1 hour)
             Std. dev.: 3164.95
           Percentiles:       10%        25%        50%        75%        90%
                            12072    13174.5      15629    15883.5      18295
                        19jan1993  26jan1996  16oct2002  27jun2003  02feb2010

-------------------------------------------------------------------------------
consent_date                                                      Consent date.
-------------------------------------------------------------------------------

                  Type: Numeric daily date (float)

                 Range: [20005,23785]                 Units: 1
       Or equivalently: [09oct2014,13feb2025]         Units: days
         Unique values: 8                         Missing .: 0/24

            Tabulation: Freq.  Value
                            1  20005  09oct2014
                            1  21663  24apr2019
                            1  23779  07feb2025
                            5  23780  08feb2025
                            4  23781  09feb2025
                            2  23782  10feb2025
                            5  23783  11feb2025
                            5  23785  13feb2025

-------------------------------------------------------------------------------
age                                                                         Age
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [0,35]                        Units: 1
         Unique values: 14                        Missing .: 0/24

                  Mean:      23
             Std. dev.: 7.72348

           Percentiles:     10%       25%       50%       75%       90%
                             15        21        22      28.5        32

-------------------------------------------------------------------------------
edu_level                              What is your highest level of education?
-------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: edu_level_

                 Range: [1,3]                         Units: 1
         Unique values: 3                         Missing .: 0/24

            Tabulation: Freq.   Numeric  Label
                            1         1  Primary school
                            5         2  High school
                           18         3  Tertiary education

-------------------------------------------------------------------------------
employ_status                           What is your current employment status?
-------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: employ_status_

                 Range: [1,5]                         Units: 1
         Unique values: 5                         Missing .: 0/24

            Tabulation: Freq.   Numeric  Label
                            5         1  Full-time
                            3         2  Part-time
                            7         3  Unemployed
                            3         4  Self-employed
                            6         5  Student

-------------------------------------------------------------------------------
monthly_income         What is your current household income (monthly, in ZAR)?
-------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: monthly_income_

                 Range: [1,5]                         Units: 1
         Unique values: 5                         Missing .: 0/24

            Tabulation: Freq.   Numeric  Label
                           10         1  Less than R3,500
                            1         2  R3, 500-R7,000
                            4         3  R7, 001-R15,000
                            4         4  R15, 001-R30,000
                            5         5  More than R30,000

-------------------------------------------------------------------------------
preg_complications                     Did you experience any pregnancy
                                       complications? (e.g., gestational
                                       diabetes, pre
-------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: preg_complications_

                 Range: [0,1]                         Units: 1
         Unique values: 2                         Missing .: 2/24

            Tabulation: Freq.   Numeric  Label
                           21         0  No
                            1         1  Yes
                            2         .  

-------------------------------------------------------------------------------
yes_specify                                             If yes, please specify.
-------------------------------------------------------------------------------

                  Type: String (str16)

         Unique values: 1                         Missing "": 23/24

            Tabulation: Freq.  Value
                           23  ""
                            1  "Placenta Previa "

               Warning: Variable has embedded and trailing blanks.

-------------------------------------------------------------------------------
demographic_informat_v_1                                              Complete?
-------------------------------------------------------------------------------

                  Type: Numeric (byte)
                 Label: demographic_informat_v_1_

                 Range: [2,2]                         Units: 1
         Unique values: 1                         Missing .: 0/24

            Tabulation: Freq.   Numeric  Label
                           24         2  Complete
. use "demo.dta"
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

. merge 1:1  record_id  using "baseline.dta"

    Result                      Number of obs
    -----------------------------------------
    Not matched                             3
        from master                         3  (_merge==1)
        from using                          0  (_merge==2)

    Matched                                21  (_merge==3)
    -----------------------------------------

. drop if _merge==1
(3 observations deleted)

. 
. 
. 
. save "Merged.dta", replace
file Merged.dta saved

. 
. 
. 
. 

Describing the second data-set to be merged with the data-set above

use Follow_up

describe
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1528.csv)


Contains data from Follow_up.dta
 Observations:            25                  HonsMSc20257_DATA_NOHDRS_2025-0
                                                2-14_1528.csv
    Variables:            18                  14 Feb 2025 17:19
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
record_id       byte    %8.0g                 Record ID
redcap_event_~e str22   %22s                  Event Name
redcap_survey~r byte    %8.0g                 Survey Identifier
followup_time~p str19   %19s                  Survey Timestamp
baby_current_~t float   %9.0g                 What is your babys current weight
                                                (kg)?
baby_current_~h float   %9.0g                 What is your babys current length
                                                (cm)?
baby_bmi2       float   %9.0g                 Babys BMI
recent_growth~s byte    %8.0g      recent_growth_issues_
                                              Has your baby been diagnosed with
                                                growth-related issues?
yes_issues      byte    %8.0g                 If yes, please specify.
healthcare_vi~s byte    %17.0g     healthcare_visits_
                                              How many times has your baby
                                                visited a healthcare facility
                                                for a check-up since
feed_3month     byte    %39.0g     feed_3month_
                                              How is your baby currently fed?
feed_perday_3~h byte    %9.0g      feed_perday_3month_
                                              If breastfeeding, how many times
                                                per day does your baby feed?
bottles_perda~h byte    %11.0g     bottles_perday_3month_
                                              If formula feeding, how many
                                                bottles per day does your baby
                                                consume?
complementary~h byte    %8.0g      complementary_food_3month_
                                              Have you introduced any
                                                complementary foods (e.g.,
                                                porridge, purees)?
latest_illnes~h byte    %47.0g     latest_illness_3month_
                                              Has your baby experienced any of
                                                the following since the last
                                                survey? (Check all
hospitalizati~h byte    %8.0g      hospitalization_3month_
                                              Has your baby been hospitalized
                                                since the last survey?
maternal_meals  byte    %8.0g      maternal_meals_
                                              How many meals do you eat per
                                                day?
followup_comp~e byte    %10.0g     followup_complete_
                                              Complete?
-------------------------------------------------------------------------------
Sorted by: 

Removing duplicates for smooth merging

use Follow_up, clear
duplicates drop record_id, force

save Follow_upD.dta, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1528.csv)


Duplicates in terms of record_id

(12 observations deleted)

file Follow_upD.dta saved

Now the entire data-set has been cleaned and merged, and it ready subsequent analysis


use Merged.dta, clear

merge 1:1 record_id using "Follow_upD", nogenerate

save "Merged1.dta", replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

    Result                      Number of obs
    -----------------------------------------
    Not matched                             8
        from master                         8  
        from using                          0  

    Matched                                13  
    -----------------------------------------

file Merged1.dta saved

Summarizing the data


use "Merged1", clear

misstable summarize
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
  redcap_sur~r |        21                   0  |      0          .           .
  preg_compl~s |         2                  19  |      2          0           1
  weigh_faci~y |         1                  20  |      1          1           1
   last_weight |         4                  17  |     17        2.9        66.6
   last_length |         5                  16  |     13         10          96
      baby_bmi |         4                  17  |     17   9.608708         290
  yes_select~s |        20                   1  |      1          1           1
  maternal_d~t |         1                  20  |      4          1           4
   clean_water |         1                  20  |      2          0           1
  mental_hea~s |         1                  20  |      2          0           1
  baby_curre~t |         8                  13  |     12        3.5          70
  baby_curre~h |         8                  13  |     11      25.99          98
     baby_bmi2 |         8                  13  |     13     6.6482    458.7848
  recent_gro~s |         8                  13  |      1          0           0
    yes_issues |        21                   0  |      0          .           .
  healthcare~s |         8                  13  |      4          1           4
   feed_3month |         8                  13  |      3          2           4
  feed_perda~h |        20                   1  |      1          2           2
  bottles_pe~h |        20                   1  |      1          2           2
  complement~h |         8                  13  |      2          0           1
  latest_ill~h |         8                  13  |      4          3           6
  hospitaliz~h |         8                  13  |      2          0           1
  maternal_m~s |         8                  13  |      2          2           3
  followup_c~e |         8                  13  |      1          2           2
  -----------------------------------------------------------------------------

Messing the data up by re-creating duplicates and introducing errors


use "Merged1", clear
expand 2 if _n <= 10 


#replace record_id = record_id + " " if mod(_n, 5) == 0  


replace redcap_event_name = "wrong_event" if mod(_n, 7) == 0  

save merged_data_with_errors.dta, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

(10 observations created)

Unknown #command
(4 real changes made)

file merged_data_with_errors.dta saved

Cleaning the errors

use merged_data_with_errors, clear

duplicates drop record_id redcap_event_name, force


replace redcap_event_name = "baseline_arm_1" if redcap_event_name == "wrong_event"


replace redcap_event_name = trim(redcap_event_name) 

save merged_data_clean.dta, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


Duplicates in terms of record_id redcap_event_name

(10 observations deleted)

(3 real changes made)

(0 real changes made)

file merged_data_clean.dta saved

Now we are making use of the clean data-set and we are formatting the dates correctly according to STATA standards.

use merged_data_clean, clear

gen dob_str = string(dob, "%td")


gen dob_date = date(dob_str, "DMY")


format dob_date %td
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

Changing the variable, baby_bmi from a string variable to a numeric variable

use merged_data_clean, clear
describe
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


Contains data from merged_data_clean.dta
 Observations:            21                  HonsMSc20257_DATA_NOHDRS_2025-0
                                                2-14_1523.csv
    Variables:            45                  24 Feb 2025 01:37
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
record_id       byte    %8.0g                 Record ID
redcap_event_~e str24   %24s                  Event Name
redcap_survey~r byte    %8.0g                 Survey Identifier
demographic_i~0 str19   %19s                  Survey Timestamp
dob             float   %dM_d,_CY             Date of birth
consent_date    float   %dM_d,_CY             Consent date.
age             byte    %8.0g                 Age
edu_level       byte    %18.0g     edu_level_
                                              What is your highest level of
                                                education?
employ_status   byte    %13.0g     employ_status_
                                              What is your current employment
                                                status?
monthly_income  byte    %17.0g     monthly_income_
                                              What is your current household
                                                income (monthly, in ZAR)?
preg_complica~s byte    %8.0g      preg_complications_
                                              Did you experience any pregnancy
                                                complications? (e.g.,
                                                gestational diabetes, pre
yes_specify     str16   %16s                  If yes, please specify.
demographic_i~1 byte    %10.0g     demographic_informat_v_1_
                                              Complete?
baseline_info~p str19   %19s                  Survey Timestamp
feed_baseline   byte    %18.0g     feed_baseline_
                                              How is your baby currently fed?
feed_per_day    byte    %9.0g      feed_per_day_
                                              How often does your baby feed per
                                                day?
solid_foods     byte    %8.0g      solid_foods_
                                              Have you introduced any solid
                                                foods to your baby?
nutrition_cou~l byte    %8.0g      nutrition_counsel_
                                              Do you have access to nutritional
                                                counseling?
weigh_facility  byte    %8.0g      weigh_facility_
                                              Has your baby been weighed at a
                                                healthcare facility since
                                                birth?
last_weight     float   %9.0g                 What was your babys last recorded
                                                weight?
last_length     float   %9.0g                 What was your babys last recorded
                                                length (in cm)?
baby_bmi        float   %9.0g                 Babys BMI.
condition_aff~h byte    %8.0g      condition_affecting_growth_
                                              Does your baby have any diagnosed
                                                medical conditions affecting
                                                growth?
yes_select_op~s byte    %21.0g     yes_select_options_
                                              If yes, please select the
                                                following options.
other_specify   str9    %9s                   If other, please specify.
maternal_diet   byte    %121.0g    maternal_diet_
                                              What is your daily diet like?
clean_water     byte    %8.0g      clean_water_
                                              Do you have access to clean
                                                drinking water?
mental_health~s byte    %8.0g      mental_health_concerns_
                                              Have you experienced any mental
                                                health concerns since giving
                                                birth?
baseline_info~e byte    %10.0g     baseline_information_complete_
                                              Complete?
_merge          byte    %23.0g     _merge     Matching result from merge
followup_time~p str19   %19s                  Survey Timestamp
baby_current_~t float   %9.0g                 What is your babys current weight
                                                (kg)?
baby_current_~h float   %9.0g                 What is your babys current length
                                                (cm)?
baby_bmi2       float   %9.0g                 Babys BMI
recent_growth~s byte    %8.0g      recent_growth_issues_
                                              Has your baby been diagnosed with
                                                growth-related issues?
yes_issues      byte    %8.0g                 If yes, please specify.
healthcare_vi~s byte    %17.0g     healthcare_visits_
                                              How many times has your baby
                                                visited a healthcare facility
                                                for a check-up since
feed_3month     byte    %39.0g     feed_3month_
                                              How is your baby currently fed?
feed_perday_3~h byte    %9.0g      feed_perday_3month_
                                              If breastfeeding, how many times
                                                per day does your baby feed?
bottles_perda~h byte    %11.0g     bottles_perday_3month_
                                              If formula feeding, how many
                                                bottles per day does your baby
                                                consume?
complementary~h byte    %8.0g      complementary_food_3month_
                                              Have you introduced any
                                                complementary foods (e.g.,
                                                porridge, purees)?
latest_illnes~h byte    %47.0g     latest_illness_3month_
                                              Has your baby experienced any of
                                                the following since the last
                                                survey? (Check all
hospitalizati~h byte    %8.0g      hospitalization_3month_
                                              Has your baby been hospitalized
                                                since the last survey?
maternal_meals  byte    %8.0g      maternal_meals_
                                              How many meals do you eat per
                                                day?
followup_comp~e byte    %10.0g     followup_complete_
                                              Complete?
-------------------------------------------------------------------------------
Sorted by: 

use merged_data_clean, clear
destring baby_bmi, replace
destring age, replace

dtable age preg_complications
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

baby_bmi already numeric; no replace

age already numeric; no replace


-----------------------------------------------------------------------------------------------
                                                                                     Summary   
-----------------------------------------------------------------------------------------------
N                                                                                            21
Age                                                                              22.857 (7.663)
Did you experience any pregnancy complications? (e.g., gestational diabetes, pre  0.053 (0.229)
-----------------------------------------------------------------------------------------------

Creating a logistic regression stats table and saving it in a working directory

asdoc logistic preg_complications age 
(File Myfile.doc already exists, option append was assumed)

Logistic regression                                     Number of obs =     22
                                                        LR chi2(1)    =   0.86
                                                        Prob > chi2   = 0.3532
Log likelihood = -3.6370114                             Pseudo R2     = 0.1059

------------------------------------------------------------------------------
preg_compl~s | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   1.180138   .2443618     0.80   0.424     .7864683     1.77086
       _cons |   .0006154   .0037186    -1.22   0.221     4.42e-09    85.67737
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
Click to Open File:  Myfile.doc

Data Visualization in QUARTO using STATA commands


use merged_data_clean, clear
scatter last_length last_weight, xtitle("Weight(kg)") ytitle("Length (cm)"), ,title("length vs. Weight")

graph export "scatter.png",as(png) replace

(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)

file scatter.png saved as PNG format

use merged_data_clean, clear
scatter last_length last_weight, xtitle("Weight(kg)") ytitle("Length (cm)"), ,title("length vs. Weight")

graph save graph1.gph, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file graph1.gph saved
plot <- knitr::include_graphics("scatter.png")

plot

use merged_data_clean, clear
graph box last_weight, over(feed_baseline) ytitle("Weight (kg)") title("Box-Plot: Weight by Feeding type")
graph export "boxplot.png", as(png) replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file boxplot.png saved as PNG format
use merged_data_clean, clear
graph box last_weight, over(feed_baseline)  ytitle("Weight (kg)") title("Weight by Feeding type")
graph save graph2.gph, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file graph2.gph saved
plot2 <- knitr::include_graphics("boxplot.png")

plot2

use merged_data_clean, clear
graph bar (mean) last_weight, over(feed_baseline) title("mean weight by feeding type")
graph export bargraph.png, as(png) replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file bargraph.png saved as PNG format

use merged_data_clean, clear
graph bar (mean) last_weight, over(feed_baseline) title("mean weight by  feeding type")
graph save graph3.gph, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file graph3.gph saved
plot3 <- knitr::include_graphics("bargraph.png")
plot3

use merged_data_clean, clear
graph pie, over(edu_level) title("Educational level")
graph export "piechart.png", as(png) replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file piechart.png saved as PNG format

use merged_data_clean, clear
graph pie, over(edu_level) title("Educational level")
graph save graph4.gph, replace
(HonsMSc20257_DATA_NOHDRS_2025-02-14_1523.csv)


file graph4.gph saved
plot4 <- knitr::include_graphics("piechart.png")
plot4

graph combine graph1.gph graph2.gph ///
              graph3.gph graph4.gph, ///
              title("Combined") cols(2)
              
graph export Combined_GraphsG7.png, as(png) replace
file Combined_GraphsG7.png saved as PNG format
knitr::include_graphics("Combined_GraphsG7.png")