Reproducible Stata analysis or report

Authors

Bongani Ncube

University Of the Witwatersrand (Biostatistician)

Published

January 30, 2025

Introduction

This report details reproducible research using Stata , it was compiled using Quarto and Rstudio.

Data Overview

Let’s begin by opening a Stata system dataset named auto and examining its structure:

sysuse auto
summarize

(1978 automobile data)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        make |          0
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5
-------------+---------------------------------------------------------
       trunk |         74    13.75676    4.277404          5         23
      weight |         74    3019.459    777.1936       1760       4840
      length |         74    187.9324    22.26634        142        233
        turn |         74    39.64865    4.399354         31         51
displacement |         74    197.2973    91.83722         79        425
-------------+---------------------------------------------------------
  gear_ratio |         74    3.014865    .4562871       2.19       3.89
     foreign |         74    .2972973    .4601885          0          1

Let’s type describe to verify that our data imported successfully.

describe

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Contains data from C:\PROGRA~1\Stata18\ado\base/a/auto.dta
 Observations:            74                  1978 automobile data
    Variables:            12                  13 Apr 2022 17:45
                                              (_dta has notes)
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
make            str18   %-18s                 Make and model
price           int     %8.0gc                Price
mpg             int     %8.0g                 Mileage (mpg)
rep78           int     %8.0g                 Repair record 1978
headroom        float   %6.1f                 Headroom (in.)
trunk           int     %8.0g                 Trunk space (cu. ft.)
weight          int     %8.0gc                Weight (lbs.)
length          int     %8.0g                 Length (in.)
turn            int     %8.0g                 Turn circle (ft.)
displacement    int     %8.0g                 Displacement (cu. in.)
gear_ratio      float   %6.2f                 Gear ratio
foreign         byte    %8.0g      origin     Car origin
-------------------------------------------------------------------------------
Sorted by: foreign

Manually inputing data

input str12 name
  Ringo         
  John          
  Paul          
  George        
end

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

             name
  1.   Ringo         
  2.   John          
  3.   Paul          
  4.   George        
  5. end

The first line tells Stata that we are going to input data for a string variable called name. The number 12 tells input that we want the string variable to allow up to 12 characters for each observation. The next four lines are the raw data, which include the names Ringo, John, Paul, and George. The word end in the last line tells Stata that we are finished adding data.

Importing Comma delimited files (CSV)

we can use import delimited to import the data from CS1policies.csv

a simple dataset that i use for teaching CS1 exams

import delimited "CS1policies.csv", clear

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(encoding automatically selected: ISO-8859-1)
(4 vars, 1,000 obs)

Summarizing data

summarize

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          v1 |      1,000       500.5    288.8194          1       1000
         age |      1,000      44.602    8.459501         30         60
    duration |      1,000     66.7856    31.45082         12      119.9
     claimed |      1,000        .222    .4157991          0          1

summarize ,detail

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                             v1
-------------------------------------------------------------
      Percentiles      Smallest
 1%         10.5              1
 5%         50.5              2
10%        100.5              3       Obs               1,000
25%        250.5              4       Sum of wgt.       1,000

50%        500.5                      Mean              500.5
                        Largest       Std. dev.      288.8194
75%        750.5            997
90%        900.5            998       Variance       83416.67
95%        950.5            999       Skewness              0
99%        990.5           1000       Kurtosis       1.799998

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           30             30
 5%           31             30
10%           33             30       Obs               1,000
25%           37             30       Sum of wgt.       1,000

50%         44.5                      Mean             44.602
                        Largest       Std. dev.      8.459501
75%           52             60
90%           56             60       Variance       71.56316
95%           58             60       Skewness       .0442793
99%           60             60       Kurtosis       1.844854

                          duration
-------------------------------------------------------------
      Percentiles      Smallest
 1%         12.8             12
 5%        16.95           12.1
10%         23.4           12.1       Obs               1,000
25%         39.9           12.1       Sum of wgt.       1,000

50%         67.1                      Mean            66.7856
                        Largest       Std. dev.      31.45082
75%        93.95          119.8
90%        111.5          119.9       Variance        989.154
95%       115.45          119.9       Skewness      -.0036626
99%       118.95          119.9       Kurtosis       1.813458

                           claimed
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs               1,000
25%            0              0       Sum of wgt.       1,000

50%            0                      Mean               .222
                        Largest       Std. dev.      .4157991
75%            0              1
90%            1              1       Variance       .1728889
95%            1              1       Skewness       1.337853
99%            1              1       Kurtosis       2.789852

We can type describe to view the contents of the data in memory.

describe

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Contains data
 Observations:         1,000                  
    Variables:             4                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
v1              int     %8.0g                 
age             byte    %8.0g                 
duration        float   %9.0g                 
claimed         byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Codebook

Codebook gives detailed information for certain variables.

codebook age

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-------------------------------------------------------------------------------
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [30,60]                       Units: 1
         Unique values: 31                        Missing .: 0/1,000

                  Mean: 44.602
             Std. dev.: 8.4595

           Percentiles:    10%       25%       50%       75%       90%
                            33        37      44.5        52        56

Conditional workbook

codebook age if claimed == 1

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-------------------------------------------------------------------------------
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [30,60]                       Units: 1
         Unique values: 31                        Missing .: 0/222

                  Mean: 45.9324
             Std. dev.: 8.55908

           Percentiles:     10%       25%       50%       75%       90%
                             33        39        46        54        57

Compact codebook

this command provides a compact summary statistics of the variables in the dataset

codebook , compact

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Variable    Obs Unique     Mean  Min    Max  Label
-------------------------------------------------------------------------------
v1         1000   1000    500.5    1   1000  
age        1000     31   44.602   30     60  
duration   1000    641  66.7856   12  119.9  
claimed    1000      2     .222    0      1  
-------------------------------------------------------------------------------

Tables and frequency summaries

Next we can tabulate the variable claimed

tabulate claimed

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

    claimed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        778       77.80       77.80
          1 |        222       22.20      100.00
------------+-----------------------------------
      Total |      1,000      100.00

Or

table claimed

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



--------------------
        |  Frequency
--------+-----------
claimed |           
  0     |        778
  1     |        222
  Total |      1,000
--------------------

Or

tabstat claimed

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

    Variable |      Mean
-------------+----------
     claimed |      .222
------------------------

tab1 is used when we have multiple variables to be tabulised

to create a table containing mean values for various numerical variables with respect to a categorical variable we use table

table claimed ,statistic(mean duration age)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



------------------------------
        |  duration        age
--------+---------------------
claimed |                     
  0     |  72.20386   44.22237
  1     |   47.7973   45.93243
  Total |   66.7856     44.602
------------------------------

summarize age duration

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,000      44.602    8.459501         30         60
    duration |      1,000     66.7856    31.45082         12      119.9

/*Lists the variables in the dataset */
ds

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


v1        age       duration  claimed

/* First 10 observations */
list in 1/10

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

     +-------------------------------+
     | v1   age   duration   claimed |
     |-------------------------------|
  1. |  1    55       24.3         1 |
  2. |  2    45       79.2         0 |
  3. |  3    33       77.8         0 |
  4. |  4    41       79.3         0 |
  5. |  5    53        105         0 |
     |-------------------------------|
  6. |  6    42       81.2         0 |
  7. |  7    44         13         1 |
  8. |  8    33       37.1         0 |
  9. |  9    59       83.9         0 |
 10. | 10    60       67.5         0 |
     +-------------------------------+

/* First 10 observations */
list in 1/10

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

     +-------------------------------+
     | v1   age   duration   claimed |
     |-------------------------------|
  1. |  1    55       24.3         1 |
  2. |  2    45       79.2         0 |
  3. |  3    33       77.8         0 |
  4. |  4    41       79.3         0 |
  5. |  5    53        105         0 |
     |-------------------------------|
  6. |  6    42       81.2         0 |
  7. |  7    44         13         1 |
  8. |  8    33       37.1         0 |
  9. |  9    59       83.9         0 |
 10. | 10    60       67.5         0 |
     +-------------------------------+

/* Last 10 observations */
list in -1/10

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


observation numbers out of range
r(198);

r(198);

/* Show first 10 observations of the */
/*  first three variables of our data */
list age-duration in 1/10

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

     +----------------+
     | age   duration |
     |----------------|
  1. |  55       24.3 |
  2. |  45       79.2 |
  3. |  33       77.8 |
  4. |  41       79.3 |
  5. |  53        105 |
     |----------------|
  6. |  42       81.2 |
  7. |  44         13 |
  8. |  33       37.1 |
  9. |  59       83.9 |
 10. |  60       67.5 |
     +----------------+

/*View data */
browse

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


request ignored because of batch mode

Generating new variables

gen logAge = log(age)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

su

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          v1 |      1,000       500.5    288.8194          1       1000
         age |      1,000      44.602    8.459501         30         60
    duration |      1,000     66.7856    31.45082         12      119.9
     claimed |      1,000        .222    .4157991          0          1

misstable summ

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(variables nonmissing or string)

misstable patterns

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(no missing values)

misstable tree

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(no missing values)

misstable nested

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(no missing values)

Graphics

histogram age , frequency normal

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(bin=29, start=30, width=1.0344828)

age

hist claimed , discrete percent

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(start=0, width=1)

claimed

sysuse auto, clear

list make price rep78 if foreign == 1

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(1978 automobile data)

     +---------------------------------+
     | make              price   rep78 |
     |---------------------------------|
 53. | Audi 5000         9,690       5 |
 54. | Audi Fox          6,295       3 |
 55. | BMW 320i          9,735       4 |
 56. | Datsun 200        6,229       4 |
 57. | Datsun 210        4,589       5 |
     |---------------------------------|
 58. | Datsun 510        5,079       4 |
 59. | Datsun 810        8,129       4 |
 60. | Fiat Strada       4,296       3 |
 61. | Honda Accord      5,799       5 |
 62. | Honda Civic       4,499       4 |
     |---------------------------------|
 63. | Mazda GLC         3,995       4 |
 64. | Peugeot 604      12,990       . |
 65. | Renault Le Car    3,895       3 |
 66. | Subaru            3,798       5 |
 67. | Toyota Celica     5,899       5 |
     |---------------------------------|
 68. | Toyota Corolla    3,748       5 |
 69. | Toyota Corona     5,719       5 |
 70. | VW Dasher         7,140       4 |
 71. | VW Diesel         5,397       5 |
 72. | VW Rabbit         4,697       4 |
     |---------------------------------|
 73. | VW Scirocco       6,850       4 |
 74. | Volvo 260        11,995       5 |
     +---------------------------------+

Day 3 : Msc In Epidemiology

Research Protocol Development 1

Descriptive statistics

Written: Bongani Ncube

Date: 29 January 2025

use "bus.dta" , clear

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

Describing data

codebook

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-------------------------------------------------------------------------------
job                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [1,2]                         Units: 1
         Unique values: 2                         Missing .: 0/125

            Tabulation: Freq.  Value
                           59  1
                           66  2

-------------------------------------------------------------------------------
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [21,59]                       Units: 1
         Unique values: 36                        Missing .: 0/125

                  Mean:   38.12
             Std. dev.: 10.1247

           Percentiles:     10%       25%       50%       75%       90%
                             27        30        37        47        53

-------------------------------------------------------------------------------
ht                                                                  (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [1.52,1.91]                   Units: .01
         Unique values: 14                        Missing .: 0/125

                  Mean: 1.64216
             Std. dev.:  .06203

           Percentiles:     10%       25%       50%       75%       90%
                           1.57       1.6      1.63      1.68       1.7

-------------------------------------------------------------------------------
wt                                                                  (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [45.9,97.3]                   Units: .1
         Unique values: 66                        Missing .: 0/125

                  Mean:  65.58
             Std. dev.: 10.009

           Percentiles:    10%       25%       50%       75%       90%
                          52.3      57.3      65.9      72.7      78.2

-------------------------------------------------------------------------------
triglyc                                                             (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [63,484]                      Units: 1
         Unique values: 99                        Missing .: 0/125

                  Mean: 197.648
             Std. dev.: 84.7462

           Percentiles:     10%       25%       50%       75%       90%
                             98       123       195       255       316

-------------------------------------------------------------------------------
sbp                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [100,200]                     Units: 1
         Unique values: 15                        Missing .: 0/125

                  Mean: 128.424
             Std. dev.: 18.4929

           Percentiles:     10%       25%       50%       75%       90%
                            110       120       120       136       150

browse

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


request ignored because of batch mode

describe

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Contains data from bus.dta
 Observations:           125                  
    Variables:             6                  26 Jan 2019 12:06
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
job             byte    %9.0g                 
age             byte    %9.0g                 
ht              float   %9.0g                 
wt              float   %9.0g                 
triglyc         int     %9.0g                 
sbp             int     %9.0g                 
-------------------------------------------------------------------------------
Sorted by:

Label define

label define joblabel 1 "Driver" 2 "Conductor"

label values job joblabel

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

tab job

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

        job |      Freq.     Percent        Cum.
------------+-----------------------------------
     Driver |         59       47.20       47.20
  Conductor |         66       52.80      100.00
------------+-----------------------------------
      Total |        125      100.00

hist age , normal
quietly graph export hist2.svg, replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(bin=11, start=21, width=3.4545455)

claimed

Normality test

\(H_O\) : Data is normally distributed
\(H_1\) : Data is not normally distributed

swilk age

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         age |        125    0.95915      4.069     3.151    0.00081

The p-value is 0.00081, which is less than 0.05. This means we reject the null hypothesis, suggesting that age is not normally distributed.

Automated tables

ssc install table1, replace
ssc install summtab, replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


host not found
http://fmwww.bc.edu/repec/bocode/t/ either
  1)  is not a valid URL, or
  2)  could not be contacted, or
  3)  is not a Stata download site (has no stata.toc file).
r(631);

r(631);

Generating new variable

gen hypertension = .
replace hypertension =1 if sbp >= 140
replace hypertension =0 if sbp < 140

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(125 missing values generated)

(31 real changes made)

(94 real changes made)

What is the code doing?

Generate the variable hypertension and initialize it with missing values:
Replace hypertension with 1 if the systolic blood pressure (sbp) is greater than or equal to 140:
Replace hypertension with 0 if the systolic blood pressure (sbp) is less than 140:

describe hypertension

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
hypertension    float   %9.0g

tab hypertension

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



hypertensio |
          n |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         94       75.20       75.20
          1 |         31       24.80      100.00
------------+-----------------------------------
      Total |        125      100.00

Label labels for `hypertension`

label define hplabel 1 "Yes" 0 "No"

label values hypertension hplabel

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

tab hypertension

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



hypertensio |
          n |      Freq.     Percent        Cum.
------------+-----------------------------------
         No |         94       75.20       75.20
        Yes |         31       24.80      100.00
------------+-----------------------------------
      Total |        125      100.00

Summarise a continuous variable across a categorical variable

bysort job: sum age , d

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-------------------------------------------------------------------------------
-> job = Driver

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           25             25
 5%           27             27
10%           29             27       Obs                  59
25%           33             28       Sum of wgt.          59

50%           41                      Mean           41.40678
                        Largest       Std. dev.      9.399411
75%           48             58
90%           57             58       Variance       88.34892
95%           58             59       Skewness       .1813159
99%           59             59       Kurtosis       2.056059

-------------------------------------------------------------------------------
-> job = Conductor

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           21             21
 5%           22             22
10%           25             22       Obs                  66
25%           29             22       Sum of wgt.          66

50%           31                      Mean           35.18182
                        Largest       Std. dev.      9.907121
75%           41             54
90%           51             54       Variance       98.15105
95%           54             57       Skewness       .7353005
99%           59             59       Kurtosis       2.477688

use table1 ado file

table1 , vars(age contn \ job cat \ht contn) format(%9.1f) saving("$Tables/BusTable1.xls",replace)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


  +------------------------------------------+
  | Factor           Level       Value       |
  |------------------------------------------|
  | N                            125         |
  |------------------------------------------|
  | age, mean (SD)               38.1 (10.1) |
  |------------------------------------------|
  | job              Driver      59 (47.2%)  |
  |                  Conductor   66 (52.8%)  |
  |------------------------------------------|
  | ht, mean (SD)                1.6 (0.1)   |
  +------------------------------------------+
file /BusTable1.xls could not be saved
r(603);

r(603);

Using sumtab to generate table 1

summtab , by(hypertension) catvars(job) 
contvars(age ht wt) word wordname( 
table1_bus) median medfmt(0) total title(
"Table 1: Summary statistics by hypertension status") replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


Must specify either Word or Excel output (or both)
r(198);

r(198);

Data Management

`Example: Stepping Stones`

Stepping Stones is a participatory HIV prevention programme that aims to improve sexual health through building more gender-equitable relationships Cluster Randomized Trial was conducted among young rural men and women in the Eastern Cape Province in South Africa to assess impact of Stepping Stones on HIV and HSV 2 incidence and sexual practices .The 70 study clusters comprised 64 villages and six townships Clusters grouped into seven strata, one stratum comprised the townships and six were villages grouped according to proximity to particular roads

Within each stratum, equal numbers of clusters were allocated to the two study arms Intervention arm in which participants were given the 13 Stepping Stones sessions over a period of three months Control arm in which participants were given a single 3 hour session on HIV prevention

In each cluster recruited about 20 men and 20 women .Those eligible were aged 16 – 24, resident in village where they were at school, and mature enough to understand the study and the consent process - most were recruited from schools In this study unit of randomisation was a cluster of 20 men or a cluster of 20 women Primary outcomes were HIV-incidence and HSV-2 incidence over the study period of approximately two years.

Task

We want to join the two data sets to see which women have become HIV-infected (incident cases or sero-conversions), and to see whether there is any consistent pattern in the experience of IPV Women in the study each have a unique study identification number (idnum) .We can join the data sets together using the merge command as shown below.

Merge datasets

two datasets are joined side by side using “merge 1:1”
in the datasets to come ,idnum must uniquely identify observations in each dataset

Snapshot of baseline data

idnum	visitnox	intdatex	hivx	ipvnewx
1001	1	2003-03-19	0	0
1002	1	2003-03-19	0	0
1006	1	2003-03-19	0	0
1008	1	2003-03-19	0	0
1011	1	2003-03-19	1	0
1012	1	2003-03-19	1	1
1015	1	2003-03-19	0	1
1016	1	2003-03-19	0	0
1017	1	2003-03-19	1	0
1018	1	2003-03-19	0	1

Snapshot of followup data

idnum	intdate	hiv	ipvnew
1001	2004-03-24	0	0
1002	2004-03-24	0	0
1006	2004-03-25	0	0
1008	2004-03-24	0	1
1011	2004-05-12	1	1
1012	2004-04-15	NA	0
1015	2004-03-24	0	0
1016	2004-04-15	0	1
1017	2004-04-15	NA	0
1018	2004-04-15	0	1

the unique column is idnum

use "Datasets/stonwombas.dta", clear
use "Datasets/stonwomfol.dta", clear

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(Stepping Stones women baseline)

(Stepping Stones women 12 months)

merge the datasets

We need to start with introducing the first data file to Stata through the command, use
we then specify the required dataset to join to using using
if you want to specify the name of the ‘merge’ variable, the generate option comes in handy. After specifying the master file through.

use "Datasets/stonwombas.dta", clear
merge 1:1 idnum using "Datasets/stonwomfol.dta" , gen(stonmerge)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(Stepping Stones women baseline)

    Result                      Number of obs
    -----------------------------------------
    Not matched                           306
        from master                       306  (stonmerge==1)
        from using                          0  (stonmerge==2)

    Matched                             1,109  (stonmerge==3)
    -----------------------------------------

The summary above shows that 1,109 individuals had their data merged, whereas 306 were not merged because they did not match. 306 were not merged from the master file while 0 were not merged from the using file.

in our case it implies that – in this case 1,109 women had observations in both data sets, while 306 only had observations in the baseline data set and not in the follow-up data set – this is because these women were lost to follow-up.

tab stonmerge

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
        Master only (1) |        306       21.63       21.63
            Matched (3) |      1,109       78.37      100.00
------------------------+-----------------------------------
                  Total |      1,415      100.00

We can now use data from both of the datasets e.g. we can see how many women HIV sero-converted during the twelve month follow-up period

tab hivx hiv , missing

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

       HIV |               hiv
serostatus |         0          1          . |     Total
-----------+---------------------------------+----------
         0 |       900         65        291 |     1,256 
         1 |         1        104         54 |       159 
-----------+---------------------------------+----------
     Total |       901        169        345 |     1,415

Of 1256 women who tested HIV-negative at baseline, 65 sero-converted (i.e. became HIV-infected) while 900 remained HIV-negative (and remainder did not have a follow-up result) .Note one woman tested HIV-positive at baseline but HIV-negative at follow-up We can identify this woman as participant number 1870 Would need to go to original forms and fieldworkers to understand what happened with this participant (possible for example that a friend “replaced” her at the follow-up visit)

Can compare the proportion of women experiencing IPV (Intimate Partner Violence) at follow-up, according to whether or not they had experienced IPV at baseline

tab ipvnewx ipvnew , row

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |        ipvnew
   ipvnewx |         0          1 |     Total
-----------+----------------------+----------
         0 |       617        137 |       754 
           |     81.83      18.17 |    100.00 
-----------+----------------------+----------
         1 |       201        154 |       355 
           |     56.62      43.38 |    100.00 
-----------+----------------------+----------
     Total |       818        291 |     1,109 
           |     73.76      26.24 |    100.00

Amongst women who had not experienced IPV at baseline, 18.2% experienced IPV at follow-up, while amongst women who had experienced IPV at baseline, 43.4% experienced IPV at follow-up

Finally using the visit dates, we can look at the distribution of follow-up days between the two visits - which should be roughly 365 days since follow-up was at 12 months

gen followdays = intdate - intdatex

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(308 missing values generated)

summa followdays , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                         followdays
-------------------------------------------------------------
      Percentiles      Smallest
 1%          287           -169
 5%          344              4
10%          357            103       Obs               1,107
25%          370            208       Sum of wgt.       1,107

50%          386                      Mean           409.2882
                        Largest       Std. dev.      70.97736
75%          433            717
90%          510            736       Variance       5037.786
95%          547            738       Skewness       .5813372
99%          618            745       Kurtosis       10.23827

Median follow-up time was 386 days (slightly larger than the expected 365 days) and mean was 409 days - due to some participants only being traced after about 2 years Two strange values – one with negative follow-up days (meaning that follow-up visit was recorded as having taken place before baseline visit) and one with only 4 days of follow-up between visits – we can identify the participants (idnum 1423 and idnum 1666) but would need to look at original fieldwork records in order to resolve query

list if followdays <100 , noobs

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



  +--------------------------------------------------------------------------+
  | idnum | visitnox |  intdatex | hivx | ipvnewx |   intdate | hiv | ipvnew |
  |  1423 |        1 | 30apr2004 |    0 |       0 | 04may2004 |   1 |      0 |
  |--------------------------------------------------------------------------|
  |               stonmerge              |             follow~s              |
  |             Matched (3)              |                    4              |
  +--------------------------------------------------------------------------+

  +--------------------------------------------------------------------------+
  | idnum | visitnox |  intdatex | hivx | ipvnewx |   intdate | hiv | ipvnew |
  |  1666 |        1 | 26may2005 |    0 |       0 | 08dec2004 |   0 |      1 |
  |--------------------------------------------------------------------------|
  |               stonmerge              |             follow~s              |
  |             Matched (3)              |                 -169              |
  +--------------------------------------------------------------------------+

Many to one merging

In above example joined two data sets using a one-to-one merge, since each data set had only one observation per participant

In some cases there will be many observations per participant in one data set and only one observation per participant in the other data set

Ex: Longitudinal studies where all of follow-up observations are put into the same data set (which thus has many observations per participant) while the other data set contains baseline and design information (and thus one observation per participant)

Example: COSTOP randomized controlled trial carried out to investigate whether it is safe for HIV-infected patients stabilized on ART (on ART for at least six months, on CTX prophylaxis and with a CD4 count above 250 cells/µl) to stop taking CTX prophylaxis Total of 2180 patients individually randomized to either continue taking CTX or to take an equivalent placebo (i.e. to stop CTX prophylaxis).

One secondary objective was to compare neutrophil counts over time between the two treatment arms, since CTX has some haematological toxicity Data on neutrophil counts given in costop_neutrophil, while baseline data given in costop_base

use "Datasets/costop_base.dta" , clear

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

sex	ageyrs	site	whostbas	cdstrat	idnum
1	46	1	2	1	1001
2	41	1	2	1	1002
2	47	1	2	1	1003
1	42	1	3	1	1004
2	38	1	3	1	1005
2	51	1	2	1	1006
2	48	1	4	1	1007
1	48	1	2	1	1008
2	35	1	2	1	1009
2	23	1	2	1	1010

desc

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Contains data from Datasets/costop_base.dta
 Observations:         2,180                  
    Variables:             6                  23 Jan 2023 12:34
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
sex             byte    %8.0g      sexlab     Gender
ageyrs          byte    %9.0g                 
site            byte    %8.0g      sitelab    study Site
whostbas        byte    %8.0g                 Baseline WHO stage
cdstrat         byte    %8.0g      cdlab      CD4 stratum at baseline
idnum           float   %9.0g                 
-------------------------------------------------------------------------------
Sorted by: idnum

tab1 sex site cdstrat

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-> tabulation of sex  

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |        569       26.10       26.10
     Female |      1,611       73.90      100.00
------------+-----------------------------------
      Total |      2,180      100.00

-> tabulation of site  

 study Site |      Freq.     Percent        Cum.
------------+-----------------------------------
    Entebbe |      1,002       45.96       45.96
     Masaka |      1,178       54.04      100.00
------------+-----------------------------------
      Total |      2,180      100.00

-> tabulation of cdstrat  

CD4 stratum |
at baseline |      Freq.     Percent        Cum.
------------+-----------------------------------
    251-499 |      1,142       52.39       52.39
       500+ |      1,038       47.61      100.00
------------+-----------------------------------
      Total |      2,180      100.00

use "Datasets/costop_neutrophil.dta" , clear

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

ne_abs	idnum	months
1.55	1001	2.661191
1.26	1001	5.519507
0.94	1001	8.279261
1.22	1001	10.940452
1.83	1001	13.798768
1.21	1001	16.558521
1.20	1001	19.318275
1.34	1001	22.078030
1.36	1001	24.837782
1.37	1001	27.597536

Data set costop_neutrophil has 23,093 observations, since participants could have a number of hematology tests during the trial (including neutrophil count)

Can now merge neutrophil data to baseline data to get associated characteristics (e.g. sex, age, study site, CD4 stratum, WHO stage) corresponding to the neutrophil counts

Many observations in neutrophil data (from a single participant) will be merged to a single observation in the baseline data – so known as many-to-one or m:1 merging

Also sometimes called “table lookup” since data for a given participant are “looked up” in the baseline table


use "Datasets/costop_neutrophil.dta" , clear
merge m:1 idnum using "Datasets/costop_base.dta" , gen(cosnm)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



(variable idnum was int, now float to accommodate using data's values)

    Result                      Number of obs
    -----------------------------------------
    Not matched                            23
        from master                         0  (cosnm==1)
        from using                         23  (cosnm==2)

    Matched                            23,181  (cosnm==3)
    -----------------------------------------

tab1 sex site cdstrat

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-> tabulation of sex  

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |      5,988       25.81       25.81
     Female |     17,216       74.19      100.00
------------+-----------------------------------
      Total |     23,204      100.00

-> tabulation of site  

 study Site |      Freq.     Percent        Cum.
------------+-----------------------------------
    Entebbe |     10,782       46.47       46.47
     Masaka |     12,422       53.53      100.00
------------+-----------------------------------
      Total |     23,204      100.00

-> tabulation of cdstrat  

CD4 stratum |
at baseline |      Freq.     Percent        Cum.
------------+-----------------------------------
    251-499 |     12,114       52.21       52.21
       500+ |     11,090       47.79      100.00
------------+-----------------------------------
      Total |     23,204      100.00

There were 24 observations from baseline table that did not have matches in neutrophil data – these correspond to participants who dropped out early without having any hematology tests

Also among 23,117 hematology tests carried out 10,776 were in Entebbe and 12,341 in Masaka We can look at a boxplot to compare the neutrophil counts between Entebbe and Masaka

graph box ne_abs , over(site)
quietly graph export box.svg, replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

claimed

There appear to be a number of outliers among the neutrophil counts, which could be laboratory errors (or missing decimal places in the results), so we can look at the counts with a cut-off of 10 (in practice we would investigate this together with the laboratory)

graph box ne_abs if ne_abs<10, over(site)
quietly graph export box2.svg, replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

claimed

With outliers removed seems little difference in distn of neutrophil counts between sites - examine this further by looking at summary statistics by site

tabstat ne_abs , stat(n mean sd q) by(site)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



Summary for variables: ne_abs
Group variable: site (study Site)

   site |         N      Mean        SD       p25       p50       p75
--------+------------------------------------------------------------
Entebbe |     10765  1.860236  1.191051      1.17      1.61      2.21
 Masaka |     12328  1.770006  1.230289      1.13      1.56      2.14
--------+------------------------------------------------------------
  Total |     23093  1.812067  1.212965      1.15      1.59      2.17
---------------------------------------------------------------------

save "costop_ndm.dta" , replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


file costop_ndm.dta saved

This confirms that on average neutrophil counts are slightly higher in Entebbe than in Masaka .Note that we save the merged data set as “costop_ndm.dta” to use in the next three sections

In neutrophil data set, for each participant we might want to select 1st neutrophil count (to get estimate of this at baseline or enrolment) and also select last neutrophil count (to get estimate at end of the trial) We will see how to do this below:

sort idnum months
by idnum: keep if _n==1

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



(21,024 observations deleted)

tab1 sex site cdstrat if ne_abs<.

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



-> tabulation of sex if ne_abs<. 

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |        555       25.81       25.81
     Female |      1,595       74.19      100.00
------------+-----------------------------------
      Total |      2,150      100.00

-> tabulation of site if ne_abs<. 

 study Site |      Freq.     Percent        Cum.
------------+-----------------------------------
    Entebbe |        991       46.09       46.09
     Masaka |      1,159       53.91      100.00
------------+-----------------------------------
      Total |      2,150      100.00

-> tabulation of cdstrat if ne_abs<. 

CD4 stratum |
at baseline |      Freq.     Percent        Cum.
------------+-----------------------------------
    251-499 |      1,128       52.47       52.47
       500+ |      1,022       47.53      100.00
------------+-----------------------------------
      Total |      2,150      100.00

summa ne_abs , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                       Neutrophils Abs
-------------------------------------------------------------
      Percentiles      Smallest
 1%          .47            .03
 5%          .73            .06
10%          .87             .1       Obs               2,150
25%         1.17            .23       Sum of wgt.       2,150

50%         1.61                      Mean            1.81113
                        Largest       Std. dev.      1.076618
75%         2.18           7.38
90%         2.92           7.92       Variance       1.159106
95%         3.49           10.1       Skewness        5.44071
99%          5.8          23.67       Kurtosis       86.54071

summa month , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                           months
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.6899384      -6.505134
 5%    -.4928131       -3.74538
10%    -.4599589       -3.74538       Obs               2,157
25%     2.726899      -3.055442       Sum of wgt.       2,157

50%     2.759754                      Mean           2.316625
                        Largest       Std. dev.      1.361104
75%     2.759754       9.297741
90%     2.858316       9.626284       Variance       1.852604
95%     3.022587       11.03901       Skewness      -.5982735
99%     5.552361       11.03901       Kurtosis       7.534941

Since we are looking within idnum and have sorted by month within idnum, _n measures number of observation within each participant so _n=1 denotes the first (earliest) hematology test, _n=2 the second test and so on Only 2,156 participants have at least one neutrophil result Note that for over 10% this was found before enrolment i.e. during the screening phase of the trial – as shown by negative values for month Can now see how to find the last neutrophil count

use "costop_ndm.dta" , clear
sort idnum months
by idnum: keep if _n==_N

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...




(21,024 observations deleted)

summa ne_abs , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                       Neutrophils Abs
-------------------------------------------------------------
      Percentiles      Smallest
 1%          .54            .26
 5%          .77             .3
10%          .91            .33       Obs               2,149
25%         1.18            .34       Sum of wgt.       2,149

50%         1.63                      Mean           1.843406
                        Largest       Std. dev.      1.715537
75%         2.19          10.13
90%         2.86          10.21       Variance       2.943067
95%         3.46             48       Skewness       19.74392
99%         4.99             50       Kurtosis       534.6258

summa month , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                           months
-------------------------------------------------------------
      Percentiles      Smallest
 1%     5.716632      -.4599589
 5%     13.76591       1.051335
10%     16.55852       2.759754       Obs               2,157
25%     23.49076       2.759754       Sum of wgt.       2,157

50%     30.35729                      Mean           28.24435
                        Largest       Std. dev.      7.616737
75%      33.1499       39.45791
90%      35.8768       39.55647       Variance       58.01468
95%     36.00821       39.75359       Skewness      -1.010526
99%     38.63655       39.75359       Kurtosis       3.356113

save "costop_nelast.dta" , replace

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


file costop_nelast.dta saved

summa month , det

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...

                           months
-------------------------------------------------------------
      Percentiles      Smallest
 1%     5.716632      -.4599589
 5%     13.76591       1.051335
10%     16.55852       2.759754       Obs               2,157
25%     23.49076       2.759754       Sum of wgt.       2,157

50%     30.35729                      Mean           28.24435
                        Largest       Std. dev.      7.616737
75%      33.1499       39.45791
90%      35.8768       39.55647       Variance       58.01468
95%     36.00821       39.75359       Skewness      -1.010526
99%     38.63655       39.75359       Kurtosis       3.356113

Last observation within each idnum (participant) is labelled _N Note that median and mean neutrophil count are very similar at beginning and end of the trial Now have two data sets – one containing first neutrophil count and other containing last neutrophil count Could merge these data sets (1:1) and hence find for each participant how much neutrophil count has changed over course of the trial

Long and wide format data

The neutrophil data set is an example of a “long” data set since we have a separate row of data for each visit An alternative to this would be a “wide” data set in which we have one row of data for each participant and within this row the first neutrophil count is recorded as ne_abs1, the second as ne_abs2, the third as ne_abs_3 and so on We see how to do this below:

use "costop_ndm.dta" , clear
drop if ne_abs==.
sort idnum months
by idnum: gen visitnum = _n
tab visitnum

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



(111 observations deleted)




   visitnum |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      2,156        9.34        9.34
          2 |      2,146        9.29       18.63
          3 |      2,130        9.22       27.85
          4 |      2,106        9.12       36.97
          5 |      2,066        8.95       45.92
          6 |      1,979        8.57       54.49
          7 |      1,880        8.14       62.63
          8 |      1,794        7.77       70.40
          9 |      1,635        7.08       77.48
         10 |      1,486        6.43       83.91
         11 |      1,283        5.56       89.47
         12 |        947        4.10       93.57
         13 |        643        2.78       96.35
         14 |        382        1.65       98.01
         15 |        191        0.83       98.84
         16 |        106        0.46       99.29
         17 |         60        0.26       99.55
         18 |         38        0.16       99.72
         19 |         23        0.10       99.82
         20 |         16        0.07       99.89
         21 |         10        0.04       99.93
         22 |          8        0.03       99.97
         23 |          1        0.00       99.97
         24 |          1        0.00       99.97
         25 |          1        0.00       99.98
         26 |          1        0.00       99.98
         27 |          1        0.00       99.99
         28 |          1        0.00       99.99
         29 |          1        0.00      100.00
         30 |          1        0.00      100.00
------------+-----------------------------------
      Total |     23,093      100.00

Variable “visitnum” measures number of each visit within each participant Note: if a person had (say) visitnum 5, then they must have also had visits 1,2,3 and 4 That is why the frequency decreases with each visit number – so 8 participants had 22 visits and one participant had 30 visits. Now see how to make a wide data set In this wide data set there will be variables for ne_abs1 up to ne_abs30 and month1 up to month30 (although ne_abs30 and month30 will be missing for all except one participant)

use costop_ndm , clear

drop if ne_abs==.

sort idnum months

by idnum: gen visitnum = _n

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



(111 observations deleted)

tab visitnum

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



   visitnum |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |      2,156        9.34        9.34
          2 |      2,146        9.29       18.63
          3 |      2,130        9.22       27.85
          4 |      2,106        9.12       36.97
          5 |      2,066        8.95       45.92
          6 |      1,979        8.57       54.49
          7 |      1,880        8.14       62.63
          8 |      1,794        7.77       70.40
          9 |      1,635        7.08       77.48
         10 |      1,486        6.43       83.91
         11 |      1,283        5.56       89.47
         12 |        947        4.10       93.57
         13 |        643        2.78       96.35
         14 |        382        1.65       98.01
         15 |        191        0.83       98.84
         16 |        106        0.46       99.29
         17 |         60        0.26       99.55
         18 |         38        0.16       99.72
         19 |         23        0.10       99.82
         20 |         16        0.07       99.89
         21 |         10        0.04       99.93
         22 |          8        0.03       99.97
         23 |          1        0.00       99.97
         24 |          1        0.00       99.97
         25 |          1        0.00       99.98
         26 |          1        0.00       99.98
         27 |          1        0.00       99.99
         28 |          1        0.00       99.99
         29 |          1        0.00      100.00
         30 |          1        0.00      100.00
------------+-----------------------------------
      Total |     23,093      100.00

reshape wide ne_abs month , i(idnum) j(visitnum)

list in 1/5

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
>  29 30)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations           23,093   ->   2,156       
Number of variables                  10   ->   67          
j variable (30 values)         visitnum   ->   (dropped)
xij variables:
                                 ne_abs   ->   ne_abs1 ne_abs2 ... ne_abs30
                                 months   ->   months1 months2 ... months30
-----------------------------------------------------------------------------

     +------------------------------------------------------------------------+
  1. | idnum | ne_abs1 |  months1 | ne_abs2 |  months2 | ne_abs3  |  months3  |
     |  1001 |    1.55 | 2.661191 |    1.26 | 5.519507 |     .94  | 8.279261  |
     |------------------------------------------------------------------------|
     | ne_abs4 |  months4 | ne_abs5 |  months5 | ne_abs6 |  months6 | ne_abs7 |
     |    1.22 | 10.94045 |    1.83 | 13.79877 |    1.21 | 16.55852 |     1.2 |
     |------------------------------------------------------------------------|
     |  months7  | ne_abs8  |  months8  |  ne_abs9  |   months9  |  ne_abs10  |
     | 19.31828  |    1.34  | 22.07803  |     1.36  |  24.83778  |      1.37  |
     |-----------+-----------------------------------------------+------------|
     | months10  | ne_abs11  | months11  | ne_abs12  | months12  |  ne_abs13  |
     | 27.59754  |      1.6  | 30.39014  |     1.47  | 35.90965  |      2.29  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months13  | ne_abs14  | months14  | ne_abs15  | months15  |  ne_abs16  |
     | 38.66941  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months16  | ne_abs17  | months17  | ne_abs18  | months18  |  ne_abs19  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months19  | ne_abs20  | months20  | ne_abs21  | months21  |  ne_abs22  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months22  | ne_abs23  | months23  | ne_abs24  | months24  |  ne_abs25  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months25  | ne_abs26  | months26  | ne_abs27  | months27  |  ne_abs28  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |------------------------------------------------------------------------|
     | months28 | ne_abs29 | months29 | ne_abs30 | months30 |    sex | ageyrs |
     |        . |        . |        . |        . |        . |   Male |     46 |
     |------------------------------------------------------------------------|
     |       site     |    whostbas     |    cdstrat     |          cosnm     |
     |    Entebbe     |           2     |    251-499     |    Matched (3)     |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  2. | idnum | ne_abs1 |  months1 | ne_abs2 |  months2 | ne_abs3  |  months3  |
     |  1002 |     .86 | 2.694045 |     .84 | 5.453799 |    1.72  | 7.753593  |
     |------------------------------------------------------------------------|
     | ne_abs4 |  months4 | ne_abs5 |  months5 | ne_abs6 |  months6 | ne_abs7 |
     |    1.09 | 8.279261 |     .91 | 10.80903 |     .99 | 13.79877 |    1.04 |
     |------------------------------------------------------------------------|
     |  months7  | ne_abs8  |  months8  |  ne_abs9  |   months9  |  ne_abs10  |
     | 16.55852  |     .53  | 19.31828  |     1.43  |  22.07803  |      1.07  |
     |-----------+-----------------------------------------------+------------|
     | months10  | ne_abs11  | months11  | ne_abs12  | months12  |  ne_abs13  |
     | 24.83778  |      .89  | 27.59754  |     1.07  | 30.32443  |       .55  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months13  | ne_abs14  | months14  | ne_abs15  | months15  |  ne_abs16  |
     | 33.08419  |     1.29  | 35.84394  |      .54  | 38.63655  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months16  | ne_abs17  | months17  | ne_abs18  | months18  |  ne_abs19  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months19  | ne_abs20  | months20  | ne_abs21  | months21  |  ne_abs22  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months22  | ne_abs23  | months23  | ne_abs24  | months24  |  ne_abs25  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months25  | ne_abs26  | months26  | ne_abs27  | months27  |  ne_abs28  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |------------------------------------------------------------------------|
     | months28 | ne_abs29 | months29 | ne_abs30 | months30 |    sex | ageyrs |
     |        . |        . |        . |        . |        . | Female |     41 |
     |------------------------------------------------------------------------|
     |       site     |    whostbas     |    cdstrat     |          cosnm     |
     |    Entebbe     |           2     |    251-499     |    Matched (3)     |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  3. | idnum | ne_abs1 |  months1 | ne_abs2 |  months2 | ne_abs3  |  months3  |
     |  1003 |     .96 | 5.552361 |     2.3 | 8.279261 |    1.31  | 10.80903  |
     |------------------------------------------------------------------------|
     | ne_abs4 |  months4 | ne_abs5 |  months5 | ne_abs6 |  months6 | ne_abs7 |
     |    1.38 | 13.79877 |    1.86 | 16.55852 |    4.46 | 19.31828 |    1.45 |
     |------------------------------------------------------------------------|
     |  months7  | ne_abs8  |  months8  |  ne_abs9  |   months9  |  ne_abs10  |
     | 22.07803  |    1.24  | 24.83778  |     1.52  |  27.56468  |      1.28  |
     |-----------+-----------------------------------------------+------------|
     | months10  | ne_abs11  | months11  | ne_abs12  | months12  |  ne_abs13  |
     | 30.39014  |     1.75  | 35.90965  |     1.76  | 38.20945  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months13  | ne_abs14  | months14  | ne_abs15  | months15  |  ne_abs16  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months16  | ne_abs17  | months17  | ne_abs18  | months18  |  ne_abs19  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months19  | ne_abs20  | months20  | ne_abs21  | months21  |  ne_abs22  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months22  | ne_abs23  | months23  | ne_abs24  | months24  |  ne_abs25  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months25  | ne_abs26  | months26  | ne_abs27  | months27  |  ne_abs28  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |------------------------------------------------------------------------|
     | months28 | ne_abs29 | months29 | ne_abs30 | months30 |    sex | ageyrs |
     |        . |        . |        . |        . |        . | Female |     47 |
     |------------------------------------------------------------------------|
     |       site     |    whostbas     |    cdstrat     |          cosnm     |
     |    Entebbe     |           2     |    251-499     |    Matched (3)     |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  4. | idnum | ne_abs1 |  months1 | ne_abs2 |  months2 | ne_abs3  |  months3  |
     |  1005 |    1.74 | 2.759754 |    1.34 | 5.519507 |    1.62  | 8.279261  |
     |------------------------------------------------------------------------|
     | ne_abs4 |  months4 | ne_abs5 |  months5 | ne_abs6 |  months6 | ne_abs7 |
     |     .65 | 8.673512 |    1.59 | 10.77618 |     1.4 | 13.79877 |     .98 |
     |------------------------------------------------------------------------|
     |  months7  | ne_abs8  |  months8  |  ne_abs9  |   months9  |  ne_abs10  |
     | 16.55852  |    1.81  | 19.31828  |     1.59  |  22.07803  |      2.03  |
     |-----------+-----------------------------------------------+------------|
     | months10  | ne_abs11  | months11  | ne_abs12  | months12  |  ne_abs13  |
     | 24.83778  |     1.48  | 27.63039  |     1.45  | 30.39014  |      1.72  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months13  | ne_abs14  | months14  | ne_abs15  | months15  |  ne_abs16  |
     |  33.1499  |     2.28  |  35.8768  |     1.56  |  38.6037  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months16  | ne_abs17  | months17  | ne_abs18  | months18  |  ne_abs19  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months19  | ne_abs20  | months20  | ne_abs21  | months21  |  ne_abs22  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months22  | ne_abs23  | months23  | ne_abs24  | months24  |  ne_abs25  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months25  | ne_abs26  | months26  | ne_abs27  | months27  |  ne_abs28  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |------------------------------------------------------------------------|
     | months28 | ne_abs29 | months29 | ne_abs30 | months30 |    sex | ageyrs |
     |        . |        . |        . |        . |        . | Female |     38 |
     |------------------------------------------------------------------------|
     |       site     |    whostbas     |    cdstrat     |          cosnm     |
     |    Entebbe     |           3     |    251-499     |    Matched (3)     |
     +------------------------------------------------------------------------+

     +------------------------------------------------------------------------+
  5. | idnum | ne_abs1 |  months1 | ne_abs2 |  months2 | ne_abs3  |  months3  |
     |  1006 |    1.96 | 2.759754 |    1.11 | 5.486653 |     .36  | 8.279261  |
     |------------------------------------------------------------------------|
     | ne_abs4 |  months4 | ne_abs5 |  months5 | ne_abs6 |  months6 | ne_abs7 |
     |       . |        . |       . |        . |       . |        . |       . |
     |------------------------------------------------------------------------|
     |  months7  | ne_abs8  |  months8  |  ne_abs9  |   months9  |  ne_abs10  |
     |        .  |       .  |        .  |        .  |         .  |         .  |
     |-----------+-----------------------------------------------+------------|
     | months10  | ne_abs11  | months11  | ne_abs12  | months12  |  ne_abs13  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months13  | ne_abs14  | months14  | ne_abs15  | months15  |  ne_abs16  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months16  | ne_abs17  | months17  | ne_abs18  | months18  |  ne_abs19  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months19  | ne_abs20  | months20  | ne_abs21  | months21  |  ne_abs22  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months22  | ne_abs23  | months23  | ne_abs24  | months24  |  ne_abs25  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |-----------+-----------+-----------+-----------+-----------+------------|
     | months25  | ne_abs26  | months26  | ne_abs27  | months27  |  ne_abs28  |
     |        .  |        .  |        .  |        .  |        .  |         .  |
     |------------------------------------------------------------------------|
     | months28 | ne_abs29 | months29 | ne_abs30 | months30 |    sex | ageyrs |
     |        . |        . |        . |        . |        . | Female |     51 |
     |------------------------------------------------------------------------|
     |       site     |    whostbas     |    cdstrat     |          cosnm     |
     |    Entebbe     |           2     |    251-499     |    Matched (3)     |
     +------------------------------------------------------------------------+

First participant is a 46 year old male from Entebbe with 13 neutrophil results, while second participant is a 41 year old female from Entebbe with 15 neutrophil results

Note that for certain applications a wide data set is preferable, while for others a long data set is preferable We can convert a wide data set to a long data set, provided that the variables to be reshaped end in a digit denoting the serial number (so here ne_abs1, ne_abs2 etc)

reshape long ne_abs months , i(idnum) j(visitnum)

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...


(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
>  29 30)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations            2,156   ->   64,680      
Number of variables                  67   ->   10          
j variable (30 values)                    ->   visitnum
xij variables:
           ne_abs1 ne_abs2 ... ne_abs30   ->   ne_abs
           months1 months2 ... months30   ->   months
-----------------------------------------------------------------------------

collapse to summarise data

The long data set above is an example of a clustered data structure, with repeated measures of neutrophil counts clustered within participants We often want to summarize the data at the cluster level – here to get participant level summaries (number of visits and mean neutrophil count) This can be achieved using the collapse command as shown below:

use costop_ndm , clear
drop if ne_abs==.

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...



(111 observations deleted)

gen dummy = 1
collapse (sum) dummy (mean) ne_abs , by(idnum)
tab dummy

Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi

> le.do ...





(sum) dummy |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         10        0.46        0.46
          2 |         16        0.74        1.21
          3 |         24        1.11        2.32
          4 |         40        1.86        4.17
          5 |         87        4.04        8.21
          6 |         99        4.59       12.80
          7 |         86        3.99       16.79
          8 |        159        7.37       24.17
          9 |        149        6.91       31.08
         10 |        203        9.42       40.49
         11 |        336       15.58       56.08
         12 |        304       14.10       70.18
         13 |        261       12.11       82.28
         14 |        191        8.86       91.14
         15 |         85        3.94       95.08
         16 |         46        2.13       97.22
         17 |         22        1.02       98.24
         18 |         15        0.70       98.93
         19 |          7        0.32       99.26
         20 |          6        0.28       99.54
         21 |          2        0.09       99.63
         22 |          7        0.32       99.95
         30 |          1        0.05      100.00
------------+-----------------------------------
      Total |      2,156      100.00

Introduction

Data Overview

Manually inputing data

Importing Comma delimited files (CSV)

Summarizing data

Codebook

Conditional workbook

Compact codebook

Tables and frequency summaries

Generating new variables

Graphics

Day 3 : Msc In Epidemiology

Research Protocol Development 1

Descriptive statistics

Written: Bongani Ncube

Date: 29 January 2025

Describing data

Label define

Normality test

Automated tables

Generating new variable

Label labels for hypertension

Summarise a continuous variable across a categorical variable

use table1 ado file

Using sumtab to generate table 1

Data Management

Example: Stepping Stones

Task

Merge datasets

Snapshot of baseline data

Snapshot of followup data

merge the datasets

Many to one merging

Long and wide format data

collapse to summarise data

Label labels for `hypertension`

`Example: Stepping Stones`