Let’s type describe to verify that our data imported successfully.
describe
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Contains data from C:\PROGRA~1\Stata18\ado\base/a/auto.dta
Observations: 74 1978 automobile data
Variables: 12 13 Apr 2022 17:45
(_dta has notes)
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
make str18 %-18s Make and model
price int %8.0gc Price
mpg int %8.0g Mileage (mpg)
rep78 int %8.0g Repair record 1978
headroom float %6.1f Headroom (in.)
trunk int %8.0g Trunk space (cu. ft.)
weight int %8.0gc Weight (lbs.)
length int %8.0g Length (in.)
turn int %8.0g Turn circle (ft.)
displacement int %8.0g Displacement (cu. in.)
gear_ratio float %6.2f Gear ratio
foreign byte %8.0g origin Car origin
-------------------------------------------------------------------------------
Sorted by: foreign
Manually inputing data
input str12 name Ringo John Paul George end
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
name
1. Ringo
2. John
3. Paul
4. George
5. end
The first line tells Stata that we are going to input data for a string variable called name. The number 12 tells input that we want the string variable to allow up to 12 characters for each observation. The next four lines are the raw data, which include the names Ringo, John, Paul, and George. The word end in the last line tells Stata that we are finished adding data.
Importing Comma delimited files (CSV)
we can use import delimited to import the data from CS1policies.csv
a simple dataset that i use for teaching CS1 exams
We can type describe to view the contents of the data in memory.
describe
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Contains data
Observations: 1,000
Variables: 4
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
v1 int %8.0g
age byte %8.0g
duration float %9.0g
claimed byte %8.0g
-------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Codebook
Codebook gives detailed information for certain variables.
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
request ignored because of batch mode
describe
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Contains data from bus.dta
Observations: 125
Variables: 6 26 Jan 2019 12:06
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
job byte %9.0g
age byte %9.0g
ht float %9.0g
wt float %9.0g
triglyc int %9.0g
sbp int %9.0g
-------------------------------------------------------------------------------
Sorted by:
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
age | 125 0.95915 4.069 3.151 0.00081
The p-value is 0.00081, which is less than 0.05. This means we reject the null hypothesis, suggesting that age is not normally distributed.
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
host not found
http://fmwww.bc.edu/repec/bocode/t/ either
1) is not a valid URL, or
2) could not be contacted, or
3) is not a Stata download site (has no stata.toc file).
r(631);
r(631);
Generating new variable
gen hypertension = .replace hypertension =1 if sbp >= 140replace hypertension =0 if sbp < 140
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
(125 missing values generated)
(31 real changes made)
(94 real changes made)
What is the code doing?
Generate the variable hypertension and initialize it with missing values:
Replace hypertension with 1 if the systolic blood pressure (sbp) is greater than or equal to 140:
Replace hypertension with 0 if the systolic blood pressure (sbp) is less than 140:
describe hypertension
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
hypertension float %9.0g
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
+------------------------------------------+
| Factor Level Value |
|------------------------------------------|
| N 125 |
|------------------------------------------|
| age, mean (SD) 38.1 (10.1) |
|------------------------------------------|
| job Driver 59 (47.2%) |
| Conductor 66 (52.8%) |
|------------------------------------------|
| ht, mean (SD) 1.6 (0.1) |
+------------------------------------------+
file /BusTable1.xls could not be saved
r(603);
r(603);
Using sumtab to generate table 1
summtab , by(hypertension) catvars(job) contvars(age ht wt) word wordname( table1_bus) median medfmt(0) totaltitle("Table 1: Summary statistics by hypertension status") replace
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Must specify either Word or Excel output (or both)
r(198);
r(198);
Data Management
Example: Stepping Stones
Stepping Stones is a participatory HIV prevention programme that aims to improve sexual health through building more gender-equitable relationships Cluster Randomized Trial was conducted among young rural men and women in the Eastern Cape Province in South Africa to assess impact of Stepping Stones on HIV and HSV 2 incidence and sexual practices .The 70 study clusters comprised 64 villages and six townships Clusters grouped into seven strata, one stratum comprised the townships and six were villages grouped according to proximity to particular roads
Within each stratum, equal numbers of clusters were allocated to the two study arms Intervention arm in which participants were given the 13 Stepping Stones sessions over a period of three months Control arm in which participants were given a single 3 hour session on HIV prevention
In each cluster recruited about 20 men and 20 women .Those eligible were aged 16 – 24, resident in village where they were at school, and mature enough to understand the study and the consent process - most were recruited from schools In this study unit of randomisation was a cluster of 20 men or a cluster of 20 women Primary outcomes were HIV-incidence and HSV-2 incidence over the study period of approximately two years.
Task
We want to join the two data sets to see which women have become HIV-infected (incident cases or sero-conversions), and to see whether there is any consistent pattern in the experience of IPV Women in the study each have a unique study identification number (idnum) .We can join the data sets together using the merge command as shown below.
Merge datasets
two datasets are joined side by side using “merge 1:1”
in the datasets to come ,idnum must uniquely identify observations in each dataset
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
(Stepping Stones women baseline)
Result Number of obs
-----------------------------------------
Not matched 306
from master 306 (stonmerge==1)
from using 0 (stonmerge==2)
Matched 1,109 (stonmerge==3)
-----------------------------------------
The summary above shows that 1,109 individuals had their data merged, whereas 306 were not merged because they did not match. 306 were not merged from the master file while 0 were not merged from the using file.
in our case it implies that – in this case 1,109 women had observations in both data sets, while 306 only had observations in the baseline data set and not in the follow-up data set – this is because these women were lost to follow-up.
tab stonmerge
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Matching result from |
merge | Freq. Percent Cum.
------------------------+-----------------------------------
Master only (1) | 306 21.63 21.63
Matched (3) | 1,109 78.37 100.00
------------------------+-----------------------------------
Total | 1,415 100.00
We can now use data from both of the datasets e.g. we can see how many women HIV sero-converted during the twelve month follow-up period
tab hivx hiv , missing
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
HIV | hiv
serostatus | 0 1 . | Total
-----------+---------------------------------+----------
0 | 900 65 291 | 1,256
1 | 1 104 54 | 159
-----------+---------------------------------+----------
Total | 901 169 345 | 1,415
Of 1256 women who tested HIV-negative at baseline, 65 sero-converted (i.e. became HIV-infected) while 900 remained HIV-negative (and remainder did not have a follow-up result) .Note one woman tested HIV-positive at baseline but HIV-negative at follow-up We can identify this woman as participant number 1870 Would need to go to original forms and fieldworkers to understand what happened with this participant (possible for example that a friend “replaced” her at the follow-up visit)
Can compare the proportion of women experiencing IPV (Intimate Partner Violence) at follow-up, according to whether or not they had experienced IPV at baseline
Amongst women who had not experienced IPV at baseline, 18.2% experienced IPV at follow-up, while amongst women who had experienced IPV at baseline, 43.4% experienced IPV at follow-up
Finally using the visit dates, we can look at the distribution of follow-up days between the two visits - which should be roughly 365 days since follow-up was at 12 months
Median follow-up time was 386 days (slightly larger than the expected 365 days) and mean was 409 days - due to some participants only being traced after about 2 years Two strange values – one with negative follow-up days (meaning that follow-up visit was recorded as having taken place before baseline visit) and one with only 4 days of follow-up between visits – we can identify the participants (idnum 1423 and idnum 1666) but would need to look at original fieldwork records in order to resolve query
In above example joined two data sets using a one-to-one merge, since each data set had only one observation per participant
In some cases there will be many observations per participant in one data set and only one observation per participant in the other data set
Ex: Longitudinal studies where all of follow-up observations are put into the same data set (which thus has many observations per participant) while the other data set contains baseline and design information (and thus one observation per participant)
Example: COSTOP randomized controlled trial carried out to investigate whether it is safe for HIV-infected patients stabilized on ART (on ART for at least six months, on CTX prophylaxis and with a CD4 count above 250 cells/µl) to stop taking CTX prophylaxis Total of 2180 patients individually randomized to either continue taking CTX or to take an equivalent placebo (i.e. to stop CTX prophylaxis).
One secondary objective was to compare neutrophil counts over time between the two treatment arms, since CTX has some haematological toxicity Data on neutrophil counts given in costop_neutrophil, while baseline data given in costop_base
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Contains data from Datasets/costop_base.dta
Observations: 2,180
Variables: 6 23 Jan 2023 12:34
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
sex byte %8.0g sexlab Gender
ageyrs byte %9.0g
site byte %8.0g sitelab study Site
whostbas byte %8.0g Baseline WHO stage
cdstrat byte %8.0g cdlab CD4 stratum at baseline
idnum float %9.0g
-------------------------------------------------------------------------------
Sorted by: idnum
tab1 sex site cdstrat
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
-> tabulation of sex
Gender | Freq. Percent Cum.
------------+-----------------------------------
Male | 569 26.10 26.10
Female | 1,611 73.90 100.00
------------+-----------------------------------
Total | 2,180 100.00
-> tabulation of site
study Site | Freq. Percent Cum.
------------+-----------------------------------
Entebbe | 1,002 45.96 45.96
Masaka | 1,178 54.04 100.00
------------+-----------------------------------
Total | 2,180 100.00
-> tabulation of cdstrat
CD4 stratum |
at baseline | Freq. Percent Cum.
------------+-----------------------------------
251-499 | 1,142 52.39 52.39
500+ | 1,038 47.61 100.00
------------+-----------------------------------
Total | 2,180 100.00
Data set costop_neutrophil has 23,093 observations, since participants could have a number of hematology tests during the trial (including neutrophil count)
Can now merge neutrophil data to baseline data to get associated characteristics (e.g. sex, age, study site, CD4 stratum, WHO stage) corresponding to the neutrophil counts
Many observations in neutrophil data (from a single participant) will be merged to a single observation in the baseline data – so known as many-to-one or m:1 merging
Also sometimes called “table lookup” since data for a given participant are “looked up” in the baseline table
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
(variable idnum was int, now float to accommodate using data's values)
Result Number of obs
-----------------------------------------
Not matched 23
from master 0 (cosnm==1)
from using 23 (cosnm==2)
Matched 23,181 (cosnm==3)
-----------------------------------------
tab1 sex site cdstrat
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
-> tabulation of sex
Gender | Freq. Percent Cum.
------------+-----------------------------------
Male | 5,988 25.81 25.81
Female | 17,216 74.19 100.00
------------+-----------------------------------
Total | 23,204 100.00
-> tabulation of site
study Site | Freq. Percent Cum.
------------+-----------------------------------
Entebbe | 10,782 46.47 46.47
Masaka | 12,422 53.53 100.00
------------+-----------------------------------
Total | 23,204 100.00
-> tabulation of cdstrat
CD4 stratum |
at baseline | Freq. Percent Cum.
------------+-----------------------------------
251-499 | 12,114 52.21 52.21
500+ | 11,090 47.79 100.00
------------+-----------------------------------
Total | 23,204 100.00
There were 24 observations from baseline table that did not have matches in neutrophil data – these correspond to participants who dropped out early without having any hematology tests
Also among 23,117 hematology tests carried out 10,776 were in Entebbe and 12,341 in Masaka We can look at a boxplot to compare the neutrophil counts between Entebbe and Masaka
There appear to be a number of outliers among the neutrophil counts, which could be laboratory errors (or missing decimal places in the results), so we can look at the counts with a cut-off of 10 (in practice we would investigate this together with the laboratory)
graph box ne_abs if ne_abs<10, over(site)quietlygraphexport box2.svg, replace
With outliers removed seems little difference in distn of neutrophil counts between sites - examine this further by looking at summary statistics by site
tabstat ne_abs , stat(n meansdq) by(site)
Running C:\Users\kkapu\Downloads\Intro-to-LTA-masterh\Intro-to-LTA-master\profi
> le.do ...
Summary for variables: ne_abs
Group variable: site (study Site)
site | N Mean SD p25 p50 p75
--------+------------------------------------------------------------
Entebbe | 10765 1.860236 1.191051 1.17 1.61 2.21
Masaka | 12328 1.770006 1.230289 1.13 1.56 2.14
--------+------------------------------------------------------------
Total | 23093 1.812067 1.212965 1.15 1.59 2.17
---------------------------------------------------------------------
This confirms that on average neutrophil counts are slightly higher in Entebbe than in Masaka .Note that we save the merged data set as “costop_ndm.dta” to use in the next three sections
In neutrophil data set, for each participant we might want to select 1st neutrophil count (to get estimate of this at baseline or enrolment) and also select last neutrophil count (to get estimate at end of the trial) We will see how to do this below:
Since we are looking within idnum and have sorted by month within idnum, _n measures number of observation within each participant so _n=1 denotes the first (earliest) hematology test, _n=2 the second test and so on Only 2,156 participants have at least one neutrophil result Note that for over 10% this was found before enrolment i.e. during the screening phase of the trial – as shown by negative values for month Can now see how to find the last neutrophil count
Last observation within each idnum (participant) is labelled _N Note that median and mean neutrophil count are very similar at beginning and end of the trial Now have two data sets – one containing first neutrophil count and other containing last neutrophil count Could merge these data sets (1:1) and hence find for each participant how much neutrophil count has changed over course of the trial
Long and wide format data
The neutrophil data set is an example of a “long” data set since we have a separate row of data for each visit An alternative to this would be a “wide” data set in which we have one row of data for each participant and within this row the first neutrophil count is recorded as ne_abs1, the second as ne_abs2, the third as ne_abs_3 and so on We see how to do this below:
Variable “visitnum” measures number of each visit within each participant Note: if a person had (say) visitnum 5, then they must have also had visits 1,2,3 and 4 That is why the frequency decreases with each visit number – so 8 participants had 22 visits and one participant had 30 visits. Now see how to make a wide data set In this wide data set there will be variables for ne_abs1 up to ne_abs30 and month1 up to month30 (although ne_abs30 and month30 will be missing for all except one participant)
use costop_ndm , cleardropif ne_abs==.sort idnum monthsby idnum: gen visitnum = _n
First participant is a 46 year old male from Entebbe with 13 neutrophil results, while second participant is a 41 year old female from Entebbe with 15 neutrophil results
Note that for certain applications a wide data set is preferable, while for others a long data set is preferable We can convert a wide data set to a long data set, provided that the variables to be reshaped end in a digit denoting the serial number (so here ne_abs1, ne_abs2 etc)
The long data set above is an example of a clustered data structure, with repeated measures of neutrophil counts clustered within participants We often want to summarize the data at the cluster level – here to get participant level summaries (number of visits and mean neutrophil count) This can be achieved using the collapse command as shown below: