Contains data from hersdata.dta
Observations: 2,763
Variables: 37 11 May 2018 14:57
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
HT byte %15.0g HT random assignment to hormone
therapy
age byte %9.0g age in years
raceth byte %16.0g raceth race/ethnicity
nonwhite byte %9.0g noyes nonwhite race/ethnicity
smoking byte %9.0g noyes current smoker
drinkany byte %9.0g noyes any current alcohol consumption
exercise byte %9.0g noyes exercise at least 3 times per
week
physact byte %20.0g physact comparative physical activity
globrat byte %9.0g globrat self-reported health
poorfair byte %9.0g noyes poor/fair self-reported health
medcond byte %9.0g other serious conditions by
self-report
htnmeds byte %9.0g noyes anti-hypertensive use
statins byte %9.0g noyes statin use
diabetes byte %9.0g noyes diabetes
dmpills byte %9.0g noyes oral DM medication by self-report
insulin byte %9.0g noyes insulin use by self-report
weight float %9.0g weight (kg)
BMI float %9.0g BMI (kg/m^2)
waist float %9.0g waist (cm)
WHR float %9.0g waist/hip ratio
glucose int %9.0g fasting glucose (mg/dl)
weight1 float %9.0g year 1 weight (kg)
BMI1 float %9.0g year 1 BMI (kg/m^2)
waist1 float %9.0g year 1 waist (cm)
WHR1 float %9.0g year 1 waist/hip ratio
glucose1 int %9.0g year 1 fasting glucose (mg/dl)
tchol int %9.0g total cholesterol (mg/dl)
LDL float %9.0g LDL cholesterol (mg/dl)
HDL int %9.0g HDL cholesterol (mg/dl)
TG int %9.0g triglycerides (mg/dl)
tchol1 int %9.0g year 1 total cholesterol (mg/dl)
LDL1 float %9.0g year 1 LDL cholesterol (mg/dl)
HDL1 int %9.0g year 1 HDL cholesterol (mg/dl)
TG1 int %9.0g year 1 triglycerides (mg/dl)
SBP int %9.0g systolic blood pressure
DBP int %9.0g diastolic blood pressure
age10 float %9.0g age (per 10 years)
-------------------------------------------------------------------------------
Sorted by:
Question 1
(a)Normal distribution
A normal distribution (gaussian distribution), is a continuous probability distribution characterized by its symmetrical, bell-shaped curve. Mathematically, it is defined by its probability density function (PDF):
use hersdatahistogram SBP if HT == 0, normalcolor(pink) title("SBP Distribution (Placebo)") name(hist_placebo, replace)histogram SBP if HT == 1, normalcolor(pink) title("SBP Distribution (Hormone Therapy)") name(hist_ht, replace)graphcombine hist_placebo hist_ht, title("SBP Distribution by Treatment Group")graphexport"norm.png", as(png) replace
(bin=31, start=83, width=4.5483871)
(bin=31, start=87, width=3.5483871)
file norm.png saved as PNG format
knitr::include_graphics("norm.png")
use hersdataqnorm SBP if HT == 0, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (Placebo)") name(qq_sbp_placebo, replace)qnorm SBP if HT == 1, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (Hormone Therapy)") name(qq_sbp_ht, replace)graphcombine qq_sbp_placebo qq_sbp_ht, title("Normal Q-Q Plot of SBP by Treatment Group")graphexport"QQ.png", as(png) replace
file QQ.png saved as PNG format
knitr::include_graphics("QQ.png")
use hersdata bysort HT: swilk SBP
-> HT = placebo
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
SBP | 1,383 0.98776 10.373 5.868 0.00000
-------------------------------------------------------------------------------
-> HT = hormone therapy
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
SBP | 1,380 0.99136 7.311 4.990 0.00000
use hersdataksmirnov SBP, by(HT)
Two-sample Kolmogorov–Smirnov test for equality of distribution functions
Smaller group D p-value
---------------------------------------
placebo 0.0147 0.743
hormone therapy -0.0114 0.836
Combined K-S 0.0147 0.998
Note: Ties exist in combined dataset;
there are 110 unique values out of 2763 observations.
b. Bayes theorem
Bayes’ Theorem is a principle in probability theory that describes the relationship between conditional probabilities. It allows the update of beliefs about the likelihood of an event based on new evidence. Mathematically, it is expressed as:
\[
P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}
\]
use hersdatatab diabetes smoking
| current smoker
diabetes | no yes | Total
-----------+----------------------+----------
no | 1,733 299 | 2,032
yes | 670 61 | 731
-----------+----------------------+----------
Total | 2,403 360 | 2,763
The binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. It’s mathematical probability mass function (PMF) is as follows:
\[
P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k}
\]
use hersdatasetobs 2763 // gen trials = rbinomial(30, 0.26) // histogram trials, normal discrete width(1) percentcolor(pink) ///xlabel(0(2)20) ///title("Histogram of Binomial Distribution (n=30, p=0.26)") ///ytitle("Probability") xtitle("Number of Successes")graphexport"binorm.png", as(png) replace
Number of observations (_N) was 2,763, now 2,763.
(start=1, width=1)
file binorm.png saved as PNG format
knitr::include_graphics("binorm.png")
d. Sample Size or Power
Sample size refers to the number of observations or data points included in a study or experiment, playing a crucial role in determining the precision and reliability of statistical estimates. On the other hand, statistical power is the probability of correctly rejecting the null hypothesis when it is false, thus detecting an effect if one exists. It is influenced by factors such as sample size, effect size, significance level, and variability. Power is calculated as 1−β1 - \beta, where β\beta represents the likelihood of making a Type II error (failing to reject a false null hypothesis).
use hersdatabysort HT: summarize SBP
-> HT = placebo
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
SBP | 1,383 135.1229 19.36075 83 224
-------------------------------------------------------------------------------
-> HT = hormone therapy
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
SBP | 1,380 135.0159 18.69506 87 197
use hersdatasum SBP
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
SBP | 2,763 135.0695 19.02781 83 224
Estimated power for one-sample comparison of mean
to hypothesized value
Test H0: m = 135.1, where m is the mean in the population
Assumptions:
alpha = 0.0500 (two-sided)
alternative m = 135.016
sd = 19.0278
sample size n = 2763
Estimated power:
power = 0.0601
power twomeans 135.1229 135.0159, sd(19.029) n1(1383) n2(1380) alpha(0.05)
Estimated power for a two-sample means test
t test assuming sd1 = sd2 = sd
H0: m2 = m1 versus Ha: m2 != m1
Study parameters:
alpha = 0.0500
N = 2,763
N1 = 1,383
N2 = 1,380
N2/N1 = 0.9978
delta = -0.1070
m1 = 135.1229
m2 = 135.0159
sd = 19.0290
Estimated power:
power = 0.0525
e. Pearson’s Chi-Square
Pearson’s Chi-Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares observed frequencies from sample data to expected frequencies under the assumption of independence.
\[
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}
\]
use hersdatatabulate diabetes HT, chi2 expected
| Key |
|--------------------|
| frequency |
| expected frequency |
+--------------------+
| random assignment to
| hormone therapy
diabetes | placebo hormone t | Total
-----------+----------------------+----------
no | 1,031 1,001 | 2,032
| 1,017.1 1,014.9 | 2,032.0
-----------+----------------------+----------
yes | 352 379 | 731
| 365.9 365.1 | 731.0
-----------+----------------------+----------
Total | 1,383 1,380 | 2,763
| 1,383.0 1,380.0 | 2,763.0
Pearson chi2(1) = 1.4369 Pr = 0.231
f. One way ANOVA test
The One-Way Analysis of Variance (ANOVA) is a statistical test used to determine whether there are significant differences between the means of three or more independent groups. It compares the variability between group means to the variability within the groups. One-Way ANOVA assumes the dependent variable is continuous and normally distributed, and it requires homogeneity of variances across groups.
\[
F = \frac{\text{MS}{\text{between}}}{\text{MS}{\text{within}}}
\]
use hersdataoneway SBP raceth, tabulate
race/ethnic | Summary of systolic blood pressure
ity | Mean Std. dev. Freq.
------------+------------------------------------
White | 134.78376 18.831686 2,451
African A | 138.23394 19.992518 218
Other | 135.18085 21.259767 94
------------+------------------------------------
Total | 135.06949 19.027807 2,763
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 2384.26992 2 1192.13496 3.30 0.0371
Within groups 997618.388 2760 361.455938
------------------------------------------------------------------------
Total 1000002.66 2762 362.057443
Bartlett's equal-variances test: chi2(2) = 4.0727 Prob>chi2 = 0.131
g. Confounding and effect modification
Confounding occurs when a third variable (confounder) distorts the observed relationship between an exposure and an outcome. This happens because the confounder is associated with both the exposure and the outcome, creating a false impression of causality. On the other hand, effect modification occurs when the effect of the exposure on the outcome varies depending on the level of a third variable.
Confounding represented mathematically:
\[
\text{Adjusted association:} \quad Y = \beta_0 + \beta_1 X + \beta_2 Z
\]
current smoker | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .8432676 .6971237 1.01993 122.2905 (exact)
yes | .9733333 .4851674 1.874814 10 (exact)
-----------------+-------------------------------------------------
Crude | .8809565 .7348165 1.056045 (exact)
M–H combined | .8530995 .7138128 1.019565
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 0.19 Pr>chi2 = 0.6665
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 3.06
Pr>chi2 = 0.0804
h. Difference between Wilcoxon signed rank test and Wilcoxon ranked sum
The Wilcoxon Signed-Rank Test and the Wilcoxon Rank-Sum Test are two non-parametric statistical methods designed for analyzing data under different conditions. The Wilcoxon Signed-Rank Test is used when comparing paired or dependent samples, assessing whether their medians significantly differ. This test calculates the differences between paired observations, ranks the absolute differences, and compares the sums of ranks of positive and negative differences. For example, it is ideal for evaluating pre- and post-treatment effects within the same group. Conversely, the Wilcoxon Rank-Sum Test, also known as the Mann-Whitney U Test, is applied to compare two independent samples to determine if their distributions or medians differ significantly. It ranks all observations across the two groups and then evaluates the difference in rank sums between them.
in terms of hypothesis testing, their difference can be expressed as follows:
Wilcoxon Signed-Rank Test:
\[
\begin{align}H_0 &: \text{The median difference between paired samples is zero.}
\\H_a &: \text{The median difference between paired samples is not zero.}
\\\text{Test Statistic: } W &= \sum \text{Ranks of positive differences.}\end{align}
\]
Wilcoxon Rank-Sum Test:
\[
\begin{align}H_0 &: \text{The medians of the two independent groups are equal.}
\\H_a &: \text{The medians of the two independent groups are not equal.}
\\\text{Rank Sums: } \quad U_1 &= n_1 n_2 + \frac{n_1 (n_1+1)}{2} - R_1, \quad U_2 = n_1 n_2 - U_1\end{align}
\]
use Vusi2qnorm LDL if htnmeds == 0, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of ldl (no)") name(qq_ldl_htnmeds_no, replace)qnorm SBP if htnmeds == 1, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (yes)") name(qq_ldl_htnmeds_yes, replace)graphcombine qq_ldl_htnmeds_no qq_ldl_htnmeds_yes, title("Normal Q-Q Plot of ldl by use of hypertension meds Group")graphexport"QQ1.png", as(png) replace
file QQ1.png saved as PNG format
knitr::include_graphics("QQ1.png")
use Vusi2ksmirnov LDL, by(htnmeds)
Two-sample Kolmogorov–Smirnov test for equality of distribution functions
Smaller group D p-value
---------------------------------------
no 0.0644 0.675
yes -0.1061 0.345
Combined K-S 0.1061 0.662
Note: Ties exist in combined dataset;
there are 240 unique values out of 297 observations.
use Vusi2bysort htnmeds: swilk LDL
-> htnmeds = no
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
LDL | 59 0.98850 0.617 -1.041 0.85115
-------------------------------------------------------------------------------
-> htnmeds = yes
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
LDL | 238 0.98283 2.982 2.536 0.00561
\[
t = \frac{144.9831 - 142.3269}{35.5170\sqrt{\frac{1}{59} + \frac{1}{238}}}
\]
\[
t = 0.5142
\]
e. Conclusions
There is no difference in the mean of LDL cholesterol between individuals who use hypertensive medications and those who do not.
f. Repeat Question 03 using MannWhitney U-test
Research Question and variables
(i) Is there a difference in LDL cholesterol between individuals who use hypertensive medications and those who do not?
\[
H_0: \text{The distributions (medians) of LDL levels are the same for individuals not taking hypertension medications ("no") and those taking them ("yes").}
\]
\[
H_1: \text{The distributions (medians) of LDL levels differ between the "no" and "yes" groups.}
\]
STATA output
use Vusi2ranksum LDL, by(htnmeds)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test
htnmeds | Obs Rank sum Expected
-------------+---------------------------------
no | 59 9151.5 8791
yes | 238 35101.5 35462
-------------+---------------------------------
Combined | 297 44253 44253
Unadjusted variance 348709.67
Adjustment for ties -5.35
----------
Adjusted variance 348704.32
H0: LDL(htnmeds==no) = LDL(htnmeds==yes)
z = 0.610
Prob > |z| = 0.5415
Note: Exact p-value is not computed by default for sample sizes > 200.
Use option exact to compute it.
Decision or Conclusion: There is no significant evidence of a difference in LDL distributions between the “no” and “yes” groups based on hypertension medication use.
Manual computation of the Mann Whitney U-test
$$Calculating U_1 and U_2 to find the U statistics$$
statin use | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .6737968 .3061665 1.429289 9.689119 (exact)
yes | .6602871 .233172 1.772848 5.859813 (exact)
-----------------+-------------------------------------------------
Crude | .670669 .3653767 1.207849 (exact)
M–H combined | .6687055 .3825609 1.168878
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 0.00 Pr>chi2 = 0.9725
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 1.99
Pr>chi2 = 0.1578
nonwhite adjusted model
use Vusi2cc diabetes exercise, by(nonwhite)
nonwhite race/et | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .8305867 .4256263 1.588842 11.44906 (exact)
yes | .3125 .042405 1.935143 2.742857 (exact)
-----------------+-------------------------------------------------
Crude | .670669 .3653767 1.207849 (exact)
M–H combined | .7304566 .4136963 1.289755
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 1.26 Pr>chi2 = 0.2620
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 1.19
Pr>chi2 = 0.2762
Poorfair adjusted model
use Vusi2cc diabetes exercise, by(poorfair)
poor/fair self-r | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | 1.088542 .5208722 2.249411 8.074766 (exact)
yes | .3310345 .0729658 1.206992 5.178571 (exact)
-----------------+-------------------------------------------------
Crude | .6733897 .36661 1.213655 (exact)
M–H combined | .7925555 .4461784 1.407832
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 2.86 Pr>chi2 = 0.0907
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 0.63
Pr>chi2 = 0.4271
(d) - Conclusion: Neither statin use, nonwhite, and poorfair significantly confounded or modified the relationship between exercise and diabetes in this cohort. Overall, we conclude that there is no evidence of a statistically significant association between exercise and diabetes in this sample after accounting for the aforementioned variables.
(e) Manual computation of the M-H (BONUS MARKS)
use Vusi2tabulate statins diabetes
| diabetes
statin use | no yes | Total
-----------+----------------------+----------
no | 145 48 | 193
yes | 79 28 | 107
-----------+----------------------+----------
Total | 224 76 | 300
Does the level of physical activity (measured by the physact variable) affect glucose levels in women?
Variables:
Independent Variable (Factor): physact (categorical, with levels: “much more active,” “somewhat more active,” “about as active,” “somewhat less active,” “much less active”).
Dependent Variable: glucose (continuous, measured in mg/dL).
(b) Test the relevant assumptions of One-Way ANOVA
Hypothesis for the assumptions:
Normality assumptions
\[
H_0: \text{Glucose levels are normally distributed within each physact group.}
\]
\[
H_1: \text{Glucose levels are not normally distributed within each physact group.}
\]
Variance test
\[
H_0 : \text{Variances of glucose are equal across physact groups.}
\]
\[
H_1 : \text{Variances of glucose are not equal across physact groups.}
\]
Normality test to test the assumptions:
use Vusi2histogram glucose, by(physact) normalgraphexport"anova.png", as(png) replace
file anova.png saved as PNG format
knitr::include_graphics("anova.png")
use Vusi2by physact, sort: swilk glucose
-> physact = much less active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
glucose | 26 0.81915 5.171 3.367 0.00038
-------------------------------------------------------------------------------
-> physact = somewhat less active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
glucose | 60 0.74440 13.894 5.672 0.00000
-------------------------------------------------------------------------------
-> physact = about as active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
glucose | 94 0.67782 25.265 7.140 0.00000
-------------------------------------------------------------------------------
-> physact = somewhat more active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
glucose | 85 0.75744 17.501 6.293 0.00000
-------------------------------------------------------------------------------
-> physact = much more active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
glucose | 35 0.69422 10.914 4.989 0.00000
Homogeneity of variance
use Vusi2robvar glucose, by(physact)
comparative |
physical | Summary of fasting glucose (mg/dl)
activity | Mean Std. dev. Freq.
------------+------------------------------------
much less | 128.76923 44.139151 26
somewhat | 116.18333 43.064985 60
about as | 111.02128 35.953967 94
somewhat | 102.28235 21.839028 85
much more | 102.4 23.322799 35
------------+------------------------------------
Total | 110.11 34.483158 300
W0 = 5.5219210 df(4, 295) Pr > F = 0.00026723
W50 = 2.9584207 df(4, 295) Pr > F = 0.02022499
W10 = 3.7034420 df(4, 295) Pr > F = 0.00585241
The assumptions for the anova are not met.
(c)
use Vusi2oneway glucose physact, tabulate
comparative |
physical | Summary of fasting glucose (mg/dl)
activity | Mean Std. dev. Freq.
------------+------------------------------------
much less | 128.76923 44.139151 26
somewhat | 116.18333 43.064985 60
about as | 111.02128 35.953967 94
somewhat | 102.28235 21.839028 85
much more | 102.4 23.322799 35
------------+------------------------------------
Total | 110.11 34.483158 300
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 18632.1903 4 4658.04758 4.08 0.0031
Within groups 336905.18 295 1142.05146
------------------------------------------------------------------------
Total 355537.37 299 1189.08819
Bartlett's equal-variances test: chi2(4) = 44.6917 Prob>chi2 = 0.000
\[
F = \frac{4658.0475}{1142.05146} \approx 4.08
\]
(d) If the assumptions are not fully met, I can consider data transformation or using the alternative non parametric test (Kruskal Wallis).
(e) Hypotheses based on the alternative non-parametric test
\[
H_0: \text{Median glucose levels are equal across physact groups.}
\]
\[
H_1: \text{At least one group glucose level median differs across physact groups.}
\]
use Vusi2kwallis glucose, by(physact)
Kruskal–Wallis equality-of-populations rank test
+---------------------------------------+
| physact | Obs | Rank sum |
|----------------------+-----+----------|
| much less active | 26 | 5214.00 |
| somewhat less active | 60 | 9796.00 |
| about as active | 94 | 14096.00 |
| somewhat more active | 85 | 11446.00 |
| much more active | 35 | 4598.00 |
+---------------------------------------+
chi2(4) = 14.491
Prob = 0.0059
chi2(4) with ties = 14.504
Prob = 0.0058
Decision: There is significant evidence of differences in median glucose levels across the physical activity groups.