Biostatistics1 Assignment

Author

Vusumuzi Mabasa ORCID

options(scipen = 999)
library(Statamarkdown)
Stata found at C:/Program Files/Stata18/StataSE-64.exe
The 'stata' engine is ready to use.
stataexe <- "C:/Program Files/Stata18/StataSE-64.exe"

knitr::opts_chunk$set(engine.path = list(stata=stataexe))
cd "C:\Users\VUSI\Downloads\Biostatistics101"
C:\Users\VUSI\Downloads\Biostatistics101
pwd
C:\Users\VUSI\Downloads\Biostatistics101
use hersdata

d
Contains data from hersdata.dta
 Observations:         2,763                  
    Variables:            37                  11 May 2018 14:57
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
HT              byte    %15.0g     HT         random assignment to hormone
                                                therapy
age             byte    %9.0g                 age in years
raceth          byte    %16.0g     raceth     race/ethnicity
nonwhite        byte    %9.0g      noyes      nonwhite race/ethnicity
smoking         byte    %9.0g      noyes      current smoker
drinkany        byte    %9.0g      noyes      any current alcohol consumption
exercise        byte    %9.0g      noyes      exercise at least 3 times per
                                                week
physact         byte    %20.0g     physact    comparative physical activity
globrat         byte    %9.0g      globrat    self-reported health
poorfair        byte    %9.0g      noyes      poor/fair self-reported health
medcond         byte    %9.0g                 other serious conditions by
                                                self-report
htnmeds         byte    %9.0g      noyes      anti-hypertensive use
statins         byte    %9.0g      noyes      statin use
diabetes        byte    %9.0g      noyes      diabetes
dmpills         byte    %9.0g      noyes      oral DM medication by self-report
insulin         byte    %9.0g      noyes      insulin use by self-report
weight          float   %9.0g                 weight (kg)
BMI             float   %9.0g                 BMI (kg/m^2)
waist           float   %9.0g                 waist (cm)
WHR             float   %9.0g                 waist/hip ratio
glucose         int     %9.0g                 fasting glucose (mg/dl)
weight1         float   %9.0g                 year 1 weight (kg)
BMI1            float   %9.0g                 year 1 BMI (kg/m^2)
waist1          float   %9.0g                 year 1 waist (cm)
WHR1            float   %9.0g                 year 1 waist/hip ratio
glucose1        int     %9.0g                 year 1 fasting glucose (mg/dl)
tchol           int     %9.0g                 total cholesterol (mg/dl)
LDL             float   %9.0g                 LDL cholesterol (mg/dl)
HDL             int     %9.0g                 HDL cholesterol (mg/dl)
TG              int     %9.0g                 triglycerides (mg/dl)
tchol1          int     %9.0g                 year 1 total cholesterol (mg/dl)
LDL1            float   %9.0g                 year 1 LDL cholesterol (mg/dl)
HDL1            int     %9.0g                 year 1 HDL cholesterol (mg/dl)
TG1             int     %9.0g                 year 1 triglycerides (mg/dl)
SBP             int     %9.0g                 systolic blood pressure
DBP             int     %9.0g                 diastolic blood pressure
age10           float   %9.0g                 age (per 10 years)
-------------------------------------------------------------------------------
Sorted by: 

Question 1

(a)Normal distribution

A normal distribution (gaussian distribution), is a continuous probability distribution characterized by its symmetrical, bell-shaped curve. Mathematically, it is defined by its probability density function (PDF):

\[ f(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(x - \mu)2}{2\sigma2}\right) \]

use hersdata

histogram SBP if HT == 0, normal color(pink) title("SBP Distribution (Placebo)") name(hist_placebo, replace)
histogram SBP if HT == 1, normal color(pink) title("SBP Distribution (Hormone Therapy)") name(hist_ht, replace)
graph combine hist_placebo hist_ht, title("SBP Distribution by Treatment Group")

graph export "norm.png", as(png) replace
(bin=31, start=83, width=4.5483871)

(bin=31, start=87, width=3.5483871)


file norm.png saved as PNG format
knitr::include_graphics("norm.png")

use hersdata
qnorm SBP if HT == 0, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (Placebo)") name(qq_sbp_placebo, replace)

qnorm SBP if HT == 1, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (Hormone Therapy)") name(qq_sbp_ht, replace)

graph combine qq_sbp_placebo qq_sbp_ht, title("Normal Q-Q Plot of SBP by Treatment Group")
graph export "QQ.png", as(png) replace
file QQ.png saved as PNG format
knitr::include_graphics("QQ.png")

use hersdata 


bysort HT: swilk SBP
-> HT = placebo

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         SBP |      1,383    0.98776     10.373     5.868    0.00000

-------------------------------------------------------------------------------
-> HT = hormone therapy

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         SBP |      1,380    0.99136      7.311     4.990    0.00000
use hersdata

ksmirnov SBP, by(HT)
Two-sample Kolmogorov–Smirnov test for equality of distribution functions

Smaller group             D     p-value  
---------------------------------------
placebo              0.0147       0.743
hormone therapy     -0.0114       0.836
Combined K-S         0.0147       0.998

Note: Ties exist in combined dataset;
      there are 110 unique values out of 2763 observations.

b. Bayes theorem

Bayes’ Theorem is a principle in probability theory that describes the relationship between conditional probabilities. It allows the update of beliefs about the likelihood of an event based on new evidence. Mathematically, it is expressed as:

\[ P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

use hersdata

tab diabetes smoking
           |    current smoker
  diabetes |        no        yes |     Total
-----------+----------------------+----------
        no |     1,733        299 |     2,032 
       yes |       670         61 |       731 
-----------+----------------------+----------
     Total |     2,403        360 |     2,763 

use hersdata

scalar no_smoker = 2403 / 2763  
scalar smoker = 360 / 2763


scalar p_D_given_no_smoker = 670 / 2403  
scalar p_D_given_smoker = 61 / 360  


scalar p_D = (p_D_given_no_smoker * no_smoker) + (p_D_given_smoker * smoker)

scalar p_smoker_G_D = (p_D_given_smoker * smoker) / p_D

display  p_smoker_G_D
.08344733

c. Binomial

The binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. It’s mathematical probability mass function (PMF) is as follows:

\[ P(X = k) = \binom{n}{k} p^k (1 - p)^{n - k} \]

use hersdata

set obs 2763  // 

gen trials = rbinomial(30, 0.26)  // 

histogram trials, normal discrete width(1) percent color(pink) ///
    xlabel(0(2)20) ///
    title("Histogram of Binomial Distribution (n=30, p=0.26)") ///
    ytitle("Probability") xtitle("Number of Successes")


graph export "binorm.png", as(png) replace
Number of observations (_N) was 2,763, now 2,763.


(start=1, width=1)

file binorm.png saved as PNG format
knitr::include_graphics("binorm.png")

d. Sample Size or Power

Sample size refers to the number of observations or data points included in a study or experiment, playing a crucial role in determining the precision and reliability of statistical estimates. On the other hand, statistical power is the probability of correctly rejecting the null hypothesis when it is false, thus detecting an effect if one exists. It is influenced by factors such as sample size, effect size, significance level, and variability. Power is calculated as 1−β1 - \beta, where β\beta represents the likelihood of making a Type II error (failing to reject a false null hypothesis).


use hersdata
bysort HT: summarize SBP
-> HT = placebo

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         SBP |      1,383    135.1229    19.36075         83        224

-------------------------------------------------------------------------------
-> HT = hormone therapy

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         SBP |      1,380    135.0159    18.69506         87        197
use hersdata
sum SBP
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         SBP |      2,763    135.0695    19.02781         83        224

sampsi 135.1229 135.0159, sd(19.02781) n(2763) onesample
Estimated power for one-sample comparison of mean
  to hypothesized value

Test H0: m =  135.1, where m is the mean in the population

Assumptions:

         alpha =   0.0500  (two-sided)
 alternative m =  135.016
            sd =  19.0278
 sample size n =     2763

Estimated power:

         power =   0.0601
power twomeans 135.1229 135.0159, sd(19.029) n1(1383) n2(1380) alpha(0.05)
Estimated power for a two-sample means test
t test assuming sd1 = sd2 = sd
H0: m2 = m1  versus  Ha: m2 != m1

Study parameters:

        alpha =    0.0500
            N =     2,763
           N1 =     1,383
           N2 =     1,380
        N2/N1 =    0.9978
        delta =   -0.1070
           m1 =  135.1229
           m2 =  135.0159
           sd =   19.0290

Estimated power:

        power =    0.0525

e. Pearson’s Chi-Square

Pearson’s Chi-Square test is a statistical method used to determine whether there is a significant association between categorical variables. It compares observed frequencies from sample data to expected frequencies under the assumption of independence.

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

use hersdata
tabulate diabetes HT, chi2 expected
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
+--------------------+

           | random assignment to
           |    hormone therapy
  diabetes |   placebo  hormone t |     Total
-----------+----------------------+----------
        no |     1,031      1,001 |     2,032 
           |   1,017.1    1,014.9 |   2,032.0 
-----------+----------------------+----------
       yes |       352        379 |       731 
           |     365.9      365.1 |     731.0 
-----------+----------------------+----------
     Total |     1,383      1,380 |     2,763 
           |   1,383.0    1,380.0 |   2,763.0 

          Pearson chi2(1) =   1.4369   Pr = 0.231

f. One way ANOVA test

The One-Way Analysis of Variance (ANOVA) is a statistical test used to determine whether there are significant differences between the means of three or more independent groups. It compares the variability between group means to the variability within the groups. One-Way ANOVA assumes the dependent variable is continuous and normally distributed, and it requires homogeneity of variances across groups.

\[ F = \frac{\text{MS}{\text{between}}}{\text{MS}{\text{within}}} \]


use hersdata
oneway SBP raceth, tabulate
race/ethnic | Summary of systolic blood pressure
        ity |        Mean   Std. dev.       Freq.
------------+------------------------------------
      White |   134.78376   18.831686       2,451
  African A |   138.23394   19.992518         218
      Other |   135.18085   21.259767          94
------------+------------------------------------
      Total |   135.06949   19.027807       2,763

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      2384.26992      2   1192.13496      3.30     0.0371
 Within groups      997618.388   2760   361.455938
------------------------------------------------------------------------
    Total           1000002.66   2762   362.057443

Bartlett's equal-variances test: chi2(2) =   4.0727    Prob>chi2 = 0.131

g. Confounding and effect modification

Confounding occurs when a third variable (confounder) distorts the observed relationship between an exposure and an outcome. This happens because the confounder is associated with both the exposure and the outcome, creating a false impression of causality. On the other hand, effect modification occurs when the effect of the exposure on the outcome varies depending on the level of a third variable.

Confounding represented mathematically:

\[ \text{Adjusted association:} \quad Y = \beta_0 + \beta_1 X + \beta_2 Z \]

Mathematical representation of an effect modifier

\[ Y = \beta_0 + \beta_1X + \beta_2Z + \beta_3(X \cdot Z) \]

use hersdata
cc diabetes statins, by(smoking)
  current smoker | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .8432676      .6971237    1.01993     122.2905 (exact)
             yes |   .9733333      .4851674   1.874814           10 (exact)
-----------------+-------------------------------------------------
           Crude |   .8809565      .7348165   1.056045              (exact)
    M–H combined |   .8530995      .7138128   1.019565              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     0.19  Pr>chi2 = 0.6665

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      3.06
                                                Pr>chi2 =    0.0804

h. Difference between Wilcoxon signed rank test and Wilcoxon ranked sum

The Wilcoxon Signed-Rank Test and the Wilcoxon Rank-Sum Test are two non-parametric statistical methods designed for analyzing data under different conditions. The Wilcoxon Signed-Rank Test is used when comparing paired or dependent samples, assessing whether their medians significantly differ. This test calculates the differences between paired observations, ranks the absolute differences, and compares the sums of ranks of positive and negative differences. For example, it is ideal for evaluating pre- and post-treatment effects within the same group. Conversely, the Wilcoxon Rank-Sum Test, also known as the Mann-Whitney U Test, is applied to compare two independent samples to determine if their distributions or medians differ significantly. It ranks all observations across the two groups and then evaluates the difference in rank sums between them.

in terms of hypothesis testing, their difference can be expressed as follows:

Wilcoxon Signed-Rank Test:

\[ \begin{align}H_0 &: \text{The median difference between paired samples is zero.} \\H_a &: \text{The median difference between paired samples is not zero.} \\\text{Test Statistic: } W &= \sum \text{Ranks of positive differences.}\end{align} \]

Wilcoxon Rank-Sum Test:

\[ \begin{align}H_0 &: \text{The medians of the two independent groups are equal.} \\H_a &: \text{The medians of the two independent groups are not equal.} \\\text{Rank Sums: } \quad U_1 &= n_1 n_2 + \frac{n_1 (n_1+1)}{2} - R_1, \quad U_2 = n_1 n_2 - U_1\end{align} \]

Question 2

Link of the article the table was sourced from https://jamanetwork.com/journals/jama/fullarticle/187879


use hersdata

dtable, by(HT, tests testnotes nototal) ///
    sample(, statistic(frequency proportion) place(seplabels)) ///
    continuous(age SBP BMI waist WHR glucose weight LDL HDL TG tchol DBP, statistics(mean sd) test(regress)) ///
    factor(exercise raceth diabetes insulin dmpills smoking physact drinkany statins htnmeds globrat poorfair medcond, statistics(fvfrequency fvproportion) test(pearson)) ///
    nformat(%9.1f mean sd) ///
    sformat("(%s)" fvproportion sd proportion) ///
    nformat(%9.2f proportion fvproportion) ///
    title("Table 1 - Baseline Characteristics of HERS Participants (n=2763) by Treatment Group") ///
    note("HERS indicates Heart and Estrogen/progestin Replacement Study; CHD, coronary heart disease; LDL, low-density lipoprotein; and HDL, high-density lipoprotein. P-values are for difference between treatment groups by t test or χ²") ///
    export("hers_table1_sd.docx", replace)
note: using test regress across levels of HT for age, SBP, BMI, waist, WHR,
      glucose, weight, LDL, HDL, TG, tchol, and DBP.
note: using test pearson across levels of HT for exercise, raceth, diabetes,
      insulin, dmpills, smoking, physact, drinkany, statins, htnmeds,
      globrat, poorfair, and medcond.

Table 1 - Baseline Characteristics of HERS Participants (n=2763) by Treatment Group
----------------------------------------------------------------------------
                                        random assignment to hormone therapy
                                           placebo     hormone therapy  Test
                                         1,383 (0.50)   1,380 (0.50)        
----------------------------------------------------------------------------
age in years                               66.8 (6.7)       66.5 (6.6) 0.331
systolic blood pressure                  135.1 (19.4)     135.0 (18.7) 0.883
BMI (kg/m^2)                               28.5 (5.5)       28.6 (5.5) 0.596
waist (cm)                                91.5 (13.3)      92.0 (13.8) 0.333
waist/hip ratio                             0.9 (0.1)        0.9 (0.1) 0.556
fasting glucose (mg/dl)                  112.4 (36.8)     111.9 (36.9) 0.671
weight (kg)                               72.5 (14.8)      73.0 (14.7) 0.369
LDL cholesterol (mg/dl)                  144.9 (37.3)     145.2 (38.3) 0.803
HDL cholesterol (mg/dl)                   50.5 (13.2)      50.0 (13.3) 0.387
triglycerides (mg/dl)                    167.6 (63.7)     164.7 (63.4) 0.233
total cholesterol (mg/dl)                228.9 (40.4)     228.2 (41.6) 0.652
diastolic blood pressure                   73.1 (9.7)       73.2 (9.7) 0.898
exercise at least 3 times per week                                          
  no                                       853 (0.62)       842 (0.61) 0.720
  yes                                      530 (0.38)       538 (0.39)      
race/ethnicity                                                              
  White                                  1,239 (0.90)     1,212 (0.88) 0.324
  African American                         102 (0.07)       116 (0.08)      
  Other                                     42 (0.03)        52 (0.04)      
diabetes                                                                    
  no                                     1,031 (0.75)     1,001 (0.73) 0.231
  yes                                      352 (0.25)       379 (0.27)      
insulin use by self-report                                                  
  no                                     1,258 (0.91)     1,232 (0.89) 0.137
  yes                                      125 (0.09)       148 (0.11)      
oral DM medication by self-report                                           
  no                                     1,246 (0.90)     1,251 (0.91) 0.619
  yes                                      137 (0.10)       129 (0.09)      
current smoker                                                              
  no                                     1,201 (0.87)     1,202 (0.87) 0.838
  yes                                      182 (0.13)       178 (0.13)      
comparative physical activity                                               
  much less active                         106 (0.08)        91 (0.07) 0.320
  somewhat less active                     269 (0.19)       234 (0.17)      
  about as active                          451 (0.33)       468 (0.34)      
  somewhat more active                     407 (0.29)       431 (0.31)      
  much more active                         150 (0.11)       156 (0.11)      
any current alcohol consumption                                             
  no                                       834 (0.60)       846 (0.61) 0.623
  yes                                      547 (0.40)       534 (0.39)      
statin use                                                                  
  no                                       865 (0.63)       894 (0.65) 0.221
  yes                                      518 (0.37)       486 (0.35)      
anti-hypertensive use                                                       
  no                                       240 (0.17)       253 (0.18) 0.501
  yes                                    1,143 (0.83)     1,127 (0.82)      
self-reported health                                                        
  poor                                      31 (0.02)        29 (0.02) 0.987
  fair                                     303 (0.22)       302 (0.22)      
  good                                     658 (0.48)       650 (0.47)      
  very good                                336 (0.24)       338 (0.25)      
  excellent                                 54 (0.04)        59 (0.04)      
poor/fair self-reported health                                              
  no                                     1,048 (0.76)     1,047 (0.76) 0.928
  yes                                      334 (0.24)       331 (0.24)      
other serious conditions by self-report                                     
  0                                        859 (0.62)       876 (0.63) 0.457
  1                                        524 (0.38)       504 (0.37)      
----------------------------------------------------------------------------
HERS indicates Heart and Estrogen/progestin Replacement Study; CHD, coronary heart disease; LDL, low-density lipoprotein; and HDL, high-density lipoprotein. P-values are for difference between
treatment groups by t test or χ²
(collection DTable exported to file hers_table1_sd.docx)

Question 3

use hersdata

set seed 2998143


sample 300, count

save Vusi2, replace
(2,463 observations deleted)

file Vusi2.dta saved
use Vusi2

d
Contains data from Vusi2.dta
 Observations:           300                  
    Variables:            37                  13 Apr 2025 10:33
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
HT              byte    %15.0g     HT         random assignment to hormone
                                                therapy
age             byte    %9.0g                 age in years
raceth          byte    %16.0g     raceth     race/ethnicity
nonwhite        byte    %9.0g      noyes      nonwhite race/ethnicity
smoking         byte    %9.0g      noyes      current smoker
drinkany        byte    %9.0g      noyes      any current alcohol consumption
exercise        byte    %9.0g      noyes      exercise at least 3 times per
                                                week
physact         byte    %20.0g     physact    comparative physical activity
globrat         byte    %9.0g      globrat    self-reported health
poorfair        byte    %9.0g      noyes      poor/fair self-reported health
medcond         byte    %9.0g                 other serious conditions by
                                                self-report
htnmeds         byte    %9.0g      noyes      anti-hypertensive use
statins         byte    %9.0g      noyes      statin use
diabetes        byte    %9.0g      noyes      diabetes
dmpills         byte    %9.0g      noyes      oral DM medication by self-report
insulin         byte    %9.0g      noyes      insulin use by self-report
weight          float   %9.0g                 weight (kg)
BMI             float   %9.0g                 BMI (kg/m^2)
waist           float   %9.0g                 waist (cm)
WHR             float   %9.0g                 waist/hip ratio
glucose         int     %9.0g                 fasting glucose (mg/dl)
weight1         float   %9.0g                 year 1 weight (kg)
BMI1            float   %9.0g                 year 1 BMI (kg/m^2)
waist1          float   %9.0g                 year 1 waist (cm)
WHR1            float   %9.0g                 year 1 waist/hip ratio
glucose1        int     %9.0g                 year 1 fasting glucose (mg/dl)
tchol           int     %9.0g                 total cholesterol (mg/dl)
LDL             float   %9.0g                 LDL cholesterol (mg/dl)
HDL             int     %9.0g                 HDL cholesterol (mg/dl)
TG              int     %9.0g                 triglycerides (mg/dl)
tchol1          int     %9.0g                 year 1 total cholesterol (mg/dl)
LDL1            float   %9.0g                 year 1 LDL cholesterol (mg/dl)
HDL1            int     %9.0g                 year 1 HDL cholesterol (mg/dl)
TG1             int     %9.0g                 year 1 triglycerides (mg/dl)
SBP             int     %9.0g                 systolic blood pressure
DBP             int     %9.0g                 diastolic blood pressure
age10           float   %9.0g                 age (per 10 years)
-------------------------------------------------------------------------------
Sorted by: 

a. Research question and variables

(i) Is there a difference in LDL cholesterol between individuals who use hypertensive medications and those who do not?

(ii) Independent variable - LDL cholesterol; Dependent variable - htnmeds (Hypertensive medication)

b. Normality test

use Vusi2
qnorm LDL if htnmeds == 0, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of ldl (no)") name(qq_ldl_htnmeds_no, replace)

qnorm SBP if htnmeds == 1, mcolor(pink) lcolor(pink) title("Normal Q-Q Plot of SBP (yes)") name(qq_ldl_htnmeds_yes, replace)

graph combine qq_ldl_htnmeds_no qq_ldl_htnmeds_yes, title("Normal Q-Q Plot of ldl by use of hypertension meds Group")
graph export "QQ1.png", as(png) replace
file QQ1.png saved as PNG format
knitr::include_graphics("QQ1.png")

use Vusi2

ksmirnov LDL, by(htnmeds)
Two-sample Kolmogorov–Smirnov test for equality of distribution functions

Smaller group             D     p-value  
---------------------------------------
no                   0.0644       0.675
yes                 -0.1061       0.345
Combined K-S         0.1061       0.662

Note: Ties exist in combined dataset;
      there are 240 unique values out of 297 observations.
use Vusi2

bysort htnmeds: swilk LDL
-> htnmeds = no

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         LDL |         59    0.98850      0.617    -1.041    0.85115

-------------------------------------------------------------------------------
-> htnmeds = yes

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         LDL |        238    0.98283      2.982     2.536    0.00561

c. Student t-test of independence

use Vusi2

ttest LDL, by(htnmeds)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      no |      59    144.9831    5.074052    38.97453    134.8262    155.1399
     yes |     238    142.3269    2.247645    34.67499     137.899    146.7548
---------+--------------------------------------------------------------------
Combined |     297    142.8545    2.060914    35.51715    138.7986    146.9104
---------+--------------------------------------------------------------------
    diff |             2.65616     5.17181                -7.52216    12.83448
------------------------------------------------------------------------------
    diff = mean(no) - mean(yes)                                   t =   0.5136
H0: diff = 0                                     Degrees of freedom =      295

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.6960         Pr(|T| > |t|) = 0.6079          Pr(T > t) = 0.3040

d. Manual computation of the test statistics

t test formula with pooled standard deviation

\[ t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]

Pooled variances

\[ s_p^2 = \frac{(n_1 -1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2} \]

\[ s_p^2 = \frac{(60 - 1) \times 1161.2092 + (239 - 1) \times 1410.1362}{60 + 239 -2} \]

\[ s_p^2 = 1261.46 \]

Pooled standard deviation

\[ s_p = \sqrt{1261.46} \]

\[ s_p = 35.5170 \]

The t statistic

\[ t = \frac{144.9831 - 142.3269}{35.5170\sqrt{\frac{1}{59} + \frac{1}{238}}} \]

\[ t = 0.5142 \]

e. Conclusions

There is no difference in the mean of LDL cholesterol between individuals who use hypertensive medications and those who do not.

f. Repeat Question 03 using MannWhitney U-test

Research Question and variables
(i) Is there a difference in LDL cholesterol between individuals who use hypertensive medications and those who do not?

(ii) Independent variable - LDL cholesterol; Dependent variable - htnmeds (Hypertensive medication)

Hypotheses:

\[ H_0: \text{The distributions (medians) of LDL levels are the same for individuals not taking hypertension medications ("no") and those taking them ("yes").} \]

\[ H_1: \text{The distributions (medians) of LDL levels differ between the "no" and "yes" groups.} \]

STATA output

use Vusi2

ranksum LDL, by(htnmeds)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test

     htnmeds |      Obs    Rank sum    Expected
-------------+---------------------------------
          no |       59      9151.5        8791
         yes |      238     35101.5       35462
-------------+---------------------------------
    Combined |      297       44253       44253

Unadjusted variance   348709.67
Adjustment for ties       -5.35
                     ----------
Adjusted variance     348704.32

H0: LDL(htnmeds==no) = LDL(htnmeds==yes)
         z =  0.610
Prob > |z| = 0.5415

Note: Exact p-value is not computed by default for sample sizes > 200.
      Use option exact to compute it.

Decision or Conclusion: There is no significant evidence of a difference in LDL distributions between the “no” and “yes” groups based on hypertension medication use.

Manual computation of the Mann Whitney U-test

$$Calculating U_1 and U_2 to find the U statistics$$

\[ U_1 = n_1 n_2 + \frac{n_1 (n_1 + 1)}{2} - R_1\]\[U_1 = 59 \cdot 238 + \frac{59 \cdot 60}{2} - 9151.5\]\[U_1 = 14042 + 1770 - 9151.5 = 6660.5 \]

\[ U_2 = n_1 n_2 + \frac{n_2 (n_2 + 1)}{2} - R_2\]\[U_2 = 59 \cdot 238 + \frac{238 \cdot 239}{2} - 35101.5\]\[U_2 = 14042 + 28441 - 35101.5 = 7381.5 \]

\[ U = \min(U_1, U_2) = \min(6660.5, 7381.5) = 6660.5 \]

\[ \mu_U = \frac{n_1 n_2}{2} = \frac{59 \cdot 238}{2} = 7021\]\[\sigma_U = \text{adjusted variance} = \sqrt{348704.32} \approx 590.511 \]

The Z-value

\[ Z = \frac{U - \mu_U}{\sigma_U} = \frac{6660.5 - 7021}{590.511} = \frac{-360.5}{590.511} \approx -0.610 \]

Question 4

(a)

Hypotheses

\[ H_0: \text{There is no association between exercise and diabetes.} \]

\[ H_1: \text{There is an association between exercise and diabetes.} \]

Decision or Conclusion: There is no association between exercise and diabetes

use Vusi2
tabulate diabetes exercise, chi2 expected
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
+--------------------+

           |  exercise at least 3
           |    times per week
  diabetes |        no        yes |     Total
-----------+----------------------+----------
        no |       136         88 |       224 
           |     141.1       82.9 |     224.0 
-----------+----------------------+----------
       yes |        53         23 |        76 
           |      47.9       28.1 |      76.0 
-----------+----------------------+----------
     Total |       189        111 |       300 
           |     189.0      111.0 |     300.0 

          Pearson chi2(1) =   1.9818   Pr = 0.159

(b) Manual computation of the Chi-squared test

Formula

\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

The chi squared test

\[ \chi^2 = \frac{(1031 - 1017.1)^2}{1017.1} + \frac{(1001 - 1014.9)^2}{1014.9} + \frac{(352 - 365.9)^2}{365.9} + \frac{(379 - 365.1)^2}{365.1} \]

\[ \chi^2 = 1.4369 \]

c)

  1. The crude model - no adjustments

use Vusi2
cc diabetes exercise
                                                         Proportion
                 |   Exposed   Unexposed  |      Total      exposed
-----------------+------------------------+------------------------
           Cases |        23          53  |         76       0.3026
        Controls |        88         136  |        224       0.3929
-----------------+------------------------+------------------------
           Total |       111         189  |        300       0.3700
                 |                        |
                 |      Point estimate    |    [95% conf. interval]
                 |------------------------+------------------------
      Odds ratio |          .670669       |    .3653767    1.207849 (exact)
 Prev. frac. ex. |          .329331       |   -.2078487    .6346233 (exact)
 Prev. frac. pop |         .1293801       |
                 +-------------------------------------------------
                               chi2(1) =     1.98  Pr>chi2 = 0.1592
  1. Statins adjusted model
use Vusi2
cc diabetes exercise, by(statins)
      statin use | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .6737968      .3061665   1.429289     9.689119 (exact)
             yes |   .6602871       .233172   1.772848     5.859813 (exact)
-----------------+-------------------------------------------------
           Crude |    .670669      .3653767   1.207849              (exact)
    M–H combined |   .6687055      .3825609   1.168878              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     0.00  Pr>chi2 = 0.9725

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      1.99
                                                Pr>chi2 =    0.1578
  1. nonwhite adjusted model
use Vusi2
cc diabetes exercise, by(nonwhite)
nonwhite race/et | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .8305867      .4256263   1.588842     11.44906 (exact)
             yes |      .3125       .042405   1.935143     2.742857 (exact)
-----------------+-------------------------------------------------
           Crude |    .670669      .3653767   1.207849              (exact)
    M–H combined |   .7304566      .4136963   1.289755              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     1.26  Pr>chi2 = 0.2620

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      1.19
                                                Pr>chi2 =    0.2762
  1. Poorfair adjusted model

use Vusi2
cc diabetes exercise, by(poorfair)
poor/fair self-r | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   1.088542      .5208722   2.249411     8.074766 (exact)
             yes |   .3310345      .0729658   1.206992     5.178571 (exact)
-----------------+-------------------------------------------------
           Crude |   .6733897        .36661   1.213655              (exact)
    M–H combined |   .7925555      .4461784   1.407832              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     2.86  Pr>chi2 = 0.0907

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      0.63
                                                Pr>chi2 =    0.4271

(d) - Conclusion: Neither statin use, nonwhite, and poorfair significantly confounded or modified the relationship between exercise and diabetes in this cohort. Overall, we conclude that there is no evidence of a statistically significant association between exercise and diabetes in this sample after accounting for the aforementioned variables.

(e) Manual computation of the M-H (BONUS MARKS)

use Vusi2

tabulate statins diabetes 
           |       diabetes
statin use |        no        yes |     Total
-----------+----------------------+----------
        no |       145         48 |       193 
       yes |        79         28 |       107 
-----------+----------------------+----------
     Total |       224         76 |       300 

The formula for manual computation of the M-H

\[ OR_{\text{MH}} = \frac{n_1 b_1 c_1 + n_2 b_2 c_2}{n_1 a_1 d_1 + n_2 a_2 d_2} \]

Substituting values from the table

\[ OR_{\text{MH}}= \frac{193 \cdot 145 \cdot 28 + 107 \cdot 79 \cdot 48}{193 \cdot 48 \cdot 79 + 107 \cdot 28 \cdot 145} \]

Separating in terms of stratum (e.g, Stratum 1 (Statin = No, n1=193) and Stratum 2 (Statin = Yes, n2=107)

\[ OR_{\text{MH}} = \frac{3792}{193} + \frac{4060}{107} \quad \text{and} \quad \frac{4060}{193} + \frac{3792}{107} \]

\[ OR_{\text{MH}} = 19.65 + 37.94 \quad \text{and} \quad 21.03 + 35.43 \]

\[ OR_{\text{MH}} = \frac{57.59}{56.46} \approx 1.02 \]

Question 5

(a) Research Question:

  1. Does the level of physical activity (measured by the physact variable) affect glucose levels in women?
  2. Variables:
  • Independent Variable (Factor): physact (categorical, with levels: “much more active,” “somewhat more active,” “about as active,” “somewhat less active,” “much less active”).

  • Dependent Variable: glucose (continuous, measured in mg/dL).

(b) Test the relevant assumptions of One-Way ANOVA

Hypothesis for the assumptions:

Normality assumptions

\[ H_0: \text{Glucose levels are normally distributed within each physact group.} \]

\[ H_1: \text{Glucose levels are not normally distributed within each physact group.} \]

Variance test

\[ H_0 : \text{Variances of glucose are equal across physact groups.} \]

\[ H_1 : \text{Variances of glucose are not equal across physact groups.} \]

Normality test to test the assumptions:

use Vusi2

histogram glucose, by(physact) normal

graph export "anova.png", as(png) replace
file anova.png saved as PNG format
knitr::include_graphics("anova.png")

use Vusi2
by physact, sort: swilk glucose
-> physact = much less active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     glucose |         26    0.81915      5.171     3.367    0.00038

-------------------------------------------------------------------------------
-> physact = somewhat less active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     glucose |         60    0.74440     13.894     5.672    0.00000

-------------------------------------------------------------------------------
-> physact = about as active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     glucose |         94    0.67782     25.265     7.140    0.00000

-------------------------------------------------------------------------------
-> physact = somewhat more active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     glucose |         85    0.75744     17.501     6.293    0.00000

-------------------------------------------------------------------------------
-> physact = much more active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     glucose |         35    0.69422     10.914     4.989    0.00000

Homogeneity of variance

use Vusi2

robvar glucose, by(physact)
comparative |
   physical | Summary of fasting glucose (mg/dl) 
   activity |        Mean   Std. dev.       Freq.
------------+------------------------------------
  much less |   128.76923   44.139151          26
  somewhat  |   116.18333   43.064985          60
  about as  |   111.02128   35.953967          94
  somewhat  |   102.28235   21.839028          85
  much more |       102.4   23.322799          35
------------+------------------------------------
      Total |      110.11   34.483158         300

W0  =  5.5219210   df(4, 295)     Pr > F = 0.00026723

W50 =  2.9584207   df(4, 295)     Pr > F = 0.02022499

W10 =  3.7034420   df(4, 295)     Pr > F = 0.00585241

The assumptions for the anova are not met.

(c)

use Vusi2

oneway glucose physact, tabulate
comparative |
   physical | Summary of fasting glucose (mg/dl) 
   activity |        Mean   Std. dev.       Freq.
------------+------------------------------------
  much less |   128.76923   44.139151          26
  somewhat  |   116.18333   43.064985          60
  about as  |   111.02128   35.953967          94
  somewhat  |   102.28235   21.839028          85
  much more |       102.4   23.322799          35
------------+------------------------------------
      Total |      110.11   34.483158         300

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      18632.1903      4   4658.04758      4.08     0.0031
 Within groups       336905.18    295   1142.05146
------------------------------------------------------------------------
    Total            355537.37    299   1189.08819

Bartlett's equal-variances test: chi2(4) =  44.6917    Prob>chi2 = 0.000

Manual computation of Anova

\[ F = \frac{MSB}{MSW} \]

Sum of squares between (SSB)

\[ SSB = \sum n_i (\bar{y}_i - \bar{y})^2 \]

\[ n_1 = 26, \, \bar{y}_1 = 128.76923, \, \bar{y} = 110.11\]\[\bar{y}_1 - \bar{y} = 128.76923 - 110.11 = 18.65923\]\[(\bar{y}_1 - \bar{y})^2 = (18.65923)^2 \approx 348.167863\]\[SSB_1 = n_1 (\bar{y}_1 - \bar{y})^2 = 26 \cdot 348.167863 \approx 9052.364438 \]

\[ n_2 = 60, \, \bar{y}_2 = 116.18333, \, \bar{y} = 110.11\]\[\bar{y}_2 - \bar{y} = 116.18333 - 110.11 = 6.07333\]\[(\bar{y}_2 - \bar{y})^2 = (6.07333)^2 \approx 36.885138\]\[SSB_2 = n_2 (\bar{y}_2 - \bar{y})^2 = 60 \cdot 36.885138 \approx 2213.10828 \]

\[ n_3 = 94, \, \bar{y}_3 = 111.02128, \, \bar{y} = 110.11\]\[\bar{y}_3 - \bar{y} = 111.02128 - 110.11 = 0.91128\]\[(\bar{y}_3 - \bar{y})^2 = (0.91128)^2 \approx 0.830431\]\[SSB_3 = n_3 (\bar{y}_3 - \bar{y})^2 = 94 \cdot 0.830431 \approx 78.060514 \]

\[ n_4 = 85, \, \bar{y}_4 = 102.28235, \, \bar{y} = 110.11\]\[\bar{y}_4 - \bar{y} = 102.28235 - 110.11 = -7.82765\]\[(\bar{y}_4 - \bar{y})^2 = (-7.82765)^2 \approx 61.272504\]\[SSB_4 = n_4 (\bar{y}_4 - \bar{y})^2 = 85 \cdot 61.272504 \approx 5208.16284 \]

\[ n_5 = 35, \, \bar{y}_5 = 102.4, \, \bar{y} = 110.11\]\[\bar{y}_5 - \bar{y} = 102.4 - 110.11 = -7.71\]\[(\bar{y}_5 - \bar{y})^2 = (-7.71)^2 \approx 59.4441\]\[SSB_5 = n_5 (\bar{y}_5 - \bar{y})^2 = 35 \cdot 59.4441 \approx 2080.5435 \]

\[ \sum SSB = SSB_1 + SSB_2 + SSB_3 + SSB_4 + SSB_5\]\[SSB = 9052.364438 + 2213.10828 + 78.060514 + 5208.16284 + 2080.5435\]\[SSB \approx 18632.239572 \]

Sum of squares Within (SSW)

\[ SSW = \sum (n_i - 1)s_i^2 \]

\[ n_1 = 26, \, s_1 = 44.139151, \, s_1^2 = (44.139151)^2 \approx 1948.265697\]\[SSW_1 = (n_1 - 1)s_1^2 = (26 - 1) \cdot 1948.265697 = 25 \cdot 1948.265697 \approx 48706.642425 \]

\[ n_2 = 60, \, s_2 = 43.064985, \, s_2^2 = (43.064985)^2 \approx 1854.590989\]\[SSW_2 = (n_2 - 1)s_2^2 = (60 - 1) \cdot 1854.590989 = 59 \cdot 1854.590989 \approx 109420.868351 \]

\[ n_3 = 94, \, s_3 = 35.953967, \, s_3^2 = (35.953967)^2 \approx 1292.688802\]\[SSW_3 = (n_3 - 1)s_3^2 = (94 - 1) \cdot 1292.688802 = 93 \cdot 1292.688802 \approx 120220.058586 \]

\[ n_4 = 85, \, s_4 = 21.839028, \, s_4^2 = (21.839028)^2 \approx 476.942143\]\[SSW_4 = (n_4 - 1)s_4^2 = (85 - 1) \cdot 476.942143 = 84 \cdot 476.942143 \approx 40063.140012 \]

\[ n_5 = 35, \, s_5 = 23.322799, \, s_5^2 = (23.322799)^2 \approx 543.952331\]\[SSW_5 = (n_5 - 1)s_5^2 = (35 - 1) \cdot 543.952331 = 34 \cdot 543.952331 \approx 18494.379254 \]

\[ \sum SSW = SSW_1 + SSW_2 + SSW_3 + SSW_4 + SSW_5\]\[SSW = 48706.642425 + 109420.868351 + 120220.058586 + 40063.140012 + 18494.379254\]\[SSW \approx 336905.088628 \]

MSW

\[ MSW = \frac{SSW}{df_W} = \frac{336905.088628}{295} \approx 1142.051114 \]

MSB

\[ MSB = \frac{SSB}{df_B} = \frac{18632.239572}{4} \approx 4658.059893 \]

The F statistic

\[ F = \frac{4658.0475}{1142.05146} \approx 4.08 \]

(d) If the assumptions are not fully met, I can consider data transformation or using the alternative non parametric test (Kruskal Wallis).

(e) Hypotheses based on the alternative non-parametric test

\[ H_0: \text{Median glucose levels are equal across physact groups.} \]

\[ H_1: \text{At least one group glucose level median differs across physact groups.} \]

use Vusi2

kwallis glucose, by(physact)
Kruskal–Wallis equality-of-populations rank test

  +---------------------------------------+
  |              physact | Obs | Rank sum |
  |----------------------+-----+----------|
  |     much less active |  26 |  5214.00 |
  | somewhat less active |  60 |  9796.00 |
  |      about as active |  94 | 14096.00 |
  | somewhat more active |  85 | 11446.00 |
  |     much more active |  35 |  4598.00 |
  +---------------------------------------+

  chi2(4) = 14.491
     Prob = 0.0059

  chi2(4) with ties = 14.504
               Prob = 0.0058

Decision: There is significant evidence of differences in median glucose levels across the physical activity groups.