COMH7062A BIOSTATISTICS I ASSIGNMENT

Author
Affiliation

Bongani Ncube(3002164)

University Of the Witwatersrand (School of Public Health)

Published

April 14, 2025

Keywords

T-test, Confounding, Analysis Of Variance, Probability Distributions, Non Parametric tests

Note

The HERS study dataset has been provided to you to help you answer the following questions in relation to concepts that you learnt in Biostatistics I. The participants were are either on hormone therapy or placebo and the data were collected at baseline [e.g weight] and on follow-up [those variables ending with 1, e.g weight1] Below are the variable names, value labels and variable labels as they will be used in the questions below.

About the dataset
HT age raceth nonwhite smoking drinkany exercise physact globrat poorfair medcond htnmeds statins diabetes dmpills insulin weight BMI waist WHR glucose weight1 BMI1 waist1 WHR1 glucose1 tchol LDL HDL TG tchol1 LDL1 HDL1 TG1 SBP DBP age10
0 70 2 1 0 0 0 5 3 0 0 1 1 0 0 0 73.8 23.69 96.0 0.932 84 73.6 23.63 93.0 0.912 94 189 122.4 52 73 201 137.6 48 77 138 78 7.0
0 62 2 1 0 0 0 1 3 0 1 1 0 0 0 0 70.9 28.62 93.0 0.964 111 73.4 28.89 95.0 0.964 78 307 241.6 44 107 216 150.6 48 87 118 70 6.2
1 69 1 0 0 0 0 3 3 0 0 1 0 1 0 0 102.0 42.51 110.2 0.782 114 96.1 40.73 103.0 0.774 98 254 166.2 57 154 254 156.0 66 160 134 78 6.9
0 64 1 0 1 1 0 1 3 0 1 1 0 0 0 0 64.4 24.39 87.0 0.877 94 58.6 22.52 77.0 0.802 93 204 116.2 56 159 207 122.6 57 137 152 72 6.4
0 65 1 0 0 0 0 2 3 0 0 0 0 0 0 0 57.9 21.90 77.0 0.794 101 58.9 22.28 76.5 0.757 92 214 150.6 42 107 235 172.2 35 139 175 95 6.5
1 68 2 1 0 1 0 3 3 0 0 0 0 0 0 0 60.9 29.05 96.0 1.000 116 57.7 27.52 86.0 0.910 115 212 137.8 52 111 202 126.6 53 112 174 98 6.8

Data Description

use hersdata

describe 
Contains data from hersdata.dta
 Observations:         2,763                  
    Variables:            37                  11 May 2018 14:57
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
HT              byte    %15.0g     HT         random assignment to hormone
                                                therapy
age             byte    %9.0g                 age in years
raceth          byte    %16.0g     raceth     race/ethnicity
nonwhite        byte    %9.0g      noyes      nonwhite race/ethnicity
smoking         byte    %9.0g      noyes      current smoker
drinkany        byte    %9.0g      noyes      any current alcohol consumption
exercise        byte    %9.0g      noyes      exercise at least 3 times per
                                                week
physact         byte    %20.0g     physact    comparative physical activity
globrat         byte    %9.0g      globrat    self-reported health
poorfair        byte    %9.0g      noyes      poor/fair self-reported health
medcond         byte    %9.0g                 other serious conditions by
                                                self-report
htnmeds         byte    %9.0g      noyes      anti-hypertensive use
statins         byte    %9.0g      noyes      statin use
diabetes        byte    %9.0g      noyes      diabetes
dmpills         byte    %9.0g      noyes      oral DM medication by self-report
insulin         byte    %9.0g      noyes      insulin use by self-report
weight          float   %9.0g                 weight (kg)
BMI             float   %9.0g                 BMI (kg/m^2)
waist           float   %9.0g                 waist (cm)
WHR             float   %9.0g                 waist/hip ratio
glucose         int     %9.0g                 fasting glucose (mg/dl)
weight1         float   %9.0g                 year 1 weight (kg)
BMI1            float   %9.0g                 year 1 BMI (kg/m^2)
waist1          float   %9.0g                 year 1 waist (cm)
WHR1            float   %9.0g                 year 1 waist/hip ratio
glucose1        int     %9.0g                 year 1 fasting glucose (mg/dl)
tchol           int     %9.0g                 total cholesterol (mg/dl)
LDL             float   %9.0g                 LDL cholesterol (mg/dl)
HDL             int     %9.0g                 HDL cholesterol (mg/dl)
TG              int     %9.0g                 triglycerides (mg/dl)
tchol1          int     %9.0g                 year 1 total cholesterol (mg/dl)
LDL1            float   %9.0g                 year 1 LDL cholesterol (mg/dl)
HDL1            int     %9.0g                 year 1 HDL cholesterol (mg/dl)
TG1             int     %9.0g                 year 1 triglycerides (mg/dl)
SBP             int     %9.0g                 systolic blood pressure
DBP             int     %9.0g                 diastolic blood pressure
age10           float   %9.0g                 age (per 10 years)
-------------------------------------------------------------------------------
Sorted by: 
Question 1:

Explain all of the following and use the given HERS data to give an illustrative example using output from performing relevant stata commands.

a) Normal

Random variable \(Y\) is distributed \(Y \sim N(\mu, \sigma^2)\) if

\[\begin{equation} f(y) = \frac{e^{-(y-\mu)^2/ (2 \sigma^2)}}{\sqrt{2\pi\sigma^2}} \quad \textrm{for} \quad -\infty < y < \infty. \end{equation}\]

As the parameter names suggest, \(E(Y) = \mu\) and \(Var(Y) = \sigma^2\). The distribution \(\textrm{N}(0,1)\) is often referred to as the standard normal distribution

  • Normal distributions have the following properties:
  1. symmetric about the mean , median and mode hence \(Mean=mode=median\)
  2. it is bell shaped normal curve

using the hersdata we can illustrate if the above properties are satisfied for the variable weight.

use hersdata
hist weight , norm
quietly graph export normal.png
(bin=34, start=37.5, width=2.7794118)

file normal.png already exists
r(602);

r(602);

Comments

  • the data is symmetric and bell shaped
use hersdata
sum weight,det
                         weight (kg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%         46.9           37.5
 5%         51.8           37.6
10%         55.7           40.8       Obs               2,761
25%         62.2           40.9       Sum of wgt.       2,761

50%           71                      Mean           72.73484
                        Largest       Std. dev.      14.71397
75%         81.4            127
90%         92.9            127       Variance        216.501
95%          100            129       Skewness       .7042269
99%          116            132       Kurtosis        3.49991
  • here we also note that , the mean and the median are almost equal
use hersdata

qnorm weight

graph export qq.png
file qq.png already exists
r(602);

r(602);

  • the points almost lie on the straight line hence confirming normality assumption.

b) Bayes Theorem

  • A theorem about conditional probabilities and is given by \(\Pr(B \mid A) = \displaystyle{\frac{ \Pr(A \mid B) \; \Pr(B)}{\Pr(A)}}\)

  • Schematically if \(A = \theta\) and \(B = \text{y}\) the bayes theorem brings the fundamental theorem in bayesian statistics.

\[ \color{dodgerblue}{\boxed{\underbrace{P(\theta \mid y)}_{\text{Posterior}} = \frac{\overbrace{P(y \mid \theta)}^{\text{Likelihood}} \cdot \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(y)}_{\text{Normalizing Constant}}}}} \]

Context of HERS Data

Suppose we are interested in calculating the probability of being diabetic if a person smokes tobacco for a long time, \(P(diabetes|Smoker)\). With reference to the HERS data we see that:

use hersdata
tab diabetes smoking ,cell
| Key             |
|-----------------|
|    frequency    |
| cell percentage |
+-----------------+

           |    current smoker
  diabetes |        no        yes |     Total
-----------+----------------------+----------
        no |     1,733        299 |     2,032 
           |     62.72      10.82 |     73.54 
-----------+----------------------+----------
       yes |       670         61 |       731 
           |     24.25       2.21 |     26.46 
-----------+----------------------+----------
     Total |     2,403        360 |     2,763 
           |     86.97      13.03 |    100.00 

Let

  • \(Diabetes=D\) and
  • \(Smoking = S\)

From the table above we can find the following

\[P(D) =P(D~\cap~S) + P(D~\cap~S^c) = 0.2646\]

13% of the population are chronic smokers, \[P(Smoking) = P(S~\cap~D)+P(S~\cap~D^c)=0.1303\] Also, we can find the following, \[P(Smoker|Diabetes) = \frac{P(S\cap D)}{P(D)}=\frac{0.0221}{0.2646}=0.084\].

display 0.0221/0.2646
.0835223

Using the Bayes’ theorem we have:

\[ \textrm{P(Diabetes|Smoker) = }\frac{\textrm{P(Smoker|Diabetes)· P(Diabetes)}}{\textrm{P(Smoker)}}\] \[ \textrm{P(Diabetes|Smoker)} = \frac{0.084 \times 0.2646}{0.1303}= \frac{0222264}{0.1303}= .17057866\]

display "P(Diabetes|Smoker)=" (0.084*0.2646)/0.1303
P(Diabetes|Smoker)=.17057866

Statement on Bayes’ Theorem

A partition of a space \(\Omega\) is a collection of disjoint sets such that \(\displaystyle \bigcup_{i=1}^{\infty} A_i = \Omega\).

Partition of a Probability Space.

Let \(A_1\), \(A_2\), \(\ldots , A_k\) be a partition of \(\Omega\) such that \(P(A_i)>0\) for each \(i\). If \(B\) is any event with \(P(B)>0\), then for each \(i=1, \ldots k\), we have

\[P(A_i \mid B) = \frac{P(A_i \cap B) }{P(B)} = \frac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^k P(B \mid A_j)P(A_j)}.\]

c) Binomial distribution

The binomial probability distribution can be used for modeling the number of times a particular event occurs (successes) in a sequence of n repeated and independent Bernoulli trials, each with probability p of success.

The binomial setting
  1. There is a fixed number of n repeated Bernoulli trials.
  2. The n trials are all independent. That is, knowing the result of one trial does not change the probability we assign to other trials.
  3. Both probability of success, p, and probability of failure, 1-p, are constant throughtout the trials.

With reference to the Hers dataset Let X be a random variable that indicates the number of successes (diabetic patients) in n-independent Bernoulli trials. If random variable X satisfies the binomial setting, it follows the binomial distribution with parameters n and p, denoted as \(X ∼ Binomial(n, p)\), where n is the Bernoulli trial parameter (a positive integer) and p the Bernoulli probability parameter (\({0\leq p\leq 1}\)).

  • The probability mass function (pmf) of X is given by:

\[ P(X=x) = {{n}\choose{x}} \cdot p^x \cdot (1-p)^{n-x} \tag{1}\]

where x = 0, 1, … , n and \(\binom{n}{x} = \frac{n!}{x!(n-x)!}\)

Note that: \(n! = 1\cdot 2 \cdot 3\cdot \ldots \cdot (n-2)\cdot (n-1)\cdot n\)

 

  • The cumulative distribution function (cdf) of X is given by:

\[F(x) = P(X \le x)= {\begin{cases}0,&for\ x <0\\\sum_{k=0}^{x}{\left( \begin{array}{c} n \\ k \end{array} \right) p^{k}(1 - p)^{n-k}},&for\ 0\leq x < n\\1,&for\ x \geq n \end{cases}} \tag{2}\]

 

The mean of random variable, X, with Binomial(n, p) distribution is:

\[μ = np\] the variance is:

\[σ^2 = np(1-p)\]

and the standard deviation:

\[σ = \sqrt{np(1-p)}\]

Example : HERS DATA

use hersdata
tab diabetes 
   diabetes |      Freq.     Percent        Cum.
------------+-----------------------------------
         no |      2,032       73.54       73.54
        yes |        731       26.46      100.00
------------+-----------------------------------
      Total |      2,763      100.00

Let the random variable X be the number of patients diagnosed with diabetes and from the table above we note that 26.46% are diabetic(successful) (p = 0.2646). If the results of 10 patients are randomly sampled, and X follows a Binomial distribution \(X ∼ Binomial(10, 0.2646)\), The main characteristics of this distribution can be shown below.

  • So, the pmf for this distribution is:

\[ P(X=x) = {{10}\choose{x}} \cdot 0.2646^x \cdot (1-0.2646)^{10-x} \tag{3}\]

The pmf of Binomial(10, 0.2646) distribution specifies the probability of 0 through 10 successful selection of diabetic patients. Hence we have:

X 0 1 2 3 8 9 10
P(X) 0 0.046 0.166 0.269 0.00058 0.000047 0.000002

d) Sample size or Power

Power is the probability of detecting an effect if it exists. Higher sample size increases power. Suppose we want to only use a smaller sample of the HERS Data to draw conclusions about the actual population: i.e If we want to draw a sample that we can use inorder to compare mean weight between the placebo and hormone therapy group. if we want to achieve a power of 80% at 5% level of significance Assuming expected means of \(75 ~vs ~78, SD = 10\). then our sample sizes can be can be calculated in stata as follows:

power twomeans 75 78, sd(10) alpha(0.05) power(0.8)
Performing iteration ...

Estimated sample sizes for a two-sample means test
t test assuming sd1 = sd2 = sd
H0: m2 = m1  versus  Ha: m2 != m1

Study parameters:

        alpha =    0.0500
        power =    0.8000
        delta =    3.0000
           m1 =   75.0000
           m2 =   78.0000
           sd =   10.0000

Estimated sample sizes:

            N =       352
  N per group =       176

Our desired sample sizes for each group is 176

e) Pearson Chi-Square

The chi-square independence test tests whether observed joint frequency counts \(O_{ij}\) differ from expected frequency counts \(E_{ij}\) under the independence model (the model of independent explanatory variables, \(\pi_{ij} = \pi_{i+} \pi_{+j}\). \(H_0\) is \(O_{ij} = E_{ij}\).

The Pearson test of independence statistic is:

\[\chi^2 = \sum_i\sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

where \(O_{ij}\) is the observed count, and \(E_{ij}\) is the product of the row and column marginal probabilities. for instance if we wanted to test the hypothesis that:

  • \(H_0:\) There is no association between being diabetic and being a smoker
  • \(H_1:\) There is association between being diabetic and being a smoker

we would run the following code and also reject \(H_0\) if our p-value was less than \(5\%\)

use hersdata
tab diabetes smoking ,chi expected cchi2
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
| chi2 contribution  |
+--------------------+

           |    current smoker
  diabetes |        no        yes |     Total
-----------+----------------------+----------
        no |     1,733        299 |     2,032 
           |   1,767.2      264.8 |   2,032.0 
           |       0.7        4.4 |       5.1 
-----------+----------------------+----------
       yes |       670         61 |       731 
           |     635.8       95.2 |     731.0 
           |       1.8       12.3 |      14.2 
-----------+----------------------+----------
     Total |     2,403        360 |     2,763 
           |   2,403.0      360.0 |   2,763.0 
           |       2.5       16.7 |      19.2 

          Pearson chi2(1) =  19.2496   Pr = 0.000

here \(\chi^2 = \sum_i\sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}=19.2496\) The p-value is very low, so we reject the null hypothesis of independence. This demonstrates that a relationship exists between smoking and being diabetic.

f) One-way ANOVA test

One way Anova is an extension of the two sample independent t.test where the goal is now to compare the means between 3 or more categories for a given continuous outcome variable.

Treatment

1 2 a
\(Y_{11}\) \(Y_{21}\) \(Y_{a1}\)
\(Y_{12}\)
Sample Mean \(\bar{Y_{1.}}\) \(\bar{Y_{2.}}\) \(\bar{Y_{a.}}\)
Sample SD \(s_1\) \(s_2\) \(s_a\)

where \(\bar{Y_{i.}}=\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\)

\(s_i^2=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y_i})^2\) The total sample size is \(N=\sum_{i=1}^{a}n_i\) And the grand mean is \(\bar{Y_{..}}=\frac{1}{N}\sum_{i}\sum_{j}Y_{ij}\) in this case our Null and alternative hypothesis can be shown as below:

\[ H_0 : \mu_1 = \mu_2 = ... = \mu_a \]

\[ H_1 : \mu_1 \neq \mu_2 \neq ... \neq \mu_a \\ \]

The total variability of the \(Y_{ij}\) observation can be measured as the deviation of \(Y_{ij}\) around the overall mean \(\bar{Y_{..}}\): \(Y_{ij} - \bar{Y_{..}}\)

This can be rewritten as: \[ \begin{split} Y_{ij} - \bar{Y_{..}}&=Y_{ij} - \bar{Y_{..}} + \bar{Y_{i.}} - \bar{Y_{i.}} \\ &= (\bar{Y_{i.}}-\bar{Y_{..}})+(Y_{ij}-\bar{Y_{i.}}) \end{split} \] where

  • the first term is the between treatment differences (i.e., the deviation of the treatment mean from the overall mean)
  • the second term is within treatment differences (i.e., the deviation of the observation around its treatment mean)


\[ \begin{split} \sum_{i}\sum_{j}(Y_{ij} - \bar{Y_{..}})^2 &= \sum_{i}n_i(\bar{Y_{i.}}-\bar{Y_{..}})^2+\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2 \\ SSTO &= SSTR + SSE \\ total~SS &= treatment~SS + error~SS \\ (N-1)~d.f. &= (a-1)~d.f. + (N - a) ~ d.f. \end{split} \]

Accordingly, \(MSTR= \frac{SST}{a-1}\) and \(MSR=\frac{SSE}{N-a}\)

ANOVA Table

Source of Variation SS df MS
Between Treatments \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2\) a-1 SSTR/(a-1)
Error (within treatments) \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2\) N-a SSE/(N-a)
Total (corrected) \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{..}})^2\) N-1
HERS Data example

Suppose we want to compare mean BMI for people in different self reported health conditions. We run the following command:

use hersdata
 */obtain additional information about the production data by wheat varieties
oneway BMI globrat,tabulate
self-report |       Summary of BMI (kg/m^2)
  ed health |        Mean   Std. dev.       Freq.
------------+------------------------------------
       poor |   29.734667   6.7250888          60
       fair |   29.406047   6.3661577         602
       good |   28.664506   5.4050013       1,307
  very good |   27.959614   4.8108748         673
  excellent |   26.315752   4.0410658         113
------------+------------------------------------
      Total |   28.581318   5.5194835       2,755

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      1338.45482      4   334.613705     11.15     0.0000
 Within groups      82561.3228   2750   30.0222992
------------------------------------------------------------------------
    Total           83899.7776   2754   30.4646978

Bartlett's equal-variances test: chi2(4) =  73.9963    Prob>chi2 = 0.000

Conclusion

\(P(>F)=0.000\) hence we reject \(H_0\) and conclude that the mean BMI for these groups are significantly different.

g) Confounding

  • Confounder: A variable that distorts the association between exposure and outcome.

  • Effect modifier: A variable that changes the strength/direction of association.

graph TD
    A[Exposure: Exercise] --> C[Outcome: Diabetes]
    B[Confounder: statins] --> A
    B --> C

Running a confounding or Effect modification test

use ncube
cc exercise diabetes, by(statins)
      statin use | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .4275362      .1423493    1.15752     7.374046 (exact)
             yes |   .3641457      .0909963   1.273914     5.173913 (exact)
-----------------+-------------------------------------------------
           Crude |   .4047619       .179071   .8688314              (exact)
    M–H combined |   .4013983      .1943076   .8292038              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     0.05  Pr>chi2 = 0.8319

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      6.25
                                                Pr>chi2 =    0.0124

Conclusion

Statin use is a cofounder for the relationship between excercise and diabetes.

h) Wilcoxon Signed Rank vs Rank Sum Test

  • Signed Rank Test: Paired (before/after) comparisons.

for instance we can compare if there is a difference in median weight from start of study up to year 1

  • \((H_0)\): The median of the difference between the two groups is zero i.e no difference between weight and weight1.

  • \((H_1)\): There is a difference (nonzero median of differences).

use hersdata
signrank weight = weight1
Wilcoxon signed-rank test

        Sign |      Obs   Sum ranks    Expected
-------------+---------------------------------
    Positive |     1495   2111425.5   1703595.5
    Negative |     1042   1295765.5   1703595.5
        Zero |       74        2775        2775
-------------+---------------------------------
         All |     2611     3409966     3409966

Unadjusted variance   1.484e+09
Adjustment for ties  -94232.625
Adjustment for zeros  -34456.25
                     ----------
Adjusted variance     1.484e+09

H0: weight = weight1
         z = 10.587
Prob > |z| = 0.0000

Conclusion

p-value = 0.0000 → Strong evidence against the null hypothesis.

  • Rank Sum Test: Unpaired, two-group comparisons.

for example we can compare the median BMI for participants who excercised and those who did not.

use hersdata
ranksum BMI, by(exercise)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test

    exercise |      Obs    Rank sum    Expected
-------------+---------------------------------
          no |     1692   2512882.5     2334114
         yes |     1066   1291778.5     1470547
-------------+---------------------------------
    Combined |     2758     3804661     3804661

Unadjusted variance   4.147e+08
Adjustment for ties  -449.26906
                     ----------
Adjusted variance     4.147e+08

H0: BMI(exercise==no) = BMI(exercise==yes)
         z =  8.779
Prob > |z| = 0.0000

p-value = 0.0000 → Strong evidence against the null hypothesis.

Question 2 : D table

use hersdata
dtable, by(statins, tests testnotes) continuous(age BMI SBP DBP tchol LDL HDL TG) factor(raceth nonwhite smoking drinkany exercise physact diabetes globrat poorfair) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace)
note: using test regress across levels of statins for age, BMI, SBP, DBP,
      tchol, LDL, HDL, and TG.
note: using test pearson across levels of statins for raceth, nonwhite,
      smoking, drinkany, exercise, physact, diabetes, globrat, and poorfair.

Table1
--------------------------------------------------------------------------------------------
                                                           statin use                       
                                          no               yes             Total       Test 
--------------------------------------------------------------------------------------------
N                                     1,759 (63.7%)    1,004 (36.3%)   2,763 (100.0%)       
age in years                         66.507 (6.905)   66.896 (6.182)   66.649 (6.653)  0.139
BMI (kg/m^2)                         28.627 (5.626)   28.496 (5.324)   28.579 (5.518)  0.550
systolic blood pressure            135.323 (18.878) 134.625 (19.288) 135.069 (19.028)  0.354
diastolic blood pressure             73.628 (9.658)   72.317 (9.735)   73.152 (9.705) <0.001
total cholesterol (mg/dl)          233.767 (40.701) 219.470 (39.979) 228.580 (41.014) <0.001
LDL cholesterol (mg/dl)            151.136 (37.231) 134.322 (36.409) 145.039 (37.803) <0.001
HDL cholesterol (mg/dl)             50.068 (13.622)  50.602 (12.468)  50.262 (13.216)  0.308
triglycerides (mg/dl)              162.863 (63.742) 171.921 (62.718) 166.149 (63.511) <0.001
race/ethnicity                                                                              
  White                               1,542 (87.7%)      909 (90.5%)    2,451 (88.7%)  0.012
  African American                       159 (9.0%)        59 (5.9%)       218 (7.9%)       
  Other                                   58 (3.3%)        36 (3.6%)        94 (3.4%)       
nonwhite race/ethnicity                                                                     
  no                                  1,542 (87.7%)      909 (90.5%)    2,451 (88.7%)  0.022
  yes                                   217 (12.3%)        95 (9.5%)      312 (11.3%)       
current smoker                                                                              
  no                                  1,495 (85.0%)      908 (90.4%)    2,403 (87.0%) <0.001
  yes                                   264 (15.0%)        96 (9.6%)      360 (13.0%)       
any current alcohol consumption                                                             
  no                                  1,102 (62.6%)      578 (57.7%)    1,680 (60.8%)  0.010
  yes                                   657 (37.4%)      424 (42.3%)    1,081 (39.2%)       
exercise at least 3 times per week                                                          
  no                                  1,102 (62.6%)      593 (59.1%)    1,695 (61.3%)  0.063
  yes                                   657 (37.4%)      411 (40.9%)    1,068 (38.7%)       
comparative physical activity                                                               
  much less active                       125 (7.1%)        72 (7.2%)       197 (7.1%)  0.407
  somewhat less active                  317 (18.0%)      186 (18.5%)      503 (18.2%)       
  about as active                       601 (34.2%)      318 (31.7%)      919 (33.3%)       
  somewhat more active                  535 (30.4%)      303 (30.2%)      838 (30.3%)       
  much more active                      181 (10.3%)      125 (12.5%)      306 (11.1%)       
diabetes                                                                                    
  no                                  1,278 (72.7%)      754 (75.1%)    2,032 (73.5%)  0.161
  yes                                   481 (27.3%)      250 (24.9%)      731 (26.5%)       
self-reported health                                                                        
  poor                                    44 (2.5%)        16 (1.6%)        60 (2.2%)  0.259
  fair                                  401 (22.8%)      204 (20.4%)      605 (21.9%)       
  good                                  824 (46.9%)      484 (48.3%)    1,308 (47.4%)       
  very good                             418 (23.8%)      256 (25.5%)      674 (24.4%)       
  excellent                               71 (4.0%)        42 (4.2%)       113 (4.1%)       
poor/fair self-reported health                                                              
  no                                  1,313 (74.7%)      782 (78.0%)    2,095 (75.9%)  0.047
  yes                                   445 (25.3%)      220 (22.0%)      665 (24.1%)       
--------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)

https://www.ahajournals.org/doi/pdf/10.1161/01.cir.0000019406.74017.b2
Citation of the table used.

Herrington, David M., et al. “Statin Therapy, Cardiovascular Events, and Total Mortality in the Heart and Estrogen/Progestin Replacement Study (HERS).” Circulation, vol. 105, no. 25, 2002, pp. 2962–2967. https://doi.org/10.1161/01.CIR.0000019406.74017.B2

Data sampling

  • for these exercises I used a sample size of 200
use Hersdata

set seed 3002164

sample 200 , count

save ncube , replace
(2,563 observations deleted)

file ncube.dta saved
Question 3

Formulate a question that you would use the independent Student’s t-test using the given HERS data. Below is a guide to assist you to answer this question.

  1. State the research question and highlight the two variables you will use to answer this question?
  • Solution: Is there is any significant difference in the mean Body Mass Index of participants who smoke and those who do not smoke? . To answer this I used the following variables
  1. smoking
  2. BMI
  1. Test the relevant assumptions of the Student’s t-test, clearly outline the hypotheses and make decisions on whether these have been met or not?

Checking for assumptions

Equality of Variance assumption

Equality of Variances assumptions

\[H_0: Variances~ of~ BMI ~across~ groups ~are ~equal\] \[H_1: Variances~ of ~BMI ~across~ groups ~is ~not ~equal\]

use ncube, clear

sdtest BMI , by(smoking)
Variance ratio test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      no |     170    29.03318    .4483305    5.845514    28.14813    29.91823
     yes |      30    25.63967    1.038795    5.689716    23.51509    27.76424
---------+--------------------------------------------------------------------
Combined |     200    28.52415    .4195955    5.933977    27.69673    29.35157
------------------------------------------------------------------------------
    ratio = sd(no) / sd(yes)                                      f =   1.0555
H0: ratio = 1                                    Degrees of freedom =  169, 29

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.5482         2*Pr(F > f) = 0.9037           Pr(F > f) = 0.4518

Comment

Based on the following reasons:

  • the ratio F=1.056 shows no significant difference in the variance of the denominator and that of the numerator.
  • The p-value for the two-sided test is 0.9037, which is much greater than 0.05.

Therefore, we fail to reject the null hypothesis — there is no evidence that the standard deviations are different between the groups.

This suggests the variability (spread) of the outcome (likely BMI) is similar in both groups.

Normality test

  • Normality assumption

\[H_0: BMI ~across~ groups ~is ~normally~ distributed\] \[H_1: BMI ~across~ groups ~is ~not ~normally~ distributed\] We fail to reject \(H_0\) if the p-value is above 0.05

use ncube

*/Shapiro–Wilk W test

bysort smoking: swilk BMI
-> smoking = no

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |        170    0.96032      5.142     3.737    0.00009

-------------------------------------------------------------------------------
-> smoking = yes

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         30    0.86586      4.264     2.998    0.00136

Comment

Based on the results from the above test , we notice that BMI is not normally distributed across the two categories because the p-values for both groups is less than 5% hence we fail to reject the null hypothesis.

  1. Based on your decision from 3b), perform the independent Student’s ttest (regardless of results from b) using Stata. Clearly state your hypotheses, display the Stata results output and the final decision from doing the test.

Define the hypothesis

  • \(H_0\) : \(\mu_{No} = \mu_{Yes}\)

  • \(H_1\) : \(\mu_{No}\ne \mu_{Yes}\)

  • the means are given in the output tables

Define significance level

  • determine at 5% level
use ncube, clear

ttest BMI , by(smoking)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      no |     170    29.03318    .4483305    5.845514    28.14813    29.91823
     yes |      30    25.63967    1.038795    5.689716    23.51509    27.76424
---------+--------------------------------------------------------------------
Combined |     200    28.52415    .4195955    5.933977    27.69673    29.35157
---------+--------------------------------------------------------------------
    diff |             3.39351    1.153117                1.119543    5.667477
------------------------------------------------------------------------------
    diff = mean(no) - mean(yes)                                   t =   2.9429
H0: diff = 0                                     Degrees of freedom =      198

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9982         Pr(|T| > |t|) = 0.0036          Pr(T > t) = 0.0018

Conclusion

since \(Pr(|T| > |t|) = 0.0036\) which is below 5% we reject the null hypothesis and conclude that there is evidence to suggest that the mean BMI is significantly different from each group(smokers and non smokers).

  1. The test done in 3c) has a test statistic t-value, show how this was computed manually. Use the given summary statistics on the Stata output in the computation of the test statistics.

Manually calculating

for equal variances we use the formula in the table above:

\[t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\] where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 + (n_2-1)s^2_2}{n_1 + n_2 -2}\]

where

\(\overline{x_{1}}\) and \(\overline{x_{2}}\) are the means of the response variable y for group 1 and 2, respectively,
\(s_{1}^2\) and \(s_{2}^2\) are the variances of the response variable y for group 1 and 2, respectively,
\(n_{1}\) and \(n_{2}\) are the sample sizes of groups 1 and 2, respectively.

\[t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\] where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 +(n_2-1)s^2_2}{n_1 + n_2 -2}\]

\[s_p^2 = \frac{(170 -1)5.845514^2 + (30-1)5.689716^2}{170 +30 -2}\]

display "spooled=" ((170 -1)*5.845514^2 + (30-1)*5.689716^2)/(170 + 30 -2)
spooled=33.906813

\[t = \frac{(29.03318-25.63967)}{\sqrt{33.906813(\frac{1}{170}+\frac{1}{30})}}\]

display "T=" (29.03318-25.63967)/sqrt(33.906813*(1/170 +1/30))
T=2.9429016
  1. What conclusion are you deriving from the question you set out to answer?

There is significant mean difference in BMI between smokers and non-smokers

  1. Now consider redoing parts a) , c) , d and e) by considering the non-parametric alternative. Take note you have to compute the non-parametric Z-value.
  • \(H_0\): Equal medians of BMI between the two groups of smokers and non-smokers
  • \(H_1\): Unequal medians of BMI between the two groups of smokers and non-smokers
use ncube
ranksum BMI ,by(smoking)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test

     smoking |      Obs    Rank sum    Expected
-------------+---------------------------------
          no |      170     18027.5       17085
         yes |       30      2072.5        3015
-------------+---------------------------------
    Combined |      200       20100       20100

Unadjusted variance    85425.00
Adjustment for ties       -1.22
                     ----------
Adjusted variance      85423.78

H0: BMI(smoking==no) = BMI(smoking==yes)
         z =  3.225
Prob > |z| = 0.0013
Exact prob = 0.0011

Conclusion

  • \(Prob > |z| = 0.0013\) implying that we reject \(H_0\) and conclude that the median BMI for these two groups are significantly different.

Manually calculating Z

\[Z_w= \frac{W-\mu_w}{\sigma_w}\]

\[\mu_w=\frac{n_s(n_s+n_L+1)}{2}=30(30+170+1)/2\]

display "mu_w =" 30*(30+170+1)/2
mu_w =3015

\[\sigma_w =\sqrt{\frac{n_sn_L(n_s+n_L+1)}{12}}=\sqrt{\frac{30*170(30+170+1)}{12}}\]

display "sig_w=" sqrt(30*170*(30+170+1)/12)
sig_w=292.27555

\[Z_w=| \frac{2072.5 - 3015}{292.27555}|=3.2246967\]

display "z=" (2072.5 - 3015)/292.27555
z=-3.2246967
Question 4

The figure below gives the visual aid on trying to get answers on determining whether each of the variables in the third variable box was either a confounder, effect modifier or neither of these on the relationship between exercise and diabetes?

  1. State the hypotheses, use stata to do the relevant statistical tests after performing the relevant assumptions testing to answer the question “Is there an association between the exposure exercise and outcome diabetes?”
use ncube

tab exercise diabetes , chi expected exact cchi2
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
| chi2 contribution  |
+--------------------+

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        85         40 |       125 
           |      92.5       32.5 |     125.0 
           |       0.6        1.7 |       2.3 
-----------+----------------------+----------
       yes |        63         12 |        75 
           |      55.5       19.5 |      75.0 
           |       1.0        2.9 |       3.9 
-----------+----------------------+----------
     Total |       148         52 |       200 
           |     148.0       52.0 |     200.0 
           |       1.6        4.6 |       6.2 

          Pearson chi2(1) =   6.2370   Pr = 0.013
           Fisher's exact =                 0.013
   1-sided Fisher's exact =                 0.009

Define the hypothesis

  • \(H_0\) : There is no association between diabetes and exercise
  • \(H_1\) : There is an association between diabetes and exercise

Conclusion

since \(p=0.013<0.05\) , we reject \(H_0\) and conclude that at 5% level of significance , there is evidence that diabetes and exercise are significantly associated.

  1. Work out the test statistic on 4a) manually, showing all your steps?

The test statistic is calculated as follows \[\chi^2=\sum_{j=1}^{c}\sum_{i=1}^{r}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

We reject the null hypothesis at 5% level of significance if \[\chi_2>\chi^{2}_{0.05}(r-1)(c-1)\]

Calculating expected frequencies

\[ E_{ij}=\frac{O_i\times O_{ij}}{n} \]

\[ E_{11}=\frac{125\times 148}{200}=92.5 \]

display "E11=" (125*148)/200
E11=92.5

\[ E_{12}=\frac{125\times 52}{200}=32.5 \]

display "E12=" (125*52)/200
E12=32.5

\[ E_{21}=\frac{75\times 148}{200}=55.5 \]

display "E21=" (75*148)/200
E21=55.5

\[ E_{22}=\frac{75\times 52}{200}=19.5 \]

display "E22=" (75*52)/200
E22=19.5

calculating \(\chi^2\)

\[\chi^2=\frac{(85-92.5)^2}{92.5}+\frac{(40-32.5)^2}{32.5}\\ +\frac{(63-55.5)^2}{55.5}+\frac{(52-19.5)^2}{19.5}\]

display "chi2 =" (((85-92.5)^2)/92.5)+(((40-32.5)^2)/32.5) +(((63-55.5)^2)/55.5)+(((12-19.5)^2)/19.5)
chi2 =6.2370062

\[\chi^2=6.2370062\]

The degrees of freedom is

display "dof=" (2 - 1) * (2 - 1)
dof=1

\[\chi^2_{0.05}(2-1)(2-1)=\chi^2_{0.05}(1)=3.84\]

  • conclusion

Since \(\chi^{2}=6.2370062<3.84\) ,we reject \(H_0\) and conclude that there is association between diabetes and exercise.

Confounding

statins

use ncube
cc exercise diabetes, by(statins)
      statin use | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .4275362      .1423493    1.15752     7.374046 (exact)
             yes |   .3641457      .0909963   1.273914     5.173913 (exact)
-----------------+-------------------------------------------------
           Crude |   .4047619       .179071   .8688314              (exact)
    M–H combined |   .4013983      .1943076   .8292038              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     0.05  Pr>chi2 = 0.8319

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      6.25
                                                Pr>chi2 =    0.0124

Comment

statin use appears to be a confounder and not an effect modifier because of the following reasons:

  • the table above shows that the ratios 0.427 and 0.364 are not significantly different from each other
  • \(\chi^2(1) = 0.05\), p = 0.8319 → This test checks for effect modification (aka interaction) and a non-significant p-value (> 0.05) means no significant difference between the stratum-specific odds ratios (statin vs. no statin groups) hence no effect modification.

Manually computing M-H adjusted odds ratio

use ncube
bysort statins: tab exercise diabetes
-> statins = no

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        59         23 |        82 
       yes |        42          7 |        49 
-----------+----------------------+----------
     Total |       101         30 |       131 

-------------------------------------------------------------------------------
-> statins = yes

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        26         17 |        43 
       yes |        21          5 |        26 
-----------+----------------------+----------
     Total |        47         22 |        69 

\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{26*5}{69}+\frac{59*7}{131}}{(17*21)/69+(23*42)/131}=.40139833\]

display "OR_ADJUST=" ((26*5)/69+(59*7)/131)/((17*21)/69+(23*42)/131)
OR_ADJUST=.40139833

nonwhite

use ncube
cc exercise diabetes, by(nonwhite)
nonwhite race/et | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .3958333      .1610491   .9112598     10.78652 (exact)
             yes |        .75       .050961   8.546008     1.090909 (exact)
-----------------+-------------------------------------------------
           Crude |   .4047619       .179071   .8688314              (exact)
    M–H combined |   .4283626      .2064359    .888869              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     0.33  Pr>chi2 = 0.5644

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      5.30
                                                Pr>chi2 =    0.0214

Comment

  • \(\chi^2(1) = 0.33, p = 0.5644\)
  • This p-value is not significant implying that there is no statistical evidence of effect modification (i.e., the odds ratios across race/ethnicity strata are not significantly different) hence the variable is not an effect modifier.

Manually computing M-H adjusted odds ratio

use ncube
bysort nonwhite: tab exercise diabetes
-> nonwhite = no

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        76         32 |       108 
       yes |        60         10 |        70 
-----------+----------------------+----------
     Total |       136         42 |       178 

-------------------------------------------------------------------------------
-> nonwhite = yes

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |         9          8 |        17 
       yes |         3          2 |         5 
-----------+----------------------+----------
     Total |        12         10 |        22 

\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{9*2}{22}+\frac{76*10}{178}}{(8*3)/22+(32*60)/178}=.42836257\]

display "OR_ADJUST=" ((9*2)/22+(76*10)/178)/((8*3)/22+(32*60)/178)
OR_ADJUST=.42836257

poorfair

use ncube
cc exercise diabetes, by(poorfair)
poor/fair self-r | Odds ratio     [95% conf. interval]   M–H weight
-----------------+-------------------------------------------------
              no |   .5414274      .2175484    1.29283     8.292517 (exact)
             yes |   .1470588      .0032038   1.244775     3.207547 (exact)
-----------------+-------------------------------------------------
           Crude |   .4047619       .179071   .8688314              (exact)
    M–H combined |   .4314319      .2058862   .9040596              
-------------------------------------------------------------------
Test of homogeneity (M–H)      chi2(1) =     1.27  Pr>chi2 = 0.2598

                 Test that combined odds ratio = 1:
                                Mantel–Haenszel chi2(1) =      5.05
                                                Pr>chi2 =    0.0246

Comment

  • \(\chi^2(1) = 1.27, p = 0.2598\) not statistically significant hence no evidence of effect modification
  • This tells us the odds ratios across strata are not significantly different.

Manually computing M-H adjusted odds ratio

use ncube
bysort poorfair: tab exercise diabetes
-> poorfair = no

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        60         23 |        83 
       yes |        53         11 |        64 
-----------+----------------------+----------
     Total |       113         34 |       147 

-------------------------------------------------------------------------------
-> poorfair = yes

  exercise |
at least 3 |
 times per |       diabetes
      week |        no        yes |     Total
-----------+----------------------+----------
        no |        25         17 |        42 
       yes |        10          1 |        11 
-----------+----------------------+----------
     Total |        35         18 |        53 

\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{25*1}{53}+\frac{60*11}{147}}{(17*10)/53+(53*23)/147}=.43143186\]

display "OR_ADJUST=" ((25*1)/53+(60*11)/147)/((17*10)/53+(53*23)/147)
OR_ADJUST=.43143186
Question 5

Formulate a question that you would use oneway analyses of variance test (ANOVA) using the given HERS data. Below is a guide to assist you to answer this question

  1. State the research question and highlight the two variables you will use to answer this question?
  • Are the mean BMI across people with different responses on comparative physical activity significantly different?
  • to answer this question I used two variables namely
    1. Comparative physical activity (physact)
    2. Body mass Index (BMI)
  1. Test the relevant assumptions of the oneway ANOVA test, clearly outline the hypotheses and make decisions on whether these have been met or not?
  • Normality assumption

\[H_0: BMI ~across~ groups ~is ~normally~ distributed\] \[H_1: BMI ~across~ groups ~is ~not ~normally~ distributed\] We fail to reject \(H_0\) if the p-value is above 0.05

use ncube

*/Shapiro–Wilk W test

bysort physact: swilk BMI
-> physact = much less active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         15    0.93253      1.308     0.531    0.29761

-------------------------------------------------------------------------------
-> physact = somewhat less active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         41    0.96054      1.590     0.977    0.16425

-------------------------------------------------------------------------------
-> physact = about as active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         67    0.95711      2.548     2.029    0.02123

-------------------------------------------------------------------------------
-> physact = somewhat more active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         56    0.97306      1.386     0.700    0.24182

-------------------------------------------------------------------------------
-> physact = much more active

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         BMI |         21    0.89416      2.594     1.927    0.02700

Conclusion

The data for the following groups was found to be non-normal:

  1. -> physact = much more active
  2. -> physact = about as active

Equality of Variances assumptions

\[H_0: Variances~ of~ BMI ~across~ groups ~are ~equal\] \[H_1: Variances~ of ~BMI ~across~ groups ~is ~not ~equal\]

use ncube
*Levenes and Brown-Forsythe tests*

graph box BMI, over(physact) ylabel(50(10)80, labsize(small))  title("BMI count by Physical Activity")

quietly graph export box.svg, replace

robvar BMI, by(physact)
comparative |
   physical |       Summary of BMI (kg/m^2)
   activity |        Mean   Std. dev.       Freq.
------------+------------------------------------
  much less |   30.578667   9.2333355          15
  somewhat  |   30.389268   6.4486589          41
  about as  |   28.506418   5.6301614          67
  somewhat  |   27.406429   4.6921745          56
  much more |   26.452381   4.9048524          21
------------+------------------------------------
      Total |    28.52415   5.9339772         200

W0  =  5.6070266   df(4, 195)     Pr > F = 0.00027148

W50 =  4.3101376   df(4, 195)     Pr > F = 0.00230662

W10 =  5.4014192   df(4, 195)     Pr > F = 0.00038104

Stata Graph - Graph 50 60 70 80 BMI (kg/m^2) much less active somewhat less active about as active somewhat more active much more active BMI count by Physical Activity

use ncube
*/if the equal variance
ssc install robnova
robnova BMI physact
checking robnova consistency and verifying not already installed...
installing into C:\Users\r1953\ado\plus\...
installation complete.


Outcome variable was BMI and predictor variable was physact

Sum of Squares Model = 366.0596
Sum of Squares Residual = 6641.1453
Sum of Squares Total = 7007.2049
R-squared = 0.052240

---------------------------------------------------------
            Test |     F        df1       df2        p   
-----------------+---------------------------------------
Brown-Forsythe's |  2.1484       4      57.3707   0.0864 
        Fisher's |  2.6871       4     195.0000   0.0326 
         Welch's |  2.4583       4      58.0620   0.0554 
---------------------------------------------------------
Total number of observations used was 200.

Conclusion

Bartlett’s equal-variances test: \(\chi^2(4) = 14.6343 , Prob>chi2 = 0.006\) suggests that we reject the null hypothesis and conclude that there is significant evidence that at least one group has a different variance. So, the assumption of homogeneity of variance is violated.

Fitting the One way analysis of Variance

use ncube
*/////fit oneway anova/////

 */obtain additional information about the production data by wheat varieties
oneway BMI physact,tabulate
comparative |
   physical |       Summary of BMI (kg/m^2)
   activity |        Mean   Std. dev.       Freq.
------------+------------------------------------
  much less |   30.578667   9.2333355          15
  somewhat  |   30.389268   6.4486589          41
  about as  |   28.506418   5.6301614          67
  somewhat  |   27.406429   4.6921745          56
  much more |   26.452381   4.9048524          21
------------+------------------------------------
      Total |    28.52415   5.9339772         200

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      366.059574      4   91.5148934      2.69     0.0326
 Within groups      6641.14534    195   34.0571556
------------------------------------------------------------------------
    Total           7007.20492    199    35.212085

Bartlett's equal-variances test: chi2(4) =  14.6343    Prob>chi2 = 0.006

Conclusion

Since \(Prob>F =0.0326 < 0.05\) we reject the null hypothesis and conclude that there is evidence to suggest that the mean BMI for participants in each group significantly differs from the others.

  1. The test done in 5b) has an F test statistic, show how this can be computed manually. Use the given summary statistics on the Stata output in the computation of the test statistics after using the option tabulate on your oneway test.

Manual calculation

\[MS_{Total} = \frac{SSE}{N-1}=\frac{\sum (X_{ij}-\overline{X}_j)}{N-1}=\sigma^2\] \[MS_{Total} = \frac{\sum (X_{ij}-\overline{X}_j)}{N-1}=\frac{7007.20492 }{200-1}=5.9339772^2=35.212085\] \[SS_{Between} =\sum_{k=1}^K n_k(\bar{x}_k -\bar{x}_{grand})^2\]

\[SS_{Between} = 15( 30.578667 -28.52415)^2+ 41( 30.389268 -28.52415)^2+67( 28.506418 -28.52415)^2+ \\\ 56( 27.406429 -28.52415)^2+21( 26.452381 -28.52415)^2\]

display "SSB=" 15*( 30.578667 -28.52415)^2+ 41*( 30.389268 -28.52415)^2+67*( 28.506418 -28.52415)^2+56*( 27.406429 -28.52415)^2+21*( 26.452381 -28.52415)^2
SSB=366.05951

\[MS_{between} = \frac{SS_{between}}{K-1}=\frac{366.05951}{5-1}=91.514877\]

display "MSB=" 366.05951/4
MSB=91.514877

\[SS_{within} =\sum^{K}_{k=1}\sum^{n_k}_{i=1}(x_{i,k} -\bar{x}_k)\] \[SS_{within} =(15-1)*9.23333556^2+(41-1)*6.4486589^2+(67-1)*5.6301614^2+\\\ (56-1)*4.6921745^2+(21-1)*4.9048524^2\]

display "SSW=" (15-1)*9.23333556^2+(41-1)*6.4486589^2+(67-1)*5.6301614^2+(56-1)*4.6921745^2+(21-1)*4.9048524^2
SSW=6641.1453

\[MS_{within} = \frac{SS_{within}}{N-K}=\frac{6641.1453}{200-5}=34.057155\]

display "MSW=" 6641.1453/195
MSW=34.057155

\[F = \frac{MS_{Between}}{MS_{within}}=91.514877/34.057155=2.6870969\]

display "F=" 91.514877/34.057155
F=2.6870969

ANOVA Table

Source of Variation SS df MS
Between Treatments \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2\) a-1 SSTR/(a-1)
Error (within treatments) \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2\) N-a SSE/(N-a)
Total (corrected) \[\sum(Y_{ij} - \bar{Y}_{..})^2\] N-1
Source of Variation SS df MS
Between Treatments \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2=366.05951\) 5-1 366.05951/(4)
Error (within treatments) \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2=6641.1453\) 200-5 6641.1453/(195)
Total (corrected) \[\sum(Y_{ij} - \bar{Y}_{..})^2=7007.20492\] 200-1
  1. If the assumptions are not fully met in 5b) what can you consider doing?

If the assumptions were not met I would consider using an non-parametric test such as the Kruskal Wallis test.

  1. Based on your decision from 5d), clearly state your hypotheses, run the alternative test in stata, display the Stata results output and the final decision and conclusion from doing the test to show exactly where the differences (if any) were on?
  • \[H_0: \theta_1 =\theta_2=\theta_3(medians)\]
  • \[H_1: \theta_1 \neq\theta_2\neq \theta_3 (medians)\]
use ncube
kwallis BMI ,by(physact)
Kruskal–Wallis equality-of-populations rank test

  +---------------------------------------+
  |              physact | Obs | Rank sum |
  |----------------------+-----+----------|
  |     much less active |  15 |  1644.00 |
  | somewhat less active |  41 |  4821.00 |
  |      about as active |  67 |  6773.50 |
  | somewhat more active |  56 |  5194.00 |
  |     much more active |  21 |  1667.50 |
  +---------------------------------------+

  chi2(4) =  7.744
     Prob = 0.1014

  chi2(4) with ties =  7.744
               Prob = 0.1014
Note

Since \(p = 0.1014 > 0.05\), we fail to reject the null hypothesis. That means: There’s no statistically significant difference in the distribution (or medians) of the BMI across the five physical activity levels.