HT | age | raceth | nonwhite | smoking | drinkany | exercise | physact | globrat | poorfair | medcond | htnmeds | statins | diabetes | dmpills | insulin | weight | BMI | waist | WHR | glucose | weight1 | BMI1 | waist1 | WHR1 | glucose1 | tchol | LDL | HDL | TG | tchol1 | LDL1 | HDL1 | TG1 | SBP | DBP | age10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 70 | 2 | 1 | 0 | 0 | 0 | 5 | 3 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 73.8 | 23.69 | 96.0 | 0.932 | 84 | 73.6 | 23.63 | 93.0 | 0.912 | 94 | 189 | 122.4 | 52 | 73 | 201 | 137.6 | 48 | 77 | 138 | 78 | 7.0 |
0 | 62 | 2 | 1 | 0 | 0 | 0 | 1 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 70.9 | 28.62 | 93.0 | 0.964 | 111 | 73.4 | 28.89 | 95.0 | 0.964 | 78 | 307 | 241.6 | 44 | 107 | 216 | 150.6 | 48 | 87 | 118 | 70 | 6.2 |
1 | 69 | 1 | 0 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 102.0 | 42.51 | 110.2 | 0.782 | 114 | 96.1 | 40.73 | 103.0 | 0.774 | 98 | 254 | 166.2 | 57 | 154 | 254 | 156.0 | 66 | 160 | 134 | 78 | 6.9 |
0 | 64 | 1 | 0 | 1 | 1 | 0 | 1 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 64.4 | 24.39 | 87.0 | 0.877 | 94 | 58.6 | 22.52 | 77.0 | 0.802 | 93 | 204 | 116.2 | 56 | 159 | 207 | 122.6 | 57 | 137 | 152 | 72 | 6.4 |
0 | 65 | 1 | 0 | 0 | 0 | 0 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 57.9 | 21.90 | 77.0 | 0.794 | 101 | 58.9 | 22.28 | 76.5 | 0.757 | 92 | 214 | 150.6 | 42 | 107 | 235 | 172.2 | 35 | 139 | 175 | 95 | 6.5 |
1 | 68 | 2 | 1 | 0 | 1 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 60.9 | 29.05 | 96.0 | 1.000 | 116 | 57.7 | 27.52 | 86.0 | 0.910 | 115 | 212 | 137.8 | 52 | 111 | 202 | 126.6 | 53 | 112 | 174 | 98 | 6.8 |
The HERS study dataset has been provided to you to help you answer the following questions in relation to concepts that you learnt in Biostatistics I. The participants were are either on hormone therapy or placebo and the data were collected at baseline [e.g weight] and on follow-up [those variables ending with 1, e.g weight1] Below are the variable names, value labels and variable labels as they will be used in the questions below.
Data Description
use hersdata
describe
Contains data from hersdata.dta
Observations: 2,763
Variables: 37 11 May 2018 14:57
-------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
HT byte %15.0g HT random assignment to hormone
therapy
age byte %9.0g age in years
raceth byte %16.0g raceth race/ethnicity
nonwhite byte %9.0g noyes nonwhite race/ethnicity
smoking byte %9.0g noyes current smoker
drinkany byte %9.0g noyes any current alcohol consumption
exercise byte %9.0g noyes exercise at least 3 times per
week
physact byte %20.0g physact comparative physical activity
globrat byte %9.0g globrat self-reported health
poorfair byte %9.0g noyes poor/fair self-reported health
medcond byte %9.0g other serious conditions by
self-report
htnmeds byte %9.0g noyes anti-hypertensive use
statins byte %9.0g noyes statin use
diabetes byte %9.0g noyes diabetes
dmpills byte %9.0g noyes oral DM medication by self-report
insulin byte %9.0g noyes insulin use by self-report
weight float %9.0g weight (kg)
BMI float %9.0g BMI (kg/m^2)
waist float %9.0g waist (cm)
WHR float %9.0g waist/hip ratio
glucose int %9.0g fasting glucose (mg/dl)
weight1 float %9.0g year 1 weight (kg)
BMI1 float %9.0g year 1 BMI (kg/m^2)
waist1 float %9.0g year 1 waist (cm)
WHR1 float %9.0g year 1 waist/hip ratio
glucose1 int %9.0g year 1 fasting glucose (mg/dl)
tchol int %9.0g total cholesterol (mg/dl)
LDL float %9.0g LDL cholesterol (mg/dl)
HDL int %9.0g HDL cholesterol (mg/dl)
TG int %9.0g triglycerides (mg/dl)
tchol1 int %9.0g year 1 total cholesterol (mg/dl)
LDL1 float %9.0g year 1 LDL cholesterol (mg/dl)
HDL1 int %9.0g year 1 HDL cholesterol (mg/dl)
TG1 int %9.0g year 1 triglycerides (mg/dl)
SBP int %9.0g systolic blood pressure
DBP int %9.0g diastolic blood pressure
age10 float %9.0g age (per 10 years)
-------------------------------------------------------------------------------
Sorted by:
Explain all of the following and use the given HERS data to give an illustrative example using output from performing relevant stata commands.
a) Normal
Random variable \(Y\) is distributed \(Y \sim N(\mu, \sigma^2)\) if
\[\begin{equation} f(y) = \frac{e^{-(y-\mu)^2/ (2 \sigma^2)}}{\sqrt{2\pi\sigma^2}} \quad \textrm{for} \quad -\infty < y < \infty. \end{equation}\]As the parameter names suggest, \(E(Y) = \mu\) and \(Var(Y) = \sigma^2\). The distribution \(\textrm{N}(0,1)\) is often referred to as the standard normal distribution
- Normal distributions have the following properties:
- symmetric about the mean , median and mode hence \(Mean=mode=median\)
- it is bell shaped normal curve
using the hersdata we can illustrate if the above properties are satisfied for the variable weight
.
use hersdata
hist weight , norm
quietly graph export normal.png
(bin=34, start=37.5, width=2.7794118)
file normal.png already exists
r(602);
r(602);
Comments
- the data is symmetric and bell shaped
use hersdata
sum weight,det
weight (kg)
-------------------------------------------------------------
Percentiles Smallest
1% 46.9 37.5
5% 51.8 37.6
10% 55.7 40.8 Obs 2,761
25% 62.2 40.9 Sum of wgt. 2,761
50% 71 Mean 72.73484
Largest Std. dev. 14.71397
75% 81.4 127
90% 92.9 127 Variance 216.501
95% 100 129 Skewness .7042269
99% 116 132 Kurtosis 3.49991
- here we also note that , the mean and the median are almost equal
use hersdata
qnorm weight
graph export qq.png
file qq.png already exists
r(602);
r(602);
- the points almost lie on the straight line hence confirming normality assumption.
b) Bayes Theorem
A theorem about conditional probabilities and is given by \(\Pr(B \mid A) = \displaystyle{\frac{ \Pr(A \mid B) \; \Pr(B)}{\Pr(A)}}\)
Schematically if \(A = \theta\) and \(B = \text{y}\) the bayes theorem brings the fundamental theorem in bayesian statistics.
\[ \color{dodgerblue}{\boxed{\underbrace{P(\theta \mid y)}_{\text{Posterior}} = \frac{\overbrace{P(y \mid \theta)}^{\text{Likelihood}} \cdot \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(y)}_{\text{Normalizing Constant}}}}} \]
Context of HERS Data
Suppose we are interested in calculating the probability of being diabetic if a person smokes tobacco for a long time, \(P(diabetes|Smoker)\). With reference to the HERS
data we see that:
use hersdata
tab diabetes smoking ,cell
| Key |
|-----------------|
| frequency |
| cell percentage |
+-----------------+
| current smoker
diabetes | no yes | Total
-----------+----------------------+----------
no | 1,733 299 | 2,032
| 62.72 10.82 | 73.54
-----------+----------------------+----------
yes | 670 61 | 731
| 24.25 2.21 | 26.46
-----------+----------------------+----------
Total | 2,403 360 | 2,763
| 86.97 13.03 | 100.00
Let
- \(Diabetes=D\) and
- \(Smoking = S\)
From the table above we can find the following
\[P(D) =P(D~\cap~S) + P(D~\cap~S^c) = 0.2646\]
13% of the population are chronic smokers, \[P(Smoking) = P(S~\cap~D)+P(S~\cap~D^c)=0.1303\] Also, we can find the following, \[P(Smoker|Diabetes) = \frac{P(S\cap D)}{P(D)}=\frac{0.0221}{0.2646}=0.084\].
display 0.0221/0.2646
.0835223
Using the Bayes’ theorem we have:
\[ \textrm{P(Diabetes|Smoker) = }\frac{\textrm{P(Smoker|Diabetes)· P(Diabetes)}}{\textrm{P(Smoker)}}\] \[ \textrm{P(Diabetes|Smoker)} = \frac{0.084 \times 0.2646}{0.1303}= \frac{0222264}{0.1303}= .17057866\]
display "P(Diabetes|Smoker)=" (0.084*0.2646)/0.1303
P(Diabetes|Smoker)=.17057866
Statement on Bayes’ Theorem
A partition of a space \(\Omega\) is a collection of disjoint sets such that \(\displaystyle \bigcup_{i=1}^{\infty} A_i = \Omega\).
Let \(A_1\), \(A_2\), \(\ldots , A_k\) be a partition of \(\Omega\) such that \(P(A_i)>0\) for each \(i\). If \(B\) is any event with \(P(B)>0\), then for each \(i=1, \ldots k\), we have
\[P(A_i \mid B) = \frac{P(A_i \cap B) }{P(B)} = \frac{P(B \mid A_i) P(A_i)}{\sum_{j=1}^k P(B \mid A_j)P(A_j)}.\]
c) Binomial distribution
The binomial probability distribution can be used for modeling the number of times a particular event occurs (successes) in a sequence of n repeated and independent Bernoulli trials, each with probability p of success.
- There is a fixed number of n repeated Bernoulli trials.
- The n trials are all independent. That is, knowing the result of one trial does not change the probability we assign to other trials.
- Both probability of success, p, and probability of failure, 1-p, are constant throughtout the trials.
With reference to the Hers dataset
Let X be a random variable that indicates the number of successes (diabetic patients
) in n-independent Bernoulli trials. If random variable X satisfies the binomial setting, it follows the binomial distribution with parameters n and p, denoted as \(X ∼ Binomial(n, p)\), where n is the Bernoulli trial parameter (a positive integer) and p the Bernoulli probability parameter (\({0\leq p\leq 1}\)).
- The probability mass function (pmf) of X is given by:
\[ P(X=x) = {{n}\choose{x}} \cdot p^x \cdot (1-p)^{n-x} \tag{1}\]
where x = 0, 1, … , n and \(\binom{n}{x} = \frac{n!}{x!(n-x)!}\)
Note that: \(n! = 1\cdot 2 \cdot 3\cdot \ldots \cdot (n-2)\cdot (n-1)\cdot n\)
- The cumulative distribution function (cdf) of X is given by:
\[F(x) = P(X \le x)= {\begin{cases}0,&for\ x <0\\\sum_{k=0}^{x}{\left( \begin{array}{c} n \\ k \end{array} \right) p^{k}(1 - p)^{n-k}},&for\ 0\leq x < n\\1,&for\ x \geq n \end{cases}} \tag{2}\]
The mean of random variable, X, with Binomial(n, p) distribution is:
\[μ = np\] the variance is:
\[σ^2 = np(1-p)\]
and the standard deviation:
\[σ = \sqrt{np(1-p)}\]
Example : HERS DATA
use hersdata
tab diabetes
diabetes | Freq. Percent Cum.
------------+-----------------------------------
no | 2,032 73.54 73.54
yes | 731 26.46 100.00
------------+-----------------------------------
Total | 2,763 100.00
Let the random variable X be the number of patients diagnosed with diabetes and from the table above we note that 26.46% are diabetic(successful
) (p = 0.2646). If the results of 10 patients are randomly sampled, and X follows a Binomial distribution \(X ∼ Binomial(10, 0.2646)\), The main characteristics of this distribution can be shown below.
- So, the pmf for this distribution is:
\[ P(X=x) = {{10}\choose{x}} \cdot 0.2646^x \cdot (1-0.2646)^{10-x} \tag{3}\]
The pmf of Binomial(10, 0.2646) distribution specifies the probability of 0 through 10 successful selection of diabetic patients. Hence we have:
X | 0 | 1 | 2 | 3 | … | 8 | 9 | 10 |
P(X) | 0 | 0.046 | 0.166 | 0.269 | … | 0.00058 | 0.000047 | 0.000002 |
d) Sample size or Power
Power is the probability of detecting an effect if it exists. Higher sample size increases power. Suppose we want to only use a smaller sample of the HERS Data
to draw conclusions about the actual population: i.e If we want to draw a sample that we can use inorder to compare mean weight between the placebo
and hormone therapy
group. if we want to achieve a power of 80% at 5% level of significance Assuming expected means of \(75 ~vs ~78, SD = 10\). then our sample sizes can be can be calculated in stata as follows:
sd(10) alpha(0.05) power(0.8) power twomeans 75 78,
Performing iteration ...
Estimated sample sizes for a two-sample means test
t test assuming sd1 = sd2 = sd
H0: m2 = m1 versus Ha: m2 != m1
Study parameters:
alpha = 0.0500
power = 0.8000
delta = 3.0000
m1 = 75.0000
m2 = 78.0000
sd = 10.0000
Estimated sample sizes:
N = 352
N per group = 176
Our desired sample sizes for each group is 176
e) Pearson Chi-Square
The chi-square independence test tests whether observed joint frequency counts \(O_{ij}\) differ from expected frequency counts \(E_{ij}\) under the independence model (the model of independent explanatory variables, \(\pi_{ij} = \pi_{i+} \pi_{+j}\). \(H_0\) is \(O_{ij} = E_{ij}\).
The Pearson test of independence statistic is:
\[\chi^2 = \sum_i\sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
where \(O_{ij}\) is the observed count, and \(E_{ij}\) is the product of the row and column marginal probabilities. for instance if we wanted to test the hypothesis that:
- \(H_0:\) There is no association between being diabetic and being a smoker
- \(H_1:\) There is association between being diabetic and being a smoker
we would run the following code and also reject \(H_0\) if our p-value was less than \(5\%\)
use hersdata
tab diabetes smoking ,chi expected cchi2
| Key |
|--------------------|
| frequency |
| expected frequency |
| chi2 contribution |
+--------------------+
| current smoker
diabetes | no yes | Total
-----------+----------------------+----------
no | 1,733 299 | 2,032
| 1,767.2 264.8 | 2,032.0
| 0.7 4.4 | 5.1
-----------+----------------------+----------
yes | 670 61 | 731
| 635.8 95.2 | 731.0
| 1.8 12.3 | 14.2
-----------+----------------------+----------
Total | 2,403 360 | 2,763
| 2,403.0 360.0 | 2,763.0
| 2.5 16.7 | 19.2
Pearson chi2(1) = 19.2496 Pr = 0.000
here \(\chi^2 = \sum_i\sum_j \frac{(O_{ij} - E_{ij})^2}{E_{ij}}=19.2496\) The p-value is very low, so we reject the null hypothesis of independence. This demonstrates that a relationship exists between smoking and being diabetic.
f) One-way ANOVA test
One way Anova is an extension of the two sample independent t.test where the goal is now to compare the means between 3 or more categories for a given continuous outcome variable.
Treatment
1 | 2 | … | a | |
---|---|---|---|---|
\(Y_{11}\) | \(Y_{21}\) | … | \(Y_{a1}\) | |
\(Y_{12}\) | … | … | … | |
… | … | … | … | |
Sample Mean | \(\bar{Y_{1.}}\) | \(\bar{Y_{2.}}\) | … | \(\bar{Y_{a.}}\) |
Sample SD | \(s_1\) | \(s_2\) | … | \(s_a\) |
where \(\bar{Y_{i.}}=\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\)
\(s_i^2=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y_i})^2\) The total sample size is \(N=\sum_{i=1}^{a}n_i\) And the grand mean is \(\bar{Y_{..}}=\frac{1}{N}\sum_{i}\sum_{j}Y_{ij}\) in this case our Null and alternative hypothesis can be shown as below:
\[ H_0 : \mu_1 = \mu_2 = ... = \mu_a \]
\[ H_1 : \mu_1 \neq \mu_2 \neq ... \neq \mu_a \\ \]
The total variability of the \(Y_{ij}\) observation can be measured as the deviation of \(Y_{ij}\) around the overall mean \(\bar{Y_{..}}\): \(Y_{ij} - \bar{Y_{..}}\)
This can be rewritten as: \[ \begin{split} Y_{ij} - \bar{Y_{..}}&=Y_{ij} - \bar{Y_{..}} + \bar{Y_{i.}} - \bar{Y_{i.}} \\ &= (\bar{Y_{i.}}-\bar{Y_{..}})+(Y_{ij}-\bar{Y_{i.}}) \end{split} \] where
- the first term is the between treatment differences (i.e., the deviation of the treatment mean from the overall mean)
- the second term is within treatment differences (i.e., the deviation of the observation around its treatment mean)
\[ \begin{split} \sum_{i}\sum_{j}(Y_{ij} - \bar{Y_{..}})^2 &= \sum_{i}n_i(\bar{Y_{i.}}-\bar{Y_{..}})^2+\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2 \\ SSTO &= SSTR + SSE \\ total~SS &= treatment~SS + error~SS \\ (N-1)~d.f. &= (a-1)~d.f. + (N - a) ~ d.f. \end{split} \]
Accordingly, \(MSTR= \frac{SST}{a-1}\) and \(MSR=\frac{SSE}{N-a}\)
ANOVA Table
Source of Variation | SS | df | MS |
---|---|---|---|
Between Treatments | \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2\) | a-1 | SSTR/(a-1) |
Error (within treatments) | \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2\) | N-a | SSE/(N-a) |
Total (corrected) | \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{..}})^2\) | N-1 |
Suppose we want to compare mean BMI for people in different self reported health conditions. We run the following command:
use hersdata
data by wheat varieties
*/obtain additional information about the production oneway BMI globrat,tabulate
self-report | Summary of BMI (kg/m^2)
ed health | Mean Std. dev. Freq.
------------+------------------------------------
poor | 29.734667 6.7250888 60
fair | 29.406047 6.3661577 602
good | 28.664506 5.4050013 1,307
very good | 27.959614 4.8108748 673
excellent | 26.315752 4.0410658 113
------------+------------------------------------
Total | 28.581318 5.5194835 2,755
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 1338.45482 4 334.613705 11.15 0.0000
Within groups 82561.3228 2750 30.0222992
------------------------------------------------------------------------
Total 83899.7776 2754 30.4646978
Bartlett's equal-variances test: chi2(4) = 73.9963 Prob>chi2 = 0.000
Conclusion
\(P(>F)=0.000\) hence we reject \(H_0\) and conclude that the mean BMI for these groups are significantly different.
g) Confounding
Confounder: A variable that distorts the association between exposure and outcome.
Effect modifier: A variable that changes the strength/direction of association.
graph TD A[Exposure: Exercise] --> C[Outcome: Diabetes] B[Confounder: statins] --> A B --> C
Running a confounding or Effect modification test
use ncube
cc exercise diabetes, by(statins)
statin use | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .4275362 .1423493 1.15752 7.374046 (exact)
yes | .3641457 .0909963 1.273914 5.173913 (exact)
-----------------+-------------------------------------------------
Crude | .4047619 .179071 .8688314 (exact)
M–H combined | .4013983 .1943076 .8292038
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 0.05 Pr>chi2 = 0.8319
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 6.25
Pr>chi2 = 0.0124
Conclusion
Statin use is a cofounder for the relationship between excercise and diabetes.
h) Wilcoxon Signed Rank vs Rank Sum Test
- Signed Rank Test: Paired (before/after) comparisons.
for instance we can compare if there is a difference in median weight from start of study up to year 1
\((H_0)\): The median of the difference between the two groups is zero i.e no difference between weight and weight1.
\((H_1)\): There is a difference (nonzero median of differences).
use hersdata
signrank weight = weight1
Wilcoxon signed-rank test
Sign | Obs Sum ranks Expected
-------------+---------------------------------
Positive | 1495 2111425.5 1703595.5
Negative | 1042 1295765.5 1703595.5
Zero | 74 2775 2775
-------------+---------------------------------
All | 2611 3409966 3409966
Unadjusted variance 1.484e+09
Adjustment for ties -94232.625
Adjustment for zeros -34456.25
----------
Adjusted variance 1.484e+09
H0: weight = weight1
z = 10.587
Prob > |z| = 0.0000
Conclusion
p-value = 0.0000 → Strong evidence against the null hypothesis.
- Rank Sum Test: Unpaired, two-group comparisons.
for example we can compare the median BMI for participants who excercised and those who did not.
use hersdata
ranksum BMI, by(exercise)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test
exercise | Obs Rank sum Expected
-------------+---------------------------------
no | 1692 2512882.5 2334114
yes | 1066 1291778.5 1470547
-------------+---------------------------------
Combined | 2758 3804661 3804661
Unadjusted variance 4.147e+08
Adjustment for ties -449.26906
----------
Adjusted variance 4.147e+08
H0: BMI(exercise==no) = BMI(exercise==yes)
z = 8.779
Prob > |z| = 0.0000
p-value = 0.0000 → Strong evidence against the null hypothesis.
Question 2 : D table
use hersdata
by(statins, tests testnotes) continuous(age BMI SBP DBP tchol LDL HDL TG) factor(raceth nonwhite smoking drinkany exercise physact diabetes globrat poorfair) title(Table1) titlestyles( font(, bold) ) export("Statin_table", as(docx) replace) dtable,
note: using test regress across levels of statins for age, BMI, SBP, DBP,
tchol, LDL, HDL, and TG.
note: using test pearson across levels of statins for raceth, nonwhite,
smoking, drinkany, exercise, physact, diabetes, globrat, and poorfair.
Table1
--------------------------------------------------------------------------------------------
statin use
no yes Total Test
--------------------------------------------------------------------------------------------
N 1,759 (63.7%) 1,004 (36.3%) 2,763 (100.0%)
age in years 66.507 (6.905) 66.896 (6.182) 66.649 (6.653) 0.139
BMI (kg/m^2) 28.627 (5.626) 28.496 (5.324) 28.579 (5.518) 0.550
systolic blood pressure 135.323 (18.878) 134.625 (19.288) 135.069 (19.028) 0.354
diastolic blood pressure 73.628 (9.658) 72.317 (9.735) 73.152 (9.705) <0.001
total cholesterol (mg/dl) 233.767 (40.701) 219.470 (39.979) 228.580 (41.014) <0.001
LDL cholesterol (mg/dl) 151.136 (37.231) 134.322 (36.409) 145.039 (37.803) <0.001
HDL cholesterol (mg/dl) 50.068 (13.622) 50.602 (12.468) 50.262 (13.216) 0.308
triglycerides (mg/dl) 162.863 (63.742) 171.921 (62.718) 166.149 (63.511) <0.001
race/ethnicity
White 1,542 (87.7%) 909 (90.5%) 2,451 (88.7%) 0.012
African American 159 (9.0%) 59 (5.9%) 218 (7.9%)
Other 58 (3.3%) 36 (3.6%) 94 (3.4%)
nonwhite race/ethnicity
no 1,542 (87.7%) 909 (90.5%) 2,451 (88.7%) 0.022
yes 217 (12.3%) 95 (9.5%) 312 (11.3%)
current smoker
no 1,495 (85.0%) 908 (90.4%) 2,403 (87.0%) <0.001
yes 264 (15.0%) 96 (9.6%) 360 (13.0%)
any current alcohol consumption
no 1,102 (62.6%) 578 (57.7%) 1,680 (60.8%) 0.010
yes 657 (37.4%) 424 (42.3%) 1,081 (39.2%)
exercise at least 3 times per week
no 1,102 (62.6%) 593 (59.1%) 1,695 (61.3%) 0.063
yes 657 (37.4%) 411 (40.9%) 1,068 (38.7%)
comparative physical activity
much less active 125 (7.1%) 72 (7.2%) 197 (7.1%) 0.407
somewhat less active 317 (18.0%) 186 (18.5%) 503 (18.2%)
about as active 601 (34.2%) 318 (31.7%) 919 (33.3%)
somewhat more active 535 (30.4%) 303 (30.2%) 838 (30.3%)
much more active 181 (10.3%) 125 (12.5%) 306 (11.1%)
diabetes
no 1,278 (72.7%) 754 (75.1%) 2,032 (73.5%) 0.161
yes 481 (27.3%) 250 (24.9%) 731 (26.5%)
self-reported health
poor 44 (2.5%) 16 (1.6%) 60 (2.2%) 0.259
fair 401 (22.8%) 204 (20.4%) 605 (21.9%)
good 824 (46.9%) 484 (48.3%) 1,308 (47.4%)
very good 418 (23.8%) 256 (25.5%) 674 (24.4%)
excellent 71 (4.0%) 42 (4.2%) 113 (4.1%)
poor/fair self-reported health
no 1,313 (74.7%) 782 (78.0%) 2,095 (75.9%) 0.047
yes 445 (25.3%) 220 (22.0%) 665 (24.1%)
--------------------------------------------------------------------------------------------
(collection DTable exported to file Statin_table.docx)
Herrington, David M., et al. “Statin Therapy, Cardiovascular Events, and Total Mortality in the Heart and Estrogen/Progestin Replacement Study (HERS).” Circulation, vol. 105, no. 25, 2002, pp. 2962–2967. https://doi.org/10.1161/01.CIR.0000019406.74017.B2
Data sampling
- for these exercises I used a sample size of
200
use Hersdata
set seed 3002164
sample 200 , count
save ncube , replace
(2,563 observations deleted)
file ncube.dta saved
Formulate a question that you would use the independent Student’s t-test using the given HERS data. Below is a guide to assist you to answer this question.
- State the research question and highlight the two variables you will use to answer this question?
Solution:
Is there is any significant difference in the mean Body Mass Index of participants who smoke and those who do not smoke? . To answer this I used the following variables
- smoking
- BMI
- Test the relevant assumptions of the Student’s t-test, clearly outline the hypotheses and make decisions on whether these have been met or not?
Checking for assumptions
Equality of Variance assumption
Equality of Variances assumptions
\[H_0: Variances~ of~ BMI ~across~ groups ~are ~equal\] \[H_1: Variances~ of ~BMI ~across~ groups ~is ~not ~equal\]
use ncube, clear
sdtest BMI , by(smoking)
Variance ratio test
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
no | 170 29.03318 .4483305 5.845514 28.14813 29.91823
yes | 30 25.63967 1.038795 5.689716 23.51509 27.76424
---------+--------------------------------------------------------------------
Combined | 200 28.52415 .4195955 5.933977 27.69673 29.35157
------------------------------------------------------------------------------
ratio = sd(no) / sd(yes) f = 1.0555
H0: ratio = 1 Degrees of freedom = 169, 29
Ha: ratio < 1 Ha: ratio != 1 Ha: ratio > 1
Pr(F < f) = 0.5482 2*Pr(F > f) = 0.9037 Pr(F > f) = 0.4518
Comment
Based on the following reasons:
- the ratio
F=1.056
shows no significant difference in the variance of the denominator and that of the numerator. - The p-value for the two-sided test is 0.9037, which is much greater than 0.05.
Therefore, we fail to reject the null hypothesis — there is no evidence that the standard deviations are different between the groups.
This suggests the variability (spread) of the outcome (likely BMI) is similar in both groups.
Normality test
- Normality assumption
\[H_0: BMI ~across~ groups ~is ~normally~ distributed\] \[H_1: BMI ~across~ groups ~is ~not ~normally~ distributed\] We fail to reject \(H_0\) if the p-value is above 0.05
use ncube
test
*/Shapiro–Wilk W
bysort smoking: swilk BMI
-> smoking = no
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 170 0.96032 5.142 3.737 0.00009
-------------------------------------------------------------------------------
-> smoking = yes
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 30 0.86586 4.264 2.998 0.00136
Comment
Based on the results from the above test , we notice that BMI is not normally distributed across the two categories because the p-values
for both groups is less than 5% hence we fail to reject the null hypothesis.
- Based on your decision from 3b), perform the independent Student’s ttest (regardless of results from b) using Stata. Clearly state your hypotheses, display the Stata results output and the final decision from doing the test.
Define the hypothesis
\(H_0\) : \(\mu_{No} = \mu_{Yes}\)
\(H_1\) : \(\mu_{No}\ne \mu_{Yes}\)
the means are given in the output tables
Define significance level
- determine at 5% level
use ncube, clear
ttest BMI , by(smoking)
Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
no | 170 29.03318 .4483305 5.845514 28.14813 29.91823
yes | 30 25.63967 1.038795 5.689716 23.51509 27.76424
---------+--------------------------------------------------------------------
Combined | 200 28.52415 .4195955 5.933977 27.69673 29.35157
---------+--------------------------------------------------------------------
diff | 3.39351 1.153117 1.119543 5.667477
------------------------------------------------------------------------------
diff = mean(no) - mean(yes) t = 2.9429
H0: diff = 0 Degrees of freedom = 198
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.9982 Pr(|T| > |t|) = 0.0036 Pr(T > t) = 0.0018
Conclusion
since \(Pr(|T| > |t|) = 0.0036\) which is below 5% we reject the null hypothesis and conclude that there is evidence to suggest that the mean BMI is significantly different from each group(smokers and non smokers).
- The test done in 3c) has a test statistic t-value, show how this was computed manually. Use the given summary statistics on the Stata output in the computation of the test statistics.
Manually calculating
for equal variances we use the formula in the table above:
\[t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\] where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 + (n_2-1)s^2_2}{n_1 + n_2 -2}\]
where
\(\overline{x_{1}}\) and \(\overline{x_{2}}\) are the means of the response variable y for group 1 and 2, respectively,
\(s_{1}^2\) and \(s_{2}^2\) are the variances of the response variable y for group 1 and 2, respectively,
\(n_{1}\) and \(n_{2}\) are the sample sizes of groups 1 and 2, respectively.
\[t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\] where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 +(n_2-1)s^2_2}{n_1 + n_2 -2}\]
\[s_p^2 = \frac{(170 -1)5.845514^2 + (30-1)5.689716^2}{170 +30 -2}\]
display "spooled=" ((170 -1)*5.845514^2 + (30-1)*5.689716^2)/(170 + 30 -2)
spooled=33.906813
\[t = \frac{(29.03318-25.63967)}{\sqrt{33.906813(\frac{1}{170}+\frac{1}{30})}}\]
display "T=" (29.03318-25.63967)/sqrt(33.906813*(1/170 +1/30))
T=2.9429016
- What conclusion are you deriving from the question you set out to answer?
There is significant mean difference in BMI between smokers and non-smokers
- Now consider redoing parts a) , c) , d and e) by considering the non-parametric alternative. Take note you have to compute the non-parametric Z-value.
- \(H_0\): Equal medians of BMI between the two groups of smokers and non-smokers
- \(H_1\): Unequal medians of BMI between the two groups of smokers and non-smokers
use ncube
ranksum BMI ,by(smoking)
Two-sample Wilcoxon rank-sum (Mann–Whitney) test
smoking | Obs Rank sum Expected
-------------+---------------------------------
no | 170 18027.5 17085
yes | 30 2072.5 3015
-------------+---------------------------------
Combined | 200 20100 20100
Unadjusted variance 85425.00
Adjustment for ties -1.22
----------
Adjusted variance 85423.78
H0: BMI(smoking==no) = BMI(smoking==yes)
z = 3.225
Prob > |z| = 0.0013
Exact prob = 0.0011
Conclusion
- \(Prob > |z| = 0.0013\) implying that we reject \(H_0\) and conclude that the median BMI for these two groups are significantly different.
Manually calculating Z
\[Z_w= \frac{W-\mu_w}{\sigma_w}\]
\[\mu_w=\frac{n_s(n_s+n_L+1)}{2}=30(30+170+1)/2\]
display "mu_w =" 30*(30+170+1)/2
mu_w =3015
\[\sigma_w =\sqrt{\frac{n_sn_L(n_s+n_L+1)}{12}}=\sqrt{\frac{30*170(30+170+1)}{12}}\]
display "sig_w=" sqrt(30*170*(30+170+1)/12)
sig_w=292.27555
\[Z_w=| \frac{2072.5 - 3015}{292.27555}|=3.2246967\]
display "z=" (2072.5 - 3015)/292.27555
z=-3.2246967
The figure below gives the visual aid on trying to get answers on determining whether each of the variables in the third variable box was either a confounder, effect modifier or neither of these on the relationship between exercise and diabetes?
- State the hypotheses, use stata to do the relevant statistical tests after performing the relevant assumptions testing to answer the question “Is there an association between the exposure exercise and outcome diabetes?”
use ncube
tab exercise diabetes , chi expected exact cchi2
| Key |
|--------------------|
| frequency |
| expected frequency |
| chi2 contribution |
+--------------------+
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 85 40 | 125
| 92.5 32.5 | 125.0
| 0.6 1.7 | 2.3
-----------+----------------------+----------
yes | 63 12 | 75
| 55.5 19.5 | 75.0
| 1.0 2.9 | 3.9
-----------+----------------------+----------
Total | 148 52 | 200
| 148.0 52.0 | 200.0
| 1.6 4.6 | 6.2
Pearson chi2(1) = 6.2370 Pr = 0.013
Fisher's exact = 0.013
1-sided Fisher's exact = 0.009
Define the hypothesis
- \(H_0\) : There is no association between diabetes and exercise
- \(H_1\) : There is an association between diabetes and exercise
Conclusion
since \(p=0.013<0.05\) , we reject \(H_0\) and conclude that at 5% level of significance , there is evidence that diabetes and exercise are significantly associated.
- Work out the test statistic on 4a) manually, showing all your steps?
The test statistic is calculated as follows \[\chi^2=\sum_{j=1}^{c}\sum_{i=1}^{r}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]
We reject the null hypothesis at 5% level of significance if \[\chi_2>\chi^{2}_{0.05}(r-1)(c-1)\]
Calculating expected frequencies
\[ E_{ij}=\frac{O_i\times O_{ij}}{n} \]
\[ E_{11}=\frac{125\times 148}{200}=92.5 \]
display "E11=" (125*148)/200
E11=92.5
\[ E_{12}=\frac{125\times 52}{200}=32.5 \]
display "E12=" (125*52)/200
E12=32.5
\[ E_{21}=\frac{75\times 148}{200}=55.5 \]
display "E21=" (75*148)/200
E21=55.5
\[ E_{22}=\frac{75\times 52}{200}=19.5 \]
display "E22=" (75*52)/200
E22=19.5
calculating \(\chi^2\)
\[\chi^2=\frac{(85-92.5)^2}{92.5}+\frac{(40-32.5)^2}{32.5}\\ +\frac{(63-55.5)^2}{55.5}+\frac{(52-19.5)^2}{19.5}\]
display "chi2 =" (((85-92.5)^2)/92.5)+(((40-32.5)^2)/32.5) +(((63-55.5)^2)/55.5)+(((12-19.5)^2)/19.5)
chi2 =6.2370062
\[\chi^2=6.2370062\]
The degrees of freedom is
display "dof=" (2 - 1) * (2 - 1)
dof=1
\[\chi^2_{0.05}(2-1)(2-1)=\chi^2_{0.05}(1)=3.84\]
- conclusion
Since \(\chi^{2}=6.2370062<3.84\) ,we reject \(H_0\) and conclude that there is association between diabetes and exercise.
Confounding
statins
use ncube
cc exercise diabetes, by(statins)
statin use | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .4275362 .1423493 1.15752 7.374046 (exact)
yes | .3641457 .0909963 1.273914 5.173913 (exact)
-----------------+-------------------------------------------------
Crude | .4047619 .179071 .8688314 (exact)
M–H combined | .4013983 .1943076 .8292038
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 0.05 Pr>chi2 = 0.8319
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 6.25
Pr>chi2 = 0.0124
Comment
statin use appears to be a confounder and not an effect modifier because of the following reasons:
- the table above shows that the ratios
0.427
and0.364
are not significantly different from each other - \(\chi^2(1) = 0.05\), p = 0.8319 → This test checks for effect modification (aka interaction) and a non-significant p-value (> 0.05) means no significant difference between the stratum-specific odds ratios (statin vs. no statin groups) hence no effect modification.
Manually computing M-H adjusted odds ratio
use ncube
bysort statins: tab exercise diabetes
-> statins = no
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 59 23 | 82
yes | 42 7 | 49
-----------+----------------------+----------
Total | 101 30 | 131
-------------------------------------------------------------------------------
-> statins = yes
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 26 17 | 43
yes | 21 5 | 26
-----------+----------------------+----------
Total | 47 22 | 69
\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{26*5}{69}+\frac{59*7}{131}}{(17*21)/69+(23*42)/131}=.40139833\]
display "OR_ADJUST=" ((26*5)/69+(59*7)/131)/((17*21)/69+(23*42)/131)
OR_ADJUST=.40139833
nonwhite
use ncube
cc exercise diabetes, by(nonwhite)
nonwhite race/et | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .3958333 .1610491 .9112598 10.78652 (exact)
yes | .75 .050961 8.546008 1.090909 (exact)
-----------------+-------------------------------------------------
Crude | .4047619 .179071 .8688314 (exact)
M–H combined | .4283626 .2064359 .888869
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 0.33 Pr>chi2 = 0.5644
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 5.30
Pr>chi2 = 0.0214
Comment
- \(\chi^2(1) = 0.33, p = 0.5644\)
- This p-value is not significant implying that there is no statistical evidence of effect modification (i.e., the odds ratios across race/ethnicity strata are not significantly different) hence the variable is not an effect modifier.
Manually computing M-H adjusted odds ratio
use ncube
bysort nonwhite: tab exercise diabetes
-> nonwhite = no
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 76 32 | 108
yes | 60 10 | 70
-----------+----------------------+----------
Total | 136 42 | 178
-------------------------------------------------------------------------------
-> nonwhite = yes
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 9 8 | 17
yes | 3 2 | 5
-----------+----------------------+----------
Total | 12 10 | 22
\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{9*2}{22}+\frac{76*10}{178}}{(8*3)/22+(32*60)/178}=.42836257\]
display "OR_ADJUST=" ((9*2)/22+(76*10)/178)/((8*3)/22+(32*60)/178)
OR_ADJUST=.42836257
poorfair
use ncube
cc exercise diabetes, by(poorfair)
poor/fair self-r | Odds ratio [95% conf. interval] M–H weight
-----------------+-------------------------------------------------
no | .5414274 .2175484 1.29283 8.292517 (exact)
yes | .1470588 .0032038 1.244775 3.207547 (exact)
-----------------+-------------------------------------------------
Crude | .4047619 .179071 .8688314 (exact)
M–H combined | .4314319 .2058862 .9040596
-------------------------------------------------------------------
Test of homogeneity (M–H) chi2(1) = 1.27 Pr>chi2 = 0.2598
Test that combined odds ratio = 1:
Mantel–Haenszel chi2(1) = 5.05
Pr>chi2 = 0.0246
Comment
- \(\chi^2(1) = 1.27, p = 0.2598\) not statistically significant hence no evidence of effect modification
- This tells us the odds ratios across strata are not significantly different.
Manually computing M-H adjusted odds ratio
use ncube
bysort poorfair: tab exercise diabetes
-> poorfair = no
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 60 23 | 83
yes | 53 11 | 64
-----------+----------------------+----------
Total | 113 34 | 147
-------------------------------------------------------------------------------
-> poorfair = yes
exercise |
at least 3 |
times per | diabetes
week | no yes | Total
-----------+----------------------+----------
no | 25 17 | 42
yes | 10 1 | 11
-----------+----------------------+----------
Total | 35 18 | 53
\[\LARGE OR_{adjusted} = \frac{\sum{\frac{a_{i}d_{i}}{n_{i}}}}{\sum{\frac{c_{i}b_{i}}{n_{i}}}}\] \[\LARGE OR_{adjusted}=\frac{\frac{25*1}{53}+\frac{60*11}{147}}{(17*10)/53+(53*23)/147}=.43143186\]
display "OR_ADJUST=" ((25*1)/53+(60*11)/147)/((17*10)/53+(53*23)/147)
OR_ADJUST=.43143186
Formulate a question that you would use oneway analyses of variance test (ANOVA) using the given HERS data. Below is a guide to assist you to answer this question
- State the research question and highlight the two variables you will use to answer this question?
- Are the mean BMI across people with different responses on comparative physical activity significantly different?
- to answer this question I used two variables namely
- Comparative physical activity (physact)
- Body mass Index (BMI)
- Test the relevant assumptions of the oneway ANOVA test, clearly outline the hypotheses and make decisions on whether these have been met or not?
- Normality assumption
\[H_0: BMI ~across~ groups ~is ~normally~ distributed\] \[H_1: BMI ~across~ groups ~is ~not ~normally~ distributed\] We fail to reject \(H_0\) if the p-value is above 0.05
use ncube
test
*/Shapiro–Wilk W
bysort physact: swilk BMI
-> physact = much less active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 15 0.93253 1.308 0.531 0.29761
-------------------------------------------------------------------------------
-> physact = somewhat less active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 41 0.96054 1.590 0.977 0.16425
-------------------------------------------------------------------------------
-> physact = about as active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 67 0.95711 2.548 2.029 0.02123
-------------------------------------------------------------------------------
-> physact = somewhat more active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 56 0.97306 1.386 0.700 0.24182
-------------------------------------------------------------------------------
-> physact = much more active
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
BMI | 21 0.89416 2.594 1.927 0.02700
Conclusion
The data for the following groups was found to be non-normal:
- -> physact = much more active
- -> physact = about as active
Equality of Variances assumptions
\[H_0: Variances~ of~ BMI ~across~ groups ~are ~equal\] \[H_1: Variances~ of ~BMI ~across~ groups ~is ~not ~equal\]
use ncube
*Levenes and Brown-Forsythe tests*
graph box BMI, over(physact) ylabel(50(10)80, labsize(small)) title("BMI count by Physical Activity")
quietly graph export box.svg, replace
robvar BMI, by(physact)
comparative |
physical | Summary of BMI (kg/m^2)
activity | Mean Std. dev. Freq.
------------+------------------------------------
much less | 30.578667 9.2333355 15
somewhat | 30.389268 6.4486589 41
about as | 28.506418 5.6301614 67
somewhat | 27.406429 4.6921745 56
much more | 26.452381 4.9048524 21
------------+------------------------------------
Total | 28.52415 5.9339772 200
W0 = 5.6070266 df(4, 195) Pr > F = 0.00027148
W50 = 4.3101376 df(4, 195) Pr > F = 0.00230662
W10 = 5.4014192 df(4, 195) Pr > F = 0.00038104
use ncube
if the equal variance
*/ssc install robnova
robnova BMI physact
checking robnova consistency and verifying not already installed...
installing into C:\Users\r1953\ado\plus\...
installation complete.
Outcome variable was BMI and predictor variable was physact
Sum of Squares Model = 366.0596
Sum of Squares Residual = 6641.1453
Sum of Squares Total = 7007.2049
R-squared = 0.052240
---------------------------------------------------------
Test | F df1 df2 p
-----------------+---------------------------------------
Brown-Forsythe's | 2.1484 4 57.3707 0.0864
Fisher's | 2.6871 4 195.0000 0.0326
Welch's | 2.4583 4 58.0620 0.0554
---------------------------------------------------------
Total number of observations used was 200.
Conclusion
Bartlett’s equal-variances test: \(\chi^2(4) = 14.6343 , Prob>chi2 = 0.006\) suggests that we reject the null hypothesis and conclude that there is significant evidence that at least one group has a different variance. So, the assumption of homogeneity of variance is violated.
Fitting the One way analysis of Variance
use ncube
/////fit oneway anova/////
*
data by wheat varieties
*/obtain additional information about the production oneway BMI physact,tabulate
comparative |
physical | Summary of BMI (kg/m^2)
activity | Mean Std. dev. Freq.
------------+------------------------------------
much less | 30.578667 9.2333355 15
somewhat | 30.389268 6.4486589 41
about as | 28.506418 5.6301614 67
somewhat | 27.406429 4.6921745 56
much more | 26.452381 4.9048524 21
------------+------------------------------------
Total | 28.52415 5.9339772 200
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 366.059574 4 91.5148934 2.69 0.0326
Within groups 6641.14534 195 34.0571556
------------------------------------------------------------------------
Total 7007.20492 199 35.212085
Bartlett's equal-variances test: chi2(4) = 14.6343 Prob>chi2 = 0.006
Conclusion
Since \(Prob>F =0.0326 < 0.05\) we reject the null hypothesis and conclude that there is evidence to suggest that the mean BMI for participants in each group significantly differs from the others.
- The test done in 5b) has an F test statistic, show how this can be computed manually. Use the given summary statistics on the Stata output in the computation of the test statistics after using the option tabulate on your oneway test.
Manual calculation
\[MS_{Total} = \frac{SSE}{N-1}=\frac{\sum (X_{ij}-\overline{X}_j)}{N-1}=\sigma^2\] \[MS_{Total} = \frac{\sum (X_{ij}-\overline{X}_j)}{N-1}=\frac{7007.20492 }{200-1}=5.9339772^2=35.212085\] \[SS_{Between} =\sum_{k=1}^K n_k(\bar{x}_k -\bar{x}_{grand})^2\]
\[SS_{Between} = 15( 30.578667 -28.52415)^2+ 41( 30.389268 -28.52415)^2+67( 28.506418 -28.52415)^2+ \\\ 56( 27.406429 -28.52415)^2+21( 26.452381 -28.52415)^2\]
display "SSB=" 15*( 30.578667 -28.52415)^2+ 41*( 30.389268 -28.52415)^2+67*( 28.506418 -28.52415)^2+56*( 27.406429 -28.52415)^2+21*( 26.452381 -28.52415)^2
SSB=366.05951
\[MS_{between} = \frac{SS_{between}}{K-1}=\frac{366.05951}{5-1}=91.514877\]
display "MSB=" 366.05951/4
MSB=91.514877
\[SS_{within} =\sum^{K}_{k=1}\sum^{n_k}_{i=1}(x_{i,k} -\bar{x}_k)\] \[SS_{within} =(15-1)*9.23333556^2+(41-1)*6.4486589^2+(67-1)*5.6301614^2+\\\ (56-1)*4.6921745^2+(21-1)*4.9048524^2\]
display "SSW=" (15-1)*9.23333556^2+(41-1)*6.4486589^2+(67-1)*5.6301614^2+(56-1)*4.6921745^2+(21-1)*4.9048524^2
SSW=6641.1453
\[MS_{within} = \frac{SS_{within}}{N-K}=\frac{6641.1453}{200-5}=34.057155\]
display "MSW=" 6641.1453/195
MSW=34.057155
\[F = \frac{MS_{Between}}{MS_{within}}=91.514877/34.057155=2.6870969\]
display "F=" 91.514877/34.057155
F=2.6870969
ANOVA Table
Source of Variation | SS | df | MS |
---|---|---|---|
Between Treatments | \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2\) | a-1 | SSTR/(a-1) |
Error (within treatments) | \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2\) | N-a | SSE/(N-a) |
Total (corrected) | \[\sum(Y_{ij} - \bar{Y}_{..})^2\] | N-1 |
Source of Variation | SS | df | MS |
---|---|---|---|
Between Treatments | \(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2=366.05951\) | 5-1 | 366.05951/(4) |
Error (within treatments) | \(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2=6641.1453\) | 200-5 | 6641.1453/(195) |
Total (corrected) | \[\sum(Y_{ij} - \bar{Y}_{..})^2=7007.20492\] | 200-1 |
- If the assumptions are not fully met in 5b) what can you consider doing?
If the assumptions were not met I would consider using an non-parametric test such as the Kruskal Wallis test.
- Based on your decision from 5d), clearly state your hypotheses, run the alternative test in stata, display the Stata results output and the final decision and conclusion from doing the test to show exactly where the differences (if any) were on?
- \[H_0: \theta_1 =\theta_2=\theta_3(medians)\]
- \[H_1: \theta_1 \neq\theta_2\neq \theta_3 (medians)\]
use ncube
kwallis BMI ,by(physact)
Kruskal–Wallis equality-of-populations rank test
+---------------------------------------+
| physact | Obs | Rank sum |
|----------------------+-----+----------|
| much less active | 15 | 1644.00 |
| somewhat less active | 41 | 4821.00 |
| about as active | 67 | 6773.50 |
| somewhat more active | 56 | 5194.00 |
| much more active | 21 | 1667.50 |
+---------------------------------------+
chi2(4) = 7.744
Prob = 0.1014
chi2(4) with ties = 7.744
Prob = 0.1014
Since \(p = 0.1014 > 0.05\), we fail to reject the null hypothesis. That means: There’s no statistically significant difference in the distribution (or medians) of the BMI across the five physical activity levels.