Statistical Inference (Infering Means)

Tutorial in Stata

Bongani Ncube

Data Snapshot

malaria dataset

scrno todaydate ethnic othtribe dob ageyrs agemths arrival arrivaloth paying farepaid whensick vomited novomit fever headache weakness dizziness abdopain appetloss nausea diarrhoea cough rash othersymp othersymspec weight temp bloodfilm filtertaken pcv falcipasex falcipasexfields falcipgamet falcipgametfields resprate heartrate liverpalp spleenpalp spleenhack enrolled id headache1 agei falcsex lnfalsx ethnicn payn vomitn fevern headn weakn dizzn abdpn appn nausn diarrn coughn rashn osymn enrolledn age sex agecat malaria pcvnew
168 2003-09-30 Mandinka 2000-07-16 NA NA Taxi Yes 4 2 No NA No No No No No No No No No No No 13.0 37.7 Yes Yes 22 112 1 NA NA NA NA NA No NA 3 112 4.718499 Mandinka Yes No No No No No No No No No No No No Not enrolled 3 female 4 0 22
157 2003-09-29 NA NA NA Taxi No 4 2 Yes 2 No No No No No No No No No No No NA 39.0 Yes Yes 31 520 1 NA NA NA NA NA No NA NA 520 6.253829 No Yes No No No No No No No No No No No Not enrolled NA male NA 0 31
152 2003-09-29 Mandinka 2002-09-24 NA NA Walking No NA 1 Yes 1 No No No No No No No No No No No 8.5 40.6 Yes Yes 34 142 1 NA NA 36 140 NA No NA 1 142 4.955827 Mandinka No Yes No No No No No No No No No No No Not enrolled 1 male 2 0 34
48 2003-09-18 Mandinka NA 6 NA Taxi Yes 4 3 No NA No No No No No No No No Yes No No 20.0 39.2 Yes Yes 39 240 1 0 50 24 120 NA Yes 212 NA 240 5.480639 Mandinka Yes No No No No No No No No No Yes No No Enrolled 6 female 6 0 39
226 2003-10-03 Mandinka 2002-08-10 NA NA Walking No NA 1 No NA No No No No No No No No No No No 7.0 38.8 Yes Yes 29 124 1 NA NA 36 142 NA No NA 1 124 4.820282 Mandinka No No No No No No No No No No No No No Not enrolled 1 male 2 0 29
146 2003-09-29 Other sereh NA 1 3 No NA 1 Yes 2 No No No No No No No No No No No 8.0 37.2 Yes Yes 33 64 30 NA NA 30 152 NA No NA 1 64 4.158883 Other No Yes No No No No No No No No No No No Not enrolled 1 male 2 0 33

1

[1] 1

Data Summary

use malaria_treatful
generate pcvnew2 = pcv
replace pcvnew2 = . if pcvnew < 10 | pcvnew > 50
label variable pcv "Packed Cell Volume %, or Haematocrit"
summarize pcvnew
summarize pcv
(3 real changes made, 3 to missing)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      pcvnew |        667     29.8081    5.368051         14         45

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         pcv |        670    29.81493    5.970912          2         87

Data Summary (Before removing outliers)

use malaria_treatful
histogram pcv, normal
quietly graph export histoga.svg, replace
(bin=25, start=2, width=3.4)
Stata Graph - Graph 0 .02 .04 .06 .08 Density 0 20 40 60 80 pcv

Data Summary (After removing outliers)

use malaria_treatful
histogram pcvnew, normal
quietly graph export histogb.svg, replace
(bin=25, start=14, width=1.24)
Stata Graph - Graph 0 .02 .04 .06 .08 .1 Density 10 20 30 40 50 pcvnew

Summary stats

\[ \bar{X} \pm (t_{\alpha/2;n-1}) \frac{s}{\sqrt{n}} \] and \(\frac{s}{\sqrt{n}}\) is the standard error of \(\bar{y}\)

use malaria_treatful

mean pcvnew
Mean estimation                            Number of obs = 667

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
      pcvnew |    29.8081   .2078518      29.39997    30.21622
--------------------------------------------------------------

T.test (Mean = 35)

use malaria_treatful, clear

ttest pcvnew ==35
One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
  pcvnew |     667     29.8081    .2078518    5.368051    29.39997    30.21622
------------------------------------------------------------------------------
    mean = mean(pcvnew)                                           t = -24.9789
H0: mean = 35                                    Degrees of freedom =      666

    Ha: mean < 35               Ha: mean != 35                 Ha: mean > 35
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

Manual working

\(H_0 : \mu=35\) \(H_1 : \mu \neq 35\)

\[T=\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\]

\[T=\frac{29.8081-35}{\frac{5.368051}{\sqrt{667}}}\]

(29.8081-35)/(5.368051/sqrt(667))
[1] -24.97886

\[T= -24.97886\]

since it is a two tailed test we divide percentage by 2

\[t_{667-1}(\frac{5\%}{2})=t_{666}(0.025)=2.365\]

Conclusion

  • We conclude that the mean is statistically significantly different from 35.

Comparing 2 independent means

Testing hypothesis (Infering Mean)

For Difference of Means (\(\mu_1-\mu_2\)), Independent Samples

\(100(1-\alpha)%\) Confidence Interval Hypothesis Testing
Test Statistic
When \(\sigma^2\) is known \(\bar{X}_1 - \bar{X}_2 \pm z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\) \(z= \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\)
When \(\sigma^2\) is unknown, Variances Assumed EQUAL \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}\) \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\) Pooled Variance: \(s_p^2 = \frac{(n_1 -1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\)
Degrees of Freedom: \(\gamma = n_1 + n_2 -2\)
When \(\sigma^2\) is unknown, Variances Assumed UNEQUAL \(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}\) \(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}}\) Degrees of Freedom: \(\gamma = \frac{(\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2})^2}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}}\)

Gender

use malaria_treatful, clear

ttest pcvnew ,by(sex)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
  female |     316    29.93354    .3039424    5.402999    29.33553    30.53156
    male |     347    29.71758    .2877092    5.359429     29.1517    30.28346
---------+--------------------------------------------------------------------
Combined |     663    29.82051    .2088353    5.377254    29.41045    30.23057
---------+--------------------------------------------------------------------
    diff |            .2159651    .4183597                -.605509    1.037439
------------------------------------------------------------------------------
    diff = mean(female) - mean(male)                              t =   0.5162
H0: diff = 0                                     Degrees of freedom =      661

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.6971         Pr(|T| > |t|) = 0.6059          Pr(T > t) = 0.3029

Manually calculating

Define the hypothesis

  • \(H_0\) : \(\mu_{male} = \mu_{female}\)

  • the means are given in the output tables

Define significance level

  • determine at 5% level

\[t = \frac{(\bar{X}_1-\bar{X}_2)+(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\]

where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\]

\[s_p^2 = \frac{(316 -1)5.402999^2 + (347-1)5.359429^2}{316 + 347 -2}\]

((316 -1)*5.402999^2  + (347-1)*5.359429^2)/(316 + 347 -2)
[1] 28.94694

\[t = \frac{(29.93354-29.71758)}{\sqrt{28.94694(\frac{1}{316}+\frac{1}{347})}}\]

(29.93354-29.71758)/(sqrt(28.94694*(1/316+1/347)))
[1] 0.5162065

by Fever

use malaria_treatful, clear

ttest pcvnew ,by(fevern)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      No |     490    29.75714    .2443306    5.408488    29.27708    30.23721
     Yes |     177    29.94915     .395895    5.267041    29.16784    30.73046
---------+--------------------------------------------------------------------
Combined |     667     29.8081    .2078518    5.368051    29.39997    30.21622
---------+--------------------------------------------------------------------
    diff |           -.1920097    .4710499               -1.116934    .7329145
------------------------------------------------------------------------------
    diff = mean(No) - mean(Yes)                                   t =  -0.4076
H0: diff = 0                                     Degrees of freedom =      665

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.3418         Pr(|T| > |t|) = 0.6837          Pr(T > t) = 0.6582

Malaria

use malaria_treatful, clear

ttest pcvnew ,by(malaria)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |     613    29.96248    .2171008    5.375163    29.53613    30.38883
       1 |      54    28.05556    .6813974    5.007227    26.68885    29.42227
---------+--------------------------------------------------------------------
Combined |     667     29.8081    .2078518    5.368051    29.39997    30.21622
---------+--------------------------------------------------------------------
    diff |            1.906924    .7589744                .4166492    3.397199
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   2.5125
H0: diff = 0                                     Degrees of freedom =      665

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9939         Pr(|T| > |t|) = 0.0122          Pr(T > t) = 0.0061

Assumptions

use misorhrm
desc
Contains data from misorhrm.dta
 Observations:           281                  
    Variables:            31                  12 Feb 2019 09:52
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
trialno         int     %8.0g                 
treat           byte    %9.0g                 
date1           int     %d                    
age             byte    %8.0g                 
race            byte    %8.0g                 
parity          byte    %8.0g                 
gravid          byte    %8.0g                 
prevtop         byte    %8.0g                 
gawks           byte    %8.0g                 
gadays          byte    %8.0g                 
date2           int     %d                    
bleedpre        byte    %8.0g                 
poppre          byte    %8.0g                 
painpre         byte    %8.0g                 
nauspre         byte    %8.0g                 
nonepre         byte    %8.0g                 
othpre          byte    %8.0g                 
iableed         byte    %8.0g                 
iaprocon        byte    %8.0g                 
hegasize        byte    %9.0g                 
evacmin         byte    %8.0g                 
evacsec         byte    %8.0g                 
topease         byte    %8.0g                 
topimpos        byte    %8.0g                 
operata         byte    %8.0g                 
operano         byte    %8.0g                 
topain          byte    %8.0g                 
topasev         byte    %8.0g                 
evactim         int     %9.0g                 
gestage         byte    %9.0g                 
complic         byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: age

(a) T test

use misorhrm
ttest hegasize, by(treat)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     135    7.614815     .182115    2.115985    7.254623    7.975006
       2 |     141    6.007092    .1508156    1.790836    5.708922    6.305263
---------+--------------------------------------------------------------------
Combined |     276    6.793478    .1271574    2.112498    6.543153    7.043804
---------+--------------------------------------------------------------------
    diff |            1.607723    .2356042                1.143898    2.071547
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =   6.8238
H0: diff = 0                                     Degrees of freedom =      274

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

(b) Equality of Variances

use misorhrm
sdtest hegasize, by(treat)
Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     135    7.614815     .182115    2.115985    7.254623    7.975006
       2 |     141    6.007092    .1508156    1.790836    5.708922    6.305263
---------+--------------------------------------------------------------------
Combined |     276    6.793478    .1271574    2.112498    6.543153    7.043804
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   1.3961
H0: ratio = 1                                    Degrees of freedom = 134, 140

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.9743         2*Pr(F > f) = 0.0514           Pr(F > f) = 0.0257

Normality (Histogram)

use misorhrm
histogram hegasize, by(treat) normal
quietly graph export histog3.svg, replace

Normality (QQ plot)

use misorhrm
qnorm hegasize if treat == 1 

quietly graph export qq1.svg, replace
qnorm hegasize if treat == 2
quietly graph export qq2.svg, replace

Stata Graph - Graph 0 5 10 15 hegasize 2 4 6 8 10 12 Inverse normal

Stata Graph - Graph 0 5 10 hegasize 2 4 6 8 10 Inverse normal

Shapiro wilk test

use misorhrm
bysort treat: swilk hegasize
-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
    hegasize |        135    0.94494      5.855     3.985    0.00003

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
    hegasize |        141    0.86640     14.744     6.080    0.00000

(a) T test (evactim)

use misorhrm
ttest evactim, by(treat)
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127    218.3465    10.59285    119.3754    197.3835    239.3094
       2 |     126    320.5238    24.51093    275.1345    272.0136     369.034
---------+--------------------------------------------------------------------
Combined |     253    269.2332    13.67244    217.4735    242.3064      296.16
---------+--------------------------------------------------------------------
    diff |           -102.1774    26.62966               -154.6234    -49.7313
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =  -3.8370
H0: diff = 0                                     Degrees of freedom =      251

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0001         Pr(|T| > |t|) = 0.0002          Pr(T > t) = 0.9999

(b) Equality of Variances

use misorhrm
sdtest evactim, by(treat)
Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127    218.3465    10.59285    119.3754    197.3835    239.3094
       2 |     126    320.5238    24.51093    275.1345    272.0136     369.034
---------+--------------------------------------------------------------------
Combined |     253    269.2332    13.67244    217.4735    242.3064      296.16
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   0.1883
H0: ratio = 1                                    Degrees of freedom = 126, 125

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.0000         2*Pr(F < f) = 0.0000           Pr(F > f) = 1.0000

Normality (Histogram)

use misorhrm
histogram evactim, by(treat) normal
quietly graph export histog3.svg, replace
Stata Graph - Graph 0 .001 .002 .003 .004 0 500 1000 1500 2000 0 500 1000 1500 2000 1 2 Density normal evactim Density evactim Graphs by treat

Normality (QQ plot)

use misorhrm
qnorm evactim if treat == 1 

quietly graph export qq3.svg, replace
qnorm evactim if treat == 2
quietly graph export qq4.svg, replace

Stata Graph - Graph -200 0 200 400 600 evactim -200 0 200 400 600 Inverse normal

Stata Graph - Graph -500 0 500 1000 1500 2000 evactim -500 0 500 1000 Inverse normal

Shapiro wilk test

use misorhrm
bysort treat: swilk evactim
-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     evactim |        127    0.88795     11.313     5.452    0.00000

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     evactim |        126    0.59853     40.261     8.301    0.00000

Tranformation

qladder

use misorhrm
qladder evactim
quietly graph export qlad.png, replace

gladder

use misorhrm
gladder evactim
quietly graph export glad.png, replace

Results

ladder

use misorhrm
ladder evactim
Transformation         Formula              chi2(2)  Prob > chi2
----------------------------------------------------------------
Cubic                  evactim^3             329.53        0.000
Square                 evactim^2             291.95        0.000
Identity               evactim               190.23        0.000
Square root            sqrt(evactim)          99.27        0.000
Log                    log(evactim)           15.66        0.000
1/(Square root)        1/sqrt(evactim)        31.15        0.000
Inverse                1/evactim             106.31        0.000
1/Square               1/(evactim^2)         237.63        0.000
1/Cubic                1/(evactim^3)         312.91        0.000

(a) T test 3 (log evactim)

use misorhrm
gen logtime = log(evactim)
ttest logtime, by(treat)
(28 missing values generated)


Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127     5.24888    .0471544    .5314029    5.155563    5.342197
       2 |     126    5.577801    .0500822    .5621715    5.478683     5.67692
---------+--------------------------------------------------------------------
Combined |     253    5.412691    .0358473    .5701868    5.342092    5.483289
---------+--------------------------------------------------------------------
    diff |           -.3289213    .0687725               -.4643659   -.1934766
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =  -4.7827
H0: diff = 0                                     Degrees of freedom =      251

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

(b) Equality of Variances

use misorhrm
gen logtime = log(evactim)

sdtest logtime, by(treat)
(28 missing values generated)


Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127     5.24888    .0471544    .5314029    5.155563    5.342197
       2 |     126    5.577801    .0500822    .5621715    5.478683     5.67692
---------+--------------------------------------------------------------------
Combined |     253    5.412691    .0358473    .5701868    5.342092    5.483289
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   0.8935
H0: ratio = 1                                    Degrees of freedom = 126, 125

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.2645         2*Pr(F < f) = 0.5291           Pr(F > f) = 0.7355

Normality (Histogram)

use misorhrm
gen logtime = log(evactim)

histogram logtime, by(treat) normal
quietly graph export histog4.svg, replace
(28 missing values generated)
Stata Graph - Graph 0 .5 1 4 5 6 7 8 4 5 6 7 8 1 2 Density normal logtime Density logtime Graphs by treat

Normality (QQ plot)

use misorhrm
gen logtime = log(evactim)
qnorm logtime if treat == 1 
quietly graph export qq5.svg, replace
qnorm logtime if treat == 2
quietly graph export qq6.svg, replace
(28 missing values generated)

Stata Graph - Graph 3 4 5 6 7 logtime 4 4.5 5 5.5 6 6.5 Inverse normal

Stata Graph - Graph 4 5 6 7 8 logtime 4 5 6 7 Inverse normal

Shapiro wilk test

use misorhrm
gen logtime = log(evactim)
bysort treat: swilk logtime
(28 missing values generated)


-------------------------------------------------------------------------------
-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     logtime |        127    0.98840      1.171     0.354    0.36165

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     logtime |        126    0.94425      5.591     3.866    0.00006