Statistical Inference (Infering Means)

Tutorial in Stata

Bongani Ncube

Data Snapshot

malaria dataset

scrno	todaydate	ethnic	othtribe	dob	ageyrs	agemths	arrival	paying	farepaid	whensick	vomited	novomit	fever	headache	weakness	dizziness	abdopain	appetloss	nausea	diarrhoea	cough	rash	othersymp	weight	temp	bloodfilm	filtertaken	pcv	falcipasex	falcipasexfields	falcipgamet	falcipgametfields	resprate	heartrate	spleenhack	enrolled	id	agei	falcsex	lnfalsx	ethnicn	payn	vomitn	fevern	headn	weakn	dizzn	abdpn	appn	nausn	diarrn	coughn	rashn	osymn	enrolledn	age	sex	agecat	pcvnew
168	2003-09-30	Mandinka		2000-07-16	NA	NA	Taxi	Yes	4	2	No	NA	No	No	No	No	No	No	No	No	No	No	No	13.0	37.7	Yes	Yes	22	112	1	NA	NA	NA	NA	NA	No	NA	3	112	4.718499	Mandinka	Yes	No	No	No	No	No	No	No	No	No	No	No	No	Not enrolled	3	female	4	22
157	2003-09-29			NA	NA	NA	Taxi	No	4	2	Yes	2	No	No	No	No	No	No	No	No	No	No	No	NA	39.0	Yes	Yes	31	520	1	NA	NA	NA	NA	NA	No	NA	NA	520	6.253829		No	Yes	No	No	No	No	No	No	No	No	No	No	No	Not enrolled	NA	male	NA	31
152	2003-09-29	Mandinka		2002-09-24	NA	NA	Walking	No	NA	1	Yes	1	No	No	No	No	No	No	No	No	No	No	No	8.5	40.6	Yes	Yes	34	142	1	NA	NA	36	140	NA	No	NA	1	142	4.955827	Mandinka	No	Yes	No	No	No	No	No	No	No	No	No	No	No	Not enrolled	1	male	2	34
48	2003-09-18	Mandinka		NA	6	NA	Taxi	Yes	4	3	No	NA	No	No	No	No	No	No	No	No	Yes	No	No	20.0	39.2	Yes	Yes	39	240	1	0	50	24	120	NA	Yes	212	NA	240	5.480639	Mandinka	Yes	No	No	No	No	No	No	No	No	No	Yes	No	No	Enrolled	6	female	6	39
226	2003-10-03	Mandinka		2002-08-10	NA	NA	Walking	No	NA	1	No	NA	No	No	No	No	No	No	No	No	No	No	No	7.0	38.8	Yes	Yes	29	124	1	NA	NA	36	142	NA	No	NA	1	124	4.820282	Mandinka	No	No	No	No	No	No	No	No	No	No	No	No	No	Not enrolled	1	male	2	29
146	2003-09-29	Other	sereh	NA	1	3		No	NA	1	Yes	2	No	No	No	No	No	No	No	No	No	No	No	8.0	37.2	Yes	Yes	33	64	30	NA	NA	30	152	NA	No	NA	1	64	4.158883	Other	No	Yes	No	No	No	No	No	No	No	No	No	No	No	Not enrolled	1	male	2	33

[1] 1

Data Summary

use malaria_treatful
generate pcvnew2 = pcv
replace pcvnew2 = . if pcvnew < 10 | pcvnew > 50
label variable pcv "Packed Cell Volume %, or Haematocrit"
summarize pcvnew
summarize pcv

(3 real changes made, 3 to missing)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
      pcvnew |        667     29.8081    5.368051         14         45

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         pcv |        670    29.81493    5.970912          2         87

Data Summary (Before removing outliers)

use malaria_treatful
histogram pcv, normal
quietly graph export histoga.svg, replace

(bin=25, start=2, width=3.4)

Data Summary (After removing outliers)

use malaria_treatful
histogram pcvnew, normal
quietly graph export histogb.svg, replace

(bin=25, start=14, width=1.24)

Summary stats

\[ \bar{X} \pm (t_{\alpha/2;n-1}) \frac{s}{\sqrt{n}} \] and \(\frac{s}{\sqrt{n}}\) is the standard error of \(\bar{y}\)

use malaria_treatful

mean pcvnew

Mean estimation                            Number of obs = 667

--------------------------------------------------------------
             |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
      pcvnew |    29.8081   .2078518      29.39997    30.21622
--------------------------------------------------------------

T.test (Mean = 35)

use malaria_treatful, clear

ttest pcvnew ==35

One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
  pcvnew |     667     29.8081    .2078518    5.368051    29.39997    30.21622
------------------------------------------------------------------------------
    mean = mean(pcvnew)                                           t = -24.9789
H0: mean = 35                                    Degrees of freedom =      666

    Ha: mean < 35               Ha: mean != 35                 Ha: mean > 35
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

Manual working

\(H_0 : \mu=35\) \(H_1 : \mu \neq 35\)

\[T=\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\]

\[T=\frac{29.8081-35}{\frac{5.368051}{\sqrt{667}}}\]

(29.8081-35)/(5.368051/sqrt(667))

[1] -24.97886

\[T= -24.97886\]

since it is a two tailed test we divide percentage by 2

\[t_{667-1}(\frac{5\%}{2})=t_{666}(0.025)=2.365\]

Conclusion

We conclude that the mean is statistically significantly different from 35.

Comparing 2 independent means

Testing hypothesis (Infering Mean)

For Difference of Means (\(\mu_1-\mu_2\)), Independent Samples

	\(100(1-\alpha)%\) Confidence Interval	Hypothesis Testing Test Statistic
When \(\sigma^2\) is known	\(\bar{X}_1 - \bar{X}_2 \pm z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\)	\(z= \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}}\)
When \(\sigma^2\) is unknown, Variances Assumed EQUAL	\(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}\)	\(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\)	Pooled Variance: \(s_p^2 = \frac{(n_1 -1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\) Degrees of Freedom: \(\gamma = n_1 + n_2 -2\)
When \(\sigma^2\) is unknown, Variances Assumed UNEQUAL	\(\bar{X}_1 - \bar{X}_2 \pm t_{\alpha/2}\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}\)	\(t = \frac{(\bar{X}_1-\bar{X}_2)-(\mu_1-\mu_2)_0}{\sqrt{(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2})}}\)	Degrees of Freedom: \(\gamma = \frac{(\frac{s_1^2}{n_1}+\frac{s^2_2}{n_2})^2}{\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}}\)

Gender

use malaria_treatful, clear

ttest pcvnew ,by(sex)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
  female |     316    29.93354    .3039424    5.402999    29.33553    30.53156
    male |     347    29.71758    .2877092    5.359429     29.1517    30.28346
---------+--------------------------------------------------------------------
Combined |     663    29.82051    .2088353    5.377254    29.41045    30.23057
---------+--------------------------------------------------------------------
    diff |            .2159651    .4183597                -.605509    1.037439
------------------------------------------------------------------------------
    diff = mean(female) - mean(male)                              t =   0.5162
H0: diff = 0                                     Degrees of freedom =      661

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.6971         Pr(|T| > |t|) = 0.6059          Pr(T > t) = 0.3029

Manually calculating

Define the hypothesis

\(H_0\) : \(\mu_{male} = \mu_{female}\)
the means are given in the output tables

Define significance level

determine at 5% level

\[t = \frac{(\bar{X}_1-\bar{X}_2)+(\mu_1-\mu_2)_0}{\sqrt{s^2_p(\frac{1}{n_1}+\frac{1}{n_2})}}\]

where the Pooled Variance: \[s_p^2 = \frac{(n_1 -1)s^2_1 - (n_2-1)s^2_2}{n_1 + n_2 -2}\]

\[s_p^2 = \frac{(316 -1)5.402999^2 + (347-1)5.359429^2}{316 + 347 -2}\]

((316 -1)*5.402999^2  + (347-1)*5.359429^2)/(316 + 347 -2)

[1] 28.94694

\[t = \frac{(29.93354-29.71758)}{\sqrt{28.94694(\frac{1}{316}+\frac{1}{347})}}\]

(29.93354-29.71758)/(sqrt(28.94694*(1/316+1/347)))

[1] 0.5162065

by Fever

use malaria_treatful, clear

ttest pcvnew ,by(fevern)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      No |     490    29.75714    .2443306    5.408488    29.27708    30.23721
     Yes |     177    29.94915     .395895    5.267041    29.16784    30.73046
---------+--------------------------------------------------------------------
Combined |     667     29.8081    .2078518    5.368051    29.39997    30.21622
---------+--------------------------------------------------------------------
    diff |           -.1920097    .4710499               -1.116934    .7329145
------------------------------------------------------------------------------
    diff = mean(No) - mean(Yes)                                   t =  -0.4076
H0: diff = 0                                     Degrees of freedom =      665

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.3418         Pr(|T| > |t|) = 0.6837          Pr(T > t) = 0.6582

Malaria

use malaria_treatful, clear

ttest pcvnew ,by(malaria)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |     613    29.96248    .2171008    5.375163    29.53613    30.38883
       1 |      54    28.05556    .6813974    5.007227    26.68885    29.42227
---------+--------------------------------------------------------------------
Combined |     667     29.8081    .2078518    5.368051    29.39997    30.21622
---------+--------------------------------------------------------------------
    diff |            1.906924    .7589744                .4166492    3.397199
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =   2.5125
H0: diff = 0                                     Degrees of freedom =      665

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.9939         Pr(|T| > |t|) = 0.0122          Pr(T > t) = 0.0061

Assumptions

use misorhrm
desc

Contains data from misorhrm.dta
 Observations:           281                  
    Variables:            31                  12 Feb 2019 09:52
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
trialno         int     %8.0g                 
treat           byte    %9.0g                 
date1           int     %d                    
age             byte    %8.0g                 
race            byte    %8.0g                 
parity          byte    %8.0g                 
gravid          byte    %8.0g                 
prevtop         byte    %8.0g                 
gawks           byte    %8.0g                 
gadays          byte    %8.0g                 
date2           int     %d                    
bleedpre        byte    %8.0g                 
poppre          byte    %8.0g                 
painpre         byte    %8.0g                 
nauspre         byte    %8.0g                 
nonepre         byte    %8.0g                 
othpre          byte    %8.0g                 
iableed         byte    %8.0g                 
iaprocon        byte    %8.0g                 
hegasize        byte    %9.0g                 
evacmin         byte    %8.0g                 
evacsec         byte    %8.0g                 
topease         byte    %8.0g                 
topimpos        byte    %8.0g                 
operata         byte    %8.0g                 
operano         byte    %8.0g                 
topain          byte    %8.0g                 
topasev         byte    %8.0g                 
evactim         int     %9.0g                 
gestage         byte    %9.0g                 
complic         byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: age

(a) T test

use misorhrm
ttest hegasize, by(treat)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     135    7.614815     .182115    2.115985    7.254623    7.975006
       2 |     141    6.007092    .1508156    1.790836    5.708922    6.305263
---------+--------------------------------------------------------------------
Combined |     276    6.793478    .1271574    2.112498    6.543153    7.043804
---------+--------------------------------------------------------------------
    diff |            1.607723    .2356042                1.143898    2.071547
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =   6.8238
H0: diff = 0                                     Degrees of freedom =      274

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

(b) Equality of Variances

use misorhrm
sdtest hegasize, by(treat)

Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     135    7.614815     .182115    2.115985    7.254623    7.975006
       2 |     141    6.007092    .1508156    1.790836    5.708922    6.305263
---------+--------------------------------------------------------------------
Combined |     276    6.793478    .1271574    2.112498    6.543153    7.043804
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   1.3961
H0: ratio = 1                                    Degrees of freedom = 134, 140

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.9743         2*Pr(F > f) = 0.0514           Pr(F > f) = 0.0257

Normality (Histogram)

use misorhrm
histogram hegasize, by(treat) normal
quietly graph export histog3.svg, replace

Normality (QQ plot)

use misorhrm
qnorm hegasize if treat == 1 

quietly graph export qq1.svg, replace
qnorm hegasize if treat == 2
quietly graph export qq2.svg, replace

Shapiro wilk test

use misorhrm
bysort treat: swilk hegasize

-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
    hegasize |        135    0.94494      5.855     3.985    0.00003

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
    hegasize |        141    0.86640     14.744     6.080    0.00000

(a) T test (evactim)

use misorhrm
ttest evactim, by(treat)

Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127    218.3465    10.59285    119.3754    197.3835    239.3094
       2 |     126    320.5238    24.51093    275.1345    272.0136     369.034
---------+--------------------------------------------------------------------
Combined |     253    269.2332    13.67244    217.4735    242.3064      296.16
---------+--------------------------------------------------------------------
    diff |           -102.1774    26.62966               -154.6234    -49.7313
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =  -3.8370
H0: diff = 0                                     Degrees of freedom =      251

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0001         Pr(|T| > |t|) = 0.0002          Pr(T > t) = 0.9999

(b) Equality of Variances

use misorhrm
sdtest evactim, by(treat)

Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127    218.3465    10.59285    119.3754    197.3835    239.3094
       2 |     126    320.5238    24.51093    275.1345    272.0136     369.034
---------+--------------------------------------------------------------------
Combined |     253    269.2332    13.67244    217.4735    242.3064      296.16
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   0.1883
H0: ratio = 1                                    Degrees of freedom = 126, 125

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.0000         2*Pr(F < f) = 0.0000           Pr(F > f) = 1.0000

Normality (Histogram)

use misorhrm
histogram evactim, by(treat) normal
quietly graph export histog3.svg, replace

Normality (QQ plot)

use misorhrm
qnorm evactim if treat == 1 

quietly graph export qq3.svg, replace
qnorm evactim if treat == 2
quietly graph export qq4.svg, replace

Shapiro wilk test

use misorhrm
bysort treat: swilk evactim

-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     evactim |        127    0.88795     11.313     5.452    0.00000

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     evactim |        126    0.59853     40.261     8.301    0.00000

Tranformation

qladder

use misorhrm
qladder evactim
quietly graph export qlad.png, replace

gladder

use misorhrm
gladder evactim
quietly graph export glad.png, replace

Results

ladder

use misorhrm
ladder evactim

Transformation         Formula              chi2(2)  Prob > chi2
----------------------------------------------------------------
Cubic                  evactim^3             329.53        0.000
Square                 evactim^2             291.95        0.000
Identity               evactim               190.23        0.000
Square root            sqrt(evactim)          99.27        0.000
Log                    log(evactim)           15.66        0.000
1/(Square root)        1/sqrt(evactim)        31.15        0.000
Inverse                1/evactim             106.31        0.000
1/Square               1/(evactim^2)         237.63        0.000
1/Cubic                1/(evactim^3)         312.91        0.000

(a) T test 3 (log evactim)

use misorhrm
gen logtime = log(evactim)
ttest logtime, by(treat)

(28 missing values generated)


Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127     5.24888    .0471544    .5314029    5.155563    5.342197
       2 |     126    5.577801    .0500822    .5621715    5.478683     5.67692
---------+--------------------------------------------------------------------
Combined |     253    5.412691    .0358473    .5701868    5.342092    5.483289
---------+--------------------------------------------------------------------
    diff |           -.3289213    .0687725               -.4643659   -.1934766
------------------------------------------------------------------------------
    diff = mean(1) - mean(2)                                      t =  -4.7827
H0: diff = 0                                     Degrees of freedom =      251

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

(b) Equality of Variances

use misorhrm
gen logtime = log(evactim)

sdtest logtime, by(treat)

(28 missing values generated)


Variance ratio test
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       1 |     127     5.24888    .0471544    .5314029    5.155563    5.342197
       2 |     126    5.577801    .0500822    .5621715    5.478683     5.67692
---------+--------------------------------------------------------------------
Combined |     253    5.412691    .0358473    .5701868    5.342092    5.483289
------------------------------------------------------------------------------
    ratio = sd(1) / sd(2)                                         f =   0.8935
H0: ratio = 1                                    Degrees of freedom = 126, 125

    Ha: ratio < 1               Ha: ratio != 1                 Ha: ratio > 1
  Pr(F < f) = 0.2645         2*Pr(F < f) = 0.5291           Pr(F > f) = 0.7355

Normality (Histogram)

use misorhrm
gen logtime = log(evactim)

histogram logtime, by(treat) normal
quietly graph export histog4.svg, replace

(28 missing values generated)

Normality (QQ plot)

use misorhrm
gen logtime = log(evactim)
qnorm logtime if treat == 1 
quietly graph export qq5.svg, replace
qnorm logtime if treat == 2
quietly graph export qq6.svg, replace

(28 missing values generated)

Shapiro wilk test

use misorhrm
gen logtime = log(evactim)
bysort treat: swilk logtime

(28 missing values generated)


-------------------------------------------------------------------------------
-> treat = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     logtime |        127    0.98840      1.171     0.354    0.36165

-------------------------------------------------------------------------------
-> treat = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
     logtime |        126    0.94425      5.591     3.866    0.00006