Compare between two independent simple random samples

Use the data described in Example 4.10.

setwd("D:\\cycu\\111-1\\sampling\\R note\\ch4")
data.raw<-read.csv("Maine mercury data.csv")
head(data.raw)
##                NAME HG.PPM N  ELV   SA  Z LT ST  DA   RF  FR DAM
## 1           ALLEN.P  1.080 3  425   83 27  3  1   2 0.60 2.8   1
## 2       ALLIGATOR.P  0.025 2 1494   47 26  2  0   1 0.69 0.8   1
## 3  ANASAGUNTICOOK.L  0.570 5  402  568 54  2  1  15 0.56 1.1   0
## 4 BALCH&STUMP.PONDS  0.770 5  557  704 44  2  1  14 0.58 2.7   0
## 5      BASKAHEGAN.L  0.790 5  417 6944 22  2  0 123 0.57 2.0   1
## 6     BAUNEAG.BEG.L  0.750 4  205  200 29  2  1  18 0.51 9.6   0
dim(data.raw)
## [1] 120  12

Suppose there are 120 lakes in Maine (N=120). We select a simple random sample of size n=35.

N=120
n=35

set.seed(123)
data<-data.raw[sample(1:N, n),]
data
##                     NAME HG.PPM N  ELV   SA  Z LT ST  DA   RF   FR DAM
## 31               CROSS.L   0.39 5  578 2515 46  3  0 164 0.50  3.3   1
## 79   PASSAGASSAWAUKEAG.L   0.55 5  304  118 40  3  1   3 0.55  1.9   1
## 51               HICKS.P   0.90 5  683   93 18  3  0  10 0.59 18.9   0
## 14       BRANCH.L(SOUTH)   0.58 5  227 2035 28  2  0  12 0.51  0.6   0
## 67            LOVEWELL.P   0.45 5  357 1120 45  2  1   9 0.06  0.1   1
## 42         FISHER.P(BIG)   0.36 5 1150   60 11  2  0   1 0.56  2.9   1
## 50                 HAY.L   0.24 2  653  588 34  3  0   6 0.53  0.8   1
## 43              FLYING.P   0.35 4  345  360 80  3  1  15 0.51  1.7   0
## 101           SENNEBEC.P   0.41 3   87  532 57  3  1 106 0.60 14.2   0
## 117             WEBBER.P   0.18 4  118 1201 41  2  1  28 0.51  1.6   0
## 25            CHANDLER.L   0.25 5  824  401 19  2  0   5 0.52  1.1   1
## 90        RANGE.P(LOWER)   1.25 3  306  290 41  2  1  14 0.51  3.7   0
## 91       ROACH.P(SECOND)   0.22 5 1271  970 46  3  0  25 0.66  2.1   0
## 69          MEDDYBEMPS.L   0.32 5  170 6765 38  2  0  45 0.62  0.6   0
## 108             SYMMES.P   0.18 5  499   36 30  2  1   1 0.61  3.3   0
## 57               JERRY.P   0.62 5  717  272 13  3  0   4 0.56  2.7   1
## 92    ROBERTS&WADLEY.PDS   0.52 5  271  203 22  3  0   9 0.58 10.1   0
## 9            BEN.ANNIS.P   0.18 5  122   25  9  2  0  10 0.51 58.8   1
## 93               ROCKY.P   0.68 5  312  153 14  2  0   2 0.57  1.8   1
## 99    SANDY.RIVER.P(MID)   0.37 5 1700   70 58  3  1   4 0.56  3.8   1
## 72           NEQUASSET.P   0.37 3   17  392 63  3  1  21 0.58  2.3   0
## 26               CHASE.L   0.43 5  819  403 31  3  1  47 0.58  9.5   1
## 7               BEAVER.P   0.27 5  397  128  8  3  0   2 0.61  7.9   1
## 115             VARNUM.P   0.16 5  756  331 75  1  1   4 0.61  0.5   0
## 103        SHIN.P(LOWER)   0.47 5  778  638 25  3  0  23 0.56  4.2   1
## 83           PINE.P(BIG)   0.67 1 1097  164 33  3  1   5 0.51  3.3   1
## 36                DUCK.L   0.22 5  519 1222 88  1  1   6   NA   NA   1
## 78               OTTER.P   0.13 3 1633   14 18  2  0   0 0.71  1.8   1
## 81               PEASE.P   0.36 5  377  109 19  2  0   2 0.58  2.2   1
## 113            UMBAGOG.L   0.29 5 1245 7850 48  2  0 600 0.56  9.2   0
## 76     OSSIPEE.L(LITTLE)   0.77 3  311  564 74  2  1   6 0.61  0.8   0
## 15        BRANCH.P(EAST)   0.57 5  910   45  9  2  0   2   NA   NA   1
## 32      CRYSTAL(BEALS).P   0.41 5  328   47 39  3  1   1 0.54  1.1   1
## 98  SANDY.RIVER.P(LOWER)   0.10 3 1690   17 21  2  1   4 0.56 64.1   1
## 96               ROUND.P   0.57 1   34  250 34  2  1 116 0.61 43.7   1

HG.PPM: mercury content; LT: lake type

Out target variables are \((\mu_1,\mu_2, \mu_3)\), the mean value of the mercury content in the type 1, 2, and 3 lake.

The box plot of the simple random sample:

boxplot(HG.PPM~LT, data=data, xlab="Lake type", ylab="Hg(ppm)", main="Mercury content by lake type")

Specify a complex survey design.

library(survey) # load the package 
w <- rep(N/n,n)
fpc<-rep(N,n)
s.design <- svydesign(ids = ~1 ,weights = w ,fpc=fpc, data=data) #Simple random sampling (consider fpc)

Use the commend, svyby(formula, by, design, Fun), to compute survey statistics on subsets.
formula: the target variable
by: the variable which define the subsets
design: the survey design we specified
Fun: type of statistics we want to compute

If by=~LT and Fun=svymean, then the output will be the following table:

LT \(\bar{y}\) se
1 1 \(\bar{y}_1\) \(\sqrt{\hat{V}(\bar{y}_1)}\)
2 2 \(\bar{y}_2\) \(\sqrt{\hat{V}(\bar{y}_2)}\)
3 3 \(\bar{y}_3\) \(\sqrt{\hat{V}(\bar{y}_3)}\)
r.group<-svyby(~HG.PPM, by=~LT,s.design,FUN=svymean)
r.group
##   LT    HG.PPM         se
## 1  1 0.1900000 0.01811422
## 2  2 0.4247059 0.05884486
## 3  3 0.4493750 0.03597969

The outputs show us \(\hat{\mu_i}(=\bar{y}_i)\) and the estimate of the standard error of \(\hat{\mu_i}\) for each subsets.

Lake type (\(i\)) \(\bar{y}_i\) \(se_i=\sqrt{\hat{V}(\bar{y}_i)}\)
1 1 0.1900000 0.01811422
2 2 0.4247059 0.05884486
3 3 0.4493750 0.03597969

Let d be the difference in mean mercury levels for type 1 and type 2. \[ \hat{d}=\bar{y}_1-\bar{y}_2,\; se(\hat{d})=\sqrt{\hat{V}(\hat{y}_1)+\hat{V}(\hat{y}_2)} \]

(d.hat<-r.group[1,2]-r.group[2,2])
## [1] -0.2347059
(se.d.hat<-sqrt(r.group[1,3]^2+r.group[2,3]^2))
## [1] 0.06156982

Therefore, \(\hat{d}=-0.2347\), \(se(\hat{d})=0.0616\).

c(d.hat-2*se.d.hat,d.hat+2*se.d.hat)
## [1] -0.3578455 -0.1115662

The bound on the error of the estimation is \(2 \cdot se(\hat{d})\). The resulting interval estimate, (-0.3578,-0.1116), implies that the true difference in mean mercury content for these two types of lakes (\(d\)) could be anywhere between -0.3578 and -0.1116. Since the result interval does not covers zero, which implies that there is significant evidence that the mean mercury content for the type 1 lakes is smaller than the one for the type 2 lakes. The type 2 lakes could be anywhere between 0.1116 and 0.3578 ppm greater than type 1 lakes.