Use the data described in Example 4.10.
setwd("D:\\cycu\\111-1\\sampling\\R note\\ch4")
data.raw<-read.csv("Maine mercury data.csv")
head(data.raw)
## NAME HG.PPM N ELV SA Z LT ST DA RF FR DAM
## 1 ALLEN.P 1.080 3 425 83 27 3 1 2 0.60 2.8 1
## 2 ALLIGATOR.P 0.025 2 1494 47 26 2 0 1 0.69 0.8 1
## 3 ANASAGUNTICOOK.L 0.570 5 402 568 54 2 1 15 0.56 1.1 0
## 4 BALCH&STUMP.PONDS 0.770 5 557 704 44 2 1 14 0.58 2.7 0
## 5 BASKAHEGAN.L 0.790 5 417 6944 22 2 0 123 0.57 2.0 1
## 6 BAUNEAG.BEG.L 0.750 4 205 200 29 2 1 18 0.51 9.6 0
dim(data.raw)
## [1] 120 12
Suppose there are 120 lakes in Maine (N=120). We select a simple random sample of size n=35.
N=120
n=35
set.seed(123)
data<-data.raw[sample(1:N, n),]
data
## NAME HG.PPM N ELV SA Z LT ST DA RF FR DAM
## 31 CROSS.L 0.39 5 578 2515 46 3 0 164 0.50 3.3 1
## 79 PASSAGASSAWAUKEAG.L 0.55 5 304 118 40 3 1 3 0.55 1.9 1
## 51 HICKS.P 0.90 5 683 93 18 3 0 10 0.59 18.9 0
## 14 BRANCH.L(SOUTH) 0.58 5 227 2035 28 2 0 12 0.51 0.6 0
## 67 LOVEWELL.P 0.45 5 357 1120 45 2 1 9 0.06 0.1 1
## 42 FISHER.P(BIG) 0.36 5 1150 60 11 2 0 1 0.56 2.9 1
## 50 HAY.L 0.24 2 653 588 34 3 0 6 0.53 0.8 1
## 43 FLYING.P 0.35 4 345 360 80 3 1 15 0.51 1.7 0
## 101 SENNEBEC.P 0.41 3 87 532 57 3 1 106 0.60 14.2 0
## 117 WEBBER.P 0.18 4 118 1201 41 2 1 28 0.51 1.6 0
## 25 CHANDLER.L 0.25 5 824 401 19 2 0 5 0.52 1.1 1
## 90 RANGE.P(LOWER) 1.25 3 306 290 41 2 1 14 0.51 3.7 0
## 91 ROACH.P(SECOND) 0.22 5 1271 970 46 3 0 25 0.66 2.1 0
## 69 MEDDYBEMPS.L 0.32 5 170 6765 38 2 0 45 0.62 0.6 0
## 108 SYMMES.P 0.18 5 499 36 30 2 1 1 0.61 3.3 0
## 57 JERRY.P 0.62 5 717 272 13 3 0 4 0.56 2.7 1
## 92 ROBERTS&WADLEY.PDS 0.52 5 271 203 22 3 0 9 0.58 10.1 0
## 9 BEN.ANNIS.P 0.18 5 122 25 9 2 0 10 0.51 58.8 1
## 93 ROCKY.P 0.68 5 312 153 14 2 0 2 0.57 1.8 1
## 99 SANDY.RIVER.P(MID) 0.37 5 1700 70 58 3 1 4 0.56 3.8 1
## 72 NEQUASSET.P 0.37 3 17 392 63 3 1 21 0.58 2.3 0
## 26 CHASE.L 0.43 5 819 403 31 3 1 47 0.58 9.5 1
## 7 BEAVER.P 0.27 5 397 128 8 3 0 2 0.61 7.9 1
## 115 VARNUM.P 0.16 5 756 331 75 1 1 4 0.61 0.5 0
## 103 SHIN.P(LOWER) 0.47 5 778 638 25 3 0 23 0.56 4.2 1
## 83 PINE.P(BIG) 0.67 1 1097 164 33 3 1 5 0.51 3.3 1
## 36 DUCK.L 0.22 5 519 1222 88 1 1 6 NA NA 1
## 78 OTTER.P 0.13 3 1633 14 18 2 0 0 0.71 1.8 1
## 81 PEASE.P 0.36 5 377 109 19 2 0 2 0.58 2.2 1
## 113 UMBAGOG.L 0.29 5 1245 7850 48 2 0 600 0.56 9.2 0
## 76 OSSIPEE.L(LITTLE) 0.77 3 311 564 74 2 1 6 0.61 0.8 0
## 15 BRANCH.P(EAST) 0.57 5 910 45 9 2 0 2 NA NA 1
## 32 CRYSTAL(BEALS).P 0.41 5 328 47 39 3 1 1 0.54 1.1 1
## 98 SANDY.RIVER.P(LOWER) 0.10 3 1690 17 21 2 1 4 0.56 64.1 1
## 96 ROUND.P 0.57 1 34 250 34 2 1 116 0.61 43.7 1
HG.PPM: mercury content; LT: lake type
Out target variables are \((\mu_1,\mu_2, \mu_3)\), the mean value of the mercury content in the type 1, 2, and 3 lake.
The box plot of the simple random sample:
boxplot(HG.PPM~LT, data=data, xlab="Lake type", ylab="Hg(ppm)", main="Mercury content by lake type")
Specify a complex survey design.
library(survey) # load the package
w <- rep(N/n,n)
fpc<-rep(N,n)
s.design <- svydesign(ids = ~1 ,weights = w ,fpc=fpc, data=data) #Simple random sampling (consider fpc)
Use the commend, svyby(formula, by, design, Fun),
to compute survey statistics on subsets.
formula: the target variable
by: the variable which define the subsets
design: the survey design we specified
Fun: type of statistics we want to compute
If by=~LT and Fun=svymean, then the output will be the following table:
| LT | \(\bar{y}\) | se | |
|---|---|---|---|
| 1 | 1 | \(\bar{y}_1\) | \(\sqrt{\hat{V}(\bar{y}_1)}\) |
| 2 | 2 | \(\bar{y}_2\) | \(\sqrt{\hat{V}(\bar{y}_2)}\) |
| 3 | 3 | \(\bar{y}_3\) | \(\sqrt{\hat{V}(\bar{y}_3)}\) |
r.group<-svyby(~HG.PPM, by=~LT,s.design,FUN=svymean)
r.group
## LT HG.PPM se
## 1 1 0.1900000 0.01811422
## 2 2 0.4247059 0.05884486
## 3 3 0.4493750 0.03597969
The outputs show us \(\hat{\mu_i}(=\bar{y}_i)\) and the estimate of the standard error of \(\hat{\mu_i}\) for each subsets.
| Lake type (\(i\)) | \(\bar{y}_i\) | \(se_i=\sqrt{\hat{V}(\bar{y}_i)}\) | |
|---|---|---|---|
| 1 | 1 | 0.1900000 | 0.01811422 |
| 2 | 2 | 0.4247059 | 0.05884486 |
| 3 | 3 | 0.4493750 | 0.03597969 |
Let d be the difference in mean mercury levels for type 1 and type 2. \[ \hat{d}=\bar{y}_1-\bar{y}_2,\; se(\hat{d})=\sqrt{\hat{V}(\hat{y}_1)+\hat{V}(\hat{y}_2)} \]
(d.hat<-r.group[1,2]-r.group[2,2])
## [1] -0.2347059
(se.d.hat<-sqrt(r.group[1,3]^2+r.group[2,3]^2))
## [1] 0.06156982
Therefore, \(\hat{d}=-0.2347\), \(se(\hat{d})=0.0616\).
c(d.hat-2*se.d.hat,d.hat+2*se.d.hat)
## [1] -0.3578455 -0.1115662
The bound on the error of the estimation is \(2 \cdot se(\hat{d})\). The resulting interval estimate, (-0.3578,-0.1116), implies that the true difference in mean mercury content for these two types of lakes (\(d\)) could be anywhere between -0.3578 and -0.1116. Since the result interval does not covers zero, which implies that there is significant evidence that the mean mercury content for the type 1 lakes is smaller than the one for the type 2 lakes. The type 2 lakes could be anywhere between 0.1116 and 0.3578 ppm greater than type 1 lakes.