Simple random sampling by using R package \(survey\)

\(survey\) is a well-known R package for survey sampling.

Install the \(survey\) package, and then load the package in R.

#install.packages("survey") # after running this commend, you will have "survey" installed in your R

library(survey) # load the package 

Example (excercise 4.40 in the Textbook)

“4.40: An auditor detects that a certain firm is regularly overstating the dollar amounts of inventories because of delays in recording withdrawals. The auditor wants to estimate the total overstated amount on 1000 listed items by obtaining exact inventory amounts on a random sample of 15 items and comparing these exact figures with the recorded amounts.Estimate the total overstated amount on the 1000 types of items and place a bound on the error of estimation.”

\(N=1000\), \(n=15\), parameter \(\tau = \sum_{i=1}^n u_i\), where \(u_i\) the overstated amount on the \(i\)th items.

setwd("D:\\course\\111-1\\Survey Sampling\\Rnote\\Rpub\\sampling")
data.40<-read.csv("EXERCISE4_40.csv")
data.40
##    Item Audited Recorded Overstat
## 1     1     175      210       35
## 2     2     295      305       10
## 3     3      68       91       23
## 4     4      74       82        8
## 5     5     128      140       12
## 6     6     241      250        9
## 7     7     362      384       22
## 8     8      72       80        8
## 9     9      59       82       23
## 10   10     112      140       28
## 11   11     118      124        6
## 12   12     210      230       20
## 13   13     240      260       20
## 14   14     123      147       24
## 15   15      96      108       12

Step 1. specify a survey design, \(svydesign()\)

\(ids\): \(ids=\)~\(1\) means selecting samples from whole population (no stratification)

\(fpc\): finite population correction. \(fpc=\)~\(x\) says that input variable \(x\) contain the population size. \(x\) is a vector of length n. Use \(~\) when \(x\) is in the supplied data set.

\(data\): data frame to look up. Must include the variable of interest.

N<-1000
n<-15
N.v<-rep(N, n) # the population size for the n samples
design1 <- svydesign(ids = ~1, fpc = N.v, data=data.40) # finite population
w = rep(N/n,n)
design2 <- svydesign(ids = ~1, weight =w, data=data.40) # when N >> n

Step 2. summary statistics for sample surveys

\(svymean(x, design)\): estimators for the population mean

\(svytotal(x, design)\): estimators for the population total

\(conf(object, level = 0.95)\): the confidence interval

\(SE(object)\): extracts standard errors from an object

r.total.1<-svytotal(~Overstat, design1)
r.total.1
##          total     SE
## Overstat 17333 2222.8

As a result, \(\hat{\tau}=17333, \hat{V}(\hat{\tau})=(2222.8)^2\). Check the results by using the formula in textbook: \[\hat{\tau}=N\bar{y},\] \[\hat{V}(\hat{\tau})=N^2(1-n/N)(s^2/n)\]

mean(data.40$Overstat)*N #tau hat
## [1] 17333.33
sd.total<-sd(data.40$Overstat)
N^2*(1-n/N)*(sd.total^2/n) #V hat
## [1] 4940635
SE(r.total.1)^2
##          Overstat
## Overstat  4940635

The bound on the error of \(\hat{\tau}\):

2*SE(r.total.1)
##          Overstat
## Overstat 4445.508

Therefore, the estimator of the total overstated amount on the 1000 types of items is \(\hat{\tau} = \$ 17333\). The bound on the error of \(\hat{\tau}\) is \(\$\) 4445.5.

Since 1000 >> 15, we can also ignore the fpc.

r.total.2<-svytotal(~Overstat, design2)
r.total.2
##          total     SE
## Overstat 17333 2239.6
2*SE(r.total.2)
##          Overstat
## Overstat 4479.229