Ex. 3 Sampling

See: https://rpubs.com/staszkiewicz/838893

A good ovierview of Markown (https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html) i (https://bookdown.org/yihui/rmarkdown/) I do recommend.

Quick sampling and random number in R

Source: http://www.cookbook-r.com/Numbers/Generating_random_numbers/

Uniformly distributed (flat) random numbers - runif(). The range is from 0 to 1.

runif(1)

## [1] 0.8932984

#> [1] 0.09006613

# Get a vector of 4 numbers
runif(4)

## [1] 0.4025864 0.7411688 0.6191650 0.7496963

#> [1] 0.6972299 0.9505426 0.8297167 0.9779939

# Get a vector of 3 numbers from 0 to 100
runif(3, min=0, max=100)

## [1] 88.93362 96.57331 88.07445

#> [1] 83.702278  3.062253  5.388360

# Get 3 integers from 0 to 100
# Use max=101 because it will never actually equal 101
floor(runif(3, min=0, max=101))

## [1] 68 42 97

#> [1] 11 67  1

# This will do the same thing
sample(1:100, 3, replace=TRUE)

## [1] 17  4 27

#> [1]  8 63 64

# To generate integers WITHOUT replacement:
sample(1:100, 3, replace=FALSE)

## [1] 19 51 10

#> [1] 76 25 52

To generate numbers from a normal distribution, use rnorm(). By default the mean is 0 and the standard deviation is 1.

rnorm(4)

## [1] -1.2015072  0.2932762  1.2409667  1.4804722

#> [1] -2.3308287 -0.9073857 -0.7638332 -0.2193786

# Use a different mean and standard deviation
rnorm(4, mean=50, sd=10)

## [1] 37.66100 54.79465 44.79252 50.30622

#> [1] 59.20927 40.12440 44.58840 41.97056

# To check that the distribution looks right, make a histogram of the numbers
x <- rnorm(400, mean=50, sd=10)
hist(x)

Introduction

Let us assume that we are examining the statement (population) of expenditure declared by an entity in a given year.

The system audit carried out by the auditor has resulted in a moderate level of assurance. Therefore, a confidence level of 80% seems appropriate for a cost (assurance) audit.

The table below shows the main characteristics of the population. Population size (number of operations) 3,852 (our N) Book value (total expenditure in the reference period) €46,501,186.

Let’s enter these parameters as global variables

N<-3852
Pop<-46501186

To $N$ = 3852 and Poputlation = 4.6501186^{7}.

An initial sample of 20 operations gave a preliminary estimate of the standard deviation of standard deviation of errors of EUR 518.

Piotr Staszkiewicz Let’s check if our loaded data the same result as in the table.

Let’s read in the data from the first sample

Operation number: 98 120 542 554 587 1156 1325 1453 1840 1904 2028 2338 2428 2735 3054 3196 3276 3321 3366 3666

Sample book value: 13054 10758 8714 8645 9297 7908 6717 16535 15718 13175 6486 13072 8753 17507 8875 6568 6478 12448 17894 13558

Sample audit value: 13054 10758 8264 8645 9297 7908 6717 16535 15718 13175 6486 13072 8753 17507 8875 6568 6478 12448 15598 13558

Let us denote the variables as: NrOP, PBV and PAV. Let us read thos data using the c() command.

NrOp<- c(98,120,542,554,587,1156,1325,1453,1840,1904,2028,2338,2428,2735,3054,3196,3276,3321,3366,3666)
PBV<-c(13054,10758,8714,8645,9297,7908,6717,16535,15718,13175,6486,13072,8753,17507,8875,6568,6478,12448,17894,13558)
PAV<-c(13054,10758,8264,8645,9297,7908,6717,16535,15718,13175,6486,13072,8753,17507,8875,6568,6478,12448,15598,13558)

At first let us calculate the values of errors (book value less audit value)

(PER<-PBV-PAV)

##  [1]    0    0  450    0    0    0    0    0    0    0    0    0    0    0    0
## [16]    0    0    0 2296    0

Let us make Excel like table:

Pro<-data.frame(NrOp,PBV,PAV,PER)
colnames(Pro)<-c("Operation", "Book Value","Correct Value", "Error")

Let’s check the sum of Book value, Correct Value, sum of errors (Error) and standard deviation of errors (SD Errors), the ratio of errors in the sample to the book value of the sample (SER). To build totals we use the apply() function.

#sum(PBV)
#sum(PAV)
#sum(PER)
#sd(PER)

apply(Pro,2, sum)

##     Operation    Book Value Correct Value         Error 
##         38987        222160        219414          2746

Total<-apply(Pro,2, sum)
Total[5]<-sd(PER)
Total[6]<-sum(PER)/sum(PBV)
names(Total)<- c("Operation", "Book Value","Correct Value", "Error","SD Error","SER")
round(Total,0)

##     Operation    Book Value Correct Value         Error      SD Error 
##         38987        222160        219414          2746           518 
##           SER 
##             0

#install.packages("scales")                                   # Install and load scales
library("scales")

percent(Total[6], accuracy = 0.01)

##     SER 
## "1.24%"

Formula used to calculate the sample size

The initial sample gives us two pieces of information the sample value, the sampling error the sample error and the standard deviation of the sample – these we consider as parameters general population The sample size for the survey can be calculated using the formula

\[ \begin{equation} n= (\frac{N*z*\delta_{e}}{TE-AE})^2 \tag{1} \label{eq} \end{equation} \]

Note that $N$ is the number of all objects in the population, $z$ is the argument of normal distribution for the confidence interval resulting from the preliminary assessment of the detection risk obtained during the visit as a product of inherent risk and the risk of the control system. In the denominator we have the difference between the tolerable error ($TE$ – the fraction of significance, or fraction of materiality) and $AE$ the expected error.

To be more specific: 𝑧 is 1.282 (coefficient corresponding to 80% confidence level for normal distribution from confidence level for a normal distribution from the tables), $AE$ is 518 €, and $TE$, the error tolerated, is 2% (the maximum level of significance determined in advance at planning) of the book value, i.e. 2% x €46,501,186 = €930,024. This initial sample gives a sampling error rate of 1.24%. Furthermore, Based on the experience of the previous year and the conclusions from the report on the management and control systems, the audit authority audit authority expects an error rate of no more than 1.24%. Therefore, the $AE$, the expected error rate is 1.24% of the total expenditure, i.e. x EUR 46 501 186 = EUR 576 615:

\[ n = (\frac{3,853*1.282*518}{930,024-576,615})^2 = 53\] The previous initial sample of 20 is used as part of the main sample.

Let us check this calculation in R

 n<-(N*qnorm(0.9)*sd(PER)/(0.02*Pop - Total[6]*Pop))^2
names(n)<-c("wielkość próby")
round(n,0)

## wielkość próby 
##             52

The difference between the value calculated manually and the value calculated by the machine results from rounding.

Confidence interval and normal distribution

Before we move on to calculating the results, let’s consider the parameter $z$ =1.28 in formula (1).

Let us see graphically what the normal distribution looks like

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia

The entire area under the normal curve has an area value of one

\[\int g(x) dx =1 \] Given that, we are dealing with a standardized normal distribution N(0,1), the area of x between 3 and - 3 deviations from the mean (in our case the standard deviation = 1 and the mean is =0) covers 97.7 areas under the curve namely:

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia
x<-seq(-3,3, length=100)
y<-dnorm(x,mean=0,sd=1)
polygon(c(-3,x,3),c(0,y,0),col="grey")
text(0,0.1,"97.7%")

We interpret this to mean that anything between -3 and 3 is very likely (99.7%) to be in our population, if an object has a value less than -3 or greater than three then the chance that it is in the population is very small - 0.03. Except that this chance breaks into two tails of 0.0115 (1.15%)

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia
x<-seq(-3,3, length=100)
y<-dnorm(x,mean=0,sd=1)
polygon(c(-3,x,3),c(0,y,0),col="grey")
text(0,0.1,"97.7%")
x1<-seq(-4,-3,length=50)
y1<-dnorm(x1,0,1)
polygon(c(-4,x1,-3),c(0,y1,0),col="red")
x2<-seq(3,4,length=50)
y2<-dnorm(x2,0,1)
polygon(c(3,x2,4),c(0,y2,0),col="red")
arrows(-3.1,0.3, -3,0, length=0.15)
text(-3, 0.32, "ogony po 1.15%")
arrows(-3.1,0.3,3,0, length=0.15)

Please note that the function r, which gives for a given value of the area under the density function of the normal distribution the value on the x-axis is qnorm(something), where something is a value or vector. With this, our "something" is counted from minus infinity to our desired x 



```r
x=seq(-3,3,length=200)
y=dnorm(x,mean=0,sd=1)
plot(x,y,type="l")
x=seq(-3, 1.2833,length=100)
y=dnorm(x,mean=0,sd=1)
polygon(c(-3,x,1.2833),c(0,y,0),col="blue")
text(0,0.1,"0.90")
arrows(1.5,0.06,1.2,0,length=.15)
text(1.58,0.09,1.28,0,"1.28")

So let’s check how much is qnorm(0.9) 1.2815516 why?

So if we do a confidence interval of 80% we have given a quantile of 90%. Why? Because, a confidence interval at the 80% level cuts us off symmetrically 10% on the right and left sides. Hence, if we subtract qnorm(.90) from qnorm(.10) then we are left with a segment with 80% probability starting with -1.2815516 and ending with 1.2815516.

x=seq(-3,3,length=200)
y=dnorm(x,mean=0,sd=1)
plot(x,y,type="l")
x1=seq(-1.28, 1.2833,length=500)
y=dnorm(x1,mean=0,sd=1)
polygon(c(-1.28,x1,1.2833),c(0,y,0),col="blue")
text(0,0.1,"0.80")
arrows(1.5,0.06,1.2,0,length=.15)
text(1.58,0.09,1.28,0,"1.28")

#arrows(-1.5,0.06,-1.2,0,length=.15)
#text(-2.8,0.47,-2.6,0.44,"-1.28")

From there, we needed information from the inherent risk and control analysis to assess the confidence interval we should work at to control the 5% audit risk.

Low Control Assurance: Considering the desirable and acceptable audit risk of 5%, and if the inherent risk (=100%) and control risk (= 50%) are high, it means it is a high risk entity where internal control procedures are not adequate to manage the risk, the auditor should aim for a very low detection (omission) risk of 10%. In order to achieve a low risk of detection, the amount of substantive testing and therefore the sample size must be large.

𝐷𝑅=𝐴𝑅/(𝐼𝑅×𝐶𝑅)= 0,05/(1×0,5)=0,1

Which translates to a 90% confidence interval [1-0.1].

High degree of control confidence: In another context where the inherent risk is high (100%) but where adequate controls exist, the control risk can be estimated at 12.5%. To reach a control risk level of 5%, the detection risk level may be 40%, meaning that the auditor can take on more risk by reducing the sample size. This will ultimately mean a less detailed and less costly audit.

𝐷𝑅=𝐴𝑅/(𝐼𝑅×𝐶𝑅)= 0,05/(1×0,125)=0,4

Which translates into a 60% confidence interval (1-0.4)

Note that both examples result in the same achieved test risk of 5% across environments.

To summarize: the parameter $z$ is the expert’s estimate of the confidence interval necessary to control the audit risk at the 5% level, for the precision, i.e. the difference between $TE$ and $AE$ is in fact the risk appetite.

Let’s return to sample selection. Our initial sample was 20 items, after estimation of the sample size based on the parameters of the initial sample, we obtained final sample size to infer at the 80% confidence level as 52 operations.

Therefore, the auditor has to randomly select only 32 more operations. In the following, we load the full sample with an additional 32 cost items (52 in total) and the results of the valuation test:

Loading the actual sample data

1.1

Here we already enter three variables, observation index (e.g. invoice number), the book value (BV) recorded in the analysis, and audit value (AV). The values can be downloaded from files. vector c(), and here are the subsequent values

Index = 74,98,120,153,223,246,542,554,587,915,1014,1114,1156,1325, 1403,1453,1577,1621,1624,1631,1649,1650,1678,1687,1712,1729,1730,1744, 1744,1767,1774,1796,1796,1817,1821,1828,1850,1879,1888,1898,1909,1926, 1948,1949,1958,1963,1967,1980,1990,3749,3770, 4001.

Book value BV = 9093,13054,10758,16194,11662,16331,8714,8645,9297,7999,11906, 15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278, 28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827, 22714.00027,15556.03318,20436.71763,16999.19228,1111,1231,151,16336.08554, 17021.65133,29527.18354,888,1200,771,1200,150,29017.91656,1765,12110, 200,331,24383.76591,10171, 50,22

and Audit value AV that is: 9093,13054,10758,16194,11662,16331,8264,8645,9297,7999,11906,15505,7908,6717, 9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552, 18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027, 15556.03318,17556.71763,16100,2059,330,151,16336.08554,17021.65133,29000, 700,750,771,1200,300,27217.91656,1765,10110,1200,331,24383.76591,10171, 50, 22.

Index<- c(74,98,120,153,223,246,542,554,587,915,1014,1114,1156,1325,1403,1453,1577,1621,1624,1631,1649,1650,1678,1687,1712,1729,1730,1744,1744,1767,1774,1796,1796,1817,1821,1828,1850,1879,1888,1898,1909,1926,1948,1949,1958,1963,1967,1980,1990,3749, 3770,4001)

BV<-c(9093,13054,10758,16194,11662,16331,8714,8645,9297,7999,11906,15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027,15556.03318,20436.71763,16999.19228,1111,1231,151,16336.08554,17021.65133,29527.18354,888,1200,771,1200,150,29017.91656,1765,12110,200,331,24383.76591,10171,50,22)

AV<-c(9093,13054,10758,16194,11662,16331,8264,8645,9297,7999,11906,15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027,15556.03318,17556.71763,16100,2059,330,151,16336.08554,17021.65133,29000,700,750,771,1200,300,27217.91656,1765,10110,1200,331,24383.76591,10171,50,22)

Let us make a dataframe

Val<- data.frame(Index,BV,AV)

complete our frame with the error value Val$ER<-(BV-AV) let’s check the standard deviation 603.33 and its sum 7997.38.

And let’s view this data frame using the View() function:

Val$ER<-(BV-AV)
View(Val)

Now that we have the entire sample selected, we will count the point error in the sample, count the the population point error, determine the confidence interval of the population error and We compare this to our acceptable error to infer whether the population has or does not have a significant error. In view of the above:

round(apply(Val,2,sum),2)

##     Index        BV        AV        ER 
##  81977.00 661652.41 653655.03   7997.38

The total book value of the 52 sampled operations is 6.61652^{5} EUR. The total amount of error in the sample is 7997 EUR. This amount, divided by the sample size 3852, is the sample mean operating error i.e. 153.8. When using per-unit average estimation, the projection of the error onto the population is calculated by multiplying this average error by the population size $N$ (in this example 5.92421^{5} . This number represents the predicted error at the level of the entire population of costs (expenses) under study. Let us call it $EE_{1}$ and its value is:

EEP<-round((sum(Val$ER)/length(Val$ER))*N,0)
EEP

## [1] 592421

Please note that if you had chosen the sample differently the error values would have been different, here is an example we will choose from our first final sample other subsamples of 20 items and see what should be the target sample size

Pr2<-Val[sample(nrow(Val),20),]
Pr2

##    Index       BV       AV        ER
## 41  1909   771.00   771.00    0.0000
## 18  1621 16095.00 16095.00    0.0000
## 32  1796 16999.19 16100.00  899.1923
## 11  1014 11906.00 11906.00    0.0000
## 38  1879 29527.18 29000.00  527.1835
## 29  1744 22714.00 22714.00    0.0000
## 51  3770    50.00    50.00    0.0000
## 12  1114 15505.00 15505.00    0.0000
## 6    246 16331.00 16331.00    0.0000
## 5    223 11662.00 11662.00    0.0000
## 4    153 16194.00 16194.00    0.0000
## 31  1774 20436.72 17556.72 2880.0000
## 10   915  7999.00  7999.00    0.0000
## 1     74  9093.00  9093.00    0.0000
## 9    587  9297.00  9297.00    0.0000
## 52  4001    22.00    22.00    0.0000
## 22  1650 19947.07 19947.07    0.0000
## 49  1990 24383.77 24383.77    0.0000
## 8    554  8645.00  8645.00    0.0000
## 17  1577 17723.00 17723.00    0.0000

round(apply(Pr2,2,sum),2)

##     Index        BV        AV        ER 
##  28591.00 275300.93 270994.56   4306.38

#porównajmy z próbą pierowtną
round(apply(Pro,2,sum),2)

##     Operation    Book Value Correct Value         Error 
##         38987        222160        219414          2746

The deflection of standard errors in the first trial was 517.9458517 and in the second trial 667.0289825 consequently their difference is -149.0831308 After substituting the new values into equation (1) we obtain a new number of elements needed in the last trial, namely:

 n2<-(N*qnorm(0.9)*sd(Pr2$ER)/(0.02*Pop - sum(Pr2$ER)/sum(Pr2$BV)*Pop))^2
names(n2)<-c("wielkość próby drugiej")
round(n2,0)

## wielkość próby drugiej 
##                    264

In practice this means that if we drew a different sample we would get a different point estimate of the errors. The question is what would be the largest error value. And here again we return to our propablistic thinking. Well, this estimate will be on the border of the specified confidence interval. I.e. to our expected value we have to add our right-hand side (intuitively positive end of the confidence interval). As a consequence, to our mean (point estimate) we will add as many standard errors as possible to be at the right end of the interval (in our situation for 80% coverage our $z$ will be equal to 1.28). Consequently \[SE_{1}= N*z*\frac{s_{e}}{\sqrt{n}}\]

but note our $s_{e}$ is not the same as the sample standard deviation, but an estimator of the population deviation based on the sample data and is calculated as follows:

\[s_{e}= \sqrt{\frac{1}{n-1} \sum_{i=1}^{52} (E_{i} - \bar{E}) }\]

where $E_{i}$ is the error for an individual expenditure, and $\bar{E}$ is the average error in our sample. Calculated values of $s_{e}$

SE_pop<- sd(Pro$Error)
names(SE_pop)<-"Estymowane odchylenie błedów w populacji generalnej"
SE_pop

## Estymowane odchylenie błedów w populacji generalnej 
##                                            517.9459

so the upper limit of the confidence interval is

SE1<- N*qnorm(.9)*SE_pop/sqrt(n)
names(SE1)<-"Granica"
SE1

##  Granica 
## 355247.6

and then the limit of our mistake with 80% certainty will be the sum \[EE_{1} +SE_{1}\] stad 9.47669^{5}

Finally, comparing the significance threshold of 2% of the total book value of the program (2% x €46,501,186 = €930,024) with the forecast error and concludes that the forecast error is lower than the maximum permissible error, but the upper limit of error is higher than the maximum permissible error. The auditor may conclude that there is insufficient evidence that the population is not materially misstated. Hence, further testing needs to be done (e.g., by increasing the sample)

Futher reading

Difference on random and hazard sampling