See: https://rpubs.com/staszkiewicz/EN_Sampling

A good ovierview of Markown (https://bookdown.org/yihui/bookdown/markdown-extensions-by-bookdown.html) i (https://bookdown.org/yihui/rmarkdown/) I do recommend.

Features of estimators and independence of variables repetition

Let us define a vector of 2,4,3, 5, 6 e.g. $dkg$ apples.

w<- c(2,4,3, 5, 6)
w

## [1] 2 4 3 5 6

Let us calculate the average $dkg$ of w = frac{2+4+3+5+6}{5} =$mean(w)`.

mean(w)

## [1] 4

Subtract all observations (apples) from the mean

table =data.frame(w=w, w_sr = 4, roz= w-4)
table

##   w w_sr roz
## 1 2    4  -2
## 2 4    4   0
## 3 3    4  -1
## 4 5    4   1
## 5 6    4   2

Let’s sum up the results

 rbind(table, c("Sum", colSums(table[,2:3])))

##     w w_sr roz
## 1   2    4  -2
## 2   4    4   0
## 3   3    4  -1
## 4   5    4   1
## 5   6    4   2
## 6 Sum   20   0

The average is the central value because it deviates on average the same from each object in the population. But because the signs cancel each other out we cannot count the average deviation from the mean directly, hence either the absolute values or the moments even up to a subquantum of 2, 4, 6 etc. Note: if we squared anything we lose the interpretation. Let us count the differentials squared and their mean.

table = data.frame(Apples=w, Avarage = 4, Difference= w-4, Difference_2= (w-4)^2)
table1 = rbind(table, c("Sum", colSums(table[,2:4])))
table2 = rbind(table1, c('average', colSums(table[,2:4])/5))
table2

##    Apples Avarage Difference Difference_2
## 1       2       4         -2            4
## 2       4       4          0            0
## 3       3       4         -1            1
## 4       5       4          1            1
## 5       6       4          2            4
## 6     Sum      20          0           10
## 7 average       4          0            2

The average apple is 4 $dkg$, but deviates quadratically by 2 $dkg^2$ - let’s call this the variance. So far this tells us nothing, because we have a comparison of apples and quadrats of apples. Weak. What do we do to bring the quadratic back to the unisecti $sqrt{2}$. ie.

sqrt(2)

## [1] 1.414214

Now we can interpret the result, the average apple weighs 4 $dkg$ and every other apple in our basket differs from the average by 1.41 $dkg$. Our 1.41 is the herd deviation i.e. the priors of the variance.

\[ \sqrt{V(w)} = `sigma(w) \] and the equivalent expression

\[ \sigma(w)^2 =V(w) \].

Let us do a direct calculation (of the mean and variance and herd deviation of vector $w$)

#vector
w

## [1] 2 4 3 5 6

#mean
mean(w)

## [1] 4

#variance
var(w)

## [1] 2.5

#standard deviation
sd(w)

## [1] 1.581139

Why do we get different results in R than in our calculations? R with ’deafault’u counts the uncircumscribed estimator of the sample variance by dividing by n-1 not n. So if we repeat our calculation with a new denominator:

table = data.frame(apples=w, mean = 4, dif= w-4, dif2= (w-4)^2)
table1 = rbind(table, c("Sum", colSums(table[,2:4])))
table2 = rbind(table1, c('simple average', colSums(table[,2:4])/5))
table3 = rbind(table2, c('unweighted average', colSums(table[,2:4])/4))
table3

##               apples mean dif dif2
## 1                  2    4  -2    4
## 2                  4    4   0    0
## 3                  3    4  -1    1
## 4                  5    4   1    1
## 5                  6    4   2    4
## 6                Sum   20   0   10
## 7     simple average    4   0    2
## 8 unweighted average    5   0  2.5

And then the root of 2.5

sqrt(2.5)

## [1] 1.581139

We got the same results as the functions built into R.

About the difference between the general population and the sample later.

What should we use in the calculations, variance, standard deviation, absolute deviation or higher moments? Let’s skip the issues of absolute deviation and higher moments, as they are quite impractical, and consider the difference between variance and standard deviation.

Let us construct a second vector this time of large apples d 4,8,6, 10, 12 e.g. $dkg$ apples.

d<- w*2
d

## [1]  4  8  6 10 12

please note that our new vector $d= 2*w$ more generally is $A*w$, where $A$ is a constant (some number). So let us calculate the mean, standard deviation and variance:

dan<-c(mean(w),var(w),sd(w),mean(d),var(d),sd(d))
matrix<-matrix(dan, ncol=2)
rownames(matrix)<-c("mean", "variance", "deviation")
colnames(matrix)<-c("w", "d")
round(matrix,2)

##              w     d
## mean      4.00  8.00
## variance  2.50 10.00
## deviation 1.58  3.16

Please see the relationships

\[ w= 2*d, mean(d)= mean(w)*2, V(d)=2^2V(w)\] More generally, let us denote the constant as A thus: \[ bar d = bar w * A\]. \[ V(d) = V(Aw)= A^2V(w)\] Note.

\[ \sigma (d) \neq A*\sigma (w)\]. The variance has better properties than the standard deviation for transformation and calculation, so in practice, we use the variance, and only when we want to interpret the results do we reduce it to the standard deviation.

Let’s see what happens if we create a third variable that is the sum of the vector $d$ and $w$ let’s call this variable $s$ and again calculate the variance and standard deviation

s<- w+d
s

## [1]  6 12  9 15 18

dan<-c(mean(w),var(w),sd(w),mean(d),var(d),sd(d), mean(s), var(s), sd(s))
matrix<-matrix(dan, ncol=3)
rownames(matrix)<-c("mean", "variance", "deviation")
colnames(matrix)<-c("w", "d", "s")
round(matrix,1)

##             w    d    s
## mean      4.0  8.0 12.0
## variance  2.5 10.0 22.5
## deviation 1.6  3.2  4.7

Note that the variance (s) is not the sum of the variances $V(w)$ + $V(d)$, because that would be 12.5 stc hence the extra 10, namely the double of the covariance between $w$ and $d$.

\[ V(s) = V(w) + V(d)+ 2cov(s,w)\].

Let us check how much the covariance between $w$ and $d$ is:

cov(w,d)

## [1] 5

\[V(s) = 2.5 + 10 +2*5\]

\[ cov(w,d) = \sum_{i=1}^{n} \frac{(w_{i}-\bar w)*(d_{i}-\bar d)}{N-1} \] The above formula for the sample.

What if the variables, had a low covariance? Let us derive two variables from a univariate distribution X-runif(5) and the other Y-runif(5) and check their covariance.

set.seed(100)
X<- runif(5)
Y<- runif(5)
cov(X,Y)

## [1] -0.02741495

Checkmhy var(X+Y), var(X) and var(Y)

var(X+Y)

## [1] 0.03818299

var(X)

## [1] 0.03730107

var(Y)

## [1] 0.05571182

cov(X,Y)

## [1] -0.02741495

Court: if cov(X,Y) = 0 (or close to it) we are talking about independent variables.

When \[ cov(X,Y) = 0\] to \[ V(X+Y) = V(X)+V(Y)\]. but still \[ cov(X+Y) = v(X) + v(Y)\]. therefore we use variances for transformations because it is additive. The assumption of no covariance, is the assumption of independence of variables (in this case substantive, because computationally we can always get a resting covariance) hence the many tests for independence of variables.

Let’s move on to random number generation and sampling.

Quick sampling and random number in R

Source: http://www.cookbook-r.com/Numbers/Generating_random_numbers/

Uniformly distributed (flat) random numbers - runif(). The range is from 0 to 1.

runif(1)

## [1] 0.6249965

#> [1] 0.09006613

# Get a vector of 4 numbers
runif(4)

## [1] 0.8821655 0.2803538 0.3984879 0.7625511

#> [1] 0.6972299 0.9505426 0.8297167 0.9779939

# Get a vector of 3 numbers from 0 to 100
runif(3, min=0, max=100)

## [1] 66.90217 20.46122 35.75249

#> [1] 83.702278  3.062253  5.388360

# Get 3 integers from 0 to 100
# Use max=101 because it will never actually equal 101
floor(runif(3, min=0, max=101))

## [1] 36 69 54

#> [1] 11 67  1

# This will do the same thing
sample(1:100, 3, replace=TRUE)

## [1] 82 61 12

#> [1]  8 63 64

# To generate integers WITHOUT replacement:
sample(1:100, 3, replace=FALSE)

## [1] 99 51 72

#> [1] 76 25 52

To generate numbers from a normal distribution, use rnorm(). By default the mean is 0 and the standard deviation is 1.

rnorm(4)

## [1]  0.12337950 -0.02931671 -0.38885425  0.51085626

#> [1] -2.3308287 -0.9073857 -0.7638332 -0.2193786

# Use a different mean and standard deviation
rnorm(4, mean=50, sd=10)

## [1] 40.86186 73.10297 45.61910 57.64061

#> [1] 59.20927 40.12440 44.58840 41.97056

# To check that the distribution looks right, make a histogram of the numbers
x <- rnorm(400, mean=50, sd=10)
hist(x)

Introduction

Let us assume that we are examining the statement (population) of expenditure declared by an entity in a given year.

The system audit carried out by the auditor has resulted in a moderate level of assurance. Therefore, a confidence level of 80% seems appropriate for a cost (assurance) audit.

The table below shows the main characteristics of the population. Population size (number of operations) 3,852 (our N) Book value (total expenditure in the reference period) €46,501,186.

Let’s enter these parameters as global variables

N<-3852
Pop<-46501186

To $N$ = 3852 and Poputlation = 4.6501186^{7}.

An initial sample of 20 operations gave a preliminary estimate of the standard deviation of standard deviation of errors of EUR 518.

Piotr Staszkiewicz Let’s check if our loaded data the same result as in the table.

Let’s read in the data from the first sample

Operation number: 98 120 542 554 587 1156 1325 1453 1840 1904 2028 2338 2428 2735 3054 3196 3276 3321 3366 3666

Sample book value: 13054 10758 8714 8645 9297 7908 6717 16535 15718 13175 6486 13072 8753 17507 8875 6568 6478 12448 17894 13558

Sample audit value: 13054 10758 8264 8645 9297 7908 6717 16535 15718 13175 6486 13072 8753 17507 8875 6568 6478 12448 15598 13558

Let us denote the variables as: NrOP, PBV and PAV. Let us read thos data using the c() command.

NrOp<- c(98,120,542,554,587,1156,1325,1453,1840,1904,2028,2338,2428,2735,3054,3196,3276,3321,3366,3666)
PBV<-c(13054,10758,8714,8645,9297,7908,6717,16535,15718,13175,6486,13072,8753,17507,8875,6568,6478,12448,17894,13558)
PAV<-c(13054,10758,8264,8645,9297,7908,6717,16535,15718,13175,6486,13072,8753,17507,8875,6568,6478,12448,15598,13558)

At first let us calculate the values of errors (book value less audit value)

(PER<-PBV-PAV)

##  [1]    0    0  450    0    0    0    0    0    0    0    0    0    0    0    0
## [16]    0    0    0 2296    0

Let us make Excel like table:

Pro<-data.frame(NrOp,PBV,PAV,PER)
colnames(Pro)<-c("Operation", "Book Value","Correct Value", "Error")

Let’s check the sum of Book value, Correct Value, sum of errors (Error) and standard deviation of errors (SD Errors), the ratio of errors in the sample to the book value of the sample (SER). To build totals we use the apply() function.

#sum(PBV)
#sum(PAV)
#sum(PER)
#sd(PER)

apply(Pro,2, sum)

##     Operation    Book Value Correct Value         Error 
##         38987        222160        219414          2746

Total<-apply(Pro,2, sum)
Total[5]<-sd(PER)
Total[6]<-sum(PER)/sum(PBV)
names(Total)<- c("Operation", "Book Value","Correct Value", "Error","SD Error","SER")
round(Total,0)

##     Operation    Book Value Correct Value         Error      SD Error 
##         38987        222160        219414          2746           518 
##           SER 
##             0

#install.packages("scales")                                   # Install and load scales
library("scales")

percent(Total[6], accuracy = 0.01)

##     SER 
## "1.24%"

Formula used to calculate the sample size

The initial sample gives us two pieces of information the sample value, the sampling error the sample error and the standard deviation of the sample – these we consider as parameters general population The sample size for the survey can be calculated using the formula

\[ \begin{equation} n= (\frac{N*z*\delta_{e}}{TE-AE})^2 \tag{1} \label{eq} \end{equation} \]

Note that $N$ is the number of all objects in the population, $z$ is the argument of normal distribution for the confidence interval resulting from the preliminary assessment of the detection risk obtained during the visit as a product of inherent risk and the risk of the control system. In the denominator we have the difference between the tolerable error ($TE$ – the fraction of significance, or fraction of materiality) and $AE$ the expected error.

To be more specific: 𝑧 is 1.282 (coefficient corresponding to 80% confidence level for normal distribution from confidence level for a normal distribution from the tables), $AE$ is 518 €, and $TE$, the error tolerated, is 2% (the maximum level of significance determined in advance at planning) of the book value, i.e. 2% x €46,501,186 = €930,024. This initial sample gives a sampling error rate of 1.24%. Furthermore, Based on the experience of the previous year and the conclusions from the report on the management and control systems, the audit authority audit authority expects an error rate of no more than 1.24%. Therefore, the $AE$, the expected error rate is 1.24% of the total expenditure, i.e. x EUR 46 501 186 = EUR 576 615:

\[ n = (\frac{3,853*1.282*518}{930,024-576,615})^2 = 53\] The previous initial sample of 20 is used as part of the main sample.

Let us check this calculation in R

 n<-(N*qnorm(0.9)*sd(PER)/(0.02*Pop - Total[6]*Pop))^2
names(n)<-c("wielkość próby")
round(n,0)

## wielkość próby 
##             52

The difference between the value calculated manually and the value calculated by the machine results from rounding.

Confidence interval and normal distribution

Before we move on to calculating the results, let’s consider the parameter $z$ =1.28 in formula (1).

Let us see graphically what the normal distribution looks like

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia

The entire area under the normal curve has an area value of one

\[\int g(x) dx =1 \] Given that, we are dealing with a standardized normal distribution N(0,1), the area of x between 3 and - 3 deviations from the mean (in our case the standard deviation = 1 and the mean is =0) covers 97.7 areas under the curve namely:

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia
x<-seq(-3,3, length=100)
y<-dnorm(x,mean=0,sd=1)
polygon(c(-3,x,3),c(0,y,0),col="grey")
text(0,0.1,"97.7%")

We interpret this to mean that anything between -3 and 3 is very likely (99.7%) to be in our population, if an object has a value less than -3 or greater than three then the chance that it is in the population is very small - 0.03. Except that this chance breaks into two tails of 0.0115 (1.15%)

x<-seq(-4,4, length=200)
# teraz obliczamy wartości dla y gestośći rozkładu normalnego
y<-dnorm(x,0,1)
plot(x,y, type ="l") # rysujemy obrazek w typie linia
x<-seq(-3,3, length=100)
y<-dnorm(x,mean=0,sd=1)
polygon(c(-3,x,3),c(0,y,0),col="grey")
text(0,0.1,"97.7%")
x1<-seq(-4,-3,length=50)
y1<-dnorm(x1,0,1)
polygon(c(-4,x1,-3),c(0,y1,0),col="red")
x2<-seq(3,4,length=50)
y2<-dnorm(x2,0,1)
polygon(c(3,x2,4),c(0,y2,0),col="red")
arrows(-3.1,0.3, -3,0, length=0.15)
text(-3, 0.32, "ogony po 1.15%")
arrows(-3.1,0.3,3,0, length=0.15)

Please note that the function r, which gives for a given value of the area under the density function of the normal distribution the value on the x-axis is qnorm(something), where something is a value or vector. With this, our "something" is counted from minus infinity to our desired x 



```r
x=seq(-3,3,length=200)
y=dnorm(x,mean=0,sd=1)
plot(x,y,type="l")
x=seq(-3, 1.2833,length=100)
y=dnorm(x,mean=0,sd=1)
polygon(c(-3,x,1.2833),c(0,y,0),col="blue")
text(0,0.1,"0.90")
arrows(1.5,0.06,1.2,0,length=.15)
text(1.58,0.09,1.28,0,"1.28")

So let’s check how much is qnorm(0.9) 1.2815516 why?

So if we do a confidence interval of 80% we have given a quantile of 90%. Why? Because, a confidence interval at the 80% level cuts us off symmetrically 10% on the right and left sides. Hence, if we subtract qnorm(.90) from qnorm(.10) then we are left with a segment with 80% probability starting with -1.2815516 and ending with 1.2815516.

x=seq(-3,3,length=200)
y=dnorm(x,mean=0,sd=1)
plot(x,y,type="l")
x1=seq(-1.28, 1.2833,length=500)
y=dnorm(x1,mean=0,sd=1)
polygon(c(-1.28,x1,1.2833),c(0,y,0),col="blue")
text(0,0.1,"0.80")
arrows(1.5,0.06,1.2,0,length=.15)
text(1.58,0.09,1.28,0,"1.28")

#arrows(-1.5,0.06,-1.2,0,length=.15)
#text(-2.8,0.47,-2.6,0.44,"-1.28")

From there, we needed information from the inherent risk and control analysis to assess the confidence interval we should work at to control the 5% audit risk.

Low Control Assurance: Considering the desirable and acceptable audit risk of 5%, and if the inherent risk (=100%) and control risk (= 50%) are high, it means it is a high risk entity where internal control procedures are not adequate to manage the risk, the auditor should aim for a very low detection (omission) risk of 10%. In order to achieve a low risk of detection, the amount of substantive testing and therefore the sample size must be large.

𝐷𝑅=𝐴𝑅/(𝐼𝑅×𝐶𝑅)= 0,05/(1×0,5)=0,1

Which translates to a 90% confidence interval [1-0.1].

High degree of control confidence: In another context where the inherent risk is high (100%) but where adequate controls exist, the control risk can be estimated at 12.5%. To reach a control risk level of 5%, the detection risk level may be 40%, meaning that the auditor can take on more risk by reducing the sample size. This will ultimately mean a less detailed and less costly audit.

𝐷𝑅=𝐴𝑅/(𝐼𝑅×𝐶𝑅)= 0,05/(1×0,125)=0,4

Which translates into a 60% confidence interval (1-0.4)

Note that both examples result in the same achieved test risk of 5% across environments.

To summarize: the parameter $z$ is the expert’s estimate of the confidence interval necessary to control the audit risk at the 5% level, for the precision, i.e. the difference between $TE$ and $AE$ is in fact the risk appetite.

Let’s return to sample selection. Our initial sample was 20 items, after estimation of the sample size based on the parameters of the initial sample, we obtained final sample size to infer at the 80% confidence level as 52 operations.

Therefore, the auditor has to randomly select only 32 more operations. In the following, we load the full sample with an additional 32 cost items (52 in total) and the results of the valuation test:

Loading the actual sample data

1.1

Here we already enter three variables, observation index (e.g. invoice number), the book value (BV) recorded in the analysis, and audit value (AV). The values can be downloaded from files. vector c(), and here are the subsequent values

Index = 74,98,120,153,223,246,542,554,587,915,1014,1114,1156,1325, 1403,1453,1577,1621,1624,1631,1649,1650,1678,1687,1712,1729,1730,1744, 1744,1767,1774,1796,1796,1817,1821,1828,1850,1879,1888,1898,1909,1926, 1948,1949,1958,1963,1967,1980,1990,3749,3770, 4001.

Book value BV = 9093,13054,10758,16194,11662,16331,8714,8645,9297,7999,11906, 15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278, 28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827, 22714.00027,15556.03318,20436.71763,16999.19228,1111,1231,151,16336.08554, 17021.65133,29527.18354,888,1200,771,1200,150,29017.91656,1765,12110, 200,331,24383.76591,10171, 50,22

and Audit value AV that is: 9093,13054,10758,16194,11662,16331,8264,8645,9297,7999,11906,15505,7908,6717, 9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552, 18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027, 15556.03318,17556.71763,16100,2059,330,151,16336.08554,17021.65133,29000, 700,750,771,1200,300,27217.91656,1765,10110,1200,331,24383.76591,10171, 50, 22.

Index<- c(74,98,120,153,223,246,542,554,587,915,1014,1114,1156,1325,1403,1453,1577,1621,1624,1631,1649,1650,1678,1687,1712,1729,1730,1744,1744,1767,1774,1796,1796,1817,1821,1828,1850,1879,1888,1898,1909,1926,1948,1949,1958,1963,1967,1980,1990,3749, 3770,4001)

BV<-c(9093,13054,10758,16194,11662,16331,8714,8645,9297,7999,11906,15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027,15556.03318,20436.71763,16999.19228,1111,1231,151,16336.08554,17021.65133,29527.18354,888,1200,771,1200,150,29017.91656,1765,12110,200,331,24383.76591,10171,50,22)

AV<-c(9093,13054,10758,16194,11662,16331,8264,8645,9297,7999,11906,15505,7908,6717,9730,16535,17723,16095,15171,17475.74724,26183.25954,19947.07278,28549.02552,18385.60803,20137.78032,30799.52241,23149.03838,24644.80827,22714.00027,15556.03318,17556.71763,16100,2059,330,151,16336.08554,17021.65133,29000,700,750,771,1200,300,27217.91656,1765,10110,1200,331,24383.76591,10171,50,22)

Let us make a dataframe

Val<- data.frame(Index,BV,AV)

complete our frame with the error value Val$ER<-(BV-AV) let’s check the standard deviation 603.33 and its sum 7997.38.

And let’s view this data frame using the View() function:

Val$ER<-(BV-AV)
View(Val)

Now that we have the entire sample selected, we will count the point error in the sample, count the the population point error, determine the confidence interval of the population error and We compare this to our acceptable error to infer whether the population has or does not have a significant error. In view of the above:

round(apply(Val,2,sum),2)

##     Index        BV        AV        ER 
##  81977.00 661652.41 653655.03   7997.38

The total book value of the 52 sampled operations is 6.61652^{5} EUR. The total amount of error in the sample is 7997 EUR. This amount, divided by the sample size 3852, is the sample mean operating error i.e. 153.8. When using per-unit average estimation, the projection of the error onto the population is calculated by multiplying this average error by the population size $N$ (in this example 5.92421^{5} . This number represents the predicted error at the level of the entire population of costs (expenses) under study. Let us call it $EE_{1}$ and its value is:

EEP<-round((sum(Val$ER)/length(Val$ER))*N,0)
EEP

## [1] 592421

Please note that if you had chosen the sample differently the error values would have been different, here is an example we will choose from our first final sample other subsamples of 20 items and see what should be the target sample size

Pr2<-Val[sample(nrow(Val),20),]
Pr2

##    Index       BV       AV        ER
## 33  1796  1111.00  2059.00 -948.0000
## 11  1014 11906.00 11906.00    0.0000
## 42  1926  1200.00  1200.00    0.0000
## 30  1767 15556.03 15556.03    0.0000
## 6    246 16331.00 16331.00    0.0000
## 23  1678 28549.03 28549.03    0.0000
## 17  1577 17723.00 17723.00    0.0000
## 9    587  9297.00  9297.00    0.0000
## 5    223 11662.00 11662.00    0.0000
## 22  1650 19947.07 19947.07    0.0000
## 12  1114 15505.00 15505.00    0.0000
## 25  1712 20137.78 20137.78    0.0000
## 13  1156  7908.00  7908.00    0.0000
## 10   915  7999.00  7999.00    0.0000
## 28  1744 24644.81 24644.81    0.0000
## 29  1744 22714.00 22714.00    0.0000
## 32  1796 16999.19 16100.00  899.1923
## 48  1980   331.00   331.00    0.0000
## 20  1631 17475.75 17475.75    0.0000
## 36  1828 16336.09 16336.09    0.0000

round(apply(Pr2,2,sum),2)

##     Index        BV        AV        ER 
##  28084.00 283332.75 283381.55    -48.81

#porównajmy z próbą pierowtną
round(apply(Pro,2,sum),2)

##     Operation    Book Value Correct Value         Error 
##         38987        222160        219414          2746

The deflection of standard errors in the first trial was 517.9458517 and in the second trial 299.7482845 consequently their difference is 218.1975672 After substituting the new values into equation (1) we obtain a new number of elements needed in the last trial, namely:

 n2<-(N*qnorm(0.9)*sd(Pr2$ER)/(0.02*Pop - sum(Pr2$ER)/sum(Pr2$BV)*Pop))^2
names(n2)<-c("wielkość próby drugiej")
round(n2,0)

## wielkość próby drugiej 
##                      2

In practice this means that if we drew a different sample we would get a different point estimate of the errors. The question is what would be the largest error value. And here again we return to our propablistic thinking. Well, this estimate will be on the border of the specified confidence interval. I.e. to our expected value we have to add our right-hand side (intuitively positive end of the confidence interval). As a consequence, to our mean (point estimate) we will add as many standard errors as possible to be at the right end of the interval (in our situation for 80% coverage our $z$ will be equal to 1.28). Consequently \[SE_{1}= N*z*\frac{s_{e}}{\sqrt{n}}\]

but note our $s_{e}$ is not the same as the sample standard deviation, but an estimator of the population deviation based on the sample data and is calculated as follows:

\[s_{e}= \sqrt{\frac{1}{n-1} \sum_{i=1}^{52} (E_{i} - \bar{E}) }\]

where $E_{i}$ is the error for an individual expenditure, and $\bar{E}$ is the average error in our sample. Calculated values of $s_{e}$

SE_pop<- sd(Pro$Error)
names(SE_pop)<-"Estymowane odchylenie błedów w populacji generalnej"
SE_pop

## Estymowane odchylenie błedów w populacji generalnej 
##                                            517.9459

so the upper limit of the confidence interval is

SE1<- N*qnorm(.9)*SE_pop/sqrt(n)
names(SE1)<-"Granica"
SE1

##  Granica 
## 355247.6

and then the limit of our mistake with 80% certainty will be the sum \[EE_{1} +SE_{1}\] stad 9.47669^{5}

Finally, comparing the significance threshold of 2% of the total book value of the program (2% x €46,501,186 = €930,024) with the forecast error and concludes that the forecast error is lower than the maximum permissible error, but the upper limit of error is higher than the maximum permissible error. The auditor may conclude that there is insufficient evidence that the population is not materially misstated. Hence, further testing needs to be done (e.g., by increasing the sample)

Futher reading

Difference on random and hazard sampling

Ex. 3 Sampling

Piotr Staszkiewicz

16 10 2022