Date: 05/06/2022

To: The Univesity of Wyoming

From: Md Ashraful Islam Bhuiya

Subject: Estimation of the required number of testings to estimate the prevalence of the asymptomatic disease in the campus population as a whole

We analyzed the previous data to learn about the prevalence of the asymptomatic disease on the University of Wyoming (UW) campus and used a probability model to understand the effects of the number of tests on the level of uncertainty of the prevalence. Surveillance testing informs us if an asymptomatic person is positive or negative for the disease. As the outcome from testing is either one of two possibilities (positive or negative), and beta distribution ((the probability of positive cases)) tells us the probability of prevalence given the data, we analyzed the data based on the beta distribution model. In this study, we used a probability model to estimate how the number of tests relates to the uncertainty of the prevalence of asymptomatic cases. We found that ~ 150 to ~250 testings per week give an estimate of the prevalence with a fairly low level of uncertainty in the current situation.

1. Estimated prevalence throughout the semester including the estimated level of uncertainty.

covid <- read.csv("uw_covidsurv_2022_updated.csv") # Load the dataset
covid <- covid[-10, ]
mean.prevalence <- (covid$NumberPositive + 1) / ( covid$NumberPositive + 1 + covid$NumberTested - covid$NumberPositive + 1 ) # Calculate mean prevalence using the equation for beta distribution.

plot(mean.prevalence,      # plot mean prevalence.
     yaxt="n",             # remove the tick labels of the Y axis.
     type="o", 
     xlab="week", 
     ylim = c(0, 0.16), 
     ylab = "Mean prevalence with uncertainty (95% ETPI)", 
     lwd=2, 
     main="Graph 1: Estimated prevalence with estimated level of uncertainty")
axis(2, at = seq(0.0, 0.18, by=0.01), las=1) # add specified tick labels to the Y axis.
abline(h=0.05, col="magenta", lwd=1.5)       # add horizontal line in the graph.
text(10, 0.05, "Prevalence= 5%", pos=3)      # add text at the specific location in the graph.
 
segments(1:nrow(covid), qbeta(0.975, covid$NumberPositive + 1, covid$NumberTested - covid$NumberPositive +1), 1:nrow(covid), qbeta(0.025, covid$NumberPositive + 1, covid$NumberTested - covid$NumberPositive +1), col = "blue", lwd=2) # add line segments for the estimated uncertainty of the prevalence for each time point. Uncertainty is estimated using qbeta() function for 95% equal tail probability interval.  

In graph 1, we observe the mean prevalence (indicated by black circles) of the disease each week throughout the semester. Blue lines indicate the range of uncertainty of the given prevalence within the 95% equal tail probability interval (ETPI) (from here onwards, I will refer to this as the level of uncertainty) at each time point.The magenta line indicates 5% prevalence. The maximum prevalence was found at week 3 when the estimated prevalence was 12.4%, and from the estimated level of uncertainty, we are 95% confident that the uncertainty for this prevalence is only ~0.05 (~0.10 - ~0.15). For the estimated prevalence of other weeks, the level of uncertainty is ~ 0.05 or less. For this low level of uncertainty, we are confident enough about the prevalence in each week as shown in the above graph. We can clearly distinguish the prevalence from one week to another. We see that the highest prevalence was at week 3 and the lowest was at week 7. The prevalence increased a bit at week 14 compared to week 7 (~1% to ~4%). In subsequent analyses, we will use a probability model (beta distribution model) to understand how the prevalence and sample size (number of total tests) affects the level of uncertainty.

2. Uncertainty regarding the numbner of positive of cases in the campus population.

data_250 <- data.frame(tested= 250,  # create a dataframe for 250 tests.
                      positive = 0:250)
plot(data_250$positive, (qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1)), # plotting uncertainty vs prevalence. 
     yaxt="n",
     type = 'l', 
     lwd=2, 
     col="orange", 
     xlab = "Number of positive cases out of 250 tests", ylab = "Uncertainty (95% ETPI)", 
     main = "Graph 2: Uncertainty vs positive cases")
axis(2, at = seq(0.0, 0.13, by=0.01), las=1 ) # add defined Y axis

points(3, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), col='blue', pch=16, cex=1.5)                        # add points to the specific location of the graph. 
text(3, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1))+0.004, "3 '+ve' cases", cex=0.8, pos= 4, col= "black") # add texts at the specific location of the graph.
points(247, (qbeta(0.975, 247+1, 250-247+1)-qbeta(0.025, 247+1, 250-247+1)), col='blue', pch=16, cex=1.5)
text(247, (qbeta(0.975, 247+1, 250-247+1)-qbeta(0.025, 247+1, 250-247+1))+0.004, "247 '+ve' cases", cex=0.8, pos= 2, col= "black")
abline(h=(qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), col="purple", lwd=1.5)                               # add horizontal line based on estimated uncertainty.
text(125, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), "Uncertainty ~ 0.03", cex=0.8, pos= 1, col= "black")

points(75, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), col='brown', pch=16, cex=1.5)
text(75, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1))+0.004, "75 '+ve' cases", cex=0.8, pos= 2, col= "black")
points(175, (qbeta(0.975, 175+1, 250-175+1)-qbeta(0.025, 175+1, 250-175+1)), col='brown', pch=16, cex=1.5)
text(175, (qbeta(0.975, 175+1, 250-175+1)-qbeta(0.025, 175+1, 250-175+1))+0.004, "175 '+ve' cases", cex=0.8, pos= 4, col= "black")
abline(h=(qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), col="magenta", lwd=1.5) 
text(125, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), "Uncertainty > 0.11", cex=0.8, pos= 1, col= "black")

Graph 2 shows how the level of uncertainty changes according to different prevalence (different number of positive tests out of certain number of tests). Here, uncertainty levels are plotted (orange line) against 0% to 100% prevalence (0 to 250 positive cases out of 250 tests). Blue circles indicate the level of uncertainty for 3 and 247 positive tests. Magenta circles indicate the level of uncertainty for 75 and 175 positive cases. Purple line indicates the level of uncertainty ~ 0.03 and magenta line indicates the level of uncertainty >0.11. We find that with the increasing number of positive cases, the level of uncertainty is also increased. If half of the total samples tested are positive, the level of uncertainty is the highest at that point. However, if more than half of the samples tested are positive, the level of uncertainty again starts to decrease. After that mid point, the level of uncertainty continue to decrease as more samples are found positive. As examples, for 3 and 247 positive cases (near the left and right ends of the graph) out of 250 tests, the level of uncertainty is only ~0.03 (shown with purple line), whereas, for 75 and 175 positive cases (around the middle point of the graph), the level of uncertainty is >0.11 (shown with magenta line).

3. Uncertaintly is also affected by the sample size to some extent.

# create different data frames of different number of tests. 

data_1_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.01*(seq(100, 2000, by=5)))
                      
data_3_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.03*(seq(100, 2000, by=5)))

data_5_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.05*(seq(100, 2000, by=5)))

data_25_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.25*(seq(100, 2000, by=5)))

data_50_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.5*(seq(100, 2000, by=5)))

data_100_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 1*(seq(100, 2000, by=5)))
                
# Plot data from above data frames for uncertainty vs sample size (number of tests).  

plot(seq(100, 2000, by=5), (qbeta(0.975, data_1_percent$positive + 1, data_1_percent$tested - data_1_percent$positive +1)- qbeta(0.025, data_1_percent$positive + 1, data_1_percent$tested - data_1_percent$positive +1)), 
     xaxt="n",
     yaxt= "n",
     type = 'l', 
     lwd=2, 
     ylim = c(0, 0.2), 
     col="green", 
     xlab = "Number of tests (sample size)", 
     ylab = "Uncertainty (95% ETPI)",
     main = "Graph 3: Uncertainty vs sample sizes and prevalence")
axis(1, at = seq(100, 2000, by=100), las=2 )
axis(2, at = seq(0.0, 0.2, by=0.01), las=1 )
legend(1500,0.2,
  title = "Prevalence",
  legend = c("0.5", "0.25", "0.05", "0.03", "0.01"),
  lty = 1,
  lwd = 2,
  col=c("brown", "red", "orange", "blue", "green"),
  cex= 0.8)

# add line in the graph for uncertainty vs number of tests for different prevalence. The following lines were added in the similar way. 

lines(data_3_percent$tested, (qbeta(0.975, data_3_percent$positive + 1, data_3_percent$tested - data_3_percent$positive +1)- qbeta(0.025, data_3_percent$positive + 1, data_3_percent$tested - data_3_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="blue") 

lines(data_5_percent$tested, (qbeta(0.975, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)- qbeta(0.025, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="orange")

lines(data_25_percent$tested, (qbeta(0.975, data_25_percent$positive + 1, data_25_percent$tested - data_25_percent$positive +1)- qbeta(0.025, data_25_percent$positive + 1, data_25_percent$tested - data_25_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="red")

lines(data_50_percent$tested, (qbeta(0.975, data_50_percent$positive + 1, data_50_percent$tested - data_50_percent$positive +1)- qbeta(0.025, data_50_percent$positive + 1, data_50_percent$tested - data_50_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="brown")

# add line segments and texts for a certain number of tests. 

segments(300, 0.01, 300, 0.06, col="purple", lwd = 2)
segments(700,0.05, 700, 0.10, col="magenta", lwd = 2)
text(360, 0.06, "Tests = 300", srt=90, pos=3, cex=0.55)
text(755, 0.09, "Tests = 700", srt=90, pos=3, cex=0.55)

Graph 3 shows how uncertainty changes according to different number of tests and different prevalence. Legend in the top right side of the graph indicates specific colors of lines corresponding to specific prevalence. Purple line segment indicates sample size = 300 and magenta line segment indicates sample size= 700. We observe that the level of uncertainty is clearly decreased as the sample size increases for a certain prevalence. However, after some point, the changes in uncertainty become very subtle as the sample size increases further.For prevalence of 1% to 5%, the changes in uncertainty become subtle between the number of tests ~ 250 to 350. So, we can suggest that if the prevalence is low (0.01 - 0.05), number of tests from 250 to 350 is enough to estimate the prevalence within overall population with confidence (with very low level of uncertainty). Same for the sample size 650 to 750 for the prevalence between 25% to 50%. As we can see from graph 1, most of the time of the semester, especially in the recent time, the prevalence is between 0% to <5%. Therefore, the current level of sampling of the UW population (~3%, 250-500 people) should be sufficient enough to track changes in disease prevalence and inform policy.

4. An estimation of an optimum sample size given the level of prevalence.

# Create different data frames of different sample sizes. 

data_100 <- data.frame(tested= 100,
                      positive = 0:100)

data_250 <- data.frame(tested= 250,
                      positive = 0:250)

data_350 <- data.frame(tested= 350,
                      positive = 0:350)

data_500 <- data.frame(tested= 500,
                      positive = 0:500)

data_1000 <- data.frame(tested= 1000,
                      positive = 0:1000)

# Plot data from the above data frame for Uncertainty vs Prevalence for sample size=250.  
                    
plot((0:250)/250, (qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1)), 
     xaxt="n",
     yaxt= "n",
     type = 'l', 
     col="Blue",
     lwd=2,
     ylim = c(0.0, 0.2),
     xlab = "Prevalence", 
     ylab = "Uncertainty (95% ETPI)", 
     main = "Graph 4: Sample size affects uncertainty as a function of prevalences")

#specify the tick labels of X and Y axis manually.
axis(1, at = seq(0, 1, by=0.1), las=2 )
axis(2, at = seq(0, 0.2, by=0.01), las=1 )

text(250/500, max((qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1))), "Samples tested=250", cex = 0.7, pos = 1, col = "blue") # Add text at the specific location in the graph.
abline(v=0.124, col="purple", lwd=1)           # add a vertical line.
text(0.124, 0.2, "Prevalence= 12.4%", col = "purple", cex=0.7, pos= 2, srt= 90, lwd=1.5)
abline(v=0.05, col= "black")
text(0.04, 0.15, "Prevalence= 5%", col = "black", cex=0.7, pos= 2, srt= 90, lwd=1.5)

# Add lines to the graph for Uncertainty vs Prevalence for sample sizes 100, 350, 500, 1000.

lines((0:100)/100, (qbeta(0.975, data_100$positive + 1, data_100$tested - data_100$positive +1)- qbeta(0.025, data_100$positive + 1, data_100$tested - data_100$positive +1)), type = 'l', col="magenta", lwd=2)
text(250/500, max((qbeta(0.975, data_100$positive + 1, data_100$tested - data_100$positive +1)- qbeta(0.025, data_100$positive + 1, data_100$tested - data_100$positive +1))), "Samples tested=100", cex = 0.7, pos = 1, col = "magenta")

lines((0:350)/350, (qbeta(0.975, data_350$positive + 1, data_350$tested - data_350$positive +1)- qbeta(0.025, data_350$positive + 1, data_350$tested - data_350$positive +1)), type = 'l', col="orange", lwd=2)
text(250/500, max((qbeta(0.975, data_350$positive + 1, data_350$tested - data_350$positive +1)- qbeta(0.025, data_350$positive + 1, data_350$tested - data_350$positive +1))), "Samples tested=350", cex = 0.7, pos = 1, col = "orange")

lines((0:500)/500, (qbeta(0.975, data_500$positive + 1, data_500$tested - data_500$positive +1)- qbeta(0.025, data_500$positive + 1, data_500$tested - data_500$positive +1)), type = 'l', col="red", lwd=2)
text(250/500, max((qbeta(0.975, data_500$positive + 1, data_500$tested - data_500$positive +1)- qbeta(0.025, data_500$positive + 1, data_500$tested - data_500$positive +1))), "Samples tested=500", cex = 0.7, pos = 1, col = "red")

lines((0:1000)/1000, (qbeta(0.975, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)- qbeta(0.025, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)), type = 'l', col="brown", lwd=2)
text(250/500, max((qbeta(0.975, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)- qbeta(0.025, data_1000$positive + 1, data_1000$tested - data_1000$positive +1))), "Samples tested=1000", cex = 0.7, pos = 1, col = "brown")

Graph 4 shows how the sample size affects the level of uncertainty when estimated by accounting all prevalence. Here brown, red, orange, blue, and magenta lines indicate uncertainty for sample sizes 1000, 500, 350, 250, and 100 respectively. The purple vertical line indicates the prevalence of 12.4% and black vertical line indicates the prevalence of 5%. Overall, as the sample size is increased, the uncertainty is decreased. If we consider the original prevalence throughout the semester, the highest prevalence was 12.4%. At this point the highest level of uncertainty is ~0.13 for the sample size 100 and the lowest is ~0.04 for sample size 1000 as indicated in the graph. So, with more 900 individual being tested, the uncertainty is reduced by only 0.09 when the prevalence is 12.4%. The current prevalence is much lower (only <5%), which itself reduces the uncertainty significantly as indicated in graph 2, you will find it also in graph 4. Considering the above phenomena that 900 more test reduces the uncertainty only 0.09 at the time of highest (12.4%) prevalence and current situation of lower prevalence (~ only 5%), I would not advise to sample more individuals if the prevalence is not significantly higher than now. I would suggest to tests even less samples (tests = ~150 - 250) than now (tests=250-350) given the current prevalence. However, based on the above graph, if the prevalence become ~15% (which never happened during the semester, so highly unlikely) , I will suggest to increase the sample size up-to 300 - 350 (~ 100 - 150 more samples) to keep the level of uncertainty very low (<0.1). If the prevalence is higher than this, the campus may be closed to keep the spreading of disease under control.

5. Final recommendation

From the discussion above based on our probability model, we understand how prevalence and sample size may affect the level of uncertainty and their co-relationship. As a recap, for very low and high prevalence, the level of uncertainty is very low compared to around the middle value (50%) of the prevalence (graph 2 and 4) for any given number of sample size. On the other hand, after a certain number of sample size, the change in uncertainty become very low even for the mid-prevalence (prevalence =50%, when uncertainty is the highest for a given sample size)(graph 3 and 4). Which means regardless of how many more samples you tests further, the level of uncertainty won’t change noticeably. However, the scale (numbers of individuals to be tested each week) depends on how precisely you want to estimate the prevalence, Or in other words, how much uncertainty you want to tolerate for a given prevalence. Lets see more precisely in the following table how scale are related to precision (how low the uncertainty is).

library(kableExtra) # load library kableExtra for kable() function.

# create data frame as above.
data_5_percent <- data.frame(tested= seq(50, 1000, by=100),
                      positive = 0.05*(seq(50, 1000, by=100)))

data_5_percent$uncertainty <- (qbeta(0.975, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)- qbeta(0.025, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1))

data_15_percent <- data.frame(tested= seq(50, 1000, by=100),
                      positive = 0.15*(seq(50, 1000, by=100)))
data_15_percent$uncertainty <- (qbeta(0.975, data_15_percent$positive + 1, data_15_percent$tested - data_15_percent$positive +1)- qbeta(0.025, data_15_percent$positive + 1, data_15_percent$tested - data_15_percent$positive +1))

data_compare <- data.frame(Tests= data_15_percent$tested,   
                           Uncertainty_lowPreval. = round(data_5_percent$uncertainty, digits = 2),
                           Uncertainty_highPreval. = round(data_15_percent$uncertainty, digits = 2)
                           )

tbl1<- kable(data_compare, caption = "Table 1: An example of precision (how low the uncertainty is) vs sample size(Tests). lowPreval.= 5% prevalence, highPreval. = 15% prevalence") # convert the data frame into a customizable table. 
column_spec(tbl1, 1:3, width = "2in") # customize the table with a caption and defined width of columns. 
Table 1: An example of precision (how low the uncertainty is) vs sample size(Tests). lowPreval.= 5% prevalence, highPreval. = 15% prevalence
Tests Uncertainty_lowPreval. Uncertainty_highPreval.
50 0.13 0.20
150 0.07 0.11
250 0.06 0.09
350 0.05 0.07
450 0.04 0.07
550 0.04 0.06
650 0.03 0.05
750 0.03 0.05
850 0.03 0.05
950 0.03 0.05

In general, as we see from the table 1 (also from graph 4), higher the sample size, more precise the estimation of the prevalence would be, however, after a certain number of tests (for example, 650 tests, in table 1) the change in the level of uncertainty does not noticeable. Therefore, even to estimate the prevalence very precisely you don’t need to perform unlimited number of tests for any given prevalence. As you see in the above table, you can calculate exact scales based on prevalence and expected precision level and then decide at which scale you want to perform surveillance tests.

In my opinion, for the current low prevalence (< 5%, see graph 1) in the UW campus, testing from ~150 to ~ 250 people each week even less than that to some extent is sufficient to estimate the prevalence with uncertainty only ~0.07 or less (table1, and graph 3). Final decision on scale depends on the expectation of the UW testing program as discussed earlier.

---
output:
  html_notebook:
    code_folding: hide
  html_document:
    df_print: paged
---

#### Date: 05/06/2022

#### To:    The Univesity of Wyoming

#### From:  Md Ashraful Islam Bhuiya

### _Subject: ***Estimation of the required number of testings to estimate the prevalence of the asymptomatic disease in the campus population as a whole*** _

We analyzed the previous data to learn about the prevalence of the asymptomatic disease on the University of Wyoming (UW) campus and used a probability model to understand the effects of the number of tests on the level of uncertainty of the prevalence. Surveillance testing informs us if an asymptomatic person is positive or negative for the disease. As the outcome from testing is either one of two possibilities (positive or negative), and beta distribution ((the probability of positive cases)) tells us the probability of prevalence given the data, we analyzed the data based on the beta distribution model. In this study, we used a probability model to estimate how the number of tests relates to the uncertainty of the prevalence of asymptomatic cases. We found that ~ 150 to ~250 testings per week give an estimate of the prevalence with a fairly low level of uncertainty in the current situation. 

### 1. Estimated prevalence throughout the semester including the estimated level of uncertainty.
```{r setup, include=FALSE }
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_knit$set(root.dir = "G:/My Drive/Courses/Spring2022/Computational_biology/Project-4_Surveillance_testing_of_asymptomatic_disease")
getwd()
```

```{r, fig.height=5}
covid <- read.csv("uw_covidsurv_2022_updated.csv") # Load the dataset
covid <- covid[-10, ]
mean.prevalence <- (covid$NumberPositive + 1) / ( covid$NumberPositive + 1 + covid$NumberTested - covid$NumberPositive + 1 ) # Calculate mean prevalence using the equation for beta distribution.

plot(mean.prevalence,      # plot mean prevalence.
     yaxt="n",             # remove the tick labels of the Y axis.
     type="o", 
     xlab="week", 
     ylim = c(0, 0.16), 
     ylab = "Mean prevalence with uncertainty (95% ETPI)", 
     lwd=2, 
     main="Graph 1: Estimated prevalence with estimated level of uncertainty")
axis(2, at = seq(0.0, 0.18, by=0.01), las=1) # add specified tick labels to the Y axis.
abline(h=0.05, col="magenta", lwd=1.5)       # add horizontal line in the graph.
text(10, 0.05, "Prevalence= 5%", pos=3)      # add text at the specific location in the graph.
 
segments(1:nrow(covid), qbeta(0.975, covid$NumberPositive + 1, covid$NumberTested - covid$NumberPositive +1), 1:nrow(covid), qbeta(0.025, covid$NumberPositive + 1, covid$NumberTested - covid$NumberPositive +1), col = "blue", lwd=2) # add line segments for the estimated uncertainty of the prevalence for each time point. Uncertainty is estimated using qbeta() function for 95% equal tail probability interval.  
```
In graph 1, we observe the mean prevalence (indicated by black circles) of the disease each week throughout the semester. Blue lines indicate the range of uncertainty of the given prevalence within the 95% equal tail probability interval (ETPI) (from here onwards, I will refer to this as the level of uncertainty) at each time point.The magenta line indicates 5% prevalence. The maximum prevalence was found at week 3 when the estimated prevalence was 12.4%, and from the estimated level of uncertainty, we are 95% confident that the uncertainty for this prevalence is only ~0.05 (~0.10 - ~0.15). For the estimated prevalence of other weeks, the level of uncertainty is ~ 0.05 or less. For this low level of uncertainty, we are confident enough about the prevalence in each week as shown in the above graph. We can clearly distinguish the prevalence from one week to another. We see that the highest prevalence was at week 3 and the lowest was at week 7. The prevalence increased a bit at week 14 compared to week 7 (~1% to ~4%). In subsequent analyses, we will use a probability model (beta distribution model) to understand how the prevalence and sample size (number of total tests) affects the level of uncertainty.  


### 2. Uncertainty regarding the numbner of positive of cases in the campus population.
```{r, fig.height=5}
data_250 <- data.frame(tested= 250,  # create a dataframe for 250 tests.
                      positive = 0:250)
plot(data_250$positive, (qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1)), # plotting uncertainty vs prevalence. 
     yaxt="n",
     type = 'l', 
     lwd=2, 
     col="orange", 
     xlab = "Number of positive cases out of 250 tests", ylab = "Uncertainty (95% ETPI)", 
     main = "Graph 2: Uncertainty vs positive cases")
axis(2, at = seq(0.0, 0.13, by=0.01), las=1 ) # add defined Y axis

points(3, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), col='blue', pch=16, cex=1.5)                        # add points to the specific location of the graph. 
text(3, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1))+0.004, "3 '+ve' cases", cex=0.8, pos= 4, col= "black") # add texts at the specific location of the graph.
points(247, (qbeta(0.975, 247+1, 250-247+1)-qbeta(0.025, 247+1, 250-247+1)), col='blue', pch=16, cex=1.5)
text(247, (qbeta(0.975, 247+1, 250-247+1)-qbeta(0.025, 247+1, 250-247+1))+0.004, "247 '+ve' cases", cex=0.8, pos= 2, col= "black")
abline(h=(qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), col="purple", lwd=1.5)                               # add horizontal line based on estimated uncertainty.
text(125, (qbeta(0.975, 3+1, 250-3+1)-qbeta(0.025, 3+1, 250-3+1)), "Uncertainty ~ 0.03", cex=0.8, pos= 1, col= "black")

points(75, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), col='brown', pch=16, cex=1.5)
text(75, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1))+0.004, "75 '+ve' cases", cex=0.8, pos= 2, col= "black")
points(175, (qbeta(0.975, 175+1, 250-175+1)-qbeta(0.025, 175+1, 250-175+1)), col='brown', pch=16, cex=1.5)
text(175, (qbeta(0.975, 175+1, 250-175+1)-qbeta(0.025, 175+1, 250-175+1))+0.004, "175 '+ve' cases", cex=0.8, pos= 4, col= "black")
abline(h=(qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), col="magenta", lwd=1.5) 
text(125, (qbeta(0.975, 75+1, 250-75+1)-qbeta(0.025, 75+1, 250-75+1)), "Uncertainty > 0.11", cex=0.8, pos= 1, col= "black")
```
Graph 2 shows how the level of uncertainty changes according to different prevalence (different number of positive tests out of certain number of tests). Here, uncertainty levels are plotted (orange line) against 0% to 100% prevalence (0 to 250 positive cases out of 250 tests). Blue circles indicate the level of uncertainty for 3 and 247 positive tests. Magenta circles indicate the level of uncertainty for 75 and 175 positive cases. Purple line indicates the level of uncertainty ~ 0.03 and magenta line indicates the level of uncertainty >0.11. We find that with the increasing number of positive cases, the level of uncertainty is also increased. If half of the total samples tested are positive, the level of uncertainty is the highest at that point. However, if more than half of the samples tested are positive, the level of uncertainty again starts to decrease. After that mid point, the level of uncertainty continue to decrease as more samples are found positive. As examples, for 3 and 247 positive cases (near the left and right ends of the graph) out of 250 tests, the level of uncertainty is only ~0.03 (shown with purple line), whereas, for 75 and 175 positive cases (around the middle point of the graph), the level of uncertainty is >0.11 (shown with magenta line).  

### 3. Uncertaintly is also affected by the sample size to some extent. 
```{r, fig.height=5}
# create different data frames of different number of tests. 

data_1_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.01*(seq(100, 2000, by=5)))
                      
data_3_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.03*(seq(100, 2000, by=5)))

data_5_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.05*(seq(100, 2000, by=5)))

data_25_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.25*(seq(100, 2000, by=5)))

data_50_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 0.5*(seq(100, 2000, by=5)))

data_100_percent <- data.frame(tested= seq(100, 2000, by=5),
                      positive = 1*(seq(100, 2000, by=5)))
                
# Plot data from above data frames for uncertainty vs sample size (number of tests).  

plot(seq(100, 2000, by=5), (qbeta(0.975, data_1_percent$positive + 1, data_1_percent$tested - data_1_percent$positive +1)- qbeta(0.025, data_1_percent$positive + 1, data_1_percent$tested - data_1_percent$positive +1)), 
     xaxt="n",
     yaxt= "n",
     type = 'l', 
     lwd=2, 
     ylim = c(0, 0.2), 
     col="green", 
     xlab = "Number of tests (sample size)", 
     ylab = "Uncertainty (95% ETPI)",
     main = "Graph 3: Uncertainty vs sample sizes and prevalence")
axis(1, at = seq(100, 2000, by=100), las=2 )
axis(2, at = seq(0.0, 0.2, by=0.01), las=1 )
legend(1500,0.2,
  title = "Prevalence",
  legend = c("0.5", "0.25", "0.05", "0.03", "0.01"),
  lty = 1,
  lwd = 2,
  col=c("brown", "red", "orange", "blue", "green"),
  cex= 0.8)

# add line in the graph for uncertainty vs number of tests for different prevalence. The following lines were added in the similar way. 

lines(data_3_percent$tested, (qbeta(0.975, data_3_percent$positive + 1, data_3_percent$tested - data_3_percent$positive +1)- qbeta(0.025, data_3_percent$positive + 1, data_3_percent$tested - data_3_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="blue") 

lines(data_5_percent$tested, (qbeta(0.975, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)- qbeta(0.025, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="orange")

lines(data_25_percent$tested, (qbeta(0.975, data_25_percent$positive + 1, data_25_percent$tested - data_25_percent$positive +1)- qbeta(0.025, data_25_percent$positive + 1, data_25_percent$tested - data_25_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="red")

lines(data_50_percent$tested, (qbeta(0.975, data_50_percent$positive + 1, data_50_percent$tested - data_50_percent$positive +1)- qbeta(0.025, data_50_percent$positive + 1, data_50_percent$tested - data_50_percent$positive +1)), type = 'l', lwd=2, ylim = c(0, 0.2), col="brown")

# add line segments and texts for a certain number of tests. 

segments(300, 0.01, 300, 0.06, col="purple", lwd = 2)
segments(700,0.05, 700, 0.10, col="magenta", lwd = 2)
text(360, 0.06, "Tests = 300", srt=90, pos=3, cex=0.55)
text(755, 0.09, "Tests = 700", srt=90, pos=3, cex=0.55)
```
Graph 3 shows how uncertainty changes according to different number of tests and different prevalence. Legend in the top right side of the graph indicates specific colors of lines corresponding to specific prevalence. Purple line segment indicates sample size = 300 and magenta line segment indicates sample size= 700. We observe that the level of uncertainty is clearly decreased as the sample size increases for a certain prevalence. However, after some point, the changes in uncertainty become very subtle as the sample size increases further.For prevalence of 1% to 5%, the changes in uncertainty become subtle between the number of tests ~ 250 to 350. So, we can suggest that if the prevalence is low (0.01 - 0.05), number of tests from 250 to 350 is enough to estimate the prevalence within overall population with confidence (with very low level of uncertainty). Same for the sample size 650 to 750 for the prevalence between 25% to 50%. As we can see from graph 1, most of the time of the semester, especially in the recent time, the prevalence is between 0% to <5%. Therefore, the current level of sampling of the UW population (~3%, 250-500 people) should be sufficient enough to track changes in disease prevalence and inform policy.    

### 4. An estimation of an optimum sample size given the level of prevalence. 
```{r, fig.height=5}
# Create different data frames of different sample sizes. 

data_100 <- data.frame(tested= 100,
                      positive = 0:100)

data_250 <- data.frame(tested= 250,
                      positive = 0:250)

data_350 <- data.frame(tested= 350,
                      positive = 0:350)

data_500 <- data.frame(tested= 500,
                      positive = 0:500)

data_1000 <- data.frame(tested= 1000,
                      positive = 0:1000)

# Plot data from the above data frame for Uncertainty vs Prevalence for sample size=250.  
                    
plot((0:250)/250, (qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1)), 
     xaxt="n",
     yaxt= "n",
     type = 'l', 
     col="Blue",
     lwd=2,
     ylim = c(0.0, 0.2),
     xlab = "Prevalence", 
     ylab = "Uncertainty (95% ETPI)", 
     main = "Graph 4: Sample size affects uncertainty as a function of prevalences")

#specify the tick labels of X and Y axis manually.
axis(1, at = seq(0, 1, by=0.1), las=2 )
axis(2, at = seq(0, 0.2, by=0.01), las=1 )

text(250/500, max((qbeta(0.975, data_250$positive + 1, data_250$tested - data_250$positive +1)- qbeta(0.025, data_250$positive + 1, data_250$tested - data_250$positive +1))), "Samples tested=250", cex = 0.7, pos = 1, col = "blue") # Add text at the specific location in the graph.
abline(v=0.124, col="purple", lwd=1)           # add a vertical line.
text(0.124, 0.2, "Prevalence= 12.4%", col = "purple", cex=0.7, pos= 2, srt= 90, lwd=1.5)
abline(v=0.05, col= "black")
text(0.04, 0.15, "Prevalence= 5%", col = "black", cex=0.7, pos= 2, srt= 90, lwd=1.5)

# Add lines to the graph for Uncertainty vs Prevalence for sample sizes 100, 350, 500, 1000.

lines((0:100)/100, (qbeta(0.975, data_100$positive + 1, data_100$tested - data_100$positive +1)- qbeta(0.025, data_100$positive + 1, data_100$tested - data_100$positive +1)), type = 'l', col="magenta", lwd=2)
text(250/500, max((qbeta(0.975, data_100$positive + 1, data_100$tested - data_100$positive +1)- qbeta(0.025, data_100$positive + 1, data_100$tested - data_100$positive +1))), "Samples tested=100", cex = 0.7, pos = 1, col = "magenta")

lines((0:350)/350, (qbeta(0.975, data_350$positive + 1, data_350$tested - data_350$positive +1)- qbeta(0.025, data_350$positive + 1, data_350$tested - data_350$positive +1)), type = 'l', col="orange", lwd=2)
text(250/500, max((qbeta(0.975, data_350$positive + 1, data_350$tested - data_350$positive +1)- qbeta(0.025, data_350$positive + 1, data_350$tested - data_350$positive +1))), "Samples tested=350", cex = 0.7, pos = 1, col = "orange")

lines((0:500)/500, (qbeta(0.975, data_500$positive + 1, data_500$tested - data_500$positive +1)- qbeta(0.025, data_500$positive + 1, data_500$tested - data_500$positive +1)), type = 'l', col="red", lwd=2)
text(250/500, max((qbeta(0.975, data_500$positive + 1, data_500$tested - data_500$positive +1)- qbeta(0.025, data_500$positive + 1, data_500$tested - data_500$positive +1))), "Samples tested=500", cex = 0.7, pos = 1, col = "red")

lines((0:1000)/1000, (qbeta(0.975, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)- qbeta(0.025, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)), type = 'l', col="brown", lwd=2)
text(250/500, max((qbeta(0.975, data_1000$positive + 1, data_1000$tested - data_1000$positive +1)- qbeta(0.025, data_1000$positive + 1, data_1000$tested - data_1000$positive +1))), "Samples tested=1000", cex = 0.7, pos = 1, col = "brown")
```
Graph 4 shows how the sample size affects the level of uncertainty when estimated by accounting all prevalence. Here brown, red, orange, blue, and magenta lines indicate uncertainty for sample sizes 1000, 500, 350, 250, and 100 respectively. The purple vertical line indicates the prevalence of 12.4% and black vertical line indicates the prevalence of 5%. Overall, as the sample size is increased, the uncertainty is decreased. If we consider the original prevalence throughout the semester, the highest prevalence was 12.4%. At this point the highest level of uncertainty is ~0.13 for the sample size 100 and the lowest is ~0.04 for sample size 1000 as indicated in the graph. So, with more 900 individual being tested, the uncertainty is reduced by only 0.09 when the prevalence is 12.4%. The current prevalence is much lower (only <5%), which itself reduces the uncertainty significantly as indicated in graph 2, you will find it also in graph 4. Considering the above phenomena that 900 more test reduces the uncertainty only 0.09 at the time of highest (12.4%) prevalence and current situation of lower prevalence (~ only 5%), I would not advise to sample more individuals if the prevalence is not significantly higher than now. I would suggest to tests even less samples (tests = ~150 - 250) than now (tests=250-350) given the current prevalence. However, based on the above graph, if the prevalence become ~15% (which never happened during the semester, so highly unlikely) , I will suggest to increase the sample size up-to 300 - 350 (~ 100 - 150 more samples) to keep the level of uncertainty very low (<0.1). If the prevalence is higher than this, the campus may be closed to keep the spreading of disease under control.

### 5. Final recommendation
From the discussion above based on our probability model, we understand how prevalence and sample size may affect the level of uncertainty and their co-relationship. As a recap, for very low and high prevalence, the level of uncertainty is very low compared to around the middle value (50%) of the prevalence (graph 2 and 4) for any given number of sample size. On the other hand, after a certain number of sample size, the change in uncertainty become very low even for the mid-prevalence (prevalence =50%, when uncertainty is the highest for a given sample size)(graph 3 and 4). Which means regardless of how many more samples you tests further, the level of uncertainty won't change noticeably.  However, the scale (numbers of individuals to be tested each week) depends on how precisely you want to estimate the prevalence, Or in other words, how much uncertainty you want to tolerate for a given prevalence. Lets see more precisely in the following table how scale are related to precision (how low the uncertainty is).
```{r}
library(kableExtra) # load library kableExtra for kable() function.

# create data frame as above.
data_5_percent <- data.frame(tested= seq(50, 1000, by=100),
                      positive = 0.05*(seq(50, 1000, by=100)))

data_5_percent$uncertainty <- (qbeta(0.975, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1)- qbeta(0.025, data_5_percent$positive + 1, data_5_percent$tested - data_5_percent$positive +1))

data_15_percent <- data.frame(tested= seq(50, 1000, by=100),
                      positive = 0.15*(seq(50, 1000, by=100)))
data_15_percent$uncertainty <- (qbeta(0.975, data_15_percent$positive + 1, data_15_percent$tested - data_15_percent$positive +1)- qbeta(0.025, data_15_percent$positive + 1, data_15_percent$tested - data_15_percent$positive +1))

data_compare <- data.frame(Tests= data_15_percent$tested,   
                           Uncertainty_lowPreval. = round(data_5_percent$uncertainty, digits = 2),
                           Uncertainty_highPreval. = round(data_15_percent$uncertainty, digits = 2)
                           )

tbl1<- kable(data_compare, caption = "Table 1: An example of precision (how low the uncertainty is) vs sample size(Tests). lowPreval.= 5% prevalence, highPreval. = 15% prevalence") # convert the data frame into a customizable table. 
column_spec(tbl1, 1:3, width = "2in") # customize the table with a caption and defined width of columns. 
```

In general, as we see from the table 1 (also from graph 4), higher the sample size, more precise the estimation of the prevalence would be, however, after a certain number of tests (for example, 650 tests, in table 1) the change in the level of uncertainty does not noticeable. Therefore, even to estimate the prevalence very precisely you don't need to perform unlimited number of tests for any given prevalence. As you see in the above table, you can calculate exact scales based on prevalence and expected precision level and then decide at which scale you want to perform surveillance tests.  

In my opinion, for the current low prevalence (< 5%, see graph 1) in the UW campus, testing from ~150 to ~ 250 people each week even less than that to some extent is sufficient to estimate the prevalence with uncertainty only ~0.07 or less (table1, and graph 3). Final decision on scale depends on the expectation of the UW testing program as discussed earlier.   


































