Through the last year, we have used a variety of metrics to contextualize the pandemic and answer questions like “How bad is COVID right now?” and “Are things getting better or worse?” Perhaps the most considered data are the quantity of new cases in a region. We will be using data of this sort as the basis for our model - specifically two tables from the New York City Department of Health: “Case-Rate-by-Modzcta” and “Cases-by-Day”. These are both publicly available and regularly updated on Github.
Each of these tables is limited in some way.
Case-Rate-by-Modzcta:
Cases-by-Day:
Shared Considerations:
In order to build an effective model, we require a table containing the percentage of individuals in a given area who received a positive test result on a given day, starting from the beginning of the pandemic. This requires us to merge these two tables with a few concessions. We can realistically select higher fidelity in one of two forms: we can either elect for the ZIP Code level data from the Case-Rate-by-Modzcta table, or the daily level data summarized by borough offered by the Cases-by-Day file. We will opt for the ZIP Code, by week, summarization.
Looking at the logistics of transforming our data: for dates between 3/1/2020 and 8/1/2020, for which we do not have Case-Rate-by-Modzcta entries, we will use population data from the US census to convert case count to case rate, and use the borough average rate for each ZIP Code in a given borough.
I elected to run this transformation on all data from the start of the pandemic through the week of 8/8. The week of 8/8 is redundant as it is first available week in our Case-Rate-by-Modzcta file, however, this can serve as a control check to ensure the transformation is accurate. We find that the adjustment from the by-day data is fairly close to the by week summary with an “All-City” rate of 20.11 vs 19.02. This variation is likely due to this note in the ReadMe file on the DOH GitHub: “Note that sum of counts in this file may not match values in citywide tables because of records with missing geographic information”. For our purposes, this transformation is acceptable.
# Load Packages
library(lubridate)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(ggthemes)
library(ggridges)
library(gridExtra)
library(cowplot)
library(formatR)
# Import Data
# Caserate by Week x <-
# 'https://raw.githubusercontent.com/ChristopherBloome/coronavirus-data/master/trends/caserate-by-modzcta.csv'
x <- "C:\\Users\\bloom\\Documents\\Thesis\\caserate.csv"
Caserate_by_week_Modzcta <- read.csv(x)
# Caserate by Day
x <- "Casecount_by_day_borough"
# x <-
# 'https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/cases-by-day.csv'
Casecount_by_day_borough <- read.csv(x)
# Build DF for weeks prior to first entry in
# 'Case-Rate-by-Modzcta'
x <- seq(from = mdy("3/7/2020"), to = mdy("8/8/2020"), by = 7)
df = data.frame(matrix(vector(), length(x), 184), stringsAsFactors = F)
names(df) <- names(Caserate_by_week_Modzcta)
df$week_ending <- x
# Create df, a, with columns for each borough and city, that
# are summed for each week.
Casecount_by_day_borough$date_of_interest <- mdy(Casecount_by_day_borough$date_of_interest)
a <- Casecount_by_day_borough %>% filter(date_of_interest < mdy("3/8/2020")) %>%
summarise(CASERATE_CITY = sum(CASE_COUNT), CASERATE_BX = sum(BX_CASE_COUNT),
CASERATE_BK = sum(BK_CASE_COUNT), CASERATE_MN = sum(MN_CASE_COUNT),
CASERATE_QN = sum(QN_CASE_COUNT), CASERATE_SI = sum(SI_CASE_COUNT))
for (i in 2:nrow(df)) {
x <- Casecount_by_day_borough %>% filter(date_of_interest <=
df$week_ending[i], date_of_interest > df$week_ending[i -
1]) %>% summarise(CASERATE_CITY = sum(CASE_COUNT), CASERATE_BX = sum(BX_CASE_COUNT),
CASERATE_BK = sum(BK_CASE_COUNT), CASERATE_MN = sum(MN_CASE_COUNT),
CASERATE_QN = sum(QN_CASE_COUNT), CASERATE_SI = sum(SI_CASE_COUNT))
a <- rbind(a, x)
}
# Import Popultion figures from Census to convert Case Count
# to Case Rate
x <- "CensusData"
# x <-
# 'https://raw.githubusercontent.com/ChristopherBloome/Thesis/main/QuickFacts%20Mar-07-2021.csv'
CensusData <- read.csv(x, stringsAsFactors = FALSE)
CensusData <- CensusData %>% select(New.York.city..New.York,
Bronx.County..Bronx.Borough...New.York, Kings.County..Brooklyn.Borough...New.York,
New.York.County..Manhattan.Borough...New.York, Queens.County..Queens.Borough...New.York,
Richmond.County..Staten.Island.Borough...New.York)
CensusData <- CensusData[1, ]
names(CensusData) <- c("Total_Pop", "BX_Pop", "BK_POP", "MN_POP",
"QN_POP", "SI_POP")
# Use Census data to change Count to Rate, add to DF
df$CASERATE_CITY <- a$CASERATE_CITY/as.numeric(str_replace_all(CensusData$Total_Pop[1],
",", "")) * 1e+05
df$CASERATE_BX <- a$CASERATE_BX/as.numeric(str_replace_all(CensusData$BX_Pop[1],
",", "")) * 1e+05
df$CASERATE_BK <- a$CASERATE_BK/as.numeric(str_replace_all(CensusData$BK_POP[1],
",", "")) * 1e+05
df$CASERATE_SI <- a$CASERATE_SI/as.numeric(str_replace_all(CensusData$SI_POP[1],
",", "")) * 1e+05
df$CASERATE_QN <- a$CASERATE_QN/as.numeric(str_replace_all(CensusData$QN_POP[1],
",", "")) * 1e+05
df$CASERATE_MN <- a$CASERATE_MN/as.numeric(str_replace_all(CensusData$MN_POP[1],
",", "")) * 1e+05
# Import File mapping ZIP Codes to Borough
x <- "Zip_Modzcta"
# x <-
# 'https://raw.githubusercontent.com/ChristopherBloome/coronavirus-data/master/Geography-resources/ZCTA-to-MODZCTA.csv'
Zip_Modzcta <- read.csv(x)
x <- "Zip_Borough"
# x <-
# 'https://raw.githubusercontent.com/ChristopherBloome/Thesis/main/ZipToBorough.csv'
Zip_Borough <- read.csv(x)
names(Zip_Borough) <- c("X", "Borough", "Zip_Code")
Zip_Borough <- subset(Zip_Borough, select = -X)
Zip_Modzcta <- inner_join(Zip_Modzcta, Zip_Borough, by = c(MODZCTA = "Zip_Code"))
Zip_Modzcta <- distinct(subset(Zip_Modzcta, select = -ZCTA))
Zip_Modzcta$MODZCTA <- paste("CASERATE", Zip_Modzcta$MODZCTA,
sep = "_")
# Add string to make Borough match column names
BoroughList <- unique(Zip_Modzcta$Borough)
for (i in 1:length(BoroughList)) {
x <- Zip_Modzcta %>% filter(Borough == BoroughList[i])
# Uses Borough Total for Zip Totals
for (j in 1:nrow(x)) {
y <- paste("CASERATE", BoroughList[i], sep = "_")
z <- x[j, 1]
df[[z]] <- df[[y]]
}
}
# Check Accuracy of pre Aug adjustment, merge into - Final
# File 'CaseRate_DF'
CaseRate_DF <- Caserate_by_week_Modzcta
CaseRate_DF$week_ending <- mdy(CaseRate_DF$week_ending)
t1 <- df %>% filter(week_ending == mdy("8/8/2020")) %>% select(week_ending,
CASERATE_CITY, CASERATE_BX, CASERATE_BK, CASERATE_MN, CASERATE_QN,
CASERATE_SI)
t2 <- CaseRate_DF %>% filter(week_ending == mdy("8/8/2020")) %>%
select(week_ending, CASERATE_CITY, CASERATE_BX, CASERATE_BK,
CASERATE_MN, CASERATE_QN, CASERATE_SI)
CaseRate_DF <- CaseRate_DF %>% filter(week_ending != mdy("8/8/2020"))
CaseRate_DF <- rbind(df, CaseRate_DF)
print("Aprox 8/8 Rates From Transformation")
## [1] "Aprox 8/8 Rates From Transformation"
print(t1[, 2:6])
## CASERATE_CITY CASERATE_BX CASERATE_BK CASERATE_MN CASERATE_QN
## 1 20.11559 27.28798 19.88357 17.92834 18.10229
print("Actual 8/8 Rates")
## [1] "Actual 8/8 Rates"
print(t2[, 2:6])
## CASERATE_CITY CASERATE_BX CASERATE_BK CASERATE_MN CASERATE_QN
## 1 19.02 26.51 18.75 15.9 17.44
The NYC Department of Health considers anyone who tested positive for COVID with a Rapid Antigen test a “probable case” until that diagnosis is confirmed with a PCR/Molecular test. For the purposes of identifying and removing duplicates, we note that an individual who tests with a rapid test that is later confirmed, is only counted once in our tables - on the date of their “confirmed” PCR/Molecular test diagnosis. Individuals which test first via a Rapid test and are later confirmed are later removed from the probable count, and attributed to the date of their PCR/Molecular result.
Again, our probable case data is only available with by-borough fidelity. That being noted, we can study the rate of probable and total cases and apply our findings to the more granular data set:
# Create DF with probably case
Prob_DF = ""
a <- Casecount_by_day_borough$CASE_COUNT
b <- Casecount_by_day_borough$PROBABLE_CASE_COUNT + Casecount_by_day_borough$CASE_COUNT
Prob_DF <- data.frame(a, b)
# Plot Total Cases over Time
p1 <- ggplot(Prob_DF, aes(x = seq(1, nrow(Prob_DF), by = 1),
y = a)) + geom_line() + labs(title = "NYC New Confirmed COVID Case / Day",
x = "Day of Pandemic", y = "Case Count") + theme_few()
# Above plot is noisey - take 7 day rolling average
Prob_DF$a1 <- ""
Prob_DF$b1 <- ""
for (i in 7:nrow(Prob_DF)) {
x <- i - 6
y <- i
Prob_DF$a1[i] <- sum(Prob_DF$a[x:y])
Prob_DF$b1[i] <- sum(Prob_DF$b[x:y])
}
# Plot Confirmed vs Total Cases over Time
p2 <- ggplot(Prob_DF) + geom_line(aes(x = seq(1, nrow(Prob_DF),
by = 1), y = as.numeric(a1), color = "Confirmed")) + geom_line(aes(x = seq(1,
nrow(Prob_DF), by = 1), y = as.numeric(b1), color = "Confirmed + Probable")) +
labs(title = "NYC New Confirmed COVID Case / 7 Day Avg",
x = "Day of Pandemic", y = "Case Count") + guides(color = guide_legend(title = "")) +
theme_few() + scale_color_tableau("Classic Blue-Red 6")
Prob_DF$Prob_Rate <- as.numeric(Prob_DF$b1)/as.numeric(Prob_DF$a1)
# Plot percentage of Probably Cases over Total
p3 <- ggplot(Prob_DF) + geom_line(aes(x = seq(1, nrow(Prob_DF),
by = 1), y = as.numeric(Prob_Rate))) + labs(title = "NYC Rate of Probable COVID Cases / Confirmed Cases ",
x = "Day of Pandemic", y = "(Confirmed + Probable) \n / Confirmed") +
theme_few()
grid.arrange(p1, p2, p3, nrow = 3)
Above we note two observations.
This second note implies that we cannot use a flat rate to find probable cases from confirmed cases, and instead need to consider at what point in the pandemic we are measuring confirmed cases, then use our probable case by day trend to approximate total cases.
As we have data by borough, lets see if there is considerable variation by borough within the city:
# Build DF with Prob and Total cases by Borough
a <- Casecount_by_day_borough$BK_PROBABLE_CASE_COUNT
b <- Casecount_by_day_borough$BK_PROBABLE_CASE_COUNT + Casecount_by_day_borough$BK_CASE_COUNT
c <- Casecount_by_day_borough$MN_PROBABLE_CASE_COUNT
d <- Casecount_by_day_borough$MN_PROBABLE_CASE_COUNT + Casecount_by_day_borough$MN_CASE_COUNT
e <- Casecount_by_day_borough$QN_PROBABLE_CASE_COUNT
f <- Casecount_by_day_borough$QN_PROBABLE_CASE_COUNT + Casecount_by_day_borough$QN_CASE_COUNT
g <- Casecount_by_day_borough$BX_PROBABLE_CASE_COUNT
h <- Casecount_by_day_borough$BX_PROBABLE_CASE_COUNT + Casecount_by_day_borough$BX_CASE_COUNT
i <- Casecount_by_day_borough$SI_PROBABLE_CASE_COUNT
j <- Casecount_by_day_borough$SI_PROBABLE_CASE_COUNT + Casecount_by_day_borough$SI_CASE_COUNT
k <- Casecount_by_day_borough$PROBABLE_CASE_COUNT
l <- Casecount_by_day_borough$PROBABLE_CASE_COUNT + Casecount_by_day_borough$CASE_COUNT
borough_Prob_DF <- data.frame(a, b, c, d, e, f, g, h, i, j, k,
l)
# Take rolling 7 day average
borough_Prob_DF1 <- borough_Prob_DF
for (i in 7:nrow(borough_Prob_DF)) {
x <- i - 6
y <- i
borough_Prob_DF1[i, ] <- t(colSums(borough_Prob_DF[x:y, ]))
}
borough_Prob_DF1 <- borough_Prob_DF1[7:nrow(borough_Prob_DF1),
]
names(borough_Prob_DF1) <- c("BK_Prob", "BK_All", "MN_Prob",
"MN_All", "QN_Prob", "QN_All", "BX_Prob", "BX_All", "SI_Prob",
"SI_All", "CITY_Prob", "CITY_All")
# Represent Probable cases as percentage of total
borough_Prob_DF2 <- data.frame(matrix(NA, nrow = nrow(borough_Prob_DF1),
ncol = 0))
borough_Prob_DF2$BKProbAvg <- borough_Prob_DF1$BK_Prob/borough_Prob_DF1$BK_All
borough_Prob_DF2$MNProbAvg <- borough_Prob_DF1$MN_Prob/borough_Prob_DF1$MN_All
borough_Prob_DF2$QNProbAvg <- borough_Prob_DF1$QN_Prob/borough_Prob_DF1$QN_All
borough_Prob_DF2$BXProbAvg <- borough_Prob_DF1$BX_Prob/borough_Prob_DF1$BX_All
borough_Prob_DF2$SIProbAvg <- borough_Prob_DF1$SI_Prob/borough_Prob_DF1$SI_All
borough_Prob_DF2$CITYProbAvg <- borough_Prob_DF1$CITY_Prob/borough_Prob_DF1$CITY_All
# Plot by Borough
ggplot(borough_Prob_DF2) + geom_line(aes(x = seq(1, nrow(borough_Prob_DF2),
by = 1), y = as.numeric(BKProbAvg), color = "BK")) + geom_line(aes(x = seq(1,
nrow(borough_Prob_DF2), by = 1), y = as.numeric(MNProbAvg),
color = "MN")) + geom_line(aes(x = seq(1, nrow(borough_Prob_DF2),
by = 1), y = as.numeric(QNProbAvg), color = "QN")) + geom_line(aes(x = seq(1,
nrow(borough_Prob_DF2), by = 1), y = as.numeric(BXProbAvg),
color = "BX")) + geom_line(aes(x = seq(1, nrow(borough_Prob_DF2),
by = 1), y = as.numeric(SIProbAvg), color = "SI")) + geom_line(aes(x = seq(1,
nrow(borough_Prob_DF2), by = 1), y = as.numeric(CITYProbAvg),
color = "CITY")) + labs(title = "Probable Cases / Total Cases",
x = "Day of Pandemic", y = "Case Percent") + guides(color = guide_legend(title = "Borough")) +
theme_few() + scale_color_tableau("Classic Blue-Red 6")
Above we find that while there is some variation by borough, in general the city moves as one in this regard. We will use the city total to convert confirmed to probable cases.
# Pull in dates
Prob_DF$Date <- Casecount_by_day_borough$date_of_interest
Prob_DF <- Prob_DF %>% select(Date, Prob_Rate)
Week_List <- CaseRate_DF$week_ending
Week_List <- as.data.frame(Week_List)
# Only include avg value per week
Prob_Rate_DF <- inner_join(Week_List, Prob_DF, by = c(Week_List = "Date"))
In our simulations, we will be using both Confirmed and Probable cases.
DataType <- "CP"
In addition to documented cases, we also need to consider cases which are not reported. The Center for Disease Control suggests that only 1 in 4.2 symptomatic COVID-19 infections were reported nationwide (“Estimated Disease Burden,” 2021). While this may seem high, it is conservative when compared to the work of some researchers. Liu (2020) leveraged SIR modeling to find that “there might exist 19.5 cases undiagnosed while one infection reported in US counties averagely” (p. 6). It is worth noting that this 19.5 figure applies to all cases, not just symptomatic cases as with the figure from the CDC. Additionally, Liu found that this Unreported Infection rate in New York state was significantly below the national average at 7.2 (p. 13).
Considering these sources, we will run our model with the assumption that 1 in every 4 cases is reported. As we are using using both probable and confirmed cases, as well as asymptomatic cases, this will likely result in a value between that of the CDC and the New York average from the work of Liu (2020).
# The convention of this variable is a little unituitive:
# Total_Rate = Reported_Rate * (100 + Unreported_Rate)/100
Unreported_Rate <- 300
Now that we have a handle on our data, we can write a function to produce a tidy table which can later be used to find the percentage of the population in a given area, that will receive a positive test result on a given day. We will pass the rate of unreported cases, as well as the type of cases (Probable + Confirmed, or Confirmed) as variables in a function, titled CR_Week_Clean, found in our appendix. Furthermore, with this in place, we can run our model parameters, and nest the resulting data frame in a function to pull the rate for a given region.
CR_Week_Clean <- function(DataTypeX, Unreported_RateX) {
# function takes following arugmenets: DataTypeX ('P' or
# 'CP') as argument to indicate confirmed or
# confirmed+probable cases Unreproted: (Unreported / Reported
# Cases)*100. Please note that 0 = all cases reported. 100
# = unreported cases equal half total cases.
# Function returns tidy table of percentage of popualtion to
# test positive for COVID during given week.
if (DataTypeX == "CP") {
Rate <- as.data.frame(t(t(CaseRate_DF[2:184]) * as.numeric(Prob_Rate_DF$Prob_Rate)))
} else {
Rate <- CaseRate_DF[2:184]
}
Rate <- Rate * ((Unreported_RateX + 100)/100)
Rate$week_ending <- CaseRate_DF$week_ending
x <- Rate %>% pivot_longer(!week_ending, names_to = "Region",
values_to = "Rate")
x$Rate <- x$Rate/1e+05/7
x
}
# Generate DF
CR_Week_Clean_DF <- CR_Week_Clean(DataType, Unreported_Rate)
GetRate <- function(REGIONX, DATEX) {
xx <- CR_Week_Clean_DF %>% filter(week_ending == DATEX, Region ==
paste("CASERATE_", REGIONX, sep = "")) %>% select(Rate)
return(as.numeric(xx))
}
The data above indicates the percent of the population in a given area that received a positive test result on a given day. We need to learn more about the nature of the broader community’s behavior around testing in order to use this to estimate what percentage of the population is contagious and would not expected to be aware of this status.
There are a few landmark days in the progression of a COVID Case. As a guide, we will be considering the following:
# Symptom Lag = gap between becoming contagious and
# presenting symptoms. Hard Coded, not to be adjusted
Symptom_Lag = 2
(Symptom Onset to Sample Collection Date) Using data from the NYC Department of Health, we can find how many people in a given area received positive tests results on a given day. The DOH also has figures related to testing processing time which allows us to approximate the testing date for each of these individuals. As our model is concerned with the total number of days an individual is contagious in the workplace, we need to find figures related to the median delay from symptom onset to testing date.
While this information is not available from the NYC DOH, it is provided in a national patient-level data set by the Center for Disease control. This data set, provided by CDC Case Surveillance Task Force and titled COVID-19 Case Surveillance Public Use Data has several elements related to all COVID-19 cases shared with the CDC, including most notably, symptom onset date and test-sample collection date.
CDCData <- read.csv("C:\\Users\\bloom\\Downloads\\CDCCOVID.csv")
CDCData2 <- subset(CDCData, onset_dt != "" & pos_spec_dt != "" &
cdc_report_dt != "")
CDCData2$diff <- ""
CDCData2$diff <- as.numeric(difftime(CDCData2$pos_spec_dt, CDCData2$onset_dt,
units = "days"))
CDCData2$diff2 <- as.numeric(difftime(CDCData2$cdc_report_dt,
CDCData2$onset_dt, units = "days"))
CDCData3 <- subset(CDCData2, diff >= 0 & diff < 100)
ggplot(CDCData3, aes(x = diff)) + geom_histogram(binwidth = 1) +
xlim(-1, 25) + theme_few() + theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank(), plot.title = element_text(size = 8)) +
labs(title = "Test Day minus \n First Day of Symptoms", x = "Lag",
y = "Count")
A rough analysis shows that after filtering for patients whose symptom onset date was not an outlier or a null value indicating they were asymptomatic, that many individuals were tested on their first day of symptoms. This is problematic, as it is possible human error in reporting may have led to asymptomatic individuals being counted among the symptomatic population, with the date of testing used in place of the symptom onset date. With all individuals who were tested on their first date of symptoms included, our median delay from symptom onset to test date was two days. When we excluded this population, we saw this median delay increase to three days. Further review of the paperwork used to report this data to the CDC proved that the location of this question on the form was clearly in a section intended for symptomatic individuals, and the wording was quite clear this was to be ignored for asymptomatic individuals, giving us more confidence in the reported data and in using a 2-day delay for our median value.
# Test Lag = gap between becoming symptomatic and getting
# tested. This is Hard Coded as it represents community, not
# population
Test_Lag = 2
(Test Date to Report Date) Finally, we need to account for the delay between sample collection and reported result. This measure, in days, is available from the NYC DOH on a weekly basis. This is convenient as it is the same source and format as our case rate table.
x <- "LagTime"
# x <-
# 'https://raw.githubusercontent.com/ChristopherBloome/coronavirus-data/master/trends/testing-turnaround.csv'
LagTime <- read.csv(x)
LagTime$week_end <- mdy(LagTime$week_end)
FN <- 0.9
AsymptomaticRate <- 0.16
# Translating for ease of use in model.
AsymptomaticRate <- 100 * round((1/(1 - AsymptomaticRate)) -
1, 2)
Max_Contagious <- 10
If testing is to be made mandatory, this variable will indicate how many total calendar days are to pass between testing intervals.
TestInterval <- 14
(Symptom Onset to (Symptomatic) Sample Collection Date)
Pop_Lag <- 3
# 1 = Test after work. Must work on 'test day'
Test_After_Work <- 0
As mentioned above, we have testing data on the ZIP Code, borough and “all NYC” level. We will be passing the count of employees in each region as variables in our model. Logistically, this will appear as a list, with each employee represented by the presence of one instance of their region.
Employee_Demo <- c(rep("BX", 5), rep("BK", 5), rep("QN", 5),
rep("MN", 5), rep("SI", 5))
Finally, we need to specify the period in which we want to analyse.
StartDate <- mdy("8/1/2020")
EndDate <- mdy("3/1/2021")
With all of the above defined, we can write a function to simulate an office and return data on the contagiousness of its workers.
We will start with a function that calculates this for one office worker. This function, titled MonteCarlo, can be found in the appendix.
# Build DF Our Function will leverage. Only need to create it
# once, hence it is outside funciton.
# Create Vector with all relevant dates
DateL = seq(from = StartDate, to = EndDate, by = 1)
# As we are revieing a period, we will calc odds FIRST
# contagious on each day. Create Vector with date we would
# expect someone to be tested.
test_date <- DateL + Symptom_Lag + Test_Lag
# Create Vector containing date results returned To do this,
# we need Lag Date, df summarized by week. Vector below
# translates test date to date in report.
test_date_report_date <- 7 - wday(test_date) + test_date
# Vector with Lag days per test date.
lag_days_fx <- function(listx) {
LagTime %>% filter(week_end == listx) %>% select(lag_median) %>%
as.numeric()
}
lag_days <- unlist(lapply(test_date_report_date, lag_days_fx))
# Vector with daet results returned (this is how result DB is
# sorted)
Result_DateL <- test_date + lag_days
Result_DateL <- 7 - wday(Result_DateL) + Result_DateL
# Build DF.
Model_DF <- data.frame(DateL, Result_DateL)
names(Model_DF) <- c("Date", "Result_Date")
# Remove dates for which results are not avail.
Model_DF <- Model_DF %>% filter(!is.na(Result_Date))
# Write Function which takes Region, Start and End date as
# explicit arguments, plus other parameters defined previous,
# runs monte carlo simulation, and returns DF containing
# various COVID related outcomes for entire duration.
MonteCarlo <- function(RegionX, StartDateX, EndDateX) {
# Create/clear column containing probability of first
# becoming contagious on given day.
Model_DF$Prob <- ""
# Run GetRate function once per row(day). Function supplies
# probability.
Prob <- unlist(lapply(Model_DF$Result_Date, GetRate, REGIONX = RegionX))
Model_DF$Prob <- Prob
# Rand Number
Model_DF$Rand <- runif(nrow(Model_DF), 0, 1)
# if Rand < Prob, COVID Contagious.
Model_DF$Result <- Model_DF$Rand < as.numeric(Model_DF$Prob)
# Filter our rows forwhich no data avail. Should be
# redundant.
Model_DF <- Model_DF %>% filter(!is.na(Prob))
# First Test Date always a Monday. Ensures consistancy
TestDay <- 1
while (wday(Model_DF$Date[TestDay]) != 2) {
TestDay = TestDay + 1
}
# Populate Test Column: indicates if day for regularly
# scheduled test
Model_DF$Test <- FALSE
while (TestDay < nrow(Model_DF)) {
Model_DF$Test[TestDay + Test_After_Work] <- TRUE
TestDay <- TestDay + TestInterval
}
# Adds needed Columns with default values
Model_DF$FalseNeg <- NA
Model_DF$Notes <- NA
Model_DF$ContagiousDay <- 0
Model_DF$SX <- NA
# Write Function to simulatate Test, determin if accuate of
# False Negative. Takes Row Number (of DF) and data related
# to test type, returns DF with test result and notes column
# updated. Should overwrite DF.
Test_Func <- function(RowNum, notex) {
Model_DF$FalseNeg[RowNum] <- runif(1, 0, 1)
# If not False Negative
if (Model_DF$FalseNeg[RowNum] < FN) {
Model_DF$ContagiousDay[RowNum] = 0
# If IS False Negative
} else {
Model_DF$Notes[RowNum] = paste("False Negative -",
notex)
}
return(Model_DF)
}
# Write Function to see if Symptomatic. Takes Symptomatic
# Rate as input. Returns 'SX' or 'NSX'
Sx_Func <- function(AsymptomaticRate) {
x <- runif(1, 0, 100 + AsymptomaticRate)
if (x < 100) {
"SX"
} else {
"NSX"
}
}
# If not positive tests, can skip remainder of function.
if ({
x <- Model_DF %>% filter(Result == TRUE) %>% nrow()
x == 0
}) {
Model_DF$ContagiousDay <- 0
} else {
# If not, simulation continues:
for (i in 1:nrow(Model_DF)) {
# Start with setting contagious day Contagious Day = Days
# since first contagious. Later assesed and revised if
# needed, ie in case of pre-work positive test result. First
# Day
if (i == 1) {
if (Model_DF$Result[1] == TRUE) {
Model_DF$ContagiousDay[1] <- 1
Model_DF$SX[i] <- Sx_Func(AsymptomaticRate)
} else {
Model_DF$ContagiousDay[1] <- 0
}
# If already contagious
} else if (Model_DF$ContagiousDay[i - 1] > 0) {
Model_DF$ContagiousDay[i] <- Model_DF$ContagiousDay[i -
1] + 1
Model_DF$SX[i] <- Model_DF$SX[i - 1]
# If Newly contagious
} else if (Model_DF$Result[i] == TRUE) {
Model_DF$ContagiousDay[i] <- 1
Model_DF$SX[i] <- Sx_Func(AsymptomaticRate)
# If not contagious
} else {
Model_DF$ContagiousDay[i] <- 0
}
# Scheduled test - SX irrelevant
if (Model_DF$Test[i] == TRUE & Model_DF$ContagiousDay[i] >
0) {
Model_DF$Notes[i] <- "Scheduled Test"
Model_DF <- Test_Func(i, "Scheduled Test")
# SX Test
} else if (Model_DF$ContagiousDay[i] == (Symptom_Lag +
Pop_Lag) & Model_DF$SX[i] == "SX") {
Model_DF$Notes[i] <- "SX Test"
Model_DF <- Test_Func(i, "SX Test")
}
# If contagious more than maximum contagious days, stop
# count.
if (Model_DF$ContagiousDay[i] > Max_Contagious) {
Model_DF$ContagiousDay[i] = 0
Model_DF$Notes[i] = "Undetected Positive"
}
}
# If simulation ends with someone still contagious, note.
if (Model_DF$ContagiousDay[nrow(Model_DF)] > 0) {
Model_DF$Notes[nrow(Model_DF)] <- "Currently Contagious - Undetected"
}
}
# Remove all irrelevant data. Keep any contagious days or
# days with notes.
return(filter(Model_DF, ContagiousDay > 0 | !is.na(Notes) |
!is.na(SX)))
}
With a function to simulate one person during the relevant period, we can now generate a function to run this for each person in the office. This will take our Employee Demo as an argument, with each person’s region listed once in the Employee Demo vector.
Titled MonteMacro code can be found in appendix.
MonteMacro <- function(RegionL, StartDateX, EndDateX, SimX) {
# Takes Region (List of regions) and period start and end
# date, runs MonteCarlo Macro for each and merges values
# Takes SimX as input, this is for running this macro several
# times.
# Generate Empty DF
x <- data.frame(matrix(NA, nrow = 0, ncol = 10), stringsAsFactors = FALSE)
# names(x) <-
# c('Date','Result_Date','Prob','Rand','Result','Test','FalseNeg','Notes','ContagiousDay
# SX') Create list of DFs from MonteCarlo Macro. Generating
# List of DFs and Rbind-ing via reduce is ~30% faster than
# Rbind-ing after each. This will likley be run several
# times.
xtest <- lapply(RegionL, MonteCarlo, StartDateX = StartDateX,
EndDateX = EndDateX)
# print(xtest) Merge DFs into one
rows <- Reduce(rbind, xtest)
x <- rbind(x, rows)
# From Merged DF, titled X, we need two sets of info. Y
# Summary: Tracks cumiliative contagious days in period.
# Used in plotting Z Summary: Reports on testing data:
# Scheduled vs Symptomatic tests Notes any False Negatives,
# undetecteds Number of days individual was contagious
# Create Y summary Calculates Cumiliative contagious days
# across all office workers
y <- x %>% filter(ContagiousDay > 0) %>% count(Date) %>%
mutate(cumsum = cumsum(n))
# Add first and last day.
y <- rbind(y, list(StartDateX, 0, 0), list(EndDateX, 0, max(y$cumsum)))
y$SimNum <- SimX
# Create Z summary a is vector of row numbers we want to keep
# from x.
a <- NA
# keep if last day contagious in a row, or notes re: testing.
for (i in 1:(nrow(x) - 1)) {
if (x$ContagiousDay[i] > x$ContagiousDay[i + 1] | !is.na(x$Notes[i])) {
a <- c(a, i)
}
}
if (x$ContagiousDay[nrow(x)] > x$ContagiousDay[nrow(x) -
1] | !is.na(x$Notes[nrow(x)])) {
a <- c(a, nrow(x))
}
# remove leading NA
a <- a[-1]
# pull rows in a from x
z <- x[a, ]
# Contagious Day column drops to zero on day of test. We
# want the number of total days contagious before test Below
# we pull value from previous row if drop to zero This is to
# report on testing stats.
# e is vector of rows to remove.
e = NA
for (i in nrow(z):2) {
if (z$Date[i] - 1 == z$Date[(i - 1)] & z$ContagiousDay[i] ==
0 & is.na(z$Notes[(i - 1)])) {
e <- c(e, i - 1)
z$ContagiousDay[i] <- z$ContagiousDay[(i - 1)]
}
}
e <- e[-1]
# remove e from z DF. Select relevant columns
z <- z[-e, ] %>% select(Date, Notes, ContagiousDay, SX)
names(z) <- c("Date", "Type_Of_Test", "Quantity_Contagious_Days",
"SX")
z$SimNum <- SimX
row.names(z) <- NULL
# Join y and z as x
x <- full_join(y, z)
return(x)
}
With this in place, we can nest this function to run repeatedly and compare office contagious metrics under different conditions and/or office policies.
Title MonteSim, this too can be found in the appendix.
MonteSim_New <- function(RegionL, StartDateX, EndDateX, NumTimes) {
# Takes Region (List of regions) and period start and end
# date, Runs MonteCarlo Macro for each and merges values
# Takes NumTimes as input, this is quantity of times we want
# to simulate under identical conditions.
# Generate Empty DF
MonteSimDFX <- data.frame(matrix(NA, nrow = 0, ncol = 12),
stringsAsFactors = FALSE)
# Run MonteMacro, increasing SimX input with each simulation.
# Geneartes one DF with all results.
for (i in 1:NumTimes) {
x <- MonteMacro(RegionL, StartDateX, EndDateX, i)
# print(x)
MonteSimDFX <- rbind(MonteSimDFX, x)
}
return(MonteSimDFX)
}
Finally, we will write a function to clean up and contextualize our summary statistics. The output here will be a high level “executive summary” of how the office fared, on average, under the specified conditions. This function will be titled SummaryStats
SummaryStats <- function(SimDF, NumTimes) {
# Takes DF as input and quantity of times, returns DF with
# summary statistics.
# Most straight forward appraoch is to calculate aggregate
# statistics, and then divide by quantity of simulations to
# find average.
# As first input is quantity of similations, we will
# initially list this as itself squared.
SumDF <- data.frame(c("Quantity of simulations", "Total Contagious Days",
"Contagious Work Days (5-day week)", "Positive Sx Test Cnt",
"Positive Scheduled Test Cnt", "Undetected Positive Cases",
"False Negative Tests", "Avg Contagious Work Days per Test (inc Undetected)",
"Symptomatic Cases", "Asymptomatic Cases"), c(NumTimes^2,
sum(SimDF$Quantity_Contagious_Days, na.rm = TRUE), sum(SimDF$Quantity_Contagious_Days,
na.rm = TRUE) * 5/7, nrow(filter(SimDF, Type_Of_Test ==
"SX Test")), nrow(filter(SimDF, Type_Of_Test == "Scheduled Test")),
nrow(filter(SimDF, Type_Of_Test == "Undetected Positive")),
nrow(filter(SimDF, Type_Of_Test == "False Negative")),
sum(SimDF$Quantity_Contagious_Days, na.rm = TRUE) * 5/7/(nrow(filter(SimDF,
Type_Of_Test == "SX Test")) + nrow(filter(SimDF,
Type_Of_Test == "Scheduled Test"))), nrow(filter(SimDF,
SX == "SX")), nrow(filter(SimDF, SX == "NSX"))))
names(SumDF) <- c("a", "b")
SumDF$b <- round(SumDF$b/NumTimes, digits = 2)
names(SumDF) <- NULL
row.names(SumDF) <- NULL
return(SumDF)
}
With our functions written above, we can see how various approaches to workforce management can be expected to impact COVID cases in the workplace. For the purposes of illustration, we will consider 3 different approaches:
Low Trust: I think we have all worked in, or can imagine a “low trust” workplace. This is a business which might require a “Doctors Note” when taking a day off due to illness. For our purposes, we will say that this office has mandatory testing once a week, and whose employees will not seek any additional testing if symptoms happen to develop off-cycle.
High Trust: This is meant to embody an emerging trend in workplace culture. With increased amenities (ping pong tables, catered food) and privileges such as unlimited PTO, comes trust on half of the employer in its workforce. This workplace has no mandatory testing, but gives its workforce the ability to take a day off for testing if they feel even remotely symptomatic.
Medium Trust: A blend of the above. This workforce has mandatory testing every other week. This mandating testing results in a slight delay over the national average in testing to 4 days (over the national average of 2).
# Standardize Variables
Test_After_Work <- 1
StartDate <- mdy("8/1/2020")
# Low Trust Office
TestInterval <- 7
Pop_Lag <- 10
LowTrust_T <- MonteSim_New(Employee_Demo, StartDate, EndDate,
30)
# 'Medium Trust' Office
TestInterval <- 14
Pop_Lag <- 4
MedTrust_T <- MonteSim_New(Employee_Demo, StartDate, EndDate,
30)
# 'High Trust' Office
TestInterval <- 30
Pop_Lag <- 1
HighTrust_T <- MonteSim_New(Employee_Demo, StartDate, EndDate,
30)
# Add Sim Type
LowTrust_T$SimType <- "Low"
MedTrust_T$SimType <- "Med"
HighTrust_T$SimType <- "High"
# Join, adjust
Office_Type_DF <- rbind(LowTrust_T, MedTrust_T, HighTrust_T)
Office_Type_DF$SimType <- factor(Office_Type_DF$SimType, levels = c("Low",
"Med", "High"))
p1 <- Office_Type_DF %>% filter(!is.na(cumsum)) %>% ggplot(aes(x = Date,
y = cumsum)) + geom_line(aes(color = as.factor(SimNum))) +
geom_smooth(method = "lm") + xlim(StartDate, EndDate) + ylim(0,
NA) + theme_few() + theme(legend.position = "none", axis.text.x = element_text(angle = 90,
vjust = 0.5, hjust = 1)) + labs(title = "Contagious Days \n as function of trust level.",
x = "Date", y = "Cumulative Contagious Days") + facet_wrap(~SimType)
Summary_DF_Office <- Office_Type_DF %>% group_by(SimNum, SimType) %>%
summarise(max(cumsum, na.rm = TRUE))
names(Summary_DF_Office) <- c("SumNum", "SimType", "Total_Cont_Days")
p2 <- Summary_DF_Office %>% ggplot(aes(x = Total_Cont_Days, y = as.factor(SimType),
fill = factor(stat(quantile)))) + stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE, quantiles = 4, quantile_lines = TRUE) +
scale_fill_tableau("Classic Cyclic") + theme_few() + labs(title = "Total Contagious Days \n 25 Person Office. Many Simulations",
y = "Office Trust Level", x = "Total Contagious Days") +
theme(legend.position = "none")
grid.arrange(p1, p2, ncol = 1)
Well, this was a bit of a surprise. I firmly believed that trust would be correlated with a decrease in COVID risk. That, we would find decreasing the lag between when a person becomes contagious and when they are tested would have a considerable impact on the our response variable, and that decreasing the interval periodic testing would have a marginal effect. Our initial findings do not appear to suggest this is true!
It is worth noting that this is a simple simulation, and in a real world situation, we would expect some asymptomatic testing in a high trust workplace, however, noting the increased rate and increased variation seems to indicate that mandated asymptomatic testing may be of value. Specifically, this initial result seems to indicate that increased scheduled testing comes a smaller standard deviation in the quantity of contagious days, and a lower mean count. We also see that the data tends to be bimodal across all trust environments, with a second peak on the higher end of the range well beyond the mean.
Let’s take a deeper dive and isolate these variables. We will proceed by running our model under several different conditions.
# Run Simulation
# Set Variables
StartDate <- mdy("1/1/2021")
Test_After_Work <- 1
# One Week Testing Intervals
TestInterval <- 7
# Pop Lag 1
Pop_Lag <- 1
One_One <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 2
Pop_Lag <- 2
One_Two <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 3
Pop_Lag <- 3
One_Three <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 5
Pop_Lag <- 5
One_Five <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 7
Pop_Lag <- 7
One_Seven <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Two Week Testing Intervals
TestInterval <- 14
# Pop Lag 1
Pop_Lag <- 1
Two_One <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 2
Pop_Lag <- 2
Two_Two <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 3
Pop_Lag <- 3
Two_Three <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 5
Pop_Lag <- 5
Two_five <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 7
Pop_Lag <- 7
Two_Seven <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Three Week Testing Intervals
TestInterval <- 21
# Pop Lag 1
Pop_Lag <- 1
Three_One <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 2
Pop_Lag <- 2
Three_Two <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 3
Pop_Lag <- 3
Three_Three <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 5
Pop_Lag <- 5
Three_Five <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 7
Pop_Lag <- 7
Three_Seven <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Four Week Testing Intervals
TestInterval <- 28
# Pop Lag 1
Pop_Lag <- 1
Four_One <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 2
Pop_Lag <- 2
Four_Two <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 3
Pop_Lag <- 3
Four_Three <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 5
Pop_Lag <- 5
Four_Five <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Pop Lag 7
Pop_Lag <- 7
Four_Seven <- MonteSim_New(Employee_Demo,StartDate,EndDate, 30)
# Clean Data
# Add Sim Types
One_One$SimType <- "One_One"
One_Two$SimType <- "One_Two"
One_Three$SimType <- "One_Three"
One_Five$SimType <- "One_Five"
One_Seven$SimType <- "One_Seven"
Two_One$SimType <- "Two_One"
Two_Two$SimType <- "Two_Two"
Two_Three$SimType <- "Two_Three"
Two_five$SimType <- "Two_five"
Two_Seven$SimType <-"Two_Seven"
Three_One$SimType <- "Three_One"
Three_Two$SimType <- "Three_Two"
Three_Three$SimType <- "Three_Three"
Three_Five$SimType <- "Three_Five"
Three_Seven$SimType <- "Three_Seven"
Four_One$SimType <- "Four_One"
Four_Two$SimType <- "Four_Two"
Four_Three$SimType <- "Four_Three"
Four_Five$SimType <- "Four_Five"
Four_Seven$SimType <- "Four_Seven"
# Join
Grid_Sim_DF2Final <- rbind(One_One, One_Two, One_Three, One_Five, One_Seven, Two_One, Two_Two, Two_Three, Two_five, Two_Seven, Three_One, Three_Two, Three_Three, Three_Five, Three_Seven, Four_One, Four_Two, Four_Three, Four_Five, Four_Seven)
# Factor
Grid_Sim_DF2Final$SimType <- factor(Grid_Sim_DF2Final$SimType, levels = c(
"One_One",
"One_Two",
"One_Three",
"One_Five",
"One_Seven",
"Two_One",
"Two_Two",
"Two_Three",
"Two_Five",
"Two_Seven",
"Three_One",
"Three_Two",
"Three_Three",
"Three_Five",
"Three_Seven",
"Four_One",
"Four_Two",
"Four_Three",
"Four_Five",
"Four_Seven"))
Grid_Sim_DF2Final %>% filter(!is.na(cumsum)) %>% ggplot(aes(x = Date,
y = cumsum)) + geom_line(aes(color = as.factor(SimNum))) +
geom_smooth(method = "lm") + xlim(StartDate, EndDate) + ylim(0,
NA) + theme_few() + theme(legend.position = "none", strip.background = element_blank(),
strip.text.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.x = element_blank()) + labs(title = "", x = "",
y = "Cumulative Contagious Days") + facet_wrap(~SimType)
SummaryStats(Grid_Sim_DF2Final, 600)
##
## 1 Quantity of simulations 600.00
## 2 Total Contagious Days 25.37
## 3 Contagious Work Days (5-day week) 18.12
## 4 Positive Sx Test Cnt 2.58
## 5 Positive Scheduled Test Cnt 2.32
## 6 Undetected Positive Cases 0.47
## 7 False Negative Tests 0.00
## 8 Avg Contagious Work Days per Test (inc Undetected) 0.01
## 9 Symptomatic Cases 5.14
## 10 Asymptomatic Cases 0.89
The above exhibit contains 20 charts, each representing the results of 30 instances of our model under different conditions.
Each row indicates that the office modeled with mandatory testing every x number of weeks for x = 1, 2, 3, 4. Row 1 = contains mandatory testing each week, Row 2 = every second week and so forth.
Similarly, each column indicates a different number of lapsed days from when the employees become symptomatic till when they are tested, for values of 1, 2, 3, 5, 7. Column 1 indicates the model was run where the employees are tested on the first day they are symptomatic, Column 2 after 2 days etc.
There is not anything too surprising here, we see our total case count increase as we increase the weeks between mandatory tests or the delay from symptom onset to symptomatic tests. Lets summarize this a little differently to see if one measure leads to greater risk mitigation than the other.
Summary_DF <- Grid_Sim_DF2Final %>% group_by(SimNum, SimType) %>%
summarise(max(cumsum, na.rm = TRUE))
names(Summary_DF) <- c("SumNum", "SimType", "Total_Cont_Days")
HelperDF <- data.frame(unique(Grid_Sim_DF0$SimType), c(1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4), c(1,
2, 3, 5, 7, 1, 2, 3, 5, 7, 1, 2, 3, 5, 7, 1, 2, 3, 5, 7))
names(HelperDF) <- c("SimType", "Week_Test", "SX_Lag")
Summary_DF <- inner_join(Summary_DF, HelperDF)
p1 <- Summary_DF %>% ggplot(aes(x = Total_Cont_Days, y = as.factor(Week_Test),
fill = factor(stat(quantile)))) + stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE, quantiles = 4, quantile_lines = TRUE) +
scale_fill_tableau("Classic Cyclic") + theme_few() + labs(title = "Total Contagious Days \n 25 Person Office. Many Simulations",
y = "Wks Between \n Scheduled Tests", x = "Total Contagious Days") +
theme(legend.position = "none")
p2 <- Summary_DF %>% ggplot(aes(x = Total_Cont_Days, y = as.factor(SX_Lag),
fill = factor(stat(quantile)))) + stat_density_ridges(geom = "density_ridges_gradient",
calc_ecdf = TRUE, quantiles = 4, quantile_lines = TRUE) +
scale_fill_tableau("Classic Cyclic") + theme_few() + labs(title = "Total Contagious Days \n 25 Person Office. Many Simulations",
y = "Days from Symptom Presentation \n to Testing", x = "Total Contagious Days") +
theme(legend.position = "none")
grid.arrange(p1, p2, ncol = 1)
Grouping by mandatory testing intervals shows that there is a significant increase in contagious days going from the first to the second week, meanwhile, grouping by lag time in days show fairly linear increases. We must remember that our x axis is not linear through the sample, as we model for a lag of 1, 2, 3, 5 and 7 days.
Things become a little more clear when we show both variables on the same plot:
# Summary_DF %>% ggplot(aes(SX_Lag, Total_Cont_Days, group =
# SX_Lag, fill = as.factor(SX_Lag))) + geom_boxplot() +
# labs(title='Total Contagious Days \n Jan through March
# 2020. 25 Person Office', x ='Days from Symptom Presentation
# to Testing', y = 'Total Contagious Days') + theme_few() +
# scale_fill_tableau('Classic Cyclic') +
# theme(legend.position = 'none')
p1 <- Summary_DF %>% ggplot(aes(x = SX_Lag, y = Total_Cont_Days,
color = as.factor(Week_Test))) + geom_jitter() + geom_smooth(aes(color = as.factor(Week_Test)),
method = "lm") + guides(color = guide_legend(title = "Wks Between \n Scheduled Tests")) +
labs(title = "Total Contagious Days \n Jan through March 2020. 25 Person Office",
x = "Days from Symptom Presentation to Testing", y = "Total Contagious Days") +
theme_few() + scale_color_tableau("Classic Blue-Red 6")
p2 <- Summary_DF %>% ggplot(aes(x = Week_Test, y = Total_Cont_Days,
color = as.factor(SX_Lag))) + geom_jitter() + geom_smooth(aes(color = as.factor(SX_Lag)),
method = "lm") + guides(color = guide_legend(title = "Days from Symptom \n Presentation to Testing")) +
labs(title = "Total Contagious Days \n Jan through March 2020. 25 Person Office",
x = "Wks Between \n Scheduled Tests", y = "Total Contagious Days") +
theme_few() + scale_color_tableau("Classic Cyclic")
# Summary_DF %>% ggplot(aes(Week_Test, Total_Cont_Days, group
# = Week_Test, fill = as.factor(Week_Test))) + geom_boxplot()
# + theme_few() + labs(title='Total Contagious Days \n Jan
# through March 2020. 25 Person Office', x ='Wks Between \n
# Scheduled Tests', y = 'Total Contagious Days') +
# theme_few() + scale_fill_tableau('Classic Blue-Red 6') +
# theme(legend.position = 'none')
grid.arrange(p1, p2, ncol = 1)
Plotting the weeks between scheduled tests on the X axis and grouping by our testing lag shows that we see fairly linear improvements as the weeks between scheduled tests increases. Meanwhile, when this convention is reversed, we find that testing each week yields increasingly more favorable results over other scheduled testing cadences as the testing lag increases.
This is particularly clear in the split box-plot below:
Summary_DF %>% ggplot(aes(as.factor(SX_Lag), Total_Cont_Days,
fill = as.factor(Week_Test))) + geom_boxplot() + theme_few() +
scale_fill_tableau("Classic Blue-Red 6") + guides(fill = guide_legend(title = "Wks Between \n Scheduled Tests")) +
labs(title = "Total Contagious Days \n Jan through March 2020. 25 Person Office",
x = "Days from Symptom Presentation to Testing", y = "Total Contagious Days")