Summary
I am working with the NY Times COVID-19 data and noticed when looking at California counties that there was a long period between the first case and the next case. I’m using it to predict COVID-19 hotspots and that long latent period was messing up my model. For instance, if I were to predict one day ahead using all days for Orange county, California which had 30+ days between first case and sustained growth the model produce a number that was significanty lower than the day before, even though the trend was rising. When those early days before sustained growth occurs are removed the forecasts look like they should. Our tendency is to want to use all of the data because it is real and has happened but if you are building a forcasting model those early days of little value. The full dataset is there if you need it.
Methodology
While watching a news broadcast I saw a COVID-19 forecast chart that explicity stated their model started after three consecutive days of new cases. This made sense to me but I had no idea how to implement it since there are over 2,500 counties with at least one case? This article shows you how to trim those leading days and provides a few bonus features.
The Goal
Remove all rows per county that occur before the first day of the first 3-day run of daily cases.
What does that even mean?
First let’s load the data. The NY Times data gives daily total cases & deaths per county. Since no columns are provided for new cases or deaths they have to be created. I use the lag() function for this. I also make two binary variables to indicate whether that day had new cases/deaths or not.
library(tidyverse)
us <- read_csv("Data/us-counties-2020-04-08.csv") %>%
mutate(id = paste0(county, "_", state)) %>%
arrange(id, date) %>%
rename(ttl_cases = cases,
ttl_deaths = deaths) %>%
group_by(id) %>%
mutate(new_cases = ttl_cases - lag(ttl_cases),
new_cases = if_else(is.na(new_cases), ttl_cases, new_cases),
new_deaths = ttl_deaths - lag(ttl_deaths),
new_deaths = if_else(is.na(new_deaths), ttl_deaths, new_deaths),
new_case_today = if_else(new_cases >= 1, 1, 0),
new_death_today = if_else(new_deaths >= 1, 1, 0)) %>%
ungroup() %>%
select(id, everything())
glimpse(us)## Observations: 43,285
## Variables: 11
## $ id <chr> "Abbeville_South Carolina", "Abbeville_South Carolina…
## $ date <date> 2020-03-19, 2020-03-20, 2020-03-21, 2020-03-22, 2020…
## $ county <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "…
## $ state <chr> "South Carolina", "South Carolina", "South Carolina",…
## $ fips <chr> "45001", "45001", "45001", "45001", "45001", "45001",…
## $ ttl_cases <dbl> 1, 1, 1, 1, 1, 1, 3, 4, 4, 4, 4, 3, 4, 4, 6, 6, 6, 6,…
## $ ttl_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ new_cases <dbl> 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, -1, 1, 0, 2, 0, 0, 0…
## $ new_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ new_case_today <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
## $ new_death_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
If you look at the new_case_today row above you see that there was a single case and then several days until a new case. Under our 3 consecutive day rule Abbeville, South Carolina would be dropped from the final dataset since it doesn’t have that pattern in its history.
Case study
Riverside, California provides a concise example to see what pattern we’re looking for and how to fix it. Following along the new_case_today row, the 3-day pattern doesn’t occur until the 9th day. Our goal is to drop the first 8 days.
us %>%
filter(id == "Riverside_California") %>%
glimpse()## Observations: 33
## Variables: 11
## $ id <chr> "Riverside_California", "Riverside_California", "Rive…
## $ date <date> 2020-03-07, 2020-03-08, 2020-03-09, 2020-03-10, 2020…
## $ county <chr> "Riverside", "Riverside", "Riverside", "Riverside", "…
## $ state <chr> "California", "California", "California", "California…
## $ fips <chr> "06065", "06065", "06065", "06065", "06065", "06065",…
## $ ttl_cases <dbl> 1, 1, 4, 4, 4, 6, 10, 10, 12, 14, 16, 16, 22, 28, 30,…
## $ ttl_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 3, 4, 5, 6, 6, 6,…
## $ new_cases <dbl> 1, 0, 3, 0, 0, 2, 4, 0, 2, 2, 2, 0, 6, 6, 2, 15, 3, 1…
## $ new_deaths <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1, 0, 0,…
## $ new_case_today <dbl> 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ new_death_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,…
Build function that removes leading days
The key is the base R rle() function. Run-Length Encoding looks for a series of 0s & 1s and makes it fairly easy to slice the data where you need to. The guts of this function are copy & pasted from this timely article on RLE.
trim_lead_days <- function(my_id) {
# Filter & sort
df <- us %>%
filter(id == my_id) %>%
arrange(date)
# We want to identify the first occurance of three days in a row with new cases
# Record sequences of 1s & 0s
runs <- rle(df$new_case_today)
runs <- tibble(runs$lengths, runs$values)
colnames(runs) <- c("lengths", "values")
#runs
#Caluculate cumulative index numbers when sequence starts
sequences <- tibble(lengths = runs$lengths,
values = runs$values) %>%
mutate(indices = cumsum(runs$lengths))
#sequences
# Get index for first occurance of new cases 3 or more days in a row
start_index <- sequences %>%
filter(values == 1, lengths >= 3) %>%
slice(1) %>%
pull(indices)
# If no 3-day sequence then return empty row
# else return only the rows after 3-day rule met
if(is_empty(start_index)) {
final_df <- df[0, ]
} else {
final_df <- df %>%
slice((start_index - 2):nrow(df))
}
return(final_df)
}When we run the function on Riverside county we see that it correctly starts on March 15. Note that if a county never has three consecutive days of cases then all records for that county are excluded.
my_id <- "Riverside_California"
final_df <- trim_lead_days(my_id)
glimpse(final_df)## Observations: 25
## Variables: 11
## $ id <chr> "Riverside_California", "Riverside_California", "Rive…
## $ date <date> 2020-03-15, 2020-03-16, 2020-03-17, 2020-03-18, 2020…
## $ county <chr> "Riverside", "Riverside", "Riverside", "Riverside", "…
## $ state <chr> "California", "California", "California", "California…
## $ fips <chr> "06065", "06065", "06065", "06065", "06065", "06065",…
## $ ttl_cases <dbl> 12, 14, 16, 16, 22, 28, 30, 45, 48, 59, 107, 107, 185…
## $ ttl_deaths <dbl> 0, 2, 3, 3, 3, 4, 5, 6, 6, 6, 8, 8, 8, 8, 8, 9, 13, 1…
## $ new_cases <dbl> 2, 2, 2, 0, 6, 6, 2, 15, 3, 11, 48, 0, 78, 0, 10, 96,…
## $ new_deaths <dbl> 0, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 1, 4, 0,…
## $ new_case_today <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,…
## $ new_death_today <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,…
How do we do this for all counties?
There are over 2,500 counties reporting at least one case. The above function works for one county. How can we get this to run on all counties. The purrr package gives us a way to feed in all of the counties and get back a single dataframe of the results. Every time I use this pattern made famous in Hadley Wickham’s The Joy of Functional Programming (for Data Science) I amazed by how powerful and elegant it is.
This is all the code that is needed to process the entire dataset
final_df <- map_dfr(unique(us$id), ~ trim_lead_days(.x))What is the difference in row counts and number of counties?
trimmed_summary <- tibble(type = "trimmed",
rows = nrow(final_df),
counties = length(unique(final_df$id)))
source_summary <- tibble(type = "source",
rows = nrow(us),
counties = length(unique(us$id)))
bind_rows(trimmed_summary, source_summary)## # A tibble: 2 x 3
## type rows counties
## <chr> <int> <int>
## 1 trimmed 13077 1301
## 2 source 43285 2599
There is a significant difference in the number of rows and counties so use with care.
Bonus code
This will probably be the only paper I’ll write on this topic so I wanted to include some extra code that I think is important for someone just starting to work with this data.
Add day number since first case
The dataset gives you dates and that’s it. Many times you’ll want to compare counties in day 10 or 20 so they need to be indexed to their start dates. This will allow us to peg every county’s daily data to how many days from the first case/death. It’s very useful for plotting since you’ll be able to clearly see which counties did better than others.
Remember that day one in the source data could be very different than day one in the trimmed data.
us2 <- us %>%
arrange(id, date) %>%
mutate(incrementer = 1) %>%
group_by(id) %>%
mutate(day_num = cumsum(incrementer)) %>%
ungroup() %>%
mutate(id = factor(id),
id_day_num = paste0(id, "_", as.character(day_num)),
type = "Actual") %>%
arrange(id, day_num)set.seed(317)
county_sample <- sample(as.character(us2$id), 8, replace = FALSE)
p_data <- us2 %>%
filter(id %in% county_sample)
library(ggthemes)
ggplot(p_data, aes(day_num, ttl_cases, color = id)) +
geom_line() +
ggtitle("Set to same start date: Total cases") +
theme_solarized() +
scale_colour_solarized()Add county population data
The number of cases has much less context without the number of people that reside in the county. NYC can have 10,000 cases but it is still a tiny fraction of the total number of people that reside there. Smaller counties may have higher proportions of their citizens infected but are overshadowed by the large case numbers generated in the urban centers.
We’ll use an external dataset for county populations that uses projections from the 2010 census and can be acquired here. This data has a state and county fips code that has to be concatenated before joining to the main dataframe.
us_w_pop <- read_csv("Data/co-est2019-alldata.csv") %>%
select(STATE, COUNTY, STNAME, CTYNAME, POPESTIMATE2019) %>%
rename(population = POPESTIMATE2019) %>%
mutate(fips = paste0(STATE, COUNTY)) %>%
right_join(us2, by = "fips") %>%
mutate(pct_infected = ttl_cases / population,
pct_dead = ttl_deaths / population)p_data <- us_w_pop %>%
filter(id %in% county_sample)
ggplot(p_data, aes(day_num, pct_infected*100, color = id)) +
geom_line() +
ggtitle("Percent infected") +
theme_solarized() +
scale_colour_solarized()Of these eight counties none of them have exceeded 0.15% of their population infected.
What counties have the highest percent infected?
us_w_pop %>%
group_by(id) %>%
filter(date == max(date)) %>%
ungroup() %>%
select(id, ttl_cases, population, pct_infected) %>%
arrange(-pct_infected) %>%
top_n(10)## # A tibble: 10 x 4
## id ttl_cases population pct_infected
## <fct> <dbl> <dbl> <dbl>
## 1 Rockland_New York 6413 325789 0.0197
## 2 Blaine_Idaho 428 23021 0.0186
## 3 Westchester_New York 15887 967506 0.0164
## 4 Nassau_New York 18548 1356924 0.0137
## 5 Orleans_Louisiana 5070 390144 0.0130
## 6 Randolph_Georgia 87 6778 0.0128
## 7 Dougherty_Georgia 1001 87956 0.0114
## 8 Terrell_Georgia 95 8531 0.0111
## 9 Suffolk_New York 15844 1476601 0.0107
## 10 St. John the Baptist_Louisiana 432 42837 0.0101
Note there are several low population counties in this group and as of April 8, 2020 no county has exceeded 2% of their residents infected.
What happened to NYC?
I don’t know why this is occuring but wanted to point it out. It looks like NYC doesn’t have a FIPS code that matches the population data. If you know how to resolve please reach out.
us_w_pop %>%
group_by(id) %>%
filter(date == max(date)) %>%
ungroup() %>%
select(id, ttl_cases, population, pct_infected) %>%
arrange(-ttl_cases) %>%
slice(1:10)## # A tibble: 10 x 4
## id ttl_cases population pct_infected
## <fct> <dbl> <dbl> <dbl>
## 1 New York City_New York 81803 NA NA
## 2 Nassau_New York 18548 1356924 0.0137
## 3 Westchester_New York 15887 967506 0.0164
## 4 Suffolk_New York 15844 1476601 0.0107
## 5 Cook_Illinois 10520 5150233 0.00204
## 6 Wayne_Michigan 9626 1749343 0.00550
## 7 Bergen_New Jersey 7874 932202 0.00845
## 8 Los Angeles_California 7530 10039107 0.000750
## 9 Rockland_New York 6413 325789 0.0197
## 10 Essex_New Jersey 5598 798975 0.00701
End notes
In conclusion, you now have the code and data sources to jump right into exploratory data analysis of the spread of COVID-19 in the US. If you are making forecasts at the county-level then the trim_lead_days() function will remove the ragged opening data that many counties have that causes models to lose accuracy.
Acknowledgement
I want to thank Richard Careaga (aka technocrat) of the RStudio community for pointing me to his paper on the topic. What happened to the streak? rle: run length encoding