Summary

I am working with the NY Times COVID-19 data and noticed when looking at California counties that there was a long period between the first case and the next case. I’m using it to predict COVID-19 hotspots and that long latent period was messing up my model. For instance, if I were to predict one day ahead using all days for Orange county, California which had 30+ days between first case and sustained growth the model produce a number that was significanty lower than the day before, even though the trend was rising. When those early days before sustained growth occurs are removed the forecasts look like they should. Our tendency is to want to use all of the data because it is real and has happened but if you are building a forcasting model those early days of little value. The full dataset is there if you need it.

Methodology

While watching a news broadcast I saw a COVID-19 forecast chart that explicity stated their model started after three consecutive days of new cases. This made sense to me but I had no idea how to implement it since there are over 2,500 counties with at least one case? This article shows you how to trim those leading days and provides a few bonus features.

The Goal

Remove all rows per county that occur before the first day of the first 3-day run of daily cases.

What does that even mean?

First let’s load the data. The NY Times data gives daily total cases & deaths per county. Since no columns are provided for new cases or deaths they have to be created. I use the lag() function for this. I also make two binary variables to indicate whether that day had new cases/deaths or not.

library(tidyverse)
us <- read_csv("Data/us-counties-2020-04-08.csv") %>% 
  mutate(id = paste0(county, "_", state)) %>% 
  arrange(id, date) %>% 
  rename(ttl_cases = cases, 
         ttl_deaths = deaths) %>% 
  group_by(id) %>% 
  mutate(new_cases = ttl_cases - lag(ttl_cases),
         new_cases = if_else(is.na(new_cases), ttl_cases, new_cases),
         new_deaths = ttl_deaths - lag(ttl_deaths),
         new_deaths = if_else(is.na(new_deaths), ttl_deaths, new_deaths),
         new_case_today = if_else(new_cases >= 1, 1, 0), 
         new_death_today = if_else(new_deaths >= 1, 1, 0)) %>% 
  ungroup() %>% 
  select(id, everything())
glimpse(us)
## Observations: 43,285
## Variables: 11
## $ id              <chr> "Abbeville_South Carolina", "Abbeville_South Carolina…
## $ date            <date> 2020-03-19, 2020-03-20, 2020-03-21, 2020-03-22, 2020…
## $ county          <chr> "Abbeville", "Abbeville", "Abbeville", "Abbeville", "…
## $ state           <chr> "South Carolina", "South Carolina", "South Carolina",…
## $ fips            <chr> "45001", "45001", "45001", "45001", "45001", "45001",…
## $ ttl_cases       <dbl> 1, 1, 1, 1, 1, 1, 3, 4, 4, 4, 4, 3, 4, 4, 6, 6, 6, 6,…
## $ ttl_deaths      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ new_cases       <dbl> 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, -1, 1, 0, 2, 0, 0, 0…
## $ new_deaths      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ new_case_today  <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,…
## $ new_death_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

If you look at the new_case_today row above you see that there was a single case and then several days until a new case. Under our 3 consecutive day rule Abbeville, South Carolina would be dropped from the final dataset since it doesn’t have that pattern in its history.

Case study

Riverside, California provides a concise example to see what pattern we’re looking for and how to fix it. Following along the new_case_today row, the 3-day pattern doesn’t occur until the 9th day. Our goal is to drop the first 8 days.

us %>% 
  filter(id == "Riverside_California") %>% 
  glimpse()
## Observations: 33
## Variables: 11
## $ id              <chr> "Riverside_California", "Riverside_California", "Rive…
## $ date            <date> 2020-03-07, 2020-03-08, 2020-03-09, 2020-03-10, 2020…
## $ county          <chr> "Riverside", "Riverside", "Riverside", "Riverside", "…
## $ state           <chr> "California", "California", "California", "California…
## $ fips            <chr> "06065", "06065", "06065", "06065", "06065", "06065",…
## $ ttl_cases       <dbl> 1, 1, 4, 4, 4, 6, 10, 10, 12, 14, 16, 16, 22, 28, 30,…
## $ ttl_deaths      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 3, 3, 3, 4, 5, 6, 6, 6,…
## $ new_cases       <dbl> 1, 0, 3, 0, 0, 2, 4, 0, 2, 2, 2, 0, 6, 6, 2, 15, 3, 1…
## $ new_deaths      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 1, 1, 0, 0,…
## $ new_case_today  <dbl> 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ new_death_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,…

Build function that removes leading days

The key is the base R rle() function. Run-Length Encoding looks for a series of 0s & 1s and makes it fairly easy to slice the data where you need to. The guts of this function are copy & pasted from this timely article on RLE.

trim_lead_days <- function(my_id) {
  # Filter & sort
  df <- us %>%
    filter(id == my_id) %>% 
    arrange(date)
  
  # We want to identify the first occurance of three days in a row with new cases
  # Record sequences of 1s & 0s
  runs <- rle(df$new_case_today)
  runs <- tibble(runs$lengths, runs$values)
  colnames(runs) <- c("lengths", "values")
  #runs
  
  #Caluculate cumulative index numbers when sequence starts
  sequences <- tibble(lengths = runs$lengths, 
                      values = runs$values) %>% 
    mutate(indices = cumsum(runs$lengths))
  #sequences
  
  # Get index for first occurance of new cases 3 or more days in a row
  start_index <- sequences %>%
    filter(values == 1, lengths >= 3) %>% 
    slice(1) %>% 
    pull(indices)
  
  # If no 3-day sequence then return empty row
  # else return only the rows after 3-day rule met 
  if(is_empty(start_index)) {
    final_df <- df[0, ]
  } else {
    final_df <- df %>% 
      slice((start_index - 2):nrow(df))
  }
  
  return(final_df)
}

When we run the function on Riverside county we see that it correctly starts on March 15. Note that if a county never has three consecutive days of cases then all records for that county are excluded.

my_id <- "Riverside_California"
final_df <- trim_lead_days(my_id)
glimpse(final_df)
## Observations: 25
## Variables: 11
## $ id              <chr> "Riverside_California", "Riverside_California", "Rive…
## $ date            <date> 2020-03-15, 2020-03-16, 2020-03-17, 2020-03-18, 2020…
## $ county          <chr> "Riverside", "Riverside", "Riverside", "Riverside", "…
## $ state           <chr> "California", "California", "California", "California…
## $ fips            <chr> "06065", "06065", "06065", "06065", "06065", "06065",…
## $ ttl_cases       <dbl> 12, 14, 16, 16, 22, 28, 30, 45, 48, 59, 107, 107, 185…
## $ ttl_deaths      <dbl> 0, 2, 3, 3, 3, 4, 5, 6, 6, 6, 8, 8, 8, 8, 8, 9, 13, 1…
## $ new_cases       <dbl> 2, 2, 2, 0, 6, 6, 2, 15, 3, 11, 48, 0, 78, 0, 10, 96,…
## $ new_deaths      <dbl> 0, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 0, 0, 0, 0, 1, 4, 0,…
## $ new_case_today  <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0,…
## $ new_death_today <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,…

How do we do this for all counties?

There are over 2,500 counties reporting at least one case. The above function works for one county. How can we get this to run on all counties. The purrr package gives us a way to feed in all of the counties and get back a single dataframe of the results. Every time I use this pattern made famous in Hadley Wickham’s The Joy of Functional Programming (for Data Science) I amazed by how powerful and elegant it is.

This is all the code that is needed to process the entire dataset

final_df <-  map_dfr(unique(us$id), ~ trim_lead_days(.x))

What is the difference in row counts and number of counties?

trimmed_summary <- tibble(type = "trimmed", 
                          rows = nrow(final_df), 
                         counties = length(unique(final_df$id)))
source_summary <- tibble(type = "source", 
                          rows = nrow(us), 
                         counties = length(unique(us$id)))
bind_rows(trimmed_summary, source_summary)
## # A tibble: 2 x 3
##   type     rows counties
##   <chr>   <int>    <int>
## 1 trimmed 13077     1301
## 2 source  43285     2599

There is a significant difference in the number of rows and counties so use with care.

Bonus code

This will probably be the only paper I’ll write on this topic so I wanted to include some extra code that I think is important for someone just starting to work with this data.

Add day number since first case

The dataset gives you dates and that’s it. Many times you’ll want to compare counties in day 10 or 20 so they need to be indexed to their start dates. This will allow us to peg every county’s daily data to how many days from the first case/death. It’s very useful for plotting since you’ll be able to clearly see which counties did better than others.
Remember that day one in the source data could be very different than day one in the trimmed data.

us2 <- us %>% 
  arrange(id, date) %>%
  mutate(incrementer = 1) %>% 
  group_by(id) %>% 
  mutate(day_num = cumsum(incrementer)) %>% 
  ungroup() %>%
  mutate(id = factor(id), 
         id_day_num = paste0(id, "_", as.character(day_num)), 
         type = "Actual") %>% 
    arrange(id, day_num)
set.seed(317)
county_sample <- sample(as.character(us2$id), 8, replace = FALSE)
p_data <- us2 %>% 
  filter(id %in% county_sample) 

library(ggthemes)
ggplot(p_data, aes(day_num, ttl_cases, color = id)) + 
  geom_line() +
  ggtitle("Set to same start date: Total cases") +
  theme_solarized() +
  scale_colour_solarized()

Add county population data

The number of cases has much less context without the number of people that reside in the county. NYC can have 10,000 cases but it is still a tiny fraction of the total number of people that reside there. Smaller counties may have higher proportions of their citizens infected but are overshadowed by the large case numbers generated in the urban centers.

We’ll use an external dataset for county populations that uses projections from the 2010 census and can be acquired here. This data has a state and county fips code that has to be concatenated before joining to the main dataframe.

us_w_pop <- read_csv("Data/co-est2019-alldata.csv") %>%
  select(STATE, COUNTY, STNAME, CTYNAME, POPESTIMATE2019) %>%
  rename(population = POPESTIMATE2019) %>% 
  mutate(fips = paste0(STATE, COUNTY)) %>% 
  right_join(us2, by = "fips") %>% 
  mutate(pct_infected = ttl_cases / population, 
         pct_dead = ttl_deaths / population)
p_data <- us_w_pop %>% 
  filter(id %in% county_sample)

ggplot(p_data, aes(day_num, pct_infected*100, color = id)) + 
  geom_line() +
  ggtitle("Percent infected") +
  theme_solarized() +
  scale_colour_solarized()

Of these eight counties none of them have exceeded 0.15% of their population infected.

What counties have the highest percent infected?

us_w_pop %>% 
  group_by(id) %>% 
  filter(date == max(date)) %>% 
  ungroup() %>% 
  select(id, ttl_cases, population, pct_infected) %>% 
  arrange(-pct_infected) %>% 
  top_n(10)
## # A tibble: 10 x 4
##    id                             ttl_cases population pct_infected
##    <fct>                              <dbl>      <dbl>        <dbl>
##  1 Rockland_New York                   6413     325789       0.0197
##  2 Blaine_Idaho                         428      23021       0.0186
##  3 Westchester_New York               15887     967506       0.0164
##  4 Nassau_New York                    18548    1356924       0.0137
##  5 Orleans_Louisiana                   5070     390144       0.0130
##  6 Randolph_Georgia                      87       6778       0.0128
##  7 Dougherty_Georgia                   1001      87956       0.0114
##  8 Terrell_Georgia                       95       8531       0.0111
##  9 Suffolk_New York                   15844    1476601       0.0107
## 10 St. John the Baptist_Louisiana       432      42837       0.0101

Note there are several low population counties in this group and as of April 8, 2020 no county has exceeded 2% of their residents infected.

What happened to NYC?

I don’t know why this is occuring but wanted to point it out. It looks like NYC doesn’t have a FIPS code that matches the population data. If you know how to resolve please reach out.

us_w_pop %>% 
  group_by(id) %>% 
  filter(date == max(date)) %>% 
  ungroup() %>% 
  select(id, ttl_cases, population, pct_infected) %>% 
  arrange(-ttl_cases) %>% 
  slice(1:10)
## # A tibble: 10 x 4
##    id                     ttl_cases population pct_infected
##    <fct>                      <dbl>      <dbl>        <dbl>
##  1 New York City_New York     81803         NA    NA       
##  2 Nassau_New York            18548    1356924     0.0137  
##  3 Westchester_New York       15887     967506     0.0164  
##  4 Suffolk_New York           15844    1476601     0.0107  
##  5 Cook_Illinois              10520    5150233     0.00204 
##  6 Wayne_Michigan              9626    1749343     0.00550 
##  7 Bergen_New Jersey           7874     932202     0.00845 
##  8 Los Angeles_California      7530   10039107     0.000750
##  9 Rockland_New York           6413     325789     0.0197  
## 10 Essex_New Jersey            5598     798975     0.00701

End notes

In conclusion, you now have the code and data sources to jump right into exploratory data analysis of the spread of COVID-19 in the US. If you are making forecasts at the county-level then the trim_lead_days() function will remove the ragged opening data that many counties have that causes models to lose accuracy.

Acknowledgement

I want to thank Richard Careaga (aka technocrat) of the RStudio community for pointing me to his paper on the topic. What happened to the streak? rle: run length encoding

END