I recently wrote an R Notebook showing how to grab public data yourself to see that the COVID-19 deaths are not likely being hyped by the media. It focused on New York City because it has been the hardest hit; however, I thought I’d take those same scripts and show Florida, since that is where I live right now.

Source Data

The Center for Disease Control (CDC) and the National Center for Health Statistics (NCHS) provide their data publicly:

For each of these data sets, I did the following:

  1. Went the the link provided above;
  2. Read the dataset description, including what each of the fields are;
  3. Clicked Export then CSV and saved the file on my local hard drive.

You will need to install the statistical packaged R to do this yourself, and you’ll need to install the dplyr and the ggplot2 packages in R. Here’s an example showing the (filtered) first few lines of the weekly death counts for this year:

library(dplyr)

# Build a Date Converter from a string formatted "mm/dd/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%m/%d/%Y") )

# Read the data provided by the CDC/NHCS
WeeklyDeathsThisYear <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2019-2020.csv', 
                                 colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                 header=T) # Read the first line and use it as a header

# Remove the last couple of weeks from the measure because of the lagging measure.
# This is discussed below in the notebook.
maxWeekIDX.2020 = max(filter(WeeklyDeathsThisYear, MMWR.Year==2020)$MMWR.Week)  - 2 

# Filter so we have just the New York rows and the year, week, and overal death counts
FLDeathsThisYear <- subset(filter(WeeklyDeathsThisYear,
                                  Jurisdiction.of.Occurrence == 'Florida',  # Just FL
                                  MMWR.Week <= maxWeekIDX.2020,             # Just first n-2 weeks
                                  MMWR.Year == 2020),                       # Just 2020
                              # vvv Just the following fields vvv
                           select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All.Cause))

# Show the top few rows of this subset
head(FLDeathsThisYear)

I chose to plot the individual points, as well as a line plot through those points. Note, this code below assumes you have run the code above first (so that you have the FLDeathsThisYear data frame).

library(ggplot2)

myPlot <- ggplot(FLDeathsThisYear, aes(x=Week.Ending.Date, y=All.Cause)) +
             geom_point(size=2, color="firebrick") +
             geom_line(size=1, color="firebrick") +
             xlab("Week") +
             ylab("Total Reported Deaths in FL (each Week)") +
             theme(text=element_text(family="Times", size=16))
print(myPlot)

Comparing Overall Deaths

Let’s compare overall death counts (from all causes) across several years. I’ll need to merge some datasets:

library(dplyr)

# Build a Date Converter from a string formatted "dd/mm/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%d/%m/%Y") )

# Read the data provided by the CDC/NHCS for 2019-2020, then read the 2014-2018 data
WeeklyDeaths.2019.2020 <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2019-2020.csv', 
                                   colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                   header=T) # Read the first line and use it as a header
WeeklyDeaths.2014.2018 <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2014-2018.csv', 
                                   colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                   header=T) # Read the first line and use it as a header

# Filter so we have just the New York rows and the year, week, and overal death counts
FLDeaths.A <- subset(filter(WeeklyDeaths.2019.2020,
                            Jurisdiction.of.Occurrence == 'Florida'),  # Just Florida
                     select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All.Cause))

# Put this is a mergable form because the column names differ a bit between the datasets
FLDeaths.AA <- data.frame(Year= FLDeaths.A$MMWR.Year,
                          Week = FLDeaths.A$Week.Ending.Date,
                          WeekIDX = FLDeaths.A$MMWR.Week,
                          TotalDeaths = FLDeaths.A$All.Cause)

# Filter so we have just the New York rows and the year, week, and overal death counts
FLDeaths.B <- subset(filter(WeeklyDeaths.2014.2018,
                            Jurisdiction.of.Occurrence == 'Florida'),  # Just Florida
                     select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All..Cause))

# Put this is a mergable form because the column names differ a bit between the datasets
FLDeaths.BB <- data.frame(Year= FLDeaths.B$MMWR.Year,
                          Week = FLDeaths.B$Week.Ending.Date,
                          WeekIDX = FLDeaths.B$MMWR.Week,
                          TotalDeaths = FLDeaths.B$All..Cause)


# Combine the two datasets so we have all years from 2014 through 2020
FLDeaths <- mutate(rbind(FLDeaths.AA, FLDeaths.BB),
                   ThisYear=(Year==2020))   # We'll use this field later to highlight 2020

That looks like a lot, but the whole point is to get that last dataset, FLDeaths, which contains the total deaths (from all causes) for various weeks across the year for all years from 2014 to 2020. As before, you have to run the code above before the next chunk of code will work.

library(ggplot2)

myPlot <- ggplot(FLDeaths, aes(x=WeekIDX, y=TotalDeaths, group=Year, color=ThisYear)) +
             geom_line(size=1) +
             scale_color_manual(values=c("darkgray", "firebrick"), labels=c("Other Years", "2020"), name="") +
             xlab("Week") +
             ylab("Total Reported Deaths in FL (each Week)") +
              theme(text=element_text(family="Times", size=16))
print(myPlot)

Overall death count is higher than normal since the COVID outbreak (2018 had an unusually high death count in January for some reason). Of course, this is not as bad as New York City. The steep drop at the end is a result of the lagging measure – not all deaths reports have been sent to the CDC for recent weeks.

Okay, not so bad, then! We’re no New York City, thankfully!

Still, We’re Not Likely Overcounting COVID Deaths

Let’s first look at adjusted cumulative death counts. That is, let’s total all deaths that occurred up until this point in the year, then subtract off the average count for that year. That is, how do this year’s deaths differ from the expected number of deaths? I am going to lop off the most recent two weeks from this data for two reasons:

  1. Recent data is very inaccurate since the lagging measure means not all deaths have been reported to the CDC.
  2. The COVID dataset only goes up through early September, 2020 so removing two weeks from the overall death totals is a closer apples-to-apples on the date range.
# First get only the week indexes for any year that can be compared to 2020
#  Subtract a couple weeks because of the lagging indicator and to be consistent with the COVID dates
maxWeekIDX <- max(filter(FLDeaths, Year==2020)$WeekIDX) - 2
FLDeaths.abridged <- filter(FLDeaths, 
                             WeekIDX <= maxWeekIDX,  # All weeks up to the latest in 2020
                             Year != 2020)           # All years *other than* 2020

# Now let's find the average death counts for all years *other* than 2020 across each week
AggDeathCounts <- summarize(group_by(FLDeaths.abridged, WeekIDX), AvgDeathCount.pre2020 = mean(TotalDeaths))

# Now let's accumulate them:
totalPre2020Deaths <- sum(AggDeathCounts$AvgDeathCount.pre2020)

# Now we'll accumulate deaths for 2020:
FLDeaths.2020 <- filter(FLDeaths, Year==2020)
total2020Deaths <- sum(FLDeaths.2020$TotalDeaths)

# Here's the difference:
cat('How many more deaths than typical so far (on average) through end of Mid-July?  ', total2020Deaths - totalPre2020Deaths, '\n')
## How many more deaths than typical so far (on average) through end of Mid-July?   30636.33

Now we can lookup what the COVID-19 attributed death counts the CDC has for us for Florida and compare for ourselves. We’ll download one more dataset:

library(dplyr)

# Build a Date Converter from a string formatted "dd/mm/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%m/%d/%Y") )

# Read the data provided by the CDC/NHCS
CovidAttributedDeaths <- read.csv('Provisional_COVID-19_Death_Counts_by_Week_Ending_Date_and_State.csv', 
                                  colClasses=c("Date.as.of"="rpwDateConvert"),  # Convert to a date format
                                  header=T) # Read the first line and use it as a header

# Grab the counts for FL
FLCovidDeaths <- filter(CovidAttributedDeaths, State == "Florida")

# What's the biggest number they have!
totalFLDeaths = sum(na.omit(FLCovidDeaths$COVID.19.Deaths))
lastWeekDate <- max(FLCovidDeaths$Start.week)
cat('How many deaths has the CDC attributed to COVID through end of ',
    lastWeekDate, '?   ', totalFLDeaths, '\n')
## How many deaths has the CDC attributed to COVID through end of  09/12/2020 ?    12237

Note that I am counting what the CDC codes as U07.1 deaths – deaths where COVID-19 is the direct cause, not people being hit by a bus and happen to test positive. Go read the CDC source material – they are clear about the coding and attribution.

Regardless, the COVID-19 attributed death count is signficantly lower than the overall number of deaths above the expected count we have experienced this year. To be fair, there is reason to be unsatisfied with this analysis (e.g., lopping off the last two weeks affects the expected calculuations in complicated ways), so be careful what you conclude at this point; however, based on the data there is absolutely no reason to believe we are overcounting COVID-19 deaths in Florida – at least based on data between January 1 and the date this notebook was published. The only way you can buy that is if you believe doctors are systemically falsifying death records (not the cause, rather the fact) across the state. If that’s your view, you’ll need to provide evidence.