There are a lot people in the US that believe that the mortality rate of COVID-19 is far lower than reported. It is understandable; we are suffering significant economic damage as a result of the policies we’ve put in place to mitigate the health risks. Unemployment rates are skyrocketting, businesses are going under, the entire US economy is basically shut down. Regardless of what policies we think are correct or not, we need to first agree on the scope of the problem. Making decisions out of fear is not a good idea.

Indeed, the media has a tendancy to try to frighten and anger people because frightened, angry people stay tuned to the media, and the media is selling eyeballs to advertisers. Let’s call this the media hype bias argument.

Moreover, there are natural and reasonable concerns about how COVID-19 deaths are counted: Counting any person who happens to die while testing positive for COVID-19 will by necessity involve counting people who did not die from COVID-19 or its complications. I call this the misattribution argument.

Additionally, we are still not testing very many people, and our testing is focused on people who are already very, very sick. That means that the conditional ratio of COVID-19 deaths while also testing positive may be artificially inflated. I call this the test-sample bias argument.

All of these arguments are reasonable points of concern or skepticism. As a skeptic myself (in all things), I am not one to eschew the skepticism of others! But being skeptical is just the first step. To find fault in a position and then immediately conclude the opposite of that position is also not rational. Skepticism is an invitation to look closer.

So let’s take a closer look.

Eliminate the Media: Go Directly to the Source

Any of us can easily get past the media hype problem by going directly to the source of the data. The Center for Disease Control (CDC) and the National Center for Health Statistics (NCHS) provide their data publicly. You can get the data yourself!

For each of these data sets, I did the following:

  1. Went the the link provided above;
  2. Read the dataset description, including what each of the fields are;
  3. Clicked Export then CSV and saved the file on my local hard drive.

I’m going to give you all of my R code, so everything I do is completely transparent. I kept the names of the files the CDC had so that my code below should work for you. New York City is one of the hardest hit area, so I’m going to focus just on New York City, and let’s just look at the total death counts and not worry about causes right now. For now, let’s just look at 2020 deaths.

You will to install the statistical packaged R to do this yourself, and you’ll need to install the dplyr and the ggplot2 packages in R. Here’s an example showing the (filtered) first few lines of the weekly death counts for this year:

library(dplyr)

# Build a Date Converter from a string formatted "mm/dd/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%m/%d/%Y") )

# Read the data provided by the CDC/NHCS
WeeklyDeathsThisYear <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2019-2020.csv', 
                                 colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                 header=T) # Read the first line and use it as a header

# Filter so we have just the New York rows and the year, week, and overal death counts
NYCDeathsThisYear <- subset(filter(WeeklyDeathsThisYear,
                                  Jurisdiction.of.Occurrence == 'New York City',  # Just NYC
                                  MMWR.Year == 2020),                             # Just 2020
                              # vvv Just the following fields vvv
                           select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All.Cause))

# Show the top few rows of this subset
head(NYCDeathsThisYear)

Hey, we can easily plot this! I like to use ggoplot2 because it is easy to build nice-looking plots with it. I chose to plot the individual points, as well as a line plot through those points (you could make an argument that line plots are not good because it is a discrete time series, but let’s not quibble). But if you take this code and download the data, you can plot it however you like. Note, this code below assumes you have run the code above first (so that you have the NYCDeathsThisYear data frame).

library(ggplot2)

myPlot <- ggplot(NYCDeathsThisYear, aes(x=Week.Ending.Date, y=All.Cause)) +
             geom_point(size=2, color="firebrick") +
             geom_line(size=1, color="firebrick") +
             xlab("Week") +
             ylab("Total Reported Deaths in NYC (each Week)") +
             theme(text=element_text(family="Times", size=16))
print(myPlot)

So you don’t need to trust the media! You can get the public data yourself, and the code I provide above is entirely open and transparent to you. Unless you believe that the CDC and the NCHS are providing systemically faulty data about overall death counts (i.e., they are fibbing by a large margin), there’s really no concern that you are being deliberately misled. If your claim is that the total death counts are being manipulated in a systemic and large-scale way, then you’ll need to provide evidence for that claim since it is a pretty extreme one.

It’s Definitely Not Misattribution

Now we know how to get data and how to deal with it, let’s compare overall death counts (from all causes) across several years. I’ll need to merge some datasets – there’s a better way to do what I am about to do, but I think this way will be more understandable. Take a look:

library(dplyr)

# Build a Date Converter from a string formatted "dd/mm/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%d/%m/%Y") )

# Read the data provided by the CDC/NHCS for 2019-2020, then read the 2014-2018 data
WeeklyDeaths.2019.2020 <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2019-2020.csv', 
                                   colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                   header=T) # Read the first line and use it as a header
WeeklyDeaths.2014.2018 <- read.csv('Weekly_Counts_of_Deaths_by_State_and_Select_Causes__2014-2018.csv', 
                                   colClasses=c("Week.Ending.Date"="rpwDateConvert"),  # Convert to a date format
                                   header=T) # Read the first line and use it as a header

# Filter so we have just the New York rows and the year, week, and overal death counts
NYCDeaths.A <- subset(filter(WeeklyDeaths.2019.2020,
                            Jurisdiction.of.Occurrence == 'New York City'),  # Just New York City
                     select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All.Cause))

# Put this is a mergable form because the column names differ a bit between the datasets
NYCDeaths.AA <- data.frame(Year= NYCDeaths.A$MMWR.Year,
                          Week = NYCDeaths.A$Week.Ending.Date,
                          WeekIDX = NYCDeaths.A$MMWR.Week,
                          TotalDeaths = NYCDeaths.A$All.Cause)

# Filter so we have just the New York rows and the year, week, and overal death counts
NYCDeaths.B <- subset(filter(WeeklyDeaths.2014.2018,
                            Jurisdiction.of.Occurrence == 'New York City'),  # Just New York
                     select = c(Jurisdiction.of.Occurrence,MMWR.Year,MMWR.Week,Week.Ending.Date,All..Cause))

# Put this is a mergable form because the column names differ a bit between the datasets
NYCDeaths.BB <- data.frame(Year= NYCDeaths.B$MMWR.Year,
                          Week = NYCDeaths.B$Week.Ending.Date,
                          WeekIDX = NYCDeaths.B$MMWR.Week,
                          TotalDeaths = NYCDeaths.B$All..Cause)


# Combine the two datasets so we have all years from 2014 through 2020
NYCDeaths <- mutate(rbind(NYCDeaths.AA, NYCDeaths.BB),
                   ThisYear=(Year==2020))   # We'll use this field later to highlight 2020

That looks like a lot, but the whole point is to get that last dataset, NYCDeaths, which contains the total deaths (from all causes) for various weeks across the year for all years from 2014 to 2020. If you don’t believe me, download the data and setup through the code yourself; nothing is hidden from you. As before, you have to run the code above before the next chunk of code will work.

library(ggplot2)

myPlot <- ggplot(NYCDeaths, aes(x=WeekIDX, y=TotalDeaths, group=Year, color=ThisYear)) +
             geom_line(size=1) +
             scale_color_manual(values=c("darkgray", "firebrick"), labels=c("Other Years", "2020"), name="") +
             xlab("Week") +
             ylab("Total Reported Deaths in NYC (each Week)") +
             theme(text=element_text(family="Times", size=16))
print(myPlot)

This simply cannot be misattribution. We aren’t looking at attribution at all. The overall death count in New York City is many times what is typical for this time of year. Don’t trust me; get the data yourself and look at it.

We Are Not Overcounting COVID Deaths

The nature of counting deaths is going to be problematic. The concern that someone who tests positive for COVID-19 after being hit by a bus being counted as a “COVID death” is a valid and reasonable concern. The idea that our tests are biased toward those who are already sick is also valid and reasonable. So it is natural to wonder whether we are counting these deaths correctly or not. Are we inflating the concern because of counting errors?

I think the best start with this is to first look at adjusted cumulative death counts. That is, let’s total all deaths that occurred up until this point in the year, then subtract off the average count for that year. This will give us a sense for how many more people have died this year in New York than is typical.

# First get only the week indexes for any year that can be compared to 2020
maxWeekIDX <- max(filter(NYCDeaths, Year==2020)$WeekIDX)
NYCDeaths.abridged <- filter(NYCDeaths, 
                             WeekIDX <= maxWeekIDX,  # All weeks up to the latest in 2020
                             Year != 2020)           # All years *other than* 2020

# Now let's find the average death counts for all years *other* than 2020 across each week
AggDeathCounts <- summarize(group_by(NYCDeaths.abridged, WeekIDX), AvgDeathCount.pre2020 = mean(TotalDeaths))

# Now let's accumulate them:
totalPre2020Deaths <- sum(AggDeathCounts$AvgDeathCount.pre2020)

# Now we'll accumulate deaths for 2020:
NYCDeaths.2020 <- filter(NYCDeaths, Year==2020)
total2020Deaths <- sum(NYCDeaths.2020$TotalDeaths)

# Here's the difference:
cat('How many more deaths than typical so far (on average)?  ', total2020Deaths - totalPre2020Deaths, '\n')
## How many more deaths than typical so far (on average)?   18419.67

Now we can lookup what the COVID-19 attributed death counts the CDC has for us for New York City and compare for ourselves!

library(dplyr)

# Build a Date Converter from a string formatted "dd/mm/YYYY" to a proper date datatype in R
setAs("character", "rpwDateConvert", function(from) as.Date(from, format="%m/%d/%Y") )

# Read the data provided by the CDC/NHCS
CovidAttributedDeaths <- read.csv('Provisional_Death_Counts_for_Coronavirus_Disease__COVID-19_.csv', 
                                  colClasses=c("Date.as.of"="rpwDateConvert"),  # Convert to a date format
                                  header=T) # Read the first line and use it as a header

# Grab the counts for NYC
NYCCovidDeaths <- filter(CovidAttributedDeaths, State == "New York City")

# What's the biggest number they have!
totalNYCDeaths = max(NYCCovidDeaths$All.COVID.19.Deaths..U07.1.)
cat('How many deaths has the CDC attributed to COVID?  ', totalNYCDeaths, '\n')
## How many deaths has the CDC attributed to COVID?   10978

Note that I am counting what the CDC codes as U07.1 deaths – deaths where COVID-19 is the direct cause, not people being hit by a bus and happen to test positive.

It’s clear from this that we are under counting COVID deaths (by almost 50%), not over counting (at least for New York City). At the very least, there is no evidence to support the hypothesis that there is a systemic bias to overinflate the death counts.

Trust But Verify

It’s perfectly reasonable to be cautions of the motivations information sources like the public media! It’s perfectly valid to recognize flaws in counting procedures, such as the tendancy to count deaths as “COVID related” whether or not the person died of COVID. But the data is publicly available, and analyzing it is easy to do (I did it, including this workbook writeup, in a couple hours). Using your natural and reasonable skepticism to deny things that are easily verifiable is not rational.

Are there issues with the analysis I did above? Sure. Perhaps we should be indexing by population count, for example, since population grows (though not sufficiently to explain the death counts, of course). So download the data for yourself and look at it. Putting trust in well-known news sources is not folly – blindly trusting anyone certainly is. But all this is open and verifiable. Check for yourself.

This is Happening

The epidemiologists who are studing COVID-19 have indicated repeatedly in reproducable scientific publications that COVID-19 is a particularly virulent and deadly pathogen. The WHO and the CDC have indicated that this is so and have provided data to support this. The overall death rates are much higher than normal. Put simply: This is happening.