The Bar Chart Race-Version 2

As stated in my previous post on the bar chart race graphic I made, I wanted to make a few different versions to address some of the criticisms of the original. This is version #2, or #1 of the alternatives.

Except I’m tweaking it a bit. I want to show all the top global causes of death, not just summarize them as one bar. It won’t make for quite as eye-popping a visualization, but it will be more complete. I will also adjust for seasonality of global deaths, since some comparative causes, like influenza, definitely have a higher rate of death earlier in the year.

Summarizing the Causes

For the causes, I will use Our World in Data because they have the most easily accessible dataset, which they’ve already summarized from the Global Burden of Disease study, my original source.

I clicked on the link to download the image on the right. From there it prompts you to click on a link to download a CSV file with all of the data used in the graph. That saves in my Download file as a CSV file called annual-number-of-deaths-by-cause.csv

#prior to this load libraries; dplyr,tidyr,matrixStats,lubridate
df<-read.csv("~/Downloads/annual-number-of-deaths-by-cause.csv",stringsAsFactors = F)

#limit only to World, not individual countries
df<-df%>%filter(Entity=="World")
#select only year and causes of death
df<-df[c(3,5:37)]
#change column headers to something simpler
colnames(df)<-c("Year","Meningitis","Lower_resp_infx","GI_infx_dz","Malnutrition","Terrorism","CVD","Dementia","Renal","Resp","Hepatic","GI","Hepatitis","Cancer","Parkinsons","Fire","Malaria","Drowning","Homicide","HIV_AIDS","Drugs","TB","Traffic","Maternal","Neonatal","Alcohol","Nat_disast","Diarrheal","Exposure","Nutritional","Suicide","Conflict","Diabetes","Poisonings")
#malnutrition and nutritional are similar causes; add them and get rid of one
df$Malnutrition<-df$Malnutrition+df$Nutritional
df$Nutritional<-NULL

How did I handle influenza? It had to be calculated separately. For that I used a paper from the Lancet published by the Global Burden of Disease study group, in which they aim to estimate the global mortality burden specifically from influenza. In it they give an estimate of 145,000 global deaths from influenza in 2017, with a 95% uncertainty interval of 99,000-200,000. As that is an uneven uncertainty interval, for the sake of simulation I’ll say the estimate is 145,000 and give the higher band of uncertainty of 55,000, which would give a standard deviation of 55000/1.96=28061.22.

df$Influenza<-rnorm(28,mean=145000,sd=28061.22)

We also need to normalize the data to 2020 population. For that I used Worldometer’s population estimates. I find web tables easier to manipulate in Google Sheets using the importhtml function. So that is what I did. From there I calculated a population multiplier for each year from 1990 to 2017 compared to 2020.

df$multiplier<-c(1.463,1.44,1.418,1.397,1.376,1.357,1.338,1.32,1.302,1.285,1.269,1.253,1.237,1.222,1.206,1.192,1.177,1.162,1.148,1.134,1.12,1.107,1.094,1.081,1.068,1.056,1.044,1.033)
#convert all columns except influenza to 2020 population
for (i in 2:33){
  df[,i]<-df[,i]*df$multiplier
}
#influenza is all in 2017 numbers, so only need to apply 2017 multiplier to it
df$Influenza<-df$Influenza*1.033
#now need to remove influenza figures from lower respiratory infectious disease figures, since influenza was included in those numbers in the original study
df$Lower_resp_infx<-df$Lower_resp_infx-df$Influenza

In a previous version of this, I calculated the average and standard deviation for each disease for 1990-2017. However, I noticed on further inspection that many of these diseases have declined consistently in prevalence over time. Using an average of almost 30 years led to many of the diseases having much higher figures than they had in say, the last five years of data. So, for a more fair comparison, I decided to run a trend analysis on the data to project forward to 2020 what the mortality is most likely to be.

#first need to impute years with missing data; there aren't many, but it makes the data more complete
tracking<-data.frame(disease=colnames(df[2:34]),mean=NA,sd=NA)
for (i in 2:34){
  data<-df[c(1,i)]
  data<-na.omit(data)
  y<-data[1:nrow(data),2]
  x<-data[1:nrow(data),1]
  model<-lm(y~x)
  newdata<-data.frame(x=2020)
  fc<-forecast(model,newdata)
  tracking[i-1,2]<-fc$mean
  tracking[i-1,3]<-(fc$upper[[2]]-fc$mean)/1.96
}

Seasonality of Mortality

This was a bit more difficult to determine. For it I used the UN Data database of deaths by month of death. It was difficult though because the deaths were listed individually for every country in the world and each one had different numbers of years and months per year where data was available. So I had to pull all of the data and then narrow it down just to countries for which full year data was available. I found a group of countries for which multiple years of data were available with all 12 months available in each year.

Given China’s inconsistent data reporting, it is not included in the database. But beyond that problem, this should given a reasonable idea of the seasonality of mortality globally, especially as it includes countries at multiple different latitudes (some respiratory diseases are thought to have more of a seasonal component in colder climes).

From that I was able to calculate the average seasonal factor for each month. They are as follows:

Month Seasonal Multiplier
January 1.166
February 1.022
March 1.084
April 0.969
May 0.979
June 0.924
July 0.971
August 0.962
September 0.915
October 0.986
November 0.973
December 1.048

Simulation of Daily Deaths

Now to simulate daily deaths for 2020. So that the bar chart race can be easily updated, I’ll create a dataframe for the entire year so that all I have to do later is add in the new dates and the updated COVID deaths figures.

#this will be a slightly sloppy way of doing this but oh well
sim<-df[1:34]
names(sim)[1]<-"Date"
sim<-rbind(sim,sim,sim,sim,sim,sim,sim,sim,sim,sim,sim,sim,sim,sim)
sim<-sim[-c(367:392),]
sim$Date<-seq(ymd("2020-01-01"),ymd("2020-12-31"),by="day")
for (i in 2:34){
  sim[,i]<-rnorm(366,mean=tracking[i-1,2],sd=tracking[i-1,3])
}
#that gives you annuals; the averages are over multiple years in which some are leap years, some are not, so divide the annual totals by 365.25 to get dailies
sim[2:34]<-sim[2:34]/365.25
#anything that is below 0 we will just set to 0; I think the only one where this happens is terrorism because annual deaths are so variable for it
for (i in 2:34){
  sim[,i]<-ifelse(sim[,i]<0,0,sim[,i])
}
#and, finally, to apply the seasonality figures; and yes, I am aware that not all causes of death will have seasonality associated with them; this is in a way stacking the deck against COVID in the comparison to give a "fairer" comparison a la criticism received to date
season<-data.frame(month=1:12,adjustment=c(1.166,1.022,1.084,0.969,0.979,.924,.971,.962,.915,.986,.973,1.048))
sim$month<-month(sim$Date)
sim$seas_adj<-season$adjustment[match(sim$month,season$month)]
for (i in 2:34){
  sim[i]<-sim[i]*sim$seas_adj
}
#finally, to calculate cumulative figures for each column and round them to make it neater
for (i in 2:34){
  sim[i]<-cumsum(sim[i])
  sim[i]<-round(sim[i],0)
}
#remove month and seas_adj
sim$month<-NULL
sim$seas_adj<-NULL
#save as csv using whatever directory you wish
#write.csv(sim,"~/Desktop/allcod.csv")

Conclusion

Voila! Now you would pull the COVID causes of death as outlined in the previous post and put them into Flourish as shown. From there you get the new bar chart race. Enjoy!