The Mythbusters were the hosts of the Australian-American science entertainment television program with the same name that aired between 2003 and 2016. They used elements of the scientific method to test the validity of myths. They conducted the experiment and at the end they decided, after their analysis on the experiment of course, the outcome for the myth:
We are going to conduct our analysis in the same exact way !
We start by tweaking some settings for better visualization of the plots
We load the libraries needed for this analysis
library(GGally)
library(tidyverse)
library(sf)
library(cartogram)
library(broom)
library(tweenr)
library(maptools)
library(viridis)
library(reticulate)
library(FactoMineR)
library(factoextra)
library(GGally)
library(d3r)
library(ggTimeSeries)
require(gridExtra)
library(streamgraph)
library(treemapify)
library(treemap)
library(d3treeR)
library(lubridate)
library(shiny)
library(htmlwidgets)
library(hrbrthemes)
library(ellipse)
library(corrplot)
library(RColorBrewer)
library(ggExtra)
To perform this analysis we load a dataset originally about deaths divided by age groups and sex for COVID-19 , influenza and pneumoniain the U.S.A.
We are going to use only the data regarding COVID-19.
State : The American State in which the entry was recorded (United States implies the sum of all the entries in all the States)
Sex : The binary sex considered in the data collection for each person
Age Group: The age bin in which each person is assigned to
COVID-19 deaths: the number of deceased by COVID-19
american_age_dataset <- read.csv("~/Desktop/progetto adv/covid_influenza.CSV", header=FALSE)
as.data.frame(american_age_dataset)
# the dataset has all variables set as characters and the variable labels are entries , let's fix this :
american_age_dataset <- american_age_dataset %>%
rename(
End_week = V3,
State = V4,
Sex = V5,
Age_group = V6,
COVID_19_deaths = V7
) %>% select(c(State,Sex,Age_group,COVID_19_deaths ))
# now we remove the 'fake' variable labels so that we only have the real entries in the dataframe :
american_age_dataset <- american_age_dataset %>%
slice(2:n())
# let's take only male or female sexes and clear up the age groups
american_age_dataset <- american_age_dataset %>%
filter(Sex == 'Male' | Sex == 'Female') %>%
filter(Age_group == '0-17 years' | Age_group == '0-17 years' | Age_group == '18-29 years' | Age_group == '30-49 years' | Age_group == '50-64 years' | Age_group == '65-74 years' | Age_group == '75-84 years' | Age_group == '85 years and over' | Age_group == 'All Ages' )
# let's only take the the total numbers in the whole USA
american_age_dataset <- american_age_dataset %>%
filter(State == 'United States') %>% group_by(Age_group) %>%
pivot_wider(-Sex,names_from=Sex,values_from = COVID_19_deaths)
american_age_dataset <- american_age_dataset %>% mutate(Male = as.double(Male) , Female = as.double(Female))
ggplot(american_age_dataset) +
geom_segment( aes(x=Age_group, xend=Age_group, y=Male, yend=Female), color="grey") +
geom_point( aes(x=Age_group, y=Male), color= 'steelblue', size=3 ) +
geom_point( aes(x=Age_group, y=Female), color= 'pink', size=3 ) +
theme_ipsum()+
coord_flip()+
theme(
legend.position = "none",
panel.border = element_blank(),
) +
xlab("Age Group") +
ylab("COVID-19 Deaths")+
ggtitle('Male and female COVID-19 deaths by age group')
Key takeaways:
For this third myth we’re going to analyze of course the italian situation .
We start by uploading two dataframes:
the first holds population information ,that’s the number of inhabitants for each italian region. This will come in handy later when we will try to eliminate the bias of the number of inhabitants to answer some questions.
the second holds the shape files for the italian regions that we’re going to use later for the spatial visualizations in R .
Italy_reg <- st_read("shp/reg2011_g.shp") # uploading the shape file for the italian regions
pop_19 <- readxl::read_excel("regioni_pop_19.xlsx", # italian population data
col_types = c("numeric", "text", "numeric",
"numeric", "numeric"))
Italy_reg %>% merge(pop_19,by="COD_REG") -> choro_df # joining the two datagrames with the region code as the key
choro_df <- choro_df %>% select(-NOME_REG) %>% relocate(Reg_name,.after=COD_REG) # remove one of the two columns with the region names and relocating the Reg_name variable
Now we upload the italian COVID-19 dataset.
In this dataset we work with these variables:
data: the date of the collected row data
daily_new_cases: new COVID-19 cases observed in a day
daily_tests: COVID-19 daily tests done in a day
daily_recovered: recovered COVID-19 cases in a day
daily_deads: deceased COVID-19 cases in a day
cum_hospitalized: hospitalized COVID-19 cases since the beginning of the series
daily_actual_positives: total number of active COVID-19 cases in a day
cum_recovered: recovered COVID-19 cases since the beginning of the series
cum_cases: total number of COVID-19 cases observed since the beginning of the series
cum_dead: total number of deaceased COVID-19 cases observed since the beginning of the series
reg: name of the italian region
Cod_r: code of the italian region
We’re going to collapse Bolzano and Trento togheter in the Trentino Alto Adige region first , as the dataset still keeps this division. Then we’re going to summarize the cases and the tests and use them to create a new variable that holds the percentage of people that resulted positive to tests : pos_perc.
DATA_ITA <- readxl::read_excel("DATA_ITA.xlsx", col_types = c("date",
"numeric", "numeric", "numeric", "numeric",
"numeric", "numeric", "numeric", "numeric",
"numeric", "text", "numeric"))
DATA_ITA %>%
mutate(Month=lubridate::month(data)) %>% # let's create a column with months
filter((Month==3)|(Month==10)) %>%
mutate(COD_REG=if_else(Cod_r>20,as.numeric(4),Cod_r),
Name_REG=if_else(Cod_r>20,"Trentino AA",reg)) %>% # collapsing Bolzano and Trento province into TRENTINO AA region
select(COD_REG, Name_REG,Month, daily_new_cases,daily_tests) %>%
group_by(COD_REG,Month) %>%
summarize(new_cases=sum(daily_new_cases), test=sum(daily_tests)) %>% # cumulative cases and tests values
mutate(pos_perc=new_cases/test*100) %>% # creating a new variable that holds the percentage of people that resulted positive to tests
mutate(NMonth=if_else(Month==3,"March","October")) %>%
pivot_wider(-Month,names_from=NMonth,values_from = c(new_cases,test,pos_perc))-> Both_Month_per_cases
choro_df_2 <- choro_df %>% merge(Both_Month_per_cases,by.x = "COD_REG", by.y="COD_REG")
Now we are ready to show the choropleth maps , for non-italian people less familiar with the geography of our country.
# plotting choropleths for March and October
choropleth_perc_march <- ggplot(choro_df_2)+
geom_sf(aes(fill=pos_perc_March))+
scale_fill_viridis(limits = c(0, 40),
direction = -1,
option = "D") +
labs(title = "Positive percentages on tests, MARCH") +
theme_bw() + labs(fill = "Percentage ")
choropleth_perc_october <- ggplot(choro_df_2)+
geom_sf(aes(fill=pos_perc_October))+
scale_fill_viridis(limits = c(0, 40),
direction = -1,
option="D")+
labs(title = "Positive percentages on tests, OCTOBER") +
theme_bw() + labs(fill = "Percentage ")
grid.arrange(choropleth_perc_march, choropleth_perc_october , ncol=2)
While for people more familiar with the geography of the region, a cartogram could answer more question at a glimpse
#cartograms
it_perc_cart_M <- cartogram(choro_df_2, "pos_perc_March")
it_perc_cart_O <- cartogram(choro_df_2, "pos_perc_October")
cartogram_perc_march <- ggplot(it_perc_cart_M)+
geom_sf(aes(fill=pos_perc_March))+
scale_fill_viridis(limits = c(0, 40),
direction = -1,
option = "D") +
labs(title = "Positive percentages on tests, MARCH")+
theme_bw() + labs(fill = "Percentage ")
cartogram_perc_october <- ggplot(choro_df)+geom_sf()+
geom_sf(data=it_perc_cart_O,aes(fill=pos_perc_October))+
scale_fill_viridis(option = "D",
limits = c(0, 40),
direction = -1
)+
labs(title = "Positive percentages on tests, OCTOBER")+
theme_bw() + labs(fill = "Percentage ")
# we arrange the plots next to each other for better comparison possibilities
grid.arrange(cartogram_perc_march, cartogram_perc_october , ncol=2)
Looking at the visualizations , it seems like the myth is confirmed : the graph shows that the situation in March is worse than the situation in October. The percentage of people resulting positive to the tests is considerably higher.
But let’s stop and think for a second.
There’s a consideration to do before we draw conclusions: in March the available tests were less , so in Italy only people which were almost surely infected were tested. On the one hand the visualization only tells us to confirm the myth, on the other hand , external considerations tell us that the myth is busted.
Also consider that the data is as of October , from which on the second wave spread further .
We’re going to use an heatmap of the deceased through time in each region ( using relativized data to make more precise considerations without the bias of the number of inhabitants).
Let’s visualize it:
pop_19 <- rename(pop_19, Name_REG = 'Reg_name') # we need to rename to use it as a key when joining the two dataframes
heatmap_df_pop <- DATA_ITA %>%
mutate(COD_REG=if_else(Cod_r>20,as.numeric(4),Cod_r),
Name_REG=if_else(Cod_r>20,"Trentino AA",reg))
#now we join the population data with our treemap database
pop_19 <- pop_19 %>% select(Population,Name_REG)
heatmap_df_pop<- left_join(heatmap_df_pop,pop_19, by = "Name_REG")
# now we can create the new cases and deaths variables reliant on each 100000 people and we prepare the data for the treemap
heatmap_df_pop <- heatmap_df_pop %>%
mutate(cum_dead_pop = cum_dead/Population*100000) %>%
mutate(data = as.Date(data) )%>%
mutate( Month = month(data , label = TRUE)) %>%
select(Name_REG,cum_dead_pop,Month) %>%
filter(!(Name_REG == 'Friuli Venezia Giulia'))
as.matrix(heatmap_df_pop)
rotatedAxisElementText = function(angle,position='x'){
angle = angle[1];
position = position[1]
positions = list(x=0,y=90,top=180,right=270)
if(!position %in% names(positions))
stop(sprintf("'position' must be one of [%s]",paste(names(positions),collapse=", ")),call.=FALSE)
if(!is.numeric(angle))
stop("'angle' must be numeric",call.=FALSE)
rads = (angle - positions[[ position ]])*pi/180
hjust = 0.5*(1 - sin(rads))
vjust = 0.5*(1 + cos(rads))
element_text(angle=angle,vjust=vjust,hjust=hjust)
}
heatmap <- ggplot(heatmap_df_pop, aes(x=Name_REG, y=Month)) + geom_tile(aes(fill=cum_dead_pop)) + ggtitle('Italian regions Deaths per 100k inhabitants by month') + theme(axis.text.x = rotatedAxisElementText(90,'x')) + xlab('Region') + labs(fill = "Deaths per 100k ") + coord_flip()
heatmap
From this map it’s clear that the situation went downhill from April on , and some regions suffered more than others , for example :
Lombardia
Emilia-Romagna
Valle d’Aosta
Key takeaways:
Northern regions of Italy suffered more than southern ones
The situation in the second wave is slightly ‘worse’ than the first one
So, taking the results of both the visualizations togheter , and taking into account the fact that the ‘second wave’ has still room to spread , the answer is :
To go deeper in this matter, we can visualize the spread of COVID-19 in the world, using a choropleth ,considering cases per million inhabitants.
We’ll do this with in an interactive way to see the situation through time.
We’re using Python for this , so we will need to install the Python packages needed through reticulate.
#dependecies for the python plot:
knitr::opts_chunk$set(python.reticulate=FALSE) # to enable knitting
py_install("pandas") # these commands use the reticulate package
py_install("plotly")
py_install("pycountry")
use_python('/Applications/Python 3.8') # this overrides the default reticulate python in order to use a more recent verion that we may have installed on our local machine
The dataset we’re going to be using for this is visualization is available at this link: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv
This dataset is going to be used again in this report.
In this dataset we work with these variables ( for this analysis ):
location: geographical location
date : date in time at which the number of COVID-19 cases were recorded
total_cases_per_million : total confirmed cases of COVID-19 per 1,000,000 people
Now we can start to work with the Python script. We will use these packages:
Pycountry : to retrieve the country codes for each place , starting from the countries we have from the initial dataset , performing a “fuzzy” search . ( more info in the pycountry documentation here : https://pypi.org/project/pycountry)
Pandas : to manage data
Plotly express : to plot our data and output our interactive plot
We recall the plot generated in Python and saved in an html file with Shiny
shiny::includeHTML("/Users/someone/Desktop/progetto adv/progetto-adv_files/py_plot.html") # we can finally display the plot using Shiny
Key takeaways :
Initially the situation is pretty even in the world
Around mid June we start to see Countries that detach themselves from the even situation , in particular: U.S.A , Brazil , Peru and later also Russia and Saudi Arabia.
From late October we can clearly see that the european situation gets worse.
At the end of the data collection , the U.S.A. are the contry with the highest cases , followed by Brazil and Europe.
The results of this analysis , and in particular the european situation , are in line with the analysis conducted on the italian situation before.
It seems like the second wave is worse than the second one , and of course has still more room to get even worse.
We do this using the github data used before .
github_data <- read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
github_data %>%
filter(date == '2021-01-09') %>%
filter(!(location == 'International')) %>%
filter(!(location == 'World')) %>%
treemap(github_data,
index=c("continent", "location"),
vSize="total_cases",
type="index",
# Main
title="",
palette="Dark2",
# Borders:
border.col=c("black", "grey", "grey"),
border.lwds=c(1,0.5,0.1),
# Labels
fontsize.labels=c(0.7, 0.4, 0.3),
fontcolor.labels=c("white", "white", "black"),
fontface.labels=1,
bg.labels=c("transparent"),
align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")),
overlap.labels=0.5
) -> world_treemap
inter_treemap <- d3tree2( world_treemap , rootname = "Treemap of deceased in the world" )
htmlwidgets::saveWidget((inter_treemap), "inter_treemap.html") # save to html
shiny::includeHTML("/Users/someone/Desktop/progetto adv/inter_treemap.html") # show the html file
Considering the whole continents , the situation is pretty even
Africa is the only location that have significantly less deaths than the other ones.
When we go and see the underlying hierarchy we can see that the United States have the highest raw number of deceased
For this analysis we use the same data as used before for the Python choropleth and the density ridges.
It can be found at this link : https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
In this dataset we work with these variables:
continent: continent of the geographical location
total_cases_per_million: total confirmed cases of COVID-19 per 1,000,000 people
total_deaths_per_million: total deaths attributed to COVID-19 per 1,000,000 people
median_age: median age of the population, UN projection for 2020
extreme_poverty: share of the population living in extreme poverty, most recent year available since 2010
icu_patients_per_million: number of COVID-19 patients in intensive care units (ICUs) on a given day per 1,000,000 people
hosp_patients_per_million: number of COVID-19 patients in hospital on a given day per 1,000,000 people
population_density: number of people divided by land area, measured in square kilometers, most recent year available
To conduct this analysis we are going to try and collapse all the variables that hold values about ‘how bad the situation is’ into two principal components using the Principal component Analysis.
We start by plotting the correlation plot of the variables we are considering in the analysis.
a positive correlation between median age,population density and total deaths per million
a positive correlation between extreme poverty and total cases per million
github_data <- read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
PCA_df <- github_data %>% select(continent,total_cases_per_million,total_deaths_per_million,median_age,extreme_poverty,icu_patients_per_million,hosp_patients_per_million,population_density)
# Build a Pannel of 100 colors with Rcolor Brewer
my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)
# Correlation matrix
corrmatrix <- cor(PCA_df%>% select(!continent) %>% na.omit() )
# Corrplot ordered with hclust
corrpl <- corrplot(corrmatrix ,type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
there’s a little bit of positive correlation between median age and total deaths per million
there’s quite a lot of positive correlation between population density and total deaths per million
PCA_res <- PCA_df%>%
FactoMineR::PCA(,ncp = 3,quali.sup = 1,graph=FALSE)
fviz_pca_var(PCA_res, col.var = "cos2", # most represented variables ( quality on the factor map)
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
As we can see the PCA doesn’t even account for the 60% of the variability , so it isn’t very suitable to go on with this route.
But anyway let’s try and understand what the graphs are telling us.
Key takeaways from the graph of variables (colored by their quality of representation):
Mainly the total cases and deaths per million make up for Dim1
The rest of the variables partially account for Dim2
fviz_pca_ind(PCA_res,
geom.ind = "point", # show points only (nbut not "text")
col.ind = PCA_df$continent, # color by groups
palette = c("#f9d5e5","#eeac99","#e06377","#c83349","#5b9aa0", "#b8a9c9", "#622569"),
addEllipses = TRUE, # Concentration ellipses
legend.title = "Groups"
)
Key takeaways from the graph of variables:
Europe has the highest number of hospital patients/icu patients per million ( but we need to be aware that this might me a result of the high number of NAs in the dataset outside of Europe)
South and North America develope more on the Dim1 as we expected
Africa has a high rate of extreme poverty but at the same time not so high cases and deaths
Asia has a condition similar to Africa , but with not as high extreme poverty
So, in conclusion , the situation seems like is kinda bad also in Europe , but is U.S.A. still the worse overall? Clearly we have the hospital data problem , that’s the reason that forces us to give the final answer:
Clearly we go into this analysis aware that the data available is the death rate from cardiovascular disease in 2017 , not the real number of people with cardiovascular conditions as of today , but clearly that number is impossible to compute with perfect precision , so we go into this analysis with this estimator.
To answer this question we use two variables from the github dataset used previously , only now the latest version, available here : https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv
github_data_latest <- read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv')
github_data_latest %>% ggplot(aes(x=total_cases_per_million,y=cardiovasc_death_rate),)+
geom_point() +
geom_point(color="#69b3a2", alpha=0.8) +
ggtitle("Are high cardiovascular death rates correlated to high number of deceased by COVID-19?") +
theme_ipsum() +
theme(
plot.title = element_text(size=11)
) +
ylab('Cardiovascular death rate') +
xlab('Total decesead per million inhabitants') -> scatter_plot
ggMarginal(scatter_plot, type="density", color="purple")
At first glance it looks like , not only the correlation we expected is not there , but is reversed!
But let’s try to compute the correlations formally :
We will use three kinds of correlation:
Pearson : \(r = \frac{\sum{(x-m_x)(y-m_y)}}{\sqrt{\sum{(x-m_x)^2}\sum{(y-m_y)^2}}}\)
Spearman : \(rho = \frac{\sum(x' - m_{x'})(y'_i - m_{y'})}{\sqrt{\sum(x' - m_{x'})^2 \sum(y' - m_{y'})^2}}\)
pearson <-cor(github_data_latest$total_cases_per_million, github_data_latest$cardiovasc_death_rate, method="pearson",use = "complete.obs")
spearman <- cor(github_data_latest$total_cases_per_million, github_data_latest$cardiovasc_death_rate, method="spearman",use = "complete.obs")
The results are:
Pearson : -0.267
Spearman : -0.332
So it looks like what we saw graphically is true also in the mathematical computations.
Even if it is very very slight , there is negative correlation between the two variables,as opposite as we expected.
So the final answer for this myth is :
To analyze this myth we use a paper written by Mike K.P.So, Agnes Tiwari, Amanda M.Y.Chu, Jenny T.Y.Tsang and Jacky N.L.Chan.
The paper is published at this link : https://www.sciencedirect.com/science/article/pii/S1201971220303179 for anyone to see.
Let’s start uploading the visualizations.
The first one is about the spread in China only and shows the connectedness that ia computed with the correlation of changes in the numbers of confirmed cases between two geographical areas . If the correlation is >0.5, the two areas are connected in a network.
The bar chart holds the number of COVID-19 cases through time, that is itself divided through nine periods of time to allow a better organization of the analysis.
The very interesting part that answers to the myth already can be seen in the period 0 / visualization B. It is shown how , while on the one hand from the bar chart the number of confirmed cases was low and had just begun to rise in China , on the other hand , the network visualization could already show a preoccupying scenario.
In the periods of time next to the first one we just talked about in China , the number of cases continued to rise , that’s when people started to worry.
But the truth - and the visualization shows it clearly - was that while the raw number of cases was increasing , at the same time there was a decrease in connectedness , and so there was an early sign of improvement in the epidemic , to be brought back most probably to all the measures taken to fight the spread , like the lockdowns and the general reduced social interactivity of people.
Going on in time , there was a slight decrease in confirmed cases from period 2-3 already - apart from the spike at the end of period 2 - and then it became more marked as time went on , but again , through the network visualization , this was very clear as early as from period 1!
From period 4 already we can see an enormous increase in connectedness while the raw confirmed cases data doesn’t quite show the same idea , for sure not with the same strength.
From period 5 the connectedness between European and American countries continued to rise , until period 7 , while from period 7 to 8 there is a sign of a decrease in connectedness.
In general this myth analysis doesn’t want to be an admonition to the more traditional methods of analysis , or a statement that we “could have seen everything before and we could have dodged a global pandemic” , not at all.
This wants to be an eye-opener to the power of analysis and visualization methods using networks.
Regarding the myth , even though it was clear from the first consideration about China at the beginning of the analysis , we finally can state that this sixth myth was:
To answer to this myth (partly, as the sample size is not statistically significant considering all the italian population , and to have a statistically significant sample we would need to perform data scraping - we will consider a sample of 20 comments randomly sampled from Facebook comments under posts - by newspapers - about the new COVID-19 vaccine.
Links to the posts used for the sample: - https://www.facebook.com/MinisteroSalute/posts/1723197741191375 - https://www.facebook.com/MinisteroSalute/posts/1723170321194117 - https://www.facebook.com/AgenziaANSA/posts/10157868599381220
For this visualization I used Voyant (https://voyant-tools.org/). I used its option to remove italian stop-words and also used a better color palette and font.
Clearly the first word that appears on the plot is the word ‘vaccino’ that means vaccine. It was to be expected and doesn’t help with our research of an answer to the myth. Astrazeneca is another word that is useless to our analysis as it is only the name of a vaccine.
It starts to get useful as we see positive words like :
sicuro (‘safe’)
speranza (‘hope’)
affidabile (‘reliable’)
while the words that may have a more negative meaning just like :
politica (‘politics’)
soldi (‘money’)
are smaller and thus less frequent.
So it seems like from this visualization , the odds are in favour of the confirmed , but , as we have said before , the sample is not statistically significant and to have a reliable result to this answer we should seek other ways of analysis rather than relying on a single visualization.
Because of these reasons the answer to this last myth can only be:
With this last visualization ,the analysis is complete.
Author: Antonio Taurisano