Who were the Mythbusters?



Mythbusters


The Mythbusters were the hosts of the Australian-American science entertainment television program with the same name that aired between 2003 and 2016. They used elements of the scientific method to test the validity of myths. They conducted the experiment and at the end they decided, after their analysis on the experiment of course, the outcome for the myth:



Outcomes


We are going to conduct our analysis in the same exact way !


We start by tweaking some settings for better visualization of the plots

We load the libraries needed for this analysis

library(GGally)
library(tidyverse)
library(sf)
library(cartogram)   
library(broom)
library(tweenr)  
library(maptools)  
library(viridis) 
library(reticulate) 
library(FactoMineR) 
library(factoextra)
library(GGally)
library(d3r)
library(ggTimeSeries)
require(gridExtra) 
library(streamgraph)
library(treemapify)
library(treemap)
library(d3treeR) 
library(lubridate)
library(shiny)
library(htmlwidgets)
library(hrbrthemes)
library(ellipse)
library(corrplot)
library(RColorBrewer)
library(ggExtra)

FIRST MYTH : Coronavirus is as fatal as influenza ( taken the USA sample )

Before starting this analysis , we need to make a fundamental consideration:

As the Rochester Regional Health website states (https://www.rochesterregional.org/flu), the peak months for the flu are between December and March , with a particular spike in February , while in the other months the number of cases are negligible.

So that is a key difference in the comparison of the flu with COVID-19 that of course doesn’t follow this cyclical pattern.

A good estimate for the flu numbers in the USA , through 2020 is (at most):

56 million illnesses

62.000 deaths

This means that for the flu there is a fatality rate of : 0,11 %

source :https://www.cdc.gov/flu/about/burden/preliminary-in-season-estimates.htm

Regarding COVID-19 in the USA, through 2020 , there have been:

22 million illnesses

369.000 deaths

This means that for COVID-19 there is a fatality rate of : 1,67 %

let’s visualize this comparison:

fatality_rates <- read.csv("~/Desktop/progetto adv/fatality_rates.csv", sep=";") %>% select(disease , fatality_rate)

ggplot(fatality_rates, aes(x= disease, y= fatality_rate,)) +
  geom_bar(stat = 'identity', width = 0.5, color = 'white' ,fill = 'steelblue') +
  xlab(('Disease')) +
  ylab('Fatality rate (%)')+
  ggtitle('Influenza and COVID-19 fatalities comparison bar chart')+
  theme_bw()

Key takeaways:

  • COVID-19 has a fatality rate 1,5% higher than Influenza

So , in the end , we can say that the first myth is:

BUSTED


SECOND MYTH : Sex makes a difference in the number of deaths by age group ( taken the USA sample )

To perform this analysis we load a dataset originally about deaths divided by age groups and sex for COVID-19 , influenza and pneumoniain the U.S.A.

We are going to use only the data regarding COVID-19.

In this dataset we work with these variables:

State : The American State in which the entry was recorded (United States implies the sum of all the entries in all the States)

Sex : The binary sex considered in the data collection for each person

Age Group: The age bin in which each person is assigned to

COVID-19 deaths: the number of deceased by COVID-19

american_age_dataset <- read.csv("~/Desktop/progetto adv/covid_influenza.CSV", header=FALSE)
as.data.frame(american_age_dataset)


# the dataset has all variables set as characters and the variable labels are entries , let's fix this :

american_age_dataset <- american_age_dataset %>% 
                          rename(
                                 End_week = V3,
                                 State = V4,
                                 Sex = V5,
                                 Age_group = V6,
                                 COVID_19_deaths = V7
                                 ) %>%  select(c(State,Sex,Age_group,COVID_19_deaths ))

# now we remove the 'fake' variable labels so that we only have the real entries in the dataframe :
american_age_dataset <- american_age_dataset %>% 
                          slice(2:n())

# let's take only male or female sexes and clear up the age groups

american_age_dataset <- american_age_dataset %>% 
  filter(Sex == 'Male' | Sex == 'Female') %>% 
    filter(Age_group == '0-17 years' | Age_group == '0-17 years' | Age_group == '18-29 years' | Age_group == '30-49 years' | Age_group == '50-64 years'  | Age_group == '65-74 years'  | Age_group == '75-84 years' | Age_group == '85 years and over' | Age_group == 'All Ages' )

# let's only take the the total numbers in the whole USA

american_age_dataset <- american_age_dataset %>% 
  filter(State == 'United States') %>%  group_by(Age_group) %>% 
  pivot_wider(-Sex,names_from=Sex,values_from = COVID_19_deaths)

american_age_dataset <- american_age_dataset %>%  mutate(Male = as.double(Male) , Female = as.double(Female))

We visualize the situation with a Cleveland dotplot

ggplot(american_age_dataset) +
  geom_segment( aes(x=Age_group, xend=Age_group, y=Male, yend=Female), color="grey") +
  geom_point( aes(x=Age_group, y=Male), color= 'steelblue', size=3 ) +
  geom_point( aes(x=Age_group, y=Female), color= 'pink', size=3 ) +
  theme_ipsum()+
  coord_flip()+
  theme(
    legend.position = "none",
    panel.border = element_blank(),
  ) +
  xlab("Age Group") +
  ylab("COVID-19 Deaths")+
  ggtitle('Male and female COVID-19 deaths by age group')

Key takeaways:

  • In each age group except the 85+ bracket , more men die.
  • Also in the most general situation with all ages , there’s quite a difference between the male and female sex.
  • It is clearly noticeable how as age grows , the number of deceased grow

So , in the end , we can say that the second myth is:

CONFIRMED


THIRD MYTH: The italian first wave was worse than the second one

For this third myth we’re going to analyze of course the italian situation .

We start by uploading two dataframes:

the first holds population information ,that’s the number of inhabitants for each italian region. This will come in handy later when we will try to eliminate the bias of the number of inhabitants to answer some questions.

the second holds the shape files for the italian regions that we’re going to use later for the spatial visualizations in R .

Italy_reg <- st_read("shp/reg2011_g.shp") # uploading the shape file for the italian regions

pop_19 <- readxl::read_excel("regioni_pop_19.xlsx",  # italian population data
     col_types = c("numeric", "text", "numeric", 
         "numeric", "numeric"))
Italy_reg %>% merge(pop_19,by="COD_REG") -> choro_df # joining the two datagrames with the region code as the key

choro_df <- choro_df %>% select(-NOME_REG) %>%  relocate(Reg_name,.after=COD_REG)  # remove one of the two columns with the region names and relocating the Reg_name variable

Now we upload the italian COVID-19 dataset.

In this dataset we work with these variables:

data: the date of the collected row data

daily_new_cases: new COVID-19 cases observed in a day

daily_tests: COVID-19 daily tests done in a day

daily_recovered: recovered COVID-19 cases in a day

daily_deads: deceased COVID-19 cases in a day

cum_hospitalized: hospitalized COVID-19 cases since the beginning of the series

daily_actual_positives: total number of active COVID-19 cases in a day

cum_recovered: recovered COVID-19 cases since the beginning of the series

cum_cases: total number of COVID-19 cases observed since the beginning of the series

cum_dead: total number of deaceased COVID-19 cases observed since the beginning of the series

reg: name of the italian region

Cod_r: code of the italian region

We’re going to collapse Bolzano and Trento togheter in the Trentino Alto Adige region first , as the dataset still keeps this division. Then we’re going to summarize the cases and the tests and use them to create a new variable that holds the percentage of people that resulted positive to tests : pos_perc.

DATA_ITA <- readxl::read_excel("DATA_ITA.xlsx", col_types = c("date", 
    "numeric", "numeric", "numeric", "numeric", 
    "numeric", "numeric", "numeric", "numeric", 
    "numeric", "text", "numeric"))

DATA_ITA %>% 
  mutate(Month=lubridate::month(data)) %>% # let's create a column with months
  filter((Month==3)|(Month==10)) %>% 
  mutate(COD_REG=if_else(Cod_r>20,as.numeric(4),Cod_r),
         Name_REG=if_else(Cod_r>20,"Trentino AA",reg)) %>% # collapsing Bolzano and Trento province into TRENTINO AA region
  select(COD_REG, Name_REG,Month, daily_new_cases,daily_tests) %>%  
group_by(COD_REG,Month) %>%                  
  summarize(new_cases=sum(daily_new_cases),  test=sum(daily_tests)) %>%  # cumulative cases and tests values
mutate(pos_perc=new_cases/test*100) %>% # creating a new variable that holds the percentage of people that resulted positive to tests 
mutate(NMonth=if_else(Month==3,"March","October")) %>% 
pivot_wider(-Month,names_from=NMonth,values_from = c(new_cases,test,pos_perc))-> Both_Month_per_cases

choro_df_2 <- choro_df %>% merge(Both_Month_per_cases,by.x = "COD_REG", by.y="COD_REG")

Now we are ready to show the choropleth maps , for non-italian people less familiar with the geography of our country.

# plotting choropleths for March and October

choropleth_perc_march <- ggplot(choro_df_2)+
  geom_sf(aes(fill=pos_perc_March))+
  scale_fill_viridis(limits = c(0, 40),
                     direction = -1,
                     option = "D") +
  labs(title = "Positive percentages on tests, MARCH") +
    theme_bw() + labs(fill = "Percentage ")


choropleth_perc_october <- ggplot(choro_df_2)+
  geom_sf(aes(fill=pos_perc_October))+
  scale_fill_viridis(limits = c(0, 40),
                     direction = -1,
                     option="D")+
  labs(title = "Positive percentages on tests, OCTOBER") +
  theme_bw()  + labs(fill = "Percentage ")

grid.arrange(choropleth_perc_march, choropleth_perc_october , ncol=2)

While for people more familiar with the geography of the region, a cartogram could answer more question at a glimpse

#cartograms

it_perc_cart_M <- cartogram(choro_df_2, "pos_perc_March")
it_perc_cart_O <- cartogram(choro_df_2, "pos_perc_October")

cartogram_perc_march <-  ggplot(it_perc_cart_M)+
  geom_sf(aes(fill=pos_perc_March))+
  scale_fill_viridis(limits = c(0, 40),
                     direction = -1,
                     option = "D") +
  labs(title = "Positive percentages on tests, MARCH")+
  theme_bw() + labs(fill = "Percentage ")

cartogram_perc_october <- ggplot(choro_df)+geom_sf()+
  geom_sf(data=it_perc_cart_O,aes(fill=pos_perc_October))+
  scale_fill_viridis(option = "D",
                    limits = c(0, 40),
                     direction = -1
                     )+
  labs(title = "Positive percentages on tests, OCTOBER")+
  theme_bw() + labs(fill = "Percentage ")

# we arrange the plots next to each other for better comparison possibilities
grid.arrange(cartogram_perc_march, cartogram_perc_october , ncol=2)

Looking at the visualizations , it seems like the myth is confirmed : the graph shows that the situation in March is worse than the situation in October. The percentage of people resulting positive to the tests is considerably higher.

But let’s stop and think for a second.

There’s a consideration to do before we draw conclusions: in March the available tests were less , so in Italy only people which were almost surely infected were tested. On the one hand the visualization only tells us to confirm the myth, on the other hand , external considerations tell us that the myth is busted.

Also consider that the data is as of October , from which on the second wave spread further .

Let’s try to go further with another visualization that may be helpful to answer whether or not the second wave was ‘worse’ : an heatmap.

We’re going to use an heatmap of the deceased through time in each region ( using relativized data to make more precise considerations without the bias of the number of inhabitants).

Let’s visualize it:

pop_19 <- rename(pop_19, Name_REG = 'Reg_name') # we need to rename to use it as a key when joining the two dataframes

heatmap_df_pop <- DATA_ITA %>% 
  mutate(COD_REG=if_else(Cod_r>20,as.numeric(4),Cod_r),
         Name_REG=if_else(Cod_r>20,"Trentino AA",reg))

#now we join the population data with our treemap database
pop_19 <- pop_19 %>%  select(Population,Name_REG)

heatmap_df_pop<- left_join(heatmap_df_pop,pop_19, by = "Name_REG")

# now we can create the new cases and deaths variables reliant on each 100000 people and we prepare the data for the treemap

heatmap_df_pop <- heatmap_df_pop %>% 
  mutate(cum_dead_pop = cum_dead/Population*100000) %>% 
       mutate(data = as.Date(data) )%>% 
      mutate( Month = month(data , label = TRUE)) %>% 
  select(Name_REG,cum_dead_pop,Month) %>% 
  filter(!(Name_REG == 'Friuli Venezia Giulia'))

as.matrix(heatmap_df_pop)

rotatedAxisElementText = function(angle,position='x'){
  angle     = angle[1]; 
  position  = position[1]
  positions = list(x=0,y=90,top=180,right=270)
  if(!position %in% names(positions))
    stop(sprintf("'position' must be one of [%s]",paste(names(positions),collapse=", ")),call.=FALSE)
  if(!is.numeric(angle))
    stop("'angle' must be numeric",call.=FALSE)
  rads  = (angle - positions[[ position ]])*pi/180
  hjust = 0.5*(1 - sin(rads))
  vjust = 0.5*(1 + cos(rads))
  element_text(angle=angle,vjust=vjust,hjust=hjust)
}

heatmap <- ggplot(heatmap_df_pop, aes(x=Name_REG, y=Month)) + geom_tile(aes(fill=cum_dead_pop)) + ggtitle('Italian regions Deaths per 100k inhabitants by month') + theme(axis.text.x = rotatedAxisElementText(90,'x')) + xlab('Region') + labs(fill = "Deaths per 100k ") + coord_flip()

heatmap

From this map it’s clear that the situation went downhill from April on , and some regions suffered more than others , for example :

Key takeaways:

So, taking the results of both the visualizations togheter , and taking into account the fact that the ‘second wave’ has still room to spread , the answer is :

BUSTED


Let’s visualize the virus spread in the world with an interactive choropleth

To go deeper in this matter, we can visualize the spread of COVID-19 in the world, using a choropleth ,considering cases per million inhabitants.

We’ll do this with in an interactive way to see the situation through time.

We’re using Python for this , so we will need to install the Python packages needed through reticulate.

#dependecies for the python plot: 

knitr::opts_chunk$set(python.reticulate=FALSE) # to enable knitting 
py_install("pandas") # these commands use the reticulate package
py_install("plotly")
py_install("pycountry")

use_python('/Applications/Python 3.8')  # this overrides the default reticulate python in order to use a more recent verion that we may have installed on our local machine

The dataset we’re going to be using for this is visualization is available at this link: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv

This dataset is going to be used again in this report.

In this dataset we work with these variables ( for this analysis ):

location: geographical location

date : date in time at which the number of COVID-19 cases were recorded

total_cases_per_million : total confirmed cases of COVID-19 per 1,000,000 people

Now we can start to work with the Python script. We will use these packages:

Pycountry : to retrieve the country codes for each place , starting from the countries we have from the initial dataset , performing a “fuzzy” search . ( more info in the pycountry documentation here : https://pypi.org/project/pycountry)

Pandas : to manage data

Plotly express : to plot our data and output our interactive plot

We recall the plot generated in Python and saved in an html file with Shiny

shiny::includeHTML("/Users/someone/Desktop/progetto adv/progetto-adv_files/py_plot.html") # we can finally display the plot using Shiny

Key takeaways :

The results of this analysis , and in particular the european situation , are in line with the analysis conducted on the italian situation before.

It seems like the second wave is worse than the second one , and of course has still more room to get even worse.


What if we wanted to visualize which areas in the world have the highest raw number of deceased?

We do this using the github data used before .

github_data <- read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')

github_data %>% 
  filter(date == '2021-01-09') %>% 
  filter(!(location == 'International')) %>% 
  filter(!(location == 'World')) %>% 
    treemap(github_data,
            
           
            index=c("continent", "location"),
            vSize="total_cases",
            type="index",
            
            # Main
            title="",
            palette="Dark2",

            # Borders:
            border.col=c("black", "grey", "grey"),             
            border.lwds=c(1,0.5,0.1),                         
        
            # Labels
            fontsize.labels=c(0.7, 0.4, 0.3),
            fontcolor.labels=c("white", "white", "black"),
            fontface.labels=1,            
            bg.labels=c("transparent"),              
            align.labels=list( c("center", "center"), c("left", "top"), c("right", "bottom")),                                  
            overlap.labels=0.5
    ) -> world_treemap
inter_treemap <- d3tree2( world_treemap ,  rootname = "Treemap of deceased in the world" )

htmlwidgets::saveWidget((inter_treemap), "inter_treemap.html") # save to html
 
 shiny::includeHTML("/Users/someone/Desktop/progetto adv/inter_treemap.html") # show the html file 

d3tree2

Key takeaways:


FOURTH MYTH: the U.S.A is the most hit country by COVID-19 as of right now considering the overall picture (not only the simple number of confirmed cases!)

For this analysis we use the same data as used before for the Python choropleth and the density ridges.

It can be found at this link : https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv

In this dataset we work with these variables:

continent: continent of the geographical location

total_cases_per_million: total confirmed cases of COVID-19 per 1,000,000 people

total_deaths_per_million: total deaths attributed to COVID-19 per 1,000,000 people

median_age: median age of the population, UN projection for 2020

extreme_poverty: share of the population living in extreme poverty, most recent year available since 2010

icu_patients_per_million: number of COVID-19 patients in intensive care units (ICUs) on a given day per 1,000,000 people

hosp_patients_per_million: number of COVID-19 patients in hospital on a given day per 1,000,000 people

population_density: number of people divided by land area, measured in square kilometers, most recent year available

To conduct this analysis we are going to try and collapse all the variables that hold values about ‘how bad the situation is’ into two principal components using the Principal component Analysis.

We start by plotting the correlation plot of the variables we are considering in the analysis.

We go into this correlation plot analysis expecting things like:

  • a positive correlation between median age,population density and total deaths per million

  • a positive correlation between extreme poverty and total cases per million

github_data <- read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')
PCA_df <- github_data  %>%  select(continent,total_cases_per_million,total_deaths_per_million,median_age,extreme_poverty,icu_patients_per_million,hosp_patients_per_million,population_density)  

# Build a Pannel of 100 colors with Rcolor Brewer

my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)

# Correlation matrix

corrmatrix <- cor(PCA_df%>% select(!continent) %>%  na.omit() ) 
 
# Corrplot ordered with hclust

corrpl <- corrplot(corrmatrix ,type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)




From this correlogram we understand that some of the things we expected are not true , for example :

  • there’s no real correlation between extreme poverty and total cases per million , actually , there ’s a little bit of negative correlation , the opposite as we expected.

But some things are as expected, for example:

  • there’s a little bit of positive correlation between median age and total deaths per million

  • there’s quite a lot of positive correlation between population density and total deaths per million

Now we can go along with the PCA.

Let’s visualize the graph of variables:

PCA_res <- PCA_df%>% 
  FactoMineR::PCA(,ncp = 3,quali.sup = 1,graph=FALSE)

fviz_pca_var(PCA_res, col.var = "cos2",   # most represented variables ( quality on the factor map)
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE # Avoid text overlapping
             )

As we can see the PCA doesn’t even account for the 60% of the variability , so it isn’t very suitable to go on with this route.

But anyway let’s try and understand what the graphs are telling us.

Key takeaways from the graph of variables (colored by their quality of representation):

  • Mainly the total cases and deaths per million make up for Dim1

  • The rest of the variables partially account for Dim2

Now let’s go on with the graph of variables:

fviz_pca_ind(PCA_res,
             geom.ind = "point", # show points only (nbut not "text")
             col.ind = PCA_df$continent, # color by groups
             palette = c("#f9d5e5","#eeac99","#e06377","#c83349","#5b9aa0", "#b8a9c9", "#622569"),
             addEllipses = TRUE, # Concentration ellipses
             legend.title = "Groups"
             )

Key takeaways from the graph of variables:

  • Europe has the highest number of hospital patients/icu patients per million ( but we need to be aware that this might me a result of the high number of NAs in the dataset outside of Europe)

  • South and North America develope more on the Dim1 as we expected

  • Africa has a high rate of extreme poverty but at the same time not so high cases and deaths

  • Asia has a condition similar to Africa , but with not as high extreme poverty

So, in conclusion , the situation seems like is kinda bad also in Europe , but is U.S.A. still the worse overall? Clearly we have the hospital data problem , that’s the reason that forces us to give the final answer:

PLAUSIBLE


FIFTH MYTH : There’s correlation between high death rate from cardiovascular disease in the population and the number of deaceased by COVID-19

Clearly we go into this analysis aware that the data available is the death rate from cardiovascular disease in 2017 , not the real number of people with cardiovascular conditions as of today , but clearly that number is impossible to compute with perfect precision , so we go into this analysis with this estimator.

To answer this question we use two variables from the github dataset used previously , only now the latest version, available here : https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv

github_data_latest <- read.csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/latest/owid-covid-latest.csv')

github_data_latest %>% ggplot(aes(x=total_cases_per_million,y=cardiovasc_death_rate),)+
  geom_point() +
  geom_point(color="#69b3a2", alpha=0.8) +
    ggtitle("Are high cardiovascular death rates correlated to high number of deceased by COVID-19?") +
    theme_ipsum() +
    theme(
      plot.title = element_text(size=11)
    ) +
    ylab('Cardiovascular death rate') +
    xlab('Total decesead per million inhabitants') -> scatter_plot

ggMarginal(scatter_plot, type="density", color="purple")

At first glance it looks like , not only the correlation we expected is not there , but is reversed!

But let’s try to compute the correlations formally :

We will use three kinds of correlation:

pearson <-cor(github_data_latest$total_cases_per_million, github_data_latest$cardiovasc_death_rate, method="pearson",use = "complete.obs")

spearman <- cor(github_data_latest$total_cases_per_million, github_data_latest$cardiovasc_death_rate, method="spearman",use = "complete.obs")

The results are:

So it looks like what we saw graphically is true also in the mathematical computations.

Even if it is very very slight , there is negative correlation between the two variables,as opposite as we expected.

So the final answer for this myth is :

BUSTED


SIXTH MYTH: It was impossible to foresee pandemic risks in COVID-19 from the very early stages

To analyze this myth we use a paper written by Mike K.P.So, Agnes Tiwari, Amanda M.Y.Chu, Jenny T.Y.Tsang and Jacky N.L.Chan.

The paper is published at this link : https://www.sciencedirect.com/science/article/pii/S1201971220303179 for anyone to see.

Let’s start uploading the visualizations.

The first one is about the spread in China only and shows the connectedness that ia computed with the correlation of changes in the numbers of confirmed cases between two geographical areas . If the correlation is >0.5, the two areas are connected in a network.

The bar chart holds the number of COVID-19 cases through time, that is itself divided through nine periods of time to allow a better organization of the analysis.



China_Network_Analysis


Let’s see the China scenario first:

The very interesting part that answers to the myth already can be seen in the period 0 / visualization B. It is shown how , while on the one hand from the bar chart the number of confirmed cases was low and had just begun to rise in China , on the other hand , the network visualization could already show a preoccupying scenario.

In the periods of time next to the first one we just talked about in China , the number of cases continued to rise , that’s when people started to worry.

But the truth - and the visualization shows it clearly - was that while the raw number of cases was increasing , at the same time there was a decrease in connectedness , and so there was an early sign of improvement in the epidemic , to be brought back most probably to all the measures taken to fight the spread , like the lockdowns and the general reduced social interactivity of people.

Going on in time , there was a slight decrease in confirmed cases from period 2-3 already - apart from the spike at the end of period 2 - and then it became more marked as time went on , but again , through the network visualization , this was very clear as early as from period 1!



China Network Analysis


Now we take a look at the visualization that regards the rest of the world ( excluding China ):

  • From period 4 already we can see an enormous increase in connectedness while the raw confirmed cases data doesn’t quite show the same idea , for sure not with the same strength.

  • From period 5 the connectedness between European and American countries continued to rise , until period 7 , while from period 7 to 8 there is a sign of a decrease in connectedness.

In general this myth analysis doesn’t want to be an admonition to the more traditional methods of analysis , or a statement that we “could have seen everything before and we could have dodged a global pandemic” , not at all.

This wants to be an eye-opener to the power of analysis and visualization methods using networks.

Regarding the myth , even though it was clear from the first consideration about China at the beginning of the analysis , we finally can state that this sixth myth was:

BUSTED


SEVENTH MYTH: There’s more people who want to get vaccinated rather than anti-vaxxers in Italy.

To answer to this myth (partly, as the sample size is not statistically significant considering all the italian population , and to have a statistically significant sample we would need to perform data scraping - we will consider a sample of 20 comments randomly sampled from Facebook comments under posts - by newspapers - about the new COVID-19 vaccine.

Links to the posts used for the sample: - https://www.facebook.com/MinisteroSalute/posts/1723197741191375 - https://www.facebook.com/MinisteroSalute/posts/1723170321194117 - https://www.facebook.com/AgenziaANSA/posts/10157868599381220



Word Cloud




For this visualization I used Voyant (https://voyant-tools.org/). I used its option to remove italian stop-words and also used a better color palette and font.

Clearly the first word that appears on the plot is the word ‘vaccino’ that means vaccine. It was to be expected and doesn’t help with our research of an answer to the myth. Astrazeneca is another word that is useless to our analysis as it is only the name of a vaccine.

It starts to get useful as we see positive words like :

sicuro (‘safe’)

speranza (‘hope’)

affidabile (‘reliable’)

while the words that may have a more negative meaning just like :

politica (‘politics’)

soldi (‘money’)

are smaller and thus less frequent.

So it seems like from this visualization , the odds are in favour of the confirmed , but , as we have said before , the sample is not statistically significant and to have a reliable result to this answer we should seek other ways of analysis rather than relying on a single visualization.

Because of these reasons the answer to this last myth can only be:

PLAUSIBLE


With this last visualization ,the analysis is complete.

Author: Antonio Taurisano