Machine Learning - Part 1 of Covid-19 Analysis

All the visualizations and analysis in this report are as per February 27 2021 2pm AST Note that the covid-19 data changes over time #######################################################################################

1. Introduction

In 2019, Covid-19 pandemic put our life and behaviours in check. There is so much to learn from and millions of questions to be asked. I decided to investigate related covid-19 demographic, geographical & socio-economic parameters. Covid-19 data is a longitudinal dataset and changes over time.

As highlighted by Mohammed (2016), significant challenges are encountered during the development of a longitudinal dataset: loss of information due to missing or incomplete data, deterioration of data over time, lack of data Standardization (such as geographical nomenclature, spellings) and data quality (such as missing data, incorrect data type,…).

This is the 2nd machine learning project requirement for the HarvardX Professional Certificate Data Science Program.

github: https://github.com/silpai/Machine-Learning---Covid-19

2. Methodology and Analysis

There were 3 datasets used: •Worldtilegrid – contains the countries region, subregion, and cartesian x,y coordinates. Note that some countries were not listed in this dataset, and small islands were also missing.

Source: https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv

Covid-19 dataset with vaccination information – contains covid-19 tracking data collected from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU); Vaccinations against COVID-19 collected by the Our World in Data https://ourworldindata.org/ team from official reports. This data is the single dose of the vaccine; demographic and socio-economic data collected from United Nations, WorldBank and other governmental agencies. Dataset is longitudinal (changes over time), and it is updated according to the countries time zones.

Source: https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv

WorldBank estimated population – Contains projected total population for 2021 and population per age range (age 0-14, age 15-64 and age 65 and up). Since covid-19 vaccines have been approved for those older than age 16, I will be using the range age 15 and up in my analysis. One of my curiosity was to know the percentage of the population vaccinated considering the approved age, not the entire population. Note that these are estimates, and there is no reduction in the number of deaths for covid-19 or any other cause.

Source: “http://databank.worldbank.org/data/download/Population-Estimates_CSV.zip

Steps: a) Load and Clean data 1. Load datasets 2. Combine the worldtilegrid with covid-19 3. Transform Data i. Worldbank data has a pivot longer format, so I had to pivot wider the variables ii. Combine the 3 datasets. iii. Create new categories iv. Exclude the rows with aggregated values by region b) Perform Exploratory Analysis (EDA) c) Develop Machine Learning Models Modeling was used to create the predictions. Models were built from the training data to the test data: - Linear Models: correlation matrix, Linear regression, naive Bayes Classifier and decision tree

2.1 Data Transformation

Worldbank dataset required the pivot wider transformation. The three datasets were joined. Rows with aggregated values by region were eliminated, remaining only countries as observations. New groupings were created for the analysis.

This report was initially developed on January 11th 2021; few countries had initiated the vaccination rollout, and related tracking system did not differentiate 1st dose to 2nd dose the total_vaccinations variable was used. Since then, these people_vaccinated with 1st dose and people_fully_vaccinated variables are now available.

total_vaccinations - Total number of COVID-19 vaccination doses administered people_vaccinated - Total number of people who received at least one vaccine dose people_fully_vaccinated - Total number of people who received all doses prescribed by the vaccination protocol

##### a) Data Load and Cleaning

#1. Load datasets
# dataset: worldtilegrid x y 
worldtilegrid <- read.csv(
  "https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv")%>%
  select(name, alpha.3,region,sub.region,x,y)


# dataset: Covid-19 dataset with vaccination information 
#Source: https://ourworldindata.org/covid-vaccinations
covid_data<-read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")%>%
select(iso_code,location,date,total_cases,new_cases, total_deaths,new_deaths,total_vaccinations, people_vaccinated, people_fully_vaccinated, gdp_per_capita,
human_development_index, population_density, median_age,life_expectancy)


# dataset: worldbank estimated population
#Source:https://datacatalog.worldbank.org/dataset/population-estimates-and-projections
worldb <- tempfile()
download.file("http://databank.worldbank.org/data/download/Population-Estimates_CSV.zip", worldb)
worldbank_csv <- fread(text = gsub(",", "\t", readLines(unzip(worldb, "Population-EstimatesData.csv")))) 
names(worldbank_csv) <- as.character(worldbank_csv[1,])   # use 1st row as header
worldbank_csv=worldbank_csv[-c(1),]                       # eliminate the 1st row that was a header
names(worldbank_csv)<-str_replace_all(names(worldbank_csv), c(" " = "_" )) # replace space by _
#str(worldbank_csv)

##2.Combine covid data & tile grid x y
covid_grid<-full_join(x=covid_data, y=worldtilegrid, by=c("iso_code" ="alpha.3"))  
#head(covid_grid)

# 3.    Data Transformation

# a) Pivot wider worldbank 
worldbank <- worldbank_csv %>% 
select("Country_Code","Country_Name","Indicator_Code","2021") %>%   # select indicator population for 2021
filter(Indicator_Code %in% c("SP.POP.TOTL","SP.POP.1564.TO","SP.POP.65UP.TO")) %>%  # exclude SP.POP.0014.TO (population age of 14 or younger)
spread(key = "Indicator_Code",
         value = "2021") %>%                                                      #pivot wider
rename(iso_code=Country_Code,Est_Pop_2021= SP.POP.TOTL, Pop15_64=SP.POP.1564.TO, Pop65UP=SP.POP.65UP.TO ) %>%
mutate(Pop15Over=Pop15_64+Pop65UP)                                  # combine population age 15 to 64 + population 65up
#head(worldbank)


#b) Combine covid data & worldbank estimated population age 15 and older data & tile grid x y data
# exclude row related to aggregated values
iso_code_aggregated <- data.frame(c("ARB", "CAF", "CEB", "CSS", "EAP", "EAR","EAS","ECA","ECS","EUU",
                          "FCS","HPC","INX","LAC", "LCN","LDC","LIC","LMC","LMY",         
                          "LTE","MEA","MIC","MNA","NAC","OED","OSS","OWID_KOS","OWID_WRL", "PRE","PSS",
                          "PST","SAS","SSA","SSF","SST","TEA","TEC", "TLA", 
                          "TMN", "TSA", "TSS", "UMC", "WLD", "OWID_KOS", "OWID_EUR","OWID_NAM","OWID_ASI", "OWID_EUN","OWID_SAM","OWID_AFR") )  %>%
  rename(iso_code=c..ARB....CAF....CEB....CSS....EAP....EAR....EAS....ECA....ECS...)
#iso_code_aggregated

#c) Create new categories 
Pop_Vaccine_tile<-full_join(x=worldbank, y=covid_grid, by="iso_code") %>% 
group_by(iso_code)%>%
mutate(
Percent_Pop15UP= percent(Pop15Over/Est_Pop_2021),
vaccination_administrated=as.numeric(total_vaccinations),
vaccination_categ= ifelse(is.na(vaccination_administrated), "Data unavailable", 
            ifelse(vaccination_administrated >0 & vaccination_administrated <10000,"Single doses < 10k",
             ifelse(vaccination_administrated >=10000 & vaccination_administrated <500000,"10k >= Single doses < 500k",
              ifelse(vaccination_administrated >=500000 & vaccination_administrated <5000000,"500k >= Single doses < 5M",
               ifelse(vaccination_administrated >=5000000 & vaccination_administrated <50000000, "5M >= Single doses < 50M", "Single doses >= 50M"))))),
vaccination_status = ifelse(is.na(vaccination_administrated),"Vaccination did not start",
              ifelse(vaccination_administrated >0,"Vaccination started", "Vaccination did not start")),
Pop_percent_vaccinated_15over=vaccination_administrated/Pop15Over,
percent_vaccinated_15over=percent(vaccination_administrated/Pop15Over),
vaccination_percent_categ= ifelse(is.na(Pop_percent_vaccinated_15over),"Data unavailable", 
        ifelse(Pop_percent_vaccinated_15over >0 & Pop_percent_vaccinated_15over <0.1,"Total single doses < 10%",
         ifelse(Pop_percent_vaccinated_15over >=0.1 & Pop_percent_vaccinated_15over <0.5,"10% >= Total single doses < 50%",
          ifelse(Pop_percent_vaccinated_15over >=0.5 & Pop_percent_vaccinated_15over <0.7,"50% >= Total single doses < 70%", "Potential herd immunity")))),
Pop_percent_fully_vaccinated_15over= people_fully_vaccinated/Pop15Over,
percent_2nddose_15over=percent(people_fully_vaccinated/Pop15Over),
categ_2nddose_15over= ifelse(is.na(Pop_percent_fully_vaccinated_15over),"Data unavailable", 
        ifelse(Pop_percent_fully_vaccinated_15over >0 & Pop_percent_fully_vaccinated_15over <0.1,"Fully vaccinated < 10%",
         ifelse(Pop_percent_fully_vaccinated_15over >=0.1 & Pop_percent_fully_vaccinated_15over <0.5,"10% >= Fully vaccinated < 50%",
          ifelse(Pop_percent_fully_vaccinated_15over >=0.5 & Pop_percent_fully_vaccinated_15over <0.7,"50% >= Fully vaccinated < 70%", "Potential herd immunity")))),
Pop_percent_1st_vaccinated_15over= people_vaccinated/Pop15Over,
percent_1stdose_15over=percent(people_vaccinated/Pop15Over),
categ_1stdose_15over= ifelse(is.na(Pop_percent_1st_vaccinated_15over),"Data unavailable", 
        ifelse(Pop_percent_1st_vaccinated_15over >0 & Pop_percent_1st_vaccinated_15over <0.1," 1st dose < 10%",
         ifelse(Pop_percent_1st_vaccinated_15over >=0.1 & Pop_percent_1st_vaccinated_15over <0.5,"10% >= 1st dose < 50%",
          ifelse(Pop_percent_1st_vaccinated_15over>=0.5 & Pop_percent_1st_vaccinated_15over <0.7,"50% >= 1st dose < 70%", "Potential herd immunity")))),
GDP_category=ifelse(is.na(gdp_per_capita),"Data unavailable",
       ifelse(gdp_per_capita<1000,"GDP < 1k", 
        ifelse(gdp_per_capita >=1000 & gdp_per_capita <5000,"1k >= GDP < 5k",
         ifelse(gdp_per_capita >=5000 & gdp_per_capita <15000,"5k >= GDP < 15k",
          ifelse(gdp_per_capita >=15000 & gdp_per_capita <50000,"15k >= GDP < 50k",
           ifelse(gdp_per_capita >=50000 & gdp_per_capita <90000,"50k >= GDP < 90k","GDP >= 90K")))))),
Pop_density_categ= ifelse(is.na(population_density), "Data unavailable",
         ifelse(population_density >=0 & population_density <25,"Low (0 >= ppl/Km2 < 25)",                                      
          ifelse(population_density >=25 & population_density <50,"Medium (25 >= ppl/Km2 < 50)",
           ifelse(population_density >=50 & population_density <100,"High (50 >= ppl/Km2 < 100)",
            ifelse(population_density >=100 & population_density <400,"Very High (100 >= ppl/Km2 <400)","Extreme High (ppl/Km2 >400)"))))),
Life_expectancy_categ= ifelse(is.na(life_expectancy), "Data unavailable",
            ifelse(life_expectancy >=0 & life_expectancy <50,"Life expectancy < 50",                                      
             ifelse(life_expectancy >=50 & life_expectancy <60,"50 >= Life expectancy < 60",
              ifelse(life_expectancy >=60 & life_expectancy <70,"60 >= Life expectancy < 70",
               ifelse(life_expectancy >=70 & life_expectancy <80,"70 >= Life expectancy < 80","Life expectancy >= 80"))))),
Total_cases_categ= ifelse(is.na(total_cases), "Data unavailable",
         ifelse(total_cases >=0 & total_cases <50000,"Total cases < 50k",                                      
          ifelse(total_cases >=50000 & total_cases <200000,"50k >= Total cases < 200k",
           ifelse(total_cases >=200000 & total_cases <500000,"200k >= Total cases < 500k",
            ifelse(total_cases >=500000 & total_cases <1000000,"500k >= Total cases < 1M","Total cases >= 1M"))))),
Total_deaths_categ= ifelse(is.na(total_deaths), "Data unavailable",
          ifelse(total_deaths >=0 & total_deaths <100,"Total deaths < 100",                                      
           ifelse(total_deaths >=100 & total_deaths <1000,"100 >= Total deaths < 1k",
            ifelse(total_deaths >=1000 & total_deaths <10000,"1k >= Total deaths < 10k",
             ifelse(total_deaths >=10000 & total_deaths <100000,"10k >= Total deaths< 100k","Total deaths >= 100k"))))),
vaccinationstatus_factor = factor(ifelse(vaccination_status == "Vaccination did not start", 0,1)),
Pop_density_factor= factor(ifelse(Pop_density_categ == "Data unavailable",0,
            ifelse(Pop_density_categ =="Low (0 >= ppl/Km2 < 25)",1,                                      
             ifelse(Pop_density_categ == "Medium (25 >= ppl/Km2 < 50)",2,
              ifelse(Pop_density_categ =="High (50 >= ppl/Km2 < 100)",3,
               ifelse(Pop_density_categ =="Very High (100 >= ppl/Km2 <400)",4,5))))))
         ) %>%
# exclude rows with iso_code NULL 
filter (location != c("International" ) & location != c("World") & iso_code != "OWID_KOS" & Country_Name != " ")      

# d) d. exclude the rows with aggregated values by region
setDT(Pop_Vaccine_tile)
setDT(iso_code_aggregated)
Pop_Vaccine_tile= Pop_Vaccine_tile[!iso_code_aggregated, on=c(iso_code = "iso_code")]

2.2 Exploratory Data Analysis (EDA) & Visualizations

Covid-19 cases

The trends: Worldwide Monthly Distribution of the Covid-19 New Cases indicated some stability in the number of new cases in the northern hemisphere during the 2020 Summer months (Jun, July, Aug). Then, it starts to increase again after September 2020. December 2020 ended with almost 20M new cases worldwide.

The first 7 days of 2021 were comparable with the total new cases found during the full month of April 2020. Total new cases for January 2021 were the same as for December 2020.

Thankfully, the graphic already shows a decrease in the number of new covid-19 cases in the world. February 2021 seem to have similar new cases as found during the end of the northern hemisphere summer of 2020.

Pop_Vaccine_tile %>% 
  ggplot(aes(x=lubridate::month(date, label = TRUE, abbr = TRUE), 
             y=new_cases,
             group = factor(lubridate::year(date)),
             color = factor(lubridate::year(date)))) + 
  geom_bar(stat="identity", width=0.3) +
  #theme_classic() +
  labs(title = "Worldwide Monthly Distribution of the Covid-19 New Cases",
       x= "Date", y= "Monthy new cases") +
  theme_bw() + theme(axis.text.x =element_text(size = rel(0.75),angle = 90),
                     axis.text.y = element_text(size = 10),
                     legend.position = "none") +
 # scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + # format scientific notation
  facet_wrap(~ lubridate::year(date))                                       # see data by year

Since covid-19 is longitudinal data, the last date of the report based on total cases, total deaths and vaccination were extracted for analysis.

Note that these results may differ from the total actual published data in each country, due to the lag on the covid-19 package being updated. Also, different time zones must be considered.

#most recent cases
recent_totalcases<- Pop_Vaccine_tile %>% group_by(iso_code) %>%       # most recent data for total cases
  slice( which.max( total_cases) ) %>% as.data.frame

Table of the Top 15 Countries with Total covid-19 Cases vs. GDP per Capita indicated:

5k >= GDP < 15k: represents 20% of the nations

15k >= GDP < 50k: majority (73%)

50k >= GDP < 90k: represented by United States (7%), which is the Top 1 in the list.

location total_cases total_deaths gdp_per_capita GDP_category
United States 28486394 510458 54225.446 50k >= GDP < 90k
India 11079979 156938 6426.674 5k >= GDP < 15k
Brazil 10455630 252835 14103.452 5k >= GDP < 15k
Russia 4175757 83900 24765.954 15k >= GDP < 50k
United Kingdom 4175315 122648 39753.244 15k >= GDP < 50k
France 3746707 85738 38605.671 15k >= GDP < 50k
Spain 3188553 69142 34272.360 15k >= GDP < 50k
Italy 2888923 97227 35220.084 15k >= GDP < 50k
Turkey 2683971 28432 25129.341 15k >= GDP < 50k
Germany 2436506 69939 45229.245 15k >= GDP < 50k
Colombia 2244792 59518 13254.949 5k >= GDP < 15k
Argentina 2098728 51887 18933.907 15k >= GDP < 50k
Mexico 2076882 184474 17336.469 15k >= GDP < 50k
Poland 1684788 43353 27216.445 15k >= GDP < 50k
Iran 1615184 59899 19082.620 15k >= GDP < 50k
## 
## 15k >= GDP < 50k 50k >= GDP < 90k  5k >= GDP < 15k 
##               11                1                3

Covid-19 deaths

In 2020 there were 3 significant peaks of new deaths: one In April during the Northern hemisphere spring and 2 others at the start of Winter (November and December), being December 2020 the month with the highest number of new deaths (about 350K worldwide).

January 20201 surpassed December 2020 by almost 60k.

February also display a significant reduction of new deaths. The amount is similar to November 2020 (about 300k).

It has already been 1 year since we face covid-19; all nations recognize more cases than the published. Asymptomatic people who were not tested and continue to transmit the virus. Bendix (2020) suggests that the actual number of covid-19 cases in USA could be anywhere from 5 to 20 times the numbers published, which might also be a reality for many others.

Table of the Top 15 Countries with Total covid-19 Deaths vs. GDP per Capita has similar GDP categorization to the Top 15 countries with Total Cases:

5k >= GDP < 15k: represents 33% of the nations

15k >= GDP < 50k: represented by the majority (60%)

50k >= GDP < 90k: represented by United States (7%) which is also the Top 1

location total_cases total_deaths gdp_per_capita GDP_category
United States 28486394 510458 54225.446 50k >= GDP < 90k
Brazil 10455630 252835 14103.452 5k >= GDP < 15k
Mexico 2076882 184474 17336.469 15k >= GDP < 50k
India 11079979 156938 6426.674 5k >= GDP < 15k
United Kingdom 4175315 122648 39753.244 15k >= GDP < 50k
Italy 2888923 97227 35220.084 15k >= GDP < 50k
France 3746707 85738 38605.671 15k >= GDP < 50k
Russia 4175757 83900 24765.954 15k >= GDP < 50k
Germany 2436506 69939 45229.245 15k >= GDP < 50k
Spain 3188553 69142 34272.360 15k >= GDP < 50k
Iran 1615184 59899 19082.620 15k >= GDP < 50k
Colombia 2244792 59518 13254.949 5k >= GDP < 15k
Argentina 2098728 51887 18933.907 15k >= GDP < 50k
South Africa 1510778 49784 12294.876 5k >= GDP < 15k
Peru 1316363 46094 12236.706 5k >= GDP < 15k
## 
## 15k >= GDP < 50k 50k >= GDP < 90k  5k >= GDP < 15k 
##                9                1                5

Vaccination

On December 2nd 2020, the Pfizer/BioNTech covid-19 vaccine was the 1st in the world to receive emergency approval in UK, days after, Baharain, Canada, Mexico and USA followed suit.

The vaccine rollout started slow but already reached at least 102 countries:

  • Europe (36): 28% of the countries are located in Northern Europe and another 28% at Southern Europe

  • Asia (26): 42% are in Western Asia

  • Americas (20): 50% are in South America

  • Africa (8): Northern Africa and Eastern Africa each represent 38%

  • Oceania (2)

  • NAs (10) are Andorra, Bermuda, Cayman Islands, Faroe Islands, Gibraltar, Isle of Man, Liechtenstein, Macao, Monaco, Turks and Caicos Islands

region sub.region count
Africa Eastern Africa 3
Africa Northern Africa 3
Africa Southern Africa 1
Africa Western Africa 1
Americas Caribbean 3
Americas Central America 4
Americas Northern America 3
Americas South America 10
Asia Central Asia 1
Asia Eastern Asia 3
Asia South-Eastern Asia 4
Asia Southern Asia 7
Asia Western Asia 11
Europe Eastern Europe 9
Europe Northern Europe 10
Europe Southern Europe 10
Europe Western Europe 7
Oceania Australia and New Zealand 2
NA NA 10
# Creating the treemap
treemap(n_countries_subregion,
        index=c("region","sub.region"),
        vSize=c("count"),
        vColor=c("count"),
        type="value",
        range=c(0,12),
        #palette=brewer.pal(n=8,"RdYlGn"),
        algorithm="pivotSize",
        sortID="-size",
        palette="RdYlBu",
        title="Counts of countries with covid-19 vaccination rollout by sub regions",
        title.legend = "Counts of countries",
        fontsize.labels=c(0.1,12),                # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
        fontcolor.labels=c("white","black"),    # Color of labels
        fontface.labels=c(2,1),                  # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
        #bg.labels=c("transparent"),              # Background color of labels
        align.labels=list(
          c("center", "center"), 
          c("center", "center")
        ),                                   # Where to place labels in the rectangle?
        overlap.labels=0.5,                      # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1  means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5  times their area size.
        inflate.labels=F,                        # If true, labels are bigger when rectangle is bigger.
        border.col=c("black","white"),             # Color of borders of groups, of subgroups, of subsubgroups ....
        border.lwds=c(5,2)                       # Width of colors  
        )

For some of the countries such as China, India, United Arab Emirates there is no data for only 1st dose administrated and people fully vaccinated; because of this, when the TOP 15 countries Trends are filtered out, the tables are inconsistent.

Table of the Top 15 Countries which administrated at Least one dose of covid-19 Vaccine vs. GDP per Capita has an heterogeneous distribution:

1k >= GDP < 5k: Bangladesh representing 7%

5k >= GDP < 15k: Brazil, Morocco, Indonesia representing 20%

15k >= GDP < 50k: represents 66%

50k >= GDP < 90k: United States representing 7%

location total_cases total_deaths people_vaccinated gdp_per_capita GDP_category
United States 28486394 510458 47184199 54225.446 50k >= GDP < 90k
United Kingdom 4166727 122303 19177555 39753.244 15k >= GDP < 50k
India 11079979 156938 11552857 6426.674 5k >= GDP < 15k
Turkey 2683971 28432 6745147 25129.341 15k >= GDP < 50k
Brazil 10455630 252835 6346769 14103.452 5k >= GDP < 15k
Israel 770780 5697 4663028 33132.320 15k >= GDP < 50k
Germany 2436506 69939 3881490 45229.245 15k >= GDP < 50k
United Arab Emirates 375535 1145 3480415 67293.483 50k >= GDP < 90k
Morocco 482994 8608 3327858 7485.013 5k >= GDP < 15k
Chile 816929 20400 3289086 22767.037 15k >= GDP < 50k
Bangladesh 544954 8384 2850940 3523.984 1k >= GDP < 5k
France 3746475 85734 2808490 38605.671 15k >= GDP < 50k
Italy 2888923 97227 2696588 35220.084 15k >= GDP < 50k
Spain 3180212 68813 2361852 34272.360 15k >= GDP < 50k
Russia 3968228 76873 2200000 24765.954 15k >= GDP < 50k
## 
## 15k >= GDP < 50k   1k >= GDP < 5k 50k >= GDP < 90k  5k >= GDP < 15k 
##                9                1                2                3

Table of the Top 15 Countries fully vaccinated against covid-19 vs. GDP per Capita indicated:

5k >= GDP < 15k: India, Brazil and Indonesia representing 20%

15k >= GDP < 50k: represents 67%

50k >= GDP < 90k: United States and United Arab Emirates representing 13%

location total_cases total_deaths people_vaccinated people_fully_vaccinated gdp_per_capita GDP_category
United States 28486394 510458 47184199 22613359 54225.446 50k >= GDP < 90k
Israel 770780 5697 4663028 3294759 33132.320 15k >= GDP < 50k
India 11079979 156938 11552857 2204083 6426.674 5k >= GDP < 15k
United Arab Emirates 375535 1145 3480415 2187849 67293.483 50k >= GDP < 90k
Germany 2436506 69939 3881490 2029047 45229.245 15k >= GDP < 50k
Brazil 10455630 252835 6346769 1755018 14103.452 5k >= GDP < 15k
Russia 3968228 76873 2200000 1700000 24765.954 15k >= GDP < 50k
Turkey 2683971 28432 6745147 1553658 25129.341 15k >= GDP < 50k
France 3746475 85734 2808490 1490083 38605.671 15k >= GDP < 50k
Italy 2888923 97227 2696588 1377987 35220.084 15k >= GDP < 50k
Spain 3180212 68813 2361852 1243783 34272.360 15k >= GDP < 50k
Poland 1684788 43353 2101754 1168058 27216.445 15k >= GDP < 50k
Indonesia 1322866 35786 1583581 865870 11188.744 5k >= GDP < 15k
United Kingdom 4166727 122303 19177555 736037 39753.244 15k >= GDP < 50k
Romania 795732 20233 890068 615965 23313.199 15k >= GDP < 50k
## 
## 15k >= GDP < 50k 50k >= GDP < 90k  5k >= GDP < 15k 
##               10                2                3

Table of the Top 15 Countries Total Vaccines Administrated against covid-19 vs. GDP per Capita indicated:

5k >= GDP < 15k: India, Brazil and Morroco representing 20%

15k >= GDP < 50k: represents 67%,

50k >= GDP < 90k: United States and United Arab Emirates representing 13%

location total_cases total_deaths people_vaccinated people_fully_vaccinated total_vaccinations gdp_per_capita GDP_category
United States 28486394 510458 47184199 22613359 70454064 54225.446 50k >= GDP < 90k
China 100475 4824 NA NA 40520000 15308.712 15k >= GDP < 50k
United Kingdom 4166727 122303 19177555 736037 19913592 39753.244 15k >= GDP < 50k
India 11079979 156938 11552857 2204083 13756940 6426.674 5k >= GDP < 15k
Turkey 2683971 28432 6745147 1553658 8298805 25129.341 15k >= GDP < 50k
Brazil 10455630 252835 6346769 1755018 8101787 14103.452 5k >= GDP < 15k
Israel 770780 5697 4663028 3294759 7957787 33132.320 15k >= GDP < 50k
United Arab Emirates 385160 1198 NA NA 5933299 67293.483 50k >= GDP < 90k
Germany 2436506 69939 3881490 2029047 5910537 45229.245 15k >= GDP < 50k
France 3746475 85734 2808490 1490083 4298573 38605.671 15k >= GDP < 50k
Italy 2888923 97227 2696588 1377987 4074575 35220.084 15k >= GDP < 50k
Russia 3968228 76873 2200000 1700000 3900000 24765.954 15k >= GDP < 50k
Spain 3180212 68813 2361852 1243783 3605635 34272.360 15k >= GDP < 50k
Morocco 482994 8608 3327858 96437 3424295 7485.013 5k >= GDP < 15k
Chile 816929 20400 3289086 55941 3345027 22767.037 15k >= GDP < 50k
## 
## 15k >= GDP < 50k 50k >= GDP < 90k  5k >= GDP < 15k 
##               10                2                3

Table of the Top 15 GDP per Capita vs. Total single doses demonstrated:

15k >= GDP < 50k: Cayman Islands represents 7% and administrated about 21k total single doses.

50k >= GDP < 90k: represents the majority (73%). All countries of this category started the vaccination rollout except Brunei, San Marino and Hong Kong. In terms of single doses administrated on people age 15 and up, United Arab Emirates is leading with 50% >= Single doses < 70% (about 6 million) followed by United States 10% >= Single doses < 50% (about 61 million).

GDP >= 90K: Qatar, Macao and Luxembourg represent 20%. For each nation, total single doses were administrated in less than 10% of the age 15 and up. Notable the total cases were below 200K and total deaths below 1k.

location gdp_per_capita GDP_category total_cases total_deaths people_vaccinated people_fully_vaccinated total_vaccinations Pop15Over Total_vaccination_Age15UP People_vaccinated_Age15UP People_Fully_vaccinated_Age15UP
Qatar 116935.60 GDP >= 90K 162737 257 NA NA 140000 2530000 Total single doses < 10% Data unavailable Data unavailable
Macao 104861.85 GDP >= 90K NA NA NA NA 2000 562000 Total single doses < 10% Data unavailable Data unavailable
Luxembourg 94277.96 GDP >= 90K 55110 637 26089 9982 36071 537000 Total single doses < 10% 1st dose < 10% Fully vaccinated < 10%
Singapore 85535.38 50k >= GDP < 90k 59913 29 250000 110000 360000 5078000 Total single doses < 10% 1st dose < 10% Fully vaccinated < 10%
Brunei 71809.25 50k >= GDP < 90k 185 3 NA NA NA 344000 NA NA NA
Ireland 67335.29 50k >= GDP < 90k 218251 4300 238841 134439 373280 4001000 Total single doses < 10% 1st dose < 10% Fully vaccinated < 10%
United Arab Emirates 67293.48 50k >= GDP < 90k 385160 1198 NA NA 5933299 8503000 50% >= Total single doses < 70% Data unavailable Data unavailable
Kuwait 65530.54 50k >= GDP < 90k 189046 1072 137000 38000 175000 3413000 Total single doses < 10% 1st dose < 10% Fully vaccinated < 10%
Norway 64800.06 50k >= GDP < 90k 70564 622 318722 149622 468344 4505000 10% >= Total single doses < 50% 1st dose < 10% Fully vaccinated < 10%
Switzerland 57410.17 50k >= GDP < 90k 554932 9961 527979 220812 748791 7384000 10% >= Total single doses < 50% 1st dose < 10% Fully vaccinated < 10%
San Marino 56861.47 50k >= GDP < 90k 3671 73 NA NA NA NA NA NA NA
Hong Kong 56054.92 50k >= GDP < 90k NA NA NA NA NA 6621000 NA NA NA
United States 54225.45 50k >= GDP < 90k 28486394 510458 47184199 22613359 70454064 271447000 10% >= Total single doses < 50% 10% >= 1st dose < 50% Fully vaccinated < 10%
Bermuda 50669.32 50k >= GDP < 90k NA NA 12304 4769 17073 NA Data unavailable Data unavailable Data unavailable
Cayman Islands 49903.03 15k >= GDP < 50k NA NA NA NA 21106 NA Data unavailable Data unavailable Data unavailable
## 
## 15k >= GDP < 50k 50k >= GDP < 90k       GDP >= 90K 
##                1               11                3
## 
## 10% >= Total single doses < 50% 50% >= Total single doses < 70% 
##                               3                               1 
##                Data unavailable        Total single doses < 10% 
##                               2                               6

Geographical Distribution of the Countries with vaccination rollout vs. Total single doses

The tile grid map of the Countries with vaccination rollout vs. Total Single doses that most countries are located at sub-regions North & South America, European nations and Southern & Western Asia. The countries with the highest number of doses administrated:

North America: United States - USA (Single doses >= 50M)

Northern & Western Europe: United Kingdom - GBR (5M>=Single doses >= 50M)

Western Asia: Israel - ISR (5M>=Single doses >= 50M)

Eastern Asia: China - CHN (5M>=Single doses >= 50M)

# tile map of the countries by the counts of single doses administrated  
min<- min(vaccine_tile$total_vaccinations.y)
max<- max(vaccine_tile$total_vaccinations.y)
median<- median(vaccine_tile$total_vaccinations.y)


ggplot(vaccine_tile, aes(xmin = x.x, ymin = y.x, xmax = x.x + 1, ymax = y.x + 1, fill = total_vaccinations.y)) +
  geom_rect(color = "black") +
  mytheme +
  theme(plot.caption = element_text(size = 4),
    legend.position = "bottom", 
    legend.title = element_text(size = 6),
    legend.text = element_text(size = 6),
    legend.key.size = unit(0.5,"line"))+
  geom_text(aes(x = x.x, y = y.x, label = iso_code), color = "black", alpha = 0.5, nudge_x = 0.5, nudge_y = -0.5, size = 1.7) + 
  scale_y_reverse() + 
   coord_equal()+
  labs(caption="Countries such as Liechtenstein, Monaco, West Bank and Gaza, San Marino and small islands are not represented", color="Single Doses")+
 scale_fill_gradientn(colours = rcartocolor::carto_pal(name = "TealRose", n = 7), name = "",na.value = NA) +
   guides(fill = guide_colourbar(barheight = 0.3, barwidth = 20, direction = "horizontal", ticks = FALSE)) # https://www.katiejolly.io/blog/2019-08-28/nyt-urban-heat
## Warning: Removed 13 rows containing missing values (geom_rect).
## Warning: Removed 13 rows containing missing values (geom_text).

Considering the population of age 15 and up vs. total single doses of the covid-19 vaccine, data shows that some countries such as Israel and Seychelles (both under 15k >= GDP < 50k) may soon reach the potential herd immunity by vaccinating more than 70% of the population of age 15 and up. Note: considering only 1st dose administrated each nation falls under 50% >= 1st dose < 70%; considering the fully vaccinated category, each nation is under 10% >= Fully vaccinated < 50%.

location GDP_category total_cases total_deaths people_vaccinated people_fully_vaccinated total_vaccinations Pop15Over Total_vaccination_Age15UP People_vaccinated_Age15UP People_Fully_vaccinated_Age15UP
United Arab Emirates 50k >= GDP < 90k 385160 1198 NA NA 5933299 8503000 50% >= Total single doses < 70% Data unavailable Data unavailable
Israel 15k >= GDP < 50k 770780 5697 4663028 3294759 7957787 6751000 Potential herd immunity 50% >= 1st dose < 70% 10% >= Fully vaccinated < 50%
Seychelles 15k >= GDP < 50k 2592 11 51577 23519 75096 75300 Potential herd immunity 50% >= 1st dose < 70% 10% >= Fully vaccinated < 50%

In terms of the percent of the population of age 15 and up been vaccinated, Israel - ISR and Seychelles - SYC are leading with Total Single doses > 70%, potential herd immunity; followed by the United Arab Emirates - ARE (50% >= Total single doses < 70%).

3. Conclusion

This project used longitudinal data (covid-19), which changes over time; it was challenging to validate the data. Also, I encounter days in which the covid-19 package had an issue with its server. In other instances, data was missing or did not adequately updated.

When the time zone changed from one day to another, I frequently noticed that the covid-19 cases and deaths were updated, but there was a lag in update the vaccination information (vice versa). If I was trying to retrieve a subset with the most recent date, then the data was not complete. My solution to this temporal data issue was to create dataframes with the most recent data for each variable published (total cases, total deaths, and vaccination information) and join these subsets, selecting just the variables that I would plot and were updated).

I also noticed a few changes on the covid-19 data structure itself (e.g., new columns were added), probably to accommodate the reality, primarily related to the vaccination information. Some of the countries track vaccination administrated by people who received the 1st dose, people fully vaccinated. Still, others track only the total single doses, making some analysis to be inconsistent.

There are many other geographical-socio-economic questions that I would be curious to know, and this is just the beginning of my insights.

4. Reference

Bendix, To know the real number of coronavirus cases in the US, China, or Italy, researchers say multiply by 10. Accesed:Apr 19, 2020, 12:50 PM https://www.businessinsider.com/real-number-of-coronavirus-cases-underreported-us-china-italy-2020-4

Hale T, Phillips T, Petherick A, Kira B, Angrist N, Aymar K, et al. Risk of Openness Index: when do government responses need to be increased or maintained? [Internet]. Version 2.0. Oxford:Blavatnik School of Government; 2020 [cited 2020 Oct 21] https://www.publichealthontario.ca/-/media/documents/ncov/research/2020/10/research-hale-risk-of-openness-index.pdf?la=en based on https://www.bsg.ox.ac.uk/sites/default/files/2020-10/10-2020-Risk-of-Openness-Index-BSG-Research-Note.pdf

Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8

HDI https://ec.europa.eu/environment/beyond_gdp/download/factsheets/bgdp-ve-hdi.pdf

HDI wikipedia 2019 https://en.wikipedia.org/wiki/Human_Development_Index

https://www.maartenlambrechts.com/2017/10/22/tutorial-a-worldtilegrid-with-ggplot2.html https://github.com/ishaberry/Covid19Canada, https://github.com/kaerosen/tilemaps"

Naïve Bayes Classifier - https://uc-r.github.io/naive_bayes

Mohammed, R.A. Longitudinal Data Integration for a Tracking System for Health Professionals. Masters Thesis. UNIVERSITY OF NEW BRUNSWICK, 2016