Machine Learning - Part 1 of Covid-19 Analysis
All the visualizations and analysis in this report are as per February 27 2021 2pm AST Note that the covid-19 data changes over time #######################################################################################
1. Introduction
In 2019, Covid-19 pandemic put our life and behaviours in check. There is so much to learn from and millions of questions to be asked. I decided to investigate related covid-19 demographic, geographical & socio-economic parameters. Covid-19 data is a longitudinal dataset and changes over time.
As highlighted by Mohammed (2016), significant challenges are encountered during the development of a longitudinal dataset: loss of information due to missing or incomplete data, deterioration of data over time, lack of data Standardization (such as geographical nomenclature, spellings) and data quality (such as missing data, incorrect data type,…).
This is the 2nd machine learning project requirement for the HarvardX Professional Certificate Data Science Program.
github: https://github.com/silpai/Machine-Learning---Covid-19
2. Methodology and Analysis
There were 3 datasets used: •Worldtilegrid – contains the countries region, subregion, and cartesian x,y coordinates. Note that some countries were not listed in this dataset, and small islands were also missing.
•Covid-19 dataset with vaccination information – contains covid-19 tracking data collected from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU); Vaccinations against COVID-19 collected by the Our World in Data https://ourworldindata.org/ team from official reports. This data is the single dose of the vaccine; demographic and socio-economic data collected from United Nations, WorldBank and other governmental agencies. Dataset is longitudinal (changes over time), and it is updated according to the countries time zones.
Source: https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv
•WorldBank estimated population – Contains projected total population for 2021 and population per age range (age 0-14, age 15-64 and age 65 and up). Since covid-19 vaccines have been approved for those older than age 16, I will be using the range age 15 and up in my analysis. One of my curiosity was to know the percentage of the population vaccinated considering the approved age, not the entire population. Note that these are estimates, and there is no reduction in the number of deaths for covid-19 or any other cause.
Source: “http://databank.worldbank.org/data/download/Population-Estimates_CSV.zip”
Steps: a) Load and Clean data 1. Load datasets 2. Combine the worldtilegrid with covid-19 3. Transform Data i. Worldbank data has a pivot longer format, so I had to pivot wider the variables ii. Combine the 3 datasets. iii. Create new categories iv. Exclude the rows with aggregated values by region b) Perform Exploratory Analysis (EDA) c) Develop Machine Learning Models Modeling was used to create the predictions. Models were built from the training data to the test data: - Linear Models: correlation matrix, Linear regression, naive Bayes Classifier and decision tree
2.1 Data Transformation
Worldbank dataset required the pivot wider transformation. The three datasets were joined. Rows with aggregated values by region were eliminated, remaining only countries as observations. New groupings were created for the analysis.
This report was initially developed on January 11th 2021; few countries had initiated the vaccination rollout, and related tracking system did not differentiate 1st dose to 2nd dose the total_vaccinations variable was used. Since then, these people_vaccinated with 1st dose and people_fully_vaccinated variables are now available.
total_vaccinations - Total number of COVID-19 vaccination doses administered people_vaccinated - Total number of people who received at least one vaccine dose people_fully_vaccinated - Total number of people who received all doses prescribed by the vaccination protocol
##### a) Data Load and Cleaning
#1. Load datasets
# dataset: worldtilegrid x y
worldtilegrid <- read.csv(
"https://gist.githubusercontent.com/maartenzam/787498bbc07ae06b637447dbd430ea0a/raw/9a9dafafb44d8990f85243a9c7ca349acd3a0d07/worldtilegrid.csv")%>%
select(name, alpha.3,region,sub.region,x,y)
# dataset: Covid-19 dataset with vaccination information
#Source: https://ourworldindata.org/covid-vaccinations
covid_data<-read.csv("https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv")%>%
select(iso_code,location,date,total_cases,new_cases, total_deaths,new_deaths,total_vaccinations, people_vaccinated, people_fully_vaccinated, gdp_per_capita,
human_development_index, population_density, median_age,life_expectancy)
# dataset: worldbank estimated population
#Source:https://datacatalog.worldbank.org/dataset/population-estimates-and-projections
worldb <- tempfile()
download.file("http://databank.worldbank.org/data/download/Population-Estimates_CSV.zip", worldb)
worldbank_csv <- fread(text = gsub(",", "\t", readLines(unzip(worldb, "Population-EstimatesData.csv"))))
names(worldbank_csv) <- as.character(worldbank_csv[1,]) # use 1st row as header
worldbank_csv=worldbank_csv[-c(1),] # eliminate the 1st row that was a header
names(worldbank_csv)<-str_replace_all(names(worldbank_csv), c(" " = "_" )) # replace space by _
#str(worldbank_csv)
##2.Combine covid data & tile grid x y
covid_grid<-full_join(x=covid_data, y=worldtilegrid, by=c("iso_code" ="alpha.3"))
#head(covid_grid)
# 3. Data Transformation
# a) Pivot wider worldbank
worldbank <- worldbank_csv %>%
select("Country_Code","Country_Name","Indicator_Code","2021") %>% # select indicator population for 2021
filter(Indicator_Code %in% c("SP.POP.TOTL","SP.POP.1564.TO","SP.POP.65UP.TO")) %>% # exclude SP.POP.0014.TO (population age of 14 or younger)
spread(key = "Indicator_Code",
value = "2021") %>% #pivot wider
rename(iso_code=Country_Code,Est_Pop_2021= SP.POP.TOTL, Pop15_64=SP.POP.1564.TO, Pop65UP=SP.POP.65UP.TO ) %>%
mutate(Pop15Over=Pop15_64+Pop65UP) # combine population age 15 to 64 + population 65up
#head(worldbank)
#b) Combine covid data & worldbank estimated population age 15 and older data & tile grid x y data
# exclude row related to aggregated values
iso_code_aggregated <- data.frame(c("ARB", "CAF", "CEB", "CSS", "EAP", "EAR","EAS","ECA","ECS","EUU",
"FCS","HPC","INX","LAC", "LCN","LDC","LIC","LMC","LMY",
"LTE","MEA","MIC","MNA","NAC","OED","OSS","OWID_KOS","OWID_WRL", "PRE","PSS",
"PST","SAS","SSA","SSF","SST","TEA","TEC", "TLA",
"TMN", "TSA", "TSS", "UMC", "WLD", "OWID_KOS", "OWID_EUR","OWID_NAM","OWID_ASI", "OWID_EUN","OWID_SAM","OWID_AFR") ) %>%
rename(iso_code=c..ARB....CAF....CEB....CSS....EAP....EAR....EAS....ECA....ECS...)
#iso_code_aggregated
#c) Create new categories
Pop_Vaccine_tile<-full_join(x=worldbank, y=covid_grid, by="iso_code") %>%
group_by(iso_code)%>%
mutate(
Percent_Pop15UP= percent(Pop15Over/Est_Pop_2021),
vaccination_administrated=as.numeric(total_vaccinations),
vaccination_categ= ifelse(is.na(vaccination_administrated), "Data unavailable",
ifelse(vaccination_administrated >0 & vaccination_administrated <10000,"Single doses < 10k",
ifelse(vaccination_administrated >=10000 & vaccination_administrated <500000,"10k >= Single doses < 500k",
ifelse(vaccination_administrated >=500000 & vaccination_administrated <5000000,"500k >= Single doses < 5M",
ifelse(vaccination_administrated >=5000000 & vaccination_administrated <50000000, "5M >= Single doses < 50M", "Single doses >= 50M"))))),
vaccination_status = ifelse(is.na(vaccination_administrated),"Vaccination did not start",
ifelse(vaccination_administrated >0,"Vaccination started", "Vaccination did not start")),
Pop_percent_vaccinated_15over=vaccination_administrated/Pop15Over,
percent_vaccinated_15over=percent(vaccination_administrated/Pop15Over),
vaccination_percent_categ= ifelse(is.na(Pop_percent_vaccinated_15over),"Data unavailable",
ifelse(Pop_percent_vaccinated_15over >0 & Pop_percent_vaccinated_15over <0.1,"Total single doses < 10%",
ifelse(Pop_percent_vaccinated_15over >=0.1 & Pop_percent_vaccinated_15over <0.5,"10% >= Total single doses < 50%",
ifelse(Pop_percent_vaccinated_15over >=0.5 & Pop_percent_vaccinated_15over <0.7,"50% >= Total single doses < 70%", "Potential herd immunity")))),
Pop_percent_fully_vaccinated_15over= people_fully_vaccinated/Pop15Over,
percent_2nddose_15over=percent(people_fully_vaccinated/Pop15Over),
categ_2nddose_15over= ifelse(is.na(Pop_percent_fully_vaccinated_15over),"Data unavailable",
ifelse(Pop_percent_fully_vaccinated_15over >0 & Pop_percent_fully_vaccinated_15over <0.1,"Fully vaccinated < 10%",
ifelse(Pop_percent_fully_vaccinated_15over >=0.1 & Pop_percent_fully_vaccinated_15over <0.5,"10% >= Fully vaccinated < 50%",
ifelse(Pop_percent_fully_vaccinated_15over >=0.5 & Pop_percent_fully_vaccinated_15over <0.7,"50% >= Fully vaccinated < 70%", "Potential herd immunity")))),
Pop_percent_1st_vaccinated_15over= people_vaccinated/Pop15Over,
percent_1stdose_15over=percent(people_vaccinated/Pop15Over),
categ_1stdose_15over= ifelse(is.na(Pop_percent_1st_vaccinated_15over),"Data unavailable",
ifelse(Pop_percent_1st_vaccinated_15over >0 & Pop_percent_1st_vaccinated_15over <0.1," 1st dose < 10%",
ifelse(Pop_percent_1st_vaccinated_15over >=0.1 & Pop_percent_1st_vaccinated_15over <0.5,"10% >= 1st dose < 50%",
ifelse(Pop_percent_1st_vaccinated_15over>=0.5 & Pop_percent_1st_vaccinated_15over <0.7,"50% >= 1st dose < 70%", "Potential herd immunity")))),
GDP_category=ifelse(is.na(gdp_per_capita),"Data unavailable",
ifelse(gdp_per_capita<1000,"GDP < 1k",
ifelse(gdp_per_capita >=1000 & gdp_per_capita <5000,"1k >= GDP < 5k",
ifelse(gdp_per_capita >=5000 & gdp_per_capita <15000,"5k >= GDP < 15k",
ifelse(gdp_per_capita >=15000 & gdp_per_capita <50000,"15k >= GDP < 50k",
ifelse(gdp_per_capita >=50000 & gdp_per_capita <90000,"50k >= GDP < 90k","GDP >= 90K")))))),
Pop_density_categ= ifelse(is.na(population_density), "Data unavailable",
ifelse(population_density >=0 & population_density <25,"Low (0 >= ppl/Km2 < 25)",
ifelse(population_density >=25 & population_density <50,"Medium (25 >= ppl/Km2 < 50)",
ifelse(population_density >=50 & population_density <100,"High (50 >= ppl/Km2 < 100)",
ifelse(population_density >=100 & population_density <400,"Very High (100 >= ppl/Km2 <400)","Extreme High (ppl/Km2 >400)"))))),
Life_expectancy_categ= ifelse(is.na(life_expectancy), "Data unavailable",
ifelse(life_expectancy >=0 & life_expectancy <50,"Life expectancy < 50",
ifelse(life_expectancy >=50 & life_expectancy <60,"50 >= Life expectancy < 60",
ifelse(life_expectancy >=60 & life_expectancy <70,"60 >= Life expectancy < 70",
ifelse(life_expectancy >=70 & life_expectancy <80,"70 >= Life expectancy < 80","Life expectancy >= 80"))))),
Total_cases_categ= ifelse(is.na(total_cases), "Data unavailable",
ifelse(total_cases >=0 & total_cases <50000,"Total cases < 50k",
ifelse(total_cases >=50000 & total_cases <200000,"50k >= Total cases < 200k",
ifelse(total_cases >=200000 & total_cases <500000,"200k >= Total cases < 500k",
ifelse(total_cases >=500000 & total_cases <1000000,"500k >= Total cases < 1M","Total cases >= 1M"))))),
Total_deaths_categ= ifelse(is.na(total_deaths), "Data unavailable",
ifelse(total_deaths >=0 & total_deaths <100,"Total deaths < 100",
ifelse(total_deaths >=100 & total_deaths <1000,"100 >= Total deaths < 1k",
ifelse(total_deaths >=1000 & total_deaths <10000,"1k >= Total deaths < 10k",
ifelse(total_deaths >=10000 & total_deaths <100000,"10k >= Total deaths< 100k","Total deaths >= 100k"))))),
vaccinationstatus_factor = factor(ifelse(vaccination_status == "Vaccination did not start", 0,1)),
Pop_density_factor= factor(ifelse(Pop_density_categ == "Data unavailable",0,
ifelse(Pop_density_categ =="Low (0 >= ppl/Km2 < 25)",1,
ifelse(Pop_density_categ == "Medium (25 >= ppl/Km2 < 50)",2,
ifelse(Pop_density_categ =="High (50 >= ppl/Km2 < 100)",3,
ifelse(Pop_density_categ =="Very High (100 >= ppl/Km2 <400)",4,5))))))
) %>%
# exclude rows with iso_code NULL
filter (location != c("International" ) & location != c("World") & iso_code != "OWID_KOS" & Country_Name != " ")
# d) d. exclude the rows with aggregated values by region
setDT(Pop_Vaccine_tile)
setDT(iso_code_aggregated)
Pop_Vaccine_tile= Pop_Vaccine_tile[!iso_code_aggregated, on=c(iso_code = "iso_code")]2.2 Exploratory Data Analysis (EDA) & Visualizations
Covid-19 cases
The trends: Worldwide Monthly Distribution of the Covid-19 New Cases indicated some stability in the number of new cases in the northern hemisphere during the 2020 Summer months (Jun, July, Aug). Then, it starts to increase again after September 2020. December 2020 ended with almost 20M new cases worldwide.
The first 7 days of 2021 were comparable with the total new cases found during the full month of April 2020. Total new cases for January 2021 were the same as for December 2020.
Thankfully, the graphic already shows a decrease in the number of new covid-19 cases in the world. February 2021 seem to have similar new cases as found during the end of the northern hemisphere summer of 2020.
Pop_Vaccine_tile %>%
ggplot(aes(x=lubridate::month(date, label = TRUE, abbr = TRUE),
y=new_cases,
group = factor(lubridate::year(date)),
color = factor(lubridate::year(date)))) +
geom_bar(stat="identity", width=0.3) +
#theme_classic() +
labs(title = "Worldwide Monthly Distribution of the Covid-19 New Cases",
x= "Date", y= "Monthy new cases") +
theme_bw() + theme(axis.text.x =element_text(size = rel(0.75),angle = 90),
axis.text.y = element_text(size = 10),
legend.position = "none") +
# scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) + # format scientific notation
facet_wrap(~ lubridate::year(date)) # see data by yearSince covid-19 is longitudinal data, the last date of the report based on total cases, total deaths and vaccination were extracted for analysis.
Note that these results may differ from the total actual published data in each country, due to the lag on the covid-19 package being updated. Also, different time zones must be considered.
#most recent cases
recent_totalcases<- Pop_Vaccine_tile %>% group_by(iso_code) %>% # most recent data for total cases
slice( which.max( total_cases) ) %>% as.data.frameTable of the Top 15 Countries with Total covid-19 Cases vs. GDP per Capita indicated:
5k >= GDP < 15k: represents 20% of the nations
15k >= GDP < 50k: majority (73%)
50k >= GDP < 90k: represented by United States (7%), which is the Top 1 in the list.
| location | total_cases | total_deaths | gdp_per_capita | GDP_category |
|---|---|---|---|---|
| United States | 28486394 | 510458 | 54225.446 | 50k >= GDP < 90k |
| India | 11079979 | 156938 | 6426.674 | 5k >= GDP < 15k |
| Brazil | 10455630 | 252835 | 14103.452 | 5k >= GDP < 15k |
| Russia | 4175757 | 83900 | 24765.954 | 15k >= GDP < 50k |
| United Kingdom | 4175315 | 122648 | 39753.244 | 15k >= GDP < 50k |
| France | 3746707 | 85738 | 38605.671 | 15k >= GDP < 50k |
| Spain | 3188553 | 69142 | 34272.360 | 15k >= GDP < 50k |
| Italy | 2888923 | 97227 | 35220.084 | 15k >= GDP < 50k |
| Turkey | 2683971 | 28432 | 25129.341 | 15k >= GDP < 50k |
| Germany | 2436506 | 69939 | 45229.245 | 15k >= GDP < 50k |
| Colombia | 2244792 | 59518 | 13254.949 | 5k >= GDP < 15k |
| Argentina | 2098728 | 51887 | 18933.907 | 15k >= GDP < 50k |
| Mexico | 2076882 | 184474 | 17336.469 | 15k >= GDP < 50k |
| Poland | 1684788 | 43353 | 27216.445 | 15k >= GDP < 50k |
| Iran | 1615184 | 59899 | 19082.620 | 15k >= GDP < 50k |
##
## 15k >= GDP < 50k 50k >= GDP < 90k 5k >= GDP < 15k
## 11 1 3
Covid-19 deaths
In 2020 there were 3 significant peaks of new deaths: one In April during the Northern hemisphere spring and 2 others at the start of Winter (November and December), being December 2020 the month with the highest number of new deaths (about 350K worldwide).
January 20201 surpassed December 2020 by almost 60k.
February also display a significant reduction of new deaths. The amount is similar to November 2020 (about 300k).
It has already been 1 year since we face covid-19; all nations recognize more cases than the published. Asymptomatic people who were not tested and continue to transmit the virus. Bendix (2020) suggests that the actual number of covid-19 cases in USA could be anywhere from 5 to 20 times the numbers published, which might also be a reality for many others.
Table of the Top 15 Countries with Total covid-19 Deaths vs. GDP per Capita has similar GDP categorization to the Top 15 countries with Total Cases:
5k >= GDP < 15k: represents 33% of the nations
15k >= GDP < 50k: represented by the majority (60%)
50k >= GDP < 90k: represented by United States (7%) which is also the Top 1
| location | total_cases | total_deaths | gdp_per_capita | GDP_category |
|---|---|---|---|---|
| United States | 28486394 | 510458 | 54225.446 | 50k >= GDP < 90k |
| Brazil | 10455630 | 252835 | 14103.452 | 5k >= GDP < 15k |
| Mexico | 2076882 | 184474 | 17336.469 | 15k >= GDP < 50k |
| India | 11079979 | 156938 | 6426.674 | 5k >= GDP < 15k |
| United Kingdom | 4175315 | 122648 | 39753.244 | 15k >= GDP < 50k |
| Italy | 2888923 | 97227 | 35220.084 | 15k >= GDP < 50k |
| France | 3746707 | 85738 | 38605.671 | 15k >= GDP < 50k |
| Russia | 4175757 | 83900 | 24765.954 | 15k >= GDP < 50k |
| Germany | 2436506 | 69939 | 45229.245 | 15k >= GDP < 50k |
| Spain | 3188553 | 69142 | 34272.360 | 15k >= GDP < 50k |
| Iran | 1615184 | 59899 | 19082.620 | 15k >= GDP < 50k |
| Colombia | 2244792 | 59518 | 13254.949 | 5k >= GDP < 15k |
| Argentina | 2098728 | 51887 | 18933.907 | 15k >= GDP < 50k |
| South Africa | 1510778 | 49784 | 12294.876 | 5k >= GDP < 15k |
| Peru | 1316363 | 46094 | 12236.706 | 5k >= GDP < 15k |
##
## 15k >= GDP < 50k 50k >= GDP < 90k 5k >= GDP < 15k
## 9 1 5
Vaccination
On December 2nd 2020, the Pfizer/BioNTech covid-19 vaccine was the 1st in the world to receive emergency approval in UK, days after, Baharain, Canada, Mexico and USA followed suit.
The vaccine rollout started slow but already reached at least 102 countries:
Europe (36): 28% of the countries are located in Northern Europe and another 28% at Southern Europe
Asia (26): 42% are in Western Asia
Americas (20): 50% are in South America
Africa (8): Northern Africa and Eastern Africa each represent 38%
Oceania (2)
NAs (10) are Andorra, Bermuda, Cayman Islands, Faroe Islands, Gibraltar, Isle of Man, Liechtenstein, Macao, Monaco, Turks and Caicos Islands
| region | sub.region | count |
|---|---|---|
| Africa | Eastern Africa | 3 |
| Africa | Northern Africa | 3 |
| Africa | Southern Africa | 1 |
| Africa | Western Africa | 1 |
| Americas | Caribbean | 3 |
| Americas | Central America | 4 |
| Americas | Northern America | 3 |
| Americas | South America | 10 |
| Asia | Central Asia | 1 |
| Asia | Eastern Asia | 3 |
| Asia | South-Eastern Asia | 4 |
| Asia | Southern Asia | 7 |
| Asia | Western Asia | 11 |
| Europe | Eastern Europe | 9 |
| Europe | Northern Europe | 10 |
| Europe | Southern Europe | 10 |
| Europe | Western Europe | 7 |
| Oceania | Australia and New Zealand | 2 |
| NA | NA | 10 |
# Creating the treemap
treemap(n_countries_subregion,
index=c("region","sub.region"),
vSize=c("count"),
vColor=c("count"),
type="value",
range=c(0,12),
#palette=brewer.pal(n=8,"RdYlGn"),
algorithm="pivotSize",
sortID="-size",
palette="RdYlBu",
title="Counts of countries with covid-19 vaccination rollout by sub regions",
title.legend = "Counts of countries",
fontsize.labels=c(0.1,12), # size of labels. Give the size per level of aggregation: size for group, size for subgroup, sub-subgroups...
fontcolor.labels=c("white","black"), # Color of labels
fontface.labels=c(2,1), # Font of labels: 1,2,3,4 for normal, bold, italic, bold-italic...
#bg.labels=c("transparent"), # Background color of labels
align.labels=list(
c("center", "center"),
c("center", "center")
), # Where to place labels in the rectangle?
overlap.labels=0.5, # number between 0 and 1 that determines the tolerance of the overlap between labels. 0 means that labels of lower levels are not printed if higher level labels overlap, 1 means that labels are always printed. In-between values, for instance the default value .5, means that lower level labels are printed if other labels do not overlap with more than .5 times their area size.
inflate.labels=F, # If true, labels are bigger when rectangle is bigger.
border.col=c("black","white"), # Color of borders of groups, of subgroups, of subsubgroups ....
border.lwds=c(5,2) # Width of colors
)For some of the countries such as China, India, United Arab Emirates there is no data for only 1st dose administrated and people fully vaccinated; because of this, when the TOP 15 countries Trends are filtered out, the tables are inconsistent.
Table of the Top 15 Countries which administrated at Least one dose of covid-19 Vaccine vs. GDP per Capita has an heterogeneous distribution:
1k >= GDP < 5k: Bangladesh representing 7%
5k >= GDP < 15k: Brazil, Morocco, Indonesia representing 20%
15k >= GDP < 50k: represents 66%
50k >= GDP < 90k: United States representing 7%
| location | total_cases | total_deaths | people_vaccinated | gdp_per_capita | GDP_category |
|---|---|---|---|---|---|
| United States | 28486394 | 510458 | 47184199 | 54225.446 | 50k >= GDP < 90k |
| United Kingdom | 4166727 | 122303 | 19177555 | 39753.244 | 15k >= GDP < 50k |
| India | 11079979 | 156938 | 11552857 | 6426.674 | 5k >= GDP < 15k |
| Turkey | 2683971 | 28432 | 6745147 | 25129.341 | 15k >= GDP < 50k |
| Brazil | 10455630 | 252835 | 6346769 | 14103.452 | 5k >= GDP < 15k |
| Israel | 770780 | 5697 | 4663028 | 33132.320 | 15k >= GDP < 50k |
| Germany | 2436506 | 69939 | 3881490 | 45229.245 | 15k >= GDP < 50k |
| United Arab Emirates | 375535 | 1145 | 3480415 | 67293.483 | 50k >= GDP < 90k |
| Morocco | 482994 | 8608 | 3327858 | 7485.013 | 5k >= GDP < 15k |
| Chile | 816929 | 20400 | 3289086 | 22767.037 | 15k >= GDP < 50k |
| Bangladesh | 544954 | 8384 | 2850940 | 3523.984 | 1k >= GDP < 5k |
| France | 3746475 | 85734 | 2808490 | 38605.671 | 15k >= GDP < 50k |
| Italy | 2888923 | 97227 | 2696588 | 35220.084 | 15k >= GDP < 50k |
| Spain | 3180212 | 68813 | 2361852 | 34272.360 | 15k >= GDP < 50k |
| Russia | 3968228 | 76873 | 2200000 | 24765.954 | 15k >= GDP < 50k |
##
## 15k >= GDP < 50k 1k >= GDP < 5k 50k >= GDP < 90k 5k >= GDP < 15k
## 9 1 2 3
Table of the Top 15 Countries fully vaccinated against covid-19 vs. GDP per Capita indicated:
5k >= GDP < 15k: India, Brazil and Indonesia representing 20%
15k >= GDP < 50k: represents 67%
50k >= GDP < 90k: United States and United Arab Emirates representing 13%
| location | total_cases | total_deaths | people_vaccinated | people_fully_vaccinated | gdp_per_capita | GDP_category |
|---|---|---|---|---|---|---|
| United States | 28486394 | 510458 | 47184199 | 22613359 | 54225.446 | 50k >= GDP < 90k |
| Israel | 770780 | 5697 | 4663028 | 3294759 | 33132.320 | 15k >= GDP < 50k |
| India | 11079979 | 156938 | 11552857 | 2204083 | 6426.674 | 5k >= GDP < 15k |
| United Arab Emirates | 375535 | 1145 | 3480415 | 2187849 | 67293.483 | 50k >= GDP < 90k |
| Germany | 2436506 | 69939 | 3881490 | 2029047 | 45229.245 | 15k >= GDP < 50k |
| Brazil | 10455630 | 252835 | 6346769 | 1755018 | 14103.452 | 5k >= GDP < 15k |
| Russia | 3968228 | 76873 | 2200000 | 1700000 | 24765.954 | 15k >= GDP < 50k |
| Turkey | 2683971 | 28432 | 6745147 | 1553658 | 25129.341 | 15k >= GDP < 50k |
| France | 3746475 | 85734 | 2808490 | 1490083 | 38605.671 | 15k >= GDP < 50k |
| Italy | 2888923 | 97227 | 2696588 | 1377987 | 35220.084 | 15k >= GDP < 50k |
| Spain | 3180212 | 68813 | 2361852 | 1243783 | 34272.360 | 15k >= GDP < 50k |
| Poland | 1684788 | 43353 | 2101754 | 1168058 | 27216.445 | 15k >= GDP < 50k |
| Indonesia | 1322866 | 35786 | 1583581 | 865870 | 11188.744 | 5k >= GDP < 15k |
| United Kingdom | 4166727 | 122303 | 19177555 | 736037 | 39753.244 | 15k >= GDP < 50k |
| Romania | 795732 | 20233 | 890068 | 615965 | 23313.199 | 15k >= GDP < 50k |
##
## 15k >= GDP < 50k 50k >= GDP < 90k 5k >= GDP < 15k
## 10 2 3
Table of the Top 15 Countries Total Vaccines Administrated against covid-19 vs. GDP per Capita indicated:
5k >= GDP < 15k: India, Brazil and Morroco representing 20%
15k >= GDP < 50k: represents 67%,
50k >= GDP < 90k: United States and United Arab Emirates representing 13%
| location | total_cases | total_deaths | people_vaccinated | people_fully_vaccinated | total_vaccinations | gdp_per_capita | GDP_category |
|---|---|---|---|---|---|---|---|
| United States | 28486394 | 510458 | 47184199 | 22613359 | 70454064 | 54225.446 | 50k >= GDP < 90k |
| China | 100475 | 4824 | NA | NA | 40520000 | 15308.712 | 15k >= GDP < 50k |
| United Kingdom | 4166727 | 122303 | 19177555 | 736037 | 19913592 | 39753.244 | 15k >= GDP < 50k |
| India | 11079979 | 156938 | 11552857 | 2204083 | 13756940 | 6426.674 | 5k >= GDP < 15k |
| Turkey | 2683971 | 28432 | 6745147 | 1553658 | 8298805 | 25129.341 | 15k >= GDP < 50k |
| Brazil | 10455630 | 252835 | 6346769 | 1755018 | 8101787 | 14103.452 | 5k >= GDP < 15k |
| Israel | 770780 | 5697 | 4663028 | 3294759 | 7957787 | 33132.320 | 15k >= GDP < 50k |
| United Arab Emirates | 385160 | 1198 | NA | NA | 5933299 | 67293.483 | 50k >= GDP < 90k |
| Germany | 2436506 | 69939 | 3881490 | 2029047 | 5910537 | 45229.245 | 15k >= GDP < 50k |
| France | 3746475 | 85734 | 2808490 | 1490083 | 4298573 | 38605.671 | 15k >= GDP < 50k |
| Italy | 2888923 | 97227 | 2696588 | 1377987 | 4074575 | 35220.084 | 15k >= GDP < 50k |
| Russia | 3968228 | 76873 | 2200000 | 1700000 | 3900000 | 24765.954 | 15k >= GDP < 50k |
| Spain | 3180212 | 68813 | 2361852 | 1243783 | 3605635 | 34272.360 | 15k >= GDP < 50k |
| Morocco | 482994 | 8608 | 3327858 | 96437 | 3424295 | 7485.013 | 5k >= GDP < 15k |
| Chile | 816929 | 20400 | 3289086 | 55941 | 3345027 | 22767.037 | 15k >= GDP < 50k |
##
## 15k >= GDP < 50k 50k >= GDP < 90k 5k >= GDP < 15k
## 10 2 3
Table of the Top 15 GDP per Capita vs. Total single doses demonstrated:
15k >= GDP < 50k: Cayman Islands represents 7% and administrated about 21k total single doses.
50k >= GDP < 90k: represents the majority (73%). All countries of this category started the vaccination rollout except Brunei, San Marino and Hong Kong. In terms of single doses administrated on people age 15 and up, United Arab Emirates is leading with 50% >= Single doses < 70% (about 6 million) followed by United States 10% >= Single doses < 50% (about 61 million).
GDP >= 90K: Qatar, Macao and Luxembourg represent 20%. For each nation, total single doses were administrated in less than 10% of the age 15 and up. Notable the total cases were below 200K and total deaths below 1k.
| location | gdp_per_capita | GDP_category | total_cases | total_deaths | people_vaccinated | people_fully_vaccinated | total_vaccinations | Pop15Over | Total_vaccination_Age15UP | People_vaccinated_Age15UP | People_Fully_vaccinated_Age15UP |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qatar | 116935.60 | GDP >= 90K | 162737 | 257 | NA | NA | 140000 | 2530000 | Total single doses < 10% | Data unavailable | Data unavailable |
| Macao | 104861.85 | GDP >= 90K | NA | NA | NA | NA | 2000 | 562000 | Total single doses < 10% | Data unavailable | Data unavailable |
| Luxembourg | 94277.96 | GDP >= 90K | 55110 | 637 | 26089 | 9982 | 36071 | 537000 | Total single doses < 10% | 1st dose < 10% | Fully vaccinated < 10% |
| Singapore | 85535.38 | 50k >= GDP < 90k | 59913 | 29 | 250000 | 110000 | 360000 | 5078000 | Total single doses < 10% | 1st dose < 10% | Fully vaccinated < 10% |
| Brunei | 71809.25 | 50k >= GDP < 90k | 185 | 3 | NA | NA | NA | 344000 | NA | NA | NA |
| Ireland | 67335.29 | 50k >= GDP < 90k | 218251 | 4300 | 238841 | 134439 | 373280 | 4001000 | Total single doses < 10% | 1st dose < 10% | Fully vaccinated < 10% |
| United Arab Emirates | 67293.48 | 50k >= GDP < 90k | 385160 | 1198 | NA | NA | 5933299 | 8503000 | 50% >= Total single doses < 70% | Data unavailable | Data unavailable |
| Kuwait | 65530.54 | 50k >= GDP < 90k | 189046 | 1072 | 137000 | 38000 | 175000 | 3413000 | Total single doses < 10% | 1st dose < 10% | Fully vaccinated < 10% |
| Norway | 64800.06 | 50k >= GDP < 90k | 70564 | 622 | 318722 | 149622 | 468344 | 4505000 | 10% >= Total single doses < 50% | 1st dose < 10% | Fully vaccinated < 10% |
| Switzerland | 57410.17 | 50k >= GDP < 90k | 554932 | 9961 | 527979 | 220812 | 748791 | 7384000 | 10% >= Total single doses < 50% | 1st dose < 10% | Fully vaccinated < 10% |
| San Marino | 56861.47 | 50k >= GDP < 90k | 3671 | 73 | NA | NA | NA | NA | NA | NA | NA |
| Hong Kong | 56054.92 | 50k >= GDP < 90k | NA | NA | NA | NA | NA | 6621000 | NA | NA | NA |
| United States | 54225.45 | 50k >= GDP < 90k | 28486394 | 510458 | 47184199 | 22613359 | 70454064 | 271447000 | 10% >= Total single doses < 50% | 10% >= 1st dose < 50% | Fully vaccinated < 10% |
| Bermuda | 50669.32 | 50k >= GDP < 90k | NA | NA | 12304 | 4769 | 17073 | NA | Data unavailable | Data unavailable | Data unavailable |
| Cayman Islands | 49903.03 | 15k >= GDP < 50k | NA | NA | NA | NA | 21106 | NA | Data unavailable | Data unavailable | Data unavailable |
##
## 15k >= GDP < 50k 50k >= GDP < 90k GDP >= 90K
## 1 11 3
##
## 10% >= Total single doses < 50% 50% >= Total single doses < 70%
## 3 1
## Data unavailable Total single doses < 10%
## 2 6
Geographical Distribution of the Countries with vaccination rollout vs. Total single doses
The tile grid map of the Countries with vaccination rollout vs. Total Single doses that most countries are located at sub-regions North & South America, European nations and Southern & Western Asia. The countries with the highest number of doses administrated:
North America: United States - USA (Single doses >= 50M)
Northern & Western Europe: United Kingdom - GBR (5M>=Single doses >= 50M)
Western Asia: Israel - ISR (5M>=Single doses >= 50M)
Eastern Asia: China - CHN (5M>=Single doses >= 50M)
# tile map of the countries by the counts of single doses administrated
min<- min(vaccine_tile$total_vaccinations.y)
max<- max(vaccine_tile$total_vaccinations.y)
median<- median(vaccine_tile$total_vaccinations.y)
ggplot(vaccine_tile, aes(xmin = x.x, ymin = y.x, xmax = x.x + 1, ymax = y.x + 1, fill = total_vaccinations.y)) +
geom_rect(color = "black") +
mytheme +
theme(plot.caption = element_text(size = 4),
legend.position = "bottom",
legend.title = element_text(size = 6),
legend.text = element_text(size = 6),
legend.key.size = unit(0.5,"line"))+
geom_text(aes(x = x.x, y = y.x, label = iso_code), color = "black", alpha = 0.5, nudge_x = 0.5, nudge_y = -0.5, size = 1.7) +
scale_y_reverse() +
coord_equal()+
labs(caption="Countries such as Liechtenstein, Monaco, West Bank and Gaza, San Marino and small islands are not represented", color="Single Doses")+
scale_fill_gradientn(colours = rcartocolor::carto_pal(name = "TealRose", n = 7), name = "",na.value = NA) +
guides(fill = guide_colourbar(barheight = 0.3, barwidth = 20, direction = "horizontal", ticks = FALSE)) # https://www.katiejolly.io/blog/2019-08-28/nyt-urban-heat## Warning: Removed 13 rows containing missing values (geom_rect).
## Warning: Removed 13 rows containing missing values (geom_text).
Considering the population of age 15 and up vs. total single doses of the covid-19 vaccine, data shows that some countries such as Israel and Seychelles (both under 15k >= GDP < 50k) may soon reach the potential herd immunity by vaccinating more than 70% of the population of age 15 and up. Note: considering only 1st dose administrated each nation falls under 50% >= 1st dose < 70%; considering the fully vaccinated category, each nation is under 10% >= Fully vaccinated < 50%.
| location | GDP_category | total_cases | total_deaths | people_vaccinated | people_fully_vaccinated | total_vaccinations | Pop15Over | Total_vaccination_Age15UP | People_vaccinated_Age15UP | People_Fully_vaccinated_Age15UP |
|---|---|---|---|---|---|---|---|---|---|---|
| United Arab Emirates | 50k >= GDP < 90k | 385160 | 1198 | NA | NA | 5933299 | 8503000 | 50% >= Total single doses < 70% | Data unavailable | Data unavailable |
| Israel | 15k >= GDP < 50k | 770780 | 5697 | 4663028 | 3294759 | 7957787 | 6751000 | Potential herd immunity | 50% >= 1st dose < 70% | 10% >= Fully vaccinated < 50% |
| Seychelles | 15k >= GDP < 50k | 2592 | 11 | 51577 | 23519 | 75096 | 75300 | Potential herd immunity | 50% >= 1st dose < 70% | 10% >= Fully vaccinated < 50% |
In terms of the percent of the population of age 15 and up been vaccinated, Israel - ISR and Seychelles - SYC are leading with Total Single doses > 70%, potential herd immunity; followed by the United Arab Emirates - ARE (50% >= Total single doses < 70%).
3. Conclusion
This project used longitudinal data (covid-19), which changes over time; it was challenging to validate the data. Also, I encounter days in which the covid-19 package had an issue with its server. In other instances, data was missing or did not adequately updated.
When the time zone changed from one day to another, I frequently noticed that the covid-19 cases and deaths were updated, but there was a lag in update the vaccination information (vice versa). If I was trying to retrieve a subset with the most recent date, then the data was not complete. My solution to this temporal data issue was to create dataframes with the most recent data for each variable published (total cases, total deaths, and vaccination information) and join these subsets, selecting just the variables that I would plot and were updated).
I also noticed a few changes on the covid-19 data structure itself (e.g., new columns were added), probably to accommodate the reality, primarily related to the vaccination information. Some of the countries track vaccination administrated by people who received the 1st dose, people fully vaccinated. Still, others track only the total single doses, making some analysis to be inconsistent.
There are many other geographical-socio-economic questions that I would be curious to know, and this is just the beginning of my insights.
4. Reference
Bendix, To know the real number of coronavirus cases in the US, China, or Italy, researchers say multiply by 10. Accesed:Apr 19, 2020, 12:50 PM https://www.businessinsider.com/real-number-of-coronavirus-cases-underreported-us-china-italy-2020-4
Hale T, Phillips T, Petherick A, Kira B, Angrist N, Aymar K, et al. Risk of Openness Index: when do government responses need to be increased or maintained? [Internet]. Version 2.0. Oxford:Blavatnik School of Government; 2020 [cited 2020 Oct 21] https://www.publichealthontario.ca/-/media/documents/ncov/research/2020/10/research-hale-risk-of-openness-index.pdf?la=en based on https://www.bsg.ox.ac.uk/sites/default/files/2020-10/10-2020-Risk-of-Openness-Index-BSG-Research-Note.pdf
Hasell, J., Mathieu, E., Beltekian, D. et al. A cross-country database of COVID-19 testing. Sci Data 7, 345 (2020). https://doi.org/10.1038/s41597-020-00688-8
HDI https://ec.europa.eu/environment/beyond_gdp/download/factsheets/bgdp-ve-hdi.pdf
HDI wikipedia 2019 https://en.wikipedia.org/wiki/Human_Development_Index
https://www.maartenlambrechts.com/2017/10/22/tutorial-a-worldtilegrid-with-ggplot2.html https://github.com/ishaberry/Covid19Canada, https://github.com/kaerosen/tilemaps"
Naïve Bayes Classifier - https://uc-r.github.io/naive_bayes
Mohammed, R.A. Longitudinal Data Integration for a Tracking System for Health Professionals. Masters Thesis. UNIVERSITY OF NEW BRUNSWICK, 2016