Appendix 1: Explore Entities with No Vaccine Numbers to Date
At least two possible factors may contribute to the complete absence of vaccine data in these entities:
Necessity: Compare to entities with fewer cases, entities with more cases are more likely to be motivated to administer vaccine rollout and to report the progress closely.
Ability: An entity needs to have enough resources in order to administer vaccine rollout and record relevant data.
The Necessity Hypothesis: Vaccine and Total Cases
We have identified entities that have no vaccine numbers at all to date. Let’s look at their case numbers to check out the necessity hypothesis. Because entities differ in how frequently vaccine data is updated, compassion within the last five days were shown to help us get a more comprehensive picture.
library(ggpubr)
all %>%
#Choose a specific day for comparison
filter(Date>=(max(Date)-5),
Date<=(max(Date)-1)) %>%
mutate(VaccineData=if_else(Entity %in% NoVaccine$Entity, "None","Some")) %>%
ggplot(aes(x=VaccineData,y=total_cases_per_million,color=VaccineData))+
geom_boxplot(outlier.alpha=0)+
geom_jitter(alpha=0.5,
width=0.5,
height=0.1)+
facet_wrap(~Date,nrow=2)+
scale_color_brewer("Vaccine Data",
#,labels=c("Available","Missing")
palette="Dark2")+
stat_compare_means(method = "t.test", vjust=1,
label = "p.signif"
#not sure why this didn't work: check later
#symnum.args = list(cutpoints = c(0, 0.0001, 0.001, 0.01, 0.05,0.01, 1), symbols = c("p<0.0001", "p<0.001", "p<0.01", "p<0.05","p<0.1", "ns"))
)+
theme_minimal()+
scale_y_continuous(labels=label_number_si(accuracy=1))+
labs(title="Total Cases Per Million & Missing of Vaccine Data",
subtitle=Subtitle,
x=" ",
y=" ",
caption="*Showing significance level of Welch's two sample t-test")+
theme(axis.text.x=element_blank(),
plot.subtitle=element_text(hjust=1,color="grey50"))

ggsave("Graphs/Cases in entities with no vaccine data.PNG",height=5.06,width=9,dpi=300,limitsize=FALSE)
Consistent with the above discussion, entities with no vaccine data have much less confirmed cases per capita. This result, however, may need to be viewed in terms of these entities’ ability to perform tests to identify cases if there are any, which then bring us to the next section.
The Ability Hypothesis: Vaccines and GDP
Also consistent with the above discussion, entities with no vaccine data have lower GDP per capita. It is possible that lack of resources may also contribute to fewer tests, hence the lower case numbers.
#GDP and Missing on Vaccine
all %>%
#Choose a specific day for comparison
filter(Date>=(max(Date)-5),
Date<=(max(Date)-1)) %>%
mutate(VaccineData=if_else(Entity %in% NoVaccine$Entity, "None","Some")) %>%
ggplot(aes(x=VaccineData,y=GDP_per_capita,color=VaccineData))+
geom_boxplot(outlier.alpha=0)+
geom_jitter(alpha=0.5,
width=0.5,
height=0.1)+
facet_wrap(~Date,nrow=2)+
scale_color_brewer("Vaccine Data",
palette="Dark2")+
stat_compare_means(method = "t.test",vjust=1,
#show significance level
aes(label = ..p.signif..))+
theme_minimal()+
scale_y_continuous(labels=label_number_si(accuracy=1))+
labs(title="GDP Per Capita & Missing of Vaccine Data",
subtitle=Subtitle,
x=" ",
y=" ",
caption="*Showing significance level of Welch's two sample t-test")+
theme(axis.text.x=element_blank(),
plot.subtitle=element_text(hjust=1,color="grey50"))

ggsave("Graphs/GDP of entities with no vaccine data.PNG",height=5.06,width=9,dpi=300,limitsize=FALSE)
Location of the Entities
Finally, we can also locate these entities on the map. One may even infer the actual Covid-19 situations in these entities by looking at their neighbors. (Tempting as it is to continue with this line of analysis, I’d better leave it here as the assignment will be due in 2 days and I have only finished its section 2 about preparing the data :P)
library(rworldmap)
Vac_Drop_Review <- all %>%
mutate(VaccineData=if_else(Entity %in% NoVaccine$Entity, "None","Some")) %>%
#choose key variables for comparison
select(VaccineData,Entity,iso_code)
joinData <- joinCountryData2Map( Vac_Drop_Review,
joinCode = "ISO3",
nameJoinColumn = "iso_code")
## 85351 codes from your data successfully matched countries in the map
## 977 codes from your data failed to match with a country code in the map
## 54 codes from the map weren't represented in your data
#highlight entities dropped
mapParams <- mapCountryData( joinData
, mapTitle="Locations of Entities with None versus Some Vaccine Data"
, nameColumnToPlot="VaccineData"
, addLegend=TRUE
, missingCountryCol="white"
, oceanCol="lightblue")

Appendix 2. Discussions on the Imputation Methods
As discussed above, the imputation method used in this project was na_interpolation in the imputeTS package. The following graph shows that although the gaps between known vaccine numbers were filled in smoothly, the values for a few days after the last known value dropped dramatically. This is understandable because na_interpolation is an imputation rather than a forecasting method. Therefore, the imputation of missing values in this analysis was done only between the 1st and last day of known vaccine numbers for each entity.
Vac_impute <-Vac %>%
select(Date,Entity,VaccinationP100) %>%
group_by(Entity) %>%
na_interpolation(option="linear") %>%
left_join(Vac,by=c("Date","Entity")) %>%
rename(Imputed=VaccinationP100.x
, Original=VaccinationP100.y)
p1=Vac_impute %>%
pivot_longer(col=c("Imputed","Original"),names_to="ValueType",values_to="Vaccination") %>%
filter(Entity=="Luxembourg") %>%
ggplot(aes(x=Date,y=Vaccination,color=ValueType))+
geom_line()+
labs(subtitle="Luxembourg: Many small gaps",
x="",
y="")
p2=Vac_impute %>%
pivot_longer(col=c("Imputed","Original"),names_to="ValueType",values_to="Vaccination") %>%
filter(Entity=="Albania") %>%
ggplot(aes(x=Date,y=Vaccination,color=ValueType))+
geom_line()+
labs(subtitle="Albania: Large and small gaps",
x="",
y="")
gridExtra::grid.arrange(p1,p2
,nrow=2
,top = textGrob("Interpolation imputation works well to fill in the gaps, not suitable for forecasting",
gp = gpar(fontsize = 12, font = 2)))

Also maybe of interests is the comparison between different imputation methods. Followed please see a comparison between na_ma and na_interpolation. Results from the latter can be much smoother and likely closer to how vaccine numbers would increase gradually in an entity.
##the performance of imputation algorithms differ based on data available. To demonstrate the "dip" in na_ma as compared to na_interpolation, choose a specific subset of data where it shows.
#imputation with ma (moving average)
Vac_impute_ma <-Vac %>%
filter(Date<="2021-05-08") %>%
select(Date,Entity,VaccinationP100) %>%
group_by(Entity) %>%
na_ma(k=7,weighting="linear") %>%
left_join(Vac,by=c("Date","Entity")) %>%
rename(Imputed=VaccinationP100.x
, Original=VaccinationP100.y)
#Choose an entity for demonstration
#Graph 1: Performance of ma imputation
p3=Vac_impute_ma %>%
pivot_longer(col=c("Imputed","Original"),names_to="ValueType",values_to="Vaccination") %>%
filter(Entity=="Luxembourg") %>%
ggplot(aes(x=Date,y=Vaccination,color=ValueType))+
geom_line()+
labs(subtitle="na_ma (weighting=linear)",
x="",
y="")
#Graph 2: Performance of interpolation imputation
p4=Vac_impute %>%
filter(Date<="2021-05-08") %>%
pivot_longer(col=c("Imputed","Original"),names_to="ValueType",values_to="Vaccination") %>%
filter(Entity=="Luxembourg") %>%
ggplot(aes(x=Date,y=Vaccination,color=ValueType))+
geom_line()+
labs(subtitle="na_interpolation (option=linear)",
x="",
y="")
gridExtra::grid.arrange(p3,p4
,nrow=2
,top = textGrob("Comparing the ma(moving average) and interpolation imputation",
gp = gpar(fontsize = 12, font = 2))
)

The above graph is based on data of Luxembourg. In the following data, you can check and compare the performance of these two imputation methods for data of all entities of interests.
CheckImputation<-Vac_impute %>%
select(Entity,Date,Original,InterpolationImputed=Imputed) %>%
left_join(Vac_impute_ma) %>%
select(Entity,Date,Original,InterpolationImputed,MaImputed=Imputed)
datatable(CheckImputation)