Summary
I decided to use 5 clusters.
Cluster 1 comprises of the countries with the most total cases and deaths per capita. Despite this, it is by far the group that has had the best vaccination policy. Some of the countries that belong to this group are: Italy, United States, Sweden, and Chile.
Cluster 2 is quite diverse in its composition. There are countries with low mortality and cases, such as New Zealand and Laos, and others with many cases and deaths, such as Bolivia and Andorra. What is particular about this group is that the vaccination and testing policy has not been very good.
Cluster 3 contains countries with many cases and total deaths, although not as many as cluster 1. However, it is the group that as of September 2021 has the most cases and deaths per capita. When it comes to testing it has been by far the worst of all clusters. It is also the group with the most stringent mobility restrictions. It contains countries such as Peru, Mexico, Brazil and the Philippines.
Cluster 4 consists of countries that had very few COVID cases and low mortality. It is composed mainly of African countries, although there are also countries like Haiti, Afghanistan and Syria. This is also the group with the lowest vaccination rate.
Cluster 5 include the countries that have been the least affected by COVID. Most the countries of Oceania are in this group: Vanuatu, Palau, Solomon Islands, Micronesia, etc. As well as, other small countries such as Bhutan, The Vatican and Hong Kong. Furthermore, they are the group with highest vaccionation rate during the month of September 2021.
From the Principal Component Analysis (PCA) I found that PC1 explains 48.29% of the variation in available data, while PC2 explains 16.32%.
The countries that have a high score in PC1 are those that have had a lot of infections, deaths, hospitalizations and a good vaccination policy. But at the same time they have also had very few tests per case. These countries include: Israel, the United States, the United Kingdom, and Serbia.
The countries that have a high score in PC2 are those that have had few infections and deaths, and a good vaccination policy. They have also had a very low positive rate and many tests per case. Examples of these countries are Hong-Kong, Iceland, Palau, Bhutan, Samoa.
Intro
Some time ago, I came across the data from ourworldindata.org for COVID 19. I decided to download it and use it to do some analysis. My goal is to analyze the current (September 2021) situation of COVID in the world. With that objective, I am going to use clustering in order to catalog the countries into different groups. The data can be downloaded from here: https://ourworldindata.org/coronavirus. For the analysis I will use the data between September 1 and September 27, 2021. Additionally, I will use Principal Component Analysis (PCA), to analyze the data.
Variables
Here are the variable that I’ll use in this analysis. The original database has more variables.Which can be read here.
| Variable |
Description |
| total_cases_per_million |
Total confirmed cases of COVID-19 per 1,000,000 people |
| new_cases_per_million |
New confirmed cases of COVID-19 per 1,000,000 people |
| total_deaths_per_million |
Total deaths attributed to COVID-19 per 1,000,000 people |
| new_deaths_per_million |
New deaths attributed to COVID-19 per 1,000,000 people |
| icu_patients_per_million |
Number of COVID-19 patients in intensive care units (ICUs) on a given day per 1,000,000 people |
| weekly_hosp_admissions_per_million |
Number of COVID-19 patients newly admitted to hospitals in a given week per 1,000,000 people |
| stringency_index |
Government Response Stringency Index: composite measure based on 9 response indicators including school closures, workplace closures, and travel bans, rescaled to a value from 0 to 100 (100 = strictest response) |
| reproduction_rate |
Real-time estimate of the effective reproduction rate (R) of COVID-19. See https://github.com/crondonm/TrackingR/tree/main/Estimates-Database |
| new_tests_per_thousand |
New tests for COVID-19 per 1,000 people |
| positive_rate |
The share of COVID-19 tests that are positive, given as a rolling 7-day average (this is the inverse of tests_per_case) |
| tests_per_case |
Tests conducted per new confirmed case of COVID-19, given as a rolling 7-day average (this is the inverse of positive_rate) |
| total_vaccinations_per_hundred |
Total number of COVID-19 vaccination doses administered per 100 people in the total population |
| people_vaccinated_per_hundred |
Total number of people who received at least one vaccine dose per 100 people in the total population |
| people_fully_vaccinated_per_hundred |
Total number of people who received all doses prescribed by the vaccination protocol per 100 people in the total population |
| total_boosters_per_hundred |
Total number of COVID-19 vaccination booster doses administered per 100 people in the total population |
| new_vaccinations_per_million |
New COVID-19 vaccination doses administered per 1,000,000 people in the total population |
From the original base I decided to keep all the variables that were controlled for the population per country. Furthermore, due to the fact that they had many missing values or that the data was not very up-to-date, I did not keep the variables related to excess mortality.
Variable Treatment
For the variables new_cases_per_million, new_vaccionations_per_million, new_test_per-thousand, new_cases_per_million and new_deaths_per_million I added all the values reported in September 2021. On the other hand, for stringency_index, tests_per_case, icu_patients_per_million, weekly_hosp_admissions_per_million and reproduction_rate I calculated the mean of the values for September 2021. For the rest of the variables, the maximum value reported in the data was used.
The data contained many missing values; therefore, I decided to impute the variables that had less than 10 missing values with the median. For the rest of the variables, the imputation was made in two steps. First I used random forest, with 100 trees, to know which variables were related to each other. Then, I used those variables to impute the missing data with the k-nearest neighbors (knn) method, with k = 10. Finally, a logarithmic transformation was applied to the data and then variables were scaled.
Exploratory Analysis
I’m going to start by plotting a correlation matrix of the variables.

If one analyzes this correlation matrix, the number of total cases is related to the number of total deaths, hospitalizations, ICU patients and to a lesser extent to the number of tests, and people vaccinated. On the other hand, the number of deaths per capita is also connected to the number of hospitalizations and ICU patients. To a lesser extent it is linked to the positivity rate and is negatively associated with the number of test per case.
The number of ICU patients is obviously related to the number of hospital admissions. However, the coefficient is also high for the total tests per capita and for the variables linked to vaccination, which is intriguing, since one would expect that countries with more vaccinations would have fewer ICU patients. Finally, hospitalizations are only connected to the number of total tests.
Other things that can be noticed is that the tests per case have a negative relationship with the positive rate. All variables associated to vaccination are related to each other. Lastly, I find it interesting that the stringency index does not have any kind of relationship with any variable. Let’s now look at the distribution of the variables with a boxplot.

What this plot shows is that many variables have outliers. Another interesting thing is the distribution of the variable total_booster_per_hundred. Most of the values are 0 since there are very few countries that are using booster doses.
Analysis
I used two methods to decide the number of clusters. Let’s begin with the silhouette analysis.

Looking at the results it seems that the optimal number of clusters is 2. However, I think that separating the world in two is perhaps not a good idea. Now we are going to do an elbow analysis.

I don’t see a major “elbow” but I’d say that k = 5 looks pretty reasonable.
Using these 2 axes (cases and total deaths per million) it seems that the clusters are not clearly separated. However, it seems that 1 and 3 belong to countries more affected by covid than 4 and 5. Another way to visualize the clusters in a way that is clearer is by using the Principal Component Analysis:
In this plot the clusters are clearly separated. PC1 explains 48.29% of the variation in our data, while PC2 explains 16.32%. To better understand what each dimension means, let’s look at a graph that demostrates the important variables for each one.

The countries that have a high PC1 score are those that have had many cases, deaths, hospitalizations and, at the same time, an effective vaccination policy. But, in addition, they have also had very few tests per case. These countries include: Israel, United States, United Kingdom, and Serbia.

The countries that have a high score in PC2 are those that have had the fewest cases and deaths. They have also had a very low positive rate, which may be associated with the fact that these countries have many tests per case. Finally, like the countries that have high PC1 scores, they have also had a good vaccination policy. Examples of these countries are Hong Kong, Iceland, Palau, Bhutan, and Samoa.
Clusters Characterization
Let’s look at the variable values per cluster to better understand them.
Cluster 1 comprises the countries that have had the most total cases and deaths per capita throughout the pandemic. During September 2021, they have the highest number of ICU patients and tests per population. Furthermore, they are by far the group with the best vaccination policy, almost all the countries where booster doses have been administered are contained in this cluster. However, the same does not happen with vaccination in September 2021, the only group that does it worse than them is cluster number 4. One reason may be that the countries of cluster 1, having already vaccinated much of their population, have a smaller margin to increase the number of vaccines per inhabitant. Some of the countries in this group are: Italy, United States, Sweden, Chile, etc.
Cluster 2 is quite diverse in its composition. There are countries with low mortality and cases, such as New Zealand and Laos, and others with many cases and deaths, such as Bolivia and Andorra. It is the second group with fewer tests per case, total tests per thousand inhabitants and vaccinations per inhabitants. Furthermore, during September 2021 they were the second group with the most stringent mobility restrictions.
Cluster 3 has countries with many cases and total deaths, although not as many as those of cluster 1. However, they are the group that currently (September 2021) has more cases and deaths per capita. Which causes this to be the cluster with the highest reproduction rate and hospitalizations per inhabitant. In addition, it has been by far the worst of all, in tests per case. Finally, it is the group with the most stringent mobility restrictions. It contains countries such as Peru, Mexico, Brazil or the Philippines.
Cluster 4 consists of countries that had very few infections and low mortality. Most of the African countries are in this cluster although there are also countries like Haiti, Afghanistan and Syria. Even if they are not the cluster with the fewest cases and deaths, they are the group with the fewest icu patients and weekly hospital admission per million during the month of September 2021. They are the ones with the fewest tests per thousand and vaccinated people per population. It may be due in large part to the low level of development of these countries. In any case, we must be aware that these countries may not have the necessary infrastructure to be able to obtain reliable numbers. Even so, last year there were people surprised by the low mortality of covid in Africa, as can be seen in this article: https://www.bbc.com/news/world-africa-54418613
As can be seen in the graph below, the mortality of the virus in Africa has been lower than in the world throughout the pandemic. When we make the comparison with the European Union, the differences are gigantic, which surprises us even more due to the differences in development that exist in these areas.
Cluster number 5 consists of the countries that have been the least affected by COVID. Here are most of the countries of Oceania: Vanuatu, Palau, Solomon Islands, Micronesia, etc. In addition to other small countries such as Bhutan, The Vatican and Hong Kong. There are two countries that stand out in this group, since they are very different from the rest: China and Taiwan. This is not the group with the least amount of hospitalizations and ICU patients, since they are marginally surpassed by the countries of cluster 4. Furthermore, during the month of September 2021 they were the cluster with the most vaccinated people per million inhabitants. Finally, they are the second group with the the least stringent mobility restrictions after cluster 4.
Final Thoughts
It must considered that I choose 5 clusters in a totally arbitrary way. It could have been more or less. Therefore, there is no reason to think that the 5 clusters are something definitive that cannot be changed or questioned. Having said this, there are 3 surprises that I came across during this analysis: First, noticing the little impact that COVID has had in Africa, the few cases and deaths in China and seeing how many high-income countries were greatly affected, being the only exceptions: Australia, New Zealand and South Korea. On the first point, the explanations can be varied, from the median age of the population or the low population density in Africa, in this link some hypotheses are explored.
About the cases and deaths in China, there are also many theories. One is that since the virus happened so early in that country they didn’t have it identified or named at the beginning, and there was no way of testing for it. So there’s a huge chance that the cases and deaths are way higher. Another explanation is that the political response was very good in China, especially the strict lockdowns. About the last point, I think it is interesting to analyze why certain developed countries had fewer cases and deaths. Is it due to any public policy or demographic factor?
Clustering is generally used for customer segmentation, targeted marketing or recommendation systems. I think that using it to find patterns in a group of territorial units can be very important for public policy, since it can help to plan different types of interventions and policies depending on the characteristics that each cluster has.
