52414 - Lab 1: Basic Data Wrangling and Plotting, Distributions

Analysis of the World Covid-19 Dataset - Solutions:

1. Loading data and basic processing:

(a) Loading the csv and casting the date column to the date type

We noticed immediately that loading the data-set from the url takes some time, therefore we saved it locally in our repository and saved running time. We accepted the trade-off of having the data less updated, but by doing so, we also managed to understand the results of our code better, since the output changed only because of what we wrote and due to the daily updates in the data.

covid19 = read.csv("owid-covid-data.csv") 

# Modifying class of date column 
covid19$date = as.Date(covid19$date)
class(covid19$date)

## [1] "Date"

(b) Listing five top dates

After examining the data set, we understood that there are values in the locations column that refer to categorical location, such as High income countries. We subset only these categories into a new data set, and simply selected the top 5 dates in terms of highest values of new_cases.

high_income_covid19 = filter(covid19, location == "High income") # temp df

#top 5 new_cases
high_income_covid19 %>%
  select(date, new_cases) %>%
  arrange(desc(new_cases)) %>% 
  top_n(5)

## Selecting by new_cases

##         date new_cases
## 1 2022-01-19   2830533
## 2 2022-01-18   2737804
## 3 2022-01-10   2589365
## 4 2022-01-21   2558959
## 5 2022-01-26   2532201

#top 5 new_deaths
data.table(high_income_covid19 %>%
  select(date, new_deaths) %>%
  arrange(desc(new_deaths)) %>% 
  top_n(5))

## Selecting by new_deaths

##          date new_deaths
## 1: 2021-01-20      11047
## 2: 2021-01-26      10279
## 3: 2021-01-12      10170
## 4: 2021-01-27      10079
## 5: 2021-01-08       9967

#top 5 new_vaccinations
high_income_covid19 %>%
  select(date, new_vaccinations) %>%
  arrange(desc(new_vaccinations)) %>% 
  top_n(5)

## Selecting by new_vaccinations

##         date new_vaccinations
## 1 2021-08-04         11980707
## 2 2021-12-17          9748326
## 3 2021-06-10          9538010
## 4 2021-06-11          9504763
## 5 2021-06-09          9452211

The tables show that the top 5 dates with the largest amount of new_cases recorded were all in January 2022. The 5 dates with the highest new_deaths were recorded in January 2021. Regarding the five dates with the most vaccinations, there is a dispersal in the second half of 2021.

2. Comparing low vs. high income countries:

(a) Writing a plot function for high/low income countries

We developed a function that receives a data frame, and filters within it the information on the categories of high / low income according to the input column while excluding NA values and non-numeric values. Inside the function, we used ggplot to display a plot as required.

plot_income <- function(df, col_name, log=FALSE){
  income_locations <- c("High income","Low income") #specifying the wanted locations
  df %>%
    select(location, date, all_of(col_name)) %>% # selecting wanted columns and filtering them
    filter(location %in%  income_locations) %>%
    ggplot(aes_string(x="date",y=all_of(col_name), color="location"),
           width = 40, height = 20, units = "cm")  +
    labs(x="Date", # using string functions to "prettify" the labels
         y=str_to_title(str_replace_all(as.character(all_of(col_name)), '_', ' ')),
         title = paste(str_to_title(str_replace_all(as.character(all_of(col_name)), '_', ' ')),
                       "over Time Per Income Category"),
         color='Income Category') + 
    geom_line()
}

(b) Plotting with the function with different columns

(i) plotting new_cases_per_million vs plotting logged new_cases_smoothed_per_million

# first plot - new_cases_per_million
plot_income(covid19, "new_cases_per_million")

# second plot - logged_new_cases_smoothed_per_million
covid19$logged_new_cases_smoothed_per_million <- log(covid19$new_cases_smoothed_per_million) 
plot_income(covid19, "logged_new_cases_smoothed_per_million")

We presented the information in the plot as asked and immediately noticed that the logged & smoothed version of the data we plotted is much more convenient to interpret. Smoothing and using log down-sizes the scale of the data without changing the trends With this sort of plotting it is much simpler to visualize the trends of the column we inspect.

We learn from the second graph that the general trends of disease spread were relatively identical between the two income-types of countries.

(ii) plotting logged_new_deathes_smoothed and logged_new_vaccinations_smoothed_per_million

We executed the same function using the log function for a plot the new_deaths_smoothed and new_vaccinations_smoothed_per_million columns.

# third plot - logged smoothed_new_deathes
covid19$logged_new_deathes_smoothed <- log(covid19$new_deaths_smoothed)
plot_income(covid19, "logged_new_deathes_smoothed")

# fourth plot - logged new_vaccinations_smoothed_per_million
covid19$logged_new_vaccinations_smoothed_per_million <- log(covid19$new_vaccinations_smoothed_per_million)
plot_income(covid19, "logged_new_vaccinations_smoothed_per_million")

It can be seen in the third graph referring to deaths that the trends are relatively identical in 2020 and 2022 so far, but reversed in 2021. The fourth plot showing the extent of vaccinations presents how High income countries vaccinated the populations earlier and in a sharp and rapid process, compared to Low income countries, which started vaccinating later and at a lower pace.

3. Storing and analyzing current (cumulative) values:

(a) Defining the currnet data frame

applies fun to all columns specified in .SDcols while grouping by the columns specified in by.

In order to create the current table with most updated data for each country we used the data.table and functions that it has. We create a function for the columns that takeing the last element is not NA by location. After using that function, every last row of each location has the most updated data for that specific location, at this point we group the locations by the last date (which is the last row for each country). Afterwards we make sure to select only the needed columns.

current = data.table(covid19[,c("location", "continent", "total_cases_per_million", "total_deaths_per_million", "total_vaccinations_per_hundred","people_fully_vaccinated_per_hundred", "total_boosters_per_hundred", "excess_mortality_cumulative_per_million")])
current = current[, lapply(.SD, function(x) last(x[!is.na(x)])), by= location]

(b) Analyzing total_deaths_per_million

(i) Histogram of the data

current %>% 
  ggplot(aes(total_deaths_per_million)) + 
  geom_histogram(binwidth = 30, color = "dark blue") +
  xlab("Total Deaths Per Million") +
  ylab("Count")+
  ggtitle("Histogram Of Total Deaths Per Million")

The histogram does not look close to the normal distribution, it has a right tail and the center of the values is in the left side. Many data-points are on the far left, few are on the far right.

(ii) Skewness and kurtosis of the data

ske = skewness(current$total_deaths_per_million, na.rm = TRUE)
kur = kurtosis(current$total_deaths_per_million, na.rm = TRUE)

As we learned that distributions can exhibit positive or negative skewness to varying degrees and it’s formula is:

\[skewness: \tilde{\mu}_3 \equiv E [(\frac{(x-\mu)}{\sigma})^3]\] The histogram has a Positive skewness = 1.3099515 and this indicates a long right tail. We learned about the kurtosis as well, distributions can exhibit positive or negative kurtosis in various values and it’s formulated as:

\[\tilde{\mu}_4 \equiv E [(\frac{(x-\mu)}{\sigma})^4]\]

\[kurtosis: \kappa \equiv \tilde{\mu}_4 - 3\]

The histogram has a positive kurtosis = 1.6398022 and it indicates a long tail (compared to the normal distribution).

(c) Plotting linear model of total_deaths_per_million vs total_cases_per_million

reg = lm(total_deaths_per_million ~ total_cases_per_million, data = current)

ggplot(data = current, mapping = aes(x=total_cases_per_million , y= total_deaths_per_million), col = "blue") +
  geom_point() +
  geom_abline(intercept = reg$coefficients[1], slope = reg$coefficients[2], col = "blue") + 
  xlab("Total Cases Per Million") +
  ylab("Total Deaths Per Million") +
  ggtitle("Total Deaths Vs Total Cases Per Million")

The slope tells us how much we can expect Y to change as X increases, and can be used to estimate an average rate of change. The greater the magnitude of the slope, the steeper the line and the greater the rate of change. The slope of the regression is 0.0041153 and it means that there is a rate of 0.0041153 in change of total deaths as total cases change. The intercept is 581.912673.

4. Vaccinations in different continents:

(a) Analyzing total number of vaccinations per hundred in all the countries by continent:

(i) Box-plot of the data

First we made sure to select only locations with continents from current data frame. Then we box-plotted the data:

continents = c("Africa", "Asia", "Europe", "North America", "Oceania", "South America")
curr = subset(current, continent %in% continents) # Dropping out the lines without continent.

curr %>% 
  ggplot(aes(x=continent, y=total_vaccinations_per_hundred, fill=continent)) +
  labs(x="Continents",
       y= "Total Vaccinations Per Hundred",
       fill="Continents",
       title = "Boxplot of Total Vaccinations per hunderd for Each Continent")+
  geom_boxplot()

We got a box-plot of all the continent’s colored differently. For each continent we can see the spread of the results, the values range between 25th and 75th quartiles, the median, and whether are any outlines. We see that in most continents vaccination ratios spread at the range 100-200, while Africa stands down with lower range of vaccination ratio.

(ii) Finding 2 outlier countries in the data

We used simple selection with filters to output the outliers which we saw in the boxplot above.

# First outlier
max_location = curr$location[curr$total_vaccinations_per_hundred == max(curr$total_vaccinations_per_hundred)]
max_vaccination = curr$total_vaccinations_per_hundred[curr$location == max_location]

# Second outlier
North_America = curr %>%
  select(location, continent, total_vaccinations_per_hundred) %>%
  filter(continent == "North America") %>%
  arrange(desc(total_vaccinations_per_hundred)) %>% 
  head(1)
max_north_america = as.character(North_America[1])
max_vaccination_na = as.numeric(North_America[3])

Looking at the results of the box-plot, we can notice 4 outliers in Africa, one in Europe, and one in North America. The maximum total_vaccinations_per_hundred in the data is the country Gibraltar” from Europe and it’s value is 353.04. Another outline is the maximum total_vaccinations_per_hundred in North America, which is Cuba with 313.99 vaccinations per hundred.

(b) Defining and plotting the Booster Ratio

# Defining a new column - booster_ratio
covid19$booster_ratio = covid19$total_boosters/covid19$people_fully_vaccinated

# Plot of `booster_ratio` as a function of time
covid19 %>%   
  filter(location %in% continents) %>%
  ggplot(aes(x=date, y=booster_ratio, color=location)) + 
    geom_line(size=1.2,  alpha=0.5) + 
    xlim(as.Date(c('1/1/2021','13/4/2022'),format="%d/%m/%Y"))+
    labs(x="Date",
       y= "Booster Ratio",
       color="Continents") +
    ggtitle("Booster Ratio Per Continent Over Time")

We can notice that the third vaccine (the booster) started for the first time around January 2021 in South America and Europe, and later on in July 2021 in Asia and then in North America, South America, Europe, Africa and Oceania. There is an anomaly in the booster ratio of South America in January 2021 which is in contrast of what we expected.

5. Displaying monthly data:

(a) Monthly and yearly boxplot of new_cases

We filterd out the locations without continents and then selected the desierd values using group_by function. Then we ploted with ggplot the boxplot.

#Assigning a new df for Grouping the data to months,years and locations
months_df <- covid19 %>% 
  filter(continent != "") %>% # avoiding calculating the continents twice and other non-countries
  select(location, date, new_cases_per_million,new_deaths,new_vaccinations) %>%
  mutate(year_month = format(as.Date(date), "%Y-%m")) %>% # Grouping by month and year
  group_by(location,year_month) %>%
  summarise(monthly_new_cases_per_mill = sum(new_cases_per_million, na.rm = TRUE),
            monthly_new_deaths = sum(new_deaths, na.rm = TRUE),
            monthly_new_vaccinations = sum(new_vaccinations, na.rm = TRUE), .groups = "keep")

# first plot - Monthly New Cases Per Million
months_df %>% 
  ggplot(aes(x=year_month, 
             y=monthly_new_cases_per_mill, 
             fill=year_month,)) + 
   labs(x="Year & Month",
       y= "Monthly New Cases Per Million",
       color="Year & Month") +
  theme(axis.text.x=element_text(angle = 90, hjust = 0))+
  geom_boxplot(show.legend = FALSE, )  +
  ggtitle("Wrold Monthly New Cases Per Million Distributions Over Time")

As seen in the plot above, the ratio of monthly new cases per million in the world grows with time.

(b) Plotting two more disticutions and concluding them

Here we applied the same code from (a)

#second plot - new_deaths:
months_df %>% 
  ggplot(aes(x=year_month, 
             y=monthly_new_deaths, 
             fill=year_month,)) + 
   labs(x="Year & Month",
       y= "Monthly New Deaths",
       color="Year & Month") +
  theme(axis.text.x=element_text(angle = 90, hjust = 0))+
  geom_boxplot(show.legend = FALSE, )  +
  ggtitle("Wrold Monthly New Deaths Distributions Over Time")

#third plot - new_vaccinations:
months_df %>% 
    ggplot(aes(x=year_month, 
             y=monthly_new_vaccinations, 
             fill=year_month,)) + 
   labs(x="Year & Month",
       y= "Monthly New Vaccinations",
       color="Year & Month") +
  theme(axis.text.x=element_text(angle = 90, hjust = 0))+
  geom_boxplot(show.legend = FALSE, )  +
  ggtitle("Wrold Monthly New Vaccinations Distributions Over Time")

We see a significant variance in all plots, which is quite expected since the world is diversified. From the first plot, it can be seen that in recent months there has been a significant increase in the new cases, as remembered the varient of omicron appeared in these months. From the second plot, it can be seen that there is actually a greater concentration in the middle of the period, i.e. at the beginning of 2021, of deaths. From the third plot, it can be seen that the vaccines began in the middle of the period, i.e at the beginning of 2021, and as we remember in these months, vaccinations also began in Israel.

6. R - disease reproduction number:

(i) Adding an R_cases column to the covid19 data frame

First we created a temporary data frame with colunmns location, date and new_cases_smoothed, then we used the lag function to create a new column of the new_cases_smoothed 7 days before the date value, and then we calculated the R_case using a function we built and apply.

# Defining temporary table with specific columns
temp_df <- covid19 %>% 
  select(location, date, 
         new_cases_smoothed)
n <- length(covid19$location)

# Adding a column of new_cases_smoothed_7_days_ago using lag function 
temp_df <- temp_df %>%
  mutate(location_2 = lag(location, n=7),
         new_cases_smoothed_7_days_ago = lag(new_cases_smoothed, n=7),
         R_cases = numeric(length=n))

temp_df$new_cases_smoothed = as.numeric(temp_df$new_cases_smoothed)
temp_df$new_cases_smoothed_7_days_ago = as.numeric(temp_df$new_cases_smoothed_7_days_ago)

# Defining a function that assigns an R_cases value to a cell (using apply later)
caclculate_R_cases <- function(df_row){
 if(is.na(df_row["new_cases_smoothed"]) | is.na(df_row["new_cases_smoothed_7_days_ago"])){
   df_row["R_cases"] <- NA
 }else if(as.numeric(df_row["new_cases_smoothed_7_days_ago"]) == 0){
   df_row["R_cases"] <- NA
 }else if(df_row["location"]!=df_row["location_2"]){
   df_row["R_cases"] <- NA
 }else{
  df_row["R_cases"] <- as.numeric(df_row["new_cases_smoothed"]) / as.numeric(df_row["new_cases_smoothed_7_days_ago"])
 }
}

# Applying on a vector of R_cases assigning it as column in covid19 data frame
R_cases <- apply(X=temp_df,MARGIN = 1,FUN = caclculate_R_cases)
covid19$R_cases = R_cases
temp_df$R_cases = R_cases #we will use this later

(ii) Plotting the R_cases values for Israel, UK, and US

We made a simple plot using ggplot.

locations_vec <- c("Israel", "United Kingdom", "United States")

# plot with outlires included (no filtering)
covid19 %>% select(location, R_cases,date) %>%
  filter(location %in% locations_vec) %>%
  ggplot(aes(x = date, y = R_cases, color = location)) +
  geom_point() +
  geom_line(alpha=0.5)

The plot is not clear because of some outlines in US’s data. We decided to plot it again filtering the outline as seen below:

# plot with outlines excluded
covid19 %>% select(location, R_cases,date) %>%
  filter(location %in% locations_vec) %>%
  ggplot(aes(x = date, y = R_cases, color = location)) + ylim(0,10) +
  geom_point(size=1.5, alpha = 0.6) +
  geom_line(alpha=0.5)

(iii) listing in a table the number of days of spread

We filterd out and counted with dyplr the requested data.

spreading_days <- temp_df %>% 
  filter(location %in% locations_vec) %>%
  filter(R_cases > 1) %>% 
  count(location,
        sort = TRUE,
        name = "num of disease spreading days")
spreading_days

##         location num of disease spreading days
## 1 United Kingdom                           443
## 2         Israel                           400
## 3  United States                           395

United kingdom had the most days out of the three.

7. Displaying data on the world map:

We used the requested functions for displaying the map-plots as follows:

(i) Defining relevant objects

sPDF <- joinCountryData2Map(current, joinCode = "NAME"
                              , nameJoinColumn = "location")

## 218 codes from your data successfully matched countries in the map
## 24 codes from your data failed to match with a country code in the map
## 25 codes from the map weren't represented in your data

par(mai=c(0,0,0.2,0),xaxs="i",yaxs="i")

(ii) Mapping the world by total_deaths_per_million

mapCountryData(sPDF, nameColumnToPlot="total_deaths_per_million",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Total Deaths Per Million", aspect = "variable", colourPalette = "magma")

# Presenting in a table
current %>% 
  select(location, total_deaths_per_million) %>%
  arrange(desc(total_deaths_per_million)) %>% 
  head(3)

##                  location total_deaths_per_million
## 1:                   Peru                 6370.855
## 2:               Bulgaria                 5327.220
## 3: Bosnia and Herzegovina                 4823.716

We got 3 countries Peru, Bulgaria and Bosnia and Herzegovina in red color and the rest of the countries show values that are relatively intermediate or low.

(iii) Mapping the world by total_vaccinations_per_hundred

mapCountryData(sPDF, nameColumnToPlot="total_vaccinations_per_hundred",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Total Vaccinations Per Hundred map", aspect = "variable",colourPalette = "terrain")

# Presenting in a table
current %>% 
  select(location, total_vaccinations_per_hundred) %>%
  arrange(desc(total_vaccinations_per_hundred)) %>% 
  head(3)

##     location total_vaccinations_per_hundred
## 1: Gibraltar                         353.04
## 2:      Cuba                         313.99
## 3:     Chile                         266.58

It can be seen from the map that the number of vaccinated in Africa is very low, Russia and India the number of vaccinated is lower than the median while most of America is above the median.

(iv) Mapping the world by excess_mortality_cumulative_per_million

mapCountryData(sPDF, nameColumnToPlot="excess_mortality_cumulative_per_million",catMethod = "fixedWidth", missingCountryCol = "white", mapTitle = "Excess Mortality Cumulative Per Million", aspect = "variable",colourPalette = "magma")

# Presenting in a table
current %>% 
  select(location, excess_mortality_cumulative_per_million) %>%
  arrange(desc(excess_mortality_cumulative_per_million)) %>% 
  head(3)

##    location excess_mortality_cumulative_per_million
## 1: Bulgaria                                9573.960
## 2:   Serbia                                8158.629
## 3:   Russia                                7975.083

It can be seen that compared to previous years, in Russia and Eastern-European countries, the excess rate is very large (it can be amused that this is due to under-reporting in previous years but there may be other reasons). In contrast, in countries like Australia and Greenland the excess is negative.

8. Cross correlations and delay from diagnosis to death:

(a) Creating a cross_correlation function

cross_corr <- function(df, country, column_x, column_y){
  tempdf = df %>%
    filter(location == country) %>%
    select(date, all_of(column_x), all_of(column_y)) #creating a temporary data frame
  
  X = tempdf[all_of(column_x)] 
  Y = tempdf[all_of(column_y)]
  result = vector("numeric" , 121 ) #creating a new vector of the result
  names(result) = c(-60:60) # define the names of the values
  delta_t= -60 # temporary value for the loop
  for (i in 1:121){
    if (delta_t<0){ #Calculating correlation with a time difference as in the definition
      result[i] = cor(Y[(1+abs(delta_t)):nrow(X),1], 
                    X[1:(nrow(Y)-abs(delta_t)),1],
                    method = "pearson",
                    use = "pairwise.complete.obs")
    }else {
      result[i] = cor(X[(1+abs(delta_t)):nrow(Y),1],
                      Y[1:(nrow(X)-abs(delta_t)),1], 
                      method = "pearson", 
                      use = "pairwise.complete.obs")
    }
    if(delta_t < 60){ #the range of the time difference is -60:60
      delta_t= delta_t+1
    }else{
      return(result)
    }
  }
}

(b) Plotting the output of our function with world new cases and deaths (smoothed)

#Defining a vector of the data and plotting it
world_cc <- cross_corr(covid19,"World","new_cases_smoothed","new_deaths_smoothed")
plot(x=names(world_cc), y=world_cc, type = "h", main = "Cross-Correlation of World New Cases and Deaths (Smoothed)", xlab = "Time Difference", ylab = "Correlation Level")

# Finding the maximal value and it's time difference (presented in the text)
max_corr = max(world_cc)
max_time_diff = names(world_cc[world_cc==max_corr])

The results show that the cross correlation is maximized around the time delay of -13

9. Death risk after Covid-19 infection:

(a) Calculating and plotting death_rate per continents and entire world

For the presenting the entire world, we chose to present the values of the location World, after we made sure that it does represents the data of the entire world and read it in the OWID Changelog.

covid19$death_rate = covid19$total_deaths / covid19$total_cases
covid19["continent"][covid19["location"] == "World"] <- "World"

covid19 %>% 
  filter(continent != "") %>%
  group_by(continent, date) %>%
  summarize(avg = mean(death_rate, na.rm = TRUE), .groups = 'keep') %>%
  group_by(continent) %>% 
  filter(date >= '2021-01-01') %>% 
  ggplot(aes(x=date, y=avg, color=continent)) +
  geom_line(size=0.5) +
  ggtitle("Average Death Rate Over Time") +
  labs(x="Date",y = "Average Death Rate",color = "Continents")

The plot shows significant decrease Death Rates since the end of year 2021. We suppose that it is correlated with type of variants that were most common at a several time. Many people were infected with the Omicron variant which spread at the begging of 2021, but it was considers least fatal and that is why the death rate decreased.

(b) Calculating and plotting total_vaccinations_per_hundred per continents and entire world

We did the same as in (a).

covid19 %>%
  filter(continent != "") %>%
  group_by(continent, date) %>% 
  summarize(avg = mean(total_vaccinations_per_hundred, na.rm = TRUE), .groups = 'keep') %>%
  filter(date >= '2021-01-01') %>% 
  ggplot(aes(x=date, y=avg, color=continent)) + 
  geom_line(size=0.5) + 
  ggtitle("Average Total Vaccinations Per Hundred Over Time") +
  labs(x="Date",y = "Average Total Vaccinations Per Hundred",color = "Continents")

We do see that the plots suggest a negative correlation between vaccination and death rate,

10. Excess mortality:

(a) Scatter plot of the current date total_deaths_per_million vs. the excess_mortality_cumulative_per_million

We added a diff column to current data frame and filled it the the absalute value of the difference between the excess and reported mortality rates. afterwards we plotted the data and used geom_text_repel to mark the countries that hada difference bigger then 2000 of the rates.

current$diff = abs(current$excess_mortality_cumulative_per_million-current$total_deaths_per_million) # Defining the difference between the columns (with abs)

a = ifelse(current$diff >= 2000, "dark blue", "black")

ggplot(current, aes(x=excess_mortality_cumulative_per_million, y=total_deaths_per_million)) +
  geom_point(col = a) + 
  geom_text_repel(size = 3, data = current %>% 
                    filter(diff >= 2000 | diff <= -2000),
    aes(label=location), col = "dark blue") + # Marking on the graph in color all the countries with diff >= 2000
  geom_abline(slope = 1, intercept = 0, col = "blue") +
  geom_abline(slope = 1, intercept = 2000, col = "red") +
  geom_abline(slope = 1, intercept = -2000, col = "green") +
  ggtitle("Total Death ~ Excess Mortality Cumulative (Per Million)")+
  xlab("Excess Mortality Cumulative (Per Million)") + 
  ylab("Total Deaths (Per Million)")

We can see that there are quite a few countires that has significant diff values.

(b) Plotting 3 excising countries

From the named countries in the previous plot chose 3 random countries and made sure they had over 50 records and their plot is blow.

country = c("South Africa", "Bulgaria", "Lithuania")

covid19 %>% 
  filter(location %in% country) %>% 
  select(location, date, excess_mortality_cumulative_per_million,total_deaths_per_million) %>% 
  group_by(location) %>% 
  ggplot(aes(x=date, y=total_deaths_per_million, colour=location)) + 
  geom_line(size=0.5) + 
  geom_point(size=0.5, aes(y=excess_mortality_cumulative_per_million)) +
  labs(y = "Total Deaths Per Million (line), 
       Excess Mortality Cumulative Per Million (dots)", x = "Date", colour = "Location")

The plot shows that the difference between reported deaths and excess rates grows in time. We see that until July 2020 the difference wasn’t significant. Until that time we see the most excess mortality not explained by Covid-19.