Visualization of COVID data

This R Markdown document examines COVID-19 cases and deaths in different countries using graphs and interactive plots.

Source of data

The covid-19 data is downloaded from from “Our World In Data” website.

The data is manipulated using a set of wrapper functions based on the “data.table” package. These wrapper functions conveniently allow non-standard evaluation of column and variable names in similar way as those functions in the popular ‘r dplyr’ but offer fast computing speed similar to ‘r data.table’.

These wrapper functions are published as ‘r wr.data.table’ on “github”

View a summary of data

Here are some features of the dataset:

The earliest record is on 2020-01-01.

The latest record is on 2022-05-09.

NA’s in total_cases

We will use the total_cases column for data visualisation. Therefore we will have closer look in the data under the ‘r total_cases’ column.

There are 6866 entries that contain NA under total_cases column.

The code below examine the source data to find any locations that contained no records in total_cases on any dates.

locations_with_no_total_cases_record <- 
  raw |>
  sel_cols(location, total_cases) |>
  def_cols(is.na ~ is.na(total_cases)) |>
    set_group(location, is.na) |>
    condense(no_of_na_record = .N) |>
    spread(is.na, no_of_na_record) |>
    def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
    rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))

head(locations_with_no_total_cases_record)
##       location data_count NA_count no_record
## 1: Afghanistan        806       NA     FALSE
## 2:      Africa        816        1     FALSE
## 3:     Albania        792       13     FALSE
## 4:     Algeria        805       NA     FALSE
## 5:     Andorra        799       NA     FALSE
## 6:      Angola        781       NA     FALSE

No locations are to be omitted. All locations have some record in total_cases.

locations_with_no_total_deaths_record <- 
  raw |>
  sel_cols(location, total_deaths) |> 
  def_cols(is.na ~ ifelse(is.na(total_deaths), T, F)) |>
  set_group(location, is.na) |>
  condense(no_of_na_record = .N) |>
  spread(is.na, no_of_na_record) |>
  def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
  rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))

head(locations_with_no_total_deaths_record)
##       location data_count NA_count no_record
## 1: Afghanistan        778       28     FALSE
## 2:      Africa        793       24     FALSE
## 3:     Albania        790       15     FALSE
## 4:     Algeria        789       16     FALSE
## 5:     Andorra        779       20     FALSE
## 6:      Angola        772        9     FALSE

No locations are to be omitted. All locations have some record in total_deaths.

Total COVID-19 Cases Worldwide

The following graph shows the cumulative number of COVID-19 cases and deaths since 2020-01-01 in different continents.

The script below show the use of wr.data.table package in filtering and summarizing the raw data.

total_cases_continents <- 
  raw |>
  filter_rows(continent != "") |>
  sel_cols(date, continent, location, total_cases, total_deaths)

total_cases_continents <- total_cases_continents |>
  def_cols(total_cases ~ replace(total_cases, is.na(total_cases),0)) |>
  def_cols(total_deaths ~ replace(total_deaths, is.na(total_deaths), 0)) |>
  set_group(continent, date) |>
  condense(total_cases = sum(total_cases), total_deaths = sum(total_deaths))

The script below was used to tidy the text in the global COVID timeline.

The dates and the corresponding events were extract and stored as a dataframe.

dates <- global_timeline |> str_extract_all(pattern = "\\w{3,9}\\s[:digit:]{1,2},\\s[:digit:]{4}(?=\\s-\\s)") |> unlist()

events <- c()
for (i in 1:length(dates)){
  pattern <- paste0("(?<=",dates[i],"\\s-\\s).+(?=<br />)")
  events <- c(events, str_extract(global_timeline, pattern))
  rm(pattern)
}

#length(events) == length(dates) # check number of events == check number of events

# store tidied text in csv file
global_event_timeline <- data.frame("date"=mdy(dates), "events"=events)
readr::write_csv(global_event_timeline, "global_covid_event_timeline.csv")
rm(global_timeline,dates,events)

The following interactive Plotly graph shows the key events and accumulative number of COVID-19 cases worldwide since 2020-01-01

Key events were obtained from “CNN’s COVID-19 facts and timeline”

The following script was used to extract the dates and NZ events during the pandemic.

# extract event dates
dates <- nz_timeline |> str_extract_all(pattern = "(?<!\\son\\s)[:digit:]{1,2}\\s\\w{3,9}\\s[:digit:]{4}(?=\n)") |> unlist()

# extract events by date
nz_timeline <- nz_timeline |> str_remove_all(pattern = "\n") # remove line breaks
events <- c()
for (i in 1:length(dates)) {
  if (i == length(dates)) {
    pattern = paste0("(?<=", dates[i],").+")
  } else {
    pattern = paste0("(?<=",dates[i],").+","(?=",dates[i+1],")")
  }
  events <- c(events, str_extract(nz_timeline, pattern))
  rm(pattern)
}
events <- lapply(events, str_trim) |> unlist() # trim spaces at start and end of string
events <- lapply(events, function(x){gsub("\\.",".<br />", x)}) |> unlist()

# length(events) == length(dates) # check: num of events extracted == number of dates extracted


# construct data.table for events
nz_event_timeline <- data.frame("date"=dmy(dates), "events"=events)
readr::write_csv(nz_event_timeline, "nz_covid_event_timeline.csv")

rm(nz_timeline, dates, events)

Here are the key events in New Zealand’s COVID-19 response plotted with total cases accumulated in New Zealand.

Top Countries With The Most COVID-19 Cases Recorded

The following script was used to summarize and extract information about the countries that account for highest 80% COVID-19 cases.

top_80 <- 
  raw |>
  filter_rows(continent != "") |>
  sel_cols(date, continent, location, total_cases, total_deaths, population_density)

top_80$total_deaths <- replace(top_80$total_deaths, is.na(top_80$total_deaths),0)
top_80$total_cases <- replace(top_80$total_cases, is.na(top_80$total_cases),0)

top_80 <- 
  top_80 |>
  set_group(continent,location) |>
  condense(total_cases = sum(total_cases), 
           total_deaths = sum(total_deaths), 
           population_density = mean(population_density)) |>
  # filter_rows(!is.na(total_cases)) |>
  descend_rows(total_cases)

top_80_count = 1
cumulative_total_cases = 0
total_cases_to_date = sum(top_80$total_cases)
while ((cumulative_total_cases/total_cases_to_date) <= 0.8) {
  cumulative_total_cases <- sum(top_80$total_cases[1:top_80_count])
  top_80_count <- top_80_count + 1
}

top_80 <- top_80[1:top_80_count,]

top_80 <- 
  top_80 |> 
  def_cols(death_rate_pct ~ round(total_deaths / total_cases, digits=3) * 100) |>
  descend_rows(death_rate_pct)

These countries are listed below in descending order of death rate. From the data Peru and Mexico has had the highest death rate over the period studied.

continent location total_cases total_deaths population_density death_rate_pct
South America Peru 1252827004 104996637 25.129 8.4
North America Mexico 1858756703 140976676 66.444 7.6
Asia Indonesia 1736870388 52128658 145.725 3.0
Africa South Africa 1391726309 39983246 46.754 2.9
South America Brazil 10691300560 280053586 25.040 2.6
South America Colombia 2307152015 57696362 44.223 2.5
Europe Russia 4467046120 105503439 8.823 2.4
Asia Iran 2342521047 56989667 49.831 2.4
Europe Poland 1749005476 40053410 124.027 2.3
Europe Ukraine 1514386134 34859741 77.390 2.3
Europe Italy 3289787034 73998887 205.859 2.2
South America Chile 963038016 19604803 24.282 2.0
South America Argentina 2684604299 51592250 16.177 1.9
Europe Spain 3027291834 50081599 93.105 1.7
Asia Philippines 1090891843 18175288 351.873 1.7
North America Canada 962911738 16614813 4.037 1.7
Europe Belgium 911153941 15761395 375.564 1.7
North America United States 24832438504 390911437 35.608 1.6
Europe United Kingdom 4868483948 79223814 272.898 1.6
Europe Czechia 1073151524 15708044 137.176 1.5
Europe Germany 3572273153 49855674 237.016 1.4
Asia India 15672119463 204297650 450.419 1.3
Europe France 5221160359 65344650 122.578 1.3
Asia Malaysia 907334233 8663202 96.254 1.0
Asia Turkey 3850946112 32248696 104.914 0.8
Asia Japan 1031797442 8287894 347.778 0.8
Europe Netherlands 1505610530 10935197 508.544 0.7
Asia South Korea 1047311072 2525281 527.967 0.2

These countries are plotted in the interactive Plotly map below.

Bubble markers proportional to the number of total cases (yellow) and total deaths (red) are added.

Peru has a 8.4% death rate (proportion of number of deaths to number of cases), which is the highest out of these countries.

The second highest country is Mexico at 7.6% death rate.

The same data is represented as bar charts below in ascending order of total cases or total deaths.

The USA is highest in total cases and deaths.

Russia has a relatively low number of death and cases, but considering the population density, the rate of total cases and deaths are relatively high.

India has a high number of cases and deaths but the rate of total cases and deaths are relatively low when population density is taken into account.

The following time series summarizes the number of daily new cases reported in different continents during the pandemic.

According to the data, it appears that Europe had the greatest surge around Jan 2022 with the USA being second and Asia being third around the same time. However, there can be a number of factors involved that would affect the data, including the system used for reporting and recording new cases, people’s proactiveness in reporting, testing capacity, etc.

The following plot shows that NZ cases surged at around the same time in Jan 2022 as the other countries/continents.

General knowledge tells us that Jan 2022 was around the time when the Delta variant was at its tail end and the Omicron variant became prevalent.

It is common sense to expect the number of new tests recorded would be correlated to the numbe of new cases.

The analysis using linear regression below shows a very small p-value, indicating number of new tests is a signficant factor in relation to the number of new case.

## Analysis of Variance Table
## 
## Response: new_cases_smoothed_per_million
##                         Df    Sum Sq  Mean Sq F value    Pr(>F)    
## new_tests_per_thousand   1  61150521 61150521  196.11 < 2.2e-16 ***
## Residuals              375 116933823   311824                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1