Source of data

The covid-19 data is downloaded from from “Our World In Data” website.

The data is manipulated using a set of wrapper functions based on the “data.table” package. These wrapper functions conveniently allow non-standard evaluation of column and variable names in similar way as those functions in the popular ‘r dplyr’ but offer fast computing speed similar to ‘r data.table’.

These wrapper functions are published as ‘r wr.data.table’ on “github”

View a summary of data

Here are some features of the dataset:

The earliest record is on 2020-01-01.

The latest record is on 2022-05-09.

NA’s in total_cases

We will use the total_cases column for data visualisation. Therefore we will have closer look in the data under the ‘r total_cases’ column.

There are 6866 entries that contain NA under total_cases column.

The code below examine the source data to find any locations that contained no records in total_cases on any dates.

locations_with_no_total_cases_record <- 
  raw |>
  sel_cols(location, total_cases) |>
  def_cols(is.na ~ is.na(total_cases)) |>
    set_group(location, is.na) |>
    condense(no_of_na_record = .N) |>
    spread(is.na, no_of_na_record) |>
    def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
    rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))

head(locations_with_no_total_cases_record)

##       location data_count NA_count no_record
## 1: Afghanistan        806       NA     FALSE
## 2:      Africa        816        1     FALSE
## 3:     Albania        792       13     FALSE
## 4:     Algeria        805       NA     FALSE
## 5:     Andorra        799       NA     FALSE
## 6:      Angola        781       NA     FALSE

No locations are to be omitted. All locations have some record in total_cases.

locations_with_no_total_deaths_record <- 
  raw |>
  sel_cols(location, total_deaths) |> 
  def_cols(is.na ~ ifelse(is.na(total_deaths), T, F)) |>
  set_group(location, is.na) |>
  condense(no_of_na_record = .N) |>
  spread(is.na, no_of_na_record) |>
  def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
  rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))

head(locations_with_no_total_deaths_record)

##       location data_count NA_count no_record
## 1: Afghanistan        778       28     FALSE
## 2:      Africa        793       24     FALSE
## 3:     Albania        790       15     FALSE
## 4:     Algeria        789       16     FALSE
## 5:     Andorra        779       20     FALSE
## 6:      Angola        772        9     FALSE

No locations are to be omitted. All locations have some record in total_deaths.

Total COVID-19 Cases Worldwide

The following graph shows the cumulative number of COVID-19 cases and deaths since 2020-01-01 in different continents.

The script below show the use of wr.data.table package in filtering and summarizing the raw data.

total_cases_continents <- 
  raw |>
  filter_rows(continent != "") |>
  sel_cols(date, continent, location, total_cases, total_deaths)

total_cases_continents <- total_cases_continents |>
  def_cols(total_cases ~ replace(total_cases, is.na(total_cases),0)) |>
  def_cols(total_deaths ~ replace(total_deaths, is.na(total_deaths), 0)) |>
  set_group(continent, date) |>
  condense(total_cases = sum(total_cases), total_deaths = sum(total_deaths))

The script below was used to tidy the text in the global COVID timeline.

The dates and the corresponding events were extract and stored as a dataframe.

dates <- global_timeline |> str_extract_all(pattern = "\\w{3,9}\\s[:digit:]{1,2},\\s[:digit:]{4}(?=\\s-\\s)") |> unlist()

events <- c()
for (i in 1:length(dates)){
  pattern <- paste0("(?<=",dates[i],"\\s-\\s).+(?=<br />)")
  events <- c(events, str_extract(global_timeline, pattern))
  rm(pattern)
}

#length(events) == length(dates) # check number of events == check number of events

# store tidied text in csv file
global_event_timeline <- data.frame("date"=mdy(dates), "events"=events)
readr::write_csv(global_event_timeline, "global_covid_event_timeline.csv")
rm(global_timeline,dates,events)

The following interactive Plotly graph shows the key events and accumulative number of COVID-19 cases worldwide since 2020-01-01

Key events were obtained from “CNN’s COVID-19 facts and timeline”

The following script was used to extract the dates and NZ events during the pandemic.

# extract event dates
dates <- nz_timeline |> str_extract_all(pattern = "(?<!\\son\\s)[:digit:]{1,2}\\s\\w{3,9}\\s[:digit:]{4}(?=\n)") |> unlist()

# extract events by date
nz_timeline <- nz_timeline |> str_remove_all(pattern = "\n") # remove line breaks
events <- c()
for (i in 1:length(dates)) {
  if (i == length(dates)) {
    pattern = paste0("(?<=", dates[i],").+")
  } else {
    pattern = paste0("(?<=",dates[i],").+","(?=",dates[i+1],")")
  }
  events <- c(events, str_extract(nz_timeline, pattern))
  rm(pattern)
}
events <- lapply(events, str_trim) |> unlist() # trim spaces at start and end of string
events <- lapply(events, function(x){gsub("\\.",".<br />", x)}) |> unlist()

# length(events) == length(dates) # check: num of events extracted == number of dates extracted


# construct data.table for events
nz_event_timeline <- data.frame("date"=dmy(dates), "events"=events)
readr::write_csv(nz_event_timeline, "nz_covid_event_timeline.csv")

rm(nz_timeline, dates, events)

Here are the key events in New Zealand’s COVID-19 response plotted with total cases accumulated in New Zealand.

Top Countries With The Most COVID-19 Cases Recorded

The following script was used to summarize and extract information about the countries that account for highest 80% COVID-19 cases.

top_80 <- 
  raw |>
  filter_rows(continent != "") |>
  sel_cols(date, continent, location, total_cases, total_deaths, population_density)

top_80$total_deaths <- replace(top_80$total_deaths, is.na(top_80$total_deaths),0)
top_80$total_cases <- replace(top_80$total_cases, is.na(top_80$total_cases),0)

top_80 <- 
  top_80 |>
  set_group(continent,location) |>
  condense(total_cases = sum(total_cases), 
           total_deaths = sum(total_deaths), 
           population_density = mean(population_density)) |>
  # filter_rows(!is.na(total_cases)) |>
  descend_rows(total_cases)

top_80_count = 1
cumulative_total_cases = 0
total_cases_to_date = sum(top_80$total_cases)
while ((cumulative_total_cases/total_cases_to_date) <= 0.8) {
  cumulative_total_cases <- sum(top_80$total_cases[1:top_80_count])
  top_80_count <- top_80_count + 1
}

top_80 <- top_80[1:top_80_count,]

top_80 <- 
  top_80 |> 
  def_cols(death_rate_pct ~ round(total_deaths / total_cases, digits=3) * 100) |>
  descend_rows(death_rate_pct)

These countries are listed below in descending order of death rate. From the data Peru and Mexico has had the highest death rate over the period studied.

continent	location	total_cases	total_deaths	population_density	death_rate_pct
South America	Peru	1252827004	104996637	25.129	8.4
North America	Mexico	1858756703	140976676	66.444	7.6
Asia	Indonesia	1736870388	52128658	145.725	3.0
Africa	South Africa	1391726309	39983246	46.754	2.9
South America	Brazil	10691300560	280053586	25.040	2.6
South America	Colombia	2307152015	57696362	44.223	2.5
Europe	Russia	4467046120	105503439	8.823	2.4
Asia	Iran	2342521047	56989667	49.831	2.4
Europe	Poland	1749005476	40053410	124.027	2.3
Europe	Ukraine	1514386134	34859741	77.390	2.3
Europe	Italy	3289787034	73998887	205.859	2.2
South America	Chile	963038016	19604803	24.282	2.0
South America	Argentina	2684604299	51592250	16.177	1.9
Europe	Spain	3027291834	50081599	93.105	1.7
Asia	Philippines	1090891843	18175288	351.873	1.7
North America	Canada	962911738	16614813	4.037	1.7
Europe	Belgium	911153941	15761395	375.564	1.7
North America	United States	24832438504	390911437	35.608	1.6
Europe	United Kingdom	4868483948	79223814	272.898	1.6
Europe	Czechia	1073151524	15708044	137.176	1.5
Europe	Germany	3572273153	49855674	237.016	1.4
Asia	India	15672119463	204297650	450.419	1.3
Europe	France	5221160359	65344650	122.578	1.3
Asia	Malaysia	907334233	8663202	96.254	1.0
Asia	Turkey	3850946112	32248696	104.914	0.8
Asia	Japan	1031797442	8287894	347.778	0.8
Europe	Netherlands	1505610530	10935197	508.544	0.7
Asia	South Korea	1047311072	2525281	527.967	0.2

These countries are plotted in the interactive Plotly map below.

Bubble markers proportional to the number of total cases (yellow) and total deaths (red) are added.

Peru has a 8.4% death rate (proportion of number of deaths to number of cases), which is the highest out of these countries.

The second highest country is Mexico at 7.6% death rate.

The same data is represented as bar charts below in ascending order of total cases or total deaths.

The USA is highest in total cases and deaths.

Russia has a relatively low number of death and cases, but considering the population density, the rate of total cases and deaths are relatively high.

India has a high number of cases and deaths but the rate of total cases and deaths are relatively low when population density is taken into account.

The following time series summarizes the number of daily new cases reported in different continents during the pandemic.

According to the data, it appears that Europe had the greatest surge around Jan 2022 with the USA being second and Asia being third around the same time. However, there can be a number of factors involved that would affect the data, including the system used for reporting and recording new cases, people’s proactiveness in reporting, testing capacity, etc.

The following plot shows that NZ cases surged at around the same time in Jan 2022 as the other countries/continents.

General knowledge tells us that Jan 2022 was around the time when the Delta variant was at its tail end and the Omicron variant became prevalent.

It is common sense to expect the number of new tests recorded would be correlated to the numbe of new cases.

The analysis using linear regression below shows a very small p-value, indicating number of new tests is a signficant factor in relation to the number of new case.

## Analysis of Variance Table
## 
## Response: new_cases_smoothed_per_million
##                         Df    Sum Sq  Mean Sq F value    Pr(>F)    
## new_tests_per_thousand   1  61150521 61150521  196.11 < 2.2e-16 ***
## Residuals              375 116933823   311824                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OWID Data - COVID-19

kleu046

5/3/2022

Visualization of COVID data

Source of data

View a summary of data

NA’s in total_cases

Total COVID-19 Cases Worldwide

Top Countries With The Most COVID-19 Cases Recorded