This R Markdown document examines COVID-19 cases and deaths in different countries using graphs and interactive plots.
The covid-19 data is downloaded from from “Our World In Data” website.
The data is manipulated using a set of wrapper functions based on the “data.table” package. These wrapper functions conveniently allow non-standard evaluation of column and variable names in similar way as those functions in the popular ‘r dplyr’ but offer fast computing speed similar to ‘r data.table’.
These wrapper functions are published as ‘r wr.data.table’ on “github”
Here are some features of the dataset:
The earliest record is on 2020-01-01.
The latest record is on 2022-05-09.
We will use the total_cases column for data visualisation. Therefore we will have closer look in the data under the ‘r total_cases’ column.
There are 6866 entries that contain NA under total_cases column.
The code below examine the source data to find any locations that contained no records in total_cases on any dates.
locations_with_no_total_cases_record <-
raw |>
sel_cols(location, total_cases) |>
def_cols(is.na ~ is.na(total_cases)) |>
set_group(location, is.na) |>
condense(no_of_na_record = .N) |>
spread(is.na, no_of_na_record) |>
def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))
head(locations_with_no_total_cases_record)
## location data_count NA_count no_record
## 1: Afghanistan 806 NA FALSE
## 2: Africa 816 1 FALSE
## 3: Albania 792 13 FALSE
## 4: Algeria 805 NA FALSE
## 5: Andorra 799 NA FALSE
## 6: Angola 781 NA FALSE
No locations are to be omitted. All locations have some record in total_cases.
locations_with_no_total_deaths_record <-
raw |>
sel_cols(location, total_deaths) |>
def_cols(is.na ~ ifelse(is.na(total_deaths), T, F)) |>
set_group(location, is.na) |>
condense(no_of_na_record = .N) |>
spread(is.na, no_of_na_record) |>
def_cols(no_record ~ ifelse(is.na(`FALSE`) & is.na(`TRUE`), T, F)) |>
rn_cols(old=c("FALSE","TRUE"),new=c("data_count", "NA_count"))
head(locations_with_no_total_deaths_record)
## location data_count NA_count no_record
## 1: Afghanistan 778 28 FALSE
## 2: Africa 793 24 FALSE
## 3: Albania 790 15 FALSE
## 4: Algeria 789 16 FALSE
## 5: Andorra 779 20 FALSE
## 6: Angola 772 9 FALSE
No locations are to be omitted. All locations have some record in total_deaths.
The following graph shows the cumulative number of COVID-19 cases and deaths since 2020-01-01 in different continents.
The script below show the use of wr.data.table package in filtering and summarizing the raw data.
total_cases_continents <-
raw |>
filter_rows(continent != "") |>
sel_cols(date, continent, location, total_cases, total_deaths)
total_cases_continents <- total_cases_continents |>
def_cols(total_cases ~ replace(total_cases, is.na(total_cases),0)) |>
def_cols(total_deaths ~ replace(total_deaths, is.na(total_deaths), 0)) |>
set_group(continent, date) |>
condense(total_cases = sum(total_cases), total_deaths = sum(total_deaths))
The script below was used to tidy the text in the global COVID timeline.
The dates and the corresponding events were extract and stored as a dataframe.
dates <- global_timeline |> str_extract_all(pattern = "\\w{3,9}\\s[:digit:]{1,2},\\s[:digit:]{4}(?=\\s-\\s)") |> unlist()
events <- c()
for (i in 1:length(dates)){
pattern <- paste0("(?<=",dates[i],"\\s-\\s).+(?=<br />)")
events <- c(events, str_extract(global_timeline, pattern))
rm(pattern)
}
#length(events) == length(dates) # check number of events == check number of events
# store tidied text in csv file
global_event_timeline <- data.frame("date"=mdy(dates), "events"=events)
readr::write_csv(global_event_timeline, "global_covid_event_timeline.csv")
rm(global_timeline,dates,events)
The following interactive Plotly graph shows the key events and accumulative number of COVID-19 cases worldwide since 2020-01-01
Key events were obtained from “CNN’s COVID-19 facts and timeline”
The following script was used to extract the dates and NZ events during the pandemic.
# extract event dates
dates <- nz_timeline |> str_extract_all(pattern = "(?<!\\son\\s)[:digit:]{1,2}\\s\\w{3,9}\\s[:digit:]{4}(?=\n)") |> unlist()
# extract events by date
nz_timeline <- nz_timeline |> str_remove_all(pattern = "\n") # remove line breaks
events <- c()
for (i in 1:length(dates)) {
if (i == length(dates)) {
pattern = paste0("(?<=", dates[i],").+")
} else {
pattern = paste0("(?<=",dates[i],").+","(?=",dates[i+1],")")
}
events <- c(events, str_extract(nz_timeline, pattern))
rm(pattern)
}
events <- lapply(events, str_trim) |> unlist() # trim spaces at start and end of string
events <- lapply(events, function(x){gsub("\\.",".<br />", x)}) |> unlist()
# length(events) == length(dates) # check: num of events extracted == number of dates extracted
# construct data.table for events
nz_event_timeline <- data.frame("date"=dmy(dates), "events"=events)
readr::write_csv(nz_event_timeline, "nz_covid_event_timeline.csv")
rm(nz_timeline, dates, events)
Here are the key events in New Zealand’s COVID-19 response plotted with total cases accumulated in New Zealand.
The following script was used to summarize and extract information about the countries that account for highest 80% COVID-19 cases.
top_80 <-
raw |>
filter_rows(continent != "") |>
sel_cols(date, continent, location, total_cases, total_deaths, population_density)
top_80$total_deaths <- replace(top_80$total_deaths, is.na(top_80$total_deaths),0)
top_80$total_cases <- replace(top_80$total_cases, is.na(top_80$total_cases),0)
top_80 <-
top_80 |>
set_group(continent,location) |>
condense(total_cases = sum(total_cases),
total_deaths = sum(total_deaths),
population_density = mean(population_density)) |>
# filter_rows(!is.na(total_cases)) |>
descend_rows(total_cases)
top_80_count = 1
cumulative_total_cases = 0
total_cases_to_date = sum(top_80$total_cases)
while ((cumulative_total_cases/total_cases_to_date) <= 0.8) {
cumulative_total_cases <- sum(top_80$total_cases[1:top_80_count])
top_80_count <- top_80_count + 1
}
top_80 <- top_80[1:top_80_count,]
top_80 <-
top_80 |>
def_cols(death_rate_pct ~ round(total_deaths / total_cases, digits=3) * 100) |>
descend_rows(death_rate_pct)
These countries are listed below in descending order of death rate. From the data Peru and Mexico has had the highest death rate over the period studied.
| continent | location | total_cases | total_deaths | population_density | death_rate_pct |
|---|---|---|---|---|---|
| South America | Peru | 1252827004 | 104996637 | 25.129 | 8.4 |
| North America | Mexico | 1858756703 | 140976676 | 66.444 | 7.6 |
| Asia | Indonesia | 1736870388 | 52128658 | 145.725 | 3.0 |
| Africa | South Africa | 1391726309 | 39983246 | 46.754 | 2.9 |
| South America | Brazil | 10691300560 | 280053586 | 25.040 | 2.6 |
| South America | Colombia | 2307152015 | 57696362 | 44.223 | 2.5 |
| Europe | Russia | 4467046120 | 105503439 | 8.823 | 2.4 |
| Asia | Iran | 2342521047 | 56989667 | 49.831 | 2.4 |
| Europe | Poland | 1749005476 | 40053410 | 124.027 | 2.3 |
| Europe | Ukraine | 1514386134 | 34859741 | 77.390 | 2.3 |
| Europe | Italy | 3289787034 | 73998887 | 205.859 | 2.2 |
| South America | Chile | 963038016 | 19604803 | 24.282 | 2.0 |
| South America | Argentina | 2684604299 | 51592250 | 16.177 | 1.9 |
| Europe | Spain | 3027291834 | 50081599 | 93.105 | 1.7 |
| Asia | Philippines | 1090891843 | 18175288 | 351.873 | 1.7 |
| North America | Canada | 962911738 | 16614813 | 4.037 | 1.7 |
| Europe | Belgium | 911153941 | 15761395 | 375.564 | 1.7 |
| North America | United States | 24832438504 | 390911437 | 35.608 | 1.6 |
| Europe | United Kingdom | 4868483948 | 79223814 | 272.898 | 1.6 |
| Europe | Czechia | 1073151524 | 15708044 | 137.176 | 1.5 |
| Europe | Germany | 3572273153 | 49855674 | 237.016 | 1.4 |
| Asia | India | 15672119463 | 204297650 | 450.419 | 1.3 |
| Europe | France | 5221160359 | 65344650 | 122.578 | 1.3 |
| Asia | Malaysia | 907334233 | 8663202 | 96.254 | 1.0 |
| Asia | Turkey | 3850946112 | 32248696 | 104.914 | 0.8 |
| Asia | Japan | 1031797442 | 8287894 | 347.778 | 0.8 |
| Europe | Netherlands | 1505610530 | 10935197 | 508.544 | 0.7 |
| Asia | South Korea | 1047311072 | 2525281 | 527.967 | 0.2 |
These countries are plotted in the interactive Plotly map below.
Bubble markers proportional to the number of total cases (yellow) and total deaths (red) are added.
Peru has a 8.4% death rate (proportion of number of deaths to number of cases), which is the highest out of these countries.
The second highest country is Mexico at 7.6% death rate.
The same data is represented as bar charts below in ascending order of total cases or total deaths.
The USA is highest in total cases and deaths.
Russia has a relatively low number of death and cases, but considering the population density, the rate of total cases and deaths are relatively high.
India has a high number of cases and deaths but the rate of total cases and deaths are relatively low when population density is taken into account.
The following time series summarizes the number of daily new cases reported in different continents during the pandemic.
According to the data, it appears that Europe had the greatest surge around Jan 2022 with the USA being second and Asia being third around the same time. However, there can be a number of factors involved that would affect the data, including the system used for reporting and recording new cases, people’s proactiveness in reporting, testing capacity, etc.
The following plot shows that NZ cases surged at around the same time in Jan 2022 as the other countries/continents.
General knowledge tells us that Jan 2022 was around the time when the Delta variant was at its tail end and the Omicron variant became prevalent.
It is common sense to expect the number of new tests recorded would be correlated to the numbe of new cases.
The analysis using linear regression below shows a very small p-value, indicating number of new tests is a signficant factor in relation to the number of new case.
## Analysis of Variance Table
##
## Response: new_cases_smoothed_per_million
## Df Sum Sq Mean Sq F value Pr(>F)
## new_tests_per_thousand 1 61150521 61150521 196.11 < 2.2e-16 ***
## Residuals 375 116933823 311824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1