INTRODUCTION

There are numerous analyses on the internet and in research papers regarding COVID-19. Data from the pandemic is very useful for creating educational material. The Johns-Hopkins University (JHU) data repository contains large open data sets on the pandemic.

In this notebook, I showcase the use of this data resource. The aims are as follows:

  1. Use the JHU data as teaching material for the R language
  2. Use the JHU data as teaching material for data analysis
  3. Compare data between countries (South Africa, Germany, United Kingdom)
  4. Look ahead at what may happen in South Africa in early 2021

LIBRARIES

The following libraries are imported for use in this notebook.

library(readr) # Importing the data from the internet
library(plotly) # Creating data visualization
library(DT) # Create tables

DATA

The data exists in a JHU Github repository and is regularly updated. The readr::read_csv() function can import data from the repository, which exists as a comma-separated-values (spreadsheet) file.

confirmedraw <- readr::read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

The data set is in wide format. Days are added as columns. The rows consist of countries and regions. The names() function shows all the column headers

names(confirmedraw)
##   [1] "Province/State" "Country/Region" "Lat"            "Long"          
##   [5] "1/22/20"        "1/23/20"        "1/24/20"        "1/25/20"       
##   [9] "1/26/20"        "1/27/20"        "1/28/20"        "1/29/20"       
##  [13] "1/30/20"        "1/31/20"        "2/1/20"         "2/2/20"        
##  [17] "2/3/20"         "2/4/20"         "2/5/20"         "2/6/20"        
##  [21] "2/7/20"         "2/8/20"         "2/9/20"         "2/10/20"       
##  [25] "2/11/20"        "2/12/20"        "2/13/20"        "2/14/20"       
##  [29] "2/15/20"        "2/16/20"        "2/17/20"        "2/18/20"       
##  [33] "2/19/20"        "2/20/20"        "2/21/20"        "2/22/20"       
##  [37] "2/23/20"        "2/24/20"        "2/25/20"        "2/26/20"       
##  [41] "2/27/20"        "2/28/20"        "2/29/20"        "3/1/20"        
##  [45] "3/2/20"         "3/3/20"         "3/4/20"         "3/5/20"        
##  [49] "3/6/20"         "3/7/20"         "3/8/20"         "3/9/20"        
##  [53] "3/10/20"        "3/11/20"        "3/12/20"        "3/13/20"       
##  [57] "3/14/20"        "3/15/20"        "3/16/20"        "3/17/20"       
##  [61] "3/18/20"        "3/19/20"        "3/20/20"        "3/21/20"       
##  [65] "3/22/20"        "3/23/20"        "3/24/20"        "3/25/20"       
##  [69] "3/26/20"        "3/27/20"        "3/28/20"        "3/29/20"       
##  [73] "3/30/20"        "3/31/20"        "4/1/20"         "4/2/20"        
##  [77] "4/3/20"         "4/4/20"         "4/5/20"         "4/6/20"        
##  [81] "4/7/20"         "4/8/20"         "4/9/20"         "4/10/20"       
##  [85] "4/11/20"        "4/12/20"        "4/13/20"        "4/14/20"       
##  [89] "4/15/20"        "4/16/20"        "4/17/20"        "4/18/20"       
##  [93] "4/19/20"        "4/20/20"        "4/21/20"        "4/22/20"       
##  [97] "4/23/20"        "4/24/20"        "4/25/20"        "4/26/20"       
## [101] "4/27/20"        "4/28/20"        "4/29/20"        "4/30/20"       
## [105] "5/1/20"         "5/2/20"         "5/3/20"         "5/4/20"        
## [109] "5/5/20"         "5/6/20"         "5/7/20"         "5/8/20"        
## [113] "5/9/20"         "5/10/20"        "5/11/20"        "5/12/20"       
## [117] "5/13/20"        "5/14/20"        "5/15/20"        "5/16/20"       
## [121] "5/17/20"        "5/18/20"        "5/19/20"        "5/20/20"       
## [125] "5/21/20"        "5/22/20"        "5/23/20"        "5/24/20"       
## [129] "5/25/20"        "5/26/20"        "5/27/20"        "5/28/20"       
## [133] "5/29/20"        "5/30/20"        "5/31/20"        "6/1/20"        
## [137] "6/2/20"         "6/3/20"         "6/4/20"         "6/5/20"        
## [141] "6/6/20"         "6/7/20"         "6/8/20"         "6/9/20"        
## [145] "6/10/20"        "6/11/20"        "6/12/20"        "6/13/20"       
## [149] "6/14/20"        "6/15/20"        "6/16/20"        "6/17/20"       
## [153] "6/18/20"        "6/19/20"        "6/20/20"        "6/21/20"       
## [157] "6/22/20"        "6/23/20"        "6/24/20"        "6/25/20"       
## [161] "6/26/20"        "6/27/20"        "6/28/20"        "6/29/20"       
## [165] "6/30/20"        "7/1/20"         "7/2/20"         "7/3/20"        
## [169] "7/4/20"         "7/5/20"         "7/6/20"         "7/7/20"        
## [173] "7/8/20"         "7/9/20"         "7/10/20"        "7/11/20"       
## [177] "7/12/20"        "7/13/20"        "7/14/20"        "7/15/20"       
## [181] "7/16/20"        "7/17/20"        "7/18/20"        "7/19/20"       
## [185] "7/20/20"        "7/21/20"        "7/22/20"        "7/23/20"       
## [189] "7/24/20"        "7/25/20"        "7/26/20"        "7/27/20"       
## [193] "7/28/20"        "7/29/20"        "7/30/20"        "7/31/20"       
## [197] "8/1/20"         "8/2/20"         "8/3/20"         "8/4/20"        
## [201] "8/5/20"         "8/6/20"         "8/7/20"         "8/8/20"        
## [205] "8/9/20"         "8/10/20"        "8/11/20"        "8/12/20"       
## [209] "8/13/20"        "8/14/20"        "8/15/20"        "8/16/20"       
## [213] "8/17/20"        "8/18/20"        "8/19/20"        "8/20/20"       
## [217] "8/21/20"        "8/22/20"        "8/23/20"        "8/24/20"       
## [221] "8/25/20"        "8/26/20"        "8/27/20"        "8/28/20"       
## [225] "8/29/20"        "8/30/20"        "8/31/20"        "9/1/20"        
## [229] "9/2/20"         "9/3/20"         "9/4/20"         "9/5/20"        
## [233] "9/6/20"         "9/7/20"         "9/8/20"         "9/9/20"        
## [237] "9/10/20"        "9/11/20"        "9/12/20"        "9/13/20"       
## [241] "9/14/20"        "9/15/20"        "9/16/20"        "9/17/20"       
## [245] "9/18/20"        "9/19/20"        "9/20/20"        "9/21/20"       
## [249] "9/22/20"        "9/23/20"        "9/24/20"        "9/25/20"       
## [253] "9/26/20"        "9/27/20"        "9/28/20"        "9/29/20"       
## [257] "9/30/20"        "10/1/20"        "10/2/20"        "10/3/20"       
## [261] "10/4/20"        "10/5/20"        "10/6/20"        "10/7/20"       
## [265] "10/8/20"        "10/9/20"        "10/10/20"       "10/11/20"      
## [269] "10/12/20"       "10/13/20"       "10/14/20"       "10/15/20"      
## [273] "10/16/20"       "10/17/20"       "10/18/20"       "10/19/20"      
## [277] "10/20/20"       "10/21/20"       "10/22/20"       "10/23/20"      
## [281] "10/24/20"       "10/25/20"       "10/26/20"       "10/27/20"      
## [285] "10/28/20"       "10/29/20"       "10/30/20"       "10/31/20"      
## [289] "11/1/20"        "11/2/20"        "11/3/20"        "11/4/20"       
## [293] "11/5/20"        "11/6/20"        "11/7/20"        "11/8/20"       
## [297] "11/9/20"        "11/10/20"       "11/11/20"       "11/12/20"      
## [301] "11/13/20"       "11/14/20"       "11/15/20"       "11/16/20"      
## [305] "11/17/20"       "11/18/20"       "11/19/20"       "11/20/20"      
## [309] "11/21/20"       "11/22/20"       "11/23/20"       "11/24/20"      
## [313] "11/25/20"       "11/26/20"       "11/27/20"       "11/28/20"      
## [317] "11/29/20"       "11/30/20"       "12/1/20"        "12/2/20"       
## [321] "12/3/20"        "12/4/20"        "12/5/20"        "12/6/20"       
## [325] "12/7/20"        "12/8/20"        "12/9/20"        "12/10/20"      
## [329] "12/11/20"       "12/12/20"       "12/13/20"       "12/14/20"      
## [333] "12/15/20"       "12/16/20"       "12/17/20"       "12/18/20"      
## [337] "12/19/20"       "12/20/20"       "12/21/20"       "12/22/20"      
## [341] "12/23/20"       "12/24/20"       "12/25/20"       "12/26/20"      
## [345] "12/27/20"       "12/28/20"

We note that the first four columns are “Province/State”, “Country/Region”, “Lat”, and “Long”. The rest are dates starting on 22 January 2020.

As this data set is updated daily, the number of columns will change daily. In the code chunk below, we create a variable to show how many columns there are in the file above. The ncol() function return the number of columns. We store this value in a computer variable named last.day.number. If the data is run on subsequent days, we need to manually look for the last date.

last.day.number <- ncol(confirmedraw)

The date column headers are stored as strings. In the code cell below, we extract the last column header (last date in the current data set) and save it is a date object using the as.Date() function. This date is stored in a computer variable called last.date.

last.date <- as.Date(names(confirmedraw[1, ncol(confirmedraw)]), "%m/%d/%y")

EXTRACTING DATA FROM SOUTH AFRICA

We can search for South Africa in the Country\Region column and extract only the row of data pertaining to this country.

sa <- confirmedraw[confirmedraw$`Country/Region` == "South Africa", ]

Since the data is in wide format, we can extract the actual case values as a vector object. The as.numeric() function is used below. We index the first (and only) row and then column \(5\) to the last column (stored as last.day.number). We also create a sequence of dates from 22 January 2020 to the last date in the data set stored as last.date. We name the case number vector sa.cases and the date sequence dates. The latter is created using the seq() function.

sa.cases <- as.numeric(sa[1, 5:last.day.number])
dates <- seq(as.Date("2020/01/22"), last.date, "days")

Now we store the two vectors as columns in a new data frame object named df. The data frame is in long form.

df <- data.frame(Day = dates, RSACases = sa.cases)

The datatable function from the DT package can be used to view the new data frame object.

DT::datatable(df)

VISUALIZING SOUTH AFRICAN DATA

The number of cases in our data frame object is a rolling total. Below, we use the plotly library to plot a date list plot.

plot_ly(x = df$Day, y = df$RSACases, mode = "markers", type = "scatter", name = "RSA") %>% 
  layout(title = "Total cases in South Africa",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

We notice that the second wave is well under way. We can create a new column to show the daily number of new cases. This is achieved by subtracting the previous day’s running total from the current day. It is achieved using the lag() function from the dplyr library.

df$RSADaily <- df$RSACases - dplyr::lag(df$RSACases, n = 1)

Below, we visualize the new cases per day for South Africa. The vertical lines segments indicate the dates on which various levels of lockdown were instituted.

plot_ly(x = df$Day, y = df$RSADaily, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_segments(x = as.Date("2020-03-26"), xend = as.Date("2020-03-26"), y = 0, yend = 14000, name = "Level 5") %>% 
  add_segments(x = as.Date("2020-05-01"), xend = as.Date("2020-05-01"), y = 0, yend = 14000, name = "Level 4") %>% 
  add_segments(x = as.Date("2020-06-01"), xend = as.Date("2020-06-01"), y = 0, yend = 14000, name = "Level 3") %>% 
  add_segments(x = as.Date("2020-08-01"), xend = as.Date("2020-08-01"), y = 0, yend = 14000, name = "Level 2") %>% 
  layout(title = "Daily cases in South Africa",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))

ADDING GERMANY AND THE UNITED KINGDOM FOR COMPARISON

Countries in Europe are ahead in the COVID-19 time line with respect to South Africa. Two countries with a similar population size are Germany and the United Kingdom. In the code chunk below, we go through the same steps as before to add data for Germany to our data frame object.

germany <- confirmedraw[confirmedraw$`Country/Region` == "Germany", ]
germany.cases <- as.numeric(germany[1, 5:last.day.number])
df$GermanyCases <- germany.cases
df$GermanyDaily <- df$GermanyCases - dplyr::lag(df$GermanyCases, n = 1)

We can now visualize the difference between South Africa and Germany with respect to the total number of cases.

plot_ly(x = df$Day, y = df$RSACases, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyCases, mode = "markers", type = "scatter", name = "Germany") %>% 
  layout(title = "Total cases in South Africa and Germany",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))

We notice that there are more cases in Germany than in South Africa. We also note that we lag behind in the time line. However, South Africa has a lot more total number of cases after the first wave.

Below, we also view the difference in the daily number of cases.

plot_ly(x = df$Day, y = df$RSADaily, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyDaily, mode = "markers", type = "scatter", name = "Germany") %>% 
  layout(title = "Daily cases in South Africa and Germany",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))

The comparison might not be fair due to a total population difference of about \(20000000\) people. The actual population size of these countries are not known, especially in South Africa. Below, we store the official 2018 population sizes and express it per \(100000\) people.

sa.population <- 57780000 / 100000
germany.population <- 83020000 / 100000

We can now add new columns to our data frame object and express the total number of cases and the daily cases per \(100000\) people.

sa.population <- 57780000 / 100000
germany.population <- 83020000 / 100000
df$RSACasesPC <- df$RSACases / sa.population
df$RSADailyPC <- df$RSADaily / sa.population
df$GermanyCasesPC <- df$GermanyCases / germany.population
df$GermanyDailyPC <- df$GermanyDaily / germany.population

We can visualize this normalized data for the total number of cases and the daily cases.

plot_ly(x = df$Day, y = df$RSACasesPC, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyCasesPC, mode = "markers", type = "scatter", name = "Germany") %>% 
  layout(title = "Total cases in South Africa and Germany per 100K people",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))
plot_ly(x = df$Day, y = df$RSADailyPC, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyDailyPC, mode = "markers", type = "scatter", name = "Germany") %>% 
  layout(title = "Daily cases in South Africa and Germany per 100K people",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))

We go through this exercise again and add the United Kingdom (UK). There are \(11\) rows containing data for the UK. The first are for its territories and the last is for the UK proper.

uk <- confirmedraw[confirmedraw$`Country/Region` == "United Kingdom", ][11, ]
uk.cases <- as.numeric(uk[1, 5:last.day.number])
df$UKCases <- uk.cases
df$UKDaily <- df$UKCases - dplyr::lag(df$UKCases, n = 1)
uk.population <- 66650000 / 100000
df$UKCasesPC <- df$UKCases / uk.population
df$UKDailyPC <- df$UKDaily / uk.population

Finally, we visualize the total cases and the daily cases for all three countries.

plot_ly(x = df$Day, y = df$RSACasesPC, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyCasesPC, mode = "markers", type = "scatter", name = "Germany") %>% 
  add_trace(x = df$Day, y = df$UKCasesPC, mode = "markers", type = "scatter", name = "UK") %>% 
  layout(title = "Total cases in RSA, Germany, and the UK per 100K people",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))
plot_ly(x = df$Day, y = df$RSADailyPC, mode = "markers", type = "scatter", name = "RSA") %>% 
  add_trace(x = df$Day, y = df$GermanyDailyPC, mode = "markers", type = "scatter", name = "Germany") %>% 
  add_trace(x = df$Day, y = df$UKDailyPC, mode = "markers", type = "scatter", name = "UK") %>% 
  layout(title = "Daily cases in RSA, Germany, and the UK per 100K people",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Number of cases"))

CONCLUSION

South Africa lags behind in the time line of COVID-19. Cases in South Africa were much higher after the first wave. It may be that the case load will be very high in the first part of 2021.

While we do consider that a current strain of SARS-CoV-2 is more infective, there might be confounding factors as there is great concern about human activities and interactions, especially since the progressive lifting of restrictions. The festive season may worsen upcoming case numbers.

Seroprevalence studies in South Africa are showing a a much higher level of infection than confirmed cases report. Vaccines will take the better part of 2021 to reach large parts of South Africa.