Disclaimer

After working on this notebook for a long time, I sadly realised the data source have unexplained artefacts (see final section). Due to this, I decided to make another notebook with a new data source (which is where you may come from). If you are learning data science or R, you may still find this notebook helpful. It illustrates very well the painful (but extremely important) process of data cleaning that all data scientist should know intimately.

Libraries

We will use 3 libraries for data cleaning & wrangling: lubridate(for dates), dplyr and tidyr(for wrangling).

Raw data

Our data comes from Kaggle’s Novel-Coronavirus dataset. This dataset includes several files you can see below.

tree ../RawData
## ../RawData
## ├── COVID19_line_list_data.csv
## ├── COVID19_open_line_list.csv
## ├── covid_19_data.csv
## ├── time_series_covid_19_confirmed.csv
## ├── time_series_covid_19_confirmed_US.csv
## ├── time_series_covid_19_deaths.csv
## ├── time_series_covid_19_deaths_US.csv
## └── time_series_covid_19_recovered.csv
## 
## 0 directories, 8 files

Out of this files I only ended up using 4 files: covid_19_data.csv, time_series_covid_19_{confirmed, deaths, recovered}.csv. Let’s quickly load these files and inspect what they contain:

covid_data = read.csv("../RawData/covid_19_data.csv")
deaths_time = read.csv("../RawData/time_series_covid_19_deaths.csv")
confirmed_time = read.csv("../RawData/time_series_covid_19_confirmed.csv")
recovered_time = read.csv("../RawData/time_series_covid_19_recovered.csv")

As we see below covid_19_data.csv contains 116805 rows and 8 columns. Each row includes the number of confirmed, recovered and deaths recorded in a corresponding province (and respective country) at a given date.

head(covid_data, 10)
##    SNo ObservationDate Province.State Country.Region     Last.Update Confirmed
## 1    1      01/22/2020          Anhui Mainland China 1/22/2020 17:00         1
## 2    2      01/22/2020        Beijing Mainland China 1/22/2020 17:00        14
## 3    3      01/22/2020      Chongqing Mainland China 1/22/2020 17:00         6
## 4    4      01/22/2020         Fujian Mainland China 1/22/2020 17:00         1
## 5    5      01/22/2020          Gansu Mainland China 1/22/2020 17:00         0
## 6    6      01/22/2020      Guangdong Mainland China 1/22/2020 17:00        26
## 7    7      01/22/2020        Guangxi Mainland China 1/22/2020 17:00         2
## 8    8      01/22/2020        Guizhou Mainland China 1/22/2020 17:00         1
## 9    9      01/22/2020         Hainan Mainland China 1/22/2020 17:00         4
## 10  10      01/22/2020          Hebei Mainland China 1/22/2020 17:00         1
##    Deaths Recovered
## 1       0         0
## 2       0         0
## 3       0         0
## 4       0         0
## 5       0         0
## 6       0         0
## 7       0         0
## 8       0         0
## 9       0         0
## 10      0         0
tail(covid_data, 10)
##           SNo ObservationDate     Province.State Country.Region
## 116796 116796      09/23/2020              Yukon         Canada
## 116797 116797      09/23/2020             Yunnan Mainland China
## 116798 116798      09/23/2020   Zabaykalsky Krai         Russia
## 116799 116799      09/23/2020          Zacatecas         Mexico
## 116800 116800      09/23/2020 Zakarpattia Oblast        Ukraine
## 116801 116801      09/23/2020  Zaporizhia Oblast        Ukraine
## 116802 116802      09/23/2020            Zeeland    Netherlands
## 116803 116803      09/23/2020           Zhejiang Mainland China
## 116804 116804      09/23/2020    Zhytomyr Oblast        Ukraine
## 116805 116805      09/23/2020       Zuid-Holland    Netherlands
##                Last.Update Confirmed Deaths Recovered
## 116796 2020-09-24 04:23:38        15      0        15
## 116797 2020-09-24 04:23:38       209      2       196
## 116798 2020-09-24 04:23:38      5727     66      4703
## 116799 2020-09-24 04:23:38      6921    661      5711
## 116800 2020-09-24 04:23:38      9343    294      4295
## 116801 2020-09-24 04:23:38      3149     49      1158
## 116802 2020-09-24 04:23:38      1270     72         0
## 116803 2020-09-24 04:23:38      1282      1      1272
## 116804 2020-09-24 04:23:38      5191     92      2853
## 116805 2020-09-24 04:23:38     29513   1372         0

The time series files are a bit different. The deaths and confirmed tables contain 266 whereas the recovered table contains 253. These tables contain 4 columns that never change (the first 4 - province, country, latitude and longitude). Each of the remaining columns corresponds to the number of deaths, confirmed or recovered cases for every recorded day (one column == one day). This to me is weird.

head(confirmed_time[,1:10])
##   Province.State      Country.Region       Lat      Long X1.22.20 X1.23.20
## 1                        Afghanistan  33.93911  67.70995        0        0
## 2                            Albania  41.15330  20.16830        0        0
## 3                            Algeria  28.03390   1.65960        0        0
## 4                            Andorra  42.50630   1.52180        0        0
## 5                             Angola -11.20270  17.87390        0        0
## 6                Antigua and Barbuda  17.06080 -61.79640        0        0
##   X1.24.20 X1.25.20 X1.26.20 X1.27.20
## 1        0        0        0        0
## 2        0        0        0        0
## 3        0        0        0        0
## 4        0        0        0        0
## 5        0        0        0        0
## 6        0        0        0        0

The covid_data table does not contain the continent each country belongs to. We will be using a utility table which contains a list of all countries in the World and their corresponding continents.

Note

In this notebook I am sort of cheating. Data cleaning and wrangling is about putting your data in a format that serves as input for a model or plotting function. When I made this project I first investigated which plots I wanted to make and which input formats were necessary for such plots. In the interest of coherence, I’ll flash-forward and tell you we’ll be making the following plots:

We will be making 3 tables for all these charts: the covid_data table, the time_series table and the clean_time_series table.

Up to date table

We will be adding several extra columns to the table (on top of the ones described earlier). First we will make a column with the number of Active cases (= Confirmed-Deaths-Recovered)

## Calculate active cases
covid_data$Active = covid_data$Confirmed - covid_data$Deaths-covid_data$Recovered

## Change format of date from character to Date
covid_data$ObservationDate = mdy(covid_data$ObservationDate)

## Read utils table
countryToContinent = read.csv("../UtilsData/Countries-Continents.csv")

## Add countries that were missing or not properly formatted
extra_columns = cbind(Continent = c(rep("Asia",5),
                                    rep("Europe",2),
                                    "Africa"),
                      Country = c("Hong Kong", "Mainland China", "Macau", "Taiwan","South Korea",
                                  "UK", "Czech Republic",
                                  "Burkina Faso"))
countryToContinent = rbind(countryToContinent, extra_columns)

## Add continent column to covid_data
covid_data = merge(covid_data, countryToContinent,
                   by.x = "Country.Region",
                   by.y = "Country",
                   all.x = TRUE,
                   sort = FALSE)

Some regions do not come with a corresponding country (e.g. Gaza Strip, Russia). To simplify things we will simply label these as having Other/Missing continent. After this we will save the dataset as an .RDS file. Find below the list of those country/regions with missing continent information:

## List of country/regions with missing continent info
unique(covid_data %>% filter(is.na(Continent)) %>% select(Country.Region))
##                      Country.Region
## 1                            Russia
## 3                West Bank and Gaza
## 4                    Western Sahara
## 8                             Burma
## 10                       Cabo Verde
## 11                  North Macedonia
## 13                           Kosovo
## 16              Congo (Brazzaville)
## 17                 Congo (Kinshasa)
## 24                 Diamond Princess
## 50                         Holy See
## 63                      Timor-Leste
## 80                    Faroe Islands
## 86                         Eswatini
## 128                      MS Zaandam
## 146                     The Bahamas
## 300                          Others
## 431                     Gambia, The
## 695                    Bahamas, The
## 786                       Palestine
## 853                Saint Barthelemy
## 894                       Gibraltar
## 918                      Guadeloupe
## 928                         Reunion
## 936                      Martinique
## 1089                  French Guiana
## 1132                     St. Martin
## 1406                Channel Islands
## 1655 occupied Palestinian territory
## 1975                 Cayman Islands
## 2306                         Jersey
## 2332                       Guernsey
## 2664                     Azerbaijan
## 2685                  North Ireland
## 2863                        Mayotte
## 3108                           Guam
## 3136                      Greenland
## 3149                     The Gambia
## 3156                    Puerto Rico
## 3732          Republic of the Congo
## 4067            Republic of Ireland
## 4675                ('St. Martin',)
## 5280                          Aruba
## 5605                        Curacao
## replace NA continent by Other/Missing label
covid_data$Continent = ifelse(is.na(covid_data$Continent),
                              "Other/Missing",
                              covid_data$Continent)

We will call the stored file uptodate.rds (I know, name could be better)

saveRDS(covid_data, '../ProcessedData/uptodate.rds')

Time-series data - by Country

The data in the time_series tables sometimes is split across regions within a country (such as states in the US, or regions in Australia). Because we are making a global dashboard, we are going to extract the values by country only. To to do this, we’ve written a short function called TimeByCountry():

TimeByCountry = function(data) {
  
  ## sum values by country
  over.country = data %>% 
    select(-c(Lat, Long, Province.State)) %>%
    group_by(Country.Region) %>% summarise_all(sum) %>% ungroup()
  
  return(over.country)
}

In addition to this we, will also merge this data with the country2continent table.

## At this point data is by country (no continent data) and contains latitude and longitude info
deaths_time = TimeByCountry(deaths_time)
confirmed_time = TimeByCountry(confirmed_time)
recovered_time = TimeByCountry(recovered_time)

## Merge with continent data
deaths_time = merge(countryToContinent,deaths_time,  by.y = "Country.Region", by.x = "Country")
confirmed_time = merge(countryToContinent, confirmed_time, by.y = "Country.Region", by.x = "Country")
recovered_time = merge( countryToContinent,recovered_time, by.y = "Country.Region", by.x = "Country")

At this point each row corresponds to a country’s information, if we remove the Country and Continent columns (the 1st two) we can hence plot any row to see how COVID has progressed in such country (This is not yet a tidy dataframe)

## index 55 is France
plot(t(deaths_time[55,-c(1,2)]),
     ylab = paste('COVID Deaths in:',deaths_time[55,1]),
     xlab = 'Days from January 22nd')

## index 24 is Brazil
plot(t(confirmed_time[24,-c(1,2)]),
     ylab = paste('COVID Deaths in:',deaths_time[24,1]),
     xlab = 'Days from January 22nd')

Events per day

Now we will calculate the number of events (i.e. new cases, recoveries, deaths) per day. All data available is cumulative hence we will simply calculate the number of cases for any specific day as the number of cases recorded that given day minus the number of cases from the day before. We will use the function lag()(see help(lag)) to simply shift the number of events from “today”. Again we will make a handy function for this:

perDay = function(cumulative_table){
  today = cumulative_table %>% select(-c(Country, Continent))
  
  ## tranpose table so that each column corresponds to a country
  ## and each row corresponds to a day
  today = t(today)
  
  ## shift the whole table down by a row (one day) with lag function
  yesterday = lag(today)
  
  ## we will fill the first row, which was filled with NAs, with 0s
  yesterday[1,] = 0
  
  per_day = today - yesterday
  
  ## transpose again
  per_day = t(per_day)
  
  ## combine back with original table
  output = cumulative_table %>% select(Country, Continent)
  output = cbind(output, per_day)
  return(output)
}

To help you understand this function this is what today and yesterday would look like (each column would correspond to a country):

Today table Yesterday table

Now, we use the above function to calculate the number of deaths, confirmed cases and recoveries per day, per country.

deaths_per_day = perDay(deaths_time)
confirmed_per_day = perDay(confirmed_time)
recovered_per_day = perDay(recovered_time)

Next, we are going to re-structure our data frame so that each rows corresponds to the number of events in any single day for any given country. This is called tidy data and will make it much easier for us to work with ggplot2 functions. The main function we are using to do this is the pivot_longer() function.

deaths_per_day = deaths_per_day %>%
  pivot_longer(-c(Country,Continent),
               names_to = "Date",
               values_to = "Amount") %>%
  mutate(Metric = "Deaths")

confirmed_per_day = confirmed_per_day %>%
  pivot_longer(-c(Country,Continent),
               names_to = "Date",
               values_to = "Amount") %>%
  mutate(Metric = "Confirmed")

recovered_per_day = recovered_per_day %>% 
  pivot_longer(-c(Country,Continent),
               names_to = "Date",
               values_to = "Amount") %>% 
  mutate(Metric = "Recovered")

## combine all in one dataset
all_per_day = rbind(deaths_per_day, confirmed_per_day, recovered_per_day)

As you can see below the date format is not very pretty (nor standard):

all_per_day$Date[1:10]
##  [1] "X1.22.20" "X1.23.20" "X1.24.20" "X1.25.20" "X1.26.20" "X1.27.20"
##  [7] "X1.28.20" "X1.29.20" "X1.30.20" "X1.31.20"

We will change this by using as.Date() which allow us to pass the input format with the argument format:

all_per_day$Date = as.Date(sapply(strsplit(all_per_day$Date, "X"), "[[",2), format = "%m.%d.%y")

Note: This bit (sapply(strsplit(all_per_day$Date, "X"), "[[",2)) strips each date strings after the ‘X’ characters and then picks the 2nd element of each string (i.e. MM.DD.YY)

As I said before, this modification allows us now to easily work with ggplot and dplyr:

library(ggplot2)
library(plotly)
ggplotly(
all_per_day %>% 
  filter(Country == 'Spain') %>% 
  ggplot(aes(x = Date, y = Amount, color = Metric))+
  geom_line()
)

WTF

After reviewing this beautiful notebook I have noticed these data artefacts (like in the plot above) where suddenly the daily number of cases or even deaths becomes NEGATIVE! COVID-19 is giving birth! I went straight to the John Hopkins University source on GitHub and many people have noticed the same problem:

At this point, I had two choices: either bang my head against the wall and google explanations for these artefacts (e.g. update on validity of stats? testing?), which I’ve already done) or look for other sources. I could also over-engineer a solution to this problem and do smart (actually super dumb) data cleaning, further distorting the data and making arbitrary decisions. From my view, these issues are pretty serious and should be addressed by the sources. At this point I cannot take the Novel Corona-Virus dataset or the JHU repo data as rigorous enough and instead I’ve decided to switch my data source to Our World in Data, which does not seem to have these problems.