After working on this notebook for a long time, I sadly realised the data source have unexplained artefacts (see final section). Due to this, I decided to make another notebook with a new data source (which is where you may come from). If you are learning data science or R, you may still find this notebook helpful. It illustrates very well the painful (but extremely important) process of data cleaning that all data scientist should know intimately.
We will use 3 libraries for data cleaning & wrangling: lubridate(for dates), dplyr and tidyr(for wrangling).
Our data comes from Kaggle’s Novel-Coronavirus dataset. This dataset includes several files you can see below.
tree ../RawData
## [01;34m../RawData[00m
## ├── COVID19_line_list_data.csv
## ├── COVID19_open_line_list.csv
## ├── covid_19_data.csv
## ├── time_series_covid_19_confirmed.csv
## ├── time_series_covid_19_confirmed_US.csv
## ├── time_series_covid_19_deaths.csv
## ├── time_series_covid_19_deaths_US.csv
## └── time_series_covid_19_recovered.csv
##
## 0 directories, 8 files
Out of this files I only ended up using 4 files: covid_19_data.csv, time_series_covid_19_{confirmed, deaths, recovered}.csv. Let’s quickly load these files and inspect what they contain:
covid_data = read.csv("../RawData/covid_19_data.csv")
deaths_time = read.csv("../RawData/time_series_covid_19_deaths.csv")
confirmed_time = read.csv("../RawData/time_series_covid_19_confirmed.csv")
recovered_time = read.csv("../RawData/time_series_covid_19_recovered.csv")
As we see below covid_19_data.csv contains 116805 rows and 8 columns. Each row includes the number of confirmed, recovered and deaths recorded in a corresponding province (and respective country) at a given date.
head(covid_data, 10)
## SNo ObservationDate Province.State Country.Region Last.Update Confirmed
## 1 1 01/22/2020 Anhui Mainland China 1/22/2020 17:00 1
## 2 2 01/22/2020 Beijing Mainland China 1/22/2020 17:00 14
## 3 3 01/22/2020 Chongqing Mainland China 1/22/2020 17:00 6
## 4 4 01/22/2020 Fujian Mainland China 1/22/2020 17:00 1
## 5 5 01/22/2020 Gansu Mainland China 1/22/2020 17:00 0
## 6 6 01/22/2020 Guangdong Mainland China 1/22/2020 17:00 26
## 7 7 01/22/2020 Guangxi Mainland China 1/22/2020 17:00 2
## 8 8 01/22/2020 Guizhou Mainland China 1/22/2020 17:00 1
## 9 9 01/22/2020 Hainan Mainland China 1/22/2020 17:00 4
## 10 10 01/22/2020 Hebei Mainland China 1/22/2020 17:00 1
## Deaths Recovered
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## 7 0 0
## 8 0 0
## 9 0 0
## 10 0 0
tail(covid_data, 10)
## SNo ObservationDate Province.State Country.Region
## 116796 116796 09/23/2020 Yukon Canada
## 116797 116797 09/23/2020 Yunnan Mainland China
## 116798 116798 09/23/2020 Zabaykalsky Krai Russia
## 116799 116799 09/23/2020 Zacatecas Mexico
## 116800 116800 09/23/2020 Zakarpattia Oblast Ukraine
## 116801 116801 09/23/2020 Zaporizhia Oblast Ukraine
## 116802 116802 09/23/2020 Zeeland Netherlands
## 116803 116803 09/23/2020 Zhejiang Mainland China
## 116804 116804 09/23/2020 Zhytomyr Oblast Ukraine
## 116805 116805 09/23/2020 Zuid-Holland Netherlands
## Last.Update Confirmed Deaths Recovered
## 116796 2020-09-24 04:23:38 15 0 15
## 116797 2020-09-24 04:23:38 209 2 196
## 116798 2020-09-24 04:23:38 5727 66 4703
## 116799 2020-09-24 04:23:38 6921 661 5711
## 116800 2020-09-24 04:23:38 9343 294 4295
## 116801 2020-09-24 04:23:38 3149 49 1158
## 116802 2020-09-24 04:23:38 1270 72 0
## 116803 2020-09-24 04:23:38 1282 1 1272
## 116804 2020-09-24 04:23:38 5191 92 2853
## 116805 2020-09-24 04:23:38 29513 1372 0
The time series files are a bit different. The deaths and confirmed tables contain 266 whereas the recovered table contains 253. These tables contain 4 columns that never change (the first 4 - province, country, latitude and longitude). Each of the remaining columns corresponds to the number of deaths, confirmed or recovered cases for every recorded day (one column == one day). This to me is weird.
head(confirmed_time[,1:10])
## Province.State Country.Region Lat Long X1.22.20 X1.23.20
## 1 Afghanistan 33.93911 67.70995 0 0
## 2 Albania 41.15330 20.16830 0 0
## 3 Algeria 28.03390 1.65960 0 0
## 4 Andorra 42.50630 1.52180 0 0
## 5 Angola -11.20270 17.87390 0 0
## 6 Antigua and Barbuda 17.06080 -61.79640 0 0
## X1.24.20 X1.25.20 X1.26.20 X1.27.20
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
The covid_data table does not contain the continent each country belongs to. We will be using a utility table which contains a list of all countries in the World and their corresponding continents.
In this notebook I am sort of cheating. Data cleaning and wrangling is about putting your data in a format that serves as input for a model or plotting function. When I made this project I first investigated which plots I wanted to make and which input formats were necessary for such plots. In the interest of coherence, I’ll flash-forward and tell you we’ll be making the following plots:
Line charts: we will be using line charts to display the cumulative number of cases per region and the daily number of cases.
Animated bar charts
Map charts/Chloropeth: actually used leaflet here, plotly map charts are a bit under developed.
We will be making 3 tables for all these charts: the covid_data table, the time_series table and the clean_time_series table.
We will be adding several extra columns to the table (on top of the ones described earlier). First we will make a column with the number of Active cases (= Confirmed-Deaths-Recovered)
## Calculate active cases
covid_data$Active = covid_data$Confirmed - covid_data$Deaths-covid_data$Recovered
## Change format of date from character to Date
covid_data$ObservationDate = mdy(covid_data$ObservationDate)
## Read utils table
countryToContinent = read.csv("../UtilsData/Countries-Continents.csv")
## Add countries that were missing or not properly formatted
extra_columns = cbind(Continent = c(rep("Asia",5),
rep("Europe",2),
"Africa"),
Country = c("Hong Kong", "Mainland China", "Macau", "Taiwan","South Korea",
"UK", "Czech Republic",
"Burkina Faso"))
countryToContinent = rbind(countryToContinent, extra_columns)
## Add continent column to covid_data
covid_data = merge(covid_data, countryToContinent,
by.x = "Country.Region",
by.y = "Country",
all.x = TRUE,
sort = FALSE)
Some regions do not come with a corresponding country (e.g. Gaza Strip, Russia). To simplify things we will simply label these as having Other/Missing continent. After this we will save the dataset as an .RDS file. Find below the list of those country/regions with missing continent information:
## List of country/regions with missing continent info
unique(covid_data %>% filter(is.na(Continent)) %>% select(Country.Region))
## Country.Region
## 1 Russia
## 3 West Bank and Gaza
## 4 Western Sahara
## 8 Burma
## 10 Cabo Verde
## 11 North Macedonia
## 13 Kosovo
## 16 Congo (Brazzaville)
## 17 Congo (Kinshasa)
## 24 Diamond Princess
## 50 Holy See
## 63 Timor-Leste
## 80 Faroe Islands
## 86 Eswatini
## 128 MS Zaandam
## 146 The Bahamas
## 300 Others
## 431 Gambia, The
## 695 Bahamas, The
## 786 Palestine
## 853 Saint Barthelemy
## 894 Gibraltar
## 918 Guadeloupe
## 928 Reunion
## 936 Martinique
## 1089 French Guiana
## 1132 St. Martin
## 1406 Channel Islands
## 1655 occupied Palestinian territory
## 1975 Cayman Islands
## 2306 Jersey
## 2332 Guernsey
## 2664 Azerbaijan
## 2685 North Ireland
## 2863 Mayotte
## 3108 Guam
## 3136 Greenland
## 3149 The Gambia
## 3156 Puerto Rico
## 3732 Republic of the Congo
## 4067 Republic of Ireland
## 4675 ('St. Martin',)
## 5280 Aruba
## 5605 Curacao
## replace NA continent by Other/Missing label
covid_data$Continent = ifelse(is.na(covid_data$Continent),
"Other/Missing",
covid_data$Continent)
We will call the stored file uptodate.rds (I know, name could be better)
saveRDS(covid_data, '../ProcessedData/uptodate.rds')
The data in the time_series tables sometimes is split across regions within a country (such as states in the US, or regions in Australia). Because we are making a global dashboard, we are going to extract the values by country only. To to do this, we’ve written a short function called TimeByCountry():
TimeByCountry = function(data) {
## sum values by country
over.country = data %>%
select(-c(Lat, Long, Province.State)) %>%
group_by(Country.Region) %>% summarise_all(sum) %>% ungroup()
return(over.country)
}
In addition to this we, will also merge this data with the country2continent table.
## At this point data is by country (no continent data) and contains latitude and longitude info
deaths_time = TimeByCountry(deaths_time)
confirmed_time = TimeByCountry(confirmed_time)
recovered_time = TimeByCountry(recovered_time)
## Merge with continent data
deaths_time = merge(countryToContinent,deaths_time, by.y = "Country.Region", by.x = "Country")
confirmed_time = merge(countryToContinent, confirmed_time, by.y = "Country.Region", by.x = "Country")
recovered_time = merge( countryToContinent,recovered_time, by.y = "Country.Region", by.x = "Country")
At this point each row corresponds to a country’s information, if we remove the Country and Continent columns (the 1st two) we can hence plot any row to see how COVID has progressed in such country (This is not yet a tidy dataframe)
## index 55 is France
plot(t(deaths_time[55,-c(1,2)]),
ylab = paste('COVID Deaths in:',deaths_time[55,1]),
xlab = 'Days from January 22nd')
## index 24 is Brazil
plot(t(confirmed_time[24,-c(1,2)]),
ylab = paste('COVID Deaths in:',deaths_time[24,1]),
xlab = 'Days from January 22nd')
Now we will calculate the number of events (i.e. new cases, recoveries, deaths) per day. All data available is cumulative hence we will simply calculate the number of cases for any specific day as the number of cases recorded that given day minus the number of cases from the day before. We will use the function lag()(see help(lag)) to simply shift the number of events from “today”. Again we will make a handy function for this:
perDay = function(cumulative_table){
today = cumulative_table %>% select(-c(Country, Continent))
## tranpose table so that each column corresponds to a country
## and each row corresponds to a day
today = t(today)
## shift the whole table down by a row (one day) with lag function
yesterday = lag(today)
## we will fill the first row, which was filled with NAs, with 0s
yesterday[1,] = 0
per_day = today - yesterday
## transpose again
per_day = t(per_day)
## combine back with original table
output = cumulative_table %>% select(Country, Continent)
output = cbind(output, per_day)
return(output)
}
To help you understand this function this is what today and yesterday would look like (each column would correspond to a country):
| Today table | Yesterday table |
|---|---|
Now, we use the above function to calculate the number of deaths, confirmed cases and recoveries per day, per country.
deaths_per_day = perDay(deaths_time)
confirmed_per_day = perDay(confirmed_time)
recovered_per_day = perDay(recovered_time)
Next, we are going to re-structure our data frame so that each rows corresponds to the number of events in any single day for any given country. This is called tidy data and will make it much easier for us to work with ggplot2 functions. The main function we are using to do this is the pivot_longer() function.
deaths_per_day = deaths_per_day %>%
pivot_longer(-c(Country,Continent),
names_to = "Date",
values_to = "Amount") %>%
mutate(Metric = "Deaths")
confirmed_per_day = confirmed_per_day %>%
pivot_longer(-c(Country,Continent),
names_to = "Date",
values_to = "Amount") %>%
mutate(Metric = "Confirmed")
recovered_per_day = recovered_per_day %>%
pivot_longer(-c(Country,Continent),
names_to = "Date",
values_to = "Amount") %>%
mutate(Metric = "Recovered")
## combine all in one dataset
all_per_day = rbind(deaths_per_day, confirmed_per_day, recovered_per_day)
As you can see below the date format is not very pretty (nor standard):
all_per_day$Date[1:10]
## [1] "X1.22.20" "X1.23.20" "X1.24.20" "X1.25.20" "X1.26.20" "X1.27.20"
## [7] "X1.28.20" "X1.29.20" "X1.30.20" "X1.31.20"
We will change this by using as.Date() which allow us to pass the input format with the argument format:
all_per_day$Date = as.Date(sapply(strsplit(all_per_day$Date, "X"), "[[",2), format = "%m.%d.%y")
Note: This bit (sapply(strsplit(all_per_day$Date, "X"), "[[",2)) strips each date strings after the ‘X’ characters and then picks the 2nd element of each string (i.e. MM.DD.YY)
As I said before, this modification allows us now to easily work with ggplot and dplyr:
library(ggplot2)
library(plotly)
ggplotly(
all_per_day %>%
filter(Country == 'Spain') %>%
ggplot(aes(x = Date, y = Amount, color = Metric))+
geom_line()
)
After reviewing this beautiful notebook I have noticed these data artefacts (like in the plot above) where suddenly the daily number of cases or even deaths becomes NEGATIVE! COVID-19 is giving birth! I went straight to the John Hopkins University source on GitHub and many people have noticed the same problem:
At this point, I had two choices: either bang my head against the wall and google explanations for these artefacts (e.g. update on validity of stats? testing?), which I’ve already done) or look for other sources. I could also over-engineer a solution to this problem and do smart (actually super dumb) data cleaning, further distorting the data and making arbitrary decisions. From my view, these issues are pretty serious and should be addressed by the sources. At this point I cannot take the Novel Corona-Virus dataset or the JHU repo data as rigorous enough and instead I’ve decided to switch my data source to Our World in Data, which does not seem to have these problems.