Eric A. Suess
2/16/2021
Today we will introduce and discuss the COVID19 Hub an R Package that provides access to current numbers related to COVID19.
The COVID19 Data Hub tries to provide access to a curated collection of data from as many countries around the world as possible. It is a open source package that encourages user suggestions and contributions.
> install.packages("COVID19")
It is one of the 15 “covid” packages that is currently available on CRAN.
## - "covid" --------------------------------------- 20 packages in 0.01 seconds -
## # package version by @ title
## 1 100 covid19jp 0.1.0 Koji Higuchi 1M Japanese Covid-19 Da...
## 2 100 covid19france 0.1.0 Amanda Dobbyn 10M Cases of COVID-19 in...
## 3 92 covid19us 0.1.7 Amanda Dobbyn 5M Cases of COVID-19 in...
## 4 92 covid19br 0.1.1 Fabio Demarqui 3M Brazilian COVID-19 P...
## 5 92 covidregionaldata 0.8.2 Sam Abbott 2M Subnational Data for...
## 6 89 covid19swiss 0.1.0 Rami Krispin 5M COVID-19 Cases in Sw...
## 7 86 covidprobability 0.1.0 Eric Brown 6d Estimate the Unit-Wi...
## 8 86 oxcovid19 0.1.2 Ernest Guevarra 4M An R API to the Oxfo...
## 9 86 COVID19 2.3.2 Emanuele Guidotti 1M R Interface to COVID...
## 10 78 coronavirus 0.3.21 Rami Krispin 17d The 2019 Novel Coron...
I am a Professor at CSU East Bay in Statistics and Biostatistics, jointly appointed in the Engineering. I have taught classes in Economics, Marketing, and Analytics for the College of Business. I am 5+ years former Chair, after 3 terms, so 9 years (or 14).
I am the Chief Statistician at machineVantage an AI and ML Neuroscience Marketing start-up company located in Berkeley, CA, Chennai and Bangalore, India, London, England. I am a <= 10 hour per week employee. Apply ML and AI algorithms for clients.
Now I am starting to work on the COVID19 Data Hub with Emanuele Guidotti and David Ardia. Emanuele is located in Switzerland and David is located in Montreal.
Well at the start of the Covid lock-down I decided not to say No to any project that came my way. I am now working on many interesting projects. This is the one that is likely to influence my teaching the most in terms of technical skills.
Joe asked and I said Yes.
I am hoping this effort is beneficial to:
1. The developers of the package.
2. The R community.
3. The R Consortium Covid19 Working Group.
4. My CSU East Bay colleagues, Ayona Chatterjee and Eric Fox.
5. My current students who are working on Covid19 data projects.
6. Me. Hopefully I can develop more "developer" skills that I can pass on to my students.
The CODID19 Data Hub is an R package that pulls data from a curated collection of data sources that is updated hourly. The data is downloaded and merged together into one file once an hour and can be access through one function in R (or using other frontends).
> library(COVID19)
> x_USA <- covid19("USA")
> x_USA
The data is downloaded from many many data sources by code running on a GCP server in the Cloud. The data is processed from the various sources to populate three levels of data. At the end of each day a vintage dataset is made a available.
The levels:
There are so many different sources of COVID19 data. Every country, every state and every city has its own data. There are many different government websites, many universities, and many companies.
It is going to be an ongoing challenge to maintain all of the connections to the original sources. It is already the case that some of the original sources will be ending their efforts soon.
Below are some examples of the use of some possible uses of the data. I am currently teaching a Time Series course using the fpp3 book and a graduate Statistical Learning class using the mdsr2e book. So the examples that follow use of of the R packages used in these books.
There is also an excellent tutorial posted on Medium’s Toward Data Science COVID-19 Data Acquisition in R that give further details on how to extend the dataset in real time.
Load the country level data for the United States.
## Warning in id(x$country, iso = "ISO", ds = "jhucsse_git", level = 1): missing
## id: Micronesia
Time plot of the cumulative deaths.
## Adding missing grouping variables: `id`
## Using `date` as index variable.
## Plot variable not specified, automatically selected `.vars = deaths`
## Warning: Removed 38 row(s) containing missing values (geom_path).
Using the lag() function we can determine daily counts.
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
tail(10)## Adding missing grouping variables: `id`
## Using `date` as index variable.
## # A tsibble: 10 x 4 [1D]
## # Groups: id [1]
## id date deaths daily_deaths
## <chr> <date> <dbl> <dbl>
## 1 USA 2021-02-06 466890 2546
## 2 USA 2021-02-07 468204 1314
## 3 USA 2021-02-08 469786 1582
## 4 USA 2021-02-09 472818 3032
## 5 USA 2021-02-10 476100 3282
## 6 USA 2021-02-11 479257 3157
## 7 USA 2021-02-12 482142 2885
## 8 USA 2021-02-13 484301 2159
## 9 USA 2021-02-14 485384 1083
## 10 USA 2021-02-15 486325 941
Plotting the daily counts reveals a weekly seasonal pattern in the time series.
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
autoplot(daily_deaths) +
labs(title = "USA Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
## Warning: Removed 39 row(s) containing missing values (geom_path).
Looking at the last 6 months.
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
tail(180) %>%
autoplot(daily_deaths) +
labs(title = "USA Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
Trying a multiplicative Classical Decomposition Model to see the Trend and Seasonal components in the time series.
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
tail(180) %>%
model(classical_decomposition(daily_deaths, type = "multiplicative")) %>%
components() %>%
autoplot() +
labs(title = "Classical multiplicative decomposition of USA Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
## Warning: Removed 3 row(s) containing missing values (geom_path).
Computing some features of the time series.
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
tail(180) %>%
select(date, daily_deaths) %>%
features(daily_deaths, feat_stl)## Adding missing grouping variables: `id`
## Using `date` as index variable.
## Adding missing grouping variables: `id`
## # A tibble: 1 x 9
## trend_strength seasonal_streng~ seasonal_peak_w~ seasonal_trough~ spikiness
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.954 0.873 0 4 576880.
## # ... with 4 more variables: linearity <dbl>, curvature <dbl>,
## # stl_e_acf1 <dbl>, stl_e_acf10 <dbl>
Autocorrelation plot. (See Allison Horst’s new series on the ACF post on Twitter @allison_horst yesterday. )
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
ACF(daily_deaths) %>%
autoplot() +
labs(title = "USA Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
PACF
x_USA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
PACF(daily_deaths) %>%
autoplot() +
labs(title = "USA Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
Note: The time series is not stationary, so need to take another difference.
Brazil
## # A tibble: 10 x 36
## # Groups: id [1]
## id date vaccines tests confirmed recovered deaths hosp vent icu
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 BRA 2021-02-06 3401383 NA 9447165 8428992 230034 NA NA NA
## 2 BRA 2021-02-07 3553681 NA 9524640 8467982 231534 NA NA NA
## 3 BRA 2021-02-08 3605538 NA 9524640 8478818 231534 NA NA NA
## 4 BRA 2021-02-09 3820207 NA 9599565 8577207 233520 NA NA NA
## 5 BRA 2021-02-10 4120332 NA 9659167 8616282 234850 NA NA NA
## 6 BRA 2021-02-11 4406835 NA 9713909 8637050 236201 NA NA NA
## 7 BRA 2021-02-12 4696136 NA 9765455 8691664 237489 NA NA NA
## 8 BRA 2021-02-13 5125206 NA 9809754 8740445 238532 NA NA NA
## 9 BRA 2021-02-14 5236943 NA 9834513 8765048 239245 NA NA NA
## 10 BRA 2021-02-15 5293979 NA 9866710 8821887 239773 NA NA NA
## # ... with 26 more variables: population <dbl>, school_closing <int>,
## # workplace_closing <int>, cancel_events <int>,
## # gatherings_restrictions <int>, transport_closing <int>,
## # stay_home_restrictions <int>, internal_movement_restrictions <int>,
## # international_movement_restrictions <int>, information_campaigns <int>,
## # testing_policy <int>, contact_tracing <int>, stringency_index <dbl>,
## # iso_alpha_3 <chr>, iso_alpha_2 <chr>, iso_numeric <int>, currency <chr>,
## # administrative_area_level <chr>, administrative_area_level_1 <chr>,
## # administrative_area_level_2 <chr>, administrative_area_level_3 <chr>,
## # latitude <dbl>, longitude <dbl>, key <lgl>, key_apple_mobility <chr>,
## # key_google_mobility <chr>
x_BRA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble() %>%
autoplot(daily_deaths) +
labs(title = "Brazil Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## Using `date` as index variable.
## Warning: Removed 56 row(s) containing missing values (geom_path).
## # A tibble: 10 x 36
## # Groups: id [1]
## id date vaccines tests confirmed recovered deaths hosp vent
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 USA 2021-02-06 39037964 3.11e8 26917787 NA 466890 84233 NA
## 2 USA 2021-02-07 41210937 3.11e8 27007368 NA 468204 81439 NA
## 3 USA 2021-02-08 42417617 3.12e8 27097095 NA 469786 80055 NA
## 4 USA 2021-02-09 43206190 3.12e8 27192455 NA 472818 79179 NA
## 5 USA 2021-02-10 44769970 NA 27287159 NA 476100 76979 NA
## 6 USA 2021-02-11 46390270 NA 27392512 NA 479257 74225 NA
## 7 USA 2021-02-12 48410558 NA 27492023 NA 482142 NA NA
## 8 USA 2021-02-13 50641884 NA 27575344 NA 484301 NA NA
## 9 USA 2021-02-14 52884356 NA 27640282 NA 485384 NA NA
## 10 USA 2021-02-15 NA NA 27694165 NA 486325 NA NA
## # ... with 27 more variables: icu <dbl>, population <dbl>,
## # school_closing <int>, workplace_closing <int>, cancel_events <int>,
## # gatherings_restrictions <int>, transport_closing <int>,
## # stay_home_restrictions <int>, internal_movement_restrictions <int>,
## # international_movement_restrictions <int>, information_campaigns <int>,
## # testing_policy <int>, contact_tracing <int>, stringency_index <dbl>,
## # iso_alpha_3 <chr>, iso_alpha_2 <chr>, iso_numeric <int>, currency <chr>,
## # administrative_area_level <chr>, administrative_area_level_1 <chr>,
## # administrative_area_level_2 <chr>, administrative_area_level_3 <chr>,
## # latitude <dbl>, longitude <dbl>, key <lgl>, key_apple_mobility <chr>,
## # key_google_mobility <chr>
x_USA_BRA %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble(key = id, index = date) %>%
autoplot(daily_deaths) +
labs(title = "USA and Brazil Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## `mutate_if()` ignored the following grouping variables:
## Column `id`
## Warning: Removed 95 row(s) containing missing values (geom_path).
Estonia, Lithuania, and Latvia
## # A tibble: 10 x 36
## # Groups: id [1]
## id date vaccines tests confirmed recovered deaths hosp vent icu
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LVA 2021-02-07 32250 1.24e6 71800 59897 1339 NA NA NA
## 2 LVA 2021-02-08 32661 1.25e6 72088 60117 1347 NA NA NA
## 3 LVA 2021-02-09 32845 1.26e6 72869 60798 1363 NA NA NA
## 4 LVA 2021-02-10 33452 1.27e6 73859 61889 1395 NA NA NA
## 5 LVA 2021-02-11 35098 1.28e6 74701 62844 1416 NA NA NA
## 6 LVA 2021-02-12 36644 1.30e6 75509 62844 1431 NA NA NA
## 7 LVA 2021-02-13 37043 1.31e6 76282 64528 1443 NA NA NA
## 8 LVA 2021-02-14 37063 1.31e6 76706 65046 1451 NA NA NA
## 9 LVA 2021-02-15 NA 1.31e6 76984 65450 1468 NA NA NA
## 10 LVA 2021-02-16 NA 1.33e6 77697 NA 1486 NA NA NA
## # ... with 26 more variables: population <dbl>, school_closing <int>,
## # workplace_closing <int>, cancel_events <int>,
## # gatherings_restrictions <int>, transport_closing <int>,
## # stay_home_restrictions <int>, internal_movement_restrictions <int>,
## # international_movement_restrictions <int>, information_campaigns <int>,
## # testing_policy <int>, contact_tracing <int>, stringency_index <dbl>,
## # iso_alpha_3 <chr>, iso_alpha_2 <chr>, iso_numeric <int>, currency <chr>,
## # administrative_area_level <chr>, administrative_area_level_1 <chr>,
## # administrative_area_level_2 <chr>, administrative_area_level_3 <chr>,
## # latitude <dbl>, longitude <dbl>, key <lgl>, key_apple_mobility <chr>,
## # key_google_mobility <chr>
x_three %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble(key = id, index = date) %>%
autoplot(daily_deaths) +
labs(title = "Covid19 Daily Deaths")## Adding missing grouping variables: `id`
## `mutate_if()` ignored the following grouping variables:
## Column `id`
## Warning: Removed 166 row(s) containing missing values (geom_path).
Summarize the data weekly.
x_three %>% select(date, deaths) %>%
mutate(daily_deaths = deaths - lag(deaths)) %>%
as_tsibble(key = id, index = date) %>%
# Currently only supports daily data
index_by(date) %>%
summarise(weekly_deaths = sum(daily_deaths)) %>%
# Compute weekly aggregates
fabletools:::aggregate_index("1 week", weekly_deaths = sum(weekly_deaths)) %>%
autoplot(weekly_deaths) +
labs(title = "Covid19 Weekly Deaths")## Adding missing grouping variables: `id`
## Warning: Removed 23 row(s) containing missing values (geom_path).
We can do a data availability study.
Estonia, Lithuania, and Latvia
## [1] TRUE
## [1] 7924
## [1] 0.1874882
Visualize the missing values.
## Warning in id(x$state, iso = iso[[1]], ds = "jhucsse_git", level = level):
## missing id: Nunavut, Repatriated Travellers
## Warning in id(x$state, iso = iso[[1]], ds = "jhucsse_git", level = level):
## missing id: Wallis and Futuna
##
## Hale Thomas, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz
## Kira (2020). Oxford COVID-19 Government Response Tracker, Blavatnik
## School of Government.
##
## The COVID Tracking Project (2020), https://covidtracking.com
##
## Johns Hopkins Center for Systems Science and Engineering (2020),
## https://github.com
##
## Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
## Source Software 5(51):2376, doi: 10.21105/joss.02376.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
##
## To hide the data sources use 'verbose = FALSE'.
x_USA_state %>% select(date, administrative_area_level_2, deaths) %>%
filter(date == "2021-02-15") %>%
filter(administrative_area_level_2 %in% c("California", "Oregon", "Washington")) %>%
ggplot(aes(x = administrative_area_level_2, y = deaths)) +
geom_bar(stat="identity")## Adding missing grouping variables: `id`
## Warning in id(y$fips, iso = "USA", ds = "nytimes_git", level = level): missing
## id: 2997, 2998
##
## World Bank Open Data (2018), https://data.worldbank.org
##
## Hale Thomas, Sam Webster, Anna Petherick, Toby Phillips, and Beatriz
## Kira (2020). Oxford COVID-19 Government Response Tracker, Blavatnik
## School of Government.
##
## Johns Hopkins Center for Systems Science and Engineering (2020),
## https://github.com
##
## The New York Times (2020), https://github.com
##
## Guidotti, E., Ardia, D., (2020), "COVID-19 Data Hub", Journal of Open
## Source Software 5(51):2376, doi: 10.21105/joss.02376.
##
## To see these entries in BibTeX format, use 'print(<citation>,
## bibtex=TRUE)', 'toBibtex(.)', or set
## 'options(citation.bibtex.max=999)'.
##
## To hide the data sources use 'verbose = FALSE'.
x_USA_county %>% select(date, administrative_area_level_2, administrative_area_level_3, deaths, vaccines) %>%
filter(date == "2021-02-15") %>%
filter(administrative_area_level_2 %in% c("California")) %>%
filter(administrative_area_level_3 %in% c("Alameda", "Contra Costa", "Santa Clara")) %>%
ggplot(aes(x = administrative_area_level_3, y = deaths)) +
geom_bar(stat="identity")## Adding missing grouping variables: `id`