This document is a walkthrough on how to import data from Johns Hopkins University’s COVID tracking project and put it into R for processing. We’ll clean it up, do some calculations and filtering, and then chart it as part of this tutorial.
This document is called a “R Markdown” document. It combines your code, your narrative and your results all in one, self-contained document that can be shared with varying amounts of detail shown.
This first section sets up our R session. You’ll almost always do this, because there is very little you’ll do without a package called the tidyverse. It’s a collection of other packages that play together seamlessly, and make R much easier to use. Without it, R is just. too. hard.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(janitor)
library(skimr)
library(reactable)
options (scipen = 999)
Johns Hopkins makes available weekly summaries of COVID from countries around the world. The one we’ll use for this tutorial is for Oct. 31, 2021, and we’ll use this version of it: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/10-31-2021.csv
(When you want to get data from Github, you usually have to use the link to the “RAW” data, not the first one you see. )
The repo has information on each column and the sources in its readme.md file
Here’s what the data contains:
province_state, country_region, lat and long_ , which is where JHU puts the dots on their online map.The column names are in the first row of the data.
Here’s how you get the data into R: * Insert a “code chunk” and give it a name. * use the read_csv function to pour the file into a dataset in our R environment.
jhu_covid <- read_csv ( "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/10-31-2021.csv"
)
## Rows: 4006 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Admin2, Province_State, Country_Region, Combined_Key
## dbl (9): FIPS, Lat, Long_, Confirmed, Deaths, Recovered, Active, Incident_R...
## dttm (1): Last_Update
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
What it’s done is say, “Here’s what I found!! Four columns that are words, nine columns that are numbers, and one column that’s a date/time. I found 4006 rows!”
Now, that doesn’t make much sense, does it? Why are there so many? There aren’t 4006 countries in the world! Let’s take a quick glance at the data:
glimpse(jhu_covid)
## Rows: 4,006
## Columns: 14
## $ FIPS <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Admin2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Province_State <chr> NA, NA, NA, NA, NA, NA, NA, NA, "Australian Capita…
## $ Country_Region <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "A…
## $ Last_Update <dttm> 2021-11-01 04:22:01, 2021-11-01 04:22:01, 2021-11…
## $ Lat <dbl> 33.93911, 41.15330, 28.03390, 42.50630, -11.20270,…
## $ Long_ <dbl> 67.70995, 20.16830, 1.65960, 1.52180, 17.87390, -6…
## $ Confirmed <dbl> 156250, 185300, 206452, 15516, 64433, 4058, 528880…
## $ Deaths <dbl> 7280, 2924, 5920, 130, 1710, 102, 115950, 6328, 13…
## $ Recovered <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Active <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Combined_Key <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "A…
## $ Incident_Rate <dbl> 401.37852, 6438.94642, 470.80294, 20081.53757, 196…
## $ Case_Fatality_Ratio <dbl> 4.6592000, 1.5779817, 2.8674946, 0.8378448, 2.6539…
The reason is that there is a row for every province or state that the university could get, not just the countries.
We can also see here that there are a lot of things that say, NA . We’ll want to keep an eye on that, since they can corrupt everything you want to do with your data.
To get a more thorough view of what you have, we’ll use another package, called skimr to skim through the dataset:
skimr::skim(jhu_covid)
| Name | jhu_covid |
| Number of rows | 4006 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 9 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Admin2 | 734 | 0.82 | 3 | 41 | 0 | 1927 | 0 |
| Province_State | 174 | 0.96 | 3 | 44 | 0 | 594 | 0 |
| Country_Region | 0 | 1.00 | 2 | 32 | 0 | 196 | 0 |
| Combined_Key | 0 | 1.00 | 4 | 60 | 0 | 4006 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| FIPS | 738 | 0.82 | 32405.94 | 18056.38 | 60.00 | 19048.50 | 30068.00 | 47041.50 | 99999.00 | ▆▇▆▁▁ |
| Lat | 90 | 0.98 | 35.80 | 13.25 | -52.37 | 33.20 | 37.90 | 42.18 | 71.71 | ▁▁▁▇▁ |
| Long_ | 90 | 0.98 | -71.35 | 54.90 | -178.12 | -96.60 | -86.77 | -77.39 | 178.06 | ▁▇▁▁▁ |
| Confirmed | 0 | 1.00 | 61585.93 | 344002.19 | 0.00 | 1723.25 | 4895.00 | 20163.00 | 8032958.00 | ▇▁▁▁▁ |
| Deaths | 0 | 1.00 | 1248.10 | 7025.05 | 0.00 | 27.00 | 81.00 | 289.00 | 152002.00 | ▇▁▁▁▁ |
| Recovered | 4005 | 0.00 | 0.00 | NA | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| Active | 4005 | 0.00 | 0.00 | NA | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| Incident_Rate | 91 | 0.98 | 13041.57 | 5360.51 | 0.00 | 10334.37 | 14024.82 | 16531.42 | 54240.40 | ▃▇▁▁▁ |
| Case_Fatality_Ratio | 43 | 0.99 | 2.78 | 46.09 | 0.00 | 1.13 | 1.61 | 2.24 | 2796.55 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| Last_Update | 0 | 1 | 2020-08-04 02:27:56 | 2021-11-01 04:22:01 | 2021-11-01 04:22:01 | 8 |
What if I want to see only the data for countries as a total? Let’s see what we find:
jhu_covid %>%
filter ( is.na( Province_State))
## # A tibble: 174 × 14
## FIPS Admin2 Province_State Country_Region Last_Update Lat Long_
## <dbl> <chr> <chr> <chr> <dttm> <dbl> <dbl>
## 1 NA <NA> <NA> Afghanistan 2021-11-01 04:22:01 33.9 67.7
## 2 NA <NA> <NA> Albania 2021-11-01 04:22:01 41.2 20.2
## 3 NA <NA> <NA> Algeria 2021-11-01 04:22:01 28.0 1.66
## 4 NA <NA> <NA> Andorra 2021-11-01 04:22:01 42.5 1.52
## 5 NA <NA> <NA> Angola 2021-11-01 04:22:01 -11.2 17.9
## 6 NA <NA> <NA> Antigua and Bar… 2021-11-01 04:22:01 17.1 -61.8
## 7 NA <NA> <NA> Argentina 2021-11-01 04:22:01 -38.4 -63.6
## 8 NA <NA> <NA> Armenia 2021-11-01 04:22:01 40.1 45.0
## 9 NA <NA> <NA> Austria 2021-11-01 04:22:01 47.5 14.6
## 10 NA <NA> <NA> Azerbaijan 2021-11-01 04:22:01 40.1 47.6
## # … with 164 more rows, and 7 more variables: Confirmed <dbl>, Deaths <dbl>,
## # Recovered <dbl>, Active <dbl>, Combined_Key <chr>, Incident_Rate <dbl>,
## # Case_Fatality_Ratio <dbl>
Unfortunately for us, any country that has data for the state doesn’t have data for the country as a whole. And the US is listing EVERY COUNTY! No wonder there are so many.
So we want to reverse engineer this a little. To get to the country totals we need to:
We’ll use the basic building blocks of the Tidyverse, the basic verbs:
select means to pick out columns filter means to pick out rows arrange means to sort the rows mutate means to create a new column from old information.
When we go from one verb to the next, we’ll use a pipe (%>% ) to connect them:
rladies
jhu_covid_counts <-
jhu_covid %>%
mutate ( rate_not_per100k = Incident_Rate / 100000 ,
est_population = Confirmed / rate_not_per100k)
Now I want to just keep the columns that can add up, fix some NA’s, and lowercase all of the column names so I don’t have to remember what’s what.
Let’s go ahead and add them up by country. Introducing a new verb:
group_by and summarise, which combines rows by calculating a statistic. We’ll do it longhand, even though there are fancy ways to reduce the typing involved.
jhu_covid_count2 <-
jhu_covid_counts %>%
group_by ( Country_Region) %>%
summarise ( confirmed_total = sum ( Confirmed, na.rm=T),
deaths_total = sum (Deaths, na.rm=T) ,
est_population = sum ( est_population , na.rm = T),
approx_lat = median ( Lat, na.rm=T),
approx_long = median (Long_, na.rm=T))
Want to see a chart, with the 196 countries listed by the number of COVID cases they have? I’ll use another package to make it look a little nicer, and to create a sortable, searchable table. That package is called reactable
jhu_covid_count2 %>%
reactable (
searchable=TRUE,
filterable=FALSE,
#compact=TRUE,
theme=reactableTheme(color="dark gray",
style=list(fontSize="85%")
) ,
defaultColDef = colDef ( format = colFormat(separators=TRUE, digits=0), minWidth=75)
)
If we wanted to find out what percentage of the total worldwide cases each country had, we can do that with our data the same way:
jhu_covid_count2 %>%
mutate ( pct_of_cases = confirmed_total / sum ( confirmed_total) * 100 ,
pct_of_pop = est_population / sum (est_population) * 100 ) %>%
select (Country_Region, pct_of_cases, pct_of_pop) %>%
arrange ( desc (pct_of_cases))
## # A tibble: 196 × 3
## Country_Region pct_of_cases pct_of_pop
## <chr> <dbl> <dbl>
## 1 US 18.6 4.32
## 2 India 13.9 17.8
## 3 Brazil 8.84 2.73
## 4 United Kingdom 3.69 0.871
## 5 Russia 3.40 1.91
## 6 Turkey 3.26 1.10
## 7 France 2.95 0.886
## 8 Iran 2.40 1.09
## 9 Argentina 2.14 0.588
## 10 Spain 2.03 0.610
## # … with 186 more rows
Now let’s get the rate per 100,000 residents:
jhu_rates <-
jhu_covid_count2 %>%
mutate ( rate_per_100k = confirmed_total / est_population * 100000,
death_rate = deaths_total / confirmed_total * 100)
Here’s one way to chart the death rates from COVID for the 20 largest countries in the dataset:
jhu_rates %>%
slice_max ( n=20, est_population) %>%
ggplot ( aes ( x=reorder(Country_Region,death_rate), y=death_rate, ) ) +
geom_col (fill="blue") +
ylim ( 0, 8) +
coord_flip() +
theme_minimal() +
labs( y = "Pct of cases ending in death", x="")
Now we’ll use a package that some of you may be a little familiar with - it’s an R version of the leaflet() javascript library:
library(leaflet)
leaflet ( jhu_rates) %>%
addTiles () %>%
setView ( lng= 0, lat=0, zoom = 1) %>%
addCircles ( lng=~approx_long, lat=~approx_lat, weight=1,
radius = sqrt(jhu_rates$rate_per_100k)*4000,
popup= paste(jhu_rates$Country_Region, round(jhu_rates$rate_per_100k, 0), sep=": ")
)