The global Covid-19 pandemic has affected everyone in 2020. Through this unprecedented event, the general public has the opportunity like never before to Understand the disease’s spread and impact through open data. Numerous online dashboards show cases, the rate of new infections, hospitalizations, and deaths.
Popular dashboards include:
While these dashboards are informative, data-minded people like me also want to get their hands on the raw data to better understand them.
This code-through will use the Wisconsin Department of Health Services (DHS) data API to demonstrate how to load and manipulate state-level COVID-19 data.
Specifically, you’ll learn how to access Wisconsin Covid-19 data through the DHS API, clean the data to prepare it for use, and plot simple graphs.
The Wisconsin DHS website provides a number of interactive visualizations of the Covid-19 data they aggregate from hospitals, medical examiners, and local public health departments.
They offer datasets that are organized by on the aggregate state level, by county, or by census tract. For this demonstration we will use the state-level dataset.
The first step to using DHS data is to pull in the desired data using DHS’s API and package jsonlite, and save it as dataset “Covid.”
url <- paste0( "https://opendata.arcgis.com/datasets/b913e9591eae4912b33dc5b4e88646c5_10.geojson?where=GEO%20%3D%20'State'" )
jsonlite::fromJSON( url )
Covid <- jsonlite::fromJSON( txt = url, simplifyDataFrame = TRUE, flatten = TRUE )
head( Covid )The dataset in its initial form has the “features” dataframe inside of one of only three columns. Running code “Covid <- Covid$features” forms a dataframe with 105 variables. The number of observations will change daily, because there is one row for each day data has been reported starting on March 15, 2020. As of this writing on October 8th, there are 208 observations.
The very last column shows date, however, this is stored as a character string rather than a date or numeric value. The format is YYYY/MM/DD 19:00:00+00, because the daily data update occurs at 2:00 pm. To make this a usable date field, one line of code extracts only the YYYY/MM/DD values, and another stores this as a new variable, “NewDate,” formatted as a date in the dataframe.
Covid <- Covid$features
Covid$NewDate <-stringr::str_extract( Covid$properties.DATE, "^.{10}" )
Covid$NewDate <- as.Date( Covid$NewDate )
head( Covid )This dataset includes 106 variables, many with confusing names like “IP_Y_70_79.” To understand the variables, and other important details, we must refer to the COVID-19 Public Use Data Definitions provided by Wisconsin DHS. This explains specifics of what tests and cases are counted and the definitions of variables. In this case, “IP_Y_70_79” means “Cumulative number of people who had confirmed cases of COVID-19 and were hospitalized for COVID-19, ages 70–79 years.”
Now you can plot any variable against date to see how it has changed since the beginning of the Covid-19 pandemic. First, let’s look at the number of new positive cases by date.
You can see here that Wisconsin has had a marked increase in people with positive Covid-19 cases since September. The day this was written, October 8, 2020, had the first report of over 3,000 people with new positive tests in Wisconsin.
ggplot2 code reference Chang, W. Cookbook for R.
ggplot( data = Covid,
aes( x = NewDate,
y = properties.POS_NEW
) ) +
geom_line( color = "red",
size = 1) +
ggtitle( "People with New Positive Tests by Date" ) +
xlab( "Date" ) +
ylab( "People with New Positive Tests" )Looking at the number of people who tested positive for Covid-19 doesn’t give the whole picture, without knowing how many people were tested overall. Instead, let’s see what the percentage of new positive people was over time. To do this, we’ll first need to calculate the percent of people with positive tests, and then graph that number.
Here we see that in addition to the number of new positives increasing, the percentage of newly tested people who were positive increased during September and in early October remains higher than it had been March through August. The increase in positive tests is not merely due to increased testing.
Covid$PCT_NEW <- ( ( Covid$properties.POS_NEW / Covid$properties.TEST_NEW ) * 100)
ggplot( data = Covid,
aes( x = NewDate,
y = Covid$PCT_NEW
) ) +
geom_line( color = "Blue",
size = 1 ) +
ggtitle( "Percent of New Tests that were Positive by Date" ) +
xlab( "Date" ) +
ylab( "% of New Tests" )People with new positive tests aren’t our only concern, so next let’s plot the cumulative number of positive people, hospitalizations due to Covid-19, and deaths.
ggplot( ) +
geom_line( data = Covid,
aes( x = NewDate,
y = properties.POSITIVE ),
color = "orange" ) +
geom_line( data = Covid,
aes( x = NewDate,
y = properties.HOSP_YES ),
color = "blue" ) +
geom_line( data = Covid,
aes( x = NewDate,
y = properties.DEATHS ),
color = "green" ) +
ggtitle( "Positive Cases, Hospitalizations, and Deaths by Date" ) +
xlab( "Date" ) +
ylab( "Number" ) There is much more you could do with the Wisconsin DHS Covid-19 dataset, following the code examples above.
You could compare rates of positive tests and deaths between men and women. Men are thought to seek health care less than women; from the data does that appear to be the case here?
It has been reported widely that older age is a risk factor for worse Covid-19 outcomes. Positive tests, hospitalizations, intensive care hospitalizations, and deaths are all broken down by into 10-year age ranges, so you could plot these to look for differences between age groups.
Similarly, the data identifies positive tests and deaths by race and ethnicity. It would be interesting to see how different racial groups have been affected by Covid-19, especially compared to the population of Wisconsin, which is predominantly white.
Learn more about Covid-19 data with the following:
This code through references and cites the following source: