Here I use the data available at https://github.com/nytimes/covid-19-data to highlight a few aspects of the Covid-19 pandemic in Nassau county, New York. I show the R code that I have used.
I begin by making the tidyverse package available for use:
library(tidyverse)
Next, I download the data provided by The New York Times for all US counties and prepare a smaller data frame for Nassau county, NY, by specifying its FIPS code, which happens to be 36059:
corona.us.counties_2020 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2020.csv")
corona.us.counties_2021 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2021.csv")
corona.us.counties_2022 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties-2022.csv")
corona.us.counties <- rbind(corona.us.counties_2020, corona.us.counties_2021, corona.us.counties_2022)
mycounty.mystate <- filter(corona.us.counties, fips == params$fips)
head(mycounty.mystate)
## # A tibble: 6 × 6
## date county state fips cases deaths
## <date> <chr> <chr> <chr> <dbl> <dbl>
## 1 2020-03-05 Nassau New York 36059 1 0
## 2 2020-03-06 Nassau New York 36059 4 0
## 3 2020-03-07 Nassau New York 36059 4 0
## 4 2020-03-08 Nassau New York 36059 5 0
## 5 2020-03-09 Nassau New York 36059 17 0
## 6 2020-03-10 Nassau New York 36059 19 0
The crucial variables are cases and deaths, representing cumulative counts. Note that the data are arranged chronologically and begin on Thursday, March 05, 2020, the day the first case was recorded in Nassau county, New York.
As an example of the use of the xts and
dygraphs packages, I present an interactive graph of the
cumulative number of Covid-19 cases. The xts package is
widely used to work with time series data. The dygraphs
package creates, inter alia, interactive graphs when it is fed an
xts data object.
library(xts)
library(dygraphs)
mycounty.mystate.coredata <- mycounty.mystate %>%
select(cases, deaths)
mycounty.mystate.index <- as.Date(mycounty.mystate$date, "%m/%d/%Y")
mycounty.mystate.xts <- xts(mycounty.mystate.coredata, order.by= mycounty.mystate.index)
dygraph(mycounty.mystate.xts$cases, main = "Cumulative Cases", width = 500, height = 300) %>%
dyRangeSelector() %>%
dyHighlight(highlightCircleSize = 5,
highlightSeriesBackgroundAlpha = 0.2,
hideOnMouseOut = FALSE)
# To show all the variables in the `xts` object, delete `$cases`.
Note, again, that this graph is interactive! If you glide your cursor over the graph, you should see an ever-changing label giving the cumulative cases of Covid-19 for the relevant day. You should also be able to drag the sliders on the graph’s horizontal axis to choose the beginning and end of the chart’s time period.
The next graph shows the same data as the one above, but using a logarithmic scale (with one unit of height along the vertical scale representing a doubling of the plotted variable). Moreover, this is a static – that is, non-interactive – graph.
For my static graphs, I use the ggplot2 package, which is part of the tidyverse package that I have already made ready for use.
ggplot(data = mycounty.mystate) +
geom_point(mapping = aes(x = date, y = cases), color = "blue") +
scale_y_continuous(trans = 'log2') +
labs(x = "Date", y = "Cumulative Cases", title = "The Spread of the Virus", subtitle = "Logarithmic Scale")
The next graph begins as non-interactive, but becomes interactive
thanks to the ggplotly command of the Plotly package.
p <- ggplot(data = mycounty.mystate) +
geom_line(mapping = aes(x = date, y = deaths)) +
labs(x = "Date", y = "Cumulative Deaths", title = "The Toll", subtitle = "Linear Scale")
#install.packages("plotly")
library(plotly)
ggplotly(p)
And in logarithmic scale:
ggplot(data = mycounty.mystate) +
geom_point(mapping = aes(x = date, y = deaths), color = "blue") +
scale_y_continuous(trans = 'log2') +
labs(x = "Date", y = "Cumulative Deaths", title = "The Toll", subtitle = "Logarithmic Scale")
And, having graphed the data for Covid-19 cases and deaths, it is not too much of a detour to look at the Case Fatality Rate, which is deaths as a percent of cases:
ggplot(data = mycounty.mystate) +
geom_line(mapping = aes(x = date, y = 100*(deaths/cases))) +
labs(x = "Date", y = "Deaths as a percent of Cases", title = "Case Fatality Rate")
Note that this rate would depend heavily on the number of tests being done and on the criteria used to determine who gets tested. Moreover, this case fatality rate is cumulative deaths as a percent of cumulative cases. As time passes and the pandemic matures, day to day changes in these cumulative numbers will be relatively inconsequential. Consequently, the CFR, being a ratio of slow-changing numbers, will itself be slow to change.
The one exception to this was August 6, 2020 when the cumulative number of deaths actually fell by 512, probably because of some reassessment of the data.
The increase in the cumulative totals from one date to the next gives the increment for the second of the two dates. The seven-day averages of the daily increases are also calculated.
mycounty.mystate <- mycounty.mystate %>%
arrange(date) %>% # This is not strictly necessary
mutate(increase.in.cases = cases - lag(cases),
increase.in.deaths = deaths - lag(deaths),
increase.in.cases.7days = (cases - lag(cases, 7))/7,
increase.in.deaths.7days = (deaths - lag(deaths, 7))/7)
Now the daily tallies of new cases and deaths can be graphed, with the seven-day averages overlaid in blue:
ggplot(data = mycounty.mystate) +
geom_line(mapping = aes(x = date, y = increase.in.cases)) +
geom_line(mapping = aes(x = date, y = increase.in.cases.7days), color = "blue", linetype = 1, size = 1.5) +
labs(x = NULL, y = NULL, title = "The Daily Increase in Cases and its Seven-Day Average")
ggplot(data = mycounty.mystate) +
geom_line(mapping = aes(x = date, y = increase.in.deaths)) +
geom_line(mapping = aes(x = date, y = increase.in.deaths.7days), color = "blue", linetype = 1, size = 1.5) +
labs(x = NULL, y = NULL, title = "The Daily Increase in Deaths and its Seven-Day Average") +
ylim(0, NA)
mycounty.mystate %>% select(date, increase.in.cases) %>% arrange(increase.in.cases) %>% na.omit() %>% tail()
## # A tibble: 6 × 2
## date increase.in.cases
## <date> <dbl>
## 1 2022-01-09 6668
## 2 2021-12-30 6861
## 3 2022-01-06 6983
## 4 2021-12-31 7346
## 5 2022-01-01 7716
## 6 2021-12-26 8121
mycounty.mystate %>% select(date, increase.in.deaths) %>% arrange(increase.in.deaths) %>% na.omit() %>% tail()
## # A tibble: 6 × 2
## date increase.in.deaths
## <date> <dbl>
## 1 2020-04-14 108
## 2 2020-04-10 112
## 3 2020-04-06 139
## 4 2020-04-19 221
## 5 2020-04-04 258
## 6 2022-11-11 557
mycounty.mystate %>%
select(date, increase.in.cases, increase.in.deaths) %>%
tail(n = 28) %>%
knitr::kable(caption = paste("The Covid-19 Pandemic During the Last Four Weeks:", params$county, "County,", params$state_short)) %>%
kableExtra::kable_styling(full_width = FALSE)
| date | increase.in.cases | increase.in.deaths |
|---|---|---|
| 2022-11-13 | 309 | 0 |
| 2022-11-14 | 230 | 0 |
| 2022-11-15 | 313 | 0 |
| 2022-11-16 | 389 | 0 |
| 2022-11-17 | 392 | 0 |
| 2022-11-18 | 395 | 0 |
| 2022-11-19 | 406 | 0 |
| 2022-11-20 | 323 | 0 |
| 2022-11-21 | 310 | 0 |
| 2022-11-22 | 299 | 0 |
| 2022-11-23 | 353 | 0 |
| 2022-11-24 | 577 | 0 |
| 2022-11-25 | 386 | 0 |
| 2022-11-26 | 252 | 0 |
| 2022-11-27 | 297 | 0 |
| 2022-11-28 | 307 | 0 |
| 2022-11-29 | 360 | 0 |
| 2022-11-30 | 857 | 0 |
| 2022-12-01 | 829 | 0 |
| 2022-12-02 | 666 | 0 |
| 2022-12-03 | 639 | 0 |
| 2022-12-04 | 408 | 0 |
| 2022-12-05 | 403 | 0 |
| 2022-12-06 | 527 | 0 |
| 2022-12-07 | 606 | 0 |
| 2022-12-08 | 644 | 36 |
| 2022-12-09 | 626 | 0 |
| 2022-12-10 | 488 | 0 |
ggplot(data = tail(mycounty.mystate, 28)) +
geom_line(mapping = aes(x = date, y = increase.in.cases)) +
expand_limits(y = 0)
ggplot(data = tail(mycounty.mystate, 28), mapping = aes(x = date, y = increase.in.deaths)) +
geom_col() +
scale_y_continuous(breaks = 0:5)
Needless to say, the code here can be used to present a similar profile for any other US county, by inserting the appropriate fips number for the county in the first of my code chunks.
This essay is meant to help me remember the R commands I used in it. I am an amateur “data scientist” and I work on simple projects on occasion. As a result of the long gaps between my “projects”, I tend to forget what I learn.
tidyverse::read_csv() to import CSV datadplyr::filter() to extract a subset of rowsdplyr::select() to extract a subset of columnsas.Date() to convert a string into a datexts package to convert a data frame into a time
series objectdygraph package to create an interactive graph of a
time series variableggplot2 package to graph a time seriesplotly package to create an interactive graphdplyr::mutate() to compute a new variable that measures
day-to-day increases in a time series variable, and the weekly averages
thereofknitr::kable() to make a nice-looking table from a data
frame