The US Centers for Disease Control publishes a range of data sets. It also has a role in the monitoring of diseases nationwide. One such process is the National Notifiable Diseases Surveillance System (NNDSS). The NNDSS monitors about 120 diseases. These diseases range from Zika to E. coli and rabies. The dataset we will consider is the NNDSS report on infrequently reported notifiable diseases. These are diseases for which less than 1000 cases are reported annually. https://wwwn.cdc.gov/nndss/document/NNDSS-Fact-Sheet-508.pdf
In this exercise, we will download, parse and tidy the data file for use in data analysis.
The data file is provided at the NNDSS data portal: https://data.cdc.gov/NNDSS/NNDSS-Table-I-infrequently-reported-notifiable-dis/5fyu-rtk3
CDC Page for Infrequently Reported Notifiable Diseases
Tidy data requires clearly defining an observation. It requires ensuring each column is a variable and each row is an observation.
We download the data file from the CDC to freeze its content for reproducibility of this analysis. It is copied to github.
library(readr)
urlFile = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/NNDSS_-_Table_I._infrequently_reported_notifiable_diseases.csv"
csv_data <- read_csv(urlFile)
knitr::kable( csv_data[c(1:4), ])| Disease | MMWR year | MMWR week | Current week | Current week, flag | Cum 2014 | Cum 2014, flag | 5-year weekly average | 5-year weekly average, flag | Total cases reported 2013 | Total cases reported 2013, flag | Total cases reported 2012 | Total cases reported 2012, flag | Total cases reported 2011 | Total cases reported 2011, flag | Total cases reported 2010 | Total cases reported 2010, flag | Total cases reported 2009 | Total cases reported 2009, flag | States reporting cases during current week (No.) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Anthrax | 2014 | 1 | NA | - | NA | - | NA | - | NA | - | NA | - | 1 | NA | NA | - | 1 | NA | NA |
| Arboviral diseases, California serogroup virus disease§,¶ | 2014 | 1 | NA | - | NA | - | NA | - | 81 | NA | 81 | NA | 137 | NA | 75 | NA | 55 | NA | NA |
| Arboviral diseases, Eastern equine encephalitis virus disease§,¶ | 2014 | 1 | NA | - | NA | - | 0 | NA | 6 | NA | 15 | NA | 4 | NA | 10 | NA | 4 | NA | NA |
| Arboviral diseases, Powassan virus disease§,¶ | 2014 | 1 | NA | - | NA | - | 0 | NA | 12 | NA | 7 | NA | 16 | NA | 8 | NA | 6 | NA | NA |
Let us first clean the non-ASCII characters in the raw data file.
There are characters representing UTF8 characters section number 0xA7 and paragraph mark 0xB6 that should be cleansed from the raw data. Using str_replace_all we are able to remove these non-ASCII characters.
Disease_Ascii_only = str_replace_all( csv_data$Disease, "(\xB6|\xA7|\x86)", "")
csv_data$Disease = Disease_Ascii_only
knitr::kable(csv_data[c(10:13), ])| Disease | MMWR year | MMWR week | Current week | Current week, flag | Cum 2014 | Cum 2014, flag | 5-year weekly average | 5-year weekly average, flag | Total cases reported 2013 | Total cases reported 2013, flag | Total cases reported 2012 | Total cases reported 2012, flag | Total cases reported 2011 | Total cases reported 2011, flag | Total cases reported 2010 | Total cases reported 2010, flag | Total cases reported 2009 | Total cases reported 2009, flag | States reporting cases during current week (No.) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Botulism, other (wound and unspecified) | 2014 | 1 | 1 | NA | 1 | NA | 1 | NA | 11 | NA | 18 | NA | 32 | NA | 25 | NA | 25 | NA | CA (1 ) |
| Brucellosis | 2014 | 1 | NA | - | NA | - | 2 | NA | 93 | NA | 114 | NA | 79 | NA | 115 | NA | 115 | NA | NA |
| Chancroid | 2014 | 1 | NA | - | NA | - | 0 | NA | 12 | NA | 15 | NA | 8 | NA | 24 | NA | 28 | NA | NA |
| Cholera | 2014 | 1 | NA | - | NA | - | 1 | NA | 2 | NA | 17 | NA | 40 | NA | 13 | NA | 10 | NA | NA |
Next, the data file is wide but it really captures two types of information:
The two types of information are different in composition and should therefore be regarded as distinct observations.
Thus, we expect to separate the initial data table into two related but distinct data sets. Let us use only the historical data for year 2013.
Note that the current year is defined as 2014 and historical weekly cases counts are provided for each week of 2014 from 1 to 53.
We rename the columns to remove typos, extra spaces and year specific dependencies which make the data untidy.
data2013 = csv_data[c(1,2,3,10)] # Keep the disease, MMWR year, MMWR week, Total Cases Reported 2013
colnames(data2013) = c("DISEASE", "MMWR_YEAR", "MMWR_WEEK", "TOTAL_CASES")
data2014 = csv_data[c(1,2,3,4,6,8)] # Keep the disease, MMWR year, MMWR week, Current week, Cum 2014, 5 year weekly average
colnames(data2014) = c("DISEASE", "MMWR_YEAR", "MMWR_WEEK", "CURRENT_WEEK", "YTD_CASES", "AVERAGE_5Y" )Here are the top 4 rows of each set of tables.
knitr::kable(data2013[c(1:4),])| DISEASE | MMWR_YEAR | MMWR_WEEK | TOTAL_CASES |
|---|---|---|---|
| Anthrax | 2014 | 1 | NA |
| Arboviral diseases, California serogroup virus disease, | 2014 | 1 | 81 |
| Arboviral diseases, Eastern equine encephalitis virus disease, | 2014 | 1 | 6 |
| Arboviral diseases, Powassan virus disease, | 2014 | 1 | 12 |
knitr::kable(data2014[c(1:4),])| DISEASE | MMWR_YEAR | MMWR_WEEK | CURRENT_WEEK | YTD_CASES | AVERAGE_5Y |
|---|---|---|---|---|---|
| Anthrax | 2014 | 1 | NA | NA | NA |
| Arboviral diseases, California serogroup virus disease, | 2014 | 1 | NA | NA | NA |
| Arboviral diseases, Eastern equine encephalitis virus disease, | 2014 | 1 | NA | NA | 0 |
| Arboviral diseases, Powassan virus disease, | 2014 | 1 | NA | NA | 0 |
To define the data in a tidy format, we require exporting 2 files of the observation data.
write_csv(data2013, "Tidy_NNDSS_Infrequent_2013_2014.csv" )
write_csv(data2014, "Tidy_NNDSS_Infrequent_2014_2014.csv" )We will illustrate three exploratory data analyses with the enhanced 2014 data set.
data2014 %>% filter( MMWR_WEEK== 53) %>% arrange( desc( YTD_CASES ) ) %>%
select( DISEASE, YTD_CASES ) %>% top_n(10, YTD_CASES)## Warning: package 'bindrcpp' was built under R version 3.4.4
Clearly, the above results show Listeriosis and measles lead the pack.
data2014 %>% filter( MMWR_WEEK== 53 , YTD_CASES < 20) %>%
arrange( YTD_CASES) %>%
select( DISEASE, YTD_CASES ) %>%
top_n(-10, YTD_CASES)From the above table, we see that even historically eradicated diseases seem to be making a resurgence. There was 1 cases of human rabies. Though, the plague had 5 cases in 2014.
data2014 %>% filter( MMWR_WEEK == 53 , YTD_CASES > 0 ) %>%
summarise( cases_per_year = mean( YTD_CASES, na.rm = TRUE ))Thus, among these 55 diseases, the average number of cases per year is ~112 in 2014.
In this scenario, the data was already mostly tidy. Except for the clean-up of column headers and non-ASCII characters, the data is structured in a usable way. What the data does not tell us is the social importance of each disease. Clearly, a disease with high mortality and rapid growth in cases requires urgent resources and preventative measures. In this regard, frequency data is insufficient.
There are 5 files associated with this component of the project. They are uploaded to Github or RPubs.