library('data.table')
library('ggplot2')
library('stringr')
library('kableExtra')
setwd('//cdc.gov/project/ATS_GIS_Store4/Projects/prj06135_Shigella_SVI/Data/FoodNet_NARMS')
dat<-read.csv('FN_NARMS_PulseNet_Census.csv', stringsAsFactors = F)
setDT(dat)
AgeRange is derived from the Age variable. There are 1299 records with missing age range. However, there are additional age variables: PatientAgeYears, PatientAgeMonths, PatientAgeDays, and PatientAgeFromPulsenet so perhaps we can fill in some missing values.
t1<-table(rowSums(is.na(dat[, c('PatientAgeYears', 'PatientAgeMonths', 'PatientAgeDays', 'PatientAgeFromPulsenet')]))<4, is.na(dat$Age))
There are 1154 records that have a NULL Age value, but have at least one non-missing values for PatientAge* or PatientAgeFromPulesNet.
Before we fill-in missing data, let’s double-check to make sure that the original AgeRange was calculated correctly.
dat<-dat[order(PatientAgeYears),]
print(dat[AgeUnit=='Months', c('AgeRange', 'Age', 'AgeUnit', 'PatientAgeYears')] , nrow=25)
## AgeRange Age AgeUnit PatientAgeYears
## 1: 0-4 3 Months 0
## 2: 18-30 19 Months 1
## 3: 05-17 17 Months 1
## 4: 18-30 24 Months 2
## 5: 18-30 18 Months 2
## ---
## 97: 05-17 12 Months NA
## 98: 18-30 20 Months NA
## 99: 18-30 20 Months NA
## 100: 18-30 23 Months NA
## 101: 05-17 12 Months NA
It looks like the age unit was neglected when calculating AgeRange. We will need to re-cut that variable.
It also appears that Age was calculated in years based on elapsed dates for a lot of records, those are missing an age unit. What I think happened is that AgeYrs, which was a FoodNet numeric double field calculated on elapsed time between DOB and DtSpec date, was combined with Age from NARMS which was a numeric integer that needs AgeUnit to be correctly interpreted. We will need to create an AgeClean field to fix this error. We will also use the PatientAgeFromPulseNet, PatientAgeYears, and PatientAgeMonths variables to fill in missing values.
dat$AgeUnit[is.na(dat$AgeUnit) & !is.na(dat$Age)]<-'Years'
dat$PNAge<-ifelse(grepl("0-", dat$PatientAgeFromPulsenet)==T, 0, as.numeric(dat$PatientAgeFromPulsenet))
## Warning in ifelse(grepl("0-", dat$PatientAgeFromPulsenet) == T, 0,
## as.numeric(dat$PatientAgeFromPulsenet)): NAs introduced by coercion
dat$AgeClean<-floor(
ifelse(dat$AgeUnit=='Years' & !is.na(dat$Age), dat$Age,
ifelse(dat$AgeUnit=='Months' & !is.na(dat$Age), dat$Age/12,
ifelse(is.na(dat$Age) & !is.na(dat$PNAge), dat$PNAge,
ifelse(is.na(dat$Age) & is.na(dat$PNAge) & !is.na(dat$PatientAgeYears), dat$PatientAgeYears,
ifelse(is.na(dat$Age) & is.na(dat$PNAge) & is.na(dat$PatientAgeYears), dat$PatientAgeMonths/12, NA))))))
Ok, let’s take a look at the new AgeClean variable and see how it
compares to the original. We should see shift in cases original
classified in the 5-24 ages to the <2 age ranges.There are now only 145 records with a missing AgeRangeNew value.
There are a few cases with unknown Sex that have Gender information. We will fill-in those missing values.
dat$Sex<-ifelse(dat$Sex=='U' & !is.na(dat$Gender), dat$Gender, dat$Sex)
This code condenses the values in FinalSpeciesName into a smaller number of categories. This variable is included in the NARMS dataset. However, there are 31831records missing FinalSpeciesName values. There is another field (serotypesummary) that has species values that can be used to fill in missing values. The FoodNet Codebook states that the serotypesummary field is a CDC-cleaned version of SeroSite. Per the discussion with Naeemah Logan (5/27), the NARMS FinalSpeciesName field should take precedence.
dat$SpeciesClean<-ifelse(dat$FinalGenusName !="Shigella" |
dat$FinalSpeciesName %in% c(NA, 'unknown', 'other'),
tolower(word(dat$serotypesummary, 1)), dat$FinalSpeciesName)
#Add not serotyped and not speciated to unknown
dat$SpeciesClean[dat$SpeciesClean %in% c(NA, 'not')]<-'unknown'
table(dat$FinalSpeciesName, dat$SpeciesClean, useNA='ifany')
##
## boydii dysenteriae flexneri sonnei unknown
## alginolyticus 0 0 0 1 0
## boydii 11 0 0 0 0
## coli 0 0 36 49 2
## dysenteriae 0 4 0 0 0
## enterica 0 0 4 10 10
## flexneri 0 0 359 0 0
## jejuni 0 0 3 1 5
## mimicus 0 0 0 1 0
## monocytogenes 0 0 0 1 0
## other 0 0 1 1 0
## sonnei 0 0 0 1529 0
## unknown 0 0 2 4 0
## <NA> 208 63 5132 20580 5848
The State field is equivalent to FoodNet Site.
addmargins(table(dat$State, useNA ='ifany' ))
##
## CA CO CT GA MD MN NM NY OR TN Sum
## 4503 1575 1016 10781 2419 3194 1698 1265 1232 6182 33865
For lab approved isolates not associated with an outbreak (Surveillance ==1 & Exclude ==0 & LabApproval ==1), we will categorize resistance. Note, the field SurveillanceReportable is a flag variable for cases with Surveillance ==1 & Exclude ==0 & LabApproval ==1, so I’m going to use that for brevity.
First, we will update the subpop1 variable according to the instructions Layne provided. This flag should be 1 for cases that have a SurveillanceReportable value of 1 AND have a non-missing or unknown ResistancePattern.
dat$SubPopNew<-dat$SurveillanceReportable==1 & !is.na(dat$ResistancePattern)
dat$subpop1<-NULL
There are 923 cases that are had a primary purpose of surveillance, were lab approved, were not excluded and have antibiotic resistance information. These isolates are part of the NARMS surveillance scheme and are included in official NARMS reports.
Resistance is grouped into the following variables (not mutually exclusive):
| Resistance Type | Definition | New Field Name |
|---|---|---|
| Ampicillin-R | AMP_Concl==‘R’ | AmpR |
| Azithromycin-R | Azm_Concl==‘R’ | AzmR |
| Ciprofloxacin-R | CIP_Concl %in% c(‘I’, ‘R’) | CipR |
| Trimethoprim-Sulfamethoxazole-R | COT_Concl==‘R’ | CotR |
| Ceftriaxone-R | AXO_Concl==‘R’ | AxoR |
| Multidrug-R | AMP_Concl & AZM_Concl & CIP_Concl & COT_Concl =‘R’ | MDR |
| Extensive Drug-R | MDR==T & AXO_Concl==‘R’ | XDR |
Resistance does not include intermediate results, with the exception of the Ciprofloxacin-R category.
#Individual categories of antibiotic resistance
dat[SubPopNew ==1, c("AmpR", 'AzmR', "CipDSC", "CipR", "CotR", 'AxoR') :=
list(AMP_Concl=='R',
AZM_Concl=='R',
CIP_Concl %in% c('I', 'R'),
CIP_Concl=='R',
COT_Concl=='R',
AXO_Concl=='R')]
#Number of antibiotic resistance categories, multi-drug and x-drug resistant
abxlist<-paste0(c('AMP', 'AZM', 'CIP', 'COT', 'AXO'), "_Concl")
dat[SubPopNew==1, AbxResSum :=rowSums(.SD=='R', na.rm=T), .SDcols=abxlist]
#Multi-drug resistance
dat[SubPopNew==1, MDR := rowSums(.SD=='R', na.rm=T)==4, .SDcols=abxlist[1:4]]
#X-drug resistance
dat[SubPopNew==1, XDR := MDR==T & AXO_Concl=='R']
#No drug resistance
dat[SubPopNew==1, NoAbxRes := AbxResSum == 0]
Of those cases, there are a total of 54 cases that were resistant to three or more antibiotics, 9 cases of multidrug resistance, and 1 cases of extensive drug resistance.
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
| Resistance | N | Min | Mean | Median | Max |
|---|---|---|---|---|---|
| Ampicillin | 409 | 0 | 21.98044 | 12 | 83 |
| Azithromycin | 57 | 1 | 39.22807 | 36 | 66 |
| Ciprofloxacin-DSC | 37 | 1 | 32.70270 | 34 | 76 |
| Ciprofloxacin-R | 29 | 2 | 33.65517 | 34 | 76 |
| Cotrimoxazole | 371 | 0 | 26.26954 | 24 | 80 |
| Ceftriaxone | 8 | 1 | 36.50000 | 41 | 55 |
| Multi-Drug | 9 | 31 | 41.11111 | 38 | 57 |
| XDR | 1 | 45 | 45.00000 | 45 | 45 |
| None | 291 | 0 | 17.43299 | 8 | 103 |
dat$Fever[is.na(dat$Fever)]<-'UNKNOWN'
Fever include patients with a value of “YES” for Fever. There are 11015 cases with fever.
First, we will verify that dates of admission or discharge to a hospital are within the surveillance period. In other words, we will make sure that these are all plausible dates.
dates<-c('DtAdmit', 'DtDisch', 'DtAdmit2', 'DtDisch2')
dat[, (dates) := lapply(.SD, as.Date, format="%m/%d/%Y"), .SDcols = c(dates)]
dat[, lapply(.SD, min, na.rm=T), .SDcols = c(dates)]
dat[, lapply(.SD, max, na.rm=T), .SDcols = c(dates)]
Looks good, so we will define hospitalization as any record with a value of YES for Hospital or HospTrans or a non-missing admit or discharge date (initial or second admit/discharge).
dat[, HospClean := Hospital=='YES'|HospTrans=='YES'|rowSums(is.na(.SD), na.rm=T)<4, .SDcols=c(dates)]
There are 7060 records that were hospitalized.
Bloody diarrhea includes records with a value of “YES” for BloodyDiarr.
dat$BloodyDiarr[is.na(dat$BloodyDiarr)]<-'UNKNOWN'
There are 6487 cases with bloody diarrhea.
There are three fields for specimen source (SpecimenSource, SourceSite, and SpecSrce) which have a “Blood” category. The SpecSrce field is the CDC-derived field and is the most complete.
dat$Bacteremia<-dat$SpecimenSource=='Blood'
dat$Bacteremia[is.na(dat$Bacteremia) & dat$SpecSrce=='BLOOD']<-T
dat$Bacteremia[is.na(dat$Bacteremia) & dat$SourceSite=='Blood']<-T
dat$Bacteremia[is.na(dat$Bacteremia)]<-F
There are 213 cases with bacteremia.
dat$Death<-dat$Outcome=='DEAD'
There are 29 cases that died.
dat$Severe<-dat$Fever=='YES'| dat$HospClean==T | dat$BloodyDiarr=='YES'| dat$Bacteremia==T | dat$Death==T
There are 16660 cases were classified as severe.
Due to the size of the merged dataset, we will restrict columns to only those that will be used for analysis.
columns<-openxlsx::read.xlsx('//cdc.gov/project/ATS_GIS_Store12/Shigella_SVI/Data Dictionaries/dataset_columns.xlsx')
keep<-columns$Name[columns$Remove==F]
final<-dat[, ..keep]
options(scipen=500)
write.csv(dat, "//cdc.gov/project/ATS_GIS_Store4/Projects/prj06135_Shigella_SVI/Data/FoodNet_NARMS/analytic_file_V6.csv", row.names=F)