Clean variables for demographic tables

Age Range

AgeRange is derived from the Age variable. There are 1299 records with missing age range. However, there are additional age variables: PatientAgeYears, PatientAgeMonths, PatientAgeDays, and PatientAgeFromPulsenet so perhaps we can fill in some missing values.

t1<-table(rowSums(is.na(dat[, c('PatientAgeYears', 'PatientAgeMonths', 'PatientAgeDays', 'PatientAgeFromPulsenet')]))<4, is.na(dat$Age))

There are 1154 records that have a NULL Age value, but have at least one non-missing values for PatientAge* or PatientAgeFromPulesNet.

Before we fill-in missing data, let’s double-check to make sure that the original AgeRange was calculated correctly.

dat<-dat[order(PatientAgeYears),]
print(dat[AgeUnit=='Months', c('AgeRange', 'Age', 'AgeUnit', 'PatientAgeYears')] , nrow=25)

##      AgeRange Age AgeUnit PatientAgeYears
##   1:      0-4   3  Months               0
##   2:    18-30  19  Months               1
##   3:    05-17  17  Months               1
##   4:    18-30  24  Months               2
##   5:    18-30  18  Months               2
##  ---                                     
##  97:    05-17  12  Months              NA
##  98:    18-30  20  Months              NA
##  99:    18-30  20  Months              NA
## 100:    18-30  23  Months              NA
## 101:    05-17  12  Months              NA

It looks like the age unit was neglected when calculating AgeRange. We will need to re-cut that variable.

It also appears that Age was calculated in years based on elapsed dates for a lot of records, those are missing an age unit. What I think happened is that AgeYrs, which was a FoodNet numeric double field calculated on elapsed time between DOB and DtSpec date, was combined with Age from NARMS which was a numeric integer that needs AgeUnit to be correctly interpreted. We will need to create an AgeClean field to fix this error. We will also use the PatientAgeFromPulseNet, PatientAgeYears, and PatientAgeMonths variables to fill in missing values.

dat$AgeUnit[is.na(dat$AgeUnit) & !is.na(dat$Age)]<-'Years'
dat$PNAge<-ifelse(grepl("0-", dat$PatientAgeFromPulsenet)==T, 0, as.numeric(dat$PatientAgeFromPulsenet))

## Warning in ifelse(grepl("0-", dat$PatientAgeFromPulsenet) == T, 0,
## as.numeric(dat$PatientAgeFromPulsenet)): NAs introduced by coercion

dat$AgeClean<-floor(
               ifelse(dat$AgeUnit=='Years'   & !is.na(dat$Age), dat$Age,
                ifelse(dat$AgeUnit=='Months' & !is.na(dat$Age), dat$Age/12,
                 ifelse(is.na(dat$Age) & !is.na(dat$PNAge), dat$PNAge,
                   ifelse(is.na(dat$Age) & is.na(dat$PNAge) & !is.na(dat$PatientAgeYears), dat$PatientAgeYears, 
                     ifelse(is.na(dat$Age) & is.na(dat$PNAge) & is.na(dat$PatientAgeYears), dat$PatientAgeMonths/12, NA))))))

Ok, let’s take a look at the new AgeClean variable and see how it compares to the original. We should see shift in cases original classified in the 5-24 ages to the <2 age ranges.

There are now only 145 records with a missing AgeRangeNew value.

Sex

There are a few cases with unknown Sex that have Gender information. We will fill-in those missing values.

dat$Sex<-ifelse(dat$Sex=='U' & !is.na(dat$Gender), dat$Gender, dat$Sex)

Group Serotypes

This code condenses the values in FinalSpeciesName into a smaller number of categories. This variable is included in the NARMS dataset. However, there are 31831records missing FinalSpeciesName values. There is another field (serotypesummary) that has species values that can be used to fill in missing values. The FoodNet Codebook states that the serotypesummary field is a CDC-cleaned version of SeroSite. Per the discussion with Naeemah Logan (5/27), the NARMS FinalSpeciesName field should take precedence.

dat$SpeciesClean<-ifelse(dat$FinalGenusName !="Shigella" | 
                         dat$FinalSpeciesName %in% c(NA, 'unknown', 'other'), 
                         tolower(word(dat$serotypesummary, 1)), dat$FinalSpeciesName)

#Add not serotyped and not speciated to unknown
dat$SpeciesClean[dat$SpeciesClean %in% c(NA, 'not')]<-'unknown' 

table(dat$FinalSpeciesName, dat$SpeciesClean, useNA='ifany')

##                
##                 boydii dysenteriae flexneri sonnei unknown
##   alginolyticus      0           0        0      1       0
##   boydii            11           0        0      0       0
##   coli               0           0       36     49       2
##   dysenteriae        0           4        0      0       0
##   enterica           0           0        4     10      10
##   flexneri           0           0      359      0       0
##   jejuni             0           0        3      1       5
##   mimicus            0           0        0      1       0
##   monocytogenes      0           0        0      1       0
##   other              0           0        1      1       0
##   sonnei             0           0        0   1529       0
##   unknown            0           0        2      4       0
##   <NA>             208          63     5132  20580    5848

FoodNet Site

The State field is equivalent to FoodNet Site.

addmargins(table(dat$State, useNA ='ifany' ))

## 
##    CA    CO    CT    GA    MD    MN    NM    NY    OR    TN   Sum 
##  4503  1575  1016 10781  2419  3194  1698  1265  1232  6182 33865

Resistance Profile

For lab approved isolates not associated with an outbreak (Surveillance ==1 & Exclude ==0 & LabApproval ==1), we will categorize resistance. Note, the field SurveillanceReportable is a flag variable for cases with Surveillance ==1 & Exclude ==0 & LabApproval ==1, so I’m going to use that for brevity.

First, we will update the subpop1 variable according to the instructions Layne provided. This flag should be 1 for cases that have a SurveillanceReportable value of 1 AND have a non-missing or unknown ResistancePattern.

dat$SubPopNew<-dat$SurveillanceReportable==1 & !is.na(dat$ResistancePattern)
dat$subpop1<-NULL

There are 923 cases that are had a primary purpose of surveillance, were lab approved, were not excluded and have antibiotic resistance information. These isolates are part of the NARMS surveillance scheme and are included in official NARMS reports.

Resistance is grouped into the following variables (not mutually exclusive):

Resistance Type	Definition	New Field Name
Ampicillin-R	AMP_Concl==‘R’	AmpR
Azithromycin-R	Azm_Concl==‘R’	AzmR
Ciprofloxacin-R	CIP_Concl %in% c(‘I’, ‘R’)	CipR
Trimethoprim-Sulfamethoxazole-R	COT_Concl==‘R’	CotR
Ceftriaxone-R	AXO_Concl==‘R’	AxoR
Multidrug-R	AMP_Concl & AZM_Concl & CIP_Concl & COT_Concl =‘R’	MDR
Extensive Drug-R	MDR==T & AXO_Concl==‘R’	XDR

Resistance does not include intermediate results, with the exception of the Ciprofloxacin-R category.

#Individual categories of antibiotic resistance
dat[SubPopNew ==1, c("AmpR", 'AzmR', "CipDSC", "CipR", "CotR",  'AxoR') := 
      list(AMP_Concl=='R', 
           AZM_Concl=='R',
           CIP_Concl %in% c('I', 'R'),
           CIP_Concl=='R',
           COT_Concl=='R', 
           AXO_Concl=='R')]

#Number of antibiotic resistance categories, multi-drug and x-drug resistant
abxlist<-paste0(c('AMP', 'AZM', 'CIP', 'COT', 'AXO'), "_Concl")
dat[SubPopNew==1, AbxResSum :=rowSums(.SD=='R', na.rm=T), .SDcols=abxlist]

#Multi-drug resistance
dat[SubPopNew==1, MDR := rowSums(.SD=='R', na.rm=T)==4, .SDcols=abxlist[1:4]]

#X-drug resistance
dat[SubPopNew==1, XDR := MDR==T & AXO_Concl=='R']

#No drug resistance
dat[SubPopNew==1, NoAbxRes := AbxResSum == 0]

Of those cases, there are a total of 54 cases that were resistant to three or more antibiotics, 9 cases of multidrug resistance, and 1 cases of extensive drug resistance.

Age density by antibiotic resistance category

## Warning: Groups with fewer than two data points have been dropped.

## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

Resistance	N	Min	Mean	Median	Max
Ampicillin	409	0	21.98044	12	83
Azithromycin	57	1	39.22807	36	66
Ciprofloxacin-DSC	37	1	32.70270	34	76
Ciprofloxacin-R	29	2	33.65517	34	76
Cotrimoxazole	371	0	26.26954	24	80
Ceftriaxone	8	1	36.50000	41	55
Multi-Drug	9	31	41.11111	38	57
XDR	1	45	45.00000	45	45
None	291	0	17.43299	8	103

Calculate measures of severity

Fever

dat$Fever[is.na(dat$Fever)]<-'UNKNOWN'

Fever include patients with a value of “YES” for Fever. There are 11015 cases with fever.

Hospitalization

First, we will verify that dates of admission or discharge to a hospital are within the surveillance period. In other words, we will make sure that these are all plausible dates.

dates<-c('DtAdmit', 'DtDisch', 'DtAdmit2', 'DtDisch2')
dat[, (dates) := lapply(.SD, as.Date, format="%m/%d/%Y"), .SDcols = c(dates)]
dat[, lapply(.SD, min, na.rm=T), .SDcols = c(dates)]

dat[, lapply(.SD, max, na.rm=T), .SDcols = c(dates)]

Looks good, so we will define hospitalization as any record with a value of YES for Hospital or HospTrans or a non-missing admit or discharge date (initial or second admit/discharge).

dat[, HospClean := Hospital=='YES'|HospTrans=='YES'|rowSums(is.na(.SD), na.rm=T)<4, .SDcols=c(dates)]

There are 7060 records that were hospitalized.

Bloody diarrhea

Bloody diarrhea includes records with a value of “YES” for BloodyDiarr.

dat$BloodyDiarr[is.na(dat$BloodyDiarr)]<-'UNKNOWN'

There are 6487 cases with bloody diarrhea.

Bacteremia

There are three fields for specimen source (SpecimenSource, SourceSite, and SpecSrce) which have a “Blood” category. The SpecSrce field is the CDC-derived field and is the most complete.

dat$Bacteremia<-dat$SpecimenSource=='Blood'
dat$Bacteremia[is.na(dat$Bacteremia) & dat$SpecSrce=='BLOOD']<-T
dat$Bacteremia[is.na(dat$Bacteremia) & dat$SourceSite=='Blood']<-T
dat$Bacteremia[is.na(dat$Bacteremia)]<-F

There are 213 cases with bacteremia.

Death

dat$Death<-dat$Outcome=='DEAD'

There are 29 cases that died.

Overall Severity (1+ Severity Classifications)

dat$Severe<-dat$Fever=='YES'| dat$HospClean==T | dat$BloodyDiarr=='YES'| dat$Bacteremia==T | dat$Death==T

There are 16660 cases were classified as severe.

Cleaning Final, Merged Dataset for Analysis