# for reading Excel file

Read the Data from Excel File

naloxone.2019 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2019")

naloxone.2020 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2020")

naloxone.2021 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2021")

Let’s look at the dimensionality of the data sets.

dim(naloxone.2019)
## [1] 118  25
dim(naloxone.2020)
## [1] 133  25
dim(naloxone.2021)
## [1] 82 16

And let’s look at a sample of the data. naloxone.2019.

glimpse(naloxone.2019)
## Rows: 118
## Columns: 25
## $ `Transport Agency`           <chr> NA, "BUTTE COUNTY EMS", "BUTTE COUNTY EMS…
## $ `Incident Date`              <dttm> NA, 2019-01-02, 2019-01-04, 2019-01-05, …
## $ `Incident #`                 <dbl> NA, 303, 483, 696, 1328, 1839, 1841, 1845…
## $ `Primary Impression`         <chr> NA, "ALOC", "ALOC", "Overdose", "ALOC", "…
## $ `Transport Destination`      <chr> NA, "Enloe Medical Center", "Enloe Medica…
## $ `Incident Address`           <chr> NA, "2730 White Avenue", "15 CARRIAGE LN"…
## $ `Incident City`              <chr> NA, "Chico", "Chico", "Oroville", "Orovil…
## $ Zip                          <dbl> NA, 95926, 95926, 95966, 95966, 95926, 95…
## $ State                        <chr> NA, "CA", "CA", "CA", "CA", "CA", "CA", "…
## $ County                       <chr> NA, "Butte", "Butte", "Butte", "Butte", "…
## $ `Prescription Intentional`   <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...12                        <chr> "BLS FD Administration", NA, NA, NA, NA, …
## $ ...13                        <chr> "LE Administration", NA, NA, NA, NA, NA, …
## $ ...14                        <chr> "Female Age", NA, NA, NA, NA, NA, NA, NA,…
## $ ...15                        <chr> "Male Age", NA, NA, NA, NA, NA, NA, NA, N…
## $ `Prescription Unintentional` <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...17                        <chr> "BLS FD Administration", NA, NA, NA, NA, …
## $ ...18                        <chr> "LE Administration", NA, NA, NA, NA, NA, …
## $ ...19                        <chr> "Female Age", NA, "74", "60", NA, NA, NA,…
## $ ...20                        <chr> "Male Age", NA, NA, NA, "66", NA, NA, NA,…
## $ `Illicit/Street`             <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...22                        <chr> "BLS FD Administration", "Chico FD", NA, …
## $ ...23                        <chr> "LE Administration", NA, NA, NA, NA, "Chi…
## $ ...24                        <chr> "Female Age", "45", NA, NA, NA, NA, "20",…
## $ ...25                        <chr> "Male Age", NA, NA, NA, NA, "25", NA, "21…

After loading the data and viewing the resulting data frames, there are two things that stand out. 1) The column headers are sometimes split into double rows (main header - which is a category - and a subheader which names the data element). R interprets this second row as “data” - so we’ll need to fix it. 2) 2021 data is a different number of columns - we’ll need to understand why.

Fixing the Column Names

Columns 1..10 are ok. Then columns 11..15, 16..20, and 21..25 are in 3 groups of 5 columns, split across two rows.

  • Upper row, first column is the category, next 4 are junk. There are 3 categories, we’ll call them “PI” (Prescription, Intentional), “PU” (Prescription, Unintentional)“, and”IS" (Illicit/Street).
  • Lower row, first 3 columns relate to naloxone-administrator (types = bystander, BLS FD, LE), next two provide age (Female, Male).

So, the problems are not just that there are two headers but that the actual data value is split across multiple columns (3 for Administrator, 2 for Age).

First, let’s just fix the column names. We’ll create names in groups of 5: the first part will be the category, and the second part in the item. Here’s what we get.

old.names <- names(naloxone.2019)
cols.5 <- c("Bystander", "BLS FD", "LE", "F.Age", "M.Age")

new.names <- c(old.names[1:10], paste("PI.", cols.5, sep=""), paste("PU.", cols.5, sep=""), paste("IS.", cols.5, sep=""))

# remove the first data-row of the data set because it is really the second-row for the column names. 

naloxone.2019 <- naloxone.2019[-1,]
naloxone.2020 <- naloxone.2020[-1,]

names(naloxone.2019) <- new.names
names(naloxone.2020) <- new.names

new.names[-(1:10)]
##  [1] "PI.Bystander" "PI.BLS FD"    "PI.LE"        "PI.F.Age"     "PI.M.Age"    
##  [6] "PU.Bystander" "PU.BLS FD"    "PU.LE"        "PU.F.Age"     "PU.M.Age"    
## [11] "IS.Bystander" "IS.BLS FD"    "IS.LE"        "IS.F.Age"     "IS.M.Age"

Fixing Column Data

Next problem to fix is the data in these columns. It looks like there are 15 data elements, columns 11..25. In fact, there are really only 4 data elements: Overdose.Type (3 data values = PI, PU, IS), Administrator (3 data values), Age (numeric), Gender (2 values). So, we need to convert this into a data set which has these 4 columns, and with the appropriate values.

Currently, for each row, the columns (representing administrator and age) are populated only in the category (group-of-5) that it belongs to. So, here’s the logic we want to implement.

  • Pick a group of 5 columns. If it is all missing, ignore and move on to the next group.
  • If not-Null, then see whether M.Age has a value or F.Age, and assign the value to a new column “Age” and the corresponding gender to a new column “Gender”.

Data ERRORS

Above is how it should be. But it is possible that data are entered erroneously. We will do some basic checks. First, of the 6 columns that contain Age, ideally exactly one should have a numeric value, others should all be null. If more than one, that’s clearly an error. If none, it just means the responder could not identify age.

Let’s write a function that takes the 6 columns, checks for above rule, and returns … dropped it for now, need to fix later.

age <- function(cols6) {# give it the 6 columns
  as.numeric(cols6[!is.na(cols6)]) # to convert it to number
} 

age.check <- naloxone.2019 %>% mutate(Age.Check = apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age))

lapply(age.check, function(x) {x$Age.Check <- as.numeric(x$Age.Check);x}) 

So, yes, there was a problem in rows 52-53: there were two incidents but the age data got placed both times in the same row (52), and the other one was blank (53). We’ll fix this by hand.

naloxone.2019[53, "IS.M.Age"] <- naloxone.2019[52, "IS.M.Age"]
naloxone.2019[52, "IS.M.Age"] <- NA

Category

Now to identify which category each row belongs to (based on the presence or absence of data) we first convert all missing values to NA.

naloxone.2019 <- naloxone.2019 %>% mutate_all(~replace(., . == 0, NA))

naloxone.2020 <- naloxone.2020 %>% mutate_all(~replace(., . == 0, NA))

First let’s write a function to compute the category, given the 15 columns of data. The function below looks at the groups of 5 columns, and assigns the corresponding category if ANY one of those columns is not empty. Then, run the column on the entire data set.

category <- function(cols) {# give it the 15 columns
  ifelse(any(!is.na(cols[1:5])), "PI", # if non-null
         ifelse(any(!is.na(cols[6:10])), "PU", 
                 ifelse(any(!is.na(cols[11:15])), "IS", "na")))
  } 

# Now run the function on each row of the data set 
naloxone.2019 <- naloxone.2019 %>% mutate("Category" =  apply(naloxone.2019[,11:25],1,category) ) 

naloxone.2020 <- naloxone.2020 %>% mutate("Category" =  apply(naloxone.2020[,11:25],1,category) ) 

Age and Gender

Now we’ll look through the 6 columns (3 catgories, and F.Age and M.Age in each). If the (non-NA) age is in one of the F columns we’ll list gender as Female and capture the value as age. Same for M.

# # gender function takes the 6 columns, and returns Female/Male/na. call it like this: 
# gender(naloxone.2019[1,("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )])

gender <- function(cols6) {# give it the 6 columns = c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )
  colsF = cols6[1:3]
  colsM = cols6[4:6]
  ifelse(any(!is.na(colsF)), "Female", # if non-null
         ifelse(any(!is.na(colsM)), "Male", "na"))
} 

# if more than one age column has a value, that's a problem!
age <- function(cols6) {# give it the 6 columns
  as.numeric(cols6[!is.na(cols6)]) # to convert it to number
} 


# Now run the function on each row of the data set 
naloxone.2019 <- naloxone.2019 %>% mutate("Gender" =  apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,gender), "Age" =  apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age)) # %>% mutate("Age" = as.numeric("Age"))

naloxone.2020 <- naloxone.2020 %>% mutate("Gender" =  apply(naloxone.2020[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,gender), "Age" =  apply(naloxone.2020[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age))  

Administrator

Karina and Meghna: do you want to try writing this one?

naloxone.2019 <- naloxone.2019 %>% mutate(Admin = "")
naloxone.2020 <- naloxone.2020 %>% mutate(Admin = "")

Final Data Set

Next we will remove the columns not needed any longer, 11:25. And peek at the resulting data set.

naloxone.2019.final <- naloxone.2019[,c(1:10, 26:29)]
naloxone.2020.final <- naloxone.2020[,c(1:10, 26:29)]

glimpse(naloxone.2019.final)
## Rows: 117
## Columns: 14
## $ `Transport Agency`      <chr> "BUTTE COUNTY EMS", "BUTTE COUNTY EMS", "BUTTE…
## $ `Incident Date`         <dttm> 2019-01-02, 2019-01-04, 2019-01-05, 2019-01-0…
## $ `Incident #`            <dbl> 303, 483, 696, 1328, 1839, 1841, 1845, 1846, 1…
## $ `Primary Impression`    <chr> "ALOC", "ALOC", "Overdose", "ALOC", "Cardiac A…
## $ `Transport Destination` <chr> "Enloe Medical Center", "Enloe Medical Center"…
## $ `Incident Address`      <chr> "2730 White Avenue", "15 CARRIAGE LN", "2539 V…
## $ `Incident City`         <chr> "Chico", "Chico", "Oroville", "Oroville", "Chi…
## $ Zip                     <dbl> 95926, 95926, 95966, 95966, 95926, 95926, 9592…
## $ State                   <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA"…
## $ County                  <chr> "Butte", "Butte", "Butte", "Butte", "Butte", "…
## $ Category                <chr> "IS", "PU", "PU", "PU", "IS", "IS", "IS", "IS"…
## $ Gender                  <chr> "Female", "Female", "Female", "Male", "Male", …
## $ Age                     <dbl> 45, 74, 60, 66, 25, 20, 21, 25, 25, 24, 23, 69…
## $ Admin                   <chr> "", "", "", "", "", "", "", "", "", "", "", ""…

New Data Set

Finally let’s save the new data set so we can use it for analysis. Before we do that, we will check or confirm data types for certain columns (date and numeric ones).

naloxone.2019.final <- naloxone.2019.final %>% mutate(
#  "Incident Date" = as.Date("Incident Date"), 
         Zip = as.integer(Zip),
         Age = as.numeric(Age))

naloxone.2020.final <- naloxone.2020.final %>% mutate(
#  "Incident Date" = as.Date("Incident Date"), 
         Zip = as.integer(Zip),
         Age = as.numeric(Age))
naloxone.2019.final <- naloxone.2019.final %>% 
  convert(dte("Incident Date"),
           int(Zip, "Incident #"), 
           num(Age)) 

We can merge the 2019 and 2020 data sets without even adding a new Year column, because we already have an incident date column.

naloxone.data <- union_all(naloxone.2019.final, naloxone.2020.final) 

write.csv(naloxone.data, file="/cloud/project/Data-Setup/naloxone.data.csv")

Extending the Data

We’ll make some final changes to the data set. For this I’ll use the output file created by Karina (which includes fixing the Administrator column), then make some data type changes, and finally add day/week/month columns.

naloxone.data <- read.csv(file="/cloud/project/Data-Setup/Naloxone_final.csv")  

Data Type Coercion

naloxone.data <- naloxone.data  %>% 
#  select(-c("X")) %>% # Remove the extra column "X"
  rename(Incident.Id = "Incident..") %>%
  mutate(across(c(Zip), factor)) %>% # convert Zip to a factor
  mutate(Incident.Date = ymd_hms(as.character(Incident.Date))) # from Lubridate

Day, Week, Month

The data set lists each incident by Incident.Date. It is useful to add additional columns which list the date (e.g., 07), day-of-week (e.g., Mon as the wday value, or 2 as wday.no), week (e.g., 13) and month (e.g., April) and year (e.g., 2019).

naloxone.data <- naloxone.data  %>% 
  mutate(day = day(Incident.Date), wday.no = wday(Incident.Date), wday = wday(Incident.Date, label=TRUE), week = week(Incident.Date), month = month(Incident.Date, label=TRUE), year = year(Incident.Date)) %>% mutate(day.type = ifelse(wday.no %in% 3:5, "Weekday", "Weekend"))

kable(head(naloxone.data))
Transport.Agency Incident.Date Incident.Id Primary.Impression Transport.Destination Incident.Address Incident.City Zip State County Category Age Gender Admin day wday.no wday week month year day.type
BUTTE COUNTY EMS 2019-01-02 303 ALOC Enloe Medical Center 2730 White Avenue Chico 95926 CA Butte IS 45 Female Chico FD 2 4 Wed 1 Jan 2019 Weekday
BUTTE COUNTY EMS 2019-01-04 483 ALOC Enloe Medical Center 15 CARRIAGE LN Chico 95926 CA Butte PU 74 Female 4 6 Fri 1 Jan 2019 Weekend
BUTTE COUNTY EMS 2019-01-05 696 Overdose OROVILLE HOSPITAL 2539 V-6 RD Oroville 95966 CA Butte PU 60 Female 5 7 Sat 1 Jan 2019 Weekend
BUTTE COUNTY EMS 2019-01-09 1328 ALOC OROVILLE HOSPITAL 3050 EL NOBLE AVE Oroville 95966 CA Butte PU 66 Male 9 4 Wed 2 Jan 2019 Weekday
BUTTE COUNTY EMS 2019-01-12 1839 Cardiac Arrest Enloe Medical Center 1166 SANTANA CT Chico 95926 CA Butte IS 25 Male Chico PD 12 7 Sat 2 Jan 2019 Weekend
BUTTE COUNTY EMS 2019-01-12 1841 Overdose Enloe Medical Center 1166 SANTANA CT Chico 95926 CA Butte IS 20 Female Chico PD 12 7 Sat 2 Jan 2019 Weekend
write.csv(naloxone.data, file="/cloud/project/Data-Setup/Naloxone_final_formatted.csv")