# for reading Excel file

Read the Data from Excel File

naloxone.2019 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2019")

naloxone.2020 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2020")

naloxone.2021 <- read_excel("/cloud/project/Data-Setup/Butte County EMS Naloxone 2019-2021.xlsx", sheet = "2021")

Let’s look at the dimensionality of the data sets.

dim(naloxone.2019)

## [1] 118  25

dim(naloxone.2020)

## [1] 133  25

dim(naloxone.2021)

## [1] 82 16

And let’s look at a sample of the data. naloxone.2019.

glimpse(naloxone.2019)

## Rows: 118
## Columns: 25
## $ `Transport Agency`           <chr> NA, "BUTTE COUNTY EMS", "BUTTE COUNTY EMS…
## $ `Incident Date`              <dttm> NA, 2019-01-02, 2019-01-04, 2019-01-05, …
## $ `Incident #`                 <dbl> NA, 303, 483, 696, 1328, 1839, 1841, 1845…
## $ `Primary Impression`         <chr> NA, "ALOC", "ALOC", "Overdose", "ALOC", "…
## $ `Transport Destination`      <chr> NA, "Enloe Medical Center", "Enloe Medica…
## $ `Incident Address`           <chr> NA, "2730 White Avenue", "15 CARRIAGE LN"…
## $ `Incident City`              <chr> NA, "Chico", "Chico", "Oroville", "Orovil…
## $ Zip                          <dbl> NA, 95926, 95926, 95966, 95966, 95926, 95…
## $ State                        <chr> NA, "CA", "CA", "CA", "CA", "CA", "CA", "…
## $ County                       <chr> NA, "Butte", "Butte", "Butte", "Butte", "…
## $ `Prescription Intentional`   <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...12                        <chr> "BLS FD Administration", NA, NA, NA, NA, …
## $ ...13                        <chr> "LE Administration", NA, NA, NA, NA, NA, …
## $ ...14                        <chr> "Female Age", NA, NA, NA, NA, NA, NA, NA,…
## $ ...15                        <chr> "Male Age", NA, NA, NA, NA, NA, NA, NA, N…
## $ `Prescription Unintentional` <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...17                        <chr> "BLS FD Administration", NA, NA, NA, NA, …
## $ ...18                        <chr> "LE Administration", NA, NA, NA, NA, NA, …
## $ ...19                        <chr> "Female Age", NA, "74", "60", NA, NA, NA,…
## $ ...20                        <chr> "Male Age", NA, NA, NA, "66", NA, NA, NA,…
## $ `Illicit/Street`             <chr> "Bystander", NA, NA, NA, NA, NA, NA, NA, …
## $ ...22                        <chr> "BLS FD Administration", "Chico FD", NA, …
## $ ...23                        <chr> "LE Administration", NA, NA, NA, NA, "Chi…
## $ ...24                        <chr> "Female Age", "45", NA, NA, NA, NA, "20",…
## $ ...25                        <chr> "Male Age", NA, NA, NA, NA, "25", NA, "21…

After loading the data and viewing the resulting data frames, there are two things that stand out. 1) The column headers are sometimes split into double rows (main header - which is a category - and a subheader which names the data element). R interprets this second row as “data” - so we’ll need to fix it. 2) 2021 data is a different number of columns - we’ll need to understand why.

Fixing the Column Names

Columns 1..10 are ok. Then columns 11..15, 16..20, and 21..25 are in 3 groups of 5 columns, split across two rows.

Upper row, first column is the category, next 4 are junk. There are 3 categories, we’ll call them “PI” (Prescription, Intentional), “PU” (Prescription, Unintentional)“, and”IS" (Illicit/Street).
Lower row, first 3 columns relate to naloxone-administrator (types = bystander, BLS FD, LE), next two provide age (Female, Male).

So, the problems are not just that there are two headers but that the actual data value is split across multiple columns (3 for Administrator, 2 for Age).

First, let’s just fix the column names. We’ll create names in groups of 5: the first part will be the category, and the second part in the item. Here’s what we get.

old.names <- names(naloxone.2019)
cols.5 <- c("Bystander", "BLS FD", "LE", "F.Age", "M.Age")

new.names <- c(old.names[1:10], paste("PI.", cols.5, sep=""), paste("PU.", cols.5, sep=""), paste("IS.", cols.5, sep=""))

# remove the first data-row of the data set because it is really the second-row for the column names. 

naloxone.2019 <- naloxone.2019[-1,]
naloxone.2020 <- naloxone.2020[-1,]

names(naloxone.2019) <- new.names
names(naloxone.2020) <- new.names

new.names[-(1:10)]

##  [1] "PI.Bystander" "PI.BLS FD"    "PI.LE"        "PI.F.Age"     "PI.M.Age"    
##  [6] "PU.Bystander" "PU.BLS FD"    "PU.LE"        "PU.F.Age"     "PU.M.Age"    
## [11] "IS.Bystander" "IS.BLS FD"    "IS.LE"        "IS.F.Age"     "IS.M.Age"

Fixing Column Data

Next problem to fix is the data in these columns. It looks like there are 15 data elements, columns 11..25. In fact, there are really only 4 data elements: Overdose.Type (3 data values = PI, PU, IS), Administrator (3 data values), Age (numeric), Gender (2 values). So, we need to convert this into a data set which has these 4 columns, and with the appropriate values.

Currently, for each row, the columns (representing administrator and age) are populated only in the category (group-of-5) that it belongs to. So, here’s the logic we want to implement.

Pick a group of 5 columns. If it is all missing, ignore and move on to the next group.
If not-Null, then see whether M.Age has a value or F.Age, and assign the value to a new column “Age” and the corresponding gender to a new column “Gender”.

Data ERRORS

Above is how it should be. But it is possible that data are entered erroneously. We will do some basic checks. First, of the 6 columns that contain Age, ideally exactly one should have a numeric value, others should all be null. If more than one, that’s clearly an error. If none, it just means the responder could not identify age.

Let’s write a function that takes the 6 columns, checks for above rule, and returns … dropped it for now, need to fix later.

age <- function(cols6) {# give it the 6 columns
  as.numeric(cols6[!is.na(cols6)]) # to convert it to number
} 

age.check <- naloxone.2019 %>% mutate(Age.Check = apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age))

lapply(age.check, function(x) {x$Age.Check <- as.numeric(x$Age.Check);x})

So, yes, there was a problem in rows 52-53: there were two incidents but the age data got placed both times in the same row (52), and the other one was blank (53). We’ll fix this by hand.

naloxone.2019[53, "IS.M.Age"] <- naloxone.2019[52, "IS.M.Age"]
naloxone.2019[52, "IS.M.Age"] <- NA

Age and Gender

Now we’ll look through the 6 columns (3 catgories, and F.Age and M.Age in each). If the (non-NA) age is in one of the F columns we’ll list gender as Female and capture the value as age. Same for M.

# # gender function takes the 6 columns, and returns Female/Male/na. call it like this: 
# gender(naloxone.2019[1,("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )])

gender <- function(cols6) {# give it the 6 columns = c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )
  colsF = cols6[1:3]
  colsM = cols6[4:6]
  ifelse(any(!is.na(colsF)), "Female", # if non-null
         ifelse(any(!is.na(colsM)), "Male", "na"))
} 

# if more than one age column has a value, that's a problem!
age <- function(cols6) {# give it the 6 columns
  as.numeric(cols6[!is.na(cols6)]) # to convert it to number
} 


# Now run the function on each row of the data set 
naloxone.2019 <- naloxone.2019 %>% mutate("Gender" =  apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,gender), "Age" =  apply(naloxone.2019[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age)) # %>% mutate("Age" = as.numeric("Age"))

naloxone.2020 <- naloxone.2020 %>% mutate("Gender" =  apply(naloxone.2020[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,gender), "Age" =  apply(naloxone.2020[,c("PI.F.Age", "PU.F.Age", "IS.F.Age", "PI.M.Age", "PU.M.Age", "IS.M.Age" )],1,age))

Administrator

Karina and Meghna: do you want to try writing this one?

naloxone.2019 <- naloxone.2019 %>% mutate(Admin = "")
naloxone.2020 <- naloxone.2020 %>% mutate(Admin = "")

Final Data Set

Next we will remove the columns not needed any longer, 11:25. And peek at the resulting data set.

naloxone.2019.final <- naloxone.2019[,c(1:10, 26:29)]
naloxone.2020.final <- naloxone.2020[,c(1:10, 26:29)]

glimpse(naloxone.2019.final)

## Rows: 117
## Columns: 14
## $ `Transport Agency`      <chr> "BUTTE COUNTY EMS", "BUTTE COUNTY EMS", "BUTTE…
## $ `Incident Date`         <dttm> 2019-01-02, 2019-01-04, 2019-01-05, 2019-01-0…
## $ `Incident #`            <dbl> 303, 483, 696, 1328, 1839, 1841, 1845, 1846, 1…
## $ `Primary Impression`    <chr> "ALOC", "ALOC", "Overdose", "ALOC", "Cardiac A…
## $ `Transport Destination` <chr> "Enloe Medical Center", "Enloe Medical Center"…
## $ `Incident Address`      <chr> "2730 White Avenue", "15 CARRIAGE LN", "2539 V…
## $ `Incident City`         <chr> "Chico", "Chico", "Oroville", "Oroville", "Chi…
## $ Zip                     <dbl> 95926, 95926, 95966, 95966, 95926, 95926, 9592…
## $ State                   <chr> "CA", "CA", "CA", "CA", "CA", "CA", "CA", "CA"…
## $ County                  <chr> "Butte", "Butte", "Butte", "Butte", "Butte", "…
## $ Category                <chr> "IS", "PU", "PU", "PU", "IS", "IS", "IS", "IS"…
## $ Gender                  <chr> "Female", "Female", "Female", "Male", "Male", …
## $ Age                     <dbl> 45, 74, 60, 66, 25, 20, 21, 25, 25, 24, 23, 69…
## $ Admin                   <chr> "", "", "", "", "", "", "", "", "", "", "", ""…

New Data Set

Finally let’s save the new data set so we can use it for analysis. Before we do that, we will check or confirm data types for certain columns (date and numeric ones).

naloxone.2019.final <- naloxone.2019.final %>% mutate(
#  "Incident Date" = as.Date("Incident Date"), 
         Zip = as.integer(Zip),
         Age = as.numeric(Age))

naloxone.2020.final <- naloxone.2020.final %>% mutate(
#  "Incident Date" = as.Date("Incident Date"), 
         Zip = as.integer(Zip),
         Age = as.numeric(Age))

naloxone.2019.final <- naloxone.2019.final %>% 
  convert(dte("Incident Date"),
           int(Zip, "Incident #"), 
           num(Age))

We can merge the 2019 and 2020 data sets without even adding a new Year column, because we already have an incident date column.

naloxone.data <- union_all(naloxone.2019.final, naloxone.2020.final) 

write.csv(naloxone.data, file="/cloud/project/Data-Setup/naloxone.data.csv")

Extending the Data

We’ll make some final changes to the data set. For this I’ll use the output file created by Karina (which includes fixing the Administrator column), then make some data type changes, and finally add day/week/month columns.

naloxone.data <- read.csv(file="/cloud/project/Data-Setup/Naloxone_final.csv")

Data Type Coercion

naloxone.data <- naloxone.data  %>% 
#  select(-c("X")) %>% # Remove the extra column "X"
  rename(Incident.Id = "Incident..") %>%
  mutate(across(c(Zip), factor)) %>% # convert Zip to a factor
  mutate(Incident.Date = ymd_hms(as.character(Incident.Date))) # from Lubridate

Day, Week, Month

The data set lists each incident by Incident.Date. It is useful to add additional columns which list the date (e.g., 07), day-of-week (e.g., Mon as the wday value, or 2 as wday.no), week (e.g., 13) and month (e.g., April) and year (e.g., 2019).

naloxone.data <- naloxone.data  %>% 
  mutate(day = day(Incident.Date), wday.no = wday(Incident.Date), wday = wday(Incident.Date, label=TRUE), week = week(Incident.Date), month = month(Incident.Date, label=TRUE), year = year(Incident.Date)) %>% mutate(day.type = ifelse(wday.no %in% 3:5, "Weekday", "Weekend"))

kable(head(naloxone.data))

Transport.Agency	Incident.Date	Incident.Id	Primary.Impression	Transport.Destination	Incident.Address	Incident.City	Zip	State	County	Category	Age	Gender	Admin	day	wday.no	wday	week	month	year	day.type
BUTTE COUNTY EMS	2019-01-02	303	ALOC	Enloe Medical Center	2730 White Avenue	Chico	95926	CA	Butte	IS	45	Female	Chico FD	2	4	Wed	1	Jan	2019	Weekday
BUTTE COUNTY EMS	2019-01-04	483	ALOC	Enloe Medical Center	15 CARRIAGE LN	Chico	95926	CA	Butte	PU	74	Female		4	6	Fri	1	Jan	2019	Weekend
BUTTE COUNTY EMS	2019-01-05	696	Overdose	OROVILLE HOSPITAL	2539 V-6 RD	Oroville	95966	CA	Butte	PU	60	Female		5	7	Sat	1	Jan	2019	Weekend
BUTTE COUNTY EMS	2019-01-09	1328	ALOC	OROVILLE HOSPITAL	3050 EL NOBLE AVE	Oroville	95966	CA	Butte	PU	66	Male		9	4	Wed	2	Jan	2019	Weekday
BUTTE COUNTY EMS	2019-01-12	1839	Cardiac Arrest	Enloe Medical Center	1166 SANTANA CT	Chico	95926	CA	Butte	IS	25	Male	Chico PD	12	7	Sat	2	Jan	2019	Weekend
BUTTE COUNTY EMS	2019-01-12	1841	Overdose	Enloe Medical Center	1166 SANTANA CT	Chico	95926	CA	Butte	IS	20	Female	Chico PD	12	7	Sat	2	Jan	2019	Weekend

write.csv(naloxone.data, file="/cloud/project/Data-Setup/Naloxone_final_formatted.csv")

Naloxone-Data-Setup

HKB

5/14/2021