Loading Dallas OpenData datasets

Dallas OpenData distributes animal shelter records in separate datasets for each fiscal year from 2015 through 2017 (2017 gets daily updates in 2017):

FY 2015 Dallas Animal Shelter. This dataset reports activities between October 01, 2014 and September 30, 2015.
FY 2016 Dallas Animal Shelter Data. This dataset reports activities between October 01, 2015 and September 30, 2016.
FY 2017 Dallas Animal Shelter Data. The start date is October 1, 2016.

Access to Dalla OpenData Datasets

Dallas OpenData is powered by Socrata and its Open Data Network that supports Socrata Open Data API (SODA API). Package RSocrata supports SODA API and variety of data sources including public Dallas OpenData without neither access credentials nor API token:

data15.source = read.socrata(url = "https://www.dallasopendata.com/resource/8pn8-24ku.csv")
data16.source = read.socrata(url = "https://www.dallasopendata.com/resource/4qfv-27du.csv")
data17.source = read.socrata(url = "https://www.dallasopendata.com/resource/8849-mzxh.csv")

Preprocessing Data Files

File format slightly changed after 2016 so we process 2015 and 2016 first and then 2017 into two separate data frames:

# 2015 and 2016
data15n16 = bind_rows(data15.source, data16.source) %>%
  rename(months=month) %>%
  filter(!is.na(outcome_date) | intake_total != 1) %>%
  select(-kennel_number, -kennel_status, -tag_type, -activity_number, 
         -source_id, -staff_id, -hold_request, 
         -receipt_number, -impound_number, -service_request_number,
         -intake_total) %>%
  separate(intake_time, c("Intake.Fake.Date","Intake.Real.Time"), sep=10, remove=FALSE) %>%
  separate(outcome_time, c("Outcome.Fake.Date","Outcome.Real.Time"), sep=10, remove=FALSE) %>%
  mutate(intake_ts = ymd_hms(paste(intake_date, Intake.Real.Time), tz="America/Chicago"),
         outcome_ts = ymd_hms(paste(outcome_date, Outcome.Real.Time), tz="America/Chicago")) %>%
  select(-Intake.Real.Time, -Intake.Fake.Date, -Outcome.Real.Time, -Outcome.Fake.Date,
         -intake_time, -intake_date, -outcome_time, -outcome_date)

## Warning: All formats failed to parse. No formats found.

## Warning: All formats failed to parse. No formats found.

# format in 2017 is slightly different
data17 = filter(data17.source, !is.na(outcome_date)) %>%
  select(-kennel_number, -kennel_status, -tag_type, -activity_number, 
         -source_id, -staff_id, -hold_request, 
         -receipt_number, -impound_number, -service_request_number) %>%
  separate(intake_time, c("Intake.Real.Time", "Intake.Time.Seconds"), sep=8, remove=FALSE) %>%
  separate(outcome_time, c("Outcome.Real.Time", "Outcome.Time.Seconds"), sep=8, remove=FALSE) %>%
  mutate(intake_ts = ymd_hms(paste(intake_date, Intake.Real.Time), tz="America/Chicago"),
         outcome_ts = ymd_hms(paste(outcome_date, Outcome.Real.Time), tz="America/Chicago")) %>%
  select(-Intake.Real.Time, -Intake.Time.Seconds, -Outcome.Real.Time, -Outcome.Time.Seconds,
         -intake_time, -intake_date, -outcome_time, -outcome_date)

## Warning: All formats failed to parse. No formats found.

## Warning: All formats failed to parse. No formats found.

For each the following actions took place:

records with empty outcome date were removed (these are not censored records but rather animals that were just admitted and are still held in shelter);
records with total intake count other than 1 were removed (it is not clear what they represent);
irrelevant attributes were removed;
original data contains a pair of fields for each timestamp attribute: date and time. Thus, each such pair is transofrmed into single attribute (times for admission (intake) and outcome) using function separate and mutate;
removed temporary attributes created in previous step with function separate.

Combining and Compacting Data

Binding 2015 - 2017 FY Data Together

Next we combine three years of data (2 data frames) into single data frame (cat and dog animals only):

# combine all years into single dataset
model.data.all = bind_rows(data15n16, data17) %>%
  mutate(
    animal = factor(animal_type),
    outcome = factor(outcome_type),
    breed = factor(animal_breed),
    tract = factor(census_tract),
    district = factor(council_district),
    type = factor(intake_type),
    subtype = factor(intake_subtype),
    condition_in = factor(intake_condition),
    condition_out = factor(outcome_condition),
    chip = factor(chip_status),
    origin = factor(animal_origin))

What are the Animals?

Which animal types and how many are treated by Dallas shelters?

Retaining Dog and Cat Animal Records Only

Combining records of stay for the same animals

Some records occur for the same animal in sequence and in short time intervals. In such case We assume such records correspond to single shelter stay so they need additional processing. The rule of thumb is if 2 records for the same animal (animal_id) are within 30 days from each other then they are assigned phases that represent the same shelter stay:

# find and assign phases to consecutive records for the same stay:
model.data.dogscats = group_by(model.data.dogscats, animal_id) %>%
  arrange(intake_ts) %>%
  mutate(n=n(),
         phase = row_number(),
         lag_time = intake_ts - lag(outcome_ts)) %>%
  ungroup() %>%
  arrange(animal_id, phase) %>%
  mutate(phase_group = ifelse(lag_time <= 30, lag(phase), phase),
         phase_group = ifelse(lag_time <= 30, lag(phase_group), phase),
         phase_group = ifelse(is.na(phase_group), 1, phase_group),
         groupid = paste(animal, animal_id, phase_group, sep="-"))

The code above is crude attempt at sequential pattern processing like SQL function npath in Teradata Aster.

Having identified records for the same stay we can compact them into single records:

# combine mutlple records for the same animal stay into single record
model.data.dogscats.compact = model.data.dogscats %>%
  group_by(groupid) %>%
  summarise(
    intake_ts = min(intake_ts),
    outcome_ts = max(outcome_ts),
    interval = interval(min(intake_ts), max(outcome_ts)),
    duration = as.duration(interval),
    hours = as.numeric(duration)/(60.*60.),
    animal=first(animal),
    outcome = last(outcome),
    breed = first(breed),
    tract = first(tract),
    district = first(district),
    type = first(type),
    subtype = first(subtype),
    condition_in = first(condition_in),
    condition_out = last(condition_out),
    chip = first(chip),
    origin = first(origin),
    phase = min(phase)) %>%
  mutate(monthly_in = floor_date(intake_ts, 'month'),
         monthly_out = floor_date(outcome_ts, 'month'))

Exploratory Analysis

Intake Types

Outcomes

Transitions between Intake types and Outcomes

Sankey diagram will show how dogs addmitted to shelters transition from intake types to outcomes with arrows of variable width corresponding to number of records. Essentially, this diagram shows probabilities: absolute probabilities of dog admitted by intake types and discharged by outcomes, and conditional probabilities of dog discharged with outcomes given its admission with an intake type:

Cats

Dogs

Correlations

Next, we would like to correlate intake types with outcomes. For that we devise monthly trends for both and correlate resulting time series - the total of 12 combinations of 3 intake types and 4 outcommes:

Trend Matrix

Now correlation coefficients get replaced with actual times series used to compute them leaving matrix structure intact:

Time Spent in Shelters

Time Spent Dogs vs. Cats

Time Distributions by Admission and Outcome

Histograms

Density Curves

Kernel density estimate is a smoothered version of the histogram above.

Sankeys Based on Average Time Spent at Shelters

Let’s revisit Sankey diagrams to show average time of stay depending on admission and outcome. In the Sankeys below transition arrow widths correspond to average length of stay: the wider the longer and the narrower the shorter:

Cats

Dogs

Survival Analysis

We enhance the data with new attributes first:

event: survival outcome is 1 if the animal died or 0 if it survived its stay at shelter;
unhealthy_in: if animals was unhealthy or not at admission;
contagious_in: if animal was contagious or not at admission;
treatable_in: if animal was in untreatable, rehabiltable, or manageable condition at admission.

survival.data = model.data.dogscats.compact %>%  
  filter(hours < 2500) %>%
  mutate(event = ifelse(outcome %in% c('DIED','EUTHANIZED','MISSING'), 1, 0),
         unhealthy_in = ifelse(condition_in ==  'HEALTHY', 'NO', 'YES'),
         contagious_in = ifelse(is.na(str_match(condition_in, ' CONTAGIOUS$')[,1]), 'NO','YES'),
         treatable_in = ifelse(is.na(str_match(condition_in, '^TREATABLE ')[,1]), 'UNTREATABLE',
                               ifelse(is.na(str_match(condition_in, 'MANAGEABLE')[,1]),
                                      'REHABILITABLE', 'MANAGEABLE')))

survival.data.dogs = survival.data %>%  
  filter(
    animal == 'Dogs',
    outcome %in% c('ADOPTION','DIED','EUTHANIZED','FOSTER','FOUND REPORT',
                  'LOST REPORT','RETURNED TO OWNER','TRANSFER'),
    !type %in% c('FOSTER','FOUND REPORT','LOST REPORT','TRANSFER'),
    !subtype %in% c('EUTHANASIA REQUESTED','- DEAD ON ARRIVAL',
                    'QUARANTINE - DEAD ON ARRIVAL','KEEP SAFE - DEAD ON ARRIVAL',
                    'CRUELT - DEAD ON ARRIVAL','DANGER - DEAD ON ARRIVAL'),
    origin %in% c('FIELD','OVER THE COUNTER','SWEEP','NIGHT DROP'))

Survival Probabilities

Chances of dying in shelter by various categories available at intake time:

animal_counts = survival.data %>% 
  group_by(animal) %>% 
  summarise(cnt = n(), survived = sum(1-event), dead = sum(event)) %>% 
  mutate(rank = row_number(desc(cnt)),
         odds = dead/survived,
         notsurv_chance = dead/cnt) %>% 
  mutate(name = paste("Animal:", toupper(animal)))

intake_type_counts = survival.data.dogs %>%
  group_by(type) %>%
  summarise(cnt = n(), survived = sum(1-event), dead = sum(event)) %>% 
  mutate(rank = row_number(desc(cnt)),
         odds = dead/survived,
         notsurv_chance = dead/cnt) %>% 
  mutate(name = paste("Intake Type:", type))

origin_counts = survival.data.dogs %>%
  filter(origin %in% c("FIELD","OVER THE COUNTER","SWEEP")) %>%
  group_by(origin) %>%
  summarise(cnt = n(), survived = sum(1-event), dead = sum(event)) %>% 
  mutate(rank = row_number(desc(cnt)),
         odds = dead/survived,
         notsurv_chance = dead/cnt) %>% 
  mutate(name = paste("Origin:", origin))

condition_counts = survival.data.dogs %>%
  group_by(condition_in) %>%
  summarise(cnt = n(), survived = sum(1-event), dead = sum(event)) %>% 
  mutate(rank = row_number(desc(cnt)),
         odds = dead/survived,
         notsurv_chance = dead/cnt) %>% 
  mutate(name = paste("Condition:", condition_in))

data = rbind(animal_counts[-1], intake_type_counts[-1], origin_counts[-1],
             condition_counts[-1])
N = dim(data)[[1]]
data$name = factor(data$name, levels = data$name[order(data$notsurv_chance)])
data$text_size = (10 + data$notsurv_chance/max(data$notsurv_chance))
data$text_pos = data$notsurv_chance/3 

ggplot(data) +
  geom_bar(aes(name, notsurv_chance, fill=name), stat = 'identity', alpha=0.7) +
  geom_text(aes(name, text_pos, label=name, size=text_size), angle=0, family='Palatino', fontface='bold', hjust=0) +
  scale_size(range=c(3,5)) +
  scale_fill_manual(values = rev(colorRampPalette(wsj_pal(palette = "colors6")(4))(N))) +
  labs(title=paste("Chances of Dying in Shelter by Intake Categories"),
       subtitle="Dallas Animal Services FY 2015-2018. Only for dogs unless cats specified",
       caption=caption, x=NULL, y="Expected Probability") +
  coord_flip() +
  theme_minimal(base_size = 14, base_family = 'serif') +
  theme(legend.position = "none",
        # axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.grid.major.y = element_blank(),
        strip.text = element_text(size = 14, face = 'bold'),
        plot.caption = element_text(size = 10, hjust=1))

How many breeds are registered?

breed_counts = survival.data.dogs %>% 
  group_by(breed) %>% 
  summarise(cnt = n(), survived = sum(1-event), dead = sum(event)) %>% 
  mutate(rank = row_number(desc(cnt)),
         odds = dead/survived,
         notsurv_chance = dead/cnt) %>% 
  arrange(desc(rank))

data = breed_counts[breed_counts$cnt >= 100,]
N = dim(data)[[1]]
data$breed = factor(data$breed, levels = data$breed[order(-data$rank)])
data$text_size = (10 + data$cnt/max(data$cnt))
data$text_pos = ifelse(data$cnt>5000, data$cnt/2, data$cnt + 100)

ggplot(data) +
  geom_bar(aes(breed, cnt, fill=breed), stat = 'identity', alpha=0.7) +
  geom_text(aes(breed, text_pos, label=breed, size=text_size), angle=0, family='Palatino', fontface='bold', hjust=0) +
  scale_size(range=c(2,4)) +
  scale_fill_manual(values = rev(colorRampPalette(wsj_pal(palette = "colors6")(6))(N))) +
  labs(title=paste("Dog Breeds Admitted to Shelters"),
       subtitle="Dallas Animal Services FY 2015-2017. Only breeds with 100+ admissions", 
       caption=caption, x=NULL, y="Admissions") +
  coord_flip() +
  # theme_tufte(base_size = 14, base_family = 'serif', ticks=FALSE) +
  theme_minimal(base_size = 14, base_family = 'serif') +
  theme(legend.position = "none",
        # axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.grid.major.y = element_blank(),
        strip.text = element_text(size = 14, face = 'bold'),
        plot.caption = element_text(size = 10, hjust=1))

Unfortunate for Pit bull terrier group they have the worst chances of survival baring top breed of Chow chow. Survival chances also appear to correlate with dog sizes but for the lack of this information we can’t substantiate it with data.

data = breed_counts[breed_counts$cnt >= 100,]
N = dim(data)[[1]]
data$breed = factor(data$breed, levels = data$breed[order(data$notsurv_chance)])
data$text_size = (10 + data$notsurv_chance/max(data$notsurv_chance))
data$text_pos = data$notsurv_chance/2 # ifelse(data$cnt>5000, data$cnt/2, data$cnt + 100)

ggplot(data) +
  geom_bar(aes(breed, notsurv_chance, fill=breed), stat = 'identity', alpha=0.7) +
  geom_text(aes(breed, text_pos, label=breed, size=text_size), angle=0, family='Palatino', fontface='bold', hjust=0) +
  scale_size(range=c(2,4)) +
  scale_fill_manual(values = rev(colorRampPalette(wsj_pal(palette = "colors6")(4))(N))) +
  labs(title=paste("Chances of Dying in Shelter by Dog Breed"),
       subtitle="Dallas Animal Services FY 2015-2018. Only breeds with 100+ admissions",
       caption=caption, x=NULL, y="Expected Probability") +
  coord_flip() +
  # theme_tufte(base_size = 14, base_family = 'serif', ticks=FALSE) +
  theme_minimal(base_size = 14, base_family = 'serif') +
  theme(legend.position = "none",
        # axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        panel.grid.major.y = element_blank(),
        strip.text = element_text(size = 14, face = 'bold'),
        plot.caption = element_text(size = 10, hjust=1))

The Kaplan-Meier survival curves

Survival Curves by Animal Types - Dogs vs. Cats

km.animal = survfit(Surv(hours, event) ~ animal, data=survival.data)

p1 = createSurvivalCurve(km.animal, "Animal", nchar("animal")+2, 
                         scale_color = scale_color_solarized(name="Animal"))
p2 = createSurvivalHistogram(survival.data, "animal", caption=caption,
                             scale_fill = scale_fill_solarized(name="Animal"))
grid.arrange(p1, p2, heights=c(3.5/5, 1.5/5), ncol=1)

First 2 Weeks

Let’s zoom in into first 2 weeks to see how critical those first days at shelter are:

p1 + xlim(0,15) + 
  labs(title="Cats and Dogs: First 2 Weeks", 
       caption=caption) +
  theme(legend.position = "bottom",
        plot.caption = element_text(size = 8, hjust = 0))

## Warning: Removed 9756 rows containing missing values (geom_path).

Again, for consistency and plausability of analysis will focus on survival of dogs from here on.

Survival Curves by Intake Types

km.type = survfit(Surv(hours, event) ~ type, data=survival.data.dogs)
p1 = createSurvivalCurve(km.type, "Dog Intake Type", nchar("type")+2,
                    title="Dogs by Shelter Intake Type",
                    legend.position = "none")

p2 = createSurvivalHistogram(survival.data.dogs, "type", 
                             subtitle="Accepted to Dallas Shelters Totals by Intake Type and Outcome",
                             legend.position="right", caption=caption)

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival Curves by Dog Origin in Shelter

data = survival.data.dogs[survival.data.dogs$origin %in% c("FIELD","OVER THE COUNTER","SWEEP"),]
km.origin = survfit(Surv(hours, event) ~ origin, data=data)
p1 = createSurvivalCurve(km.origin, "Dog Origin in Shelter", nchar("origin")+2,
                    title="Dogs by Origin In Shelter", legend.position = "none")

p2 = createSurvivalHistogram(data, "origin", 
                             subtitle="Accepted to Dallas Shelters Totals by Origin and Outcome",
                             legend.position="right", caption = caption, legend.title = "Origin")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival of Healthy vs. Unhealthy Dogs

km.healthy = survfit(Surv(hours, event) ~ unhealthy_in, data=survival.data.dogs)
p1 = createSurvivalCurve(km.healthy, "Dog Health In", nchar("unhealthy_in")+2,
                    title="Dogs by Unhealthy Condition In", legend.position = "none")

p2 = createSurvivalHistogram(survival.data.dogs, "unhealthy_in", 
                             subtitle="Accepted to Dallas Shelters Totals by Unhealthy Condition In and Outcome",
                             legend.position="right", caption = caption, legend.title = "Unhealthy")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival by Condition Entering Shelter

We select only conditions represented by at least 1000 dogs.

data = survival.data.dogs %>% 
  group_by(condition_in) %>% 
  filter(n() >= 1000) %>% 
  droplevels()

km.condition = survfit(Surv(hours, event) ~ condition_in, data=data)
p1 = createSurvivalCurve(km.condition, "Dog Condition In", nchar("condition_in")+2,
                    title="Dogs by Shelter Condition In", legend.position = "none")

p2 = createSurvivalHistogram(data, "condition_in", legend.position = "right",
                             subtitle="Accepted to Dallas Shelters Totals by Condition In and Outcome",
                             caption = caption, legend.title = "Condition")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Original attribute condition_in carries too much information for meaningful analysis. It can be broken down into several attributes:

Unhealthy (Yes/No)
Contagious (Yes/No) (makes sense only in case of unhealthy dogs)
Treatable (Untreatable/Manageable/Rehabilitable) (makes sense only in case of unhealthy dog)

Survival of Unhealthy Dogs by Contagious Condition In

survival.data.dogs.unhealthy = survival.data.dogs %>%
  filter(unhealthy_in == 'YES')

km.contagious = survfit(Surv(hours, event) ~ contagious_in, data=survival.data.dogs.unhealthy)
p1 = createSurvivalCurve(km.contagious, "Unhealthy Dogs Contagious Condition In", nchar("contagious_in")+2,
                    title="Unhealthy Dogs by Contagious Condition In", legend.position = "none")

p2 = createSurvivalHistogram(survival.data.dogs.unhealthy, "contagious_in", legend.position = "right",
                             subtitle="Accepted to Dallas Shelters Unhealthy Dog Totals by Contagious Condition In and Outcome",
                             caption = caption, legend.title = "Contagious")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival of Unhealthy Dogs by Treatable Condition In

km.contagious = survfit(Surv(hours, event) ~ treatable_in, data=survival.data.dogs.unhealthy)
p1 = createSurvivalCurve(km.contagious, "Unhealthy Dogs Contagious Condition In", nchar("treatable_in")+2,
                    title="Unhealthy Dogs by Treatable Condition In", legend.position = "none")

p2 = createSurvivalHistogram(survival.data.dogs.unhealthy, "treatable_in", legend.position = "right",
                             subtitle="Accepted to Dallas Shelters Unhealthy Dog Totals by Treatable Condition In and Outcome",
                             caption = caption, legend.title = "Treatable")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival Curves by Chip Scan

km.scan = survfit(Surv(hours, event) ~ chip, data=survival.data.dogs)
p1 = createSurvivalCurve(km.scan, "Chip Scan Result", nchar("chip")+2,
                         title="Dogs by Chip Scan Result", legend.position = "none")

p2 = createSurvivalHistogram(survival.data.dogs, "chip", legend.position = "right",
                             subtitle = "Accepted to Dallas Shelters Totals by Chip Scan Result and Outcome",
                             caption = caption, legend.title = "Chip Scan Result")

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival Curves by Breed

We include top 5 breeds that represent about 63% of all dogs accepted to shelters.

N = 5
data = survival.data.dogs %>%
  filter(breed %in% breed_counts[breed_counts$rank <= N,]$breed)
# data$breed = factor(data$breed, levels = breed_counts$breed[order(-breed_counts$rank)])
data$breed_str = as.character(data$breed)

km.breed = survfit(Surv(hours, event) ~ breed_str, data=data)
p1 = createSurvivalCurve(km.breed, "Breed", nchar("chip")+2,
                         title=paste("Top",N,"Dog Breeds by Admission"), legend.position = "none",
                         scale_color = scale_color_ptol(name="Breed"))

p2 = createSurvivalHistogram(data, "breed_str", legend.position = "right",
                             subtitle = "Accepted to Dallas Shelters Totals by Breed and Outcome",
                             caption = caption, legend.title = "breed",
                             scale_fill = scale_fill_ptol(name="Breed", guide = guide_legend(reverse = TRUE)),
                             caption_hjust = 1)

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

# Survival Curves for Pit Bull Group

Pit Bull Terrier group includes Pit Bull, American Staff, Am Pit Bull Ter, and Staffordshire breeds:

data = survival.data.dogs %>%
  filter(breed %in% c('PIT BULL','AMERICAN STAFF','AM PIT BULL TER','STAFFORDSHIRE'))
data$breed_str = as.character(data$breed)

km.breed = survfit(Surv(hours, event) ~ breed_str, data=data)
p1 = createSurvivalCurve(km.breed, "Breed", nchar("chip")+2,
                         title="Pit Bull Terrier Group", legend.position = "none",
                         scale_color = scale_color_ptol())

p2 = createSurvivalHistogram(data, "breed_str", legend.position = "right",
                             subtitle = "Accepted to Dallas Shelters Totals by Breed and Outcome",
                             caption = caption, legend.title = "Pit Bulls",
                             scale_fill = scale_fill_ptol(name="Breed"))

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Survival Curves for Pit Bulls vs. All Other Dogs

data = survival.data.dogs %>%
  filter(breed %in% breed_counts[breed_counts$rank <=8 & breed_counts$rank > 4,]$breed)

km.breed = survfit(Surv(hours, event) ~ breed, data=data)
p1 = createSurvivalCurve(km.breed, "Breed", nchar("chip")+2,
                         title="Dogs by Breed", legend.position = "none",
                         scale_color = scale_color_ptol(name="Breed"))

p2 = createSurvivalHistogram(data, "breed", legend.position = "right",
                             subtitle = "Accepted to Dallas Shelters Totals by Breed and Outcome",
                             caption = caption, legend.title = "breed",
                             scale_fill = scale_fill_ptol(name="Breed"))

grid.arrange(p1, p2, heights=c(3.0/5, 2.0/5), ncol=1)

Cox PH Models

coxph.surv.mod = coxph(Surv(hours, event) ~ type + chip + origin + unhealthy_in, 
                       data = survival.data.dogs,
                       control = coxph.control(iter.max=100))
summary(coxph.surv.mod)

## Call:
## coxph(formula = Surv(hours, event) ~ type + chip + origin + unhealthy_in, 
##     data = survival.data.dogs, control = coxph.control(iter.max = 100))
## 
##   n= 64036, number of events= 21427 
## 
##                            coef exp(coef) se(coef)      z Pr(>|z|)    
## typeFOSTER                   NA        NA  0.00000     NA       NA    
## typeFOUND REPORT             NA        NA  0.00000     NA       NA    
## typeLOST REPORT              NA        NA  0.00000     NA       NA    
## typeOWNER SURRENDER     1.12513   3.08061  0.03418 32.915  < 2e-16 ***
## typeSTRAY               0.53881   1.71397  0.03188 16.903  < 2e-16 ***
## typeTRANSFER                 NA        NA  0.00000     NA       NA    
## typeWILDLIFE                 NA        NA  0.00000     NA       NA    
## chipSCAN NO CHIP        0.53859   1.71359  0.02288 23.544  < 2e-16 ***
## chipUNABLE TO SCAN      0.99733   2.71104  0.04332 23.023  < 2e-16 ***
## originFIELD             0.54156   1.71869  0.04005 13.524  < 2e-16 ***
## originNIGHT DROP        2.54351  12.72431  0.57879  4.395 1.11e-05 ***
## originOVER THE COUNTER  0.24140   1.27304  0.04090  5.903 3.58e-09 ***
## originSWEEP                  NA        NA  0.00000     NA       NA    
## unhealthy_inYES         2.38401  10.84830  0.07373 32.333  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                        exp(coef) exp(-coef) lower .95 upper .95
## typeFOSTER                    NA         NA        NA        NA
## typeFOUND REPORT              NA         NA        NA        NA
## typeLOST REPORT               NA         NA        NA        NA
## typeOWNER SURRENDER        3.081    0.32461     2.881     3.294
## typeSTRAY                  1.714    0.58344     1.610     1.824
## typeTRANSFER                  NA         NA        NA        NA
## typeWILDLIFE                  NA         NA        NA        NA
## chipSCAN NO CHIP           1.714    0.58357     1.638     1.792
## chipUNABLE TO SCAN         2.711    0.36886     2.490     2.951
## originFIELD                1.719    0.58184     1.589     1.859
## originNIGHT DROP          12.724    0.07859     4.092    39.564
## originOVER THE COUNTER     1.273    0.78552     1.175     1.379
## originSWEEP                   NA         NA        NA        NA
## unhealthy_inYES           10.848    0.09218     9.389    12.535
## 
## Concordance= 0.659  (se = 0.002 )
## Rsquare= 0.075   (max possible= 0.999 )
## Likelihood ratio test= 4997  on 8 df,   p=0
## Wald test            = 3445  on 8 df,   p=0
## Score (logrank) test = 4043  on 8 df,   p=0

coxph.surv.unhealthy.mod = coxph(Surv(hours, event) ~ type + chip + origin + contagious_in + treatable_in, 
                       data = survival.data.dogs.unhealthy,
                       control = coxph.control(iter.max=100))

summary(coxph.surv.unhealthy.mod)

## Call:
## coxph(formula = Surv(hours, event) ~ type + chip + origin + contagious_in + 
##     treatable_in, data = survival.data.dogs.unhealthy, control = coxph.control(iter.max = 100))
## 
##   n= 59000, number of events= 21241 
## 
##                               coef exp(coef) se(coef)       z Pr(>|z|)    
## typeFOSTER                      NA        NA  0.00000      NA       NA    
## typeFOUND REPORT                NA        NA  0.00000      NA       NA    
## typeLOST REPORT                 NA        NA  0.00000      NA       NA    
## typeOWNER SURRENDER        1.72327   5.60283  0.03492  49.350  < 2e-16 ***
## typeSTRAY                  1.15527   3.17488  0.03303  34.971  < 2e-16 ***
## typeTRANSFER                    NA        NA  0.00000      NA       NA    
## typeWILDLIFE                    NA        NA  0.00000      NA       NA    
## chipSCAN NO CHIP           0.51930   1.68084  0.02307  22.506  < 2e-16 ***
## chipUNABLE TO SCAN         0.52831   1.69606  0.04379  12.065  < 2e-16 ***
## originFIELD                0.42278   1.52620  0.04032  10.484  < 2e-16 ***
## originNIGHT DROP           2.81869  16.75487  0.57877   4.870 1.12e-06 ***
## originOVER THE COUNTER     0.17763   1.19438  0.04094   4.339 1.43e-05 ***
## originSWEEP                     NA        NA  0.00000      NA       NA    
## contagious_inYES           0.68940   1.99252  0.02757  25.010  < 2e-16 ***
## treatable_inREHABILITABLE -0.48311   0.61686  0.01810 -26.690  < 2e-16 ***
## treatable_inUNTREATABLE    1.00604   2.73474  0.02144  46.918  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                           exp(coef) exp(-coef) lower .95 upper .95
## typeFOSTER                       NA         NA        NA        NA
## typeFOUND REPORT                 NA         NA        NA        NA
## typeLOST REPORT                  NA         NA        NA        NA
## typeOWNER SURRENDER          5.6028    0.17848    5.2322    5.9997
## typeSTRAY                    3.1749    0.31497    2.9758    3.3872
## typeTRANSFER                     NA         NA        NA        NA
## typeWILDLIFE                     NA         NA        NA        NA
## chipSCAN NO CHIP             1.6808    0.59494    1.6065    1.7586
## chipUNABLE TO SCAN           1.6961    0.58960    1.5566    1.8481
## originFIELD                  1.5262    0.65522    1.4102    1.6517
## originNIGHT DROP            16.7549    0.05968    5.3888   52.0942
## originOVER THE COUNTER       1.1944    0.83725    1.1023    1.2942
## originSWEEP                      NA         NA        NA        NA
## contagious_inYES             1.9925    0.50188    1.8877    2.1031
## treatable_inREHABILITABLE    0.6169    1.62110    0.5954    0.6391
## treatable_inUNTREATABLE      2.7347    0.36567    2.6222    2.8521
## 
## Concordance= 0.713  (se = 0.002 )
## Rsquare= 0.151   (max possible= 0.999 )
## Likelihood ratio test= 9686  on 10 df,   p=0
## Wald test            = 11614  on 10 df,   p=0
## Score (logrank) test = 12622  on 10 df,   p=0

Dallas Animal Shelter Data - Exploratory and Survival Analysis

Gregory Kanevsky

March 29, 2018

Loading Dallas OpenData datasets

Access to Dalla OpenData Datasets

Preprocessing Data Files

Combining and Compacting Data

Binding 2015 - 2017 FY Data Together

What are the Animals?

Retaining Dog and Cat Animal Records Only

Combining records of stay for the same animals

Exploratory Analysis

Intake Types

Outcomes

Transitions between Intake types and Outcomes

Cats

Dogs

Correlations

Trend Matrix

Time Spent in Shelters

Time Spent Dogs vs. Cats

Time Distributions by Admission and Outcome

Histograms

Density Curves

Sankeys Based on Average Time Spent at Shelters

Cats

Dogs

Survival Analysis

Survival Probabilities

The Kaplan-Meier survival curves

Survival Curves by Animal Types - Dogs vs. Cats

First 2 Weeks

Survival Curves by Intake Types

Survival Curves by Dog Origin in Shelter

Survival of Healthy vs. Unhealthy Dogs

Survival by Condition Entering Shelter

Survival of Unhealthy Dogs by Contagious Condition In

Survival of Unhealthy Dogs by Treatable Condition In

Survival Curves by Chip Scan

Survival Curves by Breed

Survival Curves for Pit Bulls vs. All Other Dogs

Cox PH Models