Introduction

Data received on February 10th, we have :

Unclear from title :

I will first explore each one of these datasets. Identify

  1. the reported variables,
  2. the number of observations,
  3. the ability to match and merge the datasets,
  4. and the missing values (check if lost in process of reading).

Importing and consolidating data

library(readxl)
arrest1 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest 2000 to 2005RD.xlsx", col_types = "text")
arrest2 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest 2006 thu 2018RD.xlsx", col_types = "text")
arrest2 <- arrest2 %>% select(-"..26")
arrests <- rbind(arrest1, arrest2)
arrestcharge1 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest Charges 2000 thru 2008.xlsx", col_types = "text")
arrestcharge2 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest Charges 2009 thru 2018RD.xlsx", col_types = "text")
charges <- rbind(arrestcharge1, arrestcharge2)



#Now onto the calls data 
calls <- NULL
for(i in call_files){
  pattern <- paste0("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls/", i)
  call1 <- read_excel(pattern, col_types = "text")
  calls <- rbind(calls, call1)
}

write.csv(arrests, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Arrests.csv")
write.csv(charges, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Charges.csv")
write.csv(calls, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls.csv")

#What are these two unknown datasets? 
offenses <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of OffensesRD.xlsx", col_types = "text")

write.csv(offenses, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Offence.csv")

persons <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Persons June 1 2014 to Dec 31 2018RD.xlsx", col_types = "text")
write.csv(persons, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Persons.csv")

#The chunk of data above required a total of ~5 hours for execution. 

Understanding data structure

Calls data

The data consists of 13,058,961 observations. Each call is identified with a master incident number. Below are questions related to exploration of this dataset.

  • Q: Does each observation correspond to a unique call?
  • A: No. There are 12,429,854 unique identifiers.
calls <- read.csv("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls.csv") #Reading the file took 1h40min to complete 
n_distinct(calls$Master_Incident_Number)
calls2 <- calls %>% group_by(Master_Incident_Number) %>% mutate(id = row_number()) %>% mutate(total_obs = sum(id)) %>% filter(total_obs > 1) # I intended to use this to analyze the redundant incident numbers ... however, this line of code took 2h to run, even though it's a minor edit. I stopped trying to edit datasets after this. 
  • Q: What information do we have in the dataset?
  • A: In addition to call identifiers, we also have name (if agreed to), address, longitudinal data, date and time of call, type of incident (from gun shot to minor disturbance), incident level of priority, responding agency/division/batallion, time until unit assigned.

Note that we have no sociodemographic information in this dataset.

colnames(calls)
unique(persons$Call_Disposition)
unique(persons$Incident_Type)

Arrest data

colnames(arrests) 
View(arrests)

1,078,938 observations spanning 2000-2018.

  • Arrest information : date, time, location, arresting officers ID
  • Arrestee information : name, age at offense time (no dob, but can be deduced), race, and ethnicity, sex, home address, and arrest weapon.

#Charges data

1,454,170 observations spanning 2000-2018.

  • Variables of interest : The only relevant variable here is the charge description (a lot of the observations are not ones we can translate tho eg. 99FARMERSBRA/3/$1254.32 , 06 DSO 1 @ $500.00). Unsure how much that represents in the final data. The rest are only identifying variables that would be used for matching.

Person data

This data only covers June 2014 - Dec 2018. It containts 676,671 observations. It seems this data was not collected prior to 2014.

This dataset gives information on individuals involved with the reported incident. The “involvement” variable indicates the level of involvement : victim, witness, suspect, etc.

  • Q: What personal characteristics do we have?
  • A: Name, date of birth, sex, address, age at time of incident.

  • Q: What socio-demographic characteristics do we have?
  • A: Race, ethnicity.

  • Q: Does this dataset include information on complainants that can be linked to call/arrest data?
  • A: Hard to answer definitively. Problem is complainant can be victim, witeness, etc. One of the involvement types is : “reporting person”, but it is unclear if this type supercedes victim/witness types. ie. if the reporter is also a victim, are they classified as victim or as reporting person ? We could use names to elucidate this, but the name variable in the calls log is sparcely populated.

colnames(person) 
View(person)

Offense data

A dataset of offenses that includes the offense type, the location (address and type of premise), and the time of reporting. This data contains 537,276 observations from 2002 - 2019.

Moving forward :

I can write code for merging, data visualization, etc., but cannot run any of that analysis on my local computer. I recommend that Ben gets access to Talapas, while I write a script that we can later on send to be run. The script will perform : 1. merger of dataset, 2. merger report (NAs, matching, etc.), 3. and visualization of main variables of interest (I can also write code to answer any more specific question you might have).