Data received on February 10th, we have :
Arrest files 2000-2018
Arrest charges
Calls log
Unclear from title :
Copy of offenses
Copy of persons
I will first explore each one of these datasets. Identify
library(readxl)
arrest1 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest 2000 to 2005RD.xlsx", col_types = "text")
arrest2 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest 2006 thu 2018RD.xlsx", col_types = "text")
arrest2 <- arrest2 %>% select(-"..26")
arrests <- rbind(arrest1, arrest2)
arrestcharge1 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest Charges 2000 thru 2008.xlsx", col_types = "text")
arrestcharge2 <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Arrest Charges 2009 thru 2018RD.xlsx", col_types = "text")
charges <- rbind(arrestcharge1, arrestcharge2)
#Now onto the calls data
calls <- NULL
for(i in call_files){
pattern <- paste0("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls/", i)
call1 <- read_excel(pattern, col_types = "text")
calls <- rbind(calls, call1)
}
write.csv(arrests, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Arrests.csv")
write.csv(charges, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Charges.csv")
write.csv(calls, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls.csv")
#What are these two unknown datasets?
offenses <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of OffensesRD.xlsx", col_types = "text")
write.csv(offenses, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Offence.csv")
persons <- read_excel("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Copy of Persons June 1 2014 to Dec 31 2018RD.xlsx", col_types = "text")
write.csv(persons, "/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Persons.csv")
#The chunk of data above required a total of ~5 hours for execution.
The data consists of 13,058,961 observations. Each call is identified with a master incident number. Below are questions related to exploration of this dataset.
calls <- read.csv("/Users/yab/Dropbox/BBQ Betty Project/Dallas/Dallas Data Full/Calls.csv") #Reading the file took 1h40min to complete
n_distinct(calls$Master_Incident_Number)
calls2 <- calls %>% group_by(Master_Incident_Number) %>% mutate(id = row_number()) %>% mutate(total_obs = sum(id)) %>% filter(total_obs > 1) # I intended to use this to analyze the redundant incident numbers ... however, this line of code took 2h to run, even though it's a minor edit. I stopped trying to edit datasets after this.
Note that we have no sociodemographic information in this dataset.
colnames(calls)
unique(persons$Call_Disposition)
unique(persons$Incident_Type)
colnames(arrests)
View(arrests)
1,078,938 observations spanning 2000-2018.
#Charges data
1,454,170 observations spanning 2000-2018.
This data only covers June 2014 - Dec 2018. It containts 676,671 observations. It seems this data was not collected prior to 2014.
This dataset gives information on individuals involved with the reported incident. The “involvement” variable indicates the level of involvement : victim, witness, suspect, etc.
A: Name, date of birth, sex, address, age at time of incident.
A: Race, ethnicity.
A: Hard to answer definitively. Problem is complainant can be victim, witeness, etc. One of the involvement types is : “reporting person”, but it is unclear if this type supercedes victim/witness types. ie. if the reporter is also a victim, are they classified as victim or as reporting person ? We could use names to elucidate this, but the name variable in the calls log is sparcely populated.
colnames(person)
View(person)
A dataset of offenses that includes the offense type, the location (address and type of premise), and the time of reporting. This data contains 537,276 observations from 2002 - 2019.
–
I can write code for merging, data visualization, etc., but cannot run any of that analysis on my local computer. I recommend that Ben gets access to Talapas, while I write a script that we can later on send to be run. The script will perform : 1. merger of dataset, 2. merger report (NAs, matching, etc.), 3. and visualization of main variables of interest (I can also write code to answer any more specific question you might have).