In this report, we conduct an exploratory data analysis of a data set from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municipal agency. Our objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
The following figure shows number of complaints by year:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'
ggplot(data = df, aes(x = `Received Year`, fill=`Received Year`)) +
geom_histogram(stat = 'count') +
labs(title="Complaints by Year", x="Received Year", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Received Year")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows full investigations by year:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Received Year`, fill=`Is Full Investigation`)) +
geom_histogram(stat = 'count') +
labs(title="Full Investigation by Year", x="Year", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows number of complaints by borrough:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Borough of Occurrence`, fill=`Borough of Occurrence`)) +
geom_histogram(stat = 'count') +
labs(title="Complaints by Borough", x="Borough", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Borough")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows incident location by borough:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Borough of Occurrence`, fill=`Incident Location`)) +
geom_histogram(stat = 'count') +
labs(title="Incident Location by Borough", x="Borough", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Incident Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows outcome of encounter by borough:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Borough of Occurrence`, fill=`Encounter Outcome`)) +
geom_histogram(stat = 'count') +
labs(title="Encounter Outcome by Borough", x="Borough", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Encounter Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows full investigations by borough:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Borough of Occurrence`, fill=`Is Full Investigation`)) +
geom_histogram(stat = 'count') +
labs(title="Full Investigation in Borough", x="Borough", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Full Investigation")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows number of complaints by the place of complaint in the borough:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x=`Borough of Occurrence`, fill=`Complaint Filed Place`)) +
geom_histogram(stat = 'count') +
labs(title="Complaint Filed Place by Borough", x="Borough", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Place of Complaint")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows number of complaints by mode of communication:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x = `Complaint Filed Mode`, fill=`Complaint Filed Mode`)) +
geom_histogram(stat = 'count') +
labs(title="Complaints by Mode", x="Mode", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Complaint Filed Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows number of complaints by allegations type:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x = `Allegation FADO Type`, fill=`Allegation FADO Type`)) +
geom_histogram(stat = 'count') +
labs(title="Allegation Type", x="Allegation Type", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Allegation Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The following figure shows number of complaints by video evidence per year:
df <- read_excel("/Users/Himanshu/Desktop/ccrb_datatransparencyinitiative.xlsx")
ggplot(data = df, aes(x = `Received Year`, fill=`Complaint Has Video Evidence`)) +
geom_histogram(stat = 'count') +
labs(title="Video Evidence by Year", x="Incident Location", y="Number of Complaints") +
theme(legend.position = "right") +
scale_fill_discrete(name="Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Exploratory Data Analysis (EDA) is a great tool in identifying and showing critical relationships between variables. From our analysis, we determined that, first, highest number of crimes were reported arounf 2006-2008 (interesting observation as it was when financial crisis occurred as well); second, we observed that the full investigation rate has stayed at 50:50, which means half of the cases, sadly, are not fully investigated (could be looked into and improved in the coming years); third, interestingly, the most number of crimes happen in Brooklyn, followed by Bronx and Manhattan; fourth, in all of the boroughs, the analysis shows similar encounter outcomes, i.e., the number of arrests, no arrests, and summons are in equal proportion across the board; lastly, the most popular mode of communication has been phone call and telephone servic, which is quite intuitive as in emergency calling is the fastest way to communicate, but the analysis let’s us support intuition with the backing of historical data.