The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
data<-read.csv("/Users/jain/Desktop/EDAData.csv")
summary(data)
## DateStamp UniqueComplaintId Close.Year Received.Year
## 11/29/2016:204397 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :34794 Median :2010 Median :2009
## Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :69492 Max. :2016 Max. :2016
##
## Borough.of.Occurrence Is.Full.Investigation
## Bronx :49442 Mode :logical
## Brooklyn :72215 FALSE:107084
## Manhattan :42104 TRUE :97313
## Outside NYC : 170
## Queens :30883
## Staten Island: 9100
## NA's : 483
## Complaint.Has.Video.Evidence Complaint.Filed.Mode
## Mode :logical Call Processing System: 42447
## FALSE:195530 E-mail : 799
## TRUE :8867 Fax : 356
## In-person : 9586
## Mail : 3424
## On-line website : 14197
## Phone :133588
## Complaint.Filed.Place Complaint.Contains.Stop...Frisk.Allegations
## CCRB :130877 Mode :logical
## IAB : 69214 FALSE:119856
## Precinct : 3548 TRUE :84541
## Other City agency: 295
## Mayor's Office : 157
## Other : 110
## (Other) : 196
## Incident.Location Incident.Year Encounter.Outcome
## Street/highway :123274 Min. :1999 Arrest :89139
## Apartment/house : 34720 1st Qu.:2007 No Arrest or Summons:82964
## Residential building: 12421 Median :2009 Other/NA : 1050
## Police building : 8968 Mean :2010 Summons :31244
## Subway station/train: 6077 3rd Qu.:2012
## (Other) : 15581 Max. :2016
## NA's : 3356
## Reason.For.Initial.Contact
## PD suspected C/V of violation/crime - street:60107
## Other :39030
## PD suspected C/V of violation/crime - bldg :16067
## PD suspected C/V of violation/crime - auto :12953
## Moving violation : 8843
## (Other) :66542
## NA's : 855
## Allegation.FADO.Type
## Abuse of Authority:102173
## Discourtesy : 34452
## Force : 61761
## Offensive Language: 6008
## NA's : 3
##
##
## Allegation.Description
## Physical force :44116
## Word :31704
## Stop :12944
## Search (of person) :12250
## Refusal to provide name/shield number:10359
## (Other) :93021
## NA's : 3
This graph will provide an idea as to whether crimes have increased, decreased, or are consistent.
install.packages("ggplot2",repos = "http://cran.us.r-project.org")
##
## The downloaded binary packages are in
## /var/folders/_f/hk0n1w157s5bfmt2ykvxt1yh0000gn/T//Rtmp9EVb7U/downloaded_packages
library(ggplot2)
ggplot(data, aes(x=Received.Year, fill=Received.Year)) +
geom_line(stat = "count") +
labs(title="Number of Complaints received by Year", x="Received Year", y="Number of Complaints")
The graph shows the Boroughs with the highest level of crime, this can help determine the focus areas.
graph2 <- table(data$Borough.of.Occurrence)
pie(graph2, radius = 0.5, col = c("green", "red", "violet","cornsilk", "cyan", "yellow","pink"))
This graph shows if most of the crimes have a video evidence available ?
ggplot(data, aes(x=Received.Year, fill=Complaint.Has.Video.Evidence)) +
geom_histogram(stat = "count") +
labs(title="Availability of Video Evidence", x="Video Evidence", y="Number of Complaints")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graph shows the number of complaints by each allegation type
ggplot(data, aes(x = data$Allegation.FADO.Type, fill = data$Allegation.FADO.Type)) + geom_histogram(stat = "count")+labs(title="Complaints by Type of Allegation", x="Allegation Type", y="Number of Complaints")+theme(legend.position = "bottom") +
scale_fill_discrete(name="Allegation Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graph shows the number of complaints by each allegation type.
ggplot(data, aes(x = data$Complaint.Filed.Mode, fill = data$Complaint.Filed.Mode)) + geom_histogram(stat = "count")+labs(title="Mode of Complaint", x="Mode", y="Number of Complaints") + theme(legend.position = "bottom") + scale_fill_discrete(name="Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graph shows in which year the incidents were the highest.
hist(data$Incident.Year, main="Histogram for Incident Year", xlab="Incident Year", border="black", breaks = 20, col="red")
This graph shows whether there was a full investigation done of the crime or not.
ggplot(data, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation)) + geom_bar(stat = 'count') + labs(title = "Full Investigation True or False for crime", x = "Location", Y = "Count") + theme_minimal()
This graph shows the comparison between when the case was received versus when it was closed.
boxp <- ggplot(data, aes(grouping(Close.Year), x=Close.Year,y=Received.Year))
boxp + geom_jitter(width = 0.3, alpha = .2) + geom_boxplot(alpha = .25)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
This graph shows the type of incidents in each borough, this way the cops can pin point the troble areas.
ggplot(data, aes(x=Borough.of.Occurrence, fill=Incident.Location)) +
geom_histogram(stat = "count") +
labs(title="Complaints by the location of the Incident", x="Borough", y="Number of Complaints") +
theme(legend.position = "bottom") +
scale_fill_discrete(name="Incident Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This graph shows the outcome of the encounter in each Borough.
ggplot(data, aes(x=Borough.of.Occurrence, fill=Encounter.Outcome)) +
geom_histogram(stat = "count") +
labs(title="Complaints by Outcome of the Encounter", x="Borough", y="Number of Complaints") +
theme(legend.position = "bottom") +
scale_fill_discrete(name="Borough")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
From the first visualisation, we can notice that the number of complaints have been high during 2005-2010 time span. The department needs to do an analysis to understand the reasons. Part of the explantion might be linked to the housing market related recession that left many people jobless and hence turn to crimes. From the next visualisation we can observe that Brooklyn is the borough with the most occurences of the crime. From Vis 3 from majority of the complaint cases there was no video evidence available. Most compalinst have been regarding the abuse of authority, followed by force. It gives an impression that in a state like NY lot of people engage in some kind of abuse of the authority. Most popular system for making complaints is the phone. This comes as no surprise when it is so easy to dial 911 and register a complaint. From Visualisation 1 and Visualisation 6 it is clear that there is a correlation between the year with most incidents and the year with most complaints registered. In most boroughs, the ratio of full investigation versus partial investigation was almost 50 percent. There were a lot of outliers in the analysis of when the case was recieved versus when it was closed. Street/Highways were the most unsafe places as majority of incidents occured there, especially in Brooklyn. There was a near equal chance of Arrest versus no arrest or summon for crimes in each of the borough. Overall this analysis is helpful for the department to identify the problem areas and the kind of crime concentration in the boroughs, they can take measures to improve safety and security of the public.