The objective of this assingment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this weeks lecture we discussed a number of visualiation approaches to exploring a data set, this assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is that interative and repetitive nature of exploring a data set. It takes time and understand what is is the data and what is interesting in the data.
For this week we will be exploring data from the NYC Data Transparnecy Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
ccrb_datatransparencyinitiative <- read_excel("C:/Users/nprak/Desktop/Harrisburg Courses/ANLY_512/ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
View(ccrb_datatransparencyinitiative)
Problemset_4<- ccrb_datatransparencyinitiative
names(Problemset_4) <- gsub(" ", "_", names(Problemset_4))
ggplot(Problemset_4, aes(x=Incident_Year, y=Close_Year)) + geom_point(shape=14, color="purple") + geom_smooth(method=lm, se=FALSE, color="red") +labs(title="Relationship between Incident Year and Case Closed Year", x="Incident Year", y="Case Closed Year")
It’s clear maximum number of incidents were reported in 2007.
hist(Problemset_4$Incident_Year, main="Histogram for Incident Year", xlab="Incident Year", border="red", breaks = 15, col="blue")
For Recent years there’s been relative increase in incidents with video evidence
Legend_color <- brewer.pal(8, "Spectral")
Viz_4 <- table(Problemset_4$Complaint_Has_Video_Evidence, Problemset_4$Incident_Year)
barplot((Viz_4),main="Complaints filed with Video evidence each incident year", xlab="Incident Year", ylab="Number of Complaints",horiz = FALSE, col=c(Legend_color), legend = rownames(Viz_4))
Abuse of Authority seems to be most prominent abuse of authority as per the vizualization
Viz_5 <- table(Problemset_4$Allegation_FADO_Type, Problemset_4$Incident_Year)
barplot((Viz_5),main="Complaints distributed over allegations each incident year", xlab="Incident Year", ylab="Number of Complaints",horiz = TRUE, col=c("coral4","coral3", "coral2","coral1","coral"), legend = rownames(Viz_5))
Compliants reported by via telephone remains the top mode of communiaction in NYC area.
Viz_6 <- table(Problemset_4$Complaint_Filed_Mode, Problemset_4$Incident_Year)
barplot((Viz_6),main="Complaints filed mode each incident year", xlab="Incident Year", ylab="Number of Complaints",horiz = FALSE, col=c(Legend_color), legend = rownames(Viz_6))
Over a period of timeline we can see that there’s a decrease in incident rates from 2010 onwards.
viz_7 <- unique(Problemset_4[c("UniqueComplaintId","Incident_Year","Borough_of_Occurrence")])
viz_7 <- data.frame(viz_7)
ggplot(viz_7,aes(viz_7$Incident_Year,fill=viz_7$Borough_of_Occurrence))+geom_bar()+labs(title="Incidents overtime by borough in NYC", x="Incident Year", y="Count of Incidents")+theme(legend.title=element_blank())
Arrests in Bronx area is proportionaltely declining with incidents.
viz_8 <- sqldf("select * from Problemset_4 where Borough_of_Occurrence = 'Bronx'")
viz_8 <- data.frame(viz_8)
ggplot(viz_8,aes(viz_8$Incident_Year,fill=viz_8$Encounter_Outcome))+geom_bar()+labs(title="Criminal outcome in Bronx", x="Incident Year", y="Count of Incidents")+theme(legend.title=element_blank())
Based on the below vizualization, it appears that PARKS are the most common place where most of the incidents occur and this information is helpful to caution public in parks.
VIZ_9 <- ggplot(viz_8,aes(viz_8$Incident_Year,fill=viz_8$Incident_Location))+geom_bar()+labs(title="Criminal Activities by loaction in Bronx", x="Incident Year", y="Count of Incidents")+theme(legend.title=element_blank())
VIZ_9
The below Pie chart details the most gun violence prone boroughs, as we see 38% of the gun violence in NYC is concentrated in Brooklyn at 38% and next ranked is Bronx at 23%
viz_10 <- sqldf("select * from Problemset_4 where Allegation_description like '%Gun%'")
Viz_10_count <- sqldf("SElect Count(*) as No_of_Incidents, Borough_of_Occurrence from viz_10 Group by Borough_of_Occurrence")
pct <- round(Viz_10_count$No_of_Incidents/sum(Viz_10_count$No_of_Incidents)*100)
lbls <- paste(pct,"%")
lbls<- paste(Viz_10_count$Borough_of_Occurrence,lbls)
pie(Viz_10_count$No_of_Incidents,labels = lbls, col=c(Legend_color),main="Pie Chart of Borough with Gun involved Crimes",cex=.5)