The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).
For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.
This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.
For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:
library(readxl)
library(ggplot2)
library(readxl)
ccrb_datatransparencyinitiative <- read_excel("ccrb_datatransparencyinitiative.xlsx")
View(ccrb_datatransparencyinitiative)
data<- ccrb_datatransparencyinitiative
#Visualization-1 #No of Complaints recieved each year and borough of incident occurence ?
library(ggplot2)
ggplot(data,aes(x = ReceivedYear, fill = IncidentLocation)) +
geom_bar() +
labs(title = "Complaints recieved by year and location ",xlab='Year',ylab='frequency')+
scale_fill_discrete(name="Incident Location")
The above plot shows the number of compalaints recieved each year from 2000 to 2016 by the incident location.The plots tells us that there are negligible cases from 2000 to 2004 ,also there are more number of complaints in year 2007 and maximum number of incidents took place on street/hihghway and the second highest number of incdents took place at apartment/houses.
#Visualization-2
ggplot(data,aes(x = `ReceivedYear`, fill = `AllegationFADOType`)) +
geom_bar(position = "stack") +
labs(title = "Complaints recieved by year and type",xlab='Year')
The above plot shows the number of compalaints recieved each year from 2000 to 2016 by the incident location.The plots tells us that there are negligible cases from 2000 to 2004 ,also there are more number of complaints in year 2007 and after then there is a slow decay in the number of compalints .The maximum number of complaints are of ‘Abuse of Authority’ followed by the ‘force’ and ‘discourtesy’.
#Visualization-3
#Mode of the complaints by which they were recieved
ggplot(data,aes(x= `ReceivedYear`, fill = `ComplaintFiledMode`))+
geom_bar(stat="count")+
xlab('Year')+
ylab('Number of complaints')+
ggtitle('Mode of complaints by year')+
coord_flip()+
scale_fill_discrete(name="Complaint Filed Mode")
The above plot shows the mode by which complaints were recieved .The maximum numberof comaplaints were recieved by phone following by call processing system.
#Visualization-4
ggplot(data,aes(x = BoroughofOccurrence, fill = AllegationFADOType)) +
geom_histogram(stat="count") +theme_minimal() + theme(legend.position = "bottom") +
labs(title = "Frequency of Complaints by Borough and Allegation FADO Type")+
scale_fill_discrete(name="Allegation type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The above plot shows the Number of compalaints by Borough and allegation type .Brooklyn has the maximum number of complaints and Allegation is maximum for ‘Abuse of authority’
#Visulalization - 5
mosaicplot(~ IsFullInvestigation+ComplaintHasVideoEvidence , data, color = TRUE,main='effect of evidence for full investigation',xlab='Full investigation',ylab='Vedio evidence')
The above plot show if there is a effect of vedio evidence on full investigation.As we can see there are very less complaints with video evidence but still there are many compaints which are fully investigated without vedio evidence
#Visualization - 6
ggplot(data,aes(EncounterOutcome,AllegationFADOType,fill=EncounterOutcome,colour=EncounterOutcome))+
geom_count()+
labs(ylab='Type of Allegation',xlab='Encounter Outcome',ttitle = 'Allegation vs encounter ')
The above plot shows the type of encounter for Allegations.The abuse of authority is having maximum number of complaints with No arrest.Force has the maximum numbre of arrests. #Visualization - 7
ggplot(data, aes(x = ComplaintFiledPlace, y = IncidenYear)) +
geom_boxplot()+
xlab("Complaint Filed Place") +
ylab("Incident Year") +
coord_flip()+
scale_fill_discrete(name="Complaint filed at Place")+
ggtitle('Incident year v/s complaint file at which place')
The above plot shows the where the complaint has been filed in corresponding to the year which the incident took place.After 2016 most of the complaints are filed at other city agency.
#Visualization-8
ggplot(data, aes(`BoroughofOccurrence`,fill=`IsFullInvestigation`)) + geom_histogram(stat="count",binwidth=1.5) + labs(title="Number of Complaints by Borough Location and status of investigation ", x="Borough of Incident Location", y="Number of Complaints")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The above plot shows the number of complaints fully investigated by borough. #Visualization-9
library(ggthemes)
ggplot(data,aes(CloseYear,fill=IsFullInvestigation))+
geom_density(alpha=0.2,size=0.6)+
xlab("Year")+
ylab("Number of closed cases")+
ggtitle("Closed cases without investigation")+
theme_economist() +
scale_fill_discrete(name="Full Investigation")
The above plot shows the number of closed cases with respect to the full investigation.Most of the closed cases are fully invitigated especially after year 2013.
#Visualization - 10
ggplot(data, aes(`ReceivedYear`, `CloseYear`)) + geom_point() + geom_smooth() + labs(title="Relationship between Complaint Received Year and Close Year",xlab='Recieved year',ylab='Closed year')
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#Visualization -11
ggplot(data,aes(x = AllegationFADOType, fill = ComplaintContainsStopFriskAllegations)) +
geom_histogram(stat="count") +theme_minimal() + theme(legend.position = "bottom") +
labs(title = "Complaints by Stop&Frisk Allegations")+
scale_fill_discrete(name="Stop&Frisk Allegations")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The above plot shows which FADO type allegations contains Stop and Frisk allegations.
The above plot shows the relation ship between recieved complaints and the closed complaints.As we can see initally before 2005 recieved complaints were closed within 1 or 2 years but after 2005 the difference increased from 4 years.After 2014 complaints are closing soon within 1 year.
Your final document should include at minimum 10 visualizations. Each should include a brief statement of why you made the graphic.
A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.
Answer - With the usage of EDA we were able to find out the relationships between many variables and the important information to do any analysis without any prior data caluclation .We could easily check for outliers , trends , correlations.