library(readxl)
library(ggplot2)
data = read_excel("ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
summary(data)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
head(data)
## # A tibble: 6 x 16
## DateStamp UniqueComplaintId `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 11 2006 2005
## 2 2016-11-29 18 2006 2004
## 3 2016-11-29 18 2006 2004
## 4 2016-11-29 18 2006 2004
## 5 2016-11-29 18 2006 2004
## 6 2016-11-29 18 2006 2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## # Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## # Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## # Description` <chr>
p1 <- prop.table(table(data$`Borough of Occurrence`))
p1
##
## Bronx Brooklyn Manhattan NA Outside NYC
## 0.2418920043 0.3533075339 0.2059912817 0.0023630484 0.0008317148
## Queens Staten Island
## 0.1510932157 0.0445212014
pie(p1, labels = round(p1 * 100))
pie(p1,labels=round(p1*100/sum(p1),1), col = rainbow(length(p1)), main = "Borough of Occurrence")
legend("bottomleft",c("Bronx", "Brooklyn", "Manhattan","NA","Outside NYC","Queens","Staten Island"),fill = rainbow(length(p1)),cex = 0.6)
ggplot(data,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar()
ggplot(data,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() + coord_polar(theta="y") + scale_x_discrete()
ggplot(data,aes(`Borough of Occurrence`, fill = `Allegation FADO Type`)) +
geom_bar() +
scale_fill_discrete(name = "Allegation FADO Type")
data1 <- as.data.frame(table(data$`Borough of Occurrence`,data$`Allegation FADO Type`))
ggplot(data1, aes(x = Var1, y = Freq)) +
geom_line((aes(group = 1)))+
xlab("Total complaints") +
ylab("Borough of Occurrence")
data2 <- as.data.frame(table(data$`Borough of Occurrence`,data$`Incident Year`))
data2$Incident_Year <- as.numeric(as.character(data2$Var2))
ggplot(data2, aes(x = data2$Incident_Year, y = Freq))+
geom_line(aes(group = data2$Var1,color = data2$Var1))
data3 <- as.data.frame(table(data$`Allegation FADO Type`,data$`Incident Year`))
data3$Incident_Year <- as.numeric(as.character(data3$Var2))
ggplot(data3, aes(x = Incident_Year, y = Freq))+
geom_line(aes(group = data3$Var1,color = data3$Var1))
plot(data$"Close Year", type = "l",lwd=2, col="blue", xlab = "Number of Complaints", ylab="Close Year")
data4 <- table(data$"Allegation FADO Type", data$"Close Year")
barplot((data4),main="Complaints distributed over allegations each close year", xlab="Close Year", ylab="Complaints",horiz = T, col=c(1:5), legend = rownames(data4))
Exploratory data analysis(EDA) is an approach for summarizing and visualizing the important characteristics of a data set. EDA is mainly focusing on exploring data and on understanding the data’s underlying structure and variables, which is able to give us a clear understanding of the characters of data.
It is always a good idea to explore a data set with multiple exploratory techniques, especially when they have internal relationships. No one is able to fully understand data by only one way. With help of EDA, we are able to use various ways to understand each column or row of data. By deep understanding the details of data, we will be more interested in analysis and will have more chances to find out the hidden clues/relationships of data, which also means that EDA gives us confidences on handling data.