Analysis

library(readxl)
library(ggplot2)
data = read_excel("ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
summary(data)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 
head(data)
## # A tibble: 6 x 16
##    DateStamp UniqueComplaintId `Close Year` `Received Year`
##       <dttm>             <dbl>        <dbl>           <dbl>
## 1 2016-11-29                11         2006            2005
## 2 2016-11-29                18         2006            2004
## 3 2016-11-29                18         2006            2004
## 4 2016-11-29                18         2006            2004
## 5 2016-11-29                18         2006            2004
## 6 2016-11-29                18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## #   Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## #   Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## #   Description` <chr>

1. I want to identify the percentage of “Borough of Occurrence”

p1 <- prop.table(table(data$`Borough of Occurrence`))
p1
## 
##         Bronx      Brooklyn     Manhattan            NA   Outside NYC 
##  0.2418920043  0.3533075339  0.2059912817  0.0023630484  0.0008317148 
##        Queens Staten Island 
##  0.1510932157  0.0445212014

2. I want to use pie chart to show me the percentage.

pie(p1, labels = round(p1 * 100))

3. The pie chart above is too simple, so I need some more details.

pie(p1,labels=round(p1*100/sum(p1),1), col = rainbow(length(p1)), main = "Borough of Occurrence")
legend("bottomleft",c("Bronx", "Brooklyn", "Manhattan","NA","Outside NYC","Queens","Staten Island"),fill = rainbow(length(p1)),cex = 0.6)

4. Now I am very clear about the the percentage of “Borough of Occurrence”. Then I would like to use simple bar chart to see the distribution of complaints occurrence by areas.

ggplot(data,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() 

5. From bar chart above, it is very easy to identify that Brooklyn area has the most complaints occurrences. I want to know if pie chart can also archive this.

ggplot(data,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() + coord_polar(theta="y") + scale_x_discrete()

6. Through the comparison, I personally prefer traditional bar chart. So I will try one more time to use bar chart to see the percentage of Allegation FADO Type in Borough of Occurrence.

ggplot(data,aes(`Borough of Occurrence`, fill = `Allegation FADO Type`)) + 
  geom_bar() + 
  scale_fill_discrete(name = "Allegation FADO Type")

7. From another bar chart above, we can see that the Abuse of Authority is the largest Allegation FADO Type. Then, I want to use boxplot to do more analysis on Allegation FADO Type.

qplot(data$`Allegation FADO Type`, data$`Incident Year`,data=data, geom = "boxplot")+ 
  coord_flip()+
  theme(panel.background = element_rect(fill = "white"))

8. Now combine “Borough of Occurrence” with “Allegation FADO Type”.

data1 <- as.data.frame(table(data$`Borough of Occurrence`,data$`Allegation FADO Type`))
ggplot(data1, aes(x = Var1, y = Freq)) +
 geom_line((aes(group = 1)))+
 xlab("Total complaints") +
 ylab("Borough of Occurrence")

9. We can see that with the combination of “Borough of Occurrence” and “Allegation FADO Type”, Brooklyn still has the largest complaints. Now I want to see the relationship between year and complaints.

data2 <- as.data.frame(table(data$`Borough of Occurrence`,data$`Incident Year`))
data2$Incident_Year <- as.numeric(as.character(data2$Var2))
ggplot(data2, aes(x = data2$Incident_Year, y = Freq))+ 
  geom_line(aes(group = data2$Var1,color = data2$Var1))

10. As per above, I also want to see the relationship between year and Allegation FADO Type.

data3 <- as.data.frame(table(data$`Allegation FADO Type`,data$`Incident Year`))
data3$Incident_Year <- as.numeric(as.character(data3$Var2))
ggplot(data3, aes(x = Incident_Year, y = Freq))+ 
  geom_line(aes(group = data3$Var1,color = data3$Var1))

11. I want to use simple line graph to see the relationship between complaints and close year.

plot(data$"Close Year", type = "l",lwd=2, col="blue", xlab = "Number of Complaints", ylab="Close Year")

12. For further analysis, I want to see the distribution of complaints over close year and Allegation FADO Type.

data4 <- table(data$"Allegation FADO Type", data$"Close Year")
barplot((data4),main="Complaints distributed over allegations each close year", xlab="Close Year", ylab="Complaints",horiz = T, col=c(1:5), legend = rownames(data4))

Summary

Exploratory data analysis(EDA) is an approach for summarizing and visualizing the important characteristics of a data set. EDA is mainly focusing on exploring data and on understanding the data’s underlying structure and variables, which is able to give us a clear understanding of the characters of data.
It is always a good idea to explore a data set with multiple exploratory techniques, especially when they have internal relationships. No one is able to fully understand data by only one way. With help of EDA, we are able to use various ways to understand each column or row of data. By deep understanding the details of data, we will be more interested in analysis and will have more chances to find out the hidden clues/relationships of data, which also means that EDA gives us confidences on handling data.