Import Data

df <- read_excel("T:\\Daisy Chen\\HU\\ANL 512 51\\ccrb_datatransparencyinitiative.xlsx", sheet = 2)
df <- data.frame(df)

Visualization 1: The Number of Complaints Filed in Different Modes

Method: Bar Chart

First of all, I want to have an idea as to the most popular ways through which complaints are received.Bar chart is used due to the limited categories. As the bar chart illuminates: most people use phone to coney the complaints. The secondary most feaquent method is Call Processing System. Later, we would like to discover which borough in NYC filed the complaints the most.

ggplot(data=df, aes(x=Complaint.Filed.Mode, fill=Complaint.Filed.Mode)) + 
  geom_histogram(stat = "count") + 
  labs(title="Complaints by Mode", x="Mode", y="Number of Complaints") +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name="Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Visualization 2: Distribution of Complaints in Different Locations

Method: Pie Chart

Then I would like to discover which borough in NYC filed the complaints the most. The pie chart was used to show the percentage of each location. From the plot, Brooklyn, Manhattan and Bronx had the most complaints.

pie(table(df$Borough.of.Occurrence))

Visualization 3: Detailed Incident Location

Method: Bar Chart

In order to find the detail incident location, a stacked bar chart was created. From the plot, street/highway was the highest incident happend place. The second one was resident building.Usually, citizen would able to report the complaints by phone.

ggplot(df, aes(x = Borough.of.Occurrence, fill = Incident.Location)) + geom_bar(stat = 'count') + labs(title = "Incident Location", x = "Location", Y = "Number of Complaints") + theme_bw()

Visualization 4: Complaints Outcomes

Method: Pie Chart

To understand the type of Encounter Outcome, a pie chart was created to see the percentage of each output. For most complaints, the outcome is arrest.

EO <- table(df$Encounter.Outcome)
pie(EO)

Visualization 5: Relation between Encounter Outcome and Full investigation or not

Method: Bar Chart

With full investigation, arrest and summons are over 50% among all cases. Without full investigation, over half cases are No Arrest or Summons.

ggplot(data=df, aes(x = Is.Full.Investigation, fill = Encounter.Outcome)) + geom_bar(stat = 'count') + labs(title = "Outcome and Investigation", x = "Full investigation or not", Y = "Count") + theme_classic()

Visualization 6: Relation between “Complaints has video evidence” and “Full investigation or not”

Method: Bar Chart

Since full investigation is important, we intented to find if video evidence would affect full investigation or not. From the plot, if the complaints had video evidence, nearly all of them were fully investigated.

ggplot(data=df, aes(x=Complaint.Has.Video.Evidence, fill = Is.Full.Investigation, ))  + geom_bar(stat = 'count') + labs(title = "Video Evidence and investigation", x = "Complaints has video evidence", Y = "Count") + theme_classic()

Visualization 7: Summary of Investigation and Location

Method: Bar Chart

Then we would like to see if there is any relationship between Investigation and Location. According to the plot, there is no preference or bias.

ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation )) + 
  geom_bar(width = 0.5, alpha = 0.5, stat = 'count') + 
  labs(title = 'Figure 6: Geography Location for Complaint and Investigation Situation', x = 'Location') +
  scale_fill_discrete(name = 'Full Investigation or Not') +
  theme(legend.position = "bottom") +
  scale_fill_discrete(name="Mode")
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.

Visualization 8: Complaints by Receiced Year

Method: Series Plot

I would like to learn more about the efficiency of complaints solving over the year. Since the data was continuous, a series plot was created. From 2005 to 2010, the officer received the most complaints in NYC. After 2009, the number of receiving complaints gradually decreased.

df.by.receiveyear <- df %>% 
                          group_by(Received.Year) %>%
                            summarize(num_case = n_distinct(UniqueComplaintId)) %>%
                              select(Received.Year, num_case)

ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) + 
  geom_line(alpha = 0.5) + 
  ggtitle('Number of Complaints by Received Year') + 
  xlab('Received Year') + 
  ylab('Number of Cases') + 
  theme_economist()

Visualization 9: Complaints by Closed Year

Method: Series Plot

Then we would like to look at the closed year. From the plot, number of complaints by close year show general downward trend, however, there is back and forth in recent year as well.

df.by.closeyear <- df %>% 
                          group_by(Close.Year) %>%
                            summarize(num_case = n_distinct(UniqueComplaintId)) %>%
                              select(Close.Year, num_case)

ggplot(data = df.by.closeyear, aes(x = Close.Year, y = num_case)) + 
  geom_line(alpha = 0.5) + 
  ggtitle('Figure 2: Number of Complaints by Close Year') + 
  xlab('Close Year') + 
  ylab('Number of Cases') + 
  theme_economist()

Visualization 10: Length of Time the Complaints is processed

Method: Bar Chart

To find the time needed for processing the complaints, a bar chart was created. We found that the responding time for complaint process was not that long. For majority of cases, it was closed within a year or in 1-2 years.

df.dif <- df %>% 
                distinct(UniqueComplaintId, .keep_all = TRUE) %>%
                  mutate(time_length = Close.Year - Received.Year)
ggplot(data=df.dif, aes(x = time_length)) + geom_bar(width = 0.5, alpha = 0.5, stat = 'count') + labs(title = "Time Length for Complaints to Be Processed", x = "Time Length (Years)", Y = "Count") 

Summary

Per the EDA analysis above, the highest amount of complaints occurred in 2006-2007. In the more recent years, complaint has been on the decline. Half of the complaints were resolved within a year.

Brooklyn has the highest complaints level, followed by Bronx, and Manhattan. Further, the most common location of incident is the streets/highways and apartment/houses. This information can be utilized to significantly reduce crime levels as such provides very precise information.

Further, video evidence for the complaints should be increased in order to bring the number of complaints down, since it will lead to fully investigated.

Exploratory data analysis is very useful to investigate unfamiliar data. We don’t need to do anything on the original data set. We can check the relationship between any variables as we need. The outcome is quite clear and easy understand. There are a lot of charts we can choose as we need. Choose the right one is also important. Exploratory data analysis is a method help us understand the data set directly. It is not just about the graphics, but also about data collection and data cleaning. Single variable analysis is not hard, however, I think find out the relationship between variables are more important. And this is what EDA does.