df <- read_excel("T:\\Daisy Chen\\HU\\ANL 512 51\\ccrb_datatransparencyinitiative.xlsx", sheet = 2)
df <- data.frame(df)
First of all, I want to have an idea as to the most popular ways through which complaints are received.Bar chart is used due to the limited categories. As the bar chart illuminates: most people use phone to coney the complaints. The secondary most feaquent method is Call Processing System. Later, we would like to discover which borough in NYC filed the complaints the most.
ggplot(data=df, aes(x=Complaint.Filed.Mode, fill=Complaint.Filed.Mode)) +
geom_histogram(stat = "count") +
labs(title="Complaints by Mode", x="Mode", y="Number of Complaints") +
theme(legend.position = "bottom") +
scale_fill_discrete(name="Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Then I would like to discover which borough in NYC filed the complaints the most. The pie chart was used to show the percentage of each location. From the plot, Brooklyn, Manhattan and Bronx had the most complaints.
pie(table(df$Borough.of.Occurrence))
In order to find the detail incident location, a stacked bar chart was created. From the plot, street/highway was the highest incident happend place. The second one was resident building.Usually, citizen would able to report the complaints by phone.
ggplot(df, aes(x = Borough.of.Occurrence, fill = Incident.Location)) + geom_bar(stat = 'count') + labs(title = "Incident Location", x = "Location", Y = "Number of Complaints") + theme_bw()
To understand the type of Encounter Outcome, a pie chart was created to see the percentage of each output. For most complaints, the outcome is arrest.
EO <- table(df$Encounter.Outcome)
pie(EO)
With full investigation, arrest and summons are over 50% among all cases. Without full investigation, over half cases are No Arrest or Summons.
ggplot(data=df, aes(x = Is.Full.Investigation, fill = Encounter.Outcome)) + geom_bar(stat = 'count') + labs(title = "Outcome and Investigation", x = "Full investigation or not", Y = "Count") + theme_classic()
Since full investigation is important, we intented to find if video evidence would affect full investigation or not. From the plot, if the complaints had video evidence, nearly all of them were fully investigated.
ggplot(data=df, aes(x=Complaint.Has.Video.Evidence, fill = Is.Full.Investigation, )) + geom_bar(stat = 'count') + labs(title = "Video Evidence and investigation", x = "Complaints has video evidence", Y = "Count") + theme_classic()
Then we would like to see if there is any relationship between Investigation and Location. According to the plot, there is no preference or bias.
ggplot(data = df, aes(x = Borough.of.Occurrence, fill = Is.Full.Investigation )) +
geom_bar(width = 0.5, alpha = 0.5, stat = 'count') +
labs(title = 'Figure 6: Geography Location for Complaint and Investigation Situation', x = 'Location') +
scale_fill_discrete(name = 'Full Investigation or Not') +
theme(legend.position = "bottom") +
scale_fill_discrete(name="Mode")
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
I would like to learn more about the efficiency of complaints solving over the year. Since the data was continuous, a series plot was created. From 2005 to 2010, the officer received the most complaints in NYC. After 2009, the number of receiving complaints gradually decreased.
df.by.receiveyear <- df %>%
group_by(Received.Year) %>%
summarize(num_case = n_distinct(UniqueComplaintId)) %>%
select(Received.Year, num_case)
ggplot(data = df.by.receiveyear, aes(x = Received.Year, y = num_case)) +
geom_line(alpha = 0.5) +
ggtitle('Number of Complaints by Received Year') +
xlab('Received Year') +
ylab('Number of Cases') +
theme_economist()
Then we would like to look at the closed year. From the plot, number of complaints by close year show general downward trend, however, there is back and forth in recent year as well.
df.by.closeyear <- df %>%
group_by(Close.Year) %>%
summarize(num_case = n_distinct(UniqueComplaintId)) %>%
select(Close.Year, num_case)
ggplot(data = df.by.closeyear, aes(x = Close.Year, y = num_case)) +
geom_line(alpha = 0.5) +
ggtitle('Figure 2: Number of Complaints by Close Year') +
xlab('Close Year') +
ylab('Number of Cases') +
theme_economist()
To find the time needed for processing the complaints, a bar chart was created. We found that the responding time for complaint process was not that long. For majority of cases, it was closed within a year or in 1-2 years.
df.dif <- df %>%
distinct(UniqueComplaintId, .keep_all = TRUE) %>%
mutate(time_length = Close.Year - Received.Year)
ggplot(data=df.dif, aes(x = time_length)) + geom_bar(width = 0.5, alpha = 0.5, stat = 'count') + labs(title = "Time Length for Complaints to Be Processed", x = "Time Length (Years)", Y = "Count")
Per the EDA analysis above, the highest amount of complaints occurred in 2006-2007. In the more recent years, complaint has been on the decline. Half of the complaints were resolved within a year.
Brooklyn has the highest complaints level, followed by Bronx, and Manhattan. Further, the most common location of incident is the streets/highways and apartment/houses. This information can be utilized to significantly reduce crime levels as such provides very precise information.
Further, video evidence for the complaints should be increased in order to bring the number of complaints down, since it will lead to fully investigated.
Exploratory data analysis is very useful to investigate unfamiliar data. We don’t need to do anything on the original data set. We can check the relationship between any variables as we need. The outcome is quite clear and easy understand. There are a lot of charts we can choose as we need. Choose the right one is also important. Exploratory data analysis is a method help us understand the data set directly. It is not just about the graphics, but also about data collection and data cleaning. Single variable analysis is not hard, however, I think find out the relationship between variables are more important. And this is what EDA does.