Questions

For this assignment you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

Your final document should include at minimum 10 visualization. Each should include a brief statement of why you made the graphic.

A final section should summarize what you learned from your EDA. Your grade will be based on the quality of your graphics and the sophistication of your findings.

library(readxl)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggthemes)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

hw4 <- read_excel("C:/Users/Olivia/Desktop/HU/Data Visualization/ccrb_datatransparencyinitiative.xlsx", 
                  sheet = "Complaints_Allegations")

1. Number of incidents occured by borough and year

I would like to see the complaints counts by borough across years. Brooklyn has most complaints across the years and follow by manhattan. The dataset seems missing lots of value before 2005. We will not include data before 2015 for the other analysis.

deduped.data <- unique( hw4[ , 1:5 ] )

deduped.data %>% 
  ggplot(aes(x = `Received Year`, fill = `Borough of Occurrence`)) + 
  geom_bar(position = "stack") + theme_economist() +
  labs(title = "Complaints counts by Borough of Occurrence across years")

2. Percentage of Incident location by Borough of Occurrence

Street or highway is highest percentage of incident for all boroughs except outside NYC. Outside NYC has a high pecentage of incident happen at apartments or house.

uniinccident <-unique(hw4[ , 1:15 ] )
validlocation=uniinccident[!grepl("NA", uniinccident$`Borough of Occurrence`),]
after2005 <- validlocation[validlocation$`Incident Year` >= "2006" & validlocation$`Incident Year` <= "2016",]
ggplot(after2005, aes(x=after2005$`Borough of Occurrence`, fill= after2005$`Incident Location`)) + geom_bar(position="fill",stat = "count") + labs (title = "Percentage of Incident Location by Borough", x="Borough of Occurrence", y="Percentage of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "Incident Location") + theme_economist()

3. Incident location by Borough of Occurrence

You can see the actual number of incident location by borough and easier compare the amount of incident with other borough.

ggplot(after2005,aes(x = after2005$`Borough of Occurrence`, fill = after2005$`Incident Location`)) + geom_bar(position = "stack") + theme_economist() + labs(title = "Incident Location by Borough")

4. Percentage of Allegation FADO Type by Borough of Occurrencce

Abuse of authority and Force have the highest precentage happen across all boroughs.

ggplot(after2005, aes(x=after2005$`Borough of Occurrence`, fill= after2005$`Allegation FADO Type`)) + geom_bar(position="fill",stat = "count") + labs (title = "Percentage of Allegation FADO Type by Borough of Occurrencce", x="Borough of Occurrence", y="Percentage of Incidents") + theme (legend.position = "bottom") + scale_fill_discrete(name = "") + theme_economist()

5. Incident Location by Borough

More complaint cases are realted to abuse of authority and the next one is Force.

  ggplot(after2005,aes(x = after2005$`Allegation FADO Type`, fill = after2005$`Borough of Occurrence`)) + 
  geom_bar(position = "stack") + theme_economist() +
  labs(title = "Incident Location by Borough")

6. Complaint Filed Mode across years

The overall complaints are decreasing. The complaint filed by call processing system has highestest decreased rate among all complaing filed type.

ggplot(after2005, aes(x = after2005$`Incident Year`, fill = after2005$`Complaint Filed Mode`)) +geom_bar(position = "stack") + theme_economist() + labs(title = "Complaint Filed Mode across years")

7. Most Frequent Incidents Types

Physical disability and word are the most frequrent incient types.

ggplot(hw4, aes(hw4$`Allegation Description`)) +geom_bar(width = 0.5, position = position_dodge(width = 0.5)) + theme(axis.text.x=element_text(angle=90, hjust=1)) + labs(title = " The Most Frequent Incidents Types")

8. Denesity graph of complaints with or without video evidence

We can tell from the chart. More complaints cases don’t have video as the evidence.

ggplot(after2005,aes(x= after2005$`Received Year`, colour = after2005$`Complaint Has Video Evidence`)) + geom_density(data=after2005 ,aes(factor(after2005$`Complaint Has Video Evidence`)),alpha="1") + theme_classic() + theme_economist() + labs(title= "Denesity graph of complaints with or without video evidence", x="Allegation", y="Density")

9. Allegation FADO Type by Encounter Outcome

If allegation type is force, more preccentage of people get arrested than other allegation types.

ggplot(after2005,aes(x = after2005$`Encounter Outcome`, fill = after2005$`Allegation FADO Type`)) + geom_bar(position = "stack") + theme_economist() +labs(title = "Allegation FADO Type by Encounter Outcome")

10. Investigation Status by Allegation FADO Type

The percentage of total allegation type in Force increases when the case not get full investigato.n

fullinvestigate=after2005[after2005$`Is Full Investigation`=="TRUE",]
pie <- ggplot(data = fullinvestigate) + 
  geom_bar(mapping = aes(x = factor(1), fill = fullinvestigate$`Allegation FADO Type`), width = 1) + 
  coord_polar(theta = "y") + labs(title = "Not Full Investigation by Allegation FADO Type")
pie

tinvestigate=after2005[after2005$`Is Full Investigation`=="FALSE",]
pie <- ggplot(data = tinvestigate) + 
  geom_bar(mapping = aes(x = factor(1), fill = tinvestigate$`Allegation FADO Type`), width = 1) + 
  coord_polar(theta = "y") + labs(title = "Full Investigation by Allegation FADO Type")
pie

Summary

The goal of this assignment is to let us practice the exploratory data analysis. When I encounter the dataset that I am not familar with. The first step for me is to understand the main characteristics with data visualization. What I learned from this project is that the best practice is to start with basic scatter plot, histogram, pie chart and density chart to find interesting pattern underneath the data.

I tried to plot different graphics with ggplot2 package. I realized some graphics looks cool, but it’s hard to discover the pattern of data. I think this is also to do with the charateristics of our dataset. The dataset we used doesn’t have numerical variable, so we need to find the graphic types that are useful to analyze classification dataset.

Then, it’s important to clear the data. We can found the data seems not collect completely before 2006. This may impact the result when we include those incompleted data. When I worked on the exploratory data analysisi, I also noticed several rows have the same Unique Complaint Id. We need to careful to pre-processing the data. Otherwise, we may count the same case multiple times and impact to the final analysis.

Everything abobe is the first step to familiar our data and we can start to make the charts fancier with different combination and customization for futher analysis.

ANLY 512 - Problem Set 4

Exploratory Data Analysis

YUN-CHIA LO

07/07/2018