Exploratory Data Analysis

Data Collection

We begin this exercise by downloading data from desired source and loading it using a library that read excel files and proceeds further by finding EDA related answers using different visualization.

library(ggplot2)
library(readxl)
library(tidyverse)

## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.0     v dplyr   0.8.3
## v readr   1.3.1     v stringr 1.4.0
## v tibble  2.1.3     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
setwd("C:/Users/shoukhan/Documents/Harrisburg University/Anly-512-Data-Visualization/Summer Course")
dataset <- read_xlsx("ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")

Data Visualization

We begin with understanding of data distribution by total number of complains registered in each judistriction and which judristiction has the highest percentage?

d1<-dataset[!(dataset$`Borough of Occurrence`=="NA"),]
crime.freq <- table(d1$`Borough of Occurrence`)
label <- names(crime.freq)
label <- paste("",label,"-")
percentage <- round(crime.freq/sum(crime.freq)*100)
lableWithPercent <- paste(label, percentage)
lableWithPercent <- paste(lableWithPercent, "%")
pie(crime.freq, lableWithPercent, main = "Crime distribution by judristiction")

Based on the pie chart, it can easily interpreted that the top three places are Brooklyn, Bronx and Manhattan. Moreover, it also show that there are less number of complains outside new york city.

Analyze the total number of complains recieved each year?

dataByYear=subset(dataset, `Received Year` >= 2006)
ggplot(dataByYear, aes(x=`Received Year`, fill= `Allegation FADO Type`)) + 
  geom_histogram(stat = "count") + 
  labs (title = "Distribution of Complaints by type and year", x="Year", y="Complaints") + 
  theme (legend.position = "right") + 
  scale_fill_discrete(name = "Allegation Type") + coord_flip()

This vizualization interprets that type of complaints year over year increases but the most of causes remains the same.

Increase of video technology in complaints year over year.

d <- aggregate(`Complaint Has Video Evidence`~ `Received Year`,FUN=length, data = dataset)
d <- d[!(d$`Complaint Has Video Evidence` < 10000),]

ggplot(d, aes(`Received Year`,`Complaint Has Video Evidence`))+geom_line()

The above charts gives a clear picture of increase in video evidence captured by users year over year and also for some reason its declining

Box plot to visualize the cities impacted by year

ggplot(d1, aes(y=d1$`Received Year`,x=d1$`Borough of Occurrence`))+labs(title = "Comparision of Received Year vs Borough", x="Recieved Year", y="Bourough of Occurence")+geom_boxplot()+coord_flip()

The box plot let us interpret the average and also an easy way to spot outliers.

Density Visualization by Borough of Occurrence

ggplot(data=d1, aes(x=`Incident Year`, colour = `Borough of Occurrence`)) +
  geom_density(data=d1 ,aes(factor(`Borough of Occurrence`)),alpha="1") +
  theme_classic() +
  labs(title= "Borough of Occurrence - Density", x="Borough", y="Density")

Density plot are useful to capture the distribution of data and hee also it helps to interpret incident happened over the years by borough.

Density Visualization by Allegation

d2 <- dataset[!(dataset$`Allegation FADO Type`=="NA"),]

ggplot(data=d2, aes(x=`Incident Year`, colour = `Allegation FADO Type`)) +
  geom_density(data=d2 ,aes(factor(`Allegation FADO Type`)),alpha="1") +
  theme_classic() +
  labs(title= "Type of Allegation - Density", x="Allegation", y="Density")

This visualization help to interpret the distribution by Allegation of Complaints.

This visualization is to identify the time period require for average investigation time.

d3 <- unique(dataset[c("UniqueComplaintId","Received Year","Incident Year", "Close Year")])
d3$Length = d3$`Close Year` - d3$`Received Year`

d4 = aggregate(d3[, 5], list(d3$`Received Year`), mean)
names(d4) = c("Received Year", "Length")
ggplot(data=d4, aes(x=`Received Year`, y=Length)) + 
  labs(title = "Average Timeline of Investigation") + 
  geom_line()

The line chart interprets the average investiagtion time for registered complaints

Visualization to interpret the outcome of incidents

res.freq <- table(dataset$`Encounter Outcome`)
label <- names(res.freq)
label <- paste("",label,"-")
percentage <- round(res.freq/sum(res.freq)*100)
lableWithPercent <- paste(label, percentage)
lableWithPercent <- paste(lableWithPercent, "%")
pie(res.freq, lableWithPercent, col = rainbow(length(label)) ,main = "Outcome of Incidents")

A pie chart the represents the outcome of complaints registered.

Mode of Communication over the years for filing complaints

d4 <- dataset %>%
  group_by(`Complaint Filed Mode`) %>%
  count(`Received Year`)

ggplot(d4, aes(fill=`Complaint Filed Mode`, y=n, x=`Received Year`)) + 
    geom_bar(position="stack", stat="identity") + labs(title= "Communication Preference over years", x="Year", y="Count")

This visualization represents the common type of communication over the years, intrestingly with an increase of dot com still most complains are registered through phone.

Frequent location of incidents and outcome

d5 <- dataset %>%
  group_by(`Incident Location`, `Encounter Outcome`) %>%
  count(`Incident Year`)

ggplot(d5, aes(fill=`Encounter Outcome`, y=n, x=`Incident Location`)) + 
    geom_bar(position="stack", stat="identity") + labs(title= "Frequent location of incidents and outcome", x="Incident Location", y="Count") + coord_flip()

A stacked bar plot represents the common incident locations and what would be the expected outcome in most of the cases.

ANLY 512 - Problem Set 4