exploring tha dataset, dimensions & variables
removing NA values
library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
data1<-read_excel("C:/Users/aziz/Documents/Harrisburg/Anly 512/ccrb_datatransparencyinitiative.xlsx")
## Warning in strptime(x, format, tz = tz): unable to identify current timezone 'U':
## please set environment variable 'TZ'
data<- na.omit(data1)
head(data)
## # A tibble: 6 x 16
## DateStamp UniqueComplaintId `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 11 2006 2005
## 2 2016-11-29 18 2006 2004
## 3 2016-11-29 18 2006 2004
## 4 2016-11-29 18 2006 2004
## 5 2016-11-29 18 2006 2004
## 6 2016-11-29 18 2006 2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## # Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## # Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## # Description` <chr>
summary(data)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Attaching packages
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 204397 obs. of 16 variables:
## $ DateStamp : POSIXct, format: "2016-11-29" "2016-11-29" ...
## $ UniqueComplaintId : num 11 18 18 18 18 18 18 18 18 18 ...
## $ Close Year : num 2006 2006 2006 2006 2006 ...
## $ Received Year : num 2005 2004 2004 2004 2004 ...
## $ Borough of Occurrence : chr "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
## $ Is Full Investigation : logi FALSE TRUE TRUE TRUE TRUE TRUE ...
## $ Complaint Has Video Evidence : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Complaint Filed Mode : chr "On-line website" "Phone" "Phone" "Phone" ...
## $ Complaint Filed Place : chr "CCRB" "CCRB" "CCRB" "CCRB" ...
## $ Complaint Contains Stop & Frisk Allegations: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ Incident Location : chr "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
## $ Incident Year : num 2005 2004 2004 2004 2004 ...
## $ Encounter Outcome : chr "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
## $ Reason For Initial Contact : chr "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
## $ Allegation FADO Type : chr "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
## $ Allegation Description : chr "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...
Including Plots
hist(data$'Received Year', main="Complaints received by year", xlab="Received Year", border="black", breaks = 20, col="grey")

This first graph displays the number of complaints received by year, we can see since 2007, the number of complaints is constant decrease
hist(data$'Close Year', main="Complaints closed by year", xlab="close Year", border="white", breaks = 20, col="light blue")

This graph represents the number of complaints closed by year, we can see that it is not following the same magnitude as the number of complaints received by year.
ggplot(data, aes(x=data$'Incident Year', fill= data$'Complaint Has Video Evidence')) + geom_histogram(stat = "count") + labs (title = "Incident with Evidence/Without Evidence by Year", x="Incident Year", y="Number") + theme (legend.position = "right") + scale_fill_discrete(name = "with Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph represents the number of incidents by year. It also displays the proportion of those incidents that actually have a video evidence and those that does not. Now that we have this, i want to check the proportion of incident with evidence that were solved.
ggplot(data, aes(x=data$'Close Year', fill= data$'Complaint Has Video Evidence')) + geom_histogram(stat = "count") + labs (title = "Case Closed with evidence Vs without evidence by Year", x="Close Year", y="Number") + theme (legend.position = "right") + scale_fill_discrete(name = "Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

I assumed that most solved incidents will have a video evidence, this graph shows that my assumption is wrong.
ggplot(data, aes(x=data$'Received Year', fill= data$'Borough of Occurrence')) + geom_histogram(stat = "count") + labs (title = "Number of Incident by Year by Location", x="Received Year", y="Number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph displays the frequency of incidents by year and by location.
ggplot(data, aes(x=data$'Allegation FADO Type', fill= data$'Allegation FADO Type')) + geom_histogram(stat = "count") + labs (title = "Type of Incident", x="Type", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph shows the type of incidents and their frequency overall in all the years represented in the dataset.
ggplot(data, aes(x=data$'Received Year', fill= data$'Encounter Outcome')) + geom_histogram(stat = "count") + labs (title = "Number of Incident by Year by outcome", x="Received Year", y="Number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graphs shows the number of incident by year by outcome.
ggplot(data, aes(x=data$'Complaint Filed Mode', fill= data$'Complaint Filed Mode')) + geom_histogram(stat = "count") + labs (title = "Complaint Filled Mode", x="Mode", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph diplays the different means used to fill a complaint.
ggplot(data, aes(x=data$'Incident Location', fill= data$'Incident Location')) + geom_histogram(stat = "count") + labs (title = "Incident Location", x="Location", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph displays number of incident by Location.
ggplot(data, aes(x=data$'Received Year', y= data$'Close Year')) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Received Year and Closed Year", x="Received Year", y="Close Year")

This graph examine the relationship between the date they receive a complaint and the date the complaint is closed. From this graph we can deduct that it takes them anywhere from less than a year to sometimes 5 year to close a case.
Summary:
Given the size of that dataset, i find that data visualization was very helpful. In fact, if i had to go through the excel file with more than 200K rows, it would take me a considerable amount of time to explore this dataset. However, using data Visualization i got all kind of usefull insight about the data. Each of the above graphs, explore one dimension of the dataset and give a clear picture about the content of that dataset. It is also an accurate precise description of the content of that dataset. Moreover, i could also check some assumption that i made weither they were right or not. In this case i was wrong.
In case, we needed to run an analysis and construct a model, that first step is primordial in understanding, linking and making relationships out of the data. Using EDA simply save us time and help us understand the content better.