exploring tha dataset, dimensions & variables

removing NA values

library(ggplot2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
data1<-read_excel("C:/Users/aziz/Documents/Harrisburg/Anly 512/ccrb_datatransparencyinitiative.xlsx")
## Warning in strptime(x, format, tz = tz): unable to identify current timezone 'U':
## please set environment variable 'TZ'
data<- na.omit(data1)
head(data)
## # A tibble: 6 x 16
##    DateStamp UniqueComplaintId `Close Year` `Received Year`
##       <dttm>             <dbl>        <dbl>           <dbl>
## 1 2016-11-29                11         2006            2005
## 2 2016-11-29                18         2006            2004
## 3 2016-11-29                18         2006            2004
## 4 2016-11-29                18         2006            2004
## 5 2016-11-29                18         2006            2004
## 6 2016-11-29                18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## #   Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## #   Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## #   Description` <chr>
summary(data)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 

Attaching packages

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

Including Plots

hist(data$'Received Year', main="Complaints received by year", xlab="Received Year", border="black", breaks = 20, col="grey")

This first graph displays the number of complaints received by year, we can see since 2007, the number of complaints is constant decrease

hist(data$'Close Year', main="Complaints closed by year", xlab="close Year", border="white", breaks = 20, col="light blue")

This graph represents the number of complaints closed by year, we can see that it is not following the same magnitude as the number of complaints received by year.

ggplot(data, aes(x=data$'Incident Year', fill= data$'Complaint Has Video Evidence')) + geom_histogram(stat = "count") + labs (title = "Incident with Evidence/Without Evidence by Year", x="Incident Year", y="Number") + theme (legend.position = "right") + scale_fill_discrete(name = "with Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph represents the number of incidents by year. It also displays the proportion of those incidents that actually have a video evidence and those that does not. Now that we have this, i want to check the proportion of incident with evidence that were solved.

ggplot(data, aes(x=data$'Close Year', fill= data$'Complaint Has Video Evidence')) + geom_histogram(stat = "count") + labs (title = "Case Closed with evidence Vs without evidence by Year", x="Close Year", y="Number") + theme (legend.position = "right") + scale_fill_discrete(name = "Has Video Evidence")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

I assumed that most solved incidents will have a video evidence, this graph shows that my assumption is wrong.

ggplot(data, aes(x=data$'Received Year', fill= data$'Borough of Occurrence')) + geom_histogram(stat = "count") + labs (title = "Number of Incident by Year by Location", x="Received Year", y="Number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph displays the frequency of incidents by year and by location.

ggplot(data, aes(x=data$'Allegation FADO Type', fill= data$'Allegation FADO Type')) + geom_histogram(stat = "count") + labs (title = "Type of Incident", x="Type", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Type")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph shows the type of incidents and their frequency overall in all the years represented in the dataset.

ggplot(data, aes(x=data$'Received Year', fill= data$'Encounter Outcome')) + geom_histogram(stat = "count") + labs (title = "Number of Incident by Year by outcome", x="Received Year", y="Number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Outcome")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graphs shows the number of incident by year by outcome.

ggplot(data, aes(x=data$'Complaint Filed Mode', fill= data$'Complaint Filed Mode')) + geom_histogram(stat = "count") + labs (title = "Complaint Filled Mode", x="Mode", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Mode")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph diplays the different means used to fill a complaint.

ggplot(data, aes(x=data$'Incident Location', fill= data$'Incident Location')) + geom_histogram(stat = "count") + labs (title = "Incident Location", x="Location", y="number") + theme (legend.position = "right")+ scale_fill_discrete(name = "Location")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

This graph displays number of incident by Location.

ggplot(data, aes(x=data$'Received Year', y= data$'Close Year')) + geom_point() + geom_smooth(method = lm) + labs (title = "Relationship between Received Year and Closed Year", x="Received Year", y="Close Year") 

This graph examine the relationship between the date they receive a complaint and the date the complaint is closed. From this graph we can deduct that it takes them anywhere from less than a year to sometimes 5 year to close a case.

Summary:

Given the size of that dataset, i find that data visualization was very helpful. In fact, if i had to go through the excel file with more than 200K rows, it would take me a considerable amount of time to explore this dataset. However, using data Visualization i got all kind of usefull insight about the data. Each of the above graphs, explore one dimension of the dataset and give a clear picture about the content of that dataset. It is also an accurate precise description of the content of that dataset. Moreover, i could also check some assumption that i made weither they were right or not. In this case i was wrong.

In case, we needed to run an analysis and construct a model, that first step is primordial in understanding, linking and making relationships out of the data. Using EDA simply save us time and help us understand the content better.