ANLY 512 - Problem Set 4

Objectives

The objective of this assignment is to conduct an exploratory data analysis of a data set that you are not familiar with. In this week’s lecture, we discussed a number of visualization approaches in order to explore a data set. This assignment will apply those tools and techniques. An important distinction between class examples and applied data science work is iterative and repetitive nature of exploring a data set. It takes time to understand what the data is and what is interesting about the data (patterns).

For this week, we will be exploring data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB), an independent municiple agency. Your objective is to identify interesting patterns and trends within the data that may be indicative of large scale trends.

This link will allow you to download the data set in .xlsx format. The data file has two tabs: one with metadata, and the “Complaints_Allegations” tab with the actual data.

Deliverable and Grades

For this assignment, you should submit a link to a knitr rendered html document that shows your exploratory data analysis. Organize your analysis using section headings:

rm(list = ls())

library(readxl)

## Warning: package 'readxl' was built under R version 3.6.3

eda_data = read_xlsx("C:\\Users\\Administrator\\Desktop\\ccrb_datatransparencyinitiative.xlsx", sheet = "Complaints_Allegations")

dim(eda_data)

## [1] 204397     16

str(eda_data)

## Classes 'tbl_df', 'tbl' and 'data.frame':    204397 obs. of  16 variables:
##  $ DateStamp                                  : POSIXct, format: "2016-11-29" "2016-11-29" ...
##  $ UniqueComplaintId                          : num  11 18 18 18 18 18 18 18 18 18 ...
##  $ Close Year                                 : num  2006 2006 2006 2006 2006 ...
##  $ Received Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Borough of Occurrence                      : chr  "Manhattan" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ Is Full Investigation                      : logi  FALSE TRUE TRUE TRUE TRUE TRUE ...
##  $ Complaint Has Video Evidence               : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Complaint Filed Mode                       : chr  "On-line website" "Phone" "Phone" "Phone" ...
##  $ Complaint Filed Place                      : chr  "CCRB" "CCRB" "CCRB" "CCRB" ...
##  $ Complaint Contains Stop & Frisk Allegations: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ Incident Location                          : chr  "Street/highway" "Street/highway" "Street/highway" "Street/highway" ...
##  $ Incident Year                              : num  2005 2004 2004 2004 2004 ...
##  $ Encounter Outcome                          : chr  "No Arrest or Summons" "Arrest" "Arrest" "Arrest" ...
##  $ Reason For Initial Contact                 : chr  "Other" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" "PD suspected C/V of violation/crime - street" ...
##  $ Allegation FADO Type                       : chr  "Abuse of Authority" "Abuse of Authority" "Discourtesy" "Discourtesy" ...
##  $ Allegation Description                     : chr  "Threat of arrest" "Refusal to obtain medical treatment" "Word" "Word" ...

names(eda_data)

##  [1] "DateStamp"                                  
##  [2] "UniqueComplaintId"                          
##  [3] "Close Year"                                 
##  [4] "Received Year"                              
##  [5] "Borough of Occurrence"                      
##  [6] "Is Full Investigation"                      
##  [7] "Complaint Has Video Evidence"               
##  [8] "Complaint Filed Mode"                       
##  [9] "Complaint Filed Place"                      
## [10] "Complaint Contains Stop & Frisk Allegations"
## [11] "Incident Location"                          
## [12] "Incident Year"                              
## [13] "Encounter Outcome"                          
## [14] "Reason For Initial Contact"                 
## [15] "Allegation FADO Type"                       
## [16] "Allegation Description"

summary(eda_data)

##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location  Incident Year 
##  Mode :logical                               Length:204397      Min.   :1999  
##  FALSE:119856                                Class :character   1st Qu.:2007  
##  TRUE :84541                                 Mode  :character   Median :2009  
##                                                                 Mean   :2010  
##                                                                 3rd Qu.:2012  
##                                                                 Max.   :2016  
##  Encounter Outcome  Reason For Initial Contact Allegation FADO Type
##  Length:204397      Length:204397              Length:204397       
##  Class :character   Class :character           Class :character    
##  Mode  :character   Mode  :character           Mode  :character    
##                                                                    
##                                                                    
##                                                                    
##  Allegation Description
##  Length:204397         
##  Class :character      
##  Mode  :character      
##                        
##                        
##

Viz:1- The following bar chart shows distribution of incidents occurence in different boroughs. From the graph we can see that most of the incident come from Brooklyn and least from outside NYC.

borough = table(eda_data$`Borough of Occurrence`)
lbls = names(borough)
barplot(borough, 
        xlab = "Borough of Occurrence", 
        ylab = "Number", 
        main = "Borough of Occurrence in CCRB Report", 
        horiz = FALSE,
        legend.text = TRUE,
        cex.axis = 1.0,
        cex.names = 1.0,
        col=rainbow(length(lbls)))

Viz:2 - The following graph shows percentage distribution of different modes in which complaints were filed. The pie-chart indicates that most of the complaints were filed through phone and then through Call Processing System.

complaint_mode = table(eda_data$`Complaint Filed Mode`)
lbls <- names(complaint_mode)
lbls

## [1] "Call Processing System" "E-mail"                 "Fax"                   
## [4] "In-person"              "Mail"                   "On-line website"       
## [7] "Phone"

pct <- round(complaint_mode/sum(complaint_mode)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
pie(complaint_mode,labels = lbls, col=rainbow(length(lbls)),main="Complaints Filed Mode in CCRB Report")

Viz:3 - The following scatter plot shows realtionship between complaints receiving and closing year along with regression line.

library(ggplot2)
cleanup = theme(panel.grid.major = element_blank(),
                panel.grid.minor = element_blank(),
                panel.background = element_blank(),
                axis.line.x = element_line(color = 'black'),
                axis.line.y = element_line(color = 'black'),
                legend.key = element_rect(fill = 'white'),
                text = element_text(size = 15))

scatter = ggplot(eda_data, aes(eda_data$`Received Year`, eda_data$`Close Year`))

scatter + geom_point() +
          geom_smooth(method = 'lm', color = 'blue') +
          xlab('Complaints Received Year') + 
          ylab('Complaints Closed Year') + 
          ggtitle('Complaints Receiving and Closing Year in CCRB Report') +
          cleanup

Viz:4- The below graph shows stacked bar chart indicating distribution of complaints over different boroughs through various complaint modes. We can see that most of the complaints came from Brooklyn and maximum complaints were filed over phone, then call processing system, etc.

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.6.3

stack_plot = ggplot(eda_data, aes(eda_data$`Borough of Occurrence`, fill = eda_data$`Complaint Filed Mode`)) 
stack_plot + geom_bar() +
        scale_fill_discrete(name = "Complaint Filed Mode") +
        xlab('Borough of Occurence') +
        theme_solarized()

Viz:5 - From the below graph, we can see that the histogram is left skewd with most of the data populated at the higher end. The frequency distribution indicates maximum incident occured between 2005 and 2010 with no incidents around year 2000 and constant decrease from 2010 to 2015.

incident_hist = ggplot(eda_data, aes(eda_data$`Incident Year`))
incident_hist + geom_histogram(binwidth = 1.0, color = "green") + xlab("Incident Year") + 
ylab("Frequency") + ggtitle('Histogram of Incident Year in CCRB Report')

Viz:6 - The below visualization indicates percentage distribution of encounter outcome and it’s density plot as per borough of occurence.

library(grid)
library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.6.3

library(Rmisc)

## Warning: package 'Rmisc' was built under R version 3.6.3

## Loading required package: lattice

## Loading required package: plyr

encounter = table(eda_data$`Encounter Outcome`)
lbls <- names(encounter)
lbls

## [1] "Arrest"               "No Arrest or Summons" "Other/NA"            
## [4] "Summons"

pct <- round(encounter/sum(encounter)*100)
lbls <- paste(lbls, pct) # add percents to labels
lbls <- paste(lbls,"%",sep="") # ad % to labels
plot1 = pie(encounter,labels = lbls, col=rainbow(length(lbls)),main="Encounter Outcome in CCRB Report")

density_plot = ggplot(eda_data, aes(x=eda_data$`Encounter Outcome`, fill=eda_data$`Borough of Occurrence`, color = eda_data$`Borough of Occurrence`)) 
density_plot + geom_density(alpha=0.4) +
               xlab("Encounter Outcome") +
               ggtitle("Encounter Outcome per Borough in CCRB Report") +
               theme_economist()

Viz:7 - The below bar chart shows distribution of Allegation FADO Type and from th egraph it is clear that Abus eof Authority hasa maximum occurence compared to others.

allegation = table(eda_data$`Allegation FADO Type`)
lbls = names(allegation)
barplot(allegation, 
        xlab = "Allegation FADO Type", 
        ylab = "Number", 
        main = "Allegation FADO Type in CCRB Report", 
        horiz = FALSE,
        legend.text = TRUE,
        cex.axis = 1.0,
        cex.names = 1.0,
        col=rainbow(length(lbls)))

Viz:8 - The below graph shows relationship between Incident Year and Complaint Filed Place. We can also identify outliers from this graph.

box_plot = ggplot(eda_data, aes(x = eda_data$`Complaint Filed Place`, y = eda_data$`Incident Year`)) + geom_boxplot()
box_plot + xlab("Complaint Filed Place") +
           ylab("Incident Year") +
           coord_flip() +
           theme_wsj()

Viz:9 - In the below graphs we can see the distribution of compalaints as they were received and their correponding closing year.

library(Rmisc)

received_hist = ggplot(eda_data, aes(eda_data$`Received Year`))
plot1 = received_hist + geom_histogram(binwidth = 1.0, color = "blue") + xlab("Received Year") + 
ylab("Frequency") + ggtitle('Histogram of Received Year') + theme_economist()

closed_hist = ggplot(eda_data, aes(eda_data$`Close Year`))
plot2 = closed_hist + geom_histogram(binwidth = 1.0, color = "blue") + xlab("Closed Year") + 
ylab("Frequency") + ggtitle('Histogram of Closed Year') +theme_economist()

multiplot(plot1, plot2, cols = 2)

Viz 10: The below graph shows relationship between Incident Year and Incident location along with outliers.

box_plot = ggplot(eda_data, aes(x = eda_data$`Incident Location`, y = eda_data$`Incident Year`)) + geom_boxplot(notch = FALSE, aes(fill = eda_data$`Incident Location`))
box_plot + xlab("Incident Location") +
           ylab("Incident Year") +
           coord_flip()

We were given the data from the NYC Data Transparency Initiative. They maintain a database of complaints that fall within the Civilian Complain Review Board (CCRB). The “Complaints_Allegations” sheet of this excel file contains data on all CCRB jurisdiction complaints closed in or after 2006. If we looked at the data we understand that it has 204397 and 16 variables which provides information about how, when and what complaints were provided but we can not make any statistical inference by just looking at the data. In this situation, exploratory data analysis comes in handy. We produced multiple visualization to understand the data and from these visualization we can make inferences like: 1. Most of the incidents occured in Brooklyn (Borough of occurence) 2. Phone is the most popular mode of filing complaints 3. Peak of incident occurence can be seen between 2005 and 2010 4. Arrest is the most popular outcome of encounter, etc.

So with Exploratory Data Analysis concrete inferences can be made backed up by statistical and visualization proof. It helps establish foundation for further complex analysis. a

ANLY 512 - Problem Set 4

Exploratory Data Analysis

Kaminee Shimpi

2020-06-11

Objectives

Deliverable and Grades