Preparation

library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
library(ggplot2)
getwd()
## [1] "C:/Users/Jason/Desktop/Harrisburg University/Analytics_512-51-2017/9_Visualization of Categorical Data"
setwd("C:/Users/Jason/Desktop/Harrisburg University/Analytics_512-51-2017")
ccrb <- read_excel("ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
summary(ccrb)
##    DateStamp          UniqueComplaintId   Close Year   Received Year 
##  Min.   :2016-11-29   Min.   :    1     Min.   :2006   Min.   :1999  
##  1st Qu.:2016-11-29   1st Qu.:17356     1st Qu.:2008   1st Qu.:2007  
##  Median :2016-11-29   Median :34794     Median :2010   Median :2009  
##  Mean   :2016-11-29   Mean   :34778     Mean   :2010   Mean   :2010  
##  3rd Qu.:2016-11-29   3rd Qu.:52204     3rd Qu.:2013   3rd Qu.:2012  
##  Max.   :2016-11-29   Max.   :69492     Max.   :2016   Max.   :2016  
##  Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
##  Length:204397         Mode :logical         Mode :logical               
##  Class :character      FALSE:107084          FALSE:195530                
##  Mode  :character      TRUE :97313           TRUE :8867                  
##                                                                          
##                                                                          
##                                                                          
##  Complaint Filed Mode Complaint Filed Place
##  Length:204397        Length:204397        
##  Class :character     Class :character     
##  Mode  :character     Mode  :character     
##                                            
##                                            
##                                            
##  Complaint Contains Stop & Frisk Allegations Incident Location 
##  Mode :logical                               Length:204397     
##  FALSE:119856                                Class :character  
##  TRUE :84541                                 Mode  :character  
##                                                                
##                                                                
##                                                                
##  Incident Year  Encounter Outcome  Reason For Initial Contact
##  Min.   :1999   Length:204397      Length:204397             
##  1st Qu.:2007   Class :character   Class :character          
##  Median :2009   Mode  :character   Mode  :character          
##  Mean   :2010                                                
##  3rd Qu.:2012                                                
##  Max.   :2016                                                
##  Allegation FADO Type Allegation Description
##  Length:204397        Length:204397         
##  Class :character     Class :character      
##  Mode  :character     Mode  :character      
##                                             
##                                             
## 
head(ccrb)
## # A tibble: 6 x 16
##    DateStamp UniqueComplaintId `Close Year` `Received Year`
##       <dttm>             <dbl>        <dbl>           <dbl>
## 1 2016-11-29                11         2006            2005
## 2 2016-11-29                18         2006            2004
## 3 2016-11-29                18         2006            2004
## 4 2016-11-29                18         2006            2004
## 5 2016-11-29                18         2006            2004
## 6 2016-11-29                18         2006            2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## #   Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## #   Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## #   Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## #   Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## #   Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## #   Description` <chr>

I want to know the amount of complaints happened in each area of NYC.

attach(ccrb)
sapply(ccrb,class)
## $DateStamp
## [1] "POSIXct" "POSIXt" 
## 
## $UniqueComplaintId
## [1] "numeric"
## 
## $`Close Year`
## [1] "numeric"
## 
## $`Received Year`
## [1] "numeric"
## 
## $`Borough of Occurrence`
## [1] "character"
## 
## $`Is Full Investigation`
## [1] "logical"
## 
## $`Complaint Has Video Evidence`
## [1] "logical"
## 
## $`Complaint Filed Mode`
## [1] "character"
## 
## $`Complaint Filed Place`
## [1] "character"
## 
## $`Complaint Contains Stop & Frisk Allegations`
## [1] "logical"
## 
## $`Incident Location`
## [1] "character"
## 
## $`Incident Year`
## [1] "numeric"
## 
## $`Encounter Outcome`
## [1] "character"
## 
## $`Reason For Initial Contact`
## [1] "character"
## 
## $`Allegation FADO Type`
## [1] "character"
## 
## $`Allegation Description`
## [1] "character"

Below gives bar chart to show each area has the amount of complaints occuence

ggplot(ccrb,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() 

Brooklyn has the most complaints occuence.Only the bar chart, it not clean enough to show how each area are listed, I want to try some pie chart

ggplot(ccrb,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() + coord_polar(theta="y") + scale_x_discrete()

Even though it’s a pie chart, it is not clear enough to show the percentage of each area. It will mislead me think Queens has the largest amount of complaints since it is the large cycle.

I want to creat a pie chart with the percentage.

x <- prop.table(table(ccrb$`Borough of Occurrence`))
x
## 
##         Bronx      Brooklyn     Manhattan            NA   Outside NYC 
##  0.2418920043  0.3533075339  0.2059912817  0.0023630484  0.0008317148 
##        Queens Staten Island 
##  0.1510932157  0.0445212014
pie(x,labels=round(x*100))

This pie chart is too simple that no legend can indicated which area it is and what does the number means

x <- prop.table(table(ccrb$`Borough of Occurrence`))
x
## 
##         Bronx      Brooklyn     Manhattan            NA   Outside NYC 
##  0.2418920043  0.3533075339  0.2059912817  0.0023630484  0.0008317148 
##        Queens Staten Island 
##  0.1510932157  0.0445212014
pie(x,labels=round(x*100/sum(x),1), col = rainbow(length(x)), main = "Borough of Occurrence")
legend("bottomleft",c("Bronx", "Brooklyn", "Manhattan","NA","Outside NYC","Queens","Staten Island"),fill = rainbow(length(x)),cex = 0.4)

Through using a pie chart, it can help me understand the percentage of each area forthe borough of Occurrence.Brooklyn has the largest amount of complaints

It’s also important to see the percentage of Allegation FADO Type in each Borough of Occurrence

ggplot(ccrb,aes(`Borough of Occurrence`, fill = `Allegation FADO Type`)) + 
  geom_bar() + 
  scale_fill_discrete(name = "Allegation FADO Type")

The Abuse of Authority is most FADO type of complaints around all the area

In a boxplot, the box contains most of complaint, and we can also see if there is any outliner

qplot(`Received Year`, `Close Year`, data = ccrb, geom = "boxplot", fill = `Received Year`)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

boxplot(`Incident Year`, data = ccrb)

The previous boxplots only explain all the complaints as a whole group. I would like to see more details in each Allegation FADO Type

qplot(ccrb$`Allegation FADO Type`, ccrb$`Incident Year`,data=ccrb, geom = "boxplot")+ 
  coord_flip()+
  theme(panel.background = element_rect(fill = "white"))

I want to see the relation betwwen Borough of Occurrenceand ccrb$Allegation FADO Type

table(ccrb$`Borough of Occurrence`,ccrb$`Allegation FADO Type`)
##                
##                 Abuse of Authority Discourtesy Force    NA
##   Bronx                      25102        8277 14763     1
##   Brooklyn                   36934       11433 21806     1
##   Manhattan                  19150        7829 13705     1
##   NA                           187         108   164     0
##   Outside NYC                  114          24    28     0
##   Queens                     15914        5235  8771     0
##   Staten Island               4772        1546  2524     0
##                
##                 Offensive Language
##   Bronx                       1299
##   Brooklyn                    2041
##   Manhattan                   1419
##   NA                            24
##   Outside NYC                    4
##   Queens                       963
##   Staten Island                258
data1 <- as.data.frame(table(ccrb$`Borough of Occurrence`,ccrb$`Allegation FADO Type`))
str(data1)
## 'data.frame':    35 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 5 levels "Abuse of Authority",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  25102 36934 19150 187 114 15914 4772 8277 11433 7829 ...
ggplot(data1, aes(x = Var1, y = Freq)) +
 geom_line((aes(group = 1)))+
 xlab("Total complaints") +
 ylab("Borough of Occurrence")

Brooklyn has the largest amount of complaints. The aes argument just groups all of our data into one line, since we want one line in our plot.

I want to see the relation betwwen Borough of Occurrence` in each year

table(ccrb$`Borough of Occurrence`,ccrb$`Incident Year`)
##                
##                 1999 2000 2002 2003 2004 2005 2006 2007 2008 2009 2010
##   Bronx            0    0   11   11  208 2736 5190 6076 5584 5452 4731
##   Brooklyn         6    0    0    9  367 4868 8925 8887 7749 7470 6300
##   Manhattan        0    0    4    0  258 2600 5153 5100 4645 4605 3355
##   NA               0    1    0    0    0   12   42   54   48   37   41
##   Outside NYC      0    0    0    0    0    8   12    7   12   24   10
##   Queens          13    0    1    7  173 1917 3605 3431 3009 2652 2565
##   Staten Island    0    0    0    0   25  383  904  837  930  993  758
##                
##                 2011 2012 2013 2014 2015 2016
##   Bronx         4450 3735 3211 3004 3020 2023
##   Brooklyn      5820 5546 5108 4482 4184 2494
##   Manhattan     3307 3104 3011 2560 2572 1830
##   NA              43   43   28   44   41   49
##   Outside NYC      2   12   23   40   14    6
##   Queens        2403 2458 2642 2405 2250 1352
##   Staten Island  651  808  824 1041  546  400
data2 <- as.data.frame(table(ccrb$`Borough of Occurrence`,ccrb$`Incident Year`))
str(data2)
## 'data.frame':    119 obs. of  3 variables:
##  $ Var1: Factor w/ 7 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 7 1 2 3 ...
##  $ Var2: Factor w/ 17 levels "1999","2000",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ Freq: int  0 6 0 0 0 13 0 0 0 0 ...

Convert the hour to numerical variable

data2$Incident_Year <- as.numeric(as.character(data2$Var2))

ggplot(data2, aes(x = Incident_Year, y = Freq))+ 
 geom_line(aes(group = data2$Var1))

As a result, we can see the complaints happened in each year group by Borough of Occurrence

ggplot(data2, aes(x = Incident_Year, y = Freq))+ 
  geom_line(aes(group = data2$Var1,color = data2$Var1), size = 2)

Now, the line chart can indicated the complaints happened in each year by Borough of Occurrence clearly.For all the area, 2005 is the peak year of complaints.Brooklyn has the largest amount of complaints

I also would to see the relation betwwen Allegation FADO Type in each year

table(ccrb$`Allegation FADO Type`,ccrb$`Incident Year`)
##                     
##                       1999  2000  2002  2003  2004  2005  2006  2007  2008
##   Abuse of Authority     4     0     3     7   455  6237 12055 12639 10932
##   Discourtesy            2     0     0     1   135  1924  3731  3754  3722
##   Force                 13     1    13    19   413  4053  7405  7341  6695
##   NA                     0     0     0     0     0     0     0     0     0
##   Offensive Language     0     0     0     0    28   310   640   658   628
##                     
##                       2009  2010  2011  2012  2013  2014  2015  2016
##   Abuse of Authority 10457  8655  8074  7708  7110  6597  6631  4609
##   Discourtesy         3794  3297  3207  2876  2557  2323  1962  1167
##   Force               6384  5298  4865  4617  4714  4172  3626  2132
##   NA                     0     0     0     0     0     1     2     0
##   Offensive Language   598   510   530   505   466   483   406   246
data3 <- as.data.frame(table(ccrb$`Allegation FADO Type`,ccrb$`Incident Year`))
str(data3)
## 'data.frame':    85 obs. of  3 variables:
##  $ Var1: Factor w/ 5 levels "Abuse of Authority",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ Var2: Factor w/ 17 levels "1999","2000",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ Freq: int  4 2 13 0 0 0 0 1 0 0 ...
data3$Incident_Year <- as.numeric(as.character(data3$Var2))

ggplot(data3, aes(x = Incident_Year, y = Freq))+ 
  geom_line(aes(group = data3$Var1))

As a result, we can see the complaints happened in each year group by Allegation FADO Type

ggplot(data3, aes(x = Incident_Year, y = Freq))+ 
  geom_line(aes(group = data3$Var1,color = data3$Var1), size = 2)

Now, the line chart can indicated the complaints happened in each year by Allegation FADO Type clearly. For alll the area, 2005 is the peak year of complaints. Abuse of Authority has the largest amount of complaints

I want to creat a heater map.

table(ccrb$`Received Year`,ccrb$`Close Year`)
##       
##         2006  2007  2008  2009  2010  2011  2012  2013  2014  2015  2016
##   1999     0     0    13     0     0     0     0     0     0     0     0
##   2000     6     0     0     0     0     0     0     0     0     0     0
##   2002    15     0     0     0     0     0     0     0     0     0     0
##   2003    22     3     0     0     0     0     0     0     0     0     0
##   2004   858     6     0     0     0     0     0     0     0     0     0
##   2005 10839  1085     2     8     1     0     0     0     0     0     0
##   2006 10376 12189  1242    48     9     0     0     0     0     0     0
##   2007     0 11500 10674  2194    11     5     0     0     0     0     0
##   2008     0     0  9499 11527  1143    18     4     0     0     0     0
##   2009     0     0     0 10948  9862   550     5     0     0     0     0
##   2010     0     0     0     0  9043  8331   414    16     0     0    13
##   2011     0     0     0     0     0  8286  6256  1891    10     4     7
##   2012     0     0     0     0     0     0  4742 10079   848    11     3
##   2013     0     0     0     0     0     0     0  7060  7315   412    12
##   2014     0     0     0     0     0     0     0     0  6961  6635    62
##   2015     0     0     0     0     0     0     0     0     0  8828  3852
##   2016     0     0     0     0     0     0     0     0     0     0  8644
data4 <- as.data.frame(table(ccrb$`Received Year`,ccrb$`Close Year`))
str(data4)
## 'data.frame':    187 obs. of  3 variables:
##  $ Var1: Factor w/ 17 levels "1999","2000",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Var2: Factor w/ 11 levels "2006","2007",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Freq: int  0 6 15 22 858 10839 10376 0 0 0 ...
data4
##     Var1 Var2  Freq
## 1   1999 2006     0
## 2   2000 2006     6
## 3   2002 2006    15
## 4   2003 2006    22
## 5   2004 2006   858
## 6   2005 2006 10839
## 7   2006 2006 10376
## 8   2007 2006     0
## 9   2008 2006     0
## 10  2009 2006     0
## 11  2010 2006     0
## 12  2011 2006     0
## 13  2012 2006     0
## 14  2013 2006     0
## 15  2014 2006     0
## 16  2015 2006     0
## 17  2016 2006     0
## 18  1999 2007     0
## 19  2000 2007     0
## 20  2002 2007     0
## 21  2003 2007     3
## 22  2004 2007     6
## 23  2005 2007  1085
## 24  2006 2007 12189
## 25  2007 2007 11500
## 26  2008 2007     0
## 27  2009 2007     0
## 28  2010 2007     0
## 29  2011 2007     0
## 30  2012 2007     0
## 31  2013 2007     0
## 32  2014 2007     0
## 33  2015 2007     0
## 34  2016 2007     0
## 35  1999 2008    13
## 36  2000 2008     0
## 37  2002 2008     0
## 38  2003 2008     0
## 39  2004 2008     0
## 40  2005 2008     2
## 41  2006 2008  1242
## 42  2007 2008 10674
## 43  2008 2008  9499
## 44  2009 2008     0
## 45  2010 2008     0
## 46  2011 2008     0
## 47  2012 2008     0
## 48  2013 2008     0
## 49  2014 2008     0
## 50  2015 2008     0
## 51  2016 2008     0
## 52  1999 2009     0
## 53  2000 2009     0
## 54  2002 2009     0
## 55  2003 2009     0
## 56  2004 2009     0
## 57  2005 2009     8
## 58  2006 2009    48
## 59  2007 2009  2194
## 60  2008 2009 11527
## 61  2009 2009 10948
## 62  2010 2009     0
## 63  2011 2009     0
## 64  2012 2009     0
## 65  2013 2009     0
## 66  2014 2009     0
## 67  2015 2009     0
## 68  2016 2009     0
## 69  1999 2010     0
## 70  2000 2010     0
## 71  2002 2010     0
## 72  2003 2010     0
## 73  2004 2010     0
## 74  2005 2010     1
## 75  2006 2010     9
## 76  2007 2010    11
## 77  2008 2010  1143
## 78  2009 2010  9862
## 79  2010 2010  9043
## 80  2011 2010     0
## 81  2012 2010     0
## 82  2013 2010     0
## 83  2014 2010     0
## 84  2015 2010     0
## 85  2016 2010     0
## 86  1999 2011     0
## 87  2000 2011     0
## 88  2002 2011     0
## 89  2003 2011     0
## 90  2004 2011     0
## 91  2005 2011     0
## 92  2006 2011     0
## 93  2007 2011     5
## 94  2008 2011    18
## 95  2009 2011   550
## 96  2010 2011  8331
## 97  2011 2011  8286
## 98  2012 2011     0
## 99  2013 2011     0
## 100 2014 2011     0
## 101 2015 2011     0
## 102 2016 2011     0
## 103 1999 2012     0
## 104 2000 2012     0
## 105 2002 2012     0
## 106 2003 2012     0
## 107 2004 2012     0
## 108 2005 2012     0
## 109 2006 2012     0
## 110 2007 2012     0
## 111 2008 2012     4
## 112 2009 2012     5
## 113 2010 2012   414
## 114 2011 2012  6256
## 115 2012 2012  4742
## 116 2013 2012     0
## 117 2014 2012     0
## 118 2015 2012     0
## 119 2016 2012     0
## 120 1999 2013     0
## 121 2000 2013     0
## 122 2002 2013     0
## 123 2003 2013     0
## 124 2004 2013     0
## 125 2005 2013     0
## 126 2006 2013     0
## 127 2007 2013     0
## 128 2008 2013     0
## 129 2009 2013     0
## 130 2010 2013    16
## 131 2011 2013  1891
## 132 2012 2013 10079
## 133 2013 2013  7060
## 134 2014 2013     0
## 135 2015 2013     0
## 136 2016 2013     0
## 137 1999 2014     0
## 138 2000 2014     0
## 139 2002 2014     0
## 140 2003 2014     0
## 141 2004 2014     0
## 142 2005 2014     0
## 143 2006 2014     0
## 144 2007 2014     0
## 145 2008 2014     0
## 146 2009 2014     0
## 147 2010 2014     0
## 148 2011 2014    10
## 149 2012 2014   848
## 150 2013 2014  7315
## 151 2014 2014  6961
## 152 2015 2014     0
## 153 2016 2014     0
## 154 1999 2015     0
## 155 2000 2015     0
## 156 2002 2015     0
## 157 2003 2015     0
## 158 2004 2015     0
## 159 2005 2015     0
## 160 2006 2015     0
## 161 2007 2015     0
## 162 2008 2015     0
## 163 2009 2015     0
## 164 2010 2015     0
## 165 2011 2015     4
## 166 2012 2015    11
## 167 2013 2015   412
## 168 2014 2015  6635
## 169 2015 2015  8828
## 170 2016 2015     0
## 171 1999 2016     0
## 172 2000 2016     0
## 173 2002 2016     0
## 174 2003 2016     0
## 175 2004 2016     0
## 176 2005 2016     0
## 177 2006 2016     0
## 178 2007 2016     0
## 179 2008 2016     0
## 180 2009 2016     0
## 181 2010 2016    13
## 182 2011 2016     7
## 183 2012 2016     3
## 184 2013 2016    12
## 185 2014 2016    62
## 186 2015 2016  3852
## 187 2016 2016  8644
ggplot(data4, aes(x = data4$Var1, y = data4$Var2, fill = data4$Freq)) + 
  geom_tile()+
  labs(x = "Received Year", y = "Close Year", title = "# of complaints in NYC  by year")

The heatmap is an intuitive result. In 2016 we have largest amount of complaints which has already been proved by previous chart.

Exploratory data analysis is an approach for summarizing and visualizing the important characteristics of the data set. Exploratory data analysis focuses on exploring data to understand the data’s underlying structure and variables, to develop intuition about the data set, to consider how that data set came into existence, and to decide how it can be investigated with more formal statistical methods.

It is a extremely clear method to explore a data set with multiple exploratory techniques, especially when the dataset can be done together for comparison. Once I deeply learn the data set, I may need to revisit one or more data munging tasks in order to refine or transform the data even further. The goal of exploratory data analysis is to obtain confidence in the data to a point where to engage a machine learning algorithm.