Preparation
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.2
library(ggplot2)
getwd()
## [1] "C:/Users/Jason/Desktop/Harrisburg University/Analytics_512-51-2017/9_Visualization of Categorical Data"
setwd("C:/Users/Jason/Desktop/Harrisburg University/Analytics_512-51-2017")
ccrb <- read_excel("ccrb_datatransparencyinitiative.xlsx",sheet = "Complaints_Allegations")
summary(ccrb)
## DateStamp UniqueComplaintId Close Year Received Year
## Min. :2016-11-29 Min. : 1 Min. :2006 Min. :1999
## 1st Qu.:2016-11-29 1st Qu.:17356 1st Qu.:2008 1st Qu.:2007
## Median :2016-11-29 Median :34794 Median :2010 Median :2009
## Mean :2016-11-29 Mean :34778 Mean :2010 Mean :2010
## 3rd Qu.:2016-11-29 3rd Qu.:52204 3rd Qu.:2013 3rd Qu.:2012
## Max. :2016-11-29 Max. :69492 Max. :2016 Max. :2016
## Borough of Occurrence Is Full Investigation Complaint Has Video Evidence
## Length:204397 Mode :logical Mode :logical
## Class :character FALSE:107084 FALSE:195530
## Mode :character TRUE :97313 TRUE :8867
##
##
##
## Complaint Filed Mode Complaint Filed Place
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## Complaint Contains Stop & Frisk Allegations Incident Location
## Mode :logical Length:204397
## FALSE:119856 Class :character
## TRUE :84541 Mode :character
##
##
##
## Incident Year Encounter Outcome Reason For Initial Contact
## Min. :1999 Length:204397 Length:204397
## 1st Qu.:2007 Class :character Class :character
## Median :2009 Mode :character Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2016
## Allegation FADO Type Allegation Description
## Length:204397 Length:204397
## Class :character Class :character
## Mode :character Mode :character
##
##
##
head(ccrb)
## # A tibble: 6 x 16
## DateStamp UniqueComplaintId `Close Year` `Received Year`
## <dttm> <dbl> <dbl> <dbl>
## 1 2016-11-29 11 2006 2005
## 2 2016-11-29 18 2006 2004
## 3 2016-11-29 18 2006 2004
## 4 2016-11-29 18 2006 2004
## 5 2016-11-29 18 2006 2004
## 6 2016-11-29 18 2006 2004
## # ... with 12 more variables: `Borough of Occurrence` <chr>, `Is Full
## # Investigation` <lgl>, `Complaint Has Video Evidence` <lgl>, `Complaint
## # Filed Mode` <chr>, `Complaint Filed Place` <chr>, `Complaint Contains
## # Stop & Frisk Allegations` <lgl>, `Incident Location` <chr>, `Incident
## # Year` <dbl>, `Encounter Outcome` <chr>, `Reason For Initial
## # Contact` <chr>, `Allegation FADO Type` <chr>, `Allegation
## # Description` <chr>
I want to know the amount of complaints happened in each area of NYC.
attach(ccrb)
sapply(ccrb,class)
## $DateStamp
## [1] "POSIXct" "POSIXt"
##
## $UniqueComplaintId
## [1] "numeric"
##
## $`Close Year`
## [1] "numeric"
##
## $`Received Year`
## [1] "numeric"
##
## $`Borough of Occurrence`
## [1] "character"
##
## $`Is Full Investigation`
## [1] "logical"
##
## $`Complaint Has Video Evidence`
## [1] "logical"
##
## $`Complaint Filed Mode`
## [1] "character"
##
## $`Complaint Filed Place`
## [1] "character"
##
## $`Complaint Contains Stop & Frisk Allegations`
## [1] "logical"
##
## $`Incident Location`
## [1] "character"
##
## $`Incident Year`
## [1] "numeric"
##
## $`Encounter Outcome`
## [1] "character"
##
## $`Reason For Initial Contact`
## [1] "character"
##
## $`Allegation FADO Type`
## [1] "character"
##
## $`Allegation Description`
## [1] "character"
Below gives bar chart to show each area has the amount of complaints occuence
ggplot(ccrb,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar()
Brooklyn has the most complaints occuence.Only the bar chart, it not clean enough to show how each area are listed, I want to try some pie chart
ggplot(ccrb,aes(x=factor(`Borough of Occurrence`), fill = `Borough of Occurrence`)) + geom_bar() + coord_polar(theta="y") + scale_x_discrete()
Even though it’s a pie chart, it is not clear enough to show the percentage of each area. It will mislead me think Queens has the largest amount of complaints since it is the large cycle.
I want to creat a pie chart with the percentage.
x <- prop.table(table(ccrb$`Borough of Occurrence`))
x
##
## Bronx Brooklyn Manhattan NA Outside NYC
## 0.2418920043 0.3533075339 0.2059912817 0.0023630484 0.0008317148
## Queens Staten Island
## 0.1510932157 0.0445212014
pie(x,labels=round(x*100))
This pie chart is too simple that no legend can indicated which area it is and what does the number means
x <- prop.table(table(ccrb$`Borough of Occurrence`))
x
##
## Bronx Brooklyn Manhattan NA Outside NYC
## 0.2418920043 0.3533075339 0.2059912817 0.0023630484 0.0008317148
## Queens Staten Island
## 0.1510932157 0.0445212014
pie(x,labels=round(x*100/sum(x),1), col = rainbow(length(x)), main = "Borough of Occurrence")
legend("bottomleft",c("Bronx", "Brooklyn", "Manhattan","NA","Outside NYC","Queens","Staten Island"),fill = rainbow(length(x)),cex = 0.4)
Through using a pie chart, it can help me understand the percentage of each area forthe borough of Occurrence.Brooklyn has the largest amount of complaints
It’s also important to see the percentage of Allegation FADO Type in each Borough of Occurrence
ggplot(ccrb,aes(`Borough of Occurrence`, fill = `Allegation FADO Type`)) +
geom_bar() +
scale_fill_discrete(name = "Allegation FADO Type")
The Abuse of Authority is most FADO type of complaints around all the area
In a boxplot, the box contains most of complaint, and we can also see if there is any outliner
qplot(`Received Year`, `Close Year`, data = ccrb, geom = "boxplot", fill = `Received Year`)
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
boxplot(`Incident Year`, data = ccrb)
The previous boxplots only explain all the complaints as a whole group. I would like to see more details in each Allegation FADO Type
qplot(ccrb$`Allegation FADO Type`, ccrb$`Incident Year`,data=ccrb, geom = "boxplot")+
coord_flip()+
theme(panel.background = element_rect(fill = "white"))
I want to see the relation betwwen Borough of Occurrenceand ccrb$
Allegation FADO Type
table(ccrb$`Borough of Occurrence`,ccrb$`Allegation FADO Type`)
##
## Abuse of Authority Discourtesy Force NA
## Bronx 25102 8277 14763 1
## Brooklyn 36934 11433 21806 1
## Manhattan 19150 7829 13705 1
## NA 187 108 164 0
## Outside NYC 114 24 28 0
## Queens 15914 5235 8771 0
## Staten Island 4772 1546 2524 0
##
## Offensive Language
## Bronx 1299
## Brooklyn 2041
## Manhattan 1419
## NA 24
## Outside NYC 4
## Queens 963
## Staten Island 258
data1 <- as.data.frame(table(ccrb$`Borough of Occurrence`,ccrb$`Allegation FADO Type`))
str(data1)
## 'data.frame': 35 obs. of 3 variables:
## $ Var1: Factor w/ 7 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Var2: Factor w/ 5 levels "Abuse of Authority",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Freq: int 25102 36934 19150 187 114 15914 4772 8277 11433 7829 ...
ggplot(data1, aes(x = Var1, y = Freq)) +
geom_line((aes(group = 1)))+
xlab("Total complaints") +
ylab("Borough of Occurrence")
Brooklyn has the largest amount of complaints. The aes argument just groups all of our data into one line, since we want one line in our plot.
I want to see the relation betwwen Borough of Occurrence` in each year
table(ccrb$`Borough of Occurrence`,ccrb$`Incident Year`)
##
## 1999 2000 2002 2003 2004 2005 2006 2007 2008 2009 2010
## Bronx 0 0 11 11 208 2736 5190 6076 5584 5452 4731
## Brooklyn 6 0 0 9 367 4868 8925 8887 7749 7470 6300
## Manhattan 0 0 4 0 258 2600 5153 5100 4645 4605 3355
## NA 0 1 0 0 0 12 42 54 48 37 41
## Outside NYC 0 0 0 0 0 8 12 7 12 24 10
## Queens 13 0 1 7 173 1917 3605 3431 3009 2652 2565
## Staten Island 0 0 0 0 25 383 904 837 930 993 758
##
## 2011 2012 2013 2014 2015 2016
## Bronx 4450 3735 3211 3004 3020 2023
## Brooklyn 5820 5546 5108 4482 4184 2494
## Manhattan 3307 3104 3011 2560 2572 1830
## NA 43 43 28 44 41 49
## Outside NYC 2 12 23 40 14 6
## Queens 2403 2458 2642 2405 2250 1352
## Staten Island 651 808 824 1041 546 400
data2 <- as.data.frame(table(ccrb$`Borough of Occurrence`,ccrb$`Incident Year`))
str(data2)
## 'data.frame': 119 obs. of 3 variables:
## $ Var1: Factor w/ 7 levels "Bronx","Brooklyn",..: 1 2 3 4 5 6 7 1 2 3 ...
## $ Var2: Factor w/ 17 levels "1999","2000",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ Freq: int 0 6 0 0 0 13 0 0 0 0 ...
Convert the hour to numerical variable
data2$Incident_Year <- as.numeric(as.character(data2$Var2))
ggplot(data2, aes(x = Incident_Year, y = Freq))+
geom_line(aes(group = data2$Var1))
As a result, we can see the complaints happened in each year group by Borough of Occurrence
ggplot(data2, aes(x = Incident_Year, y = Freq))+
geom_line(aes(group = data2$Var1,color = data2$Var1), size = 2)
Now, the line chart can indicated the complaints happened in each year by Borough of Occurrence clearly.For all the area, 2005 is the peak year of complaints.Brooklyn has the largest amount of complaints
I also would to see the relation betwwen Allegation FADO Type in each year
table(ccrb$`Allegation FADO Type`,ccrb$`Incident Year`)
##
## 1999 2000 2002 2003 2004 2005 2006 2007 2008
## Abuse of Authority 4 0 3 7 455 6237 12055 12639 10932
## Discourtesy 2 0 0 1 135 1924 3731 3754 3722
## Force 13 1 13 19 413 4053 7405 7341 6695
## NA 0 0 0 0 0 0 0 0 0
## Offensive Language 0 0 0 0 28 310 640 658 628
##
## 2009 2010 2011 2012 2013 2014 2015 2016
## Abuse of Authority 10457 8655 8074 7708 7110 6597 6631 4609
## Discourtesy 3794 3297 3207 2876 2557 2323 1962 1167
## Force 6384 5298 4865 4617 4714 4172 3626 2132
## NA 0 0 0 0 0 1 2 0
## Offensive Language 598 510 530 505 466 483 406 246
data3 <- as.data.frame(table(ccrb$`Allegation FADO Type`,ccrb$`Incident Year`))
str(data3)
## 'data.frame': 85 obs. of 3 variables:
## $ Var1: Factor w/ 5 levels "Abuse of Authority",..: 1 2 3 4 5 1 2 3 4 5 ...
## $ Var2: Factor w/ 17 levels "1999","2000",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ Freq: int 4 2 13 0 0 0 0 1 0 0 ...
data3$Incident_Year <- as.numeric(as.character(data3$Var2))
ggplot(data3, aes(x = Incident_Year, y = Freq))+
geom_line(aes(group = data3$Var1))
As a result, we can see the complaints happened in each year group by Allegation FADO Type
ggplot(data3, aes(x = Incident_Year, y = Freq))+
geom_line(aes(group = data3$Var1,color = data3$Var1), size = 2)
Now, the line chart can indicated the complaints happened in each year by Allegation FADO Type clearly. For alll the area, 2005 is the peak year of complaints. Abuse of Authority has the largest amount of complaints
I want to creat a heater map.
table(ccrb$`Received Year`,ccrb$`Close Year`)
##
## 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 1999 0 0 13 0 0 0 0 0 0 0 0
## 2000 6 0 0 0 0 0 0 0 0 0 0
## 2002 15 0 0 0 0 0 0 0 0 0 0
## 2003 22 3 0 0 0 0 0 0 0 0 0
## 2004 858 6 0 0 0 0 0 0 0 0 0
## 2005 10839 1085 2 8 1 0 0 0 0 0 0
## 2006 10376 12189 1242 48 9 0 0 0 0 0 0
## 2007 0 11500 10674 2194 11 5 0 0 0 0 0
## 2008 0 0 9499 11527 1143 18 4 0 0 0 0
## 2009 0 0 0 10948 9862 550 5 0 0 0 0
## 2010 0 0 0 0 9043 8331 414 16 0 0 13
## 2011 0 0 0 0 0 8286 6256 1891 10 4 7
## 2012 0 0 0 0 0 0 4742 10079 848 11 3
## 2013 0 0 0 0 0 0 0 7060 7315 412 12
## 2014 0 0 0 0 0 0 0 0 6961 6635 62
## 2015 0 0 0 0 0 0 0 0 0 8828 3852
## 2016 0 0 0 0 0 0 0 0 0 0 8644
data4 <- as.data.frame(table(ccrb$`Received Year`,ccrb$`Close Year`))
str(data4)
## 'data.frame': 187 obs. of 3 variables:
## $ Var1: Factor w/ 17 levels "1999","2000",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Var2: Factor w/ 11 levels "2006","2007",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Freq: int 0 6 15 22 858 10839 10376 0 0 0 ...
data4
## Var1 Var2 Freq
## 1 1999 2006 0
## 2 2000 2006 6
## 3 2002 2006 15
## 4 2003 2006 22
## 5 2004 2006 858
## 6 2005 2006 10839
## 7 2006 2006 10376
## 8 2007 2006 0
## 9 2008 2006 0
## 10 2009 2006 0
## 11 2010 2006 0
## 12 2011 2006 0
## 13 2012 2006 0
## 14 2013 2006 0
## 15 2014 2006 0
## 16 2015 2006 0
## 17 2016 2006 0
## 18 1999 2007 0
## 19 2000 2007 0
## 20 2002 2007 0
## 21 2003 2007 3
## 22 2004 2007 6
## 23 2005 2007 1085
## 24 2006 2007 12189
## 25 2007 2007 11500
## 26 2008 2007 0
## 27 2009 2007 0
## 28 2010 2007 0
## 29 2011 2007 0
## 30 2012 2007 0
## 31 2013 2007 0
## 32 2014 2007 0
## 33 2015 2007 0
## 34 2016 2007 0
## 35 1999 2008 13
## 36 2000 2008 0
## 37 2002 2008 0
## 38 2003 2008 0
## 39 2004 2008 0
## 40 2005 2008 2
## 41 2006 2008 1242
## 42 2007 2008 10674
## 43 2008 2008 9499
## 44 2009 2008 0
## 45 2010 2008 0
## 46 2011 2008 0
## 47 2012 2008 0
## 48 2013 2008 0
## 49 2014 2008 0
## 50 2015 2008 0
## 51 2016 2008 0
## 52 1999 2009 0
## 53 2000 2009 0
## 54 2002 2009 0
## 55 2003 2009 0
## 56 2004 2009 0
## 57 2005 2009 8
## 58 2006 2009 48
## 59 2007 2009 2194
## 60 2008 2009 11527
## 61 2009 2009 10948
## 62 2010 2009 0
## 63 2011 2009 0
## 64 2012 2009 0
## 65 2013 2009 0
## 66 2014 2009 0
## 67 2015 2009 0
## 68 2016 2009 0
## 69 1999 2010 0
## 70 2000 2010 0
## 71 2002 2010 0
## 72 2003 2010 0
## 73 2004 2010 0
## 74 2005 2010 1
## 75 2006 2010 9
## 76 2007 2010 11
## 77 2008 2010 1143
## 78 2009 2010 9862
## 79 2010 2010 9043
## 80 2011 2010 0
## 81 2012 2010 0
## 82 2013 2010 0
## 83 2014 2010 0
## 84 2015 2010 0
## 85 2016 2010 0
## 86 1999 2011 0
## 87 2000 2011 0
## 88 2002 2011 0
## 89 2003 2011 0
## 90 2004 2011 0
## 91 2005 2011 0
## 92 2006 2011 0
## 93 2007 2011 5
## 94 2008 2011 18
## 95 2009 2011 550
## 96 2010 2011 8331
## 97 2011 2011 8286
## 98 2012 2011 0
## 99 2013 2011 0
## 100 2014 2011 0
## 101 2015 2011 0
## 102 2016 2011 0
## 103 1999 2012 0
## 104 2000 2012 0
## 105 2002 2012 0
## 106 2003 2012 0
## 107 2004 2012 0
## 108 2005 2012 0
## 109 2006 2012 0
## 110 2007 2012 0
## 111 2008 2012 4
## 112 2009 2012 5
## 113 2010 2012 414
## 114 2011 2012 6256
## 115 2012 2012 4742
## 116 2013 2012 0
## 117 2014 2012 0
## 118 2015 2012 0
## 119 2016 2012 0
## 120 1999 2013 0
## 121 2000 2013 0
## 122 2002 2013 0
## 123 2003 2013 0
## 124 2004 2013 0
## 125 2005 2013 0
## 126 2006 2013 0
## 127 2007 2013 0
## 128 2008 2013 0
## 129 2009 2013 0
## 130 2010 2013 16
## 131 2011 2013 1891
## 132 2012 2013 10079
## 133 2013 2013 7060
## 134 2014 2013 0
## 135 2015 2013 0
## 136 2016 2013 0
## 137 1999 2014 0
## 138 2000 2014 0
## 139 2002 2014 0
## 140 2003 2014 0
## 141 2004 2014 0
## 142 2005 2014 0
## 143 2006 2014 0
## 144 2007 2014 0
## 145 2008 2014 0
## 146 2009 2014 0
## 147 2010 2014 0
## 148 2011 2014 10
## 149 2012 2014 848
## 150 2013 2014 7315
## 151 2014 2014 6961
## 152 2015 2014 0
## 153 2016 2014 0
## 154 1999 2015 0
## 155 2000 2015 0
## 156 2002 2015 0
## 157 2003 2015 0
## 158 2004 2015 0
## 159 2005 2015 0
## 160 2006 2015 0
## 161 2007 2015 0
## 162 2008 2015 0
## 163 2009 2015 0
## 164 2010 2015 0
## 165 2011 2015 4
## 166 2012 2015 11
## 167 2013 2015 412
## 168 2014 2015 6635
## 169 2015 2015 8828
## 170 2016 2015 0
## 171 1999 2016 0
## 172 2000 2016 0
## 173 2002 2016 0
## 174 2003 2016 0
## 175 2004 2016 0
## 176 2005 2016 0
## 177 2006 2016 0
## 178 2007 2016 0
## 179 2008 2016 0
## 180 2009 2016 0
## 181 2010 2016 13
## 182 2011 2016 7
## 183 2012 2016 3
## 184 2013 2016 12
## 185 2014 2016 62
## 186 2015 2016 3852
## 187 2016 2016 8644
ggplot(data4, aes(x = data4$Var1, y = data4$Var2, fill = data4$Freq)) +
geom_tile()+
labs(x = "Received Year", y = "Close Year", title = "# of complaints in NYC by year")
The heatmap is an intuitive result. In 2016 we have largest amount of complaints which has already been proved by previous chart.
Exploratory data analysis is an approach for summarizing and visualizing the important characteristics of the data set. Exploratory data analysis focuses on exploring data to understand the data’s underlying structure and variables, to develop intuition about the data set, to consider how that data set came into existence, and to decide how it can be investigated with more formal statistical methods.
It is a extremely clear method to explore a data set with multiple exploratory techniques, especially when the dataset can be done together for comparison. Once I deeply learn the data set, I may need to revisit one or more data munging tasks in order to refine or transform the data even further. The goal of exploratory data analysis is to obtain confidence in the data to a point where to engage a machine learning algorithm.