This is a brief Exploratory Data Analysis of Chicago Crime incident reports from 2012 to early 2017.
Our goal for our data exploration would be to find answers from questions such as:
Follow up questions such as:
To answer our questions, we are going to wrangle and wrestle with our data. We are going to transform and visualise our data so we can check if there are any patterns. The dataset is 0.342 GB and composed of more than 1M observations or reported incidents.
To look for patterns, we are going to make an interactive time series plot using the highcharter package from R.
As usual, we are going to make bar plots and this time, heat maps to visualise the frequency of the crimes. To add flavor to our plots, we are going to use the viridis package for our colors.
To make multiple charts in one plot, I experimented with the latest R package patchwork. It worked well and I’m going to use it an as alternative to grid arrange or multiplot function.
crime <- read_csv("~/Desktop/For release datasets/Chicago_Crimes_2012_to_2017.csv/Chicago_Crimes_2012_to_2017.csv")
Remove missing cases and duplicated rows.
crime <- crime[complete.cases(crime), ]
crime %>% distinct(`Case Number`, .keep_all = TRUE)
Temporal Analysis
First, we need to transform our date variables so we can extract and know whether there are any trends between crimes done during weekdays, months and within a 24 hour time span.
# Reading the data
crime <- crime[crime$Year!='2017',]
# Working with Date and Time
#crime$Date <- as.POSIXct(strptime(crime$Date, format = "%m/%d/%Y %H:%M:%S %p"))
crime$Day <- factor(day(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p")))
crime$Month <- factor(month(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p"), label = TRUE))
crime$Year <- factor(year(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p")))
crime$Weekday <- factor(wday(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p"), label = TRUE))
#crime$hour <- hours(crime$Date)
#crime$hour <- as.factor(crime$hour)
#crime$hour <- factor(crime$hour, levels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","23","24"))
crime$Date <- as.Date(crime$Date, "%m/%d/%Y %I:%M:%S %p")
#crime$day <- weekdays(crime$Date, abbreviate = TRUE)
#crime$month <- months(crime$Date, abbreviate = TRUE)
#crime$day <- factor(crime$day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
#crime$month <- factor(crime$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
sum(is.na(crime))
## [1] 0
length(unique(crime$`Primary Type`))
## [1] 33
We are going to change some level names in our Primary Type variable. Most of the crime types are repeated like “CRIM SEXUAL ASSAULT”, “PROSTITUTION”, “SEX OFFENSE” which directly means SEX offenses/crimes.
crime$Crimes <- as.character(crime$`Primary Type`)
crime$Crimes <- ifelse(crime$Crimes %in% c("CRIM SEXUAL ASSAULT", "PROSTITUTION", "SEX OFFENSE"), 'SEX', crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes %in% c("MOTOR VEHICLE THEFT"), "MVT", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes %in% c("GAMBLING", "INTERFERENCE WITH PUBLIC OFFICER", "LIQUOR LAW VIOLATION", "NON-CRIMINAL (SUBJECT SPECIFIED)", "PUBLIC INDECENCY", "STALKING", "NON-CRIMINAL", "NON - CRIMINAL", "PUBLIC PEACE VIOLATION", "CONCEALED CARRY LICENSE VIOLATION", " NON - CRIMINAL "), "NONVIO", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes =="CRIMINAL DAMAGE", "DAMAGE", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes =="CRIMINAL TRESPASS", "TRESPASS", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes %in% c("NARCOTICS", "OTHER NARCOTIC VIOLATION", "OTHER NARCOTIC VIOLATION"), "DRUG", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes =="DECEPTIVE PRACTICE", "FRAUD", crime$Crimes)
crime$Crimes <- ifelse(crime$Crimes %in% c("OTHER OFFENSE", "OTHER OFFENSE"), "OTHER", crime$Crimes)
crime$Crimes <-ifelse(crime$Crimes %in% c("KIDNAPPING", "WEAPONS VIOLATION", "OFFENSE INVOLVING CHILDREN"), "VIO", crime$Crimes)
#table(crime$Crimes)
We are going to change most of our level names in our Location Description since most of the values are again repeated. This will be a tedious step since we have many description to change. I suggest you skip this part.
crime$LocDesc <- as.character(crime$`Location Description`)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("AIRCRAFT" , "AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA", "AIRPORT BUILDING NON-TERMINAL - SECURE AREA", "AIRPORT EXTERIOR -NON-SECURE AREA", "AIRPORT EXTERIOR - SECURE AREA", "AIRPORT PARKING LOT", "AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA", "AIRPORT TERMINAL MEZZANINE - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - SECURE AREA", "AIRPORT TRANSPORTATTION SYSTEM (ATS)","AIRPORT VENDING ESTABLISHMENT", "AIRPORT/AIRCRAFT", "AIRPORT EXTERIOR - NON-SECURE AREA", "AIRPORT TRANSPORTATION SYSTEM (ATS)"), "AVIATION" , crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("ATM (AUTOMATIC TELLER MACHINE)", "BANK", "CURRENCY EXCHANGE","CREDIT UNION", "CURRENCY EXCHANGE", "SAVINGS AND LOAN" ), "BANK", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c('""', "ABANDONED BUILDING", "ALLEY", "ANIMAL HOSPITAL", "APARTMENT", "APPLIANCE STORE", "ATHLETIC CLUB", "BAR OR TAVERN", "BARBER SHOP/BEAUTY SALON", "BARBERSHOP", "BASEMENT", "BOWLING ALLEY", "CHA APARTMENT", "CHA APARTMENT", "CHA PARKING LOT", "CHA HALLWAY/STAIRWELL/ELEVATOR", "CHA PARKING LOT/GROUNDS", "WAREHOUSE"), "BUILDINGS" , crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("CEMETARY", "CHURCH PROPERTY", "CHURCH/SYNAGOGUE/PLACE OF WORSHIP"), "CHURCHES", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c( "PUBLIC HIGH SCHOOL","COLLEGE/UNIVERSITY GROUNDS","COLLEGE/UNIVERSITY RESIDENCE HALL", "SCHOOL YARD", "SCHOOL, PRIVATE, BUILDING", "SCHOOL, PRIVATE, GROUNDS", "SCHOOL, PUBLIC, BUILDING", "SCHOOL, PUBLIC, GROUNDS", "LIBRARY" ), "SCHOOL", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("CTA \"\"L\"\" PLATFORM","CTA \"\"L\"\" TRAIN","CTA BUS", "CTA BUS STOP", "CTA GARAGE / OTHER PROPERTY", "CTA PLATFORM", "CTA STATION", "CTA TRACKS - RIGHT OF WAY", "CTA TRAIN","OTHER RAILROAD PROP / TRAIN DEPOT", "OTHER COMMERCIAL TRANSPORTATION" , "RAILROAD PROPERTY"), "PUBTRANS", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("TAXICAB","TRUCK", "VEHICLE - DELIVERY TRUCK","VEHICLE - OTHER RIDE SERVICE","VEHICLE NON-COMMERCIAL", "VEHICLE-COMMERCIAL","GAS STATION DRIVE/PROP.", "TAXI CAB", "CAR WASH", "AUTO", "DELIVERY TRUCK", "GARAGE/AUTO REPAIR", "PARKING LOT" ), "VEHICLE", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("RESIDENCE", "RESIDENCE PORCH/HALLWAY","RESIDENCE-GARAGE", "RESIDENTIAL YARD (FRONT/BACK)","GARAGE", "HOUSE", "DRIVEWAY - RESIDENTIAL", "YARD" , "DRIVEWAY", "PORCH", "STAIRWELL"), "RESIDENCE", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("FACTORY/MANUFACTURING BUILDING","FEDERAL BUILDING","FIRE STATION", "GOVERNMENT BUILDING","POLICE FACILITY/VEH PARKING LOT", "GOVERNMENT BUILDING/PROPERTY" ,"JAIL / LOCK-UP FACILITY", "PARK PROPERTY"), "GOVBUILD", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("MOVIE HOUSE/THEATER", "SPORTS ARENA/STADIUM", "POOL ROOM" ), "RECREATION", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("NURSING HOME", "NURSING HOME/RETIREMENT HOME","PARKING LOT/GARAGE(NON.RESID.)", "DAY CARE CENTER", "HOSPITAL BUILDING/GROUNDS", "HOSPITAL", "MEDICAL/DENTAL OFFICE" ), "HOSPITAL", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("COMMERCIAL / BUSINESS OFFICE","CLEANING STORE", "CONVENIENCE STORE", "DEPARTMENT STORE", "DRUG STORE", "GROCERY FOOD STORE", "SMALL RETAIL STORE","TAVERN/LIQUOR STORE" ,"PAWN SHOP", "LIQUOR STORE" , "RETAIL STORE"), "STORES", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("BOAT/WATERCRAFT","LAGOON", "LAKEFRONT/WATERFRONT/RIVERBANK", "POOLROOM" ), "WATER", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("HOTEL", "HOTEL/MOTEL","MOTEL" ), "HOTEL", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("EXPRESSWAY EMBANKMENT", "HIGHWAY/EXPRESSWAY", "GAS STATION", "SIDEWALK", "CONSTRUCTION SITE", "BRIDGE"), "HIGHWAY", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("CLEANERS/LAUNDROMAT", "COIN OPERATED MACHINE","LAUNDRY ROOM", "NEWSSTAND" ), "VENDING", crime$LocDesc)
crime$LocDesc <- ifelse(crime$LocDesc %in% c("VACANT LOT", "VACANT LOT/LAND" , "STREET", "DRIVEWAY", "GANGWAY", "VESTIBULE"), "STREET", crime$LocDesc)
#crime$Arrest <- ifelse(as.character(crime$Arrest) == "Y", 1, 0)
crime$Crimes <- as.factor(crime$Crimes)
crime$LocDesc <- as.factor(crime$LocDesc)
For our first time series plot, we are going to take advantage and use xts package to create our time series object. Creating xts objects and highcharter time series plots are a match made in heaven!
by_Date <- (crime) %>% group_by(Date) %>% summarise(Total = n())
tseries <- xts(by_Date$Total, order.by=as.POSIXct(by_Date$Date))
## Creating timeseries of arrests made
Arrests_by_Date <- (crime[crime$Arrest == 'True',]) %>% group_by(Date) %>% summarise(Total = n())
arrests_tseries <- xts(Arrests_by_Date$Total, order.by=as.POSIXct(by_Date$Date))
Plotting Temporal data of our Crime analysis
hchart(tseries, name = "Crimes") %>%
hc_add_series(arrests_tseries, name = "Arrests") %>%
hc_add_theme(hc_theme_538()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
hc_title(text = "Trend of Chicago Crimes and Arrests") %>%
hc_legend(enabled = TRUE)
From our time series plot above, we can see that more crimes were committed but less arrests were made. However, the total crimes has been steadily going down each year. Playing around with the interactive plot, we can see that crimes usually peak during the month of July each year. Another observation is, there are less crimes during the month of February. A change of hearts perhaps by the criminals?
Next, we create our arrests time series plot.
hchart(arrests_tseries, name = "Arrests") %>%
hc_add_theme(hc_theme_538()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>% hc_title(text = "Trend of Arrests made in Chicago (2012-2016)")
Separately plotting our arrests time series shows a decreasing trend each year. Playing with the interactive plot, maximum crime arrest for each year are in different months. Example would be during 2012, there were more arrest on the month of July while the next year, it was on the month of March.
Over-all we can see a trend were it starts from January and steadily rises until it reaches a certain month as it peaks. Afterwards, it drops slowly as the month progresses towards the end of the year.
Netx, we want to know the top crimes in Chicago from 2012 to early 2017
tot <- crime %>%
group_by(Crimes) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>%
ungroup() %>%
mutate(PercTot = Total/sum(Total),
rd = round(PercTot*100, digits=1)) %>%
head(10)
e1 <-tot %>% ggplot(aes(reorder(Crimes, -rd), rd, fill = Crimes)) + geom_bar(stat = "identity", show.legend = FALSE) + theme_bw() + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))+ geom_text(aes(label=round(PercTot*100,digits=1)), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5)+ labs(title="Top Crimes In Chicago From 2012-2016", x="", y="", subtitle="Percentages of each type of crime") + scale_fill_viridis(discrete = TRUE)
There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.
To continue, we want to know the hot locations were most crimes are reported.
loc <- crime %>%
group_by(LocDesc) %>%
summarise(Total = n()) %>%
arrange(desc(Total)) %>%
ungroup() %>%
mutate(PercTot = Total/sum(Total),
rd = round(PercTot*100, digits=1)) %>%
head(10)
e2 <-loc %>% ggplot(aes(reorder(LocDesc, -rd), rd, fill = LocDesc)) + geom_bar(stat="identity", show.legend = FALSE) + theme_bw() + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))+ geom_text(aes(label=round(PercTot*100,digits=1)), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5)+ labs(title="Crime Locations In Chicago From 2012-2017", x="", y="", subtitle="Percentage of each hot crime locations") + scale_fill_viridis(discrete = TRUE)
e1 + e2
There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.
To continue, we want to know the hot locations were most crimes are reported.
We can see that most crimes were committed on the streets (23% )followed by residences (22%) . And then both buildings (17%) and highways (12%).
table(crime$Year, crime$LocDesc)
##
## AVIATION BANK BUILDINGS CHURCHES CLUB CTA "L" PLATFORM
## 2012 1304 2529 55842 862 0 0
## 2013 1511 2073 51298 672 1 1
## 2014 1528 1806 45548 557 0 0
## 2015 1365 1687 44817 543 0 0
## 2016 1041 1787 41694 542 0 0
##
## CTA "L" TRAIN ELEVATOR FOREST PRESERVE GOVBUILD HALLWAY HIGHWAY
## 2012 0 0 11 4947 5 44536
## 2013 0 0 16 4602 8 40727
## 2014 0 0 8 4012 4 33569
## 2015 0 0 13 3708 3 30835
## 2016 1 1 6 3296 5 25757
##
## HOSPITAL HOTEL OFFICE OTHER PUBTRANS RECREATION RESIDENCE
## 2012 11757 1293 0 11560 8176 493 73385
## 2013 10454 1272 0 11152 7786 457 65539
## 2014 9180 1213 0 10447 5686 371 57172
## 2015 9476 1249 1 10365 4403 473 55670
## 2016 9943 1187 1 9950 4578 479 56395
##
## RESTAURANT SCHOOL STORES STREET TAVERN VEHICLE VENDING WATER
## 2012 5179 9287 18892 77776 0 6423 27 113
## 2013 4807 9464 17707 68429 1 6144 43 103
## 2014 4836 7685 16519 63665 2 5398 30 93
## 2015 4963 6401 16611 61544 2 5369 20 92
## 2016 5411 5659 17515 60200 2 5135 39 108
a11<-crime %>%
group_by(Weekday) %>%
dplyr::summarize(Total = n()) %>%
ggplot(aes(Weekday, Total, fill = Weekday)) + geom_bar( stat = "identity", show.legend = FALSE) + geom_text(aes(label=Total), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5) + labs(title="What Day Most Crimes Are Committed", x="", y="", subtitle="Crimes are equally occuring in all days") + scale_y_continuous(labels = comma) + theme(strip.text = element_text(size=11)) + theme_bw() + scale_fill_viridis(discrete = TRUE)
temp2 <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Weekday), FUN= length)
names(temp2) <- c("crime", "day", "count")
#temp2
a12 <- temp2 %>% filter(crime !="HUMAN TRAFFICKING") %>%
ggplot(aes(x= crime, y= factor(day))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By weekdays", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by DAY and CRIMES") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
Making use of patchwork package, I just “+” the two plots and the result was good. With only few codes to type, I can now easily put together multiple plots in one window.
a11 + a12
For our weekdays trend, generally all weekdays are vulnerable to crime. Though we can see a slight peak during Fridays. From Monday it slowly starts to climb, reach it peak on Friday and goes down during Saturdays and Sunday.
For our heat map, we can see that theft and battery crimes frequently occured during weekdays. With theft, peaking during Fridays and battery cases during Saturday and Sunday. This is similar with drug , damage, motor-vehicle and fraud. Though for damage, we can see it peaks during Sat and Sunday while drug instances start from Tue until peak Fri. We can also see quite few crime occurences for assault, burglary, other and robbery crimes with it’s frequency occuring almost evenly distributed throught out the week. Lastly, this is similar with non-vio, vio, tresspass and sex crimes which recorded for the least.
a13<-crime %>%
group_by(Month) %>%
dplyr::summarize(Total = n()) %>%
ggplot(aes(Month, Total, fill = Month)) + geom_bar( stat = "identity", show.legend = FALSE) + geom_text(aes(label=Total), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.0) + labs(title="What Month Most Crimes Are Committed", y="", x="") + scale_y_continuous(labels = comma) + theme(strip.text = element_text(size=11)) + theme_bw() + scale_fill_viridis(discrete = TRUE)
temp1 <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Month), FUN= length)
names(temp1) <- c("crime", "month", "count")
#temp1
a14 <-temp1 %>% filter(crime!="HUMAN TRAFFICKING", crime!="NON - CRIMINAL") %>%
ggplot(aes(x= crime, y= factor(month))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By months", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crimes Frequencies by Month and CRIME") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
a13 + a14
We can see an upward trend starting from May until it peaks in July. Then it gradually goes down until December and spikes again on January. Strangely, less crimes are committed during the month of February. Can we say that culprits also have a “change of heart” during the love month of February? Very interesting indeed.
We can see that generally, all are vulnerable to theft crimes the rest of the year starting from May until peak in July. Same can also be said of battery cases. Damage and drug crimes also follow same occurence for the rest of the months. Though we can see that most crimes have peak number during month of July.
Heat Maps
fa <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Month), FUN= length)
names(fa) <- c("location", "month", "count")
#temp
fa %>% filter(location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="TAVERN", location!="", location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%
ggplot(aes(x= location, y= factor(month))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By month", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by MONTH and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
Looking at our heat map, we can see that most of these STREET and RESIDENCE related crimes occur mostly starting from May until August with its peak during the month of July. This is also the same with both BUILDINGS and HIGHWAY locations.
fr <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Weekday), FUN= length)
names(fr) <- c("location", "day", "count")
#temp2
a18 <-fr %>% filter(location !="CLUB", location!="ELEVATOR", location!="OFFICE", location!="TAVERN", location!="" , location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%
ggplot(aes(x= location, y= factor(day))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By weekdays", expand = c(0,-2)) +
theme_bw() + ggtitle("Crime Frequencies by each DAY and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11)) + scale_fill_viridis()
arrest <- crime %>%
group_by(Year , Arrest) %>%
summarise(countArr = n())
a19 <-arrest %>%
ggplot(aes(Year, countArr, fill = Arrest)) + geom_bar(stat="identity", position = "dodge") + scale_fill_viridis(discrete = TRUE) + theme_bw() + scale_y_continuous(labels = comma) + geom_text(aes(label=countArr), vjust=-0.25, color="red",position = position_dodge(0.9), size=3.0) + theme(legend.position = "bottom") + labs(title="Total Crime Arrests from 2012-2016", x="",y="")
a18 + a19
Looking at our heat map first, we can see that crimes located at STREET, RESIDENCE , BUILDINGS and HIGHWAYS have higher occurence during the weekends or days between Friday , Saturday and Sunday. It looks like the same trend that we saw from our earlier plots.
For the total crime arrests, “TRUE” indicates cases where arrests are made. We can see that “FALSE” outnumbers “TRUE”, hence, telling us that most of the crimes resulted in a “no” arrest. The trend is the same from 2012 to 2016 with the “no” arrest getting higher each year.
And so we ask under “yes” arrest, what crime/crimes does our data have were arrests were made? We will answer this later as we continue our exploration.
fp <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Crimes), FUN= length)
names(fp) <- c("location", "crimes", "count")
#temp2
fp %>% filter(location!="", location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="OFFICE", location!="TAVERN", location!="WATER", location!="VENDING", location!="FOREST PRESERVE",crimes!="NON - CRIMINAL", crimes!="OBSCENITY", crimes!="HUMAN TRAFFICKING", crimes!="HOMICIDE", crimes!="INTIMIDATION") %>%
ggplot(aes(x= location, y= factor(crimes))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By crimes", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by CRIME and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
Interestingly, we can see that theft crimes are committed frequently at residences, street and on motor vehicles. Drug, damage and battery crimes happened in streets. Damage, burglary, battery and assault also happened in residences. On highways, drug, battery, robbery, theft and assault frequently took place. Lastly, theft and battery crimes happened frequently in buildings with together with some drug, damage and burglary crimes.
ar <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Day), FUN= length)
names(ar) <- c("location", "hour", "count")
#temp2
p1 <-ar %>% filter(location!="", location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="OFFICE", location!="TAVERN", location!="WATER", location!="VENDING", location!="FOREST PRESERVE", location!="TAVERN", location!="" , location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%
ggplot(aes(x= location, y= factor(hour))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By hour", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by Time and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
ai <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Day), FUN= length)
names(ai) <- c("crimes", "hour", "count")
#temp2
p2 <- ai %>% filter(crimes!="NON - CRIMINAL", crimes!="HUMAN TRAFFICKING") %>%
ggplot(aes(x= crimes, y= factor(hour))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By hour", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by Time and CRIMES") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
p1 + p2
Looking at our by hours and locations heat map, we can see that STREET related crimes are high for 24 hours, similar with RESIDENCE. Though for both locations, we can see 01:00 hours as it’s peak. This same pattern with BUILDINGS and HIGHWAY. For buildings, again we see 01:00 hours as peak.
Looking at our by time and crimes heatmap, we can see that both THEFT and BATTERY crimes are likely to occur during the 24 hour time span with 01:00 hours as its peak. Though for theft crimes, we can a second time at 15:00 hours. This is also the same with DAMAGE, DRUG, ASSAULT and OTHER crimes.
art <- crime %>%
group_by(Year, Crimes, Arrest) %>%
summarise(tot = n()) %>%
filter(tot >=1000) %>%
ungroup()
art %>% ggplot(aes(Year, tot, fill = Crimes)) + geom_bar(stat="identity") + facet_wrap(~Arrest) + scale_fill_viridis(discrete = TRUE) + theme_bw() + theme(legend.position = "bottom") + scale_y_continuous(labels = comma) + labs(title="Crime Arrests from 2012-2016", x="", y="")
“FALSE” indicates that perpetrators for most crimes were not arrested. Majority of these “FALSE” arrests were THEFT cases. Unfortunately, many perpetrators involved in BATTERY and DAMAGE cases were not arrested. We can see almost half difference with “TRUE” looking at the height of the bars.
Surprisingly, for our “TRUE” indicator, we can see that criminals related to DRUG cases were arrested more compared to the rest. This is followed by BATTERY and THEFT cases.
Calling our table function, we can further see the numbers behind our plot above.
table(crime$Crimes)
##
## ARSON ASSAULT BATTERY BURGLARY
## 2175 89508 258941 81668
## DAMAGE DRUG FRAUD HOMICIDE
## 152812 131207 67609 2560
## HUMAN TRAFFICKING INTIMIDATION MVT NON - CRIMINAL
## 20 643 59856 38
## NONVIO OBSCENITY OTHER ROBBERY
## 24293 169 85361 56092
## SEX THEFT TRESPASS VIO
## 18356 321950 36429 28648
Finally, I will make a time series plot for our 4 top crimes and check for any patterns through out each year.
thefts <- crime[crime$Crimes=="THEFT",]
## Creating timeseries
thefts_by_Date <- (thefts) %>% group_by(Date) %>% summarise(Total = n())
thefts_tseries <- xts(thefts_by_Date$Total, order.by=as.POSIXct(thefts_by_Date$Date))
battery <- crime[crime$Crimes=="BATTERY",]
## Creating timeseries
battery_by_Date <- (battery) %>% group_by(Date) %>% summarise(Total = n())
battery_tseries <- xts(battery_by_Date$Total, order.by=as.POSIXct(battery_by_Date$Date))
criminals <- crime[crime$Crimes=="DAMAGE",]
## Creating timeseries
criminals_by_Date <- (criminals) %>% group_by(Date) %>% summarise(Total = n())
criminals_tseries <- xts(criminals_by_Date$Total, order.by=as.POSIXct(criminals_by_Date$Date))
narcotics <- crime[crime$Crimes=="DRUG",]
## Creating timeseries
narcotics_by_Date <- (narcotics) %>% group_by(Date) %>% summarise(Total = n())
narcotics_tseries <- xts(narcotics_by_Date$Total, order.by=as.POSIXct(narcotics_by_Date$Date))
hchart(thefts_tseries, name = "Thefts") %>%
hc_add_series(battery_tseries, name = "Battery") %>%
hc_add_series(criminals_tseries, name = "Damage") %>%
hc_add_series(narcotics_tseries, name = "Drugs") %>%
hc_add_theme(hc_theme_538()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
hc_title(text = "Crimes in Thefts/Battery/Damage/Drugs") %>%
hc_legend(enabled = TRUE)
Finally, I want to make a time series plot for our top 4 crime locations and check again for some patterns.
streets <- crime[crime$LocDesc=="STREET",]
## Creating timeseries
streets_by_Date <- (streets) %>% group_by(Date) %>% summarise(Total = n())
streets_tseries <- xts(streets_by_Date$Total, order.by=as.POSIXct(streets_by_Date$Date))
residence <- crime[crime$LocDesc=="RESIDENCE",]
## Creating timeseries
residence_by_Date <- na.omit(residence) %>% group_by(Date) %>% summarise(Total = n())
residence_tseries <- xts(residence_by_Date$Total, order.by=as.POSIXct(residence_by_Date$Date))
apartment <- crime[crime$LocDesc=="BUILDINGS",]
## Creating timeseries
apartment_by_Date <- (apartment) %>% group_by(Date) %>% summarise(Total = n())
apartment_tseries <- xts(apartment_by_Date$Total, order.by=as.POSIXct(apartment_by_Date$Date))
highway <- crime[crime$LocDesc=="HIGHWAY",]
## Creating timeseries
highway_by_Date <- (highway) %>% group_by(Date) %>% summarise(Total = n())
highway_tseries <- xts(highway_by_Date$Total, order.by=as.POSIXct(highway_by_Date$Date))
hchart(streets_tseries, name = "Streets") %>%
hc_add_series(residence_tseries, name = "Residence") %>%
hc_add_series(apartment_tseries, name = "Buildings") %>%
hc_add_series(highway_tseries, name = "Highways") %>%
hc_add_theme(hc_theme_538()) %>%
hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
hc_title(text = "Crimes in Streets/Residence/Buildings/Highways") %>%
hc_legend(enabled = TRUE)
To summarise our findings
Total crimes has been steadily going down each year. We find out that crimes usually peak during the month of July each year. Another observation is, there are less crimes during the month of February.
Over-all, we find a seasonality in crime per each year were it starts from January and steadily rises until it reaches a certain month as it peaks. Afterwards, it drops slowly as the month progresses towards the end of the year.
There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.
We can see that most crimes were committed on the streets (23% )followed by residences (22%) . And then both buildings (17%) and highways (12%).
We find that theft and battery crimes frequently occured during weekdays. With theft, peaking during Fridays and battery cases during Saturday and Sunday. This is similar with drug , damage, motor-vehicle and fraud. Though for damage, we can see it peaks during Sat and Sunday while drug instances start from Tue until peak Fri. We can also see quite few crime occurences for assault, burglary, other and robbery crimes with it’s frequency occuring almost evenly distributed throught out the week. Lastly, this is similar with non-vio, vio, tresspass and sex crimes which recorded for the least.
For the total crime arrests, “TRUE” indicates cases where arrests are made. We can see that “FALSE” outnumbers “TRUE”, hence, telling us that most of the crimes resulted in a “no” arrest. The trend is the same from 2012 to 2016 with the “no” arrest getting higher each year.
We find that theft crimes are committed frequently at residences, street and on motor vehicles. Drug, damage and battery crimes happened in streets. Damage, burglary, battery and assault also happened in residences. On highways, drug, battery, robbery, theft and assault frequently took place. Lastly, theft and battery crimes happened frequently in buildings with together with some drug, damage and burglary crimes.
“FALSE” indicates that perpetrators for most crimes were not arrested. Majority of these “FALSE” arrests were THEFT cases. Unfortunately, many perpetrators involved in BATTERY and DAMAGE cases were not arrested. We find that almost half difference with “TRUE” or cases which resulted into arrests.
We find that criminals related to DRUG cases were arrested more compared to the rest. This is followed by BATTERY and THEFT cases.
We find that STREET related crimes are high for 24 hours, similar with RESIDENCE. Though for both locations, we can see 01:00 hours as it’s peak. This same pattern with BUILDINGS and HIGHWAY. For buildings, again we see 01:00 hours as peak. We also find that both THEFT and BATTERY crimes are likely to occur during the 24 hour time span with 01:00 hours as its peak. Though for theft crimes, we can a second time at 15:00 hours. This is also the same with DAMAGE, DRUG, ASSAULT and OTHER crimes.
The dataset can be found here. https://www.kaggle.com/currie32/crimes-in-chicago/data