Introduction

This is a brief Exploratory Data Analysis of Chicago Crime incident reports from 2012 to early 2017.

Our goal for our data exploration would be to find answers from questions such as:

  1. Are the total crimes each year going down or up?
  2. Is there a trend between the rest of the months were crimes are likely up?
  3. In a weekday, what part of the week does most crimes occur?
  4. What are the “hot”" locations were we can see a high number of recorded crime reports?
  5. Generally, what are the most common crimes in Chicago from 2012-2016?

Follow up questions such as:

  1. With the reported crimes, were arrests made or not?
  2. Have there been more arrests made over the course of the years?
  3. If so, what kind of crimes were more arrests were made?

To answer our questions, we are going to wrangle and wrestle with our data. We are going to transform and visualise our data so we can check if there are any patterns. The dataset is 0.342 GB and composed of more than 1M observations or reported incidents.

To look for patterns, we are going to make an interactive time series plot using the highcharter package from R.

As usual, we are going to make bar plots and this time, heat maps to visualise the frequency of the crimes. To add flavor to our plots, we are going to use the viridis package for our colors.

To make multiple charts in one plot, I experimented with the latest R package patchwork. It worked well and I’m going to use it an as alternative to grid arrange or multiplot function.

crime <- read_csv("~/Desktop/For release datasets/Chicago_Crimes_2012_to_2017.csv/Chicago_Crimes_2012_to_2017.csv")

Remove missing cases and duplicated rows.

crime <- crime[complete.cases(crime), ]

crime %>% distinct(`Case Number`, .keep_all = TRUE)

Temporal Analysis

First, we need to transform our date variables so we can extract and know whether there are any trends between crimes done during weekdays, months and within a 24 hour time span.

# Reading the data
crime <- crime[crime$Year!='2017',]

# Working with Date and Time
#crime$Date <- as.POSIXct(strptime(crime$Date, format = "%m/%d/%Y %H:%M:%S %p"))

crime$Day <- factor(day(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p")))
crime$Month <- factor(month(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p"), label = TRUE))
crime$Year <- factor(year(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p")))
crime$Weekday <- factor(wday(as.POSIXlt(crime$Date, format="%m/%d/%Y %I:%M:%S %p"), label = TRUE))

#crime$hour <- hours(crime$Date)
#crime$hour <- as.factor(crime$hour)
#crime$hour <- factor(crime$hour, levels = c("1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","23","24"))

crime$Date <- as.Date(crime$Date, "%m/%d/%Y %I:%M:%S %p")

#crime$day <- weekdays(crime$Date, abbreviate = TRUE)
#crime$month <- months(crime$Date, abbreviate = TRUE)
#crime$day <- factor(crime$day, levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
#crime$month <- factor(crime$month, levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
sum(is.na(crime))
## [1] 0
length(unique(crime$`Primary Type`))
## [1] 33

We are going to change some level names in our Primary Type variable. Most of the crime types are repeated like “CRIM SEXUAL ASSAULT”, “PROSTITUTION”, “SEX OFFENSE” which directly means SEX offenses/crimes.

crime$Crimes <- as.character(crime$`Primary Type`)

crime$Crimes <- ifelse(crime$Crimes %in% c("CRIM SEXUAL ASSAULT", "PROSTITUTION", "SEX OFFENSE"), 'SEX', crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes %in% c("MOTOR VEHICLE THEFT"), "MVT", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes %in% c("GAMBLING", "INTERFERENCE WITH PUBLIC OFFICER", "LIQUOR LAW VIOLATION", "NON-CRIMINAL (SUBJECT SPECIFIED)", "PUBLIC INDECENCY", "STALKING", "NON-CRIMINAL", "NON -  CRIMINAL", "PUBLIC PEACE VIOLATION", "CONCEALED CARRY LICENSE VIOLATION", " NON - CRIMINAL "), "NONVIO", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes =="CRIMINAL DAMAGE", "DAMAGE", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes =="CRIMINAL TRESPASS", "TRESPASS", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes %in% c("NARCOTICS", "OTHER NARCOTIC VIOLATION", "OTHER NARCOTIC VIOLATION"), "DRUG", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes =="DECEPTIVE PRACTICE", "FRAUD", crime$Crimes)

crime$Crimes <- ifelse(crime$Crimes %in% c("OTHER OFFENSE", "OTHER OFFENSE"), "OTHER", crime$Crimes)

crime$Crimes <-ifelse(crime$Crimes %in% c("KIDNAPPING", "WEAPONS VIOLATION", "OFFENSE INVOLVING CHILDREN"), "VIO", crime$Crimes)

#table(crime$Crimes)

We are going to change most of our level names in our Location Description since most of the values are again repeated. This will be a tedious step since we have many description to change. I suggest you skip this part.

crime$LocDesc <- as.character(crime$`Location Description`)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("AIRCRAFT" , "AIRPORT BUILDING NON-TERMINAL - NON-SECURE AREA", "AIRPORT BUILDING NON-TERMINAL - SECURE AREA", "AIRPORT EXTERIOR -NON-SECURE AREA", "AIRPORT EXTERIOR - SECURE AREA", "AIRPORT PARKING LOT", "AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL LOWER LEVEL - SECURE AREA", "AIRPORT TERMINAL MEZZANINE - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - NON-SECURE AREA", "AIRPORT TERMINAL UPPER LEVEL - SECURE AREA", "AIRPORT TRANSPORTATTION SYSTEM (ATS)","AIRPORT VENDING ESTABLISHMENT", "AIRPORT/AIRCRAFT", "AIRPORT EXTERIOR - NON-SECURE AREA", "AIRPORT TRANSPORTATION SYSTEM (ATS)"), "AVIATION" , crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("ATM (AUTOMATIC TELLER MACHINE)", "BANK", "CURRENCY EXCHANGE","CREDIT UNION", "CURRENCY EXCHANGE", "SAVINGS AND LOAN" ), "BANK", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c('""', "ABANDONED BUILDING", "ALLEY", "ANIMAL HOSPITAL", "APARTMENT", "APPLIANCE STORE", "ATHLETIC CLUB", "BAR OR TAVERN", "BARBER SHOP/BEAUTY SALON", "BARBERSHOP", "BASEMENT", "BOWLING ALLEY", "CHA APARTMENT", "CHA APARTMENT", "CHA PARKING LOT", "CHA HALLWAY/STAIRWELL/ELEVATOR", "CHA PARKING LOT/GROUNDS", "WAREHOUSE"), "BUILDINGS" , crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("CEMETARY", "CHURCH PROPERTY", "CHURCH/SYNAGOGUE/PLACE OF WORSHIP"), "CHURCHES", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c( "PUBLIC HIGH SCHOOL","COLLEGE/UNIVERSITY GROUNDS","COLLEGE/UNIVERSITY RESIDENCE HALL", "SCHOOL YARD", "SCHOOL, PRIVATE, BUILDING", "SCHOOL, PRIVATE, GROUNDS", "SCHOOL, PUBLIC, BUILDING", "SCHOOL, PUBLIC, GROUNDS", "LIBRARY" ), "SCHOOL", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("CTA \"\"L\"\" PLATFORM","CTA \"\"L\"\" TRAIN","CTA BUS", "CTA BUS STOP", "CTA GARAGE / OTHER PROPERTY", "CTA PLATFORM", "CTA STATION", "CTA TRACKS - RIGHT OF WAY", "CTA TRAIN","OTHER RAILROAD PROP / TRAIN DEPOT", "OTHER COMMERCIAL TRANSPORTATION" , "RAILROAD PROPERTY"), "PUBTRANS", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("TAXICAB","TRUCK", "VEHICLE - DELIVERY TRUCK","VEHICLE - OTHER RIDE SERVICE","VEHICLE NON-COMMERCIAL", "VEHICLE-COMMERCIAL","GAS STATION DRIVE/PROP.", "TAXI CAB", "CAR WASH", "AUTO", "DELIVERY TRUCK", "GARAGE/AUTO REPAIR", "PARKING LOT" ), "VEHICLE", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("RESIDENCE", "RESIDENCE PORCH/HALLWAY","RESIDENCE-GARAGE", "RESIDENTIAL YARD (FRONT/BACK)","GARAGE", "HOUSE", "DRIVEWAY - RESIDENTIAL", "YARD" , "DRIVEWAY", "PORCH", "STAIRWELL"), "RESIDENCE", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("FACTORY/MANUFACTURING BUILDING","FEDERAL BUILDING","FIRE STATION", "GOVERNMENT BUILDING","POLICE FACILITY/VEH PARKING LOT", "GOVERNMENT BUILDING/PROPERTY" ,"JAIL / LOCK-UP FACILITY", "PARK PROPERTY"), "GOVBUILD", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("MOVIE HOUSE/THEATER", "SPORTS ARENA/STADIUM", "POOL ROOM" ), "RECREATION", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("NURSING HOME", "NURSING HOME/RETIREMENT HOME","PARKING LOT/GARAGE(NON.RESID.)", "DAY CARE CENTER", "HOSPITAL BUILDING/GROUNDS", "HOSPITAL", "MEDICAL/DENTAL OFFICE" ), "HOSPITAL", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("COMMERCIAL / BUSINESS OFFICE","CLEANING STORE", "CONVENIENCE STORE", "DEPARTMENT STORE", "DRUG STORE", "GROCERY FOOD STORE",  "SMALL RETAIL STORE","TAVERN/LIQUOR STORE" ,"PAWN SHOP", "LIQUOR STORE" , "RETAIL STORE"), "STORES", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("BOAT/WATERCRAFT","LAGOON", "LAKEFRONT/WATERFRONT/RIVERBANK", "POOLROOM" ), "WATER", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("HOTEL", "HOTEL/MOTEL","MOTEL"  ), "HOTEL", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("EXPRESSWAY EMBANKMENT", "HIGHWAY/EXPRESSWAY", "GAS STATION", "SIDEWALK", "CONSTRUCTION SITE", "BRIDGE"), "HIGHWAY", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("CLEANERS/LAUNDROMAT", "COIN OPERATED MACHINE","LAUNDRY ROOM", "NEWSSTAND" ), "VENDING", crime$LocDesc)

crime$LocDesc <- ifelse(crime$LocDesc %in% c("VACANT LOT", "VACANT LOT/LAND" , "STREET", "DRIVEWAY", "GANGWAY", "VESTIBULE"), "STREET", crime$LocDesc)
#crime$Arrest <- ifelse(as.character(crime$Arrest) == "Y", 1, 0)
crime$Crimes <- as.factor(crime$Crimes)
crime$LocDesc <- as.factor(crime$LocDesc)

For our first time series plot, we are going to take advantage and use xts package to create our time series object. Creating xts objects and highcharter time series plots are a match made in heaven!

by_Date <- (crime) %>% group_by(Date) %>% summarise(Total = n())
tseries <- xts(by_Date$Total, order.by=as.POSIXct(by_Date$Date))

## Creating timeseries of arrests made
Arrests_by_Date <- (crime[crime$Arrest == 'True',]) %>% group_by(Date) %>% summarise(Total = n())
arrests_tseries <- xts(Arrests_by_Date$Total, order.by=as.POSIXct(by_Date$Date))

Plotting Temporal data of our Crime analysis

hchart(tseries, name = "Crimes") %>% 
  hc_add_series(arrests_tseries, name = "Arrests") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
  hc_title(text = "Trend of Chicago Crimes and Arrests") %>%
  hc_legend(enabled = TRUE)

From our time series plot above, we can see that more crimes were committed but less arrests were made. However, the total crimes has been steadily going down each year. Playing around with the interactive plot, we can see that crimes usually peak during the month of July each year. Another observation is, there are less crimes during the month of February. A change of hearts perhaps by the criminals?

Next, we create our arrests time series plot.

hchart(arrests_tseries, name = "Arrests") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>% hc_title(text = "Trend of Arrests made in Chicago (2012-2016)") 

Separately plotting our arrests time series shows a decreasing trend each year. Playing with the interactive plot, maximum crime arrest for each year are in different months. Example would be during 2012, there were more arrest on the month of July while the next year, it was on the month of March.

Over-all we can see a trend were it starts from January and steadily rises until it reaches a certain month as it peaks. Afterwards, it drops slowly as the month progresses towards the end of the year.

Netx, we want to know the top crimes in Chicago from 2012 to early 2017

tot <- crime %>%
  group_by(Crimes) %>%
  summarise(Total = n()) %>% 
  arrange(desc(Total)) %>%
  ungroup() %>% 
  mutate(PercTot = Total/sum(Total),
         rd = round(PercTot*100, digits=1)) %>% 
  head(10)

e1 <-tot %>% ggplot(aes(reorder(Crimes, -rd), rd, fill = Crimes)) + geom_bar(stat = "identity", show.legend = FALSE)  + theme_bw() + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))+ geom_text(aes(label=round(PercTot*100,digits=1)), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5)+ labs(title="Top Crimes In Chicago From 2012-2016", x="", y="", subtitle="Percentages of each type of crime") + scale_fill_viridis(discrete = TRUE) 

There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.

To continue, we want to know the hot locations were most crimes are reported.

loc <- crime %>%
  group_by(LocDesc) %>%
  summarise(Total = n()) %>% 
  arrange(desc(Total)) %>%
  ungroup() %>% 
  mutate(PercTot = Total/sum(Total),
         rd = round(PercTot*100, digits=1)) %>% 
  head(10)

e2 <-loc %>% ggplot(aes(reorder(LocDesc, -rd), rd, fill = LocDesc)) + geom_bar(stat="identity", show.legend = FALSE)  + theme_bw() + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))+ geom_text(aes(label=round(PercTot*100,digits=1)), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5)+ labs(title="Crime Locations In Chicago From 2012-2017", x="", y="", subtitle="Percentage of each hot crime locations") + scale_fill_viridis(discrete = TRUE)
e1 + e2

There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.

To continue, we want to know the hot locations were most crimes are reported.

We can see that most crimes were committed on the streets (23% )followed by residences (22%) . And then both buildings (17%) and highways (12%).

table(crime$Year, crime$LocDesc)
##       
##        AVIATION  BANK BUILDINGS CHURCHES  CLUB CTA "L" PLATFORM
##   2012     1304  2529     55842      862     0                0
##   2013     1511  2073     51298      672     1                1
##   2014     1528  1806     45548      557     0                0
##   2015     1365  1687     44817      543     0                0
##   2016     1041  1787     41694      542     0                0
##       
##        CTA "L" TRAIN ELEVATOR FOREST PRESERVE GOVBUILD HALLWAY HIGHWAY
##   2012             0        0              11     4947       5   44536
##   2013             0        0              16     4602       8   40727
##   2014             0        0               8     4012       4   33569
##   2015             0        0              13     3708       3   30835
##   2016             1        1               6     3296       5   25757
##       
##        HOSPITAL HOTEL OFFICE OTHER PUBTRANS RECREATION RESIDENCE
##   2012    11757  1293      0 11560     8176        493     73385
##   2013    10454  1272      0 11152     7786        457     65539
##   2014     9180  1213      0 10447     5686        371     57172
##   2015     9476  1249      1 10365     4403        473     55670
##   2016     9943  1187      1  9950     4578        479     56395
##       
##        RESTAURANT SCHOOL STORES STREET TAVERN VEHICLE VENDING WATER
##   2012       5179   9287  18892  77776      0    6423      27   113
##   2013       4807   9464  17707  68429      1    6144      43   103
##   2014       4836   7685  16519  63665      2    5398      30    93
##   2015       4963   6401  16611  61544      2    5369      20    92
##   2016       5411   5659  17515  60200      2    5135      39   108
a11<-crime %>%
group_by(Weekday) %>% 
dplyr::summarize(Total = n()) %>%

ggplot(aes(Weekday, Total, fill = Weekday)) + geom_bar( stat = "identity", show.legend = FALSE) + geom_text(aes(label=Total), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.5) + labs(title="What Day Most Crimes Are Committed", x="", y="", subtitle="Crimes are equally occuring in all days") + scale_y_continuous(labels = comma) + theme(strip.text = element_text(size=11)) + theme_bw() + scale_fill_viridis(discrete = TRUE)

temp2 <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Weekday), FUN= length)

names(temp2) <- c("crime", "day", "count")
#temp2
a12 <- temp2 %>% filter(crime !="HUMAN TRAFFICKING") %>%
ggplot(aes(x= crime, y= factor(day))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By weekdays", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by DAY and CRIMES") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))

Making use of patchwork package, I just “+” the two plots and the result was good. With only few codes to type, I can now easily put together multiple plots in one window.

a11 +  a12

For our weekdays trend, generally all weekdays are vulnerable to crime. Though we can see a slight peak during Fridays. From Monday it slowly starts to climb, reach it peak on Friday and goes down during Saturdays and Sunday.

For our heat map, we can see that theft and battery crimes frequently occured during weekdays. With theft, peaking during Fridays and battery cases during Saturday and Sunday. This is similar with drug , damage, motor-vehicle and fraud. Though for damage, we can see it peaks during Sat and Sunday while drug instances start from Tue until peak Fri. We can also see quite few crime occurences for assault, burglary, other and robbery crimes with it’s frequency occuring almost evenly distributed throught out the week. Lastly, this is similar with non-vio, vio, tresspass and sex crimes which recorded for the least.

a13<-crime %>%
group_by(Month) %>% 
dplyr::summarize(Total = n()) %>%  

ggplot(aes(Month, Total, fill = Month)) + geom_bar( stat = "identity", show.legend = FALSE) + geom_text(aes(label=Total), vjust=-0.25, color="black",position = position_dodge(0.9), size=2.0) + labs(title="What Month Most Crimes Are Committed", y="", x="") + scale_y_continuous(labels = comma) + theme(strip.text = element_text(size=11)) + theme_bw() + scale_fill_viridis(discrete = TRUE)

temp1 <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Month), FUN= length)

names(temp1) <- c("crime", "month", "count")
#temp1

a14 <-temp1 %>% filter(crime!="HUMAN TRAFFICKING", crime!="NON - CRIMINAL") %>%
ggplot(aes(x= crime, y= factor(month))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By months", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crimes Frequencies by Month and CRIME") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
a13 + a14

We can see an upward trend starting from May until it peaks in July. Then it gradually goes down until December and spikes again on January. Strangely, less crimes are committed during the month of February. Can we say that culprits also have a “change of heart” during the love month of February? Very interesting indeed.

We can see that generally, all are vulnerable to theft crimes the rest of the year starting from May until peak in July. Same can also be said of battery cases. Damage and drug crimes also follow same occurence for the rest of the months. Though we can see that most crimes have peak number during month of July.

Heat Maps

fa <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Month), FUN= length)

names(fa) <- c("location", "month", "count")
#temp
fa %>% filter(location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="TAVERN", location!="", location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%
ggplot(aes(x= location, y= factor(month))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By month", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by MONTH and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11)) 

Looking at our heat map, we can see that most of these STREET and RESIDENCE related crimes occur mostly starting from May until August with its peak during the month of July. This is also the same with both BUILDINGS and HIGHWAY locations.

fr <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Weekday), FUN= length)

names(fr) <- c("location", "day", "count")
#temp2

a18 <-fr %>% filter(location !="CLUB", location!="ELEVATOR", location!="OFFICE", location!="TAVERN", location!="" , location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%

ggplot(aes(x= location, y= factor(day))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By weekdays", expand = c(0,-2)) +
theme_bw() + ggtitle("Crime Frequencies by each DAY and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11)) + scale_fill_viridis()

arrest <- crime %>%
  group_by(Year , Arrest) %>% 
  summarise(countArr = n()) 

a19 <-arrest %>%
  ggplot(aes(Year, countArr, fill = Arrest)) + geom_bar(stat="identity", position = "dodge") + scale_fill_viridis(discrete = TRUE) + theme_bw() + scale_y_continuous(labels = comma) + geom_text(aes(label=countArr), vjust=-0.25, color="red",position = position_dodge(0.9), size=3.0) + theme(legend.position = "bottom") + labs(title="Total Crime Arrests from 2012-2016", x="",y="")
a18 + a19

Looking at our heat map first, we can see that crimes located at STREET, RESIDENCE , BUILDINGS and HIGHWAYS have higher occurence during the weekends or days between Friday , Saturday and Sunday. It looks like the same trend that we saw from our earlier plots.

For the total crime arrests, “TRUE” indicates cases where arrests are made. We can see that “FALSE” outnumbers “TRUE”, hence, telling us that most of the crimes resulted in a “no” arrest. The trend is the same from 2012 to 2016 with the “no” arrest getting higher each year.

And so we ask under “yes” arrest, what crime/crimes does our data have were arrests were made? We will answer this later as we continue our exploration.

fp <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Crimes), FUN= length)

names(fp) <- c("location", "crimes", "count")
#temp2

fp %>% filter(location!="", location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="OFFICE", location!="TAVERN", location!="WATER", location!="VENDING", location!="FOREST PRESERVE",crimes!="NON - CRIMINAL", crimes!="OBSCENITY", crimes!="HUMAN TRAFFICKING", crimes!="HOMICIDE", crimes!="INTIMIDATION") %>%
ggplot(aes(x= location, y= factor(crimes))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By crimes", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by CRIME and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))

Interestingly, we can see that theft crimes are committed frequently at residences, street and on motor vehicles. Drug, damage and battery crimes happened in streets. Damage, burglary, battery and assault also happened in residences. On highways, drug, battery, robbery, theft and assault frequently took place. Lastly, theft and battery crimes happened frequently in buildings with together with some drug, damage and burglary crimes.

ar <- aggregate(crime$LocDesc, by= list(crime$LocDesc,
crime$Day), FUN= length)

names(ar) <- c("location", "hour", "count")
#temp2

p1 <-ar %>% filter(location!="", location!="CLUB", location!="ELEVATOR", location!="HALLWAY", location!="OFFICE", location!="OFFICE", location!="TAVERN", location!="WATER", location!="VENDING", location!="FOREST PRESERVE", location!="TAVERN", location!="" , location!="CTA \"L\" PLATFORM", location!="CTA \"L\" TRAIN") %>%
ggplot(aes(x= location, y= factor(hour))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By hour", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by Time and LOCATION") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
ai <- aggregate(crime$Crimes, by= list(crime$Crimes,
crime$Day), FUN= length)

names(ai) <- c("crimes", "hour", "count")
#temp2

p2 <- ai %>% filter(crimes!="NON - CRIMINAL", crimes!="HUMAN TRAFFICKING") %>%
ggplot(aes(x= crimes, y= factor(hour))) +
geom_tile(aes(fill = count)) + scale_x_discrete("", expand = c(0,0)) +
scale_y_discrete("By hour", expand = c(0,-2)) +
scale_fill_viridis() +
theme_bw() + ggtitle("Crime Frequencies by Time and CRIMES") +
theme(panel.grid.major = element_line(colour = NA), panel.grid.minor = element_line
(colour = NA)) + theme(axis.text.x = element_text(angle=60, hjust=1),strip.text = element_text(size=11))
p1 + p2

Looking at our by hours and locations heat map, we can see that STREET related crimes are high for 24 hours, similar with RESIDENCE. Though for both locations, we can see 01:00 hours as it’s peak. This same pattern with BUILDINGS and HIGHWAY. For buildings, again we see 01:00 hours as peak.

Looking at our by time and crimes heatmap, we can see that both THEFT and BATTERY crimes are likely to occur during the 24 hour time span with 01:00 hours as its peak. Though for theft crimes, we can a second time at 15:00 hours. This is also the same with DAMAGE, DRUG, ASSAULT and OTHER crimes.

art <- crime %>%
  group_by(Year, Crimes, Arrest) %>%
  summarise(tot = n()) %>% 
  filter(tot >=1000) %>%
  ungroup()  

 art %>% ggplot(aes(Year, tot, fill = Crimes)) + geom_bar(stat="identity") + facet_wrap(~Arrest) + scale_fill_viridis(discrete = TRUE) + theme_bw() + theme(legend.position = "bottom") + scale_y_continuous(labels = comma) + labs(title="Crime Arrests from 2012-2016", x="", y="")

“FALSE” indicates that perpetrators for most crimes were not arrested. Majority of these “FALSE” arrests were THEFT cases. Unfortunately, many perpetrators involved in BATTERY and DAMAGE cases were not arrested. We can see almost half difference with “TRUE” looking at the height of the bars.

Surprisingly, for our “TRUE” indicator, we can see that criminals related to DRUG cases were arrested more compared to the rest. This is followed by BATTERY and THEFT cases.

Calling our table function, we can further see the numbers behind our plot above.

table(crime$Crimes)
## 
##             ARSON           ASSAULT           BATTERY          BURGLARY 
##              2175             89508            258941             81668 
##            DAMAGE              DRUG             FRAUD          HOMICIDE 
##            152812            131207             67609              2560 
## HUMAN TRAFFICKING      INTIMIDATION               MVT    NON - CRIMINAL 
##                20               643             59856                38 
##            NONVIO         OBSCENITY             OTHER           ROBBERY 
##             24293               169             85361             56092 
##               SEX             THEFT          TRESPASS               VIO 
##             18356            321950             36429             28648

Finally, I will make a time series plot for our 4 top crimes and check for any patterns through out each year.

thefts <- crime[crime$Crimes=="THEFT",] 
## Creating timeseries
thefts_by_Date <- (thefts) %>% group_by(Date) %>% summarise(Total = n())
thefts_tseries <- xts(thefts_by_Date$Total, order.by=as.POSIXct(thefts_by_Date$Date))

battery <- crime[crime$Crimes=="BATTERY",] 
## Creating timeseries
battery_by_Date <- (battery) %>% group_by(Date) %>% summarise(Total = n())
battery_tseries <- xts(battery_by_Date$Total, order.by=as.POSIXct(battery_by_Date$Date))

criminals <- crime[crime$Crimes=="DAMAGE",]
## Creating timeseries
criminals_by_Date <- (criminals) %>% group_by(Date) %>% summarise(Total = n())
criminals_tseries <- xts(criminals_by_Date$Total, order.by=as.POSIXct(criminals_by_Date$Date))

narcotics <- crime[crime$Crimes=="DRUG",] 
## Creating timeseries
narcotics_by_Date <- (narcotics) %>% group_by(Date) %>% summarise(Total = n())
narcotics_tseries <- xts(narcotics_by_Date$Total, order.by=as.POSIXct(narcotics_by_Date$Date))


hchart(thefts_tseries, name = "Thefts") %>% 
  hc_add_series(battery_tseries, name = "Battery") %>% 
  hc_add_series(criminals_tseries, name = "Damage") %>%
  hc_add_series(narcotics_tseries, name = "Drugs") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
  hc_title(text = "Crimes in Thefts/Battery/Damage/Drugs") %>%
  hc_legend(enabled = TRUE)

Finally, I want to make a time series plot for our top 4 crime locations and check again for some patterns.

streets <- crime[crime$LocDesc=="STREET",]
## Creating timeseries
streets_by_Date <- (streets) %>% group_by(Date) %>% summarise(Total = n())
streets_tseries <- xts(streets_by_Date$Total, order.by=as.POSIXct(streets_by_Date$Date))

residence <- crime[crime$LocDesc=="RESIDENCE",]
## Creating timeseries
residence_by_Date <- na.omit(residence) %>% group_by(Date) %>% summarise(Total = n())
residence_tseries <- xts(residence_by_Date$Total, order.by=as.POSIXct(residence_by_Date$Date))

apartment <- crime[crime$LocDesc=="BUILDINGS",]
## Creating timeseries
apartment_by_Date <- (apartment) %>% group_by(Date) %>% summarise(Total = n())
apartment_tseries <- xts(apartment_by_Date$Total, order.by=as.POSIXct(apartment_by_Date$Date))

highway <- crime[crime$LocDesc=="HIGHWAY",] 
## Creating timeseries
highway_by_Date <- (highway) %>% group_by(Date) %>% summarise(Total = n())
highway_tseries <- xts(highway_by_Date$Total, order.by=as.POSIXct(highway_by_Date$Date))

hchart(streets_tseries, name = "Streets") %>% 
  hc_add_series(residence_tseries, name = "Residence") %>% 
  hc_add_series(apartment_tseries, name = "Buildings") %>%
  hc_add_series(highway_tseries, name = "Highways") %>%
  hc_add_theme(hc_theme_538()) %>%
  hc_credits(enabled = TRUE, text = "Sources: City of Chicago Administration and the Chicago Police Department", style = list(fontSize = "12px")) %>%
  hc_title(text = "Crimes in Streets/Residence/Buildings/Highways") %>%
  hc_legend(enabled = TRUE)

To summarise our findings

  1. Total crimes has been steadily going down each year. We find out that crimes usually peak during the month of July each year. Another observation is, there are less crimes during the month of February.

  2. Over-all, we find a seasonality in crime per each year were it starts from January and steadily rises until it reaches a certain month as it peaks. Afterwards, it drops slowly as the month progresses towards the end of the year.

  3. There were more theft crime incidents (22.6%) compared to the rest of the offenses. Battery cases came in at 18.1% followed by damages (10.8%). We can also see drug cases at 9.3% followed by far by assault cases at 6.3%.

  4. We can see that most crimes were committed on the streets (23% )followed by residences (22%) . And then both buildings (17%) and highways (12%).

  5. We find that theft and battery crimes frequently occured during weekdays. With theft, peaking during Fridays and battery cases during Saturday and Sunday. This is similar with drug , damage, motor-vehicle and fraud. Though for damage, we can see it peaks during Sat and Sunday while drug instances start from Tue until peak Fri. We can also see quite few crime occurences for assault, burglary, other and robbery crimes with it’s frequency occuring almost evenly distributed throught out the week. Lastly, this is similar with non-vio, vio, tresspass and sex crimes which recorded for the least.

  6. For the total crime arrests, “TRUE” indicates cases where arrests are made. We can see that “FALSE” outnumbers “TRUE”, hence, telling us that most of the crimes resulted in a “no” arrest. The trend is the same from 2012 to 2016 with the “no” arrest getting higher each year.

  7. We find that theft crimes are committed frequently at residences, street and on motor vehicles. Drug, damage and battery crimes happened in streets. Damage, burglary, battery and assault also happened in residences. On highways, drug, battery, robbery, theft and assault frequently took place. Lastly, theft and battery crimes happened frequently in buildings with together with some drug, damage and burglary crimes.

  8. “FALSE” indicates that perpetrators for most crimes were not arrested. Majority of these “FALSE” arrests were THEFT cases. Unfortunately, many perpetrators involved in BATTERY and DAMAGE cases were not arrested. We find that almost half difference with “TRUE” or cases which resulted into arrests.

  9. We find that criminals related to DRUG cases were arrested more compared to the rest. This is followed by BATTERY and THEFT cases.

  10. We find that STREET related crimes are high for 24 hours, similar with RESIDENCE. Though for both locations, we can see 01:00 hours as it’s peak. This same pattern with BUILDINGS and HIGHWAY. For buildings, again we see 01:00 hours as peak. We also find that both THEFT and BATTERY crimes are likely to occur during the 24 hour time span with 01:00 hours as its peak. Though for theft crimes, we can a second time at 15:00 hours. This is also the same with DAMAGE, DRUG, ASSAULT and OTHER crimes.

The dataset can be found here. https://www.kaggle.com/currie32/crimes-in-chicago/data